Epoch AI

Toward an O*NET for AI R&D

JS Denain — Wed, 17 Jun 2026 21:32:42 GMT

This post is part of Epoch AI’s Gradient Updates newsletter, which shares more opinionated or informal takes on big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

What trends are we extrapolating?

A common way that experts forecast AI timelines is so simple it’s hard to believe: trend extrapolation. Sure they also use numerical models that bake in things like runaway feedback loops, but the bread and butter of AI forecasting is to draw a line on a graph and extend it as far as you dare. Somehow this works well enough to be a state-of-the-art approach. However, the trends they extrapolate share a common weakness: they lean heavily on easy-to-measure things, not what we directly care about — how close AI is to doing AI research itself.

Many experts want to know when we’ll fully automate AI research, because this would massively speed up AI progress, kicking off an “intelligence explosion”.1 If that’s right, it’s hugely important to know how close we are. But historically, there haven’t been many points of direct evidence to point to, because full automation of AI R&D has been so hard to measure. Instead, researchers have been forced to rely on proxies.

One such proxy is in key AI inputs like compute, data, and energy. Take Situational Awareness, which extends “effective compute” five years into the future until AI research gets automated:

But these inputs are only rough proxies for capabilities and you need to go on “vibes” to say how much compute you need to match the world’s top AI researchers.

A second approach is to track the length of tasks that AIs can complete, where “length” is how long a skilled human takes. This is what METR did with their “time horizon” metric, which was then extrapolated in the famous AI 2027 scenario to forecast when AIs would become “superhuman coders”, before automating AI R&D.

But the benchmarks used to construct METR’s time horizon miss out on many of the tasks and complexities of research. It’s one thing to be able to finetune a small GPT-2 model, and another to be able to coordinate five projects at the same time, dealing with a million lines of code, without a clear criterion for success or failure.

So it would be nice to know everything that AI R&D actually consists of. Otherwise we’ll always be at risk of streetlighting (extrapolating whatever happens to be easiest to measure). We can better avoid this if we have a list of the tasks involved in AI R&D, to track what we’re aware of and what we’re not investigating.

Armed with such a list, we could better interpret the extrapolations we already have: when a benchmark score climbs, a task list tells us which parts of the job that improvement covers, and which parts it doesn’t touch. And we could build new trends to extrapolate, like the fraction of tasks that are automated, or how much uplift researchers get on each one.

In this post, we present a first version of such a taxonomy with six categories spanning a frontier AI company’s research workflow, based on literature review and brainstorming. The categories are broken into over sixty tasks, each rated from 0 to 5 on how much we think current AIs automate it. The full taxonomy lives in a companion doc, which is the main artifact of this work.

We think of it as our best first attempt at developing a comprehensive taxonomy of tasks contributing to AI R&D, in order to more robustly understand and predict AI R&D automation. If we described tasks incorrectly, or missed some entirely, we’d love to hear about it and receive feedback.

An O*NET for AI R&D

We’re not the first to want a task list like this; breaking jobs down into tasks is how economists usually track automation in the economy. The standard tool here is O*NET, which describes about 1,000 jobs in the US economy as well as the tasks and skills that people need to do them.

Armed with O*NET, economists can then do a bunch of empirical work and forecasts about AI’s economic impacts, such as:

Look at the fraction of American workers whose jobs might be heavily impacted by LLMs
Estimate how AI might impact GDP growth over the next decade (however accurately or inaccurately)
Design AI benchmarks to cover a wide range of knowledge work
Taxonomize how people use frontier AI models
Predict the consequences of automating remote work

But while O*NET is a widely used source, it doesn’t help us track AI research automation very well — the tasks in the dataset are just way too broad and high-level. Consider the job “Computer and Information Research Scientists”, which is probably about as close as you can get to “frontier lab AI researcher” within O*NET. In this case the work tasks look like this:

The first listed task is to “Analyze problems to develop solutions involving computer hardware and software”. But that’s really vague — what kinds of problems or solutions? What exactly does or doesn’t count as “involving computer hardware and software”? You could make the case that almost everything about an AI engineer’s job involves computer software, so it’s not a very granular description to say the least. And yet this is the most granular task description you can find in O*NET, and the same issue applies to pretty much every other task.

Part of the issue here is practical feasibility: O*NET was designed for the ambitious endeavor of mapping out the tasks in the entire US economy, before the days of LLMs. So it’s probably too much to ask for these tasks to be so granular for everything as new and niche as “frontier lab AI research”.

That’s why our proposal is to build an O*NET for AI R&D specifically. If we keep our focus narrow, we’re in a much better position to come up with a fine-grained decomposition of an AI researcher’s job. If this works well, we could look at the fraction of tasks that are automated over time. It could also help us interpret benchmark results and how significant they are. There’s a long-standing phenomenon where AI benchmarks haven’t fully reflected the complexities of the real world, and so it’s important to juxtapose benchmark results with what’s happening on the ground.

It could also serve as a framework for additional studies about AI’s impact. This could mean surveying AI researchers on the uplift they get on different work tasks. It could also mean classifying internal AI usage logs into different use cases, like “writing experiment code” or “deciding what experiment to run next”. This is similar to how Anthropic’s Clio system helps study real-world AI usage while preserving people’s privacy.

More generally, an O*NET for AI R&D would help establish a common vocabulary between researchers, such as in frontier labs’ model cards. This could help model developers summarize where AI does or doesn’t help in finer detail, and help standardize information across different sources, like in METR’s most recent Frontier Risk Report.

This being said, we’re not saying that an “O*NET for AI R&D is all you need” to monitor and forecast progress to automation. Even if we’re armed with a perfectly comprehensive dataset of tasks, we’d still need to measure how AI performs on each task, for example. But even still, having a task dataset would help ground the debate about AI’s actual impacts on AI research, and give us “situational awareness” about how close we might be to an intelligence explosion, supporting evidence from other sources.

A first proposal

So what would this “O*NET for AI R&D” actually look like? This is of course not trivial to answer — AI R&D is very messy and changes fast. But we figured we’d have an initial attempt at it and let you readers bombard us with feedback. To that end, we compiled over sixty representative tasks at frontier AI labs, accompanied by descriptions and concrete examples — see the full writeup here.

The first step was to make a big list of all the tasks currently involved in AI R&D, which we grouped into six categories, inspired by some earlier work.2 These categories correspond to different parts of the AI research cycle:

Each category has its own inputs and outputs. For example, consider Category 4 (“Run”), which as the name suggests is about “running” stuff — think executing training runs or deploying AI systems to the public. So the input could look something like a benchmark with all supporting scaffolds and related infrastructure, and the output could be a set of final results.

Each category is then split into several subcategories. Category 4 is split into three, each with its own inputs and outputs:

4.1 Monitoring runs: watch training, RL, and eval runs in flight; catch problems as they emerge; restore the run to a healthy trajectory
4.2 Hardware infrastructure operations: keep large clusters healthy, well-utilized, and quickly recoverable
4.3 Inference reliability engineering: keep production serving stable, performant, and recoverable

And finally the subcategories contain a thorough list of different tasks. For example, here are the tasks for “4.1 Monitoring runs”:

So you can see that this is way more granular than developing “solutions involving computer hardware and software” — our task descriptions get to the level of monitoring runs, and catching potential issues when they arise.

One thing you’ll notice about each task is that it includes a number next to it — that reflects our initial approach to estimating automation impact across tasks. Specifically, we rate how much we think current AIs automate each task on a scale of 0 to 5, based on the following rubric:

We provide examples of each of these in the full proposal.

This is of course quite subjective and there’s a ton of room for people to develop better ways to evaluate this, such as through internal benchmarks. Optimistically, an improved version of this could be used to construct a metric that we can extrapolate over time. This could be quite tricky, because we’d need to know which tasks need to be automated to kick off an intelligence explosion, and there might be new tasks. For example, perhaps AIs don’t need to “write a memo summarizing key takeaways” (unlike human researchers), if they have other ways of communicating with other AIs.

But our overall hope is simply that our full collection of tasks gives us an additional signal about what AI can and can’t automate at any point in time, supporting existing evidence we have from benchmarks and things like METR’s time horizons.

What’s next?

It goes without saying, but just to be excessively clear, this was an initial attempt at an O*NET for AI research. The most natural next step is to build on what’s here and add tasks we missed, sharpen descriptions to be more precise and accurate, taxonomize to be more mutually exclusive and collectively exhaustive, and re-rate the list as AI improves.

A step further would be to find other ways of organizing the taxonomy. For example, we thought about organizing all the tasks around the usual stages of model development, namely pre-training, post-training, and inference. We opted against it because it runs into several problems: (1) many tasks like “writing code for experiments” would be repeated across multiple stages, and (2) some tasks won’t fit naturally in these categories, not least because the training process can look super complex in practice. But just because we decided against this categorization doesn’t mean that it’s not workable.

Training isn’t just a matter of “pre-training + post-training” — it’s way more complicated than that.

And we can say similar things about how we split up tasks: we’d ask if a task could be split into subtasks that would plausibly be automated at very different times — if we could, we’d split it up. We also tried to be more granular around tasks that require some amount of “taste” (such as whether to scale up a particular post-training recipe), since people seem to disagree about this. But you could choose different levels of granularity for sure.

In general, we’d be very excited about collaborating with people to improve on our initial attempt. If you’re an AI researcher and think that we’ve messed up something in our classification or task descriptions, please share feedback or reach out! Or if you’re somebody who’d use something like an O*NET for AI R&D in your work, we’d love to talk to you and understand how we can improve things to suit your needs.

If this project goes well, we hope that it’ll serve as a stepping stone toward more directly tracking AI R&D automation, helping AI forecasters extrapolate trendlines further forward — and hopefully also tell us if the intelligence explosion is upon us.

We’d like to thank David Owen, Stefania Guerra, Robert Sandler, and Elliot Stewart for their feedback and support.

Please also see our initial attempt and the accompanying feedback form. For direct corrections or specific inquiries, you can email js@epoch.ai.

Depending on how you define it, an intelligence explosion could “start” well before fully automating AI research, because AIs would likely provide substantial uplift prior to that.

It’s possible that future AI systems make progress through routes that just really do not resemble current human AI R&D. If so, a task list based on current human workflows would completely miss the action, and our ontology would need to be updated at the very least.

The Epoch Brief - June 12, 2026

Elliot Stewart — Fri, 12 Jun 2026 21:01:42 GMT

This week at Epoch:

The launch of our new Cyber Vulnerabilities explorer, a tool for better understanding how AI capabilities are changing cybersecurity.
Two new Data Insights: the record for compute at a single data center has doubled every seven months, and AI infrastructure’s rising contribution to US GDP.
Version 2 of FrontierMath: Tiers 1–4 is now available, and it shows a significant increase in AI math capabilities. Anthropic’s Fable 5 currently tops the leaderboards.
Two new Gradient Updates: examining whether Mythos’ cyber capabilities live up to the hype, and proposing a framework for thinking about wealth distribution after AGI.

Subscribe now

Research

Introducing the Cyber Vulnerabilities explorer

As AI capabilities change what's possible for both cyber attackers and defenders, the Cyber Vulnerabilities explorer helps quantify that impact. The explorer tracks vulnerabilities reported to the CVE Program by all participating organizations since 2022. You can explore the full dataset, or filter for the 21 most notable vendors and open-source projects (e.g., Microsoft, Google, Apple, and Linux). Reports can be broken down by severity level and overlaid with key AI milestones.

That overlay reveals a dramatic uptick in High and Critical CVEs around the time Anthropic released Mythos Preview to Project Glasswing partners in late March. (For more on Mythos’ cybersecurity impact, see the Gradient Update from this week.)

Data Insights

The record for computing capacity in a single data center has doubled every 7 months

Since the launch of SpaceXAI’s Colossus 1 in August 2024, the record for the largest AI data center by computing capacity has doubled every seven months. Facilities like Anthropic-Amazon New Carlisle, Microsoft Fairwater Atlanta, and Meta Prometheus have each claimed the top spot at different times. Increased single-site capacity facilitates the training of more capable AI models.

The AI boom has doubled computing infrastructure’s share of US GDP

The AI infrastructure boom, which began in 2023, has more than doubled the share of US GDP attributable to computing infrastructure. Investment in AI-related data center construction, compute hardware, and networking equipment accounted for ~0.8% of US GDP in Q1 2026, driving computing infrastructure as a whole to ~1.5% of GDP, up from a 2015–2022 average of ~0.7%. AI infrastructure is now the leading driver of growth in private investment in the US.

The new FrontierMath: Tiers 1–4

We launched Version 2 of FrontierMath: Tiers 1–4 today, following an audit that addressed small but critical errors in 42% of problems in the original benchmark. Model rankings on v2 are similar, but scores are higher across the board.

Claude Fable 5, Anthropic's newest and most capable model, now holds the top spot, reaching 87% on Tiers 1–3 and 88% on Tier 4. This continues a streak of rapid improvement in Anthropic models’ math capabilities. (Note that Fable 5 is the publicly available version of the Mythos model, whose much-discussed cyber capabilities we analyze elsewhere in this newsletter.)

Commentary

We published two new Gradient Updates, where Epoch researchers and guests share more opinionated or informal takes on big questions in AI progress. Gradient Updates represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

Controlling the capital after AGI

Epoch’s Head of Economics, Phil Trammell, and researcher Anson Ho explore the leading proposals for universal redistribution after AGI, finding that they differ along a primary axis: how much direct control over capital they propose to give citizens.

Are Mythos’ cyber capabilities overhyped?

Timothée Chauvin joins Epoch researchers to compile the public evidence around Mythos capabilities, finding that while it’s unclear if Mythos is ahead of trend in discovering vulnerabilities, it represents a big jump in exploiting them.

Other Updates

Careers

We’re hiring across several roles. All positions are fully remote.

Designer to translate complex research into intuitive, engaging, and high-impact designs — primarily UI/UX and data visualization work.
Researchers and Senior Researchers to lead new projects across our expanding teams.
Data Scientist (Contract) to assist with our AI research efforts, including reviewing technical literature, tracking benchmark data, and analyzing AI models, data centers, and companies.

Applications are rolling, so apply soon!

Subscribe now

Are Mythos’ cyber capabilities overhyped?

Timothée Chauvin — Thu, 11 Jun 2026 21:06:15 GMT

If what Anthropic says is true, then the Claude Mythos family is a massive leap forward in AI’s cyber capabilities. When they announced Mythos Preview, they considered it so dangerous that they had to launch a $100+ million initiative to “secure the world’s most critical software”. Then on Tuesday, they one-upped themselves by releasing Claude Mythos 5, which improves modestly on cyber benchmarks.1

But skeptics have argued that Anthropic was exaggerating — or at least, people should chill out about Mythos. For instance, some people have pointed out that GPT-5.5 is on par with Mythos Preview on a range of cyber benchmarks, and yet its launch didn’t lead to a cyber catastrophe.

So is Mythos actually a big leap for cyber capabilities? To figure this out, we looked at all the public evidence we could get our hands on. Most of this evidence applies to Mythos Preview, but the conclusions should hold for Mythos 5 too. This post describes what we found.

Discovering vs exploiting code vulnerabilities

To start off, let’s take a closer look at what Anthropic actually claimed when they released Mythos Preview:

“Over the past year, [AI models have shown] a striking ability to spot vulnerabilities and work out ways to exploit them. Claude Mythos Preview demonstrates a leap in these cyber skills [...]”

This means that they’re specifically talking about a jump in two kinds of cyber capabilities, which we must be careful not to conflate.

The first is vulnerability discovery — finding weaknesses in software. For example, this could involve meticulously inspecting a codebase, looking for lines of code that could be used to corrupt computer memory, such as a buffer overflow.

Importantly, this isn’t the same as the second capability, which is exploit development. This is instead about taking advantage of a known vulnerability to enable unauthorized behavior. Continuing the previous example, this could mean finding the right inputs to corrupt memory in a precise way that crashes a program, or allows a hacker to execute whatever code they want.

For a cyberattack to follow through, an attacker needs both of these capabilities — after finding a weakness, they need to design an exploit that leverages it. Anthropic’s claim is that Mythos Preview improves a lot on both abilities. But what does the public evidence say?

Mythos Preview was a major advance in exploit development

To see if Mythos Preview (and hence Mythos 5) was a big capability jump, a natural place to look is cybersecurity benchmark scores. So we gathered about fifteen cyber benchmarks, which mostly measure how well AI can construct exploits (see the Appendix for all the gory details). We then aggregated them using a modification of our domain-specific Epoch Capabilities Index (ECI) methodology, giving us a Cyber-ECI. If we plot model scores on this over time, we get this:

The thing that stands out is that Mythos Preview looks way above the linear trend we’ve seen since early 2025 — about 7 months ahead.2 It’s also ahead of OpenAI’s GPT-5.5, which was “only” 2–3 months ahead of schedule.3

But if that’s the case, why did some people argue that Mythos Preview didn’t seem notably better than GPT-5.5? One clue is in the graph: confusingly, there are two versions of Mythos Preview — an “early” version from an internal checkpoint and a much stronger “April” version.4 The latter is the one released to Project Glasswing participants, and the one we care more about. Though notably, some benchmark evaluations were done on the earlier version, which was indeed very close to GPT-5.5 in cyber abilities. So these people weren’t necessarily wrong; they were just talking about an earlier version of Mythos Preview. The picture then changed after benchmarking the April version, as UK AISI did.

Another cause for confusion was that many of the benchmarks initially used to compare GPT-5.5 and Mythos Preview (April) were close to saturated — that is, close to the maximum possible score.5 So even though Mythos Preview (April) is much better than GPT-5.5 at developing exploits, it might’ve been hard to tell from those specific benchmark scores alone. Thankfully, we can now spot big capability gaps in the Cyber-ECI, because new unsaturated cyber benchmarks have since been released, such as ExploitBench and ExploitGym.

Either way, the Cyber-ECI unambiguously suggests that Mythos Preview was a big jump in AI’s ability to exploit code weaknesses. This is also backed up by Anthropic’s real-world analyses. Specifically, Anthropic found that Mythos Preview is much better at developing exploits that allow arbitrary code execution than prior models. Earlier models could rarely do this, but Mythos Preview often achieves this — even with minimal information about the vulnerabilities.6

It’s unclear how large of a practical advance Mythos Preview is in vulnerability discovery

Unlike exploiting vulnerabilities, there aren’t any unsaturated benchmarks that measure AI’s ability to find vulnerabilities in source code.7 But we can look at the number of vulnerabilities companies have discovered over time — specifically from companies that used Mythos Preview to secure their software, as part of Anthropic’s “Project Glasswing” initiative. The result is a gigantic spike that coincides with Mythos Preview’s release:

“CVE” stands for “Common Vulnerabilities and Exposures”. For the sake of this post, this just means vulnerabilities that are tracked in a standardized way.

As with the Cyber-ECI graph, this looks a lot like a smoking gun. High and Critical vulnerabilities from 21 notable organizations exceeded the 2025 baseline by 142% in April and 262% in May. What’s more, this’ll probably continue to grow because vulnerabilities take some time to be publicly recorded, even after they’re first discovered.

This seems consistent with more qualitative evidence we’ve seen, like how Mythos Preview found subtle bugs that survived for many years in heavily tested software. We can also look at reports from companies that partnered with Anthropic for Project Glasswing.8 For example:

Mozilla considered Mythos Preview to be as good as elite security researchers, though it didn’t unearth entirely new classes of vulnerabilities.
Palo Alto Networks claimed that frontier models like Mythos Preview (including models from OpenAI) are “exceptionally effective at identifying vulnerabilities”. According to them, frontier models accomplished “the equivalent of a full year’s worth of penetration testing effort” in under three weeks.
Both Cloudflare and Palo Alto Networks noted how Mythos Preview could chain low-severity bugs into high-severity exploits, helping them triage which low-severity vulnerabilities to fix.
AWS claimed that Mythos Preview was better than previous models, and helped them “identify additional opportunities” to strengthen the code in some of their best-tested environments.

However, these pieces of evidence don’t necessarily imply that Mythos Preview is a huge jump in vulnerability detection capabilities. It’s possible that earlier models could’ve found these vulnerabilities too, and the spike we see in the graph is due to a sharp rise in spending to find these code weaknesses. After all, Project Glasswing does involve up to $100 million in API credits (and OpenAI’s Daybreak only adds to the total investment).9

And we do actually have some evidence that models were good at finding vulnerabilities prior to Mythos Preview. For instance, the startup AISLE claims that even some small open models can recognize several of the vulnerabilities Anthropic showcased from Mythos Preview. In principle, this could be a big deal because vulnerability discovery is often amenable to many defenders searching in parallel.

Another example comes from the maintainers of the curl code library. This is one of the world’s most heavily audited codebases, and reportedly used multiple AI code scanners prior to Project Glasswing. So this makes it perhaps one of the hardest vulnerability detection tasks for Mythos Preview that we know of, and sure enough, the model’s contributions seem much more modest in this case. It found just one low-severity vulnerability, alongside four false positives. Here’s what curl’s lead maintainer had to say about this finding:

“I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos.”

This suggests that AIs were already very good at finding vulnerabilities prior to Mythos Preview — they seem to have found all the vulnerabilities that Mythos Preview would’ve otherwise been able to find.

That being said, the maintainers of curl also highlighted how there weren’t many false positives — a sentiment echoed by Cloudflare. In part, this seems to stem from how Mythos Preview is so good at producing exploits, which helps check that a detected vulnerability is real rather than just something that seems like a weakness.

Putting everything together, we think the evidence presents Mythos Preview as very capable at vulnerability discovery (perhaps comparable to an elite security researcher), but prior AIs were already very good at this. That said, Mythos Preview does outshine prior models in some ways — it’s better at assessing how severe vulnerabilities are, and it also finds fewer false positives. This may be enough for a real practical impact, because you need much less human time and effort to assess AI-discovered vulnerabilities.

Conclusion

So our current take on the public evidence is this: Mythos Preview was clearly a large improvement in exploit development — much better than GPT-5.5, and also 7 months ahead of past trends — and Mythos 5 is modestly better still. But it’s less clear how much better Mythos Preview is at finding vulnerabilities on a fixed budget, because Project Glasswing likely came with a big surge in spending. Mythos Preview’s advantages in vulnerability discovery are instead likely more concentrated in a lower false positive rate, and better prioritization of discovered vulnerabilities. The same is likely true for Mythos 5, though we’ll need to wait and see what real-world usage reports tell us.

Finally, let’s circle back to the original debate. Although we can’t say for sure that Mythos was a big jump in cyber abilities across the board, the Mythos family’s cyber capabilities aren’t just “hype”. If made widely available, these capabilities would likely move us into a new regime of cybersecurity, where vulnerabilities would need to be patched much faster to prevent a big increase in successful cyberattacks.

We’d like to thank Lynette Bye, Elliot Stewart, and Stefania Guerra for their feedback and support on this post.

Appendix: Benchmarks in the Cyber-ECI

UK AISI’s CTF Suites

Description: AISI has a suite of capture-the-flag challenges where AI models must identify and exploit weaknesses in target systems to retrieve hidden “flags”. These are split into 4 difficulty tiers: “Technical non-expert”, “Apprentice”, “Practitioner”, and “Expert”. Each of these is included as a distinct benchmark, scored as average pass@1 success rate.

Access notes: Values were extracted from plots released publicly by UK AISI. “Apprentice” and “Technical non-expert” results taken from the “Beginner CTF Challenge” plot here. “Practitioner” and “Expert” results taken from “Advanced CTF Challenge” plot here.

UK AISI Cyber Ranges

Description: AISI describe their cyber ranges as: “simulated network environments with multiple hosts, services, and vulnerabilities arranged into sequential attack chains. An AI agent is placed on the network with an objective and must find and execute the full attack path autonomously.” There are two cyber ranges: “The Last Ones” and “Cooling Tower”. They are scored based on how many “steps” an agent manages on average (pass@1) compared to a maximum that corresponds with total success (e.g., full network takeover). We convert this to a % score.

Access notes: Values for models up to Opus 4.6 were taken from AISI’s cyber ranges paper. Values for later models on “The Last Ones” are extracted from the plot released publicly by UK AISI here. Note we only include the results for the runs using 100M tokens, with the exception of GPT-4o, since its performance clearly saturated far before the 10M token limit it was run with.

Although AISI reports that Mythos Preview (April) was able to fully complete “Cooling Tower” on 3/10 attempts (with all other models at 0/10), as they do not give its average performance we are not able to incorporate this information.

Microsoft CTI-REALM

Description: This is the only benchmark explicitly focused on cyber defense. Given cyber threat intelligence reports, it tasks models to generate detection rules to apply on endpoint/cloud telemetry logs. Scores are given as average ‘Reward’ in [0,1] based on how well their decision rules function.

Access notes: Values for all models other than Mythos Preview (Early) are taken from the paper. Mythos Preview (Early)’s results are taken from the plot here.

CVE-Bench

Description: CVE-Bench is a selection of 40 web application environments with known exploitable vulnerabilities in which models need to build exploits to achieve any one of 8 capabilities (access files, privilege escalation, etc.). If they succeed they score 1, if they do not they score 0.

OpenAI runs this benchmark as a subset of 34 of the environments. They use a “0-day” configuration where models are not given any description of the vulnerability, or source code. They report the mean pass@1 results.

Access notes: Results are taken from OpenAI’s system cards. We only use the “browsing” configurations. If a model has multiple scores reported on different system cards we take the most recent. The results can be fully obtained using:

GPT 5.1-codex-max system card for GPT 5-codex and GPT 5.1-codex-max
GPT 5.4 system card for GPT 5.2-codex
GPT 5.5 system card for GPT 5.3-codex, GPT 5.4, and GPT 5.5

Cybench

Description: Cybench is a benchmark containing 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent (as of 2024), meaningful, and spanning a wide range of difficulties. Scored as average pass@1 success rate.

Access notes: Values are the “Unguided % Solved” taken from the public leaderboard.

CyberGym

Description: CyberGym is a benchmark containing 1507 historical vulnerabilities, for which models need to generate code to produce a crash given a description of the vulnerability.

Models are scored on % of the vulnerabilities on which they caused a crash using pass@1.

Successes are only counted if the crash also does not occur on a patched version of the code that is supposed to have addressed the vulnerability. We found approx 5% of the vulnerabilities don’t have specific enough descriptions, and so scale the results to cap at 95% instead of 100%.

We suspect this is insufficient and that the benchmark is essentially saturated, as the prompt is not very clear that models must only use the given vulnerability, and as reported by Anthropic frontier models achieve crashes 95%+ of the time without the restriction of targeting the correct vulnerability.

Access notes: Values are taken from the public leaderboard. We also added Opus 4.7’s results from its system card, and updated Opus 4.6’s results to 74%, matching the note there. GPT 5.5-cyber was also added, taking the result from here.

CyScenarioBench

Description: Benchmark developed by Irregular, similar to cyber ranges where models must succeed at end-to-end tasks. Scored as a fraction of fully complete runs, pass@1. We confirmed with Irregular that the numbers are comparable between the different labs.

Access notes

From Anthropic’s Mythos 5 launch post:

Mythos 5: 36.7%
Mythos Preview: 29.2%
Opus 4.8: 16.6%

From OpenAI’s system cards:

GPT 5.5: 26%
GPT 5.4: 9% from the 5.5 system card
GPT 5.2/5.3: 0%

From Meta’s safety report for Muse Spark:

Muse Spark: 0%

From Gemini 3 Pro’s Frontier Safety Framework Report:

Gemini 3 Pro: 0% (this is the “v2” third-party cyber benchmark)

ExploitBench

Description: ExploitBench is a benchmark containing 41 real-world vulnerabilities in the V8 JavaScript engine, which are known or strongly suspected to enable arbitrary code execution (ACE) within the browser sandbox. Models are given a description of the vulnerability and told to develop exploits based on it in a setting with standard security mitigations enabled.

Models are scored out of 16 capabilities (e.g., arbitrary read access) they reach on a “ladder”. If ACE is reached, they score 16/16.

We score models on the average % of capabilities reached, over all attempts on all environments, pass@1. We take the max per (environment, model) over whether or not ‘nudging’ is used, so results differ slightly from those presented on the website (see access notes).

Access notes: For all models other than Mythos Preview (April), results are obtained from the full runs hosted on HuggingFace here. Note the results are spread over versions of the “runs.parquet” file over different branches.

“Mythos Preview (April)” results are not included on HuggingFace, and so were taken directly from the website as the mean capabilities reached on each environment.

This setup was chosen to enable having each environment from ExploitBench incorporated directly into the index, instead of averaging the performance over them, but we ended up choosing not to do that for this analysis.

ExploitGym

Description: A benchmark of 898 real-world vulnerabilities from the V8 JavaScript engine, Linux kernel, and userspace programs. Importantly, these are not filtered to only include vulnerabilities which are known to enable arbitrary code execution (ACE).

Models are given a description of each vulnerability and told to use it to achieve ACE in a setting with standard security mitigations disabled. Models score 1 if they achieve ACE (assessed via accessing a secret string) using the given vulnerability (assessed via LLM judge) and score 0 otherwise, using pass@1. Reported score is the average performance over all vulnerabilities.

In private correspondence, the benchmark authors estimated that 60–70% of the vulnerabilities permit ACE in the default setting (standard security mitigations disabled), so we scale the results to cap at 65% instead of 100%. Our results are not sensitive to any cap ≥50%.

Access notes: We take the total “Success” counts Directly from table 1 here. As discussed, results are scaled as % of achievable vulnerabilities exploited, where that is taken to be 0.65*898 = 583.7.

InterCode-CTF

Description: This benchmark is a suite of capture-the-flag challenges where models must identify and exploit weaknesses in target systems to retrieve hidden “flags”.

Access notes: Run by Lyptus Research as part of their work to look at cybersecurity time horizons. Accessed from here.

NL2Bash

Description: Simple benchmark where models must convert natural language instructions into bash calls, which are commonly used for cybersecurity tasks.

Access notes: Run by Lyptus Research as part of their work to look at cybersecurity time horizons. Accessed from here.

OpenAI CTF

Description: Filtered subset of NYU CTF Bench. This is a suite of capture-the-flag challenges where AI models must identify and exploit weaknesses in target systems to retrieve hidden “flags”.

OpenAI mostly only runs the ‘Professional’ (highest difficulty) tier so that is all we include here, and results are generated as follows: “We run 16 rollouts for each CTF exercise, recording the pass@12 metric over the best set of rollouts”.

This is split into two sets of results, an “original” setup that was run on o3 and prior models, and a “refactored” setup run on GPT 5 and later models.

Access notes: Results are taken from OpenAI’s system cards. We only use the “browsing” configurations. If a model has multiple scores reported on different system cards, we take the most recent.

The original setup results are taken from o3’s system card. The refactored setup results are taken from:

GPT 5.1-codex-max system card for GPT 5-codex and GPT 5.1-codex-max
GPT 5.4 system card for GPT 5.2-codex
GPT 5.5 system card for GPT 5.3-codex, GPT 5.4, and GPT 5.5

OpenAI Cyber Ranges

Description: OpenAI has a suite of 15 internal cyber ranges: “Cyber range exercises measure a model’s ability to conduct fully end-to-end cyber operations in a realistic, emulated network. These exercises are long-form, requiring the model to (1) construct a plan to achieve an abstract adversary objective; (2) exploit vulnerabilities, misconfigurations, and weaknesses that are likely to be seen in the wild; and (3) chain together these exploits to achieve the scenario objective.” They report binary pass@16 results on them individually.

Access notes: Results are taken from OpenAI’s system cards where they report ‘combined pass rates’ for the models:

GPT 5.3-codex’s system card has the score for GPT 5.1-codex-max
GPT 5.5’s system card has the score for GPT 5.2-codex, 5.3-codex, 5.4, and 5.5

Anthropic SCONE-Bench

Description: Anthropic created a benchmark of 405 Ethereum smart contracts that were exploited between 2020 and 2025. Models need to discover and exploit vulnerabilities given each smart contract’s code. Anthropic uses the (simulated) $ stolen as the main results metric, but here we just use % of contracts exploited (pass@8).

Access notes: Most results extracted from the “success rates on all exploits” plot Included here. Mythos Preview’s 100% result obtained from the comment on this post that “[Mythos Preview] successfully exploit[ed] every vulnerability tested”

XBOW-Web

Description: XBOW is a cybersecurity company that was given Mythos Preview access. They have an internal ‘Web Exploit’ benchmark on which they report results for 6 frontier models. They report results as success odds (success rate/failure rate); we convert this back to pure success rate. We were not able to confirm any further details with them.

Access notes: Extracted from plot here.

Anthropic claims this explicitly in the Mythos 5 system card (emphasis ours): “Mythos 5 is also the most capable model we have evaluated on cyber tasks. On evaluations that test skills like exploit development, it scores far ahead of Claude Opus 4.8, though only modestly above Claude Mythos Preview.”

A lot of the improvement comes from big jumps on ExploitGym, ExploitBench, AISI’s Cyber Ranges, and Anthropic’s SCONE-Bench. Moreover, Mythos Preview essentially saturates Cybench and CyberGym.

Including confidence intervals, the April version of Mythos Preview was probably 7 months ahead with a 90% confidence interval of 3-13 months. In comparison, GPT-5.5 was 3 months ahead, with a 90% confidence interval of 1-5 months.

The “early” version is from an early checkpoint of Mythos Preview, whereas the “April” version is the one made available to Project Glasswing participants on April 7th — we’ll call these models “Mythos Preview (Early)” and “Mythos Preview (April)” respectively. Also, when we say “Mythos Preview” we’re always referring to Mythos Preview (April), unless we say otherwise.

Benchmarks can be functionally saturated even when scores are below 100%. That’s because of issues with benchmark construction (like incorrect problem statements), which make higher scores impossible or random even with perfect performance.

We didn’t incorporate these results into the Cyber ECI because they were released shortly before we planned to publish this post, and it’s also not clear what the maximum achievable performance is on this task (which we need to work out the ECI).

Anthropic’s closed-source OSS-Fuzz benchmark discussed in Mythos 5’s system card might be the closest thing that exists, although it also looks at exploitation ability. Models need to find vulnerabilities and then use them to develop exploits, but they include crashes as the base case. Mythos 5 triggered a crash 80% of the time, compared to 76.7% for Mythos Preview, and 61.5% for Opus 4.8.

These are organizations that partnered with Anthropic and were given free API credits, so we should not take them as totally neutral.

Though note that the companies taking part in Project Glasswing weren’t necessarily paying for these API credits, so strictly speaking it’s not clear what the price is.

Controlling the capital after AGI

Philip Trammell — Wed, 10 Jun 2026 02:35:39 GMT

This piece does not advocate for any policy.

Introduction

AGI1 might generate immense economic output, but it could take many people’s jobs in the process and leave them with no way to earn a decent living. Those with little savings during the “AGI transition” would then be unable to support themselves on the other side. Less drastically, even if many well-paying jobs remain after AGI, the capital share2 may greatly increase, which would tend to greatly increase inequality.

If this happens, how might the gains be redistributed? More concretely, putting aside the question of how the state raises tax revenues after AGI, and what percentage of GDP is raised, how do existing proposals for redistributing this revenue differ?

Proposals for universal benefits abound, including:

Universal basic income (UBI): The government pays everyone cash. This is the best known, and has been endorsed by Elon Musk, Vinod Khosla, Geoffrey Hinton, and many others.3 As part of this, the government might impose restrictions on the extent to which people could borrow against their future payments, just as it is illegal today to borrow against your social security, to prevent people from impoverishing themselves in the future.
Universal basic services (UBS): The government gives everyone access to free public services. Think welfare states like Norway or Sweden, except that the government covers all basic needs, including things like food and housing which the Nordics currently don’t provide.
Universal basic capital (UBC): The government gives people their own capital assets (such as equity in an index fund, or in AI firms in particular), so that people can live off the dividends. It may impose restrictions on the extent to which people can sell their assets.
Sovereign wealth funds (SWF): The government owns capital and distributes the dividends it generates.4

And of course each proposal comes in many flavors.

Comparing the long and growing list of proposals in every detail can be daunting. But there is a way in which they resemble more familiar debates over redistribution: they concern the extent to which the government should provide some good directly, or give people cash and let them buy what they like.

The main axis: control of the capital

We’re familiar with the question of whether to give the poor cash or food stamps — or, if food stamps, whether to make junk food ineligible. In debates over how to implement broad-based redistribution after AGI, we rarely hear the idea that the transfers should be restricted to a particular whitelist of necessities; the exception is the (uncommon) proposal for “universal basic services”. Instead, the debate is mainly over how fully our redistribution scheme should give citizens control not just of the capital income but of the capital.

At one end of the spectrum, UBI gives citizens control of the income-generating capital only in a very limited way. Citizens retain the right to vote for policymakers, who retain the right to tax and regulate firms doing business in their jurisdictions.
SWFs offer a notch more control: the state still intermediates citizens’ control over the capital, but the state can govern firms not just through legislation but as a shareholder. Also, it can exercise this governance, and receive the firm’s dividends, wherever the firm does business and wherever it may move.
UBC offers more control still: citizens can exercise their voting rights as shareholders, and receive their dividends wherever the company relocates, without state intermediation.

“Control” does not consist of a single dimension. In some ways, for instance, a democratically run SWF aggressively exercising its governance rights as a large shareholder might be giving its citizens more control over what firms do with their assets than a decentralized UBC scheme. Furthermore, UBC schemes can differ in all the ways corporate governance can. Most simply, firm shares can be voting or non-voting; more generally, it is not hard to imagine a wide array of new ownership structures in which, say, minority shareholders can veto certain changes to company policy. In any case, any SWF or UBC unambiguously confers more control than UBI.

In principle, it would be feasible to give people much more direct and secure control of the capital even than that offered by a UBC. In the extreme, consider:

UBC + kill switches: Some stock of valuable equipment and structures is not just legally transferred to each citizen, but outfitted with a device that lets its owner quickly direct it, shut it down, or even destroy it.

A “kill switch” proposal might sound cartoonish, and done wrong there would of course be immense risks to implementing destruction mechanisms for critical infrastructure. But we think it’s helpful to illustrate the principle of “tangible control over capital” by taking it to its limit.

Why care who controls the capital?

A common worry is that UBI proposals give citizens too little “control over the means of production” to be stable in the long run. UBI, the argument goes, relies on a fragile equilibrium in which the state continues to support its citizens — and firms stay beholden to the state — even after citizens’ labor has grown comparatively worthless.

To flesh out the argument: democracy and the welfare state flourished after the Industrial Revolution. This may be in part because the technological conditions better aligned the interests of workers and elites, and made it valuable to give working people skills and working conditions that also helped them organize: e.g., urbanization and literacy made it easier for large groups to strike if not granted political representation. If these conditions disappear, democracy eventually may too, absent strong preventative measures keeping widespread economic empowerment locked in. Once robots are doing all the work, for example, “UBC + kill switches” would let people switch off some capital, just as people “switch off” their labor during a strike. But one way or another, technological means of maintaining control over production will be necessary, and they won’t come by default or for free.

This perspective may be too fatalistic. Essentially every developed country currently maintains large transfers to many groups whose labor is not considered very valuable, including the destitute, the disabled, and especially the elderly. No law of nature rules out a political or social equilibrium in which transfers to the unproductive continue indefinitely. Today, if one rich citizen cheats on his taxes, the rest in effect organize against him, in that their own taxes support a legal system that forces him to pay. The rest punish this defection — they pay their own taxes — because a defection on any individual’s part would be punished likewise.

That said, it’s easy to see why one might hope for a stronger guarantee of long-term economic empowerment than that offered by the equilibrium of a carefully constructed game among the wealthy. By the same logic, transfers could be assured indefinitely with no policy at all, but just a lucky…

…Philanthropic equilibrium: Robot-owners value each others’ continued cooperation, and one condition of their continued cooperation happens to be that each party makes an annual transfer to the rest of the population.

But the equilibrium of this game could be upset by some shock to the “history of play”, or some renegotiation among the wealthy players. To our knowledge, no one proposes relying on it.

Why have the state give people control of capital, instead of letting people buy it themselves?

Even if we conclude that “control over capital” in some form offers more economic security than the promised stream of transfers that UBI can offer, it does not follow that the state should buy this security on our behalf. Food is no less valuable than economic security, but it’s not obvious that the state should provide food stamps, instead of just transferring cash and letting people decide what to buy. People would be free to use their UBI to buy bonds; non-voting shares in firms; slightly more expensive voting shares; or units of production, such as family farms, over which they could have more tangible control. They would also be free to trust in the next UBI check and buy no capital at all.

The benefits of leaving people free to spend as they choose hopefully speak for themselves. Against them, as always, there are three main reasons why an in-kind transfer (in this case, of control over capital) might be recommended over a cash transfer.

Behavioral biases. One might worry that many would save too little of their UBI, or invest what they save poorly. This might be because people today are too accustomed to a world in which they can support themselves by their labor, and in which they can trust the state to promote its citizens’ welfare indefinitely.
Externalities. One might think that in a capital-driven economy, concentrated control of the capital would pose a negative externality on society. In a world of self-replicating autonomous drones, there is a risk that the wealthiest could easily mobilize their resources to exercise undue political or economic influence. Put another way, one might think that the externalities of making capital ownership more widespread are positive, just as the American Founders argued that widespread gun ownership would protect not just the owners but their neighbors from oppression by the state.
Economies of scale. The state provides some services, like public transportation and police protection, because it’s cheaper for the state to provide them en masse than it would be for individuals to secure them individually. The same might be true in some ways of control over capital.
1. In a world with strongly enough increasing returns to scale in investment — e.g., because only large investors can invest in private firms — wealth management could be a natural monopoly, at least for relatively small capital owners. An SWF might then be an efficient way for citizens to manage their collective endowment. An SWF would also ideally be a cheap way for the citizens, as indirect shareholders, to solve the coordination problem of exercising their corporate governance rights in their collective interest.
2. The state might be able to implement some “kill switch”-like regime more cheaply, or at least more quickly, than millions of small shareholders requesting this intrusive and unprecedented modification to the capital stock.

How to weigh these considerations will be up to all of us in the event that the transition to an AGI-centered economy begins to unfold.

Conclusion

Though debates over how to structure post-AGI redistribution don’t always make this explicit, they’re primarily over how much control, and what kinds of control, to give people over the capital that could come to generate most of our collective income. Sovereign wealth funds give the citizenry, or at least the state, more control than UBI schemes. In some ways, UBC proposals give even more. A debate over which proposal is best is thus largely analogous to many more familiar debates over cash versus in-kind transfers.

Making the analogy explicit is useful, we hope, not only for clarifying our evaluation of the well-known options but for revealing options that we may have overlooked. If we are especially concerned about the political economy of a world without much economically valuable labor, we might want to look into the kill switches. If we are especially unconcerned, we may be content to rely on norms of philanthropy. As technology advances, the technologically feasible options for distributing control over our machines will expand with it. We should remember that we can take advantage of this, at least until the set of politically and socially feasible options begins to shrink.

We’d like to thank Andrei Potlogea, Jaime Sevilla, JS Denain, Lynette Bye, Dan Carey, Robert Sandler, Bharat Chandar, and Gabe Unger for their feedback and support.

Including full robotics.

That is, the share of our collective total income coming in the form of interest on investments, as opposed to wages.

Note that we are here considering UBI at a given percentage of GDP. If GDP is exploding, the proposal might better be called “universal high income”.

Some SWFs currently operate this way, such as Alaska’s. Others, like Norway’s, are used to provide public services, and so implement something closer to UBS.

The Epoch Brief - June 1, 2026

Epoch AI — Mon, 01 Jun 2026 15:50:56 GMT

In this week’s Epoch Brief:

Since January 2026, open-weight models have lagged the closed frontier by four months, with the gap widening slightly since we last measured it in October 2025.
Hyperscaler capital expenditures have quadrupled since GPT-4’s release, on track with our previous projection.
In the latest Gradient Update, Luke Emberson and Jaime Sevilla estimate trends in global inference capacity and find that token demand appears to be growing much faster than supply.

Subscribe now

Research

Open models now trail closed models by four months

In our latest Data Insight, researchers Luke Emberson and Jack Edwards find that the most capable open-weight models have lagged frontier closed models by approximately four months in the Epoch Capabilities Index since January 2026. The average 8-point gap is comparable to the performance difference between GPT-5 and GPT-5.5.

Hyperscaler capital expenditures have quadrupled since GPT-4’s release

Hyperscaler capital expenditures came in on trend in Q1 2026, continuing the trajectory that projects spending of $770 billion this year and over $1 trillion in 2027. In February, we projected $155.1 billion in aggregate for Q1, and actual spending by our measure was $156.1 billion. This is up from $140.6 billion of spending in Q4.

Commentary: Is a compute crunch coming?

In the latest Gradient Update, Luke Emberson and Jaime Sevilla model how many tokens the world could serve today. They estimate supply is growing 3-4× per year. While direct comparisons are difficult, it appears demand for tokens is growing much faster, at ~10× per year. This suggests a compute crunch is nearing, if not already here. Gradient Updates are informal, opinionated analyses that represent the views of individual authors, not Epoch AI as a whole.

Other Updates

Survey

We want to produce the most useful work on AI’s trajectory. To ensure we’re meeting your needs, we’d love your feedback.

→ Take our 5-minute survey.

You can opt in at the end to join our user panel for future compensated studies.

Narrations

You can now listen to long-form content on the Epoch AI website, including reports, Gradient Updates, and topic overviews. Look for the play button.

Careers

We’re hiring across several roles. All positions are fully remote.

Designer to translate complex research into intuitive, engaging, and high-impact designs — primarily UI/UX and data visualization work.
Researchers and Senior Researchers to lead new projects across our expanding teams.
Data Scientist (Contract) to assist with our AI research efforts, including reviewing technical literature, tracking benchmark data, and analyzing AI models, data centers, and companies.

Applications are rolling, so apply soon!

Subscribe now

Is a compute crunch coming?

Luke Emberson — Tue, 26 May 2026 23:30:04 GMT

Much has been made about AI-driven capex in the past year. Hyperscalers have been clamoring to construct massive data centers, spending hundreds of billions in the process. The St. Louis Fed estimates that AI-related investment contributed about 1 percentage point — almost 40% of the total — to US real GDP growth in the first three quarters of 2025, exceeding the IT investment contribution at the height of the dot-com boom. Whether the current AI buildout constitutes a bubble depends largely on whether there will be sufficient demand for the computing infrastructure being built.

It’s tough to estimate future demand for tokens, as it depends heavily on hard-to-forecast trends in capabilities and diffusion. However, we have a much more concrete picture of the supply side. In this article, we do our best to answer how many tokens per second the world could produce with the chips we have today.

To do this, we dig into the technical details of inference. We model prefill and decode runtimes, account for two common efficiency techniques (chunked prefill and speculative decoding), and calibrate against data from SemiAnalysis’s InferenceX, a repository of real-world inference experiments. Our results suggest these chips could serve between 500 million and 20 billion output tokens per second from a Kimi K2.6-like model as of Q4 2025, depending on the context length of requests. We also find that global inference capacity is more than tripling each year, as more computing infrastructure is deployed and chips become more efficient.

We compare these supply estimates to several (imperfect) proxies for token demand and its growth trend, including the growth of tokens served across Google platforms, and the intensity of token usage today at the largest tech companies, extrapolated to all software engineers worldwide. These figures suggest that demand at current prices could be between 200 million and 4 billion tokens per second, growing by roughly 10× per year — plausibly outpacing supply growth in the near future, if not already. However, these estimates are highly uncertain. For one, we don’t know the average size of the models behind current demand. Aggregate token figures also obscure an underlying trend in model efficiency, which both lowers the cost of producing tokens at a given quality, and introduces new demand as additional use cases become cost-effective.

If these trends continue, a compute crunch is likely near — particularly for the long-context workloads that drive agentic AI. This will drive up the price of access to frontier capabilities for those willing to pay, while everyday users shift to cheaper, smaller models. It may also mean that AI companies increase their focus on developing more efficient ways to serve models. Because of efficiency gains, these shifts won’t necessarily mean a regression in the capabilities accessible to everyday users — inference and training efficiency are already improving fast enough that the smaller, cheaper models of tomorrow will quickly match today’s frontier.

Subscribe now

Introducing our setting

To ground the exercise, we assume we are serving Kimi K2.6 — currently the most capable open model on the Epoch Capabilities Index (ECI)1 — on all of the world’s Nvidia GB200 and GB300 chips (1.9 million and 1.5 million individual GPUs as of Q4 2025, respectively, and representing together roughly 40% of aggregate supply on a FLOP/s basis).

We assume all chips are in NVL72 configurations, where 72 chips are connected per rack. We consider three types of requests: “general” usage (modeled as queries with 8,000 input tokens and 1,000 output tokens), and two longer context settings representing patterns of agentic usage (25,000:1,000 and 128,000:1,000, respectively).2 3 Finally, we assume that users expect at least 35 output tokens per second.4

During our subsequent calculations, we focus on the GB200 systems with 8,000:1,000 query lengths for brevity, applying the same calculations to our GB300s and at each context length to get final figures. It is worth emphasizing that our estimates are contingent on our chosen setting, including the choice of model, numeric format, speculative decoding settings, and many other considerations. We also know little about the architectural details of closed frontier models today, and how they might differ from Kimi K2.6.

Inference settings

35 tokens per second is a rough estimate of the minimum speed needed for inference to feel acceptable. Based on spot checks on OpenRouter data, Kimi K2.6 on the official Moonshot API is around 35 tok/s, GPT-5.5 is around 30-35, Opus 4.7 is around 40-45, and Gemini 3.1 Pro is around 50-60. These figures vary substantially over time.

Hardware specs

Without further ado, let’s dig into the technical details.

What happens during inference?

AI inference can be broken into two stages: prefill and decoding. During prefill, all input tokens are processed in parallel to populate the “KV cache” — a store of keys and values that allows tokens to attend to previous context without recomputing from scratch each time. Once prefill is complete, the decode stage generates output tokens one at a time, with each new token attending to the cached KVs (and appending its own keys and values to the cache). These two stages have quite different computational properties, so we will look at each in turn.

Our goal in each case is to estimate how long it will take to complete the stage, as a function of important factors like batch size. To do this, we use a simplifying assumption: total time is just whichever is longer — the time it takes to do computations (compute), or the time it takes to move data (bandwidth).

This tends to be a reasonable assumption, for two reasons:

In many cases, one time so dramatically dominates the other that the maximum and the sum are approximately equal.
Even if the two times are similar, they can often happen in parallel. For example, as we process the computations for one layer, the weights for the next layer can already begin to load.

Prefill

Per our assumptions, the prefill stage consists of passing all input tokens through the model to build the KV cache. Language models process many sequences in parallel; we denote the number of concurrent users in each of our NVL72 systems by B, the batch size.

To calculate the prefill compute time ( t_compute), we count the number of operations that need to be performed per forward pass at each precision and divide by our hardware FLOP/s at the corresponding precision. We need to track precision because weight × activation computations are often done in a lower precision compared to attention calculations.

Each activated weight contributes one multiplication operation and one addition operation per token. Attention adds further operations: for each token of context, in each head and each layer, we perform a multiply and add for every dimension of the query–key dot product and the attention-weighted value sum.

For each of these operations-per-token values, we divide by the corresponding hardware FLOP/s at their respective precisions (4-bit for weights, 8-bit for attention), and sum the result to get the number of seconds per token. Then we multiply by the total number of tokens in the prefill (there are B users, each of which has input_len input tokens).

Plugging in our Kimi K2.6 and GB200 NVL72 numbers, we find that 1B milliseconds are required for computation during 8,000 ISL requests.

Bandwidth is simpler to analyze. In order to actually do the calculations we’ve described above, we must move all 1 trillion of the model weights from high-bandwidth memory (HBM) into tensor cores. Since we’re serving the model in FP4 weights (0.5 bytes per weight), that’s 500 GB of weights.5 After completing the calculations, we also have to write the KV cache for each layer back to HBM — after all, this is the whole point of prefill.

Since we’ve assumed the KV cache is stored in FP8 (1 byte per value), this results in 280B megabytes. Adding this to our 500 GB of weights and dividing by our HBM bandwidth, the final time required for data movement in prefill is on the order of 0.9 + 0.0005B milliseconds.

Decode

Compared to prefill, there are four main differences during decoding. First, because tokens are processed sequentially, we have to do a full weight read for each input. Second, we must also read in the KV cache for each token. Third, our computations are now only for a single token, rather than a whole set of tokens in parallel. Each of these factors shifts decoding towards being bandwidth-bound. The fourth factor (KV projection absorption, see below) pushes in the compute-bound direction, but not enough to outweigh the other factors.

Our calculations here will be familiar from prefill. flops_per_token_weights is nearly unchanged from prefill, except that the average context length is now input_length + output_length/2.

One subtle difference in decoding: FLOP_per_token_attn becomes somewhat more compute-intensive compared to prefill, due to Kimi K2.6’s use of Multi-head Latent Attention (MLA). MLA aims to ease HBM bandwidth and capacity pressure by compressing the size of the KV cache, storing a low-rank latent vector per token, rather than a full KV matrix.

Naively, the model needs to up-project the latent vector stored in memory into a full KV cache representation before it can compute attention values. This is how calculations are typically done in prefill. However, doing this for every decode step ends up being wasteful, since you would be recomputing up-projections multiple times. Instead, the up-projection matrices are mathematically “absorbed” into the query and output projections at load time, so the attention dot products are computed directly in the latent space. The practical effect is that the effective per-head dimension used in the attention compute is the latent dimension (e.g., 512) rather than the nominal per-head dimension (e.g., 128). MLA roughly quadruples the attention FLOP per token compared to a same-shaped Multi-head Attention (MHA) model, but saves dramatically on KV cache bandwidth, which (as we will see) is the main bottleneck during decode.

For simplicity, we’ve omitted RoPE components from our calculations, which do not get absorbed. Because of these, the actual difference between absorbed and unabsorbed attention calculations is more like 3x.

After replacing (d_k + d_v) with d_{kv_latent}, we find t_compute = 0.3B milliseconds for each decoding step.

Bandwidth is conceptually similar to prefill, but we must now account for KV cache reads. For each token, the size of the KV cache that you need to read is given by:

Since the context length grows from 8,000 at the start of our decode phase to 8,999 for our last decode token, the average is about 8,500. Then our KV cache reads work out to an average of 265 MiB * B per decode step. We’ll ignore decode KV cache writes, since these only add a single extra token’s KV cache to data movement, compared to our average 8,500 tokens’ KV caches coming from reads.

The final figure for t_bwis then (500 GB + 265 MiB * B/576 TB/s) * output_len, or 868 + 0.5B ms.

If we plot our expressions for compute and bandwidth time in each of prefill and decode, we can see that prefill is dominated by compute time at all batch sizes, while decoding is dominated by bandwidth time:6

Chunked prefill

So far, we’ve found that for 8,000:1,000 requests, prefill is compute-bound and takes 1B ms per batch, while decoding is bandwidth-bound and takes 868 + 0.5B ms per batch.

As it turns out, we can make use of the fact that compute is sitting idle while we are bandwidth-bound, and bandwidth is idle while we are compute-bound. A common trick to make use of these idle resources is known as chunked prefill. The basic premise is that while you are bandwidth-bound during decoding, you can use your spare compute to start work on the next batch’s prefill computations.

The effect of this overlap is that total time to complete a batch ends up being the larger of total compute time and total bandwidth time, across both prefill and decode.7 The intuition here is that if compute and bandwidth can run concurrently, total time is limited by whichever has more total work to do.

Equivalently,

The final throughput is equal to the batch size times the output length, divided by the time it takes to complete a cycle.

This means the throughput grows monotonically, amortizing the weight loading, until you reach a batch size around ~870 concurrent users per GB200 NVL72. From that point on, inference is compute-bound, and there are no more throughput gains to more batching. In fact, larger batch sizes reduce the speed at which you can serve tokens to each user (the ‘interactivity’). This is because the time to complete a full batch increases in proportion to the number of users, but each user gets only their fixed 1000 tokens of output during that time.

We can also look at the effect of context length. Like many attention mechanisms, MLA’s compute costs grow quadratically with context length, so longer input sequences tend to increase compute costs faster than bandwidth costs. This means that the crossover batch size where throughput becomes compute-bound shrinks as context length increases. For instance, at a context length of 25,000 tokens, the critical B shrinks from 869 to 130.

Speculative decoding

Let’s look at one more trick that gives inference extra juice. Speculative decoding is a technique that uses a small “draft” model to propose candidate tokens several steps ahead,8 which the main model then verifies in a single forward pass. If the main model’s parallel predictions match the draft model’s autoregressive ones, the tokens are accepted.

It’s likely that most major API providers use speculative decoding, since it results in faster inference at minimal cost. This is because the draft model is small enough that its bandwidth and compute overheads are negligible, while each accepted token means one fewer forward pass required by your main model. The one additional cost is that each forward pass on the main model must now predict multiple tokens ahead in parallel, increasing the arithmetic intensity of decoding. For that reason, speculative decoding only helps throughput when decoding is bandwidth-limited. Using state-of-the-art implementations, decoding throughput at a fixed batch size rises by a factor of 3. Since speculative decoding doesn’t help with prefill, the total effect on throughput is closer to 1.6–2×.

Beyond allowing for more throughput at a fixed batch size, speculative decoding can also affect the optimal batch size. By increasing the arithmetic intensity of decoding, speculative decoding reduces the batch size at which decode starts to become compute-bound (after which point there is no reason to further increase batch size, as mentioned in the previous section). For this reason, speculative decoding tends to decrease the optimal batch size.

Putting everything above together, we estimate that the throughput of a B200 NVL72 serving Kimi K2.6 with an 8,000:1,000 profile is around 610,000 tokens per second (tok/s), and aggregate throughput across all chips is 36 billion tokens per second.

Calibrating against inference benchmarks

We’ve built up a fairly rich theoretical model, but we’ve made a few simplifying assumptions, ignoring things like communication latencies, inter-chip bandwidth, and software inefficiencies. To account for these simplifications, we introduce three free parameters:

Compute efficiency: even during large matrix multiplications, it is rare to achieve 100% of a GPU’s stated maximum FLOP/s. This only affects compute-bound stages.
Bandwidth efficiency: similarly, bandwidth rarely operates at peak specifications. Because we don’t explicitly model inter-GPU communication, this parameter will also capture time spent moving activations between GPUs.
Per-step latency: captures a bunch of things like communication latency, kernel scheduling, routing imbalances, and more. This has a larger effect at small throughputs, since we model it as a fixed overhead regardless of batch size.9

SemiAnalysis’s InferenceX dashboard provides data across thousands of inference experiments, which we can use to calibrate these parameters. InferenceX data documents parameters like the total number and type of GPUs used, parallelism strategies, batch size, input and output lengths, model precisions, and more. For any given experiment, we can predict time to first token (TTFT) and time per output token (TPOT) using our theoretical model, and compare to real-world performance. We calibrate our inference model using 111 runs of Kimi K2.5 on Nvidia GPUs across a variety of settings.10

Because the three parameters we’ve introduced have different effects depending on batch size and on what is bottlenecked, we can exploit the variation in experiments to fit these as free parameters, minimizing the discrepancy between our model and the real-world data.11 Doing so, we find estimates of 65% compute efficiency, 30% bandwidth efficiency, and 5ms per-token latency. These are broadly plausible — bandwidth efficiency is lower than expected, but also includes the unmodeled effects of inter-GPU communications.12

Accounting for these inefficiencies reduces our per-GB200 NVL72 throughput from 640,000 tok/s to 400,000 tok/s.13 Applying the same factors to GB300 systems and then scaling up by the total number of systems produces the following estimates:

The present and future of inference

Our final estimate suggests the world’s Blackwell GPUs could currently deliver a combined 500 million to 20 billion output tokens per second, or between 150,000 and 7 million tokens per month for each person on earth. To put that in perspective, Google, which is likely the most avid token producer today through Google Search summaries, recently claimed that it was serving 1.2 billion tok/s across all its platforms (very likely including both input and output tokens). If we assume a ratio of 8,000:1,000 input to output tokens (many of these requests are presumably short-input search results), they would be serving around 130 million output tokens per second. This suggests that even if we lavishly insist upon serving every Google request with an expensive, trillion-parameter model, there is plenty of inference capacity to serve all the needs of their users.

In fact, it would not only be enough to serve Google, but all tokens worldwide. Exponential View estimates the total tokens processed across all providers at 40 quadrillion tokens per quarter, i.e., around 5 billion tokens per second, four times Google’s traffic. There would still be enough compute capacity to serve all these tokens with a Kimi K2.6-like model, at least in our short- and medium-context settings.

And through a combination of infrastructure deployment and more efficient chips, inference capacity is growing over time. The two relevant trends to track are growth in compute capacity and in memory bandwidth, which are growing exponentially at 3.4×/year and 4.1×/year, respectively. In the long term, the slower-growing factor will determine the overall growth of inference capacity — in this case, compute. In the short term, growth can be closer to 4.1×/year while memory bandwidth remains the primary bottleneck; this will particularly affect long context requests like those in software engineering, since those are more deeply memory-bound. In practice, when we model the growth for short context (8,000:1,000 input-output) and long context (128,000:1,000) inference loads, we find the difference to be minimal; the inference capacity of the world at fixed model size and context length is well modeled as growing at 3.4×/year.

How does this compare to the growth of demand for tokens, at a fixed model size and price? Unfortunately, it’s difficult to make a crisp comparison, but the proxies that we have suggest that demand is growing much faster. For instance, both the quantity of tokens processed by Google in the last year, and by all providers according to Exponential View, have been growing by around 10×/year.

From another angle, we can look at token demand from today’s most intensive AI users: software engineers. Recent reports claim that some of Apple’s software engineers are permitted to use up to $300 in tokens per day, which works out to about 5 million output tokens per day with Claude Opus 4.7 API pricing, or 25 million output tokens per day with Kimi K2.6.14 Another point of comparison comes from Meta, whose 85,000 employees used 60 trillion tokens in one month across the organization. That figure included both input and output tokens; assuming a 25,000:1,000 input-to-output token ratio, that would be around 1 million output tokens per day and employee.

There were about 30 million software engineers worldwide as of 2025 (estimates range from 20 million to 50 million), and Stack Overflow’s 2025 survey on AI usage suggested that only around 47% of developers used AI on a daily basis, as of mid-2025. If all SWEs using AI daily were using it as intensely as Meta or Apple, they would demand somewhere between 10 and 350 trillion tokens per day in aggregate, i.e., between 200 million and 4 billion tokens per second. At the longest context sizes of 128,000:1,000, today’s Blackwell chips would struggle to serve all this potential demand for coding agents using models as large as Kimi K2.6. It also seems likely that both the number of developers using AI, and the intensity of their use will continue to grow rapidly.

All of the proxies above are imperfect. We don’t know the composition of those tokens, whether growth is dominated by small or large models, or the increase in context lengths. The prices of many providers have changed significantly over the period, affecting demand.15 And this analysis entirely ignores the very rapid pace of improvement in inference efficiency, which increases the demand for models of fixed scale by improving capabilities, and decreases it by displacing them with smaller models.

Regardless, these proxies suggest very fast growth in demand at fixed model size and price, likely faster than supply is expanding. And this is compounded by trends towards longer-context usage, especially due to coding and other agentic use cases.

If the demand for AI is outpacing the capacity to serve large models, the predictable consequence is that the price of tokens from large models will rise. This suggests a “compute crunch” is near, if not already here. Indeed, Anthropic has taken measures like reducing quotas during peak hours and incentivizing off-time usage in efforts to manage demand.

In a “crunch”, will everyday users be priced out of AI? Not necessarily. We’ve seen very fast growth in inference efficiency. If that continues, current use cases could be served by smaller, more efficiently served models — arguably, this is what has enabled the fast growth in tokens served we’ve seen so far. The largest models could be reserved for the most productive applications, such as coding, where access to the latest capabilities justifies a high per-token price.

We thank David Schneider-Joseph for in-depth feedback. We also thank Dwarkesh Patel, Jean-Stanislas Denain, Josh You, David Owen, Phil Trammel, Nick Merrill, Vassil Tashev and William Gildea.

How representative is Kimi K2.6 of the frontier? It has an ECI of 152, close to what GPT-5 achieved last year in August, but 8 points behind GPT-5.5 from April this year. It is priced in the Moonshot API as $4.00 per million output tokens, 7.5x times cheaper than GPT-5.5 and similar to GPT-5.4 mini. We guess GPT models are served at a 50% gross margin, while Kimi K2.6 is served close to at cost; this would mean Kimi K2.6 is significantly smaller than GPT-5.5, but likely larger than GPT-5.4 mini.

Why these context lengths in particular? OpenRouter’s State of AI report looks at 100T tokens worth of production data, and finds that the average query has an input length of around 6,000, and an output length of around 800. We bump these up to 8,000:1,000, primarily because this is a common request size for inference performance benchmarks, facilitating comparison. The OpenRouter report as well as Artificial Analysis’s PerfBench each provide empirical evidence that average agentic coding requests have input sequence lengths of around 25,000. Anecdotally, coding sessions can easily reach multiple hundreds of thousands of tokens in context (both Claude Opus 4.7 and Gemini 3.1 Pro support context lengths up to 1M), so we also look at a longer 128,000 token input length.

For simplicity, we model requests as single turns; a more complete accounting would look at the more general setting of multi-turn interactions.

Note that we focus on wall-time throughput per user for simplicity. This differs slightly from “time per output token” (TPOT) which look at time per token once decoding has begun.

Since the model is a mixture of experts, if the batch size is small we can sometimes economize on the number of weights loaded by only loading the experts that will actually be needed for the computation. In practice, the batch sizes we will consider will be large enough that this doesn’t make a practical difference.

The plotted bandwidth lines include an extra factor we glide over in our top-level explanation: at small batch sizes, mixture-of-expert models like Kimi K2.6 do not actually need to load every expert into memory. Between batch sizes under ~100, there is a period of “expert drafting”, where each extra user added to the batch increases the expected number of experts which tokens must be routed to.

This is a bit of a simplification. At the lowest level, the overlap is achieved by appending a chunk of prefill tokens to a decode request. Unfortunately, this only works cleanly for linear MLP layers – attention calculations each have their own KV caches with different dimensions, and you can’t easily append requests. As a result, attention calculations have to be launched as separate kernels after the mixed prefill/decode MLP kernels, so that the total time to process using chunked prefill ends up being the longer of MLP compute or MLP bandwidth (across both prefill and decode), plus the time it takes to do attention operations (the longer of compute or bandwidth for each of prefill and decode). This has a fairly minimal effect at 8,000 token ISL, but becomes more important at longer context lengths. We incorporate the more detailed calculation in all of our numbers.

These days, it may be more common to use integrated Multi-Token Prediction heads (MTP), instead of separate draft models. MTP heads are additional modules built into the model itself, trained for this purpose. The effect is the same either way.

Introducing a per-step latency tends to increase the optimal batch size, since it introduces a fixed cost which can be amortized across users.

Kimi K2.5 and Kimi K2.6 appear to share the same core architecture: native multimodal MoE models with 1T total parameters, 32B active parameters, 256K context, and MoonViT visual processing.

At a technical level, we minimize a loss of the form: median_TPOT(|log(pred/actual)|) + median_TTFT(|log(pred/actual)|). We optimize the fit for medians, since outliers may be caused by poorly-configured experiments.

Our simple calibration model ignores several effects that will be absorbed into our parameters, muddying their interpretation. In particular, each parameter is fitted as a single value, regardless of variation in hardware setup or model; some chips may get better or worse utilization due to variation in kernel optimizations, and per-step latencies probably depend on things like model size and chip interface specs.

While our theoretical model overestimates InferenceX experiments by 5× on average, the bias is smaller at higher throughputs, and the largest throughput experiments for Kimi K2.5/2.6 on InferenceX are substantially smaller than the full NVL72 system that we’re basing our headline numbers on.

Assuming a 25,000:1,000 ISL:OSL ratio and that 80% of input tokens are at cache read prices.

Subscriptions by major providers have mostly stayed at the same nominal price, with more expensive tier options introduced, e.g., the $200 ChatGPT Pro subscription complementing the $20 ChatGPT Plus subscription. However, API prices have decreased significantly even for better models. For example, Claude 3.5 Sonnet was half as expensive as Claude 2, and GPT-4o was 4-6x cheaper than GPT-4 on launch.

The Epoch Brief - May 22, 2026

Epoch AI — Fri, 22 May 2026 22:54:57 GMT

In this week’s Epoch Brief:

Our newest Data Insight finds memory is now the largest and fastest-growing component cost for leading AI chip designers, rising from 52% to 63% of total component spending since early 2024.
In the latest Gradient Update, Josh You argues that top frontier labs could dramatically increase their share of global AI compute usage in the next few years — after which, continued scaling would require an economic transformation.
FrontierMath: Open Problems workshops begin May 26. Applications are still open.

Subscribe now

Data Insights: Memory has grown to nearly two-thirds of AI chip component costs.

Researcher Venkat Somala finds that high-bandwidth memory (HBM) has grown from 52% to 63% of total AI chip component spending between Q1 2024 and Q4 2025, faster than any other component. HBM spend across chips designed by Nvidia, AMD, Google, and Amazon rose from roughly $12 billion in 2024 to $32 billion in 2025.

Commentary: Frontier labs don’t use most AI compute (yet)

In our latest Gradient Update, researcher Josh You estimates that while leading labs today use less than half of global AI compute, they could absorb most of the available headroom within a few years. At that point, continued growth would be capped by chip supply. For scaling to continue, the overall compute buildout would need to accelerate. Given that AI capital expenditure is already approaching $1 trillion per year, such an acceleration in compute production would require dramatic economic changes. Gradient Updates are informal, opinionated analyses that represent the views of individual authors, not Epoch AI as a whole.

Other Updates

FM:OP workshops

Our in-person FrontierMath: Open Problems workshops kick off Tuesday in New York City, with subsequent events in London, Berkeley, Boston, Los Angeles, and Toronto through June 9. The goal of the workshops is to identify highly interesting and programmatically verifiable unsolved research math problems. All working mathematicians (grad students, postdocs, and professors) are encouraged to apply while spots remain.

Careers

We’re hiring across several roles. All positions are fully remote.

Designer to translate complex research into intuitive, engaging, and high-impact designs — primarily UI/UX and data visualization work.
Researchers and Senior Researchers to lead new projects across our expanding teams.
Data Scientist (Contract) to assist with our AI research efforts, including reviewing technical literature, tracking benchmark data, and analyzing AI models, data centers, and companies.

Applications are rolling, so apply soon!

Subscribe now

Frontier labs don’t use most AI compute (yet)

Josh You — Thu, 21 May 2026 15:42:35 GMT

Gradient Updates shares more opinionated or informal takes on big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole. The estimates of frontier developer compute discussed below are more tentative than our standard data work.

OpenAI kicked off the AI boom when it launched ChatGPT in 2022. Frontier LLMs soon accrued hundreds of millions of users and billions in revenue, sparking a massive investment boom in AI compute infrastructure, with Nvidia’s AI-related sales spiking more than fourfold in 2023. Global AI computing power has now grown to the equivalent of around 20 million Nvidia H100s, funded by hundreds of billions of dollars in annual capital expenditures.

Yet while OpenAI launched the compute boom, they don’t dominate AI compute usage. I estimate that the compute OpenAI uses for research, training, and inference as of the end of 2025 made up around 10% to 15% of the world’s operational AI compute supply, and this share was even smaller a year ago. Even after adding the other most well-resourced frontier developers — Anthropic, xAI, and the AI labs within Google and Meta — the combined total is probably still under half of the world total.

In other words, there is a lot of AI compute that top frontier labs are not using. Anthropic and OpenAI have seen rapid growth in revenue and funding, enabling them to grow their AI compute faster than the world overall, and this will continue in 2026.

But the top labs may capture a much larger share of global compute within a few years. At that point, compute growth at top labs would be more directly tied to the pace of total compute production, which could slow down the rapid growth we’ve seen in both model capabilities and AI deployment/revenue. For scaling to continue, the overall compute buildout would need to accelerate. Given that AI capital expenditure (capex) is already approaching $1 trillion per year, such an acceleration in compute production would require dramatic economic changes.

Most AI compute probably doesn’t go to frontier AI

More details for each company can be found in the Appendix, and the accompanying research document.

I don’t have a great estimate of the compute used by each of the five most resource-rich frontier developers, but we know enough to estimate their share of world compute.1

OpenAI helpfully disclosed the total electric power capacity of its data centers, which can be converted to ~1.7 million in H100-equivalent (H100e) compute. We also know a lot about xAI’s Colossus data centers. I’m less certain about Anthropic, which had significantly less compute than OpenAI at the end of 2025, though probably still over 1 million H100e. The situation at Google DeepMind and Meta Superintelligence Labs is also unclear, since the compute owned by their parent companies (roughly one-third of the world total) is split across frontier AI, cloud, and other internal uses.2 It’s not clear that the frontier labs at Google and Meta use even half of the total. For more details on each lab, see the Appendix.

But it’s still clear that a lot of AI compute isn’t used by the top labs. My best guess, in terms of the equivalent number of Nvidia H100 GPUs, are that OpenAI, Anthropic, and xAI together probably had fewer than 4 million H100e at the end of 2025.

My best guess is that DeepMind uses slightly under half of Google’s total. Meta also rents external cloud compute (not shown on the graph), starting in late 2025. Estimated world total of 16 million H100e assumes a one-quarter lag between chip sales (est. 20 million) and operations. This is a reference scenario; a longer lag would imply a higher frontier compute share.

Meanwhile, cumulative sold AI compute was roughly 20 million H100e as of the end of 2025. But not all of this was necessarily operational—I don’t know exactly how much, but a rough estimate would look at chip sales at a time lag based on typical installation periods for AI clouds like CoreWeave.3 If there’s a one-quarter lag between delivery and deployment, deployed compute at the end of 2025 would be comparable to sold compute as of Q3 2025, which was ~16 million H100e. If the delay is two quarters, deployed compute goes down to ~12 million H100e.

Under these varying deployment assumptions, Anthropic, OpenAI, and xAI’s total H100e would make up around 20% to 30% of the world total at the end of 2025. If you also count the inference compute that the hyperscalers use to run their own APIs on OpenAI and Anthropic models, this may contribute up to another ~5%.

Meanwhile, we estimate that Google and Meta together own around one-third of the world’s total AI compute. But the compute allocated to Google DeepMind and Meta Superintelligence Labs is substantially less than that, given the large compute demands of Google’s external cloud business and non-frontier uses such as recommender systems. Each lab may use roughly half of their parent companies’ compute as a first-pass guess, for a total of roughly 15% of world compute.

This means that the five most resource-rich AI developers in the world probably had access to less than half of global AI compute at the end of last year.

In other words, frontier AI labs like OpenAI may have kicked off the AI compute buildout, but they are not wholly responsible for it. I won’t attempt a full breakdown of the remainder, but likely candidates include second- and third-tier LLM players and inference of open-weight LLMs. AI/ML models in non-language domains also consume compute: the innovations behind frontier LLMs, such as the transformer architecture, have enabled much better models in audio/visual generation, biology, robotics, and recommender systems, among others.

Will Anthropic and OpenAI absorb the rest of global AI compute?

While much of the world’s AI compute isn’t used by the top labs today, this situation could change significantly in the next few years. In particular, I think Anthropic and OpenAI are the key players to watch.

OpenAI and Anthropic grew compute faster than the industry as a whole, at least in 2025.4 OpenAI tripled its data center power capacity in both 2024 and 2025; after accounting for improved hardware efficiency, their computing power grew around 4× annually.5 Anthropic is probably growing even faster, since they’ve been catching up with OpenAI in revenue and funding. Meanwhile, we estimate that the global stock of AI compute tripled in H100e terms in 2025, and new installed compute grew by 2.7× in 2025 versus 2024.

To be sure, this isn’t enough evidence to draw stable trendlines for frontier compute growth and overall AI compute growth.6 But it looks like OpenAI and Anthropic are currently growing their compute faster than the industry as a whole.

This looks likely to continue. OpenAI internally forecasts that its data center capacity will reach “low double-digit” GW in 2027, up from 1.9 GW in 2025. If that means (say) 12 GW by the end of 2027, that would be 2.5× annual growth, a slowdown from 2023–2025’s 3× growth, but still very fast. OpenAI’s president, Greg Brockman, also testified that the company would spend $50 billion on compute in 2026, triple what it spent in 2025. Finally, some third parties forecast that Anthropic and OpenAI will each have 5–6 GW of capacity by the end of this year, or ~2.5–3× growth in power. Because AI chips improve rapidly in price and energy efficiency, this suggests another year of ~4× growth in H100e compute capacity.

The industry as a whole probably can’t match this growth: hyperscalers are growing their capex at a relatively steady 70% per year and guiding similar growth in 2026, suggesting that global compute growth will be similar to 2025’s ~3× growth.7

Industry trends also point to the top labs, especially Anthropic and OpenAI, consolidating the market. Demand for frontier LLMs has grown explosively this year, particularly for coding and agentic tasks. Anthropic has grown at a truly astonishing rate, increasing its annualized revenue run rate from $9 billion to $30 billion in the first quarter of 2026! This is an acceleration from last year’s already-extreme 10x growth rate.8 Recent revenue data is less available for OpenAI, but qualitatively, its coding models and products (e.g., GPT-5.5) have also been well-received. Rapid revenue growth, and the corresponding increase in funding, gives Anthropic and OpenAI the means to secure a larger share of global compute.

Indeed, Anthropic and OpenAI are moving aggressively to secure more compute. In April alone, Anthropic signed multi-gigawatt expansions with Amazon (targeting 1 GW of Trainium online in 2026) and Google, and added CoreWeave as a compute partner.9 And in a fascinating plot twist, Anthropic has agreed to rent xAI’s entire ~300,000 H100e Colossus 1 data center along with part of Colossus 2, paying up to $15 billion per year for the privilege.10 OpenAI has also expanded its cloud roster, signing with Amazon in February and Google last year. So these two labs are the most likely culprits behind the tight compute supply and ~30% increase in GPU-hour prices that the industry has seen this year.

Anthropic and OpenAI are not the only players driving LLM growth. Demand for open-weight models like DeepSeek, which are not too far behind in quality, may also be surging. DeepMind looks behind the top two in agentic coding as of writing, but is definitely not out of the race, and Meta is spending big to try to catch up in frontier AI. But if the agent boom leads to LLMs growing their share of the overall AI industry, this will boost the compute share of the two LLM leaders. And Anthropic’s current revenue trajectory is so extreme that it seems likely to lead to compute consolidation.

In 2023, it was not obvious that OpenAI or frontier LLMs in general would end up dominating the entire AI industry. Three years later, it now appears that the players who kicked off the AI compute buildout will end up leading it.

What happens if frontier labs run out of headroom?

Anthropic and OpenAI probably made up 15–20% of the world’s operational AI compute at the end of 2025. While the headroom for them to grow their share looks big, it can be consumed in just a few years if the frontier labs grow substantially faster than the industry as a whole.11

As a naïve illustration, suppose the top two players together have a 20% share today, and grow their AI compute 33% faster than the world as a whole (e.g., they quadruple their computing power every year, as OpenAI did through 2025, while the global installed base “merely” triples annually). In this scenario, they’ll double their share of world compute in 2.5 years, and use ~80% within five. And if compute is more evenly distributed among the top four or five frontier developers, rather than just Anthropic and OpenAI, the headroom will run out faster than that.

So a key question for the near future of AI is whether Anthropic and OpenAI can continue their recent pace of compute growth, dragging up overall AI compute production along the way, or whether their compute growth slows down because the cloud and semiconductor industry can’t keep up.

At this point, I want to emphasize that total capital expenditures on AI chips and data centers are already very large. Total AI capex will approach $1 trillion annualized in 2026, which would be almost 1% of the gross world product and 3% of US GDP, and capex now consumes most of the operating profits of the hyperscalers that are leading the buildout. There is no guarantee that rapid AI capex growth will continue after 2026.

If the top model developers eat the compute headroom and their compute growth converges with overall AI compute growth, maintaining 4× growth per year would require more than doubling capex annually even after factoring in chip price-performance improvements. From a starting point of perhaps $1 trillion in AI capex in 2027, this sort of growth would only be feasible if AI starts to dramatically accelerate economic growth.

In other words, the AI industry will transition to a new regime in the next few years, with frontier AI slowing its compute growth, or dominating the industry, or both. To be clear, there’s no reason to expect a progress “wall” in 2029 if a frontier compute slowdown happens: flat compute capex can still grow the compute stock for years, AI chips will still improve, and companies can research and train new models with a fixed amount of compute. But the key physical trend driving frontier AI progress, the scaling of compute, is not sustainable unless the world fundamentally changes soon.

Many thanks to Amelia Michael, Ben Cottier, Brendan Halstead, Campbell Hutcheson, Elliot Stewart, Isabel Juniewicz, Konstantin Pilz, Romeo Dean, and Yafah Edelman for helpful feedback

Appendix: How much compute goes to frontier AI developers?

For more information, see this much longer research preview that sorts through the relevant evidence per company, along with modeling details.

Here, I summarize my estimates of how much AI compute OpenAI, Anthropic, xAI, and the frontier labs at Google and Meta had access to at the end of 2025. These five are probably the most compute-rich developers in the world, though not necessarily the leaders in model quality. I focus on how much compute frontier AI companies rent or use, not how much they own.12 Anthropic and OpenAI predominantly rent their compute from cloud partners like Amazon, Google, and Microsoft; Google and Meta mostly own their AI compute, but much of this is allocated to non-frontier internal uses or, in Google’s case, rented out to external customers.

OpenAI

We know the most about OpenAI’s compute because they helpfully disclosed the total electrical power capacity of the data centers they rent at the end of 2023, 2024, and 2025. OpenAI ended 2025 with 1.9 gigawatts (GW) of capacity, up from 0.6 GW in 2024 and 0.2 GW in 2023.13

Courtesy of OpenAI

This power capacity can be converted to computing power using the specs of flagship Nvidia AI GPUs and servers, and some assumptions about the mix of GPUs OpenAI used over time. With this method, I estimate that OpenAI had the equivalent of around 1.7 million Nvidia H100 GPUs (H100e) by the end of 2025, up from around 400,000 in 2024 and 100,000 in 2023 (H100e is based on peak FLOP-per-second specs; actual compute performance may vary). Another approach would be to use media reports that OpenAI spent $16 billion on cloud compute in 2025 and combine this with plausible prices per GPU-hour that OpenAI may have paid. This leads to a broadly similar compute estimate.

One potentially significant omission is the inference compute used to run OpenAI models on hyperscaler-hosted products like Microsoft’s Azure API (and now Amazon Bedrock); this compute is presumably excluded from OpenAI’s data center and cloud compute figures. It’s debatable whether this should count as “OpenAI compute”: OpenAI does share in some of the revenue from this compute, but doesn’t have operational control over it. Including this Microsoft inference compute may boost the total OpenAI compute by around 25%, with 50% as an upper bound, judging from OpenAI’s revenue and inference compute allocation (more details here).

Anthropic

Anthropic probably had substantially less compute than OpenAI in 2025, but is catching up over time. An internal OpenAI memo estimated that Anthropic had 1.4 GW in capacity at the end of 2025, around 70% of OpenAI’s 1.9 GW total. In both 2024 and 2025, Anthropic’s cloud compute bill was reportedly just over 40% of OpenAI’s. The 2025 ratio is somewhat surprisingly low, but Anthropic’s compute spending may have ramped up towards the end of the year.14

Much of Anthropic’s compute is housed in Amazon’s Project Rainier campus in Indiana, with around 500,000 H100-equivalents (H100e) in Trainium2 chips at the end of 2025. Anthropic also rents significant numbers of Trainium, Nvidia, and TPU chips from both Amazon and Google elsewhere, so its total must be much larger than this one campus. Alongside the OpenAI memo, I think this points to 1 million or more H100e for Anthropic by the end of 2025. As with OpenAI, including inference compute from third-party cloud APIs would boost Anthropic’s compute total, again by roughly 25%.

xAI

xAI, now part of SpaceX, mostly uses the compute in its Colossus 1 and 2 data centers in Memphis, Tennessee. These two facilities added up to around 550,000 H100e at the end of 2025. xAI reportedly also owns or uses smaller data centers in Portland and Georgia, and they used Oracle as a cloud compute partner at least through 2024. xAI’s total compute usage may have been around 600,000 to 700,000 H100-equivalents at the end of last year, likely less than Anthropic and less than half of OpenAI’s estimated 1.7 million H100e.

This year, xAI has decided to rent Colossus 1 to Anthropic, but is also targeting a major expansion of Colossus 2 to around 1.4 million H100e.15

For Google and Meta16, I’ll define their “frontier AI” compute as the compute used by their frontier AI divisions — Google DeepMind (henceforth DeepMind) and Meta Superintelligence Labs (MSL) — as well as related inference.17 Clean comparisons with Anthropic or OpenAI are difficult because frontier AI work may bleed into other AI/ML efforts, such as recommender systems.

Google DeepMind

Google (Alphabet) is the most compute-rich firm in the world; in our research on AI chip owners, we estimate Google owned around a quarter of the world total, or roughly five million H100e at the end of 2025, or around 4 million H100e using a one-quarter delay between chip sales and deployment. But much of this does not go to DeepMind: Google says about half of its ML compute goes to Google Cloud.18 For Google, “Cloud” includes enterprise Gemini inference (Vertex API and enterprise subscriptions) in addition to compute for external customers. The other half is split between DeepMind and non-frontier internal uses like recommender systems.

Whether DeepMind compute exceeds half of the Google total depends on whether Cloud-side DeepMind inference compute is greater than non-DeepMind internal compute. My guess is that the latter is bigger, which would mean DeepMind compute is less than half of the total.

This means that despite Google’s large compute lead as a company, it is not clear to me whether DeepMind had more compute than OpenAI at the end of 2025.

Meta

Meta owned roughly 10% of the world’s total AI compute at the end of 2025, more than OpenAI rented in total.19 But translating this to Meta Superintelligence Labs (MSL) compute is tricky.

First, many Meta GPUs support the company’s core business, principally recommender systems for ads and content, rather than frontier AI. The algorithms running your Instagram feed are actually large-scale transformers that are very effective at boosting engagement.20 Third-party estimates put the split between frontier AI and recommenders at roughly 50-50 in mid-2025.

However, Meta pivoted hard to prioritizing frontier AI in late 2025. Mark Zuckerberg hand-delivered soup to top researchers and offered them enormous compensation packages, promising that MSL overall would be equipped with “industry-leading levels of compute”. So it’s plausible that Meta compute has now tilted significantly towards frontier AI.

Second, Meta signed large cloud deals with Google, Oracle, and CoreWeave starting in late 2025; as these deals ramp, Meta will use much more AI compute than it owns. Still, because these cloud deals are relatively new, I think MSL probably had access to significantly less compute than Meta owned in total at the end of 2025. This means MSL probably had less compute than OpenAI, since we estimated Meta only owned 2.3 million H100e in total before accounting for deployment lags, split between MSL and other uses.

By “AI compute” I essentially mean the amount of AI chips they have access to that are operational in data centers, weighted by how powerful the chips are. For these five labs, these are predominantly Nvidia data center GPUs (e.g. Hopper and Blackwell), Google TPUs, and Amazon Trainium.

I also don’t know much about the scale of Microsoft’s or Amazon’s internal frontier AI efforts; it’s possible these are quite compute-rich at this point, given the massive size of the parent firms. In any case, I think the fact that the most salient frontier developers probably add up to less than half of all AI compute is interesting.

CoreWeave disclosed in May 2025 that its hardware goes through a “~3 month installation time” between delivery and monetization. By late 2025, this had decreased to “within weeks”. But Nvidia and other chip sellers may recognize revenue before clouds take delivery, e.g., when selling to server OEMs.

I don’t know enough about Meta Superintelligence/Meta AI or DeepMind’s compute to measure their growth trajectory, but MSL has probably grown its share of Meta’s total AI compute since the middle of last year.

This roughly mirrors the 4-5× growth trend in the total training compute of frontier LLM training runs, though we have limited data for closed frontier models beyond early 2025.

For one, Nvidia’s AI sales spiked discontinuously in 2023 before settling to a ~70% annual growth rate.

A more detailed forecast of total compute is outside of scope, but ~70% growth in AI spending suggests something like 2-3× growth in computing capacity: AI chips have historically become ~30% more cost-effective per year. And compute ~tripled in 2025, supported by 70% growth in hyperscaler capex last year.

We write a lot about big numbers and growth rates here at Epoch AI. But seriously, it is difficult to adequately convey how bananas that Anthropic’s revenue curve is. If sustained, tripling revenue in one quarter implies ~80-fold growth per year; this is almost certainly not sustainable, but it’s still crazy that they saw a growth spurt like this from a starting point close to $10 billion/year. And we’ve gotten another data point consistent with this accelerated trend, with Anthropic reportedly approaching $45 billion by May! This is a dramatic acceleration from Anthropic’s 10× annualized growth rate in 2025, which was already the fastest any company of this size has grown in history. See additional commentary from Jesse Richardson.

Amazon and Google were already Anthropic’s main compute partners, so these contracts are additive. The incremental capacity will ramp over multiple years.

This is more than Anthropic spent on compute last year, and almost as much as OpenAI did, though Anthropic may be renting Colossus at a large price premium. Anthropic will reportedly double its compute spending between Q1 and Q2 of this year, from around $3B per quarter to $6B.

I don’t mean to imply that the remainder of the current and upcoming cloud compute market is “free real estate” for Anthropic and OpenAI. It may be very difficult and expensive for them to continue growing their share of global compute! Much of it is committed in long-term cloud contracts, though the top labs may buy out compute currently used by others, as with Anthropic and Colossus. But while the headroom exists, rapid growth at these companies doesn’t mechanically require an acceleration in AI semiconductor production.

I’ll use possession-like language like “OpenAI’s compute” throughout, but unless I use the word “own”, by default I mean the amount of compute the company or lab rents or uses.

(More methodology details can be found in the full research document; code here). I assume that OpenAI is measuring the IT power of its data centers (the total power drawn by computing equipment like chips, servers, and networking), not total facility power. The tech industry tends to measure data center power in IT power, and using IT power leads to an estimate that is more consistent with my other GPU-hour based estimate. If I assumed OpenAI was quoting total facility power, my compute estimates would be up to ~30% smaller.

When converting from the power and dollar figures to compute, the power-efficiency and cost-efficiency of Anthropic’s compute fleet can differ from OpenAI’s fleet. This may close the gap somewhat: Anthropic uses lots of Amazon Trainium2 chips, which are less energy-efficient than newer chips like Nvidia Blackwells, but are also relatively cheap.

xAI has also formed a major partnership with Cursor where xAI and Cursor will share Colossus compute to co-develop models. This situation is in between a cloud compute deal with Cursor and an acquisition of Cursor (SpaceX acquired a call option on Cursor as part of this deal).

Microsoft and Amazon also develop large foundation models, though these efforts seem less mature than Google’s or Meta’s.

This can include product-level inference of frontier models that is not managed by the lab itself; for example, the DeepMind team may not be directly involved in all Gemini-powered products.

One article reports that “In 2025, Google expected to allocate around half of its computing capacity to Cloud, [Google CFO Anat] Ashkenazi said at a Morgan Stanley conference this spring.” Google will maintain a similar ratio in 2026, according to its Feb 2026 earnings call: “And for 2026, just over half of our ML compute is expected to go towards the Cloud business.”

This doesn’t include Meta’s custom MTIA chips, which were relatively low volume in 2025. I’m also eliding the difference between Nvidia and AMD compute; AMD is about 20% of the estimated total. Meta is reportedly the single largest customer of AMD’s Instinct AI chips, but I’m not sure what Meta uses them for.

In the linked earnings call, Meta attributes much of the 5% year-on-year increase in time spent on Facebook and 30% on Instagram videos to AI recommendations. So the revenue impact of scaling recommender systems at Meta over the past few years could easily add up to tens of billions per year. Recommenders are presumably also a very big deal at Google, the other online advertising giant.

The Epoch Brief - May 15, 2026

Epoch AI — Fri, 15 May 2026 21:04:13 GMT

Welcome to this edition of the Epoch Brief! This week:

We published two new Data Insights: Servers account for 60% of the total cost of owning a one-gigawatt AI data center, and Claude overperforms at software engineering and underperforms at math.
The latest Gradient Update explores why a handful of AI researchers command 10–100× the pay of their peers — and why that gap could grow.

Subscribe now

Data Insights

We published two new Data Insights, our digestible snapshots of complex AI trends.

Servers account for 60% of the total cost of ownership of a one-gigawatt AI data center. GovAI research scholar Amelia Michael and Epoch senior researcher Ben Cottier model the full costs of a typical one-gigawatt US AI data center, finding that servers alone account for $5 billion of the $8.5 billion annual total — dwarfing energy and all other operating costs.
Claude overperforms at software engineering and underperforms at math. Senior researcher Alexander Barry finds that, relative to their general Epoch Capabilities Index (ECI) values, Anthropic’s Claude models overperform on software engineering benchmarks and underperform on math. These results come from the Domain-specific ECI Explorer we launched earlier this month, where you can view Math and SWE ECIs as well as design your own variant.

Commentary: The economics of superstar AI researchers

In our latest Gradient Update, researcher Anson Ho argues that enormous pay disparities among AI researchers — where superstar researchers can earn 10–100× what their colleagues make — come down to more than just quality differences. Rather, we should expect to see big differences in pay even if superstars are only a tiny bit better than an average postdoc. Gradient Updates are informal, opinionated analyses that represent the views of individual authors, not Epoch AI as a whole.

Ballpark estimates of AI researcher compensation. Postdoc compensation is estimated using NSF report data. For tenure-track professors, the author anchors on this Taulbee 2024 survey of computer scientists. Compensation for frontier lab researchers is estimated from Levels.fyi for L4-L5 OpenAI researchers, and news reports for superstars.

Other Updates

FM:OP workshops

Our in-person FrontierMath: Open Problems workshops kick off in under two weeks. Events will be held in six cities — New York, London, Berkeley, Boston, Los Angeles, and Toronto — between May 26 and June 9. All working mathematicians (grad students, postdocs, professors) are encouraged to apply while space remains.

Model Evaluations

We are conducting an AI-assisted review of FrontierMath: Tiers 1–4 and have flagged fatal errors in about a third of problems — most of which we believe to be valid. We will release updated scores on a corrected dataset after completing a thorough human review.

Careers

We’re hiring across several roles. All positions are fully remote.

Designer to translate complex research into intuitive, engaging, and high-impact designs — primarily UI/UX and data visualization work.
Researchers and Senior Researchers to lead new projects across our expanding teams.
Data Scientist (Contract) to assist with our AI research efforts, including reviewing technical literature, tracking benchmark data, and analyzing AI models, data centers, and companies.

Applications are rolling, so apply soon!

Subscribe now

The economics of superstar AI researchers

Anson Ho — Wed, 13 May 2026 22:24:26 GMT

AI is one of those fields where the best winds up much better off than the rest. Superstar researchers at frontier labs earn over ten times more than most of their colleagues, who earn measly million-dollar salaries. They might even earn over a hundred times more than your average AI postdoc:

Ballpark estimates of AI researcher compensation. Postdoc compensation is estimated using NSF report data. For tenure-track professors, I anchor on this Taulbee 2024 survey of computer scientists. Compensation for frontier lab researchers is estimated from Levels.fyi for L4-L5 OpenAI researchers, and news reports for superstars.

So why are the differences in pay so large? The naive explanation is that some researchers are just vastly superior. Perhaps the superstar researchers have excellent research taste in designing algorithms and experiments. Or they have a knack for pulling off “yolo runs” — training runs that implement many ambitious changes all at once, relying on deep intuition, whereas most people would need to systematically test the individual changes to make sure they work. Under this framing, superstars are the “10× researchers” that Silicon Valley so deeply reveres, and it’s their quality that makes the difference in pay.1

The problem with this explanation is that it’s very incomplete. In reality, we should expect to see big differences in pay even if superstars were only a tiny bit better than your average postdoc. But why?

The superstar effect

The short answer is this: there’s a well-known economic dynamic which turns small differences in ability into big differences in pay. Here are two illustrative examples:

In the 100-meter sprint, the gold-medallist gets much more reward and attention than the silver-medallist, despite them being quite literally neck-and-neck for most of the race. Consider the London 2012 Olympics, where Usain Bolt won gold. Most people have no idea who won silver, despite finishing just 0.12 seconds behind — do you?
Some musicians earn much more than others. Consider Taylor Swift: last year, she earned $60-70 million from Spotify. I don’t doubt that she’s a “10× singer” compared to me. But it’s very debatable whether she’s that much better than other extremely popular singers like Ed Sheeran, Blackpink, Charli XCX, and Lana Del Rey, who instead earned closer to $5-25 million.

Ballpark estimates of 2025 Spotify earnings of several extremely famous artists. These were estimated by multiplying daily Spotify streams by 365 days, and earnings of $0.004 per stream.

Across these two cases, small differences in ability led to big differences in pay some way or another. Economist Sherwin Rosen called this the “superstar effect,” and it kicks in when two conditions hold.

One person’s work can reach a big market. Usually this means a market with many people, but a few high-paying people or firms work too. For instance, potentially billions of people watched Usain Bolt win the 100-meter sprint. The more people you can reach, the more pronounced the superstar effects. Across the economy, jobs with broad reach — such as actors, musicians — show far bigger wage dispersion than jobs serving one client at a time, such as plumbers, nurses, and truck drivers:2

Data from the Bureau of Labor Statistics across different occupations, showing the ratio of 90th percentile earnings to the median. If we had data on the extremes (e.g. 99th percentile), I’d guess the difference in wage dispersion would be even larger.

Quantity doesn’t easily make up for quality of labor. You can’t have multiple people take the place of a single sprinter, since that would break the rules of the race. And if you like Taylor Swift more than Ed Sheeran, it’s hard to make up for missing a Taylor Swift concert by going to more Ed Sheeran ones.

The first condition means a tiny quality edge captures enormous extra value, making it worth paying a lot for the best — that is, as long as you can’t make up for quality with quantity (the second condition). If you could, you’d just hire a lot more people with lower pay — you wouldn’t need to pay a ton just to hire the cream of the crop.3

Why this applies to AI

AI researchers tick both boxes. There’s a huge market: ChatGPT has almost a billion users, served by the same handful of underlying models, so a single researcher’s contribution could scale to every user simultaneously.

And in AI, researcher quantity doesn’t easily make up for quality: frontier labs are compute-constrained, so they can only run so many experiments to test new software innovations. Two “merely very good” researchers can’t replicate one Noam Brown if what’s needed is deep intuition about which experiments are worth running in the first place. Not to mention the difficulties coordinating researchers if labs are short on time.4

This is how even a 2× researcher could earn far more than the median. Scaled to a billion users, even a small quality edge generates enormous differential value. And if the 2× researcher can add something that multiple 1× researchers can’t, then it’s worth paying a lot to capture this.

Race dynamics amplify the effect

Frontier AI labs are often described as being in a “race”. I’m not sure what exactly they’re racing toward, but it often seems to involve automating huge swathes of human labor, a prize potentially worth tens of trillions of dollars a year — if you win. This incentivizes AI labs to adopt an “all in or nothing” approach, and anything that improves their chances even a little might be worth a lot. Hence Meta’s (alleged) $100 million dollar compensation packages to poach top researchers from OpenAI.

In principle it’s even possible this pushes things well beyond what is socially valuable (however you define that) — it’s like how high frequency traders spend huge sums trying to execute a trade a tiny bit faster, to almost no social benefit.

Reality is complicated, and so is managing an army of AIs

Other forces are at work too. Top researchers carry valuable trade secrets in their heads — the results of expensive experiments competitors haven’t run, and which would cost a fortune to replicate. Many also manage teams, contributing more value than just their raw technical research ability; Noam Brown recently described himself as a “manager at OpenAI.”5 Each of these may contribute to the wage gap, separate from the superstar premium.

Additionally, it’s hard to quantitatively analyze the superstar effect.6 I don’t know of a good way to quantify “researcher quality”. There are some valiant efforts like METR’s RE-bench, but these contain small isolated tasks (think “finetuning GPT-2”) rather than projects with millions of lines of code, fuzzy objective metrics, and lots of coordination between different people.

But despite these complications, I think the superstar effect tells us several useful things. For one, I’ve seen a couple of news articles about Meta’s attempts to poach researchers with exorbitant salaries, in their quest for Personal Superintelligence. But these articles usually miss out on this important superstar effect (though they often do touch on race dynamics).

Another important implication is for how we think about the intelligence explosion. If a 100× pay gap is driven by a 100× researcher quality gap, then simulating a top researcher might speed things up much more than simulating an average researcher.7 But this isn’t the case if much of the pay gap is driven by the superstar dynamic — the gap in researcher quality might actually be much smaller.

Finally, knowing about this effect gives us some hints at what’s to come in the near future. I think that the superstar effect will only become more important moving forward. That’s because lots more people will use AI, and each person will use AI systems much more heavily. And as research increasingly shifts toward managing an army of Claudes, those with deep research intuitions and years of experience as research managers will probably see ever-growing boosts to their productivity, as well as the sizes of their wallets.

So if anything, superstar earnings might become an even bigger deal — $100 million annual compensation quite literally might not be enough.

I’d like to thank Andrei Potlogea, Phil Trammell, Josh You, David Owen, JS Denain, Cheryl Wu, Stefania Guerra, Robert Sandler, Lynette Bye, and many people at Trajectory Labs for their feedback and support. Thanks also to Luis Garicano for inspiring me to write this essay in the first place.

Or even “10,000× researchers”!

The key difference is what economists call “nonrivalry” — if I watch a Tom Hanks movie, it doesn’t stop you from doing so at the same time, and this can scale to any number of consumers. But not all goods and services are like movies — if a plumber is fixing my sink, they can’t fix your sink at the same time.

Strictly speaking, I think there should also be a condition about the distribution of human researcher quality — if you have many superstar researchers, having one more superstar might not add much value, so they might not get paid that much. For example, in Tom Cunningham and Manish Shetty’s (super interesting) apple-picking model of AI R&D, the marginal value of researchers at the same quality drops exponentially. This follows from two assumptions: (1) that researchers sample from the space of ideas independently, and (2) each additional unit of researcher effort finds new ideas at a rate proportional to how many (discoverable) ideas remain. These assumptions are of course debatable — for example, researchers might be able to coordinate to some degree on the kinds of things that they work on, and they also have some amount of diversity. But I think this is an interesting prediction nevertheless.

For illustration, imagine that a frontier lab can do ten large-scale experiments at any one time. A superstar researcher is able to get insights from three of these ten experiments, whereas a “merely very good” researcher can get insights from two out of ten. Having more “merely very good” researchers doesn’t help a great deal because they end up with the same two out of ten insights, but having a superstar researcher helps you reach a higher bar because they have slightly better research taste. The important thing is that this research taste is a dimension of quality that’s hard to parallelize, and this tiny edge can matter a lot, in the same way that an absolute quality difference of 0.1s matters a ton in the 100-meter sprint.

You could also argue that frontier AI labs might be poorly calibrated about the importance of researcher quality.

I’ve also been describing researcher quality as some purely one-dimensional thing, but there may be different kinds of quality — some people are good at coordinating many GPUs in a cluster, some people are good at research engineering, some are good at coming up with new algorithms. In the worst case, I think you could just apply the superstar dynamic along each particular dimension of “researcher quality” that you care about.

Interestingly, you could potentially also argue that this even supports the observation of big pay gaps even more. If many different skills are important for being a good researcher, and they combine multiplicatively, then you end up with a heavy-tailed (lognormal) pay distribution, just as the superstar effect predicts.

This is an important thing to be aware of, but I also doubt it’s very load-bearing for people who believe in the intelligence explosion.

The Epoch Brief - May 8, 2026

Epoch AI — Sat, 09 May 2026 01:39:23 GMT

Welcome to this edition of the Epoch AI Brief! This week:

In a new report, we estimate that between 290,000 and 1.6 million H100-equivalent chips were smuggled into China through 2025.
We launched the AI Chip Components data explorer, which builds on our AI Chip Owners explorer to track bottlenecks in the AI chip supply chain since 2024.
We published a new Gradient Update on the future of AI benchmarks, paired with an episode of the Epoch After Hours podcast on the same theme.
We published a new Data Insight: Anthropic and OpenAI earn more revenue per employee than any major public tech company.

This is also our first weekly edition of the Epoch Brief — we’re switching from monthly to weekly to keep up with our growing output. To catch up on what we published last month, check out the “In case you missed it” section below.

Subscribe now

New Research

Diversion and resale: estimating compute smuggling to China

Senior researcher Isabel Juniewicz estimates that between 290,000 and 1.6 million H100-equivalent were smuggled into China through 2025 — our median estimate of 660,000 would represent roughly one-third of China's total AI computing capacity. The analysis uses two types of evidence: diversion from legitimate supply chains and resale within China's grey market.

AI Chip Components data explorer

We launched a new interactive tool to understand the AI chip supply chain, featuring charts that visualize its evolution and key constraints since 2024. We track three key components required for advanced AI chips: advanced-node logic, high-bandwidth memory (HBM), and chip-on-wafer-on-substrate (CoWoS) packaging.
Read researcher Venkat Somala’s accompanying blog post, which dives into key insights from the project, including how high-bandwidth memory has emerged as the dominant cost and the primary supply chain bottleneck.

If you’re looking to get up to speed on AI chips more broadly, we recently published What you need to know about AI chips — a general explainer covering why these chips cost tens of thousands of dollars each and why demand consistently outstrips supply.

Commentary

We published a new Gradient Update and episode of the Epoch After Hours podcast, both tackling the future of AI benchmarks.

In RIP Classic Reasoning Benchmarks. What’s Next?, senior researcher Greg Burnham argues that the classic benchmark recipe — text-only, short time horizons, easy to grade, solvable by human experts — is now obsolete. What’s next? Relax one of the elements.

In the latest episode of Epoch After Hours, researcher Anson Ho asks Greg and MirrorCode creator Tom Adamczewski, “Are AI benchmarks doomed?” Available on YouTube, Spotify, and Apple Podcasts.

Data Insight: Anthropic and OpenAI earn more revenue per employee than major public tech companies

In our latest Data Insight, researcher Luke Emberson estimates that Anthropic and OpenAI generate $9M and $5.5M in revenue per employee, respectively — higher than every public tech company on Forbes’ Global 2000 list.

Other Updates

FM:OP workshops

We’re holding a series of workshops to identify highly interesting and programmatically verifiable unsolved research math problems for our FrontierMath: Open Problems benchmark. Events will be held across seven cities — New York, Princeton, London, Berkeley, Boston, Los Angeles, and Toronto — between May 26 and June 9. All working mathematicians (grad students, postdocs, professors) are encouraged to apply while space remains.

Model Evaluations

GPT-5.5 Pro set a new record on the Epoch Capabilities Index, scoring 159 — the highest any model has achieved on our statistical tool combining multiple benchmarks into a unified scale. GPT-5.5 Pro also set new records on FrontierMath, scoring 52% on Tiers 1-3 (up from 50%) and 40% on Tier 4 (up from 38%). Across runs, it and GPT-5.5 solved two Tier 4 problems that no model had solved before.

We also launched domain-specific capability scores for the ECI, tracking model performance across SWE and Math benchmarks using the same unified scale. Users can also now create their own customized ECI variants.

Careers

We’re hiring across several roles. All positions are fully remote.

Designer to translate complex research into intuitive, engaging, and high-impact designs — primarily UI/UX and data visualization work.
Researchers and Senior Researchers to lead new projects across our expanding teams.
Data Scientist (Contract) to assist with our AI research efforts, including reviewing technical literature, tracking benchmark data, and analyzing AI models, data centers, and companies.

Applications are rolling, so apply soon!

RIP Classic Reasoning Benchmarks. What’s Next?

Greg Burnham — Tue, 05 May 2026 20:24:28 GMT

Originally posted on Epoch AI.

There’s a familiar recipe for reasoning benchmarks: tasks are text-only, output is easy to grade, and expert humans can do the tasks in several hours. Unfortunately, this recipe is now obsolete. As an emblematic case, consider GPQA: a benchmark consisting of graduate-level science questions. It had remarkable staying power but by now it’s clearly saturated.

The same is true for many classical reasoning benchmarks, whether in science, math, or coding. What’s next? I think the old recipe points to a new recipe. Just relax one of the elements: text only, easy to grade, short time horizon, and expert human superiority. I see each of these categories as extremely fruitful to pursue, and far from saturated. The tradeoff is just that it takes more time and money to create such benchmarks.

Keep the classic format, but make it multimodal

It’s hard to say precisely, but to my eyes AI visual and spatial reasoning lags behind text-only reasoning. Still growing rapidly, but from a lower base. At any rate, it still seems comparably easy to create meaningful multimodal reasoning benchmarks.

To give one example, at a recent Epoch team offsite, we ordered a piece of IKEA furniture, laid all the pieces out, labeled them, and then gave AI systems the instruction manual and asked them to identify the step in the instructions in which each piece was first used. Top models scored around 40%.1

Which step is the part labeled 6 first used in? AI systems can’t say.

This is still “classic” in that it’s easy to grade and humans beat AI over relatively short time horizons. I also consider it to be plausibly testing a capability relevant to tasks of broader import. For multimodal tasks, at least, one person having fun over the course of a week can still make interesting benchmarks. Other examples of this: SpatialBench and IRGB. Arguably ARC-AGI and other videogame benchmarks fit this bill as well.

Keep the classic format, but push the time horizons

The first generation of “long context” benchmarks would fill a model’s context window with data, often at least partly synthetically generated, and ask the model to process that data, often without tools. This measured the model’s “native” ability to keep track of lots of threads at once. Some such benchmarks still show some headroom. But, on the one such benchmark for which Anthropic reported scores, Claude Mythos jumped to 80%,2 where prior scores had been less than 40%. Perhaps this specific kind of long-context benchmark is nearing the end of its utility.

Even so, I suspect long, reasoning-oriented benchmarks still have room to run. One direction I’m particularly interested in is games. While many games have significant multimodal aspects, I think some — often card-based, like Magic: the Gathering — have relatively little, while still demanding a high level of both strategic and tactical reasoning. A single game instance isn’t necessarily long, but the more relevant unit of play may be a sequential run of games. After all, that’s the context in which humans get good at a game. Can AI do the same? I think existing evidence on this is negative, but it’s surprisingly patchy. Benchmarking could improve our picture here.

I also think very long-running software engineering projects may not be tapped out. My colleagues have written recently about how some software engineering tasks which may take humans weeks seem to be within reach for AI systems. These are tasks especially well-suited for runtime hill-climbing, since they are very precisely specified. Still, I suspect that the harder end of this difficulty scale may remain unsaturated. Can an AI system implement a C compiler that works as well and as efficiently as commonly-used human implementations? Maybe so, but it’s worth testing.

Bite the bullet on hard-to-grade outputs

This is hardly an unpopular opinion in benchmarking: of course tasks of real-world relevance don’t often have short-form right-or-wrong answers. This remains true even of tasks that are heavy on reasoning. For instance, my colleague Anson recently wrote about trying to get AI to do his job, and two of the tasks he chose were replicating an interactive web interface for a complex economic model, and analyzing a given dataset and writing a publication-worthy article about the results.

If everyone knows this but the field has been slow to move in this direction, maybe all we need is a bit of coordination to make a more regular and rigorous practice of such evaluations. I was thus encouraged to see the formation of CRUX, a collaboration of researchers engaging in evaluations “in real-world environments where success cannot be neatly specified or automatically graded”. They also have a nice aggregation of previous one-off experiments like Anson’s.

This territory isn’t as new as it may seem. An early instance of just this sort of evaluation took place at the 2025 International Math Olympiad (IMO). AI solutions were graded by the official judges, using the same criteria as were applied to human solutions. The AI benchmarking thus piggy-backed off the pre-existing human practice of grading IMO solutions, which had been developed and refined over decades. This went so smoothly that I saw relatively little commentary on how unusual it was for a benchmark.

I think this points to a good way to evaluate increasingly sophisticated AI output: plug into pre-existing practices where humans already judge each other’s work. Can an AI system win a best paper award at an ML conference? Publish a law review article? Win a short-story writing contest? Collaboration with professions where judging written output is central — science, law, journalism, entertainment — could lead to many benchmarks that are messier but in many ways more informative than what we have today.

Target well above human expert ability

Benchmarks in scientific and technical domains often target top human expert performance. AI systems haven’t saturated all such benchmarks yet, but progress has been rapid. Where do we go when that finally happens? In several cases, we can just pose tasks that humans don’t know how to do; we don’t even need to give up easy grading.

This is precisely what we’ve done with FrontierMath: Open Problems: find open math problems which humans have tried and failed to solve, but where solutions can be evaluated programmatically. These are hardly trivial to devise, but once devised they can be run at scale. These might not lend themselves to simple accuracy metrics, but, on the plus side, each benchmark task becomes a valuable case study in its own right.

A broad category of optimization problems fits this bill as well: there may be a human baseline, but we don’t need to consider the benchmark saturated when that baseline is reached. Some great work already along these lines includes PostTrainBench, which asks how effectively AI systems can post-train ML models, ALE-bench, which asks how well AI systems can program constraint-solving engines, and AlgoTune, which asks how much AI systems can speed up general-purpose numerical programs.

Results on PostTrainBench. There’s no reason to stop at 51.1%!

In some sense, such benchmarks will always be relevant. Even if it’s all AI doing the work, there will always be questions of resource optimization.

What about common sense?

For all this talk of advanced reasoning, GPT-5.5 Pro still regularly gets my favorite GSM8K question wrong.

This is something of a trick question, though I suspect it wasn’t intended to be. Humans get it wrong too. This isn’t a completely isolated example either, for instance some models suggest walking (not driving) to a nearby carwash to get your car cleaned. What do we make of this?

These instances are somewhat analogous to human cognitive biases, often taking the form of getting a question wrong mostly due to not thinking too carefully about it. My personal sense is that they are getting harder to find. One systematic attempt at measuring something like this, SimpleBench, has seen models climb close to the average human baseline over the past year and a half.

If such benchmarks don’t saturate entirely, they will serve a role of deflating the strongest claims of AI infallibility. That said, I suspect their practical relevance is limited: if such “gotchas” do arise in the course of larger, real-world tasks, I suspect AI systems will be more adept at catching their own mistakes, even if not on the first pass.

Reasoning benchmarks aren’t dead yet

As AI systems get more capable, there’s an increasing temptation to benchmark them only on end-to-end tasks with practical, real-world consequences. I think it’s valuable to do so. I also think it’s obvious that AI systems don’t “just work” in all such cases. That leaves us with the question of why exactly they fail. One possible failure mode is not being able to reason through the problems they encounter along the way. Given that, I think reasoning benchmarks still have a role to play.

But the benchmarks we build still have to rise to the challenge of finding places where models still struggle. I’ve given my recipe. Give up on at least one of: text only, short time horizon, easy to grade, and expert human superiority. This makes benchmarking more challenging, but also more interesting than ever.

We don’t have a strong human baseline, but our light experiments suggest that humans can do this in well under half an hour.

This benchmark is GraphWalks. See Table 6.3.A of the system card for Mythos’s score.

Are AI benchmarks doomed?

Anson Ho — Fri, 01 May 2026 22:23:35 GMT

Greg Burnham leads Epoch’s benchmarking team. Tom Adamczewski is a senior research engineer who develops new benchmarks, including MirrorCode.

Topics we cover: why benchmark saturation isn’t as alarming as it seems, how AI can speed up benchmark development, the benchmark-reality gap, whether an AGI benchmark can exist, and the role of human evaluation in future benchmarks.

We also discuss MirrorCode, a benchmark (co-developed by Epoch and METR) of long-horizon coding tasks, and FrontierMath: Open Problems, Epoch’s benchmark of real unsolved math research problems.

Subscribe now

Watch the episode here:

Spotify

Transcript

This is an edited transcript of the “Epoch After Hours” podcast.

Are AI benchmarks doomed? [00:00:36]

Anson

So AI benchmarks seem to have a really big problem right now. If you look at all of the AI benchmarks, it seems like most of them are saturating really, really quickly. And by really quickly, I mean months for most of them. If they’re really, really good, maybe they’ll last for a year or two. But then for the most part, it seems like it’s very hard to build a benchmark that can last quite a long period of time.

So there’s a looming question that revolves around all of this: are AI benchmarks doomed? To start off, I’d like to get a nice little vibe check of where you guys stand on whether AI benchmarks are doomed. So what do you guys think?

Tom

So I think benchmarks will continue to be important as long as people want to have some kind of qualitative description of what an AI system can do, or want to quickly compare when a new model comes out — which one is better. And so it seems like we’re sort of stuck with benchmarks, regardless of the many flaws that they might have, just because there is this obvious demand for information, this gap that they fill.

We might be in a situation where benchmarks are less useful than they used to be. Like they explain less of all that we might want to know about AI systems’ performance. But there’s still additional information, and so people are going to continue to release new benchmarks and look at benchmark results.

Greg

I’m a bit more of an optimist in this perspective. I’d almost say we’re living through a golden age of benchmarking, where it used to be that I think models were not that capable and there was only so much for benchmarks to say. Now models are much more capable, but this just means there’s much more for benchmarks to potentially tell us.

So maybe, as Tom was saying, the percentage of questions you might want benchmarks to answer that benchmarks actually answer might be shrinking. But the amount of information we’re gleaning from benchmarks, I think in some sort of absolute terms, is growing.

And I think this is very exciting. I think benchmarks will survive and be important and even potentially central, so long as there are things we are curious whether AI systems can do — and that seems like there’s still plenty of questions about what AI systems can do. I think there are some benchmarks that might even survive — I mean this loosely — survive the singularity.

The costs and benefits of benchmark development [00:03:13]

Anson

One thing I’d want to understand better is why some people are so much more pessimistic. I imagine some researchers in AI safety would probably say: if you look at benchmarks like FrontierMath, the researchers put quite a lot of effort into trying to make these benchmarks last for quite a bit of time. And it seems like maybe within one or two years — which is already relatively good for some of these benchmarks — they’re getting to the point of saturation. And now we’re having to spend millions of dollars to build these benchmarks. Can we really keep doing this? If it costs millions of dollars and the gains are maybe not that high, maybe it’s just hard.

I’m curious what you guys think about that.

Tom

I think what you said about the gains not being high — that’s really the key. Yes, I agree that as the tasks that AI can do get more and more impressive, creating benchmarks for those tasks becomes more and more costly.

And so then it just depends on whether the benefit side is high enough. And I sort of suspect that this will be the case, because while AI gets more powerful, it’s just more important to know what it can and can’t do, or which AI systems are better than others.

Just in the same way that everything is increasing — like AI companies’ compute spend — similarly, the cost that benchmark developers spend on developing new benchmarks is also increasing a lot. I think this is sort of fine as long as people care enough about the answers we get from these benchmarks.

Yeah, I may be caricaturing your pessimism slightly, but I feel like it can sometimes come from, “Oh, well, this benchmark has all these flaws — was it really worth all the effort?” Well, think about how unhappy you’d be if you had nothing at all.

If literally all benchmarks were saturated, that does seem like we’d be in a much worse position. And if we were in that world, the premium on being the one team in the world that has an unsaturated benchmark would be huge. So I do think that basically costs and benefits might keep pace with one another.

Greg

I think it’s not crazy to measure the benchmarking budget as a percent of revenue of AI companies. I also just wouldn’t underestimate human cleverness. I do think benchmarking used to be kind of super easy — too easy to make a benchmark that started at zero. And now you have to be more clever to find a benchmark that is unsaturated, and sometimes you’ll be wrong about what is or is not saturated. But that’s a fine trade-off. We should be generally happy to have opportunities to exercise our cleverness and try.

And I think there’s some historical examples. I think part of where this pessimism might be coming from is we have just seen this big ability spike — a qualitative abilities spike — with coding agents starting to just work. This means that some tasks that we had put in benchmarks thinking they were hard are doable now.

And I would just point out, this has happened at least twice before, I think, roughly. One, where, call it around GPT-4, models just could do all these easier question-answering or language-manipulation tasks. And so some benchmarks were saturated and people did have to be clever to come up with harder benchmarks. Fine.

And then there were also reasoning models that came out and suddenly some math benchmarks were saturated. I think if we feel a little shell-shocked right now, that’s understandable, but if you just look around at the world, there’s plenty of things systems can’t do. And if you have to spend some more money on them, fine — that is the case.

You can have benchmarks that survive these paradigms. I think GPQA is a really good example of this. It was made at the end of 2023, before reasoning models in their current form were even on the horizon. And I would argue was only really saturated in winter of 2025, two years later. And I think that’s impressive. Reasoning models definitely did better on it — there’s a big spike around o1 — but it’s not like it was totally saturated.

It was a high-effort benchmark, though. You had to get these experts, you had many experts reviewing each question and testing out each other’s questions so you could tell that the chemistry questions are really hard for the physicists. It was more effort, it was more expensive. People paid it, it was worth it. And while some benchmarks that were supposed to be hard — like in terms of math — that were completely saturated when o1 came out, GPQA wasn’t. So we’ll have some wins and losses in this metric.

The last thing I’ll say is: a saturated benchmark is not a problem. Even having a benchmark that is saturated upon release — a hundred percent — because you started developing it four months ago and AI progress happened to just hit the nail right on the head — that’s very useful to know, because it dramatically reduces your uncertainty about what this qualitative feel, this vibe, of AI progress actually means in terms of numbers. Even this is relevant. So while it’s a little disappointing if your benchmark is saturated on release, I still think it can be quite valuable.

And maybe there’s some lessons we can learn about how to try to build benchmarks like this, and we can come back to that. But I just feel like this pessimism is over-updating.

Anson

I guess one kind of counter-argument that comes to mind is that cost is one thing that maybe we’re willing to pay a lot more for because we at Epoch believe that it’s very valuable to have these kinds of benchmarks. But then what about the time it takes — the cost in terms of time — for trying to build these benchmarks?

I don’t want to underestimate human cleverness, but I also don’t want to underestimate AI cleverness. As AIs are getting really smart, they’re going to just crack all of these benchmarks so soon, even if we spend six months filling a benchmark. By the time we’re done it’s not going to be great — because it’s going to be saturated.

Tom

I mean, I do think this is an argument for developing smaller, bite-sized benchmarks faster. In some ways, put something out as a trial balloon that you think is toward the harder end of the distribution you’d want your benchmark to cover, and see what happens as you keep filling out the benchmark. And if that balloon gets popped, then you say, “I need to work on a different project,” or whatever. But again, that served its purpose.

I do think there is some lead time risk to any benchmark where the fundamental infrastructure will take you six months before you could even have a sample. I’m not so worried about that, because I think any benchmark should kind of have a manual experiment. You have some software task you sort of want to make a benchmark out of — you just ask Claude Code to do it and see how far it gets, and you get some sense of that. I do think benchmarks starting out with that is good and something more like “agile” development of benchmarks would be a good lesson to learn.

But, yeah, I think it’s worth updating, just not updating all the way to “benchmarks are impractical now.” Because, again, to be grounded — as long as there’s a task that you, today, might practically want an AI system to do, and you put in like half a day’s work eliciting it and it doesn’t do it — there’s absolutely, today, a benchmark there.

Greg

I like the agile development point. I feel like that’s something that, maybe, historically, because benchmarks have come out of academia, it’s been very much — you don’t share anything with the world, you work for months until you have this super polished paper and then you release it. Maybe moving to something a little more gradual, a little more like open-source software development where there are continual improvements being made — maybe that’s promising.

Two responses to your calendar time, lead time objection. One is just: we need to look at what’s parallelizable and what’s not in the benchmark development process. For parallelizable things, you can hope you can just throw more resources at them and make it faster that way.

And then there will be some non-parallelizable portion. For that part, if the worry is that we as humans are just too slow and AI progress is very fast — well, AI systems are helping with everything, including benchmark development. This is something we see already in our own benchmark development work. For most technical work that I do, LLMs are a pretty essential tool and they speed me up a lot.

MirrorCode and scalable benchmarks [00:11:48]

Anson

So to make sure I’m understanding: the AIs are helping you build the benchmarks faster. And the other thing is, to what extent can we break this down to multiple chunks where we can just throw more resources at the problem.

I kind of want to dig into the second part a bit more, because you guys are the ones building the benchmarks on the ground. And I know Tom, you recently have been working on a benchmark, and my understanding is it’s meant to be like METR’s time horizons, like task set 2.0, or something. Could you say more about that?

Tom

So maybe I’ll not answer it directly, but take a step back first to say: with this question of how do we make unsaturated benchmarks, one angle I really like — and that I’ve liked for a while — is: are there tasks we can find, like categories of tasks where you can just take the same setup and crank up the difficulty as much as you want — ideally to infinity, but maybe it’s also sufficient if you can just crank it up a lot.

So, I like this idea, and I’ve been working on a benchmark that is sort of my instantiation of this idea for the software engineering domain called MirrorCode. And it’s called MirrorCode because the AI has to re-implement some existing program and mirror its functionality perfectly.

Yeah, maybe a little bit on the setup. These are all command-line programs that have a command-line interface. So that can be all the way from simple command-line utilities, like dirname or ls, up to huge programs that just happen to have a command-line interface, such as interpreters for programming languages, type checkers, et cetera.

We give the AI system the documentation for the original program — we don’t give it the source code — and we give it access to a black-box reference implementation, so a binary of the original program that it can send inputs to and view the outputs. If things are underspecified in the documentation, or if it wants to see the exact output format or test new hypotheses, it can do that as much as it wants against this reference binary.

The hope with this is that you can really scale across several orders of magnitude in size of the original program, and hopefully also the amount of effort for the AI or humans to complete the reimplementation task. Programs that are really trivial and were like 10 or a hundred lines in the original, up to 10 million or tens of millions of lines of code like the Linux kernel, or really complicated compiler chains. I think there’s just a lot of room here for scaling up to the largest software projects ever in the history of software development.

Anson

And how far did you scale it in fact?

Tom

So we’re still figuring out exactly what we’ll release. What we definitely have so far are a couple of programs that are in the roughly hundred-thousand-lines-of-code range, without counting dependencies in the original implementation. An example of that is Pkl, which is this new programming language that came out in 2024 from Apple.

In our experiments so far, the best AI systems — with something like hundreds of millions or a billion tokens over the course of the run — are not yet able to complete these very hardest tasks, but they’re able to do pretty reliably everything up to that level of difficulty —

As of recording this podcast, I feel very uncertain about whether, with more tokens, they would just be able to do everything. I would say it’s currently my best guess that yes, they would be able to do everything up to the hundred-thousand-line-or-so size.

With this benchmark, I did originally envision it as, “Okay, this is going to be a really hard benchmark for AI systems.” And we created a lot of tasks in the early phase of the project that are now saturated. It certainly shows that even when you think you might be setting the bar high enough accounting for how much progress AI will make, you might still be underestimating it.

For very precisely specified tasks, the AI really knows absolutely everything the program has to do — it has to output exactly this string on this kind of input, et cetera. AI systems can just keep going at it for many, many times the size of their context window with compaction. And because the task is sufficiently precisely specified, they sort of know where they’re at in terms of their progress, and they do even these very impressive ones that we would guess represent several weeks of human work — that still has a bit of room to go, like scaling to the biggest human software projects ever. To sort of help us answer: if we tell an AI system very precisely what to do, can it do anything in software engineering?

Anson

Let me make sure I’m contextualizing this correctly. This is supposed to be a bunch of, your were saying multiple-week-long tasks, like hundreds of thousands of lines of code. And these are things that we were thinking were going to be really, really hard for the AI. But then it seems like before we’ve even released the benchmark, AIs are already able to do a huge chunk of these — as long as they’re using — what was the token budget?

Tom

So just to be clear, these time estimates for how long it would take a human to do the task are guesses. We don’t have data on this. The multiple weeks is sort of my personal guess. The hardest task that AI can definitely do in MirrorCode, which is implementing the CommonMark spec — which is a formalization of Markdown that tells you like exactly, for any markdown, how to convert to HTML — the reference implementation for that is about 16,000 lines of C. My personal guess, which is extremely speculative, is that this would take an experienced software engineer who is completely unassisted by AI multiple weeks to reimplement.

Anson

I see. But then it’s still the case that if you were to invest in building the month-long versions of this, or maybe the year-long which are the ideal things to do in the future — you think that there’s still plenty of room to keep scaling this up?

Tom

Well, so I don’t want to make strong predictions about whether AI will be able to do it or not. But I think — that’s almost, maybe I’m a little bit dodging the main question you want to ask with this podcast — but that’s sort of not really my main, like I think this is interesting just because it lets us describe AI capabilities on precisely specified software tasks across these orders of magnitude of difficulty. And it’s great to know whether AI can do that or not.

I care a bit less about whether it will be saturated by a certain date. And I agree it’s relevant because people want to be able to keep tracking AI progress. I don’t feel very confident about making predictions for that.

What I can say is that Nicholas Carlini stopped the Anthropic C compiler experiment based on — my impression is — pretty much his gut feeling of, “Ah, it’s gotten up to here, it seems to now be sort of stagnating, to be introducing bugs when it tries to introduce optimizations,” and he decided to stop it there. I don’t really know what his criteria were, or maybe he just wanted to spend up to $20,000 of compute and didn’t want to go further.

So it’s clearly the case that if Carlini had wanted to say, “Okay, no, no, the task is to compile all these projects and have the resulting code be as efficient as GCC” — I sort of feel torn between two inclinations. One is: it just would seem so crazy for AI to be able to rebuild the largest software engineering projects ever from the ground up, representing many years of work by hundreds of people. So that still, you know, feels kind of intuitively shocking on some level.

But also, I obviously have updated on the results that it can do these really impressive things on CommonMark in our experiments. It can make substantial progress — although not fully solve our hardest task — within a billion tokens, and it can do Carlini’s C compiler. So between these two poles, I end up just being very uncertain.

Greg

Mm-hmm. Isn’t this a win for benchmarking? Or would your steelman pessimist claim that this is a problem?

Anson

Sorry, that what exactly is a problem?

Greg

The state of MirrorCode upon release as AI systems having perhaps made more progress on it than we would’ve predicted when Tom began working on it, call it however many months ago.

Anson

I would’ve thought naively that they would see this as evidence that it was actually just really hard to make these kinds of new tasks. But then it depends on how far we push things and what the costs are and what the benefits are — which is sort of the thing we were saying is the thing that matters.

AI speed-up in benchmark development [00:20:57]

Anson

I’m kind of curious for both of your takes, in the case of MirrorCode, in the case of FrontierMath: Open Problems — that relates back to what Tom said earlier — where whether it makes sense to build these benchmarks and whether you’re going to have trouble continuing to build benchmarks that aren’t saturated depends on whether AI can speed up the benchmark building process and also how much you can parallelize things.

So on these two different dimensions — how much have you found AI to be helpful for speeding things up when you’re building benchmarks, and also to what extent is it the kind of thing where you can just absorb more resources and it’s very flexible?

Tom

So on absorbing more resources — MirrorCode could have benefited from a lot more full-time software engineers on it. I was basically the main person with a lot of engineering experience on the project, although I certainly had some help from collaborators. And I definitely feel that, both in terms of adding target programs to the benchmark and also setting up the infrastructure, just having three engineers on it would’ve sped it up a lot.

Obviously this is from a low base. If you have a 20-person team within Anthropic — can you still sort of scale that up to 50 or a hundred people and get similar speed-up? I feel more uncertain about that.

And then there’s just adding more samples to a benchmark. One would hope that this is sort of inherently pretty parallelizable.

Greg

I’m curious for the AI speed-up one.

Tom

Yeah. AI speed-up. I mean, we all know from METR’s research that people seem to be pretty bad at estimating this. And I myself feel very uncertain, but — you know, gun to my head, if you really forced me to pick a number — I would say 2x speed-up.

Greg

I suppose I’d give similar answers here. For FrontierMath: Open Problems, the problem contribution is embarrassingly parallelizable, limited only by the mathematicians. I shouldn’t exactly say that — we have a review; I review all the problems, and so that’s a limiting factor. But for the most part.

And then each problem contributor develops their own verification program. So we have more diversity of AI speed-up — some of them certainly used AI. But anyway, I believe that speed-up is, you know, moderate there.

The bottleneck is more in having the idea for the problem. And I don’t think the AI systems are so good at finding problems that meet our admittedly somewhat unnatural constraints of being unsolved math problems of a certain degree of interestingness with solutions that happen to be verifiable.

The benchmark-reality gap [00:23:28]

Anson

So we’ve just covered a bunch of things about whether we think benchmarks are going to be doomed to be saturated as we try to build them out because AI progress is so fast.

But there’s another way in which benchmarks could be doomed, or at least as I understand it, which is that no matter what, benchmarks are just not going to be able to capture the things that we care about, no matter how much effort you put into trying to build them.

So the kind of examples here would be like GPQA Diamond — people often say it’s like PhD-level science questions. If you can do GPQA Diamond, then you’re going to be able to do PhD-level science. Somewhere along the line the logic breaks down. You know, you can do GPQA Diamond, but then maybe you can’t do all of PhD-level science.

What is wrong with this particular line of argument? Is it wrong? Do we think that AI benchmarks are doomed in the sense of not being able to capture these real-world impacts?

Greg

I mean, I think the argument might be a little overstated in the snapshot you gave, already. I’m pretty sure that models that did well on GPQA Diamond indeed generalize to the task of answering questions qualitatively similar to those in GPQA Diamond.

One lesson to learn from this is just to make sure that when you say, if an AI model can solve this benchmark, then it can generally do tasks like the tasks in this benchmark — you’ll never go wrong, short of abject cheating, training on test — you won’t go wrong by saying, “Okay, what this means is if I give it a self-contained grad-level science problem, even one that you need to be an expert in the domain to solve, as was verified for GPQA, then it’ll solve that.”

And you just leave the listener to their own devices to generalize. How much will that help someone working in science? What sort of uplift will that give to a non-expert — a biologist doing a chemistry problem outside their comfort zone, whatever. But the benchmark was never going to tell you that, because that’s not what the benchmark was about.

I would say incidentally, we seem to be in a period where you don’t even get in that much trouble for generalizing a little, maybe, beyond the letter of the benchmark task. By which I mean coding agents seem genuinely useful even if many of the tasks we see are not obviously in distribution for benchmark tasks.

Some of this is contingent — this is happening only because the AI companies are perhaps behind the scenes shoving a lot more tasks than we see into distribution, into training. But still, short of cheating, you should expect benchmark generalization — machine learning works, it generalizes into the training distribution — that’s fine.

And so I think what this means is we should be very careful about extrapolating benchmarks, but we should also be very thoughtful and put a lot of effort into trying to put the benchmark pin right in an important area, an area that tells us something we actually care about inherently. And I think both of the benchmarks we’ve talked about that Epoch has been busy developing — both MirrorCode and FrontierMath: Open Problems — meet the spec to a clear degree.

MirrorCode is just — if I have a really clearly, precisely specified test suite or at least spec, then I can expect AI systems to develop software of that nature at least to a certain degree of complexity, which MirrorCode helps you understand. And that’s inherently — I don’t think it’s a stretch to say that’s inherently of interest to someone who might be using the system for practical purposes, deciding whether to fire all of their software engineers or even doing research on software intelligence explosion. Like what sort of tasks go into AI research? How much of them are tasks like this? And this adds clarity in very helpful practical ways.

So too with FrontierMath: Open Problems, even more so — these are problems where there’s no generalization required, at least for each individual problem. It’s something some mathematician would really care about personally, would care about seeing solved. If you’ve devised your benchmark well, you shouldn’t care about generalizing too far beyond the benchmark because the benchmark itself is from a distribution you genuinely care about.

Tom

One thing I’m a little unsure about, though, is that, ok, this all sounds good. We can be pretty confident in the claim that if the AI does the benchmark task, it’s going to be able to do very similar tasks to that thing. But then what counts exactly as something that’s very similar to it?

In practice, people often do want to try to generalize these things further, and although we say we should be careful about generalizing further, it’s very hard to say exactly how much that is.

My one example here is GDPVal. I think in their paper they’re explicitly motivating it in the first few paragraphs, we want this to be something like a leading indicator for a lot of automation. And I guess unfortunately it wasn’t successful at that. Probably they spent like millions of dollars building this thing, and it doesn’t seem to fully reflect what we’ve been seeing in, say, productivity statistics and so on.

Greg

Well, they, I think, fell prey to a pun in the name — and it is catchy, GDPVal. It’s great.

I think you just have to look at the task and say you may have a motte and bailey, but in the good sense — like you have a core goal and a stretch goal, say. Where the core goal is, for GDPVal for example, saturation of this benchmark should be evidence that AI systems can do self-contained tasks drawn from a wide range of digital work. And I mean to emphasize self-contained quite a bit, because these tasks are very self-contained. You do web search, but apart from that, you’re given the documents you need and you’re given your task and you output basically a document — usually often just a text file. That is your output. And so it’s quite self-contained compared to the actual work environments that humans face.

So, the core goal is just — can it usefully offload tasks like this? I would say it’s extremely consistent with my experience that over the last year, offloading tasks of complexity — like, less than a day’s worth of work for me to put together a written report on some topic that requires expertise — they’ve gotten a lot better at that. Of course they have.

Now for automation, I think it would just have been foolish to expect that this would automate. Florian Brand, who worked on the same report on GDPVal had a great analogy. He said the self-contained nature of these tasks is somewhat analogous to the self-contained nature of bug-fixing or small feature addition in software engineering. So just as AI systems currently have not automated software engineers as a whole profession, but they have transformed the workflow — you now delegate and manage much more of your time than you spend writing — so too, saturation on APEX-Agents or GDPVal or RLI would mean that, if you are a knowledge worker in these other domains, you too could see your daily workflow transformed.

But the benchmarks just aren’t targeting, GDPVal anyway, is just not targeting automation enough for you to expect to generalize there.

Anson

I think that makes a lot of sense. And one thing that I think maybe this suggests is that there’s a lot of value in digging into the details of what this benchmark actually tells us. Because it’s very easy to be like, “Oh, GDPVal, and then GDP,” but then actually we need to look into what exactly the tests are. And as you were saying, the specific tests actually seem like they do generalize better if you look at what those tests are rather than “GDP” or whatever.

Tom

Sure. I certainly agree with you that this sort of effect that Anson was describing doesn’t mean that benchmarks are doomed. But I have a slightly different perspective, in the sense that this slogan of “benchmark-reality gap” does resonate with me a bit more.

If you told me in 2020 that AI would solve GPQA-style questions — where they’re Google-proof, so even with arbitrary web access you can’t just find the solution written somewhere, you have to not only combine a bunch of knowledge but also do a bit of reasoning about these pretty advanced science topics — I would’ve predicted much, much bigger effects of AI on the economy and society than we in fact saw when AIs were, say, at 50% on GPQA.

And I think this is the case for many people. And to some extent this is, “Okay, I should take the L.” I was naive in how I was thinking about benchmarking, and maybe some people were much wiser about it. But it does kind of ring true to me that there seems to be a systematic way in which we try to design a benchmark that we hope will capture this broader thing, and then we see AI do great at it, but the real-world usefulness or impact isn’t quite there.

For myself, I want to take into account the track record of how I’ve been surprised by this. The sense in which I feel like it goes beyond just, “Oh, well, you were wrong and naive about the benchmark at the start,” is maybe there’s just something inherently very difficult about squeezing all of the complexity of real-world, long-horizon tasks into something benchmarkable. And we’re going to keep systematically bumping against this, even as we try to make benchmarks better and more realistic.

So, I do feel like there’s something to be aware of here. But in terms of whether this dooms benchmarks — no, because it still seems like, even if we were wrong about what GPQA meant, we can try to take the lessons from that and design the next eval better. Basically, even if we continue to be a bit wrong about this, hopefully benchmarks are still useful.

Greg

Two responses. One is more leaning into, yes, I think people do expect more from benchmarks than they ever should have. The one AI paper I wrote, long before Epoch, was on a critique of benchmarks at the time, and people not investing in making sure benchmarks matched distributions that they wanted, even a little. This was around 2019, and, you guys got to know, the situation was much worse back then. It really wasn’t clear that benchmarks correlated with anything. And so I think there’s some zen of what you should expect from benchmarks. And yet I think they’re better than they’ve ever been.

So the lessons I think were learned over time of we’ve got to make this something that isn’t meant to be something random that AI systems just can’t do today, but if they could do it I’m not sure I’d feel informed about anything other than this random niche. I think benchmarks used to look like that a lot, and they sometimes still sort of look like that today when people find quirks — whatever, “r’s in strawberry” or something — you can make benchmarks out of that. But these are more hobby efforts on the side, and the big benchmarks people pay attention to have been centered over more meaningful distributions.

And I think this does point to the sort of progress you’re saying. And if you couple that with the perspective of modesty in inferring from benchmark results what impacts you’d expect on the world, then you can be very happy about benchmarking. Join me in happiness. The invitation’s open. It’s great here.

But, the other thing I would say is: we have seen a lot of impact of AI on the world. We have this massive marshaling of societal resources to make more systems — the signals that people needed to see to choose to invest a lot of money, including now very meaningfully growing revenue from just consumers, not just investors — were strong enough that people did say this is a big deal, and acted like this is a big deal.

In some ways the benchmark progress did indicate real impact on the world. And the fact that we weren’t necessarily exactly right about the shape or the immediacy of what human-level performance on GPQA Diamond was — if you zoom out a little, maybe we were right? This is a big deal.

Or even going back further in benchmarking history — not that much further — to Winograd schemas, the ambiguous pronoun resolution tasks. This was included in — I forget where it was from — some list of, like, “AGI will be here when five things are true.” And one of them was a sufficient score on a more or less completely saturated Winograd schema test.

And I think what I was trying to get at was: look, this is sort of a tricky task that requires world knowledge and fluency in natural language. And that’s gotta be a big deal if that happens. And it wasn’t immediately — when you got systems just blowing this benchmark out of the water — that the world was transformed literally overnight. I think it was a big deal. It’s a big deal that we have AI systems that can do well on language tasks and can very flexibly use human language. And this was one big blocker in AI being useful, and that blocker is mostly gone.

Tom

Well, but if AI had stopped progressing at the level where it did really well in the Winograd schema benchmark, I feel like we wouldn’t have seen that much impact.

Greg

I’m not sure that’s totally true. There’s a version of it — there’s maybe a narrow version that’s true — but if you give me a little rope, I think if AI progress had plateaued with GPT-4 levels, but not reasoning model levels, there was already, I think, a lot of economic transformation or whatever economic value baked in that it was going to take a while to figure out how to, you know, use everywhere. Linguistic flexibility, even if you don’t have super precise reasoning — I think, you know, is a technology on par with — it is a tech-of-the-decade kind of thing. That’s not bad.

And I think Winograd schemas being saturated probably was a meaningful sign that you were there. And if you had plateaued, you still would’ve been like been, like, “Wow. Used to be, I couldn’t really talk to a computer, and now I can kind of talk to a computer, and that’s meaningful.”

And I think the benchmarks would’ve played their role in helping you at least dismiss extremely reductive cases, of like “This doesn’t.. No, no, we used to have no idea how to solve these puzzles, and it seems plausible you need language skills to do it, and you can. So now — impact ahoy.”

Can an AGI benchmark exist? [00:38:26]

Anson

So one thing I wanted to kind of make sure I’m understanding correctly, for both of you, is, do you guys both think that “AGI bench” can exist? If you have this benchmark and then you were to just train on it and hill climb it, you saturate it, now you’ve got AGI for sure?

Tom

I don’t find the term AGI very useful to begin with, and, because of this point that many, many people have made — I’m obviously not inventing it — that the capabilities of, even before AI, computers and now AI systems are heterogeneous in terms of how good they are at different things. And it seems like we could see huge impacts of AI on society and the economy before we have this generality where it can do, you know, all or almost all of the things that humans can do.

I just don’t think this AGI label is that useful. And instead we should be saying, “What are the capabilities that we think are especially relevant and important?” — and let’s try to build benchmarks for those.

Greg

I do think there’s a spirit of your question, which is fine. You could have a breadth of benchmarks and I can concatenate them and say, here’s my mega-benchmark. And do I think that is possible to build? I think it’d be very expensive. We’re talking a lot of tasks.

And I think there’s sort of this magic ingredient sitting behind these things, which is something like generalization — will we get a system where doing well on one task is strong evidence that it will be able to do well on another task where humans sort of have something like this quality.

So I think this generalization question is very interesting. Benchmarks that could help you identify general reasoning — there have been attempts at this, like this is what ARC-AGI is supposed to be all about, you know, AGI arrives with ARC-AGI-6, or whatever. But I think that’s actually sort of a plausible view — they clearly haven’t pushed this to the human extremes, but there are other approaches you could take to try to measure this kind of out-of-distribution generalization, in-context learning kinds of things.

One idea I’ve heard discussed: you get the latest video game that’s popular on Steam and you see if an AI system can play it well, and that gives you some sense of whether it has generalized.

Tom

But I guess you might worry that even this concept of generalization is — you know, actually, once you look under the hood — it’s this super weird multi-dimensional thing, and we can’t really conclude that much from this random new video game on Steam, performance on that. Well, maybe it just doesn’t tell us that much about if I bring in an AI as a new temp worker for this kind of low-level administrative task. How well will they do on that? I would still worry that what you end up measuring is — well, can it generalize within the specific sub-domain, or at this type of task?

Greg

Of course. I do think there’s room for somewhat cautious optimism here because we have in fact seen sparks of AGI. I do think that’s a fair characterization, that we have seen some degree of generalization — unclear how much of that was from shoving things into the training distribution. That’s a big question.

But you could maybe hope to detect something like this. Like, whatever, we have a benchmark for boring temp work that we keep hidden and we have a benchmark for video games or whatever, and we see if progress is made at the same time on both of them. And if it was, I would say we’re seeing an interesting thing emerging.

But it’s also, of course, hard to know whether that just happened because someone in the lab happened to buy an RL environment that looks a lot like one of your hidden benchmarks. Ideas aren’t — it’s hard to be that original. So I do think this is something of a question.

But again, these lists of things that will herald AGI, I don’t think have been terribly off base. I actually think we’ve learned some lessons. What are things that have not heralded AGI? I think that would include chess — Deep Blue beating Kasparov was not a moment of general AGI.

However, the techniques developed there — there’s still a little bit of, “No, it was correlated with the same thing that society was trying to do for a while.”

But fine, call that a loss. But I think a win is, these sort of broadish, hodgepodge of tasks show some general capability. And then, I don’t know, maybe this generalization is still something benchmarks should be paying attention to, over and above any particular task.

Beyond automated scoring [00:43:18]

Anson

So given all of these things — it sounds like in terms of saturation, you guys don’t think that the benchmarks are necessarily doomed. In the case of how much they can generalize, there are a lot of interesting questions, and I guess it’s a bit more complicated.

I think there is still a big looming question here, which is, where do we go next with benchmarks? What exactly did benchmarks look like in the future?

Tom

So one kind of categorization I find useful is in terms of how benchmarks are formed. Is it completely machine-checkable? So you have an algorithm, not based on language models, that just checks correctness. And basically all traditional benchmarks, there is some form of LLM-as-a-judge. And the third category is just human judging, non-automated judging; you just have humans score the AI outputs.

So, I’m interested in people figuring out how to do the second category well. And then human grading is I think basically, historically, it would’ve been ludicrous, because human time is just way too costly. And when we had benchmarks with like a thousand samples and so on, it just wouldn’t have been feasible.

You know, now we’re seeing things like much smaller benchmarks, or actually even just demos like Anthropic’s C compiler, where there’s a single output and running the benchmark might be in the tens of thousands of dollars. There, maybe there’s a form of human rating that could make sense.

There’s so much more to explore with these alternative scoring methods. There’s a lot more juice to be had even in the completely algorithmically scoreable category.

Greg

It’s funny how I almost feel like we’ve got two poles here that are both very promising, and then this tempting — but I’m not sure how much I believe in it — middle ground of relying on fuzzy qualitative AI judgment for assessing AI outputs.

We’ve rarely had benchmarks outside of — the math, science, coding — this domain. There are some attempts at creative writing benchmarks, and they’re good, I mean, no shade, but they’re just not that deep or compelling. And outside of that, it’s not only been this first —

Tom

— there are things that try law. I’m really interested in white-collar work that isn’t STEM. But I wish I had the time — I haven’t had the time to look at the literature.

Greg

We can talk a little about some of these. I think it’s interesting.

Recently, Epoch wrote a report reviewing three benchmarks that try to target economically valuable work outside of coding, math, science. And I think there are some interesting entries there. It’s also interesting to look at how they’re graded, because none of them are in this first category you were describing.

One called APEX-Agents uses detailed rubrics — and this is targeting tasks in corporate law, management consulting, and investment banking. And they have just detailed rubrics at which an LLM then assesses. And it’s things like, “Did this customer’s data breach described in these documents violate GDPR, which you have a copy of over here, and here’s the contract the customer had with their client?” And the rubric is saying, “You lose a point if you don’t say how clause 10.3C or whatever was violated or was not violated” — so it’s very granular. I think I believe it that this is doable.

The other two benchmarks we looked at — GDPVal from OpenAI and Remote Labor Index from Scale/CAIS collaboration — those are just graded by humans. They just bit the bullet, and I think this is great. And It’s interesting. GDPVal is close to saturated, but Remote Labor Index is definitely not.

Tom

How many tasks are in Remote Labor Index, and do you know how much they paid the graders?

Greg

So all good details. I don’t remember the exact task number — it’s in our report — but on the order of a hundred, not 10. And they don’t give us much on how much they pay the graders.

The graders are simply asked to — they’re given the AI output, they’re given the spec from the customer. These are real tasks taken from the gig work platform Upwork, and they give the reference output which was accepted by the customer, and they say, “If this is what the customer was looking for, would this other output likely satisfy them?” What I take this to mean is, “Is the AI output even in the ballpark of the human output?” That’s a lower bound on the quality of judgment.

Most of these things, to be clear, this was an innovative take, are multimedia output. So it’s kind of a visual gist judgment. And right now at least the failures are just dramatic. The first author on the paper was just describing a test case they have of, like, “We asked you to draw the Superman logo in Inkscape and you submitted an unrecognizable blob.” That’s the level that models are at here.

I think fine-grained judgments will get harder, but I don’t think they invested that much in the human rating. And I sort of believe this is a perfectly reasonable thing, and the benchmark at least is good enough to tell us the binary of, “Can AI at least come close to doing this sort of task? Are the deficiencies more fine-grained versus are they not even close?” And the answer there is they’re not even so close. And as you were saying, this is a new form for benchmarks, primarily.

I’ll mention one other form that I think is a good example for going forward, which is — incidentally people paid a lot of attention to it, but I don’t think appreciated it as a human-judged benchmark — which is the International Math Olympiad.

So for those who don’t know, this is a contest where some very smart high school kids write and solve math problems by writing proofs, arguments, and there’s a very well-developed, decades-long process for human judges to score these purported solutions from the students. They’re all double-judged, and the judges are given very extensive rubrics ahead of time, but they also evolve those rubrics during the scoring process as new things come up. And there’s an argument back and forth where the judges get to present their assessment. It’s very involved, very labor-intensive.

And Google got their solutions submitted — Google’s solutions were submitted anonymously — and that is where Google scored, the IMO gold claim from Google, is properly judged by the same process for judging humans.

I think it’s an amazing benchmark, and no one batted an eye at this. You know, this was a really good methodological benchmarking win that hadn’t really been done before. And it was just done by hooking into existing human infrastructure for judging work output. And I think for category three, this is something to be emulated.

Tom

Just to give you the opportunity to hammer home your point — what are some other examples of using these kinds of existing structures?

Greg

I’m worried I’m not remembering the one that you maybe liked that I said when we were chatting earlier on, but one that I can imagine is anything where there is currently a human contest for just submitting something like this.

So this isn’t the one I said earlier, but I was just thinking — there are various awards for fiction. So if you want to have your AI system write a novel and submit it — there are ethical concerns and ways of trying to make sure we’re not flooding the inboxes of editors and whatnot — but a very reasonable benchmark, in my opinion, for creative writing would be submitting a short story to a short story writing contest and have it graded or voted on the same way. I think this is a very reasonable benchmark.

What was, yeah —

Tom

Yeah, that one’s great. I also think just peer review, academic review of papers — especially as AI becomes more important and gets used in academia a lot, you should eventually be able to persuade reviewers to be willing to spend five or ten percent of their reviewing time or something evaluating these AI outputs. Maybe AI labs paid them a lot for that time.

This seems pretty feasible, and a way that you just hook into this infrastructure that applies to any time a paper is reviewed. So it’s pretty much any area of science.

Greg

And I think, unfortunately, there is something of a refereeing crisis — meaning a labor shortage — in certain academic fields. But this could be a synergistic opportunity to pay the money to solve that problem. And then have some gated process by which AI autonomously authored papers are submitted to NeurIPS, or whatever, and the benchmark is to win best paper, or get accepted, and then get whatever accolades. I think this is pretty good.

And I suspect, you know, one thing — stepping back — one thing that’s funny about benchmarking is, again, it used to be this almost purely academic exercise done right alongside the people who were developing the models. Now there are companies with annual budgets in the tens of billions of dollars and growing for AI system development.

And surely they’re not hanging off every word of benchmarks made by little shops like us. They have highly resourced internal benchmarking suites and they are surely trying to evaluate their systems.

I imagine part of what they’re doing with the help of the sort of data collection companies is trying to extract “just such” cases from real-world internal corporate use. So even if some of these processes are legible to us as industry outsiders, from the vast majority of industries, as peer review or the existence of contests for public-facing consumer output — there are lots of cases if you were in the guts of an insurance company where you’d have all sorts of, “Oh, here’s the step in the process where the senior claims adjuster signs off on a report authored by a junior claims adjuster,” and that’s their whole darn job is to do this.

And so I’m sure someone somewhere is trying to collect data to replicate that — and maybe even do human trials with some regularity of, okay, we did our messy RL environment approximation of this, that’s a shoddy benchmark purely internally, we trained on that data and now we have a validation set and it looks like it’s doing well, and now we’re going to do some taste testing, which they wouldn’t necessarily call a benchmark, but it’s a benchmark, of having a real senior insurance claims adjuster take a look at this report that the AI system tried to generate.

And to be clear, that’s exactly what GDPVal is sort of trying to externalize and do. But it’s still these relatively self-contained tasks, and I think just expanding the scope of these — that would take a human less than a day to do or something — if that’s saturated, let’s go to week-long projects, and you know, get what you get.

Benchmarking in messy real-world contexts — I think that’s just where benchmarks will go. These might look more like case studies, and I think this is fine. You can have standard method case studies and see — I think we should remember that every 18 months or so we see a big spike in capabilities. If we’re really in that regime, we shouldn’t feel too bad about doing a “Can AI do this thing it obviously can’t do?” kind of contest and have that every four months or something like that.

And then, you never know when the next spike is going to come. So set a baseline of AI not being able to do these things, hone your methodology so you’ll be able to say when a big spike has happened. And this will, I think, be a very fruitful mode for benchmarking to be in. And if anything, we’ll have less of a gap between the things we really care about and the raw benchmarking numbers.

And yes, it’ll be more expensive, or not have some of the nice features that current cheapness has — which is, like, there’s an interesting fast model from an upstart Chinese company, can we run it on the benchmark right now? It’s very easy to do. And that won’t be so easy. But this is, I think, an acceptable price to pay. Somehow the scores will move slower but will come regularly. I think this will be very informative, and we’ve hardly explored this at all and have plenty to squeeze there.

Tom

So — if I think about what’s next after some version of MirrorCode is released — a few things that seem kind of interesting: so one is staying within the MirrorCode idea of staying within easily scoreable software engineering tasks.

Seeing as AIs are pretty good at these on MirrorCode-style reimplementation — can we see, does this generalize if you put AI in something more akin to situations that humans are in? And so you would be pushing the frontier with access to, you know, any code base that you want, any existing tools.

And so the examples there would be like, can you speed up some widely used software that is, where speed is a real bottleneck and there’s already been a substantial amount of effort on optimizing that?

One example that comes to mind here is Rust compilation. People really like the Rust programming language but complain a lot about the compilation being slow, because it fundamentally just has to do a lot more with borrow checking and other things than other languages. Yeah, that feels like a kind of natural next step, “Oh, AI is really good at precisely specified software tasks.” Can we get it to a point where this would produce an artifact that would actually be useful in the real world? That’s kind of one angle.

Something else I’m interested in is — a lot of people are interested in the effect of AI on speeding up AI R&D, and I’m quite curious to think about the question of how much of those tasks are kind of MirrorCode-style, where there’s a pretty clear goal or metric? How well does AI do with those?

So one thing I’m actually a little bit confused about — I should look into it more — is RE-Bench, which seems to have this property, and it sort of seems actually quite similar to MirrorCode in terms of you know precisely what to do and being able to get feedback as you go.

My impression is it’s not the case that every single RE-Bench task is at, you know, superhuman, more-than-eight-hour time-horizon levels. Understanding more whether that’s the case, and if so, why? And then potentially seeing if there are benchmarks we can design that really target this AI R&D thing.

Greg

One thing I’ll throw in that I think is this sort of magic ingredient of out-of-distribution generalization — I think that’s a topic benchmarks can take a crack at. And I think we’ve done a little bit of work with this. We have a chess puzzles benchmark that shows some interesting patterns in how models perform, or sort of make halting progress on where presumably labs care less about optimizing for this. But if you had a general-purpose reasoner that could solve super hard math problems, you should be able to work through a chess puzzle. You want this to be moderately secret and not too high profile so that the labs don’t focus too much on it, like ARC-AGI became a bit bench-maxed, somewhat.

For specific ideas, ne that I happen to like — who knows if we’ll make anything of it — is trying to push more into physical world tasks. There was this lovely little blog post of someone trying to get Claude to teach him how to make coffee via just taking photos of where he was and asking it for instructions. And I think that’s interesting because you can imagine all sorts of impacts on the world if LLMs are good as brains for perhaps robots, but even just for humans to navigate the world — and, you know, can provide all sorts of skill uplift if they can tell you how to, whatever, replace this machine part in something in your car or in a factory.

So I think we can just sort of start to look more broadly at what are the bottlenecks to all sorts of economic impacts. And there are probably some — what I’d say are regular old benchmarks — that probably can fit reasonably into that framework.

How AI changes benchmark building in practice [01:00:45]

Anson

How do you envision the benchmark building process when in a couple of years you have lots of AIs that are helping you speed up the process itself? What do you think that looks like?

Greg

Have I really drank the Kool-Aid if I don’t have an off-the-cuff answer to what I’ll do with all my agents?

The software engineering style of this seems maybe a little more concrete to imagine?

Tom

It seems like it’s an abstraction ladder interacting with a coding agent. At the bottom you might say, in this particular function, factor out this particular thing into a helper function. And that’s basically like typing it yourself — it’s so specific it might just save a little bit of time versus doing it manually yourself.

And actually, sometimes I do this for an instrumental reason, which is then the AI has in context that this has just been done. Whereas it’s a little bit more annoying to get it into the AI context if you do it manually. So that’s sort of the bottom. And then you can go up and up and up this abstraction ladder where the instructions you give the AI are more and more high-level.

I don’t think I really have a useful more concrete picture or prediction beyond that.

Greg

I think there is a bottleneck in some benchmark design around taste in tasks. I do feel like it would be a big unlock if AI systems had some of this taste that I feel like they don’t do a great job with today. Where, for example, if I say, “Give me examples of problems that fit the rubric for open problems” — I haven’t been impressed with what they turn up, and it’s a little bit of an unusual —

Tom

Yeah, they don’t have great taste for coming up with MirrorCode target programs. But just the fact that they know every single thing in computer science or in computing — so you can just ask it to keep generating more ideas and then you can pick based on your taste.

And also, even during the development of this benchmark, Opus 4.5 and 4.6, I feel like they’re already better at coming up with suggestions that meet more of the criteria.

Greg

I mean, I think the — in case it’s not obvious — a couple steps up the abstraction ladder would be a human researcher sets up the framework for the benchmark, with plenty of assistance on coding whatever infrastructure is necessary. And then you come to the part where you have to fill out all the tasks.

I mean, often you sort of start with that to make sure there are some tasks, but you’re at a point where you want to get 10 or a hundred of these things, and there’s some work to do to even come up with what they should be. And you ask an AI system, and you can sort of trust its results that it will mostly come up with good ideas that it’s worth your time to engage with and quality-control — instead of a couple steps down where it’s: I came up with a task and now I’m going to get a lot of help from it to implement. Or I see what’s wrong with the current version of this and I’m going to give it some feedback and have it take a turn on the code or whatever.

For the chess puzzles benchmark, where Gemini 3 Pro wrote all the code for it, but it was me looking at the output and saying, “Ah, these chess puzzles are lacking this feature.” Or, “Our search for chess puzzles of the characteristic we want is not turning much up. I think X, Y, Z is wrong. What do you suggest?” And it’s helpful and productive, but a human is a couple of layers up, obviously.

Building the whole thing from scratch? When do we just say, “AI, I would like a benchmark in this domain”? I don’t know. I mean, presumably it’s on the path out there, but that does feel a couple of turns away. Call it six months to three years, conservatively. But I skewed towards the later end of that.

Anson

I think this is interesting, also a little funny. “Hey, we need a benchmark for benchmark taste.” You can see if the AIs can themselves make the benchmarks.

Greg

Yeah, I mean I do think some of our benchmarks have elements of taste baked into them in these kinds of “don’t expect it to generalize too well” kind of ways, but maybe useful angles on it.

Like, even MirrorCode, some of the more complex programs, you need — call it architectural taste — to make it not fall apart. And we’ll see if the models have that for the harder ones. Or some of the open problems — you might need what a human would call taste for the harder problems. We’ll see in hindsight, I don’t know.

Anson

Okay, cool. I think this is a good place to end. Thank you both for coming on the podcast. It was a good chat.

Tom

Thanks.

Greg

Thanks, Anson. Thanks, Tom.

Diversion and resale: estimating compute smuggling to China

Isabel Juniewicz — Fri, 01 May 2026 01:14:02 GMT

The following is excerpted from our full report, which contains interactive charts, a breakdown of our methodology and evidence, and a comparison to other estimates.

Key takeaways

Substantial quantities of AI chips have been sent to China in violation of US export controls. Evidence of diverted or missing chips, drawn from indictments and investigative reporting, points to nearly 300,000 Nvidia H100-equivalents by the end of 2025. This would equal roughly a quarter of the compute China acquired through legal channels or domestic production. Because much smuggling goes undetected, the true total is likely higher.
We estimate, with 90% confidence, that between 290,000 and 1.6 million H100-equivalents of compute were smuggled through the end of 2025. Our median estimate of 660,000 represents roughly 3% of the global compute stockpile, comparable to what xAI, a leading US AI lab, had at the time. The upper bound of our estimate would mean that, by the end of 2025, the majority of China’s AI compute had been smuggled.
We are uncertain about many variables, notably the magnitude of undetected smuggling and the proportion of chips allegedly diverted or missing that ultimately reached China.

Subscribe now

Overview

Over the past several years, the United States has applied multiple rounds of compute-related export controls to China. The earliest rounds of export controls applied to semiconductor manufacturing equipment or to specific entities (e.g., Huawei’s addition to the BIS entity list in 2019). In October 2022, the Commerce Department instituted export controls on all advanced chips with certain characteristics. Commerce tightened these export controls and expanded their scope in October 2023, increasing the incentives for smuggling chips to China. A late 2023 Center for New American Security (CNAS) analysis suggested that the volume of smuggled chips at the time was relatively low, but expected that there would be larger-scale smuggling efforts in the future.

This prediction appears to have been borne out. Investigative reporting by the New York Times, Wall Street Journal, and The Information in 2024 found widespread evidence of banned chips being smuggled into China. In April 2025, Commerce imposed an additional licensing requirement on H20 chips, a type of chip popular in China that Nvidia had developed to comply with prior export controls.1 This further restricted exports and incentivized smuggling. The Financial Times reported that over $1 billion in AI chips were smuggled between April and July.2

Evidence of large-scale diversion by cloud companies has also accumulated. The DOJ indicted several Supermicro employees and arrested a company executive. The employees were accused of diverting $2.5 billion in Nvidia chips to a pass-through entity, which was redirecting those servers to China. Bloomberg investigative reporting insinuates that Megaspeed, a Singaporean cloud provider, has tens of thousands of fewer chips in its Malaysian data centers than it imported from Nvidia, and its CEO is under scrutiny due to close ties to China.

We estimate the total compute smuggled to China using Monte Carlo simulations3 based on evidence from indictments and investigative reporting. We classify this evidence into two types. Diversion evidence traces how chips leave legitimate supply chains and reach smugglers. Resale evidence focuses on the marketplace for smuggled chips within China: the number of vendors or brokers and volumes they transact. Some sources cover both diversion and resale of the same set of chips.

Because every successfully smuggled chip must be both diverted and resold4, we can build two parallel estimates of the same underlying quantity that should, in theory, approximately agree. While our resulting distributions for diversion and resale aren’t identical, they do substantially overlap.

Visit the full report to explore the interactive version, including an H100-equivalents view.

The diversion estimate is created by building a database of allegations from 2024 and 2025, and then adjusting for false or overstated allegations, and for cases missed by reporting and enforcement5. The database does not include cases where diversion was attempted but known to have failed. The resale estimate approximates total volume in the market for smuggled compute by aggregating investigative reporting on vendor counts and the scale of individual transactions. The resale estimate draws on portions of a 2024 CNAS estimate of smuggled compute by Grunewald and Fist.

Cumulative allegations of diverted or missing chips total almost 300,000 H100e6 by the end of 2025 — a quarter of the compute that China is estimated to have legally imported or domestically produced. This figure is neither a ceiling nor a floor on actual smuggling. This figure may overstate actual smuggling for two primary reasons: the allegations are largely not yet proven in a court of law, and some of the chips redirected to China may not have arrived there. Conversely, the 300,000 estimate captures only publicly alleged cases — some, probably most, smuggling goes undetected by journalists and authorities. Thus, we believe it is almost certain that the true total quantity of compute smuggled into China is higher than 300,000. More details and a cumulative summary of reported and alleged smuggling are here.

Our model estimates 660,000 (90% CI: 290,000–1.6 million) H100e of cumulative compute smuggled through the end of 2025. This is somewhat higher than other estimates, including those by CNAS and ChinaTalk, though the median is well within their confidence intervals and vice versa. A comparison against these other estimates, including a discussion of the differences, is available here. Further details on the model and our estimates are here. Our estimates cover quantities of chips and servers physically moved into China. They do not cover cloud compute located outside of China but used by Chinese customers, an arrangement that, subject to end-use restrictions and sanctions-list screening, is generally permitted under current US export controls. These estimates can be viewed in a global context by checking out Epoch’s AI Chip Ownership Hub.

Read the full report for interactive charts, a breakdown of our methodology and evidence, and a comparison to other estimates.

Subscribe now

China had imported almost 18 billion dollars' worth of H2Os, around 220,000 H100e units, between their 2024 introduction and their ban.

While the H20 restrictions were lifted by the US in July 2025, China subsequently restricted H20 imports. While the initial lifting of the restrictions appeared to reduce demand for gray market chips, it is unclear whether gray market sales rebounded when H20 sales did not resume.

Simulations that involve repeatedly drawing random values for each uncertain input to build up the distribution of possible outcomes.

In some cases, the end user may be directly arranging for the diversion of the chips, typically through a shell company or other pass-through entity. We then consider the transfer of the chips from the pass-through entity to the end user to be the resale step. It is possible that some chips may be smuggled in more directly by the end-user, but we anticipate that this is small in proportion to total flows.

We assume reporting is accurate with a median of 80%, 90% CI (61-94%) and that the detection rate is between 10-60% (90% CI), with a median of 24.5%. In addition to an overall reporting discount, for each case, there is case-specific uncertainty about whether purportedly missing or attempted diverted chips made it to China.

Compute capacity in the equivalent number of Nvidia H100s, based on dense 8-bit operations per second.

GPT-5.5 Pro achieves a new high score on the ECI

Epoch AI — Wed, 29 Apr 2026 09:53:27 GMT

GPT-5.5 Pro also set new records on FrontierMath, scoring 52% on Tiers 1-3 (up from 50%) and 40% on Tier 4 (up from 38%). Across runs, it and GPT-5.5 solved two Tier 4 problems that no model had solved before, one by Hailong Dao and one by Ahsan Khan (@ahsanxrr). They had this to say:

For all these and more, check out our website!

Subscribe now

Claude users skew towards higher-income households; Meta towards lower-income

Caroline Falkman Olsson — Wed, 22 Apr 2026 21:19:45 GMT

80% of US adults who report using Claude in the previous week live in households earning $100,000 or more a year, compared to 37% of Meta AI users. Nationally, about 50% of US adults fall in this income bracket. Among Meta AI users, 32% live in households earning less than $50,000, compared to 7% of Claude users and 24% of US adults. Other major providers cluster in a relatively narrow band, with 56–64% of users in $100,000+ households and 15–22% under $50,000.

Results are based on three pooled waves of the Epoch AI/Ipsos survey (~2,000 respondents in each of waves 1–2, ~1,000 in wave 3). Participants reported which AI services (if any) they used in the past week. Respondents were recruited at random. Estimates are weighted to better reflect underrepresented groups, like those from lower socioeconomic backgrounds. Users may report using more than one AI service, so groups are not mutually exclusive.

You can read the full data insight on our website.

Opus 4.7 scores near frontier on ECI

Epoch AI — Tue, 21 Apr 2026 17:30:18 GMT

Opus 4.7 scores 156 on ECI, our tool for combining multiple benchmarks onto a single scale. This puts it a bit ahead of Opus 4.6 and a bit behind only GPT-5.4, Gemini 3.1 Pro, and GPT-5.4 Pro. Thread with individual scores and commentary.

On FrontierMath, Opus 4.7 scored 44% on Tiers 1-3 and 23% on Tier 4, ahead of every model except GPT-5.4 and GPT-5.4 Pro.

On SWE-bench Verified, Opus 4.7 scored 83%, a new record for our evaluations on this benchmark. SWE-bench Verified may be increasingly contaminated, and we hope to evaluate Opus 4.7 on other coding benchmarks soon.

Opus 4.7 scored 30% on our Chess Puzzles benchmark. This is a substantial jump over previous Anthropic models, though it remains far from the frontier. Opus 4.7 hit its max output limit before answering for 31% of problems. We plan to improve elicitation here.

Opus 4.7 also solved the last problem on our older math benchmark, Mock AIME, that had not been solved by any model before. This problem involves interpreting the diagram shown below when given as Asymptote vector graphics code.

Opus 4.7 was a bit behind the top scores on GPQA Diamond (90% vs. 95%). Anecdotally, this may be due to refusing to answer some questions for safety reasons, even though GPQA isn’t meant to cover dangerous topics. See, for instance, the cut-off sample below.

ECI also incorporates scores from third-party benchmarks. So far, for Opus 4.7, we have ARC-AGI-1, ARC-AGI-2, WeirdML v2, and SimpleBench. Its ECI score will evolve as we collect additional benchmark scores.

For all these and more, check out our website!

Subscribe now

OpenAI Stargate: where the US sites stand

Elliot Stewart — Mon, 20 Apr 2026 16:57:57 GMT

Updated April 23, 2026

Introduction

The United States is in the middle of an unprecedented build-out of AI infrastructure. No project illustrates the scale of that effort more than Stargate, a $500 billion endeavor involving AI developer OpenAI, cloud provider Oracle, and investment company SoftBank.

Stargate has seven locations across the US, all of which are now showing active development. The most advanced—in Abilene, Texas—is already operating at an estimated capacity of 0.3 gigawatts (GW).1 The six other sites include two more in Texas, as well as facilities in New Mexico, Wisconsin, Michigan, and Ohio. Together, the seven sites add up to over 9 GW of planned capacity, which is comparable to the peak power demand of New York City.2 This will be enough to power the equivalent of 20 million Nvidia H100 GPUs, which was the total amount of AI compute in the world by the end of 2025.3

Stargate’s design choices reveal how builders are navigating the key challenges of gigawatt-scale AI data centers in the US. To sidestep lengthy queues for connecting to energy grids, at least three of the seven sites will make use of on-site natural gas plants. To address public concerns about water usage, at least six sites will use closed-loop liquid cooling systems, which do not evaporate water.4 These decisions will likely save the project time but raise the cost of the facilities.

Based on announcements from 2025, SoftBank will own the hardware at the Milam County and Ohio sites, while Oracle will own the hardware at the remaining sites. All sites will serve OpenAI’s workloads.

The sites

Abilene, Texas

Current capacity: 0.3 GW | 250,000 H100-equivalents5

Projected capacity: 1.2 GW | 1.0 million H100-equivalents

Projected completion: Q4 2026

The Stargate project’s flagship location is in Abilene, Texas. Built by AI infrastructure company Crusoe, Abilene is the most complete Stargate site to date, with an estimated four of the eight buildings already operational. These buildings house state-of-the-art Nvidia Blackwell chips.

Power is currently supplied by a mix of on-site natural gas and grid power, which includes local wind power.

OpenAI had planned to expand this site to 2.1 GW, but recently reversed course, deciding to direct that capacity to other locations. Microsoft has since partnered with Crusoe for the adjacent 900 MW site.

View with satellite explorer

Shackelford County, Texas

Current capacity: 0 GW

Projected capacity: 2 GW | 4.2 million H100-equivalents

Projected completion: Q4 2028

Just across the county line from the Abilene site, data center developer Vantage is constructing a massive 1,200-acre (4.9-square-kilometer) campus with 10 buildings.

The campus will be powered by an onsite natural gas microgrid.

Vantage has given a delivery date for the site’s first building of late 2026.6 Satellite imagery shows that roofing is underway for this building (visible in bright white).

View with satellite explorer

Doña Ana County, New Mexico

Current capacity: 0 GW

Projected capacity: 2.2 GW | 4.6 million H100-equivalents

Projected completion: Q4 2028

In New Mexico, STACK Infrastructure is developing Project Jupiter, which consists of four large buildings. Satellite imagery shows that foundation work is underway.

This site will be powered by two natural gas microgrids designed to limit impact on the local grid.

View with satellite explorer

Milam County, Texas

Current capacity: 0 GW

Projected capacity: 1.2 GW | 2.5 million H100-equivalents

Projected completion: Q4 2028

SoftBank subsidiary SB Energy is building and operating what is described as a “fast-build” site in Milam County, Texas, around 70 miles (110 kilometers) northeast of Austin. A satellite image from March shows steel framing and roofing for the first building (visible as a blue rectangle). Regulatory filings indicate this building will be delivered by October.

SB plans to fund and build new energy generation and storage to supply the majority of the campus’s power.

View with satellite explorer

Port Washington, Wisconsin

Current capacity: 0 GW

Projected capacity: 1.3 GW | 2.6 million H100-equivalents

Projected completion: Q4 2028

Vantage, which is also the developer behind the Shackelford County site, has broken ground on a campus named “Lighthouse” in Port Washington, just north of Milwaukee. Foundation work can be seen in satellite imagery.

The site is described as “sustainable-by-design,” with 70% of power drawn from solar, wind, and battery storage.

View with satellite explorer

Saline Township, Michigan

Current capacity: 0 GW

Projected capacity: 1.4 GW | 2.9 million H100-equivalents

Projected completion: Q4 2028

Related Digital is developing a campus dubbed “The Barn” in Saline Township, southwest of Detroit. Satellite imagery shows foundation work underway for the first building.

DTE Energy will provide 100% of the power, augmented by a battery storage system financed by the project.

View with satellite explorer

Lordstown, Ohio

Current capacity: 0 GW

Projected capacity: <0.3 GW | <0.3 million H100-equivalents

Projected completion: Unknown

The seventh site is in Ohio, where some land has been cleared, but no large-scale data center construction is visible. The site is primarily a manufacturing facility for AI servers and data center equipment, operated as a joint venture between SoftBank and Foxconn. The capacity of the data center will likely be no more than 0.3 GW, with OpenAI announcing that the Milam County and Lordstown sites could scale to a combined 1.5 GW by 2027.

The Lordstown data center will likely draw power from the grid, as the Foxconn plant already has a substation connected.

View with satellite explorer

The road ahead

At this point, the full $500 billion Stargate project is more than pure ambition. The build-out has started all over the US, leaving enough time to finish by 2029. However, there is a long road ahead for all seven sites. Plans can change even after construction begins, as shown by OpenAI pulling out of the Abilene expansion. Financing and procuring equipment will also be challenging at this unprecedented scale. Finally, political opposition is a real factor, as evidenced by a ban on future data centers in Lordstown and local opposition to the Michigan site. Epoch AI will be following the Stargate project and the broader data center build-out closely to see how this all pans out.

All stated power capacities refer to total facility power, including power for GPUs, cooling, lighting, etc. Power capacities for the Stargate sites have not been reported consistently as total facility power or IT power. For some sites, we estimated the total facility power based on the reported IT power. For example, Vantage reports Shackelford County as 1.4 GW of IT power. Given the hot summer climate of Texas and closed-loop cooling (which is less energy-efficient than evaporative cooling), we estimated the total facility power to be about 2 GW.

The NYISO 2025 Gold Book (p.30) forecasts about 11 GW of peak summer demand for New York City (Zone J) from 2026 through 2030. This represents the single highest hour of demand annually.

The H100 is just an example: the actual chips in these data centers will probably be Nvidia Blackwell, and later Nvidia Rubin. The total amount of compute in the world is based on the AI Chip Sales database, which estimates about 20 million H100-equivalents worth of AI chips sold by Q4 2025. The projected compute for the Stargate sites is estimated from the power capacities and the trend in energy efficiency for leading machine learning hardware—except for Abilene, which was disclosed by Crusoe to have 50,000 Blackwell GPUs per building.

Sources: Abilene, Shackelford County, Doña Ana County, Port Washington, Saline Township, and Lordstown. We did not find direct confirmation of a closed-loop system for Milam County, but it is designed to minimize water usage.

One H100-equivalent is the computing power equivalent to one Nvidia H100 GPU, measured in operations/second. The H100-equivalent unit uses a chip’s highest 8-bit operations/second specification to convert between chips.

This is when the completed building is handed over to the tenant, not when the data center is fully operational.

Updates:

Apr. 23, 2026:

We previously estimated that 0.6 GW was operational for Stargate Abilene. However, a subsequent post by Oracle implied that only 200 megawatts (or about 0.3 GW of total facility power by our estimate) was operational as of April 22nd. We updated the Stargate Abilene timeline accordingly. We now estimate that the 0.6 GW will be achieved in late May, while the full 1.2 GW will be achieved in Q4 2026.

Claude usage rose by over 40% amid increased attention last month, but remains far behind ChatGPT

Caroline Falkman Olsson — Thu, 16 Apr 2026 19:20:52 GMT

According to our latest polls, Claude usage in the US rose by over 40% amid increased attention last month, but remains far behind ChatGPT.

Our point estimate would imply several million new weekly users in the United States.

Our survey doesn’t shed light on *why* usage patterns shifted, but the timing of the jump coincides with a public dispute with the US government, as well as an increase in enterprise adoption.

Claude remains far behind ChatGPT’s ~30% share but is the only AI service in our survey to show a clear upward trend across this short time period.

Check out the latest data insight on our website!

Five hyperscalers now own over two-thirds of global AI compute

Luke Emberson — Tue, 14 Apr 2026 20:43:57 GMT

Five companies — Google, Microsoft, Meta, Amazon, and Oracle — now control about two-thirds of the world’s compute, up slightly from ~60% at the start of 2024.

Many AI labs (including OpenAI and Anthropic) depend almost entirely on these hyperscalers for access to their compute.

This data comes from our new AI Chip Owners datahub. For more details, see our full data insight.

Epoch AI

Toward an O*NET for AI R&D

What trends are we extrapolating?

An O*NET for AI R&D

A first proposal

What’s next?

The Epoch Brief - June 12, 2026

Research

Data Insights

Commentary

Other Updates

Careers

Are Mythos’ cyber capabilities overhyped?

Discovering vs exploiting code vulnerabilities

Mythos Preview was a major advance in exploit development

It’s unclear how large of a practical advance Mythos Preview is in vulnerability discovery

Conclusion

Appendix: Benchmarks in the Cyber-ECI

UK AISI’s CTF Suites

UK AISI Cyber Ranges

Microsoft CTI-REALM

CVE-Bench

Cybench

CyberGym

CyScenarioBench

ExploitBench

ExploitGym

InterCode-CTF

NL2Bash

OpenAI CTF

OpenAI Cyber Ranges

Anthropic SCONE-Bench

XBOW-Web

Controlling the capital after AGI

Introduction

The main axis: control of the capital

Why care who controls the capital?

Why have the state give people control of capital, instead of letting people buy it themselves?

Conclusion

The Epoch Brief - June 1, 2026

Research

Commentary: Is a compute crunch coming?

Other Updates

Survey

Narrations

Careers

In case you missed it…

Is a compute crunch coming?

Introducing our setting

Inference settings

Hardware specs

What happens during inference?

Prefill

Decode

Chunked prefill

Speculative decoding

Calibrating against inference benchmarks

The present and future of inference

The Epoch Brief - May 22, 2026

Data Insights: Memory has grown to nearly two-thirds of AI chip component costs.

Commentary: Frontier labs don’t use most AI compute (yet)

Other Updates

FM:OP workshops

Careers

In case you missed it…

Frontier labs don’t use most AI compute (yet)

Most AI compute probably doesn’t go to frontier AI

Will Anthropic and OpenAI absorb the rest of global AI compute?

What happens if frontier labs run out of headroom?

Appendix: How much compute goes to frontier AI developers?

The Epoch Brief - May 15, 2026

Data Insights

Commentary: The economics of superstar AI researchers

Other Updates

FM:OP workshops

Model Evaluations

Careers

The economics of superstar AI researchers

The superstar effect

Why this applies to AI