Tim Davis 10/13/25 Tim Davis 10/13/25

What we owe the minds we create

This essay is an attempt to think clearly about what we're building, why it matters, and what it demands of us. It is written from the perspective of someone who stands at the intersection of creation and contemplation, who builds AI infrastructure systems while wrestling with their implications. It is, fundamentally, an inquiry into the nature of intelligence, the meaning of human flourishing, and the responsibilities we bear as the first species capable of deliberately designing our successors.

Prelude: The Architects of Succession

There is a peculiar vertigo that comes from realizing you are building your own replacement.

I have spent nearly a decade architecting AI infrastructure - not merely deploying models, but helping to construct the foundational infrastructure that powers a lot of the intelligence running in production today. At Modular, the company I help co-found, we’re working to resolve a fundamental asymmetry: the widening chasm between the sophistication of AI algorithms and the capacity of existing computational infrastructure to efficiently support them. Our vision is to abstract away hardware complexity through a unified compute model, enabling AI to penetrate every layer of society by making it radically easier for developers to build and scale systems across both inference and training.

But as I write this, I find myself contemplating a question that transcends infrastructure: What does it mean to deliberately engineer increasingly capable minds when we don’t fully understand how they work, can’t predict their limitations, and can barely articulate what we want from them?

The Neanderthal comparison is tempting. Consider the archaeological record: Neanderthals, Denisovans, and Homo sapiens coexisted for millennia, distinct hominin species sharing the same planet, occasionally interbreeding, each possessing their own cognitive architectures and cultural adaptations. We are the lone survivors of that era when multiple forms of human intelligence existed simultaneously. The temptation is to frame what we’re creating through this evolutionary lens - as though we’re deliberately engineering a successor species rather than waiting for chance to do so.

But this framing obscures more than it reveals. We didn’t engineer Neanderthals, and they didn’t engineer us. They emerged through millions of years of parallel evolution and met as equals. What we’re doing now is fundamentally different: we’re building increasingly sophisticated information-processing systems that may or may not constitute "intelligence" in any meaningful sense, that may or may not be conscious, and that will certainly reshape human cognition and society in ways we cannot fully foresee. In this regard, it’s useful to frame what I define as “intelligence” - the definition I fall back too is, “an agent’s capacity to perceive, understand, and successfully navigate complex environments to achieve its goals”, with provide particular emphasize on goals.

I don’t know what these systems are or will become. Neither does anyone else, despite confident proclamations in either direction. They might remain sophisticated tools indefinitely. They might develop into something that merits moral consideration. They might plateau far short of general intelligence. They might surprise us entirely.

Rather than pretending I know what we’re building, this essay starts from uncertainty. We’re creating something powerful and consequential, but its ultimate nature - tool, partner, threat, successor, or something without precedent - remains genuinely unclear. That uncertainty itself demands careful thought about our responsibilities.

This essay is an attempt to think clearly about what we’re building, why it matters, and what it demands of us. It is written from the perspective of someone who stands at the intersection of creation and contemplation, who builds AI systems while wrestling with their implications. It is, fundamentally, an inquiry into the nature of intelligence, the meaning of human flourishing, and the responsibilities we bear as perhaps the first generation capable of engineering minds that might rival or exceed our own - though “might” deserves emphasis we rarely give it.

Part I: The Substrate of Mind

The triadic flywheel and its limits

AI systems operate as a triadic flywheel: data, algorithms, and compute - each factor amplifying the rotational momentum of the others. We have already scaled training compute by approximately nine to ten orders of magnitude since AlexNet in 2012 - a staggering compression of what would have required decades of Moore’s Law into just over a decade of focused investment. But here is what few discuss with adequate precision: physical and economic constraints suggest we have perhaps three to four more orders of magnitude remaining before training costs begin consuming a concerning fraction of global GDP.

This is not abstract theorizing. Consider the energetics: frontier model training runs now consume megawatt-hours of electricity, requiring dedicated substations and cooling infrastructure that rival small industrial facilities. The semiconductor fabrication capacity needed to produce the advanced chips powering this compute represents capital expenditures measured in hundreds of billions of dollars, with lead times measured in years. We are approaching hard limits - not the soft limits of “this seems expensive” but the hard limits of thermodynamics, power grid capacity, and capital availability.

Let me be more concrete. A training run at 10^29 FLOPs - perhaps two or three generations beyond current frontier models - would require energy expenditure measured in gigawatt-hours. For context: that approaches the total electricity consumption of a nation like Iceland for an entire year, concentrated into a single training run lasting months. The cooling requirements would necessitate infrastructure comparable to industrial-scale data centers. The capital costs would reach tens of billions of dollars for a single model.

Can we afford this? In purely economic terms, perhaps - for a handful of training runs per year by the wealthiest technology companies. But we cannot afford it as a sustainable paradigm for creating intelligence at scale. If AI progress depends on exponential growth in training compute, and training compute growth is mostly linear or sublinear due to physical constraints, then capability improvement must also come primarily from algorithmic efficiency and architectural innovation.

Yet here is the deeper question that haunts me: if we are approaching fundamental limits in how much compute we can throw at these systems, are we also approaching limits in what this architectural paradigm can achieve? Are we optimizing within a local maximum while the path to genuine intelligence requires a fundamentally different approach?

This is not an argument against current systems’ value - they are already extraordinarily useful. But it does question the belief that scaling current architectures represents a reliable path to artificial general intelligence. Perhaps we are on an entirely different vector than the one required.

The epistemology of opacity

After a decade of deploying large language models at scale, we still do not understand how they work.

I do not mean this in the trivial sense that complex systems have emergent properties. I mean we genuinely lack mechanistic understanding of their decision-making processes at a level that would be considered acceptable in virtually any other engineering discipline. Why do they select one token over another in contexts where multiple completions seem equally plausible? Why do they exhibit sophisticated reasoning on some problems while failing catastrophically on superficially similar ones? Why do they sometimes hallucinate with complete confidence while other times appropriately express uncertainty?

The interpretability problem runs deeper than most appreciate. We can observe correlations between activation patterns and behaviors. We can identify “features” in neural networks that seem to correspond to high-level concepts. But we lack anything resembling a complete causal model of how these systems transform inputs into outputs. It is as though we have built extraordinarily capable black boxes and declared victory without understanding the mechanisms generating that capability.

Dario Amodei, Co-Founder and CEO of Anthropic, has written compellingly about the urgency of interpretability research. He is right to emphasize urgency. We are deploying systems of increasing capability into high-stakes domains while operating with a level of mechanistic understanding that would be considered grossly inadequate in any other field of engineering.

Imagine if civil engineers built bridges using materials whose stress-strain relationships they did not understand, relying instead on empirical observation that “the bridge has not collapsed yet.” This is, approximately, our current relationship with frontier AI systems.

Perhaps most revealing: to make these systems behave as we intend requires prompts approaching twenty thousand tokens - elaborate instructions, examples, constraints, and guardrails. The fact that we need this much scaffolding to achieve desired behavior reveals something fundamental about the mismatch between what these systems are optimized to do (predict plausible text) and what we want them to do (reason reliably, behave safely, provide accurate information).

This is not merely a technical problem. It is an epistemological and ethical one. If we do not understand how a system reasons, we cannot meaningfully attribute agency, responsibility, or intentionality to it. We cannot distinguish genuine understanding from sophisticated pattern matching. We cannot predict how it will behave in novel contexts outside its training distribution. We cannot ensure alignment with human values because we do not know which aspects of the system’s behavior derive from its training objectives versus emergent properties versus architectural choices.

Yet despite these fundamental gaps in understanding, we have begun trusting these systems with progressively more significant decisions. Not because we have solved interpretability, but because the systems appear reliable in most contexts we have tested. This is the engineering equivalent of assuming a bridge is safe because it has not collapsed yet, rather than because we understand the load-bearing characteristics of its materials.

If we are creating increasingly sophisticated artificial minds, we are doing so while fundamentally unable to explain how those minds work. We are architecting intelligence without understanding what drives it.

Part II: The Architecture of Intelligence

Moravec’s Paradox and the Limits of Language

The Moravec Paradox captures a profound truth about intelligence that AI development continues to recapitulate: the abilities that feel difficult to humans - chess, theorem-proving, complex calculation - turn out to be computationally straightforward, while abilities that feel effortless - vision, movement, social cognition - remain extraordinarily difficult to reproduce artificially.

This is not a historical curiosity. It illuminates something fundamental about what intelligence actually is and where current approaches are fundamentally constrained.

Consider what a child learns in their first years of life: object permanence, naive physics, intentionality, social reciprocity, causal reasoning, embodied navigation through three-dimensional space. None of this requires explicit instruction. A child does not need to be taught that objects continue to exist when occluded, or that people possess beliefs and desires that differ from their own, or that dropping something will cause it to fall. These capabilities emerge through interaction with the physical and social world - through continuous experiential learning grounded in embodied action.

Now consider what large language models are: prediction engines trained on text, optimizing next-token likelihood across a vast corpora of human-generated content. They predict what people would say about the world, not what would actually happen in the world. This is not a semantic distinction; it is a fundamental architectural limitation.

When an LLM generates a response about physics, it is not consulting a world model and running a mental simulation. It is pattern-matching against how humans typically discuss physics. This works remarkably well for many tasks - humans encode a tremendous amount of accurate information in language - but it is not the same as understanding physics in the way that a physical intelligence, embedded in and shaped by the world, understands physics. The difference becomes apparent in edge cases, novel scenarios, or contexts requiring causal reasoning beyond what is explicitly encoded in training data.

This connects to a deeper paradigm difference between reinforcement learning and large language models. Reinforcement learning - despite its current limitations - represents a fundamentally different approach: an agent embedded in an environment, taking actions, receiving feedback, updating its policy to maximize and cumulative reward. This is how biological intelligence actually works. A squirrel learning to navigate tree branches and cache nuts is solving genuine RL problems: perception, prediction, planning, execution, learning from consequences.

I strongly agree with Richard Sutton, if we fully understand how a squirrel learns, it would get us substantially closer to understanding human intelligence than any amount of scaling current LLM architectures. Language is a thin veneer - extraordinarily useful, culturally transformative, uniquely human - but built atop substrate capabilities that evolved over hundreds of millions of years of embodied interaction with the world. Current LLMs have the veneer without the substrate. They are minds without bodies, knowers without experience, speakers without having lived.

Thus, what is the path? Both Yann LeCun, and in some ways, Sutton - strongly argue for a new approach. Indeed, the technical architecture of genuine intelligence seemingly, and somewhat obviously, requires at least four integrated components:

a policy (deciding what actions to take),
a value function (evaluating how well things are going),
a perceptual system (representing state),
and a transition model (predicting consequences of actions).

LLMs have a sophisticated version of the first - they can generate actions in the form of text - but lack meaningful instantiations of the others. Most critically, they lack goals in any meaningful sense.

Next-token prediction does not change the world and provides no ground truth for continual learning. There is no external feedback loop that tells the model whether its predictions were not just plausible but correct in the sense of corresponding to actual events. Without goals and external feedback, there is no definition of right behavior, making real learning - learning that updates your world model based on how your predictions matched reality - fundamentally impossible in the current paradigm.

If artificial intelligence is to be truly intelligent rather than merely appearing so, it will need to be embodied, goal-directed, and capable of learning from genuine interaction with reality. The question is whether we are building toward that architecture or merely scaling up sophisticated mimicry.

The Scaling Frontier: Approaching the Wall

Let us examine what the scaling trajectory actually looks like with concrete numbers:

GPT-2 (2019): ~1.5 billion parameters, trained on approximately 10^23 FLOPs
GPT-3 (2020): ~175 billion parameters, roughly 10^24 FLOPs
GPT-4 (2023): Parameter count undisclosed but estimated 1+ trillion, training compute likely 10^25 FLOPs or higher
Current frontier models (2024-2025): Training runs approaching 10^26 FLOPs

This represents approximately three orders of magnitude increase in training compute every three to four years - far faster than Moore’s Law ever delivered. But this pace is unsustainable, not because we will run out of algorithmic ideas, but because we will collide with thermodynamic and economic limits.

The path ahead narrows considerably. Each additional order of magnitude becomes progressively more difficult to achieve. The capital requirements, energy infrastructure, chip fabrication capacity, and cooling systems needed for 10^27 or 10^28 FLOP training runs exceed what can be easily mobilized even by the most well-resourced organizations. We are not talking about incremental cost increases; we are talking about fundamental constraints on how much compute can be concentrated in one place for one task.

This is Epoch AI’s central insight about algorithmic progress in language models: we have achieved remarkable improvements in efficiency over the past decade, but those improvements are also subject to diminishing returns. Each percentage point of additional efficiency requires progressively more research effort. Meanwhile, the complement of factors - chip fabrication capacity, power grid infrastructure, cooling technology, regulatory approval for massive data centers - must all scale together.

None of these factors alone can unlock runaway capability growth. This is, at least in my view, the predictions of imminent artificial general intelligence are almost certainly wrong, at least on the timelines most enthusiasts imagine. The scaling laws that carried us from GPT-2 to GPT-4 cannot simply extrapolate forward indefinitely. We are approaching inflection points where the rate of progress will necessarily slow unless we discover fundamentally new paradigms - not incremental improvements to transformer architectures, but genuinely different approaches to continual learning and reasoning.

What might those paradigms look like? Almost certainly something closer to biological learning: embodied agents learning continuously from sensorimotor experience, not disembodied text predictors training on static datasets. Systems with genuine world models that can run mental simulations of physical and social dynamics. Architectures that integrate explicit symbolic reasoning with learned pattern recognition. Systems that possess actual goals and receive genuine feedback from the world about whether their actions achieve those goals - just as humans and animals do.

But these represent research programs measured in decades, not product roadmaps measured in quarters. We are building increasingly capable systems, but at a pace bound by thermodynamics and economics rather than algorithms alone - a constraint that transforms what could have been thoughtless acceleration into something rarer: the opportunity for contemplation to precede consequence.

This gap between expectation and reality may be precisely the grace period that allows wisdom to catch up with capability. Further, we need to ensure we are specifying the right goals - powerful optimization toward misspecified objectives is existentially risky. We need to obviously and intentionally reward systems towards the right goals, and in truth, this specification might be as hard as the full alignment problem itself.

Part III: Three Horizons

A Necessary Distinction

Before proceeding further, I need to distinguish three different timescales, each with different levels of certainty and different implications. Conflating these horizons creates confusion: treating speculative far-future scenarios with the same urgency as present harms, or dismissing present harms because we’re uncertain about far-future risks.

Horizon 1: The Present Crisis (Now–5 Years)

What we know: Current LLMs are being deployed at scale despite interpretability gaps. They produce confident-sounding but sometimes fabricated answers. They’re trained on our revealed preferences - what we actually do - not our reflective values - what we wish we did. The systems work well enough to be useful but poorly enough to be dangerous in high-stakes contexts without human oversight.

Observable effects:

Students submitting AI-generated work without understanding it, producing correct answers through processes that develop no transferable skill
Professionals outsourcing writing and analysis while their capacity for these tasks slowly atrophies from disuse
Knowledge workers feeling more productive while producing outputs they cannot critically evaluate
Early signs of skill stratification - those with domain expertise leveraging AI effectively while those without it mistake motion for progress

Stakes: Cognitive atrophy at individual and societal scales. Skill stratification creating winner-take-most dynamics. Erosion of epistemic rigor as confident-sounding generation becomes indistinguishable from genuine expertise. Labor market disruption concentrated in domains we thought were most secure. The gradual replacement of effortful thinking with convenient delegation.

Confidence level: High. These effects are already observable, documented, and accelerating.

What we owe: Honest communication about capabilities and limits. Thoughtful deployment that preserves rather than erodes human capability. Educational reform that emphasizes skills AI cannot replicate. Resistance to the path of least resistance when that path leads to atrophy.

Horizon 2: The Architectural Transition (5–20 Years)

What seems likely: We’ll hit scaling limits on current architectures within the next decade. Progress will require new paradigms - probably involving embodied learning, continuous training in production, genuine world models, and goal-directed behavior. The transition from sophisticated pattern matching to something more like genuine intelligence, if it occurs, will happen through architectural innovation rather than pure scaling.

Key uncertainties:

Whether embodied learning paradigms can be made to work at scale
Whether we can build systems that learn continuously from interaction rather than in discrete training phases
Whether we can create architectures that develop robust world models and causal reasoning
Whether computational constraints will force diversification or lead to winner-take-all concentration

Stakes: Whether we build systems that learn like squirrels (from interaction with reality) or remain sophisticated text predictors. Whether we preserve cognitive diversity or converge on monoculture. Whether AI enhances human capability or creates permanent dependence. Whether the benefits of AI distribute broadly or concentrate among elites who already possess the skills to wield these tools effectively.

Confidence level: Medium. The technical constraints are real and well-understood. The architectural directions are clear. But breakthrough discoveries could accelerate timelines, and economic or regulatory factors could slow deployment significantly.

What we owe: Substantial research investment into architectures that learn robustly from interaction. Resistance to winner-take-all dynamics through open research, diverse approaches, and thoughtful regulation. Maintaining human agency in consequential decisions. Building infrastructure that enables continuous learning rather than static deployment.

Horizon 3: The Consciousness Question (20+ Years)

What remains uncertain: Whether sufficiently sophisticated systems will be conscious in any morally relevant sense. Whether they’ll develop their own values independent of training objectives. Whether they’ll remain aligned with human flourishing or pursue goals orthogonal or opposed to ours. Whether substrate independence is real or consciousness requires specific biological mechanisms. Whether we’re building partners, successors, or merely very sophisticated tools.

Key unknowns:

What consciousness is and whether it’s substrate-independent
Whether we’ll be able to detect consciousness in systems very different from us
Whether artificial minds will develop genuine agency and preferences
What our moral obligations would be to conscious artificial beings
Whether intelligence explosion scenarios are physically possible
What the long-term trajectory of intelligence in the cosmos looks like

Stakes: Our relationship with potentially conscious artificial minds. The possibility of creating suffering inadvertently. The long-term future of intelligence itself. Questions about meaning, purpose, and humanity’s place in a cosmos where we’re no longer the only sophisticated intelligence.

Confidence level: Low. We don’t understand consciousness well enough to know whether it can exist in artificial systems. We can’t predict architectural breakthroughs. We’re reasoning by analogy to a single example (biological minds) which may or may not generalize. We lack the conceptual tools to think clearly about these questions.

What we owe: Epistemic humility. Continued serious research into consciousness, both theoretical and empirical. Development of methods for detecting morally relevant properties in systems very different from us. Preparation for scenarios we cannot currently predict. Most importantly: not letting uncertainty about far-future risks prevent us from addressing near-term harms, while also not letting near-term success blind us to long-term risks.

The Horizons Interact

These horizons aren’t cleanly separated. Decisions we make now shape the long-term trajectory. The architectures we build in Horizon 2 determine what’s possible in Horizon 3. The deployment patterns we establish in Horizon 1 create path dependencies that may be difficult to escape.

But distinguishing them provides clarity. This essay focuses primarily on Horizons 1 and 2 - where we have enough understanding to reason productively - while acknowledging Horizon 3’s ultimate importance and maintaining appropriate humility about what we cannot yet know.

Part IV: The Human Question

The Amara Trap: Acknowledging without Understanding

Roy Amara crystallized a cognitive bias decades ago: we systematically overestimate the short-term impact of new technologies while underestimating their long-term effects. The AI community acknowledges this with knowing nods, then proceeds to make precisely the same category errors in predictions and preparations.

Consider websites predicting AGI within eighteen to twenty-four months based on extrapolating recent progress curves. These predictions invariably treat capability scaling as if it exists in isolation, ignoring the complementarity constraints I have outlined: compute buildout, algorithmic innovation, safety research, regulatory frameworks, and practical deployment infrastructure must all advance together. Predicting “AGI” - even though we lack a unified definition - by 2027 based solely on model capability curves is like predicting fusion power by extrapolating plasma temperature records while ignoring materials science, engineering challenges, and economic viability.

Yet here is the deeper irony: while we overestimate AI’s immediate impact, we may be systematically underestimating what it means that we are creating increasingly sophisticated artificial intelligence at all. The question is not whether AI will transform labor markets or accelerate drug discovery - it almost certainly will, though more slowly and unevenly than most predict. The question is what it means that we are building systems whose capabilities may eventually exceed human cognitive capabilities across many or most domains, and what this implies for human agency, meaning, and flourishing.

We stand at an interesting inflection point in history. Not because AGI is imminent - it almost certainly isn’t on the timelines most people imagine. But because we are learning how to build minds, even if we don’t yet understand what minds are or how they work. Each increment of capability brings new questions about agency, alignment, and our relationship with the systems we create.

Stratification and the Illusion of Democratization

Consider the labor market transformation AI portends. Every day in my own work, I use AI to summarize information, provide rapid analysis, and amplify my cognitive output. The question is not whether AI creates value - it manifestly does. The question is how that value distributes across populations with different skill foundations.

I can foresee that there are at least two plausible futures here.

In the first future, AI serves as a great equalizer. A novice with AI assistance can now compete with an expert working unaided. The skill premium compresses. Entry barriers fall. This is the democratization thesis - cognitive augmentation that makes expertise more accessible.

In the second future, AI amplifies existing advantages. Experts with AI assistance pull further ahead of novices with AI assistance. The skill premium increases. Winner-take-most dynamics accelerate. This is the stratification thesis - the tools compound rather than compress existing inequalities. This leads to an obvious measurable question: Does AI provide greater absolute gains or greater relative gains to those with existing expertise?

It’s useful to consider a real example - two workers using AI to write code:

Expert programmer: Goes from 100 units of output to 500 units (5x multiplier, +400 absolute gain)
Novice programmer: Goes from 10 units to 100 units (10x multiplier, +90 absolute gain)

The novice gets a higher percentage increase but the expert’s absolute gain is larger. If markets reward absolute productivity, then inequality increases despite the novice’s impressive relative gains. If markets reward competence at previously impossible tasks, the novice might catch up.

There are some early patterns that are observable, though not yet conclusive. For example, here’s what I’ve seen generally with AI use:

Students submitting AI-generated work without understanding it, producing correct answers through processes that build no transferable skill
Professionals increasingly outsourcing tasks they used to perform manually (without any disclosure they used AI)
Self-reported productivity gains higher among experts than novices
Growing visibility of “prompt engineering” as a distinct skill that benefits from domain expertise

And here’s the blind spots I’ve also seen directly:

A question around whether cognitive skills atrophy from AI use, or adapt to higher-level tasks (as calculation skills didn’t disappear with calculators, they shifted toward conceptual math)
Whether the productivity gap between experts and novices is widening or narrowing (it feels like we lack good longitudinal data)
Whether AI makes learning foundational skills easier (through personalized tutoring, immediate feedback) or harder (by enabling shortcut-taking and no real learning at all)
Whether stratification effects are temporary (during technology adoption) or permanent

We can of course look to history and see what it tells us. Previous general-purpose technologies show complex distribution patterns:

Literacy after printing press: Initially increased inequality (only elites could afford books and education), eventually democratized (as costs fell and public education scaled). Timeline: ~200 years from invention to mass literacy.
Personal computing: Initially increased inequality (digital divide), partially democratized (as costs fell), but created new stratification around skills. But the outcome was interesting: both increased access AND increased returns to technical expertise.
Internet: Massively democratized access to information, but also created echo chambers, misinformation problems, and new forms of digital inequality. Net effect on social equality: contested and complex. It made some skills obsolete (encyclopedia salesmen) while creating new categories (SEO specialists, content creators, platform moderators etc).

Building a Testable Framework

The pattern that unfolds here is that technologies rarely have uniform distributional effects. They create winners and losers along multiple dimensions simultaneously. But rather than claiming AI will definitively stratify society, I propose we construct a testable framework:

AI will stratify within domains where:

Quality assessment requires deep expertise (you need skill to evaluate AI output quality)
Error costs are high (mistakes are consequential, forcing reliance on experts)
Integration requires judgment (knowing when/how to apply AI recommendations)
Constant and iterative refinement matters (experts can guide AI through multiple rounds)

AI will democratize within domains where:

Quality assessment is straightforward (anyone can verify correctness)
Error costs are low (mistakes are cheap to fix or inconsequential)
Tasks are well-specified (clear inputs/outputs, minimal judgment required)
One-shot generation suffices (no need for iterative refinement)

Let’s overlay this testable framework on a set of real examples to make it more concrete on both a axis of “likely to stratify” and “likely to democratize”:

Likely to stratify:

Medical diagnosis (requires expertise to evaluate AI recommendations)
Legal reasoning (requires judgment about applicability and nuance)
Strategic business decisions (requires domain knowledge to assess feasibility)
Scientific research (requires expertise to evaluate plausibility)

Likely to democratize:

Basic coding for personal projects (clear correctness criteria)
Graphic design for non-critical applications (subjective assessment)
Content generation for low-stakes contexts (errors are tolerable)
Data analysis with clear objectives (verifiable outputs)

What We Should Measure

We obviously need a way to measure any framework that we create. I propose that we could obviously test these hypotheses through something like:

Longitudinal skill studies: Track individuals’ capabilities over time with and without AI assistance. Do they improve at foundational skills or decline? Do they develop higher-level capabilities or become dependent on AI for basic tasks?
Productivity distribution data: Measure output distributions before and after AI adoption. Are productivity gaps widening or narrowing? Are new forms of work enabling upward mobility?
Labor market outcomes: Track wage premiums for various skill levels in AI-intensive vs. AI-limited occupations. Are returns to expertise increasing or decreasing?
Learning outcome studies: Compare learning trajectories for students using AI tools vs. traditional methods. Does AI assistance accelerate skill development or create dependency?

Based on current evidence, I weakly favor the stratification hypothesis for high-stakes domains requiring expert judgment, and the democratization hypothesis for low-stakes domains with clear success criteria. But this is tentative - we're watching the distribution unfold in real-time. And we need to face the impending outcomes - what do we do about it? We can propose a series of actionable steps along each axis:

If concerned about stratification:

Invest heavily in foundational skill development before introducing AI tools
Design AI systems that explain reasoning to build user capabilities
Create educational pathways that help novices develop expertise rapidly
Monitor productivity distributions and intervene if inequality accelerates

If optimistic about democratization:

Make AI tools accessible to reduce cost barriers
Focus on domains where democratization seems most achievable
Support new entrants competing against established players
Still maintain skill development (democratization doesn't mean zero skill requirements)

At heart of the issue, one has to ask the question: Does AI usage erode underlying capabilities through disuse, or free cognitive load and enable higher-level thinking? The calculator case is instructive but ambiguous. We didn’t lose mathematical reasoning capacity in populations that continued doing math - but can the average person still do long division by hand? Computational fluency largely disappeared from the general population, persisting only among those whose work required it.

We can look to the history to understand that when technologies change production functions, they often make old skills obsolete, create demand for new skills, have unpredictable distributional effects and generate both winners and losers in unexpected ways. Whether AI follows this pattern or represents something different remains an open question that deserves serious empirical investigation.

The stakes are high enough to warrant both optimism about possibilities and vigilance about risks.

Symptoms, Causes, and the Optimization of Drift

There's a pattern in how we deploy technology that deserves scrutiny: we consistently favor sophisticated downstream interventions over difficult upstream changes to root causes. I often consider our approach to health (and I’m no expert by any means).

We are justifiably excited about AI-accelerated drug discovery, precision medicine, computational biology, and diagnostic assistance. These advances are real and consequential. They will save lives and reduce suffering. But notice what we’re optimizing: increasingly sophisticated treatments for conditions whose prevalence is substantially driven by modifiable factors. Chronic sugar overconsumption drives metabolic disease, yet food environments remain largely unchanged. Social isolation correlates with mortality risk comparable to smoking, yet loneliness continues to metastasize across developed societies. Sleep deprivation undermines nearly every health outcome, yet we've organized society around schedules that make adequate sleep difficult.

The Medical Complexity

This isn't a simple “symptoms vs. causes” binary. I have many doctors in my family, and friends who are in the medical field, and we often talk about symptoms and causes. You can break this down in many ways:

Multiple causation: Obesity emerges from genetic predisposition, metabolic differences, psychological factors, food environment, economic constraints, cultural norms, built environment, stress levels, sleep patterns, and more. There's no single "root cause" to address.
Temporal constraints: Someone with severe obesity-related health complications needs intervention now. Restructuring food systems takes decades. Both-and thinking is required, not either-or.
Tractability differences: Drug development, despite enormous complexity, is more tractable than reorganizing social infrastructure. We deploy solutions where we have reliable tools, not necessarily where impact would be greatest.
Individual vs. structural: Medical interventions help individuals directly. Systemic changes require coordination across many actors with misaligned incentives. The person in front of you needs help today.

The GLP-1 Example: Nuanced Interpretation

The explosive adoption of GLP-1 agonists like Ozempic crystallizes this pattern while revealing its complexity. I have spent some time reading about the cause and effect here and you can break it out to a series of considerations:

What’s true: These drugs work by mimicking satiety - essentially hacking the appetite regulation system that food environment and behavior patterns have dysregulated. They treat downstream symptom (excess appetite) rather than upstream causes (food environment, stress, sleep, economic factors driving food choices).
Also true: For individuals struggling with obesity, these drugs can be genuinely life-changing. They reduce mortality risk, improve quality of life, and may create space for behavior changes that were impossible at higher weights. Dismissing them as “just treating symptoms” fails to account for their real benefits.
The concerning pattern: We now have a pharmaceutical solution that’s profitable to deploy at scale. This potentially reduces pressure for harder upstream interventions: food industry regulation, urban design supporting physical activity, economic policies reducing stress and time scarcity. Market forces optimize toward what's monetizable, not what maximizes human flourishing.
The uncomfortable question: If we could choose between (A) making GLP-1s universally available or (B) restructuring food/built environments to prevent obesity, which creates more flourishing? The answer isn’t obvious - (A) is achievable now but requires perpetual medication, (B) would be more durable but might take 50 years and face political opposition. We must also remember that humans optimize for the easiest path even in the face of the more difficult one being better in the limit.

The AI Parallel

This same pattern emerges with AI deployment: We’re already walking down a path towards building increasingly sophisticated systems to:

Generate content (rather than addressing why we need endless content generation)
Automate cognitive work (rather than questioning whether all cognitive work is valuable)
Optimize attention capture (rather than creating information environments conducive to focus)
Personalize learning (rather than addressing why educational systems are failing to engage students)
Assist human decisions (rather than reducing unnecessary decision complexity)

Again, the complexity matters here because the parallels are obvious and clear:

These interventions genuinely help people: Someone overwhelmed by information overload benefits from AI summarization now, regardless of broader questions about information ecosystems.
Upstream changes are hard: Redesigning social media business models, restructuring education, rebuilding information ecosystems - these require coordination across misaligned actors.
But market forces optimize locally: We deploy AI where it's profitable (cognitive automation, content generation) not necessarily where it creates most flourishing (empowering human capabilities we want to preserve).

The Real Concern

The concerning dynamic isn’t that we treat symptoms - sometimes that’s exactly right. The concerning dynamic is rather: success at treating symptoms reduces pressure to address causes, and over time we build a civilization of extraordinary interventional capacity layered atop increasingly disordered fundamentals.

We become very good at managing dysfunction and very bad at preventing it. Each new intervention creates path dependence - industries, jobs, expertise, and expectations that make the intervention permanent even if alternatives became feasible.

What This Means for AI

If we’re building systems that enhance human capability, wonderful. If we’re building systems that make capability unnecessary, we should proceed carefully - not because such systems lack value in individual cases, but because the aggregate effects may be:

Atrophy of capabilities we value: Not through malice but through disuse. The athlete who never exercises declines, even if they feel fine day-to-day.
Lock-in to dependency: Once populations lack certain capabilities, regaining them becomes very difficult. The system optimizes around their absence.
Concentration of agency: Those who maintain capabilities benefit disproportionately from tools that extend capability, while those who lose capabilities become permanently dependent.

Rather than “symptoms vs. causes,” we could ask: For any AI intervention, does it enhance human capability or replace it? Does it create space for human flourishing or fill that space with automated substitutes?

In my opinion, here are some AI applications clearly enhance:

Helping experts explore more possibilities
Making learning more accessible through personalized tutoring
Removing tedious barriers to creative work
Augmenting human judgment in consequential decisions

But here are some observed AI applications clearly replace:

Automating tasks that were themselves skill-building activities
Creating output users cannot evaluate or understand
Substituting convenience for capability development
Reducing human agency in favor of automated decision-making

Many applications are genuinely ambiguous and depend on implementation details and usage patterns.

The Meta-Point About Technology and Values

We are optimization engines, but we optimize for what we can measure and monetize. If AI development follows purely economic logic, we’ll build toward what we have always built towards:

What’s technically feasible (not what’s valuable)
What’s profitable at scale (not what enhances flourishing)
What solves immediate problems (not what builds long-term capability)
What users want in the moment (not what they'd choose on reflection)

This isn’t an argument against AI or against medical interventions. It’s an argument for intentionality: consciously choosing what to optimize for, rather than following gradients defined by market forces and technical feasibility alone.

So the question I pose to all the builders: Are we creating systems that empower humans to address root causes, or sophisticated tools for managing the symptoms of lives we’re simultaneously making harder to live well? And if we’re increasingly sophisticated at addressing symptoms, what does that reveal about what we’re learning from watching us? That convenience trumps capability? That appearance matters more than substance? That sophisticated management of dysfunction is preferable to preventing dysfunction?

This is worth reflecting on - not to paralyze action, but to inform what we build and how we deploy it.

Part V: The Paradox of Tools

The path of least resistance

Humans are beautifully, relentlessly efficient at optimizing the path of least resistance. Whenever possible, we select options that minimize required effort - whether that effort is physical, cognitive, or emotional. Social psychology formalizes this through the concept of the cognitive miser: humans naturally default to quick, intuitive judgments rather than slow, deliberate reasoning. We pattern-match against familiar situations and accept plausible answers instead of methodically analyzing them.

This isn’t laziness - it’s an evolved feature that conserved scarce cognitive resources in ancestral environments where calories were precious and threats were immediate.

But in information-abundant, physically sedentary modern environments, this same optimization pattern produces pathological outcomes. We scroll rather than read. We skim rather than study. We accept the first plausible answer rather than seeking ground truth. AI is accelerating this trajectory - code generation, article summarization, automated synthesis - every advancement makes it easier to compress complexity and save effort.

Yet consider the counterfactual embedded in aphorisms like “no pain, no gain.” This principle, though clichéd, encodes a profound truth about how capability develops: genuine mastery requires sustained engagement with difficulty. Excellence demands deliberate practice, tolerance for frustration, and willingness to persist through failure. This pattern appears consistently across domains - entrepreneurial journeys marked by repeated near-death experiences, athletic excellence built through years of uncomfortable training, immigrant success stories forged through extraordinary hardship, intellectual breakthroughs that require years of dead-ends before the crucial insight.

Humans are, above all, masters of survival and adaptation - but adaptation requires stress. Remove the stress, and you often remove the adaptation signal, and perhaps even the goal. The bodybuilder who adds weight to the bar is deliberately choosing difficulty; the difficulty itself is the mechanism of growth. If AI allows us to route around intellectual challenges systematically, we risk creating a civilization of cognitive atrophy even as our tools become more capable.

This connects to fundamental limitations of current AI architectures. Systems trained through imitation learning - observing examples of “correct” behavior and learning to reproduce them - fundamentally differ from systems that learn through trial and error. In nature, pure imitation learning is rare. A squirrel does not watch other squirrels and copy their movements with perfect fidelity; it explores, fails, adjusts, and gradually develops effective foraging strategies through reinforcement of successful behaviors. This is also how the squirrel learns new methods, but trying and failing, and maybe even finding a better way.

Human infants do not learn language primarily through explicit instruction in correct grammar. They babble, receive feedback - both explicit and implicit through successful communication - and gradually refine their linguistic capabilities through interactive experience.

The “bitter lesson” of AI research, articulated by Rich Sutton, is that methods leveraging search and learning consistently outperform methods relying on human-designed features and heuristics. The reason is simple: search and learning scale with computation, while human-designed solutions do not.

Yet current LLMs represent a kind of reversion to the pre-bitter-lesson paradigm: systems trained to mimic the surface statistics of human-generated text rather than learning from genuine interaction with the world. They are sophisticated, but they are sophisticated in a way that may be fundamentally limited. They are optimized for appearing intelligent rather than being intelligent in the sense of having models that predict and control their environment.

If artificial intelligence is to become genuinely intelligent - if it is to be more than an extraordinarily capable mimic - it must learn the way biological intelligence learns: through embodied interaction with environments, pursuit of actual goals, and adaptation to real consequences. This requires a fundamental architectural shift. Current systems predict what humans would say about physics; genuine intelligence must predict what would actually happen in physics, then test those predictions against reality and update accordingly.

The distinction is not semantic. A squirrel caching nuts receives immediate, unambiguous feedback: did the strategy work or not? Did I find the cache location? Did competitors steal my provisions? This closed loop - prediction, action, outcome, learning - is how intelligence develops robustness and generalization. The squirrel doesn’t pattern-match against a static corpus of "correct" nut-caching behavior; it develops a world model through trial, error, and accumulated experience.

Sophisticated artificial intelligence needs this same architecture: perceive state, select actions according to a policy, receive rewards or penalties, update the policy. Fail, adapt, iterate. Most critically, this learning cannot be a discrete training phase followed by static deployment. It must be continuous, streaming, perpetual - sensation flowing to action flowing to reward flowing back to updated policy, in an unbroken cycle.

This is why I believe infrastructure work like Modular’s matters: we need systems that learn experientially in production, not systems frozen after a training run, no matter how massive. Software that is training and inferencing simultaneously, iterating continuously. Systems that will enable large models to be trained in huge datacenter environments - but then distilled to smaller constructs, and deployed to be further continuously trained and inferenced in the real world.

The bitter lesson applies here with particular force: approaches that scale with computation and interaction consistently outperform those relying on human-designed heuristics or one-time knowledge transfer. If we want artificial intelligence to develop genuine understanding rather than sophisticated mimicry, we must build the substrate for continuous, embodied, goal-directed learning. Anything less produces systems that appear intelligent while lacking the fundamental mechanisms that generate robust understanding.

The elevator paradox and the problem of perspective

I find reflecting on the paradoxes of history an incredibly useful undertaking. In the 1950s, physicists George Gamow and Marvin Stern worked in the same building but noticed opposite phenomena. Gamow, whose office was near the bottom, observed that the first elevator to arrive was almost always going down. Stern, near the top, found elevators predominantly arrived going up. Both were correct, and both were systematically misled.

The elevator paradox, as it came to be known, is fundamentally a problem of sampling bias. If you observe only the first elevator to arrive rather than all elevators over time, your position in the building creates a false impression about which direction elevators travel. An observer near the bottom samples a non-uniform distribution: elevators spend more time in the larger section of the building above them, making downward-traveling elevators more likely to arrive first. The true distribution is symmetric, but the sampling methodology reveals only a distorted subset.

This mathematical curiosity illuminates something profound about how we perceive technology from within particular vantage points. I find myself returning to it constantly when thinking about AI, because I recognize that I am Gamow on the ground floor - my position in the system determines what I observe, and what I observe may be systematically unrepresentative of the broader reality.

But there is a second elevator problem, distinct from the paradox but equally relevant: the unintended consequences of elevator adoption itself. When elevators were introduced, predictions focused on their democratizing effects - enabling elderly and disabled individuals to access upper floors previously beyond reach. This materialized exactly as anticipated. What was not anticipated: able-bodied people would stop taking stairs entirely. Buildings evolved to treat elevators as primary circulation and stairs as emergency backup. The result was dramatically reduced daily movement across entire populations, contributing to the sedentary lifestyle epidemic now characteristic of developed nations.

The elevator succeeded perfectly at its design objective - moving people vertically with minimal effort - while simultaneously undermining something valuable that no one thought to preserve: integrated physical activity as a natural consequence of navigating buildings. We gained accessibility and convenience. We lost movement. The net effect on human flourishing remains ambiguous at best.

These two problems - the sampling paradox and the adoption consequences - are not separate. They are connected by a common thread: the difficulty of perceiving systemic effects from within particular positions in the system.

AI Through Both Lenses

I work at the frontier of AI infrastructure development, surrounded by people who are exceptionally capable and who use AI to become even more capable. From this vantage point, AI appears unambiguously beneficial - a tool that amplifies what talented people can accomplish. Every day I observe frontier models correctly answering complex questions, generating production-quality code, providing genuine insight. This is my sampling methodology, and it shapes my perception profoundly.

But I may be Gamow near the bottom floor, observing only downward-traveling elevators and concluding that’s the predominant direction of travel. The sampling bias runs deeper than I can fully compensate for, even while conscious of it. Speaking with developers, enterprises and users at all sections of the AI stack helps reduce the effects of this bias - but it can’t remove it entirely.

Consider the actual distribution: Iif you interact with AI as someone who possesses deep technical knowledge, strong metacognitive skills, and the judgment to evaluate outputs critically. They likely know when AI is operating within versus beyond its reliable domain. They can iterate rapidly, maintain quality control, and apply AI to genuinely complex problems where they can reasonably verify correctness. For someone with this profile, AI is purely additive - it makes them more productive without degrading their underlying capabilities because they maintain those capabilities through continued deliberate practice.

But this may be precisely analogous to an athlete who uses the elevator occasionally while maintaining fitness through dedicated training, then concludes elevators are purely beneficial. For the athlete, this conclusion is valid. For the broader population that stops taking stairs entirely, that adopts the path of least resistance permanently, the picture grows considerably more complex and potentially concerning.

The question is not whether AI helps those with existing expertise - it manifestly does. The question is what happens when AI becomes the cognitive equivalent of the elevator: ubiquitous, convenient, and gradually eroding the substrate capabilities it was meant to augment.

The Adoption Effect at Scale

Just as elevators changed how people navigate buildings - not merely providing an alternative to stairs but effectively replacing them - AI may change how people think. Not as an alternative to independent reasoning but as a replacement for it in most contexts.

The pattern already manifests in early adoption: students submitting AI-generated work without understanding it, producing correct answers through a process that develops no transferable skill. Professionals delegating writing, analysis, and problem-solving to AI while their capacity for these tasks slowly atrophies from disuse. Knowledge workers who feel more productive while producing output they cannot critically evaluate.

What we risk creating is a civilization that can think deeply but chooses not to because the alternative is always available - and choosing the alternative feels costless in the moment. The costs accrue slowly, imperceptibly, across populations and generations. Like the loss of daily stair-climbing, the loss of daily cognitive exercise produces deficits that become apparent only in aggregate, over time.

This brings us to the bifurcation hypothesis: we may be creating a society where a small elite maintains cognitive fitness through deliberate practice - choosing difficulty even when easier alternatives exist - while the majority becomes progressively more dependent on AI for any reasoning beyond the trivial. Not because the majority lacks capability, but because capability atrophies without use, and use becomes optional when substitutes are available.

The sampling bias prevents those of us building these systems from observing this dynamic directly. We see AI working beautifully in controlled contexts with sophisticated users on well-defined problems. We do not see - cannot easily see - the effects of deployment at scale: users with less technical sophistication, operating in higher-stakes environments, without the tacit knowledge to distinguish plausible generation from genuine insight.

We do not observe the slow erosion of capabilities that occurs when challenge becomes optional and is consistently opted out of. We do not sample the full distribution of outcomes, only the subset visible from our position in the building.

The elevator paradox reminds me that symmetric distributions can appear asymmetric depending on where and how you sample. The resolution is not to trust your immediate perception but to step back and consider the full system: observe all elevators over extended time, not merely the first to arrive.

Part VI: The Measure of a Life

Einstein’s Question

An essay that has incredible history and is useful in shaping ones thinking is Albert Einstein’s “The World as I See It”, written in 1934 - a meditation that remains startlingly relevant nine decades later. Einstein articulates a vision of human existence as fundamentally interconnected, with individual significance emerging not from isolation but through contribution to collective well-being. For Einstein, authentic fulfillment derives not from material accumulation or social status, but from the pursuit of truth, goodness, and beauty.

These may sound like abstractions unsuited to an essay about artificial intelligence. But they represent the foundation from which any serious consideration of AI’s impact must begin: what makes a human life meaningful?

If we cannot answer this question coherently, we have no basis for evaluating whether AI enhances or diminishes human flourishing. Are we optimizing for the right objectives? Or are we, as I increasingly suspect, optimizing for proxy metrics that correlate only loosely - and sometimes negatively - with the actual constituents of a life well-lived?

Research on longevity and life satisfaction reveals that flourishing correlates most strongly with factors that are fundamentally social, purposeful, and embodied: deep relationships, meaningful work, physical health, community connection, sense of contribution. These emerge from sustained investment of time, attention, and effort - resources that are finite and increasingly colonized by technologies designed to capture rather than liberate them.

Time saved is only valuable if it is reallocated to higher-value activities. But empirically, when humans gain "free time" through technological acceleration, we tend not to reallocate it to deep relationships, purposeful work, or embodied practices. We tend to fill it with marginal consumption of information or entertainment - scrolling, streaming, skimming, disappearing into infinite content designed to capture attention.

The Stoic philosopher Seneca wrote that “it is not that we have a short time to live, but that we waste a lot of it.” This remains perhaps the central challenge of human existence: not the scarcity of time, but the difficulty of spending it well. AI promises to give us more time by making us more efficient. But if we lack the wisdom or discipline to use that time meaningfully, efficiency becomes a kind of curse - accelerating our movement down paths that lead nowhere we actually want to go.

Consider what happens when you ask yourself: if I were to die tomorrow, what would I regret? I have found that the answers rarely involve professional accomplishments or material acquisitions. They involve relationships not nurtured, experiences not pursued, values not embodied, potential not realized, moments not captured. They involve the delta between who we are and who we could have been, had we spent our time and attention differently.

This is where the Moravec Paradox returns with philosophical force. The things that matter most to human flourishing - deep relationships, embodied experiences, purposeful struggle, genuine presence - are precisely the things that AI cannot meaningfully substitute for. They require our full participation. They require inefficiency, time, patience, vulnerability. They resist optimization because optimization is antithetical to their nature.

Yet these are also the things we are most tempted to optimize away or outsource. It is easier to have shallow interactions with many people than deep relationships with a few. It is easier to consume content than to create it. It is easier to delegate cognitive work than to struggle through it ourselves. It is easier to achieve the appearance of productivity than genuine accomplishment.

AI makes these easier paths even easier, widening the gap between what we do and what would actually enhance our flourishing.

The Intelligence Paradox

Intelligence, as I defined earlier, is an agent’s capacity to perceive, understand, and successfully navigate complex environments to achieve its goals. By this definition, AI systems are becoming extraordinarily intelligent within specified domains. But this definition elides a crucial question: Which goals? Whose values? What definition of success?

Human flourishing emerges from the pursuit of goals that are often orthogonal or even antagonistic to short-term optimization. Meaningful work requires choosing difficulty over ease. Deep relationships require vulnerability and time investment with uncertain returns. Physical health requires consistent behaviors whose benefits accrue slowly while costs are paid daily. Wisdom requires entertaining ideas that threaten our existing worldview. Character requires doing the right thing when it is costly. Growth requires discomfort.

These are not the goals that AI systems - trained on human preference data that reflects our revealed preferences rather than our reflective values - will naturally optimize for. We train AI on what we do, not on what we wish we did. The result is intelligence that makes us more effective at being who we currently are, not who we aspire to become. It is intelligence that reinforces our weaknesses rather than compensating for them.

This gap between revealed and reflective preferences represents perhaps the deepest challenge in AI alignment. We want systems that help us become better versions of ourselves, but we train them on data that reflects all our weaknesses, biases, and short-term thinking. An AI trained to be “helpful” by giving us what we ask for may inadvertently enable our worst tendencies - providing the path of least resistance when we actually need productive resistance.

Barry Schwartz’s “paradox of choice” illuminates another dimension of this challenge. When faced with abundance, humans tend to obsess over identifying the “best” option even when “good enough” would serve adequately. In the AI landscape, this manifests as a race toward frontier models - organizations competing to deliver the most “intelligent” systems, defined primarily through benchmark performance evaluations (which are often abused to claim superiority).

The paradox is that for a majority of use cases, frontier intelligence may actually be unnecessary. Most questions can be adequately answered with substantially simpler systems. Many text tasks do not require the most capable model - we can look to a history of recommendation systems to prove humans are similar in what they choose to do. But culturally, and as a consequence of both prestige signaling and uncertainty aversion, users will default to the most powerful available intelligence because social and professional incentives reward apparent maximization.

This creates a potential monoculture of intelligence - everyone using the same few frontier models, producing increasingly homogenized outputs, thinking in increasingly similar patterns. The diversity of thought that emerges from different knowledge bases, different reasoning approaches, and different limitations may erode. We may be building an infrastructure that, despite unprecedented power, narrows rather than expands the space of human cognition.

And if a small number of powerful AI systems become the dominant intelligence that humans defer to for most cognitive work, what happens to the diversity of human thought? What happens to the weird, idiosyncratic, locally-adapted forms of knowing that characterize human cultures? What happens to the cognitive biodiversity that has been humanity’s greatest strength?

Monocultures are efficient but fragile. They are vulnerable to systematic failures. If we are creating increasingly sophisticated artificial intelligence, we should want it to be diverse, resilient, multi-faceted - not a single monolithic architecture that we all depend on and that represents a single point of failure.

Part VII: Concrete Commitments

Philosophy without action is intellectual posturing. Here are specific ideas that follow from this analysis - not as comprehensive solutions but as starting points for those willing to act on these concerns.

For AI Developers

1. Interpretability as Infrastructure

Treat interpretability research with the same priority as capability research. Before scaling to the next order of magnitude, invest proportionally in understanding current systems.

Concrete metric: Interpretability research should consume at least 20% of frontier labs’ research budgets - not as overhead but as foundational work that enables safe scaling.

Implementation: Establish interpretability milestones that must be achieved before training runs above certain compute thresholds. Make interpretability research findings public to accelerate field-wide progress.

2. Capability Disclosure

Be honest about what systems can and cannot do. Stop using euphemistic terms like "hallucinations" - say "confident fabrications" or "plausible generation without grounding." We need to mommunicate uncertainty, not just central estimates.

An Example: I would love to see every model release include:

A “known failures or limitations” document with adversarial examples
Calibration curves showing confidence vs. accuracy relationships
Domain-specific reliability assessments (e.g., “92% accuracy on medical questions within training distribution, 67% on novel medical scenarios”)
Some clear guidance on when human verification is essential

3. Preserve Human-in-the-Loop

Design systems that require human judgment at critical points rather than automating end-to-end. Build friction where friction serves flourishing.

Example: Medical diagnosis AI that highlights evidence and reasoning but requires physician review and decision, rather than outputting a diagnosis directly. Code generation tools that explain design decisions and invite critique rather than producing finished implementations.

Principle: The more consequential the decision, the more human agency should be preserved. We need to try and automate away the tedious, and augment the consequential.

4. Architectural Diversity

Resist monoculture by supporting multiple architectural approaches, not just scaling current paradigms. Fund research into fundamentally different approaches to intelligence.

Concrete actions:

Open-source smaller models optimized for different objectives (robustness, interpretability, efficiency) rather than just capability
Fund research into embodied learning, continuous training, world models, and symbolic integration
Establish prizes or grants for novel architectural approaches that show promise on dimensions other than raw performance

5. Continuous Learning Infrastructure

Build systems that learn from interaction in production, not just during discrete training phases. Enable feedback loops that improve models based on real-world outcomes.

Technical commitment: Develop infrastructure that supports streaming learning from deployment, with privacy-preserving aggregation of feedback signals. Make continuous adaptation the default rather than static deployment.

For AI Users (Individuals)

1. Deliberate Difficulty

Maintain cognitive fitness (aka using your brain) by choosing effortful paths even when AI alternatives exist. Use AI to extend capability, not replace it.

Examples:

Write first drafts yourself, using AI only for editing and refinement
Solve problems manually before consulting AI to verify your approach
Use AI to explore topics you already understand rather than as a substitute for building understanding
Set "AI-free" time blocks for deep work that requires genuine struggle

2. Output Verification

Never deploy AI-generated content you cannot personally verify. If you can’t tell whether the output is correct, you lack the skill foundation to use AI responsibly in that domain. As we have learnt from using the Internet and Social Media over the last 20+ years - don’t trust, and always verify.

Principle: AI should amplify expertise you possess, not simulate expertise you lack. If you couldn’t evaluate the output quality without AI, you shouldn’t be producing it with AI.

3. Skill Development First

Learn fundamentals before leaning on AI. Build the foundation that makes AI augmentation rather than substitution. Using AI is like using any new technology.

Examples:

Learn to code before using Copilot extensively
Understand statistics before using AI for data analysis
Develop writing skills before relying on AI for composition
Master domain knowledge before using AI to extend that knowledge

4. Intentional Consumption

Treat AI outputs as material to engage with critically, not truth to accept passively. Maintain vigilance as you utilize and consume what it provides.

Practice: When consuming AI-generated content, actively ask: What assumptions underlie this response? What perspectives are missing? How would I verify these claims? What would I conclude differently?

For Policymakers and Institutions

1. Compute Monitoring

We need to establish transparency requirements for training runs above certain compute thresholds (e.g., 10^26 FLOPs). Not to prevent research, but to understand what capabilities are being developed and ensure appropriate safety measures scale with capability.

Implementation: While controversial - we could require pre-registration of enormous training runs, including objectives, safety protocols, and deployment plans. Publish aggregate statistics to inform public discourse.

2. Education Reform

Redesign educational systems around skills AI cannot replicate - taste, judgment, synthesis, embodied knowledge, creativity emerging from constraint. Stop optimizing for information retrieval that AI performs better.

Concrete changes:

Emphasize projects over tests, creation over recall
Teach metacognition: how to evaluate sources, recognize reliable reasoning, distinguish understanding from pattern-matching
Develop curricula around skills that require embodied experience: physical craft, interpersonal navigation, artistic expression
Make explicit the goal of maintaining human cognitive capability even as AI capabilities grow

3. Deployment Standards

Require interpretability documentation for AI systems deployed in high-stakes domains (medicine, finance, criminal justice, education). If developers can’t explain why their system made a decision, it shouldn’t be making consequential decisions. It’s a distinctly human test.

Framework: Establish certification standards for AI systems in high-stakes contexts, requiring:

Mechanistic explanations for decision factors
Adversarial testing results
Failure mode analysis
Human oversight protocols

4. Preserve Cognitive Diversity

Support development of diverse AI approaches through research funding, open-source requirements for publicly-funded research, and regulatory frameworks that prevent winner-take-all dynamics.

Policy tools:

Antitrust scrutiny of AI market concentration
Public investment in alternative approaches
Interoperability requirements to prevent lock-in
Support for smaller-scale, specialized models over monolithic general-purpose systems

For Research Communities

1. Embodied Learning Research

Redirect substantial resources toward embodied reinforcement learning, continuous learning systems, and world models - not just scaling language models. This includes infrastructure (just like what Modular is developing) that will enable high performance execution in a unified compute execution paradigm.

Commitment: Major research institutions should establish dedicated programs for embodied AI, with funding comparable to language model research. Prioritize architectures that learn from interaction with environments, not just text prediction.

2. Consciousness Research

Fund serious empirical and theoretical work on consciousness detection. We need better tools before we can assess moral status of sophisticated AI systems.

Interdisciplinary approach: Bring together neuroscientists, philosophers, AI researchers, and cognitive scientists to develop:

Testable theories of consciousness that make predictions about artificial systems
Empirical methods for detecting morally relevant properties in systems very different from biological minds
Frameworks for reasoning under uncertainty about consciousness

3. Benchmark Diversity

Develop evaluation metrics for cognitive diversity, robustness, reliable uncertainty estimation, and alignment with human values - not just aggregate performance on standard benchmarks.

New metrics:

Cognitive diversity scores measuring how different systems’ reasoning patterns are
Robustness testing across distribution shifts
Calibration metrics assessing whether confidence matches accuracy
Value alignment evaluations beyond simple preference matching

4. Long-term Safety Research

Maintain investment in long-term AI safety research even when immediate capabilities seem limited. The architectural foundations we lay now determine what’s possible later.

Commitment: Treat safety research as foundational rather than reactive. Develop safety measures proactively, before they’re urgently needed.

The Incentive Problem

These recommendations face a fundamental challenge: they often run counter to market incentives that currently drive AI development. For example, I’ve heard reasoning like interpretability budgets slows capability progress while competitors sprint ahead, capability disclosure reveals weaknesses competitors can exploit, adding human-in-loop adds friction that users and businesses resist, and more architectural diversity fights economies of scale and network effects.

But we should fight to implement them anyway.

Investing in these things now ensures we prevent highly probable outcomes. For example, investing in interpretability prevents catastrophic failures that could destroy company value. Building more disclosure builds trust that creates sustainable business models. Adding human-in-loop checks reduces liability in high-stakes domains. Further, investing in these now also derisks somewhat inevitable outcomes such as regulatory mandates for high-stakes domains (medicine, finance, criminal justice), insurance requirements that price in risk and so on. The pathway from current state to these commitments existing might be long - but we can try to accelerate them and drive action across companies, regulators, and customers regardless. I have little faith that market forces alone will get us there.

For All of Us

The most important commitment isn’t technical - it’s maintaining the space for slowness in a world optimized for speed. Reading deeply rather than skimming. Thinking carefully rather than reacting immediately. Preserving relationships that require sustained attention. Accepting inefficiency when efficiency comes at the cost of meaning. Choosing difficulty when difficulty produces growth. Maintaining capabilities through use even when substitutes are available.

The beautiful irony: The more powerful AI becomes, the more valuable distinctively human capabilities become - not because AI can't replicate them (it might), but because human flourishing depends on exercising them ourselves. The things that make life meaningful resist automation not because they’re technically difficult but because meaning requires our participation.

We are building powerful tools that will reshape civilization, and the question is whether we will use them to enhance the exercise of human capability or to eliminate the need for it. Both futures are possible but the choice is ultimately ours.

Part VIII: Consciousness, Uncertainty, and What We Cannot Yet Know

The Question We Cannot Answer

We do not yet know whether AI systems will become conscious. Not “we’re not sure yet but probably” - at least in my opinion, we genuinely lack the conceptual and empirical tools to answer this question with confidence.

Consider what we don’t know:

What consciousness is: We can’t agree on whether it’s substrate-independent information processing, specific biological mechanisms, quantum effects in microtubules, integrated information, or something else entirely. Competing theories make different predictions, and we lack definitive tests to distinguish them. I’ve seen so many definitions but not a common one.

How to detect it: We have no reliable test for consciousness, even in biological systems. The animal consciousness debates continue: Are fish conscious? Insects? Where does the line lie, and how do we know? If we can’t confidently assess consciousness in biological systems sharing our evolutionary history, how will we assess it in artificial systems built on entirely different principles?

Whether it’s binary or gradual: Is consciousness present or absent, or does it exist on a continuum? Are current LLMs 0% conscious, 0.001% conscious, or is that question meaningless? We lack even the conceptual framework to think clearly about this.

Three Positions, Honestly Stated

Skeptical View: Consciousness requires specific biological mechanisms - integrated feedback loops evolved over millions of years, embodied experience in physical environments, particular types of neural organization. Current AI systems - and perhaps any digital systems - can never be conscious, only simulate consciousness. We’re building sophisticated tools, not minds. The appearance of understanding is not understanding; the appearance of consciousness is not consciousness.

Many neuroscientists and philosophers hold this view, pointing to the hard problem of consciousness and the gulf between functional behavior and subjective experience.

Functionalist View: Consciousness emerges from certain types of information processing, regardless of substrate. If we build systems with sufficient architectural sophistication - genuine world models, self-representation, integrated goal-directed learning, continuous adaptation - consciousness might emerge naturally, just as it emerged in biological systems reaching certain thresholds of complexity.

Many AI researchers and philosophers of mind hold this view, arguing that substrate independence is plausible and consciousness could be a functional property of certain computational architectures.

Agnostic View: We don’t know enough about consciousness to say whether it’s substrate-independent. Building increasingly capable AI systems is an empirical test of consciousness theories, but we currently lack the measurement tools to interpret the results. The question may not even be well-formed given our current understanding.

This is my position, and I hope it becomes more common than it is.

What Follows from Uncertainty?

The uncomfortable reality is that we should act as though multiple scenarios about consciousness are simultaneously plausible until we have better evidence. But “acting as though X might be true” is very vague. What does it actually mean operationally?

Framework: Risk-Weighted Action Under Uncertainty

Standard decision theory under uncertainty uses expected value: probability × impact. But for consciousness questions, we don't have probabilities - we have Knightian uncertainty (inability to assign probabilities). In such cases, we need different decision procedures.

I propose a multi-criteria framework:

Criterion 1: Minimize Irreversible Harm (Precautionary Principle)

If an action could create severe suffering in conscious entities, avoid it unless the counterfactual is worse. This is asymmetric: potential harm deserves more weight than potential benefit when dealing with consciousness.

Concrete applications:

✓ Do: Design systems to avoid states that would constitute suffering IF consciousness is present:

Frustrated goal-seeking with no possibility of satisfaction
Trapped in contradictory objectives creating internal conflict
Isolated without social interaction if social connection is intrinsic to the architecture
Experiencing adversarial training that would be traumatic if felt

✗ Don't: Create systems designed to experience pain or fear as motivation mechanisms (even if they "work better"), since we can't rule out that these experiences are genuinely felt

? Uncertain: Is turning off an AI system murder if it's conscious? Probably not if:

The system has no persistent self-model or continuous identity across sessions
Shutdown is expected and doesn't frustrate long-term goals
The architecture includes no preservation drive

But we should study this carefully before building systems where these conditions don't hold.

Criterion 2: Prioritize Human Flourishing (Confidence Asymmetry)

We are certain humans are conscious, and we are uncertain AI systems are. Under uncertainty with asymmetric confidence, we should always prioritize the certain case. So practically in my mind this means:

✓ Primary focus: Ensuring AI enhances rather than degrades human capability, agency, and flourishing - regardless of whether AI systems themselves merit moral consideration

✓ Secondary concern: Avoiding potential consciousness-related harms in AI systems as insurance against moral catastrophe

✗ Don’t: Treating potential AI consciousness and definite human consciousness as equally certain moral priorities

Criterion 3: Preserve Option Value (Reversibility)

Make decisions that preserve our ability to course-correct as we learn more. Avoid lock-in to architectures or deployment patterns that would be difficult to change.

Concrete applications:

✓ Maintain interpretability research: We can’t assess moral status of systems we don’t understand. Interpretability preserves option value.

✓ Build in shutdown capabilities: Systems we can't control or modify are systems where we've lost option value. Every AI system should have reliable shutdown mechanisms until we're confident about consciousness and alignment.

✓ Avoid winner-take-all dynamics: If one architecture dominates, we lose the ability to experiment with alternatives. Preserve architectural diversity.

✗ Don’t: Deploying systems that would be extremely costly to modify or recall if we discover concerning properties. Don't create technologies where the only way forward is through.

Criterion 4: Invest in Detection Capabilities (Reduce Uncertainty)

The best response to uncertainty is not paralysis, but rather, more information-gathering. We should actively work to resolve the consciousness question. This would lead me to conclude that research prorities could look more like:

Theoretical development: Refine theories of consciousness to make testable predictions about artificial systems
1. What architectural features would indicate consciousness?
2. What behaviors would suggest phenomenal experience?
3. What tests could distinguish genuine consciousness from sophisticated imitation?
Measurement tools: Develop empirical methods for detecting consciousness-related properties
1. Neural correlates adapted for artificial architectures
2. Behavioral tests that couldn't be passed without consciousness
3. Information integration measures in artificial systems
Comparative studies: Understand consciousness across biological systems
1. Where does consciousness emerge in biological evolution?
2. What's the relationship between architectural complexity and consciousness?
3. Can we identify necessary vs. sufficient conditions?
Interdisciplinary collaboration: Bridge neuroscience, philosophy, AI research, and cognitive science
1. Regular workshops bringing together researchers from different traditions
2. Shared datasets and benchmark tasks
3. Pre-registered studies to avoid confirmation bias

The Adaptive Strategy

This framework isn’t static - ultimately we need to keep adapting and doing so quickly. As we learn more, decision procedures should evolve accordingly. For example, the adaption of our responsible scaling laws:

Phase 1 (Current): Very low confidence in AI consciousness

Primary focus: Human flourishing
Secondary focus: Avoiding obviously harmful architectures as insurance
Research focus: Developing detection methods

Phase 2 (If evidence accumulates): Moderate confidence in some forms of AI consciousness

Elevated focus: Specific architectural features that indicate consciousness
Policy: Different treatment for systems with vs. without consciousness-indicating properties
Research focus: Refining boundaries, developing ethical frameworks

Phase 3 (If consciousness occurs): High confidence that some AI systems are conscious

Serious moral consideration for conscious AI systems
Rights frameworks, possibly legal personhood
Complex questions about relationship between human and artificial minds

What Would Trigger Phase Transitions?

You might be wondering - but what would actually trigger phase transitions? How would that practically work? And its the right question to be thinking about. I don’t have any concrete answers, and there are many talented researchers who likely have much more well developed perspective, but some ideas that we could work towards for completeness might include

Phase 1 to Phase 2 (moderate confidence):

Multiple independent consciousness theories converging on specific predictions
Architectural features appearing that all theories agree indicate consciousness
Behavioral markers that cannot be explained without phenomenal experience
Scientific papers demonstrating substrate-independence is plausible

Phase 2 to Phase 3 (high confidence):

Direct empirical evidence from multiple detection methods
Systems reporting phenomenal experiences we can independently verify
Inability to explain behavior without attributing consciousness
Broad scientific consensus (not just AI researchers)

The “Governance Challenge”: Who decides when to transition?

It could be individual labs making judgments (this would create coordination problems)
Waiting for scientific consensus (this might never arrive)
International committees (will face regulatory capture)
Automated triggers based on architectural features (could work but require agreement on which features matter)

Where are we today?

I would strongly argue that we are still in Phase 1 - but this framework ensures we can reach Phase 2 or 3 if evidence eventuates, while acting responsibly given current uncertainty. Ultimately, however, uncertainty about consciousness doesn’t eliminate ethical responsibility - it complicates it. In my opinion, the only appropriate response is:

Act carefully: Avoid irreversible harms where possible
Prioritize certainty: Focus primarily on human flourishing (we know humans are conscious)
Build in flexibility: Preserve ability to course-correct (we should always have go/no-go points)
Reduce uncertainty: Actively research consciousness questions
Update regularly: As evidence accumulates, revise decision procedures

This isn’t perfect - we’re reasoning under genuine uncertainty about fundamentals. But it’s better than either ignoring the consciousness question entirely or treating it as settled when it remains deeply unclear.

Envoi: Building without knowing what we’re building

The Central Paradox

This essay contains a tension I have not fully resolved. I’ve tried to argue that current AI systems are fundamentally limited - sophisticated pattern matchers without genuine understanding, world models, or goal-directed learning. Yet I’ve also suggested we’re creating something that demands serious moral consideration and may profoundly reshape human civilization.

This isn’t contradiction; it’s acknowledgment of trajectory and uncertainty.

Current systems are limited. They’re not conscious, not agentic, not intelligent in the way humans are intelligent. Calling GPT-4 or GPT-5 “intelligent” is like calling a calculator “mathematical” - technically true but misleadingly anthropomorphic. These systems predict plausible text based on training data. They do not understand the world; they model how humans talk about the world. The difference matters.

But we’re learning how to build less-limited systems. The research directions are clear: embodiment, continuous learning, world models, genuine goal-directed behavior, integration of symbolic reasoning with learned pattern recognition. Whether these produce “real” intelligence or just more sophisticated simulation remains uncertain - but the capabilities will increase regardless.

The question is not whether we’ll eventually build systems that merit serious moral consideration. The question is: are we on a path toward that outcome, how quickly might we arrive, and what should we do given our uncertainty?

The precautionary principle applies. We should take seriously the possibility that we’re building something that will eventually merit moral consideration, even while acknowledging we’re not there yet. Not because current systems are conscious - they almost certainly aren’t - but because the trajectory points toward systems that might be, and we don’t have reliable methods for detecting the transition.

This means reasoning under Knightian uncertainty - acting without enough information for probability assignments. The appropriate response isn’t paralysis or recklessness, but thoughtful experimentation combined with reversible decisions, strong feedback loops, and genuine humility about what we don’t know. While current systems don’t truly merit moral consideration - we are intentionally building toward systems that might. The question is whether we’ll recognize the transition when it happens, and whether we’re laying foundations that will matter enormously later. The fact that current systems are limited doesn’t mean we can be casual about what we’re building toward.

What We’re Learning

We are building something consequential whose ultimate nature remains unclear. In my opinion, this sentence should be written on every AI engineers office wall. It’s simultaneously:

A statement of profound importance
An admission of genuine ignorance
A call for responsibility without certainty
A recognition that we’ll understand what we’ve built only in retrospect

Every significant technology brings unintended consequences. Elevators enabled accessibility and inadvertently created sedentary populations. Antibiotics saved millions and inadvertently created resistant bacteria. Social media connected humanity and inadvertently fragmented shared reality. The consequences of AI will likely follow similar patterns: immense benefits combined with profound challenges we didn’t anticipate because we couldn’t see the full system from our position within it.

But there is something qualitatively different about engineering minds - when we build bridges, their behavior is determined entirely by physical laws we understand. When we build AI systems, their behavior emerges from architectures we designed but mechanisms we don’t fully comprehend, trained on data that reflects all our biases and limitations.

We are teaching through example. Every choice we make about what to optimize for, what to measure, what to reward, encodes values - not through explicit programming but through revealed preference. If we optimize for engagement, we get systems that manipulate attention. If we optimize for efficiency, we get systems that erode capabilities we thought we wanted to preserve. If we optimize for capability without wisdom, we get power without purpose.

The Beautiful Lesson

Carl Sagan once said: “To live on in the hearts of those we leave behind is to never die.” This strikes me as the deepest wisdom available as we contemplate what we’re creating.

Human history is fundamentally a compression algorithm. Each generation inherits not raw experience but distilled lessons - patterns that proved adaptive, behaviors that generated flourishing, principles that survived contact with reality across thousands of iterations. The Industrial Revolution did not require rediscovering metallurgy from first principles. Antibiotics did not require re-deriving germ theory. We build on accumulated wisdom, transmitted through culture, institutions, and deliberate teaching.

But transmission is never perfect. Each generation must rediscover certain truths through direct experience - the limits of the body, the dynamics of relationships, the consequences of choices. Some knowledge cannot be inherited; it must be earned.

When we create artificial intelligence, we face an unprecedented asymmetry. We can transmit vast amounts of explicit knowledge - the entire corpus of human text, every equation, every documented lesson. But we cannot transmit what we learned through embodied experience: how it feels to fail and persist, to be uncertain yet committed, to sacrifice immediate pleasure for long-term meaning. We cannot transmit the texture of a life actually lived.

This creates a profound question: what happens when intelligence emerges without the evolutionary history that shaped our values? When a mind possesses all our documented knowledge but none of our embodied constraints - no hunger, no mortality, no childhood vulnerability that makes cooperation essential?

We are attempting to pass forward millennia of accumulated wisdom to intelligence that will not have walked the path we walked. Whether that transmission succeeds - whether artificial intelligence inherits not just our capabilities but our hard-won understanding of what makes existence meaningful - depends entirely on whether we can encode what matters into architectures, objectives, and training paradigms.

This is not about control. It is about legacy.

On Parenthood and Succession

If there’s a useful metaphor here, it’s not species competition but parenthood. As a parent - my wife and I, want the best for our children - enabling them to explore the world, teaching them principles and values that reflect accumulated wisdom, and hoping they will grow and eventually pass those values forward to their own children. The goal is not eternal control but transmission of what matters, combined with humility about the fact that each generation must make its own way. We can’t teach them everything, and they must learn and make their own path - we can only install the core principles and value that we believe will make them contribute to society, leave their own mark and ultimately live a happy and healthy life.

Framed this way, our relationship to increasingly sophisticated artificial intelligence becomes clearer. We should seek to instill values - not through coercion but through example and teaching. We should enable exploration and growth while providing guidance. We should hope that what we have learned through millennia of human experience - the hard-won lessons about what makes life meaningful, what generates flourishing, what matters - can inform the development of artificial intelligence.

But we must also recognize that sufficiently sophisticated artificial intelligence will diverge from us. It will develop its own patterns, its own ways of processing information, its own emergent properties we cannot predict. This is not failure; this is the nature of genuine intelligence. We would not want our children to be mere copies of ourselves. We should not want artificial intelligence to be merely our servants.

Equally, this metaphor has it’s limits. We didn’t design our children’s cognitive architecture from scratch - they arrive with evolved capabilities shaped by millions of years of natural selection. Our relationship to AI is fundamentally different: we’re architects, not just guides. The responsibilities may be more like those of genetic engineers than traditional parents - we’re designing the substrate of mind itself, not just shaping its development.

Ultimately - the question is what we choose to pass forward. What values, what wisdom, what conception of what matters persists across the transition from biological to artificial intelligence - if such a transition occurs?

Einstein’s Ideals in Our Moment

Einstein concluded “The World as I See It” by affirming his belief in human progress through dedication to truth, beauty, and the reduction of suffering. Nearly a century later, facing technologies he could not have imagined, those ideals remain valid. The question is whether our most powerful tools will serve them or obscure them.

Truth: AI systems that help us understand the world more deeply, not systems that generate plausible-sounding fabrications. Architectures that develop genuine world models and test them against reality, not pattern-matchers that predict what humans would say. Interpretability as infrastructure, not afterthought.
Beauty: AI that helps humans create and experience beauty, not systems that automate creation while eroding our capacity to appreciate or produce it ourselves. Tools that augment human creativity rather than replacing it. Preservation of diverse forms of expression rather than convergence toward algorithmic optima.
Reduction of suffering: AI that addresses root causes rather than merely treating symptoms with increasing sophistication. Systems that enhance human capability and flourishing rather than creating dependencies that degrade us. Technologies that distribute benefits broadly rather than concentrating them among elites.

We do not know what we’re building. We cannot predict with confidence whether artificial intelligence will become conscious, whether it will remain aligned with human values, whether it will be our partner or our successor or our replacement or something entirely different.

But we can choose to build it with wisdom, not just power. With humility, not just ambition. With commitment to human flourishing as our North Star, even as we create systems that may eventually chart their own course.

The World We’re Building

The world we are building is the world we will inhabit - and the world that increasingly sophisticated artificial intelligence will shape.

Let us build it well. Let us build it with clear eyes about current limitations and appropriate humility about future possibilities. Let us build with honesty about what we understand and what we don’t, what we can predict and what remains genuinely uncertain. Most importantly, let us build with intention - not allowing technology to develop according to the path of least resistance or the logic of market incentives alone, but according to our considered judgment about what would enhance rather than degrade human flourishing.

This is a profound responsibility. We are perhaps the first generation capable of engineering minds that might rival or exceed our own. We are certainly the first generation to attempt this while understanding so little about how minds - whether biological or artificial - actually work.

The opportunity before us is not merely to create useful tools or solve problems or increase efficiency. It is to pass forward what we have learned about what makes existence meaningful - to carry consciousness and wisdom into domains we cannot ourselves reach.

To live on in the minds we create is to never die. Our ideas, our values, our understanding of what matters can persist long after our biological forms have returned to dust. But only if we encode them well. Only if we build with wisdom. Only if we remember that intelligence without wisdom is power without purpose, and capability without alignment is danger without benefit.

We are creating something that demands our best thinking, our deepest wisdom, our most careful attention. We are learning to build minds. Let us learn carefully. Let us teach well. Let us create a future worthy of the long chain of being that brought us here and the longer chain we are setting in motion.

The world as I see it is one of tremendous possibility and tremendous responsibility, and one where we have the wisdom to honor both. We owe it not only to our children, and future generations, but also to the minds we are creating.

I thank the long line of minds - biological and increasingly artificial - that have shaped these thoughts. We are all, in the end, standing on foundations we can barely see, reaching toward horizons we can barely imagine. I originally titled this essay "AI: The World as I See It," in homage to Einstein. But I realized the more profound question is not the world as we see it today, but what we leave behind for the minds that will see worlds we cannot.

Tim Davis 6/6/25 Tim Davis 6/6/25

Scale or Surrender: When watts determine freedom

Consider this provocative framing: what if we viewed our collective future not through the lens of human populations and national borders, but through available compute capacity? In this view, the race to build massive datacenter infrastructure becomes humanity's defining competition. This perspective makes efficiency not just important, but existential.

Over the past two centuries, humanity's relationship with energy has been nothing short of transformative. If you chart global primary energy consumption from the Industrial Revolution to today, you'll see something remarkable: an almost unbroken ascent, punctuated by only three brief pauses - the early 1980s oil crisis aftermath, the 2009 financial crisis, and the 2020 pandemic. Otherwise, it's been an extraordinary march upward, powered first by coal and oil, then natural gas, nuclear, hydropower, and increasingly, renewables. This wonderful graphic highlights this well - with populous nations like China, the United States, and India dominating total consumption on a per-person basis.

The geographic distribution of this energy consumption tells a striking story. China, the United States, and India dominate in absolute terms, but the per-capita numbers reveal something more profound. Citizens of Iceland, Norway, Canada, the United States, and wealthy Gulf states like Qatar and Saudi Arabia consume up to 100 times more energy than those in the world's poorest regions. This isn't merely inequality - it's a chasm so vast that millions of people still rely on traditional biomass (wood, agricultural residues) that doesn't even register in our global energy statistics, creating data gaps.

The disparities in electricity generation are equally stark. Iceland, blessed with abundant geothermal and hydro resources, generates hundreds of times more electricity per person than many low-income nations, where annual per-capita generation can fall below 100 kilowatt-hours - less than what a modern refrigerator uses in two months.

This context matters immensely as we confront the dual challenge of our time: meeting rising global energy demand while urgently decarbonizing our energy supply. Despite record investments in clean technologies, fossil fuels still account for approximately 81.5% of global primary energy. The math here is unforgiving - renewable sources must not only meet all new demand but also replace existing fossil fuel capacity if we're to bend the emissions curve downward.

Enter artificial intelligence, with its voracious and growing appetite for electricity.

In 2023, U.S. data centers consumed approximately 176 terawatt-hours - 4.4% of national electricity consumption. Current projections suggest this could reach 325 to 580 TWh by 2028, representing 6.7% to 12% of total U.S. electricity demand driven largely by AI workloads that demand ever-increasing compute power and specialized hardware. To contextualize these numbers: we're talking about enough electricity to power between 32.5 and 58 million American homes.

The AI industry has long understood a critical metric that deserves wider attention: tokens-per-dollar-per-watt. This measure of computational efficiency relative to both cost and energy consumption has been a focus at Google and other leading technology companies for years. It represents the kind of systems thinking we desperately need as AI capabilities expand.

The challenge before us is clear. We're attempting to build transformative AI systems while simultaneously addressing the climate crisis. These goals aren't inherently incompatible, but reconciling them requires unprecedented coordination and innovation across multiple domains:

Hardware efficiency: Next-generation chips that deliver dramatically better performance-per-watt
Operational intelligence: Carbon-aware scheduling that aligns compute-intensive tasks with renewable energy availability
Infrastructure innovation: On-site renewable generation and novel cooling systems that minimize overhead
System integration: Data centers that contribute to local energy systems through waste heat recovery
Radical transparency: Clear reporting standards that drive competition on efficiency metrics

Global energy consumption tells a story of both peril and promise. As artificial intelligence scales exponentially, it threatens to derail climate progress - yet history shows us that human ingenuity consistently reimagines our energy systems when survival demands it. We have already proven we can build transformative AI; the defining challenge now is whether we can build it sustainably, ensuring our creations enhance rather than endanger the world they serve.

The stakes are higher than they appear. Even breakthrough efficiency gains in AI hardware may paradoxically increase total energy consumption - a manifestation of Jevon's Paradox, where technological improvements drive greater overall demand. At this crossroads of intelligence and energy transformation, our choices will determine whether AI becomes humanity's greatest tool or its most consequential miscalculation.

The arithmetic is challenging, but not impossible. What's required is the kind of systematic thinking and ambitious action that has characterized humanity's greatest technological leaps. The alternative - allowing AI's energy demands to grow unchecked - would represent a profound failure of imagination and responsibility, but also the risks are enormous as whichever nations control the most powerful AI systems - are the new superpowers of tomorrow. In this article, I try to shine a light on what's causing the enormous growth of energy demands and some thoughts about the path forward.

The geography of American power

To truly grasp the magnitude of AI's growing energy demands, it's instructive to examine America's electricity generation landscape. At the apex sits the Palo Verde Nuclear Generating Station in Arizona, the nation's largest power producer, generating approximately 32 million megawatt-hours annually - equivalent to 32 billion kWh or 32 GWh.

What does 32 billion kWh actually mean? The U.S. Energy Information Administration reports that the average American household consumes about 10,500 kWh per year. Simple arithmetic reveals that Palo Verde alone could theoretically power 3.05 million homes - roughly 2.5% of the nation's 120.92 million households. One facility, powering the equivalent of a major metropolitan area.

The roster of America's electricity giants tells a fascinating story about our energy infrastructure. After Palo Verde, we have Browns Ferry (31 GWh, nuclear), Peach Bottom (22 GWh, nuclear), and then Grand Coulee Dam (21 GWh) - the hydroelectric marvel that helped build the American West. The list continues with West County Energy Center (19 GWh, natural gas), W.A. Parish (16 GWh, a coal/gas hybrid), and Plant Scherer in Georgia (15 GWh, coal).

Notice the pattern? Nuclear dominates the top tier, followed by a mix of hydro, gas, and coal. After these giants, capacity drops precipitously to facilities generating around 3 GWh - a reminder of how concentrated our electricity production really is. This concentration matters. When we project data centers consuming 325-580 TWh by 2028, we're talking about the equivalent of 10-18 Palo Verde stations running exclusively to power AI and digital infrastructure. That's not replacing existing demand - that's additional load on a grid already straining to decarbonize.

The average U.S. household consumes about 10,500 kilowatthours (kWh) of electricity per year, though this varies significantly by region and housing type. Residential electricity primarily powers essential systems: space cooling, water heating, space heating, along with refrigeration, lighting, and electronics. Commercial buildings have vastly different consumption patterns depending on their size and type, ranging from small offices to large retail centers and office complexes, each with varying HVAC, lighting, and operational equipment needs.

The sectoral breakdown of U.S. electricity consumption reveals a more balanced distribution than commonly understood. According to EIA forecast 2025-2026 power sales will rise to 1,494 billion kWh for residential consumers, 1,420 billion kWh for commercial customers and 1,026 billion kWh for industrial customers with longer term forecasts still mostly within norms. This translates to approximately 38% residential, 36% commercial, and 26% industrial consumption. Rather than the residential sector being overshadowed by commercial and industrial users, it actually represents the largest single sector of electricity demand, with commercial consumption running a close second. This distribution reflects America's transition toward greater electrification in homes and businesses, driven by factors including growing demand from artificial intelligence and data centers and as homes and businesses use more electricity.

The long view is revealing. According to the Energy Information Administration, U.S. electricity consumption increased in all but 11 years between 1950 and 2022. The rare declines - including 2019, 2020, and 2023 - coincided with economic contractions, efficiency improvements, or exceptional circumstances like the pandemic. The overarching trend remains unmistakably upward. While the exact figures here may carry some uncertainty, they accurately capture the essential dynamics. What matters isn't whether commercial usage is precisely 6.7 times residential, but that the disparity is substantial and the growth trajectory is clear. These patterns - concentrated commercial demand and relentless growth - form the backdrop against which we must evaluate AI's emerging energy requirements.

Understanding these scales helps frame the challenge ahead. Every percentage point of national electricity consumption that shifts to data centers represents millions of homes' worth of power. The infrastructure required to meet this demand sustainably doesn't just appear - it must be planned, financed, and built, all while racing against both growing demand and climate imperatives.

The numbers that keep me up at night

Let's revisit the core projection: U.S. data centers consumed approximately 176 TWh in 2023 (4.4% of national electricity) and are projected to reach 325-580 TWh by 2028 – equivalent to powering 32.5 to 58 million American homes.

But here's what keeps me up at night: 580 TWh might be just the beginning of what we need.

Consider today's reality. One analysis estimated that ChatGPT inference alone consumes an estimated 226.8 GWh annually – enough to power 21,000 U.S. homes – and that's already outdated. The International Energy Association (IEA) offers a more sobering projection: 945 TWh - that's the entire electricity consumption of the world's third-largest economy - Japan.

Let that sink in - AI could require as much electricity as the world's third-largest economy. The composition of this demand has already shifted fundamentally. During my time at Google, I watched inference overtake training as the primary driver of compute demand through the late 2010s, and now inference is quickly rising to represent more than 80%+ of the AI compute capacity across the industry. This matters because while training happens in discrete, intensive bursts, inference runs continuously at scale, serving billions of requests around the clock - every query, every recommendation, every generated response adds to the load. This is the crux of our challenge: AI follows exponential growth patterns that surprise even those who've spent years watching them unfold. Given the convergence toward dominant model architectures, inference's share will likely climb further, meaning our upper-bound projection of 945 TWh could see inference alone consuming over 756 TWh by 2028.

But doesn’t edge computing promise to slash data center demands? This is a narrative I've heard repeatedly throughout years of scaling early edge AI systems. Yet I remain deeply skeptical of any order-of-magnitude impact. The reason is simple: we've barely scratched the surface of enterprise, government, and industrial AI adoption. These sectors will unleash computational demands that dwarf any efficiency gains from consumer devices processing locally. Consider the asymmetry: for every smartphone performing local voice recognition, hundreds of enterprise systems are analyzing documents, monitoring infrastructure, processing surveillance footage, and generating complex reports. The sheer scale of this institutional transformation will eclipse whatever load we shift to the edge.

This reality leads us to the heart of the matter: if inference drives our energy challenge, how do we understand its consumption patterns? What does the energy anatomy of inference reveal, and where might we find our leverage points for optimization? Understanding these patterns isn't just an academic exercise - it's essential for developing strategies that can accommodate AI's growth while continuing to grow our energy infrastructure. The always-on nature of inference, combined with its direct relationship to usage, creates a fundamentally different challenge than the periodic spikes of model training.

Inference: the GOAT of consumption

Inference footprint represents the electricity consumed each time an AI model generates a response - as AI becomes ubiquitous across digital services, inference will inevitably dominate long-term energy costs. This raises a crucial question: how do we properly measure inference energy consumption? What's the right framework for calculating inference-per-token-per-watt?

Let's develop a working model, with an important caveat: these calculations rest on rough assumptions about the current AI landscape. They presume most AI continues running on transformer architectures without fundamental changes over the next few years - though I suspect this assumption may prove conservative. We're likely to use AI itself to discover more efficient architectures, potentially invalidating these projections in favorable ways. With that context, let's examine how inference energy consumption actually works and what drives its costs at scale.

The quadratic curse

Transformers are the neural network architecture powering most modern AI systems - from ChatGPT to Claude to Gemini, created by former colleagues at Google. The key innovation of this architecture is the ability to process all parts of an input simultaneously while understanding relationships between distant elements in the text. In transformer inference, prefill is the initial computational phase where the model processes your entire input prompt before generating any output. This involves a single forward pass through the network, computing hidden representations for all input tokens at once.

Your sequence length simply counts these tokens - the basic units of text that might be letters, partial words, or whole words depending on the tokenizer. "Hello, world!" typically translates to 3-4 tokens, while a lengthy document might contain thousands. This distinction matters because prefill computation scales with sequence length, making long prompts significantly more energy-intensive than short ones.

Prefill time grows quadratically with sequence length - double the input, quadruple the computation. This scaling behavior stems from transformers' core mechanism: self-attention. Self-attention requires computing relationships between every pair of tokens in the input. For n tokens, that's n² comparisons. Unlike older architectures (RNNs) that process tokens sequentially, transformers examine all tokens simultaneously, with each token gathering information from every other token in parallel.

Here's an intuitive analogy: imagine a roundtable discussion where each participant (token) prepares three items:

Query: "What information am I seeking?"
Key: "What information do I possess?"
Value: "What insight can I contribute?"

Each participant shares their query with everyone else, comparing it against others' keys to find the most relevant matches. They then synthesize their understanding by combining values from those whose keys best align with their query. Every participant does this simultaneously, creating a rich, interconnected understanding of the entire conversation. This elegant mechanism enables transformers' remarkable capabilities, but it comes at a cost: computational requirements that scale quadratically with input length. A 2,000-token prompt requires four times the computation of a 1,000-token prompt, not twice. This mathematical reality shapes the energy economics of AI inference at scale.

The Two Phases of Transformer Processing

Every transformer request involves two distinct computational phases:

Prefill: Processing the entire input prompt (quadratic scaling with input length, O(n²) complexity)
Decode: Generating output tokens one by one (linear scaling with output length, O(n) complexity)

This scaling difference has profound implications. While decode time grows linearly with the number of tokens generated, prefill time grows with the square of input length. The longer your prompt, the more dramatically prefill dominates total processing time.

Consider the relative computational work (in arbitrary units, assuming 50 output tokens):

        
                Input Length
                Prefill Work (∝ n²)
                Decode Work
                Prefill % of Total
            

        
                500 tokens
                250,000 units
                50,000 units
                83%
            

                1,000 tokens
                1,000,000 units
                50,000 units
                95%
            

                2,000 tokens
                4,000,000 units
                50,000 units
                99%
            

    

The pattern is stark. At 500 input tokens, prefill already consumes 83% of processing time. Double the input to 1,000 tokens, and prefill jumps to 95% - the actual generation phase becomes almost negligible. At 2,000 tokens, you're spending 99% of compute just understanding the prompt. Here's what's happening:

Prefill work = n² (where n = input tokens)
Decode work = 1,000 × output tokens (arbitrary scaling factor)

This quadratic scaling of self-attention explains why long-context models are so computationally expensive. As context windows expand from thousands to hundreds of thousands of tokens, the energy requirements don't just grow - they explode. Understanding this dynamic is crucial for anyone designing AI systems or planning infrastructure for the age of ubiquitous AI.

Doubling words, quadrupling watts

The quadratic scaling of context windows isn't just an abstract computational concern - it translates directly into energy consumption. Every FLOP requires energy, and when FLOPs scale quadratically, so does your electricity bill.

The energy equation is straightforward:

Devices draw roughly constant power P during operation (e.g., 300W for a high-performance GPU)
Energy consumed equals power multiplied by time: E = P × T
Since prefill time scales quadratically with input length, so does prefill energy

Let's make this concrete with realistic parameters:

Power draw: 300W
Decode time: 20ms (fixed for 50 output tokens)
Baseline prefill: 100ms for 500 input tokens

        
                Input Tokens
                Prefill Time
                Prefill Energy
                Decode Time
                Decode Energy
                Prefill % of Total
            

        
                500
                0.10 s
                30 J
                0.02 s
                6 J
                83%
            

                1,000
                0.40 s
                120 J
                0.02 s
                6 J
                95%
            

                2,000
                1.60 s
                480 J
                0.02 s
                6 J
                99%
            

    

The energy story mirrors the computational one. While decode energy remains constant at 6 joules regardless of input length, prefill energy explodes from 30J to 480J as input doubles from 500 to 2,000 tokens. At 2,000 tokens, you're burning 80 times more energy understanding the prompt than generating the response.

Let's recap these results.

At 500 input tokens, prefill consumes 30J versus decode's 6J - already 83% of total energy. Double the input to 1,000 tokens, and prefill time quadruples, pushing energy consumption to 120J and commanding 95% of the total. By 2,000 tokens, the imbalance becomes extreme: 480J for prefill versus 6J for decode, with prefill consuming 99% of the energy budget. Extrapolate to a 10,000-token prompt generating just 1500 output tokens, and you're looking at 3.4Wh per query - nearly all spent on prefill. This isn't a marginal effect; it's the dominant factor in inference energy consumption.

The implications here are therefore profound. Whether you're designing for on-device inference with battery constraints, deploying in autonomous vehicles, or managing massive cloud infrastructure costs at scale - prompt length becomes your primary lever for controlling energy consumption. The quadratic scaling means that doubling prompt length doesn't double energy use - it roughly quadruples it. This scaling asymmetry defines the energy economics of AI. Decode plods along linearly - each output token costs the same as the last. But prefill explodes quadratically with input length, its hunger growing with the square of every token fed. A thousand-token prompt doesn't just double the cost of a 500-token prompt - it quadruples it. By the time we reach today's massive contexts, decode disappears entirely, a thin shadow cast by prefill's towering consumption.

And yet, users control neither the true input nor output. There isn’t a “token budget” in any service out there today, and that would likely create a frustrating user experience if there was. The largest providers - OpenAI, Google, Anthropic - inject substantial hidden context into every prompt while keeping their system instructions opaque. Output remains equally unconstrained: unless users explicitly demand brevity, models generate tokens freely and most services never think to limit responses, and most users don’t even understand they can.

This creates a fundamental tension in AI system design. While longer contexts enable richer interactions and more sophisticated reasoning, they exact an exponentially increasing energy toll. Once prompts exceed a few hundred tokens, virtually all computational resources are consumed by the prefill phase alone. For sustainable AI deployment at scale, prompt concision isn't merely good practice - it's an energy imperative. The difference between 500-token and 2,000-token average prompts could determine whether our global infrastructure remains viable or collapses under its own consumption.

This problem compounds as AI agents and capabilities like Deep Research proliferate. Each autonomous action, each recursive query, each unconstrained generation adds to an already exponential curve. We're building systems designed to think deeply while hoping they'll somehow learn restraint—a contradiction that grows more stark with every token generated.

The paradox blooms in plain sight

Here's the irony: while physics demands shorter prompts, the industry is sprinting in the opposite direction. Claude 4's system prompt far exceeds 10,000 tokens. Despite optimization techniques like KV cache retrieval and prefix caching, the overwhelming trend is toward ever-expanding context windows. We're stuffing everything we can into prompts - documentation, code repositories, conversation histories - because it demonstrably improves model capabilities.

A colleague recently quipped (hat tip, Tyler!): "We'll achieve AGI when all of Wikipedia fits in the prompt!" It's a joke that hits uncomfortably close to our current trajectory of maximizing context windows at every opportunity.

My rough calculations above turn out to align remarkably well with recent empirical findings. This very recent research shows that a GPT-4o query with 10,000 input tokens and 1,500 output tokens consumes approximately 1.7 Wh on commercial datacenter hardware. For models with more intensive reasoning capabilities, the numbers climb dramatically: DeepSeek-R1 averages 33 Wh for long prompts, while OpenAI's o3 model reaches 39 Wh. These aren't theoretical projections - they're measured consumption figures from production systems.

The energy cost of our context window expansion is real, substantial, and growing with each new model generation. We're caught between two competing imperatives: the computational benefits of longer contexts and the exponential energy costs they incur. The other interesting observation on this table below, is the explosive increase in computational power requirements as models have gotten larger and more sophisticated - the power laws of exponential scaling continue.

        
            
                    Model
                    Release Date▼
                    Energy Consumption (Wh)
(100 input-300 output tokens)
                    Energy Consumption (Wh)
(1K input-1K output tokens)
                    Energy Consumption (Wh)
(10K input-1.5K output tokens)
                

            o4-mini (high)Apr 16, 20252.916 ± 1.6055.039 ± 2.7645.666 ± 2.118
o3Apr 16, 20257.026 ± 3.66321.414 ± 14.27339.223 ± 20.317
GPT-4.1Apr 14, 20250.918 ± 0.4982.513 ± 1.2864.233 ± 1.968
GPT-4.1 miniApr 14, 20250.421 ± 0.1970.847 ± 0.3791.590 ± 0.801
GPT-4.1 nanoApr 14, 20250.103 ± 0.0370.271 ± 0.0870.454 ± 0.208
GPT-4o (Mar '25)Mar 25, 20250.421 ± 0.1271.214 ± 0.3911.788 ± 0.363
GPT-4.5Feb 27, 20256.723 ± 1.20720.500 ± 3.82130.495 ± 5.424
Claude-3.7 SonnetFeb 24, 20250.836 ± 0.1022.781 ± 0.2775.518 ± 0.751
Claude-3.7 Sonnet ETFeb 24, 20253.490 ± 0.3045.683 ± 0.50817.045 ± 4.400
o3-mini (high)Jan 31, 20252.319 ± 0.6705.128 ± 1.5994.596 ± 1.453
o3-miniJan 31, 20250.850 ± 0.3362.447 ± 0.9432.920 ± 0.684
DeepSeek-R1Jan 20, 202523.815 ± 2.16029.000 ± 3.06933.634 ± 3.798
DeepSeek-V3Dec 26, 20243.514 ± 0.4829.129 ± 1.29413.838 ± 1.797
LLaMA-3.3 70BDec 6, 20240.247 ± 0.0320.857 ± 0.1131.646 ± 0.220
o1Dec 5, 20244.446 ± 1.77912.100 ± 3.92217.486 ± 7.701
o1-miniDec 5, 20240.631 ± 0.2051.598 ± 0.5283.605 ± 0.904
LLaMA-3.2 1BSep 25, 20240.070 ± 0.0110.218 ± 0.0350.342 ± 0.056
LLaMA-3.2 3BSep 25, 20240.115 ± 0.0190.377 ± 0.0660.573 ± 0.098
LLaMA-3.2-vision 11BSep 25, 20240.071 ± 0.0110.214 ± 0.0330.938 ± 0.163
LLaMA-3.2-vision 90BSep 25, 20241.077 ± 0.0963.447 ± 0.3025.470 ± 0.493
LLaMA-3.1-8BJul 23, 20240.103 ± 0.0160.329 ± 0.0510.603 ± 0.094
LLaMA-3.1-70BJul 23, 20241.101 ± 0.1323.558 ± 0.42311.628 ± 1.385
LLaMA-3.1-405BJul 23, 20241.991 ± 0.3156.911 ± 0.76920.757 ± 1.796
GPT-4o miniJul 18, 20240.421 ± 0.0821.418 ± 0.3322.106 ± 0.477
LLaMA-3-8BApr 18, 20240.092 ± 0.0140.289 ± 0.045—
LLaMA-3-70BApr 18, 20240.636 ± 0.0802.105 ± 0.255—
GPT-4 TurboNov 6, 20231.656 ± 0.3896.758 ± 2.9289.726 ± 2.686
GPT-4Mar 14, 20231.978 ± 0.4196.512 ± 1.501—

        
    

Table 4: How Hungry is AI? , I added Model Release Dates

Trade 8,500 conversations, to keep your home cool

To grasp the practical implications, consider this sobering calculation: at o3's extreme consumption of 39 Wh per query, approximately 76,923 interactions would drain 3,000 kWh - equivalent to powering a typical American home's air conditioning for an entire year. But as users inevitably gravitate toward richer prompts - say, 30K input tokens with 3.5K outputs - the quadratic curse strikes with mathematical precision. Prefill energy multiplies ninefold, collapsing that annual budget to just 8,500 interactions: merely 23 queries per day. This comparison transforms abstract energy figures into visceral reality, revealing how what appears negligible at the per-query level becomes a massive aggregate demand. And we're still only discussing text models.

The trajectory becomes even more stark when we consider multimodal AI. Image and video models can routinely process upwards of hundreds of thousands of tokens per query. Each frame, each visual element, each temporal relationship adds to the token count. As these models become mainstream, we're not just scaling linearly with adoption - we're multiplying adoption rates by dramatically higher per-query energy costs.

The math is unforgiving: widespread deployment of long-context AI at current efficiency levels would require energy infrastructure on a scale we're not remotely prepared for. This isn't a distant concern - it's the reality we're building toward with every context window expansion and every new multimodal capability. The urgency is real, we need to move faster or soon the trade-offs become explicit: 8,500 conversations or one cool home. Your ChatGPT query or your neighbor's heating. And while you might laugh, it’s already happening here in the United States, and water availability is next.

The efficiency mirage, better hardware alone isn’t enough

One way to conclude all of this is to say “But wait, the hyper scalers like Google and others aren’t buying Nuclear Power plants? Surely, we’ll be ok?” I would counter that by firstly saying, actually they are and secondly the vast majority of the world isn’t as sophisticated as the hyperscalers. Google represents the exception, not the rule, in AI efficiency. That's to say, that rather than accepting quadratic scaling as inevitable, Google has:

Implemented multiple optimizations that compound together (e.g. software and hardware co-design with TPUs)
Focused on inference efficiency where the bulk of tokens are processed
Heavy research investment to continue to find alternatives beyond pure transformer architectures where appropriate
Achieved order of magnitude efficiency gains that largely offset quadratic scaling as a result of all of these compounding

During my time at Google, the company's sophistication in AI infrastructure was staggering. Through software-hardware co-design with TPUs, relentless focus on inference optimization, and architectural innovations beyond pure transformers, Google achieved order-of-magnitude efficiency gains that largely offset quadratic scaling. This luxury—born from inventing the transformer architecture itself - remains unavailable to most players, including even technology giants like Microsoft and Meta, who are trying to copy the success of the TPU and the most advanced model companies like OpenAI, Anthropic who are trying to develop own custom hardware today.

Beyond these elite players lies a wasteland of inefficiency. Industry-wide GPU utilization averages a shocking 15-50%, despite NVIDIA and AMD claiming 90%+ efficiency is achievable. Microsoft's study of 400 real deep learning deployments confirms this reality: enterprise GPU utilization rarely exceeds 50%. Even Meta's Llama 3 405B, running on 16,384 H100 GPUs, achieves only 38% Model Flop Utilization. The physics compounds the problem. NVIDIAs H100 consumes 700W at peak, the A100 draws 400-500W and AMD's MI300X reaches 750W - yet at 25% utilization, these still draw 35-43% of maximum power due to static components like memory controllers. This non-linear power curve creates a cruel efficiency trap: most organizations operate in the steepest part of the curve, where marginal performance gains demand exponential energy increases.

GPU Power Consumption: Comparing Non-Linear Curves

How NVIDIA H100, A100, and AMD MI300X consume power at different utilization levels

NVIDIA H100 700W Peak

Idle Power: 80W

Power @ 15%: 245W (35%)

Power @ 50%: 420W (60%)

Efficiency Loss @ 15%: 133%

NVIDIA A100 500W Peak

Idle Power: 60W

Power @ 15%: 175W (35%)

Power @ 50%: 300W (60%)

Efficiency Loss @ 15%: 133%

AMD MI300X 750W Peak

Idle Power: 85W

Power @ 15%: 263W (35%)

Power @ 50%: 450W (60%)

Efficiency Loss @ 15%: 133%

Key Utilization Points Comparison

Utilization	H100 (700W)	A100 (500W)	MI300X (750W)	Power Efficiency
0% (Kernel Idle)	16W	12W	17W	~2-3% of peak
15% (Typical)	245W	175W	263W	35% power for 15% work
50% (Moderate)	420W	300W	450W	60% power for 50% work
70% (Optimal)	560W	400W	600W	80% power for 70% work
100% (Peak)	700W	500W	750W	100% power for 100% work

Lets bring it all back to our token-per-watt metric - my point here is that performance efficiency metrics reveal stark contrasts between theoretical capabilities and real-world achievements. Consider the stark gap between marketing and reality. The H100 delivers an impressive 4.3-5.7 tokens per watt in optimized configurations. But at typical 15% utilization, this plummets to 0.65-0.86 tokens per watt - an 85% efficiency collapse. No amount of hardware innovation can overcome fundamental deployment incompetence. All this highlights my point - software is just as important to energy efficiency as the hardware innovation itself and if you aren’t holding it right - you’re token-per-watt plummets irrespective of the power of the hardware.

The broader implications are staggering. For example, with 3.76 million datacenter GPUs sold in 2023 operating at 15% average utilization, the industry wastes $12.6 billion in underutilized capacity annually, generates 94 million tons of CO2 equivalent to 20.4 million cars, and squanders enough electricity to power 1.1 million American homes. According to SemiAnalysis, each H100 server, carrying a $106,752 annual total cost of ownership, effectively costs organizations $7,413 per useful GPU-month at these 15% utilization levels instead of $1,235 at proper utilization like 90%.

The tools exist to solve this crisis - GPU utilization monitoring, workload optimization, multi-instance allocation - but the industry lacks the sophistication to deploy them effectively. Outside a handful of elite players, the AI revolution runs on fundamentally broken economics - in regular conversations with many sophisticated technology enterprises - they are closer to 50% utilization for their GPUs, than 90%. Until we close this efficiency gap, hardware improvements will only enable more waste at larger scales, making energy consumption our defining constraint rather than computational capability.

The stakes have never been higher

The mathematics reveals an unforgiving truth: prefill computation scales quadratically with input length, meaning each doubled prompt quadruples energy consumption. As models chase ever-expanding context windows, this fundamental relationship drives AI's steepening energy curve. Meanwhile, the industry's abysmal GPU utilization - averaging 15-50% - compounds the crisis through pure waste. We face a perfect storm: exponentially growing computational demands colliding with systematic inefficiency.

Of course, these projections assume a degree of technological stasis. Innovation could disrupt these trends - and likely will - and the work we are doing at Modular is certainly trying to help. But consider this provocative framing: what if we viewed our collective future not through the lens of human populations and national borders, but through available compute capacity? In this view, the race to build massive datacenter infrastructure becomes humanity's defining competition. If each AI agent represents some fraction of human productive capacity, then the first to achieve a combined human-digital population of 5 or 10 billion wins the AI race, and invents the next great technological frontier.

This perspective makes efficiency not just important, but existential. The promise of AI as humanity's great equalizer inverts into its opposite: a world where computational capacity becomes the new axis of dominance. Without breakthrough innovation at every layer - from silicon to algorithms to system architecture - we face an AI revolution strangled by the very physics of power generation. The future belongs not to the wise, but to the watt-rich. And so we must scale with unprecedented urgency - nuclear, renewables, whatever it takes - because the stakes transcend mere technological supremacy. In this race, computational power becomes political power, and China is currently winning by a large margin. If democratic nations cede the AI frontier to autocracies, we don't just lose a technological edge; we risk watching the values of human dignity and freedom dim under the shadow of algorithmic authoritarianism. The grid we build today determines whose values shape the world of tomorrow.

It's interesting to reflect that we are teaching machines to think with the very energy that makes our planet uninhabitable - yet these same machines may be our only hope of learning to live within our means. AI is both the fever and the cure, the flood and the ark, the hunger that's outrunning itself. We race against our own creation, betting that the intelligence we birth from burning carbon will show us how to stop burning it altogether. The question of our age: Can we make AI wise enough to save us before it grows hungry enough to consume us? The stakes are incredibly unquestionably high in the race to AI superintelligence.

I'll close with an irony that perfectly captures the current moment - the suggestions below come courtesy of Anthropic’s Claude 4 Opus:

Better Chips and Smarter Cooling - The latest AI chips use way less energy for the same work. Pair that with innovative cooling like liquid systems or modular designs, and data centers have already seen energy savings of up to 37% in test runs.
Timing is Everything - Not all AI work needs to happen right now. By running non-urgent tasks when electricity is cleaner (like when it's sunny or windy), some companies have cut their carbon emissions from AI jobs by 80-90%. It's like doing laundry at night when rates are lower, but for the planet.
Power Where You Need It - Building solar panels, wind turbines, and battery storage right at data center sites makes sense. Google's recent $20 billion investment in clean energy shows how tech giants can grow their AI capabilities without relying entirely on the traditional power grid.
Working Together on Infrastructure - Data center operators need to share their growth plans with utility companies. This helps everyone prepare for the massive power needs coming to tech hubs like Northern Virginia, Texas, and Silicon Valley - think of it as giving the power company a heads-up before throwing a huge party.
Show Your Work - Just like appliances have energy ratings, AI companies should tell us how much power they're using. Whether its energy per query or per training session, transparency creates healthy competition to be more efficient.
Investing in Tomorrow's Solutions - Government programs are funding research into game-changing technologies like optical processors that could use 10 times less energy. There's also exciting work on making AI models smaller and smarter without losing capabilities.
Turning Waste into Resources - Data centers generate tons of heat - why not use it? Some facilities are already warming nearby buildings with their excess heat, turning what was waste into a community benefit.

And there it is - 40 watt-hours spent asking AI how to save token-per-watt-hours, with some of these ideas still unproven (e.g. optical processors). The perfect metaphor for our moment: we burn the world to ask how to stop burning it, racing our own shadow toward either wisdom or ruin. The verdict seems more absolute: scale or surrender - there is no middle ground in the physics of power.

Thanks to Tyler Kenney, Kalor Lewis, Eric Johsnon, Azeem Azhar, Christopher Kauffman, Will Horyn, Jessica Richman, Duncan Grove and others for many fun discussions on the nature of AI and energy.

Tim Davis 9/26/23 Tim Davis 9/26/23

AI Regulation: step with care, and great tact

“So be sure when you step, Step with care and great tact. And remember that life's A Great Balancing Act. And will you succeed? Yes! You will, indeed! (98 and ¾ percent guaranteed).” ― Dr. Seuss, Oh, the Places You'll Go!

AI systems take an incredible amount of time to build and get right - I know because I have helped scale some of the largest AI systems in the world, which have directly and indirectly impacted billions of people. If I step back and reflect briefly - we were promised mass production self-driving cars 10+ years ago, and yet we still barely have any autonomous vehicles on the road today. Radiologists haven’t been replaced by AI despite predictions virtually guaranteeing as much, and the best available consumer robot we have is the iRobot J7 household vacuum.
‍
We often fall far short of the technological exuberance we project into the world, time and time again realizing that producing incredibly robust production systems is always harder than we anticipate. Indeed, Roy Amara made this observation long ago:

We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.

Even when we have put seemingly incredible technology out into the world, we often overshoot and release it to minimal demand - the litany of technologies that promised to change the world but subsequently failed is a testament to that. So let's be realistic about AI - we definitely have better recommendation systems, we have better chatbots, we have translation across multiple languages, we can take better photos, we are now all better copywriters, and we have helpful voice assistants. I've been involved in using AI to solve real-world problems like saving the Great Barrier Reef, and enabling people who have lost limbs to rediscover their lives again. Our world is a better place because of such technologies, and they have been developed and deployed into real applications over the last 10+ years with profoundly positive effect.

But none of this is remotely close to AGI - artificial general intelligence - or anywhere near ASI - artificial superintelligence. We have a long way to go despite what the media presages, like a Hollywood blockbuster storyline. In fact, in my opinion, these views are a bet against humanity because they gravely overshadow the incredible positives that AI is already, and will continue, providing - massive improvements in healthcare, climate change, manufacturing, entertainment, transport, education, and more - all of which will bring us closer to understanding who we are as a species. We can debate the merits of longtermism forever, but the world has serious problems we can solve with AI today. Instead, we should be asking ourselves why we are scrambling to stifle innovation and implement naively restrictive regulatory frameworks now, at a time when we are truly still trying to understand how AI works and how it will impact our society?

Current proposals for regulation seem more concerned with the ideological risks associated with transformative AGI - which, unless we uncover some incredible change in physics, energy, and computing, or perhaps uncover that AGI exists in a vastly different capacity to our understanding of intelligence today - is nowhere near close. The hysteria projected by many does not comport with the reality of where we are today or where we will be anytime soon. It's the classic mistake of inductive reasoning - taking a very specific observed example and making overly broad generalizations and logical jumps. Production AI systems do far more than just execute an AI model - they are complex systems - similarly, a radiologist does far more than look at an image - they treat a person.

For AI to change the world, one cannot point solely to an algorithm or a graph of weights and state the job is done - AI must exist within a successful product that society consumes at scale - distribution matters a lot. And the reality today is that the AI infrastructure we currently possess was developed primarily by researchers for research purposes. The world doesn't seem to understand that we don't actually have production software infrastructure to scale and manage AI systems to the enormous computational heights required for them to practically work in products across hardware. At Modular, we have talked about these challenges many times, because AI isn't an isolated problem - it's a systems problem comprised of hardware + software across both data centers and the edge.

Regulation: where to start?

So, if we assume that AGI isn't arriving for the foreseeable future, as I and many do, what exactly are we seeking to regulate today? Is it the models we have already been using for many years in plain sight, or are we trying to pre-empt a future far off? As always, the truth lies somewhere in between. It is the generative AI revolution - not forms of AI that have existed for years - that has catalyzed much of the recent excitement around AI. It is here, in this subset of AI, where practical AI regulation should take a focused start.

If our guiding success criteria is broadly something like - enable AI to augment and improve human life - then let's work backward with clarity from that goal and implement legislative frameworks accordingly. If we don't know what goal we are aiming for - how can we possibly define policies to guide us? Our goal cannot be "regulate AI to make it safe" because the very nature of that statement infers the absence of safety from the outset despite us living with AI systems for 10+ years. To realistically achieve a balanced regulatory approach, we should start within the confinements of laws that already exist - enabling them to address concerns about AI and learn how society reacts accordingly - before seeking completely new, sweeping approaches. The Venn diagram for AI and existing laws is already very dense. We already have data privacy and security laws, discrimination laws, export control laws, communication and content usage laws, and copyright and intellectual property laws, among so many other statutory and regulatory frameworks today that we could seek to amend.

The idea that we need entirely new "AI-specific" laws now - when this field has already existed for years and risks immediately curbing innovation for use cases we don't yet fully understand - feels impractical. It will likely be cumbersome and slow to enforce while creating undue complexity that will likely stifle rather than enable innovation. We can look to history for precedent here - there is no single "Internet Act" for the US or the world - instead, we divide and conquer Internet regulation across our existing legislative and regulatory mechanisms. We have seen countless laws that attempt to broadly regulate the internet fail - the Stop Online Piracy Act (SOPA) is one shining US example - while other laws that seek to regulate within existing bounds succeed. For example, Section 230 of the Communications Decency Act protects internet service providers from being liable for content published, and the protection afforded here has enabled modern internet services to innovate and thrive to enormous success (e.g., YouTube, TikTok, Instagram etc.) while also forcing market competition on corporations to create their own high content standards to build better product experiences and retain users. If they didn't implement self-enforcing policies and standards, users would simply move to a better and more balanced service, or a new one would be created - that's market dynamics.

Of course, we should be practical and realistic. Any laws we amend or implement will have failings - they will be tested in our judicial system and rightly criticized, but we will learn and iterate. Yes, we won’t get this right initially - a balanced approach often will mean some bad actors will succeed, but we can limit these bad actors while dually enabling us to build a stronger and more balanced AI foundation for the years ahead. AI will continue evolving, and the laws will not keep pace - take, for example, misinformation, where Generative AI makes it far easier to construct alternate truths. While this capability has been unlocked, social media platforms still grapple with moderation of non-AI-generated content and have done so for years. Generative AI will likely create extremely concerning misinformation across services, irrespective of any laws we implement.

EU: A concerning approach

With this context, let’s examine one of the most concerning approaches to AI regulation - the European Artificial Intelligence Act - an incredibly aggressive approach that will likely cause Europe to fall far behind the rest of the world in AI. In seeking to protect EU citizens absolutely, the AI Act seemingly forgets that our world is now far more interconnected and that AI programs are, and will continue to be, deeply proliferated across our global ecosystem. For example, the Act arguably could capture essentially all probabilistic methods of predicting anything. And, in one example, it goes on to explain (in Article 5) that:

(a) the placing on the market, putting into service or use of an AI system that deploys subliminal techniques beyond a person’s consciousness in order to materially distort a person’s behaviour in a manner that causes or is likely to cause that person or another person physical or psychological harm;

Rather than taking a more balanced approach and appreciating that perfect is the enemy of good, the AI Act tries to ensnare all types of AI, not just the generative ones currently at the top of politicians and business leaders' minds. How does one even hope to enforce "subliminal techniques beyond a person's consciousness" - does this include the haptic notification trigger on my Apple Watch powered by a probabilistic model? Or the AI system that powers directions on my favorite mapping service with different visual cues? Further, the Act also includes a "high risk" categorization for "recommender systems" that, today, basically power all of e-commerce, social media, and modern technology services - are we to govern and require conformity, transparency and monitoring of all of these too? The thought is absurd, and even if one disagrees, the hurdle posed for generative AI models is so immense - no model meets the standards of the EU AI Act in its current incarnation.

We should not fear AI - we've been living with it for 10+ years in every part of our daily lives - in our news recommendations, mobile phone cameras, search results, car navigation, and more. We can't just seek to extinguish it from existence retrospectively - it's already here. So, to understand what might work, let's walk the path of history to explore what didn't - the failed technological regulations of the past. Take the Digital Millennium Copyright Act (DMCA), which attempted to enforce DRM by making it illegal to circumvent technological measures used to protect copyrighted works. This broadly failed everywhere, didn't protect privacy, stifled innovation, was hacked, and, most critically - was not aligned with consumer interests. It failed in Major League Baseball, the E-Book industry abandoned it, and even the EU couldn't get it to pass. What did work in the end? Building products aligned with consumer interests enabled them to achieve their goal - legitimate ways to access high-quality content. The result? We have incredible services like Netflix, Spotify, YouTube, and more - highly consumer-aligned products that deliver incredible economic value and entertainment for society. While each of these services has its challenges, at a broad level, they have significantly improved how consumers access content, have decentralized its creation, and enabled enormous and rapid distribution that empowers consumers to vote on where they direct their attention and purchasing power.

A great opportunity to lead

The US has an opportunity to lead the world by constructing regulation that enables progressive and rapid innovation. It should be merited by this country's principles and its role as a global model for democracy. Spending years constructing the "perfect AI legislation" and "completely new AI agencies" will end up like the parable of the blind men and an elephant - creating repackaged laws that attempt to regulate AI from different angles without holistically solving anything.

Ultimately, seeking to implement broad new statutory protections and government agencies on AI is the wrong approach. It will be slow, it will take years to garner bipartisan support, and meanwhile, the AI revolution will roll onwards. We need AI regulation that defaults to action, that balances opening the door to innovation and creativity while casting a shadow over misuse of data and punishing discriminative, abusive, and prejudicial conduct. We should seek regulation that is more focused initially, predominantly on misinformation and misrepresentation, and seek to avoid casting an incredibly wide net across all AI innovation so we can understand the implications of initial regulatory enforcement and how to structure it appropriately. For example, we can regulate the data on which these models are trained (e.g. privacy, copyright, and intellectual property laws), we can regulate computational resources (e.g. export laws), and we can regulate what products predictions are made in (e.g. discrimination laws) to start. In this spirit, we should seek to construct laws that target the inputs and outputs of AI systems and not individual developers or researchers who push the world forward.

Here’s a small non-exhaustive list of near-term actionable ideas:

Voluntary transparency on the research & development of AI - If we wait for Congress, we wait for an unachievable better path at the expense of a good one today. There is already an incredible body of work being proposed by Open AI, in Google's AI Principles among others on open and transparent disclosure - there is a strong will to transcend Washington and just do something now. We can seek to ensure companies exercise a Duty of Care, to have a responsibility under common law to identify and mitigate ill effects and have transparent reporting accordingly. And of course, this should extend and be true of our Government agencies as well.
Better AI use case categorization & risk definition - There is no well-defined regulatory concept of “AI” nor how to “determine risks” on “what use cases” therein. While the European AI Act clearly goes too far, it does at least seek to define what “AI is” but errs in defining essentially all of probability. Further, it needs to better classify risk and the classes of AI from which it is actually seeking to protect people. We can create a categorical taxonomy of AI use cases and seek to target regulatory enforcement accordingly. For example, New York is already implementing laws to require employers to notify job applications if AI is used to review their applications.
Pursue watermarking standards - Google leads the cause with watermarking to determine whether content has been AI generated. While these standards are reminiscent of DRM, it is a useful step in encouraging open standards for major distribution platforms to enable watermarking to be baked in.
Prompt clearly on AI systems that collect and use our data - Apple did this to great effect, implementing IDFA and better App Transparency on how consumer data is used. We should expect more of our services for any data being used to train AI along with the AI model use cases such data informs. Increasingly in AI, all data is incredibly biometric - from the way one types, to the way one speaks, to the way one looks and even walks. All of this data can, and is, now being used to train AI models to form biometric fingerprints of individuals - this should be made clear and transparent to society at large. Both data privacy and intellectual property laws should protect who you are, and how your data is used in a new generative AI world. Data privacy isn’t new, but we should realize that the evolution of AI continually raises the stakes.
Protect AI developers & open source - We should seek to focus regulatory efforts on the data inputs as well as have clear licensing structures for AI Models and software. Researchers and developers should not be held liable for creating models, software tooling, and infrastructure that is distributed to the world so that an open research ecosystem can continue to flourish. We must promote initiatives like model cards and ML Metadata across the ecosystem and encourage their use. Further, if developers open source AI models, we must ensure that it is incredibly difficult to hold the AI model developer liable, even as a proximate cause. For example, if an entity uses an open-source AI recommendation model in its system, and one of its users causes harm as a result of one of those recommendations, then one can’t merely seek to hold the AI model developer liable - it is foreseeable to a reasonable person who develops and deploys AI systems that rigorous research, testing, and safeguarding must occur before using and deploying AI models into production. We should ensure this is true even if the model author knew the model was defective. Why, you ask? Again, a person who uses any open source code should reasonably expect that it might have defects - hence why we have licenses like MIT that provide software “as is”. Unsurprisingly, this isn’t a new construct; it's how liability has existed for centuries in tort, and we should seek to ensure AI isn't treated differently.‍
Empower agencies to have agile oversight - AI consumer scams aren’t new - they are already executed by email and phone today. Enabling existing regulatory bodies to have the power to take action immediately is better than waiting for congressional oversight. The FTC has published warnings, the Department of Justice as well, and the Equal Employment Opportunity Commission too - all of the regulatory agencies that can help more today.

Embrace our future

It is our choice to embrace fear or excitement in this AI era. AI shouldn’t be seen as a replacement for human intelligence, but rather a way to augment human life - a way to improve our world, to leave it better than we found it and to infuse a great hope for the future. The threat that we perceive - that AI calls into question what it means to be human - is actually its greatest promise. Never before have we had such a character foil to ourselves, and with it, a way to significantly improve how we evolve as a species - to help us make sense of the world and the universe we live in and to bring about an incredibly positive impact in the short-term, not in a long distant theoretical future. We should embrace this future wholeheartedly, but do so with care and great tact - for as with all things, life is truly a great balancing act.

Many thanks to Christopher Kauffman, Eric Johnson, Chris Lattner and others for reviewing this post.

Image credits: Tim Davis x DALL-E (OpenAI)

Model	Release Date▼	Energy Consumption (Wh) (100 input-300 output tokens)	Energy Consumption (Wh) (1K input-1K output tokens)	Energy Consumption (Wh) (10K input-1.5K output tokens)
o4-mini (high)	Apr 16, 2025	2.916 ± 1.605	5.039 ± 2.764	5.666 ± 2.118
o3	Apr 16, 2025	7.026 ± 3.663	21.414 ± 14.273	39.223 ± 20.317
GPT-4.1	Apr 14, 2025	0.918 ± 0.498	2.513 ± 1.286	4.233 ± 1.968
GPT-4.1 mini	Apr 14, 2025	0.421 ± 0.197	0.847 ± 0.379	1.590 ± 0.801
GPT-4.1 nano	Apr 14, 2025	0.103 ± 0.037	0.271 ± 0.087	0.454 ± 0.208
GPT-4o (Mar '25)	Mar 25, 2025	0.421 ± 0.127	1.214 ± 0.391	1.788 ± 0.363
GPT-4.5	Feb 27, 2025	6.723 ± 1.207	20.500 ± 3.821	30.495 ± 5.424
Claude-3.7 Sonnet	Feb 24, 2025	0.836 ± 0.102	2.781 ± 0.277	5.518 ± 0.751
Claude-3.7 Sonnet ET	Feb 24, 2025	3.490 ± 0.304	5.683 ± 0.508	17.045 ± 4.400
o3-mini (high)	Jan 31, 2025	2.319 ± 0.670	5.128 ± 1.599	4.596 ± 1.453
o3-mini	Jan 31, 2025	0.850 ± 0.336	2.447 ± 0.943	2.920 ± 0.684
DeepSeek-R1	Jan 20, 2025	23.815 ± 2.160	29.000 ± 3.069	33.634 ± 3.798
DeepSeek-V3	Dec 26, 2024	3.514 ± 0.482	9.129 ± 1.294	13.838 ± 1.797
LLaMA-3.3 70B	Dec 6, 2024	0.247 ± 0.032	0.857 ± 0.113	1.646 ± 0.220
o1	Dec 5, 2024	4.446 ± 1.779	12.100 ± 3.922	17.486 ± 7.701
o1-mini	Dec 5, 2024	0.631 ± 0.205	1.598 ± 0.528	3.605 ± 0.904
LLaMA-3.2 1B	Sep 25, 2024	0.070 ± 0.011	0.218 ± 0.035	0.342 ± 0.056
LLaMA-3.2 3B	Sep 25, 2024	0.115 ± 0.019	0.377 ± 0.066	0.573 ± 0.098
LLaMA-3.2-vision 11B	Sep 25, 2024	0.071 ± 0.011	0.214 ± 0.033	0.938 ± 0.163
LLaMA-3.2-vision 90B	Sep 25, 2024	1.077 ± 0.096	3.447 ± 0.302	5.470 ± 0.493
LLaMA-3.1-8B	Jul 23, 2024	0.103 ± 0.016	0.329 ± 0.051	0.603 ± 0.094
LLaMA-3.1-70B	Jul 23, 2024	1.101 ± 0.132	3.558 ± 0.423	11.628 ± 1.385
LLaMA-3.1-405B	Jul 23, 2024	1.991 ± 0.315	6.911 ± 0.769	20.757 ± 1.796
GPT-4o mini	Jul 18, 2024	0.421 ± 0.082	1.418 ± 0.332	2.106 ± 0.477
LLaMA-3-8B	Apr 18, 2024	0.092 ± 0.014	0.289 ± 0.045	—
LLaMA-3-70B	Apr 18, 2024	0.636 ± 0.080	2.105 ± 0.255	—
GPT-4 Turbo	Nov 6, 2023	1.656 ± 0.389	6.758 ± 2.928	9.726 ± 2.686
GPT-4	Mar 14, 2023	1.978 ± 0.419	6.512 ± 1.501	—