Scale or Surrender: When watts determine freedom

Essays

Jun 6

Over the past two centuries, humanity's relationship with energy has been nothing short of transformative. If you chart global primary energy consumption from the Industrial Revolution to today, you'll see something remarkable: an almost unbroken ascent, punctuated by only three brief pauses - the early 1980s oil crisis aftermath, the 2009 financial crisis, and the 2020 pandemic. Otherwise, it's been an extraordinary march upward, powered first by coal and oil, then natural gas, nuclear, hydropower, and increasingly, renewables. This wonderful graphic highlights this well - with populous nations like China, the United States, and India dominating total consumption on a per-person basis.

The geographic distribution of this energy consumption tells a striking story. China, the United States, and India dominate in absolute terms, but the per-capita numbers reveal something more profound. Citizens of Iceland, Norway, Canada, the United States, and wealthy Gulf states like Qatar and Saudi Arabia consume up to 100 times more energy than those in the world's poorest regions. This isn't merely inequality - it's a chasm so vast that millions of people still rely on traditional biomass (wood, agricultural residues) that doesn't even register in our global energy statistics, creating data gaps.

The disparities in electricity generation are equally stark. Iceland, blessed with abundant geothermal and hydro resources, generates hundreds of times more electricity per person than many low-income nations, where annual per-capita generation can fall below 100 kilowatt-hours - less than what a modern refrigerator uses in two months.

This context matters immensely as we confront the dual challenge of our time: meeting rising global energy demand while urgently decarbonizing our energy supply. Despite record investments in clean technologies, fossil fuels still account for approximately 81.5% of global primary energy. The math here is unforgiving - renewable sources must not only meet all new demand but also replace existing fossil fuel capacity if we're to bend the emissions curve downward.

Enter artificial intelligence, with its voracious and growing appetite for electricity.

In 2023, U.S. data centers consumed approximately 176 terawatt-hours - 4.4% of national electricity consumption. Current projections suggest this could reach 325 to 580 TWh by 2028, representing 6.7% to 12% of total U.S. electricity demand driven largely by AI workloads that demand ever-increasing compute power and specialized hardware. To contextualize these numbers: we're talking about enough electricity to power between 32.5 and 58 million American homes.

The AI industry has long understood a critical metric that deserves wider attention: tokens-per-dollar-per-watt. This measure of computational efficiency relative to both cost and energy consumption has been a focus at Google and other leading technology companies for years. It represents the kind of systems thinking we desperately need as AI capabilities expand.

The challenge before us is clear. We're attempting to build transformative AI systems while simultaneously addressing the climate crisis. These goals aren't inherently incompatible, but reconciling them requires unprecedented coordination and innovation across multiple domains:

Hardware efficiency: Next-generation chips that deliver dramatically better performance-per-watt
Operational intelligence: Carbon-aware scheduling that aligns compute-intensive tasks with renewable energy availability
Infrastructure innovation: On-site renewable generation and novel cooling systems that minimize overhead
System integration: Data centers that contribute to local energy systems through waste heat recovery
Radical transparency: Clear reporting standards that drive competition on efficiency metrics

Global energy consumption tells a story of both peril and promise. As artificial intelligence scales exponentially, it threatens to derail climate progress - yet history shows us that human ingenuity consistently reimagines our energy systems when survival demands it. We have already proven we can build transformative AI; the defining challenge now is whether we can build it sustainably, ensuring our creations enhance rather than endanger the world they serve.

The stakes are higher than they appear. Even breakthrough efficiency gains in AI hardware may paradoxically increase total energy consumption - a manifestation of Jevon's Paradox, where technological improvements drive greater overall demand. At this crossroads of intelligence and energy transformation, our choices will determine whether AI becomes humanity's greatest tool or its most consequential miscalculation.

The arithmetic is challenging, but not impossible. What's required is the kind of systematic thinking and ambitious action that has characterized humanity's greatest technological leaps. The alternative - allowing AI's energy demands to grow unchecked - would represent a profound failure of imagination and responsibility, but also the risks are enormous as whichever nations control the most powerful AI systems - are the new superpowers of tomorrow. In this article, I try to shine a light on what's causing the enormous growth of energy demands and some thoughts about the path forward.

The geography of American power

To truly grasp the magnitude of AI's growing energy demands, it's instructive to examine America's electricity generation landscape. At the apex sits the Palo Verde Nuclear Generating Station in Arizona, the nation's largest power producer, generating approximately 32 million megawatt-hours annually - equivalent to 32 billion kWh or 32 TWh.

What does 32 billion kWh actually mean? The U.S. Energy Information Administration reports that the average American household consumes about 10,500 kWh per year. Simple arithmetic reveals that Palo Verde alone could theoretically power 3.05 million homes - roughly 2.5% of the nation's 120.92 million households. One facility, powering the equivalent of a major metropolitan area.

The roster of America's electricity giants tells a fascinating story about our energy infrastructure. After Palo Verde, we have Browns Ferry (31 TWh, nuclear), Peach Bottom (22 TWh, nuclear), and then Grand Coulee Dam (21 TWh) - the hydroelectric marvel that helped build the American West. The list continues with West County Energy Center (19 TWh, natural gas), W.A. Parish (16 TWh, a coal/gas hybrid), and Plant Scherer in Georgia (15 TWh, coal).

Notice the pattern? Nuclear dominates the top tier, followed by a mix of hydro, gas, and coal. After these giants, capacity drops precipitously to facilities generating around 3 GWh - a reminder of how concentrated our electricity production really is. This concentration matters. When we project data centers consuming 325-580 TWh by 2028, we're talking about the equivalent of 10-18 Palo Verde stations running exclusively to power AI and digital infrastructure. That's not replacing existing demand - that's additional load on a grid already straining to decarbonize.

The average U.S. household consumes about 10,500 kilowatthours (kWh) of electricity per year, though this varies significantly by region and housing type. Residential electricity primarily powers essential systems: space cooling, water heating, space heating, along with refrigeration, lighting, and electronics. Commercial buildings have vastly different consumption patterns depending on their size and type, ranging from small offices to large retail centers and office complexes, each with varying HVAC, lighting, and operational equipment needs.

The sectoral breakdown of U.S. electricity consumption reveals a more balanced distribution than commonly understood. According to EIA forecast 2025-2026 power sales will rise to 1,494 billion kWh for residential consumers, 1,420 billion kWh for commercial customers and 1,026 billion kWh for industrial customers with longer term forecasts still mostly within norms. This translates to approximately 38% residential, 36% commercial, and 26% industrial consumption. Rather than the residential sector being overshadowed by commercial and industrial users, it actually represents the largest single sector of electricity demand, with commercial consumption running a close second. This distribution reflects America's transition toward greater electrification in homes and businesses, driven by factors including growing demand from artificial intelligence and data centers and as homes and businesses use more electricity.

The long view is revealing. According to the Energy Information Administration, U.S. electricity consumption increased in all but 11 years between 1950 and 2022. The rare declines - including 2019, 2020, and 2023 - coincided with economic contractions, efficiency improvements, or exceptional circumstances like the pandemic. The overarching trend remains unmistakably upward. While the exact figures here may carry some uncertainty, they accurately capture the essential dynamics. What matters isn't whether commercial usage is precisely 6.7 times residential, but that the disparity is substantial and the growth trajectory is clear. These patterns - concentrated commercial demand and relentless growth - form the backdrop against which we must evaluate AI's emerging energy requirements.

Understanding these scales helps frame the challenge ahead. Every percentage point of national electricity consumption that shifts to data centers represents millions of homes' worth of power. The infrastructure required to meet this demand sustainably doesn't just appear - it must be planned, financed, and built, all while racing against both growing demand and climate imperatives.

The numbers that keep me up at night

Let's revisit the core projection: U.S. data centers consumed approximately 176 TWh in 2023 (4.4% of national electricity) and are projected to reach 325-580 TWh by 2028 – equivalent to powering 32.5 to 58 million American homes.

But here's what keeps me up at night: 580 TWh might be just the beginning of what we need.

Consider today's reality. One analysis estimated that ChatGPT inference alone consumes an estimated 226.8 GWh annually – enough to power 21,000 U.S. homes – and that's already outdated. The International Energy Association (IEA) offers a more sobering projection: 945 TWh - that's the entire electricity consumption of the world's third-largest economy - Japan.

Let that sink in - AI could require as much electricity as the world's third-largest economy. The composition of this demand has already shifted fundamentally. During my time at Google, I watched inference overtake training as the primary driver of compute demand through the late 2010s, and now inference is quickly rising to represent more than 80%+ of the AI compute capacity across the industry. This matters because while training happens in discrete, intensive bursts, inference runs continuously at scale, serving billions of requests around the clock - every query, every recommendation, every generated response adds to the load. This is the crux of our challenge: AI follows exponential growth patterns that surprise even those who've spent years watching them unfold. Given the convergence toward dominant model architectures, inference's share will likely climb further, meaning our upper-bound projection of 945 TWh could see inference alone consuming over 756 TWh by 2028.

But doesn’t edge computing promise to slash data center demands? This is a narrative I've heard repeatedly throughout years of scaling early edge AI systems. Yet I remain deeply skeptical of any order-of-magnitude impact. The reason is simple: we've barely scratched the surface of enterprise, government, and industrial AI adoption. These sectors will unleash computational demands that dwarf any efficiency gains from consumer devices processing locally. Consider the asymmetry: for every smartphone performing local voice recognition, hundreds of enterprise systems are analyzing documents, monitoring infrastructure, processing surveillance footage, and generating complex reports. The sheer scale of this institutional transformation will eclipse whatever load we shift to the edge.

This reality leads us to the heart of the matter: if inference drives our energy challenge, how do we understand its consumption patterns? What does the energy anatomy of inference reveal, and where might we find our leverage points for optimization? Understanding these patterns isn't just an academic exercise - it's essential for developing strategies that can accommodate AI's growth while continuing to grow our energy infrastructure. The always-on nature of inference, combined with its direct relationship to usage, creates a fundamentally different challenge than the periodic spikes of model training.

Inference: the GOAT of consumption

Inference footprint represents the electricity consumed each time an AI model generates a response - as AI becomes ubiquitous across digital services, inference will inevitably dominate long-term energy costs. This raises a crucial question: how do we properly measure inference energy consumption? What's the right framework for calculating inference-per-token-per-watt?

Let's develop a working model, with an important caveat: these calculations rest on rough assumptions about the current AI landscape. They presume most AI continues running on transformer architectures without fundamental changes over the next few years - though I suspect this assumption may prove conservative. We're likely to use AI itself to discover more efficient architectures, potentially invalidating these projections in favorable ways. With that context, let's examine how inference energy consumption actually works and what drives its costs at scale.

The quadratic curse

Transformers are the neural network architecture powering most modern AI systems - from ChatGPT to Claude to Gemini, created by former colleagues at Google. The key innovation of this architecture is the ability to process all parts of an input simultaneously while understanding relationships between distant elements in the text. In transformer inference, prefill is the initial computational phase where the model processes your entire input prompt before generating any output. This involves a single forward pass through the network, computing hidden representations for all input tokens at once.

Your sequence length simply counts these tokens - the basic units of text that might be letters, partial words, or whole words depending on the tokenizer. "Hello, world!" typically translates to 3-4 tokens, while a lengthy document might contain thousands. This distinction matters because prefill computation scales with sequence length, making long prompts significantly more energy-intensive than short ones.

Prefill time grows quadratically with sequence length - double the input, quadruple the computation. This scaling behavior stems from transformers' core mechanism: self-attention. Self-attention requires computing relationships between every pair of tokens in the input. For n tokens, that's n² comparisons. Unlike older architectures (RNNs) that process tokens sequentially, transformers examine all tokens simultaneously, with each token gathering information from every other token in parallel.

Here's an intuitive analogy: imagine a roundtable discussion where each participant (token) prepares three items:

Query: "What information am I seeking?"
Key: "What information do I possess?"
Value: "What insight can I contribute?"

Each participant shares their query with everyone else, comparing it against others' keys to find the most relevant matches. They then synthesize their understanding by combining values from those whose keys best align with their query. Every participant does this simultaneously, creating a rich, interconnected understanding of the entire conversation. This elegant mechanism enables transformers' remarkable capabilities, but it comes at a cost: computational requirements that scale quadratically with input length. A 2,000-token prompt requires four times the computation of a 1,000-token prompt, not twice. This mathematical reality shapes the energy economics of AI inference at scale.

The Two Phases of Transformer Processing

Every transformer request involves two distinct computational phases:

Prefill: Processing the entire input prompt (quadratic scaling with input length, O(n²) complexity)
Decode: Generating output tokens one by one (linear scaling with output length, O(n) complexity)

This scaling difference has profound implications. While decode time grows linearly with the number of tokens generated, prefill time grows with the square of input length. The longer your prompt, the more dramatically prefill dominates total processing time.

Consider the relative computational work (in arbitrary units, assuming 50 output tokens):

  
                Input Length
                Prefill Work (∝ n²)
                Decode Work
                Prefill % of Total
            
                500 tokens
                250,000 units
                50,000 units
                83%
            
                1,000 tokens
                1,000,000 units
                50,000 units
                95%
            
                2,000 tokens
                4,000,000 units
                50,000 units
                99%

The pattern is stark. At 500 input tokens, prefill already consumes 83% of processing time. Double the input to 1,000 tokens, and prefill jumps to 95% - the actual generation phase becomes almost negligible. At 2,000 tokens, you're spending 99% of compute just understanding the prompt. Here's what's happening:

Prefill work = n² (where n = input tokens)
Decode work = 1,000 × output tokens (arbitrary scaling factor)

This quadratic scaling of self-attention explains why long-context models are so computationally expensive. As context windows expand from thousands to hundreds of thousands of tokens, the energy requirements don't just grow - they explode. Understanding this dynamic is crucial for anyone designing AI systems or planning infrastructure for the age of ubiquitous AI.

Doubling words, quadrupling watts

The quadratic scaling of context windows isn't just an abstract computational concern - it translates directly into energy consumption. Every FLOP requires energy, and when FLOPs scale quadratically, so does your electricity bill.

The energy equation is straightforward:

Devices draw roughly constant power P during operation (e.g., 300W for a high-performance GPU)
Energy consumed equals power multiplied by time: E = P × T
Since prefill time scales quadratically with input length, so does prefill energy

Let's make this concrete with realistic parameters:

Power draw: 300W
Decode time: 20ms (fixed for 50 output tokens)
Baseline prefill: 100ms for 500 input tokens

  
    

        
                Input Tokens
                Prefill Time
                Prefill Energy
                Decode Time
                Decode Energy
                Prefill % of Total
            

        
                500
                0.10 s
                30 J
                0.02 s
                6 J
                83%
            

                1,000
                0.40 s
                120 J
                0.02 s
                6 J
                95%
            

                2,000
                1.60 s
                480 J
                0.02 s
                6 J
                99%
            

    
  

The energy story mirrors the computational one. While decode energy remains constant at 6 joules regardless of input length, prefill energy explodes from 30J to 480J as input doubles from 500 to 2,000 tokens. At 2,000 tokens, you're burning 80 times more energy understanding the prompt than generating the response.

Let's recap these results.

At 500 input tokens, prefill consumes 30J versus decode's 6J - already 83% of total energy. Double the input to 1,000 tokens, and prefill time quadruples, pushing energy consumption to 120J and commanding 95% of the total. By 2,000 tokens, the imbalance becomes extreme: 480J for prefill versus 6J for decode, with prefill consuming 99% of the energy budget. Extrapolate to a 10,000-token prompt generating just 1500 output tokens, and you're looking at 3.4Wh per query - nearly all spent on prefill. This isn't a marginal effect; it's the dominant factor in inference energy consumption.

The implications here are therefore profound. Whether you're designing for on-device inference with battery constraints, deploying in autonomous vehicles, or managing massive cloud infrastructure costs at scale - prompt length becomes your primary lever for controlling energy consumption. The quadratic scaling means that doubling prompt length doesn't double energy use - it roughly quadruples it. This scaling asymmetry defines the energy economics of AI. Decode plods along linearly - each output token costs the same as the last. But prefill explodes quadratically with input length, its hunger growing with the square of every token fed. A thousand-token prompt doesn't just double the cost of a 500-token prompt - it quadruples it. By the time we reach today's massive contexts, decode disappears entirely, a thin shadow cast by prefill's towering consumption.

And yet, users control neither the true input nor output. There isn’t a “token budget” in any service out there today, and that would likely create a frustrating user experience if there was. The largest providers - OpenAI, Google, Anthropic - inject substantial hidden context into every prompt while keeping their system instructions opaque. Output remains equally unconstrained: unless users explicitly demand brevity, models generate tokens freely and most services never think to limit responses, and most users don’t even understand they can.

This creates a fundamental tension in AI system design. While longer contexts enable richer interactions and more sophisticated reasoning, they exact an exponentially increasing energy toll. Once prompts exceed a few hundred tokens, virtually all computational resources are consumed by the prefill phase alone. For sustainable AI deployment at scale, prompt concision isn't merely good practice - it's an energy imperative. The difference between 500-token and 2,000-token average prompts could determine whether our global infrastructure remains viable or collapses under its own consumption.

This problem compounds as AI agents and capabilities like Deep Research proliferate. Each autonomous action, each recursive query, each unconstrained generation adds to an already exponential curve. We're building systems designed to think deeply while hoping they'll somehow learn restraint—a contradiction that grows more stark with every token generated.

The paradox blooms in plain sight

Here's the irony: while physics demands shorter prompts, the industry is sprinting in the opposite direction. Claude 4's system prompt far exceeds 10,000 tokens. Despite optimization techniques like KV cache retrieval and prefix caching, the overwhelming trend is toward ever-expanding context windows. We're stuffing everything we can into prompts - documentation, code repositories, conversation histories - because it demonstrably improves model capabilities.

A colleague recently quipped (hat tip, Tyler!): "We'll achieve AGI when all of Wikipedia fits in the prompt!" It's a joke that hits uncomfortably close to our current trajectory of maximizing context windows at every opportunity.

My rough calculations above turn out to align remarkably well with recent empirical findings. This very recent research shows that a GPT-4o query with 10,000 input tokens and 1,500 output tokens consumes approximately 1.7 Wh on commercial datacenter hardware. For models with more intensive reasoning capabilities, the numbers climb dramatically: DeepSeek-R1 averages 33 Wh for long prompts, while OpenAI's o3 model reaches 39 Wh. These aren't theoretical projections - they're measured consumption figures from production systems.

The energy cost of our context window expansion is real, substantial, and growing with each new model generation. We're caught between two competing imperatives: the computational benefits of longer contexts and the exponential energy costs they incur. The other interesting observation on this table below, is the explosive increase in computational power requirements as models have gotten larger and more sophisticated - the power laws of exponential scaling continue.

        
            
                    Model
                    Release Date▼
                    Energy Consumption (Wh)
(100 input-300 output tokens)
                    Energy Consumption (Wh)
(1K input-1K output tokens)
                    Energy Consumption (Wh)
(10K input-1.5K output tokens)
                

            o4-mini (high)Apr 16, 20252.916 ± 1.6055.039 ± 2.7645.666 ± 2.118
o3Apr 16, 20257.026 ± 3.66321.414 ± 14.27339.223 ± 20.317
GPT-4.1Apr 14, 20250.918 ± 0.4982.513 ± 1.2864.233 ± 1.968
GPT-4.1 miniApr 14, 20250.421 ± 0.1970.847 ± 0.3791.590 ± 0.801
GPT-4.1 nanoApr 14, 20250.103 ± 0.0370.271 ± 0.0870.454 ± 0.208
GPT-4o (Mar '25)Mar 25, 20250.421 ± 0.1271.214 ± 0.3911.788 ± 0.363
GPT-4.5Feb 27, 20256.723 ± 1.20720.500 ± 3.82130.495 ± 5.424
Claude-3.7 SonnetFeb 24, 20250.836 ± 0.1022.781 ± 0.2775.518 ± 0.751
Claude-3.7 Sonnet ETFeb 24, 20253.490 ± 0.3045.683 ± 0.50817.045 ± 4.400
o3-mini (high)Jan 31, 20252.319 ± 0.6705.128 ± 1.5994.596 ± 1.453
o3-miniJan 31, 20250.850 ± 0.3362.447 ± 0.9432.920 ± 0.684
DeepSeek-R1Jan 20, 202523.815 ± 2.16029.000 ± 3.06933.634 ± 3.798
DeepSeek-V3Dec 26, 20243.514 ± 0.4829.129 ± 1.29413.838 ± 1.797
LLaMA-3.3 70BDec 6, 20240.247 ± 0.0320.857 ± 0.1131.646 ± 0.220
o1Dec 5, 20244.446 ± 1.77912.100 ± 3.92217.486 ± 7.701
o1-miniDec 5, 20240.631 ± 0.2051.598 ± 0.5283.605 ± 0.904
LLaMA-3.2 1BSep 25, 20240.070 ± 0.0110.218 ± 0.0350.342 ± 0.056
LLaMA-3.2 3BSep 25, 20240.115 ± 0.0190.377 ± 0.0660.573 ± 0.098
LLaMA-3.2-vision 11BSep 25, 20240.071 ± 0.0110.214 ± 0.0330.938 ± 0.163
LLaMA-3.2-vision 90BSep 25, 20241.077 ± 0.0963.447 ± 0.3025.470 ± 0.493
LLaMA-3.1-8BJul 23, 20240.103 ± 0.0160.329 ± 0.0510.603 ± 0.094
LLaMA-3.1-70BJul 23, 20241.101 ± 0.1323.558 ± 0.42311.628 ± 1.385
LLaMA-3.1-405BJul 23, 20241.991 ± 0.3156.911 ± 0.76920.757 ± 1.796
GPT-4o miniJul 18, 20240.421 ± 0.0821.418 ± 0.3322.106 ± 0.477
LLaMA-3-8BApr 18, 20240.092 ± 0.0140.289 ± 0.045—
LLaMA-3-70BApr 18, 20240.636 ± 0.0802.105 ± 0.255—
GPT-4 TurboNov 6, 20231.656 ± 0.3896.758 ± 2.9289.726 ± 2.686
GPT-4Mar 14, 20231.978 ± 0.4196.512 ± 1.501—

        
    

Table 4: How Hungry is AI? , I added Model Release Dates

Trade 8,500 conversations, to keep your home cool

To grasp the practical implications, consider this sobering calculation: at o3's extreme consumption of 39 Wh per query, approximately 76,923 interactions would drain 3,000 kWh - equivalent to powering a typical American home's air conditioning for an entire year. But as users inevitably gravitate toward richer prompts - say, 30K input tokens with 3.5K outputs - the quadratic curse strikes with mathematical precision. Prefill energy multiplies ninefold, collapsing that annual budget to just 8,500 interactions: merely 23 queries per day. This comparison transforms abstract energy figures into visceral reality, revealing how what appears negligible at the per-query level becomes a massive aggregate demand. And we're still only discussing text models.

The trajectory becomes even more stark when we consider multimodal AI. Image and video models can routinely process upwards of hundreds of thousands of tokens per query. Each frame, each visual element, each temporal relationship adds to the token count. As these models become mainstream, we're not just scaling linearly with adoption - we're multiplying adoption rates by dramatically higher per-query energy costs.

The math is unforgiving: widespread deployment of long-context AI at current efficiency levels would require energy infrastructure on a scale we're not remotely prepared for. This isn't a distant concern - it's the reality we're building toward with every context window expansion and every new multimodal capability. The urgency is real, we need to move faster or soon the trade-offs become explicit: 8,500 conversations or one cool home. Your ChatGPT query or your neighbor's heating. And while you might laugh, it’s already happening here in the United States, and water availability is next.

The efficiency mirage, better hardware alone isn’t enough

One way to conclude all of this is to say “But wait, the hyper scalers like Google and others aren’t buying Nuclear Power plants? Surely, we’ll be ok?” I would counter that by firstly saying, actually they are and secondly the vast majority of the world isn’t as sophisticated as the hyperscalers. Google represents the exception, not the rule, in AI efficiency. That's to say, that rather than accepting quadratic scaling as inevitable, Google has:

Implemented multiple optimizations that compound together (e.g. software and hardware co-design with TPUs)
Focused on inference efficiency where the bulk of tokens are processed
Heavy research investment to continue to find alternatives beyond pure transformer architectures where appropriate
Achieved order of magnitude efficiency gains that largely offset quadratic scaling as a result of all of these compounding

During my time at Google, the company's sophistication in AI infrastructure was staggering. Through software-hardware co-design with TPUs, relentless focus on inference optimization, and architectural innovations beyond pure transformers, Google achieved order-of-magnitude efficiency gains that largely offset quadratic scaling. This luxury—born from inventing the transformer architecture itself - remains unavailable to most players, including even technology giants like Microsoft and Meta, who are trying to copy the success of the TPU and the most advanced model companies like OpenAI, Anthropic who are trying to develop own custom hardware today.

Beyond these elite players lies a wasteland of inefficiency. Industry-wide GPU utilization averages a shocking 15-50%, despite NVIDIA and AMD claiming 90%+ efficiency is achievable. Microsoft's study of 400 real deep learning deployments confirms this reality: enterprise GPU utilization rarely exceeds 50%. Even Meta's Llama 3 405B, running on 16,384 H100 GPUs, achieves only 38% Model Flop Utilization. The physics compounds the problem. NVIDIAs H100 consumes 700W at peak, the A100 draws 400-500W and AMD's MI300X reaches 750W - yet at 25% utilization, these still draw 35-43% of maximum power due to static components like memory controllers. This non-linear power curve creates a cruel efficiency trap: most organizations operate in the steepest part of the curve, where marginal performance gains demand exponential energy increases.

GPU Power Consumption: Comparing Non-Linear Curves

How NVIDIA H100, A100, and AMD MI300X consume power at different utilization levels

NVIDIA H100 700W Peak

Idle Power: 80W

Power @ 15%: 245W (35%)

Power @ 50%: 420W (60%)

Efficiency Loss @ 15%: 133%

NVIDIA A100 500W Peak

Idle Power: 60W

Power @ 15%: 175W (35%)

Power @ 50%: 300W (60%)

Efficiency Loss @ 15%: 133%

AMD MI300X 750W Peak

Idle Power: 85W

Power @ 15%: 263W (35%)

Power @ 50%: 450W (60%)

Efficiency Loss @ 15%: 133%

Key Utilization Points Comparison

Utilization	H100 (700W)	A100 (500W)	MI300X (750W)	Power Efficiency
0% (Kernel Idle)	16W	12W	17W	~2-3% of peak
15% (Typical)	245W	175W	263W	35% power for 15% work
50% (Moderate)	420W	300W	450W	60% power for 50% work
70% (Optimal)	560W	400W	600W	80% power for 70% work
100% (Peak)	700W	500W	750W	100% power for 100% work

Lets bring it all back to our token-per-watt metric - my point here is that performance efficiency metrics reveal stark contrasts between theoretical capabilities and real-world achievements. Consider the stark gap between marketing and reality. The H100 delivers an impressive 4.3-5.7 tokens per watt in optimized configurations. But at typical 15% utilization, this plummets to 0.65-0.86 tokens per watt - an 85% efficiency collapse. No amount of hardware innovation can overcome fundamental deployment incompetence. All this highlights my point - software is just as important to energy efficiency as the hardware innovation itself and if you aren’t holding it right - you’re token-per-watt plummets irrespective of the power of the hardware.

The broader implications are staggering. For example, with 3.76 million datacenter GPUs sold in 2023 operating at 15% average utilization, the industry wastes $12.6 billion in underutilized capacity annually, generates 94 million tons of CO2 equivalent to 20.4 million cars, and squanders enough electricity to power 1.1 million American homes. According to SemiAnalysis, each H100 server, carrying a $106,752 annual total cost of ownership, effectively costs organizations $7,413 per useful GPU-month at these 15% utilization levels instead of $1,235 at proper utilization like 90%.

The tools exist to solve this crisis - GPU utilization monitoring, workload optimization, multi-instance allocation - but the industry lacks the sophistication to deploy them effectively. Outside a handful of elite players, the AI revolution runs on fundamentally broken economics - in regular conversations with many sophisticated technology enterprises - they are closer to 50% utilization for their GPUs, than 90%. Until we close this efficiency gap, hardware improvements will only enable more waste at larger scales, making energy consumption our defining constraint rather than computational capability.

The stakes have never been higher

The mathematics reveals an unforgiving truth: prefill computation scales quadratically with input length, meaning each doubled prompt quadruples energy consumption. As models chase ever-expanding context windows, this fundamental relationship drives AI's steepening energy curve. Meanwhile, the industry's abysmal GPU utilization - averaging 15-50% - compounds the crisis through pure waste. We face a perfect storm: exponentially growing computational demands colliding with systematic inefficiency.

Of course, these projections assume a degree of technological stasis. Innovation could disrupt these trends - and likely will - and the work we are doing at Modular is certainly trying to help. But consider this provocative framing: what if we viewed our collective future not through the lens of human populations and national borders, but through available compute capacity? In this view, the race to build massive datacenter infrastructure becomes humanity's defining competition. If each AI agent represents some fraction of human productive capacity, then the first to achieve a combined human-digital population of 5 or 10 billion wins the AI race, and invents the next great technological frontier.

This perspective makes efficiency not just important, but existential. The promise of AI as humanity's great equalizer inverts into its opposite: a world where computational capacity becomes the new axis of dominance. Without breakthrough innovation at every layer - from silicon to algorithms to system architecture - we face an AI revolution strangled by the very physics of power generation. The future belongs not to the wise, but to the watt-rich. And so we must scale with unprecedented urgency - nuclear, renewables, whatever it takes - because the stakes transcend mere technological supremacy. In this race, computational power becomes political power, and China is currently winning by a large margin. If democratic nations cede the AI frontier to autocracies, we don't just lose a technological edge; we risk watching the values of human dignity and freedom dim under the shadow of algorithmic authoritarianism. The grid we build today determines whose values shape the world of tomorrow.

It's interesting to reflect that we are teaching machines to think with the very energy that makes our planet uninhabitable - yet these same machines may be our only hope of learning to live within our means. AI is both the fever and the cure, the flood and the ark, the hunger that's outrunning itself. We race against our own creation, betting that the intelligence we birth from burning carbon will show us how to stop burning it altogether. The question of our age: Can we make AI wise enough to save us before it grows hungry enough to consume us? The stakes are incredibly unquestionably high in the race to AI superintelligence.

I'll close with an irony that perfectly captures the current moment - the suggestions below come courtesy of Anthropic’s Claude 4 Opus:

Better Chips and Smarter Cooling - The latest AI chips use way less energy for the same work. Pair that with innovative cooling like liquid systems or modular designs, and data centers have already seen energy savings of up to 37% in test runs.
Timing is Everything - Not all AI work needs to happen right now. By running non-urgent tasks when electricity is cleaner (like when it's sunny or windy), some companies have cut their carbon emissions from AI jobs by 80-90%. It's like doing laundry at night when rates are lower, but for the planet.
Power Where You Need It - Building solar panels, wind turbines, and battery storage right at data center sites makes sense. Google's recent $20 billion investment in clean energy shows how tech giants can grow their AI capabilities without relying entirely on the traditional power grid.
Working Together on Infrastructure - Data center operators need to share their growth plans with utility companies. This helps everyone prepare for the massive power needs coming to tech hubs like Northern Virginia, Texas, and Silicon Valley - think of it as giving the power company a heads-up before throwing a huge party.
Show Your Work - Just like appliances have energy ratings, AI companies should tell us how much power they're using. Whether its energy per query or per training session, transparency creates healthy competition to be more efficient.
Investing in Tomorrow's Solutions - Government programs are funding research into game-changing technologies like optical processors that could use 10 times less energy. There's also exciting work on making AI models smaller and smarter without losing capabilities.
Turning Waste into Resources - Data centers generate tons of heat - why not use it? Some facilities are already warming nearby buildings with their excess heat, turning what was waste into a community benefit.

And there it is - 40 watt-hours spent asking AI how to save token-per-watt-hours, with some of these ideas still unproven (e.g. optical processors). The perfect metaphor for our moment: we burn the world to ask how to stop burning it, racing our own shadow toward either wisdom or ruin. The verdict seems more absolute: scale or surrender - there is no middle ground in the physics of power.

Thanks to Tyler Kenney, Kalor Lewis, Eric Johsnon, Azeem Azhar, Christopher Kauffman, Will Horyn, Jessica Richman, Duncan Grove and others for many fun discussions on the nature of AI and energy.

Tim Davis

Model	Release Date▼	Energy Consumption (Wh) (100 input-300 output tokens)	Energy Consumption (Wh) (1K input-1K output tokens)	Energy Consumption (Wh) (10K input-1.5K output tokens)
o4-mini (high)	Apr 16, 2025	2.916 ± 1.605	5.039 ± 2.764	5.666 ± 2.118
o3	Apr 16, 2025	7.026 ± 3.663	21.414 ± 14.273	39.223 ± 20.317
GPT-4.1	Apr 14, 2025	0.918 ± 0.498	2.513 ± 1.286	4.233 ± 1.968
GPT-4.1 mini	Apr 14, 2025	0.421 ± 0.197	0.847 ± 0.379	1.590 ± 0.801
GPT-4.1 nano	Apr 14, 2025	0.103 ± 0.037	0.271 ± 0.087	0.454 ± 0.208
GPT-4o (Mar '25)	Mar 25, 2025	0.421 ± 0.127	1.214 ± 0.391	1.788 ± 0.363
GPT-4.5	Feb 27, 2025	6.723 ± 1.207	20.500 ± 3.821	30.495 ± 5.424
Claude-3.7 Sonnet	Feb 24, 2025	0.836 ± 0.102	2.781 ± 0.277	5.518 ± 0.751
Claude-3.7 Sonnet ET	Feb 24, 2025	3.490 ± 0.304	5.683 ± 0.508	17.045 ± 4.400
o3-mini (high)	Jan 31, 2025	2.319 ± 0.670	5.128 ± 1.599	4.596 ± 1.453
o3-mini	Jan 31, 2025	0.850 ± 0.336	2.447 ± 0.943	2.920 ± 0.684
DeepSeek-R1	Jan 20, 2025	23.815 ± 2.160	29.000 ± 3.069	33.634 ± 3.798
DeepSeek-V3	Dec 26, 2024	3.514 ± 0.482	9.129 ± 1.294	13.838 ± 1.797
LLaMA-3.3 70B	Dec 6, 2024	0.247 ± 0.032	0.857 ± 0.113	1.646 ± 0.220
o1	Dec 5, 2024	4.446 ± 1.779	12.100 ± 3.922	17.486 ± 7.701
o1-mini	Dec 5, 2024	0.631 ± 0.205	1.598 ± 0.528	3.605 ± 0.904
LLaMA-3.2 1B	Sep 25, 2024	0.070 ± 0.011	0.218 ± 0.035	0.342 ± 0.056
LLaMA-3.2 3B	Sep 25, 2024	0.115 ± 0.019	0.377 ± 0.066	0.573 ± 0.098
LLaMA-3.2-vision 11B	Sep 25, 2024	0.071 ± 0.011	0.214 ± 0.033	0.938 ± 0.163
LLaMA-3.2-vision 90B	Sep 25, 2024	1.077 ± 0.096	3.447 ± 0.302	5.470 ± 0.493
LLaMA-3.1-8B	Jul 23, 2024	0.103 ± 0.016	0.329 ± 0.051	0.603 ± 0.094
LLaMA-3.1-70B	Jul 23, 2024	1.101 ± 0.132	3.558 ± 0.423	11.628 ± 1.385
LLaMA-3.1-405B	Jul 23, 2024	1.991 ± 0.315	6.911 ± 0.769	20.757 ± 1.796
GPT-4o mini	Jul 18, 2024	0.421 ± 0.082	1.418 ± 0.332	2.106 ± 0.477
LLaMA-3-8B	Apr 18, 2024	0.092 ± 0.014	0.289 ± 0.045	—
LLaMA-3-70B	Apr 18, 2024	0.636 ± 0.080	2.105 ± 0.255	—
GPT-4 Turbo	Nov 6, 2023	1.656 ± 0.389	6.758 ± 2.928	9.726 ± 2.686
GPT-4	Mar 14, 2023	1.978 ± 0.419	6.512 ± 1.501	—

Scale or Surrender: When watts determine freedom

The geography of American power

The numbers that keep me up at night

Inference: the GOAT of consumption

The quadratic curse

Doubling words, quadrupling watts

The paradox blooms in plain sight

Trade 8,500 conversations, to keep your home cool

The efficiency mirage, better hardware alone isn’t enough

GPU Power Consumption: Comparing Non-Linear Curves

Key Utilization Points Comparison

The stakes have never been higher

Interview with Alejandro Cremades

Fund/Build/Scale Podcast Interview