Probabilistic engineering and the 24-7 employee

Essays

Apr 16

Software is quietly becoming a probabilistic system, and almost no one is saying it out loud.

We built our profession around deterministic code. Write it, test it, ship it, know it works - but in my experience that contract is breaking. Inside the top few percent of operators at truly AI-native companies, the codebase has started to become something you believe works, with a probability you can no longer precisely state. The workday is changing as a consequence, and so are the roles, the organizations, the training pipelines, and the nature of what it means to ship.

I noticed because I built one.

A few months ago, in the evenings after my day job running Modular, I started building a side project called Compound Loop - a system that orchestrates multiple frontier models against each other to write, review, and merge code more or less autonomously. I would set it running on a real problem before I went to bed, and I would wake up and triage a stack of pull requests that had not existed the night before. Some were excellent, some were wrong, and some surfaced a question I did not know to ask. By 8 a.m. I was not catching up on yesterday's work - I was deciding which of the overnight jobs to keep, while the system kept analyzing logs and adding more PRs. The continuous compounding nature of it was, and still is, infectious to watch.

For the first time in the history of knowledge work, the person who went home did not take the only copy of their brain with them. 9-9-6 as a concept is dead, and we are simply 24-7 employees now - but the 24-7 employee is not a person working 24 hours, it is a person whose agents work with enormous parallelization. Most teams in 2026 still bottleneck on coordination rather than typing, and most organizations have barely begun to restructure, but the frontier is always where the future shows up first, and the frontier is already here. This essay is not a description of the industry at large, but rather a description of what is already happening inside the most AI-native teams, and where I believe that pulls the rest of the industry.

Roles are not just collapsing upward - they are splitting

Inside the most AI-native teams, the pattern is messier than the clean "everyone levels up" story most commentary is selling. Some operators really are moving up the stack: the best engineers are becoming more effective product managers, working at engineering's abstraction layer, the best product managers are becoming system architects, and the best architects are thinking about distribution, growth, and the shape of the market. For this group - maybe the top tier of any team - the work is more leveraged than it has ever been, and they are having the best years of their careers.

But that is not the whole picture, and pretending it is does a disservice to everyone else. Alongside the upward shift, a downward pressure is fragmenting roles in ways the headlines are not covering. Plenty of engineers are not becoming architects - instead they are becoming spec writers, reviewers, and agent babysitters, operators who spend their days translating intent into machine-readable prompts and then grading the machine's work against standards they themselves might not fully possess. Some of that work is genuinely important, but some of it is the 2026 equivalent of data entry, dressed up in new terminology.

We need to be honest about what that means for the people doing it. These fragmented roles will be paid less, valued less, and in many cases become career dead ends - a layer of output-wrangling work the system needs but does not reward. The pay gap between the top tercile running fleets of agents effectively and the middle tier managing their exhaust will be wider than the pay gap between engineers and sales reps was in the previous era. That gap is already opening inside the companies I watch closely, and I don't believe it is going to close on its own.

One honest note on where the scarce work has moved. In AI infrastructure, kernel performance and compiler design and hardware abstraction remain deeply defensible moats, because there is still a high degree of determinism needed at the lowest levels of systems engineering. But at the level of building software on top of those moats, the center of gravity has shifted hard toward the human inputs a machine cannot yet replicate, and that shift is real and accelerating.

Jevons was right about coal, and he is right about code

In 1865, the economist William Stanley Jevons observed that more efficient steam engines led to more coal consumption rather than less - efficiency expanded the set of things worth building engines for. We are living the software version of that same observation, and it is one of the most exciting moments the profession has ever seen. As the unit cost of writing code approaches zero, we are not writing less, we are writing vastly more and shipping vastly more, and the best teams are leaning into the curve with both hands.

The companies that believe the scaling laws are unbounded are building accordingly, and they will be the power-law-distributed winners.

Many of my friends at leading AI-native companies are already rapidly moving there in practice. Agents are opening pull requests, reviewing each other's work, and closing them without a human ever touching the keyboard, with a continuously live log monitoring loop to rapidly fix issues. Self-healing test suites rewrite themselves when the underlying code changes. Autonomous experimentation loops spin up, measure, and tear down a hundred hypotheses in the time a team once ran three. Documentation updates itself faster on merges using tightly honed AI skills that also self-improve. We are moving from a world where features were bound by the constraint of how fast engineers could type to one where we are bound on human creativity, management of agentic systems, and how fast the product surface can absorb the output.

In my view, this is a wonderful moment to build. The throughput gains are not subtle, and the teams that have genuinely restructured around agents are shipping three, five, or ten times what they shipped a year ago, and the curve is bending up rather than flattening. Many of the founders and operators I talk to who are running their companies this way are not complaining about noise - they are trying to figure out how to feed more work to their agent fleets tomorrow than they did today, because every incremental unit of well-directed agent output is a compounding advantage over competitors who are still typing.

But Jevons' second lesson applies here too, and it is the one that separates teams that ride the curve from teams that get thrown off it: when supply explodes, selection becomes everything. More coal made engines more valuable, but it also made the discipline of choosing what to burn, what to power, and what to build with the output dramatically more important. Cheap energy without judgment is just waste, and the same logic applies to code.

For the teams running this well, selection is not a drowning problem - it is the new leverage point. The operator who can direct a fleet of agents toward the right problem, filter the outputs for what is actually valuable, and integrate the results into something coherent is doing the highest-leverage work in software right now. The value of a piece of work is no longer set by how much effort it took to produce, because effort has collapsed - it is set by how well someone pointed the agent fleet, chose from what came back, and integrated it into something that compounds even faster. Production is not where the work gets hard anymore. Where it is hard now is direction, selection, and coherence, and those are the exact muscles the best teams are building for as fast as they can.

From deterministic engineering to probabilistic engineering

We are rapidly moving from deterministic engineering to probabilistic engineering, and our tools, our training, and our organizational instincts are still built for the old paradigm. Deterministic engineering was the contract we operated under for most of the history of the profession - you wrote code, you tested it, you reviewed it, and you knew, within well-understood bounds, what it did. Failures were deterministic - given the same input, you got the same output, and a bug was a reproducible thing you could hunt.

Probabilistic engineering is different, and inside frontier teams it is already here. Large portions of the codebase were generated by stochastic systems, reviewed under time pressure against contexts too large to fully hold, and integrated into a whole that no single human ever designed end-to-end. The codebase still runs and still ships, but the confidence interval around "this works as intended" has widened, and most teams have not updated their practices to reflect it. This is where the asymmetry at the center of all this comes into focus: generation has become cheap, but validation has not.

An agent can produce a plausible-looking 500-line pull request in under a minute, but catching a subtle bug in that same pull request - a concurrency issue, a silent misinterpretation of the spec, or a case where the code does what was literally asked for but not what was actually wanted - can take a senior engineer an hour of careful reading, or longer. Review scales worse than generation, and crucially, review scales worse than linearly with output volume, because as more of your codebase is written by agents, the context you need to hold in your head to evaluate any single piece grows. You are not reviewing one pull request against a codebase you wrote; you are reviewing a pull request against a codebase largely written by other agents, reviewed by you at a depth you have started to forget, under time pressure that is always rising.

At some scale, the system produces more than humans can reliably evaluate, and correctness becomes probabilistic rather than assured.

This is not a future problem, it is a present one. Past a certain throughput, bugs slip through not because reviewers are careless, but because the output volume has exceeded what human attention can meaningfully inspect, and the models doing much of the review are non-deterministic themselves and miss plenty. The codebase stops being a thing you know works and becomes a thing you believe works, with a probability you can no longer precisely state.

Concretely, this looks like a race condition that passes your test suite nine times out of ten, a feature that works perfectly in staging and fails under a prompt distribution you did not anticipate, or a migration that is silently corrupting one row in ten thousand and will take three weeks to catch. Proximal and Modular recently published joint research testing frontier agentic systems against basic tasks - the failure patterns we documented map directly to what I am describing. I’ve personally seen this in code I’ve written with my own multi-agent harness system. The failure mode typically is not a dramatic collapse but a slow, silent degradation - generation rises, review quality falls, unnoticed defects accumulate, and trust in the system quietly erodes until a customer or an auditor or a production incident forces the issue into the open. By then the technical debt runs deep.

The uncomfortable truth is that we do not yet have the tooling to solve this properly. Culture helps - smaller merges, harder gates, ruthless skepticism toward polished output, observability, rollback discipline - but culture does not scale past a certain team size, and the systems we have today for evaluating probabilistic code are primitive compared to what we need. I hope someone is going to build the right tooling for this problem, and whoever does will define the operating system of serious software development for the next decade. The new CI/CD is not a tool yet - it is, for now, a culture of ruthless skepticism, and an honest admission that we are building the replacement for that culture in real time.

Not every industry moves at the same speed

The shift from deterministic to probabilistic engineering will not happen uniformly. Technology diffusion takes time, legal and regulatory frameworks always lag technology progress, and the shift will tier by industry and by risk profile - so understanding the tiering matters for anyone deciding how to build.

The deterministic tier is highly regulated, and high-stakes domains - avionics, medical devices, financial trading infrastructure, nuclear control systems, the core of payment networks - will remain deeply deterministic for a long time, and they should. The cost of a silent correctness failure in a fly-by-wire system is not a customer complaint - it is lives. These domains will adopt agent assistance carefully, behind formal verification, extensive simulation, and human sign-off chains that deliberately slow things down. That is not a failure of imagination - it is a correct reading of what the stakes demand.

The probabilistic tier is consumer software, internal tools, marketing systems, most SaaS, most content infrastructure, and most experimental and early-stage product work - this is where probabilistic engineering is already running hot and will rapidly accelerate. The cost of a bug is a rollback, an apology, a hotfix, and in exchange, teams in this tier get iteration speed that the deterministic world structurally cannot match. A probabilistic team willing to ship, measure, and correct can out-learn a deterministic competitor by an order of magnitude per quarter.

The "convergence zone" is what I call the interesting future in the middle, and it is where the next decade of competition plays out. As models get smarter, as the harnesses around them get better, and as iteration loops compress toward the real-time, the frontier of what is "safe enough to do probabilistically" will keep moving. Domains that look deterministic today - parts of insurance, parts of healthcare, parts of enterprise infrastructure - will find probabilistic methods creeping up on them from below, ten percent at a time. It will be a case of going slow, until it moves rapidly fast. Meanwhile, the leading edge of probabilistic engineering will start building deterministic guardrails back in - formal checks, verified critical paths, hybrid systems where stochastic generation is bounded by deterministic verification.

The winners over the next ten years will be the teams that know which tier they are in, resist the temptation to pretend they are in a different one, and get very precise about where the boundary between the two should sit inside their own stack.

The agentic fleet

I have thought a lot about the right metaphor for what is changing, and it is not the "factory shift," because the factory worker was the system being automated and that is not us. The right metaphor, in my view, is "the agentic fleet" - but I want to be careful with the word, because "fleet" carries connotations of order, hierarchy, and reliability that the reality does not yet deserve. What most operators are actually running is closer to a swarm of brittle contractors than a well-drilled Navy: the agents are uneven in capability, stochastic in behavior, occasionally confidently wrong, and often expensive to run at scale. Orchestration layers break, context windows blow up, and inference costs appear on bills that founders and C-level teams do not want to show their boards.

With that caveat stated honestly, I still think the agentic fleet concept holds. A fleet has composition - different agents for different tasks. It has coordination - handoffs, dependencies, escalation paths, and it has a command structure - someone decides the mission, someone sets the rules of engagement, someone reviews what came back. And critically, a fleet has watch shifts: it does not stop when the commander sleeps, it carries on within the orders it was given, and it reports back in the morning with what it found.

A good fleet is not defined by how much it produces - it is defined by how well what it produces holds together. Framed in this way, one's workday has a new shape - triage and merge in the morning, high-leverage human work in the middle (customer conversations, strategy, product decisions, writing the specs that will drive the night run), and review and redirection in the afternoon as the first agents report back. Then, at the end of the day, something the previous generation of knowledge workers never did - you simply hand off. You queue the work and give your agentic fleet the specs for what you want attempted overnight, you dispatch them, and you accept that some of what comes back will be wrong, some will be brilliant, and the difference between the two is the work that only you can do. Then your work day is complete. Your agents do not sleep, and that is the whole point - you wake up ahead of where you ended, provided the review discipline holds.

Build for the model you do not have yet

One of the most consistent things I have been saying for the last few years - and a point that many large enterprise leaders I speak with still miss - is that the model we are using today is the dumbest model we will ever use.

I want to be careful with that claim, because capability growth is not guaranteed to be smooth - costs, latency, reliability, and scaling limits may complicate the curve in ways that matter. But the directional bet is well-supported by what I see at the infrastructure layer: frontier capability will meaningfully exceed today's in the next six to twelve months, and the gap between the best model you can work with now and the best model you will work with then will likely be larger than the gap between today and a year ago, which was already substantial. The scaling laws continue to hold true.

This has a strategic implication most leadership teams have not fully absorbed. You are not building organizational muscle to harness the model you have - you are building it to harness the model you do not have yet. The specifications you are learning to write, the review culture you are installing, the observability you are wiring in, the agent fleet you are learning to direct, the training rituals you are experimenting with to keep your juniors' craft alive - none of this is for 2026 capability, it is scaffolding for 2027 and 2028. The companies building this scaffolding now, before the next capability jump lands, will absorb the jump as leverage, while the companies waiting for the tooling to mature before they retool their organizations will spend the first year of the next capability era learning what the early movers already know, while the early movers compound.

This is the part that separates organizations that stay relevant from organizations that do not. Build the system for the model you will have rather than the one you have today, and be willing to over-invest in specification, review, and operational discipline relative to what the current model demands, because the current model is the weakest you will ever work with. The teams that internalize this early pull away, and the teams that do not will find, eighteen months from now, that they have been quietly passed by competitors who spent this year building the wrong tools for the right problem. Irrelevancy, in this era, does not announce itself - it arrives as a gradual inability to keep up with teams that were not noticeably better than you a year ago.

The muscle we will lose

As I stated in my last essay in Part V, AI will definitively stratify society or largely democratize it. We are creatures that are beautifully, relentlessly efficient at optimizing the path of least resistance. Whenever possible, we select options that minimize required effort - whether that effort is physical, cognitive, or emotional. But this leads to a simple notion for the context of this piece: if you never build, you lose the ability to evaluate what is being built.

That is not a hypothetical - it is already happening with junior engineers who have leaned on AI since their first week on the job. They ship fast, they produce polished code, and they can describe in general terms what that code is doing. But when it fails in a way the model did not anticipate, they often cannot find the bug, because they never developed the internal model of the system that only forms when you have personally wrestled a stack trace at 2 a.m. for the hundredth time.

Taste is not learned by clicking approve on polished first drafts, judgment is not developed by accepting a machine's plausible answer in five seconds instead of sitting with a hard problem for an afternoon, and craft is not acquired by reviewing other agents' work. These are skills that only form through the friction that agents are now, very helpfully, removing.

This creates a training crisis most organizations have not begun to reckon with. The apprenticeship model of software engineering - juniors ship small things, seniors review them, juniors absorb taste through the red ink - breaks when the juniors are shipping through agents and the seniors are reviewing agent output rather than human output. Where does the next generation's craft come from? How do you train taste without reps? What replaces mentorship when the thing being mentored on was never written by the mentee in the first place? Here is the uncomfortable extension of this argument, and for most traditional organizations I speak with, the current generation of senior engineers are the last cohort fully trained in this old methodology.

Everyone coming up behind them is learning in an environment where the hard parts of the work are mediated by machines that did not exist a few years ago. That does not mean they will be worse, it means they will be different, and the burden falls on the rest of us to figure out what hard-mode training looks like when the old hard mode is no longer commercially rational to impose. The teams that treat agents as a pure accelerant without redesigning how they develop people will find themselves, in five or ten years, with a generation of operators who can direct a fleet but might not understand the schematics of the boat. The balanced response, for anyone reading this who wants their own craft to survive, is always to take a balanced contrarian approach - do it without the fleet. Obviously not all the time, and not even most of the time, but deliberately and regularly, the hard way, on something that matters. Keep the muscle, as most of your peers will not, and in a decade that might very well be the difference.

The uneasy part

This essay does not resolve to optimism by design - as with all change, pretending it is not coming will not stop it from arriving. Work has already forever changed, and it is evolutionary and progressionary with the pace of AI. With it, we will all reclaim the day for work that actually requires a human, and the machine will reclaim the night for work that was always drudgery.

The next few years will be messy. The plausible cases also include an employee class exhausted by the review burden they signed up for, a layer of fragmented roles the system needs but does not reward, a generation of juniors who never develop the craft the current seniors used to judge what came back, teams that confuse volume of output for quality of work and do not notice the gap until an incident forces it, and organizations that built operational muscle for the next model and organizations that did not, separated by a gap that keeps widening. All of it is possible, and some of it is already happening.

At a minimum, the key takeaway is this. Let us build organizations for the model we do not have yet, so we are not caught flat when it arrives. Let us keep building the hard things ourselves, sometimes, so we remember how. Let us dispatch the night fleet and sleep well knowing the work is underway - and stay awake to the possibility that some of what comes back is wrong in ways we are no longer trained to see.

The 24-7 employee is not a promise, it is a rearrangement and a bet on a probabilistic engineering future. The bet is that the human in the loop remains sharp enough, honest enough, and trained well enough to be worth having in the loop at all - and that the organization around that human is built not for today's model, but for the one that has not shipped yet. That bet is winnable but it is not yet won.

Tim Davis