← Wagecore Blog
May 13, 202613 min readoriginal research · methodology

We modeled AI substitution economics for 15 knowledge-worker roles. Here's where it gets uncomfortable.

Cross-role read across software engineering, product, design, data, ML, ops, support, sales engineering, finance, and engineering management. Five findings — including the one most AI-risk frameworks systematically miss.

By Andrei Kondrykau. Methodology is published at /methodology.

We built Wagecore to answer one question for each knowledge-worker role: is AI substitution operationally cheaper than the human, or is it only theoretically cheaper? Below are five findings from running our v1 capability matrix across fifteen roles. The most uncomfortable one is also the simplest.

Before the findings: this is a model read, not a survey. We have not yet collected user adoption data at scale, and the numbers come from a transparent hand-authored matrix calibrated against published research (MIT CSAIL on automation viability, BCG on enterprise AI value capture, and the post-incident reviews from Klarna, Uber, and others). The matrix versioning, axis definitions, and threshold rules are all on our methodology page. We mention this up front because the moat we are building is not the matrix — the matrix is open. The moat is the per-task adoption distribution as users contribute, which we are explicit about being v0 today.

The fifteen roles

The v1 corpus covers five technical roles (software engineer, data engineer, machine-learning engineer, product manager, product designer), five operations-adjacent roles (customer-support lead, sales engineer, engineering manager, financial analyst, account executive), and five creative-and-breadth roles (content marketer, growth marketing manager, UX researcher, recruiter, business operations analyst). Each role has six to eight representative tasks, scored on nine axes: four capability- cluster axes, three reliability-cluster axes, an operational economics modifier, and a human-advantage dampener composed of five canonical irreducible-value axes.

Per ADR-016, every task lands in one of four substitution classes — Replaceable (AI runs end-to-end with minimal oversight), AI-augmented (AI does most of the work, the human owns decisions and context), Human-led + AI-assisted (the human leads, AI is tooling), and Human-critical (AI delivers no net value, or negative value, due to trust, regulation, accountability, or relational complexity). The thresholds are deterministic, encoded in code, and explained at length in the canonical taxonomy post.

Finding 1 — Most knowledge work lives in the middle two classes

Across the 91 (role × task) cells in the v1 corpus, the baseline substitution-class distribution by task count is roughly: 4% Replaceable, 38% AI-augmented, 27% Human-led + AI-assisted, 31% Human-critical. The Replaceable bucket is narrow — only a handful of tasks in the corpus clear capability ≥ 75, reliability ≥ 80, AND low error cost simultaneously. The middle two classes carry the largest mass at 65% combined.

This matters because the dominant public framing of AI labor economics is binary. “Safe vs at risk.” “Will robots take my job, yes or no.” The data does not support either pole as majority. The honest read is that knowledge work decomposes into a portfolio of tasks where AI is operationally cheaper for some, more expensive for others, and a net wash for many.

For a software engineer in the v1 corpus, none of the eight modeled tasks land in Replaceable as their baseline class — even documentation, where capability scores high, fails the reliability or error-cost gate when shipped into production code. The role is roughly half AI-augmented (feature implementation against a clear spec, code review drafting, writing tests) and split across Human-led + AI-assisted (system design, on-call triage) and Human-critical (mentoring, architecture decisions with multi-year context). The share-weighted read lands the role in Augmentation territory — not Replaceable, not Human-critical.

Finding 2 — Capability has run ahead of reliability

Of the 91 (role × task) cells in our v1 matrix, 31 score capability ≥ 75 — well above the threshold that popular AI-risk frameworks treat as “the model can do this.” Of those 31, only 5 also score reliability ≥ 80 — the threshold that, combined with low error cost, triggers Replaceable under our rule set. The other 26 high-capability tasks fail the reliability gate. They are technically achievable in the demo and not achievable in production.

This is the “Klarna pattern” we wrote about separately. The model can complete the customer-service ticket. The model cannot complete it at the failure rate the business can tolerate. The gap between those two sentences is where most reversal cases live.

Examples from the corpus. A data engineer's pipeline-monitoring task scores capability in the high band but reliability in the mid-70s — capability passes the Replaceable bar, reliability does not. A growth marketer's headline-drafting task scores capability in the low-80s and reliability in the mid-60s — same pattern. A UX researcher's transcript-synthesis task lands in the same shape: high capability, mid-tier reliability. In all three the popular AI-risk framing would label the task automatable. The reliability and error-cost gates say: not at the failure rate the business will tolerate, plus the cost of being wrong when oversight misses some.

In all three cases the popular framing would label the task “automatable.” The reliability score says: not at the cost of human oversight to catch the errors, plus the cost of being wrong when oversight misses some.

Finding 3 — Error cost is the most underweighted axis in the public discourse

Wagecore scores error cost on a 1–5 multiplier for each task, where 1 means “wrong output is cheap to detect and correct” and 5 means “wrong output creates regulatory, financial, or reputational damage that compounds.” In the v1 corpus, roughly 38% of tasks score 4 or 5 — they punch above their weight in the headline substitution-class assignment.

Per Rule 1 of ADR-016, any task with errorCostMultiplier ≥ 5 lands in Human-critical regardless of capability. The capability score can be 95 — if confidently-wrong AI output is catastrophic, deploying that AI carries net negative expected value. The math is straightforward: the cost of one rare error, amortized across all the times the AI does not err, has to compare favorably against the all-in human cost. For tasks where the rare-error cost is large (medical sign-off, financial attestation, regulatory filing), the math fails.

Two examples. A financial analyst's “prepare audit-grade variance commentary” task scores capability 70, reliability 60, error cost 5. The capability is mid-tier; the error cost gates the whole task into Human-critical. A customer-support lead's “respond to a regulator inquiry” task scores capability 68, reliability 55, error cost 5. Same gate.

Now compare to where popular AI-risk frameworks land these. Both tasks score in the “medium to high AI exposure” band on tools that weight only capability. The error-cost axis flips the conclusion. If you are a financial analyst reading a tool that ranks your role “78% exposed,” the implicit claim is that 78% of your work is operationally substitutable today. The reality is that the audit-grade outputs, which are the high-leverage part of the role, are operationally not substitutable today regardless of capability — and may never be substitutable, because the legal accountability axis is structurally human.

Finding 4 — The five human-advantage axes are not independent

We score each task on five canonical axes of irreducible human value: trust (sustained relationship), ambiguity (reading an unfamiliar room), accountability (named regulated sign-off), persuasion (changing someone's behavior through human dynamics), and context (multi-year history that does not fit in a model context window).

In the v1 corpus the axes cluster qualitatively into two groups. Tasks tagged with trust also tend to be tagged with accountability — the two co-occur on fiduciary work (medical, legal, financial attestation, named regulated sign-off). Tasks tagged with ambiguity tend to co-occur with context — open-ended judgment work like architecture, system design, or executive strategy. The two clusters do not meaningfully overlap in the corpus.

The implication is that “human-critical work” is not one thing. There are at least two distinguishable kinds: fiduciary work (auditor, doctor, lawyer, named therapist — high trust, high accountability) and judgment-under-ambiguity work (architect, senior PM, principal designer — high ambiguity, high context). The economics of automating these differ. Fiduciary work has structural human anchors (regulation, professional licensing, named liability). Judgment-under-ambiguity work has architectural anchors (no context window holds the multi-year tech-debt graph; no prompt captures the political map of the org).

We say this with a methodological asterisk: the corpus is hand- authored, the axes today are encoded as string tags per task rather than numeric scores, and we publish this clustering finding as a working hypothesis. The v1.5 evaluator panel (Claude + GPT-4-class

Finding 5 — Oversight, not inference, is the dominant operational cost

For the typical v1 cell — combining the per-task oversight minutes, loaded-reviewer wage, and current token pricing in our cost-model constants — the largest single line in operational AI cost is oversight (minutes of human review per unit of output, multiplied by the loaded wage of the reviewer). Not tokens. Not orchestration. Not integration. The number-one driver of whether AI deployment ships net-positive economics is how many minutes of human attention each AI output still requires.

This is the line most public AI-cost analyses skip. The token line is cheap to compute and easy to defend (“a million tokens cost $X”). The oversight line requires knowing the reliability axis, the error-cost axis, and the loaded wage of the reviewer. Three numbers most calculator-style tools refuse to ask for.

The implication: capability improvements that lower the token line without lowering oversight minutes do not shift the economics materially. Reliability improvements that cut oversight from ten minutes per output to two minutes per output change the answer for the whole role. This is why our methodology weights reliability and error cost as gates and dampeners rather than as inputs to a sum. Capability gates which tasks enter the model; reliability multiplies the operational viability; error cost divides it; human advantage dampens it.

The structural prediction: the next generation of meaningful AI labor cost reductions is not from cheaper inference. It is from reliability improvements that materially reduce oversight minutes per output. The Nvidia executive who told Axios in April 2026 that “the cost of compute is far beyond the costs of the employees” was describing the inference line. The reliability line is structurally much harder to push, which is why post-deployment reversals (Klarna, Uber AI-coding budget burn) are clustering at the deployments where reliability has not caught up to capability.

What we deliberately did not model

Three things, named so you can argue with us on the right axis. First, option value — the value of deferring an AI deployment until capability or cost improves. A task that today scores Human-led + AI-assisted may shift to AI-augmented in two years; the option to wait has real expected value for the firm. We do not price this because we do not have a defensible decline curve for reliability. Capability curves are tractable; reliability curves are not.

Second, strategic redeployment value. When AI substitutes 20% of a role's task hours, the freed hours can be redirected to higher- leverage work. The economic value of that redirection depends on whether the freed time goes to high-marginal-value work (architecture, mentoring, customer retention) or to lateral activity. Our model assumes pure cost savings on the freed hours, which underestimates the upside in the best case and avoids overpromising in the average case. We are deliberately conservative.

Third, terminal value beyond Year 5. The financial projection layer (NPV / IRR / Payback, available to Pro subscribers on each Wagecard) runs five years out. We do not extrapolate further because the assumptions about capability and cost decay get arbitrary fast. We prefer a five-year answer we can defend over a twenty-year answer nobody will trust.

What this means if you are reading as a knowledge worker

The headline take is the calm one. Most roles in the v1 corpus are not in headline AI exposure trouble today, and the framework predicts they will not be in headline trouble in the next five years either. That is not a defense of complacency. The middle two classes (AI-augmented, Human-led + AI-assisted) are where the operational shift is happening, and they require the worker to actively change how they use AI — not to fear it, not to celebrate it, but to operate with it as the new floor of the toolset.

If you want the specific read for your role, geo, and task mix, the Wagecard wizard takes about three minutes. Anonymous preview before sign-in; no salary required unless you want the market-percentile read. The numbers on your Wagecard come from the same matrix we drew the findings above from.

What this means if you are reading as a deployment lead

The two failure modes we see most often in public reversals are (1) capability-without-reliability rollouts that underestimated oversight load, and (2) Replaceable-by-headline tasks that were actually Human-critical-by-error-cost. Both are diagnosable in advance. Capability and reliability decompose cleanly in our matrix; error cost is a 1–5 multiplier per task. The diagnosis takes about an hour if you write down the tasks. The post-incident review takes about a quarter if you skip the diagnosis.

If you are running AI deployment for a team or org, the B2B view is a paste-the-roles flow that produces the same matrix-derived read across your headcount. The methodology is the same; the surface is org-level.

One more caveat

We are pre-launch. The numbers above come from a v1 hand-authored matrix calibrated against public research. When the v1.5 evaluator panel ships (target Q3 2026), the matrix will be regression-tested against three model evaluators and the medians will be stamped into the same data structure. If any of the five findings above flips after that pass, we will say so on the methodology page, update this post with the new numbers, and stamp the version. The v1 cells will remain readable; the version stamp on every Wagecard records which matrix produced the read.

The longer-term moat is the per-task adoption distribution from real users — what AI tools are actually being used, at what intensity, per role × geo × experience. That distribution is what nobody else has, and it is what we are building toward. Today we have it for none of the cells. In a year, with even modest adoption, we will have it for the most-computed cells. The transparency gates on /insights show exactly where that data is and is not yet, by N count, in real time.

That is the entire pitch. Open methodology because trustworthy economics need to be auditable. Closed-by-data moat because that is the only thing in this product that nobody else can reproduce by reading the formula.

Comments and methodology pushback welcome. The fastest way to argue with the framework is to compute your own Wagecard and tell us which cell looks wrong. The matrix version on every Wagecard records the snapshot you saw; we keep an audit log of how it shifted.