In February 2024, Klarna announced that an OpenAI-powered assistant was handling the work of 700 full-time customer service agents. By 2025 the company said the number had grown to 853. In May 2025, Klarna's CEO told the Financial Times the firm had started hiring humans back, citing quality complaints and the limits of pure automation. That arc — announcement, escalation, partial retreat — is the most cited story in any conversation about AI replacing customer support, and it's also the most misread. Treated as a triumph, it overstates the case. Treated as a failure, it understates it. Treated as a tool, it tells you something specific: the cost of substituting an agent is not one number, it's a distribution across substitution classes, and the classes behave very differently.
This post walks through the four substitution classes Wagecore uses for customer-support work, the cost ranges with confidence bands inside each, and the methodology choices behind the numbers.
The Klarna case, read carefully
The original Klarna/OpenAI press release in February 2024 reported the AI assistant resolving 2.3 million conversations in its first month — about two-thirds of inbound chat tickets — with customer satisfaction scores statistically indistinguishable from human agents and average resolution time falling from 11 minutes to under 2. That was the headline. The follow-up details, mostly surfaced in 2025 reporting from Yahoo Finance and the Financial Times, matter more: the 700-agent figure was a comparison against the contracted agent capacity the assistant displaced, not against Klarna's own employees. The 853 figure announced later in 2025 used the same comparison method. And the 2025 partial reversal was not "AI doesn't work" — it was "the residual tickets that AI cannot resolve are harder, more emotionally loaded, and require humans who are better-paid than the contracted-agent baseline."
What this case actually shows is the substitution classes pulling apart in the real world. The volume-tier, password-reset, status-of-my-order class went almost entirely to AI and stayed there. The dispute-and-troubleshoot class went mostly to AI with a human review loop, and stayed there too. The complex-escalation class went to AI initially, then partially back to humans. And the relationship-or-novel-edge-case class never actually moved.
That's not a failure of the technology. It's the substitution map asserting itself.
Four substitution classes
Wagecore classifies customer-support tasks into four classes based on observable cost and reliability properties — not based on whether the task "feels automatable." The classes are:
Full Substitution. L1 ticket triage, password resets, order-status lookups, FAQ answers, simple refund processing within policy bounds. These tasks have narrow input distributions, high reliability requirements only on a small set of failure modes, and low error cost. AI handles them end-to-end without a human in the resolution path. Confidence band: $2–$8 per resolved ticket using a frontier-model API plus a vendor wrapper (Intercom Fin, Ada, Forethought all sit in this range as of public pricing through 2025). The lower end assumes a well-tuned vendor; the upper end assumes off-the-shelf with retrieval but no retraining. Human equivalent: $15–$25 per ticket for an outsourced contact-center agent, per the public pricing guides published by Crescendo and the BPO mid-market. The ratio favors AI by roughly 3–5×, and the gap is stable.
Supervised Substitution. Billing disputes, product troubleshooting where the customer's setup matters, account changes with policy edge cases, simple complaints. AI proposes a resolution, a human reviews it before it ships to the customer — either case-by-case for low-confidence cases, or via batched audit for high-confidence cases. The cost structure is meaningfully different from full substitution: you pay the AI inference cost plus a fraction of an agent's time per ticket, where the fraction depends on your audit policy. Confidence band: $5–$14 per resolved ticket. The wide band reflects the choice between heavy audit (every ticket reviewed) and light audit (sampled). Human-only equivalent: $18–$30 per ticket — these tickets take longer than full-substitution tickets, so the human baseline rises too. The ratio favors AI by 2–3×, and degrades as you tighten the audit loop.
Augmentation. Complex escalations, emotional situations (refunds tied to medical or family circumstances, complaints about service breakdowns), multi-system investigations, executive-attention cases. AI assists the human — drafting responses, pulling history, summarizing prior tickets, suggesting policy precedent — but does not act. The human owns the resolution. Cost is essentially "human salary plus a per-seat AI assistant subscription." Confidence band: $20–$45 per ticket. The AI contribution shows up as throughput, not headcount: a senior agent with a good copilot handles maybe 30% more tickets per shift. Human-only equivalent: $25–$60 per ticket. Ratio: modest, single-digit-percent cost reduction, with the upside expressed as faster resolution rather than lower agent count.
Non-substitutable Residual. Relationship management with strategic accounts, novel edge cases that don't fit any prior pattern, regulatory or legal correspondence, crisis incidents (fraud rings, mass outage handling, PR-sensitive complaints). AI may be in the loop as a research tool, but the resolution path is fully human and often spans multiple humans (an agent, a manager, sometimes legal). Cost: $50–$200+ per ticket depending on duration and seniority. There is no AI baseline to compare against because the substitution probability is effectively zero at current capabilities. Klarna's partial hire-back of human agents in 2025 happened mostly inside this class and the upper edge of Augmentation — exactly the class where AI's confidence was lowest and the cost of a wrong answer was highest.
The human baseline, fully loaded
The cost-per-ticket numbers above ride on a human baseline that itself deserves a confidence band. ZipRecruiter's 2025 data for "Customer Support Representative" in the US shows an average annual base of roughly $42,000, with a 25th–75th percentile band of $34,000–$50,000 depending on geography and tenure. Fully loaded — benefits, payroll tax, equipment, manager overhead, attrition replacement cost, training amortization — the typical multiplier is 1.35–1.55×, putting the loaded annual cost at roughly $57,000–$77,000. Divide by 1,800–2,000 productive hours per year and you get $28–$43 per loaded agent hour. At an industry-typical handle time of 8–14 minutes per ticket across the full mix, that produces the $15–$25 per-ticket figure for routine L1 work and the $25–$60 figure for complex tickets cited above.
Outsourced BPO pricing — Crescendo's published guide, the mid-market benchmarks from the contact-center analyst firms — runs lower than this on the per-ticket basis ($6–$15 for L1 voice or chat in lower-cost geographies) but should not be read as the human baseline unless the AI alternative is being compared against the same offshore arrangement. The economically honest comparison sets like against like: in-house against in-house, BPO against BPO, and AI against the human cost it is actually displacing inside that organization. Mixing the comparisons is how you get the 10× cost-reduction claims that don't survive the first quarter of operations.
The implication for the substitution-class math: in a high-cost in-house environment, Full Substitution's 3–5× ratio compounds because the human baseline is high. In a low-cost BPO environment, the same technology produces a 1.5–2.5× ratio because the human baseline is already low. The technology is constant; the savings are not.
Why confidence bands, not point estimates
A single dollar figure per ticket is the cleanest possible answer, and it is almost always wrong. Two reasons.
First, the input distribution to each class varies wildly across companies. A consumer-fintech ticket mix is heavily Full Substitution at the top of the funnel; a B2B SaaS support queue is Augmentation-heavy because the tickets reference customer-specific configurations. The same "AI agent replaces a human" claim can map to a 4× cost reduction at one company and a 1.2× reduction at another, not because the technology is different but because the work distribution is.
Second, AI pricing is moving. Frontier-model per-token cost has fallen roughly 10× from early 2024 to mid-2025. Vendor wrappers have not fallen at the same rate, because the cost structure of an Intercom Fin or an Ada is not pure model inference — it's retrieval, vendor margin, sales motion, and integration. The lower end of each band tracks raw inference; the upper end tracks vendor pricing. The gap between the two narrows over time but is not zero.
We publish confidence bands because point estimates create the illusion of certainty that the Klarna case explicitly contradicted. The 700-agent figure was a point estimate, and it didn't survive contact with the residual ticket distribution.
The Wagecard methodology behind these numbers
Wagecore's Wagecard treats customer-support roles the way it treats every other role: as a weighted average across substitution classes, with each class scored on capability, reliability, error cost, and oversight cost. The four classes above map onto our standard frontier — Full Substitution corresponds to our Replaceable cell, Supervised Substitution to AI-augmented, Augmentation to Human-led-AI-assisted, Non-substitutable Residual to Human-critical.
The Investment View on a customer-support function therefore reads as an NPV computation, not a single ratio. Inputs: ticket volume distribution across the four classes, current human-only cost per class, expected AI-plus-human cost per class with a chosen audit policy, switching costs (vendor onboarding, retrieval-index build, retraining contracts), and a risk-adjusted discount rate that accounts for the chance the vendor's pricing or quality changes mid-contract. The IRR on full-substitution-heavy queues is high — typically 80%+ on a one-year horizon at the bands above. The IRR on augmentation-heavy queues is modest. The payback period varies from under a quarter to over two years depending on which class dominates.
This is not a black box. The substitution classes, the cost bands, and the weighting are all published in our methodology. We do not retroactively backfill prior numbers when our methodology revises: a Wagecard computed under v1 stays a v1 Wagecard, with the v1 numbers, even if v2 updates the bands. The reason is that the cost of a substitution decision is paid against the numbers known at decision time — backfilling rewrites history in a way that makes prior decisions look better or worse than they were when made.
Reading the Klarna arc through the classes
With the four classes in hand, Klarna's announcement-escalation-partial-retreat sequence reads cleanly:
The 700-agent and 853-agent figures captured the Full Substitution and most of Supervised Substitution displacement. Those are real, the math holds, and the ratio is roughly what the public pricing on Intercom Fin and equivalent vendors would predict for a high-volume consumer-fintech ticket mix.
The 2025 partial hire-back captured Augmentation and Non-substitutable Residual. Klarna initially routed those tickets through AI too, hit a quality wall, and adjusted. That isn't an AI failure — it's the substitution map being read correctly the second time. The class boundaries are real, and crossing them on optimistic assumptions costs money in customer dissatisfaction faster than it saves it in salary.
What the case does not show is the binary framing that dominates most commentary: AI either replaces customer support or it doesn't. Both readings are wrong. AI replaces a measurable fraction of the work at a known cost ratio, with the fraction depending on the ticket distribution and the chosen audit policy. The other fraction stays human, and gets more valuable as the substitutable work compresses around it.
What to do with this
Three things follow.
First, before computing any "AI replaces customer support" cost, classify the tickets. The Full Substitution share matters most because it dominates the ratio. A queue that's 70% Full Substitution behaves very differently from one that's 30% Full Substitution and 40% Augmentation — and the headline figures from competitors rarely tell you which they have.
Second, treat the audit policy as a first-class variable. Supervised Substitution's cost band is wider than the others because the audit choice changes the unit cost by nearly 3×. Most write-ups skip this and quote whichever endpoint flatters the conclusion.
Third, don't price the Non-substitutable Residual against an AI baseline. There isn't one. Those tickets stay human, and the right comparison is human-versus-human (senior agent vs. junior, in-house vs. outsourced), not human-versus-AI. Pricing the residual against a phantom AI baseline is what made Klarna's first pass overestimate the savings — and what makes most internal "AI replaces customer support" business cases overpromise by 30–50% before they even hit pilot.
Fourth, version the analysis. The bands here reflect inference pricing and vendor pricing as observed through mid-2025. They will move. A decision made today should record which numbers it was made against, because the next twelve months of pricing changes will look like savings only against an unchanged baseline. Wagecards carry a methodology version on the face of the card for exactly this reason: a Wagecard is a snapshot of a decision, not a forecast.
If you want the same analysis run against your own role or function, with substitution classes, confidence bands, and an Investment View, that's what Wagecore does. The methodology is open at wagecore.ai/methodology and a free Wagecard is at wagecore.ai/start.
Sources
- Klarna and OpenAI joint announcement, February 2024 — AI assistant resolving 2.3M conversations, ~700-agent equivalent.
- Yahoo Finance reporting, 2025 — Klarna AI assistant performing work equivalent to 853 full-time agents.
- Financial Times reporting on Klarna's partial human-agent rehiring, May 2025.
- ZipRecruiter customer support representative salary data, 2025 — US average fully-loaded cost per ticket basis.
- Intercom Fin AI public pricing — per-resolution cost benchmarks through 2025.
- Crescendo outsourced call-center pricing guide — BPO per-ticket cost ranges for L1 through complex tiers.