Floburn Journal·Failure modes

Why your AI pilot stalled at the demo.

Six failure modes we keep finding when we're called in to revive AI pilots that already failed. Every one is foreseeable. None of them show up in the demo.

By Aaron Burns·March 11, 2026·6 min read

title: "Why your AI pilot stalled at the demo." dek: "Six failure modes we keep finding when we're called in to revive AI pilots that already failed. Every one is foreseeable. None of them show up in the demo." date: "2026-03-11" pillar: "failure-modes" author: "aaron" tags: ["ai-implementation", "pilot-failure", "harness-map"]

When a company calls us, it usually isn't because they're starting from scratch. It's because something already broke. The pattern is consistent enough to name: a vendor pitch six months earlier, an internal champion, a pilot that demoed beautifully in March, and a strained meeting in October where someone asks why the rollout stalled.

We've worked through enough of these to map the failure modes. They are not random. They are not vendor-specific. They are not unique to any one industry. They are six recurring shapes of how an AI implementation fails between the demo and production. Every one is foreseeable. None show up in the demo. Most are the demo's fault.

1. The demo was on clean data. Production isn't.

The vendor demo runs on a sample dataset the vendor curated for the demo. The data has no nulls, no duplicates, no character-encoding artifacts, no rows from the legacy ERP that nobody migrated cleanly. The model performs at 94% accuracy on that sample.

Production data has all of those things. The model's accuracy drops to 71% on day one. The team spends the next four months wrestling with data quality issues that should have been surfaced at the diagnosis stage. Meanwhile the vendor's pricing model, calibrated against demo-data accuracy, no longer makes economic sense.

The fix is upstream: insist on running the proof of concept against a representative slice of your production data, with all of its actual ugliness intact. If the vendor declines to do this, that is the answer — the answer is no.

2. The model was right. The workflow wasn't.

Sometimes the model performs well. The pilot generates accurate outputs at acceptable latency and cost. The problem is downstream of the model: the existing workflow around it was designed for human-only processing, with implicit handoffs, exceptions, and trust-building behaviors that don't translate to AI inserted at one step.

The output lands in a queue nobody monitors. Or it lands in an inbox where a human is supposed to validate it but doesn't, because the volume tripled. Or it goes straight through, with the validation step omitted by default — and now nobody catches the model's confident-but-wrong answers until a customer complaint surfaces them in week eight.

The fix is to redesign the workflow at the same time you build the model, not after. The single most common mistake we see is treating the AI build and the process change as sequential when they have to be simultaneous.

3. No one owns the system after launch.

The pilot ships. The implementation team — internal or vendor — celebrates. Six weeks later something drifts. Performance degrades. A new edge case surfaces. The data team says it's a product issue. Product says it's an engineering issue. Engineering says it's a vendor issue. The vendor says it's working as specified.

In the months it takes to negotiate ownership, the system rots. Trust in the output erodes. Users start running parallel manual processes just to be sure. By month six, the AI is technically running but effectively dead.

The fix is to name an owner before launch — by name, in writing, with a budget line. The owner is responsible for monitoring, drift detection, retraining cadence, escalation paths, and the relationship with the vendor. Without that, the most successful pilot in the world becomes the most embarrassing post-mortem.

4. Your reference customer was a different shape.

The vendor's case study customer had 5,000 employees, a dedicated MLOps team, and a Kubernetes platform their data engineering org had been building for three years. The pilot ran beautifully there because the company already had the operational infrastructure to absorb a new ML system.

You have 200 people, no MLOps function, and your data engineering is one engineer who is also responsible for the BI dashboards. The same product does not run the same way at your scale. The integrations you need don't exist. The monitoring tooling assumes infrastructure you don't have. The cost model assumes a usage volume you can't reach.

The fix happens before the contract: ask the vendor for a reference customer at your scale, in your sector, with your stack. If they can't produce one, ask whether you're the experiment.

5. The integration debt compounded.

What was supposed to be a six-week pilot ran four months. Then six. Every connector required custom work. Every system the AI needed to read from or write to had its own authentication scheme, its own data model, its own rate limit. The implementation team's hours kept climbing. By month five they'd moved on to the next account. By month six the budget was gone.

The integration work was always going to be larger than the AI work. We say this in every diagnostic. The model is 10% of the project. The plumbing is the project. Companies that miscalculate this ratio at scoping pay for it at every milestone.

The fix is to scope integration first, model second. Identify every system the AI needs to touch, list every connector, estimate every authentication and data-shape headache. Then double the estimate. Then budget for monitoring, retries, exception handling, and a slow start. The pilot does not begin until that scope is named and signed off on.

6. The success metric was aspirational.

The pilot was approved against an objective like improve customer experience or increase efficiency or unlock our data. Those are not metrics. Those are slogans. When asked at month four how much customer experience had improved, no one had a baseline from month zero to compare to. No measurement framework had been put in place. The pilot's success was indeterminable.

This is the failure mode that hides the longest, because nobody knows the pilot has failed — they just know it isn't obviously succeeding. Resources keep flowing. Six months later, the system is still running, the team still believes in it, and no one can answer the question of whether the company should renew the contract.

The fix is the most boring item in this list and the most important: define the metric in operational terms before the pilot starts. If you can't measure customer experience directly, decompose it — into response time, resolution rate, escalation frequency, repeat-contact rate. Pick two. Instrument them. Then build.

The common root

The six modes look different on the surface. They share a single cause: the demo is a sales artifact, not an operating artifact. The demo is constructed to make the AI look easy, fast, and clean — because that's what closes a contract. The work of making AI run inside a real company is none of those things. It is uneven, slow, and full of edge cases the vendor's demo team made go away with curated data and a controlled environment.

The companies that successfully ship AI into production are the ones that expected this and built for it from the diagnostic onward. They asked for production data in the proof of concept. They redesigned the workflow simultaneously with the model. They named an owner before launch. They got reference customers at their actual scale. They scoped integration before scoping accuracy. They named measurable success metrics.

When we get called in to revive a stalled pilot, the conversation usually starts with the customer asking what went wrong with the vendor. The honest answer is almost always nothing in particular — the system was set up to fail at all six points simultaneously, and the question is which point broke first.

The work we do across these revivals lives under practical AI implementation. When the conversation is more specific — California labor compliance, for instance — that work has its own page, called MicroForensics. Either way, the discovery call is the right first step.