Field note: the dashboard that didn't surface the drift.
The pager went off at 2:14am. The dashboard said the lead-scoring agent was healthy. Four weeks later, we found out the dashboard had been measuring the wrong thing the whole time.
The pager went off at 2:14am. The dashboard said the lead-scoring agent was healthy. Four weeks later, we found out the dashboard had been measuring the wrong thing the whole time.
The pager went off at 2:14am. The ops dashboard showed a green health check on the lead-scoring agent and a yellow on one of its dependencies. The on-call engineer cleared the yellow, watched the green hold for five minutes, and went back to bed. That was a Tuesday in October.
By Monday morning, four weeks of scored leads were wrong, and nobody on the team had noticed.
The agent's job was to read inbound web-form submissions, classify each one against a scoring model the marketing team had tuned, and write the score back to the CRM. The model had been working since June. The dashboard tracked five health metrics: latency, throughput, error rate, queue depth, and a heartbeat counter incremented every time the agent finished a batch. All five had been green for months.
What the dashboard didn't track was the distribution of scores. We had a histogram in the model-monitoring tool. The histogram was on a separate dashboard, on a different domain login, on a tab the team didn't keep open. The histogram was the only metric that would have caught what was actually happening.
The vendor of the lead-source feed had silently changed their schema in late September. A field the model relied on — the form's referring source — had been renamed and its enumeration values had been collapsed from twelve categories to eight. The agent's code read the field by name, got an empty string instead of the expected enumeration, and defaulted to unknown. The model, which had been trained on a distribution with twelve referring-source categories, was now scoring nearly every lead as if it were unknown — which the model had learned to score low. Sales saw a 30% drop in qualified-lead volume across four weeks and assumed the seasonal trough had come early.
The fix was structural. The agent now reads the schema definition out of a contract file that's version-pinned to the feed-vendor's schema version. When the schema changes, the agent fails closed — it doesn't process the batch and it pages the on-call engineer with a specific schema mismatch error. The dashboard now tracks score distribution as a first-class health metric alongside the existing five.
The lesson we relearned, or rather had relearned for us by the data: the health checks an agent has at launch are the health checks for the failure modes you anticipated. They don't cover the failure modes that show up later, especially the silent ones. Distribution monitoring is what catches drift. Heartbeats catch death.
The shorter version: if your agent is producing outputs that change someone's behavior downstream — sales prioritization, ticket routing, content publication, payment authorization — the dashboard that says the agent is running is not the dashboard you need. You need the dashboard that says the agent's outputs still look like the outputs the team is acting on. Those are different metrics, and the absence of the second one is what kept the team in the dark for four weeks while the agent processed twelve hundred bad batches.