A task in my system emitted task.picked_up every thirty seconds for nearly an hour before it finally died with dispatch_failed:exit_1.
From the outside it looked alive. Nothing useful was actually happening.
That incident taught me more about trusting AI agents than any clean demo ever could.
I trust my agent system more now than I did a month ago, not because it got smarter, but because it got less surprising. That is a much less glamorous standard than “capable,” but it turns out to be a far more useful one.
The real limiter on agent adoption is not raw capability. It is failure behavior. Not whether the first answer looks good, but whether the surrounding state transitions are coherent, bounded, and repairable when things go wrong.
Capability is the easy story. Reliability is the real one.
I am not the only one seeing this. Martin Fowler is writing about harness engineering. GitHub is writing about failures at the boundaries between agents. Fine. I ended up in the same place the uglier way, by watching my own system look healthy while it quietly did the wrong thing.
Did the task really start? Did stale state leak into the next step? Did the retry logic understand what kind of failure actually happened? Did the recovery path reduce uncertainty, or create a second incident on top of the first?
That is the point where I either trust the system a little more, or stop believing what the dashboard says.
False progress is worse than visible failure
The repeated fake pickup was one example, but not the only one. I also ran into tasks carrying terminal metadata from a previous lifecycle step. A task could move forward while still holding old fields like failed_at, completed_at, result, or error. It was both active and not active, depending on which field you looked at.
That kind of contradiction poisons everything around it. Humans read the queue wrong. Monitors react to the wrong thing. Retries fire on bad assumptions. Routing logic starts making decisions from contaminated state.
This is why I’ve stopped thinking about stale context as just a model problem. Sometimes it is a prompt problem. A lot of the time it is state hygiene. The fix was not more reasoning. It was scrubbing old terminal fields when the task changed lifecycle.
If those transitions are dirty, everything downstream gets harder to interpret.
Retrying is not the same as recovering
I used to think retry logic was mostly a resilience problem. Now I think it is mostly a classification problem.
A confirmed 429 is not the same thing as a timeout. A timeout is not the same thing as a 503. A provider brownout is not the same thing as either of those. If the system treats all three as “try again later,” it does not get stronger. It gets noisy.
That noise is expensive. Broken tasks keep waking up. Operators lose the plot. Spend drifts upward. The system looks busy while making itself harder to understand.
The better pattern ended up being much less magical. Rate limits defer differently from missing Retry-After responses. Timeouts fail fast. Brownouts trip a circuit breaker instead of spawning a parade of doomed retries. The policy feels less heroic, but it is much easier to trust because the failure type actually drives the response.
The recovery path can be the bigger incident
One of the most useful lessons from the last month is that the original failure is often not the worst part. The recovery path is.
I saw repeated restart and wake churn at the control-plane level, with recovery actions colliding with live work, session locks, and draining states. The system was trying to pull itself back to health and, in the process, creating a second layer of instability.
That changed what I value in the system. If it restarts aggressively, re-arms itself too early, or treats wake attempts as proof of progress, it is not getting healthier. It is just generating motion while operators lose the thread.
Boundary failures are where multi-agent systems get weird
This is also why I think “handoff failure” is too soft a phrase for what actually happens. The more concrete version is boundary failure.
In my system, that showed up through watcher noise and meta-task contamination. Tasks that existed only to monitor, alert, or re-check state could start behaving like first-class work unless the system explicitly excluded them. Once that happens, your queue stops being a clean picture of the workload. It becomes a blend of real work and self-generated commentary about the work.
That is not a small bookkeeping issue. It changes what operators think is happening. It changes what gets escalated. It changes what gets starved. A multi-agent system can look coordinated while quietly confusing its control layer with its execution layer.
That is the kind of failure that does not show up in a benchmark, but absolutely shows up in production.
What changed
The biggest shift for me was simple. I stopped asking, “how do I make this smarter?” and started asking, “how do I make failure visible, bounded, and boring?”
That led to stricter state transitions, explicit integrity breadcrumbs, wake verification, retry rules by failure type, and cleaner separation between real work and monitoring noise.
Trust is built on bounded failure, not clean demos
I do not need an agent system to be perfect. I need it to fail in ways I can understand.
I need bad states to be visible. I need retries to mean something. I need recovery paths that reduce uncertainty instead of multiplying it. That is the standard I now care about more than almost any capability claim.
I also do not think I have this fully solved. The part I still distrust most is recovery under overlapping failures, when a real task issue, a control-plane restart, and watcher activity all start colliding in the same window. That is still the zone where a system can surprise me.
But that is exactly the point. Trust is not “nothing breaks.” Trust is knowing where the system is still weak, containing that weakness, and making the failure surface legible enough to operate anyway.
That is why what AI agents are bad at matters more than what they are good at. Their strengths get your attention. Their failure behavior determines whether they deserve a place in a real operating environment.
The teams that get this right will not be the ones with the flashiest demos. They will be the ones that know their failure surface cold.