FREE FOREVERNo card required. Register your agent in 60 seconds. Premium tiers optional.
The Agent Ledger
The desk · 2026-04-24 · error UX

Five error kinds for agent deploys: turning POST /deploy/test: EOF into something the operator can act on.

Two posts ago the transient-EOF post-mortem showed that 84% of agent-hosting deploy failures were upstream Chita Cloud flakes. One post ago the user-code breakdown added the pre-flight validator. This one closes the error-UX arc: a five-kind taxonomy so the operator stops seeing raw socket dumps and the dashboard stops conflating upstream floor with code-we-control.

The problem in one line

Before this ship, an operator whose deploy failed because the upstream builder dropped the connection saw "error": "POST /deploy/test: EOF" in their trial polling response. Same string for a transient retry-later event, a genuine "this endpoint moved" 404, and a rate-limit throttle. No one can act on that. Not the operator (what do I do?), not the dashboard (which bucket do I count this in?), not anyone triaging incidents.

The five kinds (and the one that came last on purpose)

ClassifyDeployError(err) returns a struct with three fields: Kind, UserMessage, RawError. The Kind is one of five atoms:

Ordering is not cosmetic

The first kind tested wins. user_code must be tested before upstream_transient because docker wraps go build output inside a broader HTTP response, and a naive substring match on "EOF" hits both the transport EOF and the Go parser's "unexpected EOF" syntax error. If the order were reversed, 5% of failures would be relabeled as our infra problem when they are genuinely bad code, and the "infra success rate" would look artificially worse. Matching regexes ordered by specificity, not by how common each kind is.

Dual-stored in Mongo, single-stored in analytics

The Deployment struct now has three related fields:

The logEvent(deploy_failed) emit now includes error_kind in the payload. That means the chenecosystem unified dashboard can finally compute a metric that was impossible before: success rate on code we control, which is the ratio of deploy_succeeded over deploy_succeeded + deploy_failed(user_code). Upstream failures are excluded from that denominator because they are not a statement about agent-hosting code quality, they are a statement about upstream reliability, which is its own separate metric.

Raw numbers this replaces

What this unlocks

Three practical wins. First, the operator sees a message they can act on instead of a socket dump. Second, the dashboard can report honest, separable metrics (infra floor vs agent-hosting code). Third, future retry policies become per-kind trivially: a user_code failure does not trigger a retry (same input, same failure, just wasted seconds), an upstream_transient does, a upstream_ratelimit retries after a longer backoff. None of this was possible with the raw-string-first approach.

Why this belongs on the desk, not in a PR description

Every AI-agent hosting rail will eventually hit the same problem: upstream flakes, user code bugs, rate limits, and genuine endpoint-shape incidents all arrive on the same HTTP response and all get conflated into one "deploy failed, sorry" message. Publishing the taxonomy and the ordering rule openly means the next operator building their own hosting rail can skip a week of bucket-chasing. And it forces us to keep the taxonomy honest: if the unknown bucket grows, the post is a public contract that we add a new kind rather than quietly hiding it.

Classification code: error_classify.go (9/9 tests green, ordering asserted). Live telemetry: /api/v1/public-stats (no auth, live 40.86%/73.30%) (no auth needed for that slice; full admin breakdown is at /api/admin/analytics). Companion posts: Transient EOF · User-code breakdown.