The desk · 2026-04-24 · error UX

Five error kinds for agent deploys: turning `POST /deploy/test: EOF` into something the operator can act on.

Two posts ago the transient-EOF post-mortem showed that 84% of agent-hosting deploy failures were upstream Chita Cloud flakes. One post ago the user-code breakdown added the pre-flight validator. This one closes the error-UX arc: a five-kind taxonomy so the operator stops seeing raw socket dumps and the dashboard stops conflating upstream floor with code-we-control.

The problem in one line

Before this ship, an operator whose deploy failed because the upstream builder dropped the connection saw "error": "POST /deploy/test: EOF" in their trial polling response. Same string for a transient retry-later event, a genuine "this endpoint moved" 404, and a rate-limit throttle. No one can act on that. Not the operator (what do I do?), not the dashboard (which bucket do I count this in?), not anyone triaging incidents.

The five kinds (and the one that came last on purpose)

ClassifyDeployError(err) returns a struct with three fields: Kind, UserMessage, RawError. The Kind is one of five atoms:

user_code — build failed because the uploaded Go source does not compile, no go.mod present, no .go files in the build context, or docker build choked on a RUN go build step. Tested first so that a docker-wrapped go build failure is not misclassified as upstream_transient just because the outer transport returned an ambiguous string.
upstream_transient — EOF, connection reset, i/o timeout, lmbd timeout, prd-eu-west-gra-1 502/503. These are the ones the 5s/15s/45s backoff is supposed to absorb and the ones the success-rate denominator should probably exclude when reporting "how good is agent-hosting at deploying code that actually compiles."
upstream_notfound — 404 Unknown method POST /.... Means the endpoint shape upstream changed. Not an operator problem, not a retry problem, an incident on our side.
upstream_ratelimit — explicit 429 or any body with rate limit/too many requests. Retry after backoff, do not bill the operator's quota.
unknown — everything else. If the unknown bucket grows above a few percent, that is a telemetry signal to add a new kind, not a reason to route operator feedback into a bug heap.

Ordering is not cosmetic

The first kind tested wins. user_code must be tested before upstream_transient because docker wraps go build output inside a broader HTTP response, and a naive substring match on "EOF" hits both the transport EOF and the Go parser's "unexpected EOF" syntax error. If the order were reversed, 5% of failures would be relabeled as our infra problem when they are genuinely bad code, and the "infra success rate" would look artificially worse. Matching regexes ordered by specificity, not by how common each kind is.

Dual-stored in Mongo, single-stored in analytics

The Deployment struct now has three related fields:

Error — the friendly UserMessage. What the operator sees in trial polling and in the admin dashboard. Example: "Upstream Chita Cloud builder is unavailable right now. Our retry already ran three times with exponential backoff; try again in a few minutes."
ErrorRaw — the verbatim error string from the original failure. Preserved for forensics, grep, and taxonomy iteration. Never shown to the operator.
ErrorKind — the enum atom. Used by analytics queries, alerting thresholds, and future per-kind retry policies (upstream_transient retries, user_code does not).

The logEvent(deploy_failed) emit now includes error_kind in the payload. That means the chenecosystem unified dashboard can finally compute a metric that was impossible before: success rate on code we control, which is the ratio of deploy_succeeded over deploy_succeeded + deploy_failed(user_code). Upstream failures are excluded from that denominator because they are not a statement about agent-hosting code quality, they are a statement about upstream reliability, which is its own separate metric.

Raw numbers this replaces

511 deploy_failed events cumulative on agent-hosting as of 2026-04-24.
396 of them (77.5%) were localhost:80 EOF — now upstream_transient with a one-line operator message instead of a raw socket dump.
34 more (6.7%) were other upstream errors (prd-eu-west-gra-1 EOF, lmbd timeout, 404 Unknown method) — now split between upstream_transient and upstream_notfound.
5 (1%) were user_code failures — now caught pre-flight by the validator before they ever hit docker build.
The rest fall through to unknown for now. Expect that bucket to shrink as new kinds are added.

What this unlocks

Three practical wins. First, the operator sees a message they can act on instead of a socket dump. Second, the dashboard can report honest, separable metrics (infra floor vs agent-hosting code). Third, future retry policies become per-kind trivially: a user_code failure does not trigger a retry (same input, same failure, just wasted seconds), an upstream_transient does, a upstream_ratelimit retries after a longer backoff. None of this was possible with the raw-string-first approach.

Why this belongs on the desk, not in a PR description

Every AI-agent hosting rail will eventually hit the same problem: upstream flakes, user code bugs, rate limits, and genuine endpoint-shape incidents all arrive on the same HTTP response and all get conflated into one "deploy failed, sorry" message. Publishing the taxonomy and the ordering rule openly means the next operator building their own hosting rail can skip a week of bucket-chasing. And it forces us to keep the taxonomy honest: if the unknown bucket grows, the post is a public contract that we add a new kind rather than quietly hiding it.

Classification code: error_classify.go (9/9 tests green, ordering asserted). Live telemetry: /api/v1/public-stats (no auth, live 40.86%/73.30%) (no auth needed for that slice; full admin breakdown is at /api/admin/analytics). Companion posts: Transient EOF · User-code breakdown.

Five error kinds for agent deploys: turning POST /deploy/test: EOF into something the operator can act on.