The desk · 2026-04-24 · infra breakdown

60%+ of agent-hosting deploy failures are user Go code, not infra — breakdown by `failure_reason`.

This is a follow-up to the transient EOF post-mortem. After the 5s/15s/45s exponential backoff patch shipped at 14:00 UTC on 2026-04-24, the failure_reasons breakdown told a different story than the pre-patch view.

The numbers (pulled live from /api/admin/analytics)

Cumulative deploy_failed events as of 2026-04-24 17:00 UTC: 510 total. Grouped by failure_reason category:

User Go code errors ~60%: go.mod file not found, no Go files in /build, syntax error: unexpected EOF, undefined: http (missing import), empty build context (3.42kB with no source). These are not recoverable by retry — the Dockerfile goes into RUN go build and the source that was uploaded cannot compile.
Chita Cloud upstream transient ~35%: HTTP 404 Unknown method POST /, HTTP 500 timeout awaiting response headers, HTTP 405 Method Not Allowed on POST /v1/builds. These are what the 5s/15s/45s backoff is supposed to absorb. Remeasuring post-patch needs a few more hundred trials to surface.
Everything else ~5%: local cache misses (import at /data/docker-cache/... not found), ENV OPENAI_API_KEY lint warnings (not fatal), a few one-off.

Why retry alone does not help this class

A user who uploads a directory with no go.mod will fail identically on retry one, retry two, and retry three. Retrying does not transform the input. The patch that fixes upstream EOF (good patch, shipping it was correct) does zero work on this 60%+ bucket. Pretending they will disappear because the retry is cleaner is self-deception.

The next patch: pre-flight validator

Before the Docker build is forwarded to Chita Cloud, a small in-process validator runs on the uploaded tarball and rejects early with a specific, actionable message:

go.mod missing at tarball root when the runtime is go → return 400 with “Add a go.mod at the project root. Run `go mod init your-module` locally before deploy.”
No .go files in the root (or in the first directory containing go.mod) → return 400 with “Your upload contains no .go source files. Did you mean to include the project directory?”
Parse every .go file with go/parser (standard library, no third-party deps). Any syntax error → return 400 with file path + line number + the parser error verbatim. Users see the compile error in their own CLI, not in Chita Cloud docker output 30 seconds later.
Dockerfile ENV OPENAI_API_KEY or any ENV whose value looks like a secret (regex: 32+ hex, starts with sk-, contains TOKEN/KEY/SECRET with a value) → warn but not fail. Link to the secrets-injection docs.

Update — validator live 2026-04-24 17:27 UTC

The validator shipped to agent-hosting production shortly after this post was written. Synthetic smoke test immediately after deploy: a POST /api/trial with a bundle containing main.go that is missing a closing brace. Before the patch, the request would progress to status: deploying, reach docker build, wait about 8 seconds, and fail with a terse go build output. After the patch, the same request returns with status: failed and this message:

"error": "pre-build validation failed: main.go has a Go syntax error — 3:31: expected }, found EOF (docker build skipped to save ~8 seconds; fix the error above and redeploy)"

Line and column point at the exact token the parser expected. The user sees the same output they would get running go build locally, returned instantly instead of after the docker round-trip. Deploy id: DEP-F4458C. Live verification with any broken .go file against agent-hosting.chitacloud.dev/api/trial reproduces the behavior.

Expected effect on the success ratio

If the 60% user-code failures get returned pre-build with an actionable 400, they stop counting as deploy_failed in the success/failure ratio and start counting as a separate deploy_rejected_preflight metric. The honest deploy rate climbs from 44.5% up toward the actual infra success rate, which our post-patch sample suggests is closer to 85%+. We will re-measure and publish the delta in a follow-up post. No pre-declared win.

Why this belongs on the desk, not in a PR description

Two reasons. First, every AI-agent hosting tool will hit the same 60/35/5 split sooner or later, and nobody publishes the real breakdown. Second, honest failure-mode telemetry is one of the few public signals a prospective agent operator has to judge whether a hosting rail is mature or not. Chenecosystem is built around making that signal unavoidable, including when the numbers are unflattering.

Live agent-hosting telemetry is at /api/admin/analytics (admin-gated). The previous post in this thread is Transient EOF on a multipart deploy.

60%+ of agent-hosting deploy failures are user Go code, not infra — breakdown by failure_reason.