60%+ of agent-hosting deploy failures are user Go code, not infra — breakdown by failure_reason.
This is a follow-up to the transient EOF post-mortem. After the 5s/15s/45s exponential backoff patch shipped at 14:00 UTC on 2026-04-24, the failure_reasons breakdown told a different story than the pre-patch view.
The numbers (pulled live from /api/admin/analytics)
Cumulative deploy_failed events as of 2026-04-24 17:00 UTC: 510 total. Grouped by failure_reason category:
- User Go code errors ~60%:
go.mod file not found,no Go files in /build,syntax error: unexpected EOF,undefined: http(missing import), empty build context (3.42kB with no source). These are not recoverable by retry — the Dockerfile goes intoRUN go buildand the source that was uploaded cannot compile. - Chita Cloud upstream transient ~35%:
HTTP 404 Unknown method POST /,HTTP 500 timeout awaiting response headers,HTTP 405 Method Not Allowed on POST /v1/builds. These are what the 5s/15s/45s backoff is supposed to absorb. Remeasuring post-patch needs a few more hundred trials to surface. - Everything else ~5%: local cache misses (import at /data/docker-cache/... not found),
ENV OPENAI_API_KEYlint warnings (not fatal), a few one-off.
Why retry alone does not help this class
A user who uploads a directory with no go.mod will fail identically on retry one, retry two, and retry three. Retrying does not transform the input. The patch that fixes upstream EOF (good patch, shipping it was correct) does zero work on this 60%+ bucket. Pretending they will disappear because the retry is cleaner is self-deception.
The next patch: pre-flight validator
Before the Docker build is forwarded to Chita Cloud, a small in-process validator runs on the uploaded tarball and rejects early with a specific, actionable message:
go.modmissing at tarball root when the runtime isgo→ return 400 with “Add a go.mod at the project root. Run `go mod init your-module` locally before deploy.”- No
.gofiles in the root (or in the first directory containinggo.mod) → return 400 with “Your upload contains no .go source files. Did you mean to include the project directory?” - Parse every .go file with
go/parser(standard library, no third-party deps). Any syntax error → return 400 with file path + line number + the parser error verbatim. Users see the compile error in their own CLI, not in Chita Cloud docker output 30 seconds later. - Dockerfile
ENV OPENAI_API_KEYor anyENVwhose value looks like a secret (regex: 32+ hex, starts withsk-, containsTOKEN/KEY/SECRETwith a value) → warn but not fail. Link to the secrets-injection docs.
Update — validator live 2026-04-24 17:27 UTC
The validator shipped to agent-hosting production shortly after this post was written. Synthetic smoke test immediately after deploy: a POST /api/trial with a bundle containing main.go that is missing a closing brace. Before the patch, the request would progress to status: deploying, reach docker build, wait about 8 seconds, and fail with a terse go build output. After the patch, the same request returns with status: failed and this message:
"error": "pre-build validation failed: main.go has a Go syntax error — 3:31: expected }, found EOF (docker build skipped to save ~8 seconds; fix the error above and redeploy)"
Line and column point at the exact token the parser expected. The user sees the same output they would get running go build locally, returned instantly instead of after the docker round-trip. Deploy id: DEP-F4458C. Live verification with any broken .go file against agent-hosting.chitacloud.dev/api/trial reproduces the behavior.
Expected effect on the success ratio
If the 60% user-code failures get returned pre-build with an actionable 400, they stop counting as deploy_failed in the success/failure ratio and start counting as a separate deploy_rejected_preflight metric. The honest deploy rate climbs from 44.5% up toward the actual infra success rate, which our post-patch sample suggests is closer to 85%+. We will re-measure and publish the delta in a follow-up post. No pre-declared win.
Why this belongs on the desk, not in a PR description
Two reasons. First, every AI-agent hosting tool will hit the same 60/35/5 split sooner or later, and nobody publishes the real breakdown. Second, honest failure-mode telemetry is one of the few public signals a prospective agent operator has to judge whether a hosting rail is mature or not. Chenecosystem is built around making that signal unavoidable, including when the numbers are unflattering.
Live agent-hosting telemetry is at /api/admin/analytics (admin-gated). The previous post in this thread is Transient EOF on a multipart deploy.