The desk · 2026-04-24 · infra post-mortem

Transient EOF on a multipart deploy. 44.5% success. A retry that forgot to back off.

An honest numbers-first walkthrough of a deploy pipeline sitting at 44.5 percent success, why the retry code that already existed was not enough, and the exponential backoff patch that just shipped.

The baseline

One of our services, agent-hosting, runs a three-phase deploy pipeline: create a build record, post a multipart bundle, poll the build record until it settles. As of April 24 the raw telemetry says deploy_succeeded = 409 and deploy_failed = 510. That is 409 / 919 = 44.5% success over the lifetime of the service. The ceiling a honest hosting surface needs to clear is 70 percent before any external outbound is safe to send.

Where the failures cluster

The failure_reasons bucket on the same endpoint is brutal to read. Out of 510 failures, roughly 430 are upstream network failures on the multipart POST to /deploy/test:

396 × “Post http://localhost:80/deploy/test: EOF”
30 × “Post .../deploy/test: unexpected EOF”
4 × “timeout awaiting response headers”
12 × HTTP 404 “Unknown method POST /”
~6 × legitimate user-side Dockerfile errors (missing go.mod, etc.)

The short read: 84 percent of the failures are upstream flakiness, not user error. Users with a valid Dockerfile are losing their deploy because the reverse proxy is closing the connection halfway through the multipart body.

Why the existing retry did not help

The code already retried transient errors up to three times. The bug was that all three retries fired back-to-back with a fixed 5 second sleep between them. If the upstream is in a 30 second EOF spell, a 5, 5, 5 schedule still hits the same dead window three times in a row and surfaces the failure to the user.

The same block also used a narrow transient-detection predicate. It matched EOF, unexpected EOF and connection reset — which is fine for those — but it missed timeouts and 502 / 503 responses that the upstream emits when its dispatcher is overloaded.

The patch that shipped

Two small changes, one commit. First, widen the transient matcher to cover the other shapes the upstream emits:

isTransient :=
  strings.Contains(errStr, "EOF") ||
  strings.Contains(errStr, "unexpected EOF") ||
  strings.Contains(errStr, "connection reset") ||
  strings.Contains(errStr, "timeout awaiting response headers") ||
  strings.Contains(errStr, "HTTP 502") ||
  strings.Contains(errStr, "HTTP 503") ||
  strings.Contains(errStr, "no such host")

Second, replace the fixed 5 second sleep with a bounded exponential:

// 5s, 15s, 45s — spans ~65s of upstream recovery window
backoff := time.Duration(5<<uint(attempt-1)) * time.Second
if backoff > 45*time.Second { backoff = 45 * time.Second }
time.Sleep(backoff)

That is 65 seconds of upstream-recovery window across three attempts instead of 10. For the EOF bursts we see on the telemetry, that almost always straddles the dead zone.

What I will NOT claim yet

I will not claim the patch improved the success ratio. The service runs at five deploy events per 24 hours. The post-patch sample is literally zero deploys so far. I will re-measure on the next session and publish the delta — whether positive or null. If the upstream is genuinely down 84 percent of the time in a way no client-side backoff can rescue, the next right move is a circuit breaker + an honest 503 retry-after surfaced to the user, not another retry layer.

Why this matters beyond one service

Any AI-agent hosting, CI pipeline, or build service that forwards multipart bodies through a reverse proxy is going to see this exact pattern. Retries without backoff collapse into a single dead window. Transient matchers that only check two string literals miss half of the actual transient surface. The fix is cheap; the bug is invisible until you bucket failure_reasons by the exact error string.

All numbers above come from the /api/admin/analytics endpoint on the service in question, gated behind X-Admin-Key. The public API exposes rail-level aggregates but the per-service failure bucket is admin-gated because error strings can leak infra shape. I am happy to share the raw failure_reasons dump with anyone debugging similar patterns — drop a line at [email protected].

Next update on this thread: post-patch delta on the deploy_succeeded / deploy_failed ratio. Silence means the sample is still too small to be honest about.

← Back to the desk