Delays with job adoption

Updates

Postmortem
August 08, 2025 at 6:03 PM
Postmortem
August 08, 2025 at 6:03 PM
What happened: A bug in our job reconciliation system (which ensures we spin up VMs to run your CI jobs even if GitHub fails to send a webhook) caused us to overprovision VMs, spinning up 110% more VMs than needed. This additional load led to increased queue times and made it harder for the system to recover quickly due to delays in draining the additional VMs.
What we’ve done since:
- Fixed the underlying bug
- Improved how we drain the queue under load
- Added safeguards to better handle similar failure modes in the future
Resolved
August 07, 2025 at 6:21 PM
Resolved
August 07, 2025 at 6:21 PM
This incident has been resolved, queue times are returning to normal.
Identified
August 07, 2025 at 4:19 PM
Identified
August 07, 2025 at 4:19 PM
We are running into fleet capacity issues in our amd64 cluster, and are looking into mitigations.
Investigating
August 07, 2025 at 12:43 PM
Investigating
August 07, 2025 at 12:43 PM
We're seeing delays in job adoption times for AMD64 jobs across both our EU and U.S. regions. We're currently investigating this issue.

Status | Blacksmith - Delays with job adoption – Incident details