Status | Blacksmith - Delays with job adoption – Incident details

Delays with job adoption

Resolved
Degraded performance
Started 23 days agoLasted about 6 hours

Affected

Blacksmith Managed Runners

Degraded performance from 12:43 PM to 6:21 PM

Updates
  • Postmortem
    Postmortem

    What happened: A bug in our job reconciliation system (which ensures we spin up VMs to run your CI jobs even if GitHub fails to send a webhook) caused us to overprovision VMs, spinning up 110% more VMs than needed. This additional load led to increased queue times and made it harder for the system to recover quickly due to delays in draining the additional VMs.

    What we’ve done since:

    • Fixed the underlying bug

    • Improved how we drain the queue under load

    • Added safeguards to better handle similar failure modes in the future

  • Resolved
    Resolved

    This incident has been resolved, queue times are returning to normal.

  • Identified
    Identified

    We are running into fleet capacity issues in our amd64 cluster, and are looking into mitigations.

  • Investigating
    Investigating

    We're seeing delays in job adoption times for AMD64 jobs across both our EU and U.S. regions. We're currently investigating this issue.