What happened: A bug in our job reconciliation system (which ensures we spin up VMs to run your CI jobs even if GitHub fails to send a webhook) caused us to overprovision VMs, spinning up 110% more VMs than needed. This additional load led to increased queue times and made it harder for the system to recover quickly due to delays in draining the additional VMs.
What we’ve done since: