Status | Blacksmith - Blacksmith control plane outage – Incident details

Blacksmith control plane outage

Resolved
Major outage
Started 4 days agoLasted about 7 hours

Affected

Blacksmith Managed Runners

Major outage from 3:56 PM to 3:58 PM, Partial outage from 3:58 PM to 4:32 PM, Operational from 4:32 PM to 7:32 PM, Degraded performance from 7:32 PM to 10:31 PM

EU ARM

Major outage from 3:56 PM to 3:58 PM, Partial outage from 3:58 PM to 4:32 PM, Operational from 4:32 PM to 7:32 PM, Degraded performance from 7:32 PM to 10:31 PM

EU X86

Major outage from 3:56 PM to 3:58 PM, Partial outage from 3:58 PM to 4:32 PM, Operational from 4:32 PM to 7:32 PM, Degraded performance from 7:32 PM to 10:31 PM

US ARM

Major outage from 3:56 PM to 3:58 PM, Partial outage from 3:58 PM to 4:32 PM, Operational from 4:32 PM to 7:32 PM, Degraded performance from 7:32 PM to 10:31 PM

US X86

Major outage from 3:56 PM to 3:58 PM, Partial outage from 3:58 PM to 4:32 PM, Operational from 4:32 PM to 7:32 PM, Degraded performance from 7:32 PM to 10:31 PM

EU-WEST x86

Major outage from 3:56 PM to 3:58 PM, Partial outage from 3:58 PM to 4:32 PM, Operational from 4:32 PM to 7:32 PM, Degraded performance from 7:32 PM to 10:31 PM

Updates
  • Resolved
    Resolved

    This incident has been resolved. Queue times are back to normal, we are working on mitigations on our end to prevent such load based degradations in the future.

  • Identified
    Identified

    We are continuing to face issues with our upstream database provider. We are working with them to root cause the problem.

  • Update
    Update

    We are still investigating issues with slow job adoption.

  • Monitoring
    Monitoring

    We implemented a fix and are re-queuing jobs in the backlog

  • Investigating
    Investigating

    There may be some delays with job adoption