Status | Blacksmith - Notice history

Blacksmith Managed Runners - Operational

100% - uptime
Oct 2025 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2025
Nov 2025
Dec 2025

Incremental Docker Builders - Operational

100% - uptime
Oct 2025 · 99.99%Nov · 99.91%Dec · 100.0%
Oct 2025
Nov 2025
Dec 2025

API - Operational

100% - uptime
Oct 2025 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2025
Nov 2025
Dec 2025

Website - Operational

100% - uptime
Oct 2025 · 99.79%Nov · 100.0%Dec · 100.0%
Oct 2025
Nov 2025
Dec 2025

Github → Actions - Operational

Github → API Requests - Operational

Github → Webhooks - Operational

Notice history

Dec 2025

No notices reported this month

Nov 2025

Timeouts when interacting with upstream GitHub mirrors and GHCR
  • Postmortem
    Postmortem

    Leadup

    Blacksmith runs CI jobs for a large fraction of its customers on a fleet of bare metal machines in our US region. All GitHub operations (clones, API calls, container pulls from ghcr.io) that run as part of a CI job route through a blend of ISPs managed by our datacenter provider. Prior to this incident there was very little visibility and redundancy in the path to GitHub outside of this ISP blend.

    Fault

    One upstream ISP in the blend was consistently showing signs of congestion on the return traffic from one of GitHub's edge nodes. Since CI jobs often involve a significant amount of ingress from GitHub's endpoints, approximately 7-10% of application level HTTP connections to GitHub experienced 5-20 second stalls, causing actions/checkout and other GitHub operations to timeout. The same operations from other regions in the US and our EU region completed normally.

    Job failure rate in our US region increased from ~10% baseline to 20-30% during the incident.

    Detection and Timeline

    November 28, 2025:

    • 8:00 AM ET: Sporadic customer reports of actions/checkout timing out on a small percentage of jobs. We also observed failures in our own verification workflows. Initial assumption was that GitHub would declare an upstream degradation, as seen in previous similar incidents.

    • 3:00 PM ET: Low volume of jobs running on Thanksgiving day meant that we heard no more reports about this for several hours. This was when we first noticed our internal verification workflows sporadically failing to checkout with Failed to connect to github.com port 443 after 134053 ms: Couldn't connect to server.

    • 3:08 PM ET: Noticed probes in our EU region were completely healthy. Decided to reach out to our datacenter in the US.

    • 3:27 PM ET: Datacenter team acknowledged and tried to get support involved.

    • 3:38 PM ET: Datacenter team was provided with a reproduction where pings and traceroutes showed no issues but application level curls were showing variable latency and occasional timeouts.

    • 4:13PM ET: Datacenter network engineers joined a call with the Blacksmith team to experiment switching one of the ISPs to see if error rates would go down.

    • 5:53PM ET: ISP switch was reverted as no change was noticed in the failure rate.

    • 8:37PM ET: An instance was spun up in another US region offered by our datacenter. A similar test was run and did not show signs of degradation despite hitting the same edge node in GitHub.

    • 10:42PM ET: Started scoping out a fallback GitHub proxy in AWS so that we could re-route GitHub bound traffic through the AWS direct connect with GitHub. This was based on our previous findings that the issue was somewhere in the path out of our datacenter.

    November 29, 2025:

    • 2:01 AM ET: Director of Networking at our datacenter reached out asking if we were still running into issues.

    • 3:05 AM ET: Networking team received a report from another customer with a traceroute showing degradation in a hop on the return path. This was helpful as it gave them confidence that this was an ISP level degradation. The team started turning off ISPs in their blend one at a time.

    • 4:40 AM ET: Datacenter team asked us to re-run stress tests and we saw immediate recovery. With the degraded ISP identified it was removed from the blend. Job failure rates on Blacksmith returned to their baseline.

    • 5:08AM ET: Postmortem call was scheduled for 01/12/25 with datacenter team and Blacksmith and the incident was closed.

    • 8:36 AM ET: With recovery looking stable, orgs that had been rebalanced to the EU were moved back to the US.

    Root causes

    1. Why did jobs fail? GitHub connections timed out after 133+ seconds.

    2. Why did connections timeout? HTTP request stalls on ~7-10% of connections to GitHub IPs.

    3. Why were there TCP stalls? One ISP in our datacenter provider's blend had degraded routing to GitHub's edge nodes.

    4. Why was the ISP routing degraded? RCA pending from ISP.

    5. Why did this cause widespread impact? No redundant network path to GitHub outside the ISP blend. The entire fleet of worker nodes in the region were routing their traffic through the same set of upstream ISPs.

    Mitigation and resolution

    Immediate mitigation: Remove the degraded ISP from the blend. This was a whack-a-mole process due to lack of observability into which specific ISP was causing the issue. The datacenter provider had to disable ISPs one at a time, monitor for reproduction, and repeat until the culprit was identified and removed.

    Customer workaround: Published status page update advising affected customers to contact us for migration to EU region. 20+ customers were temporarily migrated to EU. All have since been rebalanced back to their original regions.

    Duration: ~19 hours

    Lessons learned

    What went well

    • Team escalated effectively with datacenter provider despite Thanksgiving holiday

    • Maintained clear communication with provider throughout incident

    • Incident response prompted design and implementation of GitHub proxy as a DR mechanism

    What could have gone better

    • Took too long to identify where the degradation was and subsequently which ISP in the blend was degraded

    • Took too long to realize that failures were region specific and were not happening in our EU region

    • No synthetic monitoring to detect GitHub connectivity issues before customer reports

    • Limited observability into ISP health within the datacenter provider's blend

    • Investigations were only focused on outbound traffic, instead of also considering inbound traffic congestion

    Action items

    • Build transparent GitHub proxy through AWS as DR mechanism

    • Work with datacenter provider to set up synthetic probes for GitHub connectivity

    • Establish faster time-to-resolution process once degradation is detected

    • Work with datacenter on aligning on a more transparent support escalation path

    • Get more observability into ISP blend health from datacenter provider

    • Follow up on RCA from upstream ISP (pending - ISP is being reintroduced to blend gradually)

  • Resolved
    Resolved

    We are seeing complete recovery in our US region. We will be sharing the previously mentioned postmortem in the coming days with our customers. Please reach out to us if you continue to see degradations in interactions with GitHub.

  • Monitoring
    Monitoring

    Our network provider has implemented a fix to remove an ISP that was affecting return traffic from GitHub's edge nodes. We are seeing recovery in our US region and will continue to monitor the situation.

    We sincerely apologize for the disruption in service. A detailed postmortem will be shared about what happened here along with what we're doing to prevent such an extended outage in the future.

  • Investigating
    Investigating

    We are exploring a proxy based solution to circumvent the hops in our network that are resulting in degraded interactions with GitHub. We apologize for the extended nature of this outage and request you to reach out through our support portal in https://app.blacksmith.sh/?support=open if your org is currently blocked so that we can provide you with a workaround.

  • Update
    Update

    Some evidence suggests that there's an issue with the upstream GitHub edge node that we're hitting from our US region, but the location being hit from our EU region doesn't have the same degradation. We're still monitoring.

  • Monitoring
    Monitoring

    We realize this is affecting several customer workflows, we are actively working on a resolution.

  • Identified
    Identified

    We're working with our network provider to understand why 10-15% of request failure are seeing 5+ second latency spikes to GitHub services. We're working on identifying if this continues to be an issue upstream or in our network stack.

  • Update
    Update

    We are continuing to look into this issue, it seems localized to one region and we are working on identifying the source of the degradation. We will follow up with an update as soon as we know more.

  • Investigating
    Investigating

    We are currently seeing signs of upstream degradation causing some interactions with GitHub such as checkouts or pushes and pulls to GHCR to timeout.

Github Actions Degraded Performance
  • Resolved
    Resolved
    This incident has been resolved.
  • Identified
    Identified

    Github is reporting degraded performance for Actions. We are continuing to monitor their status page for updates and will update our status here.

    You can check Github's status page here: https://www.githubstatus.com/incidents/zs5ccnvqv64m

  • Monitoring
    Monitoring

    Github has applied mitigation and are seeing recovery. We are continuing to monitor.

Github reporting Git operations degraded availability
  • Resolved
    Resolved

    Git operations are functioning normally. This incident has been resolved.

  • Monitoring
    Monitoring

    Github's team has shipped a fix and are seeing recovery in some areas. We will continue to monitor.

  • Update
    Update

    Github has identified the likely cause of the incident and are working on a fix.

  • Identified
    Identified

    Github is reporting failures for some git http operations.

    https://www.githubstatus.com/

  • Investigating
    Investigating

    We are seeing an increase in job failures. Github has declared an incident indicating Git operations are experiencing degraded availability affecting some jobs. We are currently investigating this incident.

Docker pushes are experiencing slower push times in EU
  • Resolved
    Resolved

    This incident has been resolved. We've determined this incident to be due to an upstream network degradation in our EU region. While we continue to work to provide a root cause, our internal signals indicate the issue has passed.

    If you are experiencing long Docker push times and are pushing to US-based registries, please reach out to us using the support portal in your dashboard or at support@blacksmith.sh so we can unblock you.

  • Update
    Update

    We're continuing to investigate this issue.

    If you are experiencing long Docker push times and are pushing to US-based registries, please reach out to us using the support portal in your dashboard or at support@blacksmith.sh so we can unblock you.

  • Identified
    Identified

    We are continuing to investigate slow docker push times in our EU region.

  • Investigating
    Investigating

    We are currently investigating this incident and looking into signs of wider network degradation

Oct 2025

Github reporting degraded performance for Actions
  • Resolved
    Resolved
    This incident has been resolved.
  • Update
    Update

    Github and Azure have indicated they are seeing improvement after providing fixes. We are continuing to monitor.

  • Monitoring
    Monitoring

    We are continuing to monitor both Github and Azure's declared incidents and will update here once we have more information.

  • Identified
    Identified

    We are seeing degraded performance for customers using Azure due to an ongoing incident with their services. We are monitoring both of these issues and will update here as we hear more information.

    Github Incident:
    https://www.githubstatus.com/incidents/4jxdz4m769gy

    Azure Outage Thread HackerNews:
    https://news.ycombinator.com/item?id=45748756

  • Investigating
    Investigating

    We are monitoring Github's declared incident related to actions.

Increased queue times in the US region
  • Resolved
    Resolved

    This incident has been resolved, queue times are back to normal. The longer queue times were a combination of an outage with our upstream DB provider and a need for us to scale up our fleet to absorb higher than normal traffic. We are working with our DB provider to prevent such an outage in the future, and have scaled up our fleet to prevent such saturation in the future.

    We sincerely apologize for the prolonged queueing that our customers saw today. We take such a prolonged outage very seriously and will work on hardening our systems based on our discoveries from today. If you still see any queued jobs we would recommend canceling and re-triggering jobs to see normal adoption.

  • Update
    Update

    Queue times are stabilizing and we're continuing to monitor.

  • Update
    Update

    We are monitoring increased queue times in relation to an incident with our database provider and are working towards a fix

  • Update
    Update

    We are continuing to monitor increased queue times.

  • Update
    Update

    We are continuing to monitor increasing queue times. Our team is working on mitigations to bring times back down to normal.

  • Monitoring
    Monitoring

    Increased queue times continue but we are seeing improvement and are monitoring the situation.

  • Update
    Update

    We are continuing to investigate increased queue times in the US and are working to resolve this.

  • Update
    Update

    We are continuing to investigate increased queue times in the US

  • Investigating
    Investigating

    We are being notified of extended queue times in the US region, and are looking into this.

Oct 2025 to Dec 2025

Next