Blacksmith Managed Runners - Operational
Blacksmith Managed Runners
Incremental Docker Builders - Operational
Incremental Docker Builders
API - Operational
API
Website - Operational
Website
Notice history
Dec 2025
No notices reported this month
Nov 2025
- PostmortemPostmortem
Leadup
Blacksmith runs CI jobs for a large fraction of its customers on a fleet of bare metal machines in our US region. All GitHub operations (clones, API calls, container pulls from ghcr.io) that run as part of a CI job route through a blend of ISPs managed by our datacenter provider. Prior to this incident there was very little visibility and redundancy in the path to GitHub outside of this ISP blend.
Fault
One upstream ISP in the blend was consistently showing signs of congestion on the return traffic from one of GitHub's edge nodes. Since CI jobs often involve a significant amount of ingress from GitHub's endpoints, approximately 7-10% of application level HTTP connections to GitHub experienced 5-20 second stalls, causing
actions/checkoutand other GitHub operations to timeout. The same operations from other regions in the US and our EU region completed normally.Job failure rate in our US region increased from ~10% baseline to 20-30% during the incident.
Detection and Timeline
November 28, 2025:
8:00 AM ET: Sporadic customer reports of
actions/checkouttiming out on a small percentage of jobs. We also observed failures in our own verification workflows. Initial assumption was that GitHub would declare an upstream degradation, as seen in previous similar incidents.3:00 PM ET: Low volume of jobs running on Thanksgiving day meant that we heard no more reports about this for several hours. This was when we first noticed our internal verification workflows sporadically failing to checkout with
Failed to connect to github.com port 443 after 134053 ms: Couldn't connect to server.3:08 PM ET: Noticed probes in our EU region were completely healthy. Decided to reach out to our datacenter in the US.
3:27 PM ET: Datacenter team acknowledged and tried to get support involved.
3:38 PM ET: Datacenter team was provided with a reproduction where pings and traceroutes showed no issues but application level curls were showing variable latency and occasional timeouts.
4:13PM ET: Datacenter network engineers joined a call with the Blacksmith team to experiment switching one of the ISPs to see if error rates would go down.
5:53PM ET: ISP switch was reverted as no change was noticed in the failure rate.
8:37PM ET: An instance was spun up in another US region offered by our datacenter. A similar test was run and did not show signs of degradation despite hitting the same edge node in GitHub.
10:42PM ET: Started scoping out a fallback GitHub proxy in AWS so that we could re-route GitHub bound traffic through the AWS direct connect with GitHub. This was based on our previous findings that the issue was somewhere in the path out of our datacenter.
November 29, 2025:
2:01 AM ET: Director of Networking at our datacenter reached out asking if we were still running into issues.
3:05 AM ET: Networking team received a report from another customer with a traceroute showing degradation in a hop on the return path. This was helpful as it gave them confidence that this was an ISP level degradation. The team started turning off ISPs in their blend one at a time.
4:40 AM ET: Datacenter team asked us to re-run stress tests and we saw immediate recovery. With the degraded ISP identified it was removed from the blend. Job failure rates on Blacksmith returned to their baseline.
5:08AM ET: Postmortem call was scheduled for 01/12/25 with datacenter team and Blacksmith and the incident was closed.
8:36 AM ET: With recovery looking stable, orgs that had been rebalanced to the EU were moved back to the US.
Root causes
Why did jobs fail? GitHub connections timed out after 133+ seconds.
Why did connections timeout? HTTP request stalls on ~7-10% of connections to GitHub IPs.
Why were there TCP stalls? One ISP in our datacenter provider's blend had degraded routing to GitHub's edge nodes.
Why was the ISP routing degraded? RCA pending from ISP.
Why did this cause widespread impact? No redundant network path to GitHub outside the ISP blend. The entire fleet of worker nodes in the region were routing their traffic through the same set of upstream ISPs.
Mitigation and resolution
Immediate mitigation: Remove the degraded ISP from the blend. This was a whack-a-mole process due to lack of observability into which specific ISP was causing the issue. The datacenter provider had to disable ISPs one at a time, monitor for reproduction, and repeat until the culprit was identified and removed.
Customer workaround: Published status page update advising affected customers to contact us for migration to EU region. 20+ customers were temporarily migrated to EU. All have since been rebalanced back to their original regions.
Duration: ~19 hours
Lessons learned
What went well
Team escalated effectively with datacenter provider despite Thanksgiving holiday
Maintained clear communication with provider throughout incident
Incident response prompted design and implementation of GitHub proxy as a DR mechanism
What could have gone better
Took too long to identify where the degradation was and subsequently which ISP in the blend was degraded
Took too long to realize that failures were region specific and were not happening in our EU region
No synthetic monitoring to detect GitHub connectivity issues before customer reports
Limited observability into ISP health within the datacenter provider's blend
Investigations were only focused on outbound traffic, instead of also considering inbound traffic congestion
Action items
Build transparent GitHub proxy through AWS as DR mechanism
Work with datacenter provider to set up synthetic probes for GitHub connectivity
Establish faster time-to-resolution process once degradation is detected
Work with datacenter on aligning on a more transparent support escalation path
Get more observability into ISP blend health from datacenter provider
Follow up on RCA from upstream ISP (pending - ISP is being reintroduced to blend gradually)
- ResolvedResolved
We are seeing complete recovery in our US region. We will be sharing the previously mentioned postmortem in the coming days with our customers. Please reach out to us if you continue to see degradations in interactions with GitHub.
- MonitoringMonitoring
Our network provider has implemented a fix to remove an ISP that was affecting return traffic from GitHub's edge nodes. We are seeing recovery in our US region and will continue to monitor the situation.
We sincerely apologize for the disruption in service. A detailed postmortem will be shared about what happened here along with what we're doing to prevent such an extended outage in the future. - InvestigatingInvestigating
We are exploring a proxy based solution to circumvent the hops in our network that are resulting in degraded interactions with GitHub. We apologize for the extended nature of this outage and request you to reach out through our support portal in
https://app.blacksmith.sh/?support=openif your org is currently blocked so that we can provide you with a workaround. - UpdateUpdate
Some evidence suggests that there's an issue with the upstream GitHub edge node that we're hitting from our US region, but the location being hit from our EU region doesn't have the same degradation. We're still monitoring.
- MonitoringMonitoring
We realize this is affecting several customer workflows, we are actively working on a resolution.
- IdentifiedIdentified
We're working with our network provider to understand why 10-15% of request failure are seeing 5+ second latency spikes to GitHub services. We're working on identifying if this continues to be an issue upstream or in our network stack.
- UpdateUpdate
We are continuing to look into this issue, it seems localized to one region and we are working on identifying the source of the degradation. We will follow up with an update as soon as we know more.
- InvestigatingInvestigating
We are currently seeing signs of upstream degradation causing some interactions with GitHub such as checkouts or pushes and pulls to GHCR to timeout.
- ResolvedResolvedThis incident has been resolved.
- IdentifiedIdentified
Github is reporting degraded performance for Actions. We are continuing to monitor their status page for updates and will update our status here.
You can check Github's status page here: https://www.githubstatus.com/incidents/zs5ccnvqv64m
- MonitoringMonitoring
Github has applied mitigation and are seeing recovery. We are continuing to monitor.
- ResolvedResolved
Git operations are functioning normally. This incident has been resolved.
- MonitoringMonitoring
Github's team has shipped a fix and are seeing recovery in some areas. We will continue to monitor.
- UpdateUpdate
Github has identified the likely cause of the incident and are working on a fix.
- IdentifiedIdentified
Github is reporting failures for some git http operations.
https://www.githubstatus.com/ - InvestigatingInvestigating
We are seeing an increase in job failures. Github has declared an incident indicating Git operations are experiencing degraded availability affecting some jobs. We are currently investigating this incident.
- ResolvedResolved
This incident has been resolved. We've determined this incident to be due to an upstream network degradation in our EU region. While we continue to work to provide a root cause, our internal signals indicate the issue has passed.
If you are experiencing long Docker push times and are pushing to US-based registries, please reach out to us using the support portal in your dashboard or at support@blacksmith.sh so we can unblock you. - UpdateUpdate
We're continuing to investigate this issue.
If you are experiencing long Docker push times and are pushing to US-based registries, please reach out to us using the support portal in your dashboard or at support@blacksmith.sh so we can unblock you. - IdentifiedIdentified
We are continuing to investigate slow docker push times in our EU region.
- InvestigatingInvestigating
We are currently investigating this incident and looking into signs of wider network degradation
Oct 2025
- ResolvedResolvedThis incident has been resolved.
- UpdateUpdate
Github and Azure have indicated they are seeing improvement after providing fixes. We are continuing to monitor.
- MonitoringMonitoring
We are continuing to monitor both Github and Azure's declared incidents and will update here once we have more information.
- IdentifiedIdentified
We are seeing degraded performance for customers using Azure due to an ongoing incident with their services. We are monitoring both of these issues and will update here as we hear more information.
Github Incident:
https://www.githubstatus.com/incidents/4jxdz4m769gyAzure Outage Thread HackerNews:
https://news.ycombinator.com/item?id=45748756 - InvestigatingInvestigating
We are monitoring Github's declared incident related to actions.
![[object Object]](/_next/image?url=https%3A%2F%2Finstatus.com%2Fuser-content%2Fv1750179527%2Fonfi6svhyxil3xsnprpi.png&w=3840&q=75)