A power outage to a single availability zone within the US East 2 region resulted in a widespread outage for about 2 hours (10 AM to 12:16 PM PT) on Thursday, July 28th.
Metrist monitors detected the outage around 10 AM PT soon after AWS West Region 2 recovered from a 3+ period of high latency. After about 2 hours, Metrist alerted customers that the incident was completely resolved at 12:16 PM.
AWS updated its status page at 10:11 AM PT, stating that a power outage at 9:57 AM PT was the source of the outage. After a few updates throughout the 2-hr period, they announced that all of the instances and volumes were recovered at 12:45 PM PT.
Services Affected
Several AWS services were affected by the outage. Metrist monitors detected downtime for:
- AWS CloudFront
- AWS CloudWatch
- AWS EC2
- AWS EKS
- AWS Lambda
- AWS RDS
- AWS SES
- AWS SQS
While US East 2 is less popular than the long favored and less reliable US East 1, a number of big-name SaaS companies were down or degraded as a result of the AWS outage, including:
- Auth0
- Zoom
- Zendesk
- Okta
- WebEx
- New Relic
- SmartThings
Timeline of Events
The following is a timeline of events, according to Metrist and AWS reporting. (Times are in PT.)
Prelude
- 9:44 AM. Metrist monitors show that the AWS US West 2 region recovers from a 3+ hour period of high latency, impacting multiple AWS products
- 9:58 AM. Metrist spots the first signs of trouble in US East 2 when our monitoring sees the first error with AWS EKS.
Beginning
- 10:00 AM. Metrist monitoring from AWS US East 2 of Fastly, HubSpot, Datadog, Zoom, Cloudflare, Zendesk, Jira, Google Calendar, and Bambora begins to fail, all other Metrist instances continue to test these services successfully.
- 10:12 AM. Metrist declares that AWS EC2 is completely down in US East 2, with the first error appearing at 10:10 AM PT.
- 10:13 AM. AWS updates its status page with an “Informational” message saying that it began investigating a network connectivity issue impacting EC2 APIs for some users at 10:11 AM PT.
Recovery
- 10:18 AM. Metrist shows the first signs of recovery as AWS EC2 begins to come back online.
- 10:31 AM. AWS EC2 is operational again, but highly latent.
General Resolution
- 11:28 AM. Metrist reports that all AWS services, except AWS EKS, have fully recovered in US East 2.
- 11:45 AM. AWS’s status page reports that “the vast majority of EC2 instances” had recovered as of 11:37 AM.
Complete Resolution
- 12:16 PM. Metrist determines that AWS EKS has finally recovered in full and the AWS US East 2 incident on July 28th, 2022 is over.
- 12:45 PM. AWS’s status page reports, “The issue has been resolved and the service is operating normally.”
Questions Raised by the Outage
Metrist data showed that the AWS US West 2 region had over three hours of latency in the time period before the US East 2 region went down – which was unreported. Since outages are usually due to human error, and events like these are rare, the coincidence can beg the question of whether the two events were related.
Additionally, the widespread nature of the outage led some developers to question whether only one availability region was affected. However, the interdependencies of the different regions may have led to the expanded “blast radius” of the outage.
Conclusions From the Incident
While this outage was relatively “isolated” (to one availability zone), the effects of the 2-hour outage were widespread. AWS was relatively expedient in updating its status page. However, Metrist users knew about the outage about 10 minutes before Amazon updated its status page.
If you would like to be notified about AWS outages as they happen – without having to check a status page or social media – try Metrist. This tool helps you identify the source of outages so you can have better observability and stay resilient.