In today’s world, apps are commonly built on top of other apps. Utilizing third-party cloud, API, and SaaS products can help us deliver more and better features, faster. With increasing cloud dependency, when outages occur, it is crucial for users to receive timely updates on the status of the issue. A recent study by Metrist examines how quickly the status pages of these services are updated during a cloud outage, and the results aren’t pretty.
Metrist’s Research on Status Page Updates During Cloud Outages
According to Metrist’s research, the average time it takes for a SaaS status page to be updated during a cloud outage is 29 minutes. Metrist arrived at this figure by continuously monitoring popular cloud services’ functionality and comparing the timestamps of when they first detected an outage with the timestamp of the first status page update from the vendor.
The study analyzed 25 cloud outages from 14 popular cloud products: Asana, AWS EC2, Azure Test Plans, CircleCI, Datadog, Easypost, Fastly, Github, HubSpot, npm, PagerDuty, Sentry, Trello, and Zoom. These outages occurred between October 2022 and March 2023. The median update time was 16 minutes, while the minimum was 2 minutes (nice work, Trello) and the maximum was 121 minutes (what happened, PagerDuty?).
It’s important to note that the analysis is dependent on one of the 70 services Metrist monitors experiencing a significant outage AND the vendor updating their status page. Interestingly, the majority of outages detected by Metrist never appear on the third-party cloud vendor’s status page. As we’ve covered elsewhere, cloud vendors tend to have high thresholds before they’ll update a status page, with significant hoops that need to be jumped through internally.
Challenges in Updating Status Pages Quickly During Cloud Outages
Companies face several challenges when updating their status pages promptly during a cloud outage:
- Identifying the issue: Determining the root cause of an outage may take time, which delays the process of informing users. The vendor may not even know they have an issue until well after users begin to be impacted.
- Coordination between teams: Accurate and up-to-date information on the status page requires collaboration between engineering and support teams.
- Prioritizing issue resolution: Companies might focus on fixing the issue before updating the status page to minimize downtime for users.
- Internal communication delays: In larger organizations, inefficient communication channels between different teams and strict external communication policies can result in delays in updating the status page.
The Importance of Observability and Monitoring for Timely Status Page Updates
Providing timely status page updates during a cloud outage is crucial for several reasons, and observability can play key roles in this process:
- Maintaining trust: Accurate and timely information fosters trust between the service provider and users, showing that the company is actively working to resolve the issue.
- Reducing support inquiries: Informed users are less likely to contact support teams, decreasing the overall workload during an outage.
- Managing expectations: Users can plan and adjust their workflows while waiting for the issue to be resolved, thanks to timely status page updates.
- Enhancing customer satisfaction: Faster status page updates result in lower MTTR for downstream dependencies.
An average of 29 minutes to update a status page is simply unacceptable. By addressing the challenges in updating status pages quickly and understanding the importance of observability and monitoring, Cloud, API, and SaaS vendors can empower their customers to reduce downtime and increase customer trust.