As I write this, the internet is recovering from a second major cloud infrastructure outage in just eight days. It wasn’t just VMs and blob storage that were down — our doorbells didn’t work, our streaming platforms were paused, and our Slack workspaces went silent. That was the impact to the standard consumer, but if you are a Software Engineer on call, or an SRE that keeps platforms running, your experience was worse.
You started seeing problems with your software and systems, so you began to investigate. You went to the vendor’s status page and everything reported as fine. You continued to investigate. You asked your friends on Twitter and in your Slack groups, were they having problems, too? Now you knew that it was not just you, but you still had questions, and without these answers, you couldn’t take action to mitigate impact.
- Is it just a problem with logging into the console, and everything else is running just fine?
- If products are down, which ones are impacted? Are these the products you use?
- Is the issue isolated to one region, a region you care about, or is the problem more widespread?
- How long should you expect this issue to last?
- If you failed over to another region, could you be confident that the new region would remain healthy?
- What other products and services that you rely on might be impacted by this outage?
- Were there any signs that this was coming? Did you miss a breaking change announcement from a software vendor?
- Are there alternative vendors you could consider for better reliability?
While you asked these questions and couldn’t get timely or factual answers, more than 60 minutes passed. Then, the vendor finally updated their status page, confirming some of what you and your colleagues had guessed, but were unable to prove.
This story would be compelling with just one major outage, yet it was two. They weren’t isolated, either. In the same timespan, the web’s most used communications API went down, as did a popular CI/CD tool, along with a leading user identity platform.
The reality of modern software development
Today, apps are built on apps. Over the past year and a half, my co-founder and I have conducted more than 125 interviews with software developers and technology leaders. We learned that the typical web property relies on about 50 cloud hosted third parties in realtime, with most applications at scale using 75 to 100 IaaS, SaaS, and API products. Everything from compute, to databases, to message queues, to user authentication, to credit card processing, to shipping labels, to maps, to email delivery, and recommendation engines. There is a cloud-hosted API for just about everything.
When one of these products goes down, the software that depends on it risks going down, too. One outage from one provider isn’t too bad, but 50 providers with 99.9% uptime could mean 526 minutes of downtime every year for the app that depends on them. When that downtime happens, software developers are left scrambling for answers, and action is delayed. Status pages are not updated quickly, if at all, and when they are, the information is written for any and everyone, not the visitor that just needs to know if they are impacted or not, and how. Observability tools, like APM monitoring, can spot an issue quickly, but they don’t answer the question “is it us or is it them?” While we seek answers, impact worsens.
After spending much of our careers in the observability and incident management space, my co-founder and I created Metrist. Formerly known as Canary Monitoring, Metrist is on a mission to make the world more reliable, by creating visibility into the availability, functionality, and performance of the web’s most built-upon products.
Observability is about more than the metrics, logs, and traces of first and second party software. True observability means realtime and accurate visibility into third-party dependencies. Cloud vendors write the rules on SLA, and until now, they controlled the data that defines compliance. Customers deserve more, we are here to deliver it to them.
We are Metrist.