Identifying the source of an outage is the name of the game when it comes to incident response and observability. But what if you were blind to the source of up to 70% of those outages?
That’s the case for most companies because it’s difficult to get visibility into a fundamental part of today’s apps: cloud dependencies like AWS, Twilio, and GitHub. Amazingly, the overlooked culprit behind 70% of SaaS outages are the apps they are built upon – and having observability for those apps can be a game-changer for reliability.
Let’s take a look at how cloud dependencies compose such a large proportion of outages – even if they have 99.99% reliability. And how if we can leverage visibility into those cloud dependencies, we can vastly improve our reliability.
“Is It Me Or Them?” The Quasi-Panicked Conundrum Faced with Cloud Dependency Outages
As we know, modern apps are built on other apps. In fact, the average digital business relies on 137 different cloud services to power their software and run their company. That’s Everything from AWS, to Twilio, Github, Stripe, Zoom, Snowflake, Slack, Avalara, and more.
However, when one of these products goes down, we risk going down — or at the very least degrading the user experience. But because visibility into cloud dependencies is not an automatic aspect of observability software and practices – and vendors make visibility into their products difficult and vague.
So then during an outage we are often sent scrambling to answer the question “is it me, or is it them?!”
Manual Status Page Updates and Low Visibility
This issue is also worsened because when we suspect a cloud dependency, we are gaslighted by the status pages because they are updated manually and it can take at least 20 minutes to be accurate. And as a result, incident response is then often delayed until we can verify our suspicions – 20, 30, or even 60 minutes later.
But we’ve built our apps on upstream dependencies that promise 99.9% uptime – doesn’t that mean we should have similar reliability? Not really – in fact, using multiple dependencies – even with high reliability – can make your reliability worse. Let’s discuss how.
My Dependencies Have 99.9% Reliability – How Could They Be The Culprit?
It may seem difficult to believe that cloud dependencies have such a high impact on our reliability – especially if they promise 99.99% (or more) uptime in their SLAs. But when we use so many dependencies together, their interaction changes the statistics of how reliable your app actually is when it relies on them.
And multiple dependencies – even with high reliability – actually makes the overall reliability worse. Why is that?
Too Many SLAs in the Kitchen Makes Reliability Worse
Let’s assume that each of those cloud dependencies promises 4 nines of availability. That’s wildly good – fewer than 5 minutes of downtime per month! So our hypothetical monolith should have wildly good reliability, too… right? Not so much.
Even if we’re using cloud dependencies with 99.99% uptime, when we start to combine those services, that percentage gets worse. And it’s because of math. (Boo!) Here’s the reasoning:
When you have multiple services and SLAs, you need to move beyond the single SLAs to composite SLAs – that is, the SLA of a service that has dependencies with their own SLAs. So then this figure will reflect the actual reliability of your software as it’s using those dependencies.
And unfortunately, determining a composite SLA isn’t as simple as averaging together some uptime, or finding the lowest of them all. Instead, we need to figure out the composite SLA of our system with dependencies, we multiply the SLA of each service together, and we learn that simple math is not so nice. For example, an app with 5 dependencies, all with four nines of availability, can only offer a 99.95% SLA itself.
The Importance of Assessing Cloud Dependency Reliability
Now, I don’t want to be an alarmist. Things are not as bad as they may seem, but as we can see it’s still incredibly important to address the issues at hand.
First, not all of our cloud dependencies are in the critical path, for many, if they go down, you can remain up. And, in reality, some of your cloud dependencies will deliver better than what their SLA promises.
But I firmly believe that increasing our understanding and visibility into one of the top causes of outages is our best way to improve our incident response. It can help us understand what the risks are and where we are vulnerable instead of being blind to impact from dependencies.
It can help us respond to issues with full context about everything happening in the stack, rather than only having information about the systems we own. And it can help us hold our cloud vendors more accountable because we aren’t going to stop building on them, but we can make sure we are working with providers that support our SLAs, rather than standing in the way of them.
How You Can Start Getting Observability Into Cloud Dependencies
If you are ready to start addressing this layer of the stack, here is how I recommend getting started. First, understand where your vulnerabilities are by doing an inventory of every cloud dependency in your stack, and knowing the SLA they offer.
Focus on the services in the critical path, but remember that can be anything from a dozen AWS services to a recommendation engine plugin, API that aggregates shipping providers, or a service that converts a fax to an email. Then, treat these services just like you do the network, infrastructure, and application layers of your stack, by creating direct observability into the performance, availability, and functionality of the third-party apps you depend on.
Gaining this sort of visibility can be done, but it might mean setting up canaries in your backend to test endpoints, pointing a popular synthetic monitoring product outward at the third parties you depend on the most. Or even adding a new monitoring product to your toolchain alongside your observability platform of choice (such as Metrist!).
And if you’d like to have better observability for your cloud dependencies and get alerted about outages at least 25 minutes before status pages update, try out Metrist.io for free!