Not too long ago Gergely Orosz pointed out on Twitter that Slack reported 100% uptime since 2022, but he didn’t think that was the case. So who was right, Orosz or Twitter?
We functionally monitor SaaS products like Slack to monitor for downtime in real time. So we set out to see whether Slack really did have 100% uptime.
Let’s start out with a little context of the conversation.
That Tweet That Started It All
The whole conversation started with this Twitter thread which can be found here.
Our Data On Slack’s Reliability Performance
Here’s what our data found. We mainly monitor core functionality like sending messages. And an important thing to remember is that this is data from our experience with Slack (more on this later).
Well starting in Feb 2022, Slack had a pretty notable widespread outage. And we could see it in our data here.
But what about after that?
For the rest of the year, we didn’t see a lot of major downtime. There were sporadic errors occasionally but never enough for Metrist to consider them part of an outage.
So we can see that Slack was pretty darn reliable – at least for core functionality – and at least as far as we could see it. But it wasn’t perfect.
So is 100% uptime really valid? Let’s return to the Orosz conversation.
Orosz’s Key Points
Some of the key points that emerged from the Twitter thread can help us further determine who is right in this conversation.
- Slack’s pretty narrow definition of downtime. Perhaps Slack’s definition of downtime is too narrow since it only applies to major, global outages. But its users genuinely experienced downtime – as evidenced by Orosz’s experience, Twitter, our data, and even Slack’s status page details.
- Non-global downtime is still downtime – right? From the definition and experience we can see that not everyone experiences downtime the same. We can even see that from the differences in Orosz’s experience and our experience. We didn’t experience downtime when he did and vice versa.
- Seriously, why say 100%?. Orosz questioned why Slack said 100% in the first place. Even if Slack has a narrow definition of downtime, and even if reliability of their core features is pretty rock solid – why say 100%? How does that benefit them?
Perhaps the last point is the stickiest one for this conversation.
Maybe Just Say 99.999999%
Slack had pretty solid uptime for its core features over the course of Feb 2022 – Feb 2023. At least as far as we saw it. However, everyone experiences downtime differently. Some people experience downtime while others don’t. So if it’s not global, does downtime not constitute reportable downtime?
Perhaps the issue at hand here is the simple claim of 100%. We can see from Slack’s own status page that they report incidents. However, they don’t include those non-global incidents in their calculation of downtime.
But is that really fair?
Maybe the best thing is for Slack to simply say they had 99.9999% or 99.99999999999% uptime – or whatever makes sense for them. Nobody even expects 100% uptime – and it can be frustrating or even feel like gaslighting when a company says it has 100% uptime and you didn’t experience it that way.
So Who Was Right?
All in all, both Slack and Orosz were correct (or incorrect) in their assertions.
- Slack was correct in that they had solid uptime for their core features over that time period. And according to their definition of uptime, it appears that they had “100% uptime.”
- Orosz was also correct in that they did have downtime, even if it wasn’t global. And perhaps their definition of downtime is a little too narrow and unhelpful (maybe even misleading).
The definition certainly isn’t with the times as far as accounting for how complex systems will lead to customers experiencing different downtime than others. And the definition is also perhaps outdated in that customers don’t expect 100% uptime – and want to see transparency from their vendors.
How Does This Apply to You?
So what’s the point of this examination? How can it apply to you and your software’s reliability? It’s important to remember that
- Definitions of downtime can be narrow
- Customers experience downtime differently
- Being transparent about downtime is a good practice for customer satisfaction. (In other words, don’t over-sell and under-deliver on reliability just for appearances since it can backfire.)
What do you think about this conversation? We’d love to hear from you.
And if you have questions about how you can get personalized data into your SaaS dependencies like Slack and any of the other cloud dependencies like AWS or Azure, contact us to try out Metrist.