Today, Jeff, the team, and I are excited to announce the general availability of Metrist. The product we’ve created is software with a web and Slack UI that helps you and your organizations manage your cloud dependencies. We do this by running end-to-end functional tests against third-party APIs while also measuring performance directly from your apps with our installable agents.
This initial version of our product solves the most acute problems. It alerts your team reliably and quickly that a cloud dependency is experiencing issues, shows data to aid in troubleshooting, and offers ways to automate remediation.
The Backstory… Why Did We Make This?
Before Metrist, I was an early software developer at PagerDuty. During that time, one of the projects I worked on was ensuring the PagerDuty platform itself was working properly—and that included the third-party apps that it was built upon. As I helped to solve this problem, I witnessed the benefits of end-to-end testing of those third parties.
There were two such internal systems at PagerDuty during my seven years there:
- The first sent text messages to various phone numbers through different telephony providers and, through a custom app on the associated devices, recorded and reported how long it took to receive them.
- The second tested functionality of the entire PagerDuty platform through its APIs, alerting us if something was misbehaving.
The telemetry from these systems allowed us to choose third-party providers programmatically and detect issues more effectively.
However, both of these systems were decommissioned, because the time spent to maintain them had to be spent on other things. “We should have a whole team dedicated to this”, fellow developers lamented.
That’s precisely where we at Metrist come in.
What We Made
Jeff and I moved on from PagerDuty, but the lack of visibility into third parties continued to bother us. We felt compelled to team up and build a product to solve this problem.
We’ve since created a product that:
- Uses end-to-end functional testing and in-app monitoring to create unmatched visibility, especially when combined with anonymized, aggregated telemetry across our user base.
- Has a dashboard where you can see the statuses, performance, and reliability of any and all of your cloud vendors in one place. No more hunting down different status pages or emails for notifications from vendors, (if they update their status page or send a message at all!)
- Alerts you about outages 20 minutes ahead of status pages and other apps, on average. This is because our data is objective and not solely-based on a manually-updated, marketing-regulated status page.
Our product is important because companies are increasingly becoming reliant on third parties, (the average is 137!), but those third parties are responsible for 70% of downtime. What’s more, that downtime can cost an average of $300k per hour.
It’s hard to tell when those third parties are the cause of those outages, and even harder to tell why. But within the first couple months of deploying Metrist, we began to see behind the curtain of some interesting and mysterious outages, in addition to the big, obvious ones.
Data Behind Some Interesting Outages
Operating for a little under two years now, from three cloud providers across 21 regions, we have identified:
- 7,000+ outages,
- 40,000+ instances of degraded service,
- Across 100 third parties.
We’ve started tracking some of these publicly in our series here.
Rather than focus on the massive outages, I’d like to highlight three nuanced ones that Metrist caught. Issues like these often don’t make the status page and aren’t talked about on social networks, but can be the most difficult to get to the bottom of when you are impacted by them.
Outage Example 1: Cache Invalidation at npm — February, 2021
Our npm package metadata retrieval tests failed in all five source North American regions. Subsequent tests inconsistently succeeded, with the bulk of requests resulting in a 404 status code.
This was verified in the npm web UI when searching for a package—occasionally the package would be displayed, but a browser reload would result in a 404. It did not appear to matter from which region requests originated.
We worked directly with npm/GitHub to resolve the issue, which had to do with cache invalidation, after we brought it to their attention. The npm CLI seemed to be unaffected during this time.
Outage Example 2: Token Authentication Requirements for GitHub — August, 2021
Our GitHub tests were sporadically erroring with 503 status codes. This was inconsistent as some source regions experienced this more than others. As with many outages we see of this nature, we sought confirmation on social media and in a #hangops Slack room.
It turns out that this was due to an incremental rollout of token-based authentication. At one point, our Canadian source regions saw updated endpoints 95% of the time, but the west coast only 5%. In addition to this, libgit2, or our particular use of it, couldn’t seem to handle the new authentication scheme, but our git CLI checks worked as expected.
Outage Example 3: An Unannounced Breaking Change with Azure CosmosDB — September, 2022
For a more recent example of this kind of outage, see Nikko’s write-up on an unannounced Azure Cosmos DB API change.
This is Just the Beginning
These outages show the importance of running tests with multiple programming languages, SDKs, CLIs, and from different vantage points since a striking number of issues are only detectable with or affect a subset of these.
At Metrist, we combine all of this end-to-end testing data with real, third-party API call telemetry from your apps, as well as status page data. This data can be anonymized, aggregated, and shared amongst our user base to help paint an overall picture of a third party’s health.
In order to deliver on these capacities, we’ve raised $5.5M in seed funding from some incredible investors. Backers such as Heavybit, Scott and Steve Klein of StatusPage.io, and Alex Solomon of PagerDuty have seen the problem firsthand and believe in our vision to provide a similar level of visibility into cloud dependencies as we have into the software and systems we own.
If any of this sounds interesting to you, please sign up for a free account or check out our software on GitHub. We plan on releasing more on GitHub in the future, such as the source code for our end-to-end tests and documentation on how you can write your own, in the programming language of your choice.