DevOps expert Jeff Smith explains how site reliability data from third parties can influence your app’s performance and how you can leverage that data.
Today’s expert is Jeff Smith. He is the Director of Production Operations at ad tech company Basis Technologies (formerly Centro) and was previously Manager of Site Reliability Engineering at Grubhub.
Jeff is a notable figure in the DevOps movement as he’s authored the book Operations Anti-Patterns, DevOps Solutions as well as a frequent conference speaker and podcast guest.
In this interview, we discuss industry shifts to cloud-hosted third parties, along with their advantages and challenges. We hope you’ll get as much out of the conversation as we did.
Building on third-party apps is becoming more mainstream – but so is the dependency on their uptime.
METRIST: What do you think is driving the industry shift to use cloud-based third-party apps to power new apps?
JEFF: Over the last 20 years, there’s been a definite shift in people’s willingness to consume third-party services. And I think a lot of that stems from the wide adoption of API-based interfaces. As a result, the complexity of managing and understanding the underlying implementation is gone.
And that empowers people to do cool things, giving us access to emerging technologies sooner. Innovations being productized and turned into APIs bodes well for the entire industry because now we can leverage the immense resources in the API Economy.
I really like Amazon’s term, Undifferentiated Heavy Lifting. Email is a perfect example. At Basis, it’s a critical part of the product and business. We use it all the time, but we’re not going to build and manage email infrastructure ourselves, especially when there are other providers with the dedicated expertise to do it properly.
So, it’s the same concept with other third-party tools. We’re selling ads, so creating other tools is not a core competency. We don’t want to take on that burden when there are providers out there doing it better and can manage the complexity for us.
METRIST: What challenges arise when we start to rely on dozens of cloud-hosted dependencies?
JEFF: People need to understand where their key integrations live and what’s core to the business. Because from a customer perspective, if something isn’t working, they don’t care who’s at fault. They see that you’re not delivering the convenience you said you would. So they come to you for information about when things will be fixed.
For example, if I’m going to be multi-region for disaster recovery, I need to know which of my core third-party partners are also multi-region. Because if there’s another AWS us-east-1 outage, we may be fine. On the other hand, if one of my third-party providers’ infrastructure is down, we need to know what capabilities are out, how to monitor it, and communicate that to customers. So now the scope of what you care about isn’t just your software, isn’t just your dependencies, it’s also your dependencies’ dependencies.
This means most companies today are leveraging multiple third-party services for a single core functionality. So Stripe may be the primary. Then if Stripe is down, you have a secondary fallback for credit card processing providers, like Braintree, because a complete credit card processing outage would be so impactful.
Solutions to third-party dependency issues:
METRIST: How do you get the visibility you need into cloud platforms and APIs?
JEFF: It’s amazing how few tools exist for monitoring third parties. Not long ago, there was a us-west-2 outage with AWS, and we all went to Twitter and Downdector for updates. Not to the AWS status page.
We saw the complaints on Twitter and Downdetector and scrambled trying to figure out exactly what was happening, and it took us a while to get official confirmation from AWS that it was them and they were working on it. It’s frustrating, this core service delivers the heartbeat of your company, and we have to go to Twitter to find out if we’re ok.
Unfortunately, I think providers are hesitant to declare an emergency and update their status page because it shows fault. Or, there’s a lack of discipline in the incident management process, and the status page becomes an afterthought to the people fighting the incident.
Having to rely on these rudimentary monitoring systems can fail us and lead us down rabbit holes in the wrong direction, where we’re burning time and maybe not even tracking the right incident.
The roles and responsibilities of third parties addressing cloud dependency data transparency.
METRIST: What’s the impact of not having clear visibility into cloud dependencies?
JEFF: Often, when alerts go off, we jump right into a troubleshooting path that seeks to figure out what’s wrong with a third party and prove definitively that it is them, not you. At the same time, going in that direction means you’re not tracking down other leads. Suddenly you may realize it’s not a third-party issue after all, and now you’re starting over again from ground zero. It burns a ton of time.
When we don’t have visibility into when third parties are also verifiably healthy, we use Twitter and DownDetector as proxies. Right now, it’s the best we have, but it’s only accurate when it is a large-scale event. So, if it’s a us-east-1 or us-west-2 issue, there’s enough chatter to bubble to the top.
But if it’s some other issue and you don’t see chatter on Twitter or DownDetector, it’s not enough to clear the third party. I still need to keep looking for the main cause, and I could be burning valuable time. That becomes problematic. The signal is buried in the noise if there is a community signal at all.
METRIST: Why aren’t cloud vendors solving this problem for us?
JEFF: From my experience, I’m still not getting the up-to-date information that I need from things like the AWS Personal Health Dashboard. The problem is when a dashboard loses its credibility as a source of truth, it’s useless to you. I may see everything’s OK, but I don’t really believe that; I just think they haven’t updated it yet.
It’d be different if I knew it is one hundred percent accurate every time I go to this dashboard. But that’s rarely the case. So I keep it open and refreshing, looking for an update. That’s what happened during the AWS us-west-2 outage. There was no update, so I kept checking Twitter, and when I saw it there, that was the truth, not what the AWS Personal Health Dashboard said.
METRIST: How could cloud vendors meaningfully improve their status pages and external communications?
JEFF: When it comes to statuses and health dashboards, I think we need different levels of acknowledgment and confirmation of a particular issue. So, for example, how can we get AWS or other providers to say, “Yes, we’ve heard people complaining about this issue, and we’re looking into it.”
Then at least that way, even if it’s not confirmed as a full-blown issue, I know there’s something there. Providers could communicate that they are looking into things or are troubleshooting a potential issue and not ruling it out right now but will come back and let us know when something is ruled out. That type of granularity would help.
It’s always easy to demand transparency but a lot harder to deliver, and I think we lose sight of that when we’re on the buy-side. With things like the AWS Personal Health Dashboard, part of that may be that the status pages are a product challenge. With so many providers, customers, and products, it’s a multi-layered problem to find ways to tell so many people what they need to know, each with the right level of granularity.
METRIST: What if vendors went even farther, what is your ideal level of visibility into cloud dependencies?
JEFF: I’d love to see a third party’s performance data aggregated across customers. It seems like an easy way to bring out the signal from the noise. Tell me this is the aggregate response time that we’re seeing from this API endpoint across the customer base, the aggregate error rate that we’re seeing across this endpoint, and so on.
Then when I start to see elevations, if it’s against a large population group, it’s easier to figure out if it’s a one-off situation or something else and how that matches the reporting.
It boils down to how much confidence I have looking at the dashboard and feeling confident about what it’s telling me. That lets me start to look in a direction and begin confirming things.
Steps developers can take to gain greater visibility through third-party data:
METRIST: Should developers be responsible for getting this visibility for themselves, or should we look to our vendors to make it happen?
JEFF: I think it depends on the actual service. Mainly because you may rely on so many key services, but most people will want something for every service they use. I could see a world for both, with the mixture depending on the severity of the service you’re interacting with.
However, what you need to see from each provider are levels of details that vary based on the type of integration. For something like Mailgun, our email partner, I’d like an observability dashboard I own because there are specific things I’d want to track without needing to wait or go through a vendor. For others, where the service isn’t quite as vital, having an idea if something is an issue is useful, but it’s not worth my time and energy to manage and maintain it myself.
METRIST: Is there a role for greater visibility outside of incident response?
JEFF: When we talk about reliability with third parties, it’s always about major outages. Those are visible and easily documented. But part of reliability is also consistency. And that’s much harder to get data on.
For example, if I’m getting an 11-second lag from a provider, that gets passed on to my customers. So while it’s not a critical on-call issue, it’s still impacting customers.
Those issues are where I see impactful conversations happening because I’m tracking my metrics, so now I want to see your metrics as they relate to mine. But, there’s a lot of data I don’t have, or we can’t prioritize resources to get it. So, this means that there are probably many vendors that are getting over on their SLAs simply because no one really has the data to show if and why something isn’t performing.
It would also be interesting to know how these vendors publish and share their disaster recovery plans, because it’s essential for decision-making.
Getting information such as if they run in AWS or Azure, which regions they’re in, what the disaster recovery plan looks like, and so on, can make a huge difference. For example, if you have a failover plan but one of your core third-party integrations isn’t going to fail over with you, that changes the calculus of how you handle disaster recovery.
When you’re a bigger company, you can request this sort of information, but for a smaller startup and for less-expensive, swipe-a-credit-card services, it’s tougher to get answers.
Possible outcomes for vendors and customers with shared reliability data.
METRIST: Would the dynamic change between vendors and customers if there was trusted and shared data to define the reliability of cloud products?
JEFF: I think it would have to. It inherently changes the dynamic simply because the nature of competition would force that change.
If, as a competitor, I know customers are complaining about your response time and reliability and have the numbers to prove it, that’s a selling point in the brochure that this vendor is x% faster than another.
If you’re having speed issues that you suspect are tied back to a third-party provider, then seeing data from targeted third-party monitoring speaks to you in a way that it doesn’t when it’s a black box and you don’t know what your current performance is.
How you can monitor your reliability of third-party apps:
Jeff Smith had some great insights into third-party reliability data and we hope you enjoyed this interview. If you have questions about this conversation or these topics, feel free to contact us for more information.
Enjoy this interview? Check out more in our blog.
Metrist also provides a free tool for developers and support professionals to get accurate, detailed, timely information regarding the downtime of their cloud-based apps. To monitor your own apps, simply sign up today.