Many companies are using third-party vendors to increase their development speed and power. However, outages in these third-party apps can cause issues for companies’ apps depending on those vendors. Helping companies stay resilient in the face of these outages is the specialty of today’s guest.
We were very excited to sit down with Erin McKeown, Director of Engineering in Resilience at Zendesk. With her experience at Google, Salesforce, and Zendesk, she has a lot of insights into how to keep these heavyweight platforms reliable.
If you want to make sure that your company knows what to do in the case of a third-party outage, you won’t want to miss this interview.
Set the stage for us, how does Zendesk approach building software in the cloud era?
Zendesk is built primarily on AWS, and we are all-in on utilizing third parties for best-in-class functionality. SaaS is at the heart of everything we do. There’s so much benefit to utilizing third parties for different pieces of critical infrastructure, but at the same time that does open up the door for risk.
We have products across all of our offerings that require specialized technology where it just makes more sense for us to leverage a vendor that excels at delivering those capabilities. That leaves us free to spend time on mitigation strategies, runbooks, and building partnerships with those best-in-class vendors.
Considering how impactful third parties can be, how important is it to have visibility into the availability, performance, and functionality of your vendors?
It’s incredibly important. What many people in my space are asking themselves is what can be put into their monitoring to differentiate between vendor-caused and internal issues.
Being able to differentiate between internal and vendor issues is really challenging to do, especially if you don’t have the right level of insight from the vendor itself. Even just status page updates or initial outreach from technical partners can take a minimum of 15 minutes because everyone has an incident response process and timeline that needs to play out.
How often are vendors the source of outages for their customers?
This question is a huge focus area for us right now at Zendesk. The volume of incidents caused by third parties is higher than some of the other root causes ever were in the past.
It’s easy to think that some of these cloud vendors are powerhouses and they know exactly what they are doing, but that’s just not the case.
Does this recent change at Zendesk represent a shift in the industry for how we build and deliver resilient systems?
“It’s less about how things have changed over the years, and more about the maturity of an organization, and the organizations they rely on.
Earlier on in a company’s lifecycle, you’ll see self-inflicted wounds. For example, there’s a lack of testing or the staging environment isn’t close enough to production in order to catch everything. So less mature organizations are causing problems for themselves.
Later in a company’s lifecycle, we see a shift to impact coming from other places, from the outside, like third party dependency outages, and often those can be the most impactful.”
Should organizations that rely heavily on cloud vendors approach incident management differently than organizations that don’t?
If you build a strong incident management program and you have the right tools and processes in place, you should be able to respond to any type of incident, whether it’s self-inflicted or caused by a third party.
It’s not just about incident management though, it’s also about how you architect your technology for resilience towards different types of outages. That means that the decision to use a third party needs to include an evaluation of what the risks are in terms of outages. What could those outages mean for you and your business, and do the benefits outweigh the costs?
Doing this analysis upfront will lead to having a firm understanding of what is in your control when something happens with a third-party vendor, and what is completely out of your control. All of this informs your incident management strategy.
Of course, that line is getting finer and finer and sometimes it seems like SaaS and infrastructure companies are all jumbled into one. That can lead to more tolerance for downtime when something like a major AWS outage happens, but we don’t have to accept that.
You can be one of the companies that rely on AWS but don’t go down because you’ve built in mitigation strategies, you leverage different Availability Zones, and you have done what you need to do to be resilient to different types of third-party outages. That’s the goal many of us have.
Ultimately, you have to consider your third-party vendors as an extension of your product and your team. If you are impacted by an outage from a third-party vendor, your customers don’t necessarily know that you have this downstream dependency, so the problem is yours.
What are some ways to treat vendors like an extension of your team in an effort to create resiliency to third-party downtime?
At Zendesk we have a vendor success team within our product engineering organization. They are responsible for managing our relationships with key vendors and do QBRs with them. When major outages occur the team is making sure that we are having engineering leadership level debriefs and they partner with these third parties on the technical challenges we may be having with their technology.
We also have a Global Vendor Resilience Program. This program works with our Tier Zero and Tier One vendors to ensure that we are partnering with the most critical vendorsat the deepest level. These are the vendors that power our core abilities to serve customers, ad they aren’t just vendors embedded in our software or in the development pipeline. These vendors are also platforms like Slack and various monitoring tools that impact our ability to conduct incident response and other critical functions.
Not everyone can have teams and programs like that, but there is still a lot you can do. It starts before you even select a vendor. A thorough vendor resiliency assessment will help you find out what you are getting into before you start utilizing a vendor in a way that their boxed answers to questions won’t.
What exactly are they going to offer to you? How will you interact with them during business disruptions? How are you going to be able to get in touch with them when things go wrong? Is that just logging a ticket, or do you actually have a direct person that you’re going to be able to engage with?
Not only does this assessment help you understand the risk, but it also gives you a blueprint to work with. Using a blueprint like this, you can model the different scenarios with a third party, how they might impact you, and how you can minimize that impact. And you can do all of this before you ever experience an outage from that vendor.
How You Can Use Erin’s Strategies to Manage Risk
Erin had a lot of insights into how to manage risks from third-party vendors through resiliency practices. (Stay tuned for part two of this interview for even more helpful information from Erin!).
One of the key ways to stay resilient is by getting timely information about outages from third-party vendors. This information can be accessed more quickly by developing relationships with vendors, or through observability tools like Metrist.
Be sure to follow Erin on LinkedIn and check out the other articles in this series. If you have any questions about this interview or how tools like Metrist can help you stay resilient, please let us know.