Today’s interview is with DevOps expert Linda Ypulong. As Director of Engineering at Credit Karma, she has a lot of experience working with third-party dependencies and making sure her customers have the most reliable platform possible.
Linda took an unconventional path in her 30-year career, starting as a software trainer. After working with brands such as Harrah’s Entertainment, Mozilla, and Sephora, Linda joined Credit Karma in 2019. Now, she focuses on operational effectiveness and finding ways for continuous improvement to maximize operations and revenue.
Linda had some great tips about navigating the advantages and liabilities of third-party dependencies and we’re excited to share her insights with you!
How the DevOps industry is changing:
According to Linda:
“I specialize in finding ways for continuous improvement. What I’ve learned over my 30 years is it doesn’t matter what you’ve built, if you can’t run it, then you’re not going to be successful.
Originally [in the software industry], you built everything. We had Dev teams and Ops teams. It was software that you owned and configured, there was a lot of responsibility and it was really limiting. There’s a finite amount of knowledge any person or any company can have.
But… The more things change, the more they become the same. Now a lot of this is going back to the developers with DevOps; you build it, you own it. It’s interesting because a developer has to have knowledge of all the third-party tools they leverage, and that takes us back to where we were in the 1990s.”
Third-party SaaS tools can improve agility, but evaluating reliability is still a challenge.
“Recently, our DNS provider had a massive outage. The last time it happened was about five years ago. We didn’t have a backup provider because it was such a rare thing. So now, we have a primary and a secondary provider because you need to have backup. So I think to myself, “wouldn’t it be great if we had visibility into the real-time and historical reliability of all the vendors we work with?”
One of the biggest challenges in my role as a Director of Engineering is knowing there are solutions but figuring out the right one that meets both business needs and budget.”
Your app is only as reliable as the vendors in your software supply chain.
“The recent Log4J security issue was a great example of how third-party dependencies have created challenges that we didn’t have before.
It is relatively easy to patch anything on-prem and your internal tools. The challenge comes when you realize that a large organization can have well over 400 third-party pieces of software in use, some of which are embedded into your product, and increasingly they are cloud-hosted rather than installed libraries.
When you think of what’s embedded in your product, you’re only as secure and reliable as the provider. So understanding how each provider manages outages and responds to security incidents is important and a challenge.”
Relying on dozens of third-party SaaS tools is becoming the norm and the number will continue to rise.
“I’d estimate 95% of companies are using 10-15 third-party providers on the small end, and easily upwards of 50 for bigger brands.
In two or three years, I think the number will go higher, but it’s a double-edged sword. We want developers to understand how to code for the cloud and work with third-party providers, but that just means an increasing amount of knowledge we’ll be asking these people to maintain in order to deliver the product.”
How an outage from a major cloud infrastructure provider can have a cascading effect:
“Since we know everyone is using third-party cloud vendors, it means those third parties now have their own third parties, so it’s important to be aware of that. AWS isn’t a cloud provider I have to work with directly, but when they recently went down, we saw CircleCI go down. So we could infer that CircleCI had a problem because of AWS, and suddenly our ability to deploy was broken.
If you manage operations, you’re held to a high standard in your company. So, if anything goes down, you want to be aware of it before your customers are, and that is hard to do.
Recently, a major outage impacted my infrastructure. Thanks to some great monitoring software, and automation through some amazing third-party tools, we were able to mobilize. In less than five minutes, we called the vendor and told them their product was down — which was news to them.
We want answers, so visibility is key. Of course, third-party providers try to present data such as uptime, availability, and so on, but being able to see and extrapolate what may or may not be their problem vs my problem is important, especially when everyone is dealing with so many third-party vendors.”
Your customer’s experience will be impacted by the availability of your cloud-hosted third parties.
“If everybody has 99.9% availability, that’d be great – if they all went down at the same time. However, you’re only as good as many 0.1% added together. So if you have a hundred third-party dependencies with 99.9% availability, go ahead and multiply that out because the likelihood that their downtime will happen at the exact same time is unlikely.
The issue is it is really hard to know about true reliability before you sign the contract. Then, after you sign the contract, if there are reliability issues, it is a lot of work to track down the data and hold your vendors accountable to availability promises and expectations. That extra work is never budgeted.
So it is critical to identify your single points of failure in your third-party apps. Then determine if it’s a risk you’re willing to take or something we can’t afford to have happen again.”
Why it’s important to determine downtime proactively versus reactively:
“I’ve never seen a senior leader say, “I’ll just handle that when it breaks.” Being proactive requires visibility. A lot of times, when you’re talking about dozens of third-party vendors and then their third-party vendors, you wish you had a crystal ball because you don’t have the visibility you need.
Sadly, Twitter and DownDetector are probably our best way of knowing how third parties are working at any time, and that is still reactive. You’re basically searching Twitter to see if other customers are complaining, and by the time you can really find a signal, the issue has been going on for a while. Beyond seeing a SaaS provider is having a problem, you don’t get visibility into the details of what is down or if you’re impacted. Visibility doesn’t just matter when a third-party cloud provider is down
Another challenge is when a third party goes down or announces they have a problem, but it doesn’t impact you. Your team is scrambling to do extraneous health checks, which is disruptive. We’ll know we’re fine and can continue with regular work if we have more visibility, including greater detail about the issues.
With self-hosted status pages, it’s literally like letting the fox watch the hen house. Too many times, I have seen when the status page isn’t updated until the actual outage is resolved. The other problem is some of the most devastating outages mean the company themselves can’t get to their own status page. So nobody can update it.
What we need is need a central place that we can go to and trust to be a source of truth.”
Visibility into the real-time reliability of the third parties is critical but still nearly impossible.
“On a scale of 1-10, the importance of visibility into third-party reliability is a 15 for me. It’s critical to know what’s going on when you have outsourced so many other functions, such as marketing and customer relationship tools to third parties.
The reality is that the visibility we do have is about a two out of ten. It’s a black box, especially when there are outages that aren’t impacting everyone, but just you, and you’re trying to get answers. There are many times when you hit a minor bug, but since you aren’t given any visibility into what’s happening, it ends up delaying resolution.”
Can accurate, timely vendor availability data make a difference?
“Vendor management becomes really important as we build on the cloud. Part of vendor management is access to data. Not just availability data, but access to your vendor’s RCAs too. You need to know when they have outages, what percentage of their customer base was impacted, and if you’re impacted. Getting this data proactively, in a reliable way, would save me time, provide answers, and improve our operational scorecards. Plus, in terms of the larger conversation, having this data can help foster some discussions that could lead us all to better reliability.”
Why the customer often bears the burden of producing vendor reliability data:
“Recently, a third-party provider that we call through an API was having reliability problems. So, I had to get into meetings with and about them once or twice a week, going back and forth with the vendor, and showing them our data and the problems. I’ve had to become the record keeper and note taker. It was painful.
Data doesn’t lie. If we could easily get clean data and agree on it, we could be blameless. No one thinks the other is trying to take advantage of them. We just want to know how to use the vendor’s product, what may be causing problems, and how to solve them.”
How third-party vendors help push innovation and agility, even with dependency challenges:
“Building on public cloud platforms and with third-party APIs allows us to be more innovative, and more nimble. You can focus on your core competency, really innovate in your space, and then partner with the people that are innovative in their space.
I want a world where we pick vendors, and our partners, based on data, including real-time visibility and an understanding of where they are in their maturity journey. I don’t want to sign the dotted line until I know how reliable you are, because that will impact my reliability.”
How you can leverage Linda’s insights and monitor third-party cloud dependencies yourself:
Linda had some amazing insights into DevOps and third-party dependencies. If you have questions regarding this interview or third-party dependencies, we would be happy to help.
Metrist provides a free tool for monitoring third-party app outages and health. To get started, simply sign up here.
Enjoy this interview? Check out more in our Interview Series.