Welcome to the first in a series of interviews with software and SaaS industry experts on what it is like to build on, and manage the reliability of, cloud hosted third parties. In this series, we are interviewing SREs, Software Engineers, Directors of Engineering, CTOs, and more from companies like Weedmaps, Freshly, Zendesk, Netflix, and Credit Karma. In the resulting blog posts, we will share their opinions, experiences, and advice, in their own words, as we all work to improve the reliability of apps built on the public cloud.
Today’s expert is Mike Canzoneri, Head of Technology Operations at Torch Leadership Labs. Mike started coding as a kid, and throughout his 23-year career, he has had a wide range of experience from software engineering to network and database administration.
Mike joined Torch in 2019 as an early employee and leader in the Engineering org. There, Mike has leveraged his decades of experience in his technical and operations background to oversee a platform that helps facilitate digital learning and leadership development for customers, including American Express, Allstate Insurance, and Reddit.
Getting there wasn’t easy. They’ve smartly leveraged an extensive set of IaaS, APIs, and SaaS products to deliver a reliable platform that features 1:1 video conferencing, calendar integrations, and quantified progress through analytics.
Visibility is key in making all that happen, and we were excited to speak with him about his thoughts and experiences.
How Mike approaches the build vs. buy decision
In Mike’s words: With the Torch platform today, there are a set of primary components built in-house and then integrated with a series of third-party services. These include our cloud hosting provider and third-party services that provide our video conferencing and calendaring functions.
In addition to that, there’s monitoring and logging and several other components that all get integrated together that end up being the result. The resulting product our customers interact with is our set of internally engineered components plus the third-party services we leverage.
When I joined Torch, decisions had to be made around speed over ownership, so the calculus around build versus buy was very pragmatic. At the time, we had four Software Engineers and one Operations Engineer.
It doesn’t make a lot of sense for us to build our own video streaming services; it makes a lot more sense to utilize a third-party service like Twilio or Zoom. It doesn’t make a lot of sense to build our own monitoring tools and set up our own monitoring service; it makes a lot of sense to use something like Datadog.
So over time, there are a lot of these decisions, and you end up with an ecosystem of third parties that are the best at what they do, so you can be the best at what you do.
In part, it’s because you don’t have to build up, or pay for top talent, to gain all of the supporting skill sets within the organization your third-party cloud hosting provider is there to offset.
At the same time, there is just so much more capability out there that you can buy for what is a small amount of money compared to the engineering effort it would take to build any of these things.
Building and running a SaaS product requires a lot of cloud vendors
In Mike’s words: When I joined the company, we relied on around 30 SaaS/cloud providers. Our total today is between 90 and 110 across the entire company.
We use around 30 to 40 third parties alone to support our product and engineering teams, a subset which are used for our product in production.
It is easy to point at things like Salesforce and AWS as the main cloud products the industry uses, but these days cloud and SaaS can deliver value all the way down to the small things.
There are lots and lots of little utility services that don’t require a lot of your company’s data or a tremendous amount of integration, but for a little bit of money, they can provide immense value to your organization.
With the modern SaaS options available today and integration we’re getting closer to just being a first-class component of everything, it really comes down to efficient capital allocation.
No venture capitalists I’ve ever met is going to give somebody $5 – $50 million and expect that they’re going to spend half of that standing up their own in-app video streaming, monitoring infrastructure, or all the other things.
The risk of using third parties is outweighed by the benefits
In Mike’s words: I’ll start by saying this: it is often a misunderstood belief that utilizing a third-party IaaS or SaaS product is higher risk than building the capability yourself.
Think about your cell phone. Who delivers that service better, you or the company that deployed billions of dollars and tens of thousands of people to deliver it?
It’s a similar situation with cloud infrastructure and SaaS. The risk is not greater. In most cases, the risk is lower because you have experts building and maintaining those things. Torch is not going to do video better than Twilio or Zoom.
Now, that doesn’t mean there isn’t risk.
When we detect an issue with our software, I would guess that 30% of the time, the issue is on our side, and 70% percent of the time, the issue is tied back to a third party. That doesn’t mean that we’re fully down, sometimes it means a certain area of the product or a small feature is impacted.
It’s actually harder to find those partial outages. There is an old monitoring concept of the single pane of glass, and it becomes really important when you’re integrated with so many external services.
Sometimes the impact from a third party can be really minor, and it only impacts a small number of people, but that just means that your ability to identify that impact needs to be really strong.
To trust the status of a third party, additional monitoring is required
In Mike’s words: We have monitoring and logging using Datadog, with alerts set up in Slack and email. We recently added external monitoring of the most impactful third-party services.
This came in handy recently with AWS and Twilio outages, where our externally focused monitoring was the first thing to alert us.
This means we don’t have to wait for errors in our app, then investigate, and finally confirm that there’s an issue with someone else. We don’t have to wait for a user to report issues.
We already know there’s a third-party issue that could impact our app, so we’re able to communicate internally and externally. We’re also able to halt anything that needs to be halted so we’re not causing data damage or just to reduce end-user impact.
With this addition to our monitoring stack, we found out about these recent vendor issues up to 60 minutes before they were officially acknowledged and communicated.
Without it, we’d be on our heels, not taking meaningful action until much later.
The other thing that is interesting is that so many services are interconnected now. When your cloud provider has an issue, some subset of the third-party SaaS products you use are also affected by that cloud provider.
We now have the ability to know which other services we rely on are impacted, and we can take meaningful action without having to wait for the cascade of alerts and errors from our metrics and logs.
Without monitoring focused externally on third-party SaaS and cloud infrastructure, it is really hard to determine if the issue is you, or them. You have to wait until that service says, “Hey, we’re having a problem with our cloud provider.”
With monitoring and alerting focused directly on the major cloud providers, we don’t have to wait. We know they’re having a problem.
And if it’s a major third-party cloud platform, then we’ll know there will be additional issues with the other tools we use, too.
Why you cannot rely on proactive communication from third parties
In Mike’s words: It’s extremely inconsistent, and in a lot of cases, just bad. There are no industry standards. Oftentimes, there are contractual obligations to communicate in certain ways, but that varies from customer to customer.
There are status pages and sometimes status-related APIs. Then there are some services that will report directly into Datadog and other monitoring products. Finally, there are things that will pipe an RSS feed into Slack.
It’s very inconsistent. Engineering and ops teams have to do a lot of work to centralize those signals as much as possible.
A recent AWS outage was interesting because they couldn’t update their own status page due to a reliance on their own services, one of which was the source of the issue. So, you can end up in a really tough situation where the monitoring tools that you’ve put in place are not always resilient to issues themselves.
When it comes to modern software engineering in the cloud, we don’t yet have a notion of “Tier 1” and “Tier 2” providers like they have in some industries. It’s something that I believe the industry needs to create.
In many cases, status pages fall short
In Mike’s words: They do well at providing a place to go to see the status for a specific service.
What status pages don’t do well is providing you with the ability to see what’s going on across all of your services. Instead, and often at regular intervals during incident response, you have to look at 15 different status pages, which may or may not have been updated.
This is a big challenge and something that needs to be improved on across the SaaS landscape.
The challenges with visibility into third parties go beyond incident response
In Mike’s words: There is no common standard today for change management communication for things like API changes, new deployments, or scheduled maintenance. Today, those things come via email.
Within an organization, when you’re using so many SaaS products, it’s not always the Operations team that gets those emails. It’s very common for a vendor to communicate that they’re going to be having a planned outage over the weekend or performing a major upgrade on a Wednesday night, and the Engineering team doesn’t know because someone else got the email.
A big communication gap exists today within the SaaS space.
If someone came to me today and asked me if there were any planned changes, updates, or maintenance from our third-party vendors, it would take a lot of time to put that answer together. We could get it, with some reasonable level of accuracy, but it is not within arm’s reach.
Mike’s ideal view into third-party status is inspired by an unlikely source: the aviation industry
In Mike’s words: If I could wave a magic wand, I would have the same visibility over the third-party SaaS and IaaS products that air traffic control has around airspace.
I want to know what they’re doing ahead of time. I want to know how things are going right now. I want that to happen in a standard way, communicated immediately. We deserve quick acknowledgment of a problem, even if the vendor doesn’t know what it is yet.
These things should be standardized across all SaaS products so we don’t have to chase down inconsistent ways of getting that information.
It’s standardizing the sharing of monitoring, alerting, and change management information in a way so that we don’t just react to what’s going on right now, but can plan system changes at times that are a lower risk.
Honest real-time visibility into third-party availability is the missing piece
In Mike’s words: It’s everything. It’s the thing that can give us the ability to make sure our users are having a good experience. It’s extremely important.
Every opportunity we have to gain more visibility into third parties, we should take it and add it to our existing dashboards.
Third parties are a fundamental part of any SaaS offering. The need for third parties spans across every team within a company, from marketing to capture usage to engineering to host infrastructure. Visibility into the current (and in some cases past) status or health of third-party infrastructure or services is lacking.
Still, the benefits of leveraging a third party is essential to keep costs low and systems running reliably. The biggest challenge faced by leveraging a third party is the lack of transparency into the third party’s health.