Just like any tech company, fintech is reliant on third-party cloud apps. However, if those apps malfunction, it puts high-stakes businesses at risk.
Luke Rotta is the SRE & Observability Manager at Chicago Trading Company (CTC) and has worked in the Fintech industry for over 20 years. CTC is a privately held company with offices in Chicago, London, New York, and Boulder. The company uses software and data to deploy innovative, time-sensitive trading strategies. So for this tech, every second counts.
We had a wide-ranging and informative conversation with Luke about how important third parties and visibility are to him, his team, CTC, and the entire fintech industry.
Metrist: Chicago Trading Company operates in a high-stakes arena, does that stop you from using cloud-hosted third-party vendors?
Luke: “We use a number of third parties, including observability tools, cloud infra providers, financial services apps, and our clearing system. Our proprietary applications have reliance on third parties, with the most critical dependencies being the exchanges and the marketplace itself. If any of those connections to the market go down, we can’t make money.
And we are shifting more to the public cloud more and more because of the opportunities it gives us to scale more easily, and that shift is increasing the number of external dependencies in our stack.
Currently, we use between 50 and 100 cloud products, with a 50/50 split between API and UI-rich SaaS products. Their impact on our reliability varies from month to month. Typically 20% to 25% of our incidents are tied back to third parties, but that can spike to as much as 50%.”
How important is visibility into the health of your third parties?
“On a scale of 1 to 10, the importance of third-party visibility is a 10 – it’s paramount. Unfortunately, it’s a struggle to get the visibility we need. I’d say the ease of doing so is a 2 on a scale of 10.
Vendors often don’t tell us when something’s broken – we tell them! After that, it’s a long process to figure out what the problem actually is and get it fixed.
But when you have a new line of business, you can’t build everything yourself, so you purchase some things in an effort to get going fast. And that often results in a large surface area of third-party dependencies.
Third-party visibility is actually a pain point I’m currently trying to solve because right now, visibility is me pushing on our third parties to push insight to me in some proactive way because I don’t have any way to pull it. Otherwise, the only visibility we get is any feedback we may, or may not get, in the user interfaces of the products we rely on the most.
Just the other day, we discovered a breaking change caused by a third-party provider when a trader reported to my team that a drop-down didn’t work in a piece of our software. When we reported it to their customer support team, they found the service was down on their end. We had no visibility into that planned change or subsequent incident, we had to find out on our own.”
Why do you think you aren’t getting the proactive communication you need from your third-party dependencies?
“A lot of times they simply don’t know they are having problems. It isn’t until their postmortem or outage review that gaps in their monitoring are highlighted.
But a customer alerting the vendor of an issue is a terrible way for the vendors to learn about things. So we’re trying to push the vendors to find ways to improve proactive communication – whether it’s a simple email or status page update.
In an ideal world, we’d have ways to check these things proactively – a pre-flight check so to speak. But right now it’s just not easy to get visibility into those third-party systems.”
Don’t your monitoring tools, the likes of Splunk and PagerDuty, offer third-party visibility?
“They actually don’t focus on that area, so it’s a gap that they have today. To my knowledge, no one is focused on third-party visibility, they’re all focused on metrics, logs, and traces – application telemetry, not third-party dependencies and their availability.
Part of the problem is that it can be hard to separate cause and effect. Traditional monitoring tools can give you signals, but they don’t excel at telling you if the problem is you or the external dependency.
These platforms notify you of a symptom that could be from a third party, but you don’t know the third party is the root cause or not. So it’s hard to discern whether it’s us or them because of the intricacies that are there. Without direct visibility, the process of figuring out if it’s us or them is a bit like playing “ping pong” back and forth across teams.
They check their systems, we check our software, and everyone checks their configuration, trying to line up what our software is saying to the system with what we’re seeing. That whole process of triage takes on average 30 minutes, sometimes up to a couple of hours.
For major issues, our business is at risk that entire time. For lower-level incidents, we are wasting time idle until other teams get back to you and the assignment can be finished.”
What about status pages, you mentioned them earlier, don’t they solve the problem?
“Status pages give you a broad overview. Sometimes they are good at the start and end of an incident, but too many things change too fast in between and those changes don’t get reflected on the status page. It’s good to have good mass communication about what is happening, but engineers need a more real-time way to look at the service health and status of the products they are using.
Most importantly, we need easy ways to check the things we care about because many cloud products have grown to be so complex. They may have 20 components or more, but the status page is only reporting on three of them. However, those three are irrelevant to me if I don’t use those parts of the product. What I need to know is what is impacting me, and what isn’t.”
What would it mean to you and your team if you had immediate and accurate verification of third-party health?
“That would be huge! It would tell us exactly where to look, right at the start of our incident response. If we could know exactly where the failure is coming from or isn’t coming from, we’d be able to start from a better place. It’s all about MTTR (Mean Time to Resolution).
Let’s face it, you’re going to have issues, no system or piece of software is perfect. We break our systems and others break theirs. So, we focus on MTTR and reducing it so we can limit the impact on the business.”
How does mean time to resolution compare between issues caused by third parties and issues caused by your own software?
“Third-party issues take a lot longer to solve.
Not long ago, we had an issue with a system that simply writes data to a public cloud platform in support of our trading strategy. We were experiencing massive queuing within the application that runs on-prem. As the queue grew, we started to worry about the impact on our real-time workflows.
It was a really squirrely problem because we had to involve the public cloud provider, and they immediately started looking to see if there was a regional issue. At the same time, we looked at the network configuration of our on-prem software to ensure we weren’t running over capacity.
It ultimately took a considerable amount of time to get to the bottom of the issue due to all the levels of complexity that we had to look into and a lot of that time was because we didn’t have good visibility into the functionality of the public cloud infrastructure.”
If you could wave a magic wand to solve the problems we discussed, what would the solution look like?
“For me, it would be having the ability to run functional checks against endpoints. I’m happy to build it myself because I may have some different requirements than other people have. But just having the ability to check and get the information would be helpful.
Then, I want the ability to pull that real-time vendor status when I need it, so we can use automation or human intervention to make a determination of impact. Of course, having vendors push that to me would be even better from a customer experience standpoint.
Finally, centralization is key. My approach from an observability perspective is to pull as much as we can into one place. The more screens you have to look at, the harder it is to assimilate the data. Having all of our third-party vendor data together in one spot would make it easy to have a view of the current state of our world.
The fact that I have to keep telling third parties about issues, instead of them telling me, is not sustainable. It begins to feel like we are doing a piece of their operations for them – as a paying customer! Incurring that cost of time on our end reduces the value we get from third-party vendors.”
Key Takeaways for Fintechs – or Any Company
Luke’s expertise ensures that Chicago Trading Company stays up and running reliably as much as possible. His strategies take planning, persistence, and communication, but they are worth every ounce of effort – especially when time equals money.
We all want to avoid feeling like we are using our time ineffectively or even wasting it. In instances like his where time is of the essence, building close relationships with your vendors and using visibility tools can help expedite the incident response process.
A tool that can expedite this process while providing a centralized overview of the statuses of your third-party vendors is Metrist. If you have questions about how you can improve your visibility into the cloud products you rely on the most, let us know!
And you can check it out yourself by signing up for free.