Software stacks are a dime a dozen these days, and picking the right one can seem somewhere between difficult and impossible. But while some people say choosing the “right tool” doesn’t matter, we think there is an advantage in choosing the best stack for the job.
Although it may take a bit of legwork to make this determination, choosing the right stack ensures that problems will get solved quicker, less hacks will be needed, and less cruft will accumulate. That’s one reason why we worked hard to find our “right tool”: Elixir.
This blog post is the story of how we picked Elixir and why we chose it as our current stack.
A Quick Background on Metrist
Before we get started, it might be helpful to explain the context of what we needed. At Metrist, we create an observability tool that functionally tests and monitors cloud products (like AWS, Azure, Stripe) as well as IT-related tools (Zoom, Jira) and general SaaS-style vendors (Stripe, Humi). From a range of actions, like looking at status pages, actively monitoring APIs, and observing how our customers’ production systems interact with them we try to distill a summarized status so our users know first when something is amiss with any of their vendors.
If you squint at any observability tool, you will always see something where a lot of asynchronous events flow around – observed metrics, aggregations of these, interpretations of aggregations (“this system is too slow”), and finally things like notifications. What we needed, therefore, was a system where “asynchronous processing” is not some addition in a library, an afterthought, but core to the whole stack.
A Quick Background on Elixir
For those who haven’t heard of Elixir yet, the stack was started by José Valim out of the issues that he, and many others, had with Ruby. This language (especially in the form of Ruby on Rails) excelled at letting people write web applications quickly. Unfortunately, Ruby had serious shortcomings elsewhere. The performance was lacking, the OO model was very tricky to build out on (making for ‘big balls of mud’), and running any background processing was cumbersome.
Elixir solved these problems by starting with a proven platform: the Erlang programming language and its “Open Telecom Platform” architecture. Erlang was originally created to run telephony switches and similar networking equipment. However, people started to discover that its processing model coupled with its extreme reliability was a pretty perfect fit for the modern web. In fact, this language was used for WhatsApp and is considered to be an important aspect of why the messaging platform is so valuable. Because of Erlang/Elixir, only a relatively small number of engineers are needed to keep the largest chat application on the planet humming away.
José created a somewhat Ruby-inspired but very functional programming language on top of all that goodness, and released it to the world some ten years ago. In 2022, it appeared in StackOverflow’s Developer Survey as the second-most loved programming language, and its main web framework, Phoenix, as the most loved web framework. Although popularity is not the best metric for selecting a software stack, José apparently was on to something.
Getting Started: Choosing Our First Tools
Metrist is a startup, and like any startup, you want to create a MVP that you can get in front of potential customers quickly. Since the product is in such an early stage, it’s well-advised that the founders choose a stack that they’re comfortable with and which helps you to demonstrate potential value. Scalability and maintenance are for later stages.
Our team started in much the same way. Our first soft launch was based on an assortment of technologies that were chosen simply because our team had experience with the product or it could otherwise speed up the development process.
For example, we used:
- C# as the backend programming language;
- Vue and Nuxt for the UI;
- Serverless on AWS as deployment target;
- A hosted object database (RavenDB) for persistence.
This approach worked well for the goal of getting something out to customers quickly to learn and iterate on. However, it had some issues when it came to building a lasting business.
The Need for Speed
After implementing an early version of the product on this stack, we quickly figured out that having a lot of AWS Lambda functions with a single-pagestyle application in Typescript was not the best way to move fast. Additionally, we were considering the CQRS/ES model of implementing our software. This model was attractive to us because it gave us the promise of being able to replay events, especially telemetry and error events, as well as our users’ response to them.
As an experiment, we reworked our backend to a simple, homebrew event sourcing model. Our event log was DynamoDB, which signaled event handlers through SQS. It worked, and it came with Amazon’s promise of being very scalable (still, no servers were harmed in the whole endeavor). However, it was also very slow. It took a lot of time for an event to round-trip from the command side to the query side, and while we did like the model, we needed a nicer implementation.
Additionally, one of our projects in early 2021 was to write an agent that customers could install and run on their own premises. This was because users collecting their own telemetry about their SaaS vendors would allow them to compare their performance to the one we monitored. At the same time, we felt that ability to schedule our monitoring jobs on Lambda was an inflexible process, so we were looking to replace it.
A Possible Alternative: Orchestrator
The Metrist Orchestrator was born out of the idea to make this agent and our private monitoring scheduler the same thing. It would be less work, would put individual monitoring by customers on an equal footing with our own, feature-wise, and help us ensure that we would find bugs before customers did.
C# was a likely candidate, but C# tends to produce very large executables. Also, what we wanted was a very reliable scheduling system and as some of us already had Elixir experience, “the sort of reliability that Elixir gives you” as a requirement for this Orchestrator thing was just one step away from “let’s just use Elixir.”
The Catch with the Orchestrator
The problem of our Orchestrator is mainly running and scheduling independent processes: cloud monitoring jobs, but also configuration refreshes (the configuration resides on our backend, so API calls) and reporting of telemetry. Then, housekeeping jobs to make sure that we clean up after ourselves(after all, we do not want to eat all the disk space on a customer’s server).
All these jobs need to be supervised so they can be restarted when an error happens, or at least we can report that something is wrong so that a human can look at it. Furthermore, again because we planned to run this on customer machines, a minimal footprint: negligible CPU and memory, especially, and preferably not too big an executable. Essentially, our needs essentially described the Erlang/Elixir ecosystem.
The Elixir Solution
Elixir is built from the ground up to schedule, run, and supervise processes and to do so with a minimal footprint. This minimal footprint is largely an outgrowth of the fact that Erlang and the Open Telecom Platform (OTP) it provides got started in the ‘80s, and the feature set we need today is not too different from the feature set the developers behind Erlang needed back then. The system was designed and built on extremely restricted hardware, and therefore it uses minimal resources on modern computers. As all communication between processes is asynchronous, the sort of message passing that matches our needs is built right in as well – literally everything works that way, so writing an observability platform on top of it feels very natural.
We had in-house experience, and we knew that training new developers on the language was relatively easy, so we went forward and built and released Orchestrator on this platform. A little tool called Bakeware took our code and converted it into a single executable, making distribution a breeze. The first customer installed it and it ran so well, that frankly everybody forgot about it.
At the same time, we converted our service monitor scheduling to use the exact same code base. We started in all AWS regions in North America, then added Google and Azure regions in early 2022.
Elixir and Metrist Today
All in all, the Metrist Orchestrator application is running in dozens of instances now (at the time of this writing, at least one Orchestrator instance is running in each North-American region of AWS, Azure and GCP) and it is boringly reliable. Yes, we created bugs and had the joy of fixing them – but Elixir and its platform were never to blame.
Ultimately, our experience creating Orchestrator on Elixir confirmed that we had selected the right stack. However, this was just the beginning of our journey with this versatile language.
Tune in for our installment in which we’ll talk about how we took the learnings of this project and combined it with our ideas on CQRS/ES and infrastructure to radically simplify our backend architecture.