When you start a company with the idea to use external funding to get it going, the first order of business is not to build the most awesome, scalable, presentable infrastructure. The first order of business is to build what is essentially a demo that shows the essence of your ideas to potential early stage investors. And nobody cares whether that demo survives first contact with the real world out there. That first cut needs to be compelling, it needs to be a good advocate for the business case, and hopefully it will stretch to support the first couple of users while the “serious engineering” takes off.
When you start a company in 2020, companies like Amazon are out there to help you, advertising “serverless” products that promise to let you get started without too much knowledge of hosting infrastructure, servers, and all these details. Also, it is the 21st century, so who wants the hassle of maintaining a database schema anymore? We have document databases and NoSQL now.
Therefore, the first version of Metrist got written in C# (you pick the language you’re most familiar with), deployed on AWS Lambda as a series of functions behind an API Gateway, and data gets stored in a document database, RavenDB, also in the form of a hosted offering. A front-end got built in Vue/Nuxt and the little bits of AWS infrastructure that needed to be setup were managed through the product of a company called Stackery, since then acquired by AWS.
All was good, the product got built, monitors got written, and impressive demos were shown to potential investors and customers alike. The stack was as up-to-date as could be, so all should be well there, not? Everybody lived happily ever after and this ends up being a short and boring blog post.
Introducing CQRS/ES
Not so fast. As mentioned before, one thing we realized early on is that there’s value in the sort of data we are collecting, and therefore there is value in collecting it as precisely as possible. There’s two ways to store data when something happens:
- Store the result of the changes the data causes. This is what everybody does, it is called “CRUD” – you store the new address, overwriting the old one;
- Store the change itself. This is what design approaches like CQRS/ES advocate – you store the fact that user X updated the address street to B, but keeping the information that, a year ago, user Y initially created it as A.
We felt that by focusing on storing changes, we’d capture more valuable data for later processing. The state that results from these changes can always be recalculated but if you throw the changes away, they’re gone. If you want to capture change instead of state, you may as well go with the premier established pattern around it, so we decided to adopt CQRS/ES and start marching into that direction. For a more in-depth view of “state vs changes” and how CQRS/ES works, give one of Greg Young’s videos on CQRS/ES a view, like this one.
Implementing CQRS/ES requires some machinery, which is readily available in straight .NET (it’s more popular there than anywhere else, probably thanks to the prevalence of business applications being written in .NET), but was not something out of the box in our AWS “serverless” architecture. You need essentially two things:
- An event store as all events are just one big append-only log of things that happened;
- An asynchronous processing machine that gets triggered by new events and does things like update “projection” databases, which is where the actual querying of data happens.
Staying true to our “serverless” approach, we looked around and came up with the following approach, somewhat common:
- We store all events in DynamoDB.
- Writes to DynamoDB trigger, via Kinesis, SQS notifications that route to Lambda functions for further processing.
Some framework code in C# was quickly whipped up to support the basic use cases we envisioned, and we slowly started moving code away from writing directly to RavenDB and towards emitting events that got sent into this machine. It all worked quite well and showed us that this direction was indeed where we wanted to go, but it had one drawback: it was very slow.
The event processing in CQRS/ES is asynchronous and this causes complications. For example, a common way to deal with data changes in “standard” applications is to “read after write” – you change an address, write it to the database, then show the user their profile page again with the updated address by reading it back so you’re sure you have the up-to-date information. This is wasteful (you just wrote the data – you have it right there!) but helps in decoupling the “address change” code from the “display address” code so more often than not, it is how things get implemented.
In CQRS/ES, however, the user actions cause an “address changed” event, that in our system gets written to DynamoDB and that’s it. In the background, AWS spins up machinery to eventually get that event to a triggered Lambda function that responds to “address changed” events and writes the new address to the read database, but that takes a long time; typically, 5-10 seconds. In the meantime, the address the user profile page would show would still be the old one, and only after some manual refreshing the new one would come up.
Clearly, this was not acceptable. Either we needed to keep the new address around in memory somehow, or shorten the delays by at least an order of magnitude. Preferably we could do both – we are somewhat database heavy and the less unnecessary load on it, the better.
Adding Elixir And Moving To AWS ECS
As mentioned before, we were also looking at Elixir around that time. The library support it offered in the form of Phoenix LiveView and Commanded promised a more integrated approach, less moving parts to worry about, and – most importantly – close both the delay gap and naturally solve the “avoid read-after-write” issue.
Elixir applications typically are “happier” when deployed as a monolith. In fact, one could argue that the stack already includes most of what platforms like Kubernetes offer to “lesser” stacks, making it very easy to keep monoliths manageable and scale individual parts without having to resort to complexity-exploding measures like microservices.
So we decided to (gradually) replace the functionality spread out over dozens of AWS Lambda functions with a single Elixir monolith. By then, we also decided that RavenDB did not offer what we wanted, the Commanded library “wanted” PostgreSQL anyway, so we might as well combine both “event log” and “projection database” in a single instance, hosted on AWS with their RDS product. As we were spreading out our synthetic monitoring to other platforms as well (first to Azure and, a bit later, Google Cloud) we also decided that the AWS-specific Stackery product had seen its best days for us and we took the opportunity to migrate our infrastructure to Hashicorp’s Terraform.
Where to deploy Elixir? We still did not want to have too much to do with infrastructure management, so the path of least resistance that presented itself for our very simple requirements was Amazon’s Elastic Container Service, which essentially is just what you want: it takes your code and runs it. The only “restriction” is that your code needs to be a Docker container, which is simple enough to accomplish. An ALB load balancer in front of it helps with high availability and roll-outs, making things seamless to the user: running two instances of the monolith, one could go away either because of technical issues or as part of the roll-out, but the collaboration between ECS and the load balancer would ensure that users would not be aware of this.
The Need For Speed: Back To Virtual Machines
However, the interplay between ECS and ALB was very much visible to developers. We are proud of our attention to “developer experience” – you can’t have happy users if you don’t have happy developers first and with a very small team juggling so much it becomes very important to choose the paths of least nuisance. So, we try to make our lives simple by doing things like continuous deployment triggered through GitHub Actions. Our flow is simple: you work on something, make sure it runs in our staging environment, then set the corresponding ticket in Jira to “ready for review”; someone else will then pick up that code review and merge when they’re ok with your changes. The merge will trigger a production roll-out, no actions needed except for the reviewer to wait until the roll-out completes and then do a quick sanity check on production, and maybe take a screenshot to announce completion of a feature in an internal Slack channel.
The important bit in that sentence is “to wait”. How long can you make someone wait before their attention span wanders off, they go do something else like hang out on Slack or Reddit, or even get tempted to do a full context switch by starting to work on a next ticket? A good developer experience requires a single flow from review through sanity check to moving the card to the Done lane, and we somewhat arbitrarily set that wait time to 5 minutes. Hit merge, get up, get a cup of tea (steep time: 4 minutes), walk back, and you can start testing. If it’s longer, it quickly becomes annoying and a chore.
That is 5 minutes to:
- Do the actual merge in Git (very quick) and wait for GitHub Actions to spin up our job (variable, but usually very quick);
- Compile the merged code (in Elixir, reasonably quick – 30 seconds);
- Run the whole test suite again (we need to keep our suite tight – another minute);
- Package everything up as a container and upload the container to some container repository (easily 30-60 seconds depending on the amount of “luck” around the CD machine’s cache);
- Do a rolling deployment so nobody sees error pages.
Given that ~3 minutes will be spent in the compile/test/package steps, we have just a couple of minutes left to deploy. And whatever we tried, ECS wants to be really careful about ensuring that everything deploys in the right order, at the right time, without any hiccups and plenty of health checks and pauses in between: we could not get ECS to do the kind of deploy we wanted in the time we wanted. More often than not, just ECS’ deploy machinery “waking up” took the whole time budget. With mostly API calls and LiveView’s seamless reconnect, we did not need most of that carefulness at all, but still, ECS forced us to “pay” for it. We needed something simpler to make that time budget. And, given that we wanted to run our active monitoring on non-AWS platforms as well, we needed something “lowest common denominator” across cloud platforms, which pretty much still means “run a virtual machine and do everything yourself.”
As we had a reasonable amount of Terraform code already, the step to removing ECS and replacing it with EC2 was fairly simple. All we needed on top of that was a little program that ran on the VMs that did the following:
- Regularly poll for a new version of the container (in a private S3 bucket).
- Pull down the container.
- Remove the instance from the load balancer,
- Shut down the old version.
- Start the new version.
- Add the instance to the load balancer.
And do it so that both little programs would not do this at the same time, one waiting for the other, to give us a simple rolling upgrade.
Luckily, Erlang (the platform underneath Elixir) has clustering and global locking built-in. So by implementing the four steps in Elixir, all we needed to ensure that this was done one-machine-after-the-other was:
- Deploy the little program so that it would run as part of an Erlang cluster.
- Wrap the third through last steps in an Erlang `global` lock (the first steps can be done in parallel, of course).
As Elixir sports a very good AWS client library it was easy enough to talk to the load balancer as well, so after a couple of days we had a little daemon that got packages up as a Docker container and launched as part of a cloud init script that we had Terraform install. The tool does its job, the upgrades work as expected, and we could focus on building the product again. Yes, we’ve written 500 lines of deployment tooling (roughly), but it is very trivial code, pretty much maintenance free, and it is in the language we use on a daily basis, so if someone needs to change something it is much easier to quickly refresh your knowledge.
Saving Memory By Ditching Docker
However, we did notice that the Docker daemon sometimes gobbles up a lot of memory. We suspected that it was the AWS logging plugin (our logs get sent to AWS CloudWatch) that sometimes could not get rid of the logs quick enough and started buffering, but we never really could figure out what exactly caused it. We did notice, though, the multiple gigabytes of memory that the daemon would steal from our application and it did cause a couple of crashes because it caused out-of-memory situations.
In case of a faulty component, check the component’s value in your system and if you don’t need it, toss it. In our case, an Elixir distribution is self-contained already and the standard tooling will create a tarball that you can launch in two simple steps: unpack, run the included shell script. Docker was not really adding much here: the caching is useless if you run just one app, we did not use its networking facilities, and whether you copy a tarball to a private S3 bucket or a Docker container to a private registry is hardly a big difference either. All we really used was the automatic logging interception and forwarding to CloudWatch, and that feature seemed to give us trouble.
So, Docker was out as well. It took us a day to figure out how to send logging to CloudWatch when running under Linux’ standard systemd supervisor and that’s how we currently are running: a monolith on a virtual machine with a very light layer of homebrew scripting to do continuous deployment. It’s almost like it’s 2005 again…
Conclusion
Our setup is boring, utterly boring. We have two application server instances on two small virtual machines backed by a PostgreSQL database. We have a single programming language, Elixir, that handles most of the backend and frontend functionality. Deployments are fast and automatic and we hardly ever have infrastructure changes – the most “exciting” thing we’ve done recently is to add a bit of Sentry monitoring because the AWS Lambda function we had to help us out with signaling errors in logs was a bit too much of an odd duck (it was our only remaining Lambda function). Things work using decades-old tooling otherwise and they are out of our way so we can fully focus on product development, which is what we are supposed to spend our time on.
While we did not set out to go back to the past, we happened to end up in something that suspiciously looks like it. We simply followed our goals, like keeping things simple enough so we can all understand them (important if you don’t have a separate infrastructure department), requiring decent developer experience with tools that stay out of the way and operate quickly, and not introducing tools “because everybody else uses them.” We’re small, every moving part you add will invariably give you additional headaches, and we are able to do well without them.
Our particular stack was created for our particular situation. Running a dozen clusters over three cloud providers each running a single workload made a standard solution like Kubernetes way less attractive, and having the Elixir/Erlang stack made rolling our own _reliable_ roll-out script trivial. As the usual disclaimer goes, your mileage may vary. But we do hope that we have illustrated that being critical about your tooling and not blindly following the loudest voices in our somewhat fashion-driven industry can help you land in a very comfortable place.