Why we decided to go for the Big Rewrite

One of the questions that any software engineer will face at one point in their career is when is the BIG rewrite the right thing to do? And is it ever?

For small programs, where it would be a question of a few days or a few weeks to rewrite them, it will generally not be worth pondering this question for very long, and in most cases it will be better to just get started on either fixing the existing implementation or on rewriting it. For small programs it is possible for one person to keep all of their state, and their logic, and their invariants in their head, but for big programs this does not work.

Big programs are qualitatively different from small programs in that they consist of many moving and inter-locking parts, each with their own design decisions, trade-offs, and underlying assumptions. It is possible, and beneficial, to look at each part in isolation, but you also have to zoom out and look at the emerging picture of the whole system. For such systems it is no longer possible to keep everything in your head and this makes it much harder to make an informed decision on whether a rewrite would be the right thing to do or not.

In this post we will try to give a more general framework on how to answer this question for a specific project and we will also tell our story of rewriting the core data processing system that powers Channable.

Backstory

We used to be heavily invested into Apache Spark - but we have been Spark-free for six months now. We had already taken the decision to move away from Spark and HDFS in November 2017, but it took us more than a year before we could turn off the last Spark servers. We had written our core data processing system in Scala, on top of Spark, and we had added many features to it over the years. To fully replace it with a new system took a humongous engineering effort, since we did not only have to replace our own application, but also all of the features of Spark and HDFS that we relied upon. One key insight early-on was that we could not replace our current system all at once, it was simply too big for this. Instead we would have to do it piece-by-piece, running the old and the new system side-by-side and gradually migrating one feature after another.

When is a full rewrite appropriate?

The decision to move away from Spark was not an easy one and was not easily taken. As an engineer, it is always tempting to think of a rewrite as a silver bullet that will fix all the issues and technical debt of an established system that has grown organically over the years. However, these rewrites then often end up being late, over budget, and under-delivering in terms of functionality and performance. This is so common that it’s known as the second-system effect. One of the main reasons for this is that for a company (and a startup in particular) it is not really possible to stop all new development for a year to focus on a rewrite of an internal system. This means it is essential that the old and the new system can evolve in parallel, with the new system slowly taking over more and more of the workload of the old system, until there is a point where the old system can be put into feature-freeze mode and new features only need to be added to the new system.

This begs the question: in which situations is it appropriate to decide on a full rewrite?

In theory, there is an easy answer to this question: If the cost of the rewrite, in terms of money, time, and opportunity cost, is less than the cost of fixing the issues with the old system, then one should go for the rewrite. In practice, it is impossible to give a robust estimate for any sufficiently large system for either the cost of a rewrite or the cost of evolving the old system because there are simply too many unknown unknowns 1. Any project to turn those unknown unknowns into known unknowns and finally into known knowns will likely amount to as much effort as trying to implement both solutions at the same time.

We should therefore consider a few other factors instead. Let’s start with the technical ones:

There are also some questions about the migration process that we need to think about:

Additionally, we should also think about some social factors:

Our concrete reasons for a full rewrite

In our case, the answer to all of these questions was yes.

One of our original mistakes (back in 2014) had been that we had tried to “future-proof” our system by trying to predict our future requirements. One of our main reasons for choosing Apache Spark had been its ability to handle very large datasets (larger than what you can fit into memory on a single node) and its ability to distribute computations over a whole cluster of machines4. At the time, we did not have any datasets that were this large. In fact, 5 years later, we still do not. Our datasets have grown by a lot for sure, both in size and quantity, but we can still easily fit each individual dataset into memory on a single node, and this is unlikely to change any time soon5. We cannot fit all of our datasets in memory on one node, but that is also not necessary, since we can trivially shard datasets of different projects over different servers, because they are all independent of one another.

With hindsight, it seems obvious that divining future requirements is a fool’s errand. Prematurely designing systems “for scale” is just another instance of premature optimization, which many development teams seem to run into at one point or another5.

One social factor that gave use confidence that a rewrite was the right way to go was the fact that we had rewritten our job scheduling system before with excellent results. This system has stood the test of time and has evolved with our new requirements both in terms of features and scale. Like back then, we took the time to systematically write down all of the shortcomings of our current system and debated these within the team to make sure that everyone had a deep understanding of the problem. Having a robust discussion about all the technical issues that we were facing with the old system convinced us that a rewrite was the right solution.

Some technical insights that we had over time were:

We do not need a distributed file system, Postgres will do. Our usage of HDFS came mostly as a side-effect of using Spark rather than as a deliberate choice. Over time we kept adding more and more features that are classic use cases for relational databases like pagination, joining datasets, querying and filtering of data, and searching through whole datasets. It was possible to implement these features in Spark, on top of HDFS, but it was clunky and performance was much worse than what you can get with Postgres7.

We do not need a distributed compute cluster, a horizontally sharded compute system will do. Our datasets are all independent of each other and each dataset is small enough to fit into memory on a large server. We can therefore divide our datasets over n servers and scale horizontally by simply adding more servers. All of our datasets are refreshed at least once per day, so we also get rebalancing of the cluster for free.

We do not need a complicated caching system, we can simply cache whole datasets in memory instead8. We use Postgres as our storage layer, but we do have an additional in-memory cache layer, since we need fast access to the entire dataset. This is implemented as a relatively simple local LRU cache that is managed by our compute processes that run on each server. Our datasets are all immutable, and we therefore need no complicated logic to sync data between Postgres and the cache but can simply swap out the whole dataset whenever it is re-imported.

We do not need cluster-wide parallelism, single-machine parallelism will do. A major reason to choose Spark had been the promise of getting great performance by distributing computations over a whole cluster of machines. However, the overhead of cutting your dataset up into smaller chunks, serializing them, and sending them over the network is significant. Given that our datasets were <= 10GB in size, it did not make sense to do this, since the overhead was bigger than the gain.

We do not need to migrate the storage layer and the compute layer at the same time, we can do one after the other. We could import the same dataset into both the old and the new storage system in parallel, since both systems could run side-by-side independent of one another. There was also no further synchronization logic necessary, since our datasets are immutable. This also meant that we could gradually start using the data from the new storage system while we were still building the new compute system. Concretely, we could start moving over one endpoint after another, and we could ensure that there were no regressions by sending the same job to both the old and the new system and making sure that the results were identical.

How we went about the rewrite process

For big rewrite projects, there is always a high chance of failure, simply because there are so many unknowns and also because it can seem like an overwhelming task at first. There are however some principles that you can follow to increase the chances of success.

Avoid feature creep. An important decision early-on was to avoid feature creep, and to only replicate existing functionality at first. Feature creep is one of the main reasons why big rewrite projects often fail, since it is easy to underestimate the time it takes to replicate the existing behavior and to overestimate one’s own abilities. In the later stages of the project we did have to add some new features, which customers had been waiting for, but this was only after we had established a solid base to build upon.

Test critical assumptions early. We made some big bets on Postgres and Haskell with this project. To lower the risk of failure we tested some critical assumptions about these core technologies right from the start. For Postgres, we worked out how we could best store our nested datastructures in the database, and also how we could horizontally shard them over multiple servers. We also tested that both the disk and the network throughput on Google Cloud were sufficient, and that we could quickly load our datasets into the database. For Haskell, we worked out the data types for representing our core abstractions. And we also spent some thought on the best way to do error-handling later on.

Break project up into a dependency tree. We spent some time on breaking down the project into smaller parts and drawing out the dependencies between them. We then identified the parts that could be done independently from others and that could also be implemented as stand-alone services. In our case, we could separate the rewrite of the storage layer from the compute layer and focus on that first. Our earlier design decision to treat datasets as immutable paid off here, since we could simply import each dataset into both HDFS and Postgres in parallel and we did not need to worry about keeping them synchronized, since there were never any partial updates (just whole re-imports). This first stage of the rewrite could thus be wholly finished and tested within a few months.

Having the datasets available in Postgres then naturally lead to the second stage of the project, which involved porting each feature, that could be powered by Postgres alone, one-by-one over to the new system. Another earlier design decision that paid off here, was the fact that all our internal services communicate via well-defined APIs with each other. We could therefore port one endpoint after another to the new system, and gradually roll that out in production. Rolling out these changes also had immediate positive value for our customers, since virtually all views and queries got much faster (on the order of 10x to 100x) and there was also less variance in query latencies.

Prototype as proof-of-concept. We quickly developed a first prototype of our data processing engine. This allowed us to explore different design ideas without committing to any specific design right away. One of our goals was to utilize Haskell’s strong type system by encoding important program invariants in it. We therefore spent some time on finding the right data types to represent the rules which our customers use to improve the quality of their datasets and which form the core of our data processing engine.

Get new code quickly into production. Once we had broken down the project into smaller parts we focused on getting the first stage quickly into production. This uncovered some issues with the first design, and allowed us to quickly iterate on it. For example, one thing that we tested early was our choice for the in-memory layout of our data. We compared a row-based layout to a column-based layout and discovered that the former was significantly faster for our typical workloads. Had we only discovered this much later then we would have wasted a lot of time and effort, since the later stages all build on the earlier ones. For example, with the whole import pipeline running in production we could already verify that we could meet our performance targets regarding import times and that total system throughput and capacity for data ingestion were sufficient.

Opportunistically implement new features. We have a long backlog of features that we want to implement and not all are equally important of course. For the really time-pressing and important ones we had no choice but to implement them in both the old and the new system. But for the less time-critical ones, we had some leeway in choosing which ones to implement first and could therefore choose the ones that could already be powered by the new system. In general, we boot-strapped the new system in the order that would minimize the duplicate amount of work that we had to do. In the end, there were only few features that we had to do twice.

Use black-box testing to ensure identical behavior. We had both the old and the new system running side-by-side in production. We could therefore send the same request to both systems and make sure that the results were identical. If they were not, we would log an error, so that we could later investigate the differences offline. This allowed us to find many edge cases without any impact on customers and also gave us great confidence that we had not introduced any behavioral changes in the new system.

Build metrics into the system right from the start. Since we were writing a performance-critical application, we made sure that we tracked a few relevant metrics right from the start. We e.g. tracked the time that the application would spend on garbage collection, the time it took to fetch a dataset from the database, and the time it took to run whole jobs. We also wrote a few timing and benchmark scripts that helped us to quickly uncover bottlenecks in specific rules. We recently formalized this into a performance index, that systematically tracks overall performance for a representative set of projects.

Single-core performance first, parallelism later. We heavily optimized for single-core performance at first, since compared to a parallel program, it is much simpler to reason about and easier to measure. This uncovered both specific bottlenecks within some functions, but also more general problems with some algorithms (or rather their implementation) and data structures. It turned out that for our small and medium-sized projects we could get better performance on a single core than with a whole cluster in our old system, which validated one of our design ideas. Once we had picked all low-hanging fruit and there were no more big gains to be found in single-core performance we turned our sights a bit higher and focused on adding support for parallel evaluation to the new system. Not building in support for parallelism right from the start, may sound like a bad idea, but this is where our choice of Haskell really shone: all of our data structures were immutable and each job could be seen as a pure function. Many of our operations were trivially parallelizable (e.g. map operations) and we could leverage the rock-solid support for concurrency and parallelism of the Haskell runtime system. With single-machine parallelism we were thus able to surpass the performance of our old system on a multi-machine cluster, even for our biggest projects.

Conclusion

In November of 2017 we had decided to embark on a big rewrite of our core data processing system, which had been built upon Apache Spark and HDFS. The old system had organically grown over many years and many features had been added, for which it had not been designed originally. This had lead to a point where it was hard to add new features, it was hard to debug issues, performance varied widely, and there were some intractable issues that lead to outages and operational grief. It is now almost two years later, and we are happy to report that the project has been a success. The new system is very reliable, easy to operate, easy to debug, can be scaled horizontally, and it is straightforward to extend and to refactor. Last but not least, we have put in a lot of effort to make performance great as well, which our customers have noticed.


1: The cost of upgrading the old system will likely be understood better, since you should have already attempted various improvements before even considering a rewrite ↩︎

2: Concretely: Is it hard to make changes? Is it difficult to debug issues? Is the system difficult to operate reliably? Are there intractable issues? ↩︎

3: For example, buying some off-the-shelf solution, hiring somebody else to fix it, buying more hardware etc. ↩︎

4: We have a separate dataset for every project of every customer internally ↩︎

5: The biggest instance on that you can rent on Google Cloud right now has 416 vCPUs, and 11 TB of memory ↩︎

6: Examples abound: Using Kubernetes when systemd would do, using a distributed database when Postgres would do, using microservices when a monolith would do, etc. ↩︎

7: With Postgres most queries take a few milliseconds, while with Spark we saw latency between a few hundred milliseconds and tens of seconds ↩︎

8: We are working on bringing more sophisticated caching back in the future, since it is beneficial for our customers with large datasets ↩︎