Debugging a long-running Apache Spark application: A War Story

Prelude

Our Spark cluster was having a bad day. Our monitoring dashboards showed that job execution times kept getting worse and worse, and jobs started to pile up. A quick look at our monitoring dashboard revealed above average load, but nothing out of the ordinary. There was plenty of processing capacity left in the cluster, but it seemed to go unused. That was certainly odd, but nothing that warranted immediate investigation since the issue had only occurred once and was probably just a one-time anomaly. We restarted Sparkmachine1 (our aptly-named Spark application) and confirmed that execution times would go back to normal, meaning that we could get back to work on other more interesting things.

Unfortunately, the problems kept coming back and they seemed to be getting worse over time. The longer our application kept running, the worse it got. Since we regularly deployed new code (and therefore restarted our application), we hadn’t noticed the problems immediately. They really started to become apparent when we didn’t deploy for a while.

At this point, it was clear that there was something wrong somewhere, and that we would have to dig deeper to find the root cause of the problem. In the rest of this blog post, we will describe how we went ever further down the rabbit hole, and what we learned along the way. In particular, how we improved our monitoring of both Spark and our Spark application, how we added custom metrics that helped us, and how we analyzed a heap dump that gave us the missing clue that lead to a fix for the problem.

This post should hopefully be interesting to readers who also have long-running Spark applications and are looking for ways to get more insights into what the system is doing and where the bottlenecks are.

Background

We have been using Apache Spark for a few years already since it hit version 1.0 back in 2014. Our data processing engine, Sparkmachine, is built on top of Spark as a Spark application. It is responsible for processing hundreds of millions of items (e.g. product data) every day and it makes the processed data available as streams, which other services can consume to e.g. produce a data feed.

Most of this processing is done in the background as automated batch jobs. We do, however, also have a feature where customers can preview the processed data in our application. This makes it important for us that we can produce these previews quickly, so that we don’t keep our customers waiting and this also needs to work for datasets with millions of items.

To make this work, we utilize Spark’s ability to distribute a computation over a whole cluster, parallelizing e.g. map operations that have to be applied to each item. Another optimization that we use is dynamic code generation. We compile customer-supplied processing rules to JVM bytecode that we then broadcast to the Spark cluster using so-called Broadcast Variables . This makes the compiled code available to each Spark worker, which can then further JIT compile it to machine code. Using this technique enables us to reach our latency goals for processing jobs.

Our use case for Spark is thus a bit atypical: Instead of a few big and long-running jobs we instead have very many small and medium-sized ones. Our application is also a long-running process with strict uptime requirements. Fortunately, Spark is flexible enough to accomodate this use case. It does, however, take time and effort to configure and tune it and there are sometimes issues that require us to dig deeper into the internal workings of Spark, as you will see in this post.

Metrics and Monitoring

The first part of any debugging story usually involves gathering all of the facts. After all, it’s quite similar to a detective story but instead of interrogating some witnesses we will usually inspect a few stack traces, sift through the logs, correlate some possibly related events and come up with a theory.

This theory can then be tested and ideally the problem can then be reproduced and fixed.

In a distributed system like Spark this process can be more hairy, since a lot of events are happening at the same time and establishing cause and effect is not always obvious. We will therefore briefly discuss which metrics we tracked and which monitoring tools we used to do so.

Spark monitoring dashboard

Spark comes with its own web interface which provides basic monitoring functionality. It shows the jobs currently running on the Spark cluster and some information such as the total execution time, GC time and memory usage of the executors.

Whenever we experienced a severe slowdown, we usually had one or two Spark jobs that were running for quite a long time. This is why we initially suspected that some specific jobs would cause the problems.

We wrote a small tool that would regularly query Spark for the current jobs and alert us about those taking longer than a given threshold.

That gave us a good indication for when the problems were occurring, but only after the fact. We also tried to look for patterns in the slow jobs, but there was nothing that stood out to us.

Prometheus + Grafana

Besides the Spark dashboard, we also export metrics from Sparkmachine to Prometheus, a time-series based monitoring tool that gathers metrics from a multitude of sources. Almost all of our services export metrics that are collected by Prometheus. The metrics are visualized in Grafana, a web-based interface for inspecting the time-series data.

For every server, we always collect basic stats such as CPU load, memory usage and IO performance. The official Prometheus Java library we were using in Sparkmachine also provided metrics about the internals of the Hotspot JVM. At the time, we only displayed the time spent in garbage collection in our Grafana dashboard.

In addition to these low-level statistics, we were also exporting custom metrics about our internal processes. Such as the duration of job compilations and executions, cache hit rates, and some more.

All those metrics we had at this point were overwhelming to look at. We had a single Grafana dashboard with 21 shiny, colored graphs. Some of them would show abnormal behavior when our Spark application was slowing down, but none of them revealed the underlying cause.

Since the metrics we had displayed were not useful in our search, we decided to look at those metrics we collected but did not visualize in Grafana. While fiddling with those, one particular preview graph stood out:

Spikes in number of loaded classes correlating with application restarts

The important bits are the blue graph which shows the number of loaded classes. And the orange one, that represents the total number of garbage collections since the application was started. This means that whenever the orange graph drops back to zero, there had been a Sparkmachine restart.

Except for two deployment-related restarts, we always restarted when we had performance issues. And in all those cases, the number of loaded classes was excessively high.

Now we had a correlation, and a lead to investigate. The question then became: why does that number suddenly start to rise, and why is that a problem?

The second question was partly answered by the standard output logs of the Spark executors. They had always been empty before, when we checked them, but one day when we had issues again, there was a single message:

CodeCache: size=524288Kb used=519954Kb max_used=520044Kb free=4333Kb
 bounds [0x00007f13b4000000, 0x00007f13d4000000, 0x00007f13d4000000]
 total_blobs=122017 nmethods=121368 adapters=554
 compilation: disabled (not enough contiguous free space left)

The JVM JIT compiles JVM bytecode to machine code and then stores the compiled code in a CodeCache in order to avoid recompilation of identical bytecode.

Oddly enough, the JIT compiler ran out of space for storing the compiled code, so it disabled JIT compilation altogether. That could explain the severe slowdowns we experienced.

Surely, there had to be a way to configure the size and behavior of that cache.

Trials and Tribulations

During this phase, we were optimistic that finding the right JVM flags would solve our problem.

We went ahead and experimented with some JVM flags such as InitialCodeCacheSize, ReservedCodeCacheSize, UseCodeCacheFlushing, and ReservedCodeCacheSize. However, none of these made any substantial difference. The number of loaded classes would still rise inexorably and JIT compilation would be turned off at some point.

At this point we had to start looking for another suspect. One thing pretty high on our list were Java class loaders. A class loader is responsibly for dynamically loading new classes into the JVM. Every JVM has one default class loader, but you can additionally create your own class loaders, which are then children of the default class loader. Every class loader holds a certain number of classes. And importantly: as long as any of those classes in a class loader is still in use then the whole class loader cannot be garbage collected.

We therefore suspected that we possibly had a suboptimal granularity of class loaders, for example by having one class loader holding references to loaded classes of different jobs.

This would mean that if a class loader held 99 classes of a jobA and 1 of a jobB, those 99 classes were not being garbage-collected because their class loader could not be garbage-collected because this one class left was still being referenced somewhere (by a cache, for example).

After looking at the code and collecting metrics related to the number of class loaders alive, that turned out not to be the problem either. We had, in fact, one class loader per sub-task of a job.

Those were good news in terms of the current infrastructure as it already offered a pretty fine level of granularity. It meant, though, that our research had to dive yet a little deeper.

Finding the root cause

At this point, we were almost certain that the huge number of loaded classes was the main cause for the slowdowns that we were seeing. We suspected, that our dynamic code generator was responsible for creating all of those classes. However, those classes should have been very short-lived. After executing a job, the classes were not necessary any more and should have been garbage-collected. So what was keeping them from being garbage-collected?

To answer this question, we went ahead with getting a JVM heap dump using jmap and analysing it using the standalone Memory Analyser tool.

The heap dumped revealed that we had millions of objects of type org.apache.spark.CleanupTaskWeakReference and org.apache.spark.CleanAccum. Clearly, there should never be that many objects of this type alive at any point in time because they are cleanup tasks that should be executed right away.

We then focused on the Dominator Tree section, where you can see the list of the biggest objects in your dump. In the top three we could find the ContextCleaner class. The name just matched our assumptions about the cause of our issues, leading us next to go and investigate the source code of this class.

How the Spark ContextCleaner works

Because Spark is a distributed system, it has to manage some cluster-wide state, such as RDDs and Broadcast variables. It also needs to track these objects so that they can be removed from all servers once they are no longer needed. These objects are tracked by the so-called driver program, which runs on Spark master server (each Spark application consists of one driver program and many executor programs which run on the Spark worker servers). The driver program also creates one instance of the ContextCleaner class.

This class is responsible for removing RDDs, Broadcasts, and other objects that are no longer in use, from all executors in the cluster. You can think of it as the cluster-wide garbage collector.

Whenever a new object is created that holds a Spark resource, it is registered with the ContextCleaner:

It uses Java’s WeakReference mechanism to receive a notification from the garbage collector when an object associated with a Spark resource is no longer referenced in the application. When the object that is referred to by a WeakReference is collected, the JVM garbage collector inserts the WeakReference itself (not the object) into a ReferenceQueue, if given.

The referenceQueue member of the ContextCleaner is such a queue that is passed to every WeakReference created by the cleaner. In order to prevent the WeakReferences from being collected before the object they refer to, they are stored in the referenceBuffer member and thereby kept alive.

This is how the ContextCleaner knows when the “handle” for something (e.g. the Broadcast[_] instance that is a handle for a broadcast variable) is no longer referenced from user code. It uses a class derived from WeakReference in order to store the cleanup task together with the reference.

The actual cleanup happens in the keepCleaning function, which runs in a separate thread until stopped. It polls the referenceQueue (not the referenceBuffer), that is, it gets a weak reference (CleanupTaskWeakReference) whose object has already been collected.

It then checks the task that is associated with that weak reference

and performs the corresponding cleanup (e.g. removing the now unused broadcast variable from the workers). It also removes the weak reference from the referenceBuffer, enabling it to be garbage-collected as well.

The cleanup performed by doCleanup* causes Spark to drop the references to the contents of a Broadcast variable or RDD, allowing the GC to reclaim that space.

So all those CleanupTaskWeakReference objects that we had seen in the heap dump were kept alive because the keepCleaning thread had somehow not executed those clean-up tasks yet. The question thus was, why not?

Fixing the root cause

There seemed to be only two possible explanations for this:

Either the keepCleaning thread could not keep up, or the referenceQueue was somehow empty.

We first investigated the latter and looked how often the garbage collector is run by default in the Spark driver program.

And then, we saw the light:

The default garbage-collection interval was simply too long for our use case: 30 minutes. Running every 30mins was way too little since we generate hundreds of thousands of classes during peak load. This means that the keepCleaning thread is doing nothing for the first 30 minutes, and then is suddenly swamped with way too many tasks to keep up with. This problem then keeps getting worse and worse since the cleanup tasks are generated faster than they can be processed, which in turn leads to bigger and bigger GC heaps, a truly overwhelming task for the GC, until it can’t keep up any more and then the whole cluster dies in a slow and agonizing death.

The solution was simply to change the periodicGC.interval to every 3 mins instead of every 30 mins, ensuring that the GC runs often enough and never ends up lacking behind and being incapable of catching up.

We deployed this change and immediately saw a dramatic improvement, especially during the peak time around 4am:

Metrics showing the impact of a more frequent GC cycle

Or so we thought…

A couple of days after this last fix and having written with great joy what felt to us like the story of our Exodus, Spark went back to behave just like it was before: not coping during peak time.

This regression was truly surprising and unexpected. How could it be that the fix showed a tremendously positive impact during the first days and suddenly it seemed like the fix had lost its effect?

The only reasonable explanation was that the load at peak time had increased considerably, which indeed turned out to be the case because there were a number of big new projects.

At this point we tried decreasing the GC interval to 1min and see if it would help but the difference was minimal and insufficient.

After some more investigation, we further paid more attention to the fact that the ContextCleaner thread was blocking by default, meaning that it would sequentially tell each worker server to delete e.g. a specific BroadCastVariable and wait until that worker confirmed the deletion, before moving on to the next worker.

We were wondering why the cleaning thread would block by default and it turns out that it actually was non-blocking at first, but was later changed to be blocking as a workaround for a bug in Spark itself (SPARK-3015).

This workaround was already introduced 3.5 years ago, and the issue it worked around seemed to be triggered only in fairly specific circumstances. We therefore felt it worth a try to make the cleaning thread non-blocking.

By doing this, we were hoping to increase its throughput, since it can then send cleanup commands to all n workers concurrently instead of sequentially.

Since that change has been deployed, Spark has been working as we’d expect it to work by default: robustly and consistently. We haven’t seen our cluster dying a slow, crashing death ever since:

Metrics showing the impact making the ContextCleaner non-blocking

It makes us wonder how this has not been a big problem for many other Spark users. Especially people running a cluster with hundreds or even thousands of machines. Perhaps our atypical use case explains it alone. In any case, it is odd that Spark, meant for heavy and clustered computations is by default cleaned by a single, blocking thread.

Epilogue

In the end, our cluster performance problems were solved by two simple configuration changes. First, we had to make sure that the garbage collector in the Spark driver program was triggered more often. This would make sure that the queue with cleanup tasks would get filled as soon as a cluster-wide resource like e.g. an RDD was not needed any more and could be expediently removed from all workers. Second, we had to make sure that the cleanup thread was running in non-blocking mode. Otherwise, it could simply not keep up with the number of cleanup tasks that we were generating.

Given, that the fix for our problem was relatively simple in the end, the question remains how we could have known earlier. We were already tracking a whole number of metrics and graphing them in our dashboards, but there is still no silver bullet to knowing which metrics are important and which ones aren’t. It still takes experience and knowledge of the internals of a system to know what to look for.

After we had found the relevant metric (the number of loaded classes was way too high) it became simply a process of following this lead till the end. Along the way, a heap dump helped to show which part of the code was likely the problem (the ContextCleaner). And ultimately reading the source code of that component led to the right theory.


1: We’re really good at naming things. See also our job-scheduling system Jobmachine.