Erlang is a hoarder

One day you set aside a shoebox to store newspaper clippings. Suddenly you are trapped under an avalanche of whole newspapers and wondering how long your body will lie there before anyone misses you.

That is what kept happening to my Erlang apps. They would store obsolete binary data in memory until memory filled up. Then they would go into swap and become unresponsive and unrecoverable. Eventually somebody would notice the smell and restart the server.

The problem seems to be related to Erlang’s memory management optimizations. Sometimes an optimization becomes pathological. If you store a piece of binary data for a while (a newspaper clipping) Erlang “optimizes” by remembering the whole binary (the newspaper). When you remove all references to that data (toss the clipping) Erlang sometimes fails to purge the data (lets the newspapers pile up everywhere). If nobody shows up to collect the garbage, Erlang dies an embarrassing death.

The first step to recovery is to monitor the app’s memory footprint and log in every so often to sweep out the detritus. It can be tricky to find the PIDs that need attention and tragic if you arrive too late. The permanent solution is to build periodic garbage collection into the app. It’s not hard to do. The only hazard is doing it too often since it incurs some CPU overhead.

Each time I have found an app doing this, I’ve had to locate the offending module and install explicit garbage collection. If there is a periodic event, such as a timeout that happens every second, I’ll use it to call something like this:

gc(Tick) ->
    case Tick rem 60 of
        0 -> erlang:garbage_collect(self());
        _ -> ok

Today I installed this simple code and here is the result:

Memory footprint reduced drastically
Memory footprint reduced drastically

CPU utilization raised slightly
CPU utilization raised slightly

For the cost of 5% of one CPU core I stopped the cycle of swap and restart. I would like to learn why my binaries are not being garbage collected automatically. The processes involved queue the binaries in lists for a short time, then send them to socket loops which dispose of them via gen_tcp:send/2. Setting fullsweep_after to 0 had no effect. I’ll be interested in any theories. However, I’m not looking for a new solution since mine is satisfactory. I hope other Erlang hackers find it useful.

Published by

Andy Skelton

Code Wrangler Automattic

16 thoughts on “Erlang is a hoarder”

    1. On each graph, memory and CPU, the vertical axis is scaled to the total capacity of that resource. They illustrate the tradeoff between CPU and memory: for a tiny slice of the available CPU, we get most of the memory back.

      The CPU graph goes to 1600% because there are 16 CPU cores in that server. That’s just how Munin makes the graphs.

  1. I do not use erlang – really at all – but does the problem still exist if the process where the garbage was being collected dies?

  2. Although it’s not impossible, it’s still unlikely that there’s an actual bug in the garbage collection (in particular since when you trigger the GC manually, all the junk does get cleaned out). It’s not really that “Erlang sometimes fails to purge the data”, but my guess is that you’ve done a lot of work on binaries, perhaps without allocating and freeing much other data. So the process heap is full of small handles to reference-counted large binary chunks stored off heap. The heap itself has not filled up enough to trigger another round of GC since last time (if the process started out by doing some heavy work, the heap area may have grown to a fair size), and if the process has now gone into waiting in a receive loop, there is no reason for it to suddenly start GCing unless you explicitly tell it to. (A sleeping process with a list of 1000 integers on the heap is no problem, but a list of 1000 heap binaries could be bad.) I think this problem could be fixed by making the emulator do a scavenging sweep over inactive processes with large heaps that have not already been forced to GC. Meanwhile, it’s always a good idea to force a GC when a process switches phase from allocation-heavy work to being mostly inactive.

  3. Does erlang not automatically garbage collect when it runs out memory, rather than swapping to disk and eventually dying completely?

  4. So the problem you’re experiencing with large binaries has to do with the fact that they are reference counted and, as a function of that, keep track of *every process that touches them* whether or not they actually *do* anything with the data. Functionally this means that a large binary cannot be destroyed until every process that has ever touched it has been garbage collected first. This can be especially problematic if you do a lot of matching on binaries or otherwise come about a large number of sub-binaries.

    So, suppose you have an intermediate process that’s routing a request to another process. In such a situation, the intermediate process must also be garbage collected, but since such a process is very lightweight, it’s not garbage collected very frequently, meaning your large binaries live for a longer time.

    You can mitigate this problem by shrinking the “fullsweep_after” parameter (default 65535; I’m using 10 atm), or, more effectively, by not passing your binaries around unnecessarily. In my example, you could mitigate the problem by only passing the essential information for routing to the intermediate process and having the intermediate process return the pid of the destination process.

    Good breakdown on the details here:

    1. Thank you for the clear explanation! Lowering fullsweep_after didn’t help. As it turns out I am GCing all of the processes on timers. Memory hasn’t been a problem since I started doing that last year.

Comments are closed.