Erlang is a hoarder

One day you set aside a shoebox to store newspaper clippings. Suddenly you are trapped under an avalanche of whole newspapers and wondering how long your body will lie there before anyone misses you.

That is what kept happening to my Erlang apps. They would store obsolete binary data in memory until memory filled up. Then they would go into swap and become unresponsive and unrecoverable. Eventually somebody would notice the smell and restart the server.

The problem seems to be related to Erlang’s memory management optimizations. Sometimes an optimization becomes pathological. If you store a piece of binary data for a while (a newspaper clipping) Erlang “optimizes” by remembering the whole binary (the newspaper). When you remove all references to that data (toss the clipping) Erlang sometimes fails to purge the data (lets the newspapers pile up everywhere). If nobody shows up to collect the garbage, Erlang dies an embarrassing death.

The first step to recovery is to monitor the app’s memory footprint and log in every so often to sweep out the detritus. It can be tricky to find the PIDs that need attention and tragic if you arrive too late. The permanent solution is to build periodic garbage collection into the app. It’s not hard to do. The only hazard is doing it too often since it incurs some CPU overhead.

Each time I have found an app doing this, I’ve had to locate the offending module and install explicit garbage collection. If there is a periodic event, such as a timeout that happens every second, I’ll use it to call something like this:

gc(Tick) ->
    case Tick rem 60 of
        0 -> erlang:garbage_collect(self());
        _ -> ok
    end.

Today I installed this simple code and here is the result:

Memory footprint reduced drastically

Memory footprint reduced drastically


CPU utilization raised slightly

CPU utilization raised slightly

For the cost of 5% of one CPU core I stopped the cycle of swap and restart. I would like to learn why my binaries are not being garbage collected automatically. The processes involved queue the binaries in lists for a short time, then send them to socket loops which dispose of them via gen_tcp:send/2. Setting fullsweep_after to 0 had no effect. I’ll be interested in any theories. However, I’m not looking for a new solution since mine is satisfactory. I hope other Erlang hackers find it useful.

Previous Post

16 Comments

  1. That’s a really terrible graph on the bottom. Why does the vertical axis go up to 1600% to show a 5% increase?

    • On each graph, memory and CPU, the vertical axis is scaled to the total capacity of that resource. They illustrate the tradeoff between CPU and memory: for a tiny slice of the available CPU, we get most of the memory back.

      The CPU graph goes to 1600% because there are 16 CPU cores in that server. That’s just how Munin makes the graphs.

    • I’m guessing it’s a 16-core system.

  2. Have you considered Hibernating processes that do no work to get them garbage collected and more compact?

    • Yes. These processes receive several messages per second. It would be very wasteful to hibernate them so often.

  3. Andy

     /  February 13, 2012

    I do not use erlang – really at all – but does the problem still exist if the process where the garbage was being collected dies?

  4. John

     /  February 13, 2012

    See the item about binaries here:

    http://prog21.dadgum.com/43.html

  5. Although it’s not impossible, it’s still unlikely that there’s an actual bug in the garbage collection (in particular since when you trigger the GC manually, all the junk does get cleaned out). It’s not really that “Erlang sometimes fails to purge the data”, but my guess is that you’ve done a lot of work on binaries, perhaps without allocating and freeing much other data. So the process heap is full of small handles to reference-counted large binary chunks stored off heap. The heap itself has not filled up enough to trigger another round of GC since last time (if the process started out by doing some heavy work, the heap area may have grown to a fair size), and if the process has now gone into waiting in a receive loop, there is no reason for it to suddenly start GCing unless you explicitly tell it to. (A sleeping process with a list of 1000 integers on the heap is no problem, but a list of 1000 heap binaries could be bad.) I think this problem could be fixed by making the emulator do a scavenging sweep over inactive processes with large heaps that have not already been forced to GC. Meanwhile, it’s always a good idea to force a GC when a process switches phase from allocation-heavy work to being mostly inactive.

  6. I’ve posted a lengthy reply/explanation on hacker news. The whole discussion is here: http://news.ycombinator.com/item?id=3586438

    –Matt

  7. Sean

     /  February 13, 2012

    Consider using ERL_FULLSWEEP_AFTER=0. This will trigger GC very frequently.

  8. Dan

     /  February 13, 2012

    Does erlang not automatically garbage collect when it runs out memory, rather than swapping to disk and eventually dying completely?

    • Dan

       /  February 13, 2012

      Ok, Richard and Matt have already addressed this. Should learn to read better next time, I guess.

  9. I often use Erlang and I’m looking for something that ascends from the very basic things to more complex, in ways idiomatic to Erlang (I’d like exercises in message passing, network stuff, etc)
    Thanks!

  10. fauxsoup

     /  February 20, 2013

    So the problem you’re experiencing with large binaries has to do with the fact that they are reference counted and, as a function of that, keep track of *every process that touches them* whether or not they actually *do* anything with the data. Functionally this means that a large binary cannot be destroyed until every process that has ever touched it has been garbage collected first. This can be especially problematic if you do a lot of matching on binaries or otherwise come about a large number of sub-binaries.

    So, suppose you have an intermediate process that’s routing a request to another process. In such a situation, the intermediate process must also be garbage collected, but since such a process is very lightweight, it’s not garbage collected very frequently, meaning your large binaries live for a longer time.

    You can mitigate this problem by shrinking the “fullsweep_after” parameter (default 65535; I’m using 10 atm), or, more effectively, by not passing your binaries around unnecessarily. In my example, you could mitigate the problem by only passing the essential information for routing to the intermediate process and having the intermediate process return the pid of the destination process.

    Good breakdown on the details here: http://dieswaytoofast.blogspot.com/2012/12/erlang-binaries-and-garbage-collection.html

    • Thank you for the clear explanation! Lowering fullsweep_after didn’t help. As it turns out I am GCing all of the processes on timers. Memory hasn’t been a problem since I started doing that last year.

Follow

Get every new post delivered to your Inbox.

Join 1,670 other followers

%d bloggers like this: