Out of memory: Kill process or sacrifice child

kill-process-or-sacrifice-childIt is 6 AM. I am awake summarizing the sequence of events leading to my way-too-early wake up call. As those stories start, my phone alarm went off. Sleepy and grumpy me checked the phone to see whether I was really crazy enough to set the wake-up alarm at 5AM. No, it was our monitoring system indicating that one of Plumbr services went down.

As a seasoned veteran in the domain, I made the first correct step towards solution by turning on the espresso machine. With a cup of coffee I was equipped to tackle the problems. First suspect, application itself seemed to have behave completely normal before the crash. No errors, no warning signs, no trace of any suspects in the application logs.

The monitoring we have in place had noticed the death of the process and had already restarted the crashed service. But as I already had caffeine in my bloodstream, I started to gather more evidence. 30 minutes later I found myself staring at the following in the /var/log/kern.log :

Jun  4 07:41:59 plumbr kernel: [70667120.897649] Out of memory: Kill process 29957 (java) score 366 or sacrifice child
Jun  4 07:41:59 plumbr kernel: [70667120.897701] Killed process 29957 (java) total-vm:2532680kB, anon-rss:1416508kB, file-rss:0kB

Apparently we became victims of the Linux kernel internals. As you all know, Linux is built with a  bunch of unholy creatures ( called ‘daemons’). Those daemons are shepherded by several kernel jobs, one of which seems to be especially sinister entity. Apparently all modern Linux kernels have a built-in mechanism called “Out Of Memory killer” which can annihilate your processes under extremely low memory conditions. When such a condition is detected, the killer is activated and picks a process to kill. The target is picked using a set of heuristics scoring all processes and selecting the one with the worst score to kill.

Understanding the “Out Of Memory killer”

By default, Linux kernels allow processes to request more memory than currently available in the system. This makes all the sense in the world, considering that most of the processes never actually use all of the memory they allocate. The easiest comparison to this approach would be with the cable operators. They sell all the consumers a 100Mbit download promise, far exceeding the actual bandwidth present in their network. The bet is again on the fact that the users will not simultaneously all use their allocated download limit. Thus one 10Gbit link can successfully serve way more than the 100 users our simple math would permit.

A side effect of such approach is visible in case some of your programs is on the path of depleting the system’s memory.This can lead to extremely low memory conditions, where no pages can be allocated to process. You might have faced such situation, where not even a root account cannot kill the offending task. To prevent such situations, the killer activates, and identifies the process to be the killed.

You can read more about fine-tuning the behaviour of “Out of memory killer” from this article in RedHat documentation.

What was triggering the Out of memory killer?

Now that we have the context, it is still unclear what was triggering the “killer” and woke me up at 5AM? Some more investigation revealed that:

  • The configuration in /proc/sys/vm/overcommit_memory allowed overcommitting memory – it was set to 1, indicating that every malloc() should succeed.
  • The application was running on a EC2 m1.small instance. EC2 instances have disabled swapping by default.

Those two facts combined with the sudden spike in traffic in our services resulted in the application requesting more and more memory to support those extra users. Overcommitting configuration allowed to allocate more and more memory for this greedy process, eventually triggering the “Out of memory killer” who was doing exactly what it is meant to do. Killing our application and waking me up in the middle of the night.

Example

When I described the behaviour to engineers, one of them was interested enough to create a small test case reproducing the error. When you compile and launch the following Java code snippet on Linux (I used the latest stable Ubuntu version):

package eu.plumbr.demo;
public class OOM {

public static void main(String[] args){
java.util.List l = new java.util.ArrayList();
for (int i = 10000; i < 100000; i++) {
			try {
				l.add(new int[100_000_000]);
			} catch (Throwable t) {
				t.printStackTrace();
			}
		}
}
}

then you will face the very same Out of memory: Kill process <PID> (java) score <SCORE> or sacrifice child message.

Note that you might need to tweak the swapfile and heap sizes, in my testcase I used the 2g heap specified via -Xmx2g and following configuration for swap:

swapoff -a 
dd if=/dev/zero of=swapfile bs=1024 count=655360
mkswap swapfile
swapon swapfile

Solution?

There are several ways to handle such situation. In our example, we just migrated the system to an instance with more memory. I also considered allowing swapping, but after consulting with engineering I was reminded of the fact that garbage collection processes on JVM are not good at operating under swapping, so this option was off the table.

Other possibilities would involve fine-tuning the OOM killer, scaling the load horizontally across several small instances or reducing the memory requirements of the application.

If you found the study interesting – follow Plumbr in Twitter or RSS, we keep publishing our insights about Java internals.

Reference: Out of memory: Kill process or sacrifice child from our JCG partner Jaan Angerpikk at the Plumbr Blog blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

One Response to "Out of memory: Kill process or sacrifice child"

  1. Andrew Logvinov says:

    Interesting thing, thanks for sharing.

    Was the amount of memory allocated for this application (Xmx) higher than amount of memory actually available on this host minus some 500-1000m for the OS itself? Otherwise, I don’t get how the JVM wouldn’t crash with OOM itself when trying to allocate another array.

Leave a Reply


1 + = three



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close