Java Lock Implementations

Martin ThompsonJuly 24th, 2012Last Updated: October 22nd, 2012

0 150 4 minutes read

We all use 3rd party libraries as a normal part of development. Generally, we have no control over their internals. The libraries provided with the JDK are a typical example. Many of these libraries employ locks to manage contention.

JDK locks come with two implementations. One uses atomic CAS style instructions to manage the claim process. CAS instructions tend to be the most expensive type of CPU instructions and on x86 have memory ordering semantics. Often locks are un-contended which gives rise to a possible optimisation whereby a lock can be biased to the un-contended thread using techniques to avoid the use of atomic instructions. This biasing allows a lock in theory to be quickly reacquired by the same thread. If the lock turns out to be contended by multiple threads the algorithm with revert from being biased and fall back to the standard approach using atomic instructions. Biased locking became the default lock implementation with Java 6.

When respecting the single writer principle, biased locking should be your friend. Lately, when using the sockets API, I decided to measure the lock costs and was surprised by the results. I found that my un-contended thread was incurring a bit more cost than I expected from the lock. I put together the following test to compare the cost of the current lock implementations available in Java 6.

The Test

For the test I shall increment a counter within a lock, and increase the number of contending threads on the lock. This test will be repeated for the 3 major lock implementations available to Java:

Atomic locking on Java language monitors
Biased locking on Java language monitors
ReentrantLock introduced with the java.util.concurrent package in Java 5.

I’ll also run the tests on the 3 most recent generations of the Intel CPU. For each CPU I’ll execute the tests up to the maximum number of concurrent threads the core count will support.

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

The Code

import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;

import static java.lang.System.out;

public final class TestLocks implements Runnable
{
    public enum LockType { JVM, JUC }
    public static LockType lockType;

    public static final long ITERATIONS = 500L * 1000L *1000L;
    public static long counter = 0L;

    public static final Object jvmLock = new Object();
    public static final Lock jucLock = new ReentrantLock();
    private static int numThreads;
    private static CyclicBarrier barrier;

    public static void main(final String[] args) throws Exception
    {
        lockType = LockType.valueOf(args[0]);
        numThreads = Integer.parseInt(args[1]);
        
        runTest(numThreads); // warm up
        counter = 0L;

        final long start = System.nanoTime();
        runTest(numThreads);
        final long duration = System.nanoTime() - start;

        out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
        out.printf("%,d ns/op\n", duration / ITERATIONS);
        out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
        out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads) throws Exception
    {
        barrier = new CyclicBarrier(numThreads);
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestLocks());
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        try
        {
            barrier.await();
        }
        catch (Exception e)
        {
            // don't care
        }

        switch (lockType)
        {
            case JVM: jvmLockInc(); break;
            case JUC: jucLockInc(); break;
        }
    }

    private void jvmLockInc()
    {
        long count = ITERATIONS / numThreads;
        while (0 != count--)
        {
            synchronized (jvmLock)
            {
                ++counter;
            }
        }
    }

    private void jucLockInc()
    {
        long count = ITERATIONS / numThreads;
        while (0 != count--)
        {
            jucLock.lock();
            try
            {
                ++counter;
            }
            finally
            {
                jucLock.unlock();
            }
        }
    }
}

Script the tests:

set -x
for i in {1..8}; do java -XX:-UseBiasedLocking TestLocks JVM $i; done
for i in {1..8}; do java -XX:+UseBiasedLocking TestLocks JVM $i; done
for i in {1..8}; do java TestLocks JUC $i; done

The Results

Figure 1

Figure 2

Figure 3

Biased locking should no longer be the default lock implementation on modern Intel processors. I recommend you measure your applications and experiement with the -XX:-UseBiasedLocking JVM option to determine if you can benefit from using atomic lock based algorithm for the un-contended case.

Observations

Biased locking, in the un-contended case, is ~10% more expensive than the atomic locking. It seems that for recent CPU generations the cost of atomic instructions is less than the necessary housekeeping for biased locks. Previous to Nehalem, lock instructions would assert a lock on the memory bus to perform the these atomic operations, each would cost more than 100 cycles. Since Nehalem, atomic instructions can be handled local to a CPU core, and typically cost only 10-20 cycles if they do not need to wait on the store buffer to empty while enforcing memory ordering semantics.
As contention increases, language monitor locks quickly reach a throughput limit regardless of thread count.
ReentrantLock gives the best un-contended performance and scales significantly better with increasing contention compared to language monitors using synchronized.
ReentrantLock has an odd characteristic of reduced performance when 2 threads are contending. This deserves further investigation.
Sandybridge suffers from the increased latency of atomic instructions I detailed in a previous article when contended thread count is low. As contended thread count continues to increase, the cost of the kernel arbitration tends to dominate and Sandybridge shows its strength with increased memory throughput.

Conclusion

When developing your own concurrent libraries I would recommend ReentrantLock rather than using the synchronized keyword due to the significantly better performance on x86, if a lock-free alternative algorithm is not a viable option.

Update 20-Nov-2011

Dave Dice has pointed out that biased locking is not implemented for the locks created in the first few seconds of the JVM startup. I’ll re-run my tests this week and post the results. I’ve had some more quality feedback that suggests my results could be potentially invalid. Micro benchmarks can be tricky but the advice of measuring your own application in the large still stands.

A re-run of the tests can be seen in this follow-on blog taking account of Dave’s feedback.

Reference: Java Lock Implementations from our JCG partner Martin Thompson at the Mechanical Sympathy blog.