Hardware Transactional Memory in Java, or why synchronized will be cool again

Peter LawreyFebruary 5th, 2014Last Updated: February 5th, 2014

0 24 5 minutes read

Overview

Hardware Transaction Memory has the potential to allow multiple threads to speculatively access the same data structure at the same time and let the cache coherence protocol determine if a conflict occurred. HTM aims to give you the scalability of fine grain locking, the simplicity of course grain locking, and performance close to no locking at all. If supported by the JVM, your program or library was written with course grain lock it could mean your application can scale to many more cores with little change.

While adding support for this into C and C++ is non trivial, adding support to native code generated by the JVM might be done without having to change the byte code.

In short, this could allow many threads to speculatively execute synchronized blocks for a lock concurrently, even concurrent writes, and the processor work out whether this was a problem and repeats the block until it is not.

What is Hardware Transaction Memory and how much will it cost?

Hardware Transactional Memory has been around for some time but until recently it hasn’t been main stream. With Intel introducing support into some of their 4th Generation i3/i5/i7 processors (Haswell) and their E3-1200 v3 (up to 4 core, one socket ATM) family processors, is widely available to new Intel based machines. It may be later this year, or next year before we see the larger numbers of core which means HTM would make a real difference. AFAIK, AMD plan to add this functionality soon.

BTW Azul’s Vega systems have had this technology for almost ten years, and I would expect Azul would be best placed to implement this in the JVM first.

The hardware you will buy, and perhaps have already, will do this. Many new models of laptop have Haswell processors as they have significantly improved power consumption.

How might it work?

synchronized blocks are often used in Java on a just-in-case basis. To simplify the code, these locks are often much more coarse than is optimal e.g. Hashtable locks the whole object/Map for any operation vs ConcurrentHashMap which has fine grain locking. Writing fine grain locking is much harder to get right and thus more error prone. The goal of Hardware Transaction Memory is to support course grained locking but get the benefit of fine grain locking. This is particularly useful for code where re-optimising the code is not practical.

Example

private final Map map = new HashMap<>(); 

public synchronized PooledObject acquireObject(String key) {
    PooledObjectobject = map.get(key);
    if (object == null)
        map.put(key, object = new PooledObject());
    return map;
}

You might expect the common case

only reads the map
updates the map, but in different places e.g. different keys.
rarely attempts to update the same key in two threads at once.

What you would like is

concurrent execution between threads.
very small over head compared to code without locking.
the CPU or JVM to do all the work to optimise this i.e you don’t have to change the code.

Without HTM, the synchronized block needs to obtain a lock and enforce serialized access even though the majority case is a read operation.

With HTM the byte code could be turned into pseudo code like this

public PooledObject acquireObject(String key) {
    int code;
    do {
        xbegin();
        PooledObjectobject = map.get(key);
        if (object == null)
            map.put(key, object = new PooledObject());
        return map;
    } while((code = xend()) == RETRYABLE);
    if (code != DONE) {
        // take corrective action such as
        // obtain a normal lock and repeat
    }
}

The XEND instruction demarcates the end of where the speculative read-set and write-set in the cache is checked to see whether any of these are any cache line accessed was modified by another CPU/thread. If not, the changes made are committed. Otherwise any changes are dropped and the loop can be repeated.

Note: rolling back the transaction means reverting the changes and could even mean rolling back the object creation where it doesn’t have significant side effects. If it does have side effects there is an XABORT instruction which can trigger the transaction to abort and the fall back code would need to be run.

Compare And Swap is limited to 64-bits what is the limit of these transactions?

The limit is the number of lines you can store in your L1 cache. This is up to 32 KB. If you have hyper threading this might be half i.e. 16 KB. Also, the L1 cache is 8 way associative so in the worst case 9 cache lines which hash to the same bucket could cause the transaction to fail. (less with hyper threading) Never the less, it is much higher and far more flexible than CAS 64-bit or 2CAS which is 128-bit.

Writing this transaction locking structure with fall back add boilerplate and duplicate code in a language like C.

Conclusion

The beauty of this pattern is it can be applied to Java code which has already been compiled and available as open source libraries. Unlike C code which would need significant rework to exploit this functionality, Java programs could take advantage of HTM without re-compilation. What we would need is a change in the JVM.

Notes (some corrections/clarifications on what I said earlier)

For me; “cool” technology is one I believe generates wide interest, even if there isn’t proven wide usefulness. I believe implementing this in a mainstream JVM will challenge what even experienced developers “know” about multi-threaded programming.

While Intel TSX is available in some Haswell processors, it is not available in all Haswell processors. You should check with Haswell on ARK and look for Intel TSX-NI is Yes.

It has been noted that this may not make much difference for well tuned code. Intel’s Designer for TSX Ravi Rajwar presented at QCon SF 2012 on Going Under the Hood with Intel?s Next Generation Microarchitecture Codename Haswell on the Mechanical Sympathy track. If you look at Page 29, it suggests to me that fine grained code will scale well across cores anyway and will not gain as much. Where TSX may help is course grained locking.

For more technical details I suggest you read Gil Tene’s post on the mechanical sympathy group. He has more first hand experience of tuning a JVM to support HTM than any one I have met.