Home » Java » Core Java » Printing Generated Assembly Code From The Hotspot JIT Compiler

About Martin Thompson

Martin Thompson

Printing Generated Assembly Code From The Hotspot JIT Compiler

Sometimes when profiling a Java application it is necessary to understand the assembly code generated by the Hotspot JIT compiler. This can be useful in determining what optimisation decisions have been made and how our code changes can affect the generated assembly code. It is also useful at times knowing what instructions are emitted when debugging a concurrent algorithm to ensure visibility rules have been applied as expected. I have found quite a few bugs in various JVMs this way.

This blog illustrates how to install a Disassembler Plugin and provides command line options for targeting a particular method.


Previously it was necessary to obtain a debug build for printing the assembly code generated by the Hotspot JIT for the Oracle/SUN JVM. Since Java 7, it has been possible to print the generated assembly code if a Disassembler Plugin is installed in a standard Oracle Hotspot JVM. To install the plugin for 64-bit Linux follow the steps below:

  1. Download the appropriate binary, or build from source, from https://kenai.com/projects/base-hsdis/downloads
  2. On Linux rename linux-hsdis-amd64.so to libhsdis-amd64.so
  3. Copy the shared library to $JAVA_HOME/jre/lib/amd64/server

You now have the plugin installed!

Test Program

To test the plugin we need some code that is both interesting to a programmer and executes sufficiently hot to be optimised by the JIT. Some details of when the JIT will optimise can be found here. The code below can be used to measure the average latency between two threads by reading and writing volatile fields. These volatile fields are interesting because they require associated hardware fences to honour the Java Memory Model.

import static java.lang.System.out;

public class InterThreadLatency
    private static final int REPETITIONS = 100 * 1000 * 1000;

    private static volatile int ping = -1;
    private static volatile int pong = -1;

    public static void main(final String[] args)
        throws Exception
        for (int i = 0; i < 5; i++)
            final long duration = runTest();

            out.printf("%d - %dns avg latency - ping=%d pong=%d\n",
                       duration / (REPETITIONS * 2),

    private static long runTest() throws InterruptedException
        final Thread pongThread = new Thread(new PongRunner());
        final Thread pingThread = new Thread(new PingRunner());

        final long start = System.nanoTime();

        return System.nanoTime() - start;

    public static class PingRunner implements Runnable
        public void run()
            for (int i = 0; i < REPETITIONS; i++)
                ping = i;

                while (i != pong)
                    // busy spin

    public static class PongRunner implements Runnable
        public void run()
            for (int i = 0; i < REPETITIONS; i++)
                while (i != ping)
                    // busy spin

                pong = i;

Printing Assembly Code

It is possible to print all generated assembly code with the following statement.

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly InterThreadLatency

However this can put you in the situation of not being able to see the forest for the trees. It is generally much more useful to target a particular method. For this test, the run() method will be optimised and generated twice by Hotspot. Once for the OSR version, and then again for the standard JIT version. The standard JIT version follows.

java -XX:+UnlockDiagnosticVMOptions '-XX:CompileCommand=print,*PongRunner.run' InterThreadLatency

Compiled method (c2)   10531    5             InterThreadLatency$PongRunner::run (30 bytes)
 total in heap  [0x00007fed81060850,0x00007fed81060b30] = 736
 relocation     [0x00007fed81060970,0x00007fed81060980] = 16
 main code      [0x00007fed81060980,0x00007fed81060a00] = 128
 stub code      [0x00007fed81060a00,0x00007fed81060a18] = 24
 oops           [0x00007fed81060a18,0x00007fed81060a30] = 24
 scopes data    [0x00007fed81060a30,0x00007fed81060a78] = 72
 scopes pcs     [0x00007fed81060a78,0x00007fed81060b28] = 176
 dependencies   [0x00007fed81060b28,0x00007fed81060b30] = 8
Decoding compiled method 0x00007fed81060850:
[Entry Point]
  # {method} 'run' '()V' in 'InterThreadLatency$PongRunner'
  #           [sp+0x20]  (sp of caller)
  0x00007fed81060980: mov    0x8(%rsi),%r10d
  0x00007fed81060984: shl    $0x3,%r10
  0x00007fed81060988: cmp    %r10,%rax
  0x00007fed8106098b: jne    0x00007fed81037a60  ;   {runtime_call}
  0x00007fed81060991: xchg   %ax,%ax
  0x00007fed81060994: nopl   0x0(%rax,%rax,1)
  0x00007fed8106099c: xchg   %ax,%ax
[Verified Entry Point]
  0x00007fed810609a0: sub    $0x18,%rsp
  0x00007fed810609a7: mov    %rbp,0x10(%rsp)    ;*synchronization entry
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 58)
  0x00007fed810609ac: xor    %r11d,%r11d
  0x00007fed810609af: mov    $0x7ad0fcbf0,%r10  ;   {oop(a 'java/lang/Class' = 'InterThreadLatency')}
  0x00007fed810609b9: jmp    0x00007fed810609d0
  0x00007fed810609bb: nopl   0x0(%rax,%rax,1)   ; OopMap{r10=Oop off=64}
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 60)
  0x00007fed810609c0: test   %eax,0xaa1663a(%rip)        # 0x00007fed8ba77000
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 60)
                                                ;   {poll}
  0x00007fed810609c6: nopw   0x0(%rax,%rax,1)   ;*iload_1
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 60)
  0x00007fed810609d0: mov    0x74(%r10),%r9d    ;*getstatic ping
                                                ; - InterThreadLatency::[email protected] (line 3)
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 60)
  0x00007fed810609d4: cmp    %r9d,%r11d
  0x00007fed810609d7: jne    0x00007fed810609c0
  0x00007fed810609d9: mov    %r11d,0x78(%r10)
  0x00007fed810609dd: lock addl $0x0,(%rsp)     ;*putstatic pong
                                                ; - InterThreadLatency::[email protected] (line 3)
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 65)
  0x00007fed810609e2: inc    %r11d              ;*iinc
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 58)
  0x00007fed810609e5: cmp    $0x5f5e100,%r11d
  0x00007fed810609ec: jl     0x00007fed810609d0  ;*if_icmpeq
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 60)
  0x00007fed810609ee: add    $0x10,%rsp
  0x00007fed810609f2: pop    %rbp
  0x00007fed810609f3: test   %eax,0xaa16607(%rip)        # 0x00007fed8ba77000
                                                ;   {poll_return}
  0x00007fed810609f9: retq                      ;*iload_1
                                                ; - InterThreadLatency$PongRunner::[email protected] (line 60)
  0x00007fed810609fa: hlt    
  0x00007fed810609fb: hlt    
  0x00007fed810609fc: hlt    
  0x00007fed810609fd: hlt    
  0x00007fed810609fe: hlt    
  0x00007fed810609ff: hlt    
[Exception Handler]
[Stub Code]
  0x00007fed81060a00: jmpq   0x00007fed8105eaa0  ;   {no_reloc}
[Deopt Handler Code]
  0x00007fed81060a05: callq  0x00007fed81060a0a
  0x00007fed81060a0a: subq   $0x5,(%rsp)
  0x00007fed81060a0f: jmpq   0x00007fed81038c00  ;   {runtime_call}
  0x00007fed81060a14: hlt    
  0x00007fed81060a15: hlt    
  0x00007fed81060a16: hlt    
  0x00007fed81060a17: hlt    
OopMapSet contains 1 OopMaps

OopMap{r10=Oop off=64}

An Interesting Observation

The red highlighted lines of assembly code above are very interesting. When a
volatile field is written, under the Java Memory Model the write must be sequentially consistent, i.e. not appear to be reordered due to optimisations normally applied such as staging the write to the store buffer. This can be achieved by inserting the appropriate memory barriers. In the case above, Hotspot has chosen to enforce the ordering by issuing a MOV instruction (register to memory address – i.e. the write) followed by a LOCK ADD instruction (no op to the stack pointer as a fence idiom) that has ordering semantics. This is less than ideal on an x86 processor. The same action could have been performed more efficiently and correctly with a single LOCK XCHG instruction for the write. This makes me wonder if there are some significant compromises in the JVM to make it portable across many architectures, rather than be the best it can on x86.

Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
Email address:

Leave a Reply

Be the First to Comment!

Notify of