Mastering Java Bytecode

Attila Mihaly BalazsDecember 9th, 2013Last Updated: December 9th, 2013

4 920 7 minutes read

Hey! Happy Advent :D I’m Simon Maple (@sjmaple), the Technical Evangelist for ZeroTurnaround. You know, the JRebel guys! We’ll as a result of writing a product like JRebel which interacts with bytecode, more often than you care to imagine, there are many things we’ve learned about it which we’d love to share.

Let’s start at the start… Java was a language designed to run on a virtual machine so that it only needed to be compiled once to run everywhere (yes, yes, write once, test everywhere). As a result, the JVM which you install onto your system would be native, allowing the code that runs on it to be platform agnostic. Java bytecode is the intermediate representation of the Java code you write as source and is the result of you compiling your code. So your class files are the bytecode.

To be more succinct, Java bytecode is the code set used by the Java Virtual Machine that is JIT-compiled into native code at runtime.

Have you ever played about with assembler or machine code? Bytecode is kind of similar, in a way, but many people in the industry don’t really play with it that much, more out of the lack necessity. However it is important to understand what’s going on, and useful if you want to out-geek someone in the pub.

Firstly, let’s take a look at some bytecode basics. We’ll take the expression ‘1+2’ first and see how this gets executed as Java bytecode. 1+2 can be written in reverse Polish notation as 1 2 +. Why? Well when we put it on a stack it all becomes clear…

OK, in bytecode we’d actually see opcodes (iconst_1 and iconst_2) and an instruction (iadd) rather than push and add, but the flow is the same. The actual instructions are one byte in length, hence bytecode. There are 256 possible opcodes as a result, but only 200 or so are used. Opcodes are prefixed with a type followed by the operation name. So what we saw previously with iconst and iadd, are constants of integer type and an add instruction for integer types.

This is all very well and good, but how about reading class files. Typically, all you normally see in a class file when opened, in your editor of choice, is a bunch of smiley faces and some squares, dots and other weird characters, right? The answer is in javap, a code utility you actually get with your JDK. Let’s look at a code example to see javap in action.

public class Main {

    public static void main(String[] args){

        MovingAverage app = new MovingAverage();

    }

}

Once this class is compiled into a Main.class file, we can use the following command to extract the bytecode: javap -c Main

Compiled from "Main.java"

public class algo.Main {
  public algo.Main();
       Code:
       0: aload_0
       1: invokespecial #1
       4: return
// Method java/lang/Object."<init>":()V
public static void main(java.lang.String[]);
     Code:
       0: new           #2
       3: dup
       4: invokespecial #3
       7: astore_1
      8: return 
}

We can see we have our default constructor and main method in the byte code straight away. By the way, this is how Java gives you a default constructor for constructor-less classes! The bytecode in the constructor is simply a call to super(), while our main method creates a new instance of the MovingAverage and returns. The #n characters actually refer to constants which we can view using the -verbose argument as follows: javap -c -verbose Main. The interesting part of what is returned is shown below:

public class algo.Main

  SourceFile: "Main.java"

  minor version: 0

  major version: 51

  flags: ACC_PUBLIC, ACC_SUPER

Constant pool:

   #1 = Methodref    #5.#21         //  java/lang/Object."<init>":()V

   #2 = Class        #22            //  algo/MovingAverage

   #3 = Methodref    #2.#21         //  algo/MovingAverage."<init>":()V

   #4 = Class        #23            //  algo/Main

   #5 = Class        #24            //  java/lang/Object

Now we can match our instructions to our constants and we can piece together what’s actually happening much easier. Is anything still bugging you about the example above? No? What about the numbers in front of each instruction…

       0: new           #2

       3: dup

       4: invokespecial #3

       7: astore_1

       8: return

Now it’s really bugging you, right? :) Here’s what we’ve got if we visualise this method body as an array:

Note that every instruction has a HEX representation, so using that we actually see this:

We can actually see this in the class file if we open it in a HEX editor:

We could actually change the bytecode here in our HEX editor, but let’s be honest, it’s not something you’d really want to do, particularly on a Friday afternoon after the obligatory pub trip. Better ways to do this would be using ASM or javassist.

Let’s move on from our basic example and add some local variables that store state and interact directly with our stack. Check out the following code:

public static void main(String[] args) {

  MovingAverage ma = new MovingAverage();



  int num1 = 1;

  int num2 = 2;



  ma.submit(num1);

  ma.submit(num2);



  double avg = ma.getAvg();

}

Lets see what we get this time in our bytecode:

Code: 0: new  #2    // class algo/MovingAverage

3: dup

4: invokespecial #3  // Method algo/MovingAverage."<init>":()V

7: astore_1

8: iconst_1

9: istore_2

10: iconst_2

11: istore_3

12: aload_1

13: iload_2

14: i2d

15: invokevirtual #4        // Method algo/MovingAverage.submit:(D)V

18: aload_1

19: iload_3

20: i2d

21: invokevirtual #4        // Method algo/MovingAverage.submit:(D)V

24: aload_1

25: invokevirtual #5        // Method algo/MovingAverage.getAvg:()D

28: dstore     4



LocalVariableTable:

Start  Length  Slot  Name   Signature



0       31         0    args   [Ljava/lang/String;

8       23        1      ma     Lalgo/MovingAverage;

10      21         2     num1   I

12       19         3      num2   I

30       1        4    avg     D

This looks a lot more interesting… We can see that we create an object of type MovingAverage which is stored in local variable, ma, via the astore_1 instruction (1 is the slot number in the LocalVariableTable). Instructions iconst_1 and iconst_2 are there to load constants 1 and 2 to the stack and store them in LocalVariableTable slots 2 and 3 respectively by instructions istore_2 and istore_3. A load instruction pushed a local variable onto the stack, which a store instruction pops the next item from the stack and stores it in the LocalVariableTable. It’s important to realise that when a store instruction is used, the item is taken off of the stack and if you want to use it again, you’ll need to load it.

How about the flow of execution? All we’ve seen is a simple progression from one line to the next. I want to see some BASIC style GOTO 10 in the mix! Let’s take another example:

MovingAverage ma = new MovingAverage();

for (int number : numbers) {

    ma.submit(number);

}

In this case the flow of execution will jump around many times as we traverse the for loop. This bytecode, assuming that the numbers variable is a static field in the same class is shown as the following:

0: new #2 // class algo/MovingAverage

3: dup

4: invokespecial #3 // Method algo/MovingAverage."<init>":()V

7: astore_1

8: getstatic #4 // Field numbers:[I

11: astore_2

12: aload_2

13: arraylength

14: istore_3

15: iconst_0

16: istore 4

18: iload 4

20: iload_3

21: if_icmpge 43

24: aload_2

25: iload 4

27: iaload

28: istore 5

30: aload_1

31: iload 5

33: i2d

34: invokevirtual #5 // Method algo/MovingAverage.submit:(D)V

37: iinc 4, 1

40: goto 18

43: return



LocalVariableTable:

Start  Length  Slot  Name   Signature

30       7         5    number I 

12       31        2    arr$     [I

15       28        3    len     $I 

18       25         4     i$      I

0       49         0     args  [Ljava/lang/String;

8       41         1    ma     Lalgo/MovingAverage; 

48      1         2    avg    D

The instructions from position 8 through 17 are used to setup the loop. There are three variables in the LocalVariable table that aren’t really mentioned in the source, arr$, len$ and i$. These are the loop variables. arr$ stores the reference value of the numbers field from which the length of the loop, len$ is derived. i$ is the loop counter which is incremented by the iinc instruction.

First we need to test our loop expression, which is performed by a comparison instruction:

18: iload 4

20: iload_3

21: if_icmpge 43

We’re loading 4 and 4 onto the stack, which are the loop counter and the loop length. We’re checking id i$ is greater than or equal to len$. If it is, we jump to statement 43, otherwise we proceed. We can then perform our logic in the loop and at the end, we increment our counter and jump back to our code that checks the loop condition on statement 18.

37: iinc       4, 1       // increment i$

40: goto       18         // jump back to the beginning of the loop

There are a bunch of arithmetical opcodes and type command combinations that can be used in bytecode, including the following:

As well as a number of type conversion opcodes which are important when assigning say an integer to a variable of type long.

In our precious example we pass an integer to a submit method which takes a double. Java syntax does this for us, but in bytecode, you’ll see the i2d opcode is used:

31: iload 5

 33: i2d 

34: invokevirtual #5 // Method algo/MovingAverage.submit:(D)V

So, you’ve made it this far. Well done, you’ve earned a coffee! Is any of this actually useful to know or is it just geek fodder? Well, it’s both! Firstly now, you can tell your friends that you’re a JVM that can process bytecode, and secondly you can better understand what you’re doing when writing bytecode. For example, when using ObjectWeb ASM, which is one of the most widely used bytecode manipulation tools, you’ll find yourself constructing instructions and this knowledge will prove invaluable!

If you found this interesting and want to know more, then checkout our free Mastering Java Bytecode report from Anton Arhipov, the JRebel Product Lead at ZeroTurnaround. (JRebel uses javassist and we have had lots of fun learning and interactive with Java bytecode!) This report goes into more depth and touches on how to use ASM.

Thanks for reading! Let me know what you thought! (@sjmaple)

Reference: Mastering Java Bytecode from our JCG partner Attila Mihaly Balazs at the Java Advent Calendar blog.