Java8 Lambdas: Sorting Performance Pitfall EXPLAINED

Daniel ShayaJanuary 27th, 2015Last Updated: January 26th, 2015

0 25 4 minutes read

Written in collaboration with Peter Lawrey.

A few days ago I raised a serious problem with the performance of sorting using the new Java8 declarative style. See blogpost here. In that post I only pointed out the issue but in this post I’m going to go a bit deeper into understanding and explaining the causes of the problem. This will be done by reproducing the issue using the declarative style, and bit by bit modifying the code until we have removed the performance issue and are left with the performance that we would expect using the old style compare.

To recap, we are sorting instances of this class:

private static class MyComparableInt{
        private int a,b,c,d;

        public MyComparableInt(int i) {
            a = i%2;
            b = i%10;
            c = i%1000;
            d = i;
        }

        public int getA() return a;
        public int getB() return b;
        public int getC() return c;
        public int getD() return d;
}

Using the declarative Java 8 style (below) it took ~6s to sort 10m instances:

List mySortedList = myComparableList.stream()
      .sorted(Comparator.comparing(MyComparableInt::getA)
                              .thenComparing(MyComparableInt::getB)
                              .thenComparing(MyComparableInt::getC)
                              .thenComparing(MyComparableInt::getD))
      .collect(Collectors.toList());

Using a custom sorter (below) took ~1.6s to sort 10m instances.

This is the code call to sort:

List mySortedList = myComparableList.stream()
                    .sorted(MyComparableIntSorter.INSTANCE)
                    .collect(Collectors.toList());

Using this custom Comparator:

public enum MyComparableIntSorter 
    implements Comparator<MyComparableInt>{
        INSTANCE;

        @Override
        public int compare(MyComparableInt o1, MyComparableInt o2) {
            int comp = Integer.compare(o1.getA(), o2.getA());
            if(comp==0){
                comp = Integer.compare(o1.getB(), o2.getB());
                if(comp==0){
                    comp = Integer.compare(o1.getC(), o2.getC());
                    if(comp==0){
                        comp = Integer.compare(o1.getD(), o2.getD());
                    }
                }
            }
            return comp;
        }
 }

Let’s create a comparing method in our class so we can analyse the code more closely. The reason for the comparing method is to allow us to easily swap implementations but leave the calling code the same.

In all cases this is how the comparing method will be called:

List mySortedList = myComparableList.stream()
                    .sorted(comparing(
                            MyComparableInt::getA,
                            MyComparableInt::getB,
                            MyComparableInt::getC,
                            MyComparableInt::getD))
                    .collect(Collectors.toList());

The first implementation of comparing is pretty much a copy of the one in jdk.

public static <T, U extends Comparable<? super U>> Comparator<T> 
   comparing(
            Function<? super T, ? extends U> ke1,
            Function<? super T, ? extends U> ke2,
            Function<? super T, ? extends U> ke3,
            Function<? super T, ? extends U> ke4)
    {
        return  Comparator.comparing(ke1).thenComparing(ke2)
                  .thenComparing(ke3).thenComparing(ke4);
    }

Not surprisingly this took ~6s to run through the test – but at least we have reproduced the problem and have a basis for moving forward.

Let’s look at the flight recording for this test:

As can be seen there are two big issues:

A performance issue in the lambda$comparing method
Repeatedly calling Integer.valueOf (auto-boxing)

Let’s try and deal with the first one which is in the comparing method. At first sight this seems strange because when you look at the code there’s not much happening in that method. One thing however that is going on here extensively are virtual table lookups as the code finds the correct implementation of the function. Virtual table lookups are used when there are multiple methods called from a single line of code. We can eliminate this source of latency with the following implementation of comparing. By expanding all of the uses of the Function interface each line can only call one implementation and thus the method is inlined.

public static <T, U extends Comparable<? super U>> Comparator<T> 
       comparing(
            Function<? super T, ? extends U> ke1,
            Function<? super T, ? extends U> ke2,
            Function<? super T, ? extends U> ke3,
            Function<? super T, ? extends U> ke4)
    {
        return  (c1, c2) -> {
            int comp = compare(ke1.apply(c1), ke1.apply(c2));
            if (comp == 0) {
                comp = compare(ke2.apply(c1), ke2.apply(c2));
                if (comp == 0) {
                    comp = compare(ke3.apply(c1), ke3.apply(c2));
                    if (comp == 0) {
                        comp = compare(ke4.apply(c1), ke4.apply(c2));
                    }
                }
            }
            return comp;
        };
    }

By unwinding the method the JIT should be able to inline the method lookup.

Indeed the time almost halves to 3.5s, let’s look at the Flight Recording for this run:

When I first saw this I was very surprised because as yet we haven’t done any changes to reduce the calls to Integer.valueOf but that percentage has gone right down! What has has actually happened here is that, because of the changes we made to allow inlining, the Integer.valueOf has been inlined and the time taken for the Integer.valueOf is being blamed on the caller ( lambda$comparing ) which has inlined the callee ( Integer.valueOf ). This is a common problem in profilers as they can be mistaken as to which method to blame especially when inlining has taken place.

But we know that in the previous Flight Recording Integer.valueOf
was highlighted so let’s remove that with this implementation of comparing and see if we can reduce the time further.

return  (c1, c2) -> {
    int comp = compare(ke1.applyAsInt(c1), ke1.applyAsInt(c2));
    if (comp == 0) {
        comp = compare(ke2.applyAsInt(c1), ke2.applyAsInt(c2));
        if (comp == 0) {
           comp = compare(ke3.applyAsInt(c1), ke3.applyAsInt(c2));
           if (comp == 0) {
             comp = compare(ke4.applyAsInt(c1), ke4.applyAsInt(c2));
           }
         }
    }
    return comp;
};

With this implementation the time goes right down to 1.6s which is what we could achieve with the custom Comparator.

Let’s again look at the flight recording for this run:

All the time is now going in the actual sort methods and not in overhead.

In conclusion we have learnt a couple of interesting things from this investigation:

Using the new Java8 declarative sort will in some cases be up to 4x slower than writing a custom Comparator because of the cost of auto-boxing and virtual table lookups.
FlightRecorder whilst being better than other profilers (see my first blog post on this issue) will still attribute time to the wrong methods especially when inlining is taking place.

Reference:

Java8 Lambdas: Sorting Performance Pitfall EXPLAINED from our JCG partner Daniel Shaya at the Rational Java blog.