How to Translate SQL GROUP BY and Aggregations to Java 8

Lukas EderJanuary 26th, 2015Last Updated: January 25th, 2015

0 23 8 minutes read

I couldn’t resist. I have read this question by Hugo Prudente on Stack Overflow. And I knew there had to be a better way than what the JDK has to offer.

The question reads:

I’m looking for a lambda to refine the data already retrieved. I have a raw resultset, if the user do not change the date I want use java’s lambda to group by the results for then. And I’m new to lambdas with java.
The lambda I’m looking for works simliar to this query.
SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

SQL is declarative. Functional programming is not.

Before we go on with this discussion, let’s establish a very important fact. SQL is a completely declarative language. Functional (or “functional-ish”, to keep the Haskell-aficionados at peace) programming languages like Java 8 are not declarative. While expressing data transformation algorithms using functions is much more concise than expressing them using objects, or worse, using imperative instructions, you’re still explicitly expressing the algorithm.

When you write SQL, you don’t write any algorithm. You just describe the result you want to have. The SQL engine’s optimiser will figure out the algorithm for you – e.g. based on the fact that you may have an index on Z but not on W or on (Z, W).

While simple examples like these can easily be implemented using Java 8, you will quickly run into Java’s limitations, once you need to do more complex reporting.

Of course, as we’ve blogged before, the optimum is reached when you combine SQL and functional programming.

How can this be written in Java 8?

There are a variety of ways to do it. The essence is to understand all the participants in such a transformation. And no matter if you find this easy or hard, suitable for Java 8 or inadequate, thinking about the different, lesser-known parts of new Stream API is certainly worth the exercise.

The main participants here are:

Stream: If you’re using JDK 8 libraries, then the new java.util.stream.Stream type will be your first choice.
Collector: The JDK provides us with a rather low-level and thus very powerful new API for data aggregation (also known as “reduction”). This API is summarised by the new java.util.stream.Collector type, a new type from which we have heard only little so far in the blogosphere

Disclaimer

Some of the code displayed here might not work in your favourite IDE. Unfortunately, even if Java 7 reaches its end of life, all major IDEs (Eclipse, IntelliJ, NetBeans), and even the javac compiler still have quite a few bugs related to the combination of generic type inference and lambda expressions. Stay tuned until those bugs are fixed! And report any bug you discover. We’ll all thank you for it!

Let’s go!

Let’s review our SQL statement:

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

In terms of the Stream API, the table itself is the Stream. Let’s just assume that we have a “table type” A as such:

class A {
    final int w;
    final int x;
    final int y;
    final int z;

    A(int w, int x, int y, int z) {
        this.w = w;
        this.x = x;
        this.y = y;
        this.z = z;
    }

    @Override
    public String toString() {
        return "A{" +
                "w=" + w +
                ", x=" + x +
                ", y=" + y +
                ", z=" + z +
                '}';
    }
}

You can also add equals() and hashCode() if you must.

We can now easily compose the Stream using Stream.of(), and some sample data:

Stream<A> stream =
Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5));

Now, the next step is to GROUP BY z, w. The Stream API itself, unfortunately, doesn’t contain such a convenience method. We have to resort to more low-level operations by specifying the more general Stream.collect() operation, and passing a Collector to it that does the grouping. Luckily, a variety of different grouping Collectors are already made available from the Collectors helper class.

So we add that to our stream:

Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(...));

Now the interesting part starts. How do we specify that we want to group by both A.z and A.w? We need to provide this groupingBy method with a function that can extract something like a SQL tuple from the A type. We could write our own tuple, or we simply use that of jOOλ, a library that we have created and open-sourced to improve our jOOQ integration tests.

The Tuple2 type roughly looks like this:

public class Tuple2<T1, T2> {

    public final T1 v1;
    public final T2 v2;

    public T1 v1() {
        return v1;
    }

    public T2 v2() {
        return v2;
    }

    public Tuple2(T1 v1, T2 v2) {
        this.v1 = v1;
        this.v2 = v2;
    }
}

public interface Tuple {
    static <T1, T2> Tuple2<T1, T2> tuple(T1 v1, T2 v2) {
        return new Tuple2<>(v1, v2);
    }
}

It has many more useful features, but these ones will be sufficient for this article.

On a side-note

Why the JDK doesn’t ship with built-in tuples like C#’s or Scala’s escapes me.

Functional programming without tuples is like coffee without sugar: A bitter punch in your face.

Anyway… back on track

So we’re grouping by the (A.z, A.w) tuple, as we would in SQL

Map<Tuple2<Integer, Integer>, List<A>> map =
Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(
    a -> tuple(a.z, a.w)
));

As you can see, this produces a verbose but very descriptive type, a map containing our grouping tuple as its key, and a list of collected table records as its value.

Running the following statement:

map.entrySet().forEach(System.out::println);

will yield:

(1, 1)=[A{w=1, x=1, y=1, z=1}, A{w=1, x=2, y=3, z=1}]
(4, 9)=[A{w=9, x=8, y=6, z=4}, A{w=9, x=9, y=7, z=4}]
(5, 2)=[A{w=2, x=3, y=4, z=5}, A{w=2, x=4, y=4, z=5}, A{w=2, x=5, y=5, z=5}]

That’s already quite awesome! In fact, this behaves like the SQL:2011 standard COLLECT() aggregate function, that is also available in Oracle 10g+

Now, instead of actually collecting the A records, we prefer to aggregate the individual values of x and y. The JDK provides us with a couple of interesting new types, e.g. the java.util.IntSummaryStatistics, which is available for convenience again from the Collectors type via Collectors.summarizingInt().

On a side note

For my taste, this sledge-hammer data aggregation technique is a bit quirky. The JDK libraries have been left intentionally low level and verbose, perhaps to keep the library footprint small, or to prevent “horrible” consequences when in 5-10 years (after the release of JDK 9 and 10), it becomes obvious that some features may have been added prematurely.

At the same time, there is this all-or-nothing IntSummaryStatistics, that blindly aggregates these popular aggregation values for your collection:

COUNT(*)
SUM()
MIN()
MAX()

and obviously, once you have SUM() and COUNT(*), you also have AVG() = SUM() / COUNT(*). So that’s going to be the Java way. IntSummaryStatistics.

In case you were wondering, the SQL:2011 standard specifies these aggregate functions:

AVG, MAX, MIN, SUM, EVERY, ANY, SOME, COUNT, STDDEV_POP, STDDEV_SAMP, VAR_SAMP, VAR_POP, COLLECT, FUSION, INTERSECTION, COVAR_POP, COVAR_SAMP, CORR, REGR_SLOPE, REGR_INTERCEPT, REGR_COUNT, REGR_R2, REGR_AVGX, REGR_AVGY, REGR_SXX, REGR_SYY, REGR_SXY, PERCENTILE_CONT, PERCENTILE_DISC, ARRAY_AGG

And obviously there are many other, vendor-specific aggregate and window functions in SQL. We’ve blogged about them all:

True, MIN, MAX, SUM, COUNT, AVG are certainly the most popular ones. But it would’ve been nicer if they hadn’t been included in these default aggregation types, but made available in a much more composable way.

Anyway… back on track

If you want to stay low-level and use mostly JDK API, you can use the following technique to implement aggregation over two columns:

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map = Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(
    a -> tuple(a.z, a.w),
    Collector.of(

        // When collecting, we'll aggregate data
        // into two IntSummaryStatistics for x and y
        () -> tuple(new IntSummaryStatistics(), 
                    new IntSummaryStatistics()),

        // The accumulator will simply take
        // new t = (x, y) values
        (r, t) -> {
            r.v1.accept(t.x);
            r.v2.accept(t.y);
        },

        // The combiner will merge two partial
        // aggregations, in case this is executed
        // in parallel
        (r1, r2) -> {
            r1.v1.combine(r2.v1);
            r1.v2.combine(r2.v2);

            return r1;
        }
    )
));

map.entrySet().forEach(System.out::println);

The above would now print

(1, 1)=(IntSummaryStatistics{count=2, sum=3, min=1, average=1.500000, max=2}, 
        IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3})
(4, 9)=(IntSummaryStatistics{count=2, sum=17, min=8, average=8.500000, max=9}, 
        IntSummaryStatistics{count=2, sum=13, min=6, average=6.500000, max=7})
(5, 2)=(IntSummaryStatistics{count=3, sum=12, min=3, average=4.000000, max=5}, 
        IntSummaryStatistics{count=3, sum=13, min=4, average=4.333333, max=5})

But obviously, no one will want to write that much code. The same thing can be achieved with jOOλ with much less code

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map =

// Seq is like a Stream, but sequential only,
// and with more features
Seq.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))

// Seq.groupBy() is just short for 
// Stream.collect(Collectors.groupingBy(...))
.groupBy(
    a -> tuple(a.z, a.w),

    // ... because once you have tuples, 
    // why not add tuple-collectors?
    Tuple.collectors(
        Collectors.summarizingInt(a -> a.x),
        Collectors.summarizingInt(a -> a.y)
    )
));

What you see above is probably as close as it gets to the original, very simmple SQL statement:

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

The interesting part here is the fact that we have what we call “tuple-collectors”, a Collector that collects data into tuples of aggregated results for any degree of the tuple (up to 8). Here’s the code for Tuple.collectors:

// All of these generics... sheesh!
static <T, A1, A2, D1, D2> 
       Collector<T, Tuple2<A1, A2>, Tuple2<D1, D2>> 
collectors(
    Collector<T, A1, D1> collector1
  , Collector<T, A2, D2> collector2
) {
    return Collector.of(
        () -> tuple(
            collector1.supplier().get()
          , collector2.supplier().get()
        ),
        (a, t) -> {
            collector1.accumulator().accept(a.v1, t);
            collector2.accumulator().accept(a.v2, t);
        },
        (a1, a2) -> tuple(
            collector1.combiner().apply(a1.v1, a2.v1)
          , collector2.combiner().apply(a1.v2, a2.v2)
        ),
        a -> tuple(
            collector1.finisher().apply(a.v1)
          , collector2.finisher().apply(a.v2)
        )
    );
}

Where the Tuple2<D1, D2> is the aggregation result type that we derive from collector1 (which provides D1) and from collector2 (which provides D2).

That’s it. We’re done!

Conclusion

Java 8 is a first step towards functional programming in Java. Using Streams and lambda expressions, we can already achieve quite a bit. The JDK APIs, however, are extremely low level and the experience when using IDEs like Eclipse, IntelliJ, or NetBeans can still be a bit frustrating. While writing this article (and adding the Tuple.collectors() method), I have reported around 10 bugs to the different IDEs. Some javac compiler bugs are not yet fixed, prior to JDK 1.8.0_40 ea. In other words:

I just keep throwing generic type parameters at the darn thing until the compiler stops bitching at me

But we’re on a good path. I trust that more useful API will ship with JDK 9 and especially with JDK 10, when all of the above will hopefully profit from the new value types and generic type specialization.

We have created jOOλ to add the missing pieces to the JDK libraries. If you want to go all in on functional programming, i.e. when your vocabulary includes hipster terms (couldn’t resist) like monads, monoids, functors, and all that, we suggest you skip the JDK’s Streams and jOOλ entirely, and go download functionaljava by Mark Perry or javaslang by Daniel Dietrich

Reference:

How to Translate SQL GROUP BY and Aggregations to Java 8 from our JCG partner Lukas Eder at the JAVA, SQL, AND JOOQ blog.