Core Java

Java Spark RDD reduce() Examples – sum, min and max operations

A quick guide to explore the Spark RDD reduce() method in java programming to find sum, min and max values from the data set.

1. Overview

In this tutorial, we will learn how to use the Spark RDD reduce() method using java programming language. Most of the developers use the same method reduce() in pyspark but in this article, we will understand
how to get the sum, min and max operations with Java RDD.

2. Java Spark RDD – reduce() method

First let understand the syntax of java reduce() spark method.

1
public T reduce(scala.Function2<T,T,T> f)

This method takes the Function2 functional interface which is the concept of Java 8. But the Function2 is implemented in Scala language.

Function2 takes two arguments as input and returns one value. Here, always input and output type should be same.

3. Java Spark RDD reduce() Example to find sum

In the below examples, we first created the SparkConf and JavaSparkContext with local mode for the testing purpose.

We’ve provided the step by step meaning in the program.

We must have to pass the lambda expression to the reduce() method. If you are new to java, please read the in-depth article on Java 8 Lambda expressions.

You might be surprised with the logic behind the reduce() method. Below is the explanation of its internals. As a developer, you should know the basic knowledge on hood what is going on.

On the RDD, reduce() method is called with the logic of value1 + value2. That means this formula will be applied to all the values in each partition untill partition will have only one value.

If there are more than one partitions then all the outputs of partitions are moved to another data node. Then next, again the same logic value1 + value2 is applied to get the final result.

if only one partition is for the input file or dataset then it will return the final output of the single partion.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
package com.javaprogramto.rdd.reduce;
 
import java.util.Arrays;
import java.util.List;
 
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
 
public class RDDReduceExample {
 
    public static void main(String[] args) {
         
        // to remove the unwanted loggers from output.
        Logger.getLogger("org.apache").setLevel(Level.WARN);
 
        // Getting the numbers list.
        List<Integer> numbersList = getSampleData();
         
        // Creating the SparkConf object
        SparkConf sparkConf = new SparkConf().setAppName("Java RDD_Reduce Example").setMaster("local");
 
        // Creating JavaSprakContext object
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
         
        // Converting List into JavaRDD.
        JavaRDD<Integer> integersRDD =  sc.parallelize(numbersList);
         
        // Getting the sum of all numbers using reduce() method
        Integer sumResult = integersRDD.reduce( (value1, value2) -> value1 + value2);
 
        // printing the sum
         
        System.out.println("Sum of RDD numbers using reduce() : "+sumResult);
         
        // closing Spark Context
        sc.close();
         
    }
 
    /**
     * returns a list of integer numbers
     *
     * @return
     */
    private static List<Integer> getSampleData() {
 
        return Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
 
    }
 
}

Output:

1
Sum of RDD numbers using reduce() : 45

4. Java Spark RDD reduce() min and max Examples

Next, let us find the min and max values from the RDD.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
package com.javaprogramto.rdd.reduce;
 
import java.util.Arrays;
import java.util.List;
 
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
 
public class RDDReduceExample {
 
    public static void main(String[] args) {
         
        // to remove the unwanted loggers from output.
        Logger.getLogger("org.apache").setLevel(Level.WARN);
 
        // Getting the numbers list.
        List<Integer> numbersList = getSampleData();
         
        // Creating the SparkConf object
        SparkConf sparkConf = new SparkConf().setAppName("Java RDD_Reduce Example").setMaster("local");
 
        // Creating JavaSprakContext object
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
         
        // Converting List into JavaRDD.
        JavaRDD<Integer> integersRDD =  sc.parallelize(numbersList);
         
        // Finding Min and Max values using reduce() method
         
        Integer minResult = integersRDD.reduce( (value1, value2) -> Math.min(value1, value2));
         
        System.out.println("Min of RDD numbers using reduce() : "+minResult);
         
        Integer maxResult = integersRDD.reduce( (value1, value2) -> Math.max(value1, value2));
         
        System.out.println("Max of RDD numbers using reduce() : "+maxResult);
         
        // closing Spark Context
        sc.close();
         
    }
 
    /**
     * returns a list of integer numbers
     *
     * @return
     */
    private static List<Integer> getSampleData() {
 
        return Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
 
    }
 
}

Output:

1
2
Min of RDD numbers using reduce() : 1
Max of RDD numbers using reduce() : 9

5. Conclusion

In this post, we’ve seen how to use reduce() aggregate operation on the RDD dataset to find the
sum,  m in and max values with example program in java.

Reference

Reduce() API

GitHub Code

Published on Java Code Geeks with permission by Venkatesh Nukala, partner at our JCG program. See the original article here: Java Spark RDD reduce() Examples – sum, min and max opeartions

Opinions expressed by Java Code Geeks contributors are their own.

Venkatesh Nukala

Venkatesh Nukala is a Software Engineer working for Online Payments Industry Leading company. In my free time, I would love to spend time with family and write articles on technical blogs. More on JavaProgramTo.com
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button