The last few years a paradigm shift is taking place in the field of computer processors. For years, processor makers consistently delivered increases in clock rates, so developers enjoyed the fact that their single-threaded software executed faster without any effort from their part. Now, processor makers favor multi-core chip designs, and software has to be written in a multi-threaded or multi-process manner to take full advantage of the hardware. Thus, the only way for software developers to catch up is to write applications that leverage parallelism, i.e. engaging multiple CPU cores for handling all tasks rather than a single faster core. Moore's law still applies, but in a different context.
Parallel computing or parallelization is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel"). In essence, if a CPU intensive problem can be divided in smaller, independent tasks, then those tasks can be assigned to different processors.
Regarding multi-threading and concurrency, Java is very interesting. It has had support for Threads since its beginning and back in the old days you could manipulate thread execution using a low-level approach with the interrupt, join, sleep methods. Moreover, the notify and wait methods that all objects inherit could also be helpful.
It was possible to control application execution in that way, but the process was a bit tedious. And then came the concurrency package in Java 1.5 which provided a higher level framework with which developers could handle threading in a simpler, easier and less error prone way. The package provides a bunch of utility classes commonly useful in concurrent programming.
Over the years, the need for “concurrency-enabled” programs became even bigger, so the platform took this a step further introducing New Concurrency Features for Java SE 7. One of the features is the introduction of the Fork/Join framework, which was intended to be included in the JDK 1.5 version, but finally did not make it.
The Fork/Join framework is designed to make divide-and-conquer algorithms easy to parallelize. That type of algorithms is perfect for problems that can be divided into two or more sub-problems of the same type. They use recursion to break down the problem to simple tasks until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem. An excellent introduction to the Fork/Join approach is the article “Fork-Join Development in Java SE”.
As you may have noticed, the Fork/Join approach is similar to MapReduce in that they are both algorithms for parallelizing tasks. One difference however is that Fork/Join tasks will subdivide themselves into smaller tasks only if necessary (if too large), whereas MapReduce algorithms divide up all the work into portions as the first step of their execution.
The Fork/Join Java framework originated as JSR-166 and you can find more information in the Concurrency JSR-166 Interest Site led by Doug Lea. Actually, this is where you have to go if you wish to use the framework and you do not have JDK 7. As mentioned in the site, we can use the related classes in JDK 1.5 and 1.6 without having to install the latest JDK. In order to do so, we have to download the jsr166 jar and launch our JVM using the option -Xbootclasspath/p:jsr166.jar. Note that you may need to precede "jsr166.jar" with its full file path. This has to be done because the JAR contains some classes that override the core Java classes (for example those under the java.util package). Don't forget to bookmark the JSR-166 API docs.
So, let's see how the framework can be used to address a real problem. We are going to create a program that calculates Fibonacci numbers (a classic pedagogical problem) and see how to use the new classes to make it as fast as we can. This problem has also been addressed in the Java Fork/Join + Groovy article.
First we create a class that represents the problem as follows:
As you can see, we use the recursive version of the solution and this is a typical implementation. (Note that this implementation is highly inefficient since it calculates the same values over and over. In a real-life scenario, the already calculated values should be cached and retrieved in subsequent executions). Let's now see how a single thread approach would look like:
Our task extends the RecursiveTask class which is recursive result-bearing ForkJoinTask. We override the compute method which handles the main computation performed by this task. In that method, we first check if we have to use parallelism (by comparing to a threshold). If this is an easy to perform task, we directly invoke the solve method, otherwise we create two smaller tasks and then execute each one of them independently. The execution occurs into different threads and their results are then combined. This is achieved by using the fork and join methods. Let's test our implementation:
It is obvious that the Fork/Join framework scales a lot better that the single-thread approach and executes the computations in much smaller times. (If you wish to perform some stress tests of your own, don't forget to remove the System.out call in the FibonacciProblem class.) It is also interesting to see some pictures of the CPU usage in my Windows 7 machine (i7-720QM with 4 cores and Hyper-Threading) while each one of the approaches was used.
Single-thread:
The total CPU usage remained very low during the execution (never exceed 16%). As you can see, the CPU is under-utilized while the single-thread application struggles to perform all the calculations.
Multi-threaded:
The CPU utilization is much better and all the processors contribute to the total calculation. We have reached the end to my introduction of the Fork/Join framework in Java. Note that I have only scratched the surface here, numerous other features exist and are ready to help us in leveraging the multi-cores CPUs. A new era is emerging, so developers should get familiar with these concepts. As always, you can find here the source code produced for this article. Don't forget to share!
Parallel computing or parallelization is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel"). In essence, if a CPU intensive problem can be divided in smaller, independent tasks, then those tasks can be assigned to different processors.
Regarding multi-threading and concurrency, Java is very interesting. It has had support for Threads since its beginning and back in the old days you could manipulate thread execution using a low-level approach with the interrupt, join, sleep methods. Moreover, the notify and wait methods that all objects inherit could also be helpful.
It was possible to control application execution in that way, but the process was a bit tedious. And then came the concurrency package in Java 1.5 which provided a higher level framework with which developers could handle threading in a simpler, easier and less error prone way. The package provides a bunch of utility classes commonly useful in concurrent programming.
Over the years, the need for “concurrency-enabled” programs became even bigger, so the platform took this a step further introducing New Concurrency Features for Java SE 7. One of the features is the introduction of the Fork/Join framework, which was intended to be included in the JDK 1.5 version, but finally did not make it.
The Fork/Join framework is designed to make divide-and-conquer algorithms easy to parallelize. That type of algorithms is perfect for problems that can be divided into two or more sub-problems of the same type. They use recursion to break down the problem to simple tasks until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem. An excellent introduction to the Fork/Join approach is the article “Fork-Join Development in Java SE”.
As you may have noticed, the Fork/Join approach is similar to MapReduce in that they are both algorithms for parallelizing tasks. One difference however is that Fork/Join tasks will subdivide themselves into smaller tasks only if necessary (if too large), whereas MapReduce algorithms divide up all the work into portions as the first step of their execution.
The Fork/Join Java framework originated as JSR-166 and you can find more information in the Concurrency JSR-166 Interest Site led by Doug Lea. Actually, this is where you have to go if you wish to use the framework and you do not have JDK 7. As mentioned in the site, we can use the related classes in JDK 1.5 and 1.6 without having to install the latest JDK. In order to do so, we have to download the jsr166 jar and launch our JVM using the option -Xbootclasspath/p:jsr166.jar. Note that you may need to precede "jsr166.jar" with its full file path. This has to be done because the JAR contains some classes that override the core Java classes (for example those under the java.util package). Don't forget to bookmark the JSR-166 API docs.
So, let's see how the framework can be used to address a real problem. We are going to create a program that calculates Fibonacci numbers (a classic pedagogical problem) and see how to use the new classes to make it as fast as we can. This problem has also been addressed in the Java Fork/Join + Groovy article.
First we create a class that represents the problem as follows:
package com.javacodegeeks.concurrency.forkjoin;
public class FibonacciProblem {
public int n;
public FibonacciProblem(int n) {
this.n = n;
}
public long solve() {
return fibonacci(n);
}
private long fibonacci(int n) {
System.out.println("Thread: " +
Thread.currentThread().getName() + " calculates " + n);
if (n <= 1)
return n;
else
return fibonacci(n-1) + fibonacci(n-2);
}
}
As you can see, we use the recursive version of the solution and this is a typical implementation. (Note that this implementation is highly inefficient since it calculates the same values over and over. In a real-life scenario, the already calculated values should be cached and retrieved in subsequent executions). Let's now see how a single thread approach would look like:
package com.javacodegeeks.concurrency.forkjoin;
import org.perf4j.StopWatch;
public class SillyWorker {
public static void main(String[] args) throws Exception {
int n = 10;
StopWatch stopWatch = new StopWatch();
FibonacciProblem bigProblem = new FibonacciProblem(n);
long result = bigProblem.solve();
stopWatch.stop();
System.out.println("Computing Fib number: " + n);
System.out.println("Computed Result: " + result);
System.out.println("Elapsed Time: " + stopWatch.getElapsedTime());
}
}
We just create a new FibonacciProblem and execute its solve method which will recursively call the fibonacci method. I also use the nice Perf4J library to keep track of the elapsed time. The output would be something like this (I have isolated the last lines): ... Thread: main calculates 1 Thread: main calculates 0 Computing Fib number: 10 Computed Result: 55 Elapsed Time: 8 As expected, all the job is getting done by only one thread (main). Let's see how this can be rewritten using the Fork/Join framework. Note that in the Fibonacci solution, the following take place: fibonacci(n-1) + fibonacci(n-2) Because of that, we can assign each one of these two tasks to a new worker (i.e. a new thread) and after the workers have finished their executions, we will join the result. Taking that into account, we introduce the FibonacciTask class which is essentially a way to divide a big Fibonacci problem to smaller ones.package com.javacodegeeks.concurrency.forkjoin;
import java.util.concurrent.RecursiveTask;
public class FibonacciTask extends RecursiveTask<Long> {
private static final long serialVersionUID = 6136927121059165206L;
private static final int THRESHOLD = 5;
private FibonacciProblem problem;
public long result;
public FibonacciTask(FibonacciProblem problem) {
this.problem = problem;
}
@Override
public Long compute() {
if (problem.n < THRESHOLD) { // easy problem, don't bother with parallelism
result = problem.solve();
}
else {
FibonacciTask worker1 = new FibonacciTask(new FibonacciProblem(problem.n-1));
FibonacciTask worker2 = new FibonacciTask(new FibonacciProblem(problem.n-2));
worker1.fork();
result = worker2.compute() + worker1.join();
}
return result;
}
}
Note: If you are not using JDK 7 and have manually included the JSR-166 libraries, you will have to override the default Java core classes. Otherwise you will encounter the following error: java.lang.SecurityException: Prohibited package name: java.util.concurrent To prevent this, setup your JVM to override the classes by using the following argument: -Xbootclasspath/p:lib/jsr166.jar I have used the “lib/jsr166.jar” value because the JAR resides in a folder named “lib” inside my Eclipse project. Here is how the configuration looks like:
Our task extends the RecursiveTask class which is recursive result-bearing ForkJoinTask. We override the compute method which handles the main computation performed by this task. In that method, we first check if we have to use parallelism (by comparing to a threshold). If this is an easy to perform task, we directly invoke the solve method, otherwise we create two smaller tasks and then execute each one of them independently. The execution occurs into different threads and their results are then combined. This is achieved by using the fork and join methods. Let's test our implementation:
package com.javacodegeeks.concurrency.forkjoin;
import java.util.concurrent.ForkJoinPool;
import org.perf4j.StopWatch;
public class ForkJoinWorker {
public static void main(String[] args) {
// Check the number of available processors
int processors = Runtime.getRuntime().availableProcessors();
System.out.println("No of processors: " + processors);
int n = 10;
StopWatch stopWatch = new StopWatch();
FibonacciProblem bigProblem = new FibonacciProblem(n);
FibonacciTask task = new FibonacciTask(bigProblem);
ForkJoinPool pool = new ForkJoinPool(processors);
pool.invoke(task);
long result = task.result;
System.out.println("Computed Result: " + result);
stopWatch.stop();
System.out.println("Elapsed Time: " + stopWatch.getElapsedTime());
}
}
We first check the available number of processors in the system and then create a new ForkJoinPool with the corresponding level of parallelism. We assign and execute our task by using the invoke method. Here is the output (I have isolated the first and last methods): No of processors: 8 Thread: ForkJoinPool-1-worker-7 calculates 4 Thread: ForkJoinPool-1-worker-6 calculates 4 Thread: ForkJoinPool-1-worker-4 calculates 4 ... Thread: ForkJoinPool-1-worker-2 calculates 1 Thread: ForkJoinPool-1-worker-2 calculates 0 Computed Result: 55 Elapsed Time: 16 Note that now the computation is delegated to a number of worker threads, each one of which is assigned a task smaller than the original one. You may have noticed that the elapsed time is bigger that the previous one. This paradox happens because we have used a low threshold (5) and a low value for n (10). Because of that a big number of threads is created, thus unnecessary delay is introduced. The real power of the framework will become apparent for bigger values of threshold (around 20) and higher values of n (40 and above). I performed some quick stress tests for values n>40 and here is a chart with the results:
It is obvious that the Fork/Join framework scales a lot better that the single-thread approach and executes the computations in much smaller times. (If you wish to perform some stress tests of your own, don't forget to remove the System.out call in the FibonacciProblem class.) It is also interesting to see some pictures of the CPU usage in my Windows 7 machine (i7-720QM with 4 cores and Hyper-Threading) while each one of the approaches was used.Single-thread:
The total CPU usage remained very low during the execution (never exceed 16%). As you can see, the CPU is under-utilized while the single-thread application struggles to perform all the calculations.Multi-threaded:
The CPU utilization is much better and all the processors contribute to the total calculation. We have reached the end to my introduction of the Fork/Join framework in Java. Note that I have only scratched the surface here, numerous other features exist and are ready to help us in leveraging the multi-cores CPUs. A new era is emerging, so developers should get familiar with these concepts. As always, you can find here the source code produced for this article. Don't forget to share!
Related Articles :
Related Snippets :
Cool experiment!, not sure I'd include it as
ReplyDelete'import java.util.concurrent.ForkJoinPool ,etc' and
-Xbootclasspath/p:lib/jsr166.jar but instead I'd go with
jsr166y.jar (no need to override bootcl) and import it as 'import jsr166y.ForkJoinPool etc' until it solidifies since it may not be included in JDK7, others are not on board for example http://coopsoft.com/ar/CalamityArticle.html
and others on the fence: ver http://www.scala-lang.org/node/6412 (Scala already has it in 2.8.1 undocumented)
time will tell. but thanks again for your post.
Hi Martin,
ReplyDeleteThank you very much for the additional info!
Regards,
Ilias
Windows 7, Quad Core, 8gb Ram, HP Make
ReplyDeleteSillyWorker Output:
====================
Computing Fib number: 50
Computed Result: 12586269025
Elapsed Time: 115940
Computing Fib number: 50
Computed Result: 12586269025
Elapsed Time: 115862
Computing Fib number: 50
Computed Result: 12586269025
Elapsed Time: 115600
ForkJoinWorker Output:
=======================
No of processors: 4
Computed Result: 12586269025
Elapsed Time: 265914
No of processors: 2
Computed Result: 12586269025
Elapsed Time: 339986
No of processors: 1
Computed Result: 12586269025
Elapsed Time: 606991
For time being I conclude...From this Exercise I see Silly Worker was much better than Fork/Join.
I just can't Digest this............I am sure there is something fishy...will dig more...Am in middle of Robin Hood (my best movie)...will look into the fishy part later...
Hi Sreenath V,
ReplyDeletePlease let me know a couple of things:
1) What was the value used for the threshold?
2) Perhaps you forgot to turn of Sys.out logging?
Ilias
When I ran with 4 processor i got results in 7 seconds and i could not believe my self as the fork less code took 115 seconds.
ReplyDeleteThen I found in the code ForkJoinWorker.java the n was set to 10. Later, I changed the value of n (fibanaci number) to 50 as it was same in case of SillyWorker(Fork less or single threaded).
And coming to Sys.out I dont have any except for the last three lines which prints the results and this won't matter...
Let me try implementing different way and try with 2,3,4 core CPUs.
I changed the THRESHOLD to 25 and now i can breathe easy...
ReplyDeleteWith Threshold = 25;
No of processors: 4
Computed Result: 12586269025
Elapsed Time: 33180
With Threshold = 30;
No of processors: 4
Computed Result: 12586269025
Elapsed Time: 32100
With Threshold = 35;
No of processors: 4
Computed Result: 12586269025
Elapsed Time: 32973
With Threshold = 28;
No of processors: 4
Computed Result: 12586269025
Elapsed Time: 32146
With Threshold = 32;
No of processors: 4
Computed Result: 12586269025
Elapsed Time: 32244
*************************
Best results:
==============
With Threshold = 30;
No of processors: 4
Computed Result: 12586269025
Elapsed Time: 32100
With Threshold = 30;
No of processors: 3
Computed Result: 12586269025
Elapsed Time: 41578
No of processors: 2
Computed Result: 12586269025
Elapsed Time: 62046
No of processors: 1
Computed Result: 12586269025
Elapsed Time: 111892
Threshold is the key for this problem type (Fibonacci)
Hi Sreenath V,
ReplyDeleteI am glad you managed to figure this out! :-)
It is mentioned somewhere in the post that the threshold value should be 40 and above, perhaps this should be stressed more.
Regards,
Ilias
But thanks for the post and for the introduction...I was using HazzelCast for multi core programing so far...good to hear as part of JSR...
ReplyDeleteThis seem to be a very incomplete test.
ReplyDeleteWhat happens if you have both tasks fork and join the results instead of one computing and one joining? It seems to my untrained eye that calling compute on the task is similar to calling run directly on a runnable.
If you did this without fork join and just used standard java threading or concurrency (seems that this test could easily be replicated using futures) what would the results be?
I am trying to run FibonacciTask.java.
ReplyDeleteI downloaded jsr166.jar and placed it in the same folder where FibonacciTask.java is located to start with.
In eclipse, as shown in the screen-shot, under the debug configurations >> Java Application and under the Arguments tab I give the following
-Xbootclasspath/home/username/data/study/jsr166.jar
It gives me the below error:
Unrecognized option: -Xbootclasspath/home/username/data/study/jsr166.jar
Could not create the Java virtual machine
I tried giving
-Xbootclasspath/jsr166.jar as well, but does not help.
Can anyone suggest me a way to resolve this ?
Thanks in advance,
Hi Neena,
ReplyDeleteIt seems that you are having some difficulties setting up the JVM arguments. You can alternatively use jsr166y.jar as suggested by Martin Zoldano. Please read the first comment in this post.
Regards,
Ilias
Thanks Ilias, it worked with jsr166y.jar.
ReplyDeleteI am using Linux (Ubuntu 10.10, 64 bit, 4 processors)
Here are my results for ForkJoinWorker :
No of processors: 4
Computed Result: 12586269025
Elapsed Time: 110325
Hey Neena,
ReplyDeleteGlad it worked!!
The statement with the graph "It is obvious that the Fork/Join framework scales a lot better that the single-thread approach" is false.
ReplyDeleteIt is obvious that the two curves are the SAME except for a scale factor (which is due to the fact that all processors are used now). It is the same same ineffecient calculation.
This is a poor choice of a problem, where the overheard of is much greater than the task itself. Rather than a recursive calculation, a divide and conquer problem such as merge sort would be much better.
Functionally this is great!
ReplyDeleteBut practically the second application contains twice as much code as the first one.
So almost nobody will use this except Tomcat developers.
To make multithreading usable Oracle has to make it simple as 2+2=4.
I want special language construct to be introduced like:
return fork fibonacci(n-1) + fibonacci(n-2);
It's very simple. Expression is calculated as usual from left to right. Forking function is executed in separate thread or pull.
In my computer(i3, 4 processors), n=50:
ReplyDeletesilly use 78174ms , cpu 25%;
forkjoin use 250563ms, cpu 90% .
CPU was used fully, But time was spent more. Because this forkjoin example contain a lot of repeated calculation.
I've tested both applications (usual recursive and fork/join architecture's one).
ReplyDeleteFork/Join cosumes much more memory and time.
By the way, I've changed FibonacciProblem class to cache the calculated results:
private static ConcurrentMap result=new ConcurrentHashMap();
then method:
private long fibonacci(int n) {
if (n <= 1) return n;
Long fibonacci1 = getFib(n-2);
Long fibonacci = getFib(n-1);
return fibonacci + fibonacci1;
}
and finally method:
private Long getFib(int n) {
Long fibonacci = result.get(n);
if (fibonacci==null){
fibonacci=fibonacci(n);
result.put(n, fibonacci);
}
return fibonacci;
}
of course, I do not need to use putIfAbsent method because there is only one thread to put the result in the map.
After that changing I've got the results:
Computing Fib number: 44
Computed Result: 701408733
Elapsed Time: 1 ms
and fork/join (w/o any modifications)
Computed Result: 701408733
Elapsed Time: 14816 ms
NIkon DSL : Adding the cache will impact the results as it will depends on many processors accessing the same collection , remove it and re-evaluate the results..
ReplyDeleteThe discussion how to optimize a recursive Fibonacci implementation makes not really sense. For the computation of Fibonacci numbers there are very fast implementations (e.g. http://nayuki.eigenstate.org/page/fast-fibonacci-algorithms) the recursive is a bad one.
ReplyDeleteI have to admit, that I used the same sample to demonstrate some aspects of fork and join (see http://sites.google.com/site/markussprunck/blog-1/howtoimplementforkandjoininjava7-jsr166concurrencyutilities), but just because of simplicity of Fibonacci not because it is a good idea to use it.