Basic Java 7 Fork/Join Benchmark

One of the enhancements of the newly released Java 7 is the Fork/Join framework that is part of the java.util.concurrent package. This framework is supposed to ease the implementation of divide-and-conquer algorithms while providing high performance on a multicore system. As performance benchmarks are still rare to find (and I just wanted to try out the framework), I implemented a simple divide-and-conquer algorithm with this framework and compared its execution time to that of a similar implementation using plain old threads and one using a Thread pool.

The task to be performed is really simple: Compute the maximum number in an array of integers. The algorithm for all three alternatives is identical, see this (java-like) pseudocode:

int findMax(int[] numbers, int start, int end){
 if((end - start) > 2){
  int middle = (end - start) / 2;
  return Math.max(findMax(numbers,start,start+middle),findMax(numbers,start+middle+1,end));
  return Math.max(numbers[start],numbers[end]);

In words: The input is an array of ints, a starting index (initially zero), and an ending index (initially numbers.length). If there is nothing left between start and end, the maximum of the two values is computed and returned. If there is a range of values inbetween, the method is recursively called twice, each call covering half the array. Ultimately, the maximum of the two halfs is computed. As each of the halfs can be traversed independently, some benefit could be expected from parallelizing the calls. In fact, for such a trivial example, simply iterating through the array outperformes the solutions below due to the overhead of creating threads. Given the base operation would be more complex than just computing the maximum of two ints, parallelization can increase performance. Still, this example is sufficient for comparing different methods of parallelization. In particular, I tested the following:

  • plain-old-threads: For each recursive invocation, create a new thread, which means two new threads per method call if the base case is not reached. A calling thread then starts its childrens and waits for them to complete using thread.join()
  • thread pool: Initially, a thread pool is created, in particular a CachedThreadPool, that creates new threads on demand or reuses old ones. This might reduce some of the thread creation overhead. For each recursive invocation, two Callables are submitted to the thread pool. Then, the calling thread waits for them to complete and obtains the result via their Future instance.
  • fork/join: Each recursive call creates two RecursiveTasks and invokes their compute() method to obtain the result and compute the maximum.

I ran the tests with a a randomly populated array of varying sizes. My machine has an i7 Q720 processor, 4 GB of RAM and is running Windows 7. I am executing the tests in jre7.0 (Edit: with 32bit).  Here are some results:

Execution Time (millisec)
Array Size Plain Old Threads Thread Pool Fork/Join
10 2 3 1
100 13 9 1
1000 141 37 2
10000 1360 (OutOfMemoryError) 118 4
50000 2341 (OutOfMemoryError) 309 16

The execution times of the fork/join implementation are really impressive compared to the other techniques. At an array size of 10 000, the solution with plain old threads malfunctions, because the VM is unable to create new native threads (and as a result the program might actually not find the correct maximum). The same happens for the thread pool soon after an array size of 50 000. This is not very supprising, given the amount of threads needed when partitioning an array of such a size to a subspace of two elements with one thread each. The results indicate that the fork/join framework might really be preferable in situations where:

  1. A divide-and-conquer solution is possible
  2. Excessive parallelism is required

Of course, I have to emphasize that this is a very simple benchmark, not allowing an overall assessment of the fork/join framework. It is just a starting point, helping to get a grasp of the problem.

9 thoughts on “Basic Java 7 Fork/Join Benchmark

  1. Garbriel,

    “First and foremost, fork/join tasks should operate as “pure” in-memory algorithms in which no I/O operations come into play.”
    Thanks for that hint!

    It would definitly be interesting to hear of someone with JVM-in depth knowledge why Java is so wicked fast here. Maybe we are missing some crucial aspect? It might also be worth to discuss this at stackoverflow.

  2. Jörg,

    The third observation is what annoyed me the most. How can it be that Java sums up a million elements in 2 second (Sorry, I’ve uploaded an updated version of the test with the correct labels). Maybe the JVM is forecasting the addition and max calculation and summing them up as we create the array elements. Or, it might be using some kind of loop unwinding mechanism to speed it up.
    The executors seem to run much slower because there is a constant demand for new thread by the pool. The rate of demand of new thread far exceeds the the rate in which threads become available back to the pool. As a result I’ve noticed creation of additional thousands of threads during the executor test – as there was no upper bound to the number of threads in the pool. As everyone knows thread creation is a pretty expensive action.
    On the more complex operations in Fork/Join, you have to shun I/O operation from that kind of algorithm, as mentioned in one of Oracle blogs:
    “First and foremost, fork/join tasks should operate as “pure” in-memory algorithms in which no I/O operations come into play.”
    Maybe one of the JVM team members can chime in on this discussion and give us some clues which iterative is much faster than Fork/Join.
    For reference: I’m running on a quad core 64 bit Windows 7.

  3. Gabriel,

    thanks for the enhanced version of the benchmark! I fully aggree with you concerning the “process window” variable. In the original benchmark, I set this value to 2 for the sake of simplicity. A benchmark with a more realistic workload (you could control that with the process window) for the threads and different types of task, e.g. sorting, should be interesting (Hm, I’ll keep that in mind…).

    Anyway, in the blog post, I forgot to mention a critical aspect (I edited that now): Although I run a 64bit OS, I executed the benchmark with a 32bit VM. Using a 64bit VM elimitated the OutOfMemoryErrors. Here are the results:
    Array Size: 1000000
    =====Initializing Max dataset=====
    =====Done initializing Max dataset=====
    =====Running Max test=====
    =====Starting FORKJOIN test=====
    FORKJOIN: 2.147416203E9 duration: 32
    =====Starting EXECUTOR test=====
    EXECUTOR: 2.147416203E9 duration: 1342
    =====Starting ITERATIVE test=====
    ITERATIVE: 2.147416203E9 duration: 16
    =====Done running Max test=====
    =====Initializing Sum dataset=====
    =====Done initializing Sum dataset=====
    =====Running Sum test=====
    =====Starting FORKJOIN test=====
    FORKJOIN: 4497488.0 duration: 296
    =====Starting EXECUTOR test=====
    EXECUTOR: 225196.0 duration: 94038
    =====Starting ITERATIVE test=====
    ITERATIVE: 225196.0 duration: 16
    =====Done running Sum test=====

    1. Observation: Your machine sums up a lot faster, mine is slightly better in the max test. I have no immediate explanation for that ;-) It should be eliminated by the law of large numbers.

    2. Observation: Fork/Join is better than Executors: Not surprising. Thread management and distribution of workload seems to be implemented more efficient with the new framework. Maybe it is the work-stealing algorithm of the Fork/Join framework that does the trick, maybe something else. No matter what, my conclusion is: For this kind of tasks, definitly use Fork/Join instead of Executors.

    3. Observation: Iterative is fastest. This is somehow irritating… But keep in mind that the computation we do is really trivial. My guess is that the thread creation and management overhead simply outperforms the computation. Given the threads would do something more sophisticated, say: write to an IO-device, I would expect that to change (Again, I should keep such a benchmark in mind…).

    Considering this, I find the results surprising, but plausible.

  4. Jörg,

    Thanks for sending over the source code you had. If you don’t might I’ve done a bit of a re-factoring to be able to use the command pattern so we could test calculations other than maximum numbers (e.g sum of the numbers).
    I also think that its a bit unfair to force threads to process a series of two numbers. As a result I’ve added a “process window” variable to decide when a thread should not split up the work and delegate to two sub threads. I’ve also ran the test on 1000000 (A million) size array. I’ve got quite peculiar results (Giving up the plain old threads and adding the iterative test):
    Array Size: 1000000
    Max: 2.147476655E9 duration: 35
    Max: 2.147476655E9 duration: 2282
    Max: 2.147476655E9 duration: 4
    Max: 4499456.0 duration: 34
    Max: 4499456.0 duration: 36864
    Max: 4499456.0 duration: 2

    I’ve zipped and uploaded the source to the following link:

  5. Gabriel,

    thank you for the feedback! Luckily, I still had the code. It consists of several classes and a little Strategy Pattern. I pasted all classes to a paste at pastebin, you can find it here:

    To execute it, simply paste each of the classes into its own file or modify the code accordingly. But please remember, this was just a quick hack so please don’t blame me, if the code isn’t completely clean ;-)

    It would be very interesting to see the results from your machine!


  6. Joerg,

    Great post to these new capabilities. Do happen to have the source code for the test so I could recreate the results?


Comments are closed.