Tuesday, 21 September 2010

Java Best Practices – Queue battle and the Linked ConcurrentHashMap


Continuing our series of articles concerning proposed practices while working with the Java programming language, we are going to perform a performance comparison between four popular Queue implementation classes with relevant semantics. To make things more realistic we are going to test against a multi–threading environment so as to discuss and demonstrate how to utilize ArrayBlockingQueue, ConcurrentLinkedQueue, LinkedBlockingQueue and/or LinkedList for high performance applications.

Last but not least we are going to provide our own implementation of a ConcurrentHashMap. The ConcurrentLinkedHashMap implementation differs from ConcurrentHashMap in that it maintains a doubly–linked list running through all of its entries. This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map. The ConcurrentLinkedHashMap benefits from the combined characteristics of ConcurrentHashMap and LinkedHashMap achieving performance just slightly below that of ConcurrentHashMap due to the added expense of maintaining the linked list.

All discussed topics are based on use cases derived from the development of mission critical, ultra high performance production systems for the telecommunication industry.

Prior reading each section of this article it is highly recommended that you consult the relevant Java API documentation for detailed information and code samples.

All tests are performed against a Sony Vaio with the following characteristics :
  • System : openSUSE 11.1 (x86_64)
  • Processor (CPU) : Intel(R) Core(TM)2 Duo CPU T6670 @ 2.20GHz
  • Processor Speed : 1,200.00 MHz
  • Total memory (RAM) : 2.8 GB
  • Java : OpenJDK 1.6.0_0 64-Bit
The following test configuration is applied :
  • Concurrent worker Threads : 50
  • Test repeats per worker Thread : 100
  • Overall test runs : 100


ArrayBlockingQueue vs ConcurrentLinkedQueue vs LinkedBlockingQueue vs LinkedList

One of the most common tasks a Java developer has to implement is storing and retrieving objects from Collections. The Java programming language provides a handful of Collection implementation classes with both overlapping and unique characteristics. Since Java 1.5 the Queue implementation classes became the de–facto standard for holding elements prior processing. Besides basic Collection operations, queues provide additional insertion, extraction, and inspection operations.

Nevertheless working with Queue implementation classes, especially in a multi–threading environment, can by tricky. The majority of them provide concurrent access by default but concurrency can be treated in a blocking or non – blocking manner. A BlockingQueue implementation class supports operations that wait for the queue to become non-empty when retrieving an element, and wait for space to become available in the queue when storing an element.

The case scenarios that will be discussed here are having multiple Threads inserting, retracting and iterating through elements of ConcurrentLinkedQueue and LinkedList Queue implementation classes and ArrayBlockingQueue and LinkedBlockingQueue BlockingQueue implementation classes. We are going to demonstrate how to properly utilize the aforementioned collection implementation classes in a multi–threading environment and provide relevant performance comparison charts so as to show which one performs better in every test case.

To make a fare comparison we will assume that no NULL elements are allowed and that the size of every queue is limited where it is applicable. Thus the BlockingQueue implementation classes of our test group will be initialized with a maximum size of 5000 elements – remember that we will be using 50 worker Threads performing 100 test repeats each. Furthermore since LinkedList is the only Queue implementation class of our test group that does not provide concurrent access by default, concurrency for LinkedList will be achieved using a synchronized block to access the list. At this point we must pinpoint that the Java documentation for the LinkedList Collection implementation class proposes the use of Collections.synchronizedList static method in order to maintain concurrent access to the list. This method provides a “wrapped” synchronized instance of the designated Collection implementation class as shown below :

List syncList = Collections.synchronizedList(new LinkedList());

This approach is appropriate when you want to use the specific implementation class as a List rather than a Queue. To be able to use the “queue” functionality of the specific Collection implementation class you must utilize it as is, or cast it to a Queue interface.


Test case #1 – Adding elements in a queue

For the first test case we are going to have multiple Threads adding String elements in each Queue implementation class. In order to maintain uniqueness among String elements, we will construct them as shown below :
  • A static first part e.g. “helloWorld”
  • The worker Thread id, remember that we have 50 worker Threads running concurrently
  • The worker Thread test repeat number, remember that each worker thread performs 100 test repeats for each test run
For every test run each worker Thread will insert 100 String elements, as shown below :
  • For the first test repeat
    • Worker Thread #1 will insert the String element : “helloWorld-1-1”
    • Worker Thread #2 will insert the String element : “helloWorld-2-1”
    • Worker Thread #3 will insert the String element : “helloWorld-3-1”
    • etc …
  • For the second test repeat
    • Worker Thread #1 will insert the String element : “helloWorld-1-2”
    • Worker Thread #2 will insert the String element : “helloWorld-2-2”
    • Worker Thread #3 will insert the String element : “helloWorld-3-2”
    • etc …
  • etc …
At the end of each test run every Queue implementation class will be populated with 5000 distinct String elements. For adding elements we will be utilizing the “put()” operation for the BlockingQueue implementation classes and the “offer()” operation for the Queue implementation classes as shown below :
  • arrayBlockingQueue.put(“helloWorld-" + id + "-" + count);
  • linkedBlockingQueue.put(“helloWorld-" + id + "-" + count);
  • concurrentLinkedQueue.offer(“helloWorld-" + id + "-" + count);
  • synchronized(linkedList) {
       linkedList.offer(“helloWorld-" + id + "-" + count);
    }
Below we present a performance comparison chart between the four aforementioned Queue implementation classes



The horizontal axis represents the number of test runs and the vertical axis the average transactions per second (TPS) for each test run. Thus higher values are better. As you can see all Queue implementation classes performed almost identically when adding elements to them. ArrayBlockingQueue and ConcurrentLinkedQueue performed slightly better compared to LinkedList and LinkedBlockingQueue. The latter was the worst performer scoring 625000 TPS on average.


Test case #2 – Removing elements from a queue

For the second test case we are going to have multiple Threads removing String elements from each Queue implementation class. All queue implementation classes will be pre–populated with the String elements from the previous test case. Every Thread will remove a single element from each Queue implementation class until the Queue is empty.

For removing elements we will be utilizing the “take()” operation for the BlockingQueue implementation classes and the “poll()” operation for the Queue implementation classes as shown below :
  • arrayBlockingQueue.take();
  • linkedBlockingQueue.take();
  • concurrentLinkedQueue.poll();
  • synchronized(linkedList) {
       linkedList.poll();
    }
Below we present a performance comparison chart between the four aforementioned Queue implementation classes


The horizontal axis represents the number of test runs and the vertical axis the average transactions per second (TPS) for each test run. Thus higher values are better. Again all Queue implementation classes performed almost identically when removing String elements from them. ArrayBlockingQueue and ConcurrentLinkedQueue performed slightly better compared to LinkedList and LinkedBlockingQueue. The latter was the worst performance scoring 710000 TPS on average.


Test case #3 – Iterators

For the third test case we are going to have multiple worker Threads iterating over the elements of each Queue implementation class. Every worker Thread will be using the Queue's “iterator()” operation to retrieve a reference to an Iterator instance and iterate through all the available Queue elements using the Iterator's “next()” operation. All Queue implementation classes will be pre–populated with the String values from the first test case. Below is the performance comparison chart for the aforementioned test case.



The horizontal axis represents the number of test runs and the vertical axis the average transactions per second (TPS) for each test run. Thus higher values are better. Both ArrayBlockingQueue and LinkedBlockingQueue implementation classes performed poorly compared to the ConcurrentLinkedQueue and LinkedList implementation classes. LinkedBlockingQueue scored 35 TPS on average, while ArrayBlockingQueue scored 81 TPS on average. On the other hand LinkedList outperformed ConcurrentLinkedQueue, resulting in 15000 TPS on average.


Test case #4 – Adding and removing elements

For our final test case we are going to implement the combination of test case #1 and test case #2 scenarios. In other words we are going to implement a producer – consumer test case. One group of worker Threads is going to insert String elements to every Queue implementation class, whereas another group of worker Threads is going to retract the String elements from them. Every Thread from the “inserting elements” group is going to insert just one element, whereas every Thread from the “retracting elements” group is going to retract just one element. Thus we are going to concurrently insert and retract 5000 unique String elements from every Queue implementation class in total.

To properly simulate the aforementioned test case we must start all worker Threads that retract elements prior starting the worker Threads that insert elements. In order for the worker Threads of the “retracting elements” group to be able to retract a single element they must wait and retry if the relevant Queue is empty. BlockingQueue implementation classes provide waiting functionality by default but Queue implementation classes do not. Thus for removing elements we will be utilizing the “take()” operation for the BlockingQueue implementation classes and the “poll()” operation for the Queue implementation classes as shown below :
  • arrayBlockingQueue.take();
  • linkedBlockingQueue.take();
  • while(result == null)
       result = concurrentLinkedQueue.poll();
  • while(result == null)
    synchronized(linkedList) {
       result = linkedList.poll();
    }

    As you can see we have implemented the bare minimum – a while loop – in order for our ConcurrentLinkedQueue and LinkedList consumers to be able to perform reties when retracting from an empty Queue. Of course you can experiment and implement a more sophisticated approach to the matter. Nevertheless keep in mind that the aforementioned along with any other artificial implementation is NOT a proposed solution and should be avoided in favor to the BlockingQueue “take()” operation as will be shown in the following performance comparison charts.

    Below is the performance comparison chart for the addition part of the aforementioned test case.


    Following is the performance comparison chart for the retraction part of the aforementioned test case.


    The horizontal axis represent the number of test runs and the vertical axis the average transactions per second (TPS) for each test run. Thus higher values are better. As expected ArrayBlockingQueue and LinkedBlockingQueue outperformed both LinkedList and ConcurrentLinkedQueue. BlockingQueue implementations are designed to be used primarily for producer–consumer queues. Their blocking behavior grands them unparalleled performance and efficiency in this kind of scenarios, especially when the number of producer and consumer Threads is relatively high. In fact, according to relevant performance comparisons, the relative gain in performance between Queue implementation classes and BlockingQueue implementation classes increases in favor of the latter as the number of producer and consumer Threads rises.

    As you can see from the provided performance results LinkedBlockingQueue achieved the best combined (adding and removing elements) performance results and should be your number one candidate for implementing producer – consumer schenarios.



    ConcurrentLinkedHashMap

    Our ConcurrentLinkedHashMap implementation is a tweaked version of the ConcurrentHashMap implementation originally coded by Doug Lea and found on OpenJDK 1.6.0_0. We present a concurrent hash map and linked list implementation of the ConcurrentMap interface, with predictable iteration order. This implementation differs from ConcurrentHashMap in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map (insertion-order). Note that insertion order is not affected if a key is re-inserted into the map.

    This implementation spares its clients from the unspecified, generally chaotic ordering provided by ConcurrentHashMap and Hashtable, without incurring the increased cost associated with TreeMap. It can be used to produce a copy of a map that has the same order as the original, regardless of the original map's implementation:

    void foo(Map m) {
    Map copy = new ConcurrentLinkedHashMap(m);
    ...
    }

    This technique is particularly useful if a module takes a map on input, copies it, and later returns results whose order is determined by that of the copy. (Clients generally appreciate having things returned in the same order they were presented.)

    A special “ConcurrentLinkedHashMap(int,float,int,boolean)” constructor is provided to create a concurrent linked hash map whose order of iteration is the order in which its entries were last accessed, from least-recently accessed to most-recently (access-order). This kind of map is well-suited to building LRU caches. Invoking the put or get method results in an access to the corresponding entry (assuming it exists after the invocation completes). The “putAll()” method generates one entry access for each mapping in the specified map, in the order that key-value mappings are provided by the specified map's entry set iterator. No other methods generate entry accesses. In particular, operations on collection-views do not affect the order of iteration of the backing map.

    The “removeEldestEntry(Map.Entry)” method may be overridden to impose a policy for removing stale mappings automatically when new mappings are added to the map.

    Performance is likely to be just slightly below that of ComcurrentHashMap, due to the added expense of maintaining the linked list, with one exception: Iteration over the collection-views of a ConcurrentLinkedHashMap requires time proportional to the size of the map, regardless of its capacity. Iteration over a ConcurrentHashMap is likely to be more expensive, requiring time proportional to its capacity.

    You can download the latest version of the ConcurrentLinkedHashMap source and binary code from here

    You can download the source code of  all the "tester" classes that we used to conduct our performance comparison from here, here and here


    Happy coding


    Justin


    Related Articles :



    11 comments:

    1. I'd like to recommend a nice ConcurrentLinkedHashMap implementation: https://code.google.com/p/concurrentlinkedhashmap/

      ReplyDelete
    2. In your #4 test case, the graph for removal, it looks like something may be wrong with your test. The graph shows ConcurrentLinkedQueue as performing 2000/sec in the first pass (similar to the others, which before hotspot optimization is not surprising) but then dropping to 0/sec for the rest of the test. This doesn't correspond to the performance I've seen from ConcurrentLinkedQueue in the past. I would have expected it to be at the top of that graph as well.

      ReplyDelete
    3. Hello Derek,

      I am afraid you are right. Due to a minor typo ConcurrentLinkedQueue performance on the first pass was not around 2000 TPS but around 20 TPS!. At a matter of fact ConcurrentLinkedQueue scored 102 TPS an average for the specific test case. Thats why all TPS scores for the specific Queue implementation class seem are located close to zero on the graph.

      It is obvious that the bare minimum approach implemented to achieve polling, resulted in an overkill for the retracting threads of the ConcurrentLinkedQueue. We can back up our assumption considering that the specific Queue implementation class achieved top performance on the "addition" part of the test.

      BRs

      Justin

      ReplyDelete
    4. In your ConcurrentLinkedHashMap, concurrent calls to removeEldestEntry() may not honor a size constraint due to data races. For example, two threads may insert near the threshold size, but both return false when checking against #size(). The map would then be +1 greater, but never able to self-correct itself later.

      Another concern is that all reads serialize against the doubly-linked list's lock when reordering the link chain (e.g. access order). While this is a very slim lock it could result in contention.

      Lastly, your HashIterator may skip or see duplicate entries if in access order as it traverses the links. This could occur due to the reordering of entries (e.g. reads) during iteration.

      This looks like a good first attempt, but you may want to add some concurrent test cases that validate the behavior.

      ReplyDelete
    5. Hello Ben,

      Thank you for your valuable comments. Below are my answers :

      1) Since removeEldestEntry() operation is meant to be implemented by external processes it is their concern to properly utilize the specific functionality.

      2) Like you said this is a very slim lock and thus the possibility to result in contention is very slim also. Nevertheless we will try to further optimize the specific approach in a future release.

      3) You are right about your point here. We will fix this bug and come back to you as soon as possible. Sorry for the inconvenience.


      BRs

      ReplyDelete
    6. An additional concern is how you read and modify the list's before/after pointers. You are sometimes under a dedicated lock (e.g. when reordering), sometimes a segment's lock (e.g. cloning), and sometimes with no lock (e.g. iteration). I suspect that the list is being corrupted but not failing because you evict based on the entry's key (not its reference) so it can be found in the table (or not, you ignore the result).

      Unfortunately you'll find that fixing this properly is non-trivial. The easiest solutions are either to place the links on a value wrapper with a global list (as done in the Google Code version) or on the entries using per-segment lists (as done in Google's MapMaker). Either of those approaches would restrict mutations of the list under a single lock. If you're adventurous you could use a lock-free dequeue (e.g. JDK 7), but that's non-trivial. Interestingly, the lock-based strategies can be made to perform better than the lock-free approach based on a few simple (but not immediately obvious) observations.

      I'll be interested to see how you mature this, but I think you'll discover that this class is not quite as simple as you initially thought. Its a wonderful learning exercise, though, so best of luck and have fun with it!

      ReplyDelete
    7. Hello Ben, all

      A fixed version of our ConcurrentLinkedHashMap is available for download. I have fixed the Iteration bug and now the iterator sees a snapshot of the underlying map at a given point in time (the iterator's creation time). Also modification of the lists's before/after pointers (when reordering and cloning)is done atomically, thus the list cannot be corrupted.

      Further optimization and tunning will be applied in the following versions, so stay tuned!

      BRs

      ReplyDelete
    8. Hello all,

      Minor optimizations concerning iterators and concurrent linked list modifications where performed in our ConcurrentLinkedHashMap. These optimizations finalize version 1.0.0

      Feel free to download the final version use it and comment on any issue you like.

      In one of my future comments I will provide the performance results derived from the main benchmark tests performed against our ConcurrentLinkedHashMap.

      Until then, stay tuned!

      BRs

      ReplyDelete
    9. Hello all,

      Our ConcurrentLinkedHashMap implementation has been moved to SourceForge.net, for more information please visit our Software page

      BRs

      ReplyDelete
    10. Do you publish somewhere the code that you use to perform your benchmarks?

      Benchmarking in Java is relatively complex to get it right as described here:
      http://ellipticgroup.com/html/benchmarkingArticle.html

      I would like to experiment around with your test cases if possible.

      ReplyDelete
    11. Hello cs224,

      I have updated the article with the location of the "tester" classes source code. You can download them and experiment around our test cases!

      BRs

      ReplyDelete

    Related Posts Plugin for WordPress, Blogger...