Algorithms and Data Structures of JDK 7
While checking periodically if there is one or another standard algorithm in JDK I’ve decided to make such index. It was also interesting why some famous data structures or algs are included in and others – not. A format of this survey is only about key properties and features of algorithms and data structures of JDK, all details and full description – you can easily find in javadoc or jdk sources. Let’s start from simple to complex!
Data Structures of JDK
There is a stack in jdk appeared there from origins – class Stack, but it’s not recommended to use it, it’s complex and strange: it inherits from Vector, so is based on Dynamic Array and synchronized. Why a simple stack needs all of that and why it’s not just an interface – is not clear (discussed many times: 1, 2), but it seems just an architecture mistake, the same as with Vector itself. Btw, JDK authors themselves suggest using Deque instead.
Deque – an interface (api) of double-ended queue (LIFO + FIFO in O (1)), which includes the stack operations (push, pop, isEmpty, size) and is available in the jdk not so far (1.6+). Of course it’d be more logical to put these stack operations in an interface Stack, and e.g. let Deque inherit it, but as Stack was already present, and backward compatibility is the “holy grail” of java – they had to sacrifice a normal design. Implementations of Deque are ArrayDeque and LinkedList, who are also implementors of an usual queue – so we’ll discuss this later.
Next, let’s look at queue data type. Everything here is good and well designed. Queue – interface (api) of FIFO queue, adding to the beginning and removal from the end are in constant time O(1).
There are main implementations: ArrayDeque, cyclical buffer based on dynamically extensible array (doubles when filling) and LinkedList, a classic doubly-linked list (the size is not limited). Surprisingly, the first does not support random access (add/remove/get with index), the second does but in O(n) time with iterations through a linked list. These classes also implements the above mentioned Deque so they support removal from the end and adding to the top in constant time.
Next, since jdk 1.5+, the PriorityQueue was added which in fact violates the contract because the queue elements are not retrieved from the end (and they are not added to the head), but according to their priorities. PriorityQueue is based on an extensible binary heap with the minimum on the top (according to its comparator) and it enhances in 1.5 times on filling. Respectively, the key features are: element addition/removal is in O(log N) and a reference to the minimum (head) in O(1).
Other types of queues are designed for multithreaded usage, they are: BlockingQueue, TransferQueue, ConcurrentLinkedQueue and ConcurrentLinkedDeque.
Implementations of BlockingQueue (ArrayBlockingQueue, LinkedBlockingQueue, PriorityBlockingQueue) are kind of synchronized versions of their originals, i.e. nearly every operation is performed serially (with exclusive lock). It’s also true for DelayQueue: synchronized too and uses PriorityQueue internally.
Other implementations: SynchronousQueue, TransferQueue ( LinkedTransferQueue), ConcurrentLinkedQueue, ConcurrentLinkedDeque – based on a different approach: they use a non-blocking queue algorithms based on linked lists and CAS instructions which are well paralleled in a multiprocessor environment. Detailed description is in the source code.
Theory of such algorithms is a quite huge and modern topic so it’s not well standardized and structured yet and they are beyond the scope of this review and subject for another article.
As been said since jdk 1.5+ there is an general PriorityQueue that works according to the element comparator. Moreover, there is another implementation of the heap in the jdk. This is the good old Timer, the internal class – TaskQueue (a task with minimal latency is on the top). Of course it’s a private class and can not be used except inside the Timer.
As you know there are sequential and/or random access type lists. In java, it’s List and there are 2 main implementations: first – is ArrayList, supports random access, based on dynamically extensible array (increases by half when filling), but it doesn’t shrink after deleting all elements, you need to call a special method (trimToSize).
And the second – again LinkedList it’s a doubly-linked list sequential access size is limited only by the memory of jvm. Although the methods of random access (index) are also present – as been said, they require O(n) time.
Thus, there is no the simplest linked list implementation in the java collections though it’d be a good idea (2 times less overhead for the links), as well as there is no a simple stack.
For using a list in a multithreaded environment, there are several options: CopyOnWriteArrayList (change operations – O(n)), wrappers (synchonizedList) as well as outdated Vector.
They are presented as binary tree and hash tables in JDK. The Binary Tree – it’s TreeMap (or TreeSet), the implementation of SortedMap (SortedSet too), it is based on a classic red-black tree, i.e. is balanced and its basic operations are guaranteed for the O(log N) and it is not limited in size. Other types of trees are not present in jdk.
A hash table is HashMap (HashSet too) is probably the most used structure in java, it is based on a dynamically extensible hash table with separate chaining with linked-lists, with all the features: the performance depends on a hash function quality, so it’s O(N) in the worst case. When the size reaches a predetermined loadFactor the HashMap doubles in size. It is worth noting that for the protection of the bad hash functions a double hashing is used and a tricky bit arithmetic applies to the result of a call to hashCode().
There are also implementations of hash tables with open-addressing (linear probing) in JDK. One of them is IdentityHashMap, which uses an optimized version of the classical linear probing when both the keys and values are stored in the same array next to each other for better data caching (javadoc: «better locality for large tables than does using separate arrays»)
The second open-addressing implementation is a very specific: it only serves to store ThreadLocal elements (internal hidden class in ThreadLocalMap) and of course is not available for use.
There are multithreaded versions: ConcurrentHashMap, wrapper synchronizedMap, Hashtable and ConcurrentSkipListMap. Wrapper – naturally is just blocking version of the regular HashMap Hashtable – the same thing (and it’s legacy and not recommended), ConcurrentHashMap – lock-striping version, that reduces the critical lock sections (you’d better read about it JCiP, here is an excerpt) ConcurrentSkipListMap – is an adapted version of the non-blocking named algorithm for hash tables (see the source code for the details).
Sets (w/o duplicates) – it’s Set in java, and implemented via HashMap so all that is said to hash tables – valid for HashSet.
Graph structures and algorithms are not represented in jdk. Therefore, you can use only third-party libraries for that.
There is an usual implementation of strings in java: based on array of unicode characters. It is worth mentioning that starting from version 1.7_17 – the performance of substring is O(n), since the substring is copied.
It is interesting this implementation uses a simple (brute-force) substring search algorithm that runs in O(N*M) in the worst case, and there are not some efficient algorithms (built on finite-state machine: Knuth-Morris-Pratt, etc.) in JDK.
There are several reasons for that: a large size of the alphabet UTF characters (~ 65K), and hence a large overhead for storage a state machine, while the brute-force algorithm is in-place (do not use extra memory). And secondly, on average input strings – this algorithm is not much behind others by statistics.
The same with a sorting: there are effective radix sort string algs (LSD, MSD, etc), but in the jdk, there is a standard object sort used for strings, which runs in O(M*N*log N), if most of the lines don’t differ much (M – length of lines). The reason is the same: quick counting string algs require additional arrays of UTF alphabet size, which makes them very ineffective on average inputs.
Since jdk7 many changes regarding sorts of options has occured, this was discussed many time, there are many information and articles on this subject, you can easily google it.
In short, here is the actual list of sorting implementations in the jdk: TimSort – for sorting objects by default, mergesort – also used for objects, the old version (enabled through a system property), Dual-Pivot Quick sort – for primitives, then sorting by counting is used for byte / character arrays and finally insertion sort is used for small arrays used in any case.
Collection contents are sorted with Collections.sort(List …) and it is still done via copying the collection into an array, sorting it, and then overwriting the collection contents accordingly. Therefore, it’s still impossible to sort a collection w/o an overhead, though I think it’d be a good idea to have an in-place sort of linked lists. String sorting in java is also done with the object version, the reasons were already mentioned.
A traditional binary search is present for all arrays of primitives and objects as well as for random access list implementations (e.g. ArrayList etc). In addition there is a version of binary search for linked lists in JDK! It’s very surprisingly: JDK doesn’t have a sorting for linked lists, but it has a binary search for them! Although it doesn’t make much sense because the performance of such version is O(N).
There are Pattern and Matcher classes in for that. Java uses the traditional implementation based on non-deterministic finite state machine (NFA) with backtracking. So, it gives an exponential complexity of O(m^N) in the worst case on degenerated inputs, e.g. run this regexp in java and try to add/remove “a” symbols: “aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa”.matches(“(a|aa)*b”)
Also, there is the so-called ordered alternation – it’s a feature that stops a search after the first match found but not the most specific(longest) search result (example).
Hash functions, checksums
There are six versions of hashCode implementation in jdk and the Park-Miller_random_number_generator is a default, others are simple like constant or object memory address and don’t use algorithms, you can find them in c++ jdk sources. There are also industry standard hashing algorithms (SHA-*, MD5 and variations) in MessageDigest class.
For checksums there are implementations of Adler-32 (javadoc) and CRC32 ( javadoc) algorithms.
There is an implementation of a standard compression deflate (Deflater) algorithm in jdk, and zip/gzip jdk utils use it. All of them are in java.util.zip package.
As you can see the classic data structures are not fully presented in java, but at the same time almost there are a lot of thread-safe versions available nearly for all data structures in jdk. What is missing is an open question. For example you can argue whether jdk needs some Union-Find, but the lack of graph structures and algorithms in the language completely in the age of social networks very surprises and in practice generates a lot of bugs and bicycles.
Thanks! Exactly what I was searching for!
Thanks. I was looking for exactly this info on JDK internals.
Thank You !!!