Home » Java » Core Java » Unique hashCodes is not enough to avoid collisions

About Peter Lawrey

Unique hashCodes is not enough to avoid collisions

There is a common misconception that if you have unique hashCode() you won’t have collisions.  While unique, or almost unique, hashCodes are good, this is not the end of the story.

The problem is that the size of a HashMap is not unlimited (or at least 2^32 in size)  This means the hashCode() number has to be reduced to a smaller number of bits.

The way HashMap, and thus HashSet and LinkedHashMap ,work is to mutate the bits in the following manner:
 
 

h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);

and then apply a mask for the lowest bits to select a bucket.  The problem is that even with unique hashCode()s as Integer does, there will be values with different hash code map to the same bucket. You can research how Integer.hashCode() works:

public static void main(String[] args) {
    Set integers = new HashSet<>();
    for (int i = 0; i <= 400; i++)
        if ((hash(i) & 0x1f) == 0)
            integers.add(i);
    Set integers2 = new HashSet<>();
    for (int i = 400; i >= 0; i--)
        if ((hash(i) & 0x1f) == 0)
            integers2.add(i);
    System.out.println(integers);
    System.out.println(integers2);

}

static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

this prints:

[373, 343, 305, 275, 239, 205, 171, 137, 102, 68, 34, 0]
[0, 34, 68, 102, 137, 171, 205, 239, 275, 305, 343, 373]

The entries as in the reverse order they were added as the HashMap is acting as a linked list, placing all entries into the same bucket.

Solutions?

A simple solution is to have a bucket turn into a tree instead of a linked list.  In Java 8, it will do this for String keys, but this could be done for all Comparable types AFAIK.

Another approach is to allow custom hashing strategies to allow the developer to avoid such problems, or to randomize the mutation on a per collection basis, amortizing the cost to the application.

Other notes

I would favour supporting 64-bit hash codes, esp for complex objects.  This has a very low chance of collision in the hash code itself and supports very large data structures well. e.g. into the billions.
 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*


Want to take your Java Skills to the next level?
Grab our programming books for FREE!
  • Save time by leveraging our field-tested solutions to common problems.
  • The books cover a wide range of topics, from JPA and JUnit, to JMeter and Android.
  • Each book comes as a standalone guide (with source code provided), so that you use it as reference.
Last Step ...

Where should we send the free eBooks?

Good Work!
To download the books, please verify your email address by following the instructions found on the email we just sent you.