Bloom Filter Implementation in Java on GitHub

Bloom Filter is a type of Set Data Structure. For those unaware, a Set Data Structure only has one main method, contains. It’s only used to determine if a specific element is included in a group of elements or not. Most Data Structures (like a Hash MapLinked List, or Array) can create this function fairly easily. You simply need to search the data structure for the specific element.

However, these types of Data Structures can pose a problem when the number of elements in the set exceeds the amount of memory available, as these types of data structures store all of the elements in memory.

This is where the Bloom Filter becomes interesting. As the Bloom Filter doesn’t actually store all of the elements of the set in memory.

Instead of placing each element into a the Data Structure, the Bloom Filter only stores an array of bytes. For each element added to the Bloom Filter, k bits are set in its array. These bits are typically determined by a hashing function.

To check if an element is within the set, you simply check if the bits that would normally be one for this item are actually one. If they all are one (instead of zero), then the item is within the set. If any of the bits are not one, then the item is not within the set.

With every Data Structure there is definitely a draw back to the Bloom Filter. By using the method above, the Bloom Filter can say an element is within the set when it actually isn’t. False positives are possible in the set, and they depend on several factors, such as:

  • The size of the byte array
  • The number of bits (k) set per element
  • The number of items in the set

By tweaking the above values, you can easily get the false positive probability to respectable levels while still saving a large amount of space.

After I discovered the Bloom Filter, I went looking for an implementation in Java. Sadly, a standard implementation doesn’t exist! So, I wrote a quick and simple version of the Bloom Filter for Java. You can find the source code on GitHub.

My implementation uses:

  • MD5 Hash
    • To add an Object, the set takes the value of the hashCode() method to compute the MD5 hash. For subsequent values of k, the filter uses the previously computed MD5 hash (converted to an int) to generate the new MD5 hash.
  • Backed by a simple byte array
  • Implements the Set<Object> interface, although some of the methods in the interface will not work properly.

Note that the project also used the SizeOf Library to get the number of byte used in memory.

I also did a few quick expirements to compare the filter to a standard ArrayList in Java and a few performance checks.

  • Time required to add an element to the set using different k values
  • Size of the set versus the array list at different levels

As to be expected, the larger the number of elements required to be in the set, the more useful the Bloom Filter becomes. It does get a bit tricky when determining how large the Bloom Filter should be and what the optimal k  value is for a given set, especially if the set is continually growing.

For the tests, I simply added Objects (which have a size of 16 bytes) to each data structure, and I then use the SizeOf library to get the true amount of space used.

From the above graph, its easy to see that the Bloom Filter is much more efficient on size once the array becomes larger than 100 objects. That trend continues at 1500 objects, with the Bloom Filter requiring 22808 bytes less than the ArrayList to store the same amount of elements.

The above graph shows the time in seconds (on an early 2012 iMac) to add an element to the list with different numbers of bits set (k). As k increases, the time increases fairly slowly up to 10 bits. However, anything past 10 becomes very costly, with 100 bits set requiring a full second to complete.

Feel free to check out the source code for the tests and the Bloom Filter implementation itself on GitHub.
 

Reference: Bloom Filter Implementation in Java on GitHub from our JCG partner Isaac Taylor at the Programming Mobile blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

2 Responses to "Bloom Filter Implementation in Java on GitHub"

  1. Ilias Tsagklis says:

    Thanks for sharing guys..!

Leave a Reply


three + = 7



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close