A common problem is to determine the prime factorization of a number. The brute force approach is trial division (Wikipedia, Khan Academy) but that requires a lot of wasted effort if multiple numbers must be factored.
One widely used solution is the Sieve of Eratosthenes (Wikipedia, Math World). It is easy to modify the Sieve of Eratosthenes to contain the largest prime factor of each composite number. This makes it extremely cheap to subsequently compute the prime factorization of numbers.
If we only care about primality we can either use a bitmap with the Sieve of Eratosthenes, or use the Sieve of Atkin).
(Sidenote: for clarity I’m leaving out the common optimizations that follow from the facts that a prime number is always “1 mod 2, n > 2″ and “1 or 5 mod 6, n > 5″. This can substantially reduce the amount of memory required for a sieve.)
public enum SieveOfEratosthenes { SIEVE; private int[] sieve; private SieveOfEratosthenes() { // initialize with first million primes - 15485865 // initialize with first 10k primes - 104729 sieve = initialize(104729); } /** * Initialize the sieve. */ private int[] initialize(int sieveSize) { long sqrt = Math.round(Math.ceil(Math.sqrt(sieveSize))); long actualSieveSize = (int) (sqrt * sqrt); // data is initialized to zero int[] sieve = new int[actualSieveSize]; for (int x = 2; x < sqrt; x++) { if (sieve[x] == 0) { for (int y = 2 * x; y < actualSieveSize; y += x) { sieve[y] = x; } } } return sieve; } /** * Is this a prime number? * * @FIXME handle n >= sieve.length! * * @param n * @return true if prime * @throws IllegalArgumentException * if negative number */ public boolean isPrime(int n) { if (n < 0) { throw new IllegalArgumentException("value must be non-zero"); } boolean isPrime = sieve[n] == 0; return isPrime; } /** * Factorize a number * * @FIXME handle n >= sieve.length! * * @param n * @return map of prime divisors (key) and exponent(value) * @throws IllegalArgumentException * if negative number */ private Map<Integer, Integer> factorize(int n) { if (n < 0) { throw new IllegalArgumentException("value must be non-zero"); } final Map<Integer, Integer> factors = new TreeMap<Integer, Integer>(); for (int factor = sieve[n]; factor > 0; factor = sieve[n]) { if (factors.containsKey(factor)) { factors.put(factor, 1 + factors.get(factor)); } else { factors.put(factor, 1); } n /= factor; } // must add final term if (factors.containsKey(n)) { factors.put(n, 1 + factors.get(n)); } else { factors.put(n, 1); } return factors; } /** * Convert a factorization to a human-friendly string. The format is a * comma-delimited list where each element is either a prime number p (as * "p"), or the nth power of a prime number as "p^n". * * @param factors * factorization * @return string representation of factorization. * @throws IllegalArgumentException * if negative number */ public String toString(Map factors) { StringBuilder sb = new StringBuilder(20); for (Map.Entry entry : factors.entrySet()) { sb.append(", "); if (entry.getValue() == 1) { sb.append(String.valueOf(entry.getKey())); } else { sb.append(String.valueOf(entry.getKey())); sb.append("^"); sb.append(String.valueOf(entry.getValue())); } } return sb.substring(2); } }
This code has a major weakness – it will fail if the requested number is out of range. There is an easy fix – we can dynamically resize the sieve as required. We use a Lock to ensure multithreaded calls don’t get the sieve in an intermediate state. We need to be careful to avoid getting into a deadlock between the read and write locks.
private final ReadWriteLock lock = new ReentrantReadWriteLock(); /** * Initialize the sieve. This method is called when it is necessary to grow * the sieve. */ private void reinitialize(int n) { try { lock.writeLock().lock(); // allocate 50% more than required to minimize thrashing. initialize((3 * n) / 2); } finally { lock.writeLock().unlock(); } } /** * Is this a prime number? * * @param n * @return true if prime * @throws IllegalArgumentException * if negative number */ public boolean isPrime(int n) { if (n < 0) { throw new IllegalArgumentException("value must be non-zero"); } if (n > sieve.length) { reinitialize(n); } boolean isPrime = false; try { lock.readLock().lock(); isPrime = sieve[n] == 0; } finally { lock.readLock().unlock(); } return isPrime; } /** * Factorize a number * * @param n * @return map of prime divisors (key) and exponent(value) * @throws IllegalArgumentException * if negative number */ private Map<Integer, Integer> factorize(int n) { if (n < 0) { throw new IllegalArgumentException("value must be non-zero"); } final Map<Integer, Integer> factors = new TreeMap<Integer, Integer>(); try { if (n > sieve.length) { reinitialize(n); } lock.readLock().lock(); for (int factor = sieve[n]; factor > 0; factor = sieve[n]) { if (factors.containsKey(factor)) { factors.put(factor, 1 + factors.get(factor)); } else { factors.put(factor, 1); } n /= factor; } } finally { lock.readLock().unlock(); } // must add final term if (factors.containsKey(n)) { factors.put(n, 1 + factors.get(n)); } else { factors.put(n, 1); } return factors; }
Iterable<Integer> and foreach loops
In the real world it’s often easier to use a foreach loop (or explicit Iterator) than to probe a table item by item. Fortunately it’s easy to create an iterator that’s built on top of our self-growing sieve.
/** * @see java.util.List#get(int) * * We can use a cache of the first few (1000? 10,000?) primes * for improved performance. * * @param n * @return nth prime (starting with 2) * @throws IllegalArgumentException * if negative number */ public Integer get(int n) { if (n < 0) { throw new IllegalArgumentException("value must be non-zero"); } Iterator<Integer> iter = iterator(); for (int i = 0; i < n; i++) { iter.next(); } return iter.next(); } /** * @see java.util.List#indexOf(java.lang.Object) */ public int indexOf(Integer n) { if (!isPrime(n)) { return -1; } int index = 0; for (int i : sieve) { if (i == n) { return index; } index++; } return -1; } /** * @see java.lang.Iterable#iterator() */ public Iterator<Integer> iterator() { return new EratosthenesListIterator(); } public ListIterator<Integer> listIterator() { return new EratosthenesListIterator(); } /** * List iterator. * * @author Bear Giles <bgiles@coyotesong.com> */ static class EratosthenesListIterator extends AbstractListIterator<Integer> { int offset = 2; /** * @see com.invariantproperties.projecteuler.AbstractListIterator#getNext() */ @Override protected Integer getNext() { while (true) { offset++; if (SIEVE.isPrime(offset)) { return offset; } } // we'll always find a value since we dynamically resize the sieve. } /** * @see com.invariantproperties.projecteuler.AbstractListIterator#getPrevious() */ @Override protected Integer getPrevious() { while (offset > 0) { offset--; if (SIEVE.isPrime(offset)) { return offset; } } // we only get here if something went horribly wrong throw new NoSuchElementException(); } } }
IMPORTANT : The code:
for (int prime : SieveOfEratosthenes.SIEVE) { ... }
is essentially an infinite loop. It will only stop once the JVM exhausts the heap space when allocating a new sieve.
In practice this means that the maximum prime we can maintain in our sieve is around 1 GB. That requires 4 GB with 4-byte ints. If we only care about primality and use a common optimization that 4 GB can hold information on 64 GB values. For simplicity we can call this 9-to-10 digit numbers (base 10).
What if we put our sieve on a disk?
There is no reason why the sieve has to remain in memory. Our iterator can quietly load values from disk instead of an in-memory cache. A 4 TB disk, probably accessed in raw mode, would seem to bump the size of our sieve to 14-to-15 digit numbers (base 10). In fact it will be a bit less because we’ll have to double the size of our primitive types from int to long, and then probably to an even larger format.
More! More! More!
We can dramatically increase the effective size of our sieve by noting that we only have to compute sqrt(n) to initialize a sieve of n values. We can flip that and say that a fully populated sieve of n values can be used to populate another sieve of n^{2} values. In this case we’ll want to only populate a band, not the full n^{2} sieve. Our in-memory sieve can now cover values up to roughly 40 digit numbers (base 10), and the disk-based sieve jumps to as much as 60 digit numbers (base 10), minus the space require for the larger values.
There is no reason why this approach can’t be taken even further – use a small sieve to bootstrap a larger transient sieve and use it, in turn, to populate an even larger sieve.
But how long will this take ?
Aye, there’s the rub. The cost to initialize a sieve of n values is O(n^{2}). You can use various tweaks to reduce the constants but at the end of the day you’re visiting every node once (O(n)), and then visiting some rolling value proportional to n beyond each of those points. For what it’s worth this is a problem where keeping the CPU’s cache architecture could make a big difference.
In practical terms any recent system should be able to create a sieve containing the first million primes within a few seconds. Bump the sieve to the first billion primes and the time has probably leapt to a week, maybe a month if limited JVM heap space forces us to use the disk heavily. My gut instinct is that it will take a server farm months to years to populate a TB disk
Why bother ?
For most of us the main takeaway is a demonstration of how to start a collection with a small seed, say a sieve with n = 1000, and transparently grow it as required. This is easy with prime numbers but it isn’t a huge stretch to imagine the same approach being used with, oh, RSS feeds. We’re used to thinking of Iterators as some boring aspect of Collections but in fact they give us a lot of flexibility when used as part of an Iterable.
There is also a practical reason for a large prime sieve – factoring large numbers. There are several good algorithms for factoring large numbers but they’re expensive – even “small” numbers may take months or years on a server farm. That’s why the first step is always doing trial division with “small” primes – something that may take a day by itself.
Source Code
The good news is that I have published the source code for this… and the bad news is it’s part of ongoing doodling when I’m doing Project Euler problems. (There are no solutions here – it’s entirely explorations of ideas inspired by the problems. So the code is a little rough and should not be used to decide whether or not to bring me in for an interview (unless you’re impressed): http://github.com/beargiles/projecteuler.
Reference: | Getting an Infinite List of Primes in Java from our JCG partner Bear Giles at the Invariant Properties blog. |
Bulletproof Java Code: A Practical Strategy for Developing Functional, Reliable, and Secure Java CodeUse Java? If you do, you know that Java software can be used to drive application logic of Web services or Web applications. Perhaps you use it for desktop applications? Or, embedded devices? Whatever your use of Java code, functional errors are the enemy! To combat this enemy, your team might already perform functional testing. Even so, you're taking significant risks if you have not yet implemented a comprehensive team-wide quality management strategy. Such a strategy alleviates reliability, security, and performance problems to ensure that your code is free of functionality errors.Read this article to learn about this simple four-step strategy that is proven to make Java code more reliable, more secure, and easier to maintain. |