Prime Numbers in Python

Here is the full generator. The source can also be downloaded from here: primegen.py
It's based on the code posted here

def primegen(maxn=float("inf"),count=float("inf")):
	# Sequentially output primes. Outputs all p<maxn or the first 'count' primes.
	for n in (2,3,5,7,11,13,17,19,23,29,31,37,41,43,47):
		if n>=maxn or count<=0: return
		yield n
		count-=1
	# Recursive generator for upcoming factors.
	r,sq,n=7,49,49
	it=iter(primegen())
	while next(it)<r: pass
	comp=dict()
	jump=(1,6,5,4,3,2,1,4,3,2,1,2,1,4,3,
	      2,1,2,1,4,3,2,1,6,5,4,3,2,1,2)
	while n<maxn and count>0:
		# See if we have a factor of n.
		f=comp.pop(n,0)
		if n==sq:
			# n=r*r is the next prime square we're waiting for.
			f,r=r,next(it)
			sq=r*r
		elif f==0:
			# n!=sq and isn't in comp, so it's prime.
			yield n
			count-=1
		if f:
			# We've found a factor of n. Add it to comp.
			q=n//f
			q+=jump[q%30]
			while q*f in comp: q+=jump[q%30]
			comp[q*f]=f
		n+=jump[n%30]


print("First 20 primes : "+str(list(primegen(count=20))))
print("Primes under 100: "+str(list(primegen(100))))
print("Loop iterator test:")
for p in primegen():
	if p>50: break
	print(p)

Output:

First 20 primes : [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71] Primes under 100: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97] Loop iterator test: 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47

Performance

Our performance data has been generated from profiling.zip using pypy and python2. primegen.py will need to be included in the same folder. The generators used for profiling were taken from the page posted above, and each generator is denoted by the last name of its author.

The following table shows the time and memory used to iterate over all primes less than 10^8. The ability to skip more composites per iteration, and inlining adding factors to comp, allows our algorithm to perform about 36% faster than the next best.

name	pypy time	python time	memory
Eppstein	35.22 s	90.98 s	1,428,453 kb
Martelli	27.67 s	50.96 s	663,365 kb
Beneicke	19.93 s	27.80 s	663,365 kb
Hofstra	19.87 s	26.75 s	663,365 kb
Hochberg	17.37 s	27.43 s	663,365 kb
Ness	7.68 s	21.10 s	340 kb
Dee	4.47 s	17.48 s	337 kb

The following graph shows the time taken to iterate up to 10^7. Even though these algorithms involve finding a factor of n, their performance appears linear with respect to n.

Explanation

The algorithm relies on having a rolling dictionary of upcoming composite numbers, and being able to retrieve their factors quickly. We'll first analyze a simple version of this dictionary and then look at ways to speed it up.

Our dictionary of composites, hence named comp, will return a factor of n if n is composite, otherwise it will have no entry for n. That is, if n is composite, comp[n]=f for n%f=0, otherwise n is prime and comp[n] will be missing.

We will build comp as we iterate over n=2,3,4,... For every iteration we will find a factor, f, of n using f=comp[n]. If n is prime, we'll take f=n. Otherwise, the factor will be used to calculate the next composite not already in comp that's a multiple of f. That is, we'll iterate over values of n+k*f for k=1,2,3,... until we find a value not in comp. This will give us the next entry comp[n+k*d]=f.

The code for this is much more succinct than the derivation:

comp=dict()
n=2
while True:
	if n in comp:
		# n is composite and n%f=0
		f=comp.pop(n)
	else:
		# n is prime
		f=n
	# Find first value of n+k*f not in comp.
	k=1
	while n+k*f in comp:
		k+=1
	comp[n+k*f]=f
	n+=1

Indeed, this version is sufficient for a simple prime number generator:

def primegen():
	comp=dict()
	n=2
	while True:
		if n in comp:
			f=comp.pop(n)
		else:
			yield n
			f=n
		k=1
		while n+k*f in comp:
			k+=1
		comp[n+k*f]=f
		n+=1

The first big improvement comes from Will Ness, who noted that prime factors are being added to comp before they need to be. That is, for a prime p, we don't need to be tracking p as a factor of n until n=p^2. For n<p^2, there will be some other prime factor of n that will identify n as a composite.

For example, consider p=7. We can see that for n=8,9,...,48, all values will either be prime, or have 2, 3, or 5 as a factor:

8=2*4	22=2*11	36=2*18
9=3*3	23=prime	37=prime
10=2*5	24=2*12	38=2*19
11=prime	25=5*5	39=3*13
12=2*6	26=2*13	40=2*20
13=prime	27=3*9	41=prime
14=2*7	28=2*14	42=2*21
15=3*5	29=prime	43=prime
16=2*8	30=2*15	44=2*22
17=prime	31=prime	45=3*15
18=2*9	32=2*16	46=2*23
19=prime	33=3*11	47=prime
20=2*10	34=2*17	48=2*24
21=3*7	35=5*7	49=7*7

It isn't until n=49 that the smallest factor is 7. Using the simple generator we have so far, let us see what comp is when n=49:

comp={58: 29, 62: 31, 51: 17, 74: 37, 57: 19, 52: 13, 50: 2, 82: 41,
      49: 7 , 86: 43, 55: 11, 60: 5 , 69: 23, 94: 47, 54: 3}

There are a total of 15 entries. We know that we only need to have 4 at most: 2, 3, 5, and 7. These extra entries increase the values skipped over in the n+k*f loop and increase the number of entries that need to be ignored when we call if n in comp.

We can improve on this by delaying the addition of p until we pass n>=p^2, and just rely on the other factors to identify composites. The main problem of this is tracking what the next p will be when n=p^2, which is the exact problem we're already trying to solve with our prime generator!

Fortunately, we can solve this easily by making a recursive call to the primegen iterator we already have. This new iterator will solely be used for tracking the next factor we need to add. We need to be careful about making an infinite number of recursive calls to primegen however, so we're going to manually output the first 2 primes first and then use our main loop.

def primegen():
	yield 2
	yield 3
	comp=dict()
	r,sq=2,4
	it=iter(primegen())
	while next(it)<r: pass
	n=4
	while True:
		# If comp returns a non-zero value, n is composite with factor f.
		f=comp.pop(n,0)
		if n==sq:
			# n=r*r is the next prime square we're waiting for. Add r to comp.
			f=r
			r=next(it)
			sq=r*r
		elif f==0:
			# n isn't a prime square and isn't in comp, so it's prime.
			yield n
		if f:
			# We've found a factor of n. Add the next value of n+k*f to comp.
			k=1
			while n+k*f in comp:
				k+=1
			comp[n+k*f]=f
		n+=1

Here, sq=r^2 where sq is the square we're watching for and r is the next prime we'll add to comp. If we check what comp contains when n=49, it's what we'd expect: {50: 5, 51: 3, 52: 2}. Note that 7 is not an entry because it is implicitly stored in r.

The last remaining way to speed up the algorithm is skip numbers that we know have a factor. For instance, every even number greater than 2 is a composite divisible by 2. Every third number is divisible by 3, etc... In general, if we choose to skip composites divisible by primes 2,3,...,p, then we will be able to skip a larger and larger fraction of integers described by the formula: y=1-(1/2)*(2/3)*(4/5)*...*((p-1)/p).

p	fraction skipped
*	0.000000
2	0.500000
3	0.666667
5	0.733333
7	0.771429
11	0.792208
13	0.808192
17	0.819475
19	0.828976
23	0.836412

For the purpose of this generator, we choose to skip integers divisible by 2, 3, and 5. By the table above, this will allow us to skip 73% of all integers we need to test. To do this we'll create a table that tells us how far to jump from n so that n+jump is not divisible by 2, 3, or 5. To simplify this table, note that r is divisible by 2, 3, or 5, if and only if r+30 is also divisible by 2, 3, or 5. Thus, to calculate how far from n to jump so n+jump isn't divisible by 2, 3, or 5, we may write n as n=q*30+r for some 0<r<=30 and use r to find the jump value.

To derive the jump table, we begin by looking at the values 0 to 29:

0	1	2	3	4	5
6	7	8	9	10	11
12	13	14	15	16	17
18	19	20	21	22	23
24	25	26	27	28	29

We now cross off all values divisible by 2, 3, or 5, leaving:

	1
	7				11
	13				17
	19				23
					29

Our jump table will be calculated as follows: given some entry on the table, how many spaces do we have to jump to reach the next value not divisible by 2, 3, or 5. We have

1	6	5	4	3	2
1	4	3	2	1	2
1	4	3	2	1	2
1	4	3	2	1	6
5	4	3	2	1	2

Or, in python:

jump=(1,6,5,4,3,2,1,4,3,2,1,2,1,4,3,
      2,1,2,1,4,3,2,1,6,5,4,3,2,1,2)

We can now calculate the next value of n with n+=jump[n%30].

Before we simply drop this table into the main loop of our prime generator, we need to modify the section that adds upcoming composites to comp. Previously, when we added values to comp, we knew that n would iterate over all positive integers without skipping any values. Because of our new jump table, we introduce the possibility that we'll jump over a composite that comp is expecting us to hit.

k=1
while n+k*f in comp:
	k+=1
comp[n+k*f]=f

To fix this section, we'll make use of the jump table. Our fixed loop must jump over all the values that n will jump over, and still be a multiple of f. We can do this by factoring f out of n, and then jumping over values when we check for composites already in comp. Our new loop will look like so:

q=n//f
q+=jump[q%30]
while q*f in comp: q+=jump[q%30]
comp[q*f]=f

We'll present what our generator currently looks like, but note that it does not work correctly.

def primegen():
	yield 2
	yield 3
	comp=dict()
	r,sq=2,4
	it=iter(primegen())
	while next(it)<r: pass
	n=4
	jump=(1,6,5,4,3,2,1,4,3,2,1,2,1,4,3,
	      2,1,2,1,4,3,2,1,6,5,4,3,2,1,2)
	while True:
		# If comp returns a non-zero value, n is composite with factor f.
		f=comp.pop(n,0)
		if n==sq:
			# n=r*r is the next prime square we're waiting for. Add r to comp.
			f=r
			r=next(it)
			sq=r*r
		elif f==0:
			# n isn't a prime square and isn't in comp, so it's prime.
			yield n
		if f:
			# We've found a factor of n. Add the next value of n+k*f to comp.
			q=n//f
			q+=jump[q%30]
			while q*f in comp: q+=jump[q%30]
			comp[q*f]=f
		n+=jump[n%30]

The last modification we need to make is to the primes we manually yield. We must yield at least 2, 3, and 5 to prevent them from being skipped by jump, and we must also yield 7 to prevent primegen() from making infinite recursive calls. We also want to yield as many as possible to minimize the number of recursive calls, but we can't yield any prime greater than 7^2=49. Thus, we'll yield all primes up 49. We'll also rearrange some of the code and add convenience checks for the maximum value of n and maximum number of primes to generate.

Here is the final version of our prime generator:

def primegen(maxn=None,count=None):
	# Sequentially output primes. Outputs all p<maxn or the first 'count' primes.
	if  maxn==None:  maxn=float("inf")
	if count==None: count=float("inf")
	for n in (2,3,5,7,11,13,17,19,23,29,31,37,41,43,47):
		if n>=maxn or count<=0: return
		yield n
		count-=1
	# Recursive generator for upcoming factors.
	r,sq,n=7,49,49
	it=iter(primegen())
	while next(it)<r: pass
	comp=dict()
	jump=(1,6,5,4,3,2,1,4,3,2,1,2,1,4,3,
	      2,1,2,1,4,3,2,1,6,5,4,3,2,1,2)
	while n<maxn and count>0:
		# See if we have a factor of n.
		f=comp.pop(n,0)
		if n==sq:
			# n=r*r is the next prime square we're waiting for.
			f,r=r,next(it)
			sq=r*r
		elif f==0:
			# n!=sq and isn't in comp, so it's prime.
			yield n
			count-=1
		if f:
			# We've found a factor of n. Add it to comp.
			q=n//f
			q+=jump[q%30]
			while q*f in comp: q+=jump[q%30]
			comp[q*f]=f
		n+=jump[n%30]

Notes

We have explained how the prime generating algorithm works and shown that it is faster than current related algorithms. We now list some current ideas for improving it.

One trivial improvement is to increase the primes we jump around (2,3, and 5). This has the drawback of increasing the jump table size for every prime we add, and every additional prime skips fewer and fewer numbers.

When we are querying if the current value of n is composite, f=comp.pop(n,0), n will either be the minimum key of comp or it will not be in comp at all. Any data structure which could exploit this will allow for a nice speedup.

We could potentially speed up the algorithm by precalculating the jump table used when calculating the next composite of f. That is

ujump=((1,6),(7,4),(11,2),(13,4),(17,2),(19,4),(23,6),(29,2))
fjump=[0]*30
...
if f:
	for u in ujump: fjump[(u[0]*f)%30]=u[1]*f
	q=n+fjump[n%30]
	while q in comp: q+=fjump[q%30]
	comp[q]=f

In practice, the table precalculation adds more time than is saves (6.88s vs the current 4.65s).

Another algorithm we could use relies on primality testing. It would look like so:

def isprime(n):
	# return true iff n is prime

def primegen():
	n=2
	while True:
		if isprime(n): yield n
		n+=1

It is conceptually simple, uses constant memory, and takes O(log(n)^2) time using the Rabin-Miller primality test. In practice, the isprime() function is too slow to compete with the current primegen for real-world values of n.