Python for Bioinformatics: More about primes

Friday, December 17, 2010

More about primes

I took a quick look at our question from the primes.py script the other day (here) about the patterns within each decade: how many decades have primes ending in 1,3,7 or 7,9 and so on?

I modified the script to keep track of this information in a global variable L. I still kept pL because we need to use the primes in pL to test new candidates. I grabbed the first one million primes and looked at the patterns. The results are amazing, at least I think so:


  (2, 3, 5, 7)      1
  (1, 3, 7, 9)   1228
     (1, 7, 9)   4799
     (3, 7, 9)   4839
     (1, 3, 7)   4843
     (1, 3, 9)   4913
        (1, 9)  17638
        (3, 7)  17664
        (7, 9)  17844
        (1, 3)  17865
        (3, 9)  45941
        (1, 7)  46184
          (1,) 152464
          (7,) 152613
          (9,) 152738
          (3,) 152816
            () 754197
1548587
1000001

We look through 1548587 decades to get the first 10⁶ primes (actually one more than that, in order to count the pattern in the last decade correctly). Slightly less than half of these contain no primes, of the rest about 80% have a single prime. (Also of interest, the overall density of primes is reduced by about 1/6 between 10⁵ and 10⁶---not shown).

For the pairs and triples we observe an amazingly consistent distribution of primes considering only their last digit. Each of the four is generally equivalent except that the pairs 1,7 and 3,9 are about 2.6 times more likely than the other pairs. What property of integers can account for the amazing constancies and also this systematic difference?

Before getting too excited, I should probably head over to the prime pages and check out their lists. But that just seems incredible to me. Something deep is going on here (or not.. maybe it's something to do with the way a "decade" is defined. Have to think about that).

BTW, this algorithm isn't scaling well. 10⁵ is really easy, 10⁶ is agonizingly slow. Use something better for big problems.

def primes(L,N):
    pL = [2,3,5,7]
    count = 4
    L.append(pL[:])
    base = 0
    while True:
        temp = list()
        base += 10
        for i in [1,3,7,9]:
            n = base + i
            for p in pL:
                prod = p*p
                if prod == n or not n % p:
                    break
                if prod > n:
                    count += 1
                    pL.append(n)
                    temp.append(n)
                    break
        L.append(temp[:])
        if count > N:  return pL,count

L = list()
pL,count = primes(L,int(1e6))
D = dict()
for sL in L:
    t = tuple([n % 10 for n in sL])
    if not t in D:
        D[t] = 1
    else:
        D[t] += 1

for k in sorted(D.keys(),key=lambda k:D[k]):
    print str(k).rjust(14), str(D[k]).rjust(6)
    
print sum(D.values())
print count

[UPDATE: It turns out the patterns depend on the definition of a decade. For example, to have the decades go from 15 to 25, etc. I changed these four lines:

 2    pL = [2,3,5,7,11,13]
 3    count = 6
 5    base = 5
10        for i in [2,4,6,8]:

and now the pattern is different.


     (7, 9, 3)   6020
     (7, 1, 3)   6232
        (7, 1)  22485
        (9, 3)  22521
        (1, 3)  22616
        (7, 9)  22690
        (9, 1)  28466
        (7, 3)  44681
          (7,) 147906
          (3,) 148038
          (1,) 170134
          (9,) 170243
            () 736554

I'm not saying I understand it, but the fact that it changes argues pretty strongly that it's an artifact. ]