wheel sieve - slow reconstruction from bitsets

Question

I'm trying to get my feet wet with parallel processing by implementing a wheel sieve for primes, the order of operation roughly as follows,

given some upper bound N, construct 8 different spokes of the form [p, p + 30, p+ 60, ..., p + 30n] for p in {1, 7, 11, 13, 17, 19, 23, 29}.
combine these all together into one list
For each p, sieve all the primes in the list created in step 2 (using pmap)
take these 8 bitsets, parse them, and rebuild some sorted list of all primes less than N

My code is listed below (primes and primes2 are single-threaded implementations of a simple sieve, and wheel sieve respectively, for comparison)

The problem I'm having is that my current attempts to implement step 4 (in functions primes4, primes5, and primes6) are all dominating steps 1-3. Can anyone give me any pointers on how I should complete primes3 (which implements only steps 1-3)? otherwise, if this is hopeless, can someone explain some other strategy for splitting up the work for separate threads to do in isolation?

(defn spoke-30 [p N]
  (map #(+ p (* %1 30)) (range 0 (/ N 30))))

(defn interleaved-spokes
  "returns all the spoke30 elements less than or equal to N (but not the first which is 1)"
  [N]
  (rest (filter #(< % N) (apply interleave (map #(spoke-30 % N) '(1 7 11 13 17 19 23 29))))))


(defn get-orbit
  "for a spacing diff, generates the orbit of diff under + in Z_{30}"
  [diff]
  (map #(mod (* diff %) 30) (range 0 30)))

(defn _all-orbits
  "returns a map of maps where each key is an element of (1 7 11 13 17 19 23 29),
  and each value is the orbit of that element under + in Z_30)"
  []
  (let [primes '(1 7 11 13 17 19 23 29)]
    (zipmap primes (map #(zipmap (get-orbit %) (range 30 0 -1)) primes))))

(def all-orbits (memoize _all-orbits))

(defn start 
"for a prime N and a spoke on-spoke, determine the optimal starting point"
  [N on-spoke]
  (let [dist   (mod (- N) 30)
        lowest (* (((all-orbits) dist) on-spoke) N)
        sqrN (* N N)]

    ; this might be one away from where I need to be, but it is a cheaper
    ; calculation than the absolute best start.
    (cond (>= lowest sqrN) lowest
          (= lowest N)  (* 31 N)
          true (+ lowest (* N 30 (int (/ N 30)))))))

(defn primes2 [bound]
  (let [bset      (new java.util.BitSet bound)
        sqrtBound (Math/sqrt bound)
        pList     (interleaved-spokes sqrtBound)]
    (.flip bset 2 bound)
    (doseq [i (range 9 bound 6)] (.clear bset i)) ;clear out the special case 3s
    (doseq [i (range 25 bound 10)] (.clear bset i)) ;clear out the special case 5s
    (doseq [x '(1 7 11 13 17 19 23 29)
            y pList
            z (range (start y x) bound (* 30 y))]
          (.clear bset z))
    (conj (filter (fn [x] (.get bset x)) (range 1 bound 2)) 2)))



(defn scaled-start [N on-spoke]
  (let [dist      (mod (- N) 30)
        k         (((all-orbits) dist) on-spoke)
        lowest    (int (/ (* k N) 30))
        remaining (* (int (/ (- N k) 30)) N)
        start     (+ lowest remaining)]
    (if (> start 0) start N)))

;TODO do I even *need* this bitset!?? ...
(defn mark-composites [bound spoke pList]
  (let [scaledbound  (int (/ bound 30))
        bset         (new java.util.BitSet scaledbound)]
    (if (= spoke 1) (.set bset 0)) ; this won't be marked as composite - but it isn't prime!
    (doseq [x pList
            y (range (scaled-start x spoke) scaledbound x)]
      (.set bset y))
    [spoke bset]))


;TODO now need to find a quick way of reconstructing the required list
; need the raw bitsets ... will then loop over [0 scaledbound] and the bitsets, adding to a list in correct order if the element is true and not present already (it shouldn't!)
(defn primes3 [bound]
  (let [pList (interleaved-spokes (Math/sqrt bound))]
        (pmap #(mark-composites bound % pList) '(1 7 11 13 17 19 23 29)))) 

(defn primes4 [bound]
  (let [pList (interleaved-spokes (Math/sqrt bound))
        bits  (pmap #(mark-composites bound % pList) '(1 7 11 13 17 19 23 29))
        L     (new java.util.ArrayList)]
    (.addAll L '(2 3 5))
    (doseq [z (range 0 (int (/ bound 30))) [x y] bits ]
      (if (not (.get y z)) (.add L (+ x (* 30 z))))
        (println x y z L)
    )))

(defn primes5 [bound]
  (let [pList (interleaved-spokes (Math/sqrt bound))
    bits  (pmap #(mark-composites bound % pList) '(1 7 11 13 17 19 23 29))]
    (for [z (range 0 (int (+ 1 (/ bound 30)))) [x y] bits ]
      (if (not (.get y z)) (+ x (* 30 z))))
    ))

(defn primes6 [bound]
  (let [pList (interleaved-spokes (Math/sqrt bound))]
    (concat '(2 3 5) (filter #(not= % 0) (apply interleave (pmap #(mark-composites2 bound % pList) '(1 7 11 13 17 19 23 29))))))) 


(defn primes [n]
  "returns a list of prime numbers less than or equal to n"
  (let [bs (new java.util.BitSet n)]
    (.flip bs 2 n)
    ;(doseq [i (range 4 n 2)] (.clear bs i)) ;clear out the special case 2s
    (doseq [i (range 3 (Math/sqrt n))] 
      (if (.get bs i) ; it seems faster to check if odd than to range in steps of 2
        (doseq [j (range (* i i) n (* 2 i))] (.clear bs j))))
    (conj (filter (fn [x] (.get bs x)) (range 1 n 2)) 2)))

some timings are:

user=> (time (count (primes 1000000)))
"Elapsed time: 117.023543 msecs"
78498
user=> (time (count (primes2 1000000)))
"Elapsed time: 77.10944 msecs"
78498
user=> (time (count (primes3 1000000)))
"Elapsed time: 22.447898 msecs"
8
user=> (time (count (primes4 1000000)))
"Elapsed time: 647.586234 msecs"
78506
user=> (time (count (primes5 1000000)))
"Elapsed time: 721.62017 msecs"
266672
user=> (time (count (primes6 1000000)))
"Elapsed time: 306.280182 msecs"

HexedAgain, I don't fully understand what you're trying to do, but: I love `pmap`, but creating new threads is expensive, so it's useful only when you can arrange the code so that the cost of computation in the function passed to the `pmap` call overwhelms the cost of thread creation. Would it be possible to rearrange code so that some of the code passed to different instances of `map` in the helper functions can be combined into a single `pmap` call? Maybe your judicious use of Java rather than pure Clojure makes the code so efficient that `pmap` can't help. — Mars, Dec 28 '15 at 03:31
Thanks, I do use pmap when I am marking off the composites (since there is a lot of work there which can be done in isolation), and there is some speed up for doing that (though not massive), however the bottleneck is in parsing the data from 8 different threads and patching it all together into one sorted list. My attempts to solve that one so far have come up short. — HexedAgain, Dec 28 '15 at 10:18
I think that is going to be difficult given that each bitset is scaled by 30 (to save space) (for example s7 = [7,37,67,97,127,157,187, 217,247,277,,...] -> [0,0,0,0,0,1,1,1,0...]). Unless I'm missing something. I would have to map this spoke to a fat bitmap (though I could still squash it by ignoring the evens) and OR this new bitmap with some other bitmap (OR'd with the other 7 under a similar mapping). I'm getting the impression that if wheel-sieving can be easily parallelized ... my concept of how to do this is completely wrong, and that I need to hit the drawing board again. — HexedAgain, Jan 03 '16 at 01:26

wheel sieve - slow reconstruction from bitsets

0 Answers0