clojure lazy-seq performance optimization

Question

I'm fairly new to Clojure and I have some code that I am trying to optimise. I want to compute concurrance-counts. The main function is compute-space and the output is a nested map of the type

{"w1" {"w11" 10, "w12" 31, ...}
 "w2" {"w21" 14, "w22" 1,  ...}
 ... 
 }

meaning that "w1" cooccurs with "w11" 10 times, etc...

It takes a coll of documents (sentences) and a coll of target words, it iterates over both and finally applies the context-fn such as sliding-window to extract context-words. More concretely I am passing a closure over sliding-window

(compute-space docs (fn [target doc] (sliding-window target doc 5)) targets)

I've been testing it with around 50 million words (~ 3 million sentences) and ca. 20,000 targets. This version would take more than a day to complete. I also wrote a pmap parallel function (pcompute-space) that would reduce computing time to around 10 hours, but I still I feel it should be faster. I don't have other code to compare, but my intuition says it should be faster.

(defn compute-space 
  ([docs context-fn targets]
    (let [space (atom {})]
      (doseq [doc docs
              target targets]
        (when-let [contexts (context-fn target doc)]
          (doseq [w contexts]
            (if (get-in @space [target w])
              (swap! space update-in [target w] (partial inc))
              (swap! space assoc-in  [target w] 1)))))
     @space)))

(defn sliding-window
  [target s n]
  (loop [todo s seen [] acc []]
    (let [curr (first todo)]
      (cond (= curr target) (recur (rest todo) (cons curr seen) (concat acc (take n seen) (take n (rest todo))))
            (empty? todo) acc
            :else (recur (rest todo) (cons curr seen) acc)))))


(defn pcompute-space
  [docs step context-fn targets]
  (reduce
     #(deep-merge-with + %1 %2)
      (pmap
        (fn [chunk]
          (do (tick))
          (compute-space chunk context-fn targets))
        (partition-all step docs)))

I profiled the application with jvisualvm and I found out that clojure.lang.Cons, clojure.lang.ChunkedCons and clojure.lang.ArrayChunk are dominating the process quite excessively (see picture). This surely has to do with the fact that I am using this double doseq loop, (previous experiments showed that this way was faster than using map, reduce and the like, although I was using time for benchmarking the functions). I'd very thankful for any insights you could provide me, and suggestions for refactor the code and make it run faster. I guess reducers could be of some help here, but I'm not sure as to how and/or why.

jvisualvm memory profile

SPECS

MacPro 2010 2,4 GHz Intel Core 2 Duo 4 GB RAM

Clojure 1.6.0

Java 1.7.0_51 Java HotSpot(TM) 64-Bit Server VM

Test data

GithubGist with the entire code

Not related to optimization, but `(partial inc)` is essentially the same as `inc`. — Shannon Severance, Apr 07 '15 at 17:42
`sliding-window` takes three arguments, but only two are passed to `context-fn`. — Shannon Severance, Apr 07 '15 at 17:49
@ShannonSeverance, thanks for the observation. In the case of `context-fn` it is actually a closure what is passed to `compute-space`. (fn [target sent] (sliding-window target sent 5)) for instance. I will edit the question to make this clear — emanjavacas, Apr 07 '15 at 19:18

Shannon Severance · Answer 1 · 2015-04-08T13:55:25.093

Test data was:

A lazy seq of 42 strings (targets)
A lazy seq of 105,040 lazy sets. (Documents)
Each lazy seq within Documents was a lazy sequence of strings. The total number of strings contained in documents was 1,146,190.

Quite a bit smaller than your work load. Criterium was used to collect timing. Criterium computes an expression multiple times, first warming up the JIT and then to collect average data.

With my test data, and your code, compute-space took 22 seconds:

WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 21.989189 sec
    Execution time std-deviation : 471.199127 ms
   Execution time lower quantile : 21.540155 sec ( 2.5%)
   Execution time upper quantile : 23.226352 sec (97.5%)
                   Overhead used : 13.353852 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 9.4329 % Variance is slightly inflated by outliers

First optimization Updated to use frequencies to go from vector of words to map of words and counts of their occurrences.

To help me understand the structure of the computation I wrote a separate function that takes a collection of documents, context-fn and a single target, and returns the map of context words to counts. The inner map for one target of what is returned by compute-space. Wrote this using built in Clojure functions, instead of updating counts.

(defn compute-context-map-f [documents context-fn target]
  (frequencies (mapcat #(context-fn target %) documents)))

With compute-context-map-f written compute-space, named compute-space-f here, is fairly short:

(defn compute-space-f [docs context-fn targets]
  (into {} (map #(vector % (compute-context-map-f docs context-fn %)) targets)))

Timing, with the same data as above is 65% of original version:

WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 14.274344 sec
    Execution time std-deviation : 345.240183 ms
   Execution time lower quantile : 13.981537 sec ( 2.5%)
   Execution time upper quantile : 15.088521 sec (97.5%)
                   Overhead used : 13.353852 ns

Found 3 outliers in 60 samples (5.0000 %)
    low-severe   1 (1.6667 %)
    low-mild     2 (3.3333 %)
 Variance from outliers : 12.5419 % Variance is moderately inflated by outliers

Parallelize first optimization

I chose to chunk by target instead of document, so that merging the maps together would not require modifying the {context-word count, ...} maps for a target.

(defn pcompute-space-f [docs step context-fn targets]
  (into {} (pmap #(compute-space-f docs context-fn %) (partition-all step targets))))

Timing, with the same data as above is 16% of original version:

user> (criterium.core/bench (pcompute-space-f documents 4 #(sliding-window %1 %2 5) keywords))
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 3.623018 sec
    Execution time std-deviation : 83.780996 ms
   Execution time lower quantile : 3.486419 sec ( 2.5%)
   Execution time upper quantile : 3.788714 sec (97.5%)
                   Overhead used : 13.353852 ns

Found 1 outliers in 60 samples (1.6667 %)
    low-severe   1 (1.6667 %)
 Variance from outliers : 11.0038 % Variance is moderately inflated by outliers

Specifications

Mac Pro 2009 2.66 GHz Quad-Core Intel Xeon, 48 GB RAM.
Clojure 1.6.0.
Java 1.8.0_40 Java HotSpot(TM) 64-Bit Server VM.

TBD

Further optimizations.

Describe the test data.

First of all thanks for your code and help! After trying it on my computer I am a bit puzzled though because I got contradictory results. I've added my specs a github gist with the entire involved code (i/o function etc..), criterium results and a link to the test file I'm using. Any insights as to why this divergence? — emanjavacas, Apr 08 '15 at 09:53
@emanjavacas Interesting. I do not know if I will have time to look at it though. — Shannon Severance, Apr 08 '15 at 13:55
Since I've had no response from OP, could I ask you to verify [my work](http://stackoverflow.com/a/29561227/1562315)? — Thumbnail, Apr 17 '15 at 12:30
@Thumbnail, you have some good ideas there, that are backed up by your own timings. I don't have time for this problem to verify or critique your work. Sorry. — Shannon Severance, Apr 17 '15 at 23:24
I got round to using Criterium. Testing all the targets at once does work a huge improvement, but moving to transients actually slows things down! Still no word from OP. — Thumbnail, Apr 21 '15 at 06:48

Thumbnail · Answer 2 · 2015-05-08T10:33:51.093

Analysis of the compute-space algorithm in the question

The cost of scanning the sentences - looking for targets -

is proportional to the total number of words
is proportional to the number of targets, but
is more of less independent of the number of sentences that the words are divided into.

The cost of dealing with the targets

is proportional to the target hit rate,
is proportional to the number of distinct words in its context.

Major improvement

The context-fn scans a sentence looking for a target. If there are ten-thousand targets, it scans the sentence ten-thousand times.

Far better scan the sentence once, looking for all the targets. If we keep the targets as a (hash) set, we can test whether a word is a target in more or less constant time, no matter how many targets there are.

Possible improvement

The sliding-windows function generates contexts by passing each word from hand to hand - from todo to seen. It is probably quicker to pour the words into a vector, then return the contexts as subvecs.

However it is done, a simple way to organise generating the contexts is to have the context-fn return the sequence of contexts corresponding to the sequence of words. A function that does this for sliding-windows is

(defn sliding-windows [w s]
  (let [v (vec s), n (count v)
        window (fn [i] (lazy-cat (subvec v (max (- i w) 0) i)
                                 (subvec v (inc i) (min (inc (+ i w)) n))))]
    (map window (range n))))

We can now define the compute-space function in terms of the new kind of contexts-fn as follows:

(defn compute-space [docs contexts-fn target?]
  (letfn [(stuff [s] (->> (map vector s (contexts-fn s))
                          (filter (comp target? first))))]
    (reduce
     (fn [a [k m]] (assoc a k (merge-with + (a k) (frequencies m))))
     {}
     (mapcat stuff docs))))

The code pivots on stuff:

We develop stuff as a sequence of [target context-sequence] pairs.
We then merge each pair into the aggregate, adding the corresponding neighbour counts for each target occurrence.

Results

This algorithm is about 500 times faster than that in the question: what the code in the question accomplishes in a day and a half, this should perform in about a minute.

Given

a vocabulary of 100,000 words,
one sentence of 100,000 words, and
10,000 targets

this code constructs the context map in 100 msec.

For a sentence a tenth as long - 10,000 words - the code in the question takes 5 sec.

This is using (long) integers rather than strings as "words". So the work of assembling the strings with their hash values will dilute the improvement somewhat.

Note

I took the original version of this answer down because

there was transcription error in the code;
the performance claims were not accurate.

With accurate tests - conducted by Criterium - the version that uses transient maps turns out to be a little slower, so has been omitted.

sorry I wasn't notified of your answer. Thanks for your answer. Unfortunately, I won't have time to test your suggestions in the short time but they do sound interesting. Another thing that I'd like to try is to move into the reducers framework. — emanjavacas, May 08 '15 at 09:01
@emanjavacas Moving to the reducers/transducers framework will produce a constant factor improvement - I'd guess 2 to 5, though it may be more. The answer above produces an improvement proportional to the number of targets: in your case, roughly a factor of 10K. — Thumbnail, May 08 '15 at 10:27

clojure lazy-seq performance optimization

2 Answers2