5

I'm trying to parse a file with a million lines, each line is a json string with some information about a book (author, contents etc). I'm using iota to load the file, as my program throws an OutOfMemoryError if I try to use slurp. I'm also using cheshire to parse the strings. The program simply loads a file and counts all the words in all books.

My first attempt included pmap to do the heavy work, I figured this would essentially make use of all my cpu cores.

(ns multicore-parsing.core
  (:require [cheshire.core :as json]
            [iota :as io]
            [clojure.string :as string]
            [clojure.core.reducers :as r]))


(defn words-pmap
  [filename]
  (letfn [(parse-with-keywords [str]
            (json/parse-string str true))
          (words [book]
            (string/split (:contents book) #"\s+"))]
    (->>
     (io/vec filename)
     (pmap parse-with-keywords)
     (pmap words)
     (r/reduce #(apply conj %1 %2) #{})
     (count))))

While it does seem to use all cores, each core rarely uses more than 50% of its capacity, my guess is that it has to do with batch size of pmap and so I stumbled across relatively old question where some comments make reference to the clojure.core.reducers library.

I decided to rewrite the function using reducers/map:

(defn words-reducers
  [filename]
  (letfn [(parse-with-keywords [str]
            (json/parse-string str true))
          (words [book]
            (string/split (:contents book) #"\s+"))]
  (->>
   (io/vec filename)
   (r/map parse-with-keywords)
   (r/map words)
   (r/reduce #(apply conj %1 %2) #{})
   (count))))

But the cpu usage is worse, and it even takes longer to finish compared to the previous implementation:

multicore-parsing.core=> (time (words-pmap "./dummy_data.txt"))
"Elapsed time: 20899.088919 msecs"
546
multicore-parsing.core=> (time (words-reducers "./dummy_data.txt"))
"Elapsed time: 28790.976455 msecs"
546

What am I doing wrong? Is mmap loading + reducers the correct approach when parsing a large file?

EDIT: this is the file I'm using.

EDIT2: Here are the timings with iota/seq instead of iota/vec:

multicore-parsing.core=> (time (words-reducers "./dummy_data.txt"))
"Elapsed time: 160981.224565 msecs"
546
multicore-parsing.core=> (time (words-pmap "./dummy_data.txt"))
"Elapsed time: 160296.482722 msecs"
546
eugecm
  • 1,189
  • 8
  • 14
  • 1
    It looks like `io/vec` scans the entire file to build an index of where the lines are. Do you get different results if you try `io/seq`? – Nathan Davis May 14 '16 at 17:47
  • @NathanDavis I just tried, the times are worse. let me update the question – eugecm May 14 '16 at 18:04
  • 1
    [This talk](https://www.youtube.com/watch?v=BzKjIk0vgzE) by Leon Barrett, author of [Claypoole](https://github.com/TheClimateCorporation/claypoole), might have some relevant information. It explains `pmap` in detail, including why it often doesn't saturate the CPU, and a little bit about why feeding one `pmap` into another can have surprising results. Also, if you're mainly looking for a way to saturate your CPU, Claypoole might be exactly what you need. – Ben Kovitz May 14 '16 at 22:20
  • Not saturating the CPU: Sounds like it's I/O-bound. Maybe it would help to use [line-seq](http://clojuredocs.org/clojure.core/line-seq), which reads lines lazily. Also, don't call `pmap` twice in a row like that. It's better to use `(pmap (comp words parse-with-keywords))`. Try to pack as much processing as possible into a single `pmap` call, because there's a lot of overhead to creating multiple threads each time it's called. If the processing done with a single `pmap` call is too little, it won't be worthwhile to use it. – Mars May 15 '16 at 06:16
  • Usually it's best to do timing with the [Criterium](https://github.com/hugoduncan/criterium) library, although in your case it might not matter. – Mars May 15 '16 at 06:19
  • Correction: I don't believe `line-seq` will help if it's I/O-bound--I wasn't thinking--but it might be worth trying anyway. – Mars May 15 '16 at 06:24
  • Interesting. After doing some profiling I noticed that `iota` actually takes a long time to split the chunks. Using a `clojure.java.io/reader` with `line-seq` cut down the time ito almost half (with pmap). I guess there's no need to load the whole file in memory if I can process it line by line. – eugecm May 15 '16 at 12:20

1 Answers1

3

I don't believe that reducers are going to be the right solution for you, as they don't cope with lazy sequences at all well (a reducer will give correct results with a lazy sequence, but won't parallelise well).

You might want to take a look at this sample code from the book Seven Concurrency Models in Seven Weeks (disclaimer: I am the author) which solves a similar problem (counting the number of times each word appears on Wikipedia).

Given a list of Wikipedia pages, this function counts the words sequentially (get-words returns a sequence of words from a page):

(defn count-words-sequential [pages]
  (frequencies (mapcat get-words pages)))

This is a parallel version using pmap which does run faster, but only around 1.5x faster:

(defn count-words-parallel [pages]
  (reduce (partial merge-with +)
    (pmap #(frequencies (get-words %)) pages)))

The reason it only goes around 1.5x faster is because the reduce becomes a bottleneck—it's calling (partial merge-with +) once for each page. Merging batches of 100 pages improves the performance to around 3.2x on a 4-core machine:

(defn count-words [pages]
  (reduce (partial merge-with +)
    (pmap count-words-sequential (partition-all 100 pages))))
Paul Butcher
  • 10,722
  • 3
  • 40
  • 44
  • was `pages` a lazy sequence? or was it previously loaded with all the pages? – eugecm May 18 '16 at 12:57
  • `pages` is lazy, yes. – Paul Butcher May 19 '16 at 08:10
  • You can see the source that's loading pages here: https://media.pragprog.com/titles/pb7con/code/FunctionalProgramming/WordCount/src/wordcount/pages.clj and for completeness the implementation of get-words here: https://media.pragprog.com/titles/pb7con/code/FunctionalProgramming/WordCount/src/wordcount/words.clj – Paul Butcher May 19 '16 at 08:12