How to utilize memory/performance when processing a big file in Clojure

Question

How to utilize memory/performance when processing a large data set of time series data ?

Size : ~3.2G

Lines : ~54 million

First few line of dataset

{:ts 20200601040025269 :bid 107.526000 :ask 107.529000}
{:ts 20200601040025370 :bid 107.525000 :ask 107.529000}
{:ts 20200601040026421 :bid 107.525000 :ask 107.528000}
{:ts 20200601040026724 :bid 107.524000 :ask 107.528000}
{:ts 20200601040027424 :bid 107.524000 :ask 107.528000}
{:ts 20200601040033535 :bid 107.524000 :ask 107.527000}
{:ts 20200601040034230 :bid 107.523000 :ask 107.526000}

Helper functions

(defn lines [n filename]
  (with-open [rdr (io/reader filename)]
    (doall (take n (line-seq rdr)))))

(def dataset (into [] (lines 2000 "./data/rawdata.map")))

For best performance, I should retrieve data into memory as much as possible. However, my notebook has 16GB only, when I retrieve more data into memory, CPU/memory is almost 95% utilized.

Can I do a better memory management with large dataset in Clojure?
Can I reserve a memory buffer to store data set?
Because this is time series data in small memory environment. When first batch of data processed, the next batch can be retrieved by line-seq.
Please suggest what data structure is used to implement this function?

Please feel free to comment.

Thanks

Have you considered using a data processing library such as [tech.ml.dataset](https://github.com/techascent/tech.ml.dataset) ? — Steffan Westcott, May 04 '21 at 10:15
@SteffanWestcott, tech.ml.dataset is great and there is a lot of features. In my case, we are implementing a small project and want to maintain a self-containing codebase. Thanks for your comment. — madeinQuant, May 04 '21 at 10:34
Obviously, you will need to work in chunks of data in some fashion. You are already using a lazy sequence (via `line-seq`). Beyond that, you will need to add more details of the intended processing of the data. — Alan Thompson, May 04 '21 at 13:50
@AlanThompson, Thanks for your comment. We want to minimize the disk i/o, in order to maximize the performance. — madeinQuant, May 04 '21 at 16:09
This question is both over- and under-specified. Nobody can advise you on how to read a file with no idea of how you plan to use the data you got. Do you need random access? Will you access each item only once? And so on. Alan Thompson's advice is the obvious approach. You reject it to "minimize disk i/o", but it doesn't do any more i/o than any other approach. Such constraints over-specify the question. — amalloy, May 04 '21 at 18:18

score 1 · Accepted Answer · answered May 05 '21 at 20:26

Because the dataset consists of only 54000000 lines, you can fit this dataset into memory if you pack the data together in memory. Assuming this is what you want to do, e.g. for the sake of convenience of random access, here is one approach.

The reason why you are not able to fit it into memory is likely the overhead of all the objects used to represent each record read from the file. But if you flatten out the values into, for example, a byte buffer, the amount of space needed to store those values is not that great. You could represent the timestamp simply as one byte per digit, and the amounts using some fixed-point representation. Here is a quick and dirty solution.

(def fixed-pt-factor 1000)
(def record-size (+ 17 4 4))
(def max-count 54000000)

(defn put-amount [dst amount]
  (let [x (* fixed-pt-factor amount)]
    (.putInt dst (int x))))


(defn push-record [dst m]
  ;; Timestamp (convert to string and push char by char)
  (doseq [c (str (:ts m))]
    (.put dst (byte c)))
  (put-amount dst (:bid m))
  (put-amount dst (:ask m))
  dst)

(defn get-amount [src pos]
  (/ (BigDecimal. (.getInt src pos))
     fixed-pt-factor))

(defn record-count [dataset]
  (quot (.position dataset) record-size))

(defn nth-record [dataset n]
  (let [offset (* n record-size)]
    {:ts (edn/read-string (apply str (map #(->> % (+ offset) (.get dataset) char) (range 17))))
     :bid (get-amount dataset (+ offset 17))
     :ask (get-amount dataset (+ offset 17 4))}))

(defn load-dataset [filename]
  (let [result (ByteBuffer/allocate (* record-size max-count))]
    (with-open [rdr (io/reader filename)]
      (transduce (map edn/read-string) (completing push-record) result (line-seq rdr)))
    result))

You can then use load-dataset to load the dataset, record-count to get the number of records, and nth-record to get the nth record:

(def dataset (load-dataset filename))

(record-count dataset)
;; => 7

(nth-record dataset 2)
;; => {:ts 20200601040026421, :bid 107.525M, :ask 107.528M}

Exactly how you choose to represent the values in the byte buffer is up to you, I did not optimize it particularly. The loaded dataset in this example will only require about 54000000*25 bytes = 1.35 GB which will fit in memory (you may have to tweak some flag of the JVM though...).

In case you need to load larger files than this, you may consider putting the data into a memory-mapped file instead of an in-memory byte buffer.

Thanks for your help, I shall apply your suggestion into my project. Other than that, would you suggest some reading materials which is about large dataset management in JVM? — madeinQuant, May 06 '21 at 00:00
I found there is an example of memory mapping in Clojure. "https://github.com/clojure-cookbook/clojure-cookbook/blob/master/04_local-io/4-08_memory-map-files.asciidoc". Thank you for your suggestion. — madeinQuant, May 06 '21 at 00:41
memory-mapped file is a large array of bytes in memory, not the data structures encoded in the file. How can we associate memory-mapped file with clojure data structure e.g atom — madeinQuant, May 25 '21 at 09:47

score 0 · Answer 2 · answered May 04 '21 at 22:25

Use deftype to create a type with a long ts and doubles for bid ask. If you parse your line strings into instances of this type you will find that a 54 million row dataset should fit in memory easily. 24 bytes of data, plus 8 bytes of object header, plus ~8 bytes of reference in the array makes 40 bytes / record. Around 2G heap.

More exotic solutions (primitive arrays for a column store, or flyweights to access packed byte buffers) are possible but unneeded for your stated problem parameters.

Example code to follow, I only have my phone to hand.

How to utilize memory/performance when processing a big file in Clojure

2 Answers2