How to utilize memory/performance when processing a large data set of time series data ?
Size : ~3.2G
Lines : ~54 million
First few line of dataset
{:ts 20200601040025269 :bid 107.526000 :ask 107.529000}
{:ts 20200601040025370 :bid 107.525000 :ask 107.529000}
{:ts 20200601040026421 :bid 107.525000 :ask 107.528000}
{:ts 20200601040026724 :bid 107.524000 :ask 107.528000}
{:ts 20200601040027424 :bid 107.524000 :ask 107.528000}
{:ts 20200601040033535 :bid 107.524000 :ask 107.527000}
{:ts 20200601040034230 :bid 107.523000 :ask 107.526000}
Helper functions
(defn lines [n filename]
(with-open [rdr (io/reader filename)]
(doall (take n (line-seq rdr)))))
(def dataset (into [] (lines 2000 "./data/rawdata.map")))
For best performance, I should retrieve data into memory as much as possible. However, my notebook has 16GB only, when I retrieve more data into memory, CPU/memory is almost 95% utilized.
- Can I do a better memory management with large dataset in Clojure?
- Can I reserve a memory buffer to store data set?
- Because this is time series data in small memory environment. When first batch of data processed, the next batch can be retrieved by line-seq.
- Please suggest what data structure is used to implement this function?
Please feel free to comment.
Thanks