1

I have a sample data set in a txt file. The data file is extremely large so loading it in memory is not an option. I need to be able to read the file lazily. Furthermore, I need the lines to be read in a random order. And there might be cases where I don't need to read all the lines. This is what I found so far -

(defn read-lazy [in-file]
        (letfn [(helper [rdr]
                            (if-let [line (.readLine rdr)]
                                (cons line (helper rdr))
                                (do (.close rdr) nil)))]
            (helper (io/reader in-file))))

which returns a lazy-seq of the file. How can I loop through random lines in the lazy-seq when I need to? I think using a go block could help here. Go blocks could put a random line in a channel and await for something to consume it. Once the data gets read it puts another line in the channel awaits for the next read. How can I implement that?

Here's how I've worked it out (not random) -

(def lazy-ch (chan))
(defn async-fetch-set [in-file]
    (go
        (with-open [reader (io/reader in-file)]
            (doseq [line (line-seq reader)]
                (>! lazy-ch line)))
        (close! lazy-ch)))

(println "got: " (<!! lazy-ch))

Is this a good way to approach the problem? Is there a better solution? I might not need to read all the lines so I'd like to be able to close the reader if whenever I need to.

Lordking
  • 1,413
  • 1
  • 13
  • 31
  • So, to make sure I understand your problem correctly: you need to perform the same operation on every line of the file in sequence, but the order of that sequence needs to be random? – Sam Estep Nov 03 '15 at 17:34
  • yup. that's what i'm trying to do. i added an update – Lordking Nov 03 '15 at 17:39

2 Answers2

1

Your solution above does not include any randomness. Go channels are first-in, first-out constructs. If you really want random reads, you first need to count the number of lines in the file, then use (rand N) to generate an integer I in the interval [0..N-1], then read line I from the file.

There are several different approaches to read line I from the file, which trade off speed vs memory requirements.

Alan Thompson
  • 29,276
  • 6
  • 41
  • 48
1
(defn char-seq
  "Returns a lazy sequence of characters from rdr. rdr must implement
  java.io.Reader."
  [rdr]
  (let [c (.read rdr)]
    (if-not (neg? c)
      (cons (char c) (lazy-seq (char-seq rdr))))))

(defn line-offsets
  "Returns a lazy sequence of offsets of all lines in s."
  [s]
  (if (seq s)
    (->> (partition-all 3 1 s)
         (map-indexed
          (fn [i [a b c]]
            (cond
              (= b \newline) (if c (+ 2 i))
              (= a \return) (if b (inc i)))))
         (filter identity)
         (cons 0))))

(defn ordered-line-seq
  "Returns the lines of text from raf at each offset in offsets as a lazy
  sequence of strings. raf must implement java.io.RandomAccessFile."
  [raf offsets]
  (map (fn [i]
         (.seek raf i)
         (.readLine raf))
       offsets))

Example usage:

(let [filename "data.txt"
      offsets (with-open [rdr (clojure.java.io/reader filename)]
                (shuffle (line-offsets (char-seq rdr))))]
  (with-open [raf (java.io.RandomAccessFile. filename "r")]
    (dorun (map println (ordered-line-seq raf offsets)))))
Sam Estep
  • 12,974
  • 2
  • 37
  • 75