29

What is the best way to read a very large file (like a text file having 100 000 names one on each line) into a list (lazily - loading it as needed) in clojure?

Basically I need to do all sorts of string searches on these items (I do it with grep and reg ex in shell scripts now).

I tried adding '( at the beginning and ) at the end but apparently this method (loading a static?/constant list, has a size limitation for some reason.

Ali
  • 18,665
  • 21
  • 103
  • 138

5 Answers5

32

There are various ways of doing this, depending on exactly what you want.

If you have a function that you want to apply to each line in a file, you can use code similar to Abhinav's answer:

(with-open [rdr ...]
  (doall (map function (line-seq rdr))))

This has the advantage that the file is opened, processed, and closed as quickly as possible, but forces the entire file to be consumed at once.

If you want to delay processing of the file you might be tempted to return the lines, but this won't work:

(map function ; broken!!!
    (with-open [rdr ...]
        (line-seq rdr)))

because the file is closed when with-open returns, which is before you lazily process the file.

One way around this is to pull the entire file into memory with slurp:

(map function (slurp filename))

That has an obvious disadvantage - memory use - but guarantees that you don't leave the file open.

An alternative is to leave the file open until you get to the end of the read, while generating a lazy sequence:

(ns ...
  (:use clojure.test))

(defn stream-consumer [stream]
  (println "read" (count stream) "lines"))

(defn broken-open [file]
  (with-open [rdr (clojure.java.io/reader file)]
    (line-seq rdr)))

(defn lazy-open [file]
  (defn helper [rdr]
    (lazy-seq
      (if-let [line (.readLine rdr)]
        (cons line (helper rdr))
        (do (.close rdr) (println "closed") nil))))
  (lazy-seq
    (do (println "opening")
      (helper (clojure.java.io/reader file)))))

(deftest test-open
  (try
    (stream-consumer (broken-open "/etc/passwd"))
    (catch RuntimeException e
      (println "caught " e)))
  (let [stream (lazy-open "/etc/passwd")]
    (println "have stream")
    (stream-consumer stream)))

(run-tests)

Which prints:

caught  #<RuntimeException java.lang.RuntimeException: java.io.IOException: Stream closed>
have stream
opening
closed
read 29 lines

Showing that the file wasn't even opened until it was needed.

This last approach has the advantage that you can process the stream of data "elsewhere" without keeping everything in memory, but it also has an important disadvantage - the file is not closed until the end of the stream is read. If you are not careful you may open many files in parallel, or even forget to close them (by not reading the stream completely).

The best choice depends on the circumstances - it's a trade-off between lazy evaluation and limited system resources.

PS: Is lazy-open defined somewhere in the libraries? I arrived at this question trying to find such a function and ended up writing my own, as above.

Brad Koch
  • 19,267
  • 19
  • 110
  • 137
andrew cooke
  • 45,717
  • 10
  • 93
  • 143
24

Andrew's solution worked well for me, but nested defns are not so idiomatic, and you don't need to do lazy-seq twice: here is an updated version without the extra prints and using letfn:

(defn lazy-file-lines [file]
  (letfn [(helper [rdr]
                  (lazy-seq
                    (if-let [line (.readLine rdr)]
                      (cons line (helper rdr))
                      (do (.close rdr) nil))))]
         (helper (clojure.java.io/reader file))))

(count (lazy-file-lines "/tmp/massive-file.txt"))
;=> <a large integer>
JohnJ
  • 4,753
  • 2
  • 28
  • 40
21

You need to use line-seq. An example from clojuredocs:

;; Count lines of a file (loses head):
user=> (with-open [rdr (clojure.java.io/reader "/etc/passwd")]
         (count (line-seq rdr)))

But with a lazy list of strings, you cannot do those operations efficiently which require the whole list to be present, like sorting. If you can implement your operations as filter or map then you can consume the list lazily. Otherwise it'll be better to use an embedded database.

Also note that you should not hold on to the head of the list, otherwise the whole list will be loaded in memory.

Furthermore, if you need to do more than one operation, you'll need to read the file again and again. Be warned, laziness can make things difficult sometimes.

Abhinav Sarkar
  • 23,534
  • 11
  • 81
  • 97
  • Thanks a lot, but what if I wanted to keep all the list in memory (no being lazy), what would be the best way then? As you said for some operations I need to go over the list over and over again (let's assume I have enough memory to keep the whole list). – Ali Nov 08 '10 at 04:08
  • 4
    In that case, simply keep a reference to the head of the lazy list. It will be loaded lazily first time and then stay loaded. Something like: `(def names (with-open [rdr (clojure.java.io/reader "/path/to/names/file")] (line-seq rdr)))` – Abhinav Sarkar Nov 08 '10 at 05:42
  • 7
    Well, I don't think so. Because you have surrounded "line-seq" with "with-open", the underlying stream will be closed automatically when it returns. So there is nothing left behind your "names" var. So basically you would have to 1: `(def rdr (clojure.java.io/reader "/path/to/names/file"))` then 2: `(def names (line-seq rdr))` then 3: `(. rdr close)`. Finally, you can now play around with your "names" like: `(count names)` – Rollo Tomazzi May 06 '11 at 12:49
  • 2
    @RolloTomazzi, if you don't realize `names` before closing `rdr`, it won't work either (the problem is exact the same you point on @AbhinavSarkar's suggestion: `line-seq` reads only the first element, the rest is lazy, so closing `rdr` won't allow you to read past the first element of `names`, so `(count names)` will probably throw an exception). You'd have to add a new step between 2 and 3, something to realize the collection, like `(dorun names)`. But, then, this is equivalent to `(def names (with-open [rdr ...] (doall (line-seq rdr))))`, like in @andrew's answer, which is way better. – Bruno Reis Jul 30 '12 at 06:15
1

see my answer here

(ns user
  (:require [clojure.core.async :as async :refer :all 
:exclude [map into reduce merge partition partition-by take]]))

(defn read-dir [dir]
  (let [directory (clojure.java.io/file dir)
        files (filter #(.isFile %) (file-seq directory))
        ch (chan)]
    (go
      (doseq [file files]
        (with-open [rdr (clojure.java.io/reader file)]
          (doseq [line (line-seq rdr)]
            (>! ch line))))
      (close! ch))
    ch))

so:

(def aa "D:\\Users\\input")
(let [ch (read-dir aa)]
  (loop []
    (when-let [line (<!! ch )]
      (println line)
      (recur))))
Community
  • 1
  • 1
chen_767
  • 337
  • 2
  • 8
1

You might find the iota library useful for working with very large files in Clojure. I use iota sequences all the time when I am applying reducers to large amounts of input, and iota/vec provides random access to files larger than memory by indexing them.

Matt
  • 537
  • 5
  • 19