4

I'm new to using compojure, but have been enjoying using it so far. I'm currently encountering a problem in one of my API endpoints that is generating a large CSV file from the database and then passing this as the response body.

The problem I seem to be encountering is that the whole CSV file is being kept in memory which is then causing an out of memory error in the API. What is the best way to handle and generate this, ideally as a gzipped file? Is it possible to stream the response so that a few thousand rows are returned at a time? When I return a JSON response body for the same data, there is no problem returning this.

Here is the current code I'm using to return this:

(defn complete
  "Returns metrics for each completed benchmark instance"
  [db-client response-format]
  (let [benchmarks  (completed-benchmark-metrics {} db-client)]
    (case response-format
      :json  (json-grouped-output field-mappings benchmarks)
      :csv   (csv-output benchmarks))))

(defn csv-output [data-seq]
  (let [header (map name (keys (first data-seq)))
        out    (java.io.StringWriter.)
        write  #(csv/write-csv out (list %))]
    (write header)
    (dorun (map (comp write vals) data-seq))
    (.toString out)))

The data-seq is the results returned from the database, which I think is a lazy sequence. I'm using yesql to perform the database call.

Here is my compojure resource for this API endpoint:

(defresource results-complete [db]
  :available-media-types  ["application/json" "text/csv"]
  :allowed-methods        [:get]
  :handle-ok              (fn [request]
                            (let [response-format (keyword (get-in request [:request :params :format] :json))
                                  disposition     (str "attachment; filename=\"nucleotides_benchmark_metrics." (name response-format) "\"")
                                  response        {:headers {"Content-Type" (content-types response-format)
                                                             "Content-Disposition" disposition}
                                                   :body    (results/complete db response-format)}]
                              (ring-response response))))
Michael Barton
  • 8,868
  • 9
  • 35
  • 43
  • Why do you have the `dorun` there? It's whole purpose is to be non-lazy. I'm pretty sure `.toString` is non-lazy, and I'm not sure about `keys` & `vals`. – Alan Thompson Sep 18 '17 at 21:57
  • @AlanThompson, it's my understanding that `dorun` does not keep the head of the sequence and so can be used for side effects such as writing to a string buffer. – Michael Barton Sep 18 '17 at 22:27
  • Yes, but it does cause the sequence to be evaluated immediately, not lazily. – Alan Thompson Sep 18 '17 at 22:54
  • This page talks about laziness when working with csv files. Just in case it helps: https://github.com/clojure/data.csv – Chris Murphy Sep 18 '17 at 23:58
  • Thanks for all the suggestions, I was able to create a solution based on the linked articles and documentation – Michael Barton Sep 19 '17 at 21:28

3 Answers3

7

Thanks to all the suggestion that were provided in this thread, I was able to create a solution using piped-input-stream:

(defn csv-output [data-seq]
  (let [headers     (map name (keys (first data-seq)))
        rows        (map vals data-seq)
        stream-csv  (fn [out] (csv/write-csv out (cons headers rows))
                              (.flush out))]
    (piped-input-stream #(stream-csv (io/make-writer % {})))))

This differs from my solution because it does not realise the sequence using dorun and does not create a large String object either. This instead writes to a PipedInputStream connection asynchronously as described by the documentation:

Create an input stream from a function that takes an output stream as its argument. The function will be executed in a separate thread. The stream will be automatically closed after the function finishes.

Michael Barton
  • 8,868
  • 9
  • 35
  • 43
3

Your csv-output function completely realises the dataset and turns it into a string. To lazily stream the data, you'll need to return something other than a concrete data type like a String. This suggests ring supports returning a stream, that can be lazily realised by Jetty. The answer to this question might prove useful.

l0st3d
  • 2,860
  • 1
  • 27
  • 29
1

I was also struggling with the streaming of large csv file. My solution was to use httpkit-channel to stream every single line of the data-seq to the client and then close the channel. My solution looks like that:

[org.httpkit.server :refer :all]

(fn handler [req]
    (with-channel req channel (let [header "your$header"
                                    data-seq ["your$seq-data"]]
                                    (doseq [line (cons header data-seq)]
                                       (send! channel
                                              {:status  200
                                              :headers {"Content-Type" "text/csv"}
                                              :body    (str line "\n")}
                                              false))
                                    (close channel))))
Minh Tuan Nguyen
  • 1,026
  • 8
  • 13
  • Thanks for your suggestion. I'm not familiar with httpkit server, I ended up answering this question using `piped-input-stream` because that fits with my existing clojure API stack which includes liberator and compojure. I'll likely channels in future for larger data. – Michael Barton Sep 19 '17 at 22:49