0

i develop a heroku rails application on the cedar stack and this is the bottle neck.

def self.to_csvAlt(options = {})
  CSV.generate(options) do |csv|     
    column_headers = ["user_id", "session_id", "survey_id"]
    pages = PageEvent.order(:page).select(:page).map(&:page).uniq
    page_attributes = ["a", "b", "c", "d", "e"]
    pages.each do |p|
      page_attributes.each do |pa|
        column_headers << p + "_" + pa
      end
    end
    csv << column_headers
    session_ids = PageEvent.order(:session_id).select(:session_id).map(&:session_id).uniq
    session_ids.each do |si|
        session_user = PageEvent.find(:first, :conditions => ["session_id = ? AND page != ?", si, 'none']);
        if session_user.nil?
            row = [si, nil, nil, nil]
        else
            row = [session_user.username, si, session_user.survey_name]
        end
        pages.each do |p|
          a = 0
          b = 0
          c = 0
          d = 0
          e = 0
          allpages = PageEvent.where(:page => p, :session_id => si)
          allpages.each do |ap|
            a += ap.a
            b += ap.b
            c += ap.c
            d += ap.d
            e += ap.e
          end
          index = pages.index p
          end_index = (index + 1)*5 + 2
          if !p.nil?
            row[end_index] = a
            row[end_index-1] = b
            row[end_index-2] = c
            row[end_index-3] = d
            row[end_index-4] = e
          else
            row[end_index] = nil
            row[end_index-1] = nil
            row[end_index-2] = nil
            row[end_index-3] = nil
            row[end_index-4] = nil
          end
        end
      csv << row
    end
  end
end

as you can see, it generates a csv file from a table that contains data on each individual page taken from a group of surveys. the problem is that there are ~50,000 individual pages in the table and the heroku app continues to give me R14 errors (out of memory 512MB) and eventually dies when the dyno goes to sleep after an hour.

that being said, i really dont care how long it takes to run, i just need it to complete. i am waiting on approval to add a worker dyno to run the csv generation, which i know will help but in the meantime i still would like to optimize this code. There is potential for over 100,000 pages to be processed at at time and i realize this is incredibly memory heavy and really need to cut back its memory usage as much as possible. thank you for your time.

celeriko
  • 1,564
  • 1
  • 13
  • 26

2 Answers2

3

You can split it up into batches so that the work is completed in sensible chunks.

Try something like this:

def self.to_csvAlt(options = {})

  # ...

  pages = PageEvent.order(:page).select(:page).map(&:page).uniq

  pages.find_each(:batch_size => 5000) do |p|
    # ...

Using find_each with a batch_size, you wont do one huge lookup for your loop. Instead it'll fetch 5000 rows, run your loop, fetch another, loop again ... etc, until you have no more records returned.

The other key thing to note here is that rather than rails trying to instantiate all of the objects returned from the database at the same time, it will only instantiate those returned in your current batch. This can save a huge memory overhead if you have a giant dataset.

UPDATE:

Using #map to restrict your results to a single attribute of your model is highly inefficient. You should instead use the pluck Active record method to just pull back the data you want from the DB directly rather than manipulating the results with Ruby, like this:

# Instead of this:
pages = PageEvent.order(:page).select(:page).map(&:page).uniq

# Use this:
pages = PageEvent.order(:page).pluck(:page).uniq

I also personally prefer to use .distinct rather than the alias .uniq as I feel it sits more in line with the DB query rather than confusing things with what seems more like an array function:

pages = PageEvent.order(:page).pluck(:page).distinct
Jon
  • 10,678
  • 2
  • 36
  • 48
  • getting `NoMethodError (undefined method 'find_each' for #(Array:0x...))` – celeriko Jan 28 '14 at 01:57
  • You need to use find_each on an `Arel` object. It'll be the `.map(&:page)` part doing it. Replace that with `.pluck(:page)` instead ... I'll update my answer for you. – Jon Jan 28 '14 at 07:00
  • didnt work with `pluck`, `uniq`, or `distinct`....not sure why. i figured out another way to get the `session_id`s and find_each made a huge difference. thanks! – celeriko Jan 29 '14 at 02:24
  • `find_each` greatly decreased both the memory usage and the running time. marking as answer, even though the above code (`pluck`, `uniq`, `distinct`) didnt work.. – celeriko Jan 29 '14 at 04:38
  • the same `NoMethodError (undefined method 'find_each' for #(Array...))` – celeriko Jan 29 '14 at 15:10
  • Ah, ok. If you do it without the `pluck` then it should be fine. You basically need to ensure you still have an `Arel` object when you call `find_each` – Jon Jan 29 '14 at 23:47
2

Use

CSV.open("path/to/file.csv", "wb")

This will stream CSV into the file.

Instead of CSV.generate.

generate will create a huge string that will end up exasting memory if it gets too large.

Kieran Andrews
  • 5,845
  • 2
  • 33
  • 57