3

I've seen a couple posts for this with no real answers or out-of-date answers, so I'm wondering if there are any new solutions. I have an enormous CSV I need to read in. I can't call open() on it bc it kills my server. I have no choice but to use .foreach().

Doing it this way, my script will take 6 days to run. I want to see if I can cut that down by using Threads and splitting the task in two or four. So one thread reads lines 1-n and one thread simultaneously will read lines n+1-end.

So I need to be able to only read in the last half of the file in one thread (and later if I split it into more threads, just a specific line through a specific line).

Is there anyway in Ruby to do this? Can this start at a certain row?

CSV.foreach(FULL_FACT_SHEET_CSV_PATH) do |trial|

EDIT: Just to give an idea of what one of my threads looks like:

threads << Thread.new { 
CSV.open('matches_thread3.csv', 'wb') do |output_csv|

  output_csv << HEADER
  count = 1
  index = 0

    CSV.foreach(CSV_PATH) do |trial|
        index += 1
        if index > 120000 
            break if index > 180000
            #do stuff
        end
    end
end
}

But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.

bjacobs
  • 401
  • 1
  • 6
  • 17
  • 1
    6 days are really a long time. Even if you would parallelize the task in 6 processes, this will take one day at least. I think you should really find way to improve the speed, maybe giving us some more details about the process of each single line we could help. – coorasse Jul 14 '17 at 14:36
  • 1
    `CSV.forEach` is a fast method according to ruby standards. So, you might not get a lot of improvement just by optimizing it without throwing away a lot of CPU at it. I'd suggest you try to parse this with Python or some other language instead. – Nerve Jul 14 '17 at 14:40
  • Check this https://stackoverflow.com/questions/10650444/import-records-from-csv-in-small-chunks-ruby-on-rails – ramongr Jul 14 '17 at 15:38
  • Just *reading* takes 6 days? Or reading and processing line-by-line? – Mark Thomas Jul 14 '17 at 20:05
  • 1
    @Nerve Unless there is a flaw in CSV, not much will be gained by switching the language from Ruby to Python. – Mark Thomas Jul 14 '17 at 20:10
  • 1
    @MarkThomas the whole process takes now 3 days (since I removed some of the search terms). It looks like it's reading each line at a rate of maybe 2.5 seconds per line. But then I have logic in there that traverses another CSV and there are more loops in that. It's a matching algorithm to make a dump to eventually upload to a db. Normally I'd use a db and something like solr for this, but it's not my project. I just have this one task. – bjacobs Jul 14 '17 at 20:40
  • Would your workflow allow you to manually make a copy of the file without the first 120,000 rows, and then just run your Ruby script on that copy? The [XSV](https://github.com/BurntSushi/xsv) tool for CSV analysis can select the rows you want quickly, or you could use the more generic commands `wc -l` and `tail`. – Rory O'Kane Jul 14 '17 at 21:36
  • 1
    Sounds like you have an N+1 - style problem. Does the second CSV fit in memory? Otherwise maybe you should be spawning background jobs. – Mark Thomas Jul 14 '17 at 23:23
  • @MarkThomas no the second CSV is read in fully, it's a smaller file. There's no issue with my script bc it can run fine with a db. It's really just reading the CSV that's slow and I was hoping to break it into Threads and be able to start a foreach at a particular line for each thread. But it seems like that is not possible. – bjacobs Jul 18 '17 at 19:17
  • @RoryO'Kane I can split the file into multiple CSV's as an option I suppose. Was hoping for a foreach option. – bjacobs Jul 18 '17 at 19:20

3 Answers3

6

If still relevant, you can do something like this using .with_index after :

rows_array = []

CSV.foreach(path).with_index do |row, i|
  next if i == 0 #skip first row
  rows_array << columns.map { |n| row[n] }
end
4

But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.

Impossible. Content of a CSV file is just a blob of text, with some commas and newlines. You can't know at which offset in the file row N starts without knowing where row N-1 ends. And to know this, you have to know where row N-1 starts (see recursion?) and read the file until you see where it ends (encounter a newline that is not part of field value).

Exception to this is if all your rows are of fixed size. In which case, you can seek directly to offset 120_000 * row_size. I am yet to see a file like this, though.

Sergio Tulentsev
  • 226,338
  • 43
  • 373
  • 367
0

As per my understanding towards your Question in Ruby way it may help you.

require 'csv'
csv_file = "matches_thread3.csv"
# define one Constant Chunk Size for Jobs
CHUNK_SIZE = 120000
# split - by splitting (\n) will generate an array of CSV records
# each_slice - will create array of records of CHUNK_SIZE defined

File.read(csv_file).split("\n").drop(1).each_slice(CHUNK_SIZE).with_index 
do |chunk, index|   

  data = []
   # chunk will be work as multiple Jobs of 120000 records 

  chunk.each do |row|
   data << r
   ##do stuff
  end
end