0

I have this block of ruby code. I need to read big json.gz file, that cannot be loaded into RAM at once (so no GzipReader.new method). To achieve this, I use GzipReader and then lazy read with batch loading. Everything works perfectly, but for some reasons, not all data from json get to block. Only 5500125 rows are being processed in this code, but file has cca 6600000 rows. If i use File.open('authors.jsonl.gz') instead of Zlib, then all rows are processed, but are not unzipped.

I looked almost all day to documenattion and haven't found anything :( I also try to unzip each row, that is processed, but all my attempts failed also. Is there way how to unzip file and then read it in chunks (all of its content not just part), or at least read line by line and unzip each line on its own?

Thank you guys :)

Zlib::GzipReader(File.open('authors.jsonl.gz')) do |file|
    file.lazy.each_slice(batch_size) do |lines|
      lines.each do |line|
        parsed_line = JSON.parse(line.gsub('\u0000', ''))

        array_of_authors << {id: parsed_line['id'],
                             name: parsed_line['name'],
                             username: parsed_line['username'],
                             description: parsed_line['description'],
                             followers_count: parsed_line.dig('public_metrics', 'followers_count'),
                             following_count: parsed_line.dig('public_metrics', 'following_count'),
                             tweet_count: parsed_line.dig('public_metrics', 'tweet_count'),
                             listed_count: parsed_line.dig('public_metrics', 'listed_count')}
      end
    end
  end

0 Answers0