I was wondering if someone could help me with a streaming download issue I am having?
Currently I have a large GZipped XML file stored in a s3 bucket that I need to download, decompress, and parse daily using ruby. Because of the size of the file I opted to download the large file in chunks and perform the necessary parsing on each of the chunks like so:
module S3Stream
def self.call(credentials)
Enumerator.new do |yielder|
connection = Fog::Storage.new(credentials)
bucket = connection.directories.new(key: 'bucket_name')
bucket.files.get('file_name.tar.gz') do |chunk, remaining_bytes, total_bytes|
yielder << chunk
end
end
end
end
Returning the Enumerator object allows me to process each part of the chunk later in the script (i.e. Unzip and parse the XML contained within)
This all needs to run on a schedule, so I have a detached Heroku dyno running the ruby script for me (my Rails site is deployed to Heroku as well). It runs fine locally, but after about an hour on Heroku (sometimes less depending on the size of the dyno) the script fails with this error message:
Errno::ECONNRESET: Connection reset by peer
My issue is that when this fails in the middle of my stream, I am not sure how to retry the download starting at chunk where the connection got reset.
Fog's stream downloading documentation is sparse, and switching to the AWS SDK for Ruby seems to give even less information to the block (i.e. no remaining or total bytes parameters).
Any ideas? Thanks in advance!
UPDATE 1
My most recent attempt, involved keeping track of the byte location of the last chunk worked with (i.e. total_bytes - remaining_bytes
) before the connection got reset. Fog allows you to set a custom header in the options parameter of the get
method:
bucket.files.get('file_name.tar.gz', header: {'Range' => "bytes=#{@last_byte}-"}) do |chunk, remaining_bytes, total_bytes|
yielder << chunk
end
So when the stream got reset I would just try and pick off from where I left off in a new stream by using the HTTP Range header.
However, Fog seemed to ignore this header and after some digging I found that it is not an accepted HTTP header attribute in the fog-aws
gem:
Next, I will attempt to perform these same steps but with the aws-sdk
gem, which looks like it supports the HTTP Range header