We are using Amazon EMR and commoncrawl to perform crawling. EMR writes the output to Amazon S3 in a binary-like format. We'd like to copy that to our local in raw-text format.
How can we achieve that? What's the best way?
Normally we could hadoop copyToLocal but we can't access hadoop directly and the data is on S3.