I have to download many gzipped files stored on S3 like this:
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz
to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/
I have to download and decompress the files,then assemble the content as a single RDD.
Something similar to this:
JavaRDD<String> text =
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");
I want to do this code with spark:
for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));
gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);
sitemaps = new ArrayList<>();
String line = buffered.readLine();
while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}