0

I have to download many gzipped files stored on S3 like this:

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz

to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/

I have to download and decompress the files,then assemble the content as a single RDD.

Something similar to this:

JavaRDD<String> text = 
    sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");

I want to do this code with spark:

    for (String key : keys) {
        object = s3.getObject(new GetObjectRequest(bucketName, key));

        gzipStream = new GZIPInputStream(object.getObjectContent());
        decoder = new InputStreamReader(gzipStream);
        buffered = new BufferedReader(decoder);

        sitemaps = new ArrayList<>();

        String line = buffered.readLine();

        while (line != null) {
            if (line.matches("Sitemap:.*")) {
                sitemaps.add(line);
            }
            line = buffered.readLine();
        }
fra96
  • 43
  • 6
  • There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: https://github.com/commoncrawl/cc-mrjob/blob/master/sitemaps_from_robotstxt.py It's Python and based on [mrjob](https://mrjob.readthedocs.io/en/latest/), but it would be easy to port it to Spark, cf. [cc-pyspark](https://github.com/commoncrawl/cc-pyspark). – Sebastian Nagel Nov 12 '18 at 16:01

1 Answers1

0

To read something from S3, you can do this:

sc.textFiles("s3n://path/to/dir")

If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:

/root
  /a
    f1.gz
    f2.gz
  /b
    f3.gz

or even this:

/root
  f3.gz
  /a
    f1.gz
    f2.gz

then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.

Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.

Oli
  • 9,766
  • 5
  • 25
  • 46
  • What do you mean not yours? – Oli Nov 08 '18 at 12:48
  • I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method – fra96 Nov 08 '18 at 15:33
  • if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary – fra96 Nov 08 '18 at 15:51
  • Have you tried something like this? `sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")` – Oli Nov 08 '18 at 18:19