0

I'm trying to stream crawl data from Common Crawl, but Amazon s3 errors when I use the stream=True parameters to get requests. Here is an example:

resp = requests.get(url, stream=True)
print(resp.raw.read())

When I run this on a Common Crawl s3 http url, I get the response:

b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message><Key>crawl-data/CC-
MAIN-2018-05/segments/1516084886237.6/warc/CC-
MAIN-20180116070444-20180116090444-00000.warc.gz\n</Key>
<RequestId>3652F4DCFAE0F641</RequestId><HostId>Do0NlzMr6
/wWKclt2G6qrGCmD5gZzdj5/GNTSGpHrAAu5+SIQeY15WC3VC6p/7/1g2q+t+7vllw=
</HostId></Error>'

I am using warcio, and need a streaming file object as input to the archive iterator, and a can't download the file all at once because of limited memory. What should I do?

PS. The url I request in the example is https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz

Superman
  • 196
  • 1
  • 2
  • 8
  • you can't stream s3 objects like this, you have to access them using the `boto3` module, also since aws is by default a deny first system it says this key doesn't exist in order to prevent information leak – eagle Feb 25 '18 at 21:38
  • @ Doesn't boto3 require the amazon keys? – Superman Feb 25 '18 at 21:39
  • 1
    Thats not exactly true, this is a public dataset. https://aws.amazon.com/public-datasets/common-crawl/ – avigil Feb 25 '18 at 21:40
  • it requires the account keys, yes – eagle Feb 25 '18 at 21:40
  • it doesn't matter if the dataset is public, you need to use the `boto3` library to access this data, theres no other way around this, and it's implemented as such for a reason – eagle Feb 25 '18 at 21:41
  • from http://commoncrawl.org/the-data/get-started/: "The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. From Public Data Sets, you can download the files entirely free using HTTP or S3." – avigil Feb 25 '18 at 21:43
  • How come I can access the data without stream using http perfectly fine then? Wouldn't s3 give the same error if regardless of streaming? – Superman Feb 25 '18 at 21:43
  • 1
    you've ommitted a digit in your url! `CC-MAIN-2018-0` should be `CC-MAIN-2018-05` – avigil Feb 25 '18 at 21:57
  • @eagle, your information is not correct. Public content in S3 can be accessed with any HTTP user agent. The entire S3 REST API is open and documented. – Michael - sqlbot Feb 26 '18 at 02:38

1 Answers1

1

There is an error in your url. Compare the key in the response you are getting:

<Key>crawl-data/CC-
MAIN-2018-05/segments/1516084886237.6/warc/CC-
MAIN-20180116070444-20180116090444-00000.warc.gz\n</Key>

to the one in the intended url:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz

For some reason you are adding unnecessary whitespace, probably picked up during file reading (readline() will give you trailing '\n' characters on every line). Maybe try calling .strip() to remove the trailing newline.

avigil
  • 2,218
  • 11
  • 18
  • I mistyped on the website, I checked the logs and I didn't have the error in the program – Superman Feb 25 '18 at 22:00
  • @Superman the url you think you are streaming is not the one actually being requested. The key in the response does not match the working url. Updated answer – avigil Feb 25 '18 at 22:09
  • The whitespace on the post is for visual reasons. the response comes without whitespace on my code. – Superman Feb 25 '18 at 22:13
  • Check the response. There is definitely a `\n` in your requested key. – avigil Feb 25 '18 at 22:13
  • @ avigil You were right, readlines called on the paths file leaves a trailing newline that messed up the request code. I appreciate the help! – Superman Feb 25 '18 at 22:17