I'm trying interesting data inside the Web Data Commons dumps. It is taking day to grep across it on my machine (in parallel). Is there an index out there of what websites are covered and an ability to extract specifically from those sites?
Asked
Active
Viewed 342 times
1 Answers
1
To get all of the pages from a particular domain -- one option is to query the common crawl api site:
To list all of the pages from the specific domain wikipedia.org:
http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org*/&showNumPages=true
This shows you how many pages of blocks common crawl has from this domain (note you can use wildcards as in this example).
Then go into each page and ask common crawl to send you a json object of each file:
http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=en.wikipedia.org/*&page=0&output=json
You can then parse the json and get each warc file through the field: filename
This link will help you.

Chris
- 18,075
- 15
- 59
- 77