Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions
2
votes
1 answer

Common Crawl Keyword Lookup

I want to find a list of all the websites which is having a specific keywords.For example if i search for a keyword "Sports" or "Football" only the related website URLS , Title , Description and image needs to be extracted from common crawl warc…
2
votes
2 answers

Get offset and length of a subset of a WAT archive from Common Crawl index server

I would like to download a subset of a WAT archive segment from Amazon S3. Background: Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example,…
jmtroos
  • 123
  • 4
2
votes
0 answers

Fetch Common crawl data using Apache Nutch

I find my data on common crawl website and i downloads that data from there and now i have to fetch that data using Apache Nutch but don't know how. This file is in warc file format.
Sahil Rohila
  • 51
  • 1
  • 6
2
votes
0 answers

S3 the read operation timed out while reading commoncrawl data

In order to read few files from common crawl I have written this script import warc import boto for line in sys.stdin: line = line.strip() #Connect to AWS and read a dataset conn = boto.connect_s3(anon=True,…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
2
votes
1 answer

Hadoop process WARC files

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed. Using the current…
2
votes
1 answer

Giving Comomn crawl location as input to Amazon EMR using mrjob python

It has been only days since I started using mrjob and I have tried certain low and medium level tasks.Now I am stuck at giving Common crawl [now onwards will be know as CC] location as input to emr using python mrjob My config file looks like this…
The6thSense
  • 8,103
  • 8
  • 31
  • 65
2
votes
1 answer

Download Common crawl complete index file

The common crawl index file used in the below project https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792') is a partial one. I want the…
Vanaja Jayaraman
  • 753
  • 3
  • 18
1
vote
1 answer

Python's zlib doesn't work on CommonCrawl file

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here're the first few lines of…
157 239n
  • 349
  • 3
  • 15
1
vote
2 answers

How to get webpage text from Common Crawl?

Using common crawl, is there a way I can download raw text from all pages of a particular domain (e.g., wisc.edu)? I am only interested in text for NLP purposes such as topic modeling.
SanMelkote
  • 228
  • 2
  • 12
1
vote
1 answer

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment…
cc100
  • 31
  • 6
1
vote
2 answers

How to crawl the web for specific language

I am trying to collect all available text information(as much as possible) from web-pages for Uzbek language(for my research). What is the best way to do it?. I found the Common Crawl, but not sure if it's easy to extract specific language text.
1
vote
1 answer

Is it possible to get titles from the webversion of Common Crawler API?

I am trying to get urls, titles and languages from webpages. Fortunately there exists the CC API https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference. But sadly I did not notice a way to get also the titles. At the moment I query CC…
Mazzespazze
  • 111
  • 12
1
vote
1 answer

Mrjob Step is failing. How do debug?

I am trying to run sample mrjob in EMR cluster. I have created EMR cluster manually in AWS dashboard and started mrjob as follows python keywords.py -r emr s3://commoncrawl/crawl-data/CC-MAIN-2018-34/wet.paths.gz --cluster-id j-22GFG1FUGS12L Job is…
Javith
  • 41
  • 4
1
vote
1 answer

Processing many WARC archives from CommonCrawl using Hadoop Streaming and MapReduce

I am working on a project in which I need to download crawl data (from CommonCrawl) for specific URLs from an S3 container and then process that data. Currently I have a MapReduce job (Python via Hadoop Streaming) which gets the correct S3 file…
1
vote
1 answer

How do I use common crawl to search the web for a certain keyword query?

Common Crawl is a non-profit third party web search engine. http://commoncrawl.org I'm seeing the API to search Common Crawl for a given domain. How can I search common crawl for a given search term?