Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions

votes

1 answer

Common Crawl Keyword Lookup

I want to find a list of all the websites which is having a specific keywords.For example if i search for a keyword "Sports" or "Football" only the related website URLS , Title , Description and image needs to be extracted from common crawl warc…

asked Oct 02 '17 at 08:10

Dinesh Manne

votes

2 answers

Get offset and length of a subset of a WAT archive from Common Crawl index server

I would like to download a subset of a WAT archive segment from Amazon S3. Background: Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example,…

common-crawl

asked Aug 28 '17 at 13:43

jmtroos

votes

0 answers

Fetch Common crawl data using Apache Nutch

I find my data on common crawl website and i downloads that data from there and now i have to fetch that data using Apache Nutch but don't know how. This file is in warc file format.

nutch warc common-crawl

asked Jan 17 '17 at 07:44

Sahil Rohila

votes

0 answers

S3 the read operation timed out while reading commoncrawl data

In order to read few files from common crawl I have written this script import warc import boto for line in sys.stdin: line = line.strip() #Connect to AWS and read a dataset conn = boto.connect_s3(anon=True,…

python amazon-s3 boto common-crawl

asked Jan 02 '17 at 06:28

Hafiz Muhammad Shafiq

8,168
12
63
121

votes

1 answer

Hadoop process WARC files

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed. Using the current…

java hadoop mapreduce elastic-map-reduce common-crawl

asked Oct 30 '16 at 05:22

user1738628

votes

1 answer

Giving Comomn crawl location as input to Amazon EMR using mrjob python

It has been only days since I started using mrjob and I have tried certain low and medium level tasks.Now I am stuck at giving Common crawl [now onwards will be know as CC] location as input to emr using python mrjob My config file looks like this…

python amazon-web-services emr mrjob common-crawl

asked Sep 27 '15 at 19:47

The6thSense

8,103
8
31
65

votes

1 answer

Download Common crawl complete index file

The common crawl index file used in the below project https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792') is a partial one. I want the…

python boto common-crawl

asked Jun 29 '15 at 12:23

Vanaja Jayaraman

vote

1 answer

Python's zlib doesn't work on CommonCrawl file

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here're the first few lines of…

python gzip zlib common-crawl

asked Jun 11 '23 at 20:54

157 239n

vote

2 answers

How to get webpage text from Common Crawl?

Using common crawl, is there a way I can download raw text from all pages of a particular domain (e.g., wisc.edu)? I am only interested in text for NLP purposes such as topic modeling.

python web-scraping common-crawl

asked Nov 30 '20 at 18:21

SanMelkote

vote

1 answer

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment…

java nutch common-crawl warc

asked Sep 15 '20 at 09:43

cc100

vote

2 answers

How to crawl the web for specific language

I am trying to collect all available text information(as much as possible) from web-pages for Uzbek language(for my research). What is the best way to do it?. I found the Common Crawl, but not sure if it's easy to extract specific language text.

web-crawler common-crawl

asked Apr 05 '19 at 09:54

elmurod1202

vote

1 answer

Is it possible to get titles from the webversion of Common Crawler API?

I am trying to get urls, titles and languages from webpages. Fortunately there exists the CC API https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference. But sadly I did not notice a way to get also the titles. At the moment I query CC…

amazon-web-services api web-crawler common-crawl

asked Jan 30 '19 at 17:48

Mazzespazze

vote

1 answer

Mrjob Step is failing. How do debug?

I am trying to run sample mrjob in EMR cluster. I have created EMR cluster manually in AWS dashboard and started mrjob as follows python keywords.py -r emr s3://commoncrawl/crawl-data/CC-MAIN-2018-34/wet.paths.gz --cluster-id j-22GFG1FUGS12L Job is…

python amazon-emr mrjob common-crawl

asked Oct 03 '18 at 11:55

Javith

vote

1 answer

Processing many WARC archives from CommonCrawl using Hadoop Streaming and MapReduce

I am working on a project in which I need to download crawl data (from CommonCrawl) for specific URLs from an S3 container and then process that data. Currently I have a MapReduce job (Python via Hadoop Streaming) which gets the correct S3 file…

mapreduce boto3 hadoop-streaming common-crawl

asked Aug 13 '18 at 23:13

James Hayes

vote

1 answer

How do I use common crawl to search the web for a certain keyword query?

Common Crawl is a non-profit third party web search engine. http://commoncrawl.org I'm seeing the API to search Common Crawl for a given domain. How can I search common crawl for a given search term?

web-crawler common-crawl

asked Dec 11 '17 at 20:54

longtimelurker42

Prev 1

3 4 5 Next