Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions

votes

1 answer

exception in newsplease commoncrawl.py file

i am using newsplease library that i have cloned from https://github.com/fhamborg/news-please. i want to use newsplease to get news artices from commoncrawl news datasets. i am running commoncrawl.py file as instruct here. i have used the command…

asked Jul 12 '20 at 10:21

Prateek Tyagi

votes

0 answers

Common Crawl : pyspark, unable to use it

As part of an internship, I must download Hadoop and Spark and test them on some data of Common Crawl. I tried to follow the steps of this page https://github.com/commoncrawl/cc-pyspark#get-sample-data (I install Spark 3.0.0 on my computer) but when…

apache-spark hadoop pyspark common-crawl

asked Jun 24 '20 at 14:05

Fitz

votes

1 answer

Does commoncrawl contain only benign URLs? If yes, how they avoid indexing malicious URLs?

We would like to know whether commoncrawl database can be used as legitimate dataset for URL classification.

url phishing common-crawl

asked Feb 12 '19 at 05:43

test M

votes

1 answer

How to read multiple gzipped files from S3 into a single RDD with http request?

I have to download many gzipped files stored on S3 like…

java apache-spark amazon-s3 common-crawl

asked Nov 08 '18 at 10:36

fra96

votes

1 answer

mrjob returned non-zero exit status 256

I'm new to map reduce and I'm trying to run a map reduce job using mrjob package of python. However, I encountered this error: ERROR:mrjob.launch:Step 1 of 1 failed: Command '['/usr/bin/hadoop', 'jar',…

python hadoop mrjob common-crawl

asked Aug 31 '18 at 04:16

kkesley

3,258
1
28
55

votes

2 answers

Reading WARC Files Efficiently

I am reading a WARC file with python's 'warc' library. Current file that I am using, is around 4.50 GB. The thing is ; file = warc.open("random.warc") html_lists = [line for line in file] Executing these 2 lines takes up to 40 seconds. Since there…

python byte common-crawl warc

asked Aug 10 '18 at 12:19

MeteHan

votes

1 answer

How to download multiple large files concurrently in python?

I am trying to download a series of Warc files from the CommonCrawl database, each of them about 25mb. This is my script: import json import urllib.request from urllib.error import HTTPError from src.Util import rooted with…

python python-3.x download urllib common-crawl

asked Apr 16 '18 at 16:40

kabeersvohra

1,049
1
14
31

votes

1 answer

Can't stream files from Amazon s3 using requests

I'm trying to stream crawl data from Common Crawl, but Amazon s3 errors when I use the stream=True parameters to get requests. Here is an example: resp = requests.get(url, stream=True) print(resp.raw.read()) When I run this on a Common Crawl s3…

python amazon-web-services python-requests common-crawl

asked Feb 25 '18 at 21:37

Superman

votes

0 answers

No Credentials error with python , common data crawl

I am trying out a sample common data crawl example based on https://engineeringblog.yelp.com/2015/03/analyzing-the-web-for-the-price-of-a-sandwich.html I am running this below command in my local windows PC based on the instructions. python…

python amazon-s3 boto common-crawl

asked Nov 26 '17 at 22:48

Shamnad P S

1,095
2
15
43

votes

1 answer

Converting a warc.gz file downloaded from Common Crawl to an RDD

I have downloaded a warc.gz file from common crawl and I have to process it using spark. How can convert the file into an RDD?sc.textFile("filepath") does not seem to help. When rdd.take(1) is printed, it gives me [u'WARC/1.0'] whereas it should…

apache-spark pyspark rdd common-crawl warc

asked Aug 23 '17 at 12:33

Ravi Ranjan

votes

2 answers

Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database

I am attempting to create a database of Digital Object Identifier (DOI) found on the internet. By manually searching the CommonCrawl Index Server manually I have obtained some promising results. However I wish to develop a programmatic…

web-scraping common-crawl

asked Jul 27 '17 at 10:18

Hector

4,016
21
112
211

votes

1 answer

cannot find url from a warc file crawled from common crawl

I have crawled data from common crawl and I want to find out url corresponding to each of the records. for record in files: print record['WARC-Target-URI'] This outputs an empty list. I am referring to the following…

python record common-crawl warc

asked Jul 17 '17 at 11:56

Ravi Ranjan

votes

1 answer

Beautifull soup takes too much time for text extraction in common crawl data

I have to parse html content in common crawl data-set (warc.gz files). I have decided to used bs4 (Beautifulsoup) module as mostly people suggest it. Following is the code snippet to get text: from bs4 import BeautifulSoup soup = BeautifulSoup(src,…

python amazon-web-services beautifulsoup common-crawl

asked Jan 17 '17 at 08:22

Hafiz Muhammad Shafiq

8,168
12
63
121

votes

1 answer

How to handle binary data in commoncrawl using python

I have to analyze commoncrawl. For that I am using python 2.7. I have observed some warc files, there is some binary data in warc.gz files. I have to parse html source using bs4. But how I can detect that this is the textual data and this is…

python amazon-web-services common-crawl

asked Jan 13 '17 at 12:07

Hafiz Muhammad Shafiq

8,168
12
63
121

votes

0 answers

Company name matching Common Crawl using mrjob

I have a list of company name and details like ph.no, address, email etc.,. I want to get their company_url. We thought of using google API to make requests but it turns out to be costly. After searching I found Common_Crawl which was somewhat close…

python mrjob common-crawl

asked Dec 21 '16 at 14:41

Python master

Prev 1 2 3

5 Next