Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions
0
votes
1 answer

exception in newsplease commoncrawl.py file

i am using newsplease library that i have cloned from https://github.com/fhamborg/news-please. i want to use newsplease to get news artices from commoncrawl news datasets. i am running commoncrawl.py file as instruct here. i have used the command…
0
votes
0 answers

Common Crawl : pyspark, unable to use it

As part of an internship, I must download Hadoop and Spark and test them on some data of Common Crawl. I tried to follow the steps of this page https://github.com/commoncrawl/cc-pyspark#get-sample-data (I install Spark 3.0.0 on my computer) but when…
Fitz
  • 41
  • 4
0
votes
1 answer

Does commoncrawl contain only benign URLs? If yes, how they avoid indexing malicious URLs?

We would like to know whether commoncrawl database can be used as legitimate dataset for URL classification.
test M
  • 9
  • 3
0
votes
1 answer

How to read multiple gzipped files from S3 into a single RDD with http request?

I have to download many gzipped files stored on S3 like…
fra96
  • 43
  • 6
0
votes
1 answer

mrjob returned non-zero exit status 256

I'm new to map reduce and I'm trying to run a map reduce job using mrjob package of python. However, I encountered this error: ERROR:mrjob.launch:Step 1 of 1 failed: Command '['/usr/bin/hadoop', 'jar',…
kkesley
  • 3,258
  • 1
  • 28
  • 55
0
votes
2 answers

Reading WARC Files Efficiently

I am reading a WARC file with python's 'warc' library. Current file that I am using, is around 4.50 GB. The thing is ; file = warc.open("random.warc") html_lists = [line for line in file] Executing these 2 lines takes up to 40 seconds. Since there…
MeteHan
  • 289
  • 2
  • 16
0
votes
1 answer

How to download multiple large files concurrently in python?

I am trying to download a series of Warc files from the CommonCrawl database, each of them about 25mb. This is my script: import json import urllib.request from urllib.error import HTTPError from src.Util import rooted with…
kabeersvohra
  • 1,049
  • 1
  • 14
  • 31
0
votes
1 answer

Can't stream files from Amazon s3 using requests

I'm trying to stream crawl data from Common Crawl, but Amazon s3 errors when I use the stream=True parameters to get requests. Here is an example: resp = requests.get(url, stream=True) print(resp.raw.read()) When I run this on a Common Crawl s3…
Superman
  • 196
  • 1
  • 2
  • 8
0
votes
0 answers

No Credentials error with python , common data crawl

I am trying out a sample common data crawl example based on https://engineeringblog.yelp.com/2015/03/analyzing-the-web-for-the-price-of-a-sandwich.html I am running this below command in my local windows PC based on the instructions. python…
Shamnad P S
  • 1,095
  • 2
  • 15
  • 43
0
votes
1 answer

Converting a warc.gz file downloaded from Common Crawl to an RDD

I have downloaded a warc.gz file from common crawl and I have to process it using spark. How can convert the file into an RDD?sc.textFile("filepath") does not seem to help. When rdd.take(1) is printed, it gives me [u'WARC/1.0'] whereas it should…
Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
0
votes
2 answers

Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database

I am attempting to create a database of Digital Object Identifier (DOI) found on the internet. By manually searching the CommonCrawl Index Server manually I have obtained some promising results. However I wish to develop a programmatic…
Hector
  • 4,016
  • 21
  • 112
  • 211
0
votes
1 answer

cannot find url from a warc file crawled from common crawl

I have crawled data from common crawl and I want to find out url corresponding to each of the records. for record in files: print record['WARC-Target-URI'] This outputs an empty list. I am referring to the following…
Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
0
votes
1 answer

Beautifull soup takes too much time for text extraction in common crawl data

I have to parse html content in common crawl data-set (warc.gz files). I have decided to used bs4 (Beautifulsoup) module as mostly people suggest it. Following is the code snippet to get text: from bs4 import BeautifulSoup soup = BeautifulSoup(src,…
0
votes
1 answer

How to handle binary data in commoncrawl using python

I have to analyze commoncrawl. For that I am using python 2.7. I have observed some warc files, there is some binary data in warc.gz files. I have to parse html source using bs4. But how I can detect that this is the textual data and this is…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
0 answers

Company name matching Common Crawl using mrjob

I have a list of company name and details like ph.no, address, email etc.,. I want to get their company_url. We thought of using google API to make requests but it turns out to be costly. After searching I found Common_Crawl which was somewhat close…