Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions

votes

1 answer

Crate Common Crawl Example not working

I am trying to use this example of Crate with Common Crawl: https://github.com/crate/crate-commoncrawl I have setup the Crate and even created the table schema using the instructions from the example. I am accessing CRATE using the URL:…

asked Dec 02 '16 at 06:21

Jaffer Wilson

7,029
10
62
139

votes

0 answers

Common crawl example having doubts

I am trying to run a common crawl example and extracting URL and emails from the Warc file. I have just one doubt. Whether the email I have extracted belongs to the URL or some other website, this is a confusing part. Kindly, help me. How can I…

java amazon-s3 amazon-ec2 common-crawl hadoop2.7.3

asked Nov 18 '16 at 08:58

Jaffer Wilson

7,029
10
62
139

votes

2 answers

How do I archive and retrieve a large HTML dataset?

I am a fresher and I am about to participate in a contest this weekend. The problem is about archiving and retrieving a large HTML dataset and I have no idea about it. My friend suggested to me to use a web archive and common crawl. Please suggest…

war common-crawl warc bigdata

asked Aug 18 '16 at 13:06

Sriram S

votes

1 answer

Common Crawl AWS public dataset transfer cost

I'm actually working on Common Crawl datasets and I want to know the cost of transferring data from the original S3 bucket to my EC2 cluster ? Is there any charge or it's totally free ?

amazon-web-services amazon-s3 common-crawl

asked Jun 08 '16 at 12:02

ar-ms

votes

2 answers

MRJob determining if running inline, local, emr or hadoop

I am building on some old code from a few years back using the commoncrawl dataset with EMR using MRJob. The code uses the following inside MRJob subclass mapper function to determine whether running locally or on emr: self.options.runner ==…

python hadoop emr mrjob common-crawl

asked Apr 23 '16 at 15:26

Pykler

14,565
9
41
50

votes

1 answer

How to read all the data of Common Crawl from AWS with Java?

I'm totally new to Hadoop and MapReduce programming, and I'm trying to write my first MapReduce program with the data of Common Crawl. I would like to read all the data of April 2015 from AWS. For example, if I want download all the data of April…

java hadoop amazon-s3 mapreduce common-crawl

asked Jul 08 '15 at 08:57

pi-2r

1,259
4
27
52

votes

0 answers

Convert commoncrawl keyword search script to Hadoop EMR script

I have built a keyword search script which runs from EC2 and saves the output on s3 successfully. But it is single threaded that's why it is slow. I want to run it on EMR using custom jar. Can someone please convert this to Hadoop script so I can…

java hadoop amazon-s3 amazon-emr common-crawl

asked May 20 '15 at 10:36

Sohail Ahmed

1,667
14
23

votes

1 answer

How do I log from a mapper? (hadoop with commoncrawl)

I'm using the commoncrawl example code from their "Mapreduce for the Masses" tutorial. I'm trying to make modifications to the mapper and I'd like to be able to log strings to some output. I'm considering setting up some noSQL db and just pushing my…

java hadoop mapreduce nosql common-crawl

asked Dec 29 '12 at 22:53

kelorek

6,042
6
29
32

-1

votes

1 answer

Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org. After decompressing the file and writing the code to read this file, I attached the code: import pandas as pd from warcio.archiveiterator…

url jupyter-notebook python-3.10 common-crawl warc

asked Jun 04 '23 at 15:49

Jawaher

-1

votes

1 answer

Means of getting data for a given website from the Web Data Commons?

I'm trying interesting data inside the Web Data Commons dumps. It is taking day to grep across it on my machine (in parallel). Is there an index out there of what websites are covered and an ability to extract specifically from those sites?

common-crawl

asked Jun 27 '15 at 22:14

user1556658

-1

votes

1 answer

Jar file for the mentioned import statements is required

import edu.cmu.lemurproject.WarcHTMLResponseRecord; import edu.cmu.lemurproject.WarcRecord; am using the import statements am getting error could you please suggest me the jar file for the above imports

java common-crawl

asked May 28 '14 at 09:55

santhosh11103

Prev 1 2 3 4