Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions
0
votes
1 answer

Crate Common Crawl Example not working

I am trying to use this example of Crate with Common Crawl: https://github.com/crate/crate-commoncrawl I have setup the Crate and even created the table schema using the instructions from the example. I am accessing CRATE using the URL:…
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
0
votes
0 answers

Common crawl example having doubts

I am trying to run a common crawl example and extracting URL and emails from the Warc file. I have just one doubt. Whether the email I have extracted belongs to the URL or some other website, this is a confusing part. Kindly, help me. How can I…
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
0
votes
2 answers

How do I archive and retrieve a large HTML dataset?

I am a fresher and I am about to participate in a contest this weekend. The problem is about archiving and retrieving a large HTML dataset and I have no idea about it. My friend suggested to me to use a web archive and common crawl. Please suggest…
Sriram S
  • 1
  • 1
0
votes
1 answer

Common Crawl AWS public dataset transfer cost

I'm actually working on Common Crawl datasets and I want to know the cost of transferring data from the original S3 bucket to my EC2 cluster ? Is there any charge or it's totally free ?
ar-ms
  • 735
  • 6
  • 14
0
votes
2 answers

MRJob determining if running inline, local, emr or hadoop

I am building on some old code from a few years back using the commoncrawl dataset with EMR using MRJob. The code uses the following inside MRJob subclass mapper function to determine whether running locally or on emr: self.options.runner ==…
Pykler
  • 14,565
  • 9
  • 41
  • 50
0
votes
1 answer

How to read all the data of Common Crawl from AWS with Java?

I'm totally new to Hadoop and MapReduce programming, and I'm trying to write my first MapReduce program with the data of Common Crawl. I would like to read all the data of April 2015 from AWS. For example, if I want download all the data of April…
pi-2r
  • 1,259
  • 4
  • 27
  • 52
0
votes
0 answers

Convert commoncrawl keyword search script to Hadoop EMR script

I have built a keyword search script which runs from EC2 and saves the output on s3 successfully. But it is single threaded that's why it is slow. I want to run it on EMR using custom jar. Can someone please convert this to Hadoop script so I can…
Sohail Ahmed
  • 1,667
  • 14
  • 23
0
votes
1 answer

How do I log from a mapper? (hadoop with commoncrawl)

I'm using the commoncrawl example code from their "Mapreduce for the Masses" tutorial. I'm trying to make modifications to the mapper and I'd like to be able to log strings to some output. I'm considering setting up some noSQL db and just pushing my…
kelorek
  • 6,042
  • 6
  • 29
  • 32
-1
votes
1 answer

Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org. After decompressing the file and writing the code to read this file, I attached the code: import pandas as pd from warcio.archiveiterator…
Jawaher
  • 3
  • 2
-1
votes
1 answer

Means of getting data for a given website from the Web Data Commons?

I'm trying interesting data inside the Web Data Commons dumps. It is taking day to grep across it on my machine (in parallel). Is there an index out there of what websites are covered and an ability to extract specifically from those sites?
-1
votes
1 answer

Jar file for the mentioned import statements is required

import edu.cmu.lemurproject.WarcHTMLResponseRecord; import edu.cmu.lemurproject.WarcRecord; am using the import statements am getting error could you please suggest me the jar file for the above imports
1 2 3 4
5