1

Common Crawl is a non-profit third party web search engine. http://commoncrawl.org

I'm seeing the API to search Common Crawl for a given domain.

How can I search common crawl for a given search term?

1 Answers1

3

you can't currently search the content of the web pages. There was commonsearch which used the CC datasets but I am not sure how up to date it is. If you are looking for a limited set of keywords you could use Mapreduce or Spark to filter the pages but if you are dealing with an open or arbitrary set of queries then the best approach would be to index the datasets into Elasticsearch or SOLR yourself.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
  • Thanks for the answer Julien! To confirm, is there no way to use CommonCrawl to search the web? A script equivalent of entering a term into Google.com from a browser. Is that included as a feature in CommonCrawl capabilities? – longtimelurker42 Dec 12 '17 at 21:19
  • 1
    commoncrawl is a dataset. you can't query it directly but you can use it to build your own search yourself – Julien Nioche Dec 13 '17 at 10:05