1

I am trying to get urls, titles and languages from webpages. Fortunately there exists the CC API https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference. But sadly I did not notice a way to get also the titles.

At the moment I query CC as (for example) http://index.commoncrawl.org/CC-MAIN-2018-47-index?url=www.example.com/*&output=json where I get "url" and "languages" information.

Is there any way to query CC through the API without downloading every warc and getting the titles?

Thanks!

Mazzespazze
  • 111
  • 12

1 Answers1

2

No. The page title isn't indexed in Common Crawl's URL index (neither in the CDX index nor the columnar index).

Sebastian Nagel
  • 2,049
  • 10
  • 10
  • Thank you! I guess I can download every single segment and search there for every title I need. Then delete the segment and download the next one. Do you think that would be possible? – Mazzespazze Jan 31 '19 at 18:03