Is it possible to get titles from the webversion of Common Crawler API?

Question

I am trying to get urls, titles and languages from webpages. Fortunately there exists the CC API https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference. But sadly I did not notice a way to get also the titles.

At the moment I query CC as (for example) http://index.commoncrawl.org/CC-MAIN-2018-47-index?url=www.example.com/*&output=json where I get "url" and "languages" information.

Is there any way to query CC through the API without downloading every warc and getting the titles?

Thanks!

score 2 · Answer 1 · answered Jan 31 '19 at 12:12

2

No. The page title isn't indexed in Common Crawl's URL index (neither in the CDX index nor the columnar index).

answered Jan 31 '19 at 12:12

Sebastian Nagel

Thank you! I guess I can download every single segment and search there for every title I need. Then delete the segment and download the next one. Do you think that would be possible? – Mazzespazze Jan 31 '19 at 18:03

1 Answers1