3

As a part of a research, I need to download freely available RDF (Resource Description Framework - *.rdf) files via web, as much as possible. What are the ideal libraries/frameworks available in Python for doing this?

Are there any websites/search engines capable of doing this? I've tried Google filetype:RDF search. Initially, Google shows you 6,960,000 results. However, as you browse individual results pages, the results drastically drop down to 205 results. I wrote a script to screen-scrape and download files, but 205 is not enough for my research and I am sure there are more than 205 files in the web. So, I really need a file crawler. I'd like to know whether there are any online or offline tools that can be used for this purpose or frameworks/sample scripts in Python to achieve this. Any help in this regards is highly appreciated.

Asanka
  • 31
  • 1
  • 2
  • Good question, i need to do similar. i know teleport pro can crawl for filetypes, but probably not from google.com, perhaps there is another website that can list results from google.com in a way that can be dld... teleport pro can crawl websites for pdfs, i got 100 mb of midi files with it – bandybabboon Aug 16 '14 at 10:09

5 Answers5

1

Crawling RDF content from the Web is no different than crawling any other content. That said, if your question is "what is a good python Web crawler", than you should read this question: Anyone know of a good Python based web crawler that I could use?. If your question is related to processing RDF with python, then there are several options, one being RDFLib

Community
  • 1
  • 1
MarcoS
  • 13,386
  • 7
  • 42
  • 63
0

I know that I'm a bit late with this answer - but for future searchers - http://sindice.com/ is a great index of rdf documents

Sweet Burlap
  • 346
  • 3
  • 9
0

teleport pro, although it maybe cant copy from google, too big, it can probably handly proxy sites that return google results, and i know, for a fact, i could download 10 000 pdfs with in a day if i wanted to. it has filetype specifiers and many options.

bandybabboon
  • 2,210
  • 1
  • 23
  • 33
0

here's one workaround :

get "download master" from chrome extensions, or similar program

search on google or other for results, set google to 100 per page

select - show all files

write your file extension, .rdf press enter

press download

you can have 100 files per click, not bad.

bandybabboon
  • 2,210
  • 1
  • 23
  • 33
0

Did you notice the text something like "google has hidden similar results, click here to show all results" at the bottom of one page? Might help.

ron
  • 9,262
  • 4
  • 40
  • 73