0

At the moment I am crawling a large number of predefined sites, looking for a very small number of particular documents of interest. Importantly, I am not crawling these sites to create my own search engine: it is specifically for retrieving the documents.

All of the major search engines have an API that I don't mind paying for, but they seem to be focused on using their API to make your own search engine.

For example: Yahoo BOSS TOS at http://info.yahoo.com/legal/us/yahoo/boss/tou/ . B.1(a) says "You are permitted to use the Services only for the purpose of incorporating and displaying Results from the Services as part of a Search Product deployed on Your Offering". So I can only use it for my own search engine.

Google only has the Custom Search Engine stuff, which again is not what I need.

Bing's API seems to be closer to what I need but then it's TOS require not removing certain pieces of information etc. But then again, it doesn't require me to only use it for implementing my own search engine (from what I can see).

Am I reading too much into this or is there a search engine that allows me to essentially use the results of their crawl of certain sites instead of my own for my product? Again, the search results themselves are not my product: it's what I do with the data in the documents that is.

Thanks for any tips.

Narcissus
  • 3,144
  • 3
  • 27
  • 40

1 Answers1

0

You will not want to use a search engine to do this.

Search engines will not index all content on a site. If a site has lots of similar pages, for example, they will be thrown out. Sites with large number of pages will not be completely indexed.

You could potentially miss lots of pages this way.

Keep it crawling!

P.S. Crawling individual websites often violates their TOS. If you care about that, also take care to adhere to robots.txt.

Byron Whitlock
  • 52,691
  • 28
  • 123
  • 168
  • Thanks for the reply Byron. FWIW, we absolutely do adhere to robots.txt and we do a lot to keep the load down on the sites we hit (we can go very slowly as we have plenty of sites to crawl in parallel). The documents we are pulling are very few and far between and when they're there, they're not thrown out by the search engines (as they don't have similar versions... it's just the nature of the documents themselves). As I say, thanks, but I guess I still need to know if anyone knows if we can do what we want. – Narcissus Sep 13 '13 at 14:05