12

If you've done any serious research into search API's, you know that most of them have a huge slew of TOS/TOU restrictions that make them nearly impossible to use in anything but the most inane applications.

Bing's 2.0 API, Yahoo Search BOSS, Google Places, Google AJAX Search (dead), et al, are far too restrictive for us. I need to run a finite and relatively small number of queries (perhaps 500k) one time only, storing specific data from the results for use within our application.

For example, we need to match up business names with their target websites (we have written the algorithm to make a 'best guess' from a set of results if necessary; we just need a vanilla result set). Also, we need to match an address to this company in question.

Unfortunately, I can find ZERO search API's that will allow us to fire off queries in a programmatic, non-user-initiated manner.

We're even quite eager to give someone cold, hard cash for access to this kind of data; Google, Bing, Yahoo, and others simply seem to not want our money (as evidenced by their TOSes)...

Any thoughts?

rinogo
  • 8,491
  • 12
  • 61
  • 102
  • Hi, everyone! I see that this has received a close vote. If there is a SO community that would be more appropriate for this question, please let me know. I honestly looked through them all, and the original SO proper seemed to be the most relevant. Thanks! :) – rinogo Aug 31 '11 at 23:35
  • Have you tried Blekko? What do you mean with "I can find ZERO search API's that will allow us to fire off queries in a programmatic, non-user-initiated manner" ? There were a discussion around the Custom Search Engine of Google having the possibility to search the whole web (adding a site and removing it later). Also you can buy "credits" for the Custom Search Engine, although some user found a limitation even in that case. Anyway, I understand your point around the limitations of the current Search APIs, and Google is the best search engine, even if others compete nobody has a larger index. – sw. Sep 01 '11 at 03:50
  • Thanks so much for your response, sw. Prompted by your suggestion, I checked out Blekko, and their TOU is also quite restrictive. (For the time being, however, there is a glimmer of hope for the Blekko API: http://dev-ops.net/2011/02/02/blekko-search-engine-with-some-nice-features/ ) Google's CSE won't work for us; we prefer a long-term legitimate solution rather than a short-term, legally questionable patch. We have money and are willing to part with it! :) Why are none of the big names willing to accomodate entities with legitimate business needs like ours? – rinogo Sep 02 '11 at 17:15
  • Well, I even wrote an article about it: http://blog.databigbang.com/google-search-no-api/ since there is a business opportunity there. I think in your case you must add a combination of [many] data sources, but it will not be straightforward to mix/clean/etc the data. I'll be interested on discussing it by chat since it's a very interesting subject. I am now on #bigdata on freenode. – sw. Sep 02 '11 at 20:33

2 Answers2

3

A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2.

http://commoncrawl.org/

Their Terms of Service (or TOU) are pretty reasonable and unrestricted too:

http://commoncrawl.org/about/terms-of-use/

seanieb
  • 1,196
  • 2
  • 14
  • 36
  • Haven't looked into this one much (it might satisfy the requirements, not sure); I thought I'd add it as a comment: http://80legs.com/ – rinogo Mar 04 '14 at 17:51
0

If you know some visual basic I'd suggest playing around with Bing Ad Intelligence. It's a free Excel plugin and all you need to use it is a free Microsoft account.

The query limit is 20,000 words per query. You can get information on Clicks, Impressions, CTR, CPC, Average Bid and Total Cost. The query limit is a little lower if you use the more advanced keyword research features.

Donald
  • 3,901
  • 1
  • 20
  • 15