10

On the Stackoverflow podcast this week, Jeff mentioned that in 2004 he wrote a script which queried Google with 110,000 English words and collected a database containing the number of hits for each word. They use this on Stackoverflow e.g. for the "Related" list on the right-hand side of each question page.

Since creating one of these today with a similar script would be difficult (as Joel mentioned, "at 30,000 words you get a knock at your door"), I was wondering if anyone knows of a more up-to-date, free database of Google word frequencies (e.g. for IT words which have surely changed since then such as jquery, ruby, azure, etc.).

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Edward Tanguay
  • 189,012
  • 314
  • 712
  • 1,047

4 Answers4

5

A quick Google search(!) turns up a few hits. This link looks promising:

But it's not targeted at IT words.

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
3

It maybe late to answer this but I can propose you different way. Instead of getting "number of hits" from Google to compute some approximation of it by yourself. Get big collection of text pages (Corpus) and count the number of each word in it. I have done this with the Wikipedia. There is a dump of all wiki pages. You just need to write a parser to extract text and to count words. The result is a list of more then 110K words (at least 2M-3M). If you really need numbers in Google search result you can get some sample of words and query Google and then make some normalization of computed values to match the Google values. I hope this helps.

1

According to Google, you may send 50,000 queries per day per one IP. I don't really think that it is illegal to split it between your friends..

I had similar problem with queries per day per IP but we solved it by totally different approach.

Skuta
  • 5,830
  • 27
  • 60
  • 68
0

You can split a list between your friends/collegues and use sufficiently large timeouts so you don't exceed 50,000 requests per day per IP, and then merging the results. I'm not sure about the legality of this approach, but the probability of having Google people "knocking at your door" using this method is pretty low.

NOTE: edited according to data provided by Skuta

Boris Gorelik
  • 29,945
  • 39
  • 128
  • 170