C++ libraries for web ranking and search engines

Question

Can anybody introduce me some libraries that contains web ranking algorithms such as PageRank, HITS? Thank you

I seriously doubt such libraries exist. AFAIK, PageRank et al are secret algorithms. — Violet Giraffe, Nov 10 '11 at 08:44
Note that I have retagged this question so it is more likely to lead to related posts that could contain useful information. You can click on those tags and browse them, or mix them. For instance: http://stackoverflow.com/questions/tagged/c%2b%2b%20search-engine — HostileFork says dont trust SE, Nov 10 '11 at 16:20

score 1 · Accepted Answer · answered Nov 11 '11 at 13:59

I guess you are refering to the canonical PageRank algorithm as published in the original PageRank paper. People nowadays use "PageRank" to refer to the actual current Google algorithm for search.

If that is really the case, the PageRank implementation is not that difficult to find and use. Searching through Google you can find a good deal of implementations. One in python, for example.

For the HITS algorithm there's pseudocode in wikipedia. There's also a Perl implementation.

I'm also suggesting CLucene for you to start messing around.

But clucene doesn't have a manual. Do you know how can I use it in c++? — orezvani, Dec 17 '11 at 06:29

score 0 · Answer 2 · answered Nov 10 '11 at 16:15

0

Unless you work for Google, there aren't many good ways of finding out the specifics of their page ranking algorithm...which changes from time to time. Wikipedia outlines some of the basics:

http://en.wikipedia.org/wiki/PageRank

Other people write lengthy articles:

http://www.smashingmagazine.com/2007/06/05/google-pagerank-what-do-we-really-know-about-it/

If you are interested in the kinds of techniques that are involved in writing a search engine, there are several topics. For instance, there is "web crawling" and how to write programs that visit web sites and grab their contents...and determining when to visit the sites again to see if they've changed:

http://en.wikipedia.org/wiki/Web_crawler

Once you have a bunch of data on your machine(s) to analyze and search, the subject area to study is called "Information Retrieval" (or "IR"):

http://en.wikipedia.org/wiki/Information_retrieval

It's a fairly new science, but a lot of work is done on it. Wikipedia has a list of "free search engine software":

http://en.wikipedia.org/wiki/Category:Free_search_engine_software

I'd suggest that if you're new to this then it might be best to start with figuring out how to use something like Lucene to provide a search box on a website you have. Then dig in and see how it works. It has been ported to C++ if that is important to you:

http://clucene.sourceforge.net/

answered Nov 10 '11 at 16:15

HostileFork says dont trust SE

32,904
11
98
167

Thank you for perfect information. But I have focused on Web Ranking which is a part of Web Information Retrieval. I need some implementations of ranking algorithms such as PageRank and others in order to compare their results with mine. – orezvani Nov 10 '11 at 18:41
You can try those Free Search Engine Software links and maybe be able to get at some kind of data files showing rankings they calculate. But the only tractable way to compare against Google's methods would be to make sample data sets and then use either use Google Site Search or buy a Google Search Appliance...feed in various terms and compare what their top hit choices were to yours run on the same data: http://www.google.com/enterprise/search/gsa.html – HostileFork says dont trust SE Nov 10 '11 at 18:50
there are some famouse ranking algorithms such as PageRank, HITS,... which have been published in many papers. I need to compare my results with them! I want their implementations! Do you have any idea? – orezvani Nov 10 '11 at 19:32
You've seen what everyone else has seen...general descriptions published in papers. You've also noticed there is no published source code alongside those papers. Unless you work for Google, reverse engineer the Google Search Appliance, or participate in some kind of industrial espionage...you will not have access to the source for their search algorithms. You can treat commercial search engines as a black box and look at the results, or you can study the internals of open source engines. That's what you've got. – HostileFork says dont trust SE Nov 11 '11 at 18:47

C++ libraries for web ranking and search engines

2 Answers2