Web Crawling and Pagerank

Question

I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.

Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?

Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.

score 1 · Answer 1 · answered Feb 18 '15 at 19:12

I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.

score 0 · Answer 2 · answered Mar 12 '15 at 10:01

As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.

You may change the weight of the score based on the backlinks, by default it is set to 1.

Scoring panel of OpenSearchServer

We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.

Web Crawling and Pagerank

2 Answers2