1

I want to use Open Search Server http://www.open-search-server.com/ for creating a web search engine at production level. Is there any other Good free software for creating a search engine? I want to crawl millions of websites.

Gopa Soft
  • 111
  • 1
  • 2
  • 11

2 Answers2

4

(Disclosure: The author of this post is affiliated with the website/product mentioned herein)

OpenSearchServer is based on Lucene. In addition, it contains a powerful web crawler able to index millions of pages. I am the founder of this software. I use it on projects which index thousands of web sites.

However, indexing millions of web site is another story. You will need to distribute the crawl over several servers to build a distributed index.

Then you use another pool of servers to handle the search request from your users. It is possible to use several instance of OpenSearchServer to do that.

Whatever the software you choose, you must carefully choose you hardware, especially the storage part. On large index, the performance of the search query is related to the performance of the storage. Large raid pool or SSD disks are welcome.

Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
Emmanuel Keller
  • 3,384
  • 1
  • 14
  • 16
  • does Query speed will be slow on big index or not? – Gopa Soft Apr 06 '12 at 15:17
  • It depends of the size. One server with RAID or SSD, with 16GB of RAM can handle tens of millions of document with fast query execution time (< 500 ms). Do you have any idea of the total number of indexed pages ? – Emmanuel Keller Apr 06 '12 at 15:38
  • I've approx 50,000 websites. These documents will be number of billions. – Gopa Soft Apr 06 '12 at 15:54
  • I am also getting this error "com.jaeksoft.searchlib.web.ServletException: java.lang.NullPointerException" When i send a request to server – Gopa Soft Apr 06 '12 at 15:57
  • It is better using the forum on SourceForge for technical issues. Mainly you must optimize the index before using it for queries. A good practice is to duplicate the index using the replicate feature. The web crawler works on the first index. Every hour, or day, depending of your choice, you may copy the first index to a second one using the scheduler. A typical scenario is: Stop the Web Crawler, Optimize the index, Do the replication, Start the web crawler. – Emmanuel Keller Apr 06 '12 at 16:32
  • and what about your question "Do you have any idea of the total number of indexed pages ?"? – Gopa Soft Apr 06 '12 at 16:40
  • You can operate a hundreds of millions of documents on one (big) server: 32 GB or RAM, Bunch RAID 10 of SSD Disks. To reach one billions documents, you have to distribute the crawl on several independent servers. Then you will use OpenSearchServer 1.3 (which will be unveiled the next week) with the distributed request handler to handle the distributed query. – Emmanuel Keller Apr 06 '12 at 19:13
  • Because you are advocating for the product, even though you are not linking to it, I have added the required disclosure to this post. You'll notice I also added the required disclosure to all of the posts where you link to the project. Please keep this in mind for the future. – Andrew Barber Oct 17 '12 at 12:37
0

The most popular open source softwares of search-engine are nutch and lucence. Nutch is web page crawler, here is the main page

Lucence is a index server, here is the main page

You can use the two softwares to build the seach-engine

yaronli
  • 699
  • 5
  • 10