5

Basic requirements:

  • Should be able to index things like MediaWiki, Confluence, Sharepoint, GitHub:Enterprise, Askbot
  • Should be reasonably smart about de-duping results (one reason Confluence search is so painful).
  • Should definitely incorporate heuristics like how many pages link to a document, whether the search terms are in the title of the document, etc. If there's a way for users to downrank particular results, that might be a bonus.
  • Should be somewhat tunable (e.g., prefer Confluence over Sharepoint, blacklist certain paths).

Are there off-the-shelf products that can do the above? FOSS projects? Are there FOSS projects that can provide the basics for the above and are easy to extend or build a frontend for?

3 Answers3

4

You can try Apache Solr, it's a great tool.

According to the website:

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

Raúl Juárez
  • 2,129
  • 1
  • 19
  • 17
  • Nutch+Solr aren't hitting it off as well as I'd have hoped. I'm still playing around with them, but it's a bit tricky without much familiarity with either tool (and the documentation for Nutch seems to be quite schizophrenic). In general, would you recommend going this route, or does it make sense to roll my own crawling tool for Solr? – Jun-Dai Bates-Kobashigawa Aug 27 '13 at 23:14
  • @Jun-DaiBates-Kobashigawa I would recommend using Nutch, AFAIK is the best open source web crawler and I don't think is going away – Raúl Juárez Aug 28 '13 at 14:47
  • Vote for Elasticsearch. – boj May 19 '14 at 21:46
1

You could try a bundled version of Solr and other tools such as OpenESP or Constellio. Expect to spend some time tuning the sources and imports. ManifoldCF which is bundled with OpenESP is an open source connector/crawler framework for plugging in connectors to various systems like those you describe, and several connectors come out of the box.

Cominvent
  • 101
  • 3
-1

You can try Moogle. It is open source easily employable in windows with IIS. just having look as google so you feel bit familiar with it. Try http://techstuff.smsjuju.com/intranet-search-engine/