What's a good database for full text search on a large number of relatively small text documents? (C# backend)

Question

I am designing a system that aims to ingest large numbers of documents. I want to support full text search on the document contents, as well as other metadata (keyword/sentiment analysis). How keyword/sentiment analysis is done is beyond the scope of this question. But it is worth considering that this sort of metadata needs to live along side the search-able documents.

The main assumptions are:

by large I mean initially a few 100,000 with the goal of reaching millions
the documents are 0-15kb.
these documents are text (utf-8)
desire to be able to full-text-search document contents
hosted on a single machine, no cloud/distributed services
new documents are inserted continuously (roughly 1-2 per second)
ad hoc text searches
more complicated query use cases would be:
- show me all documents that are about 'Widgets' that are positive from this daterange

C# is the language of choice for fetching documents, processing, storing and retrieving from db. So having C# bindings is a big plus. Or at least an easy way to bridge the gap.

Naive Approach

A naive approach is to use MySQL along with Apache's Lucene. Having the document contents stored as files with references to them in the DB, or having the document contents as a Text field in the databse.

Then I could use one of the C# wrappers to Lucene like Lucene.Net

My concern/question with this approach is whether or not the size of my data and what I want to do with it is too much for MySQL. I know it is silly to do premature optimization, and that oftentimes people think they need some 'big data' solution when it turns out that a regular SQL database does just fine. My other main concern with this approach is that it would be too 'clunky' and cumbersome to develop compared to some potential alternatives.

Alternatives

From doing some research, one alternative that looks promising is using CouchDB with Lucene. I have come across two libraries that solve this:

What I'm looking for:

I haven't done a whole lot with this size of data. I wonder:

Does this amount of data and use case merit a non-relational database?
Should documents live in the database, or as files with references in the database?
Is there a database/full-text-search technology that is particularly suited for this scenario that I haven't considered?

Maybe you can repost this on [Database Administrators](http://dba.stackexchange.com/) or have the post moved there. It is a good and valid question to ask, in the right place. — TaW, Oct 30 '14 at 13:12

score 1 · Accepted Answer · answered Oct 29 '14 at 22:54

1

I would suggest you look into RavenDb. It uses Lucene and is 100% .Net. It has text analyzers for doing full text indexing and fuzzy searches.

answered Oct 29 '14 at 22:54

stricq

798
6
18

What's a good database for full text search on a large number of relatively small text documents? (C# backend)

Naive Approach

Alternatives

What I'm looking for:

1 Answers1