I am designing a system that aims to ingest large numbers of documents. I want to support full text search on the document contents, as well as other metadata (keyword/sentiment analysis). How keyword/sentiment analysis is done is beyond the scope of this question. But it is worth considering that this sort of metadata needs to live along side the search-able documents.
The main assumptions are:
- by large I mean initially a few 100,000 with the goal of reaching millions
- the documents are 0-15kb.
- these documents are text (utf-8)
- desire to be able to full-text-search document contents
- hosted on a single machine, no cloud/distributed services
- new documents are inserted continuously (roughly 1-2 per second)
- ad hoc text searches
- more complicated query use cases would be:
- show me all documents that are about 'Widgets' that are positive from this daterange
C# is the language of choice for fetching documents, processing, storing and retrieving from db. So having C# bindings is a big plus. Or at least an easy way to bridge the gap.
Naive Approach
A naive approach is to use MySQL along with Apache's Lucene. Having the document contents stored as files with references to them in the DB, or having the document contents as a Text field in the databse.
Then I could use one of the C# wrappers to Lucene like Lucene.Net
My concern/question with this approach is whether or not the size of my data and what I want to do with it is too much for MySQL. I know it is silly to do premature optimization, and that oftentimes people think they need some 'big data' solution when it turns out that a regular SQL database does just fine. My other main concern with this approach is that it would be too 'clunky' and cumbersome to develop compared to some potential alternatives.
Alternatives
From doing some research, one alternative that looks promising is using CouchDB with Lucene. I have come across two libraries that solve this:
What I'm looking for:
I haven't done a whole lot with this size of data. I wonder:
- Does this amount of data and use case merit a non-relational database?
- Should documents live in the database, or as files with references in the database?
- Is there a database/full-text-search technology that is particularly suited for this scenario that I haven't considered?