3

I have a pretty small collection of string values in memory (around 8400 records with an average of 10 words each):

What I'm trying to find out if there's a library or something that, when I search for strings within that collection, it returns the matching values according to it, and it could also include some kind of weight to the results.

This is what I'm trying to do; let's say that I have these records in a List in memory:

  • Department Store General Manager
  • General and Operations Manager
  • General Manager
  • Restaurant Generally Managers
  • Restaurant General Manager

Let's say that I'm working on a method that receives a search string and it will analize that collection in order to retrieve the results:

List<string> SearchJotitles("General Manager")

I want something that will return all the records that contain the words General AND Manager. So far it should be easy: I could do it with regular expressions.

But the tricky part is that I want to apply some weighing rules saying :

"OK: the third record is a bigger match cause it's an EXACT match." "The first and last record should be next cause they have the two words with no distance between them". "The second record should be next cause it has the two exact words but in different order" "The 4th record should be last cause it has a partial match of both words"

THAT's the kind of logic I want to apply.

I know there are some libraries like Lucene.NET or Sphinx: I'm not discarding them; I'm just not convinced if they're worth using for such small in-memory collection.

In the worst-case scenario, I'll work in a IComparer implementation of the entities, but I want to know if there's something I could already use out there.

Thanks and regards,

Silvestre
  • 804
  • 11
  • 25

1 Answers1

2

In this particular example volume of records is small but it still does not decrease complexity of full-text search.

If you have only 5 records it might be a good idea to implement simple Levenshtein distance(or find implementation online), tokenise all phrases and do your custom matching algorithm (word distance, maybe synonyms etc).

On the other hand using Lucene.NET gives you that out of the box. You can use RAMDirectory to store the index in memory. And what it's most important you don't have to spend hours trying to figure out why your custom algorithm does not work as it should. Why reinvent the wheel?

Alternative? Are you using any sql database in you application? Maybe it's worth leveraging full-text search built into modern SQL databases, of course if you use one.

b2zw2a
  • 2,663
  • 15
  • 15
  • Thank you @plentysmart; I'm gonna give Lucene.NET a chance: I heard it's kinda hard to configure but I'll look into it. I thought about the full-text search, yes, but it's not THAT powerful if you want to customize the way it ranks the results, so I discarded that. – Silvestre Nov 10 '14 at 21:44