3

I have a list of perhaps 100,000 strings in memory in my application. I need to find the top 20 strings that contain a certain keyword (case insensitive). That's easy to do, I just run the following LINQ.

from s in stringList
where s.ToLower().Contains(searchWord.ToLower())
select s

However, I have a distinct feeling that I could do this much faster and I need to find the way to that, because I need to look up in this list multiple times per second.

Niels Brinch
  • 3,033
  • 9
  • 48
  • 75

4 Answers4

4

Finding substrings (not complete matches) is surprisingly hard. There is nothing built-in to help you with this. I suggest you look into Suffix Trees data structures which can be used to find substrings efficiently.

You can pull searchWord.ToLower() out to a local variable to save tons of string operations, btw. You can also pre-calculate the lower-case version of stringList. If you can't precompute, at least use s.IndexOf(searchWord, StringComparison.InvariantCultureIgnoreCase) != -1. This saves on expensive ToLower calls.

You can also slap an .AsParallel on the query.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Thanks. Creating the list with lower case strings is a good idea which I will implement. I was hoping for another, better, way of doing this, but tweaks to my own solution is also good. Thanks. – Niels Brinch May 15 '12 at 18:53
  • I just looked up Trie and it looks like what I'm looking for. Even found someone implemented it for me! :) http://geekyisawesome.blogspot.com/2010/07/c-trie.html – Niels Brinch May 15 '12 at 18:55
  • Trie won't save you from 100,000 all containing the same 'a' letter. Your Trie will have 100,000 children for certain short paths that exists on most string? It's still the same reverse index method. – Daniel Baktiar May 15 '12 at 18:55
  • The Trie's cost is linear in the input size, not quadratic. – usr May 15 '12 at 18:56
  • Daniel, I guess I should try to limit the searchWord to at least 3 characters or so ... – Niels Brinch May 15 '12 at 20:13
  • Hi usr, Don't just go by text book: "Trie's cost is linear of input size, only if you index from the first character". But if you index any occurrence of sequence inside the word, it will still combinatoric, as the input size is pre-generated. For example, if the word is "abcd", "bce", "ghibc". If you only index Trie for recognizing "ab*", yes it is. But if you want to index "\*bc\*" you still either go full scan or pre-index with combinatoric explosion. – Daniel Baktiar May 17 '12 at 02:59
  • You're right. I misremembered. I meant Suffix Trees. They can be constructed in linear time and space. – usr May 17 '12 at 09:05
1

Another option, although it would require a fair amount of memory, would be to precompute something like a suffix array (a list of positions within the strings, sorted by the strings to which they point).

http://en.wikipedia.org/wiki/Suffix_array

This would be most feasible if the list of strings you're searching against is relatively static. The entire list of string indexes could be stored in a single array of tuples(indexOfString, positionInString), upon which you would perform a binary search, using String.Compare(keyword, 0, target, targetPos, keyword.Length).

So if you had 100,000 strings of average 20 length, you would need 100,000 * 20 * 2*sizeof(int) of memory for the structure. You could cut that in half by packing both indexOfString and positionInString into a single 32 bit int, for example with positionInString in the lowest 12 bits, and the indexOfString in the remaining upper bits. You'd just have to do a little bit fiddling to get the two values back out. It's important to note that the structure contains no strings or substrings itself. The strings you're searching against exist only once.

This would basically give you a complete index, and allow finding any substring very quickly (binary search over the index the suffix array represents), with a minimum of actual string comparisons.

If memory is dear, a simple optimization of the original brute force algorithm would be to precompute a dictionary of unique chars, and assign ordinal numbers to represent each. Then precompute a bit array for each string with the bits set for each unique char contained within the string. Since your strings are relatively short, there should be a fair amount of variability of the resuting BitArrays (it wouldn't work well if your strings were very long). You then simply compute the BitArray or your search keyword, and only search for the keyword in those strings where keywordBits & targetBits == keywordBits. If your strings are preconverted to lower case, and are just the English alphabet, the BitArray would likely fit within a single int. So this would require a minimum of additional memory, be simple to implement, and would allow you to quickly filter out strings within which you will definitely not find the keyword. This might be a useful optimization since string searches are fast, but you have so many of them to do using the brute force search.

EDIT For those interested, here is a basic implementation of the initial solution I proposed. I ran tests using 100,000 randomly generated strings of lengths described by the OP. Although it took around 30 seconds to construct and sort the index, once made, the speed of searching for keywords 3000 times was 49,805 milliseconds for brute force, and 18 milliseconds using the indexed search, so a couple thousand times faster. If you rarely build the list, then my simple, but relatively slow method of initially building the suffix array should be sufficient. There are smarter ways to build it that are faster, but would require more coding than my basic implementation below.

// little test console app
static void Main(string[] args) {
    var list = new SearchStringList(true);
    list.Add("Now is the time");
    list.Add("for all good men");
    list.Add("Time now for something");
    list.Add("something completely different");
    while (true) {
        string keyword = Console.ReadLine();
        if (keyword.Length == 0) break;
        foreach (var pos in list.FindAll(keyword)) {
            Console.WriteLine(pos.ToString() + " =>" + list[pos.ListIndex]);
        }
    }
}
~~~~~~~~~~~~~~~~~~
// file for the class that implements a simple suffix array
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections;

namespace ConsoleApplication1 {
    public class SearchStringList {
        private List<string> strings = new List<string>();
        private List<StringPosition> positions = new List<StringPosition>();
        private bool dirty = false;
        private readonly bool ignoreCase = true;

        public SearchStringList(bool ignoreCase) {
            this.ignoreCase = ignoreCase;
        }

        public void Add(string s) {
            if (s.Length > 255) throw new ArgumentOutOfRangeException("string too big.");
            this.strings.Add(s);
            this.dirty = true;
            for (byte i = 0; i < s.Length; i++) this.positions.Add(new StringPosition(strings.Count-1, i));
        }

        public string this[int index] { get { return this.strings[index]; } }

        public void EnsureSorted() {
            if (dirty) {
                this.positions.Sort(Compare);
                this.dirty = false;
            }
        }

        public IEnumerable<StringPosition> FindAll(string keyword) {
            var idx = IndexOf(keyword);
            while ((idx >= 0) && (idx < this.positions.Count)
                && (Compare(keyword, this.positions[idx]) == 0)) {
                yield return this.positions[idx];
                idx++;
            }
        }

        private int IndexOf(string keyword) {
            EnsureSorted();

            // binary search
            // When the keyword appears multiple times, this should
            // point to the first match in positions. The following
            // positions could be examined for additional matches
            int minP = 0;
            int maxP = this.positions.Count - 1;
            while (maxP > minP) {
                int midP = minP + ((maxP - minP) / 2);
                if (Compare(keyword, this.positions[midP]) > 0) {
                    minP = midP + 1;
                } else {
                    maxP = midP;
                }
            }
            if ((maxP == minP) && (Compare(keyword, this.positions[minP]) == 0)) {
                return minP;
            } else {
                return -1;
            }
        }

        private int Compare(StringPosition pos1, StringPosition pos2) {
            int len = Math.Max(this.strings[pos1.ListIndex].Length - pos1.StringIndex, this.strings[pos2.ListIndex].Length - pos2.StringIndex);
            return String.Compare(strings[pos1.ListIndex], pos1.StringIndex, this.strings[pos2.ListIndex], pos2.StringIndex, len, ignoreCase);
        }

        private int Compare(string keyword, StringPosition pos2) {
            return String.Compare(keyword, 0, this.strings[pos2.ListIndex], pos2.StringIndex, keyword.Length, this.ignoreCase);
        }

        // Packs index of string, and position within string into a single int. This is
        // set up for strings no greater than 255 bytes. If longer strings are desired,
        // the code for the constructor, and extracting  ListIndex and StringIndex would
        // need to be modified accordingly, taking bits from ListIndex and using them
        // for StringIndex.
        public struct StringPosition {
            public static StringPosition NotFound = new StringPosition(-1, 0);
            private readonly int position;
            public StringPosition(int listIndex, byte stringIndex) {
                this.position = (listIndex < 0) ? -1 : this.position = (listIndex << 8) | stringIndex;
            }
            public int ListIndex { get { return (this.position >= 0) ? (this.position >> 8) : -1; } }
            public byte StringIndex { get { return (byte) (this.position & 0xFF); } }
            public override string ToString() {
                return ListIndex.ToString() + ":" + StringIndex;
            }
        }
    }
}
  • Thanks a lot, seems like a good solution, much along the same lines as the Trie suggestion. – Niels Brinch May 16 '12 at 14:45
  • @NielsBrinch - I've added code for a simple implementation of the initial solution I proposed, in case you're interested. – hatchet - done with SOverflow May 16 '12 at 17:07
  • So it just returns one match and then a 'brute force'-ish approach can be used to get the next matches right after that, until it doesn't match any longer? – Niels Brinch May 18 '12 at 11:34
  • @NielsBrinch - In my basic implementation, it returns the first match (the first element in positions that points to a matching string). But it would be easy to modify it to return all the matches by just looking at the following elements of the positions array, adding StringPositions to the result until the first non-match is found, and then returning the list of results. That works because positions is in order, sorted by the strings they point to, and the binary search positions to the first match. – hatchet - done with SOverflow May 18 '12 at 12:28
  • @NielsBrinch - Also, I've edited the code in the answer to allow specifying whether you want case sensitive matching or not, and it will respect that without converting all the strings ToLower. – hatchet - done with SOverflow May 18 '12 at 12:48
  • Thanks hatchet, I am implementing your solution now. Thanks a lot. – Niels Brinch May 19 '12 at 14:30
  • I removed the EnsureSorted() call from the IndexOf method and instead make sure I call it after adding new strings to the list ... seems it could take quite some time to ensure sorting each time, right? – Niels Brinch May 19 '12 at 14:49
  • Hmm, it only works to get one record, because the next one is sorted by the letter it starts with, but the search word might be in the middle of the string... – Niels Brinch May 19 '12 at 15:03
  • @NielsBrinch - no, the EnsureSorted will only re-sort the list if it needs to be resorted. That's what the dirty flag is for. So once it's sorted the first time, and you don't add any more strings, the call to EnsureSorted will take almost no time. I've modified the code to return all matches and included a little test app code so you can see that it really does find all the matches. – hatchet - done with SOverflow May 19 '12 at 18:39
  • I've run your console app and can see that it DOES in fact work. I am surprised, because it seems that if you create simpler strings "a 123" "b 678" and "c 123" and then search for "123" it would first find "a 123" and the next element would be "b 678" which doesn't match... – Niels Brinch May 20 '12 at 10:24
  • I implemented your solution in my app and it works perfectly. Thanks a lot. Just still don't understand WHY it works :) – Niels Brinch May 20 '12 at 19:47
0

In that case what you need is a reverse index.

If you are keen to pay much you can use database specific full text search index, and tuning the indexing to index on every subset of words.

Alternatively, you can use a very successful open source project that can achieve the same thing.

You need to pre-index the string using tokenizer, and build the reverse index file. We have similar use case in Java where we have to have a very fast autocomplete in a big set of data.

You can take a look at Lucene.NET which is a port of Apache Lucene (in Java).

If you are willing to ditch LINQ, you can use NHibernate Search. (wink).

Another option is to implement the pre-indexing in memory, with preprocessing and bypass of scanning unneeded, take a look at the Knuth-Morris-Pratt algorithm.

Daniel Baktiar
  • 1,692
  • 11
  • 22
  • Hi usr, for database-specific it's database-specific. For Lucene you can customize the Lucene tokenizer to all subset sequence of character, which produce larger index file. – Daniel Baktiar May 15 '12 at 18:39
  • 1
    Are you sure? For a single(!) 100 char string this would produce about 5000 tokens (quadratic)! This is impractical. Also you wouldn't need to use Lucene for that. Just use a dictionary. But as I said, there are too many possible substrings. – usr May 15 '12 at 18:40
  • You can limit, say only index 3 chars or more. When there's a repetition of occurrence, the index only produce one entry. I agree there has to be some trade offs. – Daniel Baktiar May 15 '12 at 18:43
  • Ahh, exactly what I also asked you in the other thread, down to the 3 chars. Well done, answering before I even ask :) – Niels Brinch May 15 '12 at 20:15
0

There's one approach that would be a lot faster. But it would mean looking for exact word matches, rather than using the Contains functionality.

Basically, if you have the memory for it you could create a Dictionary of words which also reference some sort of ID (or IDs) for the strings in which the word is found.

So the Dictionary might be of type <string, List<int>>. The benefit here of course is that you're consolidating a lot of words into a smaller collection. And, the Dictionary is very fast with lookups since it's built on a hash table.

Now if this isn't what you're looking for you might search for in-memory full-text searching libraries. SQL Server supports full-text searching using indexing to speed up the process beyond traditional wildcard searches. But a pure in-memory solution would surely be faster. This still may not give you the exact functionality of a wildcard search, however.

Steve Wortham
  • 21,740
  • 5
  • 68
  • 90
  • Thanks, yes it would have to be a complete substring, not just a word search. Thanks though. – Niels Brinch May 15 '12 at 18:46
  • @NielsBrinch - I'm curious. Is the wildcard search (as you have written your code) really the desired result? I mean, I used to code wildcard searches all the time until I discovered the advantages of full text searching. There are 3 big advantages full text searching libraries give you: indexing (for speed), ranking (for relevance), and the ability to search for alternate versions of a root word. – Steve Wortham May 15 '12 at 18:57
  • Optimally I would be fine with having search on actual english words, but it DOES need to be even if it's appended to another word. 'bicycle' should be found by 'cycle'. – Niels Brinch May 15 '12 at 20:15