Another option, although it would require a fair amount of memory, would be to precompute something like a suffix array (a list of positions within the strings, sorted by the strings to which they point).
http://en.wikipedia.org/wiki/Suffix_array
This would be most feasible if the list of strings you're searching against is relatively static. The entire list of string indexes could be stored in a single array of tuples(indexOfString, positionInString), upon which you would perform a binary search, using String.Compare(keyword, 0, target, targetPos, keyword.Length)
.
So if you had 100,000 strings of average 20 length, you would need 100,000 * 20 * 2*sizeof(int) of memory for the structure. You could cut that in half by packing both indexOfString and positionInString into a single 32 bit int, for example with positionInString in the lowest 12 bits, and the indexOfString in the remaining upper bits. You'd just have to do a little bit fiddling to get the two values back out. It's important to note that the structure contains no strings or substrings itself. The strings you're searching against exist only once.
This would basically give you a complete index, and allow finding any substring very quickly (binary search over the index the suffix array represents), with a minimum of actual string comparisons.
If memory is dear, a simple optimization of the original brute force algorithm would be to precompute a dictionary of unique chars, and assign ordinal numbers to represent each. Then precompute a bit array for each string with the bits set for each unique char contained within the string. Since your strings are relatively short, there should be a fair amount of variability of the resuting BitArrays (it wouldn't work well if your strings were very long). You then simply compute the BitArray or your search keyword, and only search for the keyword in those strings where keywordBits & targetBits == keywordBits
. If your strings are preconverted to lower case, and are just the English alphabet, the BitArray would likely fit within a single int. So this would require a minimum of additional memory, be simple to implement, and would allow you to quickly filter out strings within which you will definitely not find the keyword. This might be a useful optimization since string searches are fast, but you have so many of them to do using the brute force search.
EDIT For those interested, here is a basic implementation of the initial solution I proposed. I ran tests using 100,000 randomly generated strings of lengths described by the OP. Although it took around 30 seconds to construct and sort the index, once made, the speed of searching for keywords 3000 times was 49,805 milliseconds for brute force, and 18 milliseconds using the indexed search, so a couple thousand times faster. If you rarely build the list, then my simple, but relatively slow method of initially building the suffix array should be sufficient. There are smarter ways to build it that are faster, but would require more coding than my basic implementation below.
// little test console app
static void Main(string[] args) {
var list = new SearchStringList(true);
list.Add("Now is the time");
list.Add("for all good men");
list.Add("Time now for something");
list.Add("something completely different");
while (true) {
string keyword = Console.ReadLine();
if (keyword.Length == 0) break;
foreach (var pos in list.FindAll(keyword)) {
Console.WriteLine(pos.ToString() + " =>" + list[pos.ListIndex]);
}
}
}
~~~~~~~~~~~~~~~~~~
// file for the class that implements a simple suffix array
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections;
namespace ConsoleApplication1 {
public class SearchStringList {
private List<string> strings = new List<string>();
private List<StringPosition> positions = new List<StringPosition>();
private bool dirty = false;
private readonly bool ignoreCase = true;
public SearchStringList(bool ignoreCase) {
this.ignoreCase = ignoreCase;
}
public void Add(string s) {
if (s.Length > 255) throw new ArgumentOutOfRangeException("string too big.");
this.strings.Add(s);
this.dirty = true;
for (byte i = 0; i < s.Length; i++) this.positions.Add(new StringPosition(strings.Count-1, i));
}
public string this[int index] { get { return this.strings[index]; } }
public void EnsureSorted() {
if (dirty) {
this.positions.Sort(Compare);
this.dirty = false;
}
}
public IEnumerable<StringPosition> FindAll(string keyword) {
var idx = IndexOf(keyword);
while ((idx >= 0) && (idx < this.positions.Count)
&& (Compare(keyword, this.positions[idx]) == 0)) {
yield return this.positions[idx];
idx++;
}
}
private int IndexOf(string keyword) {
EnsureSorted();
// binary search
// When the keyword appears multiple times, this should
// point to the first match in positions. The following
// positions could be examined for additional matches
int minP = 0;
int maxP = this.positions.Count - 1;
while (maxP > minP) {
int midP = minP + ((maxP - minP) / 2);
if (Compare(keyword, this.positions[midP]) > 0) {
minP = midP + 1;
} else {
maxP = midP;
}
}
if ((maxP == minP) && (Compare(keyword, this.positions[minP]) == 0)) {
return minP;
} else {
return -1;
}
}
private int Compare(StringPosition pos1, StringPosition pos2) {
int len = Math.Max(this.strings[pos1.ListIndex].Length - pos1.StringIndex, this.strings[pos2.ListIndex].Length - pos2.StringIndex);
return String.Compare(strings[pos1.ListIndex], pos1.StringIndex, this.strings[pos2.ListIndex], pos2.StringIndex, len, ignoreCase);
}
private int Compare(string keyword, StringPosition pos2) {
return String.Compare(keyword, 0, this.strings[pos2.ListIndex], pos2.StringIndex, keyword.Length, this.ignoreCase);
}
// Packs index of string, and position within string into a single int. This is
// set up for strings no greater than 255 bytes. If longer strings are desired,
// the code for the constructor, and extracting ListIndex and StringIndex would
// need to be modified accordingly, taking bits from ListIndex and using them
// for StringIndex.
public struct StringPosition {
public static StringPosition NotFound = new StringPosition(-1, 0);
private readonly int position;
public StringPosition(int listIndex, byte stringIndex) {
this.position = (listIndex < 0) ? -1 : this.position = (listIndex << 8) | stringIndex;
}
public int ListIndex { get { return (this.position >= 0) ? (this.position >> 8) : -1; } }
public byte StringIndex { get { return (byte) (this.position & 0xFF); } }
public override string ToString() {
return ListIndex.ToString() + ":" + StringIndex;
}
}
}
}