Search String By SubWords

Question

What Kind of algorithms + data structures that would help me to do that?

Having a file contains like 10000~ lines loaded in memory in a ordered set. With a given search string I want to be able to get all the lines that have words prefixed with words found in search string. Well let me give an example to clarify this:

Lines:

"A brow Fox flies."
"Boxes are full of food."
"Cats runs slow"
"Dogs hates eagles"
"Dolphins have eyes and teath"

Cases 1:

search string = "fl b a"

"A brow Fox flies."

Explanation: search string have three words "fl", "b", and "a" and the only string that have some words that are prefixed with words from the search string is line 1.

Case 2:

search string "e do ha"

"Dogs hates eagles", "Dolphins have eyes and teath"

Solution

(fast enough for me it took about 30ms~(including sorting the final result) on my pc on a set of 10k lines 3 words each line)

I used a trie suggested in answer.
And some other hacky methods to be able to filter out duplicate and false positive results (mainly used hash sets for this).

Are there any constraints on the length of lines, on the length of search fragments or on the length of search strings? I'm trying to find an improvement on the trie solution, because thinking about it I increasingly feel there's just *so much* duplication of information going on in it that there must be some optimizations to be found. — Andy Jones, Dec 19 '13 at 11:17
well not on length but on words count no more than 5 words in each line — ToddlerXxX, Dec 19 '13 at 12:12

score 4 · Accepted Answer · answered Dec 18 '13 at 20:06

4

I think what you're probably wanting is a trie. Construct one for the set of all words in your document, and have each leaf point to a hashset containing the indices of the lines in which the key of the leaf appears.

To execute a search, you'd use each fragment of the search string to navigate to a node in the tree and take the union over the hashsets of all leaves in that node's subtree. Then you'd take the intersection of those unions over the set of fragments to get the list of lines satisfying the search string.

answered Dec 18 '13 at 20:06

Andy Jones

4,723
2
19
24

1

I thought of that but suppose Search is "a a a" this would bring me the first line though this line only contains one word prefixed with "a" not three. – ToddlerXxX Dec 18 '13 at 20:10
Good point! I'll have a think about it. Edit: To anyone who reads this, the tempting solution is to keep multiple hashsets in each leaf, indexed by the number of times that prefix appears in the contained strings. The problem with this is "a a a" wouldn't match "aa ab ac". You could assign hashsets to every internal node as well, but that seems a little inelegant. – Andy Jones Dec 18 '13 at 20:24
2

(Preliminary thought) Sort the input fragments and trie data alphabetically, and on each matching substring, walk down the trie until you match a next fragment or encounter the end-of-branch. – Jongware Dec 18 '13 at 21:02
I like the sorting part and still thinking about walking down the trie – ToddlerXxX Dec 18 '13 at 21:07
3

It's not difficult to modify your method to handle the `"a a a"` case. When you create the list of lines that contains words starting with a particular prefix, include a count that says how many matching words that line contains. So your intermediate structure wouldn't be just a line number, but instead the line number plus a count for the number of times each prefix occurs. So in the `"a a a`" case you only have to search the trie once, and then filter out those lines that don't have a count of 3. – Jim Mischel Dec 19 '13 at 01:56
@JimMischel I think this is what I am going to do if nothing else better comes up. – ToddlerXxX Dec 19 '13 at 09:40
@JimMischel I am having troubles with filtering the results, in searching the trie I get set with duplicates that i need to remove, then I need to intersect this set with any previous set then cross the wrong ones and I cant find how to do that. – ToddlerXxX Dec 19 '13 at 14:44
@ToddlerXxX: If you're having specific issues with some code, then isolate the problem and post a question showing the relevant code. I'm sure somebody (perhaps me if I see it) will be able to help you out. – Jim Mischel Dec 19 '13 at 14:51

score 0 · Answer 2 · edited Dec 19 '13 at 00:34

0

Here is my 2 cents:

class DicVal
{
    public int OriginalValue;
    public int CurrentValue;

    public int LineNumber;
}

private static void Main()
{
    var a = "A brow Fox flies.\r\n" +
            "Boxes are full of food.\r\n" +
            "Cats runs slow\r\n" +
            "Dogs hates eagles\r\n" +
            "A brow Fox flies. AA AB AC\r\n" +
            "Dolphins have eyes and teath";

    var lines = a.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
    var dic = new Dictionary<string, DicVal>
        {
            {"fl", new DicVal { OriginalValue = 1, LineNumber = -1}},
            {"b", new DicVal { OriginalValue = 1, LineNumber = -1}},
            {"a", new DicVal { OriginalValue = 4, LineNumber = -1}}
        };

    var globalCount = dic.Sum(x => x.Value.OriginalValue);
    var lineNumber = 0;

    foreach(var line in lines)
    {
        var words = line.Split(' ');
        var currentCount = globalCount;

        foreach (var word in words.Select(x => x.ToLower()))
        {
           for (var i = 1; i <= word.Length; i++)
           {
               var substr = word.Substring(0, i);
               if (dic.ContainsKey(substr))
               {
                   if (dic[substr].LineNumber != lineNumber)
                   {
                       dic[substr].CurrentValue = dic[substr].OriginalValue;
                       dic[substr].LineNumber = lineNumber;
                   }

                   if (dic[substr].CurrentValue > 0)
                   {
                       currentCount--;
                       dic[substr].CurrentValue--;
                   }
               }
           }
        }

        if(currentCount == 0)
            Console.WriteLine(line);

        lineNumber++;
    }
}

Not going to explain much, as code is the best documentation :P.

Output: A brow Fox flies. AA AB AC

Assuming you implement everything efficiently, the running time will be as good as possible, since you need to read every word at least ONCE.

Further optimization can be done and apply threading. You can view into PARALLEL AGGREGATION concept, as this problem can be parallelized easily.

edited Dec 19 '13 at 00:34

Bernhard Barker

54,589
14
104
138

answered Dec 19 '13 at 00:18

Erti-Chris Eelmaa

25,338
6
61
78

This operates in time proportional to M*N*Q, where M is the number of prefixes, N is the number of lines to search, and Q is the average number of words per line. A trie implementation will be much faster in searching, although it will take a bit more time to initialize. A single-threaded trie implementation will operate *much* faster than a parallelized version of what you suggest when N*Q is even moderately large. – Jim Mischel Dec 19 '13 at 01:47
Unless I misunderstand you, my code is not bound by M at all. This will run in N*Q. Prefix check is O(1). What I am saying; you can't do better than N*Q since you need to touch each character atleast once. I beg to differ, 10k lines is enough to justify the time that it takes to fire up new threads. I can imagine gain being very well justified on average-client computer. Splitting work between each processor & using aggregate pattern should do the trick. – Erti-Chris Eelmaa Dec 19 '13 at 08:20
"not bound by M at all" as in not as you said. N*Q + M, yes. – Erti-Chris Eelmaa Dec 19 '13 at 08:42
what does original value refer to why it is 4 in "a"? – ToddlerXxX Dec 19 '13 at 09:37
It indicates how many of these need's to be find. In my example, 4 A "prefixes" have to be find in input line. – Erti-Chris Eelmaa Dec 19 '13 at 11:10
I see this easier from trie one which needs many intersections and unions then ignoring false positeved. – Karim Tarabishy Dec 19 '13 at 11:37
should only the outer for loop be executed in parallel? – ToddlerXxX Dec 19 '13 at 13:25
Wait ... your algorithm checks for prefixes by taking substrings and checking the dictionary? So if I want to know if the word "foobar" starts with one of the prefixes, first the code checks "f" against the dictionary, then it checks "fo", then "foo", etc? That algorithm is crazy inefficient. You're doing a lookup for *every character* that's in the entire text. Not only that, you're creating a temporary string for every character. I don't think I could devise a less efficient method to solve the OP's problem. Congratulations. – Jim Mischel Dec 19 '13 at 14:21
The idea is that the program will load the data once and do multiple (perhaps thousands) of searches. With your method, you have to touch every character every time. But if you use a prefix tree you end up touching a very small number of nodes--orders of magnitude fewer than with your method. Parallelization would speed your algorithm, yes. But with sufficiently large data (say, a million lines), your algorithm would take approximately forever to complete. And [half of forever is still forever](http://blog.mischel.com/2012/01/03/half-of-forever-is-still-forever/). – Jim Mischel Dec 19 '13 at 14:30
@ToddlerXxX: that is right. You need to chunk work into LINES/ProcessorCount onto each processor. You would be able to split work even more, but I don't see possible gains. – Erti-Chris Eelmaa Dec 19 '13 at 16:47

Rusty Rob · Answer 3 · 2013-12-19T00:51:35.020

Here's a fairly simple implementation that should be appropriate for your use case. The idea is that you can store all combinations of short prefixes for each line (and for each query) since you only have 10,000 lines and assuming each line doesn't contain too many words. Now look up each hash generated for the query string. For each hash match, we then check for an exact match. For my example code I consider only prefixes of length 1, however you could repeat this approach for prefixes of length 2 & 3 provided the prefixes in your query have those lengths too.

__author__ = 'www.google.com/+robertking'

from itertools import combinations
from collections import defaultdict

lines = [
    "A brow Fox flies.",
    "Boxes are full of food.",
    "Cats runs slow",
    "Dogs hates eagles",
    "Dolphins have eyes and teath"
 ]
lines = [line.lower() for line in lines]

def short_prefixes(line):
    for word in line.split():
        yield word[:1]

def get_hashes(line):
    starts = list(short_prefixes(line))
    for prefixes_in_hash in range(1, min(4, len(starts))):
        for hash_group in combinations(starts, r=prefixes_in_hash):
            yield tuple(sorted(hash_group))


def get_hash_map():
    possible_matches = defaultdict(list)
    for line_pos, line in enumerate(lines):
        for hash in get_hashes(line):
            possible_matches[hash].append(line_pos)
    return possible_matches


possible_matches = get_hash_map()

def ok(line, q):
    return all(line.startswith(prefix) or ((" " + prefix) in line) for prefix in q)

def query(search_string):
    search_string = search_string.lower()
    q = search_string.split()
    hashes = set(get_hashes(search_string))
    possible_lines = set()
    for hash in hashes:
        for line_pos in possible_matches[hash]:
            possible_lines.add(line_pos)

    for line_pos in possible_lines:
        if ok(lines[line_pos], q):
            yield lines[line_pos]

print(list(query("fl b a")))
#['a brow fox flies.']