1

Data:

x.txt,simple text file(around 1 MB) y.txt dictionary file(around 1Lakh words).

Need to find whether any of the word/s in y.txt is present in x.txt.

Need an algorithm which consumes less time for execution and language preferred for the same.

PS: Please suggest any algorithm apart from BRUTE FORCE METHOD.

I need pattern matching rather than exact string matching.

For instance :

x.txt : "The Old Buzzards were disestablished on 27 April"

Y.txt : "establish"

Output should be : Found establish in X.txt : Line 1

Thank you.

Ankith
  • 277
  • 2
  • 4
  • 13
  • This has got nothing to do with regex... – Tom Lord Jun 26 '13 at 08:33
  • define BRUTE FORCE METHOD.. – Anthony Steven Jun 26 '13 at 08:43
  • [Sets](http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29)? (The implementation actually depends on whether the dictionary is static and you're providing various input text files or the other way around). – bereal Jun 26 '13 at 08:44
  • Dictionary file will be growing(update) as time passes and yes I am providing multiple input files. – Ankith Jun 26 '13 at 08:57
  • BRUTE FORCE: compare characters of all the patterns against every character of the text until the first failure (if you compare afterwards that's just plain dumb). In other words you have a loop over the characters in the text and you try every single pattern until they fail, or you don't learn anything from the patterns or successive successful or failed comparisons. – user1666959 Jun 26 '13 at 10:21
  • Please check if the following PERL script is relevant to you http://stackoverflow.com/questions/1099985/algorithm-for-multiple-word-matching-in-text. – Srihari Jun 27 '13 at 23:50

3 Answers3

2

It is not clear to me whether you need this to get a job done or it is home work. If you need it to get a job done then:

#!/usr/bin/bash
Y=`cat y.txt | tr '\n' '|'`
echo "${Y%|}"
grep -E "${Y%|}" x.txt
if [ "$?" -eq 0 ]
then
    echo "found"
else
    echo "no luck"
fi

is hard to beat as you slurp in all the patterns from a file, construct a regular expression (the echo shows the regex) and then hand it to grep which constructs a finite state automata for you. That is going to fly as it compares every character in the text at most once. If it is homework then I suggest you consult Cormen et al 'Introduction to Algorithms', or the first few chapters of the Dragon Book which will also explain what I just said.

Forgot to add: y.txt should contain your pattern one per line, but as a nice side effect your patterns do not have to be single words.

user1666959
  • 1,805
  • 12
  • 11
  • Not homework.I am developing the application for server. and between when it comes to application, in y.txt or dictionary file there is no relation between each words so constructing general regular expression is not possible. – Ankith Jun 26 '13 at 11:09
  • Above mentioned program wont work as there is limit for grep argument. It throws the following error "Argument list too long" – Ankith Jul 13 '13 at 12:56
  • In that case explore the -f (--file) option to grep. Dump your patterns to a file, one pattern per line, same effect as the code above. Should solve the problem. – user1666959 Jul 14 '13 at 08:00
0

Suppose, you have any Set implementation in your standard library, here is some pseudo-code:

dictionary = empty set

def populate_dict():
    for word in dict_file:
        add(dictionary, word)

def validate_text(text_file):
    for word in text_file:      ### O(|text_file|)
        if word in dictionary:  ### O(log |dictonary|)
            report(word)

populate_dict()
every_now_and_then(populate_dict)

That would give you O(t * log d) instead of the brute-force O(t * d) where t and d are the lengths of the input text file and dictionary respectively. I don't think that anything faster is possible since you can't read the file faster that O(t) and can't search faster than O(log d).

bereal
  • 32,519
  • 6
  • 58
  • 104
0

This is a search algorithm I had in mind for a while. Basically the algorithm is in two steps.

In the first step all the words from y.txt are inserted in a tree. Every path in the tree from the root to a leaf is a word. The leaf is empty.

For example, the tree for the words dog and day is the following.

<root>--<d>-<a>-<y>-<>
          \-<o>-<g>-<>

The second part of the algorithm is a search down the tree. When you reach an empty leaf then you have found a word.

The implementation in Groovy, if more comments are needed just ask

//create a tree to store the words in a compact and fast to search way
//each path of the tree from root to an empty leaf is a word
def tree = [:]
new File('y.txt').eachLine{ word->
    def t=tree
    word.each{ c ->
        if(!t[c]){
            t[c]=[:]
        }
        t=t[c]
    }
    t[0]=0//word terminator (the leaf)
}
println tree//for debug purpose
//search for the words in x.txt
new File('x.txt').eachLine{ str, line->
    for(int i=0; i<str.length(); i++){
        if(tree[str[i]]){
            def t=tree[str[i]]
            def res=str[i]
            def found=false
            for(int j=i+1; j<str.length(); j++){
                if(t[str[j]]==null){
                    if(found){
                        println "Found $res at line $line, col $i"
                        res=str[j]
                        found=false
                    }
                    break
                }else if(t[str[j]][0]==0){
                    found=true
                    res+=str[j]
                    t=t[str[j]]
                    continue
                }else{
                    t=t[str[j]]
                    res+=str[j]
                }
                found=false
            }
            if(found) println "Found $res at line $line, col $i"//I know, an ugly repetition, it's for words at the end of a line. I will fix this later
        }
    }
}

this is my y.txt

dog
day
apple
daydream

and x.txt

This is a beautiful day and I'm walking with my dog while eating an apple.
Today it's sunny.
It's a daydream

The output is the following:

$ groovy search.groovy
[d:[o:[g:[0:0]], a:[y:[0:0, d:[r:[e:[a:[m:[0:0]]]]]]]], a:[p:[p:[l:[e:[0:0]]]]]]
Found day at line 1, col 20
Found dog at line 1, col 48
Found apple at line 1, col 68
Found day at line 2, col 2
Found daydream at line 3, col 7

This algorithm should be fast because the depth of the tree doesn't depend on the number of words in y.txt. The depth is equal to the length of the longest word in y.txt.

Marco Martinelli
  • 892
  • 2
  • 14
  • 27