19

I have two files, wordlist.txt and text.txt.

The first file, wordlist.txt, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:

你
你们
我

The second file, text.txt, contains long passages, e.g.:

你们要去哪里?
卡拉OK好不好?

I want to create a new word list (wordsfount.txt), but it should only contain those lines from wordlist.txt which are found at least once within text.txt. The output file from the above should show this:

你
你们

"我" is not found in this list because it is never found in text.txt.

I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.

I know a simple way in BASH to check each line in worlist.txt and see if it is in text.txt using grep:

a=1
while read line
do
    c=`grep -c $line text.txt`
    if [ "$c" -ge 1 ]
    then
    echo $line >> wordsfound.txt
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < wordlist.txt

Unfortunately, as wordlist.txt is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:

As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:

我
我们

Due to this fact, if "我" is never found within text.txt, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt that also contained within wordlist.txt. If there are about 8,000 unique characters found in wordlist.txt, then the script should not need to check so many lines.

What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?

cfi
  • 10,915
  • 8
  • 57
  • 103
Village
  • 22,513
  • 46
  • 122
  • 163
  • 2
    Suppose `我` is in `wordlist.txt` but `我们` isn't. Suppose that `我们` appears in `text.txt` is that a match for `我`? I.e. are you really matching words, or just arbitrary substrings of Chinese characters, which could be fragments of words? – Kaz Mar 23 '12 at 23:50
  • My goal is to create a new, shortened wordlist.txt, which does not contain words which do not match, so that later, more complex scripts, which take many hours to do the work, can do their work much more quickly. The new list is about 5% of the original length. If "我们" is found, but “我" is never found in isolation, ideally, the new word list does not show "我", but if this additional check is very difficult to implement, then it is unnecessary. – Village Mar 24 '12 at 00:45
  • 1
    Not for nothin' Village, but you keep asking "is there a faster way"? The frank and honest truth is no, not really. There's no way faster than brute force to check for a value in an unsorted set, and there never will be. You can add a bunch of specific criterion to make use of binary searches, but the general case will never be faster than brute force. Sorry. Searches are an insanely consumptive process, and tons of research is being done into how to optimize them, but generally they involve ordering the data in some way. – FrankieTheKneeMan Mar 29 '12 at 10:45

12 Answers12

12

I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words which are also in war_and_peace.txt. You can change that with:

perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt

On my computer, it takes just over a second to run.

use strict;
use warnings;
use utf8::all;

use Getopt::Long;

my $wordlist = '/usr/share/dict/words';
my $text     = 'war_and_peace.txt';

GetOptions(
    "worlist=s" => \$wordlist,
    "text=s"    => \$text,
);

open my $text_fh, '<', $text
    or die "Cannot open '$text' for reading: $!";

my %is_in_text;
while ( my $line = <$text_fh> ) {
    chomp($line);

    # you will want to customize this line
    my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
    next unless @words;

    # This beasty uses the 'x' builtin in list context to assign
    # the value of 1 to all keys (the words)
    @is_in_text{@words} = (1) x @words;
}

open my $wordlist_fh, '<', $wordlist
    or die "Cannot open '$wordlist' for reading: $!";

while ( my $word = <$wordlist_fh> ) {
    chomp($word);
    if ( $is_in_text{$word} ) {
        print "$word\n";
    }
}

And here's my timing:

• [ovid] $ wc -w war_and_peace.txt 
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt 

real    0m1.081s
user    0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt 
15277 wordsfound.txt
Ovid
  • 11,580
  • 9
  • 46
  • 76
  • 1
    I should mention that for particularly huge text files, the File::Map (http://search.cpan.org/dist/File-Map/) module is a good alternative, but it might be a bit trickier to use in this case. – Ovid Mar 27 '12 at 08:54
  • 1
    What is the purpose of the `grep {$_}` before split? – Freek Kalter Mar 27 '12 at 19:27
  • 1
    Freek: It's to grep out empty strings. You get a few of them with that split. – Ovid Mar 27 '12 at 20:08
  • 1
    This looks like it'll work for western languages but how will it cope with Chinese, Japanese and Korean, as this is what Village plans to use the code for. Remember, with a word list of A, AB and C, a text file containing ABDEFG would produce a results file with two lines; A and AB. Would this script produce this result? – swampf0etus Mar 27 '12 at 21:32
  • 1
    swampf0etus: the utf8::all line ensures that all filehandles (both in and out) are utf8, so reading other character sets will be fine. I have a note that the `/[[:punct:][:space:]]/` regex will likely need to be changed depending on the user's needs. In short, this should be a reasonable solution, with a just a bit of tweaking. – Ovid Mar 29 '12 at 12:53
  • 1
    Of course, the input file may be in a different encode, in which case, you need to specify the encode layer properly: `open my $fh, '<:7bit-jis', $some_file or die $!;` – Ovid Mar 29 '12 at 13:01
  • 1
    @Ovid You could shorten it a bit using `@is_in_text{@words} = undef;` and later check with `if( exists $is_in_text{$word} )`. This saves you from creating the temporary array `(1) x @words` – dgw Mar 29 '12 at 13:20
  • 1
    @dgw: I had considered that, but I thought the `$is_in_text{$word}` boolean check read more naturally. Either way, I can't disagree with you :) – Ovid Mar 29 '12 at 13:37
  • 1
    @Ovid As long as the array `@words` is not too large, you are right the `easier-to-read-code` should be preferred. – dgw Mar 29 '12 at 13:44
5

Just use comm

http://unstableme.blogspot.com/2009/08/linux-comm-command-brief-tutorial.html

comm -1 wordlist.txt text.txt

moodywoody
  • 2,149
  • 1
  • 17
  • 21
  • I made some test files, but it always prints every word contained in `wordlist.txt`, even if they never appear in `text.txt`. – Village Mar 20 '12 at 04:54
  • 1
    isn't it optimal for the inputs to comm to be sorted? `comm -1 <(sort wordlist.txt) <(sort text.txt)` for modern shells . Good luck to all. – shellter Mar 25 '12 at 15:28
5

This might work for you:

 tr '[:punct:]' ' ' < text.txt | tr -s ' ' '\n' |sort -u | grep -f - wordlist.txt

Basically, create a new word list from text.txt and grep it against wordlist.txt file.

N.B. You may want to use the software you used to build the original wordlist.txt. In which case all you need is:

yoursoftware < text.txt > newwordlist.txt
grep -f newwordlist.txt wordlist.txt 
potong
  • 55,640
  • 6
  • 51
  • 83
  • This seems to assume that the words in `text.txt` are separated by spaces and that they use standard punctuation, but as everything is Chinese, Japanese, or Korean, there are no spaces between words and different punctuation symbols are used. – Village Mar 20 '12 at 07:59
  • 1
    Would setting the LC* variables have any affect on how this soution works? Good luck to all. – shellter Mar 25 '12 at 01:24
4

Use grep with fixed-strings (-F) semantics, this will be fastest. Similarly, if you want to write it in Perl, use the index function instead of regex.

sort -u wordlist.txt > wordlist-unique.txt
grep -F -f wordlist-unique.txt text.txt

I'm surprised that there are already four answers, but no one posted this yet. People just don't know their toolbox anymore.

daxim
  • 39,270
  • 4
  • 65
  • 132
  • This finds and prints those lines in `text.txt` that have words in `wordlist.txt`, however, I want to print only those lines in `wordlist.txt` which have a match somewhere in `text.txt`. – Village Mar 22 '12 at 23:49
  • 1
    @Village I guess you could add a `-o` to `grep` (display only the matching parts) and make that unique again with `sort -u`. – Michael Kohl Mar 24 '12 at 08:39
  • 1
    @daxim: This won't work in the general case. For example, if your word list contains `foo` and `foobar`, and the text contains `foobar`, this will only return `foobar`. Of course, another search with the matches in the word list could detect such substring matches. – l0b0 Mar 26 '12 at 09:17
4

Quite sure not the fastest solution, but at least a working one (I hope).

This solution needs ruby 1.9, the text file are expected to be UTF-8.

#encoding: utf-8
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')

new_wordlist = []
$wordlist.each{|word|
  new_wordlist << word if $txt.include?(word)
}

#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
  f << new_wordlist.join("\n")
}

Can you provide a bigger example to make some benchmark on different methods? (Perhaps some test files to download?)

Below a benchmark with four methods.

#encoding: utf-8
require 'benchmark'
N = 10_000 #Number of Test loops

#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')

def solution_count
    new_wordlist = []
    $wordlist.each{|word|
      new_wordlist << word if $txt.count(word) > 0
    }
    new_wordlist.sort
end

#Faster then count, it can stop after the first hit
def solution_include
    new_wordlist = []
    $wordlist.each{|word|
      new_wordlist << word if $txt.include?(word)
    }
    new_wordlist.sort
end
def solution_combine()
    #get biggest word size
    max = 0
    $wordlist.each{|word| max = word.size if word.size > max }
    #Build list of all letter combination from text
    words_in_txt = []
    0.upto($txt.size){|i|
      1.upto(max){|l|
        words_in_txt << $txt[i,l]
      }
    }
    (words_in_txt & $wordlist).sort
end
#Idea behind:
#- remove string if found.
#- the next comparison is faster, the search text is shorter.
#
#This will not work with overlapping words.
#Example:
#  abcdef contains def.
#  if we check bcd first, the 'd' of def will be deleted, def is not detected.
def solution_gsub
    new_wordlist = []
    txt = $txt.dup  #avoid to manipulate data source for other methods
    #We must start with the big words.
    #If we start with small one, we destroy  long words
    $wordlist.sort_by{|x| x.size }.reverse.each{|word|
      new_wordlist << word if txt.gsub!(word,'')
    }
    #Now we must add words which where already part of longer words
    new_wordlist.dup.each{|neww|
      $wordlist.each{|word|          
        new_wordlist << word if word != neww and neww.include?(word)
      }
    }
    new_wordlist.sort
end

#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
  #~ f << solution_include.join("\n")
  f << solution_combine.join("\n")
}

#Check the different results
if solution_count != solution_include
  puts "Difference solution_count <> solution_include"
end
if solution_gsub != solution_include
  puts "Difference solution_gsub <> solution_include"
end
if solution_combine != solution_include
  puts "Difference solution_combine <> solution_include"
end

#Benchmark the solution
Benchmark.bmbm(10) {|b|

  b.report('count') { N.times { solution_count } }
  b.report('include') { N.times { solution_include } }
  b.report('gsub') { N.times { solution_gsub } } #wrong results
  b.report('combine') { N.times { solution_gsub } } #wrong results

} #Benchmark

I think, the solution_gsub variant is not correct. See the comment in the method definition. If CJK may allow this solution, the please give me a feedback. That variant is the slowest in my test, but perhaps it will tune up with bigger examples. And perhaps it can be tuned a bit.

The variant combine is also very slow, but it would be interestiung what happens with a bigger example.

knut
  • 27,320
  • 6
  • 84
  • 112
4

I would probably use Perl;

use strict;

my @aWordList = ();

open(WORDLIST, "< wordlist.txt") || die("Can't open wordlist.txt);

while(my $sWord = <WORDLIST>)
{
   chomp($sWord);
   push(@aWordList, $sWord);
}

close(WORDLIST);

open(TEXT, "< text.txt") || die("Can't open text.txt);

while(my $sText = <TEXT>)
{
   foreach my $sWord (@aWordList)
   {
      if($sText =~ /$sWord/)
      {
          print("$sWord\n");
      }
   }
}


close(TEXT);

This won't be too slow, but if you could let us know the size of the files you're dealing with I could have a go at writing something much more clever with hash tables

swampf0etus
  • 401
  • 3
  • 10
  • The `wordlist.txt` file can contain nearly 300,000 lines. The `text.txt` files vary in length, from very short, to millions of characters. – Village Mar 24 '12 at 00:31
  • 1
    This takes an age when run against war_and_peace and the unix dictionary. I'm currently working on a faster hash table version. – swampf0etus Mar 26 '12 at 16:26
4

First TXR Lisp solution ( http://www.nongnu.org/txr ):

(defvar tg-hash (hash)) ;; tg == "trigraph"

(unless (= (len *args*) 2)
  (put-line `arguments required: <wordfile> <textfile>`)
  (exit nil))

(defvar wordfile [*args* 0])

(defvar textfile [*args* 1])

(mapcar (lambda (line)
          (dotimes (i (len line))
            (push line [tg-hash [line i..(succ i)]])
            (push line [tg-hash [line i..(ssucc i)]])
            (push line [tg-hash [line i..(sssucc i)]])))
        (file-get-lines textfile))

(mapcar (lambda (word)
          (if (< (len word) 4)
            (if [tg-hash word]
              (put-line word))
            (if (find word [tg-hash [word 0..3]]
                      (op search-str @2 @1))
              (put-line word))))
        (file-get-lines wordfile))

The strategy here is to reduce the corpus of words to a hash table which is indexed on individual characters, digraphs and trigraphs occuring in the lines, associating these fragments with the lines. Then when we process the word list, this reduces the search effort.

Firstly if the word is short, three characters or less (probably common in Chinese words), we can try to get an instant match in the hash table. If no match, word is not in the corpus.

If the word is longer than three characters, we can try to get a match for the first three characters. That gives us a list of lines which contain a match for the trigraph. We can search those lines exhaustively to see which ones of them match the word. I suspect that this will greatly reduce the number of lines that have to be searched.

I would need your data, or something representative thereof, to be able to see what the behavior is like.

Sample run:

$ txr words.tl words.txt text.txt
water
fire
earth
the

$ cat words.txt
water
fire
earth
the
it

$ cat text.txt
Long ago people
believed that the four
elements were
just
water
fire
earth

(TXR reads UTF-8 and does all string manipulation in Unicode, so testing with ASCII characters is valid.)

The use of lazy lists means that we do not store the entire list of 300,000 words, for instance. Although we are using the Lisp mapcar function, the list is being generated on the fly and because we don't keep the reference to the head of the list, it is eligible for garbage collection.

Unfortunately we do have to keep the text corpus in memory because the hash table associates lines.

If that's a problem, the solution could be reversed. Scan all the words, and then process the text corpus lazily, tagging those words which occur. Then eliminate the rest. I will post such a solution also.

Kaz
  • 55,781
  • 9
  • 100
  • 149
  • Thanks for the bounty; identifying and fixing that that silly garbage collector behavior was really all the reward I needed out of this. – Kaz Mar 30 '12 at 21:57
3
new file newlist.txt
for each word in wordlist.txt:
    check if word is in text.txt (I would use grep, if you're willing to use bash)
    if yes:
        append it to newlist.txt (probably echo word >> newlist.txt)
    if no:
        next word
FrankieTheKneeMan
  • 6,645
  • 2
  • 26
  • 37
3

Simplest way with bash script:

  1. Preprocessing first with "tr" and "sort" to format it to one word a line and remove duplicated lines.

  2. Do this:

cat wordlist.txt | while read i; do grep -E "^$i$" text.txt; done;

That's the list of words you want...

Spike
  • 302
  • 2
  • 4
3

Try this: cat wordlist.txt | while read line do if [[ grep -wc $line text.txt -gt 0 ]] then echo $line fi done

Whatever you do, if you use grep you must use -w to match a whole word. Otherwise if you have foo in wordlist.txt and foobar in text.txt, you'll get wrong match.

If the files are VERY big, and this loop takes too much time to run, you can convert text.txt to a list of work (easy with AWK), and use comm to find the words that are in both lists.

Eran Ben-Natan
  • 2,515
  • 2
  • 16
  • 19
  • 1
    -w won't work with chinese words... so you have to use regex in this case to make sure each line is matched. – Spike Mar 20 '12 at 17:07
3

This solution is in perl, maintains your original symantics and uses the optimization you suggested.

#!/usr/bin/perl
@list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(@list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
    my $j = $i+1;
    while ($list[$j]=~/^$list[$i]/) {
            ++$j;
    }
    $skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
@text = (<IN>);
close IN;
foreach $c(@text) {
    for ($i=0;$i<$size;++$i) {
            if ($c=~/$list[$i]/) {
                    $found{$list[$i]}=1;
                    last;
            }
            else {
                    $i+=$skip[$i];
            }
    }
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
        print OUT "$key\n";
}
close OUT;
exit;
pizza
  • 7,296
  • 1
  • 25
  • 22
  • 1
    This is similar to the method I provided, but I don't see any benefit in adding the results to a hash to be iterated and printed at the end. You're spending time (and memory) adding to the hash and then iterating and printing the hash contents when you could just print each result as it's found. Also, you're slurping entire files into arrays. As these files are potentially quite large, so this could blow the memory. – swampf0etus Mar 25 '12 at 07:49
  • 1
    I take the requirement is for the result to be shown only once, hence the hash. I tested with a 200k wordlist.txt file, only takes a few seconds to do. Now the poster said wordlist.txt is huge, but text.txt varies. Since the wordlist.txt is sorted, some optimization can be one as the poster suggested. – pizza Mar 25 '12 at 08:16
  • 1
    I meant 2000k not 200k, based on the assumption wordlist.txt is said to be about 300k lines, CJK words are typically 1-3 characters in length, assuming all words are 3 characters + lineend (8 bytes per line). The memory requirement is about a few megabytes. The optimization is realistic as there are a fairly large number of words start with the same character or some subset of characters. – pizza Mar 25 '12 at 08:26
3

Use paralel processing to speed up the processing.

1) sort & uniq on wordlist.txt, then split it to several files (X) Do some testing, X is equal with your computer cores.

 split -d -l wordlist.txt

2) use xargs -p X -n 1 script.sh x00 > output-x00.txt to process the files in paralel

 find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE

3) cat output* > output.txt concatenate output files

This will speed up the processing enough, and you are able to use tools that you could understand. This will ease up the maintinging "cost".

The script almost identical that you used in the first place.

script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
    c=`grep -c $line ${FILE} `
    if [ "$c" -ge 1 ]
    then
    echo $line >> ${OUTPUTFILE}
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < ${WORDLIST}
user1126070
  • 5,059
  • 1
  • 16
  • 15
  • 1
    This will be uselessly slow on even moderately sized word lists. For each line of the huge corpus of text, you want to grep through the huge wordlist. Also, you're doing the grep the wrong way: you're seeing whether a whole line of text (consisting of many words) occurs in the list of words. – Kaz Mar 27 '12 at 05:21
  • 1
    In the specification, the wordlist contain one word per line. The linux kernel will cache the text file, so it will be quite fast. Yes, I replaced the wordlist/textlist files, I will edit the script. This method can be used with any other scripts, I use shell script here because the requestor seems to understand shell scripts well. – user1126070 Mar 27 '12 at 06:48