I have two files, wordlist.txt
and text.txt
.
The first file, wordlist.txt
, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:
你
你们
我
The second file, text.txt
, contains long passages, e.g.:
你们要去哪里?
卡拉OK好不好?
I want to create a new word list (wordsfount.txt
), but it should only contain those lines from wordlist.txt
which are found at least once within text.txt
. The output file from the above should show this:
你
你们
"我" is not found in this list because it is never found in text.txt
.
I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.
I know a simple way in BASH to check each line in worlist.txt
and see if it is in text.txt
using grep
:
a=1
while read line
do
c=`grep -c $line text.txt`
if [ "$c" -ge 1 ]
then
echo $line >> wordsfound.txt
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < wordlist.txt
Unfortunately, as wordlist.txt
is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:
As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:
我
我们
Due to this fact, if "我" is never found within text.txt
, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt
that also contained within wordlist.txt
. If there are about 8,000 unique characters found in wordlist.txt
, then the script should not need to check so many lines.
What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?