I have a file containing a bunch of strings. I have another file containing a bunch of words. I want to print all the lines in the first file that contains one of the first twenty words from the second file. I've been trying to do this with sed, but would grep or awk be a better alternative?
2 Answers
The question was about "words"... and... I thought a lot about what that meant, and at the same time tried to make as few assumptions about the format of file2 as possible -- thinking perhaps file2 is another book, perhaps a phrase, or perhaps a comma or tab delimited list.
- We likely want match whole words such that "home" in file2 doesn't match "homely" in file1.
- Strings with numbers, dashes, pluses, etc. are not English words and should not be considered.
- Hyphenated words and possessives should be retained.
- As we are matching on "words," case should be ignored (this feature is easily reversible)
If we however are allowed to place restrictions on the format of file2, please read toward the end for the simplified egrep/sed script answer(s).
The following answer first operates on file2 within a sub-shell, handling punctuation and delimiters, identifies first twenty valid words, and then builds a regular expression out of the valid word list. The script then applies the regular expression (the result of the sub-shell) to filter file1.
egrep -i $(tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }" | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/') file1
To explain further... If we have the following file2 as our example:
$ cat file2
1The quick brown fox
jumps over- Frank's (empty-headed) lazy dog.
The tr statement in the sub-shell pipeline filters out unwanted delimiters and places candidate words in a return delimited list:
$ tr -c "[:alnum:]-'" '\n' < file2
1The
quick
brown
fox
jumps
over-
Frank's
empty-headed
lazy
dog
The awk statement in the sub-shell pipeline filters for valid words and prints up to 20 words.
$ tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }"
quick
brown
fox
jumps
Frank's
empty-headed
lazy
dog
The last statement in the sub-shell pipeline formats the list of words into a regular expression.
$ tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }" | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/'
\<quick\>|\<brown\>|\<fox\>|\<jumps\>|\<Frank's\>|\<empty-headed\>|\<lazy\>|\<dog\>
If we use egrep to filter with this expression against a well known text:
$ egrep -i "\<quick\>|\<brown\>|\<fox\>|\<jumps\>|\<Frank's\>|\<empty-headed\>|\<lazy\>|\<dog\>" kjv.txt | head -n 5
Ge30:32 I will pass through all thy flock to day, removing from thence all the speckled and spotted cattle, and all the brown cattle among the sheep, and the spotted and speckled among the goats: and of such shall be my hire.
Ge30:33 So shall my righteousness answer for me in time to come, when it shall come for my hire before thy face: every one that is not speckled and spotted among the goats, and brown among the sheep, that shall be counted stolen with me.
Ge30:35 And he removed that day the he goats that were ringstraked and spotted, and all the she goats that were speckled and spotted, and every one that had some white in it, and all the brown among the sheep, and gave them into the hand of his sons.
Ge30:40 And Jacob did separate the lambs, and set the faces of the flocks toward the ringstraked, and all the brown in the flock of Laban; and he put his own flocks by themselves, and put them not unto Laban's cattle.
Exo11:7 But against any of the children of Israel shall not a dog move his tongue, against man or beast: that ye may know how that the LORD doth put a difference between the Egyptians and Israel.
Putting it all together...
egrep -i $(tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }" | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/') kjv.txt | head -n 5
Ge30:32 I will pass through all thy flock to day, removing from thence all the speckled and spotted cattle, and all the brown cattle among the sheep, and the spotted and speckled among the goats: and of such shall be my hire.
Ge30:33 So shall my righteousness answer for me in time to come, when it shall come for my hire before thy face: every one that is not speckled and spotted among the goats, and brown among the sheep, that shall be counted stolen with me.
Ge30:35 And he removed that day the he goats that were ringstraked and spotted, and all the she goats that were speckled and spotted, and every one that had some white in it, and all the brown among the sheep, and gave them into the hand of his sons.
Ge30:40 And Jacob did separate the lambs, and set the faces of the flocks toward the ringstraked, and all the brown in the flock of Laban; and he put his own flocks by themselves, and put them not unto Laban's cattle.
Exo11:7 But against any of the children of Israel shall not a dog move his tongue, against man or beast: that ye may know how that the LORD doth put a difference between the Egyptians and Israel.
The solution runs fairly quickly on my year old laptop:
$ wc -lw kjv.txt
31102 820736 kjv.txt
$ time egrep -i $(tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }" | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/') kjv.txt > /dev/null
real 0m0.021s
user 0m0.016s
sys 0m0.000s
Simplified Answer
The above was for the complicated case where file2 is "noisy"... What is the answer if file2 is defined to be a return delimited list of words -- and we don't have to check for valid words? We can then eliminate the first two stages of the previous sub-shell pipeline:
egrep -i $(head -n20 file2 | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/') file1
Finally, what is the solution if the constrains are the same as immediately preceeding and the list of words in file2 is single space delimited?
egrep -i $(awk 'NF>20{NF=20}1' file2 | sed 's/^/\\</; s/ /\\>|\\</g; s/$/\\>/') file1

- 1,821
- 1
- 16
- 17