Shell script to read a list of words and compute their counts in a corpus.

Question

I need to write a command line script in linux to do the following:

read a list of words from a text file (one word per line). say w_i
for each w_i computes the word count in a different text file.
sum over these counts

some help here would be really appreciated!

score 2 · Answer 1 · answered Apr 05 '13 at 10:27

this grep line may work for you, give it a try:

 grep -oFwf wordlist textfile|wc -l

I just did this small test, it seems worked as you expected.

(PS, I insert those words in file2 using vim, so i know how many I inserted)

kent$  head file1 file2
==> file1 <==
foo
bar
baz
hello
world

==> file2 <==
 foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar
 hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world 
blah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo ba 

kent$  grep -oFwf file1 file2|wc -l
66

this sounds about right except I always get zero for the total count. Same for kamituel and sudo_O solutions. I think it has something to do with looking for word count for e.g. foo\s+ rather than just foo in file2. Plus in file1 I also have occurrences like "a lot" which are bigrams. — MAZDAK, Apr 05 '13 at 11:48
I think it has something to do with your data in two files. e.g. `foo` would be counted in `foo` but not `foowhatever` — Kent, Apr 05 '13 at 11:53
Well that doesn't really cause any problem in my analysis. Do you have any suggestion for fixing the bigram or trigram issue? For instance some of the entries of file1 look like this: foo foo, the world, a lot, etc. — MAZDAK, Apr 05 '13 at 12:48

score 2 · Answer 2 · edited May 23 '17 at 10:25

2

Here a one-liner using awk that prints the word counts and the total:

awk 'NR==FNR{w[$1];next}{for(i=1;i<=NF;i++)if($i in w)w[$i]++}END{for(k in w){print k,w[k];s+=w[k]}print "Total",s}' file1 file2
hello 13
foo 20
world 13
baz
bar 20
Total 66

Note: uses Kents example input.

The more readable script version:

BEGIN {
    OFS="\t"                              # Space the output with a tab 
}
NR==FNR {                                 # Only true in file1
    word_count[$1]                        # Build keys for all words           
    next                                  # Get next line
}
{                                         # In file2 here
    for(i=1;i<=NF;i++)                    # For each word on the current line
        if($i in word_count)              # If the word has a key in the array
            word_count[$i]++              # Increment the count
}
END {                                     # After all files have been read
    for (word in word_count) {            # For each word in the array
        print word,int(word_count[word])  # Print the word and the count
        sum+=word_count[word]             # Sum the values
    }
    print "Total",sum                     # Print the total
}

Save as script.awk and run like:

$ awk -f script.awk file1 file2
hello   13
foo     20
world   13
baz     0
bar     20
Total   66

edited May 23 '17 at 10:25

Community

1
1

answered Apr 05 '13 at 10:45

Chris Seymour

83,387
30
160
202

Can you try it with this file1 (next comment) and see what the problem is? I think the problem in not being able to capture the right word count, has to do with endofline character and then also bigrams that I have in the list – MAZDAK Apr 05 '13 at 11:52
a lot a posteriori a priori abaft abandon abandoned abandoning abashed abductor abiding – MAZDAK Apr 05 '13 at 11:55
Fixing for bigrams isn't an issue, pointless testing with `file1` without the corresponding `file 2`. You have already mark an answer as accepting, are your questions now solved? – Chris Seymour Apr 05 '13 at 12:02
The bigram problem still persist (I also have trigram and sometimes even 4-grams). So how would I resolve it? – MAZDAK Apr 05 '13 at 12:37
I would really appreciate it if you complete your answer because although it is very well done, it is of no use to me without being able to capture bi/trigrams. – MAZDAK Apr 05 '13 at 16:18

kamituel · Answer 3 · 2013-04-05T10:31:59.547

Assuming you have file words containing one word per file, and then you have file corpus, you can use the following command:

$ cat file | xargs -I% sh -c '{ echo "%\c"; grep -o "%" corpus | wc -l; }' | \
  tee /dev/tty | awk '{ sum+=$2} END {print "Total " sum}'

On example, for file:

car
plane
bike

And for corpus:

car is a plane is on a car
or in the car via a plane
plane plane
car

The output would be:

$ cat file | xargs -I% sh -c '{ echo "%\c"; grep -o "%" corpus | wc -l; }' | \
  tee /dev/tty | awk '{ sum+=$2} END {print "Total " sum}'
car       4
plane       4
bike       0
Total 8

Shell script to read a list of words and compute their counts in a corpus.

3 Answers3