record the lines in which each word in a given file appears using awk

Question

Having a few problems doing this. The output needs to be of the following format: on each line, a word is first printed, followed by a colon “:”, then a space, and then the list of the line numbers where the word appears (separated by comma). If a word appears in a line multiple times, it should only report one time for that line.

Command line: index.awk test1.txt > new.output.txt

My code (currently):

    #!/bin/awk -f


Begin {lineCount=1}                    # start line count at 1

{         
    for (i = 1; i <= NF; i++)          # loop through starting with postition 1
       for ( j = 2; j <= NF; j++)      # have something to compare  
          if ( $i == $j )              # see if they match
              print $i ":" lineCount   # if they do print the word and line number
              lineCount++              # increment the line number

}

You'll notice down below in the sample output that it completely skips over the first line from the input text file. It counts correctly from there. How can I print the word occurrences if it appears more than once? As well, is there a native function to awk that can account for erroneous characters such as punctuation, numbers, [], (), ect...

(EDIT: gsub(regexp, replacement, target) can omit these erroneous characters from the text.

Sample INPUT: I would like to print out each word, and the corresponding lines which the word occurs on. I need to make sure I omit the punctuation's from the strings when printing them out. As well, I need to make sure if the word occurs more than once on a line not to print the line number twice.

SAMPLE OUTPUT: 

I:
would:
like:
to:
print:
out:
each:
word:
and,:
the:1
corresponding:
lines:
which:
the:
word:
occurs:
on.:
I:1
need:1
to:1
make:1
sure:1
.....ect (outputs the line numbers correctly from here)

Why are `and,` and `on.` considered "words"? Define what a "word" means to you. — Ed Morton, Oct 13 '14 at 22:53
@Ryan recognizing words IS the big picture, the rest is trivial. Update your question to show some input cases you think will be difficult to handle (e.g. are `There`, `there`, and `there's` the same word? Is the trailing `s` a word? Is `7` a word? What about `7th`?) and the output you actually want. — Ed Morton, Oct 13 '14 at 23:07
Very good point. I will update for others who look at this in future. @EdMorton — chomp, Oct 13 '14 at 23:44
You've accepted an answer while the rest of us were waiting for you to tell us what the question was. Consequently it's entirely possible that the answer you selected doesn't actually do what you want in general (ie much outside of the small, very restricted input set you posted) or that there's simpler answers out there. For example, the suggested way of removing punctuation (`gsub(/[-.,"!?/]/," ")`) is wrong. It may be too late now to have people re-look at this question even if you deselected the answer so you might want to post a new question with the input/output I suggested. — Ed Morton, Oct 14 '14 at 00:55
gsub() removes all the erroneous cases I was looking to omit from the text. — chomp, Oct 14 '14 at 20:10
Cool. So would doing it the right way and it'd remove the rest of the punctuation that you apparently haven't included in your test input yet, e.g. an apostrophe. As long as your happy with it, great, but for anyone else reading this - this is not the right way to solve this problem in general, it will only work for a specific set of input and restrictive requirements. — Ed Morton, Oct 15 '14 at 14:44

John1024 · Accepted Answer · 2014-10-14T00:16:08.687

awk '{delete u;for (i=1;i<=NF;i++) u[$i]=1; for (i in u) cnt[i]=cnt[i]NR","} END{for (i in cnt) {sub(/,$/,"",cnt[i]); printf "%s: %s\n",i,cnt[i]}}' input

As an example (somewhat shorter text than your example):

$ cat file
I and I and I went
here and here and there
and then home

$ awk '{delete u;for (i=1;i<=NF;i++) u[$i]=1; for (i in u) cnt[i]=cnt[i]NR","} END{for (i in cnt) {sub(/,$/,"",cnt[i]); printf "%s: %s\n",i,cnt[i]}}' file
there: 2
went: 1
here: 2
and: 1,2,3
then: 3
I: 1
home: 3

How it works

The program uses three variables: i, u and cnt. u is used to create a unique list of words on each line. cnt is used to track the line numbers for each word. i is used as a temporary variable in loops.

This code uses the fact that awk implicitly loops over every line in a file. After the last line is read, the END clause is executed which displays the results.

Considering each command in turn:

delete u

At the start of each line, we want the array u to be empty.
for (i=1;i<=NF;i++) u[$i]=1

Create an entry in array u for each word on the line.
for (i in u) cnt[i]=cnt[i]NR","

For each word on the line, add the current line number to the array cnt.
END{for (i in cnt) {sub(/,$/,"",cnt[i]); printf "%s: %s\n",i,cnt[i]}

After processing the last line, print out each entry in array cnt. Each entry in cnt has an extra trailing comma. That comma is removed with the sub command. Then printf formats the ouput.

Refinements

Suppose that we want to ignore differences in case. To do that, we can convert all words to lower case:

$0=tolower($0)

If we also want to ignore punctuation, we can remove it:

gsub(/[-.,"!?/]/," ")

Putting it all together:

awk '{delete u;$0=tolower($0);gsub(/[-.,"!?/]/," ");for (i=1;i<=NF;i++) u[$i]=1; for (i in u) cnt[i]=cnt[i]NR","} END{for (i in cnt) {sub(/,$/,"",cnt[i]); printf "%s: %s\n",i,cnt[i]}}' file

I appreciate it! That was very informative, and best of all it works! I wish I could upvote you, unfortunately I would need a rep of 15 to do that :/ such a newb. @John1024 — chomp, Oct 13 '14 at 23:42
@Ryan : You can "accept" this answer by selecting the check-mark in the middle of the the up-down-vote counter at the top left of this answer. — shellter, Oct 13 '14 at 23:44
@John1024 the sub(/[-.,"!?/," ") doesn't remove the punctuation from the output. For example, if I have a string and/or as part of the input and/or counts as one word instead of separating and/or into two respective words "and" and "or". As well, punctuation is still sticking to the words "end." ",and"...ect — chomp, Oct 14 '14 at 00:12
@Ryan Oops. That should have been `gsub` not `sub`. As for `and/or`, the punctuation that is removed is only the punctuation that is specified by the regex in the `gsub` command. That is your choice. I updated the answer to include `/` among the characters removed. — John1024, Oct 14 '14 at 00:19

record the lines in which each word in a given file appears using awk

1 Answers1

How it works

Refinements