Having a few problems doing this. The output needs to be of the following format: on each line, a word is first printed, followed by a colon “:”, then a space, and then the list of the line numbers where the word appears (separated by comma). If a word appears in a line multiple times, it should only report one time for that line.
Command line: index.awk test1.txt > new.output.txt
My code (currently):
#!/bin/awk -f
Begin {lineCount=1} # start line count at 1
{
for (i = 1; i <= NF; i++) # loop through starting with postition 1
for ( j = 2; j <= NF; j++) # have something to compare
if ( $i == $j ) # see if they match
print $i ":" lineCount # if they do print the word and line number
lineCount++ # increment the line number
}
You'll notice down below in the sample output that it completely skips over the first line from the input text file. It counts correctly from there. How can I print the word occurrences if it appears more than once? As well, is there a native function to awk that can account for erroneous characters such as punctuation, numbers, [], (), ect...
(EDIT: gsub(regexp, replacement, target) can omit these erroneous characters from the text.
Sample INPUT: I would like to print out each word, and the corresponding lines which the word occurs on. I need to make sure I omit the punctuation's from the strings when printing them out. As well, I need to make sure if the word occurs more than once on a line not to print the line number twice.
SAMPLE OUTPUT:
I:
would:
like:
to:
print:
out:
each:
word:
and,:
the:1
corresponding:
lines:
which:
the:
word:
occurs:
on.:
I:1
need:1
to:1
make:1
sure:1
.....ect (outputs the line numbers correctly from here)