2

I have the following lines:

123;123;#rss
123;123;#site #design #rss
123;123;#rss
123;123;#rss
123;123;#site #design

and need to count how many times each tag appears, do the following:

grep -Eo '#[a-z].*' ./1.txt | tr "\ " "\n" | uniq -c

i.e. first select only the tags from the strings, and then break them down and count it.

output:

   1 #rss
   1 #site
   1 #design
   3 #rss
   1 #site
   1 #design

instead of the expected:

   2 #site
   4 #rss
   2 #design

It seems that the problem is in the non-printable characters, which makes counting incorrect. Or is it something else? Can anyone suggest a correct solution?

Inian
  • 80,270
  • 14
  • 142
  • 161
KarlsD
  • 649
  • 1
  • 6
  • 12
  • 1
    `uniq` requires the input to already by sorted; one quick fix would be `... | sort | uniq -c`; the `.*` says to match on zero or more of any character (including whitespace and non-printing characters) ... try `'#[a-z]+'` to limit to just lower case letters – markp-fuso Feb 10 '21 at 14:57
  • Please have a look at [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers) – Socowi Feb 16 '21 at 00:16

5 Answers5

2

uniq -c works only on sorted input.
Also, you can drop the tr by changing the regex to #[a-z]*.

grep -Eo '#[a-z]*' ./1.txt | sort | uniq -c

prints

  2 #design
  4 #rss
  2 #site

as expected.

Socowi
  • 25,550
  • 3
  • 32
  • 54
1

It can be done in a single gnu awk:

awk -v RS='#[a-zA-Z]+' 'RT {++freq[RT]} END {for (i in freq) print freq[i], i}' file

2 #site
2 #design
4 #rss

Or else a grep + awk solution:

grep -iEo '#[a-z]+' file |
awk '{++freq[$1]} END {for (i in freq) print freq[i], i}'

2 #site
2 #design
4 #rss
anubhava
  • 761,203
  • 64
  • 569
  • 643
0

Using awk as an alternative:

awk -F [" "\;] '{ for(i=3;i<=NF;i++) {  map[$i]++ } } END { for (i in map) { print map[i]" "i} }' file

Set the field separator to a space or a ";" Then loop from the third field to the last field (NF), adding to an array map, with the field as the index and incrementing counter as the value. At the end of the file processing, loop through the map array and print the indexes/values.

Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18
  • `-F [" "\;]` should be `-F '[ ;]'`. Your array is keeping a count, not providing a mapping, so `cnt[]` or similar would be a more useful name for it than `map[]`. Also - `print map[i]" "i` = `print map[i], i` - let OFS have its reason to live :-). – Ed Morton Feb 11 '21 at 22:47
0

With your shown samples only, could you please try following. Written and tested in GNU awk.

awk '
{
  while($0){
    match($0,/#[^ ]*/)
    count[substr($0,RSTART,RLENGTH)]++
    $0=substr($0,RSTART+RLENGTH)
  }
}
END{
  for(key in count){
    print count[key],key
  }
}' Input_file

Output will be as follows.

2 #site
2 #design
4 #rss

Explanation: Adding detailed explanation for above.

awk '                                     ##Starting awk program from here.
{
  while($0){                              ##Running while till line value.
    match($0,/#[^ ]*/)                    ##using match function to match regex #[^ ]* in current line.
    count[substr($0,RSTART,RLENGTH)]++    ##Creating count array which has index as matched sub string and keep increasing its value with 1 here.
    $0=substr($0,RSTART+RLENGTH)          ##Putting rest of line after match into currnet line here.
  }
}
END{                                      ##Starting END block of this program from here.
  for(key in count){                      ##using for loop to go throgh count here.
    print count[key],key                  ##printing value of count which has index as key and key here.
  }
}
' Input_file                              ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
0
$ cut -d';' -f3 file | tr ' ' '\n' | sort | uniq -c
      2 #design
      4 #rss
      2 #site
Ed Morton
  • 188,023
  • 17
  • 78
  • 185