4

I have a file strings.txt listing strings, which I am processing like this:

sort strings.txt | uniq -c | sort -n > uniq.counts

So the resulting file uniq.counts will list uniq strings sorted in the ascending order by their counts, so something like this:

 1 some string with    spaces
 5 some-other,string
25 most;frequent:string

Note that strings in strings.txt may contain spaces, commas, semicolons and other separators, except for the tab. How can I get uniq.counts to be in this format:

 1<tab>some string with    spaces
 5<tab>some-other,string
25<tab>most;frequent:string
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
I Z
  • 5,719
  • 19
  • 53
  • 100
  • This isn't really a question about `sort` (or, rather, changing the delimiter used by `sort` is both trivial and shown in the man page, as `-t` aka `--field-separator`; thus `sort -t $'\t'` would suffice to answer the whole of the question posed in the original title); the interesting part is how to change the delimiter used by `uniq -c` to a tab. – Charles Duffy Jul 12 '16 at 19:26
  • As chepner briefly commented -- even with IFS at defaults, `while read -r count content; do ...` would succeed in parsing the count from the rest of the output in `uniq.counts` with the original output format, without need for a distinct character. – Charles Duffy Jul 12 '16 at 19:31
  • Does this answer your question? [Why uniq -c output with space instead of \t?](https://stackoverflow.com/questions/11670393/why-uniq-c-output-with-space-instead-of-t) – Pablo Bianchi Apr 01 '20 at 02:22

3 Answers3

4

You can do:

sort strings.txt | uniq -c | sort -n | sed -E 's/^ *//; s/ /\t/' > uniq.counts

sed will first remove all leading spaces at the beginning of the line (before counts) and then it will replace space after count to tab character.

glicerico
  • 690
  • 4
  • 20
anubhava
  • 761,203
  • 64
  • 569
  • 643
3

You can simply pipe the output of the sort, etc to sed before writing to uniq.counts, e.g. add:

| sed -e 's/^\([0-9][0-9]*\)\(.*$\)/\1\t\2/' > uniq.counts

The full expression would be:

$ sort strings.txt | uniq -c | sort -n | \
sed -e 's/^\([0-9][0-9]*\)\(.*$\)/\1\t\2/' > uniq.counts

(line continuation included for clarity)

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
2

With GNU sed:

sort strings.txt | uniq -c | sort -n | sed -r 's/([0-9]) /\1\t/' > uniq.counts

Output to uniq.counts:

 1      some string with    spaces
 5      some-other,string
25      most;frequent:string

If you want to edit your file "in place" use sed's option -i.

Cyrus
  • 84,225
  • 14
  • 89
  • 153