Parsing a .csv-like file in bash

Question

I have a file formatted as follows:

string1,string2,string3,...
...

I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:

"number of occurrences of x",x
"number of occurrences of y",y        
...

I managed to write the following script, that works fine:

#!/bin/bash

> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
    if [[ "$line" =~ $regExp ]]
    then
        printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
    fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"

My question is: There is a better and simpler way to do the job?

In particular I don't know how to fix that:

gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'

The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string. Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.

Thank very much, Goodbye

EDIT:

As asked, here there is some sample data:

(It is an exercise, sorry for the inventive)

Input:

*,*,*
test,  test  ,test
prova, * , prova
test,test,test
prova,  prova   ,prova
leonardo,da vinci,leonardo
in,o    u   t   ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o    u   t   ,pr
test,  test  ,test
,   tabs    ,
,   tabs    ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
,   tabs    ,

Output:

3, * 
4,*
4,da vinci
2,o u   t   
3,po
1,  prova   
3, spaces 
3,  tabs    
1,test
2,  test

Filipe Gonçalves · Accepted Answer · 2015-09-08T18:39:11.770

5

A one-liner in awk:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv

It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.

To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2

The only condition, of course, is that the 2nd column of each line doesn't contain a ,

edited Sep 08 '15 at 18:39

answered Sep 08 '15 at 18:25

Filipe Gonçalves

20,783
6
53
70

Thank you! unfortunately I'm not good with awk... It's incredible what it can do – Luca Sep 08 '15 at 18:44
@Nopaste Indeed, it is a very powerful tool. I recommend reading *The awk programming language* if you have the time, it'll teach you this (and much more). – Filipe Gonçalves Sep 08 '15 at 18:45

score 1 · Answer 2 · answered Sep 08 '15 at 18:25

1

You can make your final awk:

gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'

or use sed for this sort of thing:

sed 's/ *\([0-9]*\) /\1,/'

answered Sep 08 '15 at 18:25

meuh

11,500
2
29
45

Thanks.. I think I'll go for the sed version, seems to be the easiest way! – Luca Sep 08 '15 at 18:46
I only did a little change: `sed -r 's/^ *([0-9]+) /\1,/'` – Luca Sep 08 '15 at 19:15

Chris Koknat · Answer 3 · 2015-10-07T16:48:10.347

0

Here is a Perl one-liner, similar to Filipe's awk solution:

perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv

The output is sorted alphabetically according to the second column.
The @F autosplit array starts at index $F[0] while awk fields start with $1

edited Oct 07 '15 at 16:48

answered Sep 08 '15 at 21:37

Chris Koknat

3,305
2
29
30

Parsing a .csv-like file in bash

3 Answers3