2

I need to delete all the rows in a csv file which have more than a certain number of columns.

This happens because sometimes the code, which generates the csv file, skips some values and prints the following on the same line.

Example: Consider the following file to parse. I want to remove all the rows which have more than 3 columns (i.e. the columns of the header):

timestamp,header2,header3
1,1val2,1val3
2,2val2,2val3
3,4,4val2,4val3
5val1,5val2,5val3
6,6val2,6val3

The output file I would like to have is:

timestamp,header2,header3
1,1val2,1val3
2,2val2,2val3
5val1,5val2,5val3
6,6val2,6val3

I don't care if the row with timestamp 4 is missing.

I would prefer a solution in bash or perhaps using awk, rather than a python one, so that I can learn how to use it.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
Paolo
  • 183
  • 1
  • 6

2 Answers2

8

This can be done straight forward with awk:

awk -F, 'NF<=3' file

This uses the awk variable NF that holds the number of fields in the current line. Since we have set the field separator to the comma (with -F, or, equivalent, -v FS=","), then it is just a matter of checking when the number of fields is not higher than 3. This is done with NF<=3: when this is true, the line will be printed automatically.

Test

$ awk -F, 'NF<=3' a
timestamp,header2,header3
1,1val2,1val3
2,2val2,2val3
5val1,5val2,5val3
6,6val2,6val3
fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • 1
    This is a pretty and smarter solution you could use instead of mine. – sweepy_ Apr 02 '15 at 12:11
  • Thanks a lot. Just for curiosity, what if I wanted only the rows with exactly 3 columns? I tried and I cannot simply replace '<=' with '==' or '='. If it is hard to be done, don't worry. It was just for curiosity. – Paolo Apr 02 '15 at 13:14
  • @Paolo you were doing it properly! `awk -F, 'NF==3' file` is the way to go. – fedorqui Apr 02 '15 at 13:20
1

Try the following (do not omit to replace your file path and your max column):

#! /bin/bash

filepath=test.csv
max_columns=3

for line in $(cat $filepath);
do
    count=$(echo "$line" | grep -o "," | wc -l)
    if [ $(($count + 1)) -le $max_columns ]
            then
            echo $line
    fi
done

Copy this in a .sh file (cropper.sh for example), make it executable chmod +x cropper.sh and run ./cropper.sh).

This will output only the valid lines. You can then catch the result in a file this way:

./cropper.sh > result.txt

sweepy_
  • 1,333
  • 1
  • 9
  • 28