Remove entire column with same strings (with header remains) using grep or awk

Question

I have a file as below:

name1   name2   name3   name4    
AA  BB  BB  CC   
AA  AA  BB  CC   
AA  CC  BB  CC   
AA  DD  BB  DD   
AA  DD  BB  AA

column 1 and column 3 have the same string within itself. I wish to remove the entire column if the case is as described above but keep the header. So eventually the file will become something like this.

name2   name4   
BB  CC         
AA  CC   
CC  CC   
DD  DD   
DD  AA

Is there any way to do so using grep or awk? Thanks a lot!

In your example, column 1 and 3 *don't* have the same strings at all, anywhere. I'm confused, and your question is confusing. — Nathan Tuggy, Jan 22 '15 at 03:22
Clarification needed. If _every_ line of the file has column 1 and 3 equal (other than header of course), you want to remove column 1 and 3 from entire file? — paxdiablo, Jan 22 '15 at 03:26

score 1 · Answer 1 · answered Jan 22 '15 at 03:35

This is not completely optimal in terms of performance, but it does use awk and it does work for your sample input:

file=$1

header=$(head -1 "$file")
i=1
goodcols=""
for colname in $header; do
  count=$(awk "NR>1 {print \$$i}" "$file" | sort -u | wc -l)
  if [ $count -gt 1 ]; then
    if [ -z "$goodcols" ]; then
      goodcols="\$$i"
    else
      goodcols="$goodcols, \$$i"
    fi
  fi
  i=$((i+1))
done

awk "{print $goodcols}" "$file"

paxdiablo · Answer 2 · 2015-01-22T04:03:24.340

If your intent is to print the entire file if any of the column 1 and 3 values are different in any line, and only print columns 2 and 4 where every line has an identical value in columns 1 and 3, the following script will do it:

same=$(awk 'BEGIN{same=1}NR==1{next}$1!=$3{same=0;exit}{}END{print same}' qq.in)
if [[ $same -eq 1 ]] ; then
    awk '{print $2" "$4}' qq.in
else
    cat qq.in
fi

The first awk outputs 1 if all lines (other than the header of course) have an identical column1/3 value. Otherwise it outputs 0.

Then you simply use that to either filter the columns, or output the file as is.

If instead, you're wanting to strip columns 1 and 3 only if all the values in column 1 are identical and all the values in column 3 are identical (as per your test data), change the first line to:

allsame=$(awk 'BEGIN{allsame=1}NR==1{next}NR==2{val1=$1;val3=$3;next}$1!=val1||$3!=val3{allsame=0;exit}{}END{print allsame}' qq.in)

score 0 · Answer 3 · answered Jan 22 '15 at 04:02

The UNIX shell is simply an environment from which to call UNIX tools. The UNIX tool for general text manipulation is awk so just use it:

$ cat tst.awk
{
    for (col=1; col<=NF; col++) {
        val[NR,col] = $col
        if ( (NR>1) && (!seen[col,$col]++) ) {
            cnt[col]++
        }
    }
}
END {
    for (row=1; row<=NR; row++) {
        ofs = ""
        for (col=1; col<=NF; col++) {
            if (cnt[col] != 1) {
                printf "%s%s", ofs, val[row,col]
                ofs = OFS
            }
        }
        print ""
    }
}

$ awk -f tst.awk file
name2 name4
BB CC
AA CC
CC CC
DD DD
DD AA

Remove entire column with same strings (with header remains) using grep or awk

3 Answers3