Remove blank spaces comm output

Question

I have two lists of IDs that I am comparing with comm command. My problem is that output looks like this:

YAL002W
YAL003W
        YAL004W
        YAL005C
                YAL008W
        YAL011W

All I want to do is try to pipe it somehow so the file is written with out the empty spcaces, that translate into white cell when I open this files in excel. I have tried every possible combination I have found of grep, awk and sed to remove blank spaces without luck...
So I have came to the conclusion that columns are separated by one or two tabs respectively, therefore I can not remove them as easily as removing blank spaces without removing the formating of the file.

any help or suggestion will be welcomed. Thanks

EDIT:

I want my output to be three columns, tab delimited without the blank spaces

YAL002W YAL004W YAL008W
YAL003W YAL005C
        YAL011W

EDIT2 to avoit XY Problem as referenced:

Original problem (X): I have to lists and I want to find common and unique words between both lists (To generate a Venn diagram later on). So comm seemed like the perfect solution since I get all three lists at the same time, which I can later on import into excel easily.

Secondary problem (Y): The three columns that are generated are not three columns (or so I am starting to think) since I can't cut -f them, nor I can't remove the blank spaces with usual awk 'NF' or grep . (for example).

@JuanLB Don't try to put it in a comment, edit the question and show it. — Barmar, Apr 05 '18 at 21:44
Interesting. Kind of like Tetris. Every solution will be a treasure. — James Brown, Apr 05 '18 at 21:55

Ed Morton · Accepted Answer · 2018-04-05T22:39:21.673

Given this input and comm output:

$ cat file1
YAL002W
YAL003W
YAL008W

$ cat file2
YAL004W
YAL005C
YAL008W
YAL011W

$ comm file1 file2
YAL002W
YAL003W
        YAL004W
        YAL005C
                YAL008W
        YAL011W

This will do what you asked for:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    colNr = NF
    rowNr = ++rowNrs[colNr]
    val[rowNr,colNr] = $NF
    numCols = (colNr > numCols ? colNr : numCols)
    numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        for (colNr=1; colNr<=numCols; colNr++) {
            printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
        }
    }
}

.

$ comm file1 file2 | awk -f tst.awk
YAL002W YAL004W YAL008W
YAL003W YAL005C
        YAL011W

but of course you could just skip the call to comm and use awk right off the bat:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
    file1[$0]
    next
}
{
    if ($0 in file1) {
        colNr = 3
        delete file1[$0]
    }
    else {
        colNr = 2
    }
    rowNr = ++rowNrs[colNr]
    val[rowNr,colNr] = $0
}
END {
    for (v in file1) {
        colNr = 1
        rowNr = ++rowNrs[colNr]
        val[rowNr,colNr] = v
    }

    numRows = (rowNrs[1] > rowNrs[2] ? rowNrs[1] : rowNrs[2])
    numRows = (numRows   > rowNrs[3] ? numRows   : rowNrs[3])
    numCols = 3
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        for (colNr=1; colNr<=numCols; colNr++) {
            printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
        }
    }
}

.

$ awk -f tst.awk file1 file2
YAL002W YAL004W YAL008W
YAL003W YAL005C
        YAL011W

You mean that I can compare the two files wiht awk directly? I will try that!, thanks! — Juan LB, Apr 05 '18 at 22:11
sorry for the n00b question but tst.awk can not be written direclty on terminal window, right? I have to generate a file with the code and then run it with `$ awk -f tst.awk file`, am I right? — Juan LB, Apr 05 '18 at 22:17
No, You can include the script within single quotes like `comm f1 f2 | awk '{print $1}'` — Ed Morton, Apr 05 '18 at 22:19
I updated my answer to show you how to do it all in awk without calling comm first. — Ed Morton, Apr 05 '18 at 22:39

glenn jackman · Answer 2 · 2018-04-05T23:14:23.517

For kicks, implementing without comm.

gawk '
    NR == FNR {file1[$0]; next}
    $0 in file1 {common[$0]; delete file1[$0]; next}
    {file2[$0]}
    END {
        PROCINFO["sorted_in"] = "@ind_str_asc"
        c=0; for (e in file1)  v[1,++c] = e; max = c
        c=0; for (e in file2)  v[2,++c] = e; if (c > max) max = c
        c=0; for (e in common) v[3,++c] = e; if (c > max) max = c
        for (i=1; i<=max; i++)
            printf "%s\t%s\t%s\n", v[1,i], v[2,i], v[3,i]
    }
' file1 file2

YAL002W YAL004W YAL008W
YAL003W YAL005C 
    YAL011W

Requires GNU awk for the use of PROCINFO.

score 1 · Answer 3 · answered Apr 05 '18 at 23:26

For more kicks, a non-awk answer

comm file1 file2 | ruby -e '
    data = Array.new(3) {Array.new}
    readlines.each {|line| 
        fields = line.chomp.split("\t")
        data[fields.length - 1] << fields[-1]
    }
    m = data.map(&:length).max
    data.collect {|lst| (lst + [""] * m).first(m)}   # pad shorter lists
        .transpose
        .each {|row| puts row.join("\t")}
'

YAL002W YAL004W YAL008W
YAL003W YAL005C 
        YAL011W

score 0 · Answer 4 · answered Apr 05 '18 at 22:01

Use awk with TAB as the field separator. Find the first field that's filled in, and add that to the corresponding column of a multi-dimensional array. Use separate counters for each column so you fill in the next row in that array.

awk -F'\t' 'BEGIN {col1=0; col2=0; col3=0; max=0; SUBSEP="\t"}
            { if (!out[max]) {out[max,1] = ""; out[max,2] = ""; out[max,3] = ""} }
            length($1) { out[col1, 1] = $1; if(col1 > max) max = col1++; next }
            length($2) { out[col2, 2] = $2; if(col2 > max) max = col2++; next }
            length($3) { out[col3, 3] = $3; if(col3 > max) max = col3++;  }
            END { for (i = 0; i < max; i++) { print(out[i]; }'

score 0 · Answer 5 · answered Apr 05 '18 at 22:04

for comm output, the number of columns is fixed, here is a tailored solution

$ awk 'BEGIN {FS=OFS="\t"}
             {for(i=1;i<=3;i++) 
                if($i) {a[i,++c[i]]=$i; if(max<c[i]) max=c[i]}} 
       END   {for(i=1;i<=max;i++) print a[1,i],a[2,i],a[3,i]}' file

YAL002W YAL004W YAL008W
YAL003W YAL005C
        YAL011W

Remove blank spaces comm output

5 Answers5