Display unique values in each column using awk

Question

I'm relatively new to using awk/grep etc and want to filter some data. I have a large spreadsheet which I want to display the unique values column by column. For example I want to change this:

DS571187    DS571220    DS571200    DS571194  
contig1     contig3     contig4     contig7  
contig2     contig3     contig4     contig7  
contig1     contig4     contig6     contig8  
contig1     contig5     contig6     contig9  
contig2     contig4     contig6     contig9  
contig2         
contig2

to something that looks like this:

DS571187    DS571220    DS571200    DS571194
contig1     contig3     contig4     contig7
contig2     contig4     contig6     contig8
            contig5                 contig9

Basically I'm trying to sort each column as its own list and get the unique values this way. Any help would be appreciated.

Amber

Do you know about the Excel feature 'Remove Duplicates'? Its on the Data Ribbon, 2nd in the 'Data Tools' segment. Unfortunately you have to process 1 column at a time to get the results you want. But still may be faster than exporting file, uniqing it, and reloading. Oh, ... spreadsheet, not necessarily Excel. Well just in case, leaving this comment. Good luck. — shellter, Aug 22 '16 at 15:20
Hi Shellter, I tried the remove duplicates function but I have literally thousands of columns to work through. Thanks anyway! :) — A. Leckenby, Aug 22 '16 at 15:33
`awk` solution is great. But maybe it's time to learn a little bit of VBA scripting. You can probably search for "Excel for loop columns remove duplicates" and get something you can use. OR can you process data before bringing into spreadsheet? No need to reply, just some ideas. Good luck! — shellter, Aug 22 '16 at 16:57
Yep, and guess where :-) http://stackoverflow.com/questions/31654461/excel-to-remove-duplicates-one-column-at-a-time-for-many-columns ... Good luck. — shellter, Aug 22 '16 at 16:59
Aah that link has exactly what I have spent days looking for and works perfectly! Thank you! :) — A. Leckenby, Aug 23 '16 at 13:19

Ed Morton · Answer 1 · 2016-08-22T14:55:08.900

Assuming your input file is tab-separated as it appears to be:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (colNr=1;colNr<=NF;colNr++) {
        if (!seen[colNr,$colNr]++) {
            val[++colRowNr[colNr],colNr] = $colNr
            numRows = (colRowNr[colNr] > numRows ? colRowNr[colNr] : numRows)
        }
    }
    numCols = (NF > numCols ? NF : numCols)
}
END {
    for (rowNr=1;rowNr<=numRows;rowNr++) {
        for (colNr=1;colNr<=numCols;colNr++) {
            printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file | column -s$'\t' -t
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig4   contig6   contig8
          contig5             contig9

The call to column is just to make the alignment look pretty on the site.

If it's not tab-separated then to to this concisely and robustly you need GNU awk for FIELDWIDTHS to identify possibly empty fields in mid-line like this input (which you should test other potential solutions against as later input columns being shorter than earlier ones I expect can happen in your real data and makes this a harder problem to solve):

$ column -s$'\t' -t file
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig3             contig7
contig1   contig4             contig8
          contig5             contig9
                              contig9

$ awk -f tst.awk file | column -s$'\t' -t
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig4             contig8
          contig5             contig9

Hi Ed, I don't know if I'm doing something wrong but when I run that it works great for the first few columns and then after that it seems to list unique contigs followed by a space and then some more contigs which are not related to the header (i.e. the DS57...) I have tried filling in all my gaps in my spreadsheet with 0 to create a 'full matrix structure' incase it was some blank spaces that were causing it to throw the script off. Any ideas appreciated :) — A. Leckenby, Aug 23 '16 at 12:52
Blank spaces will not cause any problems for that script. **IS** your input file tab-separated (make SURE it is)? Could it contain control-Ms at the end of each line (try `cat -v file` to see them and `dos2unix` or similar to remove them). If none of that helps, reduce your file to the smallest possible number of rows and columns that can reproduce the problem and then upload it to some code-sharing site so we can see it. — Ed Morton, Aug 23 '16 at 14:30

James Brown · Answer 2 · 2016-08-22T16:18:33.243

Gawk probably needed, tab expected as delimiter, any one character delimiter works (-F"\t" below):

$ cat > cs.awk
NR==1 {
    nf=NF
    $1=$1
    print
}
NR>1 {
    for(i=1;i<=NF;i++)
        if($i!="")
            a[i][$i]++
}
END {
    for(i=1;i<=nf;i++)
        n[i]=asorti(a[i])
    j=asort(n)
    for(i=1;i<=n[j];i++)
        for(k=1;k<=nf;k++)
            printf "%-8s%s", a[k][i], (k<nf?OFS:ORS)
}
$ awk -F"\t" -f cs.awk cs_by_ed.txt
DS571187 DS571220 DS571200 DS571194
contig1  contig3  contig4  contig7
contig2  contig4           contig8
         contig5           contig9

score 0 · Answer 3 · answered Aug 22 '16 at 17:22

0

a different approach, not necessarily efficient but easier to understand. Last two lines are for pretty printing.

$ function f() { cut -d$'\t' -f$1 file1 | sed '/^$/d' | sort -u; }; 
  paste -d$'\t' <(f 1) <(f 2) <(f 3) <(f 4) | 
  sed 's/\t/ \t/g' | 
  column -ts$'\t'

DS571187   DS571220   DS571200   DS571194
contig1    contig3    contig4    contig7
contig2    contig4    contig6    contig8
           contig5               contig9

answered Aug 22 '16 at 17:22

karakfa

66,216
7
41
56

Pretty neat but OP mentions to _have literally thousands of columns to work through_ so you should script those "`f`s" out. – James Brown Aug 22 '16 at 17:41
This is clearly not scalable, appropriate only for few columns. – karakfa Aug 22 '16 at 17:50
tab is the default separator for cut and paste so no need for `-d$'\t'` on those. You could get rid of the last sed since it's just adding one more blank char between fields and the OP just said he wants `something that looks like` the output in the question so he probably doesn't care if it's 2 vs 3 blanks between columns and if he DID want that you could use `column -o' '...` to specify 3 blanks instead of 2 instead of having a separate sed before it. – Ed Morton Aug 22 '16 at 20:29

Display unique values in each column using awk

3 Answers3

Linked