3

I'm relatively new to using awk/grep etc and want to filter some data. I have a large spreadsheet which I want to display the unique values column by column. For example I want to change this:

DS571187    DS571220    DS571200    DS571194  
contig1     contig3     contig4     contig7  
contig2     contig3     contig4     contig7  
contig1     contig4     contig6     contig8  
contig1     contig5     contig6     contig9  
contig2     contig4     contig6     contig9  
contig2         
contig2 

to something that looks like this:

DS571187    DS571220    DS571200    DS571194
contig1     contig3     contig4     contig7
contig2     contig4     contig6     contig8
            contig5                 contig9     

Basically I'm trying to sort each column as its own list and get the unique values this way. Any help would be appreciated.

Amber

  • Do you know about the Excel feature 'Remove Duplicates'? Its on the Data Ribbon, 2nd in the 'Data Tools' segment. Unfortunately you have to process 1 column at a time to get the results you want. But still may be faster than exporting file, uniqing it, and reloading. Oh, ... spreadsheet, not necessarily Excel. Well just in case, leaving this comment. Good luck. – shellter Aug 22 '16 at 15:20
  • Hi Shellter, I tried the remove duplicates function but I have literally thousands of columns to work through. Thanks anyway! :) – A. Leckenby Aug 22 '16 at 15:33
  • `awk` solution is great. But maybe it's time to learn a little bit of VBA scripting. You can probably search for "Excel for loop columns remove duplicates" and get something you can use. OR can you process data before bringing into spreadsheet? No need to reply, just some ideas. Good luck! – shellter Aug 22 '16 at 16:57
  • Yep, and guess where :-) http://stackoverflow.com/questions/31654461/excel-to-remove-duplicates-one-column-at-a-time-for-many-columns ... Good luck. – shellter Aug 22 '16 at 16:59
  • Aah that link has exactly what I have spent days looking for and works perfectly! Thank you! :) – A. Leckenby Aug 23 '16 at 13:19

3 Answers3

2

Assuming your input file is tab-separated as it appears to be:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (colNr=1;colNr<=NF;colNr++) {
        if (!seen[colNr,$colNr]++) {
            val[++colRowNr[colNr],colNr] = $colNr
            numRows = (colRowNr[colNr] > numRows ? colRowNr[colNr] : numRows)
        }
    }
    numCols = (NF > numCols ? NF : numCols)
}
END {
    for (rowNr=1;rowNr<=numRows;rowNr++) {
        for (colNr=1;colNr<=numCols;colNr++) {
            printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file | column -s$'\t' -t
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig4   contig6   contig8
          contig5             contig9

The call to column is just to make the alignment look pretty on the site.

If it's not tab-separated then to to this concisely and robustly you need GNU awk for FIELDWIDTHS to identify possibly empty fields in mid-line like this input (which you should test other potential solutions against as later input columns being shorter than earlier ones I expect can happen in your real data and makes this a harder problem to solve):

$ column -s$'\t' -t file
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig3             contig7
contig1   contig4             contig8
          contig5             contig9
                              contig9

$ awk -f tst.awk file | column -s$'\t' -t
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig4             contig8
          contig5             contig9
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Hi Ed, I don't know if I'm doing something wrong but when I run that it works great for the first few columns and then after that it seems to list unique contigs followed by a space and then some more contigs which are not related to the header (i.e. the DS57...) I have tried filling in all my gaps in my spreadsheet with 0 to create a 'full matrix structure' incase it was some blank spaces that were causing it to throw the script off. Any ideas appreciated :) – A. Leckenby Aug 23 '16 at 12:52
  • Blank spaces will not cause any problems for that script. **IS** your input file tab-separated (make SURE it is)? Could it contain control-Ms at the end of each line (try `cat -v file` to see them and `dos2unix` or similar to remove them). If none of that helps, reduce your file to the smallest possible number of rows and columns that can reproduce the problem and then upload it to some code-sharing site so we can see it. – Ed Morton Aug 23 '16 at 14:30
0

Gawk probably needed, tab expected as delimiter, any one character delimiter works (-F"\t" below):

$ cat > cs.awk
NR==1 {
    nf=NF
    $1=$1
    print
}
NR>1 {
    for(i=1;i<=NF;i++)
        if($i!="")
            a[i][$i]++
}
END {
    for(i=1;i<=nf;i++)
        n[i]=asorti(a[i])
    j=asort(n)
    for(i=1;i<=n[j];i++)
        for(k=1;k<=nf;k++)
            printf "%-8s%s", a[k][i], (k<nf?OFS:ORS)
}
$ awk -F"\t" -f cs.awk cs_by_ed.txt
DS571187 DS571220 DS571200 DS571194
contig1  contig3  contig4  contig7
contig2  contig4           contig8
         contig5           contig9
James Brown
  • 36,089
  • 7
  • 43
  • 59
0

a different approach, not necessarily efficient but easier to understand. Last two lines are for pretty printing.

$ function f() { cut -d$'\t' -f$1 file1 | sed '/^$/d' | sort -u; }; 
  paste -d$'\t' <(f 1) <(f 2) <(f 3) <(f 4) | 
  sed 's/\t/ \t/g' | 
  column -ts$'\t'

DS571187   DS571220   DS571200   DS571194
contig1    contig3    contig4    contig7
contig2    contig4    contig6    contig8
           contig5               contig9
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Pretty neat but OP mentions to _have literally thousands of columns to work through_ so you should script those "`f`s" out. – James Brown Aug 22 '16 at 17:41
  • This is clearly not scalable, appropriate only for few columns. – karakfa Aug 22 '16 at 17:50
  • tab is the default separator for cut and paste so no need for `-d$'\t'` on those. You could get rid of the last sed since it's just adding one more blank char between fields and the OP just said he wants `something that looks like` the output in the question so he probably doesn't care if it's 2 vs 3 blanks between columns and if he DID want that you could use `column -o' '...` to specify 3 blanks instead of 2 instead of having a separate sed before it. – Ed Morton Aug 22 '16 at 20:29