Select matrix first row based on 1st, 8th and 9th column value with awk or sed

Question

I have some rows where the 1st, 8th and 9th columns are mostly the same. The total number of rows is well over 60K. Now I want to simplify keeping only the first rows for which the 1st,8th and 9th column are same.

Input file:

chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407
chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175
chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038
chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038

Desired output:

chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038

I only keep one row for each distinct categories which have similar column1, column 8th and column 9, best would be to just keep the first row whenever there is a change.

How can I achieve this in awk, sed or in R?

fedorqui · Answer 1 · 2015-04-23T07:34:31.757

6

Just one awk line:

awk '!seen[$1,$8,$9]++' file

This keeps track in the array seen[] of how many times a given tuple (field1, field8, field9) appeared so far. When it sees one, the counter increments by one. When the value was already 1 or bigger, !value will be evaluated to False, so that awk won't print the line.

First time:

seen[$1,$8,$9] is 0 (default value when not set).
!0 evaluates to True, so the line is printed.
seen[$1,$8,$9] increments by one.

Next times:

seen[$1,$8,$9] is 1 or more.
!1 evaluates to False, so the line is not printed.
seen[$1,$8,$9] increments by one.

Test

$ awk '!seen[$1,$8,$9]++' a
chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038

edited Apr 23 '15 at 07:34

answered Apr 22 '15 at 14:35

fedorqui

275,237
103
548
598

Nice, I was about to suggest the very same. I suspect that there is already a canonical question for this? – Tom Fenech Apr 22 '15 at 14:38
sorry I did not get the answer.. here a is the file name right? – ivivek_ngs Apr 22 '15 at 14:38
Thanks it works perfectly, sorry , I was a bit confused.. later I realized the color change and understood the file naming convention – ivivek_ngs Apr 22 '15 at 14:40
@TomFenech yes, I am sure this has been answered in a very similar way many times. Couldn't find an exact duplicate (multiple fields, not just one), but it may be time to find a good canonical one and point all of these questions there. – fedorqui Apr 22 '15 at 14:42
@vchris_ngs no problem! Glad to read that it worked to you :) – fedorqui Apr 22 '15 at 14:43
1

@ fedorqui , perfectly explained as well.. thanks a lot.. sometimes I need a bit of time lapse to switch between R and awk.. – ivivek_ngs Apr 22 '15 at 14:47
1

Normally we name the array `seen` for this. I think by now when people see an array named `seen` they know what that array is being used for so giving the array that name is more than just a cosmetic preference. – Ed Morton Apr 22 '15 at 17:04

score 4 · Accepted Answer · answered Apr 22 '15 at 14:46

Import your data into R (you would specify the file):

DF <- read.table(text = "chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407
chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175
chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038
chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038", header = TRUE)

Extract rows where columns 1, 8, 9 are not duplicates of earlier rows:

DF[!duplicated(DF[, c(1,8,9)]),]
#     chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start  CNV_end seg_mean
#1   chr1     762097   762270   3       821         717                       1.4566102    762097  6706109 1.297329
#3   chr1    7868860  7869039   2        78         119                       1.1233852   7796356  8921423 1.088752
#7   chr1   21012415 21012609   3        89         135                       1.2304212  19536504 21054539 1.247494
#11 chr14   20920169 20920704   3       211         214                       1.2542613  20840851 20923828 1.288877
#14 chr14   20924141 20924329   2       244         344                       0.9022995  20924141 21465086 1.088234

Select matrix first row based on 1st, 8th and 9th column value with awk or sed

2 Answers2

Test