Remove lines based on missing information using sed

Question

I have a data set that is sectioned very specifically, but rather inconveniently, like this:

data <- textConnection("rs1050,15,234323,C,T
1,7329,0.1147,-0.0024,0.0048
1,9810,0.6399,0.001174,0.006095
1,16550,0.648541108,0.0061,0.0070
rs7895,NA,NA,A,C
1,997,NA,NA,0.0732
1,9810,0.0339,-0.016131,0.021611
1,16550,0.32739678,0.0014,0.0053
rs995,18,100336,C,T
1,7385,0.2692,-0.0063,0.0035
1,9810,0.5397,-0.002697,0.006012
1,16550,0.651147483,-0.0045,0.0053")
test_data <- read.csv(data, header = FALSE, sep = ",")

If the lines containing rs#### include NA (in one or multiple columns), they need to be removed. This would be no problem to do in and of itself, but in this case, the three lines below this row also need to be removed (regardless of if all the data is present in those lines).

So, in the case of the above data, lines 5-8 would all be removed.

Any solution would be great, but my efforts thus far have been based on sed. Something like this?

sed -i '/rs*\t*\tNA\tNA\t*/~1-3d' test_data

sed -i '/rs*\t*\tNA\tNA\t*/,+3d' test_data

I feel like I'm close, any thoughts would be appreciated!

Do you want to delete the last *after* the first rs7895, too (the one with the two NA's)? And the second rs line (rs995) ? — wildplasser, Jan 30 '13 at 20:00
I want to delete the line containing rs7895, and the three lines after it (4 lines total). In other words, I want to delete all lines associated with and including the line containing rs7895. — mfk534, Jan 30 '13 at 20:04
Exactly three: `sed '/^rs7895,NA,NA,/,+3d'` ? All rsxxx,NA lines plus the three following lines: `sed '/^rs[0-9]+,NA,NA,/,+3d'` BTW: \t is TAB, I see no tabs in the above fragment. — wildplasser, Jan 30 '13 at 20:11
My actual data is tab delimitated, I just used csv to share a small sample - should have made that clear, sorry! This line isn't working for me, unfortunately. It just prints out the same output. — mfk534, Jan 30 '13 at 20:27
Ok, then just replace the commas in my fragment with \t again : `sed '/^rs[0-9]+\tNA\tNA\t/,+3d' should work. Also note that in *true* regexps, '*' does not stand for "any stretch of characters", but for "the previous pattern can repeat zero or more times". — wildplasser, Jan 30 '13 at 20:32

score 1 · Accepted Answer · answered Jan 30 '13 at 20:42

1

This should fine, unless your actual data has training ")...

sed  '/^rs.*NA/,+3d' test_data

answered Jan 30 '13 at 20:42

Faiz

16,025
5
48
37

Thanks! This does exactly what I need it to. – mfk534 Jan 30 '13 at 20:52
The original poster doesn't say what platform she's on, but note that this answer requires GNU sed. – Rob Davis Jan 30 '13 at 21:38

score 0 · Answer 2 · answered Jan 30 '13 at 21:04

0

sed '/^rs[0-9]+\tNA\tNA\t/,+3d' <input_data >output_data

answered Jan 30 '13 at 21:04

wildplasser

43,142
8
66
109

twalberg · Answer 3 · 2013-01-30T21:41:39.340

0

Obligatory alternative using awk:

awk '/^rs.*NA/ { output = 0; } /^rs/ && !/NA/ { output = 1; } output { print }'

Could probably be a little better optimized, but there's the proverbial exercise for the reader...

This has three parts - if a line starts with rs and contains NA, it turns off the output variable. If a line starts with rs and doesn't contain NA, it turns output back on. Then, if output is currently on, it prints the line, regardless of whether it contains rs or NA.

edited Jan 30 '13 at 21:41

answered Jan 30 '13 at 21:12

twalberg

59,951
11
89
84

What does this do with the lines that look like this: `1,997,NA,NA,0.0732` ? – Rob Davis Jan 30 '13 at 21:36
@RobDavis Prints them if `output == 1`. The original question, at least to my reading, only cared about `NA` if the line started with `rs`, but I may be reading that wrong... Added a little more clarification to the answer. – twalberg Jan 30 '13 at 21:39
Oh wait, I see. Right. Carry on. :-) – Rob Davis Jan 30 '13 at 21:42

Remove lines based on missing information using sed

3 Answers3