0

I have a data set that is sectioned very specifically, but rather inconveniently, like this:

data <- textConnection("rs1050,15,234323,C,T
1,7329,0.1147,-0.0024,0.0048
1,9810,0.6399,0.001174,0.006095
1,16550,0.648541108,0.0061,0.0070
rs7895,NA,NA,A,C
1,997,NA,NA,0.0732
1,9810,0.0339,-0.016131,0.021611
1,16550,0.32739678,0.0014,0.0053
rs995,18,100336,C,T
1,7385,0.2692,-0.0063,0.0035
1,9810,0.5397,-0.002697,0.006012
1,16550,0.651147483,-0.0045,0.0053")
test_data <- read.csv(data, header = FALSE, sep = ",")

If the lines containing rs#### include NA (in one or multiple columns), they need to be removed. This would be no problem to do in and of itself, but in this case, the three lines below this row also need to be removed (regardless of if all the data is present in those lines).

So, in the case of the above data, lines 5-8 would all be removed.

Any solution would be great, but my efforts thus far have been based on sed. Something like this?

sed -i '/rs*\t*\tNA\tNA\t*/~1-3d' test_data

sed -i '/rs*\t*\tNA\tNA\t*/,+3d' test_data

I feel like I'm close, any thoughts would be appreciated!

mfk534
  • 719
  • 1
  • 9
  • 21
  • Do you want to delete the last *after* the first rs7895, too (the one with the two NA's)? And the second rs line (rs995) ? – wildplasser Jan 30 '13 at 20:00
  • I want to delete the line containing rs7895, and the three lines after it (4 lines total). In other words, I want to delete all lines associated with and including the line containing rs7895. – mfk534 Jan 30 '13 at 20:04
  • 1
    Exactly three: `sed '/^rs7895,NA,NA,/,+3d'` ? All rsxxx,NA lines plus the three following lines: `sed '/^rs[0-9]+,NA,NA,/,+3d'` BTW: \t is TAB, I see no tabs in the above fragment. – wildplasser Jan 30 '13 at 20:11
  • My actual data is tab delimitated, I just used csv to share a small sample - should have made that clear, sorry! This line isn't working for me, unfortunately. It just prints out the same output. – mfk534 Jan 30 '13 at 20:27
  • Ok, then just replace the commas in my fragment with \t again : `sed '/^rs[0-9]+\tNA\tNA\t/,+3d' should work. Also note that in *true* regexps, '*' does not stand for "any stretch of characters", but for "the previous pattern can repeat zero or more times". – wildplasser Jan 30 '13 at 20:32
  • I did not know that about "*" - thanks for the heads up! – mfk534 Jan 30 '13 at 20:51

3 Answers3

1

This should fine, unless your actual data has training ")...

sed  '/^rs.*NA/,+3d' test_data 
Faiz
  • 16,025
  • 5
  • 48
  • 37
0
sed '/^rs[0-9]+\tNA\tNA\t/,+3d' <input_data >output_data
wildplasser
  • 43,142
  • 8
  • 66
  • 109
0

Obligatory alternative using awk:

awk '/^rs.*NA/ { output = 0; } /^rs/ && !/NA/ { output = 1; } output { print }'

Could probably be a little better optimized, but there's the proverbial exercise for the reader...

This has three parts - if a line starts with rs and contains NA, it turns off the output variable. If a line starts with rs and doesn't contain NA, it turns output back on. Then, if output is currently on, it prints the line, regardless of whether it contains rs or NA.

twalberg
  • 59,951
  • 11
  • 89
  • 84
  • What does this do with the lines that look like this: `1,997,NA,NA,0.0732` ? – Rob Davis Jan 30 '13 at 21:36
  • @RobDavis Prints them if `output == 1`. The original question, at least to my reading, only cared about `NA` if the line started with `rs`, but I may be reading that wrong... Added a little more clarification to the answer. – twalberg Jan 30 '13 at 21:39
  • Oh wait, I see. Right. Carry on. :-) – Rob Davis Jan 30 '13 at 21:42