I have a data set that is sectioned very specifically, but rather inconveniently, like this:
data <- textConnection("rs1050,15,234323,C,T
1,7329,0.1147,-0.0024,0.0048
1,9810,0.6399,0.001174,0.006095
1,16550,0.648541108,0.0061,0.0070
rs7895,NA,NA,A,C
1,997,NA,NA,0.0732
1,9810,0.0339,-0.016131,0.021611
1,16550,0.32739678,0.0014,0.0053
rs995,18,100336,C,T
1,7385,0.2692,-0.0063,0.0035
1,9810,0.5397,-0.002697,0.006012
1,16550,0.651147483,-0.0045,0.0053")
test_data <- read.csv(data, header = FALSE, sep = ",")
If the lines containing rs####
include NA
(in one or multiple columns), they need to be removed. This would be no problem to do in and of itself, but in this case, the three lines below this row also need to be removed (regardless of if all the data is present in those lines).
So, in the case of the above data, lines 5-8 would all be removed.
Any solution would be great, but my efforts thus far have been based on sed. Something like this?
sed -i '/rs*\t*\tNA\tNA\t*/~1-3d' test_data
sed -i '/rs*\t*\tNA\tNA\t*/,+3d' test_data
I feel like I'm close, any thoughts would be appreciated!