I have a 7 column file like this:
ID ANNOTATION OR PVAL VAR_INFO INFO_TAGS_USED_TO_ANNOTATE INFO_TAGS_USED_TO_ANNOTATE
1 ANN1 1.66 0.0028 1:154837796(1.12e-06,0) 1:154834092(1.49e-05,0)|1:154834911(1.2e-05,1)| 1:155008318(0.000201,0)|1:155008973(0.000177,0)|
1 ANN1 1.66 0.0028 3:53707953(1.21e-06,0) 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)| |3:53765419(0.0021,0)|
1 ANN1 1.66 0.0028 12:109966662(6.723e-06,0) 12:109888779(3.01e-05,0)|12:109889704(5.242e-05,0)|12:109890955(4.79e-05,0)| 12:110105520(NA,0)|12:110132792(0.00015,0)|12:110155571(3.93e-05,0)|
I'm trying to find a solution using grep, sed, awk or R. From a space separated text file I need to extract all in instances in column 6 (the INFO_TAGS_USED_TO_ANNOTATE column
) where the value 1 occurs after a comma.
Whenever a 1 occurs, the information I need to retain for each instance is, using row 1 as an example, 1:154834911(1.2e-05,1)
. This information corresponds to SNP information i.e. chromosome number:base position(p-value,annotation=1)
.
If more than one instance of a 1 occurs in a particular row then I need every instance, and corresponding SNP info reported. So using row 2 as an example, two instances would be reported 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)
.
If possible, this extracted info for each row would be appended into a new column.
Each entry in column 6 is separated by |
. It would be useful to use the |
as the delimiter in the new column if multiple entries are reported for a particular row.
The output for the last column for the above output would look like this
NewCol
1 1:154834911(1.2e-05,1)
2 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)
3 <empty>
Please note that the information in columns 5, 6 and 7 is very similar which can be confusing.
Any suggestions would be greatly appreciated.