Complex extract of all row entries based on string pattern using awk, sed or R

Question

I have a 7 column file like this:

ID ANNOTATION OR PVAL VAR_INFO INFO_TAGS_USED_TO_ANNOTATE INFO_TAGS_USED_TO_ANNOTATE
1 ANN1 1.66 0.0028 1:154837796(1.12e-06,0) 1:154834092(1.49e-05,0)|1:154834911(1.2e-05,1)| 1:155008318(0.000201,0)|1:155008973(0.000177,0)|
1 ANN1 1.66 0.0028 3:53707953(1.21e-06,0) 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)| |3:53765419(0.0021,0)|
1 ANN1 1.66 0.0028 12:109966662(6.723e-06,0) 12:109888779(3.01e-05,0)|12:109889704(5.242e-05,0)|12:109890955(4.79e-05,0)| 12:110105520(NA,0)|12:110132792(0.00015,0)|12:110155571(3.93e-05,0)|

I'm trying to find a solution using grep, sed, awk or R. From a space separated text file I need to extract all in instances in column 6 (the INFO_TAGS_USED_TO_ANNOTATE column) where the value 1 occurs after a comma.

Whenever a 1 occurs, the information I need to retain for each instance is, using row 1 as an example, 1:154834911(1.2e-05,1). This information corresponds to SNP information i.e. chromosome number:base position(p-value,annotation=1).

If more than one instance of a 1 occurs in a particular row then I need every instance, and corresponding SNP info reported. So using row 2 as an example, two instances would be reported 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1).

If possible, this extracted info for each row would be appended into a new column.

Each entry in column 6 is separated by |. It would be useful to use the | as the delimiter in the new column if multiple entries are reported for a particular row.

The output for the last column for the above output would look like this

NewCol
1 1:154834911(1.2e-05,1)
2 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)
3 <empty>

Please note that the information in columns 5, 6 and 7 is very similar which can be confusing.

Any suggestions would be greatly appreciated.

score 1 · Answer 1 · answered Jul 10 '19 at 12:39

perl will work well here:

perl -lane '
    if ($. == 1) {
        $val = "NewCol";
    } else {
        $val = join "|", grep {/,1\)/} split /\|/, $F[5];
    }
    print join " ", @F, $val;
' file

ID ANNOTATION OR PVAL VAR_INFO INFO_TAGS_USED_TO_ANNOTATE INFO_TAGS_USED_TO_ANNOTATE NewCol
1 ANN1 1.66 0.0028 1:154837796(1.12e-06,0) 1:154834092(1.49e-05,0)|1:154834911(1.2e-05,1)| 1:155008318(0.000201,0)|1:155008973(0.000177,0)| 1:154834911(1.2e-05,1)
1 ANN1 1.66 0.0028 3:53707953(1.21e-06,0) 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)| |3:53765419(0.0021,0)| 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)
1 ANN1 1.66 0.0028 12:109966662(6.723e-06,0) 12:109888779(3.01e-05,0)|12:109889704(5.242e-05,0)|12:109890955(4.79e-05,0)| 12:110105520(NA,0)|12:110132792(0.00015,0)|12:110155571(3.93e-05,0)|

This is the magic: $val = join "|", grep {/,1\)/} split /\|/, $F[5]; -- you have to read that right to left:

first split the 6th field (perl uses 0-based indexing) using a pipe (which is a special regex character, so care is needed,
then keep only the bits with a "1" after a comma and before a close parenthesis,
then join the remaining bits with a pipe.

If you want to annoy your colleagues, this can be golfed down to

perl -lape '$_=join" ",@F,$.==1?"NewCol":join"|",grep{/,1\)/}split/[|]/,$F[5]' file

G. Grothendieck · Accepted Answer · 2019-07-10T23:26:34.357

Using R read in myfile as in the commented out line and then use gsub to remove the parts not wanted putting what is left into a new column. The data used is shown at the end.

library(gsubfn)
# DF <- read.table("myfile", header = TRUE, as.is = TRUE, 
#   check.names = FALSE, strip.white = TRUE)
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, 
  check.names = FALSE, strip.white = TRUE)

transform(DF, NEWCOL = gsub("([^,]+),[^1]\\)\\|", "", INFO_TAGS_USED_TO_ANNOTATE))

Note

Lines <- "
ID ANNOTATION OR PVAL VAR_INFO INFO_TAGS_USED_TO_ANNOTATE INFO_TAGS_USED_TO_ANNOTATE
1 ANN1 1.66 0.0028 1:154837796(1.12e-06,0) 1:154834092(1.49e-05,0)|1:154834911(1.2e-05,1)| 1:155008318(0.000201,0)|1:155008973(0.000177,0)|
1 ANN1 1.66 0.0028 3:53707953(1.21e-06,0) 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)| |3:53765419(0.0021,0)|
1 ANN1 1.66 0.0028 12:109966662(6.723e-06,0) 12:109888779(3.01e-05,0)|12:109889704(5.242e-05,0)|12:109890955(4.79e-05,0)| 12:110105520(NA,0)|12:110132792(0.00015,0)|12:110155571(3.93e-05,0)|
"

Ed Morton · Answer 3 · 2019-07-10T15:02:42.403

Using any awk in any shell on every UNIX box:

$ cat tst.awk
NR==1 { print $0, "NewCol"; next }
{
    new = sep = ""
    n = split($6,f,/[|]/)
    for (i=1; i<=n; i++) {
        if ( f[i] ~ /,1/ ) {
            new = new sep f[i]
            sep = "|"
        }
    }
    print $0, new
}

$ awk -f tst.awk file
ID ANNOTATION OR PVAL VAR_INFO INFO_TAGS_USED_TO_ANNOTATE INFO_TAGS_USED_TO_ANNOTATE NewCol
1 ANN1 1.66 0.0028 1:154837796(1.12e-06,0) 1:154834092(1.49e-05,0)|1:154834911(1.2e-05,1)| 1:155008318(0.000201,0)|1:155008973(0.000177,0)| 1:154834911(1.2e-05,1)
1 ANN1 1.66 0.0028 3:53707953(1.21e-06,0) 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)| |3:53765419(0.0021,0)| 3:53708850(1.68e-06,1)|3:53711735(8.52e-07,1)
1 ANN1 1.66 0.0028 12:109966662(6.723e-06,0) 12:109888779(3.01e-05,0)|12:109889704(5.242e-05,0)|12:109890955(4.79e-05,0)| 12:110105520(NA,0)|12:110132792(0.00015,0)|12:110155571(3.93e-05,0)|

score 1 · Answer 4 · answered Jul 10 '19 at 17:47

This might work for you (GNU sed):r

sed -E 'h;s/\S+/\n&\n/6;s/.*\n(.*)\n.*/\1/;s/[^,]+,([^1)]|[^)]{2,})\|//g;s/\|$//;1s/.*/newcol/;H;g;s/\n/ /' file

Make a copy of the line, isolate the 6th field, remove strings not containing 1 following a comma, remove the final | and append the result to the original line.

Complex extract of all row entries based on string pattern using awk, sed or R

4 Answers4

Note