9

I am trying to replace values in a large space-delimited text-file and could not find a suitable answer for this specific problem:

Say I have a file "OLD_FILE", containing a header and approximately 2 million rows:

COL1 COL2 COL3 COL4 COL5
rs10 7 92221824 C A 
rs1000000 12 125456933 G A 
rs10000010 4 21227772 T C 
rs10000012 4 1347325 G C 
rs10000013 4 36901464 C A 
rs10000017 4 84997149 T C 
rs1000002 3 185118462 T C 
rs10000023 4 95952929 T G 
...

I want to replace the first value of each row with a corresponding value, using a large (2.8M rows) conversion table. In this conversion table, the first column lists the value I want to have replaced, and the second column lists the corresponding new values:

COL1_b36       COL2_b37
rs10    7_92383888
rs1000000       12_126890980
rs10000010      4_21618674
rs10000012      4_1357325
rs10000013      4_37225069
rs10000017      4_84778125
rs1000002       3_183635768
rs10000023      4_95733906
...

The desired output would be a file where all values in the first column have been changed according to the conversion table:

COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A 
12_126890980 12 125456933 G A 
4_21618674 4 21227772 T C 
4_1357325 4 1347325 G C 
4_37225069 4 36901464 C A 
4_84778125 4 84997149 T C 
3_183635768 3 185118462 T C 
4_95733906 4 95952929 T G 
...

Additional info:

  • Performance is an issue (the following command takes approximately a year:

    while read a b; do sed -i "s/\b$a\b/$b/g" OLD_FILE ; done < CONVERSION_TABLE

  • A complete match is necessary before replacing
  • Not every value in the OLD_FILE can be found in the conversion table...
  • ...but every value that could be replaced, can be found in the conversion table.

Any help is very much appreciated.

Community
  • 1
  • 1
KJ_
  • 336
  • 1
  • 3
  • 11
  • sort the conversion table (as you know it beforehand). split it in as many little files you find reasonnable. Then parse the inputfile, and depending on the first (or first 2, or ...) letters search for its conversion in the appropriate splitted conversion file. – Olivier Dulac Jan 09 '13 at 12:30
  • another approach: in perl, use a hash table so that each possibly_converted strings (the 1st column in the conversion file) is a key that correspond to its conversion value. If it holds in memory, it'll be **really** fast. – Olivier Dulac Jan 09 '13 at 12:34
  • a similar approach is to find a way to "hash" uniquely the 1st column of the conversion_file, so that for each entry there is one (and only one) value. To test that: once you have a hash functino, try each entry and note its output value. verify ( cat | sort | uniq -c | sort -rn) that it is unique. Once found the appropriate hash function, you just need to have 1 file per hashvalue ( hashvalue.txt) containing the conversion string. – Olivier Dulac Jan 09 '13 at 12:38
  • These are all very helpful suggestions, although a little bit more elaborate to get working, which is why I prefer the awk-oneliner. – KJ_ Jan 09 '13 at 17:25
  • check here:http://theunixshell.blogspot.com/2012/12/file-comparisons-using-awk-match-columns.html – Vijay Jan 10 '13 at 08:45
  • @sarathi: Thanks. It turns out that learning awk and sed could increase productivity even more than I expected. – KJ_ Jan 11 '13 at 14:28

4 Answers4

16

Here's one way using awk:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE

Results:

COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G

Explanation, in order of appearance:

NR==1 { next }            # simply skip processing the first line (header) of
                          # the first file in the arguments list (TABLE)

FNR==NR { ... }           # This is a construct that only returns true for the
                          # first file in the arguments list (TABLE)

a[$1]=$2                  # So when we loop through the TABLE file, we add the
                          # column one to an associative array, and we assign
                          # this key the value of column two

next                      # This simply skips processing the remainder of the
                          # code by forcing awk to read the next line of input

$1 in a { ... }           # Now when awk has finished processing the TABLE file,
                          # it will begin reading the second file in the
                          # arguments list which is OLD_FILE. So this construct
                          # is a condition that returns true literally if column
                          # one exists in the array

$1=a[$1]                  # re-assign column one's value to be the value held
                          # in the array

1                         # The 1 on the end simply enables default printing. It
                          # would be like saying: $1 in a { $1=a[$1]; print $0 }'
Steve
  • 51,466
  • 13
  • 89
  • 103
  • Thank you! This solution turned out to be the fastest (just a couple of minutes on a regular computer). Unfortunately, my skills with awk are very basic so I don't understand what is happening in the script. Would you mind briefly explaining it? – KJ_ Jan 09 '13 at 16:28
  • @KJ_: No problem! I've added a short explanation of how the script works to my answer above. Please let me know if I haven't explained things thoroughly enough. Cheers. – Steve Jan 10 '13 at 00:36
2

This might work for you (GNU sed):

sed -r '1d;s|(\S+)\s*(\S+).*|/^\1\\>/s//\2/;t|' table | sed -f - file
potong
  • 55,640
  • 6
  • 51
  • 83
1

You can use join:

join -o '2.2 1.2 1.3 1.4 1.5' <(tail -n+2 file1 | sort) <(tail -n+2 file2 | sort)

This drops the headers of both files, you can add it back with head -n1 file1.

Output:

12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
7_92383888 7 92221824 C A
Thor
  • 45,082
  • 11
  • 119
  • 130
1

Another way with join. Assuming the files are sorted on the 1st column:

head -1 OLD_FILE
join <(tail -n+2 CONVERSION_TABLE) <(tail -n+2 OLD_FILE) | cut -f 2-6 -d' '

But with data of this size you should consider using a database engine.

Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175