1

I have a pdb text file with about 200 000 rows. Every rows looks like this :

COMPND
SOURCE    
HETATM    1  CT  100     1     -23.207  17.632  14.543
HETATM    2  CT   99     1     -22.069  18.353  15.280
HETATM    3  OH  101     1     -21.074  18.762  14.358
HETATM    4  F   103     1     -23.816  18.483  13.675
HETATM    5  F   103     1     -24.119  17.162  15.433
HETATM    6  F   103     1     -22.680  16.591  13.841
HETATM    7  HC  104     1     -21.623  17.681  16.014
HETATM    8  HC  104     1     -22.451  19.218  15.823
HETATM    9  HO  102     1     -21.040  18.108  13.673
HETATM   10  CT  100     2      -4.340 -29.478  45.144
HETATM   11  CT   99     2      -3.051 -29.846  44.395
HETATM   12  OH  101     2      -1.968 -29.072  44.880
HETATM   13  F   103     2      -4.217 -29.778  46.464
HETATM   14  F   103     2      -5.396 -30.156  44.621
HETATM   15  F   103     2      -4.551 -28.140  45.015
HETATM   16  HC  104     2      -3.178 -29.656  43.329
HETATM   17  HC  104     2      -2.829 -30.908  44.511
HETATM   18  HO  102     2      -2.315 -28.222  45.119
HETATM   19  CT  100     3     -49.455 -17.542 -31.718
HETATM   20  CT   99     3     -49.981 -18.984 -31.736
HETATM   21  OH  101     3     -48.905 -19.897 -31.607
HETATM   22  F   103     3     -48.867 -17.273 -30.521
HETATM   23  F   103     3     -50.474 -16.668 -31.929
HETATM   24  F   103     3     -48.527 -17.405 -32.704
...

I have to change all first CT for C1 and second CT for C2, and the same for F1, F2, F3 and HC to H1, H2.

Is it possible to change them with awk and sed in a small script? Each C1-C2 and F1,F2,F3 are part of the same molecule (trifluoroethanol - TFE) but there is many molecules of TFE to be defined.

So I want it to look like this :

COMPND
SOURCE    
HETATM    1  C1  100     1     -23.207  17.632  14.543
HETATM    2  C2   99     1     -22.069  18.353  15.280
HETATM    3  OH  101     1     -21.074  18.762  14.358
HETATM    4  F1  103     1     -23.816  18.483  13.675
HETATM    5  F2  103     1     -24.119  17.162  15.433
HETATM    6  F3  103     1     -22.680  16.591  13.841
HETATM    7  H1  104     1     -21.623  17.681  16.014
HETATM    8  H2  104     1     -22.451  19.218  15.823
HETATM    9  HO  102     1     -21.040  18.108  13.673
HETATM   10  C1  100     2      -4.340 -29.478  45.144
HETATM   11  C2   99     2      -3.051 -29.846  44.395
HETATM   12  OH  101     2      -1.968 -29.072  44.880
HETATM   13  F1  103     2      -4.217 -29.778  46.464
HETATM   14  F2  103     2      -5.396 -30.156  44.621
HETATM   15  F3  103     2      -4.551 -28.140  45.015
HETATM   16  H1  104     2      -3.178 -29.656  43.329
HETATM   17  H2  104     2      -2.829 -30.908  44.511
HETATM   18  HO  102     2      -2.315 -28.222  45.119
HETATM   19  C1  100     3     -49.455 -17.542 -31.718
HETATM   20  C2   99     3     -49.981 -18.984 -31.736
HETATM   21  OH  101     3     -48.905 -19.897 -31.607
HETATM   22  F1  103     3     -48.867 -17.273 -30.521
HETATM   23  F2  103     3     -50.474 -16.668 -31.929
HETATM   24  F3  103     3     -48.527 -17.405 -32.704
...

Thanks

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Grego
  • 59
  • 2
  • 9

2 Answers2

1

You can use awk more easily than sed, though I have little doubt that it could be done in sed too if you really wanted to.

You need to:

  • Print lines where number of fields is 1 (or 2 — less than 3).
  • Keep track of last value in column 3 when there are at least 3 columns.
  • If the current column is one of CT, F or HC:
    • If the last value in column 3 was different, replace the input column 3 with the first letter plus 1; record that it was 1 you output.
    • Otherwise, increment the count and output the first letter plus the counter.
  • Otherwise output the line unchanged.

Which being translated to an awk script in a file, awk.script, could be:

NF < 3 { print; next }
$3 != "CT" && $3 != "F" && $3 != "HC" { print; next }
{ if (old_col3 != $3) { counter = 0 }
  old_col3 = $3
  $3 = substr($3, 1, 1) ++counter
  print
}

And, when that is run on your data file (named, unoriginally, data), I get:

$ awk -f awk.script data
COMPND
SOURCE    
HETATM 1 C1 100 1 -23.207 17.632 14.543
HETATM 2 C2 99 1 -22.069 18.353 15.280
HETATM    3  OH  101     1     -21.074  18.762  14.358
HETATM 4 F1 103 1 -23.816 18.483 13.675
HETATM 5 F2 103 1 -24.119 17.162 15.433
HETATM 6 F3 103 1 -22.680 16.591 13.841
HETATM 7 H1 104 1 -21.623 17.681 16.014
HETATM 8 H2 104 1 -22.451 19.218 15.823
HETATM    9  HO  102     1     -21.040  18.108  13.673
HETATM 10 C1 100 2 -4.340 -29.478 45.144
HETATM 11 C2 99 2 -3.051 -29.846 44.395
HETATM   12  OH  101     2      -1.968 -29.072  44.880
HETATM 13 F1 103 2 -4.217 -29.778 46.464
HETATM 14 F2 103 2 -5.396 -30.156 44.621
HETATM 15 F3 103 2 -4.551 -28.140 45.015
HETATM 16 H1 104 2 -3.178 -29.656 43.329
HETATM 17 H2 104 2 -2.829 -30.908 44.511
HETATM   18  HO  102     2      -2.315 -28.222  45.119
HETATM 19 C1 100 3 -49.455 -17.542 -31.718
HETATM 20 C2 99 3 -49.981 -18.984 -31.736
HETATM   21  OH  101     3     -48.905 -19.897 -31.607
HETATM 22 F1 103 3 -48.867 -17.273 -30.521
HETATM 23 F2 103 3 -50.474 -16.668 -31.929
HETATM 24 F3 103 3 -48.527 -17.405 -32.704
$

This doesn't preserve all the spacing in the modified lines, but is otherwise what you need. If you do need to preserve the spacing, you have to write a printf() statement to format the fields correctly (in place of the print in the last block of code:

printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8);

This does preserve the spacing, but makes the code less robust in general. It exploits the property that strings shorter than the n in %ns are right-justified. That yields:

COMPND
SOURCE    
HETATM    1  C1  100     1     -23.207  17.632  14.543
HETATM    2  C2   99     1     -22.069  18.353  15.280
HETATM    3  OH  101     1     -21.074  18.762  14.358
HETATM    4  F1  103     1     -23.816  18.483  13.675
HETATM    5  F2  103     1     -24.119  17.162  15.433
HETATM    6  F3  103     1     -22.680  16.591  13.841
HETATM    7  H1  104     1     -21.623  17.681  16.014
HETATM    8  H2  104     1     -22.451  19.218  15.823
HETATM    9  HO  102     1     -21.040  18.108  13.673
HETATM   10  C1  100     2      -4.340 -29.478  45.144
HETATM   11  C2   99     2      -3.051 -29.846  44.395
HETATM   12  OH  101     2      -1.968 -29.072  44.880
HETATM   13  F1  103     2      -4.217 -29.778  46.464
HETATM   14  F2  103     2      -5.396 -30.156  44.621
HETATM   15  F3  103     2      -4.551 -28.140  45.015
HETATM   16  H1  104     2      -3.178 -29.656  43.329
HETATM   17  H2  104     2      -2.829 -30.908  44.511
HETATM   18  HO  102     2      -2.315 -28.222  45.119
HETATM   19  C1  100     3     -49.455 -17.542 -31.718
HETATM   20  C2   99     3     -49.981 -18.984 -31.736
HETATM   21  OH  101     3     -48.905 -19.897 -31.607
HETATM   22  F1  103     3     -48.867 -17.273 -30.521
HETATM   23  F2  103     3     -50.474 -16.668 -31.929
HETATM   24  F3  103     3     -48.527 -17.405 -32.704

Since it appears that when you got to 10,000 records, the HETATM column and the number column following are merged into a single column:

HETATM   21  OH  101     3     -48.905 -19.897 -31.607
…
HETATM 9999  HO  102  1111     -24.504 -16.257 -35.613
HETATM10000  CT  100  1112       9.045  23.978  29.038
HETATM10001  CT   99  1112      10.488  24.501  29.083
HETATM10002  OH  101  1112      11.370  23.545  28.522
HETATM10003  F   103  1112       8.650  23.804  27.749
HETATM10004  F   103  1112       8.209  24.855  29.654
HETATM10005  F   103  1112       8.996  22.779  29.679

It isn't clear what happens when the numbers reach 100,000 and above. However, it is possible to deal with it (for the most part) by counting columns and working appropriately.

NF < 7 { print; next }
NF == 8 && $3 != "CT" && $3 != "F" && $3 != "HC" { print; next }
NF == 7 && $2 != "CT" && $2 != "F" && $2 != "HC" { print; next }
NF == 8 {
          if (old_mark != $3) { counter = 0 }
          old_mark = $3
          $3 = substr($3, 1, 1) ++counter
          printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8);
        }
NF == 7 {
          if (old_mark != $2) { counter = 0 }
          old_mark = $2
          $2 = substr($2, 1, 1) ++counter
          printf("%s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7);
        }

Note the use of the 'column number neutral' name old_mark. If row 9,999 contains CT and row 10,000 also contains CT, then the mapping needs to be continuous (C1, C2) etc. You could use:

NF < 7 { print; next }
NF == 8 && $3 != "CT" && $3 != "F" && $3 != "HC" { print; next }
NF == 7 && $2 != "CT" && $2 != "F" && $2 != "HC" { print; next }
{
    colnum = NF - 5
    if (old_mark != $colnum) { counter = 0 }
    old_mark = $colnum
    $colnum = substr($colnum, 1, 1) ++counter
    if (NF == 7)
        printf("%s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7);
    else
        printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8);
}

There might be a way to use one printf() call, but I doubt if it is worth the effort.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Thanks but there is still a problem remaining. It stops at the 9999 row. Is there something to do? – Grego May 15 '15 at 13:27
  • `HETATM 9999 HO 102 1111 -24.504 -16.257 -35.613 HETATM10000 CT 100 1112 9.045 23.978 29.038 HETATM10001 CT 99 1112 10.488 24.501 29.083 HETATM10002 OH 101 1112 11.370 23.545 28.522 HETATM10003 F 103 1112 8.650 23.804 27.749 HETATM10004 F 103 1112 8.209 24.855 29.654 HETATM10005 F 103 1112 8.996 22.779 29.679` – Grego May 15 '15 at 13:36
  • I found my problem! `NF < 3 { print; next } $2 != "CT" && $2 != "F" && $2 != "HC" { print; next } { if (old_col2 != $2) { counter = 0 } old_col2 = $2 $2 = substr($2, 1, 1) ++counter printf("%s %3s %4s %5s %11s %7s %7s \n", $1, $2, $3, $4, $5, $6, $7); }` – Grego May 15 '15 at 13:46
  • This illustrates that it is hard to code reliably when you don't have all the information needed. It wasn't clear from the sample data that you'd 'lose' a field when the second column reached 10,000. Many formats preserve white-space separated columns. This one doesn't. I wonder what happens if you ever get to a million rows in the file? However, adaptive code is probably best here; `if (NF == 8) { …do it the 8-column way… } else { …do it the 7-column way… }`. One gotcha to watch for is if column 9999 and 10,000 are both 'F' columns: then the old column has to be recognized properly. – Jonathan Leffler May 15 '15 at 14:26
0

Here's one way to solve this with a while read loop, grep and sed:

counter=0
while read line; do 
  # if a line has CT, CF, F in it... 
  if echo $line | grep -Pq '(CT|HC|F) '; then 
    # increment the counter and...
    counter=$((counter+1))

    # replace the 15th character with the counter!
    echo $line | sed "s/./$counter/15"
  else 

    # otherwise, reset the counter, and echo the line
    counter=0
    echo $line
  fi
done < molecule.txt

You can then pipe that to another file or stdout!

lxe
  • 7,051
  • 2
  • 20
  • 32
  • All the echo commands should use `echo "$line"` to preserve spaces (the first doesn't matter too much; the other two probably do matter). It seems a trifle clumsy to be at least one command per line of input when it can be done with a single command for the entire process. – Jonathan Leffler May 14 '15 at 21:55
  • The spaces are fine here. I'm just echoing stuff, not storing it to variables or iterating on anything. – lxe May 14 '15 at 21:57
  • Though the whole `if echo | grep -q` could be better expressed with a `case` statement. – tripleee May 15 '15 at 20:53