Insert a column into an existing file

Question

I am trying to insert a column of the letter "A" into an existing PDB (Protein Data Bank) file. Keeping the field separation the same. Please note that the actual data does contain different text in the fourth column.

ATOM      1  N   LYS     1      27.426  26.010  24.339  1.00  0.00           N  
ATOM      2  H1  LYS     1      27.291  25.736  24.387  1.00  0.00           H  
ATOM      3  H2  LYS     1      27.286  25.739  24.374  1.00  0.00           H

What I would like is.

ATOM      1  N   LYS A   1      27.426  26.010  24.339  1.00  0.00           N  
ATOM      2  H1  LYS A   1      27.291  25.736  24.387  1.00  0.00           H  
ATOM      3  H2  LYS A   1      27.286  25.739  24.374  1.00  0.00           H

Any help would be greatly appreciated.
Dan.

EDITED :

A reminder of old PDB standard with column values for the Record ATOM.

Remember Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description Version 3.30:

Record Format Every PDB file is presented in a number of lines. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of- line indicator. Each line in the PDB file is self-identifying. The first six columns of every line contains a record name, that is left-justified and separated by a blank. The record name must be an exact match to one of the stated record names in this format guide.

With GNU sed: `sed -E 's/(.{21})./\1A/' file` – Cyrus Mar 01 '23 at 21:46 — Cyrus, Mar 01 '23 at 21:46

score 2 · Answer 1 · answered Mar 01 '23 at 20:54

Overwrite column 22 with an A:

awk 'BEGIN{FS=OFS=""}{$22="A";print}' file

Output:

ATOM      1  N   LYS A   1      27.426  26.010  24.339  1.00  0.00           N  
ATOM      2  H1  LYS A   1      27.291  25.736  24.387  1.00  0.00           H  
ATOM      3  H2  LYS A   1      27.286  25.739  24.374  1.00  0.00           H

See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

score 0 · Answer 2 · answered Mar 01 '23 at 20:48

0

If the A column always comes after LYS then the easiest option is a simple string replacement with sed. Assuming that this file uses tabs, you can do:

sed -i 's/LYS/LYS\tA/' file

Otherwise if it is done with an exact numbr of spaces, do:

sed -i 's/LYS  /LYS A/' file

answered Mar 01 '23 at 20:48

match

10,388
3
23
41

Hello, thank you for the suggestion. There would be other amino acids in the column with LYS in it e.g. GLY, ALA, HIS etc. How could you implement your suggestion with different text in that column? Dan. – Dan Mar 01 '23 at 21:09

markp-fuso · Answer 3 · 2023-03-02T21:53:37.007

0

Assumptions:

actual width of columns, and spacing between columns, is not known beforehand (eg, could vary from file to file); UPDATE comment from pippo indicates this is an invalid assumption
in this case 'inserting a column' is the same (visually) as appending A to the 4th column
OP has access to GNU awk (aka gawk) - for the 4th argument to split()

One GNU awk idea:

awk '
{ n=split($0,fld,FS,sep)                 # split current line, fields go into array fld[] while individual field separators go into array sep[]
  fld[4]=fld[4] " A"                     # append " A" to 4th field
# sep[4]=substr(sep[4],3)                # uncomment to maintain position of current 5th-nth columns
  for (i=1;i<=n;i++)                     # loop through array indices ...
      printf "%s%s",fld[i],sep[i]        # printing each field and its associated (suffix) separator
  print ""                               # terminate current line
}
' file

This generates:

ATOM      1  N   LYS A     1      27.426  26.010  24.339  1.00  0.00           N
ATOM      2  H1  LYS A     1      27.291  25.736  24.387  1.00  0.00           H
ATOM      3  H2  LYS A     1      27.286  25.739  24.374  1.00  0.00           H

Once the results are validated, and if OP wishes to update the source file, with GNU awk we can use the -i inplace option to facilitate the 'inplace' update:

$ awk -i inplace '{ n=split($0,fld,FS,sep) .... }' file

$ cat file
ATOM      1  N   LYS A     1      27.426  26.010  24.339  1.00  0.00           N
ATOM      2  H1  LYS A     1      27.291  25.736  24.387  1.00  0.00           H
ATOM      3  H2  LYS A     1      27.286  25.739  24.374  1.00  0.00           H

edited Mar 02 '23 at 21:53

answered Mar 01 '23 at 21:23

markp-fuso

28,790
4
16
36

actual width of columns, and spacing between columns, is not known beforehand : this is not correct pdb file standards has a defined widgth of columns each with a defined meaning – pippo1980 Mar 02 '23 at 15:53
@pippo1980 that kind of detail should've been included in the question as opposed to assuming everyone seeing this question knows what a `pdb` file format looks like; does the`pdb` standard (you mention) also state *all* `pdb` files will have the same set of columns and said columns will always be in the same order? thanks – markp-fuso Mar 02 '23 at 18:51
also, OP has stated they want to *insert* a column; does that mean this 'new' column was there all along (per the `pdb` file standard) and was just empty? alternatively, if OP really is inserting a *new* column ... this brings up the question of what the `pdb` file standard really says ... does it say the 4th field is exactly 3 characters wide, or does it say the 4th field always starts in column #18? – markp-fuso Mar 02 '23 at 18:58
1

https://files.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_Letter.pdf Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description Version 3.30 Document Published by the wwPDB now superseed by PDBx/mmcif https://mmcif.wwpdb.org/ – pippo1980 Mar 02 '23 at 21:16
1

Record Format Every PDB file is presented in a number of lines. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of- line indicator. Each line in the PDB file is self-identifying. The first six columns of every line contains a record name, that is left-justified and separated by a blank. The record name must be an exact match to one of the stated record names in this format guide. – pippo1980 Mar 02 '23 at 21:19
1

Why my edit, that carried the ATOM record column specification wasn approved ??? – pippo1980 Mar 03 '23 at 09:07
@pippo1980 How would you add a number to the residue. As I am trying to 30 to the residue number so that the so that it goes 29,30,TER,31,32 etc. Not 29,30,TER,1,2,3 – Dan Mar 06 '23 at 11:24
@Dan no idea, no knowledge of bash, sed, awk I am using python actually biopython to deal with pdbs, there is pdb tools too : http://www.bonvinlab.org/pdb-tools/ – pippo1980 Mar 06 '23 at 12:04
@Dan see an example about how biopython parser would work in a case similar to yours https://stackoverflow.com/questions/71427946/how-to-renumber-residues-start-from-1-in-continuation-among-chains-in-pdb-file/71428814#71428814 – pippo1980 Mar 06 '23 at 17:25
@Dan TER is the last atom of a chain so if your number keep growing after TER you should rename their chain too – pippo1980 Mar 06 '23 at 17:26

score 0 · Answer 4 · answered Mar 01 '23 at 21:42

An approach using bash with ruby (zero-based arrays)

% ruby -npe '$_[22-1] = "A"' file.pdb
ATOM      1  N   LYS A   1      27.426  26.010  24.339  1.00  0.00           N
ATOM      2  H1  LYS A   1      27.291  25.736  24.387  1.00  0.00           H
ATOM      3  H2  LYS A   1      27.286  25.739  24.374  1.00  0.00           H

Insert a column into an existing file

4 Answers4