I have an output of .xyz file from a molecular simulation software. I need to calculate the distance between two sets of atoms. First line is the number of atoms (1046), the next line can be seen as a comment which I wouldn't need. Then comes the coordinates of the atoms. The general view of the file is as follows.
1046
i = 13641, time = 5456.400, E = -6478.1065220464
O -7.4658679231 -8.2711817669 -9.8539631371
H -7.6241360163 -9.2582538006 -9.9522290769
H -8.2358222851 -7.6941601822 -9.9653770757
O -4.9711266650 -4.7190696213 -15.2513827675
H -4.0601366272 -4.8452939622 -14.9451462873
H -5.3574156180 -5.6550789412 -15.1154558067
... ~ 1000 more lines
O -3.7163764338 -18.4917410571 -11.7137020838
H -3.3000068500 -18.5292231200 -12.6331415740
H -4.3493512803 -19.2443154891 -11.6751925772
1046
i = 13642, time = 5456.800, E = -6478.1027892656
O -7.4669935102 -8.2716185134 -9.8549232159
H -7.6152044024 -9.2599276969 -9.9641510528
H -8.2364333010 -7.6943001983 -9.9565217204
O -4.9709831745 -4.7179801609 -15.2530422573
H -4.0670595153 -4.8459686871 -14.9472675802
H -5.3460565517 -5.6569802374 -15.1037050119
...
The indices of first set of atoms that I want to extract goes like 0,3,6,9,...,360. Considering the first two lines of header in the file, I implemented this as
w_index = list(range(2,362,3))
And the other set is only 8 atoms, which I again gave as a list.
c_index = [420,488,586,688,757,847,970,1031]
My thinking is to append the corresponding lines to separate lists in order to operate on them by the function 'dist_calculate_with_pbc_correction'.
def open_file(filename):
global step
waters = []
carbons = []
step=0
with open(filename, 'r') as infile:
for index, line in enumerate(infile):
items = line.split()
if index % (natoms+2) in w_index:
kind, x, y, z = items[0], float(items[1]), float(items[2]), float(items[3])
waters.append([kind,x,y,z])
if (index - 2)% (natoms+2) in c_index:
kind, x, y, z = items[0], float(items[1]), float(items[2]), float(items[3])
carbons.append([kind,x,y,z])
if index > 0 and index % (natoms+2) == natoms:
dist_calculate_with_pbc_correction(carbons,waters)
carbons, waters = [], []
step+=1
And this is the function that does all the calculation and write the outcomes to a file.
def dist_calculate_with_pbc_correction(c,w):
write_file(output_fn,'%s %r \n' % ('#Step:',step))
for i in range(len(c)):
hydration = 0
min_d = 100
for j in range(len(w)):
x_diff, y_diff, z_diff = c[i][1] - w[j][1], c[i][2] - w[j][2], c[i][3] - w[j][3]
if x_diff > pbc_a/2:
x_diff -= pbc_a
elif x_diff < -pbc_a/2:
x_diff += pbc_a
if y_diff > pbc_b/2:
y_diff -= pbc_b
elif y_diff < -pbc_b/2:
y_diff += pbc_b
if z_diff > pbc_c/2:
z_diff -= pbc_c
elif z_diff < -pbc_c/2:
z_diff += pbc_c
dist = math.sqrt(x_diff**2 + y_diff**2 + z_diff**2)
if dist < min_d:
min_d, min_index = dist, w_index[j]-2
if dist < r_cutoff :
hydration +=1
write_file(output_fn,'%d %s %d %d \n' % (c_index[i], round(min_d,3), min_index, hydration))
def write_file(filename,out):
with open(filename,'a') as g:
g.write(out)
The problem is that, this code works but it works not fast enough (read extremely slow). It takes about 27 minutes to go through the whole file of ~200K steps (~200M lines). I say 'extremely slow' since a colleague of mine came up with her own version in Fortran -that does the exact same thing if not more- runs 10 times faster. Her code runs the entire file under 3 minutes even though she bothers to calculate in order to determine the smallest distance between all the atoms unlike the way I dictate the indices of atoms. I am aware that the majority of time is spent on selecting the atoms in the 'open_file' function to add them to corresponding lists, but I don't know how to improve that. Any help will be appreciated.
Here is a sample xyz file maker of carbons if you want to have a sample file handy.
import numpy as np
def sample_xyz(steps,natoms):
with open('sample_xyz.xyz','w') as f:
for step in range(steps):
f.write(str(natoms)+'\n')
f.write('i = {} \n'.format(step))
for foo in range(natoms):
line = 'C {} {} {}\n'.format(np.random.random()*10, np.random.random()*10, np.random.random()*10)
f.write(line)