Reading a huge xyz file in python (200M lines, ~15 GB) and performing calculations on it

Question

I have an output of .xyz file from a molecular simulation software. I need to calculate the distance between two sets of atoms. First line is the number of atoms (1046), the next line can be seen as a comment which I wouldn't need. Then comes the coordinates of the atoms. The general view of the file is as follows.

    1046
 i =    13641, time =     5456.400, E =     -6478.1065220464
  O        -7.4658679231       -8.2711817669       -9.8539631371
  H        -7.6241360163       -9.2582538006       -9.9522290769
  H        -8.2358222851       -7.6941601822       -9.9653770757
  O        -4.9711266650       -4.7190696213      -15.2513827675
  H        -4.0601366272       -4.8452939622      -14.9451462873
  H        -5.3574156180       -5.6550789412      -15.1154558067
... ~ 1000 more lines
  O        -3.7163764338      -18.4917410571      -11.7137020838
  H        -3.3000068500      -18.5292231200      -12.6331415740
  H        -4.3493512803      -19.2443154891      -11.6751925772
    1046
 i =    13642, time =     5456.800, E =     -6478.1027892656
  O        -7.4669935102       -8.2716185134       -9.8549232159
  H        -7.6152044024       -9.2599276969       -9.9641510528
  H        -8.2364333010       -7.6943001983       -9.9565217204
  O        -4.9709831745       -4.7179801609      -15.2530422573
  H        -4.0670595153       -4.8459686871      -14.9472675802
  H        -5.3460565517       -5.6569802374      -15.1037050119
...

The indices of first set of atoms that I want to extract goes like 0,3,6,9,...,360. Considering the first two lines of header in the file, I implemented this as

w_index = list(range(2,362,3))

And the other set is only 8 atoms, which I again gave as a list.

c_index = [420,488,586,688,757,847,970,1031]

My thinking is to append the corresponding lines to separate lists in order to operate on them by the function 'dist_calculate_with_pbc_correction'.

def open_file(filename):

    global step

    waters = []
    carbons = []
    step=0

    with open(filename, 'r') as infile:

        for index, line in enumerate(infile):
            items = line.split()

            if index % (natoms+2) in w_index:
                kind, x, y, z = items[0], float(items[1]), float(items[2]), float(items[3])
                waters.append([kind,x,y,z])

            if (index - 2)% (natoms+2) in c_index:
                kind, x, y, z = items[0], float(items[1]), float(items[2]), float(items[3])
                carbons.append([kind,x,y,z])

            if index > 0 and index % (natoms+2) == natoms:
                dist_calculate_with_pbc_correction(carbons,waters)
                carbons, waters = [], []
                step+=1

And this is the function that does all the calculation and write the outcomes to a file.

def dist_calculate_with_pbc_correction(c,w):

    write_file(output_fn,'%s %r \n' % ('#Step:',step))

    for i in range(len(c)):
        hydration = 0
        min_d = 100
        for j in range(len(w)):

            x_diff, y_diff, z_diff = c[i][1] - w[j][1], c[i][2] - w[j][2], c[i][3] - w[j][3]

            if x_diff > pbc_a/2:
                x_diff -= pbc_a
            elif x_diff < -pbc_a/2:
                x_diff += pbc_a

            if y_diff > pbc_b/2:
                y_diff -= pbc_b
            elif y_diff < -pbc_b/2:
                y_diff += pbc_b

            if z_diff > pbc_c/2:
                z_diff -= pbc_c
            elif z_diff < -pbc_c/2:
                z_diff += pbc_c

            dist = math.sqrt(x_diff**2 + y_diff**2 + z_diff**2)
            if dist < min_d:
                min_d, min_index = dist, w_index[j]-2
            if dist < r_cutoff :
                hydration +=1

        write_file(output_fn,'%d %s %d %d \n' % (c_index[i], round(min_d,3), min_index, hydration))


def write_file(filename,out):
    with open(filename,'a') as g:

        g.write(out)

The problem is that, this code works but it works not fast enough (read extremely slow). It takes about 27 minutes to go through the whole file of ~200K steps (~200M lines). I say 'extremely slow' since a colleague of mine came up with her own version in Fortran -that does the exact same thing if not more- runs 10 times faster. Her code runs the entire file under 3 minutes even though she bothers to calculate in order to determine the smallest distance between all the atoms unlike the way I dictate the indices of atoms. I am aware that the majority of time is spent on selecting the atoms in the 'open_file' function to add them to corresponding lists, but I don't know how to improve that. Any help will be appreciated.

Here is a sample xyz file maker of carbons if you want to have a sample file handy.

import numpy as np
def sample_xyz(steps,natoms):

    with open('sample_xyz.xyz','w') as f:
        for step in range(steps):
            f.write(str(natoms)+'\n')
            f.write('i = {} \n'.format(step))

            for foo in range(natoms):
                line = 'C {} {} {}\n'.format(np.random.random()*10, np.random.random()*10, np.random.random()*10)
                f.write(line)

What is your question? why is this slow? have you tried profiling? — shx2, May 07 '20 at 18:35
Yes I am wondering why the code is running slower compared to its Fortran version. And no, I haven't tried profiling, would you care to elaborate? — user9115191, May 07 '20 at 18:52
any float will do. that won't matter. you can just take them pbc_a, pbc_b, pbc_c = 10.,10.,70. and r_cutoff=4.0 — user9115191, May 07 '20 at 20:00
@user9115191--from profiling the software it seems most of the time is spent parsing the text file (i.e. reading and placing into arrays), while relatively little is spent on calculations for routine dist_calculate_with_pbc_correction. Is your colleague also reading a text (rather than binary) file? Considering this seems to be IO limited I'm surprised her Fortran version is 10X faster if she is also reading a text file. — DarrylG, May 08 '20 at 15:48
Yes I thought that too, most of the time is spent at open_file function. She is reading exactly the same file as I do. I can't really understand. I will try to improve the parsing function to avoid if clauses as best as possible. By the way what can I use to profile as you have done? Thanks. — user9115191, May 09 '20 at 16:09

Reading a huge xyz file in python (200M lines, ~15 GB) and performing calculations on it

0 Answers0