0

I want to calculate mean square displacement of N particles for which I have the trajectory of particles' positions over time. The code I've written has 3 for loops that makes it extremely slow. Could you please help me how I can replace for loops with some sort of vectorized capabilities of numpy or pandas?

Here is my code:

ntime = 10 # number of times represented in data
atom_count = 3 # number of particles 
norigin = 5 # number of origins is half number of time steps
nmin = 2 # minimum number of intervals to contribute to diffusivity
nmax = norigin # maximum number of intervals to contribute to diffusivity
dt = 1.0 # timestep

# creating sample trajectory of particles
traj = pd.DataFrame(np.random.rand(ntime*atom_count,3), columns=['x', 'y', 'z'])
traj['frame_id'] = np.repeat(np.arange(ntime)+1, atom_count)
traj['particle_id'] = np.tile(np.arange(atom_count)+1, ntime)
traj = traj[['frame_id', 'particle_id', 'x', 'y', 'z']]
print(traj.head(6))

ndata = traj.shape[0] # number of rows of data

# store mean square displacements in msd
time_vec= np.arange(dt, norigin*dt+1, dt)
msd_xyz = np.zeros((norigin,3))

# loop over all particles
for i in range(atom_count):
    # loop over all time origins
    for j in range(norigin):
        jstart = j*atom_count + i
        # loop over all time windows
        for k in range(nmin, nmax):
            kend = jstart + k*atom_count
            msd_xyz[k, :] += (traj.ix[kend, ['x', 'y', 'z']].values - 
                              traj.ix[jstart, ['x', 'y', 'z']].values)**2
msd_xyz = msd_xyz / (atom_count * norigin)   

msd = np.mean(msd_xyz, axis=1) # total MSD averaged over x, y, z directions

print()
print("MSD (averaged over all particles and time origins):")
print(msd)
OriolAbril
  • 7,315
  • 4
  • 29
  • 40
Frank63
  • 69
  • 1
  • 4

1 Answers1

1

Using the indexing capabilities of numpy, all 3 nested loops can be vectorized with the help of a meshgrid.

The key for this to work is that numpy arrays support list or array indexing of any shape:

a = np.arange(5,10)
b = np.array([[0,2,4],[3,3,0]])
print(a[b])
# Output
[[5 7 9]
 [8 8 5]]

Therefore, we can define a meshgrid from the arrays used as iterators in the loops in order to retrieve all the combinations of i,j and k from the loop at once, and then sum over i and j.

It is important to note that the array indexing has been moved after the .values method because numpy supports this kind of indexing, but pandas only does for 1D arrays.

# define indexing arrays 
k = np.arange(nmin,nmax)
j = np.arange(norigin)
i = np.arange(atom_count)
I,J,K = np.meshgrid(i,j,k) # the meshgrid contains all the combinations of i,j,k, 
                           # it is equivalent to the 3 nested loops

jstart = J*atom_count + I
kend = jstart + K*atom_count
msd_xyz[k,:] = np.sum(np.sum((traj[['x', 'y', 'z']].values[kend,:] - 
                     traj[['x', 'y', 'z']].values[jstart,:])**2,axis=0),axis=0)
msd_xyz = msd_xyz / (atom_count * norigin)   
msd = np.mean(msd_xyz, axis=1) # total MSD averaged over x, y, z directions

With the dimensions of the example data in the question, this achieves a x60 speedup with respect to the 3 nested loops. However, for large dataframes, this will probably use too much memory and become slower, in which case it would be better to combine loops and vectorization and only vectorize one or 2 loops to avoid excessive memory usage.

OriolAbril
  • 7,315
  • 4
  • 29
  • 40