Speedup MSD calculation in Python

Question

It's a call to the community to see if anyone has an idea to improve the speed of this MSD calculation implementation. It is largely based on the implementation from this blog post : http://damcb.com/mean-square-disp.html

For now the current implementation takes about 9s for a 2D trajectory of 5 000 points. It's really way too much if you need to compute a lot of trajectories...

I didn't try to parallelize it (with multiprocess or joblib) but I have the feeling that creating new processes will be too heavy for this kind of algorithm.

Here is the code :

import os

import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

# Parameters
N = 5000
max_time = 100
dt = max_time / N

# Generate 2D brownian motion

t = np.linspace(0, max_time, N)
xy = np.cumsum(np.random.choice([-1, 0, 1], size=(N, 2)), axis=0)
traj = pd.DataFrame({'t': t, 'x': xy[:,0], 'y': xy[:,1]})
print(traj.head())

# Draw motion
ax = traj.plot(x='x', y='y', alpha=0.6, legend=False)

# Set limits
ax.set_xlim(traj['x'].min(), traj['x'].max())
ax.set_ylim(traj['y'].min(), traj['y'].max())

And the output :

          t  x  y
0  0.000000 -1 -1
1  0.020004 -1  0
2  0.040008 -1 -1
3  0.060012 -2 -2
4  0.080016 -2 -2

def compute_msd(trajectory, t_step, coords=['x', 'y']):

    tau = trajectory['t'].copy()
    shifts = np.floor(tau / t_step).astype(np.int)
    msds = np.zeros(shifts.size)
    msds_std = np.zeros(shifts.size)

    for i, shift in enumerate(shifts):
        diffs = trajectory[coords] - trajectory[coords].shift(-shift)
        sqdist = np.square(diffs).sum(axis=1)
        msds[i] = sqdist.mean()
        msds_std[i] = sqdist.std()

    msds = pd.DataFrame({'msds': msds, 'tau': tau, 'msds_std': msds_std})
    return msds

# Compute MSD
msd = compute_msd(traj, t_step=dt, coords=['x', 'y'])
print(msd.head())

# Plot MSD
ax = msd.plot(x="tau", y="msds", logx=True, logy=True, legend=False)
ax.fill_between(msd['tau'], msd['msds'] - msd['msds_std'], msd['msds'] + msd['msds_std'], alpha=0.2)

And the output :

       msds  msds_std       tau
0  0.000000  0.000000  0.000000
1  1.316463  0.668169  0.020004
2  2.607243  2.078604  0.040008
3  3.891935  3.368651  0.060012
4  5.200761  4.685497  0.080016

And some profiling :

%timeit msd = compute_msd(traj, t_step=dt, coords=['x', 'y'])

Give this :

1 loops, best of 3: 8.53 s per loop

Any idea ?

Since you already have working code, this could be a good candidate for *codereview*. — cel, Oct 07 '15 at 09:18
Oh I didn't knew _codereview_. Can a moderator confirm this and I will move it to _codereview_ ? — hadim, Oct 07 '15 at 09:24
I am a moderator on Code Review and I have flagged this question for migration to Code Review. All we can do is to wait to see if the Stack Overflow moderators will agree with that. — Simon Forsberg, Oct 07 '15 at 09:38
I am getting a NA and the `floor` function in the second line of `compute_msd` throws an exception when trying to convert to `int`. ( numpy 1.9.2, Py2.7.10, OSX) Anyone else? — rll, Oct 07 '15 at 10:09
It's working for me with numpy 1.9.3, pandas 0.16.2 and python 3.4 on Ubuntu... — hadim, Oct 07 '15 at 10:25
@tll - The code above requires true division in `dt = max_time / N` so that's why it doesn't work on Python 2.7. — , Oct 07 '15 at 10:47

score 5 · Answer 1 · 2015-10-07T13:11:21.793

It did some profiling line by line and it appears that pandas is making this slow. This pure numpy version is about 14x faster:

def compute_msd_np(xy, t, t_step):
    shifts = np.floor(t / t_step).astype(np.int)
    msds = np.zeros(shifts.size)
    msds_std = np.zeros(shifts.size)

    for i, shift in enumerate(shifts):
        diffs = xy[:-shift if shift else None] - xy[shift:]
        sqdist = np.square(diffs).sum(axis=1)
        msds[i] = sqdist.mean()
        msds_std[i] = sqdist.std(ddof=1)

    msds = pd.DataFrame({'msds': msds, 'tau': t, 'msds_std': msds_std})
    return msds

score 3 · Answer 2 · answered Oct 07 '15 at 13:49

Adding to moarningsun answer above:

you can speed up using numexpr

if you plot the MSD in log scale anyway, you don't need to compute it for every time

import numpy as np
import numexpr

def logSpaced(L, pointsPerDecade=15):
    """Generate an array of log spaced integers smaller than L"""
    nbdecades = np.log10(L)
    return np.unique(np.logspace(
        start=0, stop=nbdecades, 
        num=nbdecades * pointsPerDecade, 
        base=10, endpoint=False
        ).astype(int))

def compute_msd(xy, pointsPerDecade=15):
    dts = logSpaced(len(xy), pointsPerDecade)
    msd = np.zeros(len(idts))
    msd_std = np.zeros(len(idts))
    for i, dt in enumerate(dts):
        sqdist = numexpr.evaluate(
            '(a-b)**2',
            {'a': xy[:-dt], 'b':xy[dt:]}
            ).sum(axis=-1)
        msd[i] = sqdist.mean()
        msd_std[i] = sqdist.std(ddof=1)
    msds = pd.DataFrame({'msds': msd, 'tau': dt, 'msds_std': msd_std})
    return msds

Thank you. Did you compare the speed of the numexpr version against the one of moarningsun ? — hadim, Oct 07 '15 at 14:32

score 2 · Accepted Answer · edited May 23 '17 at 12:02

2

The MSD calculations mentioned so far are all O(N**2) where N is the number of time steps. Using the FFT this can be reduced to O(N*log(N)). See this question and answer for an explanation and an implementation in python.

EDIT: A small benchmark (I also added this benchmark to this answer): Generate a trajectory with

r = np.cumsum(np.random.choice([-1., 0., 1.], size=(N, 3)), axis=0)

For N=100.000, we get

$ %timeit msd_straight_forward(r)
1 loops, best of 3: 2min 1s per loop

$ %timeit msd_fft(r)
10 loops, best of 3: 253 ms per loop

edited May 23 '17 at 12:02

Community

1
1

answered Dec 11 '15 at 14:10

thomasfermi

1,183
1
10
20

I'm glad if it helps somebody :) – thomasfermi Dec 11 '15 at 18:41

score 1 · Answer 4 · answered Oct 21 '15 at 10:21

With the comments I designed this function :

def get_msd(traj, dt, with_nan=True):

    shifts = np.arange(1, len(traj), dtype='int')
    msd = np.empty((len(shifts), 2), dtype='float')
    msd[:] = np.nan

    msd[:, 1] = shifts * dt

    for i, shift in enumerate(shifts):
        diffs = traj[:-shift] - traj[shift:]
        if with_nan:
            diffs = diffs[~np.isnan(diffs).any(axis=1)]
        diffs = np.square(diffs).sum(axis=1)

        if len(diffs) > 0:
            msd[i, 0] = np.mean(diffs)

    msd = pd.DataFrame(msd)
    msd.columns = ["msd", "delay"]

    msd.set_index('delay', drop=True, inplace=True)
    msd.dropna(inplace=True)

    return msd

With the following features :

It takes numpy array as trajectory input.
It returns a pandas.DataFrame with almost no overlay.
with_nan allows to deal with trajectory containing NaN values but it adds a big overhead (more than 100%) so I put it as a function parameter.
It can deal with multi dimensional trajectories (1D, 2D, 3D, etc)

Some profiling :

$ print(traj.shape)
(2108, 2)

$ %timeit get_msd(traj, with_nan=True, dt=0.1)
10 loops, best of 3: 143 ms per loop

$ %timeit get_msd(traj, with_nan=False, dt=0.1)
10 loops, best of 3: 68 ms per loop

score 0 · Answer 5 · edited Apr 13 '17 at 16:01

0

Maybe is not the topic, however MSD has to be calculated not be the mean like in line 37:

msds[i] = sqdist.mean()

Taking as mean=N

You must divide by:

msds[i] = sqdist/N-1 // for lag1

Then:

msds[i] = sqdist/N-2 // for lag2 .... msds[i] = sqdist/N-n // for lag n

And so on.

As a result you don't get standard deviation, just the MSD for a single trajectory

edited Apr 13 '17 at 16:01

Caleb Kleveter

11,170
8
62
92

answered Apr 13 '17 at 15:11

Jonathan Pacheco

531
1
6
16

Speedup MSD calculation in Python

5 Answers5

Linked