11

I'm trying to make an interactive program which primarily uses matplotlib to make scatter plots of rather a lot of points (10k-100k or so). Right now it works, but changes take too long to render. Small numbers of points are ok, but once the number rises things get frustrating in a hurry. So, I'm working on ways to speed up scatter, but I'm not having much luck

There's the obvious way to do thing (the way it's implemented now) (I realize the plot redraws without updating. I didn't want to alter the fps result with large calls to random).

import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import time


X = np.random.randn(10000)  #x pos
Y = np.random.randn(10000)  #y pos
C = np.random.random(10000) #will be color
S = (1+np.random.randn(10000)**2)*3 #size

#build the colors from a color map
colors = mpl.cm.jet(C)
#there are easier ways to do static alpha, but this allows 
#per point alpha later on.
colors[:,3] = 0.1

fig, ax = plt.subplots()

fig.show()
background = fig.canvas.copy_from_bbox(ax.bbox)

#this makes the base collection
coll = ax.scatter(X,Y,facecolor=colors, s=S, edgecolor='None',marker='D')

fig.canvas.draw()

sTime = time.time()
for i in range(10):
    print i
    #don't change anything, but redraw the plot
    ax.cla()
    coll = ax.scatter(X,Y,facecolor=colors, s=S, edgecolor='None',marker='D')
    fig.canvas.draw()
print '%2.1f FPS'%( (time.time()-sTime)/10 )

Which gives a speedy 0.7 fps

Alternatively, I can edit the collection returned by scatter. For that, I can change color and position, but don't know how to change the size of each point. That would I think look something like this

import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import time


X = np.random.randn(10000)  #x pos
Y = np.random.randn(10000)  #y pos
C = np.random.random(10000) #will be color
S = (1+np.random.randn(10000)**2)*3 #size

#build the colors from a color map
colors = mpl.cm.jet(C)
#there are easier ways to do static alpha, but this allows 
#per point alpha later on.
colors[:,3] = 0.1

fig, ax = plt.subplots()

fig.show()
background = fig.canvas.copy_from_bbox(ax.bbox)

#this makes the base collection
coll = ax.scatter(X,Y,facecolor=colors, s=S, edgecolor='None', marker='D')

fig.canvas.draw()

sTime = time.time()
for i in range(10):
    print i
    #don't change anything, but redraw the plot
    coll.set_facecolors(colors)
    coll.set_offsets( np.array([X,Y]).T )
    #for starters lets not change anything!
    fig.canvas.restore_region(background)
    ax.draw_artist(coll)
    fig.canvas.blit(ax.bbox)
print '%2.1f FPS'%( (time.time()-sTime)/10 )

This results in a slower 0.7 fps. I wanted to try using CircleCollection or RegularPolygonCollection, as this would allow me to change the sizes easily, and I don't care about changing the marker. But, I can't get either to draw so I have no idea if they'd be faster. So, at this point I'm looking for ideas.

george
  • 161
  • 1
  • 6

2 Answers2

8

I've been through this a few times trying to speed up scatter plots with large numbers of points, variously trying:

  • Different marker types
  • Limiting colours
  • Cutting down the dataset
  • Using a heatmap / grid instead of a scatter plot

And none of these things worked. Matplotlib is just not very performant when it comes to scatter plots. My only recommendation is to use a different plotting library, though I haven't personally found one that was suitable. I know this doesn't help much, but it may save you some hours of fruitless tinkering.

John Lyon
  • 11,180
  • 4
  • 36
  • 44
  • 2
    I was really hoping that wouldn't be the answer, matplotlib is extremely convenient. Any chance you can mention some of the non-suitable matplotlib replacements you've tried so I can not spend time finding out they won't work? Right now top on my list of things to try out is chaco. – george Aug 12 '13 at 05:57
  • I only futzed with a couple, but we kept falling back to matplotlib as it's the most convenient and well-supported. My next port of call would be rpy2 if I need to do fast stuff like your question - R is designed for big data and one would assume their plots are fairly snappy: http://rpy.sourceforge.net/rpy2/doc-2.2/html/graphics.html – John Lyon Aug 12 '13 at 06:12
  • I'd recommend the last 2 options. If you just want a nice visual presentation of some sample, there's no point plotting the entire thing. A subsample scatter plot should usually be OK. Alternatively you can bin the sample and display some coarse-grained (and also possibly smoothed, depending on your need) image, changing color/intensity to match the value in each bin (or "pixel"). It allows you to keep `matplotlib` without subjecting it to too large a problem for it to tackle. – Cong Ma Aug 03 '15 at 15:34
7

We are actively working on performance for large matplotlib scatter plots. I'd encourage you to get involved in the conversation (http://matplotlib.1069221.n5.nabble.com/mpl-1-2-1-Speedup-code-by-removing-startswith-calls-and-some-for-loops-td41767.html) and, even better, test out the pull request that has been submitted to make life much better for a similar case (https://github.com/matplotlib/matplotlib/pull/2156).

HTH

pelson
  • 21,252
  • 4
  • 92
  • 99