How to reduce the number of data points in a scatter chart?

Question

Currently I have a problem for plotting a huge amount of X,Y data in a scatter chart by using the plotly's engine and python. So the browser can't actually render this amount of points without crashing after some time. (I've also tried the Scattergl option https://plot.ly/python/webgl-vs-svg/)

Is there any algorithms to reduce this huge amount of points without losing the original shape of the scatter chart? Maybe something like the iterative end-point fit algorithm?

EDIT:

some code

import plotly.plotly as py
import plotly.graph_objs as go
from  plotly.offline import plot

import numpy as np

N = 1000000
trace = go.Scattergl(
    x = np.random.randn(N),
    y = np.random.randn(N),
    mode = 'markers',
    marker = dict(
        line = dict(
            width = 1,
            color = '#404040')
    )
)
data = [trace]

layout = go.Layout(title='A Simple Plot', width=1000, height=350)

fig = go.Figure(data=data, layout=layout)

plot(fig)

Can you include a [Minimal, Complete, and Verifiable](https://stackoverflow.com/help/mcve) example? I have an "answer" but I can't write a proper one without your code. — Jeremy McGibbon, Sep 23 '17 at 18:22

score 1 · Answer 1 · answered Sep 23 '17 at 18:42

One way would be to randomly sample from the scatter points. As long as you're sampling enough points, it can be extremely likely you have a similar shape.

For example, to randomly sample 10,000 of the 1 million points you would use

i_plot = np.random.choice(N, size=10000, replace=False)
trace = go.Scattergl(
    x = np.random.randn(N)[i_plot],
    y = np.random.randn(N)[i_plot],
    mode = 'markers',
    marker = dict(
        line = dict(
            width = 1,
            color = '#404040')
    )
)

This snippet might look silly, but in reality you'll have an actual arrays instead of np.random.randn(N), so it will make sense to randomly sample from those arrays.

You'll want to test different numbers of points, and probably increase it to the maximum number of points the engine can handle without lagging or crashing.

score 0 · Answer 2 · answered Sep 23 '17 at 18:26

If you are just trying to visualize the regions where the data points exist, it might be more effective to convert the x-y data into a grid of densities. This may be better than a scatter plot because when you have a very large number of points, the points can obscure each other so you really have no idea how many points there are in certain areas.

I'm not familiar with plotly (I use matplotlib.pyplot) but I see there is at least one way to do this.

score 0 · Answer 3 · answered Sep 23 '17 at 18:49

You should try DataShader package (http://datashader.readthedocs.io/en/latest/) which focuses exactly on that - transformation of huge number of data points into something more amenable to visualization. They also provide argumentation why their approach might be better than a simple heatmap: https://anaconda.org/jbednar/plotting_pitfalls/notebook

How to reduce the number of data points in a scatter chart?

3 Answers3