43

I have to represent about 30,000 points in a scatter plot in matplotlib. These points belong to two different classes, so I want to depict them with different colors.

I succeded in doing so, but there is an issue. The points overlap in many regions and the class that I depict for last will be visualized on top of the other one, hiding it. Furthermore, with the scatter plot is not possible to show how many points lie in each region. I have also tried to make a 2d histogram with histogram2d and imshow, but it's difficult to show the points belonging to both classes in a clear way.

Can you suggest a way to make clear both the distribution of the classes and the concentration of the points?

EDIT: To be more clear, this is the link to my data file in the format "x,y,class"

papafe
  • 2,959
  • 4
  • 41
  • 72
  • Why not a histogram with two colors? Doesn't it look good enough? – Ofir Israel Sep 28 '13 at 09:44
  • @OfirIsrael I have tried to use histogram2d and imshow with alpha levels to have two overlapping histograms, but the result seems to be very poor – papafe Sep 28 '13 at 10:28
  • Have you tried showing the histograms using contour instead of alpha blending? http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.contour – grep Sep 28 '13 at 12:43
  • do not add noise to your data, that is crossing the line into immoral data manipulation. – tacaswell Sep 28 '13 at 17:52
  • 3
    @tcaswell It is crossing the line into immoral data manipulation if and only if you hide it. – Pietro Battiston Jun 16 '16 at 11:25

3 Answers3

48

One approach is to plot the data as a scatter plot with a low alpha, so you can see the individual points as well as a rough measure of density. (The downside to this is that the approach has a limited range of overlap it can show -- i.e., a maximum density of about 1/alpha.)

Here's an example:

enter image description here

As you can imagine, because of the limited range of overlaps that can be expressed, there's a tradeoff between visibility of the individual points and the expression of amount of overlap (and the size of the marker, plot, etc).

import numpy as np
import matplotlib.pyplot as plt

N = 10000
mean = [0, 0]
cov = [[2, 2], [0, 2]]
x,y = np.random.multivariate_normal(mean, cov, N).T

plt.scatter(x, y, s=70, alpha=0.03)
plt.ylim((-5, 5))
plt.xlim((-5, 5))
plt.show()

(I'm assuming here you meant 30e3 points, not 30e6. For 30e6, I think some type of averaged density plot would be necessary.)

tom10
  • 67,082
  • 10
  • 127
  • 137
41

You could also colour the points by first computing a kernel density estimate of the distribution of the scatter, and using the density values to specify a colour for each point of the scatter. To modify the code in the earlier example :

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde as kde
from matplotlib.colors import Normalize
from matplotlib import cm

N = 10000
mean = [0,0]
cov = [[2,2],[0,2]]

samples = np.random.multivariate_normal(mean,cov,N).T
densObj = kde( samples )

def makeColours( vals ):
    colours = np.zeros( (len(vals),3) )
    norm = Normalize( vmin=vals.min(), vmax=vals.max() )

    #Can put any colormap you like here.
    colours = [cm.ScalarMappable( norm=norm, cmap='jet').to_rgba( val ) for val in vals]

    return colours

 colours = makeColours( densObj.evaluate( samples ) )

 plt.scatter( samples[0], samples[1], color=colours )
 plt.show()

Scatter plot with density information

I learnt this trick a while ago when I noticed the documentation of the scatter function --

c : color or sequence of color, optional, default : 'b'

c can be a single color format string, or a sequence of color specifications of length N, or a sequence of N numbers to be mapped to colors using the cmap and norm specified via kwargs (see below). Note that c should not be a single numeric RGB or RGBA sequence because that is indistinguishable from an array of values to be colormapped. c can be a 2-D array in which the rows are RGB or RGBA, however, including the case of a single row to specify the same color for all points.

vishakad
  • 567
  • 1
  • 6
  • 7
  • 4
    This is a stunning solution to one of the most often encountered problems in plotting large datasets. Excellent work! – Stephen McAteer May 01 '18 at 01:36
  • Is there any way to add a colorbar to the above figure to indicate the density of each color? – Zack Eriksen Apr 29 '21 at 02:38
  • You can optimize this solution by simply using the `cmap` kwarg of the scatter method (i.e. `plt.scatter(samples[0], samples[1], c=densObj.evaluate(samples), cmap="jet")` with no need for the extra function. – jacob Aug 20 '21 at 21:32
  • @ZackEriksen try `plt.colorbar()` – jacob Aug 20 '21 at 21:33
4

My answer may not perfectly answer your question, however, I too tried to plot overlapping points, but mine were perfectly overlapped. I therefore came up with this function in order to offset identical points.

import numpy as np

def dodge_points(points, component_index, offset):
    """Dodge every point by a multiplicative offset (multiplier is based on frequency of appearance)

    Args:
        points (array-like (2D)): Array containing the points
        component_index (int): Index / column on which the offset will be applied 
        offset (float): Offset amount. Effective offset for each point is `index of appearance` * offset

    Returns:
        array-like (2D): Dodged points
    """

    # Extract uniques points so we can map an offset for each
    uniques, inv, counts = np.unique(
        points, return_inverse=True, return_counts=True, axis=0
    )

    for i, num_identical in enumerate(counts):
        # Prepare dodge values
        dodge_values = np.array([offset * i for i in range(num_identical)])
        # Find where the dodge values must be applied, in order
        points_loc = np.where(inv == i)[0]
        #Apply the dodge values
        points[points_loc, component_index] += dodge_values

    return points

Here is an example of before and after.

Before:

Before dodge

After:

After Dodge

This method only works for EXACTLY overlapping points (or if you are willing to round points off in a way that np.unique finds matching points).

Quoding
  • 176
  • 8