0

I have 2 data sets, I see there is a correlation. But the line of best fit is being strongly influenced a few denser regions in the scatter plot. So I decided to use matplotlib.pyplot.hist2d for 2d binning. Now I am curious to see if there is an improvement in identifying the correlation i.e. line of best fit best represents the actual correlation without the effect of bin count.

import numpy as np
import matplotlib.pyplot as plt
import copy

num_samples = 400

# The desired mean values of the sample.
mu = np.array([5.0, 0.0, 10.0])

# The desired covariance matrix.
r = np.array([
        [  3.40, -2.75, -2.00],
        [ -2.75,  5.50,  1.50],
        [ -2.00,  1.50,  1.25]
    ])

# Generate the random samples.
rng = np.random.default_rng()
y = rng.multivariate_normal(mu, r, size=num_samples)

plt.subplot(111)
plt.plot(y[:,1], y[:,2], 'b.', alpha=0.25)
plt.plot(mu[1], mu[2], 'ro', ms=3.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.grid(True)

plt.show()


##plotting 2d histogram, 2dbinning

plt.hist2d(y[:,1], y[:,2], bins=30, cmap='rainbow')#, cmin=5, cmax=33)
cb=plt.colorbar()
cb.set_label('counts in bin')
plt.show()

##showing the pearson correlation values for the fit
pccSi_raw = stats.pearsonr(y[:,1], y[:,2])
# pccSi_afterbinning = stats.pearsonr(xedges, yedges)
print("Python's Pearson Correlation Coefficient for the raw data: " + str(pccSi_raw))
# print("Python's Pearson Correlation Coefficient for the binned data: " + str(pccSi_afterbinning))

pccSi_afterbinning = stats.pearsonr(xedges, yedges) print("Python's Pearson Correlation Coefficient for the binned data: " + str(pccSi_afterbinning))

as shown above lines (not correct approach anyway), somehow I want to know how to get the two 1d arrays after binning to do the fitting. Thank you for your inputs.

Vara
  • 1
  • 1

0 Answers0