0

I am calculating Pearson Correlation. At the end I have the result (correlation1) like below. I wonder why I have 0.0 for all the second coefficient as a result in correlation1. Is there anybody who could explain? Moreover, my correlation code is working slow. How I can make it fast?

Result (sample):
(0.52543523179249552, 0.0), (0.52543905756911169, 0.0), (0.52544196572206603, 0.0), (0.52545010637443945, 0.0)...

from scipy.stats import pearsonr

s1_list = []
s2_list = []
s3_list = []
s4_list = []

zip_list1 = []
zip_list2 = []

correlation1 = []
for x, y in zip(speed1_list, speed2_list):
    zip1 = {"s1": float(x), "s2": float(y)}
    s1_list.append(zip1["s1"])
    s2_list.append(zip1["s2"])
    zip_list1.append(zip1)
    correlation1.append(pearsonr(s1_list,s2_list))

print correlation1

Inputs:

speed1_list: [113.0, 116.0, 120.0, 120.0, 117.0, 127.0, 124.0, 118.0, 124.0, 128.0, 128.0, 125.0, 112.0, 122.0, 125.0, 133.0, 128.0, 129.0, 126.0, 123.0, 120.0, 118.0, 114.0, 119.0, 129.0, 127.0, 128.0, 122.0, 120.0, 125.0, 119.0...]

speed2_list: [125.0, 123.0, 120.0, 115.0, 124.0, 120.0, 120.0, 119.0, 119.0, 122.0, 121.0, 116.0, 116.0, 119.0, 116.0, 113.0, 113.0, 115.0, 120.0, 122.0, 122.0, 113.0, 118.0, 121.0, 120.0, 119.0, 116.0...]

correlation1: (0.52543523179249552, 0.0), (0.52543905756911169, 0.0), (0.52544196572206603, 0.0), (0.52545010637443945, 0.0)...

dirn
  • 19,454
  • 5
  • 69
  • 74
serenade
  • 359
  • 2
  • 5
  • 13

1 Answers1

0

If you read the documentation of the pearsonr function, you see that the second term is the p-value giving the probability that the Pearson's correlation between your dataset is equal to 0.

If I run your code on your sample lists, I get only one 0 p-value:

correlation1 = [(nan, nan), (-1.0, 0.0), (-0.99946642948624609, 0.020797462218684917), (-0.87259228616792028, 0.12740771383207972), (-0.82714719627765909, 0.083995277603981247), (-0.58025386521762756, 0.22730335863992135), (-0.57868746304695651, 0.17345428063365897), (-0.53247171319158504, 0.17427615080621298), ...

But I guess the values you gave for correlation1 are from further in the list, where you have enough samples for your correlation to be very precise, thus a p-value of 0.

Math
  • 2,399
  • 2
  • 20
  • 22
  • Thank you. Then, How I can plot correlation result with row data? For instance, I am plotting row data with scatter(speed1_list, speed2_list, marker='.', color = 'pink') Then I want to add correlation plot on the row data plot. Can you help me? Thanks. @Math – serenade Feb 18 '16 at 14:02
  • I don't get what you want to do. Correlation is between -1 and 1, while your data range in the hundreds, you won't see the variations of correlation on the plot. What you could do is plot correlation vs index if you want to show some convergence, but on its own plot. – Math Feb 18 '16 at 14:45