1

I'm trying to figure how to calculate the pearson correlation coefficient using sql. Here is the formula I'm using: enter image description here and here is the table I'm using: enter image description here

This is what I have so far for a query but it's giving me this message: Invalid use of group function

select first_id, second_id, movie_id, first_score, second_score,  count(*) as n, 
sum((first_score-avg(first_score))*(second_score-avg(second_score)))/
(
sqrt(sum(first_score-avg(first_score)))*
sqrt(sum(second_score-avg(second_score))))
as pearson
from connections
group by second_id

Thanks for helping

Connor Pearson
  • 63,902
  • 28
  • 145
  • 142
  • Just looking at it, you can't have a sum or average inside a sum or average. Could you post the SQL on SQLfiddle.com? – Brian Hoover Dec 22 '13 at 04:01

2 Answers2

2

Here is a query that does the calculation in the formula:

select sum((first_score - avg_first_score)*(second_score - avg_second_score)) /
       (sqrt(sum(pow((first_score - avg_first_score), 2)))*
        sqrt(sum(pow((second_score - avg_second_score), 2)))
       ) as r      
from connections c cross join
     (select avg(first_score) as avg_first_score, avg(second_score) as avg_second_score
      from connections
     ) const;

There are numerous issues with your attempt. This precalculates the average values for the two scores. It then applies the formula pretty much as written.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

From a purely syntactic perspective, you've got a problem with your group by clause. It should list every non-aggregated column to work properly. It should be:

group by first_id, second_id, movie_id, first_score, second_score
Bohemian
  • 412,405
  • 93
  • 575
  • 722