how to decide two variables are correlated

Question

Running the below command in R:

cor.test(loandata$Age,loandata$Losses.in.Thousands)

loandata is the name of the dataset
Age is the independent Variable
Losses.in.Thousands is the dependent variable

Below is the result in R:

Pearson's product-moment correlation

data:  loandata$Age and loandata$Losses.in.Thousands

t = -61.09, df = 15288, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0


95 percent confidence interval:

 -0.4556139 -0.4301315

sample estimates:

       cor 

-0.4429622

How to decide whether Age is correlated with Losses.in.Thousand ? How do we decide by looking at the p-value with alpha = 0.05?

Would the down voters please provide comments explaining their downvotes? Bala - Welcome to Stack Overflow. Please read [How to create a Complete, Minimal, and Verifiable Example](https://stackoverflow.com/help/mcve) and update your post. Note that since your question is about the interpretation of the correlation coefficient as opposed to how to write R code to calculate a correlation, it's probably better suited to [CrossValidated](https://stats.stackexchange.com), the Stack Exchange site for statistics questions. — Len Greski, May 06 '18 at 22:45
I didn't downvote, but I don't see much evidence of research effort. If I google "correlation test interpretation" I get a lot of useful material ... — Ben Bolker, May 07 '18 at 01:32

Len Greski · Answer 1 · 2018-05-07T11:39:03.277

As stated in the other answer, the correlation coefficient produced by cor.test() in the OP is -0.4429. The Pearson correlation coefficient is a measure of the linear association between two variables. It varies between -1.0 (perfect negative linear association) and 1.0 (perfect positive linear association), the magnitude is absolute value of the coefficient, or its distance from 0 (no association).

The t-test indicates whether the correlation is significantly different from zero, given its magnitude relative to its standard error. In this case, the probability value for the t-test, p < 2.2e-16, indicates that we should reject the null hypothesis that the correlation is zero.

That said, the OP question:

How to decide whether Age is correlated with Losses.in.Thousands?

has two elements: statistical significance and substantive meaning.

From the perspective of statistical significance, the t-test indicates that the correlation is non-zero. Since the standard error of a correlation varies inversely with degrees of freedom, the very large number of degrees of freedom listed in the OP (15,288) means that a much smaller correlation would still result in a statistically significant t-test. This is why one must consider substantive significance in addition to statistical significance.

From a substantive significance perspective, interpretations vary. Hemphill 2003 cites Cohen's (1988) rule of thumb for correlation magnitudes in psychology studies:

0.10 - low
0.30 - medium
0.50 - high

Hemphill goes on to conduct a meta analysis of correlation coefficients in psychology studies that he summarized into the following table.

As we can see from the table, Hemphill's empirical guidelines are much less stringent than Cohen's prior recommendations.

Alternative: coefficient of determination

As an alternative, the coefficient of determination, r^2 can be used as a proportional reduction of error measure. In this case, r^2 = 0.1962, and we can interpret it as "If we know one's age, we can reduce our error in predicting losses in thousands by approximately 20%."

Reference: Burt Gerstman's Statistics Primer, San Jose State University.

Conclusion: Interpretation varies by domain

Given the problem domain, if the literature accepts a correlation magnitude of 0.45 as "large," then treat it as large, as is the case in many of the social sciences. In other domains, however, a much higher magnitude is required for a correlation to be considered "large."

Sometimes, even a "small" correlation is substantively meaningful as Hemphill 2003 notes in his conclusion.

For example, even though the correlation between aspirin taking and preventing a heart attack is only r=0.03 in magnitude, (see Rosenthal 1991, p. 136) -- small by most statistical standards -- this value may be socially important and nonetheless influence social policy.

Small but important distinction is that the association with a Pearson correlation is a linear association. A Pearson correlation of zero simply indicates no linear association :) — rmilletich, May 07 '18 at 04:13
This quit big explanation, but as you mentioned this: " the probability value for the t-test, p < 2.2e-16, indicates that we should reject the null hypothesis that the correlation is zero." how is this could be, because if we reject the null hypothesis, we should accept alternative, then the variables are correlated each other right, but you say the "correlation is zero" how is that ?? — Bala, May 07 '18 at 08:53
@Bala - I did not say "correlation is zero." I said that "we should reject the null hypothesis that the correlation is zero," which means that we should accept the alternate hypothesis that the correlation is *not* zero. — Len Greski, May 07 '18 at 11:52

score 0 · Answer 2 · answered May 06 '18 at 19:49

To know if the variables are correlated, the value to look at is cor = -0.4429

In your case, the values are negatively correlated, however the magnitude of correlation isn't very high.

A simple, less confusing way to check if two variables are correlated, you can do:

cor(loandata$Age,loandata$Losses.in.Thousands)
[1] -0.4429622

byouness · Answer 3 · 2018-05-07T11:50:23.150

-1

The null hypothesis of the Pearson test is that the two variables are not correlated: H0 = {rho = 0}

The p-value is the probability that the test's statistic (or its absolute value for a two tailed test) would be beyond the actual observed result (or its absolute value for a two tailed test). You can reject the hypothesis if the p-value is smaller than the confidence level. This is the case in your test, which means the variables are correlated.

edited May 07 '18 at 11:50

answered May 06 '18 at 22:18

byouness

1,746
2
24
41

1

not quite. Unless you're doing a Bayesian analysis, you **cannot** correctly make statements like "there is a 95% probability that the correlation value lies [within the confidence interval]". – Ben Bolker May 07 '18 at 01:30
That's what a confidence interval is for, you know the variance and distribution of your statistic, so you deduce an interval where the real value (the one we're trying to estimate) lies. Could you clarify why you think this is not correct? – byouness May 07 '18 at 08:12
@uness - a more accurate interpretation of the 95% confidence interval is, "If we take thousands of samples of size *n* from the population and measure the correlation between `Age` and `Losses in Thousands` for each, the proportion of those confidence intervals that will contain the true population correlation is 1 - alpha, or 95%." – Len Greski May 07 '18 at 11:47
Thanks for the clarification. Agreed, I will remove the second part of my answer. – byouness May 07 '18 at 11:50

how to decide two variables are correlated

Below is the result in R:

3 Answers3

Alternative: coefficient of determination

Conclusion: Interpretation varies by domain