How to find the p-value for two sets of data in R?

Question

New to R, and I have two data sets -- they have the same x-axis values, but the y-axis varies.

I'm trying to find the correlation between the two. When I use R to draw the ablines through the scatter plot, it gives me two lines-of-best-fit that seemingly makes one data set higher than the other -- but I'd really like to know the p-value between these two data sets to know the effect.

After looking it up, it seems like I should use t.test -- but I'm unsure how to run them against each other.

For example, if I run:

t.test(t1$xaxis,t1$yaxis1)
t.test(t2$xaxis,t2$yaxis2)

It gives me the right means of x and y (t1: 16.84, 88.58 and t2: 14.79, 86.14) -- but for the rest, I'm not sure:

t1: t = -43.8061, df = 105.994, p-value < 2.2e-16

t2: t = -60.1593, df = 232.742, p-value < 2.2e-16

Obviously the p-values given are (a) microscopic, and (b) I don't know how to make it tell me about the data sets relationship with each other -- and not individually.

Any help is greatly appreciated -- thanks!

the question doesn't make (statistical) sense. correlation between 2 datasets? p-value between two data sets? what are the two "lines-of-best-fit"? — djas, Mar 26 '14 at 02:58
Do you want the p-values or the correlation matrix? I'm thinking `cor` might be what you want. — Rich Scriven, Mar 26 '14 at 02:59
@RichardScriven I believe I am -- but I only suggested using a t-test, because after researching R (self-taught, if you couldn't tell), it seemed like the closest thing. I noticed that when I ran code like `t.test(t1$xaxis,t1$yaxis)` the result was a Welch Two Sample t-test, which I posted above. — Ryan, Mar 26 '14 at 03:06
@djas Sorry -- you might be right. Essentially, I just want to know if my alternate hypothesis (which is that the y-values of t1 are greater than the y-values of t2) has a low enough p-value to reject the null hypothesis (that there's no difference). Does that make more sense? — Ryan, Mar 26 '14 at 03:08
@RichardScriven But if you think the `cor` function will give me more of what I'm looking for based on what I described above and to @djas, that could be very helpful as well. Thanks! — Ryan, Mar 26 '14 at 03:09
From the meager information in your question it doesn't seem like you should use t-tests or calculate correlations. Rather, I'd suggest a regression analysis. — Roland, Mar 26 '14 at 08:05
@Roland Is there another question/answer somewhere that could give me steps on how to do so? — Ryan, Mar 26 '14 at 21:59

score 2 · Accepted Answer · answered Mar 27 '14 at 08:04

Since you asked for it, here is how I understand your problem.

You have two groups of y values corresponding to identical x values. Here I assume that the relationship between y and x is linear. If it isn't you could transform your variables, use a non-linear model, an additive model, ...

First let's simulate some data since you don't provide any:

set.seed(42)
x <- 1:20
y1 <- 2.5 + 3 * x +rnorm(20)
y2 <- 4 + 2.5 * x +rnorm(20)

plot(y1~x, col="blue", ylab="y")
points(y2~x, col="red")
legend("topleft", legend=c("y1", "y2"), col=c("blue", "red"), pch=1)

enter image description here

Now, we want to know if the two samples differ. We can find out by fitting a model:

DF <- cbind(stack(cbind.data.frame(y1, y2)), x)
names(DF) <- c("y", "group", "x")

fit <- lm(y~x*group, data=DF)
summary(fit)

Call:
lm(formula = y ~ x * group, data = DF)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2585 -0.4603 -0.1899  0.9008  2.2127 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.51769    0.55148   6.379 2.17e-07 ***
x            2.92136    0.04604  63.457  < 2e-16 ***
groupy2      0.67218    0.77991   0.862    0.394    
x:groupy2   -0.46525    0.06511  -7.146 2.11e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.187 on 36 degrees of freedom
Multiple R-squared:  0.9949,    Adjusted R-squared:  0.9945 
F-statistic:  2333 on 3 and 36 DF,  p-value: < 2.2e-16

The intercepts are not significantly different, but the slopes are. If group is a significant effect, we can test best by comparing with a model that doesn't consider group:

fit0 <- lm(y~x, data=DF)
anova(fit0, fit)

Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ x * group
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     38 300.196                                  
2     36  50.738  2    249.46 88.498 1.267e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As you see, the samples are different.

score 1 · Answer 2 · answered Mar 26 '14 at 02:57

1

Did you thought about merging the datasets based on x axis so that you data structure becomes like:

X Y1 Y2

Then you can find correlation between any of the columns you want.

answered Mar 26 '14 at 02:57

Kuber

1,023
12
21

Please include an example of this. – Matthew Lundberg Mar 26 '14 at 03:40

djas · Answer 3 · 2014-03-26T03:34:37.673

1

Judging by your comments above, looks like you are after a 2-sample test of means. Is this what you are after? If so,

set.seed(1)
y1 = rnorm(100)
y2 = rnorm(120, mean=0.1)

results = t.test(y1,y2)
results$p.value

edited Mar 26 '14 at 03:34

answered Mar 26 '14 at 03:14

djas

973
8
24

Thanks! This gave me a very small p-value as well -- but that could just be a good thing. And two questions just to be clear: (1) The set.seed(1) and defining y1 and y2 was just to give yourself 100 random variables [with different means] to run the t.test, right? And (2) Would this work differently in the future if I had different x-axes and/or more/less y values in one table than another? – Ryan Mar 26 '14 at 03:29
yes, it works for any two samples -- in fact, I just edited the answer to make this point. set.seed() just makes this example replicable. – djas Mar 26 '14 at 03:36

Rich Scriven · Answer 4 · 2014-03-26T03:30:35.093

1

You can easily find the correlation between variables with the cor function. In this case, I use a data frame first, then a matrix. We can easily see the strength of the relationships between variables.

> d <- data.frame(y1 = runif(10), y2 = rnorm(10), y3 = rexp(10))
> cor(d)
##            y1         y2         y3
## y1  1.0000000 -0.3319495 -0.4013154
## y2 -0.3319495  1.0000000  0.1370312
## y3 -0.4013154  0.1370312  1.0000000

Using a matrix,

> m <- matrix(c(runif(10), rnorm(10), rexp(10)), 10, 3)
> cor(m)
##            [,1]       [,2]      [,3]
## [1,]  1.0000000 -0.1971826 0.3622307
## [2,] -0.1971826  1.0000000 0.4973368
## [3,]  0.3622307  0.4973368 1.0000000

Please see example(cor) for more.

edited Mar 26 '14 at 03:30

answered Mar 26 '14 at 03:17

Rich Scriven

97,041
11
181
245

Thanks! This is very helpful -- is there also a similar way to do this if I don't have the same x variable in the future? – Ryan Mar 26 '14 at 03:24
Sure, just leave it out. That part really isn't relevant. I'll edit. – Rich Scriven Mar 26 '14 at 03:27
Thanks again! One more thing -- is there a similar way to do this if there are more/less rows in one data set over another? – Ryan Mar 26 '14 at 03:34
Yes, you can add `NA` values the shorter vector(s) to make them the same length, then use the `use` argument of `cor` to determine how to handle the `NA` values. – Rich Scriven Mar 26 '14 at 03:45

How to find the p-value for two sets of data in R?

4 Answers4