point biserial and p-value

Question

I am trying to get a point biserial correlation between a continuous vocabulary score and syntactic productivity (dichotomous: productive vs not_productive).

I tried both the ltm packages

> biserial.cor (lol$voc1_tvl, lol$synt, use = c("complete.obs"))

and the polycor package

> polyserial( lol$voc1_tvl, lol$synt, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999, bins=4)

The problem is that neither test gives me a p-value

How could I run a point biserial correlation test and get the associated p-value or alternatively calculate the p-value myself?

Friendly note: when trying to put quote into a question or answer, either add 4 spaces before the line (vs the '>' symbol) or select all the code you want and press the '{}' button to automatically do so. Also, making a reproducible example will increase your chances of getting a question answered. — cgage1, Mar 09 '16 at 01:16

score 3 · Answer 1 · answered Apr 17 '17 at 09:49

Since the point biserial correlation is just a particular case of the popular Peason's product-moment coefficient, you can use cor.test to approximate (more on that later) the correlation between a continuous X and a dichotomous Y. For example, given the following data:

set.seed(23049)
x <- rnorm(1e3)
y <- sample(0:1, 1e3, replace = TRUE)

Running cor.test(x, y) will give you the information you want.

    Pearson's product-moment correlation

data:  x and y
t = -1.1971, df = 998, p-value = 0.2316
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.09962497  0.02418410
sample estimates:
        cor 
-0.03786575

As an indication of the similarity between the coefficients, notice how the calculated correlation of -0.03786575 is similar to what ltm::biserial.cor gives you:

> library(ltm)
> biserial.cor(x, y, level = 2)
[1] -0.03784681

The diference lies on the fact that biserial.cor is calculated on the population, with standard deviations being divided by n, where cor and cor.test calculate standard deviations for a sample, dividing by n - 1.

As cgage noted, you can also use the polyserial() function, which in my example would yield

> polyserial(x, y, std.err = TRUE)

Polyserial Correlation, 2-step est. = -0.04748 (0.03956)
Test of bivariate normality: Chisquare = 1.891, df = 5, p = 0.864

Here, I believe the difference in the calculated correlation (-0.04748) is due to polyserial using an optimization algorithm to approximate the calculation (which is unnecessary unless Y has more than two levels).

score 2 · Answer 2 · answered Mar 09 '16 at 01:08

2

Using the ggplot2 dataset mpg as a reproducible example:

library(ggplot2)
# Use class as dichotomous variable (must subset)
newData = subset(mpg, class == 'midsize' | class == 'compact')

# Now getting p-value
library(ltm)
polyserial(newData$cty,newData$class, std.err = T)

You will see all the output you desire using std.err=T in polyserial

answered Mar 09 '16 at 01:08

cgage1

579
5
15

thank you so much for answering. So, If so understand correctly, the formula is: polyserial (dichotomous_variable, continuous_variable, std.err = T). Am I getting it right? – Pietre Mar 09 '16 at 01:50
Close, just swap the dichotomous and continuous variable. You can use `?polyserial` for more details. Good luck! – cgage1 Mar 09 '16 at 17:20

point biserial and p-value

2 Answers2