Which statistical test to use?

Question

I have dataset containing two columns X and Y. Column Y is binary with values 0 and 1. There is also a range for column Y (150, 400) which are the standard results. Which statistical test should I use to find if values in column X outside the given range are affecting value in column Y?

For now I have only this little part of R script and found proportions.

df <- data.frame(
  X = data$plt,
  Y = data$pe
)

outside <- subset(df, X < 150 | X > 400)
inside <- subset(df, X >= 150 & X <= 400)

prop.outside <- sum(outside$Y == 1) / length(outside)
prop.inside <- sum(inside$Y == 1) / length(inside)

I don't know what steps are next

The Point-Biserial Correlation Coefficient is a correlation measure of the strength of association between a continuous-level variable (ratio or interval data) and a binary variable. https://stats.stackexchange.com/questions/102778/correlations-between-continuous-and-categorical-nominal-variables Or logistic regression? — Marco, Jan 26 '23 at 10:24
A regression on Y ~ X_condition. Then plotting the results for some more insights than just inference. — Yacine Hajji, Jan 26 '23 at 10:28
Greetings! Usually it is helpful to provide a minimally reproducible dataset for questions here so people can troubleshoot your problems (rather than a table or screenshot for example). One way of doing is by using the `dput` function on the data or a subset of the data you are using, then pasting the output into your question. You can find out how to use it here: https://youtu.be/3EID3P1oisg — Shawn Hemelstrand, Jan 26 '23 at 10:29
Further this is not a question about programming but statistics. As marco mentions, you could do a logistic regression of the form `df$inside <- df$x >= 150 & df$x <= 400; fit<- glm(Y ~ x + x:inside, data = df, family = 'binomial');drop1(fit, test = 'Rao')`. If `x:inside` is significant (usually standard p value < 0.05) then there is a different slope and thus effect. You could also test against `Y ~ x*inside` comparing to the regression `Y ~ x`. This being significant could indicate 2 different regressions to each interval with slope and intercept differeing. — Oliver, Jan 26 '23 at 10:35
Thank you for all of your answers I will try with AUC and Marcos answer — kaniosx, Jan 26 '23 at 11:44

Marco · Answer 1 · 2023-01-26T10:46:23.833

Here is some minimal data. Please tell us more about the background of your data. In general, more or less data will always change the way you see your relationship. As suggested, logistic regression looks like this with more or less data:

library(tidyverse)
cars <- as_tibble(mtcars[,c("vs", "mpg")])

outside <- subset(cars, mpg < 17 | mpg > 23)
inside <- subset(cars, mpg >= 17 & mpg <= 23)

ggplot(cars, aes(x = mpg, y=vs)) + 
  geom_point(size=2) +
  geom_smooth(method = "glm", 
              method.args = list(family = "binomial"), 
              se = FALSE, colour="black") +
  geom_point(data=inside, aes(x = mpg, y=vs), size=5, col="blue") +
  geom_smooth(data = inside,
              method = "glm", 
              method.args = list(family = "binomial"), 
              se = FALSE, colour="blue")  +
  geom_point(data=outside, aes(x = mpg, y=vs), size=3, col="red")+
  geom_smooth(data = outside,
              method = "glm", 
              method.args = list(family = "binomial"), 
              se = FALSE, colour="red")

Do you consider outside values outlier? Although we see a curve for each dataset (total, inside, outside), the relationship is not always statistical significant. Here are logistics regression on the samples:

model_inside = glm(vs ~mpg, family = binomial(link = 'logit'), data = inside)
model_outside = glm(vs ~ mpg, family = binomial(link = 'logit'), data = outside)
model_complete = glm(vs ~ mpg, family = binomial(link = 'logit'), data = cars)

library(stargazer)
stargazer(model_inside, model_outside, model_complete, type = "text")

===============================================
                       Dependent variable:     
                  -----------------------------
                               vs              
                     (1)       (2)       (3)   
-----------------------------------------------
mpg                 0.403     0.614   0.430*** 
                   (0.343)   (0.418)   (0.158) 
                                               
Constant           -7.791    -15.058  -8.833***
                   (6.879)  (10.782)   (3.162) 
                                               
-----------------------------------------------
Observations         14        18        32    
Log Likelihood     -8.799    -2.240    -12.767 
Akaike Inf. Crit.  21.597     8.479    29.533  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

On each subsample (inside/outside) there is no significant relation.

Is there a relationship between X and Y in the outside range?

cor.test(outside$mpg,outside$vs)

Pearson's product-moment correlation

data:  outside$mpg and outside$vs
t = 7.7269, df = 16, p-value = 8.677e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7195044 0.9578132
sample estimates:
      cor 
0.8880613

For this test data, yes.

Which statistical test to use?

1 Answers1