Run a regression by each unique id in a column, with multiple observations for each unique ID

Question

I have a data frame with multiple observations for each unique row id (subjectID), which looks like this (a short example):

df <- data.frame(subjectID = rep(c(1:4), each=3),
                  y = sample(0:2, 4, replace = TRUE),
                  x = rep(c(0:2), times = 4))

df <- as.tibble(df)
> df
# A tibble: 12 x 3
   subjectID     y     x
       <int> <int> <int>
 1         1     0     0
 2         1     1     1
 3         1     1     2
 4         2     2     0
 5         2     0     1
 6         2     1     2
 7         3     1     0
 8         3     2     1
 9         3     0     2
10         4     1     0
11         4     1     1
12         4     2     2

For each unique subjectID "unique(df$subjectID)" (i.e. 4 in the example above) I would like to:

  #1. regress y on x (3 x- and y-observations for each subjectID ) to obtain coefficients
  reg <- lm(y~x)

  #2. store intercept 
  intercept <- reg$coefficients[1]

  #3. store slope
  slope <- reg$coefficients[2]

  #4. calculate Spearman's rho with library(ggpubr)
  rho <- cor(x, y, method = c("spearman"))

  #5. calculate the Spearman p-value with library(ggpubr)
  pvalue <- cor.test(x, y, method=c("spearman"))

The results I would like to store in a new data frame like this:

# creating an data frame to store regression output for each unique subjectID
output <- tibble(
  subjectID = rep(c(1:4), each=1),
  intercept = rep(c(0), each=4),
  slope = rep(c(0), each=4),
  rho = rep(c(0), each=4),
  pvalue = rep(c(0), each=4))


> output
# A tibble: 4 x 5
  subjectID intercept slope   rho   pvalue
      <int>     <dbl> <dbl> <dbl>    <dbl>
1         1         0     0     0        0
2         2         0     0     0        0
3         3         0     0     0        0
4         4         0     0     0        0

I have 21 x- and y-obs for each unique subjectID, and more than 100 unique subjectIDs in the 'real' data frame, so I would like to avoid to do it manually. So I was wondering if a loop-function can be created which achieves this?

In dplyr or base R. With dplyr I tried to use pipes and group_by(subjectID), but don't manage to set up the function.

In `dplyr`, the current favorite way to do this is with `nest_by`. The `?nest_by` help page has an example of fitting a linear model by groups to a data frame and extracting coefficients. You can probably modify the example to add the `cor` and `cor.test` bits as well. — Gregor Thomas, Oct 20 '20 at 19:58

score 0 · Accepted Answer · answered Oct 20 '20 at 22:55

This is a bit verbose but it gets at what @GregorThomas is referring to:

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

set.seed(41)
df <- data.frame(
  subjectID = rep(c(1:4), each = 3),
  y = sample(0:2, 4, replace = TRUE),
  x = rep(c(0:2), times = 4)
)

df <- as_tibble(df)

df %>%
  nest(c(y, x)) %>%
  mutate(reg = map(data, ~ lm(y ~ x, data = .x))) %>%
  mutate(cor = map(data, ~ with(.x, cor.test(y, x)))) %>%
  mutate(tidied_reg = map(reg, tidy), tidied_cor = map(cor, tidy)) %>%
  select(subjectID, tidied_reg, tidied_cor) %>%
  unnest() %>%
  select(subjectID, term, estimate, statistic1, p.value1) %>%
  pivot_wider(names_from = term, values_from = estimate) %>%
  select(subjectID, intercept = `(Intercept)`, slope = x, rho = statistic1, pvalue = p.value1)
#> # A tibble: 4 x 5
#>   subjectID intercept  slope    rho pvalue
#>       <int>     <dbl>  <dbl>  <dbl>  <dbl>
#> 1         1     1.5   -0.5   -0.577  0.667
#> 2         2     1.5   -0.5   -0.577  0.667
#> 3         3     0.833  0.500  1.73   0.333
#> 4         4     0.167  0.5    1.73   0.333

Run a regression by each unique id in a column, with multiple observations for each unique ID

1 Answers1

Linked