1

How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?

For example, what would be the simplest way for generating such data?

  • groups: two groups: A, B
  • sex: different sex distributions: A 30%, B 70%
  • age: different mean ages: A 50, B 70

PS! Tidyverse solutions are especially welcome.

My best try so far is still quite a lot of code:

n=100
d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

enter image description here

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

enter image description here

st4co4
  • 445
  • 3
  • 10
  • There are data simulation packages on CRAN, eg, [simstudy](https://cran.r-project.org/web/packages/simstudy/), [faux](https://cran.r-project.org/web/packages/faux/index.html), [simglm](https://cran.r-project.org/web/packages/simglm/). – dipetkov Mar 08 '22 at 12:07

1 Answers1

1

There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:

set.seed(69) # Makes samples reproducible

df <- data.frame(groups = rep(c("A", "B"), each = 100),
                 sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                         sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                 age = c(runif(100, 25, 75), runif(100, 50, 90)))

And we can use the tidyverse to show it does what was expected:

library(dplyr)

df %>% 
  group_by(groups) %>% 
  summarise(age = mean(age),
            percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#>   groups   age percent_male
#>   <chr>  <dbl>        <int>
#> 1 A       49.4           29
#> 2 B       71.0           50
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87