1

I am trying to simulate the answers to a multi-choice question test (MCQ). Currently, I am using the following code to simulate the answers to a MCQ with only two questions:

answers <- data.frame(
Q1 = sample(LETTERS[1:5],10,replace = T, prob=c(0.1,0.6,0.1,0.1,0.1)),
Q2 = sample(LETTERS[1:5],10,replace = T, prob=c(0.5,0.1,0.1,0.2,0.1)))

The answers B and A are, respectively, the correct answers to Q1 and Q2.

My difficulty is to introduce correlation among the answers to the questions, in the sense that, for instance, a good student tends to select the correct answer to all questions. How can I accomplish that?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • What do you mean by `introduce correlation among the answers`? What is wrong with the current procedure? Except for the fact that you have to make up the probabilities. – user2974951 Sep 29 '21 at 12:13
  • Suppose a good student. If she chooses B in Q1, the probability of choosing A in Q2 should be greater than the one of choosing any other answer. – PaulS Sep 29 '21 at 12:17
  • But you already did that, answer A for Q2 has higher probability than the other answers, so it is quite likely that if she chooses the correct answer for Q1 she will also select the correct answer for Q2, although not necessarily. – user2974951 Sep 29 '21 at 12:19
  • Yes, but according to the process I used above, all students are intrinsically equally good -- no differentiation among students and the distribution of the grades will not be normal, I guess. I am aiming at getting a normal distribution of grades! – PaulS Sep 29 '21 at 12:23
  • Or are you trying to base all the probabilities from question Q2 onwards based on the first answer? If she chooses the correct answer on the Q1 then she is one of the good students and should select mostly good answer from now on, while if she chooses a wrong answer in Q1 then she is a trash student and will also select bad answers from Q2 onwards? – user2974951 Sep 29 '21 at 12:23
  • Thanks, I guess my previous comment clarifies what it is wanted. – PaulS Sep 29 '21 at 12:27

2 Answers2

1

You could fill up the data with completely correct answers, assign a level of proficiency to each individual student and then randomly change values in their exams, depending on their proficiency:

correct = c(2,1,3)
nstudents = 20
exam = matrix(LETTERS[rep(correct,nstudents)],ncol=length(correct),byrow=T)
colnames(exam)=paste("Q",1:length(correct),sep="")

proficiency = runif(nstudents,1,5)/5 ## Each student has a level of expertise

for(question in 1:length(correct)){
  difficulty = runif(nstudents,1,10)/10  ## Random difficulty for each question and student (may be made more or less difficult)
  nmistakes = sum(proficiency<difficulty)
  exam[,question][proficiency<difficulty] = sample(LETTERS[1:5],nmistakes,replace=T)
}

exam = as.data.frame(exam)

The result would be a data frame in which some students hardly ever make mistakes while others hardly ever get something right.

EDIT: The proficiency, in this case, follows an uniform distribution. If you need them normally distributed, just change the proficiency vector to use rnorm().

Martin Wettstein
  • 2,771
  • 2
  • 9
  • 15
1

Here is a method that applies a covariance matrix Sigma= using MASS::mvrnorm.

n <- 15
r <- .9
set.seed(42)
library('MASS')
M <- abs(mvrnorm(n=n, mu=c(1, 500), Sigma=matrix(c(1, r, r, 1), nrow=2), 
                empirical=TRUE)) |>
  as.data.frame() |>
  setNames(c('Q1', 'Q2'))

We get the correlated levels A, ..., B by cutting the random numbers along custom quantiles (taken from OP),

f <- \(x, q) cut(x, breaks=c(0, quantile(x, cumsum(q))), include.lowest=T, 
                 labels=LETTERS[1:5])

p1 <- c(0.1, 0.6, 0.1, 0.1, 0.1)
p2 <- c(0.5, 0.1, 0.1, 0.2, 0.1)

in a Map() call.

dat <- Map(f, M, list(p1, p2)) |>
  as.data.frame()
dat
#    Q1 Q2
# 1   A  A
# 2   B  A
# 3   E  E
# 4   D  E
# 5   A  A
# 6   B  A
# 7   C  D
# 8   B  A
# 9   B  A
# 10  B  A
# 11  B  C
# 12  B  B
# 13  E  D
# 14  B  A
# 15  C  D

Check

dat_check <- lapply(dat, as.integer) |> as.data.frame()
cor(dat_check)  ## correlation
#         Q1      Q2
# Q1 1.00000 0.85426
# Q2 0.85426 1.00000

lapply(dat, table)  ## students' answers
# $Q1
# 
# A B C D E 
# 2 8 2 1 2 
# 
# $Q2
# 
# A B C D E 
# 8 1 1 3 2 
jay.sf
  • 60,139
  • 8
  • 53
  • 110