0

I have this data set

study_ID title                  experiment question_ID participant_ID estimate_level estimate correct_answer question                      type   category   age gender
      <dbl> <chr>                       <dbl> <chr>                <int> <chr>             <dbl>          <dbl> <chr>                         <chr>  <chr>    <int> <chr> 
 1       11 Dallacker_Parents'_co…          1 1                        1 individual          3             10   How many sugar cubes does or… unlim… nutriti…    32 Female
 2       11 Dallacker_Parents'_co…          1 2                        1 individual         10             11.5 How many sugar cubes does a … unlim… nutriti…    32 Female
 3       11 Dallacker_Parents'_co…          1 3                        1 individual          7              6.5 How many sugar cubes does a … unlim… nutriti…    32 Female
 4       11 Dallacker_Parents'_co…          1 4                        1 individual          1             16.5 How many sugar cubes does a … unlim… nutriti…    32 Female
 5       11 Dallacker_Parents'_co…          1 5                        1 individual          7             11   How many sugar cubes does a … unlim… nutriti…    32 Female
 6       11 Dallacker_Parents'_co…          1 6                        1 individual          5              2.5 How many sugar cubes does a … unlim… nutriti…    32 Female
 7       11 Dallacker_Parents'_co…          1 1                        2 individual          2             10   How many sugar cubes does or… unlim… nutriti…    29 Female
 8       11 Dallacker_Parents'_co…          1 2                        2 individual         10             11.5 How many sugar cubes does a … unlim… nutriti…    29 Female
 9       11 Dallacker_Parents'_co…          1 3                        2 individual          1.5            6.5 How many sugar cubes does a … unlim… nutriti…    29 Female
10       11 Dallacker_Parents'_co…          1 4                        2 individual          2             16.5 How many sugar cubes does a … unlim… nutriti…    29 Female

There are 6 questions in this data set , each of which has a correct_answer column, and an estimate column. I am trying to compute a magnitude for each question, so that I get a percentage of people who under- or overestimated and who estimated correctly.

For instance, for each of the 6 questions, it would return something like this: 80 percent underestimated, 10 overestimated, and 10 percent answered correctly.

How can I do this? I am stumped. Thanks in advance!

Here is the dput

dput(head(DF, 10))
structure(list(study_ID = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5), title = c("5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd"), experiment = c(1, 1, 1, 1, 1, 
1, 1, 1, 1, 1), question_ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    participant_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), estimate_level = c("individual", 
    "individual", "individual", "individual", "individual", "individual", 
    "individual", "individual", "individual", "individual"), 
    estimate = c(2e+07, 4500000, 21075541, 2e+07, 1e+06, 1.1e+07, 
    2.5e+07, 8e+06, 1.6e+07, 9800000), correct = c(3.8e+07, 3.8e+07, 
    3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 
    3.8e+07), question = c("What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?"), 
    type = c("unlimited", "unlimited", "unlimited", "unlimited", 
    "unlimited", "unlimited", "unlimited", "unlimited", "unlimited", 
    "unlimited"), category = c("demographics", "demographics", 
    "demographics", "demographics", "demographics", "demographics", 
    "demographics", "demographics", "demographics", "demographics"
    ), age = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
    "NA", "NA"), gender = c("NA", "NA", "NA", "NA", "NA", "NA", 
    "NA", "NA", "NA", "NA")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))
dampfy
  • 41
  • 5

2 Answers2

0

Here's a dplyr approach:

library(dplyr)
df %>%
  group_by(question_ID) %>%
  summarize(prop_over = mean(estimate > correct),
            prop_under = mean(estimate < correct),
            prop_correct = mean(estimate == correct)
  )
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 1 x 4
#   question_ID prop_over prop_under prop_correct
#         <dbl>     <dbl>      <dbl>        <dbl>
# 1           1         0          1            0
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you for the reply! Would this be finding the >, <, and == to the `correct_answer` column? It looks like this would be doing this for the mean for the estimate of each question. – dampfy Oct 01 '20 at 14:40
  • This is comparing the column named `estimate` column to the column named `correct`. The data you shared doesn't have a column named `correct_answer`, so I can't tell you much about that, but I'm sure you can use this template to compare whatever columns you need to. – Gregor Thomas Oct 01 '20 at 14:54
  • `mean` is there to calculate a proportion. `estimate > correct` is `TRUE` when estimate is greater than correct, and `FALSE` otherwise. When you do math on true/false values, `TRUE` is `1`, and `FALSE` is `0`. So the sum of a true/false column is the count of `TRUE` values, and the mean of a true/false column is the proportion of `TRUE` values. You can, of course, multiply by 100 if you want to turn a proportion into a percent. In many programming languages `sum` is the standard way to count things, and `mean` is the standard way of calculating proportions - as long as the input is binary. – Gregor Thomas Oct 01 '20 at 14:57
  • Thank you! This worked. Sorry I thought I mentioned that before. – dampfy Oct 06 '20 at 10:41
0
list1 <- lapply(split(DF, DF$question_ID), function (x) {
  overestimated <- 100 * length(which(x$estimate > x$correct)) / length(x$estimate)
  underestimated <- 100 * length(which(x$estimate < x$correct)) / length(x$estimate)
  correct <- 100 * length(which(x$estimate == x$correct)) / length(x$estimate)
  data.frame(overestimated, underestimated, correct)
})
list2 <- mapply(function (x, y) {
  x$question_ID <- y
  return (x)
}, x = list1, y = names(list1), SIMPLIFY = F)
Percent_Data <- do.call("rbind", list2)
Percent_Data <- Percent_Data[, c(which(colnames(Percent_Data) == "question_ID"), which(colnames(Percent_Data) != "question_ID"))]
Percent_Data
#   question_ID overestimated underestimated correct
# 1           1             0            100       0
David Moore
  • 670
  • 3
  • 15