How do you compute magnitudes (percentages) that are over and under a specific number in another column?

Question

I have this data set

study_ID title                  experiment question_ID participant_ID estimate_level estimate correct_answer question                      type   category   age gender
      <dbl> <chr>                       <dbl> <chr>                <int> <chr>             <dbl>          <dbl> <chr>                         <chr>  <chr>    <int> <chr> 
 1       11 Dallacker_Parents'_co…          1 1                        1 individual          3             10   How many sugar cubes does or… unlim… nutriti…    32 Female
 2       11 Dallacker_Parents'_co…          1 2                        1 individual         10             11.5 How many sugar cubes does a … unlim… nutriti…    32 Female
 3       11 Dallacker_Parents'_co…          1 3                        1 individual          7              6.5 How many sugar cubes does a … unlim… nutriti…    32 Female
 4       11 Dallacker_Parents'_co…          1 4                        1 individual          1             16.5 How many sugar cubes does a … unlim… nutriti…    32 Female
 5       11 Dallacker_Parents'_co…          1 5                        1 individual          7             11   How many sugar cubes does a … unlim… nutriti…    32 Female
 6       11 Dallacker_Parents'_co…          1 6                        1 individual          5              2.5 How many sugar cubes does a … unlim… nutriti…    32 Female
 7       11 Dallacker_Parents'_co…          1 1                        2 individual          2             10   How many sugar cubes does or… unlim… nutriti…    29 Female
 8       11 Dallacker_Parents'_co…          1 2                        2 individual         10             11.5 How many sugar cubes does a … unlim… nutriti…    29 Female
 9       11 Dallacker_Parents'_co…          1 3                        2 individual          1.5            6.5 How many sugar cubes does a … unlim… nutriti…    29 Female
10       11 Dallacker_Parents'_co…          1 4                        2 individual          2             16.5 How many sugar cubes does a … unlim… nutriti…    29 Female

There are 6 questions in this data set , each of which has a correct_answer column, and an estimate column. I am trying to compute a magnitude for each question, so that I get a percentage of people who under- or overestimated and who estimated correctly.

For instance, for each of the 6 questions, it would return something like this: 80 percent underestimated, 10 overestimated, and 10 percent answered correctly.

How can I do this? I am stumped. Thanks in advance!

Here is the dput

dput(head(DF, 10))
structure(list(study_ID = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5), title = c("5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd", 
"5_Jayles_Debiasing_The_Crowd"), experiment = c(1, 1, 1, 1, 1, 
1, 1, 1, 1, 1), question_ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    participant_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), estimate_level = c("individual", 
    "individual", "individual", "individual", "individual", "individual", 
    "individual", "individual", "individual", "individual"), 
    estimate = c(2e+07, 4500000, 21075541, 2e+07, 1e+06, 1.1e+07, 
    2.5e+07, 8e+06, 1.6e+07, 9800000), correct = c(3.8e+07, 3.8e+07, 
    3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 
    3.8e+07), question = c("What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?", 
    "What is the population of Tokyo and its agglomeration?"), 
    type = c("unlimited", "unlimited", "unlimited", "unlimited", 
    "unlimited", "unlimited", "unlimited", "unlimited", "unlimited", 
    "unlimited"), category = c("demographics", "demographics", 
    "demographics", "demographics", "demographics", "demographics", 
    "demographics", "demographics", "demographics", "demographics"
    ), age = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
    "NA", "NA"), gender = c("NA", "NA", "NA", "NA", "NA", "NA", 
    "NA", "NA", "NA", "NA")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

please provide a dput but I think that some group bys should easily solve this question — Bruno, Oct 01 '20 at 12:55
The data frame has 1,800 rows, so I can't copy and paste the entire dput. Unless I am doing something wrong. I am not very knowledgeable sorry! — dampfy, Oct 01 '20 at 13:29

score 0 · Accepted Answer · answered Oct 01 '20 at 14:30

0

Here's a dplyr approach:

library(dplyr)
df %>%
  group_by(question_ID) %>%
  summarize(prop_over = mean(estimate > correct),
            prop_under = mean(estimate < correct),
            prop_correct = mean(estimate == correct)
  )
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 1 x 4
#   question_ID prop_over prop_under prop_correct
#         <dbl>     <dbl>      <dbl>        <dbl>
# 1           1         0          1            0

answered Oct 01 '20 at 14:30

Gregor Thomas

136,190
20
167
294

Thank you for the reply! Would this be finding the >, <, and == to the `correct_answer` column? It looks like this would be doing this for the mean for the estimate of each question. – dampfy Oct 01 '20 at 14:40
This is comparing the column named `estimate` column to the column named `correct`. The data you shared doesn't have a column named `correct_answer`, so I can't tell you much about that, but I'm sure you can use this template to compare whatever columns you need to. – Gregor Thomas Oct 01 '20 at 14:54
`mean` is there to calculate a proportion. `estimate > correct` is `TRUE` when estimate is greater than correct, and `FALSE` otherwise. When you do math on true/false values, `TRUE` is `1`, and `FALSE` is `0`. So the sum of a true/false column is the count of `TRUE` values, and the mean of a true/false column is the proportion of `TRUE` values. You can, of course, multiply by 100 if you want to turn a proportion into a percent. In many programming languages `sum` is the standard way to count things, and `mean` is the standard way of calculating proportions - as long as the input is binary. – Gregor Thomas Oct 01 '20 at 14:57
Thank you! This worked. Sorry I thought I mentioned that before. – dampfy Oct 06 '20 at 10:41

score 0 · Answer 2 · answered Oct 01 '20 at 14:38

list1 <- lapply(split(DF, DF$question_ID), function (x) {
  overestimated <- 100 * length(which(x$estimate > x$correct)) / length(x$estimate)
  underestimated <- 100 * length(which(x$estimate < x$correct)) / length(x$estimate)
  correct <- 100 * length(which(x$estimate == x$correct)) / length(x$estimate)
  data.frame(overestimated, underestimated, correct)
})
list2 <- mapply(function (x, y) {
  x$question_ID <- y
  return (x)
}, x = list1, y = names(list1), SIMPLIFY = F)
Percent_Data <- do.call("rbind", list2)
Percent_Data <- Percent_Data[, c(which(colnames(Percent_Data) == "question_ID"), which(colnames(Percent_Data) != "question_ID"))]
Percent_Data
#   question_ID overestimated underestimated correct
# 1           1             0            100       0

How do you compute magnitudes (percentages) that are over and under a specific number in another column?

2 Answers2