1

I have a data range of 10,000 points as per:

data = rbinom(10000, size=10, prob=1/4)

I need to find the mean and standard deviation of the data values >=5.

There are approx 766 values as per:

sum(data >=5)

sum (or any other approach I can think of) produces a TRUE/FALSE and cannot be used within a mean or sd calculation. How do I divide up the actual values?!

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Peppa
  • 13
  • 2

3 Answers3

2

If you want to get all the values of data which are greater than or equal to 5, rather than just a logical vector telling you if the values of data are greater than or equal to 5, you need to do data[data >= 5].

So we can do:

data = rbinom(10000, size=10, prob=1/4)

mean(data[data >= 5])
#> [1] 5.298153

sd(data[data >= 5])
#> [1] 0.5567141
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
0

Maybe try this:

library(dplyr)
data %>%
  as.data.frame() %>%
  filter(. >= 5) %>%
  summarise(mean = mean(.),
            sd = sd(.))

Output:

      mean        sd
1 5.297092 0.5815554

Data

data = rbinom(10000, size=10, prob=1/4)
Quinten
  • 35,235
  • 5
  • 20
  • 53
0

The TRUE and FALSE values can be used in mean(), sum(), sd(), etc... as they have numerical values 0 and 1, respectively.

set.seed(456)
data = rbinom(10000, size=10, prob=1/4)
mean(data >= 5)
#> [1] 0.0779
sum(data >= 5)
#> [1] 779
sd(data >= 5)
#> [1] 0.2680276

Created on 2022-05-14 by the reprex package (v2.0.1)

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25
  • I read the question as getting the mean and sd of all the values in data that are greater than or equal to 5 – Allan Cameron May 14 '22 at 11:23
  • @AllanCameron sorry, I read it as wanting the `mean(data >= 5)` rather than `mean(data[data >= 5])`. I'll leave the answer here for now, but re-reading the question, I suspect you're right. – DaveArmstrong May 14 '22 at 11:34