Calculate Mean and Standard Deviation Using R

Question

New to R. I'm trying to calculate the mean of double plays hit into for each batter using a data set from 2006 - 2016. But the code is flawed and I'm not sure why. The Rate1 is the same for each batter. Once I get Rate1 for each batter I want an overall mean and stdev, but I haven't gotten to that point yet...

This is a subset of the data frame...

BAT_ID  DP_FL

2   hanim001    FALSE
18  hereg002    FALSE
40  pujoa001    TRUE
50  espid001    TRUE
97  troum001    FALSE
131 calhk001    FALSE
136 hanim001    FALSE
148 hanim001    FALSE
165 mottt001    FALSE
215 calhk001    TRUE
238 calhk001    FALSE
255 napom001    FALSE
264 gomec002    FALSE
267 maybc001    TRUE
271 napom001    FALSE
279 rua-r001    FALSE
283 simma001    TRUE
286 mazan001    FALSE
318 martj007    FALSE
322 choos001    TRUE
356 gomec002    FALSE


#Percent groundball double play
library(plyr)
mean1<-ddply(all_data_gnd, .(BAT_ID), summarize,  Rate1= 
(sum(as.numeric(which(all_data_gnd$DP_FL==1))) / 
(sum(as.numeric(which(all_data_gnd$DP_FL==0))) + 
sum(as.numeric(which(all_data_gnd$DP_FL==1))))))
head(mean1)

> head(mean1)
    BAT_ID     Rate1
1 abrej003 0.1741862
2 adamc001 0.1741862
3 adaml001 0.1741862
4 adamm002 0.1741862
5 adduj002 0.1741862
6 adlet001 0.1741862

How is it "flawed"? Do you get an error, or just the wrong answers? — kdopen, Apr 02 '18 at 13:48

score 0 · Answer 1 · answered Apr 02 '18 at 14:36

Your data is insufficient for the data, so I'll generate some fake data:

n <- 1e4
set.seed(2)
fakedata <- data.frame(
  bat_id = sample(letters[1:5], size=n, replace=TRUE),
  dp_fl = sample(c(T,F), size=n, replace=TRUE),
  stringsAsFactors = FALSE
)
head(fakedata)
#   bat_id dp_fl
# 1      a  TRUE
# 2      d  TRUE
# 3      c  TRUE
# 4      a FALSE
# 5      e  TRUE
# 6      e FALSE

You don't need as.numeric, and your use of ==1/(==0 + ==1) is effectively the mean of the logicals. There are several ways you can summarize:

stack(by(fakedata$dp_fl, fakedata$bat_id, mean))
stack(tapply(fakedata$dp_fl, fakedata$bat_id, mean))

Each results in

#      values ind
# 1 0.4935000   a
# 2 0.5015322   b
# 3 0.4869432   c
# 4 0.5223735   d
# 5 0.5041810   e

Where a call to colnames will be useful.

You can also use:

library(dplyr)

fakedata %>%
  group_by(bat_id) %>%
  summarize(dp_fl = mean(dp_fl))

# # A tibble: 5 × 2
#   bat_id     dp_fl
#    <chr>     <dbl>
# 1      a 0.4935000
# 2      b 0.5015322
# 3      c 0.4869432
# 4      d 0.5223735
# 5      e 0.5041810

Calculate Mean and Standard Deviation Using R

1 Answers1