0

I am trying to center values around the mean of an entire column. I need to do this for an entire (large) data frame, so first I tried colMeans.

colMeans(data, na.rm = TRUE)

From this, I get an answer like 5.567 for the first column of my data set. However, I wanted to double check this. When I use the mean function mean(data$first_column, na.rm = TRUE) I get 8.466 instead. When I calculate the mean in an excel sheet, I got something around 6.5.

I haven't been able to recreate this problem with a generated data set, so here is a link to a GoogleDoc with the first two columns of my data set .

The end goal is to center the values around the mean for nearly every column in the data set, and I assumed I would do this with lapply(). But before I do that, I want to understand why I am getting so many different mean values. I assume it has to do with NAs or something, but I'm not quite grasping it.

Thanks in advance for your help.

wissem
  • 58
  • 8
  • I am unable to recreate your error. I downloaded the google doc as a csv and got 6.502439 using colmeans and mean. – Ian Wesley Jun 22 '17 at 18:25
  • 1
    try using `complete.cases()` on your dataframe so that all of the NAs are removed. `data <- data[complete.cases(data), ]` – sweetmusicality Jun 22 '17 at 18:26
  • 1
    I agree with @IanWesley. The problem is not reproducible; 6.502439 is the mean value of `Irritability` – Marco Sandri Jun 22 '17 at 18:33
  • Could it be that `data$first_column` is not `data$Irritability` ? – R. Schifini Jun 22 '17 at 20:53
  • Thank you all for trying it. I set it up so that it IS the first column of the code when `colMeans ` is used. – wissem Jun 23 '17 at 03:05
  • @sweetmusicality Why wouldn't the `na.rm = TRUE` work? Various patients are missing an array of variables, so there is a different number of complete cases per column – wissem Jun 23 '17 at 03:06
  • I find `complete.cases` more comprehensive...but it was just a suggestion...any luck with it? – sweetmusicality Jun 23 '17 at 16:53
  • @sweetmusicality no luck. Now I am getting the same value with `colMeans` and `mean`, but the value they are spitting out is 8.5. – wissem Jun 23 '17 at 17:38
  • @sweetmusicality do you know if `colMeans` handles NAs by excluding only the NAs in that column, or does it exclude every observation that is not complete? – wissem Jun 23 '17 at 17:39

1 Answers1

0

After a lot of trying, here is the code I have. I am still getting mean values that are off, but colMeans() and mean() are both producing the same answer so I think it has something to do with the variable NA's that I have rather than the functions. I'm still examining that, but I figured out how to both replace the NAs with the mean value while centering other values around the mean. This post helped me figure it out, specifically @Itsa's code.

###center values first

center_asd_prep <- autgi %>% select(ID, Irritability, Other_Variables)

as.numeric(center_asd_prep$srs_tot_raw ) -> center_asd_prep$srs_tot_raw

center_asd_mean <- center_asd_prep %>% select(-ID, -Group.y) #remove categorical info

#replacing NA with mean while centering other values around the mean center_asd_mean[] <- lapply(center_asd_mean, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

#adding ID info back center_asd <- data.frame(center_asd_mean, ID = center_asd_prep$ID, Group = center_asd_prep$Group.y) center_asd

I'll update this post if I figure out why I'm getting such high mean values, but I have 14 observations that have a high number of NAs, and I think that this is impacting the results because my N=218. Hypothetically, this code should work if anyone runs into the same problem as me.

wissem
  • 58
  • 8