1

I want to write a function that takes a data frame as an input and for each numeric variable in the data frame the function returns to the user the variables’ mean, median, and inter-quartile range in the form of a list.

The data frame is below:

'data.frame':   271 obs. of  6 variables:
 $ sample.id: int  1 2 4 5 6 7 8 9 12 13 ...
 $ zip      : int  48504 48507 48504 48507 48505 48507 48507 48503 48507 48505 ...
 $ ward     : int  6 9 1 8 3 9 9 5 9 3 ...
 $ Pb1      : num  0.344 8.133 1.111 8.007 1.951 ...
 $ Pb2      : num  0.226 10.77 0.11 7.446 0.048 ...
 $ Pb3      : num  0.145 2.761 0.123 3.384 0.035 ...

The output should be like:

$Pb1
    Mean   Median      IQR 
10.76687  3.56400  7.75100 

$Pb2
    Mean   Median      IQR 
10.43467  1.40000  4.50100 

$Pb3
    Mean   Median      IQR 
3.701434 0.839000 2.429500 

Here is my code:

df.numeric.summary <- function(data) {
  for (i in 1:ncol(data)) {
    if (is.numeric(data[[i]]) == TRUE) {
      variable_mean <- mean(data[[i]])
      variable_median <- median(data[[i]])
      variable_IQR <- IQR(data[[i]])
      variable_data <- data.frame(Mean = variable_mean, Median = variable_median, IQR = variable_IQR)
    }
  }
  return(variable_data)
}

My code only result in Pb3, I think I could not use for statement, but how could I get three variables' value? Also, how to return the result into a list?

lebelinoz
  • 4,890
  • 10
  • 33
  • 56
  • 1
    I think you should reconsider your choice of accepted answers here. Growing a data frame inside a `for` loop is one of the least efficient operations in all of R. It should *never* be used. – Rich Scriven Oct 14 '17 at 21:09

2 Answers2

4

There are a variety of degrees to which you can simplify/collapse this, but how about:

df.numeric.val <- function(col) {
     return(c(mean=mean(col),median=median(col),IQR=IQR(col)))
}
df.numeric.summary <- function(data) {
    numcols <- sapply(data,is.numeric)
    vals <- lapply(data[numcols],df.numeric.val)
    return(vals)
}
df.numeric.summary(mtcars)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
2

While there are much better ways to do this sort of thing in R (I suggest you look at how to use lapply, as suggested in at least one other answer and one comment), I will focus on your for-loop approach.

Your mistake is that you recreate variable_data from scratch at each pass through the loop. It's as if you've gone:

for (i in 1:3) {
   x = i
}
return(x) # <-- This will return a 3

The solution might be to define variable_data before the for-loop, and use rbind to append to it:

df.numeric.summary <- function(data) {
  variable_data = data.frame(variable_mean = numeric(0), variable_median = numeric(0), variable_IQR = numeric(0))
  for (i in 1:ncol(data)) {
    if (is.numeric(data[[i]]) == TRUE) {
      variable_mean <- mean(data[[i]])
      variable_median <- median(data[[i]])
      variable_IQR <- IQR(data[[i]])
      variable_data <- rbind(variable_data, data.frame(Mean = variable_mean, Median = variable_median, IQR = variable_IQR))
    }
  }
  return(variable_data)
}

As for converting a dataframe to a list, this is a separate question and has already been answered by this stackoverflow question. The most popular answer is:

xy.list <- split(xy.df, seq(nrow(xy.df)))

where xy.df is the name of your dataframe.

lebelinoz
  • 4,890
  • 10
  • 33
  • 56