1

I have a large df (+100k rows, see snapshot of data below) that I'm trying to summarize (min, mean, median, max, etc.) a variable (salinity) in a table by group (species) using tapply, but if I use the whole dataset (which contains a few NA's, but not in every group) instead of a random subset, tapply adds in an extra column in the table it creates called "NA.s" and it has a value for every group. I'm not sure what this column is or how it's created. Using a subset of randomly chosen rows from the df instead doesn't recreate this issue, so I'm not sure how to reproduce my data here...

I load my data: enter image description here

then I run this code:

sum_stats <- tapply(df$salinity, df$species, summary)

Which seems to create a list of doubles (no NULLs) that look like this: enter image description here

Clicking one of them yeilds this, all good:

> sum_stats[["Albula vulpes"]]
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.49   15.49   15.49   15.49   15.49   15.49 

Then creating a dataframe creates the issue somehow?

sum_data_table<-data.frame(do.call("rbind", sum_stats))

# Not sure what this is either

Warning message:
In rbind(`Achirus lineatus` = c(Min. = 6.11, `1st Qu.` = 20.97,  :
  number of columns of result is not a multiple of vector length (arg 1)

New unknown column circled in red: enter image description here

Nate
  • 411
  • 2
  • 10
  • What happens if you add in the argument to ignore NAs? `tapply(df$salinity, df$species, summary, na.rm=TRUE)` – AndrewGB Dec 10 '21 at 20:41
  • NA.s column still there :( – Nate Dec 10 '21 at 20:44
  • Can you provide some of your data? You can just do `dput(head(df))` to provide a few rows. – AndrewGB Dec 10 '21 at 20:45
  • There's something about running the code on a random subset number of rows that makes that column not appear. I use this mini_data <- df[sample(nrow(df), 100), ] – Nate Dec 10 '21 at 20:47
  • > dput(head(df)) structure(list(species = c("Hippolyte pleuracanthus", "Hippolyte pleuracanthus", "Hippolyte pleuracanthus", "Hippolyte pleuracanthus", "Hippolyte pleuracanthus", "Hippolyte pleuracanthus"), salinity = c(37, 37, 37, 37, 37, 37)), row.names = c(NA, 6L), class = "data.frame") – Nate Dec 10 '21 at 20:50
  • There are 154 species and an associated salinity value (can be any random number for these purposes). – Nate Dec 10 '21 at 20:52
  • I just realized it's the same number as a the 'min' value, for some reason... – Nate Dec 10 '21 at 21:28

1 Answers1

2

This will happen because some of the species will not have NAs while others will. summary will not have the NA column if there are none missing and rbind doesn’t know how to handle a differing number of columns. You’re seeing the first element repeated (the min) when there are none missing. For species that have some missing, you'll see the count of NA

set.seed(1)
x <- expand.grid(letters, letters, letters)
x <- paste0(x$Var1, x$Var2, x$Var3)[1:150]
df <- data.frame(
  species = rep(x, 100),
  salinity = rnorm(15000)
)
df$salinity[df$salinity < -2.2] <- NA 

sum_stats <- tapply(df$salinity, df$species, summary)

sum_data_table<-data.frame(do.call("rbind", sum_stats))
#> Warning in rbind(aaa = structure(c(Min. = -1.7072100148514, `1st Qu.` =
#> -0.755344253350053, : number of columns of result is not a multiple of vector
#> length (arg 4)

Removing the missing will work:

sum_stats <- with(df[!is.na(df$salinity) ,] , tapply(salinity, species, summary))

sum_data_table<-data.frame(do.call("rbind", sum_stats))

Or you could try using bind_rows from dplyr instead if you would like a summary of the missingness as well.

library(dplyr)
sum_data_table<-data.frame(do.call(dplyr::bind_rows, sum_stats))
TrainingPizza
  • 1,090
  • 3
  • 12