I have a large df (+100k rows, see snapshot of data below) that I'm trying to summarize (min, mean, median, max, etc.) a variable (salinity) in a table by group (species) using tapply, but if I use the whole dataset (which contains a few NA's, but not in every group) instead of a random subset, tapply adds in an extra column in the table it creates called "NA.s" and it has a value for every group. I'm not sure what this column is or how it's created. Using a subset of randomly chosen rows from the df instead doesn't recreate this issue, so I'm not sure how to reproduce my data here...
then I run this code:
sum_stats <- tapply(df$salinity, df$species, summary)
Which seems to create a list of doubles (no NULLs) that look like this:
Clicking one of them yeilds this, all good:
> sum_stats[["Albula vulpes"]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.49 15.49 15.49 15.49 15.49 15.49
Then creating a dataframe creates the issue somehow?
sum_data_table<-data.frame(do.call("rbind", sum_stats))
# Not sure what this is either
Warning message:
In rbind(`Achirus lineatus` = c(Min. = 6.11, `1st Qu.` = 20.97, :
number of columns of result is not a multiple of vector length (arg 1)