1

I would like to calculate simple summary metrics for subsets of certain columns in a data frame, where the subsets are based on information in other columns of the same data frame. Let me illustrate:

colA <- c(NA,2,3,NA,NA,3,9,5,6,1)
colB <- c(9,3,NA,2,2,4,6,1,9,9)
colC <- c(NA,NA,5,7,3,9,8,1,2,3)
colAA <- c(NA,NA,6,NA,NA,NA,1,7,9,4)
colBB <- c(NA,2,NA,7,8,NA,2,7,9,4)
colCC <- c(NA,NA,3,7,5,8,9,9,NA,3)

df <- data.frame(colA,colB,colC,colAA,colBB,colCC)

> df
   colA colB colC colAA colBB colCC
1    NA    9   NA    NA    NA    NA
2     2    3   NA    NA     2    NA
3     3   NA    5     6    NA     3
4    NA    2    7    NA     7     7
5    NA    2    3    NA     8     5
6     3    4    9    NA    NA     8
7     9    6    8     1     2     9
8     5    1    1     7     7     9
9     6    9    2     9     9    NA
10    1    9    3     4     4     3

Here colAA should be subsetted by colA so that rows containing NAs in colA are removed:

> df1 <- subset(df, !is.na(colA))
> df1
   colA colB colC colAA colBB colCC
2     2    3   NA    NA     2    NA
3     3   NA    5     6    NA     3
6     3    4    9    NA    NA     8
7     9    6    8     1     2     9
8     5    1    1     7     7     9
9     6    9    2     9     9    NA
10    1    9    3     4     4     3

Now I would like to calculate e.g. column length and percentage of non-NA values within the column:

> length(df1$colAA)
[1] 7
> (nrow(subset(df1, !is.na(colAA)))/length(df1$colAA))*100
[1] 71.42857

In an ideal world, the output would be written to another data frame, e.g.:

cat n perc_n
1 colAA 7     71
2 colBB 9     78
3 colCC 8     88

Any way to achieve this for all columns in a slighty more elegant/efficient manner? Any suggestions will be much appreciated!

M.Teich
  • 575
  • 5
  • 22
  • 2
    `Map(function(x,y) summary(y[!is.na(x)]), df[,1:3], df[, 4:6])` where you replace `summary` with whatever you want. With `summary`, you could follow it with `data.table::rbindlist(lapply(res, as.list), id="col")` to get a table... – Frank Jul 18 '18 at 15:22
  • 1
    Don't do `data.frame(cbind(.))`, just `data.frame(.)` will do it. – Rui Barradas Jul 18 '18 at 15:37
  • 2
    @Frank: Thanks, that works for me! If you repost your comment as an answer, I'll tag it accordingly. – M.Teich Jul 18 '18 at 20:19
  • @RuiBarradas: True. I've edited the post accordingly. – M.Teich Jul 18 '18 at 20:21

1 Answers1

2

You can pass the two sets of columns to Map:

res = Map(function(x,y) summary(y[!is.na(x)]), df[,1:3], df[, 4:6])

Since the post is tagged with data.table, I'd also recommend making a table like

data.table::rbindlist(lapply(res, as.list), id="col")

#     col Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
# 1: colA    1       4      6 5.400     7.0    9    2
# 2: colB    2       3      7 5.571     7.5    9    2
# 3: colC    3       4      7 6.286     8.5    9    1

You can replace summary with whatever function you like that returns a named vector and it should still work with as.list + rbindlist.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • If I replace `summary` with e.g. `length`, I get the following error message: `STRING_ELT() can only be applied to a 'character vector', not a 'NULL'`. Obviously the output no longer matches the expected format. Any chance you can help me out? – M.Teich Jul 19 '18 at 11:29
  • 1
    @M.Teich Yeah, it needs to be a *named* vector (at least for this approach), so you can do `res = Map(function(x,y) c(n = length(y[!is.na(x)])), df[,1:3], df[, 4:6])`. Alternately, you could skip the as.list in the second step and do `res = Map(function(x,y) list(n = length(y[!is.na(x)])), df[,1:3], df[, 4:6]); rbindlist(res, id = "col")` – Frank Jul 19 '18 at 12:14
  • I'm trying to adapt your code to a similar problem, where I would like to calculate median by group, using `aggregate`. Assuming A and B are grouping variables not shown in the above example, colA:colC contain values (without NAs) and colAA:colCC identify rows to be kept (i.e. rows with NAs are dropped) I have tried this: `Map(function(x,y) aggregate(.~A+B,df[!is.na(y)],FUN=median), x=df[,3:5], y=df[, 6:8])`. This produces an error. Any suggestions on how to do this correctly? Thanks! – M.Teich Aug 28 '18 at 14:59
  • @M.Teich I'm not sure. I rarely use aggregate, but maybe if you post a new question, someone else can figure it out. – Frank Aug 28 '18 at 20:27
  • Cheers, will do. – M.Teich Aug 29 '18 at 06:10