R Aggregate with a yet undefined range of columns (including factors)

Question

I probably miss the right words to find my answer using the search function. I will have a dataset with a yet unknown number of columns, because they are a function of work within another program and later changes there will change the number of variables in the dataset. However, the dataset has a clear structure, with 6 variables in the beginning (including the below mentioned code, a factor variable, and year and starting at the 7 column all the other variables that are a function of the work in the other program (MaxQDA).

So I wish to have a flexible call for 7 to N columns for an aggregate function to replace the dot in the following code, which to my understanding calls for all columns.

dataset2 <- aggregate(. ~ code+jahr, 
                   data = dataset, 
                   sum, 
                   na.action=na.pass
)

Suggestions from here do not help, as I don't know how to transfer the code+jahr into other suggested variations of aggregate-function writing.

addendum: Or, put differently: I wish to exempt a few columns from the aggregate-function, while summing up a range of other columns.

Since there was confusion about vector types. I have some factor data like ID and Name. Data would look like this

set.seed(42)
test2 <- as.data.frame(matrix(sample(16 * 4, replace=TRUE), ncol=16, nrow=4))
code <-c("aaa", "bbb","aaa", "ddd")
jahr <- c("1990", "1993", "2007", "2020")
id <- c("id1", "id2", "id3", "id4")
Name <- c("bla", "bla2", "bla3", "bla4")
test <- data.frame(code, jahr, id, Name)
dataset <- data.frame(test, test2)
dataset[1:4] <- lapply(dataset[, 1:4], as.factor)

Can you do `.[,7:n]` to call all the columns? Perhaps preface with `n <- ncol(dataset)`? — dyrland, Nov 10 '20 at 14:23
first suggestion gives me: "Error in eval(predvars, data, env) : object '.' not found". Same for the second suggestion. Even if I try to make n numeric. — slinel, Nov 10 '20 at 14:26

dcarlson · Accepted Answer · 2020-11-16T18:32:02.640

0

Using dataset above we want to remove id and Name from the aggregation since they are factors that are not used to define groups. The simplest way to do that is to extract those columns of data:

dataset2 <- aggregate(. ~ code+jahr, data = dataset[ , -(3:4)], sum, na.action=na.pass)

A slightly more complicated method is to define a logical statement that identifies columns that are factors but not used for grouping. The main advantage is not having to figure out column numbers and making it relatively simple to change the grouping variables:

keep <- colnames(dataset) %in% c("code", "jahr") | sapply(dataset, is.numeric)
dataset2 <- aggregate(. ~ code+jahr, data = dataset[, keep], sum, na.action=na.pass)

Both produce the same results

edited Nov 16 '20 at 18:32

answered Nov 10 '20 at 18:16

dcarlson

10,936
2
15
18

Hi, thank you! That does not work, since it turns my factors into characters, which is not allowed by the function and gives the following error: Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument. However, having "cols" as data.frame does not work either, "Error in model.frame.default(formula = cols ~ code + jahr, data = datI, : invalid type (list) for variable 'cols'" – slinel Nov 13 '20 at 10:49
You did not mention factors in your data other than the ones defining the subsets. You will have to exclude from `aggregate` any factors that are not used to subset the data. Provide a sample of your data if you want suggestions that can be tested first. – dcarlson Nov 13 '20 at 17:14
I did not known, classes would be a problem. See test data added now to my original post. – slinel Nov 16 '20 at 08:12

R Aggregate with a yet undefined range of columns (including factors)

1 Answers1