2

I've been banging my head against a brick wall for days on this issue; I wonder if anyone can see what is wrong with my code, or tell me if I am overlooking something obvious.

I have this data.frame, where most columns are vectors, either numerical or character, and one column is a list of character vectors:

t0g2 <- structure(list(P = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 
4, 4, 5, 5, 5, 5), ID = c(8, 10, 7, 9, 5, 2, 3, 4, 8, 9, 1, 2, 
8, 1, 4, 10, 4, 10, 2, 7), SC = c("A", "D", "A", "B", "B", "A", 
"A", "E", "A", "B", "D", "A", "A", "D", "E", "D", "E", "D", "A", 
"A"), FP = list(`40,41,37,8,11` = c("40", "41", "37", "8", "11"
), `49,28,16,41` = c("49", "28", "16", "41"), `15,49` = c("15", 
"49"), `27,12,20,35,45` = c("27", "12", "20", "35", "45"), `1,34,43,37` = c("1", 
"34", "43", "37"), `41,7,30,2,34,43` = c("41", "7", "30", "2", 
"34", "43"), `22,35,31,10,3` = c("22", "35", "31", "10", "3"), 
    `29,6,15` = c("29", "6", "15"), `40,41,37,8,11` = c("40", 
    "41", "37", "8", "11"), `27,12,20,35,45` = c("27", "12", 
    "20", "35", "45"), `10,49,28` = c("10", "49", "28"), `41,7,30,2,34,43` = c("41", 
    "7", "30", "2", "34", "43"), `40,41,37,8,11` = c("40", "41", 
    "37", "8", "11"), `10,49,28` = c("10", "49", "28"), `29,6,15` = c("29", 
    "6", "15"), `49,28,16,41` = c("49", "28", "16", "41"), `29,6,15` = c("29", 
    "6", "15"), `49,28,16,41` = c("49", "28", "16", "41"), `41,7,30,2,34,43` = c("41", 
    "7", "30", "2", "34", "43"), `15,49` = c("15", "49"))), class = "data.frame", row.names = c("8", 
"10", "7", "9", "5", "2", "3", "4", "81", "91", "1", "21", "82", 
"11", "41", "101", "42", "102", "22", "71"))

I want to aggregate it by one of the columns, with the function for the other columns being simply the concatenation of unique values. [Yes, I know this can be done with many ad hoc packages, but I need to do it with base R].

This works perfectly well if I choose numeric column "ID" as the column to aggregate on:

aggregate(x=t0g2[, !(colnames(t0g2) %in% c("ID"))], by=list(ID=t0g2[["ID"]]), 
          FUN=function(y) unique(unlist(y)))
#  ID       P SC                   FP
#1  1    3, 4  D           10, 49, 28
#2  2 2, 3, 5  A 41, 7, 30, 2, 34, 43
#3  3       2  A    22, 35, 31, 10, 3
#4  4 2, 4, 5  E            29, 6, 15
#5  5       2  B        1, 34, 43, 37
#6  7    1, 5  A               15, 49
#7  8 1, 3, 4  A    40, 41, 37, 8, 11
#8  9    1, 3  B   27, 12, 20, 35, 45
#9 10 1, 4, 5  D       49, 28, 16, 41

or with character column "SC":

aggregate(x=t0g2[, !(colnames(t0g2) %in% c("SC"))], by=list(SC=t0g2[["SC"]]), 
          FUN=function(y) unique(unlist(y)))
#  SC             P         ID                                                             FP
#1  A 1, 2, 3, 4, 5 8, 7, 2, 3 40, 41, 37, 8, 11, 15, 49, 7, 30, 2, 34, 43, 22, 35, 31, 10, 3
#2  B       1, 2, 3       9, 5                              27, 12, 20, 35, 45, 1, 34, 43, 37
#3  D    1, 3, 4, 5      10, 1                                             49, 28, 16, 41, 10
#4  E       2, 4, 5          4                                                      29, 6, 15

However, if I try with "P", which as far as I know is just another numerical column, this is what I get:

aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]), 
          FUN=function(y) unique(unlist(y)))
#   P ID.1 ID.2 ID.3 ID.4 SC.1 SC.2 SC.3                                                                  FP
#1  1    8   10    7    9    A    D    B               40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
#2  2    5    2    3    4    B    A    E           1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
#3  3    8    9    1    2    A    B    D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
#4  4    8    1    4   10    A    D    E                        40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
#5  5    4   10    2    7    E    D    A                         29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43

Does anybody know what is going on, why this happens? Literally going mental with this stuff...


EDIT: adding an example of the desired output from aggregating on "P", as requested by jay.sf.

#  P          ID      SC                                                                  FP
#1 1 8, 10, 7, 9 A, D, B               40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
#2 2  5, 2, 3, 4 B, A, E           1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
#3 3  8, 9, 1, 2 A, B, D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
#4 4 8, 1, 4, 10 A, D, E                        40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
#5 5 4, 10, 2, 7 E, D, A                         29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43

In fact, I found out that by setting simplify=F in aggregate, it works as I want.
I hope this won't backfire.


EDIT 2: it did backfire...

I don't want all my columns to become lists even when they can be vectors, but with simplify = F they do become lists:

sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("P"))],by=list(P=t0g2[["P"]]),FUN=function(y) unique(unlist(y)), simplify = F),class)
#        P        ID        SC        FP 
#"numeric"    "list"    "list"    "list" 

sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = T),class)
#         ID           P          SC          FP 
#  "numeric"      "list" "character"      "list" 

sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = F),class)
#       ID         P        SC        FP 
#"numeric"    "list"    "list"    "list" 

So I still don't have a solution... :(


EDIT 3: maybe a viable (if rather clumsy) solution?

t0g2_by_ID <- aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = F)

sapply(t0g2_by_ID,class)
#       ID         P        SC        FP 
#"numeric"    "list"    "list"    "list" 

for (i in 1:NCOL(t0g2_by_ID)) {y = t0g2_by_ID[,i]; if ((class(y) == "list") & (length(y) == length(unlist(y)))) {t0g2_by_ID[,i] <- unlist(y)} }

sapply(t0g2_by_ID,class)
#       ID           P          SC          FP 
#"numeric"      "list" "character"      "list" 

I tried to obviate to the inelegant loop using sapply, but then any cbind operation goes back to a data.frame of lists.

This is the best I can come up with.

If anyone can suggest how to do this better using only base R, that'd be great.

user6376297
  • 575
  • 2
  • 15
  • 2
    What would be the desired output? Please show a sample. – jay.sf Jun 19 '19 at 08:12
  • I think they have to be lists because the cells contain vectors with n > 1. What about turning them into strings with `paste` + `collapse = ", "`? –  Jun 19 '19 at 09:46
  • mmh... but then why when I aggregate on "ID" with `simplify=F` the "SC" column is a vector? – user6376297 Jun 19 '19 at 10:02
  • When I run your above code with `simplify = F`, then `SC` is returned as a list column of character vectors, each with n = 1, and *not* as an atomic vector. If you mean `simplify = T`, then yes, an atomic vector is returned, but only because the length of each element of `SC` == 1. Also, aggregating on `P` returns vectors with n > 1, so the only way around list columns in this case is to combine the vector values into strings. See my edited answer for an example. –  Jun 19 '19 at 10:18
  • Yes, sorry, I meant `simplify = T`. Thanks for your answer, but I don't want to convert what was originally a numeric vector to a character vector separated by ',', it's really not usable for my purposes. I may have come up with something else though. – user6376297 Jun 19 '19 at 11:44
  • You don't need a for loop to convert the column classes when the same can be achieved using `simplify = T` in `aggregate`. You're adding an unnecessary step. The for loop also doesn't fix the issue with `P` because in that case `length(y)` *does not* equal ` length(unlist(y))` for every column but `P`. So `sapply(t0g2_by_P, class)`, where `t0g2_by_P` contains data aggregated by `P`, returns `"numeric" "list" "list" "list"` before and after the loop. `SC` is returned as an atomic vector when `simplify = T` because each `ID` corresponds to exactly one unique value of `SC`. –  Jun 19 '19 at 12:47
  • I suspect we are not understanding each other. I showed that `aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]), FUN=function(y) unique(unlist(y)),simplify=T)` does not give me the output I need because some columns are expanded to matrices instead of becoming lists of vectors. I showed that `simplify=F` does not do what I need either, because while it 'fixes' the problem when I aggregate by "P", it creates another problem when I aggregate e.g. by "ID", as it makes lists even for columns that could be vectors. My final method works; if it can be improved, I am all ears. – user6376297 Jun 19 '19 at 13:20
  • Hi @user6376297 I think I understand *what* you are trying to do, I just don't quite understand *why*. Do you need something more flexible, i.e. for an unknown number of columns? In that case you should be explicit about that in your question. A recommendation: Your question has expanded a bit. I would reduce your question to your original question ("Why is aggregate behaving differently…"), accept one of the answers below so others can learn from it, and then ask a new question along the lines of "How can I control `aggregate` to simplify selectively…" with a reference to this question. –  Jun 20 '19 at 07:59

2 Answers2

1

aggregate obviously tries to give a matrix where this is possible. See This example:

# data
n <- 10
df <- data.frame(id= rep(1:2, each= n/2),
             value= 1:n)

length(unique(df$value[df$id == 1])) == length(unique(df$value[df$id == 2]))
TRUE

Here the length of unique is same for every id value, thus aggregate provides a matrix

aggregate(x= df[, "value"], by=list(id=df[, "id"]), 
      FUN=function(y) unique(unlist(y)))
   id x.1 x.2 x.3 x.4 x.5
1  1   1   2   3   4   5
2  2   6   7   8   9  10

Now we change data so that length of unique per id is not equal

df$value[2] <- 1
length(unique(df$value[df$id == 1])) == length(unique(df$value[df$id == 2]))
FALSE

In this case we get an output with values separated by ,:

aggregate(x= df[, "value"], by=list(id=df[, "id"]), 
      FUN=function(y) unique(unlist(y)))
  id              x
1  1     1, 3, 4, 5
2  2 6, 7, 8, 9, 10

In your case you have for every P value exactly 4 unique ID values and exactly 3 unique SC values, hence, aggregate shows those results as a matrix. This is not true for FP: here aggregate can't provide a matrix, hence, we get the values separated by ,

  • Thanks schwantke, but I can't change my data. I edited my post with the observation that `simplify=F` seems to do the job. Fingers crossed. – user6376297 Jun 19 '19 at 09:24
0

aggregate has an argument simplify that is TRUE by default, which means it tries to simplify to a vector or matrix when possible. All groups in P have n = 4, so your aggregated data is being simplified to a matrix. Just set simpflify = FALSE to change this behavior:

aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]), 
          FUN=function(y) unique(unlist(y)), simplify = F)

#### OUTPUT ####

  P          ID      SC                                                                  FP
1 1 8, 10, 7, 9 A, D, B               40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
2 2  5, 2, 3, 4 B, A, E           1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
3 3  8, 9, 1, 2 A, B, D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
4 4 8, 1, 4, 10 A, D, E                        40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
5 5 4, 10, 2, 7 E, D, A                         29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43
  • 1
    Thanks gersht, I had just found the same by trial and error, after schwantke mentioned that aggregate 'tries to give a matrix', which was reminiscent of simplification to me. – user6376297 Jun 19 '19 at 09:26
  • @user6376297 okay, I see that I answered seconds before you commented schwantke's answer. It might still be a good idea to accept an answer so anyone else with the same question can see what worked for you. You can also write your own answer and accept it. –  Jun 19 '19 at 09:38
  • I edited my answer with the simplify = F detail before answering anyone's comment. As for accepting, I can't do that because so far I don't have a valid solution. See last edit to the post. – user6376297 Jun 19 '19 at 09:40
  • As I mentioned in the above comment, the fact that "P" becomes a character vector is not good for my purposes. I will edit my original post with what seems to be a solution, although a clumsy one. – user6376297 Jun 19 '19 at 11:46