0

I want to take a dataframe of participant IDs and the languages they speak, then create a new column which sums all of the languages spoken by each participant. The columns are the ID, each language with 0 = "does not speak" and 1 = "does speak", including a column for "Other", then a separate column which specifies what this other language is, "Other.Lang". I want to subset just the columns which have binary values and create this new column with the sums for each participant.

First here is my dataframe.


      Participant.Private.ID French Spanish Dutch Czech Russian Hebrew Chinese German Italian Japanese Korean Portuguese Other Other.Lang
    1                5133249      0       0     0     0       0      0       0      0       0        0      0          0     0          0
    2                5136082      0       0     0     0       0      0       0      0       0        0      0          0     0          0
    3                5140442      0       1     0     0       0      0       0      0       0        0      0          0     0          0
    4                5141991      0       1     0     0       0      0       0      0       1        0      0          0     0          0
    5                5143476      0       0     0     0       0      0       0      0       0        0      0          0     0          0
    6                5145250      0       0     0     0       0      0       0      0       0        0      0          0     1      Malay
    7                5146081      0       0     0     0       0      0       0      0       0        0      0          0     0          0

Here is the structure:


    str(part_langs)
    
    grouped_df [7 x 15] (S3: grouped_df/tbl_df/tbl/data.frame)
     $ Participant.Private.ID: num [1:7] 5133249 5136082 5140442 5141991 5143476 ...
     $ French                : num [1:7] 0 0 0 0 0 0 0
     $ Spanish               : num [1:7] 0 0 1 1 0 0 0
     $ Dutch                 : num [1:7] 0 0 0 0 0 0 0
     $ Czech                 : num [1:7] 0 0 0 0 0 0 0
     $ Russian               : num [1:7] 0 0 0 0 0 0 0
     $ Hebrew                : num [1:7] 0 0 0 0 0 0 0
     $ Chinese               : num [1:7] 0 0 0 0 0 0 0
     $ German                : num [1:7] 0 0 0 0 0 0 0
     $ Italian               : num [1:7] 0 0 0 1 0 0 0
     $ Japanese              : num [1:7] 0 0 0 0 0 0 0
     $ Korean                : num [1:7] 0 0 0 0 0 0 0
     $ Portuguese            : num [1:7] 0 0 0 0 0 0 0
     $ Other                 : num [1:7] 0 0 0 0 0 1 0
     $ Other.Lang            : chr [1:7] "0" "0" "0" "0" ...
     - attr(*, "groups")= tibble [7 x 2] (S3: tbl_df/tbl/data.frame)
      ..$ Participant.Private.ID: num [1:7] 5133249 5136082 5140442 5141991 5143476 ...

I thought that this should work:


    num <- part_langs %>%
      mutate(num.langs = rowSums(part_langs[2:14]))
    num

However, I keep getting this error message:


    Error: Problem with `mutate()` input `num.langs`.
    x Input `num.langs` can't be recycled to size 1.
    i Input `num.langs` is `rowSums(part_langs[2:14])`.
    i Input `num.langs` must be size 1, not 7.
    i The error occurred in group 1: Participant.Private.ID = 5133249.

What is really strange is that when I try to create a simplified version of this problem to create a reproducible example, it works fine.

First I create a dataset.


    test <- matrix(c(1, 1, 1, 0, 0, "",
                   2, 1, 0, 1, 0, "",
                   3, 0, 0, 0, 1, "Chinese"), ncol = 6, byrow=TRUE)
    
    test<-as.data.frame(test)
    
    colnames(test) <- c("ID", "English", "French", "Italian", "Other", "Other.Lang")
    
    str(test)

Converting binary columns to numeric:


    test$ID <- as.numeric(test$ID)
    test$English <- as.numeric(test$English)
    test$French <- as.numeric(test$French)
    test$Italian <- as.numeric(test$Italian)
    test$Other <- as.numeric(test$Other)

Here's the same code as above, but with this simplified data set.


    num <- test %>%
      mutate(num.langs = rowSums(test[2:5]))
    num

Here is the output. It works exactly as I want:


    "ID","English","French","Italian","Other","Other.Lang","num.langs"
     1,     1,        1,       0,        0,        "",         2
     2,     1,        0,       1,        0,        "",         2
     3,     0,        0,       0,        1,     "Chinese",     1

So I know I have screwed up somewhere in my real data, but I can't understand where. Could anyone advise?

  • The difference in result might be due to the fact that `part_langs` is `grouped_df [7 x 15] (S3: grouped_df/tbl_df/tbl/data.frame)`. Maybe `ungroup`, like so: `library(dplyr); part_langs <- part_langs %>% ungroup`? – Chris Ruehlemann Nov 25 '21 at 16:17
  • Or avoid the using mutate and and try `part_langs$num.langs = rowSums(part_langs[2:14])` – Dave2e Nov 25 '21 at 16:20
  • @ChrisRuehlemann You were right! Adding in the ungroup code and running it again fixes this problem. Thanks very much! – talkingood Nov 25 '21 at 16:28
  • 1
    @Dave2e You're absolutely right, this is by far the most sensible way to approach it...but I was frustrated with not knowing why mutate() wasn't working the way I thought it ought to, and I had to understand it first. I appreciate your suggestion too. – talkingood Nov 25 '21 at 16:29
  • If it helped I'm going to add this comment as an asnwer hoping you will accept/upvote it! – Chris Ruehlemann Nov 25 '21 at 16:33

2 Answers2

0

A different approach which relies more on dplyrwould be to use rowwise and c_across:

test %>%
  rowwise() %>%
  mutate(num.lang = sum(c_across(English:Other)))
Cettt
  • 11,460
  • 7
  • 35
  • 58
0

The difference in result might be due to the fact that part_langs is a grouped dataframe, as can be seen from the output of strshown in your post:

grouped_df [7 x 15] (S3: grouped_df/tbl_df/tbl/data.frame). 

If this is the reason, then ungroup first and rerun your code:

library(dplyr)
part_langs <- part_langs %>% ungroup
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34