3

I'm trying to use rowSums but using a comparison on values for the condition.

Here is an example of my data frame, based on surveys. Where rows refer to participants, columns to a date of birth of a child.

  b3_01 b3_02 b3_03 b3_04 b3_05 b3_06
1  1360  1360  1266  1228  1181  1158    
2  1362  1342  1301  1264  1245  1191 
3  1379    NA    NA    NA    NA    NA  
4  1355  1330  1293  1293  1227  1208  
5  1391  1371  1358  1334  1311  1311

Here, a similar date refers to twins. What I would like to do is create a new column that tells me how many times, for each row, values for those columns are similar. Which would give me something like:

  b3_01 b3_02 b3_03 b3_04 b3_05 b3_06 twins
1  1360  1360  1266  1228  1181  1158     1
2  1362  1342  1301  1264  1245  1191     0
3  1379    NA    NA    NA    NA    NA     0
4  1355  1330  1293  1293  1227  1208     1
5  1391  1371  1358  1334  1311  1311     1

EDIT: Sorry, I forgot to say that if any number appears 3 or more times, it should not be counted as a twin. The end goal is to have 4 columns : one for singletons (when every number appears only once), one for twins, one for triplets (if any number appears three times), and one for quadruplets.

I'm working with dplyr. As the data.frame is very large, I need to specify the range of columns I want to comparison to be done on. I have tried the following code, along with variants:

twins<-df%>%
  mutate(twins= rowSums(select(.,starts_with("b3_")) == select(.,starts_with("b3_")),na.rm=TRUE))

Which does not work. I have played with other functions too but could not figure out a solution.

Do you have any idea on how to achieve this? I feel like the solution is simple, but I am an absolute beginner in R.

  • 1
    If you have a row with 3 same numbers (all others distinct), will you count them as one twin or as two? More general, if a row contains `n` times the same value, do you want to count `n-1` twins or 1 twin? – Jonas Feb 05 '21 at 15:52
  • I would not count them as twins. Ideally I want to create 4 new columns : 1 for singletons (every number appears once), 2 for twins (if any number appears ONLY twice), 3 for triplets (if any number appears three times, and one for quadruplets. Does that validate or invalidate your answer? Thanks for it, by the way – Maxime Besson Feb 05 '21 at 20:15
  • 1
    Then the logic with `table` in my answer is quite nice. You will get the number of singletons as `sum(table==1)`, the twins as `sum(table==2)` and the triplets as `sum(table==3)`. I will update my answer. – Jonas Feb 05 '21 at 20:27

4 Answers4

2

An easy solution is

twins <- df%>%
  mutate(twins = apply(., 1, function(x) sum(duplicated(x, incomparables=NA))))
Taufi
  • 1,557
  • 8
  • 14
  • I quite like your solution! However, do you know a way to count values duplicated twice only? I would like to create separate columns for values repeated two, three or four times. (see EDIT) – Maxime Besson Feb 05 '21 at 22:06
1

Referring to my comment and assuming that n same values in a row are counted as n-1 twins, define

countTwins <- function(row) {
  length(row)-length(unique(row))
}

and get the column twins as

twinCol <- apply(df,1,countTwins)

If you want to count n same values as 1 twin, use instead the function

countTwins2 <- function(row) {
  sum(table(unname(unlist(row)))>1)
}

Update according to my comment:

countSinglesTwinsAndTriplets <- function(row) {
  tt <- table(unname(unlist(row)))
  c(sum(tt==1),sum(tt==2),sum(tt==3)) #nr of singletons,twins,triplets
}

addCols <- setNames(data.frame(t(apply(df,1,countSinglesTwinsAndTriplets))),c("singletons","twins","triplets"))
Jonas
  • 1,760
  • 1
  • 3
  • 12
1

additional solution

base

df$twins <- apply(df, 1, function(x) length(x) - length(unique(x)) - sum(is.na(x)) + any(is.na(x)))

  b3_01 b3_02 b3_03 b3_04 b3_05 b3_06 twins
1  1360  1360  1266  1228  1181  1158     1
2  1362  1342  1301  1264  1245  1191     0
3  1379    NA    NA    NA    NA    NA     0
4  1355  1330  1293  1293  1227  1208     1
5  1391  1371  1358  1334  1311  1311     1
Yuriy Saraykin
  • 8,390
  • 1
  • 7
  • 14
1

A similar logic as used by @Taufi, but with the addition of purrr:

df %>%
 mutate(twins = pmap(across(everything()), ~ sum(duplicated(na.omit(c(...))))))

  b3_01 b3_02 b3_03 b3_04 b3_05 b3_06 twins
1  1360  1360  1266  1228  1181  1158     1
2  1362  1342  1301  1264  1245  1191     0
3  1379    NA    NA    NA    NA    NA     0
4  1355  1330  1293  1293  1227  1208     1
5  1391  1371  1358  1334  1311  1311     1
tmfmnk
  • 38,881
  • 4
  • 47
  • 67