R - Compute Mismatch By Group

Question

I was wondering how could I compute mismatching cases by group.

Let us imagine that this is my data :

sek = rbind(c(1, 'a', 'a', 'a'), 
        c(1, 'a', 'a', 'a'), 
        c(2, 'b', 'b', 'b'), 
        c(2, 'c', 'b', 'b'))

colnames(sek) <- c('Group', paste('t', 1:3, sep = ''))

The data look like this

     Group t1  t2  t3 
[1,] "1"   "a" "a" "a"
[2,] "1"   "a" "a" "a"
[3,] "2"   "b" "b" "b"
[4,] "2"   "c" "b" "b"

In order to get something like

Group 1 : 0 
Group 2 : 1

It would be fancy to use the stringdist library to compute this.

Something like

seqdistgroupStr = function(x) stringdistmatrix(x, method = 'hamming')

sek %>% 
  as.data.frame() %>% 
  group_by(Group) %>% 
  seqdistgroupStr()

But it is not working.

Any ideas ?

Quick Update: How would you solve the question of weights? For example, how could I pass an argument - a value (1,2,3, ...) - when setting the mistmatch between two characters. Like the mismatch between b and c cost 2 while the mismatch between a and c cost 1 and so on.

so by "mismatch", you mean the number of rows for each group where at least one element is different? — user1357015, Jul 07 '15 at 23:36
I believe there is a typo in your definition of `sek`, the last line should be `c(2, 'c', 'b', 'b'))` — Alex, Jul 07 '15 at 23:49

score 6 · Answer 1 · answered Jul 07 '15 at 23:53

6

Here is another dplyr solution that does not require any transformation of the data into long/wide forms:

library(dplyr)
sek = rbind(c(1, 'a', 'a', 'a'), 
            c(1, 'a', 'a', 'a'), 
            c(2, 'b', 'b', 'b'), 
            c(2, 'c', 'b', 'b')) %>%
    data.frame

colnames(sek) <- c('Group', paste('t', 1:3, sep = ''))

sek %>% 
    group_by(Group) %>%
    distinct(t1, t2, t3) %>%
    summarise(number_of_mismatches = n() - 1)

answered Jul 07 '15 at 23:53

Alex

15,186
15
73
127

I see you now use `distinct` ;) – Steven Beaupré Jul 07 '15 at 23:57
1

live and learn! note, I am using it in the way I think it should be sensibly used, not to find distinct cases of the grouping variables ;) – Alex Jul 07 '15 at 23:59
2

Nice solution. Avoids reshaping data. – Pierre L Jul 08 '15 at 00:00

eipi10 · Accepted Answer · 2015-07-08T00:18:11.860

The code below will give you the number of mismatches by group, where a mismatch is defined as one less than the number of unique values in each column t1, t2, etc. for each level of Group. I think you would need to bring in a string distance measure only if you need more than a binary measure of mismatch, but a binary measure suffices for the example you gave. Also, if all you want is the number of distinct rows in each group, then @Alex's solution is more concise.

library(dplyr)
library(reshape2)

sek %>% as.data.frame %>%
  melt(id.var="Group") %>%
  group_by(Group, variable) %>%
  summarise(mismatch = length(unique(value)) - 1) %>%
  group_by(Group) %>%
  summarise(mismatch = sum(mismatch))

  Group mismatch
1     1        0
2     2        1

Here's a shorter dplyr method to count individual mismatches. It doesn't require reshaping, but it requires other data gymnastics:

sek %>% as.data.frame %>%
  group_by(Group) %>%
  summarise_each(funs(length(unique(.)) - 1)) %>%
  mutate(mismatch = rowSums(.[-1])) %>%
  select(-matches("^t[1-3]$"))

You could use `num_range("t", 1:3)` instead of `matches()` – Steven Beaupré Jul 08 '15 at 00:27 — Steven Beaupré, Jul 08 '15 at 00:27

score 3 · Answer 3 · answered Jul 07 '15 at 23:48

3

Another idea:

library(dplyr)
library(tidyr)

data.frame(sek) %>%
  gather(key, value, -Group) %>%
  group_by(Group) %>%
  summarise(dist = n_distinct(value)-1)

Which gives:

#Source: local data frame [2 x 2]
#
#  Group dist
#1     1    0
#2     2    1

answered Jul 07 '15 at 23:48

Steven Beaupré

21,343
7
57
77

score 2 · Answer 4 · answered Jul 07 '15 at 23:41

2

m <- matrix(apply(sek[,-1], 1, paste, collapse=''))
newdf <- as.data.frame(cbind(sek[,1], m))
names(newdf) <- c('Group', 'value')
newdf %>% group_by(Group) %>% summarize(count = length(unique(value))-1)
#  Group count
#1     1     0
#2     2     1

answered Jul 07 '15 at 23:41

Pierre L

28,203
6
47
69

mpalanco · Answer 5 · 2015-07-11T20:02:54.660

Base package:

aggregate(cbind(dist = Groups) ~ Groups, 
          data = unique(sek), 
          FUN = function(x){NROW(x)-1})

With sqldf:

library(sqldf)
df <- rbind(c(1, "a", "a", "a"), 
            c(1, "a", "a", "a"), 
            c(2, "b", "b", "b"), 
            c(2, "c", "b", "b"))
df <- as.data.frame(df)
colnames(df)[1] <- "Groups"
sqldf("SELECT Groups, COUNT(Groups)-1 AS Dist 
      FROM (SELECT DISTINCT * FROM df) 
      GROUP BY Groups")

Output:

  Groups Dist
1      1    0
2      2    1

R - Compute Mismatch By Group

5 Answers5