Grouping of two variables in r

Question

I have a data frame like below.

dat <- data.frame(v1=c("a","b","c","c","a","w","f"),
              v2=c("z","a","a","w","p","e","h"))

 v1 v2
1  a  z
2  b  a
3  c  a
4  c  w
5  a  p
6  w  e
7  f  h

I want to add a group column based on whether these letters appear in the same row.

   v1 v2  gp
1  a  z   1
2  b  a   1
3  c  a   1
4  c  w   1
5  a  p   1
6  w  e   1
7  f  h   2

My idea is to first assign the first row to group 1, and then any row that v1 or v2 is "a" or "z" will also be assigned to group 1.

There are scenarios like row 3 and 4, where c is assigned to group 1, because, in row 3, v2 is "a". And "w" is assigned to group 1 because in row 4 v1 is "c", which is assigned to group 1 previously. But my list is very long, so I cannot keep checking all the "descendants".

I wonder if there is a way to group these letters, and return a list with group number. Something like the below table will do.

score 1 · Accepted Answer · answered Sep 09 '17 at 21:19

One way to solve this is to consider the letters as vertices of a graph and being in the same row as a link between the vertices. Then what you are asking for is the connected components of the graph. All of that is easy using the igraph package in R.

library(igraph)
G = graph_from_edgelist(as.matrix(dat), directed=FALSE)
letters = sort(unique(c(as.character(dat$v1), as.character(dat$v2))))
(gp = components(G)$membership[letters])
a b c e f h p w z 
1 1 1 1 2 2 1 1 1

If you want a data.frame containing this information

(Groups = data.frame(letters, gp, row.names=NULL))
  letters gp
1       a  1
2       b  1
3       c  1
4       e  1
5       f  2
6       h  2
7       p  1
8       w  1
9       z  1

In order to think through why this works, it may help you to look at the graph that was created and think how that represents your problem.

Thank you so much. This is very helpful. – cyrusjan Sep 09 '17 at 22:14 — cyrusjan, Sep 09 '17 at 22:14

Grouping of two variables in r

1 Answers1