Identifying and summarizing discrete groups of nodes in R

Question

I am working on a networking problem related to family/household composition. I have multiple edge tables containing id1, id2 and a relationship code to state the type of relationship between the identity variables. These tables are large, upwards of 7 million rows in each. I also have a node table which contains the same id and various attributes.

What I want to achieve is an adjacency matrix which will give summary statistics similar to something like this:

                      Children

             1  2  3  4   total 
            --------------------
          1 | 1  0  1  0    2
            |
 Adults   2 | 3  5  4  1    13  
            |
          3 | 1  2  0  0    3
            |
      total | 5  7  5  1    18

Essentially I want to be able to identify and count distinct networks in my data.

My data is in the form:

             ID1  ID2   Relationship_Code

              X1   X2    Married 
              X1   X3    Parent/Child
              X1   X4    Parent/Child 
              X5   X6    Married
              X5   X7    Parent/Child 
              X6   X5    Married
               .    .     .
               .    .     .
               .    .     .

I also have a node table which contains date of birth and other variables from which adult/child status can be identified.

Any tips/hints on how to extract this summary information from the graph data frame would be very helpful and much appreciated.

Thanks

Show how your input data looks like (small example), show what your end result should look like. — Andre Elrico, Oct 16 '18 at 11:44
If you can't publish the data, INVENT data that has similar form. — Andre Elrico, Oct 16 '18 at 11:56
Apologies, I have edited the question with an example of the form. — williamg15, Oct 16 '18 at 12:02
Is your sample data correct? I would think that if you had the first three relations, you would also have two additional relations that say that X2 has a Parent/Child relation with X3 and X4. — G5W, Oct 16 '18 at 12:48
Yes you are correct, I just wanted to show the form the data is in. There would be additional relationships within the table where X2-X3 and X2-X4 are parent/child relationships. Also, we would have another 'duplicate' relationship between X2-X1 (married) — williamg15, Oct 16 '18 at 13:04
Do you have single people in households? How are they represented? — G5W, Oct 16 '18 at 13:09
In the case of the edge table there are no single person households. They could be obtained by joining the edge tables with the node table. — williamg15, Oct 16 '18 at 13:19
Your example output includes households with 3 adults. What do the re4lationships look like there? — G5W, Oct 16 '18 at 13:32
@G5W Using the node table (date of birth) variable I would be able to determine the age of the person, thus I would be able to deduce whether a certain person is an adult of child. — williamg15, Oct 16 '18 at 13:54

score 2 · Accepted Answer · answered Oct 16 '18 at 14:19

Some of the work that is required to get the final table that you want requires access to the node table which you are not showing us, but I can get you pretty far along in your problem.

I think that the key to getting your result is identifying the households. You can do this in igraph using components. The connected components are households. I will illustrate with a slightly more elaborate version of your example.

Data:

Census = read.table(text="ID1  ID2   Relationship_Code
              X1   X2    Married 
              X2   X1    Married 
              X1   X3    Parent/Child
              X1   X4    Parent/Child 
              X2   X3    Parent/Child
              X2   X4    Parent/Child 
              X5   X6    Married
              X5   X7    Parent/Child 
              X6   X7    Parent/Child 
              X6   X5    Married
              X8   X9    Married
              X9   X8    Married",
    header=T)

Now turn it into a graph, find the components and check by plotting.

library(igraph)
EL = as.matrix(Census[,1:2])
Pop = graph_from_edgelist(EL)
Households = components(Pop)
plot(Pop, vertex.color=rainbow(3, alpha=0.5)[Households$membership])

You said that you could label the nodes as to whether they represent adults or children. I will assume that we have such a labeling. From that, it is easy to count the number of adults by household and children by household and to make a table of household decomposition by adults and children.

V(Pop)$AdultChild = c('A', 'A', 'C', 'C', 'A', 'A', 'C', 'A', 'A')
AdultsByHousehold = aggregate(V(Pop)$AdultChild, list(Households$membership), 
    function(p) sum(p=='A'))
AdultsByHousehold
  Group.1 x
1       1 2
2       2 2
3       3 2

ChildrenByHousehold = aggregate(V(Pop)$AdultChild, list(Households$membership), 
    function(p) sum(p=='C'))
ChildrenByHousehold
  Group.1 x
1       1 2
2       2 1
3       3 0

table(AdultsByHousehold$x, ChildrenByHousehold$x)
    0 1 2
  2 1 1 1

In my bogus example, all households have two adults.

Identifying and summarizing discrete groups of nodes in R

1 Answers1