1

I am currently working through some graph theory problems and have a question I can't seem to find an answer to. When creating a graph using:

x <- graph_from_data_frame(el, directed = F, vertices = x)

The addition of the vertices = x creates components of size = 1.

I want to look at cluster size i.e. extracting the components and looking at a table of size using:

comp <- components(x)
table(comp$csize)

Given the nature of edgelists, I would expect no clusters to have size <= 2, seeing as the edgelist is the relationship between two nodes.If I run the exact same code without the vertices = x, my table will start with clusters of size = 2.

Why does the addition of vertices = x do this?

Thanks

EDIT:

My edgelist has the variables:

ID   ID.2  soure 
x1   x2    healthcare
x1   x3    child benefit 

The vertices data frame contains general information for the nodes(IDs)

 ID   date_of_birth   nationality   

 x1     02/09/1999      French 
 x2     12/12/1997      French 
 x3     22/01/2002      French 
Szabolcs
  • 24,728
  • 9
  • 85
  • 174
williamg15
  • 77
  • 7
  • 1
    The `vertices` argument is there to include vertex metadata. Without knowing what is in `x` its hard to say. If you post some of your data with `dput()` or make a minimal reproducible example it would be easier to diagnose. – gfgm Nov 21 '18 at 12:48
  • Hi, thanks for the quick response. I have edited the thread and added a small reproducible example. – williamg15 Nov 21 '18 at 12:57

1 Answers1

0

I suspect that what is happening is that you have IDs appearing in your data.frame of node metadata x that do not appear in the edge list. Igraph will add these nodes as isolated vertices. Some sample code below to illustrate the problem:

library(igraph)

# generate some fake data
set.seed(42)
e1 <- data.frame(ID = sample(1:10, 5), ID.2 = sample(1:10, 5))
head(e1)
#>   ID ID.2
#> 1 10    6
#> 2  9    7
#> 3  3    2
#> 4  6    5
#> 5  4    9

# make the desired graph object
x <- graph_from_data_frame(e1, directed = F)

# make some attribute data that only matches the nodes that have edges
v_atts1 <- data.frame(ID = names(V(x)), foo = rnorm(length(names(V(x)))))
v_atts1
#>   ID         foo
#> 1 10 -0.10612452
#> 2  9  1.51152200
#> 3  3 -0.09465904
#> 4  6  2.01842371
#> 5  4 -0.06271410
#> 6  7  1.30486965
#> 7  2  2.28664539
#> 8  5 -1.38886070

g1 <- graph_from_data_frame(e1, directed = FALSE, vertices = v_atts1)

# we can see only groups of size 2 and greater
comp1 <- components(g1)
table(comp1$csize)
#> 
#> 2 3 
#> 1 2

# now make attribute data that includes nodes that dont appear in e1
v_atts2 <- data.frame(ID = 1:10, foo=rnorm(10))
g2 <- graph_from_data_frame(e1, directed = FALSE, vertices = v_atts2) 

# now we see that there are isolated nodes
comp2 <- components(g2)
table(comp2$csize)
#> 
#> 1 2 3 
#> 2 1 2

# and inspecting the number of vertices we see that
# this is because the graph has incorporated vertices
# that appear in the metadata but not the edge list
length(V(g1))
#> [1] 8
length(V(g2))
#> [1] 10

If you wanted to avoid this you could try graph_from_data_frame(e1, directed=FALSE, vertices=x[x$ID %in% c(e1$ID, e1$ID.2),]) which should subset your metadata to only the vertices that are connected. Note that you may want to check that your IDs are not being encoded as factors with levels that are not appearing in the data.

gfgm
  • 3,627
  • 14
  • 34