I've started using graph-tool
, hoping it would be a python library that will allow me to analyze large graphs (~8M
vertices, ~22M
edges, in a Pandas DataFrame / CSV). 'source' and 'target' columns are user ids for a certain digital service.
I started out with a toy example, following the method in this post.
import pandas as pd
df = pd.DataFrame({'source':range(11,15), 'target':range(12,16)})
g = Graph(directed=True)
g.add_edge_list(df.values)
you can see in my dummy example, there are only 5 distinct vertices (11, 12, 13, 14, 15)
. However, when I generate the graph, 16 vertices are created, seemingly filling the gap between 0 and the max node value.
g.get_vertices()
returns:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=uint64)
I assume that graph-tool 'reads' the values of the df as indices, not as the actual vertices' names. This follows from the docs:
Each vertex in a graph has an unique index, which is always between :math:0 and :math:N-1, where :math:N is the number of vertices.
How do I create a graph without these redundant vertices (which, if I import my data, could be in the millions), and how can I get to work with my user ids not being regarded as indices? I've been rummaging through the available methods / documentation and couldn't figure it out, for the mass import from df case.
What else I tried:
df.to_csv('test.csv', index=False)#, header=False)
g2 = graph_tool.load_graph_from_csv('test.csv', skip_first=True)
This does seem to create a graph with only 5 vertices, but 'loses' their names (user ids).
g2.get_vertices()
returns
array([0, 1, 2, 3, 4], dtype=uint64)
Instead of [11, 12, 13, 14, 15]
.
Appreciate your help! Thanks in advance.
I am using python 2.7
on Jupyter/Anaconda
.