I have a 1.2GB list of edges from a graph in a text file. My ubuntu PC has 8GB of RAM. Each line in the input looks like
287111206 357850135
I would like to convert it into a sparse adjacency matrix and output that to a file.
Some statistics for my data:
Number of edges: around 62500000
Number of vertices: around 31250000
I asked much the same question before at https://stackoverflow.com/a/38667644/2179021 and got a great answer. The problem is that I can't get it to work.
I first tried np.loadtxt to load in the file but it was very slow and used a huge amount of memory. So instead I moved to pandas.read_csv which is very fast but this caused it own problems. This is my current code:
import pandas
import numpy as np
from scipy import sparse
data = pandas.read_csv("edges.txt", sep=" ", header= None, dtype=np.uint32)
A = data.as_matrix()
print type(A)
k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
rows,cols=k3.reshape(A.shape).T
M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
print type(M)
The problem is that the pandas dataframe data
is huge and I am effectively making a copy in A which is inefficient. However things are even worse as the code crashes with
<type 'instancemethod'>
Traceback (most recent call last):
File "make-sparse-matrix.py", line 13, in <module>
rows,cols=k3.reshape(A.shape).T
AttributeError: 'function' object has no attribute 'shape'
raph@raph-desktop:~/python$ python make-sparse-matrix.py
<type 'numpy.ndarray'>
Traceback (most recent call last):
File "make-sparse-matrix.py", line 12, in <module>
k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 209, in unique
iflag = np.cumsum(flag) - 1
File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2115, in cumsum
return cumsum(axis, dtype, out)
MemoryError
So my questions are:
- Can I avoid having both the 1.2GB pandas dataframe and the 1.2GB numpy array copy in memory?
- Is there some way to get the code to complete in 8GB of RAM?
You can reproduce a test input of the size I am trying to process with:
import random
#Number of edges, vertices
m = 62500000
n = m/2
for i in xrange(m):
fromnode = str(random.randint(0, n-1)).zfill(9)
tonode = str(random.randint(0, n-1)).zfill(9)
print fromnode, tonode
Update
I have now tried a number of different approaches, all of which have failed. Here is a summary.
- Using igraph with
g = Graph.Read_Ncol('edges.txt')
. This uses a huge amount of RAM which crashes my computer. - Using networkit with
G= networkit.graphio.readGraph("edges.txt", networkit.Format.EdgeList, separator=" ", continuous=False)
. This uses a huge amount of RAM which crashes my computer. - The code above in this question but using np.loadtxt("edges.txt") instead of pandas. This uses a huge amount of RAM which crashes my computer.
I then wrote separate code which remapped all the vertex names to number from 1..|V| where |V| is the total number of vertices. This should save the code that imports the edge list from having to build up a table that maps the vertex names. Using this I tried:
- Using this new remapped edge list file I used igraph again with
g = Graph.Read_Edgelist("edges-contig.txt")
. This now works although it takes 4GB of RAM (which is way more than the theoretical amount it should). However, there is no igraph function to write out a sparse adjacency matrix from a graph. The recommended solution is to convert the graph to a coo_matrix. Unfortunately this uses a huge amount of RAM which crashes my computer. - Using the remapped edge list file I used networkit with
G = networkit.readGraph("edges-contig.txt", networkit.Format.EdgeListSpaceOne)
. This also works using less than the 4GB that igraph needs. networkit also comes with a function to write Matlab files (which is a form of sparse adjacency matrix that scipy can read). Howevernetworkit.graphio.writeMat(G,"test.mat")
uses a huge amount of RAM which crashes my computer.
Finally sascha's answer below does complete but takes about 40 minutes.