Converting a 1.2GB list of edges into a sparse matrix

Question

I have a 1.2GB list of edges from a graph in a text file. My ubuntu PC has 8GB of RAM. Each line in the input looks like

287111206 357850135

I would like to convert it into a sparse adjacency matrix and output that to a file.

Some statistics for my data:

Number of edges: around 62500000
Number of vertices: around 31250000

I asked much the same question before at https://stackoverflow.com/a/38667644/2179021 and got a great answer. The problem is that I can't get it to work.

I first tried np.loadtxt to load in the file but it was very slow and used a huge amount of memory. So instead I moved to pandas.read_csv which is very fast but this caused it own problems. This is my current code:

import pandas
import numpy as np
from scipy import sparse

data = pandas.read_csv("edges.txt", sep=" ", header= None, dtype=np.uint32)
A = data.as_matrix()
print type(A)
k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
rows,cols=k3.reshape(A.shape).T
M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
print type(M)

The problem is that the pandas dataframe data is huge and I am effectively making a copy in A which is inefficient. However things are even worse as the code crashes with

<type 'instancemethod'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 13, in <module>
    rows,cols=k3.reshape(A.shape).T
AttributeError: 'function' object has no attribute 'shape'
raph@raph-desktop:~/python$ python make-sparse-matrix.py 
<type 'numpy.ndarray'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 12, in <module>
    k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 209, in unique
    iflag = np.cumsum(flag) - 1
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2115, in cumsum
    return cumsum(axis, dtype, out)
MemoryError

So my questions are:

Can I avoid having both the 1.2GB pandas dataframe and the 1.2GB numpy array copy in memory?
Is there some way to get the code to complete in 8GB of RAM?

You can reproduce a test input of the size I am trying to process with:

import random
#Number of edges, vertices
m = 62500000
n = m/2
for i in xrange(m):
    fromnode = str(random.randint(0, n-1)).zfill(9)
    tonode = str(random.randint(0, n-1)).zfill(9)
    print fromnode, tonode

Update

I have now tried a number of different approaches, all of which have failed. Here is a summary.

Using igraph with g = Graph.Read_Ncol('edges.txt'). This uses a huge amount of RAM which crashes my computer.
Using networkit with G= networkit.graphio.readGraph("edges.txt", networkit.Format.EdgeList, separator=" ", continuous=False). This uses a huge amount of RAM which crashes my computer.
The code above in this question but using np.loadtxt("edges.txt") instead of pandas. This uses a huge amount of RAM which crashes my computer.

I then wrote separate code which remapped all the vertex names to number from 1..|V| where |V| is the total number of vertices. This should save the code that imports the edge list from having to build up a table that maps the vertex names. Using this I tried:

Using this new remapped edge list file I used igraph again with g = Graph.Read_Edgelist("edges-contig.txt"). This now works although it takes 4GB of RAM (which is way more than the theoretical amount it should). However, there is no igraph function to write out a sparse adjacency matrix from a graph. The recommended solution is to convert the graph to a coo_matrix. Unfortunately this uses a huge amount of RAM which crashes my computer.
Using the remapped edge list file I used networkit with G = networkit.readGraph("edges-contig.txt", networkit.Format.EdgeListSpaceOne). This also works using less than the 4GB that igraph needs. networkit also comes with a function to write Matlab files (which is a form of sparse adjacency matrix that scipy can read). However networkit.graphio.writeMat(G,"test.mat") uses a huge amount of RAM which crashes my computer.

Finally sascha's answer below does complete but takes about 40 minutes.

@khredos No they are not. My sample code to produce fake data is not realistic in that respect. — Simd, Jul 31 '16 at 20:38
@eleanora What's the shape of A? Well... nevermind... A seems not to be a array/matrix. that's bad! Show us data.head()! And how did you use np.loadtxt()? — sascha, Jul 31 '16 at 20:39
And regarding the copy: you could just use ```data=None``` after .as_matrix. But something before that is the problem here! — sascha, Jul 31 '16 at 20:44
@sascha You are right there is something generally wrong with the way I am reading in the data. data.head() gives 0 1 0 287111206 357850135 1 512616930 441657273 2 530905858 562056765 3 524113870 320749289 4 149911066 964526673 — Simd, Jul 31 '16 at 20:45
@eleanora So what happens when starting with A = np.loadtxt(edges.txt, dtype=np.uint32) ? — sascha, Jul 31 '16 at 20:50
@sascha I added this to the end of the question. There is already a bug it seems. — Simd, Jul 31 '16 at 20:50
@eleanora After your last edit it get's complicated as this is now a new error. I do believe this whole task is quite easy to implement by your own (even if the code from the linked answer might be much more advanced and efficient). why not keeping it simple and do small steps first. Read A in, and see if you got enough memory to build a set from iterating over all values in column 0, continue with column 1. A set is wasting memory, but i suspect we got enough for that. If it fails, we need something more clever. But working with a code one does not understand can be hard! — sascha, Jul 31 '16 at 20:55
perhaps this solves the problem: http://stackoverflow.com/questions/1938894/csv-to-sparse-matrix-in-python — Patrick the Cat, Jul 31 '16 at 22:22
WIth the last error I would print the shape (and dtype) of the 3 input arrays to `coo_matrix`. It's complaining that the lengths are wrong. What are the lengths? The `sparse` error message should have printed those lengths, but the code hasn't been refined that much. — hpaulj, Jul 31 '16 at 22:45
In the `pandas` case the `type(A)` value of `` looks funny. Pandas docs says it's better to use `.values`. In any case, I'd check the shape and dtype of `A` before passing to `unique`. And as in the `loadtxt` case, do the same checks for the inputs to `coo`. — hpaulj, Jul 31 '16 at 22:49
@hpaulj In the simple np.loadtxt case I get from type(A): and the shape is (10,2). np.ones(k1.shape,int).shape gives (20,) but rows.shape and cols.shape are (10,) — Simd, Aug 01 '16 at 05:46
Use row.shape instead of k1.shape. You want the data array to be ones matching row and col in shape. — hpaulj, Aug 01 '16 at 06:05
The approach here (http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory) is something I use whenever dealing with large data. Assuming i) your code works for actually working with the data to generate the sparse data structure and ii) your sparse data structure can fit in memory, then I think this approach will be the answer to both of your (related) questions — Ryan, Aug 01 '16 at 06:22
@Ryan It's not at all clear how to make the matrix from the input line by line. You need to know how many distinct nodes there are to start with. — Simd, Aug 01 '16 at 06:23
@hpaulj My previous deleted comment was my mistake. I hadn't realised that print M.A would also make the code crash. From looking at top it is clear that first the RAM usage is doubled by the copy from pandas to A (I wasn't sure what you had in mind to fix that ) and then k1,k2,k3=np.unique(A,return_inverse=True,return_index=True) uses all the RAM until it crashes. — Simd, Aug 01 '16 at 06:25
Why not go through the file once and create the distinct node list, then again to fill in the sparse data? (both times using the line by line approach) — Ryan, Aug 01 '16 at 06:25
I edited my answer on the other question, http://stackoverflow.com/questions/38665388/how-to-read-in-an-edge-list-to-make-a-scipy-sparse-matrix to clarify the `k1` error. — hpaulj, Aug 01 '16 at 06:56
@hpaulj Thanks. In my case I can afford to make a small number of copies of the data in memory as each int should only take 4 bytes and I have at most 125 million of them. However the RAM problems seem to be even worse than that here for some reason. — Simd, Aug 01 '16 at 07:28
In your other question node ids consisted also from characters. Can it be assumed the the ids are between 10^8 and 10^9-1 — ead, Aug 04 '16 at 05:09
@ead I have both cases. I will have to translate the node ids if it turns out to be too hard to handle those that contain characters. — Simd, Aug 04 '16 at 06:37

GrantJ · Accepted Answer · 2016-08-05T17:22:50.913

Here's my solution:

import numpy as np
import pandas as pd
import scipy.sparse as ss

def read_data_file_as_coo_matrix(filename='edges.txt'):
    "Read data file and return sparse matrix in coordinate format."
    data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32)
    rows = data[0]  # Not a copy, just a reference.
    cols = data[1]
    ones = np.ones(len(rows), np.uint32)
    matrix = ss.coo_matrix((ones, (rows, cols)))
    return matrix

Pandas does the heavy lifting of parsing using read_csv. And Pandas is already storing the data in columnar format. The data[0] and data[1] just get references, no copies. Then I feed those to coo_matrix. Benchmarked locally:

In [1]: %timeit -n1 -r5 read_data_file_as_coo_matrix()
1 loop, best of 5: 14.2 s per loop

Then to save a csr-matrix to a file:

def save_csr_matrix(filename, matrix):
    """Save compressed sparse row (csr) matrix to file.

    Based on http://stackoverflow.com/a/8980156/232571

    """
    assert filename.endswith('.npz')
    attributes = {
        'data': matrix.data,
        'indices': matrix.indices,
        'indptr': matrix.indptr,
        'shape': matrix.shape,
    }
    np.savez(filename, **attributes)

Benchmarked locally:

In [3]: %timeit -n1 -r5 save_csr_matrix('edges.npz', matrix.tocsr())
1 loop, best of 5: 13.4 s per loop

And later load it back from a file:

def load_csr_matrix(filename):
    """Load compressed sparse row (csr) matrix from file.

    Based on http://stackoverflow.com/a/8980156/232571

    """
    assert filename.endswith('.npz')
    loader = np.load(filename)
    args = (loader['data'], loader['indices'], loader['indptr'])
    matrix = ss.csr_matrix(args, shape=loader['shape'])
    return matrix

Benchmarked locally:

In [4]: %timeit -n1 -r5 load_csr_matrix('edges.npz')
1 loop, best of 5: 881 ms per loop

And finally test it all:

def test():
    "Test data file parsing and matrix serialization."
    coo_matrix = read_data_file_as_coo_matrix()
    csr_matrix = coo_matrix.tocsr()
    save_csr_matrix('edges.npz', csr_matrix)
    loaded_csr_matrix = load_csr_matrix('edges.npz')
    # Comparison based on http://stackoverflow.com/a/30685839/232571
    assert (csr_matrix != loaded_csr_matrix).nnz == 0

if __name__ == '__main__':
    test()

When running test(), it takes about 30 seconds:

$ time python so_38688062.py 
real    0m30.401s
user    0m27.257s
sys     0m2.759s

And the memory high-water mark was ~1.79 GB.

Note that once you've converted the "edges.txt" to "edges.npz" in CSR-matrix format, loading it will take less than a second.

Well! I added csrmat = matrix.tocsr() save_sparse_csr("test", csrmat) (using stackoverflow.com/a/8980156/2179021) at the end and it seems to work perfectly! — Simd, Aug 03 '16 at 07:32
If I understand your solution right, you don't relabel the nodes and the resulting sparse matrix has the dimension 10^9x10^9? I don't know whether it is an issue but if the resulting matrix should be converted to CSR or CSC format there will be some penalty (compared to relabeled nodes, which would result in a 10^8x10^8 matrix) — ead, Aug 03 '16 at 19:57
I think it's polite to let a bounty run for a few days in case someone else is working on it. I will accept if there are no other answers in a couple of days. Thanks again. — Simd, Aug 04 '16 at 06:39
I get a `MemoryError` on calling `csr_matrix = coo_matrix.tocsr()`. The total size of my edge list is just 193kb with 25571 edges. — Pranav, Nov 05 '17 at 10:30

ead · Answer 2 · 2016-08-06T09:17:57.507

In my answer I consider the case where the ids of the nodes are given by 9 characters long strings each character from [0-9A-Za-z]. n of these node ids should be mapped on the values [0,n-1] (which might be not necessary for your application, but still of general interest).

The next considerations, I'm sure you are aware of, are here for the sake of completeness:

Memory is the bottle neck.
There are around 10^8 strings in the file.
a 9 character long string + int32 value pair costs around 120 bytes in a dictionary, resulting in 12GB memory usage for the file.
a string id from file can be mapped onto an int64: there are 62 different characters -> can be encoded with 6 bits, 9 characters in the string -> 6*9=54<64 bit. See also toInt64() method further below.
there are int64+int32=12 byte "real" information => ca. 1.2 GB could be enough, however the cost for such a pair in a dictionary is around 60 bytes (around 6 GB RAM needed).
Creating small objects (on the heap) results in a lot of memory overhead, thus bundling these objects in arrays is advantageous. Interesting information about memory used by python objects can be found in his tutorial stile article. Interesting experiences with reducing the memory usage are made public in this blog entry.
python-list is out of question as data structure as well as dictionary. array.array could be alternative, but we use np.array (because there are sorting algorithms for np.array but not array.array).

1. step: reading file and mapping strings to int64. It is a pain to let a np.array grow dynamically, so we assume we now the number of edges in the file (it would be nice to have it in the header, but it can also be deduced from the file size):

import numpy as np

def read_nodes(filename, EDGE_CNT):   
    nodes=np.zeros(EDGE_CNT*2, dtype=np.int64)
    cnt=0
    for line in open(filename,"r"):
        nodes[cnt:cnt+2]=map(toInt64, line.split())  # use map(int, line.split()) for cases without letters
    return nodes

2. step: converting the int64-values into values [0,n-1]:

Possibility A, needs 3*0.8GB:

def maps_to_ids(filename, EDGE_CNT):
""" return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_ids, nodes = np.unique(nodes, return_index=True)  
    return (len(unique_ids), nodes)

Possibility B, needs 2*0.8GB, but is somewhat slower:

def maps_to_ids(filename, EDGE_CNT):
    """ return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_map = np.unique(nodes)
    for i in xrange(len(nodes)):
        node_id=np.searchsorted(unique_map, nodes[i]) # faster than bisect.bisect
        nodes[i]=node_id  
    return (len(unique_map), nodes)

3. step: put it all into coo_matrix:

from scipy import sparse
def data_as_coo_matrix(filename, EDGE_CNT)
    node_cnt, nodes = maps_to_ids(filename, EDGE_CNT)    
    rows=nodes[::2]#it is only a view, not a copy
    cols=nodes[1::2]#it is only a view, not a copy

    return sparse.coo_matrix((np.ones(len(rows), dtype=bool), (rows, cols)), shape=(node_cnt, node_cnt))

For calling data_as_coo_matrix("data.txt", 62500000), memory need peaks at 2.5GB (but with int32 instead of int64 only 1.5GB are needed). It took around 5 minutes on my machine, but my machine is pretty slow...

So what is different from your solution?

I get only unique values from np.unique (and not all the indices and the inverse), so there is some memory saved - I can replace the old ids with the new in-place.
I have no experience with pandas so maybe there is some copying involved between pandas<->numpy data structures?

What is the difference from sascha's solution?

There is no need for the list sorted all the time - it is enough to sort after all items are in the list, that is what np.unique() does. sascha's solution keep the list sorted the whole time - you have to pay for this with at least a constant factor, even if the running time stays O(n log(n)). I assumed, an add operation would be O(n), but as pointed out it is O(log(n).

What is the difference to GrantJ's solution?

The size of the resulting sparse matrix is NxN - with N - number of different nodes and not 2^54x2^54 (with very many empty rows and column).

PS:
Here is my idea, how the 9 char string id can be mapped on an int64 value, but I guess this function could become a bottle neck the way it is written and should get optimized.

def toInt64(string):
    res=0L
    for ch in string:
        res*=62
        if ch <='9':
          res+=ord(ch)-ord('0')
        elif ch <='Z':
          res+=ord(ch)-ord('A')+10
        else:
          res+=ord(ch)-ord('a')+36
    return res

Didn't check your solution, but the answer looks good (assumptions, description, comparisons). Just wanted to say, that adding an element to a *sortedlist* has a complexity of ```O(log(n))``` (not ```O(n)```); at least amortized. But i agree with your critic!. @ead — sascha, Aug 04 '16 at 16:09
I wonder if pandas can do step 1 is some efficient way. Pandas experts? — Simd, Aug 04 '16 at 17:05

sascha · Answer 3 · 2016-08-01T16:51:05.267

Updated version

As indicated in the comments, the approach did not fit your use-case. Let's make some changes:

use pandas for reading in the data (instead of numpy: i'm quite surprised np.loadtxt is performing that bad!)
use external library sortedcontainers for a more memory-efficient approach (instead of a dictionary)
the basic approach is the same

This approach will take ~45 minutes (which is slow; but you could pickle/save the result so you need to do it only once) and ~5 GB of memory to prepare the sparse-matrix for your data, generated with:

import random
N = 62500000
for i in xrange(N):
    print random.randint(10**8,10**9-1), random.randint(10**8,10**9-1)

Code

import numpy as np
from scipy.sparse import coo_matrix
import pandas as pd
from sortedcontainers import SortedList
import time

# Read data
# global memory usage after: one big array
df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32)
data = df.as_matrix()
df = None
n_edges = data.shape[0]

# Learn mapping to range(0, N_VERTICES)  # N_VERTICES unknown
# global memory usage after: one big array + one big searchtree
print('fit mapping')
start = time.time()
observed_vertices = SortedList()
mappings = np.arange(n_edges*2, dtype=np.uint32)  # upper bound on vertices
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        if val not in observed_vertices:
            observed_vertices.add(val)
mappings = mappings[:len(observed_vertices)]
n_vertices = len(observed_vertices)
end = time.time()
print(' secs: ', end-start)

print('transform mapping')
# Map original data (in-place !)
# global memory usage after: one big array + one big searchtree(can be deleted!)
start = time.time()
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        mapper_pos = observed_vertices.index(val)
        data[row, column] = mappings[mapper_pos]
end = time.time()
print(' secs: ', end-start)
observed_vertices = None  # if not needed anymore
mappings = None  # if not needed anymore

# Create sparse matrix (only caring about a single triangular part for now)
# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))

First version

Here is a very simple and very inefficient (in regards to time and space) code to build this sparse matrix. I post this code, because i believe it is important to understand the core parts if one is using these in something bigger.

Let's see, if this code is efficient enough for your use-case or if it needs work. From distance it's hard to tell, because we don't have your data.

The dictionary-part, used for the mapping, is a candidate to blow up your memory. But it's pointless to optimize this without knowing if it's needed at all. Especially because this part of the code is dependent on the number of vertices in your graph (and i don't have any knowledge of this cardinality).

""" itertools.count usage here would need changes for py2 """

import numpy as np
from itertools import count
from scipy.sparse import coo_matrix


# Read data
# global memory usage after: one big array
data = np.loadtxt('edges.txt', np.uint32)
n_edges = data.shape[0]
#print(data)
#print(data.shape)

# Learn mapping to range(0, N_VERTICES)  # N_VERTICES unknown
# global memory usage after: one big array + one big dict 
index_gen = count()
mapper = {}
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        if val not in mapper:
            mapper[val] = next(index_gen)
n_vertices = len(mapper)

# Map original data (in-place !)
# global memory usage after: one big array + one big dict (can be deleted!)
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        data[row, column] = mapper[data[row, column]]
#print(data)

# Create sparse matrix (only caring about a single triangular part for now)
# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))
#print(sp_mat)

Output for edges-10.txt:

[[287111206 357850135]
 [512616930 441657273]
 [530905858 562056765]
 [524113870 320749289]
 [149911066 964526673]
 [169873523 631128793]
 [646151040 986572427]
 [105290138 382302570]
 [194873438 968653053]
 [912211115 195436728]]
(10, 2)
[[ 0 10]
 [ 1 11]
 [ 2 12]
 [ 3 13]
 [ 4 14]
 [ 5 15]
 [ 6 16]
 [ 7 17]
 [ 8 18]
 [ 9 19]]
  (0, 10)   True
  (1, 11)   True
  (2, 12)   True
  (3, 13)   True
  (4, 14)   True
  (5, 15)   True
  (6, 16)   True
  (7, 17)   True
  (8, 18)   True
  (9, 19)   True

Unfortunately this doesn't work. To start `np.loadtxt` is wildly inefficient it seems (which is why I switched to pandas). I tested it on a file with 1/5 of the edges that I need as it just crashes when I uses the whole file. It takes more than 1 minute to load (it should take a few seconds) and it uses about 2GB of RAM just to load the data! — Simd, Aug 01 '16 at 06:00
Are you sure, that your file is not broken somehow? Missing values or stuff like that? This should not be that inefficient and also that weird pandas behavour from before. — sascha, Aug 01 '16 at 10:50
I don't think anything is broken. You can reproduce the problem trivially just by running for `i in xrange(N): print random.randint(10**8,10**9-1), random.randint(10**8,10**9-1)` and outputting to a file and then running your code. There was a bug in my code before but I fixed that. — Simd, Aug 01 '16 at 12:09
@eleanora I made some changes above. This should work, if you got >= 5GB of memory. You might need to modify the code to store the final matrix & an inverse-mapper if needed later. I would only do this conversion once. — sascha, Aug 01 '16 at 16:52
Thanks for the update. I did try your new code and it still kills my computer. Would you mind testing it on data produced from my new edge list creation code in the question? If you run it with "/usr/bin/time -v python sascha.py" it will tell you how much RAM it used, assuming you are in linux. — Simd, Aug 01 '16 at 21:05
@eleanora No, sorry. That's just too many changes & incomprehensible stuff. The above uses < 5GB of ram with your old example. The old example is also an upper limit, as the old generation has ~15 times more vertices than edges (the unique/mapping op gets it's worst-case regarding memory-complexity). The new example will be easier. If it kills your computer (which is not an error-description at all), than it has to be some other problem or you don't have 5 GB of memory available (e.g. 32-bit binaries). The only thing you can do: update numpy &. co. and try py3 if you didn't. That's my setup. — sascha, Aug 02 '16 at 00:14
I got your code to run in the end. It takes about 4GB which is good and 40 minutes. Do you have any idea why it is so slow? — Simd, Aug 02 '16 at 20:48
@eleanora Sure... that's basic computer-science stuff. Look at the timings and you see, that both the mapping-learning and the final mapping is the bottleneck. Think about what SortedList is, and what theoretical complexities we got. What's so surprising? It's something like ```n_vertices * log(n_vertices) = ~ 1012113751 operations``` each. Sometimes you can't have it all: space- and time-complexity. But nobody will run that twice. It's something like a costly preprocessing-step. — sascha, Aug 02 '16 at 22:34

score 3 · Answer 4 · answered Aug 05 '16 at 03:26

I was trying the different methods available apart from the ones already used. I found the following doing good.

Method 1 - Reading the file into a string, parsing the string into a 1-D array using numpy's fromstring.

import numpy as np
import scipy.sparse as sparse

def readEdges():
    with open('edges.txt') as f:
        data = f.read()  
    edges = np.fromstring(data, dtype=np.int32, sep=' ')
    edges = np.reshape(edges, (edges.shape[0]/2, 2))
    ones = np.ones(len(edges), np.uint32)
    cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdges()

Output:

5 loops, best of 3: 13.6 s per loop

Method 2 - Same as method 1 except instead of loading the file into a string, using the memory mapped interface.

def readEdgesMmap():
    with open('edges.txt') as f:
        with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m: 
            edges = np.fromstring(m, dtype=np.int32, sep=' ')
            edges = np.reshape(edges, (edges.shape[0]/2, 2))
            ones = np.ones(len(edges), np.uint32)
            cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdgesMmap()

Output:

5 loops, best of 3: 12.7 s per loop

Monitored using /usr/bin/time, both methods use a maximum of around ~2GB of memory.

Few notes:

It seems to do slightly better than pandas read_csv. Using pandas read_csv, the output on the same machine is

5 loops, best of 3: 16.2 s per loop
Conversion from COO to CSR/CSC consumes significant time too. In @GrantJ's answer, it takes less time because the COO matrix initialization is incorrect. The argument needs to be given as a tuple. I wanted to leave a comment there but I don't have commenting rights yet.
My guess on why this is slightly better than pandas read_csv is the prior assumption of 1D data.

I am a little confused by the shape of the matrix in your answer. It should be " number of nodes" by "number of nodes" shouldn't it? Is that what you have? — Simd, Aug 05 '16 at 06:45
Yes, it should be that. I have used the code shared by you for testing. The size I get for cooMatrix above is (31250000, 31250000). — Walter, Aug 05 '16 at 06:52
@eleanora The initialization of the COO matrix is done using `coo_matrix((data, (i, j)), [shape=(M, N)])` where `A[i[k], j[k]] = data[k]` and if shape is not specified, it is inferred from the index arrays [From Scipy documentation] . If we want the size to be fixed from before, and that might be desirable in certain use cases, the shape parameter can be passed too. — Walter, Aug 05 '16 at 06:58
One could also use `edges = np.fromfile('data.txt', dtype=np.int32, sep=' ') which uses less memory because not the whole text file is in the memory the whole time. However it is a little bit slower (don't know why...) — ead, Aug 05 '16 at 09:57
Thanks for catching the bug. That's an unfortunate function signature. I have updated my code and benchmarks. — GrantJ, Aug 05 '16 at 17:29

score 0 · Answer 5 · answered Aug 01 '16 at 06:25

0

You might want to take a look at the igraph project, this is a GPL library of C code which is designed for this kind of thing, and has a nice Python API. I think in your case you the Python code would be something like

from igraph import Graph
g = Graph.Read_Edgelist('edges.txt')
g.write_adjacency('adjacency_matrix.txt')

answered Aug 01 '16 at 06:25

maxymoo

35,286
11
92
119

1

It should be g = Graph.Read_Ncol('edges.txt') I believe. One problem is that g.write_adjacency writes a dense matrix which will be far too big. Maybe there is a fix for this? – Simd Aug 01 '16 at 07:32
1

Another problem is that g = Graph.Read_Ncol('edges.txt') uses a huge amount of RAM it seems. You can test it using my data generation code in the question with N = 12500000. – Simd Aug 01 '16 at 07:41

Converting a 1.2GB list of edges into a sparse matrix

5 Answers5

Updated version

Code

First version

Linked