Using numpy's genfromtxt to load a triangular matrix with python

Question

I have a text file containing an upper 'triangular' matrix, the lower values being omitted (here's an example below):

3 5 3 5 1 8 1 6 5 8

5 8 1 1 6 2 9 6 4

2 0 5 2 1 0 0 3

2 2 5 1 0 1 0

1 3 6 3 6 1

4 2 4 3 7

4 0 0 1

0 1 8

2 1

1

Since the file in question is ~10000 lines in size, I was wondering if there was a 'smart' way to generate a numpy matrix from it e.g. using the genfromtxt function. However using it directly throws an error on the lines of Line #12431 (got 6 columns instead of 12437) and using filling_values won't work as there's no way to designate the no missing value placeholders.

Right now I have to resort to manually open and close the file:

import numpy as np
def load_updiag(filename, size):
    output = np.zeros((size,size))
    line_count = 0
    for line in f:
        data = line.split()
        output[line_count,line_count:size]= data
        line_count += 1
    return output

Which I feel is probably not very scalable for large file sizes. Is there a way to properly use genfromtxt (or any other optimized function from numpy's library) on such matrices?

jme · Accepted Answer · 2015-11-13T16:55:50.690

You can read the raw data from the file into a string, and then use np.fromstring to get a 1-d array of the upper triangular part of the matrix:

with open('data.txt') as data_file:
    data = data_file.read()

arr = np.fromstring(data, sep=' ')

Alternatively, you can define a generator to read one line of your file at a time, then use np.fromiter to read a 1-d array from this generator:

def iter_data(path):
    with open(path) as data_file:
        for line in data_file:
            yield from line.split()

arr = np.fromiter(iter_data('data.txt'), int)

If you know the size of the matrix (which you can determine from the first line of the file), you can specify the count keyword argument of np.fromiter so that the function will pre-allocate exactly the right amount of memory, which will be faster. That's what these functions do:

def iter_data(fileobj):
    for line in fileobj:
        yield from line.split()

def read_triangular_array(path):
    with open(path) as fileobj:
        n = len(fileobj.readline().split())

    count = int(n*(n+1)/2)

    with open(path) as fileobj:
        return np.fromiter(iter_data(fileobj), int, count=count)

This "wastes" a little work, since it opens the file twice to read the first line and get the count of entries. An "improvement" would be to save the first line and chain it with the iterator over the rest of the file, as in this code:

from itertools import chain

def iter_data(fileobj):
    for line in fileobj:
        yield from line.split()

def read_triangular_array(path):
    with open(path) as fileobj:
        first = fileobj.readline().split()
        n = len(first)
        count = int(n*(n+1)/2)
        data = chain(first, iter_data(fileobj))
        return np.fromiter(data, int, count=count)

All of these approaches yield

>>> arr
array([ 3.,  5.,  3.,  5.,  1.,  8.,  1.,  6.,  5.,  8.,  5.,  8.,  1.,
        1.,  6.,  2.,  9.,  6.,  4.,  2.,  0.,  5.,  2.,  1.,  0.,  0.,
        3.,  2.,  2.,  5.,  1.,  0.,  1.,  0.,  1.,  3.,  6.,  3.,  6.,
        1.,  4.,  2.,  4.,  3.,  7.,  4.,  0.,  0.,  1.,  0.,  1.,  8.,
        2.,  1.,  1.])

This compact representation might be all you need, but if you want the full square matrix you can allocate a zeros matrix of the right size and copy arr into it using np.triu_indices_from, or you can use scipy.spatial.distance.squareform:

>>> from scipy.spatial.distance import squareform
>>> squareform(arr)
array([[ 0.,  3.,  5.,  3.,  5.,  1.,  8.,  1.,  6.,  5.,  8.],
       [ 3.,  0.,  5.,  8.,  1.,  1.,  6.,  2.,  9.,  6.,  4.],
       [ 5.,  5.,  0.,  2.,  0.,  5.,  2.,  1.,  0.,  0.,  3.],
       [ 3.,  8.,  2.,  0.,  2.,  2.,  5.,  1.,  0.,  1.,  0.],
       [ 5.,  1.,  0.,  2.,  0.,  1.,  3.,  6.,  3.,  6.,  1.],
       [ 1.,  1.,  5.,  2.,  1.,  0.,  4.,  2.,  4.,  3.,  7.],
       [ 8.,  6.,  2.,  5.,  3.,  4.,  0.,  4.,  0.,  0.,  1.],
       [ 1.,  2.,  1.,  1.,  6.,  2.,  4.,  0.,  0.,  1.,  8.],
       [ 6.,  9.,  0.,  0.,  3.,  4.,  0.,  0.,  0.,  2.,  1.],
       [ 5.,  6.,  0.,  1.,  6.,  3.,  0.,  1.,  2.,  0.,  1.],
       [ 8.,  4.,  3.,  0.,  1.,  7.,  1.,  8.,  1.,  1.,  0.]])

This is interesting. I'd rather avoid using functions like `read()` or `readlines()` if at all possible though, as they can be hard on memory for large matrices. If I can't find any other solution I'll mark your question as answered. — skcidereves, Nov 13 '15 at 09:23
@skcidereves Sure, I've added a solution that uses `np.fromiter` and doesn't read the full file using `read()`. Note that all solutions of any nature are going to take `O(n^2)` simply because the matrix has `O(n^2)` entries, and they must eventually be stored in a numpy array. But this solution using `np.fromiter` will halve the memory requirements, roughly. — jme, Nov 13 '15 at 16:58
Thank you! I specifically appreciate the compact representation as a bonus. — skcidereves, Nov 16 '15 at 09:26

Using numpy's genfromtxt to load a triangular matrix with python

1 Answers1