Python Import Text Array with Numpy

Question

I have a text file that looks like this:

...
5   [0, 1]  [512, 479]  991
10  [1, 0]  [706, 280]  986
15  [1, 0]  [807, 175]  982
20  [1, 0]  [895, 92]   987
...

Each column is tab separated, but there are arrays in some of the columns. Can I import these with np.genfromtxt in some way?

The resulting unpacked lists should be, for example:

data1 = [..., 5, 10, 15, 20, ...]
data2 = [..., [512, 479], [706, 280], ... ] (i.e. a 2D list)
etc.

I tried

data1, data2, data3, data4 = np.genfromtxt('data.txt', dtype=None, delimiter='\t', unpack=True)

but data2 and data3 are lists containing 'nan'.

If genfromttxt isn't working, try something else like iterating over the lines and constructing lists that can be used by numpy to make an array. — wwii, Aug 30 '16 at 14:59
You could probably make use of the ```usecols``` and ```converters``` parameters of genfromtxt — wwii, Aug 30 '16 at 15:07
The brackets make using the stock txt loaders more difficult. Have you tried reading the file line by line and parsing each line yourself.? — hpaulj, Aug 30 '16 at 15:26
I now have something like this, based importing the text file into one big 'data' object and parsing the strings created: `datastr = data[i][1][1:-1].split(',') dataarray = [] for j in range(0, len(datastr)): dataarray.append(int(datastr[j])) data2.append(dataarray)` It works but seems very clunky. — , Aug 30 '16 at 15:40
I don't think a `genfromtxt` `converter` can be used to split one column into two. — hpaulj, Aug 30 '16 at 20:03

score 1 · Answer 1 · answered Aug 30 '16 at 17:30

Potential approach for given data, however not using numpy:

import ast

data1, data2, data3, data4 = [],[],[],[]

for l in open('data.txt'):
    data = l.split('\t')

    data1.append(int(data[0]))
    data2.append(ast.literal_eval(data[1]))
    data3.append(ast.literal_eval(data[2]))
    data4.append(int(data[3]))

print 'data1', data1
print 'data2', data2
print 'data3', data3
print 'data4', data4

Gives

"data1 [5, 10, 15, 20]"
"data2 [[0, 1], [1, 0], [1, 0], [1, 0]]"
"data3 [[512, 479], [706, 280], [807, 175], [895, 92]]"
"data4 [991, 986, 982, 987]"

Those `datan` lists can be easily turned into arrays at the end. So there's no loss of speed or functionality to do this. — hpaulj, Aug 30 '16 at 19:09

hpaulj · Accepted Answer · 2016-08-30T20:08:30.877

Brackets in a csv file are klunky no matter how you look at it. The default csv structure is 2d - rows and uniform columns. The brackets add a level of nesting. But the fact that the columns are tab separated, while the nested blocks are comma separated makes it a bit easier.

Your comment code is (with added newlines)

datastr = data[i][1][1:-1].split(',') 
dataarray = [] 
for j in range(0, len(datastr)): 
     dataarray.append(int(datastr[j])) 
data2.append(dataarray)

I assume data[i] looks something like (after a tab split):

['5', '[0, 1]', '[512, 479]',  '991']

So for the '[0,1]' you strip of the [], split the rest, and put that list back on to data2.

That certainly looks like a viable approach. genfromtxt does handle brackets or quotes. The csv reader can handle quoted text, and might be adapted to treat [] as quotes. But other than that I think the '[]` have to be handled with some sort of string processing as you do.

Keep in mind that genfromtxt just reads lines, parses them, and collects the resulting lists in a master list. It then converts that list to an array at the end. So doing your own line by line, string by string parsing is not inferior.

=============

With your sample as a text file:

 In [173]: txt=b"""
 ...: 5  \t [0, 1] \t [512, 479] \t 991
 ...: 10 \t [1, 0] \t [706, 280] \t 986
 ...: 15 \t [1, 0] \t [807, 175] \t 982
 ...: 20 \t [1, 0] \t [895, 92]  \t 987"""

A simple genfromtxt call with dtype=None:

In [186]: data = np.genfromtxt(txt.splitlines(), dtype=None, delimiter='\t', autostrip=True)

The result is a structured array with integer and string fields:

In [187]: data
Out[187]: 
array([(5, b'[0, 1]', b'[512, 479]', 991),
       (10, b'[1, 0]', b'[706, 280]', 986),
       (15, b'[1, 0]', b'[807, 175]', 982),
       (20, b'[1, 0]', b'[895, 92]', 987)], 
      dtype=[('f0', '<i4'), ('f1', 'S6'), ('f2', 'S10'), ('f3', '<i4')])

Fields are accessed by name

In [188]: data['f0']
Out[188]: array([ 5, 10, 15, 20])
In [189]: data['f1']
Out[189]: 
array([b'[0, 1]', b'[1, 0]', b'[1, 0]', b'[1, 0]'], 
      dtype='|S6')

If we can deal with the [], your data could be nicely represented a structured array with a compound dtype

In [191]: dt=np.dtype('i,2i,2i,i')
In [192]: np.ones((3,),dtype=dt)
Out[192]: 
array([(1, [1, 1], [1, 1], 1), (1, [1, 1], [1, 1], 1),
       (1, [1, 1], [1, 1], 1)], 
      dtype=[('f0', '<i4'), ('f1', '<i4', (2,)), ('f2', '<i4', (2,)), ('f3', '<i4')])

where the 'f1' field is a (3,2) array.

One approach is to pass the text/file through a function that filters out the extra characters. genfromtxt works with anything that will feed it a line at a time.

def afilter(txt):
    for line in txt.splitlines():
        line=line.replace(b'[', b' ').replace(b']', b'').replace(b',' ,b'\t')
        yield line

This generator strips out the [] and replaces the , with tab, in effect producing a flat csv file

In [205]: list(afilter(txt))
Out[205]: 
[b'',
 b'5  \t  0\t 1  \t  512\t 479  \t 991',
 b'10 \t  1\t 0  \t  706\t 280  \t 986',
 b'15 \t  1\t 0  \t  807\t 175  \t 982',
 b'20 \t  1\t 0  \t  895\t 92   \t 987']

genfromtxt with dtype=None will produce an array with 6 columns.

In [209]: data=np.genfromtxt(afilter(txt),delimiter='\t',dtype=None)
In [210]: data
Out[210]: 
array([[  5,   0,   1, 512, 479, 991],
       [ 10,   1,   0, 706, 280, 986],
       [ 15,   1,   0, 807, 175, 982],
       [ 20,   1,   0, 895,  92, 987]])
In [211]: data.shape
Out[211]: (4, 6)

But if I give it the dt dtype I defined above, I get a structured array:

In [206]: data=np.genfromtxt(afilter(txt),delimiter='\t',dtype=dt)
In [207]: data
Out[207]: 
array([(5, [0, 1], [512, 479], 991), (10, [1, 0], [706, 280], 986),
       (15, [1, 0], [807, 175], 982), (20, [1, 0], [895, 92], 987)], 
      dtype=[('f0', '<i4'), ('f1', '<i4', (2,)), ('f2', '<i4', (2,)), ('f3', '<i4')])
In [208]: data['f1']
Out[208]: 
array([[0, 1],
       [1, 0],
       [1, 0],
       [1, 0]], dtype=int32)

The brackets could dealt with at several levels. I don't think there's a lot of advantage of one over the other.

score 0 · Answer 3 · answered Aug 30 '16 at 20:07

As an alternative to genfromtxt you might try fromregex. You basically set up an analogy between regular expression groups and the fields of a numpy structured dtype.

In this example, I parse out all numbers without worrying about whether they are single numbers or arrays. Then I switch to a dtype that specifies which columns have arrays.

import numpy as np

# regular expression that will extract 6 numbers from each line
re = 6 * r"(\d+)\D*"

# dtype for a single number
dt_num = np.int

# structured dtype of 6 numbers
dt = 6 * [('', dt_num)]

# parse the file
a = np.fromregex("data.txt", re, dt)

# change to a more descriptive structured dtype
a.dtype = [
  ('data1', dt_num),
  ('data2', dt_num, (2,)),
  ('data3', dt_num, (2,)),
  ('data4', dt_num)
]

print(a['data1'])
print(a['data2'])
print(a['data3'])
print(a['data4'])

The nice thing about switching the dtype of a numpy array is that it does not have to process or make new copies of the data, it just reinterprets what you get when you access the data.

One of the down sides of this solution is that building complex regular expressions and building structured dtypes can get ugly. And in this case, you have to keep the two in sync with each other.

Python Import Text Array with Numpy

3 Answers3