Is there a more efficient way to convert a multiple line of string to a numpy array?

Question

I am converting a multiple line of string to an numpy array, like this:

names = """
1 2 1
1 1 0
0 1 1
"""
names_list = names.splitlines()
tem = []
for i in [row for row in names_list if row]:
    tem.append([col for col in list(i) if col != ' '])

np.array(tem, dtype=np.int)

This piece of code works though, I would like to know if is there a more efficient way to do this?

if all entries are separated by a space, you can call `i.split(" ")` on the strings. — warped, Apr 30 '19 at 09:15

score 3 · Accepted Answer · answered May 01 '19 at 23:35

One answer was flagged as being low quality for not explaining itself. But none of the other three do that, and they are just replicas of each other.

In [227]: names = """ 
     ...: 1 2 1 
     ...: 1 1 0 
     ...: 0 1 1 
     ...: """    

In [238]: np.genfromtxt(StringIO(names), dtype=int)                                  
Out[238]: 
array([[1, 2, 1],
       [1, 1, 0],
       [0, 1, 1]])
In [239]: timeit np.genfromtxt(StringIO(names), dtype=int)                           
135 µs ± 286 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Actually we don't need the StringIO layer; just split the string into lines (sometimes we need a format=None parameter):

In [242]: np.genfromtxt(names.splitlines(), dtype=int)                               
Out[242]: 
array([[1, 2, 1],
       [1, 1, 0],
       [0, 1, 1]])

The original function is 10x faster than the accepted one(s):

def orig(names):
    names_list = names.splitlines()
    tem = []
    for i in [row for row in names_list if row]:
        tem.append([col for col in list(i) if col != ' '])
    return np.array(tem, dtype=np.int)

In [244]: orig(names)                                                                
Out[244]: 
array([[1, 2, 1],
       [1, 1, 0],
       [0, 1, 1]])
In [245]: timeit orig(names)                                                         
11.1 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

genfromtxt does basically the same thing - split lines, collect values in a list of lists, and turn that into an array. It is not compiled.

The flagged answer replaces the list comprehension with a split method:

def czisws(names):
    names_list = names.splitlines()
    tem = []
    for i in [row for row in names_list if row]:
        tem.append(i.split())
    return np.array(tem, dtype=np.int)

In [247]: timeit czisws(names)                                                       
8.58 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It is faster, which isn't surprising. split is a string method. Builtin methods typically are faster, and preferable even if they aren't.

Split is also more general purpose:

In [251]: 'abc de f'.split()                                                         
Out[251]: ['abc', 'de', 'f']
In [252]: [i for i in list('abc de f') if i!=' ']                                    
Out[252]: ['a', 'b', 'c', 'd', 'e', 'f']

Rakesh · Answer 2 · 2019-04-30T09:21:11.793

2

You can use np.genfromtxt

Ex:

import numpy as np
from io import BytesIO

names = """
1 2 1
1 1 0
0 1 1
"""
print(np.genfromtxt(BytesIO(names), dtype=np.int)) #Python3 use BytesIO(names.encode('utf-8'))

Output:

[[1 2 1]
 [1 1 0]
 [0 1 1]]

edited Apr 30 '19 at 09:21

answered Apr 30 '19 at 09:16

Rakesh

81,458
17
76
113

https://stackoverflow.com/questions/48039690/numpy-throws-error-while-using-genfromtxt-function-in-python-3 – Rakesh Apr 30 '19 at 09:19

score 1 · Answer 3 · answered Apr 30 '19 at 09:19

You can use np.genfromtxt as follows for Python 3

import numpy as np
from io import BytesIO

names = """
1 2 1
1 1 0
0 1 1
"""
print(np.genfromtxt(BytesIO(names.encode('utf-8')), dtype=np.int))
#print(np.genfromtxt(BytesIO(names), dtype=np.int)) for Python 2

You will get output as

[[1 2 1]
 [1 1 0]
 [0 1 1]]

Is there a more efficient way to convert a multiple line of string to a numpy array?

3 Answers3