When you use name=True
, np.genfromtxt
returns a structured array. Notice that the columns labelled col
in data.dat
get disambiguated to column names of the form col_n
:
In [114]: arr = np.genfromtxt('data', comments='#', delimiter='\t', dtype=None, names=True)
In [115]: arr
Out[115]:
array([(1, 2, 3, 1, 2, 3), (4, 3, 2, 3, 2, 4), (1, 4, 3, 1, 4, 3),
(5, 6, 4, 5, 6, 4)],
dtype=[('column_1', '<i8'), ('col', '<i8'), ('col_1', '<i8'), ('col_2', '<i8'), ('col_3', '<i8'), ('col_4', '<i8')])
So once you use names=True
it becomes harder to select all the data associated with column name col
. Moreover, the structured array does not allow you to slice multiple columns at one time. So it would be more convenient to instead load the data into an array of homogenous dtype (which is what you would get without names=True
):
with open('data.dat', 'rb') as f:
header = f.readline().strip().split('\t')
arr = np.genfromtxt(f, comments='#', delimiter='\t', dtype=None)
Then you can find the numerical index of those columns whose name is col
:
idx = [i for i, col in enumerate(header) if col=='col']
and select all the data with
y = arr[:, idx]
For example,
import numpy as np
with open('data.dat', 'rb') as f:
header = f.readline().strip().split('\t')
arr = np.genfromtxt(f, comments='#', delimiter='\t', dtype=None)
idx = [i for i, col in enumerate(header) if col=='col']
y = arr[:, idx]
print(y)
yields
[[2 3 1 2 3]
[3 2 3 2 4]
[4 3 1 4 3]
[6 4 5 6 4]]
If you want y
to be 1-dimensional, you could use ravel()
:
print(y.ravel())
yields
[2 3 1 2 3 3 2 3 2 4 4 3 1 4 3 6 4 5 6 4]