numpy genfromtxt - infer column header if headers not provided

Question

I understand that with genfromtxt, the defaultfmt parameter can be used to infer default column names, which is useful if column names are not in input data. And defaultfmt, if not provided, is defaulted to f%i. E.g.

>>> data = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data, dtype=(int, float, int))
array([(1, 2.0, 3), (4, 5.0, 6)],
  dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])

So here we have autogenerated column names f0, f1, f2.

But what if I want numpy to infer both column headers and data type? I thought you do it with dtype=None. Like this

>>> data3 = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data3, dtype=None, ???)  # some parameter combo
array([(1, 2, 3), (4, 5, 6)],
  dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])

I still want the automatically generated column names of f0, f1...etc. And I want numpy to automatically determine the datatypes based on the data, which I thought was the whole point of doing dtype=None.

EDIT But unfortunately that doesn't ALWAYS work.

This case works when I have both floats and ints.

>>> data3b = StringIO("1 2 3.0\n 4 5 6.0")
>>> np.genfromtxt(data3b, dtype=None)
array([(1, 2, 3.), (4, 5, 6.)],
  dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<f8')])

So numpy correctly inferred datatype of i8 for first 2 column, and f8 for last column.

But, if I provide all ints, the inferred columned names disappears.

>>> data3c = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data3c, dtype=None)
array([[1, 2, 3],
   [4, 5, 6]])

My identical code may or may not work depending on the input data? That doesn't sound right.

And yes I know there's pandas. But I'm not using pandas on purpose. So please bear with me on that.

Looks like the the values are all integers so the default action is to return a regular 2d array rather than a structured array. — hpaulj, Nov 04 '20 at 15:09
The dtype doesn't have to have the names. eg. `dtype='i,f,i' ` or `['i','f','i']` — hpaulj, Nov 04 '20 at 15:35
thanks. Are you talking about dtype being passed in? The thing is, I don't want to pass in anything for dtype. As for all integers vs mix of integer/float - it seems like numpy does what I want if it's mixed, but not if all ints. — user3240688, Nov 04 '20 at 16:01

score 0 · Answer 1 · answered Nov 04 '20 at 16:06

0

In [2]: txt = '''1,2,3
   ...: 4,5,6'''.splitlines()

Defaylt 2d array of flaots:

In [6]: np.genfromtxt(txt, delimiter=',',encoding=None)
Out[6]: 
array([[1., 2., 3.],
       [4., 5., 6.]])

2d of ints:

In [7]: np.genfromtxt(txt, dtype=None, delimiter=',',encoding=None)
Out[7]: 
array([[1, 2, 3],
       [4, 5, 6]])

Specified field dtypes:

In [8]: np.genfromtxt(txt, dtype='i,i,i', delimiter=',',encoding=None)
Out[8]: 
array([(1, 2, 3), (4, 5, 6)],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

Specified field names:

In [9]: np.genfromtxt(txt, dtype=None, delimiter=',',encoding=None, names=['a','b','c'])
Out[9]: 
array([(1, 2, 3), (4, 5, 6)],
      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])

The unstructured array can be converted to structured with:

In [10]: import numpy.lib.recfunctions as rf
In [11]: rf.unstructured_to_structured(Out[7])
Out[11]: 
array([(1, 2, 3), (4, 5, 6)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])

In numpy the default, preferred array, is multidimensional numeric. That's why it produces Out7] if it can.

answered Nov 04 '20 at 16:06

hpaulj

221,503
14
230
353

thank you. can you elaborate on the last statement? So `numpy` defaults to unstructured if it can? And if I'm understanding you correctly, `numpy` thinks unstructured is fine if everything are`int`s. But if we have a mix of `float`s and `int`s, unstructured is not possible. So `genfromtxt` automatically gave me structured. Is that correct? – user3240688 Nov 04 '20 at 16:44
`np.array([[1,2,3],[4,5,6]])` produces a (2,3) int dtype array. You have to use an expression like `Out[11]` to produce the structured array. In other words it has to be a list of tuples, with a fully specified `dtype`. – hpaulj Nov 04 '20 at 18:19
thanks. and what's the reason `np.genfromtxt(StringIO("1 2 3.0\n 4 5 6.0"), dtype=None)` results in fully structured array with `dtype`? I just want to understand when do I need to do `Out[11]`, because it doesn't seem like it's always necessary. – user3240688 Nov 04 '20 at 18:30
With `dtype=None` it notes that some columns are float, and some integer. To preserve that mix it has to use the structured dtype. My previous comment was about making an array directly, with the `np.array` command (not via a string and `genfromtxt`). Do you realize that structured and unstructured arrays behave differently when doing calculations and indexing? Don't skimp on the basic `numpy` reading. – hpaulj Nov 04 '20 at 18:36

numpy genfromtxt - infer column header if headers not provided

1 Answers1