numpy recarray from CSV dtype has many columns but shape says just one row, why is that?

Question

My CSV has a mix of strings and numeric columns. nump.recfromcsv accurately inferred them (woo-hoo) giving a dtype of

dtype=[('null', 'S7'), ('00', '<f8'), ('nsubj', 'S20'), ('g', 'S1'), ...

So a mix of strings and numbers as you can see. But numpy.shape(csv) gives me

(133433,)

Which confuses me, since dtype implied it was column aware. Furthermore it accesses intuitively:

csv[1]
> ('def', 0.0, 'prep_to', 'g', 'query_w', 'indef', 0.0, ...

I also get the error

cannot perform reduce with flexible type

on operations like .all(), even when using with a numeric column. I'm not sure whether I'm really working with a table-like entity (two dimensions) or just one list of something. Why is the dtype inconsistent with the shape?

Take a look at my recent answer regarding `genfromtxt` and `dtype`, http://stackoverflow.com/a/36814096/901925. I'm not as familiar with `recfromcsv`, but I expect the arrays will be similar, a 1d array with a compound `dtype`. You access rows (records) by number, fields (columns) by name. For a `recarray`, `csr.null` should give you an array of the 1st column, the `S7` names. — hpaulj, Apr 25 '16 at 02:30

unutbu · Answer 1 · 2016-04-25T10:50:28.587

A recarray is an array of records. Each record can have multiple fields. A record is sort of like a struct in C.

If the shape of the recarray is (133433,) then the recarray is a 1-dimensional array of records.

The fields of the recarray may be accessed by name-based indexing. For example, csv['nsub'] and is essentially equivalent to

np.array([record['nsub'] for record in csv])

This special name-based indexing supports the illusion that a 1-dimensional recarray is a 2-dimensional array -- csv[intval] selects rows, csv[fieldname] selects "columns". However, under the hood and strictly speaking if the shape is (133433,) then it is 1-dimensional.

Note that not all recarrays are 1-dimensional. It is possible to have a higher-dimensional recarray,

In [142]: arr = np.zeros((3,2), dtype=[('foo', 'int'), ('bar', 'float')])

In [143]: arr
Out[143]: 
array([[(0, 0.0), (0, 0.0)],
       [(0, 0.0), (0, 0.0)],
       [(0, 0.0), (0, 0.0)]], 
      dtype=[('foo', '<i8'), ('bar', '<f8')])

In [144]: arr.shape
Out[144]: (3, 2)

This is a 2-dimensional array, whose elements are records.

Here are the bar field values in the arr[:, 0] slice:

In [148]: arr[:, 0]['bar']
Out[148]: array([ 0.,  0.,  0.])

Here are all the bar field values in the 2D array:

In [151]: arr['bar']
Out[151]: 
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

In [160]: arr['bar'].all()
Out[160]: False

Note that an alternative to using recarrays is Pandas Dataframes. There are a lot more methods available for manipulating Dataframes than recarrays. You might find it more convenient.

the dtype seemed to store all the type information for each column -- it does this without treating an array of records as multidimensional? — djechlin, Apr 25 '16 at 05:18
Apparently, yes. I just also learned this from unutbu's answer. But the answer and your observation are consistent. The type encapsulates the 2nd dimension. So your example behaves more like a list of lists and less like a 2D array. — roadrunner66, Apr 25 '16 at 06:01
With in a `dtype` different fields can have different `dtype` and size. In an `n-d` array each element has the same `dtype` and `nbytes`. A compound `dtype` adds a new kind of dimensionality within the `n-d` array. There's an overlap in concepts, but also a fundamental discontinuity. — hpaulj, Apr 25 '16 at 17:53

numpy recarray from CSV dtype has many columns but shape says just one row, why is that?

1 Answers1