Store different datatypes in one NumPy array?

Question

I have two different arrays, one with strings and another with ints. I want to concatenate them, into one array where each column has the original datatype. My current solution for doing this (see below) converts the entire array into dtype = string, which seems very memory inefficient.

combined_array = np.concatenate((A, B), axis = 1)

Is it possible to mutiple dtypes in combined_array when A.dtype = string and B.dtype = int?

The question is about using a NumPy array. However, if having a NumPy array is not essential then a [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) would work well for this situation. — crayzeewulf, May 03 '15 at 01:42

score 55 · Accepted Answer · edited Jun 26 '23 at 22:49

One approach might be to use a record array. The "columns" won't be like the columns of standard numpy arrays, but for most use cases, this is sufficient:

>>> a = numpy.array(['a', 'b', 'c', 'd', 'e'])
>>> b = numpy.arange(5)
>>> records = numpy.rec.fromarrays((a, b), names=('keys', 'data'))
>>> records
rec.array([('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4)], 
      dtype=[('keys', '|S1'), ('data', '<i8')])
>>> records['keys']
rec.array(['a', 'b', 'c', 'd', 'e'], 
      dtype='|S1')
>>> records['data']
array([0, 1, 2, 3, 4])

Note that you can also do something similar with a standard array by specifying the datatype of the array. This is known as a "structured array":

>>> arr = numpy.array([('a', 0), ('b', 1)], 
                      dtype=([('keys', '|S1'), ('data', 'i8')]))
>>> arr
array([('a', 0), ('b', 1)], 
      dtype=[('keys', '|S1'), ('data', '<i8')])

The difference is that record arrays also allow attribute access to individual data fields. Standard structured arrays do not.

>>> records.keys
chararray(['a', 'b', 'c', 'd', 'e'], 
      dtype='|S1')
>>> arr.keys
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'keys'

`arr = np.array([('cat', 5), ('dog', 20)], dtype=[('name', np.object), ('age',np.int)])` name column can be accessed by arr['name'] in structured array — Bharath Ram, May 02 '20 at 11:15

score 13 · Answer 2 · answered May 18 '17 at 21:47

13

A simple solution: convert your data to object 'O' type

z = np.zeros((2,2), dtype='U2')
o = np.ones((2,1), dtype='O')
np.hstack([o, z])

creates the array:

array([[1, '', ''],
       [1, '', '']], dtype=object)

answered May 18 '17 at 21:47

codeMonkey

562
8
19

8

This causes all kinds of problems down the line if you actually want to do any meaningful operations on the slices of that array. – Astrid Mar 26 '18 at 12:52
3

What kind of problems ? – matthieu May 03 '19 at 08:00
1

@Astrid could you elaborate on your thoughts? – flow2k Sep 29 '19 at 05:11
5

Suppose, for argument's sake, that you turned that into a dataframe. And then you wanted to filter objects in that dataframe say `df.loc[(df.col == item)]` well that would not work because when pandas does the filtering it expects all the items to be of the same type. So if, for example, you were to mix strings and integers in the same column then you would be comparing apples and oranges effectively. And hence pandas would throw an error. – Astrid Sep 29 '19 at 14:07

score 3 · Answer 3 · answered Mar 09 '20 at 23:05

Refering Numpy doc, there is a function named numpy.lib.recfunctions.merge_arraysfunction which can be used to merge numpy arrays in different data type into either structured array or record array.

Example:

>>> from numpy.lib import recfunctions as rfn
>>> A = np.array([1, 2, 3])
>>> B = np.array(['a', 'b', 'c'])
>>> b = rfn.merge_arrays((A, B))
>>> b
array([(1, 'a'), (2, 'b'), (3, 'c')], dtype=[('f0', '<i4'), ('f1', '<U1')])

For more detail please refer the link above.

Store different datatypes in one NumPy array?

3 Answers3

Linked

Related