To give a sense of how an array of strings is stored, I'll make one, and view it in several ways:
In [654]: np.array(['one','two','three','four'],dtype='S5')
Out[654]:
array([b'one', b'two', b'three', b'four'],
dtype='|S5')
In [655]: x=np.array(['one','two','three','four'],dtype='S5')
In [656]: x.tostring()
Out[656]: b'one\x00\x00two\x00\x00threefour\x00'
In [657]: x.view(np.uint8)
Out[657]:
array([111, 110, 101, 0, 0, 116, 119, 111, 0, 0, 116, 104, 114,
101, 101, 102, 111, 117, 114, 0], dtype=uint8)
So its databuffer consists of 20 bytes (4*S5). For strings that are shorter than 5, it puts (or leaves) 0
in the byte.
Yes, there are C
functions for creating new arrays of a given size and dtype. And functions for copying blocks of data to those arrays. Look at the C
side of the numpy documentation, or look at some of the numpy code on it's github repository.
Regarding the pandas
transfer, beware that pandas
readily changes the dtype of its columns. For example if you put None
or nan
in a column it is likely to change it to object dtype.
Object arrays store pointers in the databuffer.
In [658]: y=np.array(['one','two','three','four'],dtype=object)
In [659]: y
Out[659]: array(['one', 'two', 'three', 'four'], dtype=object)
In [660]: y.tostring()
Out[660]: b'\xe0\x0f\xc5\xb5\xa0\xfah\xb5\x80\x0b\x8c\xb4\xc09\x8b\xb4'
If I interpret that right, the databuffer has 16 bytes - 4 4byte pointers. The strings are stored elsewhere in memory as regular Python strings (in this case unicode strings (Py3)).
=================
fromstring
and frombuffer
lets me recreate an array from a buffer
In [696]: x=np.array(['one','two','three','four'],dtype='S5')
In [697]: xs=x.tostring()
In [698]: np.fromstring(xs,'S5')
Out[698]:
array([b'one', b'two', b'three', b'four'],
dtype='|S5')
In [700]: np.frombuffer(xs,'S5')
Out[700]:
array([b'one', b'two', b'three', b'four'],
dtype='|S5')
This works without copying the buffer.
However, if the are multiple strings in different parts of memory, then building an array from them will require copying into on contiguous buffer.