3

I'm working on converting some old text logs to a usable format in Python. The files are huge, so I'm writing my own C extensions to run through the files as quickly as possible and parse out the relevant fields with regular expressions. My ultimate goal is to export these fields into NumPy arrays of strings. I know it's possible to create the NumPy array as a PyObject in C and then call SetItem on each element, but I'm looking to optimize as much as I can.

Can I use something like memcpy or PyBuffer_FromMemory to read the C strings into a NumPy string array directly? I understand that NumPy arrays are internally similar to C arrays, but do I have to ensure the NumPy array will be contiguously allocated?

I intend to use the NumPy arrays to build columns Pandas for statistical analysis. As I understand it, Pandas uses NumPy arrays to store columns in a DataFrame so I won't have a large overhead going from NumPy into Pandas. I'd like to avoid cython if possible.

Max
  • 301
  • 1
  • 9
  • 3
    regular expressions are not slower in Python than in C. – Daniel Aug 24 '16 at 21:58
  • Yes I'm aware the speed of compiled regexes is comparable between C and python. I'm looking to avoid overhead in other areas by building the numpy array in C. – Max Aug 25 '16 at 10:21

1 Answers1

3

To give a sense of how an array of strings is stored, I'll make one, and view it in several ways:

In [654]: np.array(['one','two','three','four'],dtype='S5')
Out[654]: 
array([b'one', b'two', b'three', b'four'], 
      dtype='|S5')
In [655]: x=np.array(['one','two','three','four'],dtype='S5')
In [656]: x.tostring()
Out[656]: b'one\x00\x00two\x00\x00threefour\x00'
In [657]: x.view(np.uint8)
Out[657]: 
array([111, 110, 101,   0,   0, 116, 119, 111,   0,   0, 116, 104, 114,
       101, 101, 102, 111, 117, 114,   0], dtype=uint8)

So its databuffer consists of 20 bytes (4*S5). For strings that are shorter than 5, it puts (or leaves) 0 in the byte.

Yes, there are C functions for creating new arrays of a given size and dtype. And functions for copying blocks of data to those arrays. Look at the C side of the numpy documentation, or look at some of the numpy code on it's github repository.

Regarding the pandas transfer, beware that pandas readily changes the dtype of its columns. For example if you put None or nan in a column it is likely to change it to object dtype.

Object arrays store pointers in the databuffer.

In [658]: y=np.array(['one','two','three','four'],dtype=object)
In [659]: y
Out[659]: array(['one', 'two', 'three', 'four'], dtype=object)
In [660]: y.tostring()
Out[660]: b'\xe0\x0f\xc5\xb5\xa0\xfah\xb5\x80\x0b\x8c\xb4\xc09\x8b\xb4'

If I interpret that right, the databuffer has 16 bytes - 4 4byte pointers. The strings are stored elsewhere in memory as regular Python strings (in this case unicode strings (Py3)).

=================

fromstring and frombuffer lets me recreate an array from a buffer

In [696]: x=np.array(['one','two','three','four'],dtype='S5')
In [697]: xs=x.tostring()
In [698]: np.fromstring(xs,'S5')
Out[698]: 
array([b'one', b'two', b'three', b'four'], 
      dtype='|S5')
In [700]: np.frombuffer(xs,'S5')
Out[700]: 
array([b'one', b'two', b'three', b'four'], 
      dtype='|S5')

This works without copying the buffer.

However, if the are multiple strings in different parts of memory, then building an array from them will require copying into on contiguous buffer.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for the well- informed comment. I would add that converting a numpy array of strings into an array of Objects basically kills any performance gain you get out of numpy. After reading your response, I think the best thing for me to do is (1) Create a numpy array of appropiate size and dtype (2) obtain the numpy array's bytes in C with PyArray_BYTES and then (3) copy my strings into that memory with memcpy. – Max Aug 25 '16 at 10:47