Filtering byte stream efficiently before converting to numpy array / pandas dataframe

Question

I'm looking for guidance on how to efficiently filter out unneeded parts of my data before converting to a numpy array and/or pandas dataframe. Data is delivered to my program as string buffers (each record separately), and I'm currently using np.frombuffer to construct an array once all records are retrieved.

The problem I'm having is that individual records can be quite long, thousands of fields, and sometimes I only want a small subset of them. Filtering out these unneeded fields adds steps and significantly slows down the data import though.

Without any filtering, my current process is:

# assume some function here that retrieves one record at a time and appends it to 'data'

data = [b'\x00\x00\x00\x00\x00\x00\xf0?one     \x00\x00\x00\x00\x00\x00Y@',
        b'\x00\x00\x00\x00\x00\x00\x00@two     \x00\x00\x00\x00\x00\x00i@',
        b'\x00\x00\x00\x00\x00\x00\x08@three   \x00\x00\x00\x00\x00\xc0r@',
        b'\x00\x00\x00\x00\x00\x00\x10@four    \x00\x00\x00\x00\x00\x00y@']

final_data = b''.join(data)

arr = np.frombuffer(final_data, dtype=struct_dtypes)
df = pd.DataFrame(arr)

# dataframe
    n1     ch     n2
0  1.0    one  100.0
1  2.0    two  200.0
2  3.0  three  300.0
3  4.0   four  400.0

My current solution for filtering is essentially:

final_data = b''.join(b''.join(buffer[offset: offset + 8] for offset in [0, 16]) for buffer in data)

struct_dtypes = np.dtype([('n1', 'd'), ('n2', 'd')])
arr = np.frombuffer(final_data, dtype=struct_dtypes)
df = pd.DataFrame(arr)

    n1     n2
0  1.0  100.0
1  2.0  200.0
2  3.0  300.0
3  4.0  400.0

That middle step to slice and rejoin each record makes filtering slower than just reading everything. If I construct the full array first and then return only the specified columns, isn't that just a waste of memory? What's an appropriate way to read only the portions of the string buffers I want?

Update using accepted answer

struct_dtypes = np.dtype({'names': ['n1', 'ch'],
                          'formats': ['d', '8V'],
                          'offsets': [0, 8],
                          'itemsize': 24})

final_data = b''.join(data)

arr = np.frombuffer(final_data, dtype=struct_dtypes)

Stef · Accepted Answer · 2022-07-23T21:13:20.937

1

You can specify an offset for each field during dtype construction:

struct_dtypes = np.dtype({'names': ['n1', 'n2'], 'formats': ['d', 'd'], 'offsets': [0, 16]})

or

struct_dtypes = np.dtype({'n1': ('d', 0), 'n2': ('d', 16)})

Update (see comments below):
If you don't read the last element in the record, you need to specify the itemsize:

struct_dtypes = np.dtype({'names': ['n1', 'ch'],
                          'formats': ['d', '8V'],
                          'offsets': [0, 8],
                          'itemsize': 24})

edited Jul 23 '22 at 21:13

answered Jul 23 '22 at 19:31

Stef

28,728
2
24
52

Depending on the field subset I use, I'm getting an error: buffer size must be a multiple of element size – StevenS Jul 23 '22 at 20:22
for what subset do you get this error? does the subset include strings? – Stef Jul 23 '22 at 20:36
`struct_dtypes = np.dtype({'n1': ('d', 0), 'ch': ('8V', 8)})` - yes, it contains strings, but I'm importing them as void types for now. – StevenS Jul 23 '22 at 20:41
it works if I use `np.frombuffer(count=1)`, but I need to do testing on overall speed because converting each record separately and then concatenating might be expensive. – StevenS Jul 23 '22 at 20:48
so it seems slicing each record to the appropriate end position is sufficient and not very expensive, at least compared to extracting many slices and rejoining as I was doing earlier. Thanks! – StevenS Jul 23 '22 at 21:01
1

@StevenS please see my updated answer for this case, so that you can read the whole buffer in one run – Stef Jul 23 '22 at 21:14

Filtering byte stream efficiently before converting to numpy array / pandas dataframe

1 Answers1