0

I created a parser for some complex binary files using numpy.fromfile and defining the various dtypes necessary for reading each portion of the binary file. The resulting numpy array was then placed into a pandas dataframe and the same dtype that was defined for converting the binary files into the numpy array was recycled to define the column names for the pandas dataframe.

I was hoping to replicate this process using python struct but ran into an issue. If part of my structure requires a value to be a group of 3 ints, I can define the dtype as numpy.dtype([('NameOfField', '>i4', 3)]) and the returned value from the binary file is [int, int, int]. Can this be replicated using struct or do I need to regroup the values in the returned tuple based on the dtype before ingesting it into my pandas dataframe ?? I have read the python struct documentation and have not noticed any examples of this.

When using a dtype of >3i returns a result of int, int, int instead of [int, int, int] like I need.

Edit ... Below is a generic example. This method using numpy.fromfile works perfect but is slow when working on my huge binary files so I am trying to implement using struct

    import numpy as np
    import pandas as pd

    def example_structure():
        dt = np.dtype([
                 ('ExampleFieldName', '>i4', 3)
             ])

        return dt


    # filename of binary file
    file_name = 'example_binary_file'

    # define the dtype for this chunk of binary data
    d_type = example_structure()

    # define initial index for the file in memory
    start_ind = 0
    end_ind = 0

    # read in the entire file generically
    x = np.fromfile(file_name, dtype='u1')

    # based on the dtype find the chunk size
    chunk_size = d_type.itemsize

    # define the start and end index based on the chunk size
    start_ind = end_ind
    end_ind = chunk_size + start_ind

    # extract just the first chunk
    temp = x[start_ind:end_ind]

    # cast the chunk as the defined dtype
    temp.dtype = d_type

    # store the chunk in its own pandas dataframe
    example_df = pd.DataFrame(temp.tolist(), columns=temp.dtype.names)

This will return a temp[0] value of [int, int, int] that will then be read into the pandas dataframe as a single entry under the column ExampleFieldName. If I attempt to replicate this using struct the temp[0] value is int, int, int, which is not be read properly into pandas. Is there a way to make struct group values like I can do using numpy ??

btathalon
  • 185
  • 8
  • Your question isn't clear; I think you need to add some code and/or examples. For example `[int, int, int]` looks like a list, or is it a 1d array? `int, int, int` is that a tuple, `(int, int, int)`, or something else? – hpaulj Nov 28 '17 at 19:16
  • @hpaulj ok will do – btathalon Nov 28 '17 at 19:16
  • I haven't used the Python `struct` much, and not with `numpy`. However this https://stackoverflow.com/questions/30035287/passing-structured-array-to-cython-failed-i-think-it-is-a-cython-bug `cython` question shows that there is a certain relatedness between `c` struct, python `struct` and `numpy` compound dtypes. – hpaulj Nov 28 '17 at 19:18

1 Answers1

0

I'd suggest just splitting it up into a list of objects after unpacking it. It won't be as fast as numpy for huge objects, but then that's why numpy exists :P. Assuming data holds your raw bytes that you want to split into groups of 5 uint32_ts (obviously the data must also be in this shape):

import struct output = struct.unpack("5I"*int(len(data)//5), data) output = [data[i:i+5] for i in range(0, len(data), 5)]

Of course, this means iterating the data twice, and since struct.unpack doesn't yield successive values (afaik), doing it in one line won't help that. It'd maybe be faster to iterate over the data directly - I haven't run any tests - like this:

import struct output, itemsize = [], struct.calcsize("5I") for i in range(0, len(data), itemsize): output.append(struct.unpack("5I", data[i:i+itemsize])

ocket8888
  • 1,060
  • 12
  • 31