How to properly select wanted data and discard unwanted data from binary files

Question

I'm working on a project where I'm trying to convert old 16bit binary data files into 32bit data files for later use.

Straight conversion is no issue, but then i noticed i needed to remove header data from the data-file's.

The data consists of 8206 bytes long frames, each frame consists of 14 byte long header and 4096 bytes long data -block, depending on file, there are either 70313 or 70312 frames in each file.

i couldn't find a neat way to find all the header and remove them and save only the data-block to a new file.

so heres what I did:

results_array = np.empty([0,1], np.uint16)

for filename in file_list:
    num_files += 1
    # read data from file as 16bit's and save it as 32bit
    data16 = np.fromfile(data_dir + "/" + filename, dtype=np.uint16)
    filesize = np.prod(data16.shape)
    if filesize == 288494239:
        total_frames = 70313
        #total_frames = 3000
    else:
        total_frames = 70312
        #total_frames = 3000

    frame_count = 0
    chunksize = 4103

    with open(data_dir + "/" + filename, 'rb') as file:
        while frame_count < total_frames:
            frame_count += 1
            read_data = file.read(chunksize)
            if not read_data:
                break
            data = read_data[7:4103]
            results_array = np.append(results_array,data)
            converted = np.frombuffer(results_array, np.uint16)
            print(str(frame_count) + "/" + str(total_frames))

        converted = np.frombuffer(results_array, np.uint16)
        data32 = converted.astype(dtype=np.uint32) * 256

It works (i think it does atleast), but it is very very slow.

So question is, is there a way to do the above much faster, maybe some build-in function in numpy or something else perhaps?

Thanks in advance

Why are you doing `converted = np.frombuffer(results_array, np.uint16)` in while loop for each frame? — MT-FreeHK, Aug 14 '18 at 06:09
I had to add it, because reading the file with 'rb' turned the data into hex values, and i couldn't convert that to 32bit, from buffer converted the hex value array to back to decimal array. edit: oh wait i see, i dont need to do it in every loop, ill fix that :) — Nanoni, Aug 14 '18 at 06:35
Is it still slow after you move that sentence outside the loop? — MT-FreeHK, Aug 14 '18 at 07:00
oh yes, i don't think that really had an effect. I need a totally different kind of function for reading only certain bytes and discarding rest. — Nanoni, Aug 14 '18 at 07:31

score 0 · Accepted Answer · answered Aug 25 '18 at 07:53

Finally managed to crack this one, and it is 100x faster than initial approach :)

    data = np.fromfile(read_dir + "/" + file, dtype=np.int16)
    frames = len(data) // 4103 # framelenght

    # Reshape into array such that each row is a frame
    data = np.reshape(data[:frames * 4103], (frames, 4103))

    # Remove headers and convert to int32
    data = data[:, 7:].astype(np.int32) * 256

How to properly select wanted data and discard unwanted data from binary files

1 Answers1