Problems when I write np array to binary file, new file is only half of the original one

Question

I am trying to remove top 24 lines of a raw file, so I opened the original raw file(let's call it raw1.raw) and converted it to nparray, then I initialized a new array and remove the top24 lines, but after writing new array to the new binary file(raw2.raw), I found raw2 is 15.2mb only while the original file raw1.raw is like 30.6mb, my code:

import numpy as np
import imageio
import rawpy
import cv2


def ave():
    
    fd = open('raw1.raw', 'rb')
    rows = 3000 #around 3000, not the real rows
    cols = 5100 #around 5100, not the real cols
    f = np.fromfile(fd, dtype=np.uint8,count=rows*cols)
    I_array = f.reshape((rows, cols)) #notice row, column format
    #print(I_array)
   
    fd.close()

    im = np.zeros((rows - 24 , cols))
    for i in range (len(I_array) - 24):
        for j in range(len(I_array[i])):
            im[i][j] = I_array[i + 24][j]
            
    #print(im)

    newFile = open("raw2.raw", "wb")
    
    im.astype('uint8').tofile(newFile)
    newFile.close()


if __name__ == "__main__":
    ave()

I tried to use im.astype('uint16') when write in the binary file, but the value would be wrong if I use uint16.

Unrelated to your question, but you can do `im = I_array[24:,:]` to lop off the first 24 rows. — mtrw, Feb 11 '21 at 02:21
yea, but they are the same, what I am confused is about the file size — user916169, Feb 11 '21 at 04:18

Bobby Ocean · Answer 1 · 2021-02-12T19:18:33.270

There must clearly be more data in your 'raw1.raw' file that you are not using. Are you sure that file wasn't created using 'uint16' data and you are just pulling out the first half as 'uint8' data? I just checked the writing of random data.

import os, numpy as np

x = np.random.randint(0,256,size=(3000,5100),dtype='uint8')
x.tofile(open('testfile.raw','w'))
print(os.stat('testfile.raw').st_size) #I get 15.3MB.

So, 'uint8' for a 3000 by 5100 clearly takes up 15.3MB. I don't know how you got 30+.

############################ EDIT #########

Just to add more clarification. Do you realize that dtype does nothing more than change the "view" of your data? It doesn't effect the actual data that is saved in memory. This also goes for data that you read from a file. Take for example:

import numpy as np

#The way to understand x, is that x is taking 12 bytes in memory and using
#that information to hold 3 values. The first 4 bytes are the first value, 
#the second 4 bytes are the second, etc. 
x = np.array([1,2,3],dtype='uint32') 

#Change x to display those 12 bytes at 6 different values. Doing this does
#NOT change the data that the array is holding. You are only changing the 
#'view' of the data. 
x.dtype = 'uint16'
print(x)

In general (there are few special cases), changing the dtype doesn't change the underlying data. However, the conversion function .astype() does change the underlying data. If you have any array of 12 bytes viewed as 'int32' then running .astype('uint8') will take each entry (4 bytes) and covert it (known as casting) to a uint8 entry (1 byte). The new array will only have 3 bytes for the 3 entries. You can see this litterally:

x = np.array([1,2,3],dtype='uint32')
print(x.tobytes())
y = x.astype('uint8')
print(y.tobytes())

So, when we say that a file is 30mb, we mean that the file has (minus some header information) is 30,000,000 bytes which are exactly uint8s. 1 uint8 is 1 byte. If any array has 6000by5100 uint8s (bytes), then the array has 30,600,000 bytes of information in memory.

Likewise, if you read a file (DOES NOT MATTER THE FILE) and write np.fromfile(,dtype=np.uint8,count=15_300_000) then you told python to read EXACTLY 15_300_000 bytes (again 1 byte is 1 uint8) of information (15mb). If your file is 100mb, 40mb, or even 30mb, it would be completely irrelevant because you told python to only read the first 15mb of data.

I mean the original raw file is 30+ mb, after I open it with uint 8 and write in the new binary file, it became 15mb. — user916169, Feb 11 '21 at 03:34
But if you opened the 15mb file and read in exactly 3000by5100 uint8 then you must have only read half of the file. Attempt to make the matrix 6000by5100 and read uint8 from that file. — Bobby Ocean, Feb 11 '21 at 20:24
No, the file is exactly 3000 * 5100, I just read it as uint8 form, I dont know the original type of value, but I tried to use uint16, the size of output is right, but the value of pixel was wrong, I am really confused. — user916169, Feb 11 '21 at 23:01
I don't know what to tell you. I don't know anything about your file, or how it was written. Did you run my code above? You can clearly see the file should be 15mb if it was 3000by5100 uint8s. Your file clearly has 30mb of data that is literally 30,000,000 bytes, which is approximately 6000by5100 uint8s (30,600,000 bytes). I don't know how to explain that any other way. — Bobby Ocean, Feb 12 '21 at 18:51
I added an update, I think maybe dtype is a point of confusion. Feel free to correct me if I am mistaken. — Bobby Ocean, Feb 12 '21 at 19:11
Thank you for your answer, but I just found every pixel value might uses 2 bytes to save, so the final size would be double, but I am wondering how to use 2 bytes to save per value. — user916169, Feb 17 '21 at 22:44
Two bytes just means int16 or uint16 is the format of your data. — Bobby Ocean, Feb 21 '21 at 20:52

Problems when I write np array to binary file, new file is only half of the original one

1 Answers1