0

I am just wondering how the function actually stores the data. Because to me, it looks completely strange. Say I have the following code:

import numpy as np
filename = "test.dat"
print(filename)
fileobj = open(filename, mode='wb')
off = np.array([1, 300], dtype=np.int32)
off.tofile(fileobj)
fileobj.close()

fileobj2 = open(filename, mode='rb')
off = np.fromfile(fileobj2, dtype = np.int32)
print(off)
fileobj2.close()

Now I expect 8 bytes inside the file, where each element is represented by 4 bytes (and I could live with any endianness). However when I open up the file in a hex editor (used notepad++ with hex editor plugin) I get the following bytes:

01 00 C4 AC 00

5 bytes, and I have no idea at all what it represents. The first byte looks like it is the number, but then what follows is something weird, certainly not "300".

Yet reloading shows the original array.

Is this something I don't understand in python, or is it a problem in notepad++? - I notice the hex looks different if I select a different "encoding" (huh?). Also: Windows does report it being 8 bytes long.

Guillaume Jacquenot
  • 11,217
  • 6
  • 43
  • 49
paul23
  • 8,799
  • 12
  • 66
  • 149
  • 2
    First, an `int32` takes 4 bytes, not 2. – abarnert Apr 22 '15 at 22:26
  • 3
    Next, have you tried reading the file in any other program besides Notepad++? You can do it pretty easily in Python itself; instead of `off = np.fromfile(fileobj2, dtype=np.int32)`, just do `off = fileobj2.read()`, then print the bytes. You should see something like `b'\x01\x00\x00\x00,\x01\x00\x00'`; if you instead see `b'\x01\x00\xc4\xac\x00'`, then you know it's the file that's broken, not Notepad++. – abarnert Apr 22 '15 at 22:28
  • @abarnert that was actually a typo (notice I said already "I expect 8 bytes..."). Hmm it seems indeed notepad++ is broken, that's weird, never had that happen before :/, can I ask/convert this question to "how to make notepad++ work with the hex plugin" or is that too offtopic here? – paul23 Apr 22 '15 at 22:32
  • You shouldn't try to convert a question into a different question. Just ask a new one. I think that new one would be more on-topic somewhere like [SuperUser](http://superuser.com/), but you should read the help on both sites (and some other Stack Exchange sites that sound relevant) and decide for yourself. – abarnert Apr 23 '15 at 21:44
  • Could it have something to do with the version of notepad++ or the plugin? http://sourceforge.net/p/notepad-plus/discussion/482781/thread/651ff890/ mentions a null character problem in the Npp plugin several years ago. – hpaulj Apr 23 '15 at 22:19

2 Answers2

2

You can tell very easily that the file actually does have 8 bytes, the same 8 bytes you'd expect (01 00 00 00 2C 01 00 00) just by using anything other than Notepad++ to look at the file, including just replacing your off = np.fromfile(fileobj2, dtype=np.int32) with off = fileobj2.read()thenprinting the bytes (which will give youb'\x01\x00\x00\x00,\x01\x00\x00'`*).

And, from your comments, after I suggested that, you tried it, and saw exactly that.

Which means this is either a bug in Notepad++, or a problem with the way you're using it; Python, NumPy, and your own code are perfectly fine.


* In case it isn't clear: '\x2c' and ',' are the same character, and bytes uses the printable ASCII representation for printable ASCII characters, as well as familiar escapes like '\n', when possible, only using the hex backslash escape for other values.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • why are we expecting only eight bytes? is no representation of the dimensionality of the array saved? if not, how would, for example, a 5-D array be distinguished from a 10-D one, if both had the same number of elements? – abcd Apr 23 '15 at 22:15
  • 1
    @dbliss: The short answer is, no there isn't, they're not distinguished, and that's why we're expecting 8 bytes. The `tofile`/`fromfile` docs explain this. But you can test it easily by `np.array([[1,2], [3,4]], dtype=np.int8).tofile(f)`, then `a = np.fromfile(f, dtype=np.int8)`; you get back `[1,2,3,4]`, not `[[1,2],[3,4]]`. – abarnert Apr 23 '15 at 22:27
  • 1
    @dbliss: Notice that it's not even storing the data type. This means that in addition to losing information about dimension (and C vs. Fortran striding), you also lose information about endianness, platform float implementation differences, etc. It's meant for "quick storage", where you dump the data and read it back in the same session—e.g., you don't have enough memory to store all your arrays at once, or you want to pass them to a `multiprocessing` child. – abarnert Apr 23 '15 at 22:30
1

What are you expecting 300 to look like?

Write the array, and read it back as binary (in ipython):

In [478]: np.array([1,300],np.int32).tofile('test')

In [479]: with open('test','rb') as f: print(f.read())
b'\x01\x00\x00\x00,\x01\x00\x00'

There are 8 bytes, , is just a displayable byte.

Actually, I don't have to go through a file to get this:

In [505]: np.array([1,300]).tostring()
Out[505]: b'\x01\x00\x00\x00,\x01\x00\x00'

Do the same with:

[255]    
b'\xff\x00\x00\x00'

[256]
b'\x00\x01\x00\x00'

[300]
b',\x01\x00\x00'

[1,255]
b'\x01\x00\x00\x00\xff\x00\x00\x00'

With powers of 2 (and 1 less) it is easy to identify a pattern in the bytes.


frombuffer converts a byte string back to an array:

In [513]: np.frombuffer(np.array([1,300]).tostring(),int)
Out[513]: array([  1, 300])

In [514]: np.frombuffer(np.array([1,300]).data,int)
Out[514]: array([  1, 300])

Judging from this last expression, the tofile is just writing the array buffer to the file as bytes.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • But his question is why he gets the 5 bytes `01 00 C4 AC 00` instead of the 8 bytes `01 00 00 00 2C 01 00 00`; this doesn't answer that at all. – abarnert Apr 23 '15 at 21:39
  • But it does give him a way of checking the file from within Python. Assuming he gets the same thing, then the problem is clearly with `notepad++`. I let others address that issue, since I don't have that editor on the linux side of my machine. Plus testing with an easy-to-recognize value like 255 or 256 might help. – hpaulj Apr 23 '15 at 21:59
  • If you read the comments, he had already done an equivalent test 7 hours before you wrote the answer. Yes, it would have been better if he edited the question to make that clear instead of just writing a comment, but that still doesn't mean this is a useful answer to the question, either as asked or as intended. – abarnert Apr 23 '15 at 22:02