4

I am creating a 5*7 integer matrix binary file in python called random_from_python_int.dat, I then read this binary file from C. Somehow I can not get the correct numbers Here is my python code to generate this matrix:

import numpy as np
np.random.seed(10)
filename = "random_from_python_int.dat"
fileobj = open(filename, mode='wb')
b = np.random.randint(100, size=(5,7))
b.tofile(fileobj)
fileobj.close

this will generate a matrix

[ [  9 15 64 28 89 93 29]
  [  8 73 0  40 36 16 11]
  [ 54 88 62 33 72 78 49]
  [ 51 54 77 69 13 25 13]
  [ 92 86 30 30 89 12 65] ]

But when I read it from the C code from below:

#include <stdio.h>
#include <math.h>
int main()
{
  /* later changed 'double' to 'int', but that still had issues */
  double randn[5][7];

  char buff[256];
  FILE *latfile;

  sprintf(buff,"%s","random_from_python_int.dat");
  latfile=fopen(buff,"r");
  fread(&(randn[0][0]),sizeof(int),35,latfile);
  fclose(latfile);
  printf("\n %d     %d     %d     %d     %d     %d     %d",randn[0][0],randn[0][1],randn[0][2],randn[0][3],randn[0][4],randn[0][5],randn[0][6]);
  printf("\n %d     %d     %d     %d     %d     %d     %d",randn[1][0],randn[1][1],randn[1][2],randn[1][3],randn[1][4],randn[1][5],randn[1][6]);
  printf("\n %d     %d     %d     %d     %d     %d     %d",randn[2][0],randn[2][1],randn[2][2],randn[2][3],randn[2][4],randn[2][5],randn[2][6]);
  printf("\n %d     %d     %d     %d     %d     %d     %d",randn[3][0],randn[3][1],randn[3][2],randn[3][3],randn[3][4],randn[3][5],randn[3][6]);
  printf("\n %d     %d     %d     %d     %d     %d     %d\n",randn[4][0],randn[4][1],randn[4][2],randn[4][3],randn[4][4],randn[4][5],randn[4][6]);
}

It will give me (adjusted for spaces to avoid scrolling on the stackoverflow site):

      28      15         64      93         29 -163754450   9
      40      73          0      16         11 -163754450   8
      33      88         62      17         91 -163754450  54
     256       0 1830354560       0    4196011 -163754450 119
 4197424 4197493 1826683808 4196128 2084711472 -163754450  12

I am not sure what is wrong. I have tried this writing a float matrix in python and read it as double in C, it works fine. But this integer matrix just does not work.

ndim
  • 35,870
  • 12
  • 47
  • 57
harmony
  • 111
  • 1
  • 9
  • 4
    You read integers into doubles. – 0andriy Aug 25 '17 at 19:41
  • 2
    So after the integer vs. double confusion, the remaining question is: How do you know that the "integer" numpy writes is the same size as the "int" C uses? – ndim Aug 25 '17 at 19:44
  • Oops! But after I changed the double to int, I got 9 0 15 0 64 0 28 0 89 0 93 0 29 0 8 0 73 0 0 0 40 0 36 0 16 0 11 0 54 0 88 0 62 0 33. – harmony Aug 25 '17 at 19:58
  • There are zeros mixed in the data. I dont know where they come from – harmony Aug 25 '17 at 19:58
  • 1
    "*How do you know that the "integer" numpy writes is the same size as the "int" C uses*" - that's a very good thought. You can maybe google it. In C you can force a size by using the types int32_t, int64_t – rustyx Aug 25 '17 at 20:13
  • 2
    Step 1: `latfile=fopen(buff,"r");` --> `latfile=fopen(buff,"rb");` (Add `b`) – chux - Reinstate Monica Aug 25 '17 at 21:05
  • What is the size of file `random_from_python_int.dat`? – chux - Reinstate Monica Aug 25 '17 at 21:08
  • Apparently, you can give the desired integer size to `numpy.random.randint` as the optional `dtype` parameter. Default is `np.int` (whatever that may be in C), but explicit integer sizes 32 and 64 are also available which can be combined with explicit integer sizes 32 and 64 in C easily. – ndim Aug 25 '17 at 21:25
  • And if you want to use a specific C type as the type for your values in numpy, https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.scalars.html has a list of supported types many of which directly relate to C types. – ndim Aug 25 '17 at 21:34

2 Answers2

6

As @tdube writes, the quick summary of your issue is: Your numpy implementation writes 64bit integers, while your C code reads 32bit integers.

As for some more details, read on.

When you write and read integers as two's complement binary data, you need to make certain that the following three integer properties are the same for both the producer and the consumer of the binary data: integer size, integer endianness, integer signedness.

The signedness is signed for both numpy and C, so we have a match here.

The endianness is not an issue here because both numpy and the C program are on the same machine, and thus you probably have the same endianness (regardless of what endianness it might actually be).

However, the size is the issue.

By default, numpy.random.randint uses np.int as its dtype. np.int is of unknown size from the documentation but turns out to be 64 bits on your system.

The numpy scalars reference lists a few integer types (remarkably not including np.int), of which three combinations are interesting for robustly interfacing with programs outside of numpy:

 # | numpy    | C
---+----------+---------
 1 | np.int32 | int32_t
 2 | np.int64 | int64_t
 3 | np.intc  | int

If you only happen to interface your numpy based software to the same C environment used to build numpy, using the (np.intc, int) pair of types (from case 3) looks safe.

However, I would strongly prefer one of the explicitly sized types (cases 1 and 2) for the following reasons:

  • It is absolutely obvious what size the integer is in both numpy and C.

  • You can thus use your numpy generated output to interface to a program compiled with a different C environment which may have a different size int.

  • You can even use your numpy generated output to interface to a program written in a completely different language or compiled for and running on a completely different machine. You have to consider endianness for the different machine scenario, though.

ndim
  • 35,870
  • 12
  • 47
  • 57
  • np.int is not a numpy type. It's just a confusing alias for the built-in python `int` that it's too late to remove. – Eric Aug 28 '17 at 00:48
  • The facts are: `np.int` is not documented in the numpy scalars reference, but `numpy.random.randint` is documented to use `np.int` as the default type. Amend the docs maybe? – ndim Aug 28 '17 at 01:00
  • That's a mistake in the randint docs, which I think I fixed recently in master. `np.int` shouldn't be on the numpy scalar page as it is not one - but perhaps there should be a warning there that explicitly says that. (And for np.float, np.complex, ...) – Eric Aug 28 '17 at 01:03
  • Good catch. Seems I missed the pyx files in [#9517](https://github.com/numpy/numpy/pull/9517) – Eric Aug 28 '17 at 01:33
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/152976/discussion-between-ndim-and-eric). – ndim Aug 28 '17 at 02:02
3

Short Answer

Your Python program outputs 64-bit integers, not 32-bit integers which you are trying to read with your C program.

You can change the following line of code:

b = np.random.randint(100, size=(5,7), dtype=np.int32)

Now you will see 32-bit integers in the output file.

How to Tell What Your Python Code Outputs

Your Python code dumps 64-bit integers based on the following analysis of a hexdump of your output file. Of course, you can examine the binary data file with any hex editor application.

$ hexdump random_from_python_int.dat
0000000 09 00 00 00 00 00 00 00 0f 00 00 00 00 00 00 00
0000010 40 00 00 00 00 00 00 00 1c 00 00 00 00 00 00 00
0000020 59 00 00 00 00 00 00 00 5d 00 00 00 00 00 00 00

As @ndim points out in his answer, two's complement integer representation consists of three major elements: [storage] size, endianness and signedness. I will not repeat information which he provides in his answer except to show how to deduce those from the above output which was what I started to do in my original answer.

In your case of multi-dimensional arrays, you may also need to know the order of elements in linear storage.

Deducing Integer Storage Size

Since you indirectly specify the maximum non-inclusive random value of (decimal) 100 from np.random.randint(), your values will be in the decimal range [0, 100), or [0x0, 0x64) in hexadecimal which can all be represented in a single "hex byte". Note that none of the non-00 hex bytes in the above hexdump outputs are outside this range. As you can see, there are a total of 8 bytes used to represent each integer value (1 non-00-byte and 7 00-bytes based on the range of numbers in this case).

Deducing Endianness

Furthermore, you also now can deduce the endianness of the integer representation, which is little endian in this case as the least significant bit (LSB) is part of the first byte in linear storage. The LSB may also be referred to as the least signficant byte.

Deducing Signedness

In this case, you cannot deduce signedness, because you have no negative values in your sampling. If you did, in two's complement representation, you would see a value of 1 for the signed bit. I won't delve into the details of two's complement negative integer representation, which would be off-topic for this question.

Deducing Multi-dimensional Array Order

An examination of the first two 8-byte, little endian integers in the above output starting at file offset (0x) 0000000 (and 0000008 which is not labeled) are hexadecimal values 0x00000000 00000009 and 0x00000000 0000000f, which are the decimal values of 9 and 15 respectively. The decimal value 9 would be the first value in either row-major order or column-major order, but the second decimal value in linear storage being 15 indicates row-major ordering as the row elements are in contiguous storage.

The hexadecimal value of the third integer's value located at file offset (0x) 0000010 is 0x00000000 00000040 which in decimal is the numeric value 64. This value is the third value in your expected output in row-major order.

For completeness, column-major order would output the decimal value of 8 as the second integer represented in linear storage.

How to Make Numpy Dump 32-bit Numbers in Your Python Code

To make your code dump 32-bit numbers, which is a common implementation length of int (but it is "implementation defined" in the C standard which only specifies a minimum range for int to represent), you can change the following line of code:

b = np.random.randint(100, size=(5,7), dtype=np.int32)

Now you will see 32-bit integers in the output file.

$ hexdump random_from_python_int.dat
0000000 09 00 00 00 0f 00 00 00 40 00 00 00 1c 00 00 00
0000010 59 00 00 00 5d 00 00 00 1d 00 00 00 08 00 00 00
0000020 49 00 00 00 00 00 00 00 28 00 00 00 24 00 00 00

NOTE: The actual storage size (precision) of C int variables is "implementation defined", which means you may need to adjust the numpy array integer storage size before output for maximum compatibility with C. See @ndim's excellent answer that provides more detail regarding this.

Changes to Your C Code

Your C code must be updated to reflect the change in data types for the two-dimensional array. In your code, double randn[5][7] should be int randn[5][7]. You could also make the type int32_t as @ndim pointed out, but your compiler may issue an error and suggest the data type __int32_t (which is a typedef for int on my system). After making that change and compililing, I get the following output:

 9     15     64     28     89     93     29
 8     73     0     40     36     16     11
 54     88     62     33     72     78     49
 51     54     77     69     13     25     13
 92     86     30     30     89     12     65

UPDATE (See also UPDATE #2)

Per @ndim's comment below, you can also use np.intc as below. This option is likely the best option unless you are targeting a specific storage size for integer representation.

b = np.random.randint(100, size=(5,7), dtype=np.intc)

I tested this and it also produces 32-bit integers as well.

UPDATE #2

I totally agree with @ndim that specifying the integer size is best for maximizing compatibility. The Python idiom of "least surprise" applies here.

tdube
  • 2,453
  • 2
  • 16
  • 25
  • 1
    According to the numpy docs, `np.intc` will use and write to the file in numpy whatever C sees as an `int`. – ndim Aug 25 '17 at 21:35