pandas.factorize with custom array datatype

Question

Let's start off with a random (reproducible) data array -

# Setup
In [11]: np.random.seed(0)
    ...: a = np.random.randint(0,9,(7,2))
    ...: a[2] = a[0]
    ...: a[4] = a[1]
    ...: a[6] = a[1]

# Check values
In [12]: a
Out[12]: 
array([[5, 0],
       [3, 3],
       [5, 0],
       [5, 2],
       [3, 3],
       [6, 8],
       [3, 3]])

# Check its itemsize
In [13]: a.dtype.itemsize
Out[13]: 8

Let's view each row as a scalar using custom datatype that covers two elements. We will use void-dtype for this purpose. As mentioned in the docs -

https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html#specifying-and-constructing-data-types, https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#arrays-interface) and in stackoverflow Q&A, it seems that would be -

In [23]: np.dtype((np.void, 16)) # 8 is the itemsize, so 8x2=16
Out[23]: dtype('V16')

# Create new view of the input
In [14]: b = a.view('V16').ravel()

# Check new view array
In [15]: b
Out[15]: 
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
       b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
       b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
       b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
       b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
       b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
       b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
      dtype='|V16')

# Use pandas.factorize on the new view
In [16]: pd.factorize(b)
Out[16]: 
(array([0, 1, 0, 0, 1, 2, 1]),
 array(['\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
        '\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
        '\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
       dtype=object))

Two things off factorize's output that I could not understand and hence the follow-up questions -

The fourth element of the first output (=0) looks wrong, because it has same ID as the third element, but in b, the fourth and third elements are different. Why so?
Why does the second output has an object dtype, while the dtype of b was V16. Is this also causing the wrong value mentioned in 1.?

A bigger question could be - Does pandas.factorize cover custom datatypes? From docs, I see :

values : sequence A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization.

In the provided sample case, we have a NumPy array, so one would assume no issues with the input, unless the docs didn't clarify about the custom datatype part?

System setup : Ubuntu 16.04, Python : 2.7.12, NumPy : 1.16.2, Pandas : 0.24.2.

On Python-3.x

System setup : Ubuntu 16.04, Python : 3.5.2, NumPy : 1.16.2, Pandas : 0.24.2.

Running the same setup, I get -

In [18]: b
Out[18]: 
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
       b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
       b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
       b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
       b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
       b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
       b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
      dtype='|V16')

In [19]: pd.factorize(b)
Out[19]: 
(array([0, 1, 0, 2, 1, 3, 1]),
 array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
        b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
        b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
        b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
       dtype=object))

So, the first output off factorize looks alright here. But, the second output has object dtype again, different from the input. So, the same question - Why this dtype change?

Compiling the questions/tl;dr

With such a custom datatype :

Why wrong labels, uniques and different uniques dtype on Python2.x?
Why different uniques dtype on Python3.x?

score 8 · Answer 1 · answered Apr 21 '19 at 13:46

As for why V16 is coerced to object, many functions in pandas convert data to one of the data types the internal functions can handle, here. If the data type is not in the list, it becomes an object – and pandas doesn't convert the result back into the original dtype, it appears.

Regarding the discrepancy between Python 2 and Python 3: There's only one pandas codebase for both, so why do they give different results?

Turns out that Python 2 uses the string type (which are just arrays of bytes) to represent your data¹, and Python 3 the bytes type. The effect of this is that Python 2 uses a StringHashTable for the factorization and Python 3 uses a PyObjectHashTable, and the StringHashTable gives incorrect results in your case. I believe that this is because the strings in the StringHashTable are assumed to be zero-terminated, which is not the case for your strings – and indeed, if you only compare the rows up to the first zero byte, the first and fourth row look identical.

Conclusion: It's a bug, and we should probably file an issue for it.

¹ More detail: This call to ensure_object returns an array of strings in Python 2, but an array of bytes in Python 3 (as can be seen by the b prefix). Correspondingly, the hashtable chosen here is different.

pandas.factorize with custom array datatype

Compiling the questions/tl;dr

1 Answers1