0

I have a Pandas dataframe, named "impression_data," which includes a column called "site.id," like this:

   >>> impression_data['site.id']

0      62
1     189
2     191
3      62
...

Each item in this column has the datatype numpy.int64, like this:

>>> for i in impression_data['site.id']:
    print type(i)

<type 'numpy.int64'>
<type 'numpy.int64'>
<type 'numpy.int64'>
...

And as expected, membership testing works well so long as I test integers:

>>> 62 in impression_data['site.id']
True

But here's the unexpected result: I was under the impression that a column of np.int64's ought not to include any decimal values whatsoever. Apparently I'm wrong. What's going on here?

>>> 62.5 in impression_data['site.id']
True

Edit 1: All values in the column ought to be integers by construction. For completeness, I have also performed the following casting operation and encountered no errors:

impression_data['site.id'] = impression_data['site.id'].astype('int')

As per @BremBam's suggestions in the comments, I tried

impression_data['site.id'].map(type).unique()

which produces

[<type 'numpy.int64'>]

A minimal example and the real datafile I'm working with are here https://dl.dropboxusercontent.com/u/28347262/SE%20Pandas%20Int64%20Membership%20Testing/cm_impression.csv

and here

https://dl.dropboxusercontent.com/u/28347262/SE%20Pandas%20Int64%20Membership%20Testing/ExampleCode.py

avn2109
  • 54
  • 1
  • 7
  • Are you sure every single value is an int? What does `impression_data['site.id'].map(type).unique()` give? Can you provide example code and data that demonstrate the problem? – BrenBarn Jan 26 '14 at 19:15
  • Thanks for your quick response, @BrenBarn. I took your advice on trying `impression_data['site.id'].map(type).unique()` and edited my question to reflect that. Example code and data to follow shortly. – avn2109 Jan 26 '14 at 19:24
  • Historically using `in` for numpy arrays can produce odd results- I would suggest something like `np.any(df['site.id'].isin([62.5]))`. – Daniel Jan 26 '14 at 19:32

2 Answers2

1

This is a bug in pandas. The value is cast to the type of the index before the containment test is done, so 62.5 is converted to 62. (Note that in for a Series checks whether the value is in the index, not the values.)

I believe you can get what you want by doing 62.5 in impression_data.values.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • `62.5 in impression_data['site.id'].values` produces `False`, just as you predict. An easy fix! – avn2109 Jan 26 '14 at 19:56
  • @avn2109 Please note using `in` with numpy arrays does not always produce the desired result especially when the numpy array has more then one dimension. Be very careful of this. – Daniel Jan 26 '14 at 22:52
  • @Ophion: Can you be more specific? In any case, here we're only using a 1D array (in the form of a Series). – BrenBarn Jan 26 '14 at 22:54
0

First, membership tests in Series are of the index, not the values:

>>> s = pd.Series([10,20,30])
>>> s
0    10
1    20
2    30
dtype: int64
>>> 0 in s
True
>>> 10 in s
False

But you're right:

>>> 1.5 in s
True

After some work, this seems to be because of __contains__ in Int64HashTable:

cdef class Int64HashTable: #(HashTable):
    [...]
    def __contains__(self, object key):
        cdef khiter_t k
        k = kh_get_int64(self.table, key)
        return k != self.table.n_buckets

key comes in as a float, but we have

inline khint_t kh_get_int64(kh_int64_t*, int64_t)

and so I think it's coerced to an integer before the comparison is made.

DSM
  • 342,061
  • 65
  • 592
  • 494