1

Following up on a previous question, is there a preferred efficient manner to get the type of each object within a column? This is specifically for the case where the dtype of the column is object to allow for heterogeneous types among the elements of the column (in particular, allowing for numeric NaN without changing the data type of the other elements to float).

I haven't done time benchmarking, but I am skeptical of the following immediately obvious way that comes to mind (and variants that might use map or filter). The use cases of interest need to quickly get info on the types of all elements, so generators and the like probably won't be an efficiency boon here.

# df is a pandas DataFrame with some column 'A', such that
# df['A'].dtype is 'object'

dfrm['A'].apply(type) #Or np.dtype, but this will fail for native types.

Another thought was to use the NumPy vectorize function, but is this really going to be more efficient? For example, with the same setup as above, I could try:

import numpy as np
vtype = np.vectorize(lambda x: type(x)) # Gives error without lambda

vtype(dfrm['A'])

Both ideas lead to workable output, but it's the efficiency I'm worried about.

Added

I went ahead and did a tiny benchmark in IPython. First is for vtype above, then for the apply route. I repeated it a dozen or so times, and this example run is pretty typical on my machine.

The apply() approach clearly wins, so is there a good reason to expect that I won't get more efficient than with apply()?

For vtype()

In [49]: for ii in [100,1000,10000,100000,1000000,10000000]:
   ....:     dfrm = pandas.DataFrame({'A':np.random.rand(ii)})
   ....:     dfrm['A'] = dfrm['A'].astype(object)
   ....:     dfrm['A'][0:-1:2] = None
   ....:     st_time = time.time()
   ....:     tmp = vtype(dfrm['A'])
   ....:     ed_time = time.time()
   ....:     print "%s:\t\t %s"%(ii, ed_time-st_time)
   ....:     
100:         0.0351531505585
1000:        0.000324010848999
10000:       0.00209212303162
100000:      0.0224051475525
1000000:     0.211136102676
10000000:    2.2215731144

For apply()

In [50]: for ii in [100,1000,10000,100000,1000000,10000000]:
   ....:     dfrm = pandas.DataFrame({'A':np.random.rand(ii)})
   ....:     dfrm['A'] = dfrm['A'].astype(object)
   ....:     dfrm['A'][0:-1:2] = None
   ....:     st_time = time.time()
   ....:     tmp = dfrm['A'].apply(type)
   ....:     ed_time = time.time()
   ....:     print "%s:\t %s"%(ii, ed_time-st_time)
   ....:     
100:         0.000900983810425
1000:        0.000159025192261
10000:       0.00117015838623
100000:      0.0111050605774
1000000:     0.103563070297
10000000:    1.03093600273
Community
  • 1
  • 1
ely
  • 74,674
  • 34
  • 147
  • 228
  • 1
    Minor note: `lambda x: type(x)` will be slower than simply `type`, I think. – DSM Jul 19 '12 at 03:20
  • Yeah, that's true. No need for the lambdas. Edited. – ely Jul 19 '12 at 03:23
  • Although, for my `vtype` function, I am getting an error when I don't define it with the lambda. – ely Jul 19 '12 at 03:26
  • Yeah, vectorize is a little more finicky about what it accepts. If the other objects are numerical you could use `.apply(isnan)`, I guess, but that won't work if they're strings. – DSM Jul 19 '12 at 03:40
  • Yeah, it will need to be more than just nan checking. Also, pandas provides the `.isnull()` function which I generally find to be better than NumPy `isnan()`. – ely Jul 19 '12 at 03:42

1 Answers1

3

Series.apply and Series.map use a specialized Cython method (pandas.lib.map_infer) I wrote that is roughly 2x faster than using numpy.vectorize.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • I appreciate the feedback. We are also noticing that for small benchmark tests (up to length 10,000,000 columns), it seems that the code `dfrm['A'].apply(type).unique()` performs basically just the same as `set([type(x) for x in dfrm['A'] if x is not None])` ... is this because of some NumPy overhead to return a NumPy array at the end? – ely Jul 19 '12 at 18:30
  • I think in both cases you're going to be limited by hash table performance. But I could be wrong; there is additional overhead to converting the internal result of unique to an ndarray, so I'm sure is contributing. – Wes McKinney Jul 23 '12 at 23:00