Following up on a previous question, is there a preferred efficient manner to get the type of each object within a column? This is specifically for the case where the dtype
of the column is object
to allow for heterogeneous types among the elements of the column (in particular, allowing for numeric NaN
without changing the data type of the other elements to float
).
I haven't done time benchmarking, but I am skeptical of the following immediately obvious way that comes to mind (and variants that might use map
or filter
). The use cases of interest need to quickly get info on the types of all elements, so generators and the like probably won't be an efficiency boon here.
# df is a pandas DataFrame with some column 'A', such that
# df['A'].dtype is 'object'
dfrm['A'].apply(type) #Or np.dtype, but this will fail for native types.
Another thought was to use the NumPy vectorize
function, but is this really going to be more efficient? For example, with the same setup as above, I could try:
import numpy as np
vtype = np.vectorize(lambda x: type(x)) # Gives error without lambda
vtype(dfrm['A'])
Both ideas lead to workable output, but it's the efficiency I'm worried about.
Added
I went ahead and did a tiny benchmark in IPython. First is for vtype
above, then for the apply
route. I repeated it a dozen or so times, and this example run is pretty typical on my machine.
The apply()
approach clearly wins, so is there a good reason to expect that I won't get more efficient than with apply()
?
For vtype()
In [49]: for ii in [100,1000,10000,100000,1000000,10000000]:
....: dfrm = pandas.DataFrame({'A':np.random.rand(ii)})
....: dfrm['A'] = dfrm['A'].astype(object)
....: dfrm['A'][0:-1:2] = None
....: st_time = time.time()
....: tmp = vtype(dfrm['A'])
....: ed_time = time.time()
....: print "%s:\t\t %s"%(ii, ed_time-st_time)
....:
100: 0.0351531505585
1000: 0.000324010848999
10000: 0.00209212303162
100000: 0.0224051475525
1000000: 0.211136102676
10000000: 2.2215731144
For apply()
In [50]: for ii in [100,1000,10000,100000,1000000,10000000]:
....: dfrm = pandas.DataFrame({'A':np.random.rand(ii)})
....: dfrm['A'] = dfrm['A'].astype(object)
....: dfrm['A'][0:-1:2] = None
....: st_time = time.time()
....: tmp = dfrm['A'].apply(type)
....: ed_time = time.time()
....: print "%s:\t %s"%(ii, ed_time-st_time)
....:
100: 0.000900983810425
1000: 0.000159025192261
10000: 0.00117015838623
100000: 0.0111050605774
1000000: 0.103563070297
10000000: 1.03093600273