8

I have a large csv with two strings per row in this form:

g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h

I read in the first two columns and recode the strings to integers as follows:

import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)

# Convert to digits.
df = df.apply(le.transform)

This code is from https://stackoverflow.com/a/39419342/2179021.

The code works very well but is slow when df is large. I timed each step and the result was surprising to me.

  • pd.read_csv takes about 40 seconds.
  • le.fit(df.values.flat) takes about 30 seconds
  • df = df.apply(le.transform) takes about 250 seconds.

Is there any way to speed up this last step? It feels like it should be the fastest step of them all!


More timings for the recoding step on a computer with 4GB of RAM

The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:

   0  1
0  4  6
1  0  4
2  2  5
3  6  3
4  3  5
5  5  4
6  1  1
7  3  2
8  5  0
9  3  4

Notice that 'd' is mapped to 3 in the first column but 2 in the second.

I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.

df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop

Then I increased the dataframe size by a factor of 10.

df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str) 
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError                               Traceback (most recent call last)

This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.

I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.

How can I:

  1. Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
  2. Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?
Community
  • 1
  • 1
Simd
  • 19,447
  • 42
  • 136
  • 271
  • How big is your csv - rows and columns? How many le.classes do you have? – wwii Sep 13 '16 at 17:02
  • @wwii About 10 millions rows and 1 million le.classes. – Simd Sep 13 '16 at 17:03
  • What does the ```groupby....``` accomplish? – wwii Sep 13 '16 at 17:07
  • @wwii It counts duplicates. – Simd Sep 13 '16 at 17:09
  • And which version of Python are you using? – wwii Sep 13 '16 at 17:14
  • Python 2 . Thank you for any help. – Simd Sep 13 '16 at 18:18
  • What does a couple of rows of your final DataFrame look like? – wwii Sep 14 '16 at 18:59
  • @wwii See the example in the question. – Simd Sep 14 '16 at 19:00
  • You could use `numpy` to load entire file in memory and pre-process it relatively quickly. That assumes it fits in memory. Another option is to switch to `pypy` if your library is supported. Or even pre-process as a separate step, in separate process. – Dima Tisnek Sep 21 '16 at 07:41
  • @qarma If you can show timings that are faster assuming it all fits in memory that would be awesome. – Simd Sep 21 '16 at 08:04
  • first step: http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html second step, not sure why second and third are separate -- it's possible to do those in one step with `fit_transform`. Overall it's a trade-off between developer time (using `sklearn` primitives) and computer time (`numpy`, custom processing). Current trade-off seems reasonable to me. – Dima Tisnek Sep 27 '16 at 07:07

3 Answers3

10

It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:

In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 
                            'ID_1':np.random.randint(0,1000,1000000)}).astype(str)

In [88]: le.fit(df.values.flat) 
         %time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s

In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms

EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)

import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin

class PandasLabelEncoder(BaseEstimator, TransformerMixin):
    def fit(self, y):
        self.classes_ = unique1d(y)
        return self

    def transform(self, y):
        s = pd.Series(y).astype('category', categories=self.classes_)
        return s.cat.codes
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • Thank you although I do need a consistent encoding across the two columns and I won't have every value represented in both columns. – Simd Sep 15 '16 at 06:41
  • I need to be able to apply this mapping to another data frame. Is there any way to store it as labelencoder does? – Simd Sep 15 '16 at 20:54
3

I tried this with the DataFrame:

In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})

It looks like this:

In [261]: df.head()
Out[261]: 
  ID_0 ID_1
0    v    z
1    i    i
2    d    n
3    z    r
4    x    x

In [262]: df.shape
Out[262]: (10000000, 2)

So, 10 million rows. Locally, my timings are:

In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop

In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop

Then I made a dict mapping letters to numbers and used pandas.Series.map:

In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))

In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop

In [274]: df.head()
Out[274]: 
   ID_0  ID_1
0    21    25
1     8     8
2     3    13
3    25    17
4    23    23

So that might be an option. The dict just needs to have all of the values that occur in the data.

EDIT: The OP asked what timing I have for that second option, with categories. This is what I get:

In [40]: %timeit   x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop

EDIT: per the 2nd comment:

In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop

In [46]: %timeit  dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop
Dthal
  • 3,216
  • 1
  • 16
  • 10
  • this is a great answer, i didn't know you could map a dictionary like that, very nice trick. btw a more functional syntax would be `df.apply(lambda c: c.map(d))` (although this is 0.5s slower for some reason) – maxymoo Sep 17 '16 at 09:05
  • Thank you. Could you time `uniques = np.sort(pd.unique(df.values.ravel())) df.apply(lambda x: x.astype('category', categories=uniques))` too please? – Simd Sep 18 '16 at 10:09
  • Thank you for doing the timing I didn't mean the code you ran. I meant literally the two lines I pasted in the comment. You shouldn't need anything else should you (for example stack and unstack)? – Simd Sep 18 '16 at 21:25
0

I would like to point out an alternate solution that should serve many readers well. Although I prefer to have a known set of IDs, it is not always necessary if this is strictly one-way remapping.

Instead of

df[c] = df[c].apply(le.transform)

or

dict_table = {val: i for i, val in enumerate(uniques)}
df[c] = df[c].map(dict_table)

or (checkout _encode() and _encode_python() in sklearn source code, which I assume is faster on average than other methods mentioned)

df[c] = np.array([dict_table[v] for v in df[c].values])

you can instead do

df[c] = df[c].apply(hash)

Pros: much faster, less memory needed, no training, hashes can be reduced to smaller representations (more collisions by casting dtype).

Cons: gives funky numbers, can have collisions (not guaranteed to be perfectly unique), can't guarantee the function won't change with a new version of python

Note that the secure hash functions will have fewer collisions at the cost of speed.

Example of when to use: You have somewhat long strings that are mostly unique and the data set is huge. Most importantly, you don't care about rare hash collisions even though it can be a source of noise in your model's predictions.

I've tried all the methods above and my workload was taking about 90 minutes to learn the encoding from training (1M rows and 600 features) and reapply that to several test sets, while also dealing with new values. The hash method brought it down to a few minutes and I don't need to save any model.

ldmtwo
  • 419
  • 5
  • 14