I have a large csv with two strings per row in this form:
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
I read in the first two columns and recode the strings to integers as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)
# Convert to digits.
df = df.apply(le.transform)
This code is from https://stackoverflow.com/a/39419342/2179021.
The code works very well but is slow when df is large. I timed each step and the result was surprising to me.
pd.read_csv
takes about 40 seconds.le.fit(df.values.flat)
takes about 30 secondsdf = df.apply(le.transform)
takes about 250 seconds.
Is there any way to speed up this last step? It feels like it should be the fastest step of them all!
More timings for the recoding step on a computer with 4GB of RAM
The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:
0 1
0 4 6
1 0 4
2 2 5
3 6 3
4 3 5
5 5 4
6 1 1
7 3 2
8 5 0
9 3 4
Notice that 'd' is mapped to 3 in the first column but 2 in the second.
I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop
Then I increased the dataframe size by a factor of 10.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str)
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError Traceback (most recent call last)
This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.
I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.
How can I:
- Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
- Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?