I am fairly new to Python and I'm trying to use the crosstab function on a very heavy file (around 19,500,000 rows, 30 columns).
Here is the exemple of what I'm trying to do:
This is the data I have, minus a few 19.5 millions rows and 30 columns
>>> df
COMMUNE DCRAN IPONDI ACHLR
1 01001 01001 3.901356 3
2 01001 01001 3.901356 3
3 01001 01001 3.901356 5
4 01001 01001 3.901356 4
5 01001 01001 3.901356 4
6 01001 01001 3.901356 5
7 01002 01001 3.901356 1
8 01002 01001 3.901356 4
9 01002 01001 3.901356 4
10 01002 01001 3.901356 4
and this is the result I expect:
COMMUNE DCRAN IPONDI ACHLR_1 ACHLR_2 ACHLR_3 ACHLR_4 ACHLR_5
1 01001 01001 3.901356 0 0 2 2 2
2 01002 01001 3.901356 1 0 0 3 0
I have been able to make it work for 500,000 rows, but it crashes when I do it on the entire file. I guess my method is not the most efficient memory wise and that I could use other function or better use crosstab to solve this, but I'm stuck.
Basically, I have melted the DataFrame before using the crosstab function in pandas to aggregate the rows, trying to reproduce what I was trying to do on R. Contrary to R, I can melt the entire file (which makes a loooot of rows), but it crashes when I do the crosstab function.
>>>
melted = pd.melt(essai_df, id_vars=['COMMUNE', 'DCRAN','IPONDI'], value_vars='ACHLR')
aggregated = pd.crosstab([melted.COMMUNE, melted.IPONDI],
columns = [melted.variable, melted.value])
Is there a way to be more efficient memory wise, by skipping the first step, for instance? I thought about something like
df = pd.crosstab([df.COMMUNE, df.DCRAN, df.IPONDI], df.ACHLR)
df
ACHLR 1 3 4 5
COMMUNE DCRAN IPONDI
01001 01001 3.901356 0 2 2 2
01002 01001 3.901356 1 0 3 0
But if I add more variable than just 'ACHLR' — for instance 'AGEMEN8' — it returns :
ACHLR 1 3 4 5
AGEMEN8 25 40 55 65 25 40 55 65 25 40 55 65 25 40 55 65
COMMUNE DCRAN IPONDI
01001 01001 3.901356 0 0 0 0 0 0 2 0 0 2 0 0 1 0 0 1
01002 01001 3.901356 1 0 0 0 0 0 0 0 0 3 0 0 1 0 0 1
What are your suggestions to solve this?
[Updates]
I have tried JohnE suggestion, but it didn't make a significant difference.
Additionally, I have compared the time of doing a crosstab directly, and melting then crosstab-ing on 2,000,000 lines and the second was much much faster.