1

I am fairly new to Python and I'm trying to use the crosstab function on a very heavy file (around 19,500,000 rows, 30 columns).

Here is the exemple of what I'm trying to do:

This is the data I have, minus a few 19.5 millions rows and 30 columns

>>> df

    COMMUNE  DCRAN   IPONDI ACHLR 
 1    01001  01001 3.901356     3 
 2    01001  01001 3.901356     3 
 3    01001  01001 3.901356     5 
 4    01001  01001 3.901356     4 
 5    01001  01001 3.901356     4 
 6    01001  01001 3.901356     5 
 7    01002  01001 3.901356     1 
 8    01002  01001 3.901356     4 
 9    01002  01001 3.901356     4 
 10   01002  01001 3.901356     4 

and this is the result I expect:

    COMMUNE DCRAN   IPONDI ACHLR_1  ACHLR_2  ACHLR_3  ACHLR_4  ACHLR_5 
1    01001  01001 3.901356       0        0        2        2        2
2    01002  01001 3.901356       1        0        0        3        0

I have been able to make it work for 500,000 rows, but it crashes when I do it on the entire file. I guess my method is not the most efficient memory wise and that I could use other function or better use crosstab to solve this, but I'm stuck.

Basically, I have melted the DataFrame before using the crosstab function in pandas to aggregate the rows, trying to reproduce what I was trying to do on R. Contrary to R, I can melt the entire file (which makes a loooot of rows), but it crashes when I do the crosstab function.

>>>
melted = pd.melt(essai_df, id_vars=['COMMUNE', 'DCRAN','IPONDI'], value_vars='ACHLR')

aggregated = pd.crosstab([melted.COMMUNE, melted.IPONDI],
               columns = [melted.variable, melted.value])

Is there a way to be more efficient memory wise, by skipping the first step, for instance? I thought about something like

 df = pd.crosstab([df.COMMUNE, df.DCRAN, df.IPONDI], df.ACHLR)

 df

                       ACHLR    1   3   4   5
 COMMUNE    DCRAN     IPONDI                
 01001      01001   3.901356    0   2   2   2
 01002      01001   3.901356    1   0   3   0

But if I add more variable than just 'ACHLR' — for instance 'AGEMEN8' — it returns :

                    ACHLR             1               3              4               5
                    AGEMEN8     25  40  55  65  25  40  55  65  25  40  55  65  25  40  55  65
 COMMUNE    DCRAN   IPONDI
 01001      01001   3.901356    0   0   0   0   0   0   2   0   0   2   0   0   1   0   0   1
 01002      01001   3.901356    1   0   0   0   0   0   0   0   0   3   0   0   1   0   0   1

What are your suggestions to solve this?

[Updates]

I have tried JohnE suggestion, but it didn't make a significant difference.

Additionally, I have compared the time of doing a crosstab directly, and melting then crosstab-ing on 2,000,000 lines and the second was much much faster.

Hugo
  • 63
  • 6
  • 2
    I'm not sure about crosstab specifically but one thing that can shrink data is to store variables as categoricals ('commune','dcran') and store integers in smallest sizes (int8 instead of default int64), where possible. – JohnE Jul 28 '15 at 15:51
  • thanks JohnE, I'll try that. – Hugo Jul 28 '15 at 16:01
  • Concerning memory problems : http://hansmelberg.blogspot.fr/2013/06/reducing-memory-problems-with-large.html – Hugo Jul 29 '15 at 08:42

0 Answers0