1

I export a large Data Frame (18 million observations; 5 columns) called SalesData to Stata native file format using pandas to_stata:

SalesData.to_stata(sales)

It works but it is extremely slow to the point it is not usable in production. I think I understand why: as shown by an examination of the resulting Stata file, every string column is assigned by pandas a width of 244 characters regardless of the actual content of the column --> the Stata file is needlessly huge. A "compress" command in Stata on the said file reduces its size by a factor a 10, without any data loss.

I don't seem to be able to locate any options to the to_stata method to control for this behaviour.

Any suggestions? Thanks

Charles
  • 613
  • 2
  • 8
  • 18
  • The [online docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.stata.StataWriter.write_file.html#pandas.io.stata.StataWriter.write_file) don't say too much may be worth posting an [issue](https://github.com/pydata/pandas/issues) – EdChum Oct 08 '14 at 13:20
  • This was optimized very much in 0.15.0, see here: http://pandas.pydata.org/pandas-docs/version/0.15.0/whatsnew.html#whatsnew-0150-performance (the 0.15.0RC1 is now available), try with it – Jeff Oct 08 '14 at 13:26
  • Thanks Jeff, great. I will try it (need to figure out how to install 0.15.0RC1 first..) – Charles Oct 08 '14 at 13:36
  • Now that pandas 0.15.0 is out, I have tried it. Unfortunately, this does not improve things for my problem, albeit with different symptoms: the to_stata method used on the same DF as above (18 million obs, 5 columns) brings my computer to its knees as RAM usage shots up from about 1.5 Go to the max (8 Go). The process cannot be completed as the machine stalls and needs to be rebooted. – Charles Oct 26 '14 at 13:36

0 Answers0