4

I have a large SPSS-file (containing a little over 1 million records, with a little under 150 columns) that I want to convert to a Pandas DataFrame.

It takes a few minutes to convert the file to a list, than another couple of minutes to convert it to a dataframe, than another few minutes to set the columnheaders.

Are there any optimizations possible, that I'm missing?

import pandas as pd
import numpy as np
import savReaderWriter as spss

raw_data = spss.SavReader('largefile.sav', returnHeader = True) # This is fast
raw_data_list = list(raw_data) # this is slow
data = pd.DataFrame(raw_data_list) # this is slow
data = data.rename(columns=data.loc[0]).iloc[1:] # setting columnheaders, this is slow too.
bowlby
  • 649
  • 1
  • 8
  • 18
  • Converting to a list will be very slow, what is the format of the file? Can it be converted to a csv first, pandas is very fast at reading csv files, also if you read it in as a csv you can specify which rows/columns are the header. Also you can set the param `inplace=True` which will do it without returning a copy. `data.rename(columns=data.loc[0], inplace=True).iloc[1:]` – EdChum Aug 07 '14 at 11:22
  • @EdChum ``inplace=True`` like you are using it is chaining and will not work. and rarely actually saves anything. – Jeff Aug 07 '14 at 11:33
  • converting it to csv made a huge difference! I used PSPP for that. The column names are set at the same time. Thanks! – bowlby Aug 07 '14 at 12:30

1 Answers1

3

You can use rawMode=True to speed up things a bit, as in:

raw_data = spss.SavReader('largefile.sav', returnHeader=True, rawMode=True)

This way, datetime variables (if any) won't be converted to ISO-strings, and SPSS $sysmis values won't be converted to None, and a few other things.

Albert-Jan
  • 186
  • 2
  • 3