Converting .sav file to pandas dataframe

Question

I'm in the process of writing a data checker to review spss files and need to programmatically handle different checks. The first step is to access an spss file, convert it to a pandas dataframe and run my checks from there. The only way I've found to do this is through RPY2. I know very little R unfortunately and can't get either solution below to work. Any help/literature would be much appreciated.

I've pulled some stuff from other posts and created this:

Using RPY2

from rpy2.robjects import pandas2ri
from rpy2.robjects import r
from pathlib import Path
import pyreadstat
pandas2ri.activate()

w = r('foreign::read.spss("%s", to.data.frame=TRUE)' % filename)
df = pandas2ri.ri2py(w)
df.head()
w.head()

Error:

rpy2.rinterface_lib.embedded.RRuntimeError: Error in foreign::read.spss("path to test.sav",  : 
  error reading system-file header

Using pyreadstat (this gives me the columns, but errors out when I attempt to get the underlying data)

    meta = pyreadstat.read_sav(filename, metadataonly=True)
    cols = [x for x in meta[0]]
    df, meta = pyreadstat.read_sav(filename, usecols=cols)
    print(df)

Error:

pyreadstat._readstat_parser.PyreadstatError: STRING type with value 4/23/19 17:50 with date type

UPDATE:

Using haven now but still getting errors:

rdf = r(f'haven::read_sav("{filename}")')

Error:

ValueError: Invalid value NaN (not a number)

For pyreadstat please submit an issue in the github repo WITH an example file. Those formatting errors you are describing are known and just need an example file for the C library to learn how to parse them correctly. — Otto Fajardo, Sep 28 '19 at 08:29

score 3 · Answer 1 · edited Mar 09 '20 at 23:19

I did it using pyreadstat, your second option:

df, metadata = pyreadstat.read_sav("path to file", metadataonly=True)

This returns an empty DF (only column´s names) and all the metadata. With metadata.variable_value_labels you get a dictionary with the values of the variables.

df, metadata = pyreadstat.read_sav("path to file", apply_value_formats=True)

This will return the DF with all the values already replaced.

This may help https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html

score 3 · Answer 2 · answered Oct 25 '20 at 03:47

To build off/update DiegoC's answer on pyreadstat, you can use pd.read_spss if you have pyreadstat installed. So, it's as easy as:

df = pd.read_spss("path_to_sav_file.sav")

Again, you need pyreadstat to use pd.read_spss so if an error pops up telling you to install pyreadstat, go ahead and do what it tells you. For the ultra noobers:

$ pip install pyreadstat

or

$ conda install pyreadstat

score 2 · Answer 3 · edited May 09 '19 at 05:29

2

Rather you can use scipy.io.readsav library to convert the .sav file to a dictionary

scipy.io.readsav(file_name, idict=None, python_dict=False,
 uncompressed_file_name=None, verbose=False)

Then the dictionary can be easily converted into a pandas dataframe.

edited May 09 '19 at 05:29

horace_vr

3,026
6
26
48

answered May 08 '19 at 01:50

Protik Kumar Biswas

29
2

1

scipy.io.readsav is to read an IDL file, not a SPSS file – Otto Fajardo Jul 12 '20 at 08:53

score 2 · Answer 4 · answered Jul 30 '19 at 22:01

I would use the SavReaderWriter package:

pip install savReaderWriter --upgrade

import pandas
import savReaderWriter

with savReaderWriter.SavReader('mydata.sav', ioUtf8 = True) as reader:
    df = pandas.DataFrame(reader.all(), columns = [s.decode('CP1252') for s in reader.header])
df.head()

Converting .sav file to pandas dataframe

Using RPY2

Error:

Using pyreadstat (this gives me the columns, but errors out when I attempt to get the underlying data)

Error:

UPDATE:

Error:

4 Answers4