4

I'm in the process of writing a data checker to review spss files and need to programmatically handle different checks. The first step is to access an spss file, convert it to a pandas dataframe and run my checks from there. The only way I've found to do this is through RPY2. I know very little R unfortunately and can't get either solution below to work. Any help/literature would be much appreciated.

I've pulled some stuff from other posts and created this:

Using RPY2

from rpy2.robjects import pandas2ri
from rpy2.robjects import r
from pathlib import Path
import pyreadstat
pandas2ri.activate()

w = r('foreign::read.spss("%s", to.data.frame=TRUE)' % filename)
df = pandas2ri.ri2py(w)
df.head()
w.head()

Error:

rpy2.rinterface_lib.embedded.RRuntimeError: Error in foreign::read.spss("path to test.sav",  : 
  error reading system-file header

Using pyreadstat (this gives me the columns, but errors out when I attempt to get the underlying data)

    meta = pyreadstat.read_sav(filename, metadataonly=True)
    cols = [x for x in meta[0]]
    df, meta = pyreadstat.read_sav(filename, usecols=cols)
    print(df)

Error:

pyreadstat._readstat_parser.PyreadstatError: STRING type with value 4/23/19 17:50 with date type

UPDATE:

Using haven now but still getting errors:

rdf = r(f'haven::read_sav("{filename}")')

Error:

ValueError: Invalid value NaN (not a number)

Community
  • 1
  • 1
grigs
  • 1,140
  • 3
  • 15
  • 28
  • For pyreadstat please submit an issue in the github repo WITH an example file. Those formatting errors you are describing are known and just need an example file for the C library to learn how to parse them correctly. – Otto Fajardo Sep 28 '19 at 08:29

4 Answers4

3

I did it using pyreadstat, your second option:

df, metadata = pyreadstat.read_sav("path to file", metadataonly=True)

This returns an empty DF (only column´s names) and all the metadata. With metadata.variable_value_labels you get a dictionary with the values of the variables.

df, metadata = pyreadstat.read_sav("path to file", apply_value_formats=True) 

This will return the DF with all the values already replaced.

This may help https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html

Code Enjoyer
  • 681
  • 4
  • 18
DiegoC
  • 31
  • 1
3

To build off/update DiegoC's answer on pyreadstat, you can use pd.read_spss if you have pyreadstat installed. So, it's as easy as:

df = pd.read_spss("path_to_sav_file.sav")

Again, you need pyreadstat to use pd.read_spss so if an error pops up telling you to install pyreadstat, go ahead and do what it tells you. For the ultra noobers:

$ pip install pyreadstat

or

$ conda install pyreadstat
saetch_g
  • 1,427
  • 10
  • 10
2

Rather you can use scipy.io.readsav library to convert the .sav file to a dictionary

scipy.io.readsav(file_name, idict=None, python_dict=False,
 uncompressed_file_name=None, verbose=False)

Then the dictionary can be easily converted into a pandas dataframe.

horace_vr
  • 3,026
  • 6
  • 26
  • 48
2

I would use the SavReaderWriter package:

pip install savReaderWriter --upgrade

import pandas
import savReaderWriter

with savReaderWriter.SavReader('mydata.sav', ioUtf8 = True) as reader:
    df = pandas.DataFrame(reader.all(), columns = [s.decode('CP1252') for s in reader.header])
df.head()
Anthop
  • 138
  • 8