I'm in the process of writing a data checker to review spss files and need to programmatically handle different checks. The first step is to access an spss file, convert it to a pandas dataframe and run my checks from there. The only way I've found to do this is through RPY2. I know very little R unfortunately and can't get either solution below to work. Any help/literature would be much appreciated.
I've pulled some stuff from other posts and created this:
Using RPY2
from rpy2.robjects import pandas2ri
from rpy2.robjects import r
from pathlib import Path
import pyreadstat
pandas2ri.activate()
w = r('foreign::read.spss("%s", to.data.frame=TRUE)' % filename)
df = pandas2ri.ri2py(w)
df.head()
w.head()
Error:
rpy2.rinterface_lib.embedded.RRuntimeError: Error in foreign::read.spss("path to test.sav", :
error reading system-file header
Using pyreadstat (this gives me the columns, but errors out when I attempt to get the underlying data)
meta = pyreadstat.read_sav(filename, metadataonly=True)
cols = [x for x in meta[0]]
df, meta = pyreadstat.read_sav(filename, usecols=cols)
print(df)
Error:
pyreadstat._readstat_parser.PyreadstatError: STRING type with value 4/23/19 17:50 with date type
UPDATE:
Using haven now but still getting errors:
rdf = r(f'haven::read_sav("{filename}")')
Error:
ValueError: Invalid value NaN (not a number)