0

I have an SPSS (.sav) file with over 90,000 columns and around 1800 rows. Previously I have used the code below (taken from this answer), which has worked well.

raw_data = spss.SavReader('largefile.sav', returnHeader = True)
raw_data_list = list(raw_data)
data = pd.DataFrame(raw_data_list)
data = data.rename(columns=data.loc[0]).iloc[1:]

However, now some of the columns contain special characters (including Chinese characters and accented characters). Using the documentation, it appears using ioUtf8=True with SavReader should achieve what I'm aiming to do. So I do the following:

raw_data = spss.SavReader('largefile.sav', returnHeader = True, ioUtf8=True)
raw_data_list = list(raw_data)
data = pd.DataFrame(raw_data_list)
data = data.rename(columns=data.loc[0]).iloc[1:]

Line 1 runs fine, but line 2 returns the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 6: invalid continuation byte

How can I get around the problem?

OD1995
  • 1,647
  • 4
  • 22
  • 52
  • Something similar happens to me when my non-Latin characters are truncated. Are you sure you have your texts in full-length ? Try to load parts of your data, see where it breaks (which line and column) – horace_vr Oct 19 '18 at 11:10

1 Answers1

0

It looks like there are characters encoded in your dataset that cant be decoded with UTF-8. Namely an "á" encoded with latin-1.

c = 'à'

print c.decode('utf-8')

>>> UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data

print c.decode('latin-1')

>>> à

you could try to save your dataset in unicode format, in case it is not unicode already (make a backup before you do this, just in case). Try the following: Open SPSS without data open, type

set unicode on. 

open your dataset and save it. It should now be in unicode format. Now try and run your code to import the data.

*** Update

You could try to read your file row by row and handle the errors as they come in:

rawdata = []
with SavReader('largefile.sav', ioUtf8=True) as reader:
    for record in reader:
        try:
            rawdata.append(record)
        except UnicodeDecodeError:
            r = record.decode('latin-1')
            rawdata.append(r.encode('utf-8'))                
 data = pd.DataFrame(raw_data_list)
 data = data.rename(columns=data.loc[0]).iloc[1:]

Because you have Chinese characters as well you might have to add another try: except: block for that if adding them to your rawdata list raises an error too.

ragamuffin
  • 460
  • 2
  • 12
  • Thanks for your suggestion but I still get the same error. I tested if the dataset was already in the unicode format using `SHOW UNICODE` before doing `set unicode on.` and it was. Do you have any suggestions for how I can dig deeper? – OD1995 Oct 19 '18 at 14:35
  • I updated my answer. If that doesnt work, Im out of ideas – ragamuffin Oct 22 '18 at 06:05
  • Thanks for your updated suggestion but now I'm getting a `UnicodeDecodeError` on the `for record in reader` line, screenshot [here](https://imgur.com/FynaQzO). This seems against what is suggested in the [documentation](https://pythonhosted.org/savReaderWriter/index.html), under the **Reading a file in unicode mode (default in SPSS v21 and up)** subtitle so I will try to log an issue somewhere – OD1995 Oct 22 '18 at 09:40