Encoding Error of Reading .dta Files with Chinese Characters

Question

I am trying to read .dta files with pandas:

import pandas as pd
my_data = pd.read_stata('filename', encoding='utf-8')

the error message is:

ValueError: Unknown encoding. Only latin-1 and ascii supported.

other encoding formality also didn't work, such as gb18030 or gb2312 for dealing with Chineses characters. If I remove the encoding parameter, the DataFrame will be all of garbage values.

score 2 · Accepted Answer · answered Mar 23 '18 at 12:22

Simply read the original data by default encoding, then transfer to the expected encoding! Suppose the column having garbled text is column1

import pandas as pd
dta = pd.read_stata('filename.dta')
print(dta['column1'][0].encode('latin-1').decode('gb18030'))

The print result will show normal Chinese characters, and gb2312 can also make it.

score 0 · Answer 2 · answered Mar 15 '18 at 09:47

0

Looking at the source code of pandas (version 0.22.0), the supported encodings for read_stata are ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1'). So you can only choose from this list.

answered Mar 15 '18 at 09:47

Xiaodong

27
2

Thanks! But what function should I use then? – Jiahui Zhang Mar 15 '18 at 16:35
sorry I don't know, because I don't have stata. – Xiaodong Mar 20 '18 at 16:57
@Xiaodong `read_stata` in pandas 0.23.4 will fail with `ValueError: Unknown encoding. Only latin-1 and ascii supported.` – 0range Nov 15 '18 at 22:46

Encoding Error of Reading .dta Files with Chinese Characters

2 Answers2