1

I am trying to read .dta files with pandas:

import pandas as pd
my_data = pd.read_stata('filename', encoding='utf-8')

the error message is:

ValueError: Unknown encoding. Only latin-1 and ascii supported.

other encoding formality also didn't work, such as gb18030 or gb2312 for dealing with Chineses characters. If I remove the encoding parameter, the DataFrame will be all of garbage values.

Jiahui Zhang
  • 538
  • 1
  • 5
  • 12

2 Answers2

2

Simply read the original data by default encoding, then transfer to the expected encoding! Suppose the column having garbled text is column1

import pandas as pd
dta = pd.read_stata('filename.dta')
print(dta['column1'][0].encode('latin-1').decode('gb18030'))

The print result will show normal Chinese characters, and gb2312 can also make it.

Jiahui Zhang
  • 538
  • 1
  • 5
  • 12
0

Looking at the source code of pandas (version 0.22.0), the supported encodings for read_stata are ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1'). So you can only choose from this list.

Xiaodong
  • 27
  • 2