Encoding discrepancy with Iris Dataset

Question

After I downloaded the dataset as iris.data, I renamed it to iris.data.txt. I was trying to circumvent this reported error on SO:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte

After reading up, I tried this:

dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="ISO-8859-1")

This partly solved the error but some rows were still garbage.

Then I tried to open it with Sublime, save it with utf-8 encoding and then dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="utf-8")

But this doesn't solve the problem either. I'm running Python 3 on Mac OS. What could possibly render the data readable directly?

[EDIT]: The datatype reads: Web archive. In Spyder, the file appears as iris.data.webarchive

If I try dataset = pd.read_csv('iris.data.webarchive', header=None), it gives this traceback:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5

If I try dataset = pd.read_csv('iris.data', header=None), it gives FileNotFoundError: File b'iris.data' does not exist

strange simple pd.read_csv('iris.data', header=None) works for me... — iamklaus, Sep 01 '18 at 03:57
How is the data seperated.? Try giving the `sep` argument to `read_csv` — Sreeram TP, Sep 01 '18 at 07:59
@SreeramTP: It's a popular dataset. I'm not sure if we need the sep here — srkdb, Sep 01 '18 at 16:14

score 0 · Answer 1 · answered Sep 03 '18 at 16:47

0

I figured out my rookie mistake. I had to save the page as 'source' instead of 'webarchive' (which is the default Mac setting)

answered Sep 03 '18 at 16:47

srkdb

775
3
15
28

Encoding discrepancy with Iris Dataset

1 Answers1