0

Issue: I have gotten a csv file (wtih delimiter ~) that came from a third party, and about 4000 records, and has 150 columns with real column names such as FirstName~LastName~OrderID~City~...... But when the file is loaded into a pandas dataframe df and when I use print(list(df.columns)) it displays the column names as follows (I've simplified it for brevity):

['ÿþA', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4',,,,,'Unnamed: 49']

Question: What I may be doing wrong, and how can we fix the issue to simply display the real column names in df? I'm using latest version of python. I see some relevant articles such as this one but they are all related to one column.

Remark: It's a UTF-16 LE BOM file. I discovered the issue when in my code, I referenced a column as df['OrdeID'] and I got well know KeyError that means you are refencing a column that does not exist.

Code:

import pandas as pd

df = pd.read_csv('/dbfs/FileStore/tables/MyDataFile.txt', sep='~', low_memory=False, quotechar='"', header='infer', encoding='cp1252')

print(df['OrdId'])

MyDataFile.txt sample:

FirstName~LastName~OrderID~City~.....
Kim~Doe~1234~New York~...............
Bob~Mason~456~Seattle~...............
..................
nam
  • 21,967
  • 37
  • 158
  • 332
  • Are you sure you have the right character encoding? See also https://stackoverflow.com/questions/25063084/what-is-this-%C3%BF%C3%BEa – Nick ODell Jun 17 '22 at 20:43
  • @NickODell. Good question. I just updated my **Remark** section. – nam Jun 17 '22 at 20:57
  • @NickODell Your comment and the link you provided resolved the issue (thank you). You may want to write your thoughts into a `Response` and I'll mark it as an `Answer`. Users benefit more when they see a response that resolved the issue. – nam Jun 17 '22 at 21:11

2 Answers2

1

Are you sure you have the right encoding?

I see your data file starts with ÿþ when read in a cp1252 encoding. That looks like a UTF16 byte order mark (BOM.) Wikipedia has a table of these, and if you look at that table, you'll see it's a match with UTF16-LE (little endian.)

Once you figure out the right encoding, you can tell Pandas what encoding to use by calling pd.read_csv(..., encoding='...'). To figure out what to put in the encoding field, you can consult this table. If you want UTF16-LE, that's 'utf_16_le'.

More information:

Pandas docs on read_csv

What is this "ÿþA"? This is the same question, but about R instead of Python.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • I have a similar question posted [here](https://stackoverflow.com/q/72665825/1232087) where I am using `Apache Spark` instead of `pandas` – nam Jun 18 '22 at 00:25
0

Hey you can not use it directly and wanna use another way out by renaming it as per my understanding its non exixtent

try using

df.rename(columns={'Unnamed: 0':'new name0','Unnamed: 1':'new name1'}, inplace=True)

  • if not this can you share a part of the csv for refernce or you can make it up – Arunbh Yashaswi Jun 17 '22 at 20:46
  • I just added a sample. File has `150` columns and about `4000` records. Also updated my Remark section. – nam Jun 17 '22 at 21:01
  • @nam the file is not csv in my understanding as its not delimeted by comma. Just try changing the delimeter to comma it will work in my opininon. you can test it over a piece of code. – Arunbh Yashaswi Jun 17 '22 at 21:03