3

I'm trying to use Pandas to read in a delimited file. The separator is a greek character, lowercase rho (þ).

I'm struggling to define the correct read_table parameters so that the resulting data frame is correctly formatted.

Does anyone have any experience or suggestions with this?

An example of the file is below

TimeþUser-IDþAdvertiser-IDþOrder-IDþAd-IDþCreative-IDþCreative-VersionþCreative-Size-IDþSite-IDþPage-IDþCountry-IDþState/ProvinceþBrowser-IDþBrowser-VersionþOS-IDþDMA-IDþCity-IDþZip-CodeþSite-DataþTime-UTC-Sec 03-28-2016-00:50:03þ0þ3893600þ7786669þ298662779þ67802437þ1þ300x250þ1722397þ125754620þ68þþ30þ0.0þ501012þ0þ3711þþþ1459122603 03-28-2016-00:24:29þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459121069 03-28-2016-00:13:42þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459120422 03-28-2016-00:21:09þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459120869

  • 2
    so you're saying that `read_table(file, sep=r'ρ')` doesn't work? or with additional param `encoding='utf-8'` or `encoding='utf-16'`? – EdChum Apr 22 '16 at 15:24
  • I'm on a windows machine, which might not be helping, but I want to check that my syntax is fine first. Have tried the following. `import pandas as pd data = pd.read_table('C:\Users\robin.sheridan\Documents\RCode\NetworkImpression_5684_03-28-2016',sep=r'ρ',nrows=10,encoding='utf-16') print(data)` – Robin Sheridan Apr 22 '16 at 15:57
  • And varieties of. eg changing the encoding, etc. – Robin Sheridan Apr 22 '16 at 15:58
  • Prefix your path string with r, the back slashes are normally escaped but if you prefix with r then it creates a raw string – EdChum Apr 22 '16 at 16:05
  • No luck I'm afraid - I still see the question-mark symbols where the rho should be! – Robin Sheridan Apr 22 '16 at 16:16
  • Can you edit your question with the first few rows or better to post a link to the file, thanks – EdChum Apr 22 '16 at 16:18

1 Answers1

2

I think what's happening is that the C engine isn't working here. If we switch to the Python engine, which is more powerful but slower, it seems to behave. For example, with the default C engine:

>>> df = pd.read_csv("out.rsv",sep="þ")
>>> df.iloc[:,:5]
  TimeþUser-IDþAdvertiser-IDþOrder-IDþAd-IDþCreative-IDþCreative-VersionþCreative-Size-IDþSite-IDþPage-IDþCountry-IDþState/ProvinceþBrowser-IDþBrowser-VersionþOS-IDþDMA-IDþCity-IDþZip-CodeþSite-DataþTime-UTC-Sec
0  03-28-2016-00:50:03þ0þ3893600þ7786669þ29866277...                                                                                                                                                               
1  03-28-2016-00:24:29þ0þ3893600þ7352234þ29074376...                                                                                                                                                               
2  03-28-2016-00:13:42þ0þ3893600þ7352234þ29074376...                                                                                                                                                               
3  03-28-2016-00:21:09þ0þ3893600þ7352234þ29074376...    

But with Python:

>>> df = pd.read_csv("out.rsv",sep="þ", engine="python")
>>> df.iloc[:,:5]
                  Time  User-ID  Advertiser-ID  Order-ID      Ad-ID
0  03-28-2016-00:50:03        0        3893600   7786669  298662779
1  03-28-2016-00:24:29        0        3893600   7352234  290743769
2  03-28-2016-00:13:42        0        3893600   7352234  290743769
3  03-28-2016-00:21:09        0        3893600   7352234  290743769

.. but seriously, þ? You're using þ as a delimiter? The only search hits google gives me for "rho delimited file" are all related to this question!

Note that you say lowercase rho, but it looks like thorn to me.. Maybe it's a lowercase rho on your end and got confused in posting?

DSM
  • 342,061
  • 65
  • 592
  • 494
  • Yeah, my bad, its a thorn. (Only way I could see it was a shitty text editor...!) Bizarrely, that still isn't working. I'm going to try on my mac over the weekend. Have a strong suspicion that my windows machine is as much of a problem as the stupid separator. (Obviously not my choice) Thanks for your help! – Robin Sheridan Apr 22 '16 at 16:54