4

For reproducing the issue, Notebook, data, output: github link
I have Contract variable/column in my dataset which looks like this, all look like numbers but they are actually categorical.
enter image description here

When read with pandas, the info says it is read as int. Since the contract variable is a category(from the metadata I received) so I manually changed the variable type like below

df['Contract'] = df['Contract'].astype('categorical')
df.dtypes # shows modified dtype now

I then tried to get report from pandas_profiling. The generated report shows that contact interpreted as real number, even though I changed the type from int to str/category.

# Tried both, but resulted in same.
ProfileReport(df)
df.profile_report()

enter image description here

Can you explain right way to interpret datatypes with pandas_profiling? i.e, change contract variable to categorical type.

Mohith7548
  • 1,282
  • 2
  • 16
  • 24
  • rather than `.astype("categoriat")`, try, `.astype(string)` – Paul H Jan 20 '21 at 08:04
  • I did that as well `astype('str')` & `astype('category')`. But same result – Mohith7548 Jan 20 '21 at 08:17
  • I recommend appending a character to the front of each value in that column then (e.g., `9940242774` becomes `"c9940242774"` – Paul H Jan 20 '21 at 08:41
  • This is exactly what I'm using at the moment. But there has to be some way right? Kindly share the question/upvote/comment for better reachability. – Mohith7548 Jan 20 '21 at 11:15
  • 1
    can you make a reproducible example? – Paul H Jan 20 '21 at 16:07
  • 1
    https://stackoverflow.com/help/minimal-reproducible-example – Paul H Jan 20 '21 at 16:08
  • Here's the link to sample/reproducible data: https://github.com/mohith7548/Pandas-Profiling-issue-recreation – Mohith7548 Jan 24 '21 at 07:05
  • 1
    That belongs in the question, not in a gist – Paul H Jan 24 '21 at 16:04
  • The repo link above has, ipynb file which has dataframe creation step. You can chop the `Contract` column from there. – Mohith7548 Jan 24 '21 at 17:03
  • 1
    copy and paste that info into the question. links rot and and this question won't make any sense – Paul H Jan 24 '21 at 17:04
  • 1
    https://stackoverflow.com/help/how-to-ask – Paul H Jan 24 '21 at 17:04
  • @PaulH I think notebook in the GitHub link is good enough to understand and it also have comments. Let me know if you need any other info – Mohith7548 Jan 26 '21 at 11:29
  • I'm telling you that it's not. One day, you're going to clean up your GitHub profile, that link will go away, and future readers who might benefit from this question will be left in the dark. – Paul H Jan 26 '21 at 18:01
  • 1
    I ain't gonna clean up that repo. Kindly go through the ipynb notebook. Here I raised an issue as well https://github.com/pandas-profiling/pandas-profiling/issues/676. Now I understand why you can't answer this question ; ) – Mohith7548 Jan 27 '21 at 05:38
  • @Mohith7548 So looking at your github issue, is the basic conclusion that there is no way to stop the report generator from just inferring the type, end of story? – lampShadesDrifter May 21 '21 at 03:50
  • 1
    @Mohith7548 Actually, just saw your contrib (https://github.com/pandas-profiling/pandas-profiling/issues/676#issuecomment-770263553). You should post this as the answer to your question (would be helpful to others esp. since this info does not appear to be anywhere in the `pandas_profiling` docs). – lampShadesDrifter May 21 '21 at 06:12

1 Answers1

2

After a long time posting this question, raising issue and creating a pull request for this on pandas-profiling GitHub page, I almost forgot this question. I thank IampShadesDrifter for reminding me to close this question by answering.

Actually this behavior of pandas-profiling is expected. pandas-profiling tries to infer the data type that best suits for a column. And it is how it's written before. Since there wasn't a solution. it drove me to create my first ever pull request on GitHub.

Now with the newly added parameter infer_dtypes in ProfileReport / profile_report, we can explicitly ask pandas-profiling not to infer any data type, but rather use the data type from pandas (df.dtypes).

# for the df in the question,

df['Contract'] = df['Contract'].astype('categorical')

# `Contract` dtype now will be used as `categorical` as type-casted above. 
# And `pandas-profiling` does not infer dtype on its own, rather uses dtypes as understood by pandas
# for this we have to set `infer_dtypes=False`
ProfileReport(df, infer_dtypes=False) # or
df.profile_report(infer_dtypes=False)

Please feel free to contribute for this answer, if you found anything worth mentioning.

Mohith7548
  • 1,282
  • 2
  • 16
  • 24