1

I am trying to run the line of code:

pd.get_dummies(pd_df, columns = ['ethnicity'])

However, I keep getting the error 'DataFrame' object has no attribute '_internal'. It looks like its linked to the ...pyspark/pandas/namespace.py file so therefore I am not too sure how to fix it.

Unfortunately, the dataframe itself is private so I can't show/describe it on Stackoverflow however any information about why this could be happening would be greatly appreciated!

I can make the example below work perfectly but it wont work on my code even though it is exactly the same I just have a different DataFrame that has been changed from PySpark to Pandas:

sales_data = pd.DataFrame({"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"]
                           ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000]
                           ,"region":["East","North","East","South","West","West","South","West","West","East",np.nan]
                           }
                          )
pd.get_dummies(sales_data, columns = ['region'])

ajnabz
  • 297
  • 5
  • 25
  • `pd_df` is a pyspark dataframe or a pandas dataframe? – Ben.T Nov 21 '22 at 21:41
  • Pandas dataframe :) @Ben.T – ajnabz Nov 21 '22 at 21:42
  • do you build it from a pyspark dataframe? I'm asking because you seem to say it comes from the file `...pyspark/pandas/namespace.py` and also you talk about `show` that is not in pandas (as far I now). if yes, it may be related to [this Q&A](https://stackoverflow.com/questions/65474079/attributeerror-dataframe-object-has-no-attribute-data) even if it is not strickly the same error – Ben.T Nov 21 '22 at 21:49
  • Yes it is a PySpark dataframe which I then use ```.toPandas()```. Thank you I will have a look! – ajnabz Nov 21 '22 at 21:52
  • 1
    @Ben.T I dont think it is to do with the version as I am able to use it perfectly with the example I have included in the question. Thank you though – ajnabz Nov 21 '22 at 22:13

1 Answers1

0

I had this same error. I was confusing the execution by using ps (pyspark.pandas) instead of pd (pandas).

Ensure your alias are correct and you're not accidentally renaming a pandas instantiation:

Ex.

import pyspark.pandas as pd
straka86
  • 16
  • 3