0

I have pyspark dataframe that I want to convert into a pandas dataframe, however I have an array of json column that gets converted into a string in pandas

my_df = (
spark
.createDataFrame(
  pd.DataFrame([['Scott', 50], ['Jeff', 45], ['Thomas', 54], ['Ann',34]], columns=['id', 'score']))
)

my_df
.groupBy('id')
.agg(
  F.to_json(
    F.sort_array(
      F.collect_list(F.col('score')), asc=False
    )
  ).alias('preds')
).toPandas

I need to execute: mypandasdf['preds'] = mypandasdf.preds.apply(eval) to cast my string column to a list of dict.
I was wondering if there was any more efficient way to do it. Any help? Thanks.

3nomis
  • 1,175
  • 1
  • 9
  • 30
  • 1
    What happens if you remove `to_json`? `to_json` is to convert object into JSON string. – Emma Sep 28 '22 at 14:16
  • If I don't use `to_json`, the pandas column is just a list of: `[Row(score=0.08888888888888889, id='900048333'),...]` – 3nomis Sep 29 '22 at 10:01
  • Could you add a sample data? – Emma Sep 29 '22 at 13:23
  • There is no list of dict in this sample data. If I execute your code minus the `to_json`, the `preds` column is an array of integer (`[50]` for 'Scott') and `toPandas()` will kept as the integer array. – Emma Oct 03 '22 at 15:25

0 Answers0