Pyspark convert an array of json column to a pandas list of dict column

Asked Sep 28 '22 at 07:42

Active Oct 03 '22 at 14:26

Viewed 115 times

I have pyspark dataframe that I want to convert into a pandas dataframe, however I have an array of json column that gets converted into a string in pandas

my_df = (
spark
.createDataFrame(
  pd.DataFrame([['Scott', 50], ['Jeff', 45], ['Thomas', 54], ['Ann',34]], columns=['id', 'score']))
)

my_df
.groupBy('id')
.agg(
  F.to_json(
    F.sort_array(
      F.collect_list(F.col('score')), asc=False
    )
  ).alias('preds')
).toPandas

I need to execute: mypandasdf['preds'] = mypandasdf.preds.apply(eval) to cast my string column to a list of dict.
I was wondering if there was any more efficient way to do it. Any help? Thanks.

edited Oct 03 '22 at 14:26

asked Sep 28 '22 at 07:42

3nomis

1,175
1
9
30

1

What happens if you remove `to_json`? `to_json` is to convert object into JSON string. – Emma Sep 28 '22 at 14:16
If I don't use `to_json`, the pandas column is just a list of: `[Row(score=0.08888888888888889, id='900048333'),...]` – 3nomis Sep 29 '22 at 10:01
Could you add a sample data? – Emma Sep 29 '22 at 13:23
There is no list of dict in this sample data. If I execute your code minus the `to_json`, the `preds` column is an array of integer (`[50]` for 'Scott') and `toPandas()` will kept as the integer array. – Emma Oct 03 '22 at 15:25

Pyspark convert an array of json column to a pandas list of dict column

0 Answers0