0

I am stuck on this problem.

I have a pyspark dataframe looks as below-

id year data timestamp
1 2019 {'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''} 2019-09-04T02:32:00.990+0000
2 2020 {'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''} 2020-09-04T02:32:00.990+0000
3 2021 {'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''} 2021-09-04T02:32:00.990+0000
4 2022 {'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''} 2022-09-04T02:32:00.990+0000

My expected output is looks like replacing '[' and ']' with '_' and living cost with lc. Which looks as below-

id year data timestamp
1 2019 {'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''} 2019-09-04T02:32:00.990+0000
2 2020 {'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''} 2020-09-04T02:32:00.990+0000
3 2021 {'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''} 2021-09-04T02:32:00.990+0000
4 2022 {'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''} 2022-09-04T02:32:00.990+0000

can someone please help me with this?

1 Answers1

0

I found a solution and since no one replied yet; I am posting my solution. I used transform_keys. The code looks like-


df=df.withColumn('data',transform_keys(col('data'),lambda k,v:regexp_replace(k,'living_costs','lc')))\
     .withColumn('data',transform_keys(col('data'),lambda k,v:regexp_replace(k,'\[|\]','_')))

Thank you! I hope this helps someone.