Using regular expression in pyspark to replace part of the key inside a column containing maps?

Question

I am stuck on this problem.

I have a pyspark dataframe looks as below-

id	year	data	timestamp
1	2019	{'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''}	2019-09-04T02:32:00.990+0000
2	2020	{'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''}	2020-09-04T02:32:00.990+0000
3	2021	{'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''}	2021-09-04T02:32:00.990+0000
4	2022	{'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''}	2022-09-04T02:32:00.990+0000

My expected output is looks like replacing '[' and ']' with '_' and living cost with lc. Which looks as below-

id	year	data	timestamp
1	2019	{'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''}	2019-09-04T02:32:00.990+0000
2	2020	{'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''}	2020-09-04T02:32:00.990+0000
3	2021	{'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''}	2021-09-04T02:32:00.990+0000
4	2022	{'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''}	2022-09-04T02:32:00.990+0000

can someone please help me with this?

Please provide enough code so others can better understand or reproduce the problem. — Community, Sep 15 '22 at 15:35

user8178045 · Answer 1 · 2022-09-20T19:01:52.970

0

I found a solution and since no one replied yet; I am posting my solution. I used transform_keys. The code looks like-


df=df.withColumn('data',transform_keys(col('data'),lambda k,v:regexp_replace(k,'living_costs','lc')))\
     .withColumn('data',transform_keys(col('data'),lambda k,v:regexp_replace(k,'\[|\]','_')))

Thank you! I hope this helps someone.

edited Sep 20 '22 at 19:01

answered Sep 20 '22 at 18:59

user8178045

1
1

Using regular expression in pyspark to replace part of the key inside a column containing maps?

1 Answers1