I am stuck on this problem.
I have a pyspark dataframe looks as below-
id | year | data | timestamp |
---|---|---|---|
1 | 2019 | {'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''} | 2019-09-04T02:32:00.990+0000 |
2 | 2020 | {'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''} | 2020-09-04T02:32:00.990+0000 |
3 | 2021 | {'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''} | 2021-09-04T02:32:00.990+0000 |
4 | 2022 | {'living_costs[1]':'','living_costs[2]':'','living_costs[3]':'','living_costs[4]':''} | 2022-09-04T02:32:00.990+0000 |
My expected output is looks like replacing '[' and ']' with '_' and living cost with lc. Which looks as below-
id | year | data | timestamp |
---|---|---|---|
1 | 2019 | {'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''} | 2019-09-04T02:32:00.990+0000 |
2 | 2020 | {'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''} | 2020-09-04T02:32:00.990+0000 |
3 | 2021 | {'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''} | 2021-09-04T02:32:00.990+0000 |
4 | 2022 | {'lc_1_':'','lc_2_':'','lc_3_':'','lc_4_':''} | 2022-09-04T02:32:00.990+0000 |
can someone please help me with this?