I have a dataframe which contains negative numbers, with accountancy notation i.e.:
df.select('sales').distinct().show()
+------------+
| sales |
+------------+
| 18 |
| 3 |
| 10 |
| (5)|
| 4 |
| 40 |
| 0 |
| 8 |
| 16 |
| (2)|
| 2 |
| (1)|
| 14 |
| (3)|
| 9 |
| 19 |
| (6)|
| 1 |
| (9)|
| (4)|
+------------+
only showing top 20 rows
The numbers wrapped in ()
are negative. How can I replace them to have minus values instead i.e. (5)
becomes -5
and so on.
Here is what I have tried:
sales = (
df
.select('sales')
.withColumn('sales_new',
sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|salees |sales_new|
+---------+---------+
| 151 | 151 |
| 134 | 134 |
| 151 | 151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
| 151 | 151 |
| 50 | 50 |
| 101 | 101 |
| 134 | 134 |
|(134) |-134 |
| 46 | 46 |
| 151 | 151 |
| 134 | 134 |
| 185 | 185 |
| 84 | 84 |
| 188 | 188 |
|(94) |-94) |
| 38 | 38 |
| 21 | 21 |
+---------+---------+
The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.
I have tried using regexp_replace
but get an error that:
PatternSyntaxException: Unclosed group near index 1
sales = (
df
.select('sales')
.withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)