Replace accounting notation for negative number with minus value

Question

I have a dataframe which contains negative numbers, with accountancy notation i.e.:

df.select('sales').distinct().show()

+------------+
|    sales   |
+------------+
|         18 |
|          3 |
|         10 |
|         (5)|
|          4 |
|         40 |
|          0 |
|          8 |
|         16 |
|         (2)|
|          2 |
|         (1)|
|         14 |
|         (3)|
|          9 |
|         19 |
|         (6)|
|          1 |
|         (9)|
|         (4)|
+------------+
only showing top 20 rows

The numbers wrapped in () are negative. How can I replace them to have minus values instead i.e. (5) becomes -5 and so on.

Here is what I have tried:

sales = (
    df
    .select('sales')
    .withColumn('sales_new',
               sf.when(sf.col('sales').substr(1,1) == '(',
                       sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
               .otherwise(sf.col('sales')))
    
)

sales.show(20,False)

+---------+---------+
|salees   |sales_new|
+---------+---------+
| 151     | 151     |
| 134     | 134     |
| 151     | 151     |
|(151)    |-151     |
|(134)    |-134     |
|(151)    |-151     |
| 151     | 151     |
| 50      | 50      |
| 101     | 101     |
| 134     | 134     |
|(134)    |-134     |
| 46      | 46      |
| 151     | 151     |
| 134     | 134     |
| 185     | 185     |
| 84      | 84      |
| 188     | 188     |
|(94)     |-94)     |
| 38      | 38      |
| 21      | 21      |
+---------+---------+

The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.

I have tried using regexp_replace but get an error that:

PatternSyntaxException: Unclosed group near index 1

sales = (
    df
    .select('sales')
    .withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)

My approach was to prepend a `-` to rows starting with `(` and then remove both `(` and `)` but don't think it's an ideal method — cs_guy, Jun 22 '21 at 16:04
Seems reasonable to me. Why don't you think it's ideal? [Edit] a [mre] into your question and ask the _specific_ question you seem to be asking: "this approach is no good for me because ... and how can I improve it to achieve ..." — Pranav Hosangadi, Jun 22 '21 at 16:05
As I have hardcoded the value into `substr()` which means if the `sales` is a different length than expected, the output will become incorrect. This is now visible in the question — cs_guy, Jun 22 '21 at 16:23
So don't do a `substr()`. Instead replace `'('` and `')'` with `''` https://stackoverflow.com/questions/37038014/pyspark-replace-strings-in-spark-dataframe-column — Pranav Hosangadi, Jun 22 '21 at 16:29

score 2 · Accepted Answer · answered Jun 22 '21 at 18:18

This can be solved with a case statement and regular expression together:

from pyspark.sql.functions import regexp_replace, col

sales = (
    df
    .select('sales')
    .withColumn('sales_new', sf.when(sf.col('sales').substr(1,1) == '(',
                sf.concat(sf.lit('-'), regexp_replace(sf.col('sales'), '\(|\)', '')))
                .otherwise(sf.col('sales')))
)

sales.show(20,False)

+---------+---------+
|sales    |sales_new|
+---------+---------+
|151      |151      |
|134      |134      |
|151      |151      |
|(151)    |-151     |
|(134)    |-134     |
|(151)    |-151     |
|151      |151      |
|50       |50       |
|101      |101      |
|134      |134      |
|(134)    |-134     |
|46       |46       |
|151      |151      |
|134      |134      |
|185      |185      |
|84       |84       |
|188      |188      |
|(94)     |-94      |
|38       |38       |
|21       |21       |
+---------+---------+

score 0 · Answer 2 · answered Jun 22 '21 at 16:51

You can slice the string from the second character to the second last character, and then convert it to float, for example:

def convert(number):
    try:
        number = float(number)
    except:
        
        number = number[1:-1]
        number = float(number)
        return number

You can iterate through all the elements and apply this function.

Replace accounting notation for negative number with minus value

2 Answers2