0

I'd like to convert a float to a currency using Babel and PySpark

sample data:

amount       currency
2129.9       RON
1700         EUR
1268         GBP
741.2        USD
142.08091153 EUR
4.7E7        USD
0            GBP

I tried:

df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), F.col('currency'),locale='be_BE'))

or

df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), 'EUR',locale='be_BE'))

They both give me an error:
enter image description here

mck
  • 40,932
  • 13
  • 35
  • 50
John Doe
  • 9,843
  • 13
  • 42
  • 73
  • Can you provide sample data of `amount` column and also since the error is because of the decimal conversion so you can check check if there is any `/` in the entire column of `amount`. – think-maths Feb 02 '21 at 06:59
  • So you are saying that there is no other characters in the entire `amount` column apart from `numeric`? As there is no other reason why it will have a `decimal conversion` error – think-maths Feb 02 '21 at 07:36
  • I checked the table end there's an `E` in the column e.g. `4.7E7` – John Doe Feb 02 '21 at 07:39
  • remove those characters and try again. Also to check whether the rows are `numeric or integer` or not, you can use `df.select("amount",F.col("amount").cast("int").isNotNull().alias("Truth")).show()` – think-maths Feb 02 '21 at 07:50
  • What's the best way to remove those characters `regexp_extract` ? – John Doe Feb 02 '21 at 07:58
  • yes you can use `regexp_extract` to have just the numeric part or even `regexp_replace` to remove non-numeric part with `""` – think-maths Feb 02 '21 at 08:40

2 Answers2

1

There seems a problem in pre-processing the amount column of your dataframe. From the error it is evident that value after converting to string is not just numeric which it has to be according to this tableand has has some additional characters as well. You can check on this column to find that and remove unnecessary character to fix this. As as example:

>>> import decimal
>>> value = '10.0'
>>> value = decimal.Decimal(str(value))
>>> value
Decimal('10.0')
>>> value = '10.0e'
>>> value = decimal.Decimal(str(value))
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    value = decimal.Decimal(str(value))
decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>]   # as '10.0e' is not just numeric
think-maths
  • 917
  • 2
  • 10
  • 28
1

To use Python libraries with Spark dataframes, you need to use an UDF:

from babel.numbers import format_currency
import pyspark.sql.functions as F

format_currency_udf = F.udf(lambda a, c: format_currency(a, c))

df2 = df.withColumn(
    'amount',
    format_currency_udf('amount', 'currency')
)

df2.show()
+----------------+--------+
|          amount|currency|
+----------------+--------+
|     RON2,129.90|     RON|
|       €1,700.00|     EUR|
|       £1,268.00|     GBP|
|       US$741.20|     USD|
|         €142.08|     EUR|
|US$47,000,000.00|     USD|
+----------------+--------+
mck
  • 40,932
  • 13
  • 35
  • 50
  • Works for most values but still having `InvalidOperation: []` error. I checked all values but i can't find the cause of this. How can I best find the value causing the error? Some values are `0` can this cause the problem ? – John Doe Feb 02 '21 at 11:54
  • 1
    @JohnDoe it will fail for nulls. you can add a check in the udf to catch this. – mck Feb 02 '21 at 12:06