0

I have a xlsx file which has a single column ;

percentage
30%
40%
50%
-10%
0.00%
0%
0.10%
110%
99.99%
99.98%
-99.99%
-99.98%

when i read this using Apache-Spark out put i get is,

|percentage|
+----------+
|       0.3|
|       0.4|
|       0.5|
|      -0.1|
|       0.0|
|       0.0|
|     0.001|
|       1.1|
|    0.9999|
|    0.9998|
+----------+

expected output is ,

+----------+
|percentage|
+----------+
|       30%|
|       40%|
|       50%|
|      -10%|
|     0.00%|
|        0%|
|     0.10%|
|      110%|
|    99.99%|
|    99.98%|
+----------+

My code -

val spark = SparkSession
    .builder
    .appName("trimTest")
    .master("local[*]")
    .getOrCreate()

  val df = spark.read
      .format("com.crealytics.spark.excel").
      option("header", "true").
      option("maxRowsInMemory", 1000).
      option("inferSchema", "true").
  load("data/percentage.xlsx")

  df.printSchema()
  df.show(10)

I Don't want to use casting or turning inferschema to false, i want a way to read percentage value as percentage not as double or string.

  • there is no 'percentage' type in Spark. So you should select from "string", "double" or better DecimalType. – Artem Aliev Dec 01 '21 at 13:59
  • and then multiply by 100 and concate it with % ? bt when dealing with big files which have millions of rows the process will become super slow and i will need to specify conditions to read files and check for % won't all this make the process super slow? – Sarvesh Singh Dec 01 '21 at 14:07
  • Try defining a custom schema `option("schema", customSchema)` to specify the column `percentage` as string instead of number. – blackbishop Dec 01 '21 at 14:25

1 Answers1

0

Well, percentage ARE double: 30% = 0.3

The only difference is the way it is displayed and, as @Artem_Aliev wrote in comment, there is no percentage type in spark that would print out as you expect. But once again: percentage are double, same thing, different notation.

The question is, what do you want to do with those percentage?

  • to "apply" them on something else, i.e. use multiply, then just use the double type column
  • to have a nice print, convert to the suitable string before printing:
val percentString = format_string("%.2f%%", $"percentage" * 100)
ds.withColumn("percentage", percentString).show()
Juh_
  • 14,628
  • 8
  • 59
  • 92
  • I have a product which takes data from user and we process on it and whenever a user uploads a excel with a column containing 30% for example in the preview he gets is 0.3, that is correct i know, but if i get a excel with 20 columns of different datatypes i want % to be shown as it is and i can't mess or define new schema as other 19 columns will be affected, casting or taking a specific column out of the file operating on it and inserting it back to the dataframe then sending it to user is a time consuming process which is not ideal. – Sarvesh Singh Dec 02 '21 at 05:47
  • Well, spark just doesn't have a percentage type so, I guess, it convert excel percentage to its closest equivalent: double. I don't think the cost of conversion to string is really an issue here. Excel files are not that big compare to what spark is made to process. But you need to have some information to know when a column should actually be displayed as percentage. Maybe in the column name? – Juh_ Dec 02 '21 at 15:11
  • percentage is an exception in spark, can you tell me a way to catch percentage column in a dataframe. suppose if we have a double column and a percentage column in excel and i want to print them as they were. for that i will need to catch the percentage column with some regex method and then operate on it. – Sarvesh Singh Dec 03 '21 at 04:54
  • I don't see a way: once it is read into a dataset, the column is Double and there is no way to know if it is meant to be a percentage. All I can think of is that maybe in you data, the column name include "percentage" or something. Then you can detect them and add the code above to display it as percentage. – Juh_ Dec 03 '21 at 15:06