1

I have a group of Excel sheets, that I am trying to read via spark through com.crealytics.spark.excel package. In my excel sheet I have a column Survey ID that contains integer IDs. When I read the data through spark I see the values are converted to double value.

How can I retain the format of the integer values while reading from excel sheet ?

This is what I tried :

val df = spark.read.format("com.crealytics.spark.excel")
      .option("location", <somelocation>)
      .option("useHeader", "true")
      .option("treatEmptyValuesAsNulls", "true")
      .option("inferSchema", "true")
      .option("addColorColumns","False")
      .load()

Actual Value

enter image description here

Value read via Spark

+-----------+
|  Survey ID|
+-----------+
|1.7632889E7|
|1.7632889E7|
|1.7632934E7|
|1.7633233E7|
|1.7633534E7|
|1.7655812E7|
|1.7656079E7|
|1.7930478E7|
|1.7944498E7|
|1.8071246E7|

If I cast the column to integer I get the required formatted data. But is there a better way to do this?

val finalDf=df.withColumn("Survey ID", col("Survey ID").cast(sql.types.IntegerType))
baitmbarek
  • 2,440
  • 4
  • 18
  • 26
Ayan Biswas
  • 1,641
  • 9
  • 39
  • 66
  • I would suggest not to infer the schema (this would make all the columns string type) and then cast the columns accordingly. This is how handled this scenario in one of my project due to time crunch but this good thing was in my case that the columns were few in numbers. – Goldie Aug 05 '19 at 17:36
  • inferschema=false , also does not make any difference – Ayan Biswas Aug 05 '19 at 17:37
  • Okay. I just realised that i also used `org.apache.poi.ss.usermodel.WorkbookFactory` to read the data from excel through Iterator and created the dataframe manually(I don't recall the exact issue i was facing with `om.crealytics.spark.excel` that time). If you don't have any option and stuck with it, I can share that code. Let me know. – Goldie Aug 05 '19 at 18:07
  • for now I have converted the data back to integer .Seems to be working – Ayan Biswas Aug 05 '19 at 18:15

1 Answers1

1

There is a bug (or rather missing setting) in the excel library which renders column with large numbers as scientific notation. See https://github.com/crealytics/spark-excel/issues/126

D3V
  • 1,543
  • 11
  • 21