Reading a Excel file in Spark with an integer column

Question

I have a group of Excel sheets, that I am trying to read via spark through com.crealytics.spark.excel package. In my excel sheet I have a column Survey ID that contains integer IDs. When I read the data through spark I see the values are converted to double value.

How can I retain the format of the integer values while reading from excel sheet ?

This is what I tried :

val df = spark.read.format("com.crealytics.spark.excel")
      .option("location", <somelocation>)
      .option("useHeader", "true")
      .option("treatEmptyValuesAsNulls", "true")
      .option("inferSchema", "true")
      .option("addColorColumns","False")
      .load()

Actual Value

Value read via Spark

+-----------+
|  Survey ID|
+-----------+
|1.7632889E7|
|1.7632889E7|
|1.7632934E7|
|1.7633233E7|
|1.7633534E7|
|1.7655812E7|
|1.7656079E7|
|1.7930478E7|
|1.7944498E7|
|1.8071246E7|

If I cast the column to integer I get the required formatted data. But is there a better way to do this?

val finalDf=df.withColumn("Survey ID", col("Survey ID").cast(sql.types.IntegerType))

I would suggest not to infer the schema (this would make all the columns string type) and then cast the columns accordingly. This is how handled this scenario in one of my project due to time crunch but this good thing was in my case that the columns were few in numbers. — Goldie, Aug 05 '19 at 17:36
Okay. I just realised that i also used `org.apache.poi.ss.usermodel.WorkbookFactory` to read the data from excel through Iterator and created the dataframe manually(I don't recall the exact issue i was facing with `om.crealytics.spark.excel` that time). If you don't have any option and stuck with it, I can share that code. Let me know. — Goldie, Aug 05 '19 at 18:07
for now I have converted the data back to integer .Seems to be working — Ayan Biswas, Aug 05 '19 at 18:15

score 1 · Answer 1 · answered Aug 05 '19 at 17:48

1

There is a bug (or rather missing setting) in the excel library which renders column with large numbers as scientific notation. See https://github.com/crealytics/spark-excel/issues/126

answered Aug 05 '19 at 17:48

D3V

1,543
11
21

so is there an alternate way ? – Ayan Biswas Aug 05 '19 at 17:51
You can cast column to `int` or `decimal(8,0)`. The bigger values in spark are represented as scientific notation. Excel lib is only reading it as double. – D3V Aug 06 '19 at 06:07

Reading a Excel file in Spark with an integer column

1 Answers1