1

I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their github there no such functionality, so is there a way to achieve this?

This my code:

Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
                                .option("location", filePath)
                                .option("sheetName", "Feuil1")
                                .option("useHeader", "true")
                                .option("delimiter", "|")
                                .option("treatEmptyValuesAsNulls", "true")
                                .option("inferSchema", "true")
                                .option("addColorColumns", "false")
                                .load(filePath);
baitmbarek
  • 2,440
  • 4
  • 18
  • 26
Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61

4 Answers4

1

I have looked at the source code and there is no option for the same

https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/DefaultSource.scala

You should fix your excel file and remove the first 3 rows. Or else you would need to create a patched version of the code to allow you the same. Which would be way more effort then having a correct excel sheet

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
0

This issue is fixed with spark excel 0.9.16, issue link in github

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
0

You can use the skipFirstRows option (I believe it is deprecated after version 0.11)

Library Dependency : "com.crealytics" %% "spark-excel" % "0.10.2"

Sample Code :

val df = sparkSession.read.format("com.crealytics.spark.excel")
      .option("location", inputLocation)
      .option("sheetName", "sheet1")
      .option("useHeader", "true")
      .option("skipFirstRows", "2") // Mention the number of top rows to be skipped
      .load(inputLocation)

Hope it helps! Feel free to let me know in comments if you have any doubts/issues. Thanks!

0

skipFirstRows was deprecated in favor of more generic dataAddress option. For your specific example, you can skip rows by specifying start range for your data:

Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
                                .option("location", filePath)
                                .option("useHeader", "true")
                                .option("delimiter", "|")
                                .option("treatEmptyValuesAsNulls", "true")
                                .option("inferSchema", "true")
                                .option("addColorColumns", "false")
                                .option("dataAddress", "'Feuil1'!A3") // From the docs: Start cell of the data. Reading will return all rows below and all columns to the right
                                .load(filePath);
qaziqarta
  • 1,782
  • 1
  • 4
  • 11