Spark : skip top rows with spark-excel

Question

I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their github there no such functionality, so is there a way to achieve this?

This my code:

Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
                                .option("location", filePath)
                                .option("sheetName", "Feuil1")
                                .option("useHeader", "true")
                                .option("delimiter", "|")
                                .option("treatEmptyValuesAsNulls", "true")
                                .option("inferSchema", "true")
                                .option("addColorColumns", "false")
                                .load(filePath);

score 1 · Answer 1 · answered May 07 '18 at 16:59

I have looked at the source code and there is no option for the same

https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/DefaultSource.scala

You should fix your excel file and remove the first 3 rows. Or else you would need to create a patched version of the code to allow you the same. Which would be way more effort then having a correct excel sheet

score 0 · Accepted Answer · answered Jul 27 '18 at 13:37

0

This issue is fixed with spark excel 0.9.16, issue link in github

answered Jul 27 '18 at 13:37

Abdennacer Lachiheb

4,388
7
30
61

score 0 · Answer 3 · answered Apr 25 '22 at 08:39

You can use the skipFirstRows option (I believe it is deprecated after version 0.11)

Library Dependency : "com.crealytics" %% "spark-excel" % "0.10.2"

Sample Code :

val df = sparkSession.read.format("com.crealytics.spark.excel")
      .option("location", inputLocation)
      .option("sheetName", "sheet1")
      .option("useHeader", "true")
      .option("skipFirstRows", "2") // Mention the number of top rows to be skipped
      .load(inputLocation)

Hope it helps! Feel free to let me know in comments if you have any doubts/issues. Thanks!

score 0 · Answer 4 · answered Sep 13 '22 at 05:27

skipFirstRows was deprecated in favor of more generic dataAddress option. For your specific example, you can skip rows by specifying start range for your data:

Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
                                .option("location", filePath)
                                .option("useHeader", "true")
                                .option("delimiter", "|")
                                .option("treatEmptyValuesAsNulls", "true")
                                .option("inferSchema", "true")
                                .option("addColorColumns", "false")
                                .option("dataAddress", "'Feuil1'!A3") // From the docs: Start cell of the data. Reading will return all rows below and all columns to the right
                                .load(filePath);

Spark : skip top rows with spark-excel

4 Answers4