How to define spark-excel schema without specific order

Question

I'm reading in an XLS using spark-excel. My program has a base format (same columns and headers), but between different users they may have additional columns that are not required for my program.

Is there a way to define the Schema, or at least, the subset of columns I want when reading in the xls, or should I simply infer the Schema?

Example, I have this Schema

val mySchema = StructType(Array(
    StructField("Name",     StringType),
    StructField("Date",     StringType),
    StructField("Location", StringType)
  ))

val df = spark.read
                .format("com.crealytics.spark.excel")
                .option("dataAddress", s"'${sheet}'!A1") 
                .option("header", "true") 
                .option("treatEmptyValuesAsNulls", "true") 
                .option("setErrorCellsToFallbackValues", "true") 
                .option("usePlainNumberFormat", "false") 
                .option("addColorColumns", "true") 
                .schema(pinlistSchema)
                .load(xls)

This works provided Name, Date, Location are columns 0->2 (and in that order). However if a user wants to keep track of someone's nickname and puts it between Name and Date, the dataframe is busted.

I am ok with the "nickname" column being in the dataframe, as I won't use it.

How to define spark-excel schema without specific order

0 Answers0