I'm reading in an XLS using spark-excel. My program has a base format (same columns and headers), but between different users they may have additional columns that are not required for my program.
Is there a way to define the Schema, or at least, the subset of columns I want when reading in the xls, or should I simply infer the Schema?
Example, I have this Schema
val mySchema = StructType(Array(
StructField("Name", StringType),
StructField("Date", StringType),
StructField("Location", StringType)
))
val df = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", s"'${sheet}'!A1")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("setErrorCellsToFallbackValues", "true")
.option("usePlainNumberFormat", "false")
.option("addColorColumns", "true")
.schema(pinlistSchema)
.load(xls)
This works provided Name
, Date
, Location
are columns 0->2 (and in that order). However if a user wants to keep track of someone's nickname and puts it between Name and Date, the dataframe is busted.
I am ok with the "nickname" column being in the dataframe, as I won't use it.