We have some well settled source and target data and spark sql in scala is used in between. There are cases when the schema of the target is more restrictive than the one in the source but business says that target schema is more accurate. At this point we cannot change the schemas. We could live with simple casting down but just in case the business people are wrong we would like to have some sanity check and have safe downcasts without truncating data silently.
We use dataframes and source and target are parque files and at the very end we convert to strongly typed datasets with Scala.
What would be the best approach to get this done in a generic way? Anything that it can be done at while reading with a schema that would error out and not truncate? Or having some sort of UDFs and validate as we load data?
It seems like a problem that other should have faced and I'm sort of novice to Spark. Just looking for some sort of established practice so don't have to re-invent the wheel.