Is there a method I could use If I want to know how a Transformer
changes the schema; without providing the data? For example I have a large DataFrame but I don't want to use it with the transformer; I just want to know the occurring schema transformation without using the full data.
Asked
Active
Viewed 106 times
-1

o-0
- 1,713
- 14
- 29
1 Answers
2
Transfomer
's are lazy (there is no fit
stage) so even if you pass the data, there should be no significant delay.
However all PipelineStages
(which include both Transfromers
and Estimators
) provide transformSchema
method, which can be called directly, with StructType
as an argument. For example if you have StringIndexer
like this one
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("foo").setOutputCol("foo_indexed")
and schema like this one
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("foo", StringType)))
you can apply it as follows:
indexer.transformSchema(schema)
and get
org.apache.spark.sql.types.StructType = StructType(StructField(foo,StringType,true), StructField(foo_indexed,DoubleType,false))

user10938362
- 3,991
- 2
- 12
- 29