-1

Is there a method I could use If I want to know how a Transformer changes the schema; without providing the data? For example I have a large DataFrame but I don't want to use it with the transformer; I just want to know the occurring schema transformation without using the full data.

o-0
  • 1,713
  • 14
  • 29

1 Answers1

2

Transfomer's are lazy (there is no fit stage) so even if you pass the data, there should be no significant delay.

However all PipelineStages (which include both Transfromers and Estimators) provide transformSchema method, which can be called directly, with StructType as an argument. For example if you have StringIndexer like this one

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer().setInputCol("foo").setOutputCol("foo_indexed")

and schema like this one

import org.apache.spark.sql.types._

val schema = StructType(Seq(StructField("foo", StringType)))

you can apply it as follows:

indexer.transformSchema(schema)

and get

org.apache.spark.sql.types.StructType = StructType(StructField(foo,StringType,true), StructField(foo_indexed,DoubleType,false))
user10938362
  • 3,991
  • 2
  • 12
  • 29