1

when applying a StringIndexer to df_notnull (a DataFrame object) which contains the following columns:

scala> df_notnull.printSchema
root
 |-- L0_S22_F545: string (nullable = true)
 |-- L0_S0_F0: double (nullable = true)
 |-- L0_S0_F2: double (nullable = true)
 |-- L0_S0_F4: double (nullable = true)

Only those are left:

scala> indexed.printSchema
root
 |-- L0_S22_F545: string (nullable = true)
 |-- L0_S22_F545Index: double (nullable = true)

This is my code:

:paste
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}

val indexer = new StringIndexer()
  .setInputCol("L0_S22_F545")
  .setOutputCol("L0_S22_F545Index")

val indexed = indexer.fit(df_notnull).transform(df_notnull)
indexed.printSchema

I want to keep all columns and only add some new ones. What am I doing wrong?

Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58

1 Answers1

0

Found the solution here. Actually the transformers should not be used standalone but together with a pipeline - then the columns are preserved:

import org.apache.spark.ml.Pipeline
val transformers = Array(
    indexer,
    encoder
)

var pipeline = new Pipeline().setStages(transformers).fit(df_notnull)

var transformed = pipeline.transform(df_notnull)

This is how the result looks like:

scala> transformed.show
+-----------+--------+--------+--------+----------------+--------------+        
|L0_S22_F545|L0_S0_F0|L0_S0_F2|L0_S0_F4|L0_S22_F545Index|L0_S22_F545Vec|
+-----------+--------+--------+--------+----------------+--------------+
|         NA|    0.03|  -0.034|  -0.197|             0.0|(13,[0],[1.0])|
|         NA|     0.0|     0.0|     0.0|             0.0|(13,[0],[1.0])|
|         NA|   0.088|   0.086|   0.003|             0.0|(13,[0],[1.0])|
|         NA|  -0.036|  -0.064|   0.294|             0.0|(13,[0],[1.0])|
|         NA|  -0.055|  -0.086|   0.294|             0.0|(13,[0],[1.0])|
|         NA|   0.003|   0.019|   0.294|             0.0|(13,[0],[1.0])|
|         NA|     0.0|     0.0|     0.0|             0.0|(13,[0],[1.0])|
|         NA|     0.0|     0.0|     0.0|             0.0|(13,[0],[1.0])|
|         NA|  -0.016|  -0.041|  -0.179|             0.0|(13,[0],[1.0])|
|         NA|     0.0|     0.0|     0.0|             0.0|(13,[0],[1.0])|
|         NA|   0.016|   0.093|  -0.015|             0.0|(13,[0],[1.0])|
|         NA|  -0.062|  -0.153|  -0.197|             0.0|(13,[0],[1.0])|
|         NA|  -0.075|  -0.093|   0.367|             0.0|(13,[0],[1.0])|
|         NA|  -0.003|  -0.093|  -0.161|             0.0|(13,[0],[1.0])|
|         NA|  -0.016|  -0.138|  -0.197|             0.0|(13,[0],[1.0])|
|         NA|   0.252|    0.25|   0.003|             0.0|(13,[0],[1.0])|
|         NA|     0.0|     0.0|     0.0|             0.0|(13,[0],[1.0])|
|         NA|  -0.016|  -0.041|   0.003|             0.0|(13,[0],[1.0])|
|         NA|     0.0|     0.0|     0.0|             0.0|(13,[0],[1.0])|
|         NA|   0.088|   0.033|    0.33|             0.0|(13,[0],[1.0])|
+-----------+--------+--------+--------+----------------+--------------+
only showing top 20 rows
Community
  • 1
  • 1
Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58