15
Dataset<Row> dataFrame = ... ;   
StringIndexerModel labelIndexer = new StringIndexer()
               .setInputCol("label")
               .setOutputCol("indexedLabel")
               .fit(dataFrame);

 VectorIndexerModel featureIndexer = new VectorIndexer()
               .setInputCol("s")
               .setOutputCol("indexedFeatures")
               .setMaxCategories(4)
               .fit(dataFrame);
IndexToString labelConverter = new IndexToString()
               .setInputCol("prediction")
               .setOutputCol("predictedLabel")
               .setLabels(labelIndexer.labels());

What is StringIndexer, VectorIndexer, IndexToString and what is the difference between them? How and When should I use them?

2 Answers2

16

String Indexer - Use it if you want the Machine Learning algorithm to identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context.

e,g converting days(Monday, Tuesday...) to numeric representation.

Vector Indexer- use this if we do not know the types of data incoming. so we leave the logic of differentiating between categorical and non categorical data to the algorithm using Vector Indexer.

e,g - Data coming from 3rd Party API, where data is hidden and is ingested directly to the training model.

Indexer to string- just opposite of String indexer, use this if the final output column was indexed using String Indexer and now we want to convert back its numeric representation to textual so that result can be understood better.

Faisal Ahmad
  • 161
  • 1
  • 2
10

I know only about those two:

StringIndexer and VectorIndexer

StringIndexer:

  • converts a single column to an index column (similar to a factor column in R)

VectorIndexer:

  • is used to index categorical predictors in a featuresCol column. Remember that featuresCol is a single column consisting of vectors (refer to featuresCol and labelCol). Each row is a vector which contains values from each predictors.
  • if you have string type predictors, you will first need to use index those columns with StringIndexer. featuresCol contains vectors, and vectors does not contain string values.

Take a look here for example: https://mingchen0919.github.io/learning-apache-spark/StringIndexer-and-VectorIndexer.html

Amit Haim
  • 273
  • 1
  • 3
  • 14