From DataFrame to RDD[LabeledPoint]

Question

I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following:

import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.ml.feature.IDF

val sql = new SQLContext(sc)

// Load raw data from a TSV file
val raw = sc.textFile("data.tsv").map(_.split("\t").toSeq)

// Convert the RDD to a dataframe
val schema = StructType(List(StructField("class", StringType), StructField("content", StringType)))
val dataframe = sql.createDataFrame(raw.map(row => Row(row(0), row(1))), schema)

// Tokenize
val tokenizer = new Tokenizer().setInputCol("content").setOutputCol("tokens")
val tokenized = tokenizer.transform(dataframe)

// TF-IDF
val htf = new HashingTF().setInputCol("tokens").setOutputCol("rawFeatures").setNumFeatures(500)
val tf = htf.transform(tokenized)
tf.cache
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(tf)
val tfidf = idfModel.transform(tf)

// Create labeled points
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row.get(4)))

I need to use dataframes to generate the tokens and create the TF-IDF features. The problem appears when I try to convert this dataframe to a RDD[LabeledPoint]. I map the dataframe rows, but the get method of Row return an Any type, not the type defined on the dataframe schema (Vector). Therefore, I cannot construct the RDD I need to train a ML model.

What is the best option to get a RDD[LabeledPoint] after calculating a TF-IDF?

score 6 · Accepted Answer · answered Jul 06 '15 at 19:07

6

Casting the object worked for me.

Try:

// Create labeled points
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector]))

answered Jul 06 '15 at 19:07

zzztimbo

2,293
4
28
31

score 1 · Answer 2 · edited Feb 08 '19 at 14:14

1

You need to use getAs[T](i: Int): T

// Create labeled points
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row.getAs[Vector](4)))

edited Feb 08 '19 at 14:14

Bunny Rabbit

8,213
16
66
106

answered Jun 19 '15 at 00:06

Chris

507
6
13

3

I get this error: error: kinds of the type arguments (Vector) do not conform to the expected kinds of the type parameters (type T). Vector's type parameters do not match type T's expected parameters: type Vector has one type parameter, but type T has none – Miguel Jun 21 '15 at 10:03
3

@Miguel I got the same error and found a good fix from [here](https://community.hortonworks.com/questions/6020/type-error-when-attempting-linear-regression.html) You need to import the Spark Vector class explicitly since Scala imports its in-built Vector type by default. `import org.apache.spark.mllib.linalg.{Vector, Vectors}` and then Chris's code will work. – Ben Jan 28 '16 at 22:46

From DataFrame to RDD[LabeledPoint]

2 Answers2

Linked