tokenizer in spark dataframe API

Question

Each row of a Spark dataframe df contains a tab-separated string in a column rawFV. I already know that splitting on the tab will yield an array of 3 strings for all the rows. This can be verified by:

df.map(row => row.getAs[String]("rawFV").split("\t").length != 3).filter(identity).count()

and making sure that the count is indeed 0.

My question is: How to do this using the pipeline API?

Here's what I tried:

val tabTok = new RegexTokenizer().setInputCol("rawFV").setOutputCol("tk").setPattern("\t")
val pipeline = new Pipeline().setStages(Array(tabTok))
val transf = pipeline.fit(df)
val df2 = transf.transform(df)
df2.map(row => row.getAs[Seq[String]]("tk").length != 3).filter(identity).count()

which is NOT equal to 0.

The issue has to do with the presence of missing values. For example:

The pipeline code with RegexTokenizer would return 3 fields on the first line but only 2 on the second. On the other hand, the first code would correctly return 3 fields everywhere.

It would be much more useful if you provide example data which can be used to reproduce the problem. — zero323, Jan 06 '16 at 13:26
It is related to the presence of missing values. For example, if you have a tab-separated file like this: "a\ta\ta\nb\t\tb". I would get 3 fields on the first line but only 2 in the second — ranlot, Jan 06 '16 at 14:10

score 3 · Accepted Answer · answered Jan 06 '16 at 14:16

It is an expected behavior. By default minTokenLength parameter is equal to 1 to avoid empty strings in the output. If you want to return empty strings it should be set to 0.

new RegexTokenizer()
  .setInputCol("rawFV")
  .setOutputCol("tk")
  .setPattern("\t")
  .setMinTokenLength(0)

tokenizer in spark dataframe API

1 Answers1