0

Each row of a Spark dataframe df contains a tab-separated string in a column rawFV. I already know that splitting on the tab will yield an array of 3 strings for all the rows. This can be verified by:

df.map(row => row.getAs[String]("rawFV").split("\t").length != 3).filter(identity).count()

and making sure that the count is indeed 0.

My question is: How to do this using the pipeline API?

Here's what I tried:

val tabTok = new RegexTokenizer().setInputCol("rawFV").setOutputCol("tk").setPattern("\t")
val pipeline = new Pipeline().setStages(Array(tabTok))
val transf = pipeline.fit(df)
val df2 = transf.transform(df)
df2.map(row => row.getAs[Seq[String]]("tk").length != 3).filter(identity).count()

which is NOT equal to 0.

The issue has to do with the presence of missing values. For example:

example

The pipeline code with RegexTokenizer would return 3 fields on the first line but only 2 on the second. On the other hand, the first code would correctly return 3 fields everywhere.

Community
  • 1
  • 1
ranlot
  • 636
  • 1
  • 6
  • 14
  • It would be much more useful if you provide example data which can be used to reproduce the problem. – zero323 Jan 06 '16 at 13:26
  • It is related to the presence of missing values. For example, if you have a tab-separated file like this: "a\ta\ta\nb\t\tb". I would get 3 fields on the first line but only 2 in the second – ranlot Jan 06 '16 at 14:10
  • Could add this to the question? – zero323 Jan 06 '16 at 14:13

1 Answers1

3

It is an expected behavior. By default minTokenLength parameter is equal to 1 to avoid empty strings in the output. If you want to return empty strings it should be set to 0.

new RegexTokenizer()
  .setInputCol("rawFV")
  .setOutputCol("tk")
  .setPattern("\t")
  .setMinTokenLength(0)
zero323
  • 322,348
  • 103
  • 959
  • 935