If I have a dataframe with fields ['did','doc'] such as
data = sc.parallelize(['This is a test',
'This is also a test',
'These sentence are tests',
'This tests these sentences'])\
.zipWithIndex()\
.map(lambda x: (x[1],x[0]))\
.toDF(['did','doc'])
data.show()
+---+--------------------+--------------------+
|did| doc| words|
+---+--------------------+--------------------+
| 0| This is a test| [this, is, a, test]|
| 1| This is also a test|[this, is, also, ...|
| 2|These sentence ar...|[these, sentence,...|
| 3|This tests these ...|[this, tests, the...|
+---+--------------------+--------------------+
and I do some transformations on that document like tokenizing and finding 2-grams:
data = Tokenizer(inputCol = 'doc',outputCol = 'words').transform(data)
data = NGram(n=2,inputCol = 'words',outputCol='grams').transform(data)
data.show()
+---+--------------------+--------------------+--------------------+
|did| doc| words| grams|
+---+--------------------+--------------------+--------------------+
| 0| This is a test| [this, is, a, test]|[this is, is a, a...|
| 1| This is also a test|[this, is, also, ...|[this is, is also...|
| 2|These sentence ar...|[these, sentence,...|[these sentence, ...|
| 3|This tests these ...|[this, tests, the...|[this tests, test...|
+---+--------------------+--------------------+--------------------+
then at the end I want to combine the two-grams and words into a single column of features with a VectorAssembler:
data = VectorAssembler(inputCol=['words','grams'],
outputCol='features').transform(data)
then I get the following error:
Py4JJavaError: An error occurred while calling o504.transform.
: java.lang.IllegalArgumentException: Data type ArrayType(StringType,true) is not supported.
because the VectorAssembler doesn't like to work with lists of strings. To get around that I can drop the dataframe to an rdd, map the rdd to appropriate rows, and rezip it back up into a dataframe, a la
data = data.rdd.map(lambda x: Row(did = x['did'],
features = x['words']+x['grams'])) .toDF(['did','features'])
Which is not a problem for this tiny dataset, but which is prohibitively expensive for a large dataset.
Is there any way to achieve this more efficiently than the above?