0

If I have a dataframe with fields ['did','doc'] such as

data = sc.parallelize(['This is a test',
                   'This is also a test',
                   'These sentence are tests',
                   'This tests these sentences'])\
         .zipWithIndex()\
         .map(lambda x: (x[1],x[0]))\
         .toDF(['did','doc'])
data.show()
+---+--------------------+--------------------+
|did|                 doc|               words|
+---+--------------------+--------------------+
|  0|      This is a test| [this, is, a, test]|
|  1| This is also a test|[this, is, also, ...|
|  2|These sentence ar...|[these, sentence,...|
|  3|This tests these ...|[this, tests, the...|
+---+--------------------+--------------------+

and I do some transformations on that document like tokenizing and finding 2-grams:

data = Tokenizer(inputCol = 'doc',outputCol = 'words').transform(data)
data = NGram(n=2,inputCol = 'words',outputCol='grams').transform(data)
data.show()
+---+--------------------+--------------------+--------------------+
|did|                 doc|               words|               grams|
+---+--------------------+--------------------+--------------------+
|  0|      This is a test| [this, is, a, test]|[this is, is a, a...|
|  1| This is also a test|[this, is, also, ...|[this is, is also...|
|  2|These sentence ar...|[these, sentence,...|[these sentence, ...|
|  3|This tests these ...|[this, tests, the...|[this tests, test...|
+---+--------------------+--------------------+--------------------+

then at the end I want to combine the two-grams and words into a single column of features with a VectorAssembler:

data = VectorAssembler(inputCol=['words','grams'],
                       outputCol='features').transform(data)

then I get the following error:

Py4JJavaError: An error occurred while calling o504.transform.
: java.lang.IllegalArgumentException: Data type ArrayType(StringType,true) is not supported.

because the VectorAssembler doesn't like to work with lists of strings. To get around that I can drop the dataframe to an rdd, map the rdd to appropriate rows, and rezip it back up into a dataframe, a la

data = data.rdd.map(lambda x: Row(did = x['did'], 
           features = x['words']+x['grams'])) .toDF(['did','features'])

Which is not a problem for this tiny dataset, but which is prohibitively expensive for a large dataset.

Is there any way to achieve this more efficiently than the above?

pauli
  • 4,191
  • 2
  • 25
  • 41
nbk
  • 523
  • 1
  • 6
  • 20

1 Answers1

0

You can use a udf to create the features column like this

import pyspark.sql.functions as f
import pyspark.sql.types as t


udf_add = f.udf(lambda x,y: x+y, t.ArrayType(t.StringType()))
data.withColumn('features', udf_add('words','grams')).select('features').show()

[Row(features=['this', 'is', 'a', 'test', 'this is', 'is a', 'a test']),
Row(features=['this', 'is', 'also', 'a', 'test', 'this is', 'is also', 'also a', 'a test']),
Row(features=['these', 'sentence', 'are', 'tests', 'these sentence', 'sentence are', 'are tests']),
Row(features=['this', 'tests', 'these', 'sentences', 'this tests', 'tests these', 'these sentences'])]
pauli
  • 4,191
  • 2
  • 25
  • 41
  • This will achieve it, but udf is pretty slow as well. Is it actually faster than dragging the data from dataframe to rdd, map and zipping it back up to a dataframe again? – nbk Feb 17 '18 at 17:51
  • I haven't checked the solution for speed. As general rule dataframe operations are supposed to be much faster than rdd operations. – pauli Feb 18 '18 at 02:28
  • May I ask you to check a similar [question](https://stackoverflow.com/questions/69195968/how-can-reach-the-list-of-characters-using-the-bigram-n-gram-algorithm-in-pyspar) kindly? – Mario Sep 15 '21 at 16:16