I have a dataframe consisting of text and languages
sf = spark.createDataFrame([
('eng', "I saw the red balloon"),
('eng', 'She was drinking tea from a black mug'),
('ger','Er ging heute sehr weit'),
('ger','Ich habe dich seit hundert Jahren nicht mehr gesehen')
], ["lang", "text"])
display(sf)
Output:
+----+--------------------+
|lang| text|
+----+--------------------+
| eng|I saw the red bal...|
| eng|She was drinking ...|
| ger|Er ging heute seh...|
| ger|Ich habe dich sei...|
+----+--------------------+
I want to remove the stop word for each text, for this I create a dictionary:
from pyspark.ml.feature import StopWordsRemover
ger_stopwords = StopWordsRemover.loadDefaultStopWords("german")
eng_stopwords = StopWordsRemover.loadDefaultStopWords("english")
stopwords = {'eng':eng_stopwords,
'ger':ger_stopwords}
And now I don't understand how can I apply stop words to a col('text') using udf ? Because transform() will not suit me in this case