I am working with pyspark dataframe.
I have df that looks like this:
df.select('words').show(5, truncate = 130)
+----------------------------------------------------------------------------------------------------------------------------------+
| words |
+----------------------------------------------------------------------------------------------------------------------------------+
|[content, type, multipart, alternative, boundary, nextpart, da, df, nextpart, da, df, content, type, text, plain, charset, asci...|
|[receive, ameurht, eop, eur, prod, protection, outlook, com, cyprmb, namprd, prod, outlook, com, https, via, cyprca, namprd, pr...|
|[plus, every, photographer, need, mm, lens, digital, photography, school, email, newsletter, http, click, aweber, com, ct, l, m...|
|[content, type, multipart, alternative, boundary, nextpart, da, beb, nextpart, da, beb, content, type, text, plain, charset, as...|
|[original, message, customer, service, mailto, ilpjmwofnst, qssadxnvrvc, narrig, stepmotherr, eviews, com, send, thursday, dece...|
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows
I need to use LanguageDetectorDL
from spark NLP on words
column which is array<strings>
type, such that it detects english language and keeps only english words and removes other.
I have already used DocumentAssembler()
to transform data to annotation format:
documentAssembler = DocumentAssembler().setInputCol('words').setOutputCol('document')
But I am not sure how to use LanguageDetectorDL
on the column and get rid of non-english words?