0

I am working with pyspark dataframe.
I have df that looks like this:

df.select('words').show(5, truncate = 130)

+----------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   words          |
+----------------------------------------------------------------------------------------------------------------------------------+
|[content, type, multipart, alternative, boundary, nextpart, da, df, nextpart, da, df, content, type, text, plain, charset, asci...|
|[receive, ameurht, eop, eur, prod, protection, outlook, com, cyprmb, namprd, prod, outlook, com, https, via, cyprca, namprd, pr...|
|[plus, every, photographer, need, mm, lens, digital, photography, school, email, newsletter, http, click, aweber, com, ct, l, m...|
|[content, type, multipart, alternative, boundary, nextpart, da, beb, nextpart, da, beb, content, type, text, plain, charset, as...|
|[original, message, customer, service, mailto, ilpjmwofnst, qssadxnvrvc, narrig, stepmotherr, eviews, com, send, thursday, dece...|
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows

I need to use LanguageDetectorDL from spark NLP on words column which is array<strings> type, such that it detects english language and keeps only english words and removes other.

I have already used DocumentAssembler() to transform data to annotation format:

documentAssembler = DocumentAssembler().setInputCol('words').setOutputCol('document')

But I am not sure how to use LanguageDetectorDL on the column and get rid of non-english words?

Samiksha
  • 59
  • 6
  • it's very hard to detect language on the word level - it's not enough information for reliable detection... – Alex Ott Apr 04 '21 at 10:58
  • @AlexOtt thank you for looking into it. If I convert array back to string, i.e. if I have sentences, is it possible then? – Samiksha Apr 04 '21 at 14:26
  • for instance, if there is a sentence `protection outlook com cyprmb namprd prod outlook com`, so is it possible to get `protection outlook com prod outlook com`? – Samiksha Apr 04 '21 at 14:41
  • I'm not sure that it's possible - only if you'll use some kind of dictionary lookup, or something like. – Alex Ott Apr 04 '21 at 17:46
  • but why do you need that? if you're doing kind of text classification or something like, then most probably these words will have very low TF-IDF, or other measure, and will be excluded from the "dictionary" – Alex Ott Apr 04 '21 at 17:46

1 Answers1

1

the Language Detector at Spark-NLP works at the char level. This means that it doesn't use a dictionary to match words. It will definitely work better if you provide entire sentences, but it should perform acceptably well if you just pass a large string of concatenated tokens in the language you want to detect, for example with this pretrained model that detects 21 different languages,

from sparknlp.pretrained import PretrainedPipeline
language_detector_pipeline = PretrainedPipeline('detect_language_21', lang='xx')

language_detector_pipeline.annotate("«Нападение на 13-й участок»")


{'document': ['«Нападение на 13-й участок»'],
 'sentence': ['«Нападение на 13-й участок»'],
 'language': ['bg']}

Check if the languages you will be working with are among the ones supported by the model,

https://nlp.johnsnowlabs.com/2020/12/05/detect_language_21_xx.html

and also make sure you pass a string about ~150 chars longs for the model to have more chances of returning a good answer.

AlbertoAndreotti
  • 478
  • 4
  • 13
  • You'll have to convert the array of strings in your dataframe to a big string separated by spaces and pass that at the input of the pipeline. – AlbertoAndreotti Apr 05 '21 at 18:38
  • thank you for looking into it. If I pass a long english sentence that has some junk words in it , for example: `ghurbgiueg, zasa, fjhirgre` but the whole sentence will still be considered as english because the model is trained on the entire sentence not on just few words, if I am right. And my doubt if it possible to detect language on word-level and some how map sentence with language and remove those words using a filter? – Samiksha Apr 05 '21 at 19:47