0

I am working with pyspark dataframe. I need to perform tf-idf and for that I am used prior steps of tokenizing, normalization, etc using spark NLP.

I have df that looks like this after applying tokenizer:

df.select('tokenizer').show(5, truncate = 130)

+----------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                  tokenized       |
+----------------------------------------------------------------------------------------------------------------------------------+
|[content, type, multipart, alternative, boundary, nextpart, da, df, nextpart, da, df, content, type, text, plain, charset, asci...|
|[receive, ameurht, eop, eur, prod, protection, outlook, com, cyprmb, namprd, prod, outlook, com, https, via, cyprca, namprd, pr...|
|[plus, every, photographer, need, mm, lens, digital, photography, school, email, newsletter, http, click, aweber, com, ct, l, m...|
|[content, type, multipart, alternative, boundary, nextpart, da, beb, nextpart, da, beb, content, type, text, plain, charset, as...|
|[original, message, customer, service, mailto, ilpjmwofnst, qssadxnvrvc, narrig, stepmotherr, eviews, com, send, thursday, dece...|
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows

The next step is to apply normalizer:

I want to set multiple clean up patterns:

1) remove all numerics and numerics from words
-> example: [jhghgb56, 5897t95, fhgbg4, 7474, hfgbgb]
-> expected output: [jhghgb, fhgbg, hfgbgb]

2) remove all words less than 4
-> example: [gfh, ehfufibf, hi, df, jdfh]
-> expected output: [ehfufibf, jdfh]

I tried this:

tokenizer = Tokenizer()\
     .setInputCols(['document'])\
     .setOutputCol('tokenized')\
     .setMinLength(3)

cleanup = ["[^A-Za-z]"]
normalizer = Normalizer()\
     .setInputCols(['tokenized'])\
     .setOutputCol('normalized')\
     .setLowercase(True)\
     .setCleanupPatterns(cleanup)

so far cleanup = ["[^A-Za-z]"] fulfils the first condition. But now I get clean words which are less than 4 characters and I don't understand how to remove those words. Help would be much appreciated !

Samiksha
  • 59
  • 6
  • why can't you simply use the character count in a Spark DataFrame `filter`? – UninformedUser Mar 27 '21 at 14:17
  • I did try that but I am working with a df that consists of millions of rows and the filter operation is very time consuming. – Samiksha Mar 27 '21 at 14:49
  • I see. but the normalizer also has to process each row, I doubt this matters especially with filter being perfectly parallelizable. The normalizer does nothing different, it does process each row. It's basically a `map` operation. So given that the normalizer doesn't have an option to cleanup by length, you could simply replace the normalizer by a simple Spark `map` operation where do do the steps for each row by yourself, i.e. the regex, the lower case and the string length for each token in a row. – UninformedUser Mar 28 '21 at 07:59
  • as an alternative, you could use a cleanup pattern that does also check for string length? Did you try `^[a-zA-Z]{4,}` for example? – UninformedUser Mar 28 '21 at 08:03
  • @UninformedUser `^[a-zA-Z]{4,}` rather gives numeric and alpha numeric, also it does not fix the length. – Samiksha Mar 28 '21 at 19:57
  • Also I think it might not be not possible with Normalizer. let's say there is token 'ab123' so after going through the normalizer (remove token other than a-z or A-Z) it becomes 'ab', now the normalizer cannot use the cleanup pattern (remove word length less than 3) on this word 'ab' again such that it gets removed. – Samiksha Mar 28 '21 at 21:15

0 Answers0