Remove stopwords in Tensorflow extended

Question

I have to preprocess NLP data, so I've to remove the stopwords (from nltk library) from a Tensorflow dataset. I tried many thing like this:

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
data = tokenized_docs.filter(lambda x: x. not in stop_words)

or this:

tokens = docs.map(lambda x: tokenizer.tokenize(x))
data = tokens.filter(lambda x: tf.strings.strip(x).ref() not in stopwords)

But it didn't work. This first code shows an error like: RaggedTensor is unhashable.

Could you please elaborate how is the error related to TensorFlow extended? What components from tensorflow extended are you getting this error at? — , Feb 22 '21 at 13:09

score 3 · Answer 1 · answered Apr 27 '21 at 17:27

From what I can tell Tensorflow supports basic string normalization (lowercasing + punctuation stripping) using the standardize callback's standardization function. There doesn't appear to be support for more advanced options, like removing stop words without doing it yourself.

It's probably easier to just do the standardization beforehand, outside of TensorFlow and then pass the result on.

import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')


def parse_text(text):
    print(f'Input: {text}')

    text = re.sub("[^a-zA-Z]", ' ', text)
    print(f'Remove punctuation and numbers: {text}')

    text = text.lower().split()
    print(f'Lowercase and split: {text}')

    swords = set(stopwords.words("english"))
    text = [w for w in text if w not in swords]
    print(f'Remove stop words: {text}')

    text = " ".join(text)
    print(f'Final: {text}')

    return text


list1 = [["NEver tell me the odds."],["It's a trap!"]]

for sublist in list1:
    for i in range(len(sublist)):
        sublist[i] = parse_text(sublist[i])

print(list1)
# [['never tell odds'], ['trap']]

score 3 · Answer 2 · answered Dec 28 '21 at 11:20

3

You can use this to remove stopwords when using tfx

from nltk.corpus import stopwords
outputs['review'] = tf.strings.regex_replace(inputs['review'], r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*',"")

answered Dec 28 '21 at 11:20

David Amoateng

61
3

Remove stopwords in Tensorflow extended

2 Answers2