2

I have to preprocess NLP data, so I've to remove the stopwords (from nltk library) from a Tensorflow dataset. I tried many thing like this:

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
data = tokenized_docs.filter(lambda x: x. not in stop_words)

or this:

tokens = docs.map(lambda x: tokenizer.tokenize(x))
data = tokens.filter(lambda x: tf.strings.strip(x).ref() not in stopwords)

But it didn't work. This first code shows an error like: RaggedTensor is unhashable.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
tCot
  • 307
  • 2
  • 7
  • Could you please elaborate how is the error related to TensorFlow extended? What components from tensorflow extended are you getting this error at? –  Feb 22 '21 at 13:09

2 Answers2

3

From what I can tell Tensorflow supports basic string normalization (lowercasing + punctuation stripping) using the standardize callback's standardization function. There doesn't appear to be support for more advanced options, like removing stop words without doing it yourself.

It's probably easier to just do the standardization beforehand, outside of TensorFlow and then pass the result on.

import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')


def parse_text(text):
    print(f'Input: {text}')

    text = re.sub("[^a-zA-Z]", ' ', text)
    print(f'Remove punctuation and numbers: {text}')

    text = text.lower().split()
    print(f'Lowercase and split: {text}')

    swords = set(stopwords.words("english"))
    text = [w for w in text if w not in swords]
    print(f'Remove stop words: {text}')

    text = " ".join(text)
    print(f'Final: {text}')

    return text


list1 = [["NEver tell me the odds."],["It's a trap!"]]

for sublist in list1:
    for i in range(len(sublist)):
        sublist[i] = parse_text(sublist[i])

print(list1)
# [['never tell odds'], ['trap']]

mike v
  • 346
  • 5
  • 10
3

You can use this to remove stopwords when using tfx

from nltk.corpus import stopwords
outputs['review'] = tf.strings.regex_replace(inputs['review'], r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*',"")