1

I am trying to write a function that takes tensor of strings as an input and return sparse tensor of ones and zeros so that each row is a bag of words representation of one string from input.

About input

  • each input string is a name consisting of 1 up to 7 words
  • a word occurs in one name 1 or 0 times
  • inputs are lowercase and alphanumeric only
  • a name can occur many times in input tensor
  • some words are repeated among different names

Requirements

  • function should take parameter k_top that indicates number of most popular words that are considered (it's probably called vocabulary size)
  • function should consist only of operations allowed in graph mode (e.g. tensor.numpy() won't work)
  • compatible with Tensorflow 2.5.0

Example

# tensor of strings, shape: (7,)
inputs = tf.strings.lower([
    "one two three",
    "three",
    "one two",
    "one two cat",
    "three cat",
    "one cat two three",
    "banana one"
]) 

Word frequencies:

"one": 5
"two": 4
"three": 4
"cat": 3
"banana": 1

After calling function with k_top = 2 (take words with two top counts and there are two words with count 4) each string is represented as vector of ones and zeros indicating wheter "one", "two", "three" is present:

"one two three" -> [1,1,1]
"three" -> [0,0,1]
"one two" -> [1,1,0]
"one two cat" -> [1,1,0]
"three cat" -> [0,0,1]
"one cat two three" -> [1,1,1]
"banana one" -> [1,0,0]

I've been trying for a few days combining different functions from tf.Transform module and still getting errors instead of result (probably because I'm new to Tensorflow and also have difficulty with debugging because it's hard to see contents of tensor when not in eager mode (see edit here: Tensorflow2: How to print value of a tensor returned from tf.function when eager execution is disabled?)).Any help would be greatly appreciated!

Brzoskwinia
  • 371
  • 2
  • 11

1 Answers1

0

There is no direct method available, but there is some workaround to get to your solution using below code.

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow_datasets as tfds

inputs = [
    "one two three",
    "three",
    "one two",
    "one two cat",
    "three cat",
    "one cat two three",
    "banana one"
]

tf_keras_tokenizer = Tokenizer(oov_token=0)
tf_keras_tokenizer.fit_on_texts(inputs)
tf_keras_encoded = tf_keras_tokenizer.texts_to_sequences(inputs)
tf_keras_encoded = tf.keras.preprocessing.sequence.pad_sequences(tf_keras_encoded, padding="post") 

word_index =tf_keras_tokenizer.word_index
#tf_keras_tokenizer.word_index

word_counts = tf_keras_tokenizer.word_counts
#OrderedDict([('one', 5), ('two', 4), ('three', 4), ('cat', 3), ('banana', 1)])

#Below is the encoded result based on the word index.
tf_keras_encoded
array([[2, 3, 4, 0],
       [4, 0, 0, 0],
       [2, 3, 0, 0],
       [2, 3, 5, 0],
       [4, 5, 0, 0],
       [2, 5, 3, 4],
       [6, 2, 0, 0]], dtype=int32)

Now, we need to mask the values with 1s and 0s based on the top n frequent words.

num_words = 2
#To get the number of frequent words by it's values, even if the values matches it will return the keys.
n_frequent_words = [key for key,value in word_counts.items() if value in list(word_counts.values())[:num_words]]
#Based on the keys extracted above, get the word index mapping value to replace.
frequent_word_index = [value for key,value in word_index.items() if key in n_frequent_words]

Now, let's use condition to replace values using np.where. First, let's iterate through frequent words and replace with 1s and after the loop replace rest of the words with 0, except 1s.

import numpy as np
for i in frequent_word_index:
  tf_keras_encoded=np.where(tf_keras_encoded==i,1, tf_keras_encoded)
tf_keras_encoded=np.where(tf_keras_encoded==1,1, 0)

Result: Note: It's in the sequence of input words, if you need to change the sequence as you wish then play around with this code.

tf_keras_encoded
array([[1, 1, 1, 0],
       [1, 0, 0, 0],
       [1, 1, 0, 0],
       [1, 1, 0, 0],
       [1, 0, 0, 0],
       [1, 0, 1, 1],
       [0, 1, 0, 0]])