I am trying to write a function that takes tensor of strings as an input and return sparse tensor of ones and zeros so that each row is a bag of words representation of one string from input.
About input
- each input string is a name consisting of 1 up to 7 words
- a word occurs in one name 1 or 0 times
- inputs are lowercase and alphanumeric only
- a name can occur many times in input tensor
- some words are repeated among different names
Requirements
- function should take parameter
k_top
that indicates number of most popular words that are considered (it's probably called vocabulary size) - function should consist only of operations allowed in graph mode (e.g.
tensor.numpy()
won't work) - compatible with Tensorflow 2.5.0
Example
# tensor of strings, shape: (7,)
inputs = tf.strings.lower([
"one two three",
"three",
"one two",
"one two cat",
"three cat",
"one cat two three",
"banana one"
])
Word frequencies:
"one": 5
"two": 4
"three": 4
"cat": 3
"banana": 1
After calling function with k_top = 2
(take words with two top counts and there are two words with count 4) each string is represented as vector of ones and zeros indicating wheter "one", "two", "three" is present:
"one two three" -> [1,1,1]
"three" -> [0,0,1]
"one two" -> [1,1,0]
"one two cat" -> [1,1,0]
"three cat" -> [0,0,1]
"one cat two three" -> [1,1,1]
"banana one" -> [1,0,0]
I've been trying for a few days combining different functions from tf.Transform
module and still getting errors instead of result (probably because I'm new to Tensorflow and also have difficulty with debugging because it's hard to see contents of tensor when not in eager mode (see edit here: Tensorflow2: How to print value of a tensor returned from tf.function when eager execution is disabled?)).Any help would be greatly appreciated!