Calculating percentile of dataset words and Tensorflow-hub model

Question

I want to calculate the percentile of dataset words that are present in a tensorflow-hub model (such as ELMo or Universal Sentence Encoder). For local models like GloVe, I use a naive method: read the local model, transfer it to set, and then calculate the percentile as that:

f = open('../glove.6B.100d.txt', encoding="utf8")
#Read all the word into a list
...
intersect_words = set(dataset_words).intersect(glove_words)
percentile = len(intersect_words)/len(dataset_words)*100

Is there any method to do like that for Tenorflow-hub models?

score 0 · Accepted Answer · answered Aug 04 '21 at 09:04

For some models, the vocabulary is serialized within the SavedModel protocol buffer (like for USE and ELMo) so one has to manually find it within the SavedModel and extract it (I've used logic to extract the vocab from USE from here):

import tensorflow_hub as hub
from tensorflow.python.saved_model.loader_impl import parse_saved_model

# This caches the model at `model_path`.
hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
model_path = '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
saved_model = parse_saved_model(model_path)

# The location of the tensor holding the vocab is model-specific.
graph = saved_model.meta_graphs[0].graph_def
function_ = graph.library.function
embedding_node = function_[5].node_def[1]  # Node name is "Embedding_words".
words_tensor = embedding_node.attr.get("value").tensor
word_list = [s.decode('utf-8') for s in words_tensor.string_val]
word_list[100:105]  # ['best', ',▁but', 'no', 'any', 'more']

For other models like google/Wiki-words-500/2, we're more lucky since the vocab has been exported to the assets/ directory:

hub.load("https://tfhub.dev/google/Wiki-words-500/2")
!head /tmp/tfhub_modules/bf115a5fe517f019bebae05b433eaeee6415f5bf/assets/tokens.txt -n 40000 | tail
# Antisense
# Antiseptic
# Antiseptics

Thanks a lot it's exactly what I was looking for – Mus Aug 04 '21 at 14:16 — Mus, Aug 04 '21 at 14:16

Calculating percentile of dataset words and Tensorflow-hub model

1 Answers1