Multihot encoding in tensoflow (google cloud machine learning, tf estimator api)

Question

I have a feature like a post tag. So for each observation the post_tag feature might be a selection of tags like "oscars,brad-pitt,awards". I'd like to be able to pass this as a feature to a tensorflow model build using the estimator api running on google cloud machine learning (as per this example but adapted for my own problem).

I'm just not sure how to transform this into a multi-hot encoded feature in tensorflow. I'm trying to get something similar to MultiLabelBinarizer in sklearn ideally.

I think this is sort of related but not quite what i need.

So say i have data like:

id,post_tag
1,[oscars,brad-pitt,awards]
2,[oscars,film,reviews]
3,[matt-damon,bourne]

I want to featurize it, as part of preprocessing within tensorflow, as:

id,post_tag_oscars,post_tag_brad_pitt,post_tag_awards,post_tag_film,post_tag_reviews,post_tag_matt_damon,post_tag_bourne
1,1,1,1,0,0,0,0
2,1,0,0,1,1,0,0
3,0,0,0,0,0,1,1

Update

If i have post_tag_list be a string like "oscars,brad-pitt,awards" in the input csv. And if i try then do:

INPUT_COLUMNS = [
...
tf.contrib.lookup.HashTable(tf.contrib.lookup.KeyValueTensorInitializer('post_tag_list',
                                            tf.range(0, 10, dtype=tf.int64),
                                            tf.string, tf.int64),
                           default_value=10, name='post_tag_list'),
...]

I get this error:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/task.py", line 4, in <module>
    import model
  File "trainer/model.py", line 49, in <module>
    default_value=10, name='post_tag_list'),
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 276, in __init__
    super(HashTable, self).__init__(table_ref, default_value, initializer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 162, in __init__
    self._init = initializer.initialize(self)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 348, in initialize
    table.table_ref, self._keys, self._values, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_lookup_ops.py", line 205, in _initialize_table_v2
    values=values, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2632, in create_op
    set_shapes_for_outputs(ret)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1911, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1861, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 595, in call_cpp_shape_fn
    require_shape_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 659, in _call_cpp_shape_fn_impl
    raise ValueError(err.message)
ValueError: Shape must be rank 1 but is rank 0 for 'key_value_init' (op: 'InitializeTableV2') with input shapes: [], [], [10].

If i was to pad each post_tag_list to be like "oscars,brad-pitt,awards,OTHER,OTHER,OTHER,OTHER,OTHER,OTHER,OTHER" so it's always 10 long. Would that be a potential solution here.

Or do i need to in some way know the size of all post tags i might ever be passing in here (kinda ill defined as new ones created all the time).

It's actually open ended. When someone creates a post they can also make a new tag if they can't find an existing one. So far its about 7k and we try to encourage them not to make new tags but they are open ended in sense of if a new news story breaks about something new it might get a new tag. So it might be that actually treating the tags as words similar to a post title and putting them into an embedding space might be more appropriate. I am training a doc2vec on the tags and posts and going to pass those vectors in as dense features. Was hoping to have tag dummies to use for wide cols. — andrewm4894, Oct 12 '17 at 11:04
Previously unseen tags are going to get mapped to one or more "unseen" weight vectors. So 7K is what I was looking for -- order of magnitude, mostly. — rhaertel80, Oct 12 '17 at 19:07

score 2 · Answer 1 · answered Oct 11 '17 at 08:39

2

Have you tried tf.contrib.lookup.Hashtable?

Here is an example usage from my own use: https://github.com/TensorLab/tensorfx/blob/master/src/data/_transforms.py#L160 and a made up example snippet based on that:

import tensorflow as tf
session = tf.InteractiveSession()

entries = ['red', 'blue', 'green']
table = tf.contrib.lookup.HashTable(
    tf.contrib.lookup.KeyValueTensorInitializer(entries,
                                                tf.range(0, len(entries), dtype=tf.int64),
                                                tf.string, tf.int64),
    default_value=len(entries), name='entries')
tf.tables_initializer().run()

value = tf.constant([['blue', 'red'], ['green', 'red']])
print(table.lookup(value).eval())

I believe lookup works for both regular tensors and SparseTensors (you might end up with the latter given your variable length list of values).

answered Oct 11 '17 at 08:39

Nikhil Kothari

5,215
2
22
28

great - looks exactly like what i need, thanks a million. I just need to play around a bit to figure out how i can get my string in the csv which is actually looking like "red|blue|green" into a list like in your example. So am thinking something like `post_tags = post_tag_list.split("|")` and then something like `INPUT_COLUMNS = [ ... tf.contrib.lookup.KeyValueTensorInitializer('post_tags', tf.range(0, len(post_tags), dtype=tf.int64), tf.string, tf.int64), ... ]` – andrewm4894 Oct 11 '17 at 10:37
ugh.. `Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/task.py", line 4, in import model File "trainer/model.py", line 32, in post_tags = post_tag_list.split("|") NameError: name 'post_tag_list' is not defined` – andrewm4894 Oct 11 '17 at 10:42
am not sure if i can do the .split("|") within defining the INPUT_COLUMNS. But not quite sure what type to read the [post_tag_list] field (looks like "red|green|blue") in as so that it ends up like a list as per entities in your example that i can then pass to the hashtable() – andrewm4894 Oct 11 '17 at 10:45
i'll amend my etl to store it as "red,green,blue" in the input csv to try simplify. – andrewm4894 Oct 11 '17 at 10:46
I haven't gotten as far yet in my project, but you could string split using tf.string_split to go from pipe delimited column value representing a variable length list. – Nikhil Kothari Oct 11 '17 at 14:51

score 2 · Answer 2 · answered Oct 12 '17 at 19:56

There are a couple of issues to tackle here. First, is the question about a tag set which keeps growing. You would also like to know how to parse variable-length data from CSV.

To handle a growing tag set, you'll need to use an OOV or feature hashing. Nikhil showed the latter, so I'll show the former.

How to parse variable-length data from CSV

Let's suppose the column with variable length data uses | as a separator, e.g.

csv = [
  "1,oscars|brad-pitt|awards",
  "2,oscars|film|reviews",
  "3,matt-damon|bourne",
]

You can use code like this to convert those to a SparseTensor.

import tensorflow as tf

# Purposefully omitting "bourne" to demonstrate OOV mappings.
TAG_SET = ["oscars", "brad-pitt", "awards", "film", "reviews", "matt-damon"]
NUM_OOV = 1

def sparse_from_csv(csv):
  ids, post_tags_str = tf.decode_csv(csv, [[-1], [""]])
  table = tf.contrib.lookup.index_table_from_tensor(
      mapping=TAG_SET, num_oov_buckets=NUM_OOV, default_value=-1)
  split_tags = tf.string_split(post_tags_str, "|")
  return ids, tf.SparseTensor(
      indices=split_tags.indices,
      values=table.lookup(split_tags.values),
      dense_shape=split_tags.dense_shape)

# Optionally create an embedding for this.
TAG_EMBEDDING_DIM = 3

ids, tags = sparse_from_csv(csv)

embedding_params = tf.Variable(tf.truncated_normal([len(TAG_SET) + NUM_OOV, TAG_EMBEDDING_DIM]))
embedded_tags = tf.nn.embedding_lookup_sparse(embedding_params, sp_ids=tags, sp_weights=None)

# Test it out
with tf.Session() as s:
  s.run([tf.global_variables_initializer(), tf.tables_initializer()])
  print(s.run([ids, embedded_tags]))

You'll see output like so (since the embedding is random, exact numbers will change):

[array([1, 2, 3], dtype=int32), array([[ 0.16852427,  0.26074541, -0.4237918 ],
       [-0.38550434,  0.32314634,  0.858069  ],
       [ 0.19339906, -0.24429649, -0.08393878]], dtype=float32)]

You can see that each column in the CSV is represented as an ndarray, where the tags are now 3-dimensional embeddings.

Thanks a million - this is very useful. Will try and incorporate and let you know how i get on. — andrewm4894, Oct 16 '17 at 15:05

Multihot encoding in tensoflow (google cloud machine learning, tf estimator api)

2 Answers2

How to parse variable-length data from CSV