I have a feature like a post tag. So for each observation the post_tag feature might be a selection of tags like "oscars,brad-pitt,awards". I'd like to be able to pass this as a feature to a tensorflow model build using the estimator api running on google cloud machine learning (as per this example but adapted for my own problem).
I'm just not sure how to transform this into a multi-hot encoded feature in tensorflow. I'm trying to get something similar to MultiLabelBinarizer in sklearn ideally.
I think this is sort of related but not quite what i need.
So say i have data like:
id,post_tag
1,[oscars,brad-pitt,awards]
2,[oscars,film,reviews]
3,[matt-damon,bourne]
I want to featurize it, as part of preprocessing within tensorflow, as:
id,post_tag_oscars,post_tag_brad_pitt,post_tag_awards,post_tag_film,post_tag_reviews,post_tag_matt_damon,post_tag_bourne
1,1,1,1,0,0,0,0
2,1,0,0,1,1,0,0
3,0,0,0,0,0,1,1
Update
If i have post_tag_list be a string like "oscars,brad-pitt,awards" in the input csv. And if i try then do:
INPUT_COLUMNS = [
...
tf.contrib.lookup.HashTable(tf.contrib.lookup.KeyValueTensorInitializer('post_tag_list',
tf.range(0, 10, dtype=tf.int64),
tf.string, tf.int64),
default_value=10, name='post_tag_list'),
...]
I get this error:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/task.py", line 4, in <module>
import model
File "trainer/model.py", line 49, in <module>
default_value=10, name='post_tag_list'),
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 276, in __init__
super(HashTable, self).__init__(table_ref, default_value, initializer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 162, in __init__
self._init = initializer.initialize(self)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 348, in initialize
table.table_ref, self._keys, self._values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_lookup_ops.py", line 205, in _initialize_table_v2
values=values, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2632, in create_op
set_shapes_for_outputs(ret)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1911, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1861, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 595, in call_cpp_shape_fn
require_shape_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 659, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Shape must be rank 1 but is rank 0 for 'key_value_init' (op: 'InitializeTableV2') with input shapes: [], [], [10].
If i was to pad each post_tag_list to be like "oscars,brad-pitt,awards,OTHER,OTHER,OTHER,OTHER,OTHER,OTHER,OTHER" so it's always 10 long. Would that be a potential solution here.
Or do i need to in some way know the size of all post tags i might ever be passing in here (kinda ill defined as new ones created all the time).