I have data with different dtypes and I would like to build a windowed dataset. Previously, I asked this question where I dealt with homogeneous data. If I have a dataframe with different dtypes I need to use a dictionary and the accepted solution that uses flat_map
doesn't work (AttributeError: 'dict' object has no attribute 'batch'
). For example, if I don't use flat_map
:
x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10), 'col3': np.arange(10)})
y = np.arange(10)
Create Tensorflow dateset:
window_size_x = 3
window_size_y = 2
shift_size = 1
x = x[:-window_size_y]
y = y[window_size_x:]
ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True)
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True)
dataset = tf.data.Dataset.zip((ds_x, ds_y))
dataset = dataset.batch(1)
# Test dataset
for i, j in dataset.take(1):
print(i, j)
Output:
{'col1': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>, 'col2': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>, 'col3': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>} <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
Create preprocessor for different dtypes:
inputs = {'col1': tf.keras.Input(shape=(), name='col1', dtype=tf.string),
'col2': tf.keras.Input(shape=(), name='col2', dtype=tf.float32),
'col3': tf.keras.Input(shape=(), name='col3', dtype=tf.float32)}
vocab = sorted(set(x['col1']))
lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
lookup = lookup(inputs['col1'][:tf.newaxis])
numeric = tf.stack([tf.cast(inputs[i], dtype=tf.float32) for i in ['col2', 'col3']], axis=-1)
result = tf.concat([lookup, numeric], axis=-1)
preprocessor = tf.keras.Model(inputs, result)
# Test preprocesor
preprocessor(dict(x))
Output:
<tf.Tensor: shape=(8, 11), dtype=float32, numpy=
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 2., 2.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 3., 3.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 4., 4.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 5., 5.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 6., 6.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 7., 7.]], dtype=float32)>
Create model and fit:
body = tf.keras.models.Sequential([tf.keras.layers.Dense(8),
tf.keras.layers.Dense(window_size_y)])
x = preprocessor(inputs)
result = body(x)
model = tf.keras.Model(inputs, result)
model.compile(loss='mae', optimizer='adam')
model.fit(dataset)
Error message:
TypeError: Inputs to a layer should be tensors. Got: <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>
I think I need to transform my _NestedVariantdata
into tensors.
UPDATE: When I use answers from @AloneTogether and @thushv89, I get the following error:
ValueError: Exception encountered when calling layer "string_lookup_21" (type StringLookup).
When output_mode is not `'int'`, maximum supported output rank is 2. Received output_mode one_hot and input shape (None, None), which would result in output rank 3.
Call arguments received:
• inputs=tf.Tensor(shape=(None, None), dtype=string)
It looks like the StringLookup
function doesn't like the shape of the input tensor.