1

I have data with different dtypes and I would like to build a windowed dataset. Previously, I asked this question where I dealt with homogeneous data. If I have a dataframe with different dtypes I need to use a dictionary and the accepted solution that uses flat_map doesn't work (AttributeError: 'dict' object has no attribute 'batch'). For example, if I don't use flat_map:

x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10), 'col3': np.arange(10)})
y = np.arange(10)

Create Tensorflow dateset:

window_size_x = 3
window_size_y = 2
shift_size = 1

x = x[:-window_size_y]
y = y[window_size_x:]

ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True)
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True)
dataset = tf.data.Dataset.zip((ds_x, ds_y))
dataset = dataset.batch(1)

# Test dataset
for i, j in dataset.take(1):
  print(i, j)

Output:

{'col1': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>, 'col2': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>, 'col3': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>} <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

Create preprocessor for different dtypes:

inputs = {'col1': tf.keras.Input(shape=(), name='col1', dtype=tf.string),
          'col2': tf.keras.Input(shape=(), name='col2', dtype=tf.float32),
          'col3': tf.keras.Input(shape=(), name='col3', dtype=tf.float32)}

vocab = sorted(set(x['col1']))
lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
lookup = lookup(inputs['col1'][:tf.newaxis])

numeric = tf.stack([tf.cast(inputs[i], dtype=tf.float32) for i in ['col2', 'col3']], axis=-1)
result = tf.concat([lookup, numeric], axis=-1)

preprocessor = tf.keras.Model(inputs, result)

# Test preprocesor
preprocessor(dict(x))

Output:

<tf.Tensor: shape=(8, 11), dtype=float32, numpy=
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 2., 2.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 3., 3.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 4., 4.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 5., 5.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 6., 6.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 7., 7.]], dtype=float32)>

Create model and fit:

body = tf.keras.models.Sequential([tf.keras.layers.Dense(8),
                                   tf.keras.layers.Dense(window_size_y)])
x = preprocessor(inputs)
result = body(x)
model = tf.keras.Model(inputs, result)

model.compile(loss='mae', optimizer='adam')
model.fit(dataset)

Error message:

TypeError: Inputs to a layer should be tensors. Got: <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

I think I need to transform my _NestedVariantdata into tensors.

UPDATE: When I use answers from @AloneTogether and @thushv89, I get the following error:

ValueError: Exception encountered when calling layer "string_lookup_21" (type StringLookup).

When output_mode is not `'int'`, maximum supported output rank is 2. Received output_mode one_hot and input shape (None, None), which would result in output rank 3.

Call arguments received:
  • inputs=tf.Tensor(shape=(None, None), dtype=string)

It looks like the StringLookup function doesn't like the shape of the input tensor.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73

2 Answers2

1

You can do this with bit of a hack. It's gets a bit messy when you try to zip() data in different structures (e.g. a dict of arrays (x) and a plain array (y)). I'm not sure if it's possible (I got weird errors). So I'm collating both x and y to a single dict.

import tensorflow as tf
import pandas as pd
import numpy as np

window_size_x = 3
window_size_y = 2
shift_size = 1


x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10)})
y = np.arange(10)

x = x[:-window_size_y]
y = y[window_size_x:]

ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True)
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True)
dataset = tf.data.Dataset.zip((ds_x, ds_y)).flat_map(
    # zip the data in the dict to a tf.data.Dataset
    lambda window_x, window_y: tf.data.Dataset.zip(
      # Here we are collating x and y to a single dict
      {**dict([(k, v.batch(window_size_x)) for k, v in window_x.items()]), **{"y": window_y.batch(window_size_y)}}      
    )
)

If you don't like both x and y being in the same dict, you can break it back using the map()

dataset = dataset.map(lambda data_dict: (dict(<all of k,v pairs except of key y>), data_dict["y"]))
thushv89
  • 10,865
  • 1
  • 26
  • 39
1

Maybe something like this:

import tensorflow as tf
import pandas as pd
import numpy as np

window_size_x = 3
window_size_y = 2
shift_size = 1

x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10)})
y = np.arange(10)

x = x[:-window_size_y]
y = y[window_size_x:]

ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True).flat_map(lambda x: tf.data.Dataset.zip(tuple(x[col].batch(window_size_x) for col in ['col1', 'col2', 'col3'])))
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True).flat_map(lambda x: x.batch(window_size_y))
dataset = tf.data.Dataset.zip((ds_x, ds_y))

for i, j in dataset.take(1):
  print(i, j)
(<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'a', b'b', b'c'], dtype=object)>, <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 1, 2])>, <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 1, 2])>) tf.Tensor([3 4], shape=(2,), dtype=int64)
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
AloneTogether
  • 25,814
  • 5
  • 20
  • 39