1

How does one transform (e.g., one-hot encode, index, bucketize, embed, etc.) labels natively in TensorFlow? tf.feature_column is the preferred way for features, but what about the labels (i.e., targets)? Those too may often need to be transformed and treated as a layer in the overall Keras pipeline. The problem is that tf.feature_column only acts on the features, not the labels.

Consider for example a CSV

F1     F2    T 
3.7    2.0   A
1.7    3.5   B
6.0    6.6   A
0.7    3.2   A

where F1 and F2 are the features and T the target. I'd then naturally call make_csv_dataset(..., label_name='T') to generate my dataset. But then how do I transform the targets so that all data processing is neatly wrapped in a Dense layer?

Has the tf.data team at TensorFlow overlooked the fact that labels are often categorical and therefore need to be transformed?

EDIT: I would like to avoid any use of pandas since it's not scalable, hence my emphasis on the "native" tools of tf.data (e.g., make_csv_dataset() or otherwise).

Tfovid
  • 761
  • 2
  • 8
  • 24
  • For labels you may use: `tf.keras.utils.to_categorical(y)` – Kaveh Aug 03 '21 at 07:57
  • Have you resolved your issue? I'm ending up in this scenario too often and I'm looking for a way to perform label encoding with tf.data or some scalable way. I also want to avoid pandas or one hot encoding as much as possible as this does not scale. Currently have to do it at data set building stage (with spark, Flink...) – Alexandre Pieroux Nov 28 '22 at 11:43

1 Answers1

1

In this case you got 2 options:

  1. convert to class label index then use sparse categorical cross entropy
  2. transform the class label index to one hot encoding - in this case you have to use categorical cross entropy loss

EDIT: Note the function to one hot encode the classes

In [14]: @tf.function
    ...: def map_labels(feature, target):
    ...:     return feature, tf.one_hot(target, 2) # number of classes = 2

You can add the dataset as such if you are using tf dataset api

In [1]: import pandas as pd

In [2]: import tensorflow as tf
2021-08-03 15:58:01.546181: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

In [3]: df = pd.DataFrame({
   ...:     'F1': [20 ,30 ,40 ,60],
   ...:     'F2': [10, 50, 300, 300],
   ...:     'label': ['A', 'B', 'A', 'B']
   ...:     })
In [8]: df['label'] = df['label'].replace({'A': 0, 'B': 1})
In [12]: dataset = tf.data.Dataset.from_tensor_slices((df.iloc[:, :-1], df.iloc[:, -1]))

In [13]: dataset = dataset.shuffle(len(dataset)).map(map_labels).batch(20)
Edwin Cheong
  • 879
  • 2
  • 7
  • 12
  • This doesn't answer the question of how you'd do it on the *labels*, though. The problem with feature columns is that they only act on features, not labels. I'd like to use the native `tf.data`, not the less-scalable pandas data frames. – Tfovid Aug 03 '21 at 07:54
  • i have made adjustment o my answer but u can easily adapt my answer – Edwin Cheong Aug 03 '21 at 08:08
  • Thanks... But the point is that I want to avoid pandas altogether as it's not scalable. (If it were for using pandas all of this would have been a non-issue, hence my emphasis on using solely `tf.data` with, e.g., `make_csv_dataset()`.) – Tfovid Aug 03 '21 at 08:18