I have built a model with different features. For the preprocessing I have used mainly feature_columns. For instance, for bucketizing GEO information or for embedding categorical data with a large amount of different values. Additionally, I had to preprocess two of my features before using feature_columns:
Feature “STREET”
def __preProcessStreet(data, tokenizer=None):
data['STREETPRO'] = data['STREET'].apply(lambda x: __getNormalizedString(x, ["gasse", "straße", "strasse", "str.", "g.", " "], False))
if tokenizer == None:
tokenizer = Tokenizer(split='XXX')
tokenizer.fit_on_texts(data['STREETPRO'])
street_tokenized = tokenizer.texts_to_sequences(data['STREETPRO'])
data['STREETW'] = tf.keras.preprocessing.sequence.pad_sequences(street_tokenized, maxlen=1)
return data, tokenizer
As you can see, I did the preprocessing steps directly on the loaded Pandas dataframe. Afterwards I processed this new column with the help of the mentioned columns:
def __getFutureColumnStreet(street_num_words):
street_voc = tf.feature_column.categorical_column_with_identity(
key='STREETW', num_buckets=street_num_words)
dim = __getNumberOfDimensions(street_num_words)
street_embedding = feature_column.embedding_column(street_voc, dimension=dim)
return street_embedding
Feature “NAME1
The preprocessing steps for the NAME1 column are quite similar except of the fact that I have split the NAME1 field in two different fields “NAME1W1” and “NAME1W2” which include the two most common words in the vocabulary:
def __preProcessName(data, tokenizer=None):
data['NAME1PRO'] = data['NAME1'].apply(lambda x: __getNormalizedString(x, ["(asg)", "asg", "(poasg)", "poasg"]))
if tokenizer == None:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['NAME1PRO'])
name1_tokenized = tokenizer.texts_to_sequences(data['NAME1PRO'])
name1_tokenized_pad = tf.keras.preprocessing.sequence.pad_sequences(name1_tokenized, maxlen=2, truncating='pre')
data = pd.concat([data, pd.DataFrame(name1_tokenized_pad, columns=['NAME1W1', 'NAME1W2'])], axis=1)
return data, tokenizer
Afterwards I also used feature_colums for the word embedding:
def __getFutureColumnsName(name_num_words):
namew1_voc = tf.feature_column.categorical_column_with_identity(
key='NAME1W1', num_buckets=name_num_words)
namew2_voc = tf.feature_column.categorical_column_with_identity(
key='NAME1W2', num_buckets=name_num_words)
dim = __getNumberOfDimensions(name_num_words)
namew1_embedding = feature_column.embedding_column(namew1_voc, dimension=dim)
namew2_embedding = feature_column.embedding_column(namew2_voc, dimension=dim)
return (namew1_embedding, namew2_embedding)
Model
I am using the Functional API of TensorFlow for constructing my model:
print("start preprocessing...")
feature_columns = feature_selection.getFutureColumns(data, args.zip, args.sc, bucketSizeGEO, False)
feature_layer = tf.keras.layers.DenseFeatures(feature_columns, trainable=True)
print("preprocessing completed")
…
print("Step {}/{}".format(currentStep, stepNum))
feature_layer_inputs = feature_selection.getFeatureLayerInputs()
new_layer = feature_layer(feature_layer_inputs)
for _ in range(numLayers):
new_layer = tf.keras.layers.Dense(numNodes, activation=tf.nn.swish, kernel_regularizer=regularizers.l2(reg), bias_regularizer=regularizers.l2(reg))(new_layer)
new_layer = tf.keras.layers.Dropout(dropRate)(new_layer)
output_layer = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, kernel_regularizer=regularizers.l2(reg), bias_regularizer=regularizers.l2(reg))(new_layer)
model = tf.keras.Model(inputs=[v for v in feature_layer_inputs.values()], outputs=output_layer)
model.compile(optimizer=opt,
loss='binary_crossentropy',
metrics=['accuracy'])
paramString = "Arg-e{}-b{}-l{}-n{}-o{}-z{}-r{}-d{}".format(args.epoch, args.batchSize, numLayers, numNodes, opt, bucketSizeGEO, reg, dropRate)
log_dir = "logs\\neural\\" + paramString + "\\" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
print("Start training with the following parameters:", paramString)
model.fit(train_ds,
validation_data=val_ds,
epochs=args.epoch,
callbacks=[tensorboard_callback])
TensorFlow Serving
Logically the two preprocessing steps containing the Tokenizer are not part of the model and therefore can’t be processed during the serving so that a POST command for the model server looks like this (on Windows):
curl -d "{"""instances""": [{"""NAME1W1""": [12], """NAME1W2""": [2032], """ZIP""": [""1120""], """STREETW""": [1180], """LONGITUDE""": 16.47, """LATITUDE""": 48.22, """AVIS_TYPE""": [""E""],"""ASG""": [0], """SC""": [""101""], """PREDICT""": [0]}]}" -X POST http://localhost:8501/v1/models/my_model:predict
So at the moment I am trying to find a way to include this two preprocessing steps inside my model so that the POST command would look like this:
curl -d "{"""instances""": [{"""NAME1""": [“”Max Mustermann””], """ZIP""": [""1120""], """STREET""": [Teststraße], """LONGITUDE""": 16.47, """LATITUDE""": 48.22, """AVIS_TYPE""": [""E""],"""ASG""": [0], """SC""": [""101""], """PREDICT""": [0]}]}" -X POST http://localhost:8501/v1/models/my_model:predict
but with the same pre-processing steps inside the model.
I tried to use map functions on the datasets or preprocessing layers but without success because I ‘am not sure if I can use a combination of them with the future_columns. I also tried something similar like mentioned here: https://keras.io/examples/structured_data/structured_data_classification_from_scratch/