2

I'm not sure what the problem is with the code below. I read the documentations and all of them point to an approach similar to this one.

Here is a simple example that doesn't work. My expectation was a notice for the feature x1 since the distribution is very different between two datasets.

import numpy as np
import pandas as pd
import tensorflow_data_validation as tfdv

NUM_VALS_TRAIN = 10000

# -------- Today --------

df = pd.DataFrame({
    'x1': np.random.normal(4, 3, NUM_VALS_TRAIN),
    'x2': np.random.normal(-3, 4, NUM_VALS_TRAIN)})

stats_train_today = tfdv.generate_statistics_from_dataframe(df)

# -------- Yesterday --------

df = pd.DataFrame({
    'x1': np.random.normal(400, 300, NUM_VALS_TRAIN),
    'x2': np.random.normal(-3, 4, NUM_VALS_TRAIN)})

stats_train_yesterday = tfdv.generate_statistics_from_dataframe(df)

# -------- Validate --------

schema = tfdv.infer_schema(stats_train_yesterday)

tfdv.get_feature(schema, 'x1').drift_comparator.infinity_norm.threshold = 0.01

anomalies = tfdv.validate_statistics(statistics=stats_train_today,
                                     schema=schema,
                                     previous_statistics=stats_train_yesterday)

tfdv.display_anomalies(anomalies)

The result is always No anomalies found.

What is wrong with this code?

(Using tfx==0.24.1)

John Tartu
  • 41
  • 2

2 Answers2

0

Okay, I found this at https://github.com/tensorflow/data-validation/blob/master/RELEASE.md

"Add support for detecting drift and distribution skew in numeric features"

Turns out it is not implemented yet.

John Tartu
  • 41
  • 2
  • please check the below links for implementation of js diveregence - https://www.tensorflow.org/tfx/data_validation/get_started https://cloud.google.com/blog/topics/developers-practitioners/event-triggered-detection-data-drift-ml-workflows also, below snippet is used for the same, tfdv.get_feature(schema1, 'duration').drift_comparator.jensen_shannon_divergence.threshold = 0.01 –  Jun 25 '21 at 14:04
0

Jenson Shannon is for numerical features, and infinity norm is for categorical. It seems you have used infinity norm for numerical features, which will obviously not work. Check Drift Detection which says:

"Drift detection is supported between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. We express drift in terms of L-infinity distance for categorical features and approximate Jensen-Shannon divergence for numeric features. You can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation."