I'm not sure what the problem is with the code below. I read the documentations and all of them point to an approach similar to this one.
Here is a simple example that doesn't work. My expectation was a notice for the feature x1
since the distribution is very different between two datasets.
import numpy as np
import pandas as pd
import tensorflow_data_validation as tfdv
NUM_VALS_TRAIN = 10000
# -------- Today --------
df = pd.DataFrame({
'x1': np.random.normal(4, 3, NUM_VALS_TRAIN),
'x2': np.random.normal(-3, 4, NUM_VALS_TRAIN)})
stats_train_today = tfdv.generate_statistics_from_dataframe(df)
# -------- Yesterday --------
df = pd.DataFrame({
'x1': np.random.normal(400, 300, NUM_VALS_TRAIN),
'x2': np.random.normal(-3, 4, NUM_VALS_TRAIN)})
stats_train_yesterday = tfdv.generate_statistics_from_dataframe(df)
# -------- Validate --------
schema = tfdv.infer_schema(stats_train_yesterday)
tfdv.get_feature(schema, 'x1').drift_comparator.infinity_norm.threshold = 0.01
anomalies = tfdv.validate_statistics(statistics=stats_train_today,
schema=schema,
previous_statistics=stats_train_yesterday)
tfdv.display_anomalies(anomalies)
The result is always No anomalies found.
What is wrong with this code?
(Using tfx==0.24.1)