How to visualize a saved statistics artifact?

Question

I know of two ways to run a TFX pipeline. First, using a Jupyter notebook with InteractiveContext in a browser:

from tfx import v1 as tfx
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext


context = InteractiveContext(pipeline_root=_pipeline_data_folder)

example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder)
context.run(example_gen, enable_cache=True)

statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])
context.run(statistics_gen, enable_cache=True)

context.show(statistics_gen.outputs['statistics'])

This way, I can see the statistics artifact in the browser:

The second way to run the same pipeline is by using a python script (no browser involved):

example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder)
statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])

components = [
    example_gen,
    statistics_gen,
]

pipeline = tfx.dsl.Pipeline(
    pipeline_name='sample_pipeline',
    pipeline_root=_pipeline_data_folder,
    metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(
        f'{_pipeline_data_folder}/metadata.db'),
    components=components)

tfx.orchestration.LocalDagRunner().run(pipeline)

I understand that since there's no browser involved in the second approach, asking for a visualization is pointless. But the same artifact that was created in the first approach was also create in the second one. So my question is, after the second pipeline is over, how can visualize the created statistics artifact?

Mehran · Accepted Answer · 2023-01-05T06:09:38.823

It took me a whole day to figure this out and I had to read TFX code for it (there was hardly any documentation). An older approach to address the same need can be found in TFX documentation but it's dated and does not work with the latest version of TFX. I'm sure even this solution will be short-lived and soon it won't work. But for the time being:

from tfx import types
from tfx import v1 as tfx
from tfx.orchestration.metadata import Metadata
from tfx.orchestration.experimental.interactive import visualizations
from tfx.orchestration.experimental.interactive import standard_visualizations
standard_visualizations.register_standard_visualizations()


sqlite_path = './pipeline_data/metadata.db'
pipeline_name = 'simple_pipeline'
component_name = 'StatisticsGen'
type_name = 'ExampleStatistics'
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(sqlite_path)

with Metadata(metadata_connection_config) as metadata:
    context = metadata.store.get_context_by_type_and_name('node', f'{pipeline_name}.{component_name}')
    artifacts = metadata.store.get_artifacts_by_context(context.id)
    artifact_type = metadata.store.get_artifact_type(type_name)
    latest_artifact = max([a for a in artifacts if a.type_id == artifact_type.id], key=lambda a: a.last_update_time_since_epoch)
    artifact = types.Artifact(artifact_type)
    artifact.set_mlmd_artifact(latest_artifact)
    visualization = visualizations.get_registry().get_visualization(artifact.type_name)
    visualization.display(artifact)

Disclaimer, this code displays the latest artifact for the statistics component of a specific pipeline. Or if you want you can point to the artifact by its folder path (uri):

from tfx import types
from tfx import v1 as tfx
from tfx.orchestration.metadata import Metadata
from tfx.orchestration.experimental.interactive import visualizations
from tfx.orchestration.experimental.interactive import standard_visualizations
standard_visualizations.register_standard_visualizations()

sqlite_path = './pipeline_data/metadata.db'
uri = './pipeline_data/StatisticsGen/statistics/16'
component_name = 'StatisticsGen'
type_name = 'ExampleStatistics'
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(sqlite_path)

with Metadata(metadata_connection_config) as metadata:
    artifacts = metadata.store.get_artifacts_by_uri(uri)
    artifact_type = metadata.store.get_artifact_type(type_name)
    latest_artifact = max([a for a in artifacts if a.type_id == artifact_type.id], key=lambda a: a.last_update_time_since_epoch)
    artifact = types.Artifact(artifact_type)
    artifact.set_mlmd_artifact(latest_artifact)
    visualization = visualizations.get_registry().get_visualization(type_name)
    visualization.display(artifact)

At the end, maybe there is a better way to do this but I missed it.

How to visualize a saved statistics artifact?

1 Answers1