I'm trying to develop a custom pipeline with kubeflow pipeline (kfp) components inside Vertex AI (Google Cloud Platform). The steps of the pipeline are:
- read data from a big query table
- create a pandas
DataFrame
- use the
DataFrame
to train a K-Means model - deploy the model to an endpoint
Here there is the code of the step 2. I had to use Output[Artifact]
as output because pd.DataFrame
type that I found here did not work.
@component(base_image="python:3.9", packages_to_install=["google-cloud-bigquery","pandas","pyarrow"])
def create_dataframe(
project: str,
region: str,
destination_dataset: str,
destination_table_name: str,
df: Output[Artifact],
):
from google.cloud import bigquery
client = bigquery.Client(project=project, location=region)
dataset_ref = bigquery.DatasetReference(project, destination_dataset)
table_ref = dataset_ref.table(destination_table_name)
table = client.get_table(table_ref)
df = client.list_rows(table).to_dataframe()
Here the code of the step 3:
@component(base_image="python:3.9", packages_to_install=['sklearn'])
def kmeans_training(
dataset: Input[Artifact],
model: Output[Model],
num_clusters: int,
):
from sklearn.cluster import KMeans
model = KMeans(num_clusters, random_state=220417)
model.fit(dataset)
The run of the pipeline is stopped due to the following error:
TypeError: float() argument must be a string or a number, not 'Artifact'
Is it possible to convert Artifact to numpy array
or Dataframe
?