1

I am running a custom component in kubeflow to do some data manipulation and then save the result as a big query table. How do I register the table as an artifact so that I can pass it down to the different stages of the pipeline?

Eventually I am planning on making a parallelfor up to create multiple bigquery tables from which i will create multiple machine learning models. I would like to be able to pass these tables to the next stage so that I can create models from them.

Currently what i am doing is just saving the uri into a pandas dataframe.

def get_the_data(
    project_id: str,
    url: str,
    dataset_uri: Output[Dataset],
    lag: int = 0, 
):
  ## table name
  table_id = url + "_lag_"  + str(lag)
  ## code to query and create new table
  ##
  ##
  
  ## store URI in a dataframe which can be passed to next stage
  df=pd.DataFrame(data=[table_id], columns = ['path'])
  df.to_csv(dataset_uri.path + ".csv" , index=False, encoding='utf-8-sig')

Eventually i am going to be using a parallelfor op to run this component multiple times in parallel and create multiple tables. I don't know how to manage and collect the table ids so i can run subsequent ops on them.

aa_tt_aa
  • 11
  • 1
  • What about saving the urls into different variables such as `table_id_1`? Also this [stack question](https://stackoverflow.com/questions/69977440/how-to-use-kfp-artifact-with-sklearn) might help you. – Jose Gutierrez Paliza Jun 17 '22 at 18:10

0 Answers0