I am running a custom component in kubeflow to do some data manipulation and then save the result as a big query table. How do I register the table as an artifact so that I can pass it down to the different stages of the pipeline?
Eventually I am planning on making a parallelfor up to create multiple bigquery tables from which i will create multiple machine learning models. I would like to be able to pass these tables to the next stage so that I can create models from them.
Currently what i am doing is just saving the uri into a pandas dataframe.
def get_the_data(
project_id: str,
url: str,
dataset_uri: Output[Dataset],
lag: int = 0,
):
## table name
table_id = url + "_lag_" + str(lag)
## code to query and create new table
##
##
## store URI in a dataframe which can be passed to next stage
df=pd.DataFrame(data=[table_id], columns = ['path'])
df.to_csv(dataset_uri.path + ".csv" , index=False, encoding='utf-8-sig')
Eventually i am going to be using a parallelfor op to run this component multiple times in parallel and create multiple tables. I don't know how to manage and collect the table ids so i can run subsequent ops on them.