More technically :
- You can leverage
union_many
in your transform (more verbose example here) and list its inputs manually. Note that you can use Data Lineage to quickly copy paste the paths of all the datasets (select datasets > top right > "view histogram" icon > "copy paths"
). Basic usage :
from transforms.api import transform_df, Input, Output
from transforms.verbs import dataframes as D
@transform_df(
Output("/path/to/dataset/unioned"),
source_df_1=Input("/path/to/dataset/one"),
source_df_2=Input("/path/to/dataset/two"),
source_df_3=Input("/path/to/dataset/three"),
)
def compute(source_df_1, source_df_2, source_df_3):
return D.union_many(
source_df_1,
source_df_2,
source_df_3,
)
- Same way but easier to copy paste, you can parameter your transform to use an array of paths as an input
from transforms.verbs import dataframes as D
from transforms.api import transform_df, Input, Output
# Configure the number of datasets to generate
list_datasets_paths = [
"/path/to/dataset/one",
"/path/to/dataset/two",
"/path/to/dataset/three"]
# Convert the list of paths in a dict of Input()
input_dict = {}
for dataset_path in list_datasets_paths:
input_dict[dataset_path.split("/")[-1]] = Input(dataset_path)
# Provide the dict of Input() to the transform
@transform_df(
Output("/path/to/dataset/unioned"),
**input_dict
)
def compute_2(**inputs_dataframes):
# Create a list of dataframes from the input dict
dataframes_list = inputs_dataframes.values()
# Union the list of dataframes
return D.union_many(*dataframes_list)
- If the set of dataset were to evolve over time, you can use Logic Flows. It will essentially list rids (resource identifiers) of resources in a given input folder and open a new pull request the repository with a file containing those rids/paths. Note this is a beta product.
Note: You have as well other tools to build pipeline, and hence union datasets, like Pipeline Builder/docs.