Union 60 Dataframes in palantir foundry directory using pyspark

Question

I have a directory with 60+ foundry dataset in it. I just to read all the datasets and union it into a single dataframe

Path = input("MySpend new files P2P/2020/")

Dataset1
Dataset2
Dataset3
...
Dataset60

Output("MySpend new files P2P/2020/UnionAll")

Hi Biplab1985 any chance you could better formulate your question? https://stackoverflow.com/help/how-to-ask I think I know what you want, and don't mind dropping an example, but all stackoverflow community would benefit for a properly formulated question, instead of a statement, like above. — fmsf, Mar 09 '23 at 08:00

score 1 · Answer 1 · answered Mar 09 '23 at 09:56

More technically :

You can leverage union_many in your transform (more verbose example here) and list its inputs manually. Note that you can use Data Lineage to quickly copy paste the paths of all the datasets (select datasets > top right > "view histogram" icon > "copy paths"). Basic usage :

from transforms.api import transform_df, Input, Output
from transforms.verbs import dataframes as D

@transform_df(
    Output("/path/to/dataset/unioned"),
    source_df_1=Input("/path/to/dataset/one"),
    source_df_2=Input("/path/to/dataset/two"),
    source_df_3=Input("/path/to/dataset/three"),
)
def compute(source_df_1, source_df_2, source_df_3):
    return D.union_many(
        source_df_1,
        source_df_2,
        source_df_3,
    )

Same way but easier to copy paste, you can parameter your transform to use an array of paths as an input

from transforms.verbs import dataframes as D
from transforms.api import transform_df, Input, Output

# Configure the number of datasets to generate
list_datasets_paths = [
    "/path/to/dataset/one",
    "/path/to/dataset/two",
    "/path/to/dataset/three"]


# Convert the list of paths in a dict of Input()
input_dict = {}
for dataset_path in list_datasets_paths:
    input_dict[dataset_path.split("/")[-1]] = Input(dataset_path)


# Provide the dict of Input() to the transform
@transform_df(
    Output("/path/to/dataset/unioned"),
    **input_dict
)
def compute_2(**inputs_dataframes):
    # Create a list of dataframes from the input dict
    dataframes_list = inputs_dataframes.values()
    # Union the list of dataframes
    return D.union_many(*dataframes_list)

If the set of dataset were to evolve over time, you can use Logic Flows. It will essentially list rids (resource identifiers) of resources in a given input folder and open a new pull request the repository with a file containing those rids/paths. Note this is a beta product.

Note: You have as well other tools to build pipeline, and hence union datasets, like Pipeline Builder/docs.

Union 60 Dataframes in palantir foundry directory using pyspark

1 Answers1