2

I have several datasets I want to union in Palantir Foundry. I know what the datasets are ahead of time. The schema of all the datasets is the same (i.e. they have the same column names, and column types).

What is the best way to combine (union) these datasets?

Dataset A:

col1 col2
1 a
2 b

Dataset B:

col1 col2
2 c
3 d

Dataset C:

col1 col2
1 e
1 f

Desired output:

col1 col2
1 a
2 b
2 c
3 d
1 e
1 f
domdomegg
  • 1,498
  • 11
  • 20

3 Answers3

1

You can use a dataset view for this. A dataset view is a Palantir Foundry dataset that does not hold any files containing data, but is composed of a union of other datasets (known as backing datasets) when it is read. This means building views is incredibly quick, and views are space efficient as they don't duplicate the data.

To create a view:

  1. Navigate to where you want to create the View
  2. Click the green + New button, and select 'View' in the dropdown
  3. In the newly created View, view the 'Details' tab.
  4. Click the + Add backing dataset button and add the datasets you want to union

You can then use the View as if it is the result of the union of the datasets. For example, you could use it as the underlying dataset for a Contour analysis or to back an ontology object.

More documentation about Views can be found in the Foundry in-platform documentation, by searching for the 'Views' product.

domdomegg
  • 1,498
  • 11
  • 20
1

To do this in a Python transform with two datasets in Foundry Code Repositories or Code Workbook, you can use PySpark's unionByName function:

from transforms.api import transform_df, Input, Output

@transform_df(
    Output("/path/to/dataset/unioned"),
    source_df_1=Input("/path/to/dataset/one"),
    source_df_2=Input("/path/to/dataset/two"),
)
def compute(source_df_1, source_df_2):
    return source_df_1.unionByName(source_df_2)
domdomegg
  • 1,498
  • 11
  • 20
1

To do this in a Python transform with several datasets in Foundry Code Repositories or Code Workbook, you can use the transforms verbs helper D.union_many:

from transforms.api import transform_df, Input, Output
from transforms.verbs import dataframes as D

@transform_df(
    Output("/path/to/dataset/unioned"),
    source_df_1=Input("/path/to/dataset/one"),
    source_df_2=Input("/path/to/dataset/two"),
    source_df_3=Input("/path/to/dataset/three"),
)
def compute(source_df_1, source_df_2, source_df_3):
    return D.union_many(
        source_df_1,
        source_df_2,
        source_df_3,
    )
domdomegg
  • 1,498
  • 11
  • 20