Run transformation on all datasets in a folder

Question

I would like a transformation to be run on a group of datasets, like all the datasets in a folder.

The use case is that I have users that need to upload multiple groups of JSON files as datasets. The groupings need to be defined by the user and no indication of their differences are stored in the JSON files themselves. The user has to make that distinction by uploading the files to a particular dataset or making by a new one. I then need to automatically run a transform on all of these to turn them into a structured datasets. I don't want the user to have to go to the data transform and add the new datasets, and then data lineage and add a new schedule for each dataset.

Is there any way to do this? Is there such thing as a dataset of datasets?

score 1 · Answer 1 · edited Jun 19 '23 at 23:20

It isn't a full answer to your questions, but you can solve part of your problem with transform generation!

An example of this can be found below!

from transforms.api import transform_df, Input, Output

def transform_generator(sources):
    # type: (List[str]) -> List[transforms.api.Transform]
    transforms = []
        # This example uses multiple input datasets. You can also generate multiple outputs
        # from a single input dataset.
    for source in sources:
        @transform_df(
            Output('/sources/{source}/output'.format(source=source)),
            my_input=Input('/sources/{source}/input'.format(source=source))
        )
        def compute_function(my_input, source=source):
            # To capture the source variable in the function, you pass it as a defaulted keyword argument.
            return my_input.filter(my_input.source == source)
        transforms.append(compute_function)
    return transforms

TRANSFORMS = transform_generator(['src1', 'src2', 'src3'])

You would also need to register the transformation will the following code in pipeline.py.

import my_module

my_pipeline = Pipeline()
my_pipeline.add_transforms(*my_module.TRANSFORMS)

More of information on transform generation can be found here. Instead of using the hardcoded values ['src1', 'src2', 'src3'], you would need to find some way of getting all datasets in the folder. It sounds like this may not actually be possible from this other post.

score 1 · Accepted Answer · answered Jun 20 '23 at 06:33

1

You can use Foundry Logic Flows if it's enabled on your stack.

There is an automation called Compass File Lister that does exactly what you want. It updates a list of inputs from a folder and uses them in a transform.

answered Jun 20 '23 at 06:33

Matija Herceg

441
3

The Compass File Lister creates a json file in my transform code repository. How do I read this json file to then run a transform generator on it? – lectrician1 Jun 21 '23 at 16:17
You can use it in a logic flow: https://www.palantir.com/docs/foundry/building-pipelines/create-a-connected-flow/ – Matija Herceg Jun 22 '23 at 11:22
That doesn't answer my question. That's a circular answer. I got the logic flow set up. It creates a JSON file *inside* a code repository. I don't know how to read that JSON file since it is not a standard dataset and I can't do a transform on it, or use with open(): either. I don't know how to read from it in the code repository. It's just sitting there. – lectrician1 Jun 22 '23 at 13:30
Well you read in the content and generate the transforms dynamically as described in the answer above. – nicornk Jun 22 '23 at 17:20
@nicornk How do I read in the content from the JSON file generated in the code repository? – lectrician1 Jun 23 '23 at 12:54
1

https://www.palantir.com/docs/foundry/transforms-python/read-files-repository/ – nicornk Jun 24 '23 at 14:59

Run transformation on all datasets in a folder

2 Answers2