3

I have a folder where I will upload one file every month. The file will have the same format in every month.

First problem

The idea is to concatenate all the files in this folder into one file. Currently I am hardcoding the filenames (filename[0], filename[1], filename[2]..) but imagine later I will have 50 files, should I explicitly add them to the transform_df decorator? Is there any other method to handle this?

Second problem:

Currently I have let's say 4 files (2021_07, 2021_08, 2021_09, 2021_10) and I want whenever I add the file presenting 2021_12 data to avoid changing the code. If I add input_5 = Input(path_to_2021_12_do_not_exists) the code will not be run and give an error.

How can I implement the code for future files and let the code ignore the input if it does not exist without manually each month add a new value to my code?

Thank you

# from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
from functools import reduce


input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']

filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]


@transform_df(
    Output("/Company/Project_name/Data/clean/File_concat"),
    input_1=Input(filenames[0]),
    input_2=Input(filenames[1]),
    input_3=Input(filenames[2]),
    input_4=Input(filenames[3]),
    )
def compute(input_1, input_2, input_3, input_4):
    input_dfs = [input_1, input_2, input_3, input_4]
    dfs = []

    def transformation_input(df):
        # some transformation
        return df
    for input_df in input_dfs:
        dfs.append(transformation_input(input_df))

    dfs = reduce(DataFrame.unionByName, dfs)
    return dfs
infinity911
  • 201
  • 1
  • 9

3 Answers3

2

This question comes up a lot, the simple answer is that you don't. Defining datasets and executing a build on them are two different steps executed at different stages.

Whenever you commit your code and run the checks, your overall python code is executed during the renderSchrinkwrap stage, except for the compute part. This allows Foundry to discover what datasets exist and publish.

Publishing involves creating your dataset and putting whatever is inside your compute function is published into the jobspec of the dataset, so foundry knows what code to execute whenever you run a build.

Once you hit build on the dataset, Foundry will only pick up whatever is on the jobspec and execute it. Any other code has already run during your checks, and it has run just once.

So any dynamic input/output would require you to re-run checks on your repo, which means that some code change would have had to happen since the Checks is part of the CI process, not part of the build.

fmsf
  • 36,317
  • 49
  • 147
  • 195
  • thank you for your detailed answer. What about having a dynamic list of filenames? By checking first if the files exists. Or maybe checking what are the files that are currently available in a specific folder. In this way I can construct dynamically my ```filenames``` – infinity911 Feb 16 '22 at 17:03
  • unfortunately at CI time you can't do these operations. the CI doesn't have access to the foundry file system, so it can only access whatever is inside your repository – fmsf Feb 17 '22 at 07:40
  • @fmsf Couldn't the files be uploaded to a filesystem (schema-less) dataset of files, which is then read (file paths) and iterated over at run time? – L99 Apr 06 '22 at 17:09
0

Taking a step back, assuming each of your input files has the same schema, Foundry would expect you to have all of those files in the same dataset as append transactions.

This might not be possible though, if for instance, the only indication of the "year" of the data is embedded in the filename, but your sample code would indicate that you expect all these datasets to have the same schema and easily union together.

You can do this manually through the Dataset Preview - just use the Upload File button or drag-and-drop the new file into the Preview window - or, if it's an "end user" workflow, with a File Upload Widget in a Workshop app. You may need to coordinate with your Foundry support team if this widget isn't available.

Logan Rhyne
  • 581
  • 2
  • 5
  • You'll also still have the individual source files listed in the Details tab on the Dataset Preview, so they're still available to reference/download. – Logan Rhyne Feb 16 '22 at 17:09
  • I thought about this solution, but the problem is that I have month-year columns in each input file. for example I have the next 24 months (jan-2022 to Dec 2023) and it can change from one version to another. Meaning that I need to do a melt transformation before appending to the dataset. In this case, I go back to the original problem: either I do a manual (copy) data preparation for each dataset or add manually the name of the file to the python code. – infinity911 Feb 17 '22 at 09:01
  • So you don't necessarily have to have a schema applied to your input dataset - you can also have truly "raw" CSVs or really any other file type inside. You then need to have an initial transform step that uses the transfrom() decorator rather than the ()transform_df decorator so you can access the raw files in your code and do the necessary clean up steps to output a unioned data frame. You can check out the Raw File Access docs on your instance at: /workspace/documentation/product/transforms/python-raw-file-access – Logan Rhyne Feb 18 '22 at 09:42
0

Bit late to the post although for anyone who is interested in an answer to most of the question. Dynamically determining file names from within a folder is not doable although having some level of dynamic input is possible as follows:

# from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame

# from functools import reduce
from transforms.verbs.dataframes import union_many  # use this instead of reduce


input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
inputs = {('input{}'.format(index)): Input(filename) for (index, filename) in enumerate(filenames))}


@transform(
    output=Output("/Company/Project_name/Data/clean/File_concat"),
    **inputs
    )
def compute(output, **kwargs):
    # Extract dataframes from input datasets
    input_dfs = [dataset_df.dataframe() for dataset_name, dataset_df in kwargs.items()]

    dfs = []
    def transformation_input(df):
        # some transformation
        return df
    for input_df in input_dfs:
        dfs.append(transformation_input(input_df))

    # dfs = reduce(DataFrame.unionByName, dfs)
    unioned_dfs = union_many(*dfs)
    return unioned_dfs

Couple points:

  1. Created dynamic input dict.
  2. That dict is read into the transform using **kwargs.
  3. Using transform decorator not transform_df, we can extract the dataframes.
  4. (not in question) Combine multiple dataframes using union_many function from transforms_verbs library.
Kraeze
  • 26
  • 3