Can I put a schedule on the build rather than the dataset?

Question

I want to be able to generate a new dataset with each build, where the current date is appended to the name, like so:

dataset_output_2021-11-27
dataset_output_2021-11-28
dataset_output_2021-11-29

Is it possible to put a schedule on the build rather than a single dataset so that new datasets get generated daily?

score 1 · Answer 1 · answered Dec 06 '21 at 14:53

I think another approach would be more elegant. I suggest not creating a bunch of tables, but saving all the data in one table with one additional column for date.

I think you already have a dataset which represents data for current day (e.g. input_data).

The following transformation would add date column to the ever-growing history table so that you could always access data for any date.

from transforms.api import transform, Output, Input, incremental
from pyspark.sql import functions as F


@incremental(snapshot_inputs=['input_data'])
@transform(
    input_data=Input("/path/to/snapshot/input"),
    history=Output("/path/to/historical/dataset"),
)
def my_compute_function(input_data, history):
    input_df = input_data.dataframe()
    input_df = input_df.withColumn('date', F.current_date())

    history.write_dataframe(input_df)

I took most of the code from Foundry documentation. Try searching "Create a historical dataset from snapshots" in your system.

score 0 · Answer 2 · answered Dec 03 '21 at 18:08

As of right now, deciding at runtime to create a new dataset is not supported.

If you can give some more detail in a separate question on what you're trying to accomplish, I might be able to give more tailored guidance.

However, if what you're wanting is an efficient way to write data based on a new day, you should check out your platform documentation on Hive-style partitioning! It's a great technique for laying out your data in a way that is fast to filter.

Can I put a schedule on the build rather than the dataset?

2 Answers2