3

I am fairly new at writing code and trying to teach myself python and pyspark based on searching the web for answers to my problems. I am trying to build a historical record set based on daily changes. I periodically have to bump the semantic version, but do not want to lose my already collected historical data. If the job can run incrementally then it performs the incremental transform like normal. Any and all help is appreciated.

SEMANTIC_VERSION = 1

# if job cannot run incrementally
# joins current snapshot data with already collected historical data
if cannot_not_run_incrementally:
    @transform(
        history=Output(historical_output),
        backup=Input(historical_output_backup),
        source=Input(order_input),
    )
    def my_compute_function(source, history, backup, ctx):
        input_df = (
            source.dataframe()
            .withColumn('record_date', F.current_date())
        )
        old_df = backup.dataframe()
        joined = old_df.unionByName(input_df)
        joined = joined.distinct()
        history.write_dataframe(joined)


# if job can run incrementally perform incremental transform normally
else:
    @incremental(snapshot_inputs=['source'], semantic_version=SEMANTIC_VERSION)
    @transform(
        history=Output(historical_output),
        backup=Output(historical_output_backup),
        source=Input(order_input),
    )
    def my_compute_function(source, history, backup):
        input_df = (
            source.dataframe()
            .withColumn('record_date', F.current_date())
        )
        history.write_dataframe(input_df.distinct()
                                .subtract(history.dataframe('previous', schema=input_df.schema)))
        backup.set_mode("replace")
        backup.write_dataframe(history.dataframe())

working code based on information from the selected answer and comments.

SEMANTIC_VERSION = 3


@incremental(snapshot_inputs=['source'], semantic_version=SEMANTIC_VERSION)
@transform(
    history=Output(),
    backup=Output(),
    source=Input(),
)
def compute(ctx, history, backup, source):
    # running incrementally
    if ctx.is_incremental:
        input_df = (
            source.dataframe()
            .withColumn('record_date', F.current_date())
            )
        history.write_dataframe(input_df.subtract(history.dataframe('previous', schema=input_df.schema)))
        backup.set_mode("replace")
        backup.write_dataframe(history.dataframe().distinct())

    # not running incrementally
    else:
        input_df = (
            source.dataframe()
            .withColumn('record_date', F.current_date())
        )
        backup.set_mode('modify')  # use replace if  you want to start fresh
        backup.write_dataframe(input_df)
        history.set_mode('replace')
        history.write_dataframe(backup.dataframe().distinct())
eruhl06
  • 33
  • 4

2 Answers2

2

You use the 'IncrementalTransformContext' of the transform to determine whether it is running incrementally.

This can be seen in the code below.

@incremental()
@transform(
    x=Output(),
    y=Input(),
    z=Input(),
)
def compute(ctx, x, y, z):
    if ctx.is_incremental:
        ## Some Code
    else:
        ## Other Code

More information on IncrementalTransformContext can be found here on your environment ({URL}/workspace/documentation/product/transforms/python-transforms-api-incrementaltransformcontext) or here (https://www.palantir.com/docs/foundry/transforms-python/transforms-python-api-classes/#incrementaltransformcontext)

tomwhittaker
  • 331
  • 2
  • 8
  • Thank you. I did try and build my code this way before. The backup file seems to be my problem. I need to read the backup file if the transform cannot run incrementally and write the backfile if it can. When I do this, I get a circular error. – eruhl06 Jul 25 '22 at 17:27
  • 1
    @eruhl06 the implementation here can get quite complicated here. One possible solution would be to make the historical dataset append only (at least when running non-incrementally). You can do this by explicitly setting the write mode i.e. `output_dataset.set_mode('modify')`. This would mean that the dataset would always have an append transaction. – tomwhittaker Jul 26 '22 at 11:31
  • 1
    You could probably achieve your desired outcome with the following: [Dataset S (Source)], [Dataset B (Backup) -> Append only update from S], [Dataset H (History) -> De-duplicate Dataset B]. Depending on your data scale and how likely you are to get duplicates when running the logic, B could get very big and poorly partitioned. It might be worth including some logic to re-partition B when it is running incrementally and you can read the previous transaction. – tomwhittaker Jul 26 '22 at 11:32
  • 1
    A quick note on the current implementation, I don't think the deduplication logic will currently work as expected. The column 'record_date' means that it will only catch duplicate records in the current run (rather than across the entire data), but I understand that could also be intentional depending on your use case! – tomwhittaker Jul 26 '22 at 11:36
  • I appreciate the help. Im not sure I built my code how you suggested, but using the output_dataset.set_mode('modify') was the ticket for me. I will need to add some re-partition logic though. – eruhl06 Jul 27 '22 at 17:10
  • Happy to hear you worked it out! – tomwhittaker Jul 27 '22 at 22:29
0

In an incremental transform, there is a boolean flag property called 'is_incremental' in the incremental transform context object.

Therefore, I think you can do a single incremental transform definition and based on the value of the is_incremental you do the operations you want, I would try something like this:

        SEMANTIC_VERSION = 1

        @incremental(snapshot_inputs=['source'], semantic_version=SEMANTIC_VERSION)
        @transform(
            history=Output(historical_output),
            backup=Input(historical_output_backup),
            source=Input(order_input),
        )
        def my_compute_function(source, history, backup, ctx):
            input_df = (
                source.dataframe()
                .withColumn('record_date', F.current_date())
            )
            # if job cannot run incrementally
            # joins current snapshot data with already collected historical data
            if not ctx.is_incremental:
                old_df = backup.dataframe()
                joined = old_df.unionByName(input_df)
                joined = joined.distinct()
                history.write_dataframe(joined)
                
            else: # if job can run incrementally perform incremental transform normally
                history.write_dataframe(input_df.distinct()
                                        .subtract(history.dataframe('previous', schema=input_df.schema)))
                backup.set_mode("replace")
                backup.write_dataframe(history.dataframe())
wazzup
  • 39
  • 7
  • Thank you for the reply. I did have my code written similar to this before. My problem comes with the backup file. Under the If portion I am reading the backup as an input and under the else portion I am writing the backup file. I am basically trying to do an automatic backup file every time the job runs. – eruhl06 Jul 25 '22 at 17:21