3

I am migrating data from RDS Postgres DB to s3 via the DMS AWS service.

The type of the DMS task is full load and CDC together. Let's say now I have some data in the postgres table named employee. eg:

emp_id emp_name
1 John
2 Angel

When the task is initailly created there will be a full load done and the LOAD00000____.parquet file gets created in the s3 target location. Now I am inserting another row to the table.

emp_id emp_name
3 Ram

Now a CDC action happens and a date folder(20220101/) with a parquet file init gets created.

I am actually trying to retain the table in the target despite of a truncate/drop operation that happens in postgres after table reloads happens.

"ChangeProcessingDdlHandlingPolicy": {
  "HandleSourceTableDropped": false,
  "HandleSourceTableTruncated": false,
  "HandleSourceTableAltered": false
}

I have these configuration in my task settings. Expecting that when I truncate/drop the table in postgres and then do a reload, the target data should not be truncated/dropped respectively. However, irrespective of the value that I give in the configuration keys of HandleSourceTableDropped and HandleSourceTableTruncated. The target folders gets deleted.

My task_setting.json file also has:

"TargetTablePrepMode": "TRUNCATE_BEFORE_LOAD",

Questions:

  1. Why does the s3 folder gets deleted on reload? Irrespective of the values(True/False) that I provide to the keys in the ChangeProcessingDdlHandlingPolicy.
  2. ChangeProcessingDdlHandlingPolicy what does this configuration object mean?

1 Answers1

0

Answering your questions first as per my best understanding on this-

So the setting ChangeProcessingDdlHandlingPolicy is related to DDL which is mostly in case of your target being a database instance. More on this here

Since in your case the target is s3, it doesn't really matter what these settings are and are ignored.

Now if you want that your target folders are not affected when you do reload you may want to try with "TargetTablePrepMode": "Do Nothing" (with correct syntax, if it is not)

However, irrespective of this the data at targets will be overwritten when you reload i.e. restart the task. Though you can resume the task if you want to keep the , and the migration will start from last stopped point. enter image description here

More on this here.

Mradul Yd
  • 71
  • 8
  • In the s3 folder I am storing the data in parquet files. When an alter command is executed in the source, that gets reflected in the parquet files. So, you say delete/truncate alone does not work good here? Am I correct? – Nandini Raja Nov 26 '22 at 09:09
  • So as alter is a DDL operation, this changes the schema of your tables/database at source. Now your target is s3, and your expectation might be to get these schema changes applied to your parquet as soon as those takes place in source. Since every parquet has it's own schema and there is no central schema as long as you don't use glue data catalogue, i.e. to run a crawler on the folder and create meta data thus a table, there will not be direct changes on schema inside s3. So whenever there are schema changes, you anyways need to overwrite everything inside your s3 target.. continued... – Mradul Yd Nov 28 '22 at 11:11
  • Now since in your case I believe it is a Fullload + Ongoing replication task, you want to start from the point you last left, when the task is stopped/failed or reload from start whenever there are changes in source. So Do nothing is recommended here. However you may need to play with all 3 settings with a smaller dataset with stop after full load setting, to check what actually happens when each of these settings are applied. However in my understanding, this will only vary when there is schema pre-created which is not the case when we store the parquet files inside s3 as crawler detects it. – Mradul Yd Nov 28 '22 at 11:18