1

Incremental update of S3 buckets without natural keys

I need to design an etl flow. OLTP systems are sharing customer, product, campaign and sales record via files. I want to transfer these files incrementally into Aws S3 buckets.

Assume that I want to transfer customer file into the related AWS S3 bucket. Customer file contain customer id. This field is PII (personal identifiable information).

In the bulk(initial) load phase, First, I will generate a new field, CUSTOMER_SK which maps to the customer id. Then, I need to replace customer id with customer_sk.
For ex. my customer id is 9887345 , I generate a number: 93453423 I need to replace customer id value 9887345 with the new value: 93453423 Last, I can copy the file to AWS S3 bucket. I replaced customer id with customer_sk. So AWS S3 buckets do not contain PII data.

In the daily etl load, If the customer is a new customer, then I can insert it into AWS S3. If the customer is existing customer, For ex. Customer changed his/her birth year field. He/she probably corrected birth year field so I need to update the related record in the AWS S3 bucket. However AWS S3 bucket does not include customer_id field. And OLTP system does not know customer_sk field. So I need to swap customer_id value with the customer_sk value. Now, I can copy file to AWS S3.

Security unit, because of the regulations, does not allow us to present PII (personal identifiable information) data to the business units on Aws environment.

We can transfer whole file, in daily etc jobs. File transfer takes time, so transferring all historical data to S3 is not feasible.

How can I implement this scenario? Do we need to run etl jobs both on local and AWS S3? I want to build etc flow on AWS. I just need to swap id fields on local. I do not want to build etl job on premise, just for swapping fields. Because I do not want to maintain jobs in both systems.

Thanks in advance

user125687
  • 85
  • 1
  • 4
  • 15

0 Answers0