1

I want to scrub(or encrypt) the email information from a few tables which are older than a few years.

This I am planning to do as part of a job, next time when I run the job how can I omit the rows which are already scrubbed or encrypted.

I am looking for an approach which will be having good performance.

APC
  • 144,005
  • 19
  • 170
  • 281
TomJava
  • 499
  • 1
  • 8
  • 24

2 Answers2

2

"I want to scrub(or encrypt) the email information from a few tables which are older than a few years"

I hope this means you have a date column on these tables which you can use to determine which ones need to be scrubbed. The most efficient way of tackling the job is to track that date in an operational table, recording the most recent date scrubbed.

For example you have ten years' worth of data, and you need to scrub records which are more than four years old. Now this would work:

  update t23
  set email = null
   where date_created < add_months(sysdate, -48);

But it seems like you want to batch things up. So build a tracking table, which at its simplest would be

create table tracker (
    last_date_scrubbed);

Populate the last_date_scrubbed with a really old date say date '2010-01-01'

Now you can write a query like this

update t23
  set email = null
   where date_created 
     < (select last_date_scrubbed + interval '1' year from tracker);

That will clean all records older than 2011. Increment the date in the tracker table by one year. Run the query again to clean stuff from 2011. Repeat until you get to your target state of cleanliness. At which point you can switch to running the query monthly , with an interval of one month , or whatever.

Obviously you should proceduralize this. A procedure is the best way to encapsulate the steps and make sure everything is kept in step. Also you can use the database scheduler to run the procedure.


"there is one downside to this approach. I thought that you want to be free upon choosing which rows to be updated."

I don't see any requirement to track which individual rows have been scrubbed. After all, the end state is that every record older than a certain date has been scrubbed. When I have done jobs like this previously all anybody wanted to know was, "how many rows have we done so far and how many have we still got to do?" Which can be answered by tracking the sql%rowcount for each run.

APC
  • 144,005
  • 19
  • 170
  • 281
  • Thank you APC. Let me try to build an approach based on this. – TomJava Sep 13 '17 at 08:23
  • there is one downside to this approach. I thought that you wanted to be free upon choosing which rows get updated. so i didn't want to mess up with your where clause on the already existing columns in your table. that's why i proposed to add another column or another table that holds the updated rows ids in it. – Moher Sep 13 '17 at 08:34
  • @moher - it's a fair point; my response is too long for a comment ;-) – APC Sep 13 '17 at 08:46
1

For The best performance, you can add a Flag Column to your main table. a Column like IsEncrypted. then every time you try to run any query for the "not Encrypted rows" you easily use WHERE when IsEncrypted Column is false to condition on those rows only.
there are other ways though.

EDIT
another way is to create a logger table. basically what this table does, is that it records any more information you want about a certain ID in another table. have another table called EncryptionLogger, in it you would have at least two columns: EmailTableId, IsEncrypted. then in any query you can simply get any rows WHERE their Ids are NOT IN this table.

Moher
  • 741
  • 7
  • 17
  • I cannot alter the table as it is used by a product. – TomJava Sep 13 '17 at 08:22
  • i don't think if you add a nullable column to a table, the product gets broken. if that's the case, the application is too fragile – Moher Sep 13 '17 at 08:40
  • It is true. I can create new tables and modify the data. Other objects I will have to leave as it is. – TomJava Sep 13 '17 at 08:47
  • The performance of the `where id not in` subquery will degrade as the number of rows scrubbed increases. And we still need to filter on the date and/or whatever other criteria are used to determine which records are to be scrubbed in a given run. – APC Sep 13 '17 at 10:59
  • yeap, that's right. i think the information given by the question is not good enough. but it seams what you have proposed is better. upvoted – Moher Sep 13 '17 at 11:06