0

I am working with a client that wants to import meta data into Snowflake, and I want to know if my 3rd party tool will 1) work 2) align with the data strategy of the existing architecture.

I collect data about users that browse my manufacturing marketplace. My website tracks sessions, views and the carts of logged in and not logged in users. I use Fivetran to import that collect web information(unstructured json) into my existing architecture. This information is archived after 30 days and a used to show a 30 day, and quarterly summary. I want this metadata replicated and preserved and transformed into Snowflake after 30 days.

Data: 
{
  "Session id": "87461c424d2d0f29a6b0075d",
  "isActive": true,
  "email": "xx",
  "date": "",
  "cart": [“Item 1”, “Item 2 ],
  “Sale”: true
},{"Session id": "87461c424d2d0f29a6b0075d",
  "isActive": false,
  "email": "xx",
  "date": "",
  "cart": [ “Item 2” ],
  “Sale”: false
}

In order to comply with HIPAA and GDCP I need to be able to delete or remove information, can I use Snowflake’s meta data to do this the same way I have in the software collecting my websites information can I use a stored procedure to remove the information based on an email a user gives me and then recalculate all of my previously calculated summaries? Will Snowflake’s meta data do that for me?

  • 1
    it feels like you are mixing up the words "meta data", if you have data from somewhere, and you want to put into snowflake, it's just "data" at this point. You can put what ever data you want. If you want to alter that data later, does snowflake have functions that allow you to alter variant (JSON) data to cleanse it, sure sure. You still have to know the nature of the data to cleanse it. Snowflakes "meta data" relates to what's in your tables, but this is really just there to optimize queries. But given you are "storing data" you just need "alter the data" to comply with the rules. – Simeon Pilgrim Mar 10 '20 at 22:41
  • Now a second point might be, does way snowflake stores data align with alter data after some time late (like 30 days later or N days for per-customer retention polices). In which case, all "rows" stored in a partition with the rows you update, will be rewritten, so that can end up being almost a full table rewrite (worst case scenario this is what it is). And then if you are using CLUSTERING you will get bitten if you an update, as the order of the data is not sortered, so _migh_ be random, requiring yet more partition rewrites. – Simeon Pilgrim Mar 10 '20 at 22:47
  • Currently AUTOCLUSTERING is costing us 5x full table rewrites for rewriting 1/3 of the partitions for GDPR related data PII cleaning alterations, due to how the data spans the partitions, and how the UPDATE command has no sort. So if you are clustering, consider always doing full table writes (with an ORDER BY the same as you clustering) as the cost of large updates to tables. – Simeon Pilgrim Mar 10 '20 at 22:49

0 Answers0