1

I have a bunch of data in files stored in Amazon S3 and am planning to use it to build a Data Vault in Redshift. My first question is if the right approach is to build the DV and Data Marts all in Redshift or if I should consider the S3 as my Data Lake and have only the Data Marts in Redshift?

In my architecture I'm currently considering the former (i.e. S3 Data Lake + Redshift Vault and Marts). However, I don't know if I can create ETL processes directly in Redshift to populate the Marts with data in the Vault or if I'll have to for example use Amazon EMR to process the raw data in S3, generate new files there and finally load them in the Marts.

So, my second question is: What should the ETL strategy be? Thanks.

Community
  • 1
  • 1
Cesar Vinas
  • 343
  • 1
  • 10
  • 20
  • 1
    S3 is not relational, it's basically for storing files. Data Vault is implemented in a relational database so you can't create a datavault on S3. How would you build your hubs, links, satellites in S3? If you're looking for terminology, then S3 might be called your 'data lake' (urgh I feel dirty now) – Nick.Mc Aug 22 '16 at 01:00
  • @Nick.McDermaid, you are right. I've been thinking of S3 more as a Data Lake in my architecture. My question is more about having DV and Data Marts in Redshift or just Data Marts that are populated from data in the lake (I'll edit the question). What I'm thinking is to load the Vault with data in S3 and then load the Marts with data from the Vault. However, I don't know if I'll be able to create ETL processes directly in Redshift or I'll have to do it with for example Amazon EMR by processing the files in S3, generating new files with data for the Marts and loading it in Redshift? – Cesar Vinas Aug 22 '16 at 01:10
  • 2
    Well if you have a need for a data vault it'll need to be built in redshift. You can load files into the DV or DM. If you decide you want a DV you can then load DV into DM via views and delete/insert/update statements - you don't have to go roundabout exporting and importing files again. Loading files into Redshift would require some kind of scheduling / job running tool to run, (on AWS or not) to connect to redshift and run the `COPY` command inside redshift to load the data. I investigated a project doing this but never started it so this is all theory for me right now. – Nick.Mc Aug 22 '16 at 01:17
  • But you should definitely be able to have one schema for your data vault and another schema for your datamart and load data from DV > DM directly, without needing to re export and import files. – Nick.Mc Aug 22 '16 at 01:18
  • I'm Now in the same situation. But I'm thinking to build schema over s3 files using Glue.Then, I can use DV over Athena. Just this solution still under research as well and evaluation – Youssef essa Mar 16 '19 at 08:08

5 Answers5

0

Apologies! Dont have the reputation to comment that is why I am writing in the answer section.

I am exactly in the same boat as you are. Trying to perform my ETL operation in redshift and as of now I have 3 billion rows and expecting to grow drastically.
Right now, loading data into data marts in redshift using DML's that are called at regular interval from AWS lambda. According to me it is very difficult to create a data vault in Redshift.

Rahul Gupta
  • 1,744
  • 2
  • 17
  • 28
  • it sounds like you've done more progress than me already. I'm currently doing some research and defining what I call v0.1 of my flow and data architecture. The Data Vault strategy makes a lot of sense to me so now I'm trying to understand the implementation implications of using Redshift. Did you look at using Amazon Data Pipeline for the ETL between the Vault and Marts? From what I've read it looks like a feasible approach at a small additional cost – Cesar Vinas Aug 22 '16 at 20:22
  • Yes, you can use data pipeline, but be aware the minimum scheduling interval for data pipeline is 15 mins and my business demands are more frequent. Also, I could not find any documentation to perform SQL transformation on my staging tables between Redshift (Source) and Redshift (Destination). – Rahul Gupta Aug 30 '16 at 15:12
0

Im a bit late the party and no doubt you've resolved this however it still might be relevant. Just thought I'd share my opinion on this. One solution is to use S3 and Hive as a Persistent Staging Area(Data Lake if you will) to land the data from sources. Construct your DV entirely in Redshift. You will still need a staging area in Redshift in order to ingest files from S3 ensuring the hashes are calculated on the way into Redshift staging tables(That's where EMR/Hive comes in). You could add the hashes directly in Redshift but it could put Redshift under duress depending on volume. Push data from staging into the DV via plain old bulk insert and update statements and then virtualise your marts in Redshift using views.

You could use any data pipeline tool to achieve this and lambda could also be a candidate for you or another workflow/pipeline tool.

0

I strongly recommend you check out Matillion for Redshift: https://redshiftsupport.matillion.com/customer/en/portal/articles/2775397-building-a-data-vault

It's fantastic and affordable for Redshift ETL and has a Data Vault sample project.

Chris KL
  • 4,882
  • 3
  • 27
  • 35
-1

I recommend you to read this article and consider following the design explained there in much details in case AWS is your potential tech stack. I do not want to copy-paste the article here but it is really a recipe how to implement Data Vault and I believe it addresses your requirements.

aviad
  • 8,229
  • 9
  • 50
  • 98
-2

S3 is just a key-value store for files. You can't create a DV or DW there. So you can use Redshift or EMR to process the data into the relational format for your DV. It's up to you as to whether or not which you choose; EMR has specific use cases IMO

simplycoding
  • 2,770
  • 9
  • 46
  • 91
  • I have it clear that S3 is not for the DV. Question is about DV + DM's in Redshift vs only DM's there. I also asked about ETL strategies to populate the relational model in Redshift with data from S3 – Cesar Vinas Aug 30 '16 at 11:59
  • If you only have your DM in Redshift, where are you going to store your DV? EMR? That's one expensive solution. Or it seems that you want to avoid using DV completely? I really am not understanding your approach here. Why would you want to use DV purely on what you can do in terms of ETL? – simplycoding Aug 30 '16 at 17:29
  • I should've been clearer with my question. I want to get inputs on having a DV from where to get data to populate DM's vs having only DM's in Redshift that are populated from data in S3 that would most probably have to be pre-processed via EMR. I've read about DV and see many advantages but still would like to hear from other experiences. Assuming the recommendation is to do DV, I'd like to know how others are implementing the transformation and loading of data from DV (Redshift) to DM's (Redshift). Using Lambda, Data Pipeline, SQL in Redshift, EMR to get data from DV and put it in DM's? – Cesar Vinas Aug 30 '16 at 20:22
  • I've not seen DV implemented very much, but in the cases that I have, they were based on pre-existing data in a data warehouse. So essentially land your data in your database, create a DV, and then build your DM on top of it. Isn't that the purpose of a DV? I thought DV was supposed to consistently exist in your relational database. Either way, sounds like you have a lot of research to do before making a decision. Lambda probably is not the right answer here. Data Pipeline is good for batch jobs, moving S3 files to RS, SQL in RS is the most likely answer, and EMR doesn't sound like the answer – simplycoding Aug 31 '16 at 04:32
  • S3 is not a key-value store for files. S3 is an object store that has enhanced functionality for security, data retention policy and intelligent tiering. – aviad Feb 20 '23 at 09:54