I am new to azure data lake and am currently using data factory v2 to move data from my transactional database to azure data lake storage.
Consider a scenario
Company has multiple datasources
- Team A is responsible for Source A
- Team B is responsible for Source B
- Team C is responsible for Source C
Multiple Writers
Each Team is responsible for moving the data into the data lake.
- Team A moves data under
- /TeamA/entity01.csv
- /TeamA/entity02.csv
- ..
- Team B moves data under
- /TeamB/entity03.csv
- ..
Multiple Readers
- Team Analytics can read the data and perform calculations in a databricks environment
- Team Power BI can fetch the data transform it and copy it into single tenant folders
- Tenant1/entity01.csv
- Tenant2/entity02.csv
Question
- How can the readers read without conflicts with the writers. So that while a reader is reading data, the file is NOT being written into by a Team X update data factory activity ?
What I was thinking / What have I tried :
I was thinking of having a shared source of meta data (maybe in as table storage accessible by all the readers).
"teamA/entity1" : [
"TeamA/Entity1/01-02-2018/0000/data.csv",
"TeamA/Entity1/01-01-2018/0000/data.csv",
]
"teamA/entity2" : [
"TeamA/Entity2/01-01-2018/1200/data.csv"
"TeamA/Entity2/01-01-2018/0600/data.csv"
"TeamA/Entity2/01-01-2018/0000/data.csv"
]
"teamB/entity3" : [
"TeamA/Entity3/01-01-2018/0600/data.csv"
"TeamA/Entity3/01-01-2018/0000/data.csv"
]
- the writers will have added responsible for maintaining a set of versions to avoid deleting/overriding data.
- the reader will have added responsibility of performing a lookup here and then reading the data.