0

I am new to azure data lake and am currently using data factory v2 to move data from my transactional database to azure data lake storage.

Consider a scenario

Company has multiple datasources

  • Team A is responsible for Source A
  • Team B is responsible for Source B
  • Team C is responsible for Source C

Multiple Writers

Each Team is responsible for moving the data into the data lake.

  • Team A moves data under
    • /TeamA/entity01.csv
    • /TeamA/entity02.csv
    • ..
  • Team B moves data under
    • /TeamB/entity03.csv
    • ..

Multiple Readers

  • Team Analytics can read the data and perform calculations in a databricks environment
  • Team Power BI can fetch the data transform it and copy it into single tenant folders
    • Tenant1/entity01.csv
    • Tenant2/entity02.csv

Question

  • How can the readers read without conflicts with the writers. So that while a reader is reading data, the file is NOT being written into by a Team X update data factory activity ?

What I was thinking / What have I tried :

I was thinking of having a shared source of meta data (maybe in as table storage accessible by all the readers).

"teamA/entity1" : [ 
                   "TeamA/Entity1/01-02-2018/0000/data.csv",
                   "TeamA/Entity1/01-01-2018/0000/data.csv",
                   ]
"teamA/entity2" : [
                   "TeamA/Entity2/01-01-2018/1200/data.csv"
                   "TeamA/Entity2/01-01-2018/0600/data.csv"
                   "TeamA/Entity2/01-01-2018/0000/data.csv"
                 ]
"teamB/entity3" : [
                   "TeamA/Entity3/01-01-2018/0600/data.csv"
                   "TeamA/Entity3/01-01-2018/0000/data.csv"
                 ]
  • the writers will have added responsible for maintaining a set of versions to avoid deleting/overriding data.
  • the reader will have added responsibility of performing a lookup here and then reading the data.
Community
  • 1
  • 1
frictionlesspulley
  • 11,070
  • 14
  • 66
  • 115

1 Answers1

0

Data Lake writes to temporary files in the background, before subsequently writing to the actual file. Which will likely mitigate this problem, however I'm unsure whether this will 100% avoid clashes.

If you are willing to have the pipelines in one factory you could use the inbuilt chaining of activities to allow data factory to manage the dependencies.

We typically write to "serving storage" such as SQL server rather than letting powerbi have direct access to data lake store, which may help separate things (also benefits from DirectQuery etc). However I haven't seen data bricks support yet, I'd bet it is coming similar to how HDInsight can be used.

Notably, as you are finding Data Lake Store not being a OLTP data source this sort of thing isn't what data lake store is meant for, this stackoverflow post discusses this in more detail: Concurrent read/write to ADLA

Alex KeySmith
  • 16,657
  • 11
  • 74
  • 152