4

I am working on a project and have to track lineage of file transformation. assume one file called SomeTextFile.txt goes under multiple hive actions and in the final stage it produce some magnificent result as needed.

Case:1 File went like(if i apply hive action over file)

File-->FileAfterAction1-->FileAfterAction2--->FinalResultantFile

in this case i am using hive hook which stores data related to intermediate process applied on File.say in a text file and from that text file lineageEngine code reads and generate Lineage of that Final File.

Now as there is spark involved in tech stack and client can apply spark action too over the file.

Case:2 same thing happens over file but now it's Spark action.

Question- Is there any way to take intermediate information what happened with file in between the start and end of transformations.

What i got from web till now is spark transformation vomits intermediate graph but in my case client will apply Spark action instead of Spark transformation. Get in to this if have some bandwidth.

Community
  • 1
  • 1
Sachin
  • 359
  • 2
  • 18

2 Answers2

3

https://issues.apache.org/jira/browse/SPARK-18127

This functionality would be implemented in Spark 2.2

Serge Harnyk
  • 1,279
  • 10
  • 19
0

Spline can track the lineage for you.

Felipe Martins Melo
  • 1,323
  • 11
  • 15