0

I am trying to determine the best way to represent data lineage for image processing. I have a images stored in S3 and I want to process them and then place them back in S3. I would then want to be able to run a query so I can see all the images and processes before and after in a chain. For example:

Image1 -ProcessA-> Image2 -ProcessB-> Image3

I would expect a search for the "lineage" of Image2 would yield the above information.

I know this looks like a cookie-cutter case for a graph database but I am not super familiar with them, especially for a production workflow. I have been fighting with how to implement this model in a relational database, but feel like I am just trying to put the square peg in the round hole.

  • Is a graph DB the only option? Which flavor would you suggest?
  • Is there a way to make this work in a relational model that I have not considered?

1 Answers1

0

You are correct when you say this is a cookie-cutter case for a graph database, and any of the available graph database products will likely be able to meet your requirements. You can also solve this problem using a relational database but, as you indicated, it would be like putting a square peg in round hole.

Disclosure: I work for Objectivity, maker of the InfiniteGraph product.

I have solved similar data lineage problems using InfiniteGraph. The basic idea is to separate your data from your metadata. The "lineage" information is metadata. Let's put that in the graph database. The lineage information will include objects (nodes) that contain the metadata for images and the workflow process steps that consume images as input and generated images or other information as output.

We might define an ImageMD type in Infinite graph to contain the metadata for an image, including a URI that defines where the image data is currently stored, and the size and format of the image. We might define the ProcessMD type to describe an application that operates on image. It's attributes might include the name and version of the application as well as it deployment timestamp and host location where it is running.

You are going to end up with an environment that looks something like the following diagram.

enter image description here

Then, given an image, you can track its lineage backward to see its history and forward to see how it or it derivative components were evolved or used.

This is the basis for the Objectivity, Inc. application Metadata Connect.

djhallx
  • 690
  • 6
  • 17