Dependent data processing pipelines where files arrive asynchronously

Question

I have several data dependent tasks/pipelines of which some depend on the completion of another. What makes it even harder is that the data can arrive asynchronously, meaning that certain tasks need to wait until all the files or tasks in the previous step have been processed.

Here is an example:

Let's say we have a raw file x[i,j] with indices where i stands for one particular subcategory inside the main category j.

I need to run the following pipelines:

pipeline 1: clean the raw file x[i,j] and store it as x_clean[i,j]
pipeline 2: once pipeline 1 is done for all i inside j, aggregate the results from x_clean[i,j] and store it as y_clean[j]
pipeline 3: clean a raw file z[j] and store it as z_clean[j]
pipeline 4: once pipelines 2 and pipelines 3 are done, combine z_clean[j] and y_clean[j] and store it as w_clean[j].

What kind of model could I apply to handle such data flow approach? Is there any kind of methodology behind this kind of data processing tasks? Does GCP have something built for these kind of problems?

score 0 · Answer 1 · answered Apr 30 '20 at 15:40

In a production process...

steps depend on completion of other steps.
material can arrive asynchronously, meaning subsequent steps wait for the product to arrive to work on. However, be aware this does not mean unlimited material can arrive out of control, only the material to be consumed for that specific manufacture order. If your scenario allows a stream of unlimited data to pour in, then you must organize it pre-process to avoid mixing different product components. Don't compromise the structure of the process to try to handle asynchronously arriving data in some buffer or whatever, because manufacturing data products involves relational data not raw material.
subcomponents may be completed in joining branches, meaning the assembling step waits for the coordinated set of related components to arrive before assembling begins.

I am the creator of POWER, the only collaborative (manufacturing) architecture to date. There is a lot to learn about this subject, but you can find my articles and code online: http://www.powersemantics.com/

Here is what your process looks like in manufacturing's model for work:

    class MyProduct
    {
        public object[i,j] x_clean { get; set; }
        public object[j] y_clean { get; set; }
        public object[j] z_clean { get; set; }
        // final product
        public object[j] w_clean { get; set; }
    }
    class MyProcess : Producer<MyProduct>, IProcess, IMachine, IOrganize
    {
        // process inputs
        public object[i,j] x { get; set; }  // raw file
        public object[j] z { get; set; } // raw file

        // machines
        public CleanerA Cleaner1 { get; set; }
        public Aggregator Aggregator1 { get; set }
        public CleanerB Cleaner2 { get; set; }
        public Assembler Assembler1 { get; set; }

        public void D() { // instantiates properties and machines }
        public void O()
        {
            // bind machines to work on the same data points
            // allows maintenance to later remove cleaners if it becomes possible
            // for the process to receive data in the correct form
            Cleaner1.x = x;
            Cleaner1.Product.x_clean = Product.x_clean;

            Aggregator1.x_clean = Product.x_clean;
            Aggregator1.Product.y_clean = Product.y_clean;

            Cleaner2.z = z;
            Cleaner2.Product.z_clean = Product.z_clean;

            Assembler1.z_clean = Product.z_clean;
            Assembler1.y_clean = Product.y_clean;
            Assembler1.Product.w_clean = Product.w_clean;
        }

        // hardcoded synchronous controller
        public void M()
        {
            Cleaner1.M();
            Aggregator1.M();
            Cleaner2.M();
            Assembler1.M();
        }
    }

    // these class pairs are Custom Machines, very specific work organized
    // by user requirements rather than in terms of domain-specific operations
    class CleanerAProduct
    {
        public object[i,j] x_clean { get; set; }
    }
    class CleanerA: Producer<CleanerAProduct>, IMachine
    {
        public object[i,j] x { get; set; }  // raw file
        public void M()
        {
            // clean the raw file x[i,j] and store it as x_clean[i,j]
        }
    }


    class AggregatorProduct
    {
        public object[j] y_clean { get; set; }
    }
    class Aggregator: Producer<AggregatorProduct>, IMachine
    {
        public object[i,j] x_clean { get; set; }
        public void M()
        {
            // aggregate the results from x_clean[i,j] and store it as y_clean[j]
        }
    }


    class CleanerBProduct
    {
        public object[j] z_clean { get; set; }
    }
    class CleanerB : Producer<CleanerBProduct>, IMachine
    {
        public object[j] z { get; set; }
        public void M()
        {
            // clean a raw file z[j] and store it as z_clean[j]
        }
    }


    class AssemblerProduct
    {
        public object[j] w_clean { get; set; }
    }
    class Assembler : Producer<AssemblerProduct>, IMachine
    {
        public object[j] y_clean { get; set; }
        public object[j] z_clean { get; set; }
        public void M()
        {
            // combine z_clean[j] and y_clean[j] and store it as w_clean[j]
        }
    }

Normal usage of a production process class:

Instantiate. Call D() to instantiate machines and product.
Assign any inputs to the process.
Call O() to have the process distribute those inputs to machines as well as bind the machines to operate on the end product. This is your last chance to override those assignments before production.
Call M() to execute the process.

Most source code welds together producers and consumers within the same function body and thereby becomes a pain to maintain later, and then functions e-mail the data to one another like useless office workers who don't keep an e-mail trail. That causes problems when you later want to make vertical integration decisions like replacing a machine or extending the process, all of which I've documented with sources. POWER is the only architecture which avoids complexities like centralization. I released it in February.

There are ETL tools and other solutions like TPL Dataflow, but production processes are not going to organize or manage themselves for programmers. All programmers need to learn POWER to correctly handle the responsibilities of waste, integration, control and instrumentation. Employers look at us funny when we write automated code and then can't stop live execution on a dime, but our education only prepares us to create processes not architect them the way manufacturing does.

If your actual need is to coordinate the deliveries themselves so processing can be done by each related input set, which is the norm for production, then design and build that system separately. — RBJ, Apr 30 '20 at 15:43

Dependent data processing pipelines where files arrive asynchronously

1 Answers1