Questions tagged [dataflow]

Dataflow programming is a programming paradigm in which computations are modeled through directed graphs: nodes are instructions and data flows through the connections between them.

Dataflow programming is a programming paradigm which models programs as directed graphs and calculation proceeds in a way similar to electrical circuits. More precisely:

  • nodes are instructions that takes one or more inputs, perform calculation on them and present the result(s) as output;
  • edges connects inputs and outputs of the instructions -- this way the output of an instruction can be fed directly to the input on another node to trigger another calculation;
  • data "travels" using the directed edges and triggers the instructions as they pass through the nodes.

Often dataflow programming languages are visual, the most prominent example being LabView.

Resources

1152 questions
7
votes
1 answer

GCP Dataflow: System Lag for streaming from Pub/Sub IO

We use "System Lag" to check the health of our Dataflow jobs. For example if we see an increase in system lag, we will try to see how to bring this metric down. There are few question regarding this metric. 1) What does system lag exactly…
user_1357
  • 7,766
  • 13
  • 63
  • 106
7
votes
2 answers

How to make fast producer paused when consumer is overwhelmed?

I have producer / consumer pattern in my app implemented used TPL Dataflow. I have the big dataflow mesh with about 40 blocks in it. There are two main functional parts in the mesh: producer part and consumer part. Producer supposed to continuosly…
kseen
  • 359
  • 8
  • 56
  • 104
7
votes
6 answers

Reference manual for Apache Pig Latin

Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin. Does anyone know of a good reference manual for PigLatin? I'm looking for something that includes all the syntax and commands descriptions…
Ori lahav
  • 125
  • 1
  • 2
7
votes
3 answers

MQ to process, aggregate and publish data asynchronously

Some background, before getting to the real question: I am working on a back-end application that consists of several different modules. Each module is, currently, a command-line java application, which is run "on demand" (more details later). Each…
Lorenzo Dematté
  • 7,638
  • 3
  • 37
  • 77
6
votes
2 answers

spring data flow : IAM role assignment to pods using pod-annotations

We are currently in the process of deploying a new spring data flow stream application in our aws EKS cluster. As part of this, the pods launched by the skipper should have the IAM roles defined in the annotation so that they can access the required…
6
votes
1 answer

Prevent fusion in Apache Beam / Dataflow streaming (python) pipelines to remove pipeline bottleneck

We are currently working on a streaming pipeline on Apache Beam with DataflowRunner. We are reading messages from Pub/Sub and do some processing on them and afterwards we window them in slidings windows (currently the window size is 3 seconds and…
Sven.DG
  • 295
  • 1
  • 13
6
votes
2 answers

Exception Handling in Apache Beam pipelines using Python

I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a simple WriteToBigQuery example: output = json_output |…
6
votes
1 answer

How to solve Duplicate values exception when I create PCollectionView>

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception :…
6
votes
1 answer

Difference Between Processor Properties and Flowfile Attributes in Apache NiFi

My current understanding is that NiFi processor properties are specific to that processor. So adding a new property to a processor will only be visible within that processor and not be passed on to later processor blocks? This is why UpdateAttribute…
Adam
  • 4,590
  • 10
  • 51
  • 84
6
votes
2 answers

TPL Dataflow vs plain Semaphore

I have a requirement to make a scalable process. The process has mainly I/O operations with some minor CPU operations (mainly deserializing strings). The process query the database for a list of urls, then fetches data from these urls, deserialize…
BornToCode
  • 9,495
  • 9
  • 66
  • 83
6
votes
3 answers

Has anyone used dataflow programming in a real project with a mainstream language?

I am looking at using some Dataflow programming techniques in a clojure program but I am having difficulty in finding much information from projects using Java, C#, or other mainstream languages that have used such techniques in the real world. I…
yazz.com
  • 57,320
  • 66
  • 234
  • 385
6
votes
1 answer

Convert Java object to BigQuery TableRow

I am exploring Google Cloud Dataflow. I was wondering if automatic conversion between java object or JSON to TableRow can be done. Just like we can automatically parse JSON to POJO class. I could not find relevant information. Hope not to duplicate…
user101010
  • 97
  • 3
  • 10
6
votes
1 answer

Apache Nifi with IOT sensors

Im new to Apache Nifi , and I'm having a use case which i need to parse and decode different kind of messages from Sensors, transform and load the data in Hbase all my sensors send data every 10 minutes through an API via a post request, what I have…
azelix
  • 1,257
  • 4
  • 26
  • 50
6
votes
3 answers

Opening a gzip file in python Apache Beam

Is it currently possible to read froma a gzip file in python using Apache Beam? My pipeline is pulling gzip files from gcs with this line of code: beam.io.Read(beam.io.TextFileSource('gs://bucket/file.gz', compression_type='GZIP')) But I am…
agsolid
  • 1,288
  • 2
  • 14
  • 23
6
votes
1 answer

Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?

I am trying to execute a dataflow pipeline job which would execute one function on N entries at a time from datastore. In my case this function is sending batch of 100 entries to some REST service as payload. This means that I want to go through all…
1 2
3
76 77