Questions tagged [dataflow]

Dataflow programming is a programming paradigm in which computations are modeled through directed graphs: nodes are instructions and data flows through the connections between them.

Dataflow programming is a programming paradigm which models programs as directed graphs and calculation proceeds in a way similar to electrical circuits. More precisely:

  • nodes are instructions that takes one or more inputs, perform calculation on them and present the result(s) as output;
  • edges connects inputs and outputs of the instructions -- this way the output of an instruction can be fed directly to the input on another node to trigger another calculation;
  • data "travels" using the directed edges and triggers the instructions as they pass through the nodes.

Often dataflow programming languages are visual, the most prominent example being LabView.

Resources

1152 questions
4
votes
1 answer

Issues with throttling a TPL Dataflow with SemaphoreSlim

Scope: I want to process a large file (1 GB+) by splitting it into smaller (manageable) chunks (partitions), persist them on some storage infrastructure (local disk, blob, network, etc.) and process them one by one, in memory. I want to achieve…
4
votes
1 answer

Examples of monadic effects inside a rewrite function in Hoopl?

The type of (forward) rewriting functions in Hoopl is given by the mkFRewrite function: mkFRewrite :: (FuelMonad m) => (forall e x. n e x -> f -> m (Maybe (hoopl-3.8.6.1:Compiler.Hoopl.Dataflow.Graph n e x))) -> FwdRewrite m…
Justin Bailey
  • 1,487
  • 11
  • 15
4
votes
2 answers

Dataflow Flex template job is Queued

I am trying to reproduce this tutorial to run a Flex Template on Dataflow. When I submit the job, I can see it in the console but it's not started and marked as Queued. Does this mean that the job is submitted in a FlexRS mode ? How can I start…
farhawa
  • 10,120
  • 16
  • 49
  • 91
4
votes
3 answers

What are schemas for in Apache Beam?

I was reading the docs about SCHEMAS in Apache BEAM but i can not understand what its purpose is, how and why or in which cases should i need to use them. What is the difference between using schemas or using a class that extends the Serializable…
Sergio Fonseca
  • 326
  • 2
  • 11
4
votes
2 answers

Dataflow fails when I add requirements.txt [Python]

So when I try to run dataflow with the DataflowRunner and include the requirements.txt which looks like this google-cloud-storage==1.28.1 pandas==1.0.3 smart-open==2.0.0 Every time it fails on this line…
Alex Fragotsis
  • 1,248
  • 18
  • 36
4
votes
1 answer

Static dataflow graph generator for Python?

I've been struggling for quite some time to find a static dataflow graph generator for Python. This is my ideal: Given a small python script example.py, (written in Python3), return some representation of the data flow graph. I was able to achieve…
4
votes
2 answers

Cloud SQL to BigQuery incrementally

I need some suggestions for one of the use cases I am working on. Use Case: We have data in Cloud SQL around 5-10 tables, some are treated as lookup and others transactional. We need to get this to BigQuery in a way to make 3-4 tables(Flattened,…
4
votes
1 answer

Side inputs vs normal constructor parameters in Apache Beam

I have a general question on side inputs and broadcasting in the context of Apache Beam. Does any additional variables, lists, maps that are need for computation during processElement, need to be passed as side input? Is it ok if they are passed as…
4
votes
1 answer

Beam / Dataflow Custom Python job - Cloud Storage to PubSub

I need to perform a very simple transformation on some data (extract a string from JSON), then write it to PubSub - I'm attempting to use a custom python Dataflow job to do so. I've written a job which successfully writes back to Cloud Storage, but…
4
votes
1 answer

BigQueryIO Read vs fromQuery

Say in Dataflow/Apache Beam program, I am trying to read table which has data that is exponentially growing. I want to improve the performance of the read. BigQueryIO.Read.from("projectid:dataset.tablename") or BigQueryIO.Read.fromQuery("SELECT A,…
Roshan Fernando
  • 493
  • 11
  • 31
4
votes
2 answers

Apache Beam: ReadFromText versus ReadAllFromText

I'm running an Apache Beam pipeline reading text files from Google Cloud Storage, performing some parsing on those files and the writing the parsed data to Bigquery. Ignoring the parsing and google_cloud_options here for the sake of keeping it…
4
votes
3 answers

SIGNAL vs Esterel vs Lustre

I'm very interested in dataflow and concurrency focused languages. I've read up on the subject and repeatedly I see SIGNAL, Esterel, and Lustre mentioned; so I take it they're prominent players in those fields. However, many of their links in the…
4
votes
2 answers

Easiest way to convert a TableRow to JSON-formatted String, in dataflow 2.x?

Short of writing my own function to do it, what is the easiest way to convert a TableRow object, inside a dataflow 2.x pipeline, to a JSON-formatted String? I thought the code below would work, but it isn't correctly inserting quotes in between…
Max
  • 808
  • 11
  • 25
4
votes
2 answers

Apache Beam, NoSuchMethodError on BigQueryIO.WriteTableRows()?

I've recently upgraded an existing pipeline from dataflow 1.x to dataflow 2.x, and I'm seeing an error that doesn't make sense to me. I'll put the relevant code below, then include the error I'm seeing. // This is essentially the final step in our…
Max
  • 808
  • 11
  • 25
4
votes
1 answer

Apache Beam Error - AsList object is not iterable

I'm trying to make a side input from a pcollection in apache beam with python. This is my code: from apache_beam.pvalue import AsList locations_dim = p | beam.io.Read(beam.io.BigQuerySource( query='SELECT a, b, c, d FROM test.testing_table')) |…
SaadK
  • 256
  • 2
  • 10