Questions tagged [data-pipeline]
168 questions
0
votes
0 answers
Is it possible to configure dependencies in Azkaban to start a job after completion of either Job A or Job B, without requiring both of them to finish
I have a scenario where I have three jobs in my Azkaban workflow. I want to ensure that Job C starts only after the completion of either Job A or Job B. It doesn't matter which of the two jobs finishes first; as soon as either Job A or Job B…

Bora Çolakoğlu
- 69
- 2
- 7
0
votes
1 answer
Data Flow ERROR java.lang.OutOfMemoryError: Java heap space
I have to create the pipeline to transfer the data from BigQuery and save it as json file. But I got this error. The result from sql query is 30 million records. How to improve this code?
Error:
[error] (run-main-0) java.lang.OutOfMemoryError: Java…

P.pp
- 9
- 2
0
votes
2 answers
Error: "zsh: no matches found: apache-beam[gcp]" while installing Apache Beam
I am working on a project and trying to install Apache Beam from the terminal using this command: pip3 install apache-beam[gcp] however I get this error: zsh: no matches found: apache-beam[gcp]
I created a virtual env using these commands:
pip3…
0
votes
0 answers
How Result Data Published by Respective Boards are available just after few minutes on other portals
I was wondering how the result published by state bords like Bihar,MP,MH,Jharkhand boards are available on just other result portal like Indiaresults.com. Is there any way to copy whole data without having access to db or server side script. How…

Sanjay Kumar
- 145
- 1
- 1
- 10
0
votes
1 answer
Tensorflow: How to add a property in execution object in MLMD MetadataStore?
I'm using the MLMD MetadataStore to manage the data pipelines and I need to add an execution property in MLMD to get this property later.
I'm trying add with this:
from ml_metadata.proto import metadata_store_pb2
from ml_metadata.metadata_store…

natielle
- 380
- 3
- 14
0
votes
0 answers
Migrating Aurora DB cluster to Snowflake and daily incremental refresh
I am looking to migrate multiple Aurora DB clusters around few TB's to Snowflake and performing daily incremental refresh. I am wondering of the best practices and tools for achieving this objective. Should I consider the path of Aurora DB cluster…

Code Warrior
- 133
- 1
- 3
- 15
0
votes
1 answer
Beam pipeline spark runner issue
i have a beam pipeline that reads from a kinesis stream, deserialized protobuf data inside, change to byte array and writes it to another kinesis stream (just a dummy pipeline)
This pipeline executes successfully if i run
mvn compile exec:java…

Viswajith Kalavapudi
- 189
- 1
- 3
- 16
0
votes
2 answers
stream data between tasks in pipeline orchestration tool Prefect/Dagster/Airflow
How can I stream data between tasks in a workflow with the help of a data pipeline orchestration tool like Prefect, Dagster or Airflow?
I am looking for a good data pipeline orchestration tool.
I think I have a fairly decent overview now of what…

phobic
- 914
- 10
- 24
0
votes
0 answers
Airflow Log by Attempts takes too long to show how process is going
My team has developed some pipelines in Airflow and we are really amazed how can we set multiple tasks to run and data flows from sources directly into our datalake. However, we have some complex tasks and logging can take up to 40 minutes to be…
0
votes
1 answer
Azure Data Factory - Retrieve next pagination link (decoded) from response headers in a copy data activity of Azure Data Factory
I have created a copy data activity in azure data factory and this data pipeline pulls the data from an API (via REST activity source) and writes the response body (json) on a file kept in the azure blob storage.
The API which I am fetching the…

Rubin Shah
- 1
- 1
0
votes
1 answer
How do I trigger Apache Beam side inputs periodically?
I have a Dataflow Pipeline with streaming data, and I am using an Apache Beam Side Input of a bounded data source, which may have updates. How do I trigger a periodic update of this side input? E.g. The side input should be refreshed once every 12…

yeong
- 1
- 1
0
votes
0 answers
How to design a CRON job to transfer datas from DigitalOcean database (MySQL) into Google Big Query hourly?
in my workplace I was tasked to do this below work:
what is the most cost effective method to create a CRON job that would run hourly (or maybe twice a day) to copy company's application new datas from DigitalOcean database (MySQL) into a Google…

Jackk-Doe
- 109
- 7
0
votes
0 answers
which servies to use for periodically load data from multiple data sources, aggregate and provide fast search?
Please propose a solution design for my case. The data comes from various sources, some from api, some from csv. A user will search using filters.
Ex: Product data (source 1) and Product Reviews ( Source 2). A user will search for a product with its…

jasy
- 1
- 1
0
votes
1 answer
ValueError when running Python function in data pipeline
I'm building a data pipeline using Python and I'm running into an issue when trying to execute a certain function. The error message I'm receiving is: ValueError: Could not convert string to float: 'N/A'
Here is the function in question:
def…

Kingdavid Ochai
- 15
- 4
0
votes
0 answers
Azkaban 3.44 conditional flow not working (including example on official documentation)
I'm try to use conditional flows on Azkaban. When I submit/upload my project inside web node I receive this error.
Validator Directory Flow reports errors: Error loading flow yaml file sample.flow:Cannot create property=nodes for…

Holsi
- 21
- 3