Questions tagged [data-pipeline]

168 questions
0
votes
0 answers

Is it possible to configure dependencies in Azkaban to start a job after completion of either Job A or Job B, without requiring both of them to finish

I have a scenario where I have three jobs in my Azkaban workflow. I want to ensure that Job C starts only after the completion of either Job A or Job B. It doesn't matter which of the two jobs finishes first; as soon as either Job A or Job B…
0
votes
1 answer

Data Flow ERROR java.lang.OutOfMemoryError: Java heap space

I have to create the pipeline to transfer the data from BigQuery and save it as json file. But I got this error. The result from sql query is 30 million records. How to improve this code? Error: [error] (run-main-0) java.lang.OutOfMemoryError: Java…
P.pp
  • 9
  • 2
0
votes
2 answers

Error: "zsh: no matches found: apache-beam[gcp]" while installing Apache Beam

I am working on a project and trying to install Apache Beam from the terminal using this command: pip3 install apache-beam[gcp] however I get this error: zsh: no matches found: apache-beam[gcp] I created a virtual env using these commands: pip3…
0
votes
0 answers

How Result Data Published by Respective Boards are available just after few minutes on other portals

I was wondering how the result published by state bords like Bihar,MP,MH,Jharkhand boards are available on just other result portal like Indiaresults.com. Is there any way to copy whole data without having access to db or server side script. How…
Sanjay Kumar
  • 145
  • 1
  • 1
  • 10
0
votes
1 answer

Tensorflow: How to add a property in execution object in MLMD MetadataStore?

I'm using the MLMD MetadataStore to manage the data pipelines and I need to add an execution property in MLMD to get this property later. I'm trying add with this: from ml_metadata.proto import metadata_store_pb2 from ml_metadata.metadata_store…
natielle
  • 380
  • 3
  • 14
0
votes
0 answers

Migrating Aurora DB cluster to Snowflake and daily incremental refresh

I am looking to migrate multiple Aurora DB clusters around few TB's to Snowflake and performing daily incremental refresh. I am wondering of the best practices and tools for achieving this objective. Should I consider the path of Aurora DB cluster…
0
votes
1 answer

Beam pipeline spark runner issue

i have a beam pipeline that reads from a kinesis stream, deserialized protobuf data inside, change to byte array and writes it to another kinesis stream (just a dummy pipeline) This pipeline executes successfully if i run mvn compile exec:java…
0
votes
2 answers

stream data between tasks in pipeline orchestration tool Prefect/Dagster/Airflow

How can I stream data between tasks in a workflow with the help of a data pipeline orchestration tool like Prefect, Dagster or Airflow? I am looking for a good data pipeline orchestration tool. I think I have a fairly decent overview now of what…
phobic
  • 914
  • 10
  • 24
0
votes
0 answers

Airflow Log by Attempts takes too long to show how process is going

My team has developed some pipelines in Airflow and we are really amazed how can we set multiple tasks to run and data flows from sources directly into our datalake. However, we have some complex tasks and logging can take up to 40 minutes to be…
0
votes
1 answer

Azure Data Factory - Retrieve next pagination link (decoded) from response headers in a copy data activity of Azure Data Factory

I have created a copy data activity in azure data factory and this data pipeline pulls the data from an API (via REST activity source) and writes the response body (json) on a file kept in the azure blob storage. The API which I am fetching the…
0
votes
1 answer

How do I trigger Apache Beam side inputs periodically?

I have a Dataflow Pipeline with streaming data, and I am using an Apache Beam Side Input of a bounded data source, which may have updates. How do I trigger a periodic update of this side input? E.g. The side input should be refreshed once every 12…
0
votes
0 answers

How to design a CRON job to transfer datas from DigitalOcean database (MySQL) into Google Big Query hourly?

in my workplace I was tasked to do this below work: what is the most cost effective method to create a CRON job that would run hourly (or maybe twice a day) to copy company's application new datas from DigitalOcean database (MySQL) into a Google…
0
votes
0 answers

which servies to use for periodically load data from multiple data sources, aggregate and provide fast search?

Please propose a solution design for my case. The data comes from various sources, some from api, some from csv. A user will search using filters. Ex: Product data (source 1) and Product Reviews ( Source 2). A user will search for a product with its…
jasy
  • 1
  • 1
0
votes
1 answer

ValueError when running Python function in data pipeline

I'm building a data pipeline using Python and I'm running into an issue when trying to execute a certain function. The error message I'm receiving is: ValueError: Could not convert string to float: 'N/A' Here is the function in question: def…
0
votes
0 answers

Azkaban 3.44 conditional flow not working (including example on official documentation)

I'm try to use conditional flows on Azkaban. When I submit/upload my project inside web node I receive this error. Validator Directory Flow reports errors: Error loading flow yaml file sample.flow:Cannot create property=nodes for…