Questions tagged [data-pipeline]
168 questions
3
votes
2 answers
Google data fusion Execution error "INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0."
I am trying load a Simple CSV file from GCS to BQ using Google Data Fusion Free version. The pipeline is failing with error . it reads
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient…

user11953315
- 33
- 1
- 3
3
votes
1 answer
Workflow orchestration tool compatible with Windows Server 2013?
My current project requires automation and scheduled execution of a number of tasks (copy a file, send an email when a new file arrives in a directory, execute an analytics job, etc). My plan is to write a number of individual shell scripts for each…

Praveen Thirukonda
- 365
- 1
- 4
- 16
3
votes
2 answers
Undo/rollback the effects of a data processing pipeline
I have a workflow that I'll describe as follows:
[ Dump(query) ] ---+
|
+---> [ Parquet(dump, schema) ] ---> [ Hive(parquet) ]
|
[ Schema(query) ] ---+
Where:
query is a query to an…

stefanobaghino
- 11,253
- 4
- 35
- 63
3
votes
1 answer
Is it possible to create EMR cluster with Auto scaling using Data pipeline
I am new to AWS. I have created a EMR cluster using Auto scaling policy through AWS console. I have also created a data pipeline which can use this cluster to perform the activities.
I am also able to create EMR cluster dynamically through data…

Bharani
- 429
- 1
- 8
- 18
2
votes
1 answer
Dagster sensor to check for new records in a table
I have 2 tables where 2nd is dependent on 1st. Whenever new records are added in 1st, I want to run a dagster job. I came across sensors but I am not sure if my requirement can be fulfilled using the functionality they provide. Any ideas?

Abi
- 83
- 6
2
votes
1 answer
How can we generate multiple output file in benthos?
Input Data:
{ "name": "Coffee", "price": "$1.00" }
{ "name": "Tea", "price": "$2.00" }
{ "name": "Coke", "price": "$3.00" }
{ "name": "Water", "price": "$4.00" }
extension.yaml
input:
label: ""
file:
paths: [./input/*]
codec: lines
…

Yash Chauhan
- 174
- 13
2
votes
0 answers
How to convert tensor byte string to string format?
I am trying to use tensorflow tf.data to build pipline.
import tensorflow as tf
import numpy as np
import cv2
There are 6 images with name as 1.jpeg,2.jpeg upto 6.jpeg. I want to load the image using map() function of tf.data.
tfds =…

DextroLaev
- 106
- 6
2
votes
1 answer
Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory
I am trying to establish an Azure Data Factory copy data pipeline. The source is an open HTTP Linked Source (Url reference: https://clinicaltrials.gov/AllPublicXML.zip). So basically the source contains a zipped folder having many XML files. I want…

Aditya Bhattacharya
- 914
- 2
- 9
- 22
2
votes
0 answers
Feasible Streaming Suggestions | Is it possible to use Apache Nifi + Apache Beam (on Flink Cluster) with Real Time Streaming Data
So, I am very very new to all the Apache Frameworks I am trying to use. I want your suggestions on a couple of workflow design for an IoT streaming application:
As we have NiFi connectors available for Flink, and we can easily use Beam abstraction…

Subham Agrawal
- 55
- 9
2
votes
1 answer
padding in tf.data.Dataset in tensorflow
Code:
a=training_dataset.map(lambda x,y: (tf.pad(x,tf.constant([[13-int(tf.shape(x)[0]),0],[0,0]])),y))
gives the following error:
TypeError: in user code:
:1 None *
a=training_dataset.map(lambda x,y:…

shresth_mehta
- 81
- 7
2
votes
1 answer
Airflow on Google Cloud Composer vs Docker
I can't find much information on what the differences are in running Airflow on Google Cloud Composer vs Docker. I am trying to switch our data pipelines that are currently on Google Cloud Composer onto Docker to just run locally but am trying to…

Erika_Marsha
- 23
- 1
- 5
2
votes
1 answer
Google Data Fusion: "Looping" over input data to then execute multiple Restful API calls per input row
I have the following challenge I would like to solve preferably in Google Data Fusion:
I have one web service that returns about 30-50 elements describing an invoice in a JSON payload like this:
{
"invoice-services": [
{
"serviceId":…

JensU
- 21
- 1
2
votes
2 answers
Luigi not picking up next task to run, bunch of pending tasks left, no failed tasks
I am running a big Luigi workflow that is supposed run over a hundred tasks in total. The workflow goes well for quite a while, but at one stage it comes to a point where there are 15 pending tasks and all other tasks are successfully done, and no…

Asif Iqbal
- 4,562
- 5
- 27
- 31
2
votes
1 answer
Python psycopg2: Copy result of query to another table
I am having some problem with psycopg2 in python
I have two disparate connections with corresponding cursors:
1. Source connection - source_cursor
2. Destination connection - dest_cursor
Lets say there is a select query that I want to execute on…

skybunk
- 833
- 2
- 12
- 17
2
votes
1 answer
How to configure AWS data pipeline using serverless.yml?
I am new to both data pipeline and serverless. I want to know how can I automate AWS data pipeline using serverless. Below is my diagram of AWS data pipeline which exports dynamo db table to S3

deosha
- 972
- 5
- 20