Questions tagged [data-pipeline]

168 questions
3
votes
2 answers

Google data fusion Execution error "INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0."

I am trying load a Simple CSV file from GCS to BQ using Google Data Fusion Free version. The pipeline is failing with error . it reads com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient…
3
votes
1 answer

Workflow orchestration tool compatible with Windows Server 2013?

My current project requires automation and scheduled execution of a number of tasks (copy a file, send an email when a new file arrives in a directory, execute an analytics job, etc). My plan is to write a number of individual shell scripts for each…
3
votes
2 answers

Undo/rollback the effects of a data processing pipeline

I have a workflow that I'll describe as follows: [ Dump(query) ] ---+ | +---> [ Parquet(dump, schema) ] ---> [ Hive(parquet) ] | [ Schema(query) ] ---+ Where: query is a query to an…
stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
3
votes
1 answer

Is it possible to create EMR cluster with Auto scaling using Data pipeline

I am new to AWS. I have created a EMR cluster using Auto scaling policy through AWS console. I have also created a data pipeline which can use this cluster to perform the activities. I am also able to create EMR cluster dynamically through data…
2
votes
1 answer

Dagster sensor to check for new records in a table

I have 2 tables where 2nd is dependent on 1st. Whenever new records are added in 1st, I want to run a dagster job. I came across sensors but I am not sure if my requirement can be fulfilled using the functionality they provide. Any ideas?
Abi
  • 83
  • 6
2
votes
1 answer

How can we generate multiple output file in benthos?

Input Data: { "name": "Coffee", "price": "$1.00" } { "name": "Tea", "price": "$2.00" } { "name": "Coke", "price": "$3.00" } { "name": "Water", "price": "$4.00" } extension.yaml input: label: "" file: paths: [./input/*] codec: lines …
Yash Chauhan
  • 174
  • 13
2
votes
0 answers

How to convert tensor byte string to string format?

I am trying to use tensorflow tf.data to build pipline. import tensorflow as tf import numpy as np import cv2 There are 6 images with name as 1.jpeg,2.jpeg upto 6.jpeg. I want to load the image using map() function of tf.data. tfds =…
2
votes
1 answer

Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory

I am trying to establish an Azure Data Factory copy data pipeline. The source is an open HTTP Linked Source (Url reference: https://clinicaltrials.gov/AllPublicXML.zip). So basically the source contains a zipped folder having many XML files. I want…
2
votes
0 answers

Feasible Streaming Suggestions | Is it possible to use Apache Nifi + Apache Beam (on Flink Cluster) with Real Time Streaming Data

So, I am very very new to all the Apache Frameworks I am trying to use. I want your suggestions on a couple of workflow design for an IoT streaming application: As we have NiFi connectors available for Flink, and we can easily use Beam abstraction…
2
votes
1 answer

padding in tf.data.Dataset in tensorflow

Code: a=training_dataset.map(lambda x,y: (tf.pad(x,tf.constant([[13-int(tf.shape(x)[0]),0],[0,0]])),y)) gives the following error: TypeError: in user code: :1 None * a=training_dataset.map(lambda x,y:…
2
votes
1 answer

Airflow on Google Cloud Composer vs Docker

I can't find much information on what the differences are in running Airflow on Google Cloud Composer vs Docker. I am trying to switch our data pipelines that are currently on Google Cloud Composer onto Docker to just run locally but am trying to…
2
votes
1 answer

Google Data Fusion: "Looping" over input data to then execute multiple Restful API calls per input row

I have the following challenge I would like to solve preferably in Google Data Fusion: I have one web service that returns about 30-50 elements describing an invoice in a JSON payload like this: { "invoice-services": [ { "serviceId":…
JensU
  • 21
  • 1
2
votes
2 answers

Luigi not picking up next task to run, bunch of pending tasks left, no failed tasks

I am running a big Luigi workflow that is supposed run over a hundred tasks in total. The workflow goes well for quite a while, but at one stage it comes to a point where there are 15 pending tasks and all other tasks are successfully done, and no…
Asif Iqbal
  • 4,562
  • 5
  • 27
  • 31
2
votes
1 answer

Python psycopg2: Copy result of query to another table

I am having some problem with psycopg2 in python I have two disparate connections with corresponding cursors: 1. Source connection - source_cursor 2. Destination connection - dest_cursor Lets say there is a select query that I want to execute on…
skybunk
  • 833
  • 2
  • 12
  • 17
2
votes
1 answer

How to configure AWS data pipeline using serverless.yml?

I am new to both data pipeline and serverless. I want to know how can I automate AWS data pipeline using serverless. Below is my diagram of AWS data pipeline which exports dynamo db table to S3
1
2
3
11 12