Questions tagged [data-pipeline]

168 questions
1
vote
0 answers

batch_size and max_time in LSTM Tensorflow

Background: I am trying to model multi-layered LSTM in tensorflow. I am using a common function to unroll the LSTM: tf.nn.dynamic_rnn Here I am using time_major = True, so my data has to be of format [max_time, batch_size, depth]. According to my…
1
vote
1 answer

Kinesis triggers lambda with small batch size

I have a Lambda which is configured as a consumer of a Kinesis data stream, with a batch size of 10,000 (maximal). The lambda parses given records and inserts them to Aurora Postgresql (using an INSERT command). Somehow, I see that the lambda is…
Ronyis
  • 1,803
  • 16
  • 17
1
vote
0 answers

Incremental update of the data in AWS S3

Incremental update of S3 buckets without natural keys I need to design an etl flow. OLTP systems are sharing customer, product, campaign and sales record via files. I want to transfer these files incrementally into Aws S3 buckets. Assume that I…
user125687
  • 85
  • 1
  • 4
  • 15
1
vote
1 answer

AWS Datapipeline incorrect java version

I am trying to execute a jar file in my datapipeline and it is erroring out in a fashion that indicates to me that the version of java that is installed in my pipeline is lower than that required by the executable jar. I have tried to add a command…
1
vote
1 answer

Airflow supported cross data center?

We would like to use Apache Airflow to orchestrate work across global data centers (regions). From what I can tell the only way to make this work is to give access/permission to all tasks to write directly to some cloud exposed database. Does…
hhop
  • 33
  • 3
1
vote
1 answer

"Connection timed out (Connection timed out)" Error for SQLActivity

I have a connection timed out error on my data pipeline job to run a simple sql script. The script is set up in my S3. The data pipeline itself is in the region of us-east-1. My database is in us-east-2. When I first ran the pipeline I got the error…
1
vote
1 answer

Checking status of AWS Data Pipeline using Go SDK

Situation: I have 2 data pipelines that run on-demand. Pipeline B cannot run until Pipeline A has completed. I'm trying to automate running both pipelines in a single script/program but I'm unsure how to do all of this in Go. I have some Go code…
1
vote
0 answers

Why does my Cloudformation Data Pipeline fail on my Ec2Resource?

I’m trying to run a Data Pipeline inside a cloud formation stack. This stack references the exports of another stack which contains a Redshift cluster. When I run it, I get an error stating " 'Ec2Instance', errors = Internal error during validation…
1
vote
1 answer

Time Series Windowing for streaming applications

we are developing data pipeline app using Kafka, storm and redis. Realtime events from different systems will be published to Kafka and storm do the event processing based on rules configured. State is managed in redis. we have a requirement to…
1
vote
1 answer

What is the best way to automate replication of RDS (MySQL) schema to AWS Redshift?

We use ruby scripts to migrate data from MySQL to Redshift(PostgreSQL).Currently we use YAML configuration files to maintain schema information (column names and types).So whenever a MySQL table is altered, we need to manually change the YAML…
1
vote
0 answers

Is there any blueprint for a data-pipeline?

I use Spark for data processing, but starting from the datasources (mostly csv files) I would like to put in place a data-pipeline which has right stages to control/test/manipulate data and deploy them to different "stages"…
Randomize
  • 8,651
  • 18
  • 78
  • 133
0
votes
0 answers

PyTest for DataPipelines

My project is based on collecting infrastructure metrics like CPU, Memory, Disk, Hits for various servers and applications etc via Splunk RestAPI, HTTP API calls and shell scripts. The python code written is procedural in nature. I need to implement…
Yavnica Saini
  • 33
  • 1
  • 1
  • 8
0
votes
0 answers

Kafka, Kafka-Connect and HDFS with docker compose

I am slowly working my way into the world of docker compose. I would like to create a data pipeline. I think something is not working with my connector?! The CSV I send later to Kafka is read in. But then the data from the connector is not sent to…
Mauz
  • 1
  • 1
0
votes
0 answers

Is it possible to ALTER MULTIPLE HIVE VIEWS at once? (ACID-like schema changes?)

I have a "private" Hive database filled with 24 tables of data populated by externally located parquets in part of a Spark data pipeline. I have a "public" Hive database intended for public (downstream) usage with 24 views selecting content out of…
Rimer
  • 2,054
  • 6
  • 28
  • 43
0
votes
2 answers

How to execute Step based on condition in Tranformation Level in Pentaho?

I know that I can use condition executing at Job Level like below But I want to use condition executing at Tranformation level. For example I have a simple Table Input step, which have a query like "select id from tableA". Now based on the value of…