Questions tagged [amazon-data-pipeline]

Simple service to transfer data between Amazon data storage services, kick off Elastic MapReduce jobs, and connect with outside data services.

From the AWS Data Pipeline homepage:

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

470 questions

votes

0 answers

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP, etc. The change from EMR 3.x to 4.x/5.x requires the use of…

asked Dec 17 '17 at 18:17

user1322092

4,020
7
35
52

votes

2 answers

Is it possible to dump a RDS database to S3 using AWS Data Pipeline?

Basically I want to pg_dump my RDS database to S3 using AWS Data Pipeline, I am not 100% sure if this is possible I got up to the stage where the SqlDataNode wants a selectQuery at which point i am wondering what to do. Below is my template so…

amazon-web-services amazon-s3 aws-cloudformation rds amazon-data-pipeline

asked May 15 '17 at 23:30

Jesse Whitham

votes

1 answer

AWS Data Pipeline: Tez fails on simple HiveActivity

I'm trying to run simple AWS Data Pipeline for my POC. The case that I have is following: get data from CSV stored on S3, perform simple hive query on them and put results back to S3. I've created very basic pipeline definition and tried to run it…

amazon-web-services hadoop amazon-data-pipeline tez

asked Feb 25 '17 at 02:00

Andrii Gorishnii

votes

1 answer

Can AWS Redshift drop a table that is wrapped in transaction?

During the ETL we do the following operations: begin transaction; drop table if exists target_tmp; create table target_tmp like target; insert into target_tmp select * from source_a inner join source_b on ...; analyze table…

sql amazon-web-services transactions amazon-redshift amazon-data-pipeline

asked Feb 17 '17 at 12:15

Kiril Scherbach

votes

2 answers

How do I create a parameter in the console of AWS Data Pipeline

I'd like to define some parameters in the console of AWS DataPipeline, but am not able to do so. The parameters are going to be called in a SqlActivity, so when I try to refer to them in the in-line SQL script and save the pipeline, I'm getting…

amazon-web-services amazon-data-pipeline

asked Jun 08 '15 at 16:33

simplycoding

2,770
9
46
91

votes

5 answers

Amazon Redshift: Copying Data Between Databases

I am looking to Copy data within databases on Amazon Redshift. Before this, I was copying data from a Redshift database to a PostgreSQL hosted on an EC2 instance for analytical purpose. I had ruby script that would do it using dblink EXTENSION. But…

postgresql amazon-web-services amazon-s3 amazon-redshift amazon-data-pipeline

asked Jun 01 '15 at 12:50

Sambhav Sharma

5,741
9
53
95

votes

1 answer

S3 to Redshift input data format

I'm trying to run a simple chain s3-pipeline-redshift, but I've got completely stucked with input data format. Here's my file: 1,Toyota Park,Bridgeview,IL 2,Columbus Crew Stadium,Columbus,OH 3,RFK Stadium,Washington,DC 4,CommunityAmerica…

amazon-web-services amazon-redshift amazon-data-pipeline

asked Feb 19 '14 at 19:58

KorsaR

votes

2 answers

AWSGlue AccessDeniedException, Status Code 400

I am trying to build a data pipeline for a data engineering project With the help of S3, Glue, Athena, etc., I am stuck when setting up glue crawler for indexing over data. Even I set up the role according to the need, but still it's giving me the…

amazon-web-services cloud aws-glue amazon-data-pipeline

asked Aug 04 '22 at 09:45

Zaheer UD Din Baber

votes

2 answers

AWS Data Pipeline: Issue with permissions S3 Access for IAM role

I'm using the Load S3 data into RDS MySql table template in AWS Data Pipeline to import csv's from a S3 bucket into our RDS MySql. However I (as IAM user with full-admin rights) run into a warning I can't solve: Object:Ec2Instance - WARNING: Could…

amazon-web-services amazon-s3 amazon-ec2 amazon-data-pipeline aws-data-pipeline

asked Feb 01 '19 at 09:27

jeroen

votes

5 answers

ETL pipeline in AWS with s3 as datalake how to handle incremental updates

I have setup ETL pipeline in AWS as follows input_rawdata -> s3 -> lambda -> trigger spark etl script (via aws glue )-> output(s3,parquet files ) My question is lets assume the above is initial load of the data ,how do I setup to run incremental…

amazon-web-services amazon-s3 etl amazon-data-pipeline aws-glue

asked Sep 06 '17 at 04:23

Shak

votes

1 answer

Exporting AWS Data Pipeline as CloudFormation template to use it in Terraform

I'm trying to export existing AWS Data Pipeline task to Terraform infrastructure somehow. Accordingly, to this issue, there is no direct support for Data Pipelines, but it still seems achievable using CloudFormation templates (terraform…

amazon-web-services aws-cloudformation terraform amazon-data-pipeline

asked Jul 18 '17 at 10:48

sorjef

votes

1 answer

Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

I am considering Google DataFlow as an option for running a pipeline that involves steps like: Downloading images from the web; Processing images. I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to…

google-cloud-platform google-cloud-dataflow azure-data-factory amazon-data-pipeline apache-beam

asked Jun 19 '17 at 02:30

kpax

votes

1 answer

Hadoop streaming job using Mxnet failing in AWS Emr

I have setup an emr step in AWS datapipeline. The step command looks like this:…

hadoop emr hadoop-streaming amazon-data-pipeline mxnet

asked May 17 '17 at 22:04

ishan3243

1,870
4
30
49

votes

1 answer

Pararelization of sklearn Pipeline

I have a set of Pipelines and want to have multi-threaded architecture. My typical Pipeline is shown below: huber_pipe = Pipeline([ ("DATA_CLEANER", DataCleaner()), ("DATA_ENCODING", Encoder(encoder_name='code')), ("SCALE",…

python multithreading scikit-learn pipeline amazon-data-pipeline

asked May 04 '17 at 14:01

SpanishBoy

2,105
6
28
51

votes

1 answer

Copy Selected items from AWS dyanmoDB table to another table

I want to copy the data from one Amazon dynamodb table to another amazon dynamodb table(of same region).. 1]I have table called MUSIC which has 20 items 2] I have another table MUSIC_ST (with same schema as table MUSIC). Now I want to migrate…

amazon-dynamodb data-migration amazon-data-pipeline

asked Jun 03 '16 at 18:11

Tedd

Prev 1

…

31 32 Next