Questions tagged [amazon-data-pipeline]

Simple service to transfer data between Amazon data storage services, kick off Elastic MapReduce jobs, and connect with outside data services.

From the AWS Data Pipeline homepage:

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

470 questions
6
votes
0 answers

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP, etc. The change from EMR 3.x to 4.x/5.x requires the use of…
6
votes
2 answers

Is it possible to dump a RDS database to S3 using AWS Data Pipeline?

Basically I want to pg_dump my RDS database to S3 using AWS Data Pipeline, I am not 100% sure if this is possible I got up to the stage where the SqlDataNode wants a selectQuery at which point i am wondering what to do. Below is my template so…
6
votes
1 answer

AWS Data Pipeline: Tez fails on simple HiveActivity

I'm trying to run simple AWS Data Pipeline for my POC. The case that I have is following: get data from CSV stored on S3, perform simple hive query on them and put results back to S3. I've created very basic pipeline definition and tried to run it…
6
votes
1 answer

Can AWS Redshift drop a table that is wrapped in transaction?

During the ETL we do the following operations: begin transaction; drop table if exists target_tmp; create table target_tmp like target; insert into target_tmp select * from source_a inner join source_b on ...; analyze table…
6
votes
2 answers

How do I create a parameter in the console of AWS Data Pipeline

I'd like to define some parameters in the console of AWS DataPipeline, but am not able to do so. The parameters are going to be called in a SqlActivity, so when I try to refer to them in the in-line SQL script and save the pipeline, I'm getting…
simplycoding
  • 2,770
  • 9
  • 46
  • 91
6
votes
5 answers

Amazon Redshift: Copying Data Between Databases

I am looking to Copy data within databases on Amazon Redshift. Before this, I was copying data from a Redshift database to a PostgreSQL hosted on an EC2 instance for analytical purpose. I had ruby script that would do it using dblink EXTENSION. But…
6
votes
1 answer

S3 to Redshift input data format

I'm trying to run a simple chain s3-pipeline-redshift, but I've got completely stucked with input data format. Here's my file: 1,Toyota Park,Bridgeview,IL 2,Columbus Crew Stadium,Columbus,OH 3,RFK Stadium,Washington,DC 4,CommunityAmerica…
KorsaR
  • 536
  • 1
  • 10
  • 26
5
votes
2 answers

AWSGlue AccessDeniedException, Status Code 400

I am trying to build a data pipeline for a data engineering project With the help of S3, Glue, Athena, etc., I am stuck when setting up glue crawler for indexing over data. Even I set up the role according to the need, but still it's giving me the…
5
votes
2 answers

AWS Data Pipeline: Issue with permissions S3 Access for IAM role

I'm using the Load S3 data into RDS MySql table template in AWS Data Pipeline to import csv's from a S3 bucket into our RDS MySql. However I (as IAM user with full-admin rights) run into a warning I can't solve: Object:Ec2Instance - WARNING: Could…
5
votes
5 answers

ETL pipeline in AWS with s3 as datalake how to handle incremental updates

I have setup ETL pipeline in AWS as follows input_rawdata -> s3 -> lambda -> trigger spark etl script (via aws glue )-> output(s3,parquet files ) My question is lets assume the above is initial load of the data ,how do I setup to run incremental…
5
votes
1 answer

Exporting AWS Data Pipeline as CloudFormation template to use it in Terraform

I'm trying to export existing AWS Data Pipeline task to Terraform infrastructure somehow. Accordingly, to this issue, there is no direct support for Data Pipelines, but it still seems achievable using CloudFormation templates (terraform…
5
votes
1 answer

Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

I am considering Google DataFlow as an option for running a pipeline that involves steps like: Downloading images from the web; Processing images. I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to…
5
votes
1 answer

Hadoop streaming job using Mxnet failing in AWS Emr

I have setup an emr step in AWS datapipeline. The step command looks like this:…
ishan3243
  • 1,870
  • 4
  • 30
  • 49
5
votes
1 answer

Pararelization of sklearn Pipeline

I have a set of Pipelines and want to have multi-threaded architecture. My typical Pipeline is shown below: huber_pipe = Pipeline([ ("DATA_CLEANER", DataCleaner()), ("DATA_ENCODING", Encoder(encoder_name='code')), ("SCALE",…
5
votes
1 answer

Copy Selected items from AWS dyanmoDB table to another table

I want to copy the data from one Amazon dynamodb table to another amazon dynamodb table(of same region).. 1]I have table called MUSIC which has 20 items 2] I have another table MUSIC_ST (with same schema as table MUSIC). Now I want to migrate…
Tedd
  • 53
  • 1
  • 4
1
2
3
31 32