Questions tagged [amazon-data-pipeline]

Simple service to transfer data between Amazon data storage services, kick off Elastic MapReduce jobs, and connect with outside data services.

From the AWS Data Pipeline homepage:

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

470 questions
5
votes
2 answers

Copying DynamoDB table to another DynamoDB table with transforms

I have two DynamoDB tables: Table_1 and Table_2. I am trying to deprecate Table_1 and copy information into Table_2 from Table_1, which has different GSIs and different LSIs. Table_1 attributes are: Id, state, isReused, empty, normal Table_2…
5
votes
1 answer

Amazon Data Pipeline "Load S3 Data to RDS MySQL" query format?

I was wondering what the SQL Query format would be for inserting data from a CSV into MySQL would be. The template it gives is, "INSERT INTO tablename (col1, col2, col3) VALUES (?,?,?);" Because the values are dynamic and different in each CSV file,…
5
votes
1 answer

Amazon EMR job with multiple input parameters

In Amazon data pipeline, I am creating activity to copy S3 to EMR using Hive. To achieve it I have to pass two input parameters into EMR job as a step. I have searched all most every data pipeline documentation but did not found the way to specify…
Irfan.gwb
  • 668
  • 2
  • 13
  • 35
5
votes
2 answers

aws data pipeline datetime variable

I am using AWS Data Pipeline to save a text file to my S3 bucket from RDS. I would like the file name to to have the date and the hour in the file name like: myfile-YYYYMMDD-HH.txt myfile-20140813-12.txt I have specified my S3DataNode FilePath…
5
votes
1 answer

How to use scriptVariables in hive (AWS Data Pipeline)

We can pass script variables into AWS data pipeline hiveactivity using the following construct : "scriptVariable" : [ "param1=value1", "param2=value2" ] How do we access these variables in the hive script? I have been trying to use them…
Santanu C
  • 1,362
  • 3
  • 20
  • 38
5
votes
2 answers

AWS Copy S3 to RDS

I am trying to copy from S3(.csv file) to RDS(MySQL) using Amazon Data pipeline and My error: Error copying record Cause: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet sent successfully to…
4
votes
2 answers

AWS Data Pipeline didn't showing EC2 instance role

I am trying to get data from S3 to Dynamodb using AWS Data Pipeline. The issue I am facing is that my "Data Pipeline" wasn't showing EC2 instance role even though I have created one in the IAM. I have created default roles for Pipeline and…
4
votes
0 answers

HIVE_CURSOR_ERROR: Unexpected end of input stream

I'm moving the data from Mysql to S3 using data pipeline and it creates empty file for couple of days. I believe, it is making my athena query fails with "HIVE_CURSOR_ERROR: Unexpected end of input stream". Below is my script CREATE EXTERNAL…
4
votes
2 answers

Bulk add ttl column to dynamodb table

I have a use case where I need to add ttl column to the existing table. Currently, this table has more than 2 billion records. Is there any existing solution build around same? Or Should be emr is the path forward?
Vivek Goel
  • 22,942
  • 29
  • 114
  • 186
4
votes
4 answers

What's the best way to run a python script daily?

I have a python script that connects to Redshift, executes a series of SQL commands, and generates a new derived table. But for the life of me, I can't figure out a way to have it automatically run every day. I've tried AWS Data Pipeline but my…
ScottieB
  • 3,958
  • 6
  • 42
  • 60
4
votes
0 answers

Output an AWS Data Pipeline TableBackupActivity to multiple S3 locations?

I have set up an AWS Data Pipeline of DynamoDB data into an S3DataNode, by using the DynamoDB->Export menu option that sets up the basic pipeline template. I run that once a day, and it outputs into an S3 folder like "TableName/DATE/". I set that…
4
votes
1 answer

Creating column headers in CSV/TSV files using AWS Data Pipeline?

I'm creating CSV & TSV files using AWS Data Pipeline. The files are creating just fine, but I can't figure out how to create files with column headers. At first, I expected the headers to generate automatically based on the SQL query I'm running to…
T. Brian Jones
  • 13,002
  • 25
  • 78
  • 117
4
votes
1 answer

AWS Datapipeline ServiceAccessSecurityGroup

When I try to create an EMRcluster resource with those properties: Emr Managed Master Security Group Id Emr Managed Slave Security Group Id I have this error : Terminated with errors. You must also specify a ServiceAccessSecurityGroup if you use…
bbenjii123
  • 41
  • 1
  • 6
4
votes
1 answer

AWS Data Pipeline - How to set global pipeline variable from ShellCommandActivity

I am trying to augment my pipeline (migrates data from RDS to RedShift) so that it selects all rows whose id is greater than the maximum id that exists in RedShift. I have a script in Python that calculates this value and returns it to the output. I…
user2694306
  • 3,832
  • 10
  • 47
  • 95
4
votes
1 answer

Can I have a data-pipeline as a part of my cloud-formation template?

My app has a S3 bucket with daily feeds, 2 DynamoDB tables that stores this data, an ELB application that exposes the JSON API to that data and a data pipeline flow that processes the incoming data and uploads into the tables. My CloudFormation…
Zach Moshe
  • 2,782
  • 4
  • 24
  • 40
1 2
3
31 32