Questions tagged [amazon-data-pipeline]

Simple service to transfer data between Amazon data storage services, kick off Elastic MapReduce jobs, and connect with outside data services.

From the AWS Data Pipeline homepage:

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

470 questions
3
votes
1 answer

Data Pipeline dump for DynamoDb to S3 failed all the time

I used instruction to setup dumps for DynamoDb: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part2.html Data Pipeline setup was fine. But after execution task I have same error all the time. I researched this…
Vladimir Gilevich
  • 861
  • 1
  • 10
  • 17
3
votes
2 answers

Using Amazon's Date Pipeline to backup S3 bucket -- how to skip existing files and avoid unnecessary overwriting?

I'm using Amazon's Date Pipeline to copy and S3 bucket to another bucket. It's a pretty straightforward setup, and runs nightly. However, every subsequent run copies the same files over and over--I'd rather it just skip existing files and copy only…
trevorhinesley
  • 845
  • 1
  • 10
  • 36
3
votes
3 answers

AWS Data Pipeline - Error when trying to re-run a failed acitivity

My datapipeline has many acitivities (Shellcommandactivity) one of which has failed due to a programmatic issue. However when i try to re-run the failed activity after fixing programmatic issue. Failure & rerun mode is - cascade & schedule Type is…
3
votes
2 answers

How to restart an AWS Data Pipeline

I have a scheduled AWS Data Pipeline that failed partway through its execution. I fixed the problem without modifying the Pipeline in any way (changed a script in S3). However, there seems to be no good way to restart the Pipeline from the…
Simon Lepkin
  • 1,021
  • 1
  • 13
  • 25
3
votes
1 answer

Flattening JSON file while transferring from S3 to RedShift using AWS Pipeline

I have json file on S3, I want to transfer it to Redshift. One catch is that the file contains entries in such a format: { "user_id":1, "metadata": { "connection_type":"WIFI", "device_id":"1234" …
3
votes
1 answer

DataPipeline: Use only first 4 values from CSV in pipeline

I have a CSV, which has a variable structure, which I only want to take the first 4 values from. The CSV stored in S3 has between 7 and 8 fields in it, and I would like to take just the first 4. I have attempted to use the following prepared…
dojogeorge
  • 1,674
  • 3
  • 25
  • 35
3
votes
1 answer

Importing data from S3 to DynamoDB

I am trying to import a JSON file which has been uploaded into S3 into DynamoDB I followed the tutorial amazon has given http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-console-start.html But when i try to…
3
votes
3 answers

Insert blanks as NULL to MySQL

I'm building an AWS pipeline to insert CSV files from S3 to an RDS MySQL DB. The problem I'm facing is that when it attempts to load the file, it treats blanks as empty strings instead of NULLs. For example, Line 1 of the CSV…
rodrigocf
  • 1,951
  • 13
  • 39
  • 62
3
votes
3 answers

Need strategy advice for migrating large tables from RDS to DynamoDB

We have a couple of mySql tables in RDS that are huge (over 700 GB), that we'd like to migrate to a DynamoDB table. Can you suggest a strategy, or a direction to do this in a clean, parallelized way? Perhaps using EMR or the AWS Data Pipeline.
3
votes
2 answers

AWS Data Pipeline RedShift "delimiter not found" error

I'm working on the data pipeline. In one of the steps CSV from S3 is consumed by RedShift DataNode. My RedShift table has 78 columns. Checked with: SELECT COUNT(*) FROM information_schema.columns WHERE table_name = 'my_table'; After failed…
3
votes
1 answer

Data Pipeline S3 logs not written (only written if using Amazon Linux)

With the same exact Data Pipeline configuration, only differing in the AMI to be used (Amazon Linux vs. Ubuntu), my Data Pipeline execution will succeed in both cases but it will only write logs to S3 when using Amazon Linux. With Amazon Linux With…
deprecated
  • 5,142
  • 3
  • 41
  • 62
3
votes
2 answers

How can I specify EBS Volume when adding a EC2 Resource to AWS Data Pipeline?

When I try to create an EC2 Resource with a AWS Data Pipeline, I don't see and option for defining EBS volume that will be associated with that compute engine. Is it possible to set the volume size? If yes, can someone give me an example.
3
votes
1 answer

How to set instance role for EMR clusters launched via data pipeline?

I'm trying to attach an instance role to a cluster I'm running through data-pipeline. I'd like to run my own mapper script that needs write permissions to DynamoDB (the "regular" HIVE upload won't do the trick for me). I've gone through the API docs…
Zach Moshe
  • 2,782
  • 4
  • 24
  • 40
3
votes
4 answers

Do I need to set up backup data pipeline for AWS Dynamo DB on a daily basis?

I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job…
3
votes
1 answer

Using a custom AMI (with s3cmd) in a Datapipeline

How can I install s3cmd on a AMI that is used in the pipeline? This should be a fairly basic thing to do but I can't seem to get it done: Here's what I've tried: Started a Pipeline without the Image-id option => Everything works fine Navigated to…
Biffy
  • 871
  • 2
  • 10
  • 21