Questions tagged [amazon-data-pipeline]

Simple service to transfer data between Amazon data storage services, kick off Elastic MapReduce jobs, and connect with outside data services.

From the AWS Data Pipeline homepage:

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

470 questions
4
votes
2 answers

AWS Data Pipeline Support for SQL Server RDS

I am trying to find documentation regarding the supported data source for AWS Data Pipeline. What I need to do is export SQL Server RDS data to S3. I am finding plenty of documentation saying that Data Pipeline can use RDS as a source but every…
Brian Amersi
  • 111
  • 1
  • 6
4
votes
2 answers

Amazon AWS: DataPipelineDefaultRole/EDPSession not authorized to perform iam:ListRolePolicies

I have been assigned an IAM role in AWS by my manager and I am trying to setup an Amazon Data Pipeline. I am repeatedly facing permission issues and authorization issues like the following when trying to activate the PipeLine. WARNING: Error…
3
votes
1 answer

How to export DynamoDB table data without the point in time recovery?

I am trying to export data from a DynamoDB table for the last 15 days, but unfortunately, the point in time recovery is not active. So I can't use the new DynamoDB export to S3 feature because it's not retroactive. I have tried using the AWS Data…
3
votes
1 answer

AWS datapipeline import from S3 bucket to a dynamoDB table that is in a different region gives error

When I try to use a data-pipeline to import to a dynamo db table that is in the same region as the data pipeline it works without error. When I modify the EMRClusterForLoad step to use a region that is different from the region that the…
3
votes
1 answer

How to solve "DriverClass not found for database:mariadb" with AWS data pipeline?

I'm trying to play with AWS Data Pipelines (and then Glue later) and am following Copy MySQL Data Using the AWS Data Pipeline Console. However, when I execute the pipeline, I get DriverClass not found for database:mariadb I would expect this to…
Chris F
  • 14,337
  • 30
  • 94
  • 192
3
votes
0 answers

When using the Data Pipeline to backup a DynamoDB table, does readThroughputPercent account for autoscaling?

Suppose I set my DDB table to autoscale over 80%, and set the backup data pipeline to 0.85. Does the pipeline use the read throughput it determined initially, or does it scale up along with the table?
3
votes
1 answer

AWS Data Pipeline S3 to DynamoDB JSON Error

I'm trying to import a TSV file from S3 into DynamoDB using Data Pipelines, but I keep hitting a MalformedJsonException. I've validated both pieces of Json that I provide: the definition of the data pipeline and the manifest of the S3 folder, so…
tghw
  • 25,208
  • 13
  • 70
  • 96
3
votes
1 answer

What should be return value from ShellCommandPrecondition of AWS data pipeline?

I am writing one shell script which should get executed by ShellCommandPrecondition of AWS data pipeline. AWS documentation doesn' specify What should be the return value from the script? Can I just return 0 on success and 1 or any other value if…
Shekhar
  • 11,438
  • 36
  • 130
  • 186
3
votes
1 answer

Data Pipeline failing for EMR Activity

I am trying to run a spark step on AWS Data-pipeline. I am getting the following exception:- amazonaws.datapipeline.taskrunner.TaskExecutionException: Failed to complete EMR transform. at …
Sanchay
  • 1,053
  • 1
  • 16
  • 33
3
votes
1 answer

How to compute 'DynamoDB read throughput ratio' while setting up DataPipeline to export DynamoDB data to S3

I have a DynamoDB with ~16M records where each record is of size 4k. The table is configured for autoscaling Target utilization: 70%, Minimum provisioned capacity for Reads: 250 and Maximum provisioned capacity for Writes: 3000. I am trying to…
3
votes
1 answer

How to setup google cloud storage correctly for spark application using aws data pipeline

I am setting up the cluster step to run a spark application using Amazon Data Pipeline. My job is to read data from S3, process the data and write data to google cloud storage. For google cloud storage, I am using the service account with key file.…
3
votes
1 answer

EMR activity using data pipeline for spark job

I am trying to run a Jar file for spark job in data pipeline, but I am not sure what I exactly need to pass in EMR step?
Monika Patel
  • 35
  • 1
  • 6
3
votes
1 answer

Is it possible to create EMR cluster with Auto scaling using Data pipeline

I am new to AWS. I have created a EMR cluster using Auto scaling policy through AWS console. I have also created a data pipeline which can use this cluster to perform the activities. I am also able to create EMR cluster dynamically through data…
3
votes
1 answer

Importing data from Excel sheet to DynamoDB table

I am having a problem importing data from Excel sheet to a Amazon DynamoDB table. I have the Excel sheet in an Amazon S3 bucket and I want to import data from this sheet to a table in DynamoDB. Currently I am following Import and Export DynamoDB…
3
votes
0 answers

How to mark aws data pipeline as FINISHED not ERROR when s3 precondition fails?

I've been struggling for a few weeks to find a config solution that works how I expect - maybe what I want isn't possible... Here's what I'm trying to do: Check an S3 bucket for any files If there aren't any, don't spin up a cluster just mark the…