Questions tagged [amazon-data-pipeline]

Simple service to transfer data between Amazon data storage services, kick off Elastic MapReduce jobs, and connect with outside data services.

From the AWS Data Pipeline homepage:

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

470 questions
0
votes
2 answers

Backing up DynamoDB tables via data pipeline vs manually creating a json for dynamoDB

I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in…
0
votes
1 answer

How to filter out data from DynamoDB using Amazon Data Pipeline and Hive?

Currently the logs are stored in DynamoDB. We want to filter out unnecessary rows from that table and store output in different table (f.e. exclude rows that "value" field contains "bot", "python", "requests", etc). By this moment I came up with…
Dmitrijs Zubriks
  • 2,696
  • 6
  • 22
  • 33
0
votes
2 answers

How to run Spark Or Mapreduce job on hourly aggregated data on hdfs produced by spark streaming in 5mins interval

I have a scenario where i am using spark stream to collect data from Kinesis service using https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html Now in streaming i am doing some aggregation on the data and emitting to hdfs. i am…
Sam
  • 1,333
  • 5
  • 23
  • 36
0
votes
1 answer

Running a hadoop pig script on a Kinesis stream through aws EMR

I am trying to batch process some data in a kinesis stream using a pig script on AWS EMR. I just need to group the stream data and move it to s3. I'm trying to run this every couple hours. At first it seems like a great fit for AWS Data Pipeline,…
0
votes
3 answers

process s3 access logs using AWS datapipeline

My usecase is to process S3 access logs(having those 18 fields) periodically and push to table in RDS. I'm using AWS data pipeline for this task to run everyday to process previous day's logs. I decided to split the task into two activities 1.…
0
votes
3 answers

Run SQL script file with multiple complex queries using AMAZON datapipeline

I have just created an account on Amazon AWS and I am going to use DATAPIPELINE to schedule my queries. Is it possible to run multiple complex SQL queries from .sql file using SQLACTIVITY of data pipeline? My overall objective is to process the raw…
rahul_raj
  • 275
  • 2
  • 3
  • 9
0
votes
1 answer

Using boto to create a AWS data pipeline for the RedShiftCopyActivity

I am trying to move data from s3 into redshift and want to enforce uniqueness on primary keys in redshift. I realized that the copy command itself can't do this. However, I noticed that the RedshiftCopyActivity available through the AWS data…
0
votes
2 answers

Limiting EC2 resources used by AWS data pipeline during DynamoDB table backups

I need to backup 6 DynamoDB tables every couple of hours. I've created 6 pipeliness from templates and it ran great, except that it created 6 or more virtual machines which were mostly staying up. That's not the economy I can afford. Does anyone…
0
votes
1 answer

Creating an email alert in AWS DataPipeline

I know AWS Data Pipeline supports and allows SNS alerts, but I want an alert or email sent if a query returns anything. Basically, I want to run a SQLActivity of a very simple select query and if that query returns anything, I want to send an email…
simplycoding
  • 2,770
  • 9
  • 46
  • 91
0
votes
1 answer

how does aws datapipeline scheduling work

I noticed some strange behavior by AWS data pipeline. The Execution start time is before the scheduled start time. Please refer to the screenshot below. Am I missing something here ? Is this acceptable behavior for AWS data pipline ? What are the…
Kalyanaraman Santhanam
  • 1,371
  • 1
  • 18
  • 30
0
votes
1 answer

Error with Data Pipeline backup when I transfer my data from DynamoDb to S3

I have to backup my DynamoDb table into S3 but when i launch this service I receive this error after three attempts: private.com.amazonaws.AmazonServiceException: User: …
0
votes
2 answers

SSH to ec2 instance and execute

I have a datapipeline application that I need to respond to. When it finishes, I ssh to an ec-2 instance and execute a script. What is the best way to do ssh to that box after datapipeline finishes? Should I use a lambda function and have it listen…
bjamin
  • 131
  • 1
  • 5
0
votes
2 answers

Call a pipeline from a pipeline in Amazon Data Pipeline

My team at work is currently looking for a replacement for a rather expensive ETL tool that, at this point, we are using as a glorified scheduler. Any of the integrations offered by the ETL tool we have improved using our own python code, so I…
jpavs
  • 648
  • 5
  • 17
0
votes
2 answers

More that one object matches the predicate (2 in total) in AWS data pipeline

In AWS data pipeline console, when I upload a pipeline definition file, I always get this error - Pipeline creation failed. Data Pipeline failed to create pipeline : More that one object matches the predicate (2 in total). (Service: null; Status…
v01d
  • 569
  • 6
  • 10
0
votes
2 answers

Want to server-side encrypt S3 data node file created by ShellCommandActivity

I created a ShellCommandActivity with stage = "true". Shell command creates a new file and store it in ${OUTPUT1_STAGING_DIR}. I want this new file to be server-side encrypted in S3. According to document all files created in s3 data node are…