Questions tagged [aws-data-pipeline]

Use amazon-data-pipeline tag instead

Simple service to transfer data between Amazon data storage services, kick off Elastic MapReduce jobs, and connect with outside data services.

80 questions
1
vote
1 answer

# of records loaded through AWS Redshift

Is there a way through the AWS console to understand the number of records that got loaded into a redshift table using the AWS data pipeline?
1
vote
0 answers

Cralwer not creating table in data lake from postgres partition table

My Table is partitioned in postgres. I have created a Glue crawler to create table. I selected the option "Update all new and existing partitions with metadata from the table" in Configure the crawler's output. Since it's partitioned, the table is…
1
vote
0 answers

Using AWS Data Pipeline to move data from AWS RDS to S3

I was trying to move data from RDS to S3 as backup. I used DBeaver on my local pc to establish connection with AWS RDS and uploaded a csv file. I, then, tried to create a datapipeline to send data from RDS to S3. Initially, I got an error DBInstance…
kiran
  • 11
  • 3
1
vote
1 answer

Data migration from S3 to RDS

I am working on a requirement, where i am doing multipart upload of the csv file from on prem server to S3 Bucket. To achieve this using AWS Lambda I create a presigned url and use this url i am uploading the csv file. Now, once i have the file in…
1
vote
0 answers

Which file format is suitable for unstructured data?

I am creating a data-repository more like creating data-lake for no-SQL db. I have some field which doesn't have a proper schema. They have mix type object like field value is {a:2} or {b:2,c:4, a: {1,2}}, etc. I can use CSV format so I can save…
Manish Trivedi
  • 3,481
  • 5
  • 23
  • 29
1
vote
1 answer

AWS Data Pipeline: Upload CSV file from S3 to DynamoDB

I'm attempting to migrate CSV data from S3 to DynamoDB using Data Pipeline. The data is not in a DynamoDB export format but instead in a normal CSV. I understand that Data Pipeline is more typically used as import or export of DynamoDB format rather…
Mike S.
  • 185
  • 2
  • 9
1
vote
1 answer

Airflow - Tasks that write files locally (GCS)

I'm in the process of building a few pipelines in Airflow after having spent the last few years using AWS DataPipeline. I have a couple questions I'm foggy on and hope for some clarification. For context, I'm using Google Cloud Composer. In…
1
vote
0 answers

Is there a way PigActivity in AWS Pipeline can read schema from Athena tables created on S3 buckets

I have lot of legacy pig scripts that run on on-prem cluster, we are trying to move to AWS Data Pipeline (PigActivity) and want to make these pig scripts can read data from S3 buckets where my source data would reside. On-Prem Pig scripts use…
1
vote
0 answers

ShellCommandActivity timing out despite setting 3 hours as the timeout value

I'm using a cloudformation template to spin up a EC2 instance to execute a shell script. For the EC2 resource, I've specified the terminateAfter value as 3 Hours. Similarly, for the ShellCommandActivity I've specified the attemptTimeout value as 3…
user795028
  • 113
  • 10
1
vote
1 answer

Has anyone used AWS system manager parameter in data pipeline, to allocate value to a parameter in pipeline?

"id": "myS3Bucket", "type": "String", "default": "\"aws ssm get-parameters --names variable --query \"Parameters[*].{myS3Bucket:Value}\"\"" I tried this , Where I created a variable in AWS parameter and was able to retrieve the value using this…
1
vote
2 answers

Spark Streaming scheduling best practices

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15…
1
vote
1 answer

Processing parameters passed to SQL activity in AWS data pipeline

I am working with AWS data pipeline. In this context, I am passing several parameters from pipeline definition to sql file as follows: s3://reporting/preprocess.sql,-d,RUN_DATE=#{@scheduledStartTime.format('YYYYMMdd')}" My sql file looks like…
Joy
  • 4,197
  • 14
  • 61
  • 131
1
vote
0 answers

How to run multiple steps in aws data pipeline using aws console

I have a use case of scheduling my spark jobs on EMR. Every time we will be spinning a new cluster and running spark job. I went through documentation provided by aws but those are not extensive enough to give clear picture of how to do it. If any…
Raghav salotra
  • 820
  • 1
  • 11
  • 23
1
vote
1 answer

Unresolved resource dependencies [DefaultSchedule] in the Resources block of the template

I am working with the cloudformation script to create AWS Data Pipeline. I have created the script according to the documentation but I am facing 1 error i.e. Template validation error: Template format error: Unresolved resource dependencies…
1
vote
1 answer

AWS Datapipeline incorrect java version

I am trying to execute a jar file in my datapipeline and it is erroring out in a fashion that indicates to me that the version of java that is installed in my pipeline is lower than that required by the executable jar. I have tried to add a command…