0

There is the following scenario:

SQL table needs to be transferred to a MySQL database daily. I tried with Data Pipeline using CopyActivity but the exported CSV has empty spaces instead of \N or NULLs so MySQL import those fields as "" which is not good for our application.

I then tried with a slightly different approach. Export the table to S3 via CopyActivity, ShellCommandActivity downloads the file, executes the script below and uploads the file to s3:

#!/bin/bash
sed -i -e 's/^,/\\N,/' -e 's/,$/,\\N/' -e 's/,,/,\\N,/g' -e 's/,,/,\\N,/g' ${INPUT1_STAGING_DIR}/*.csv |cat ${INPUT1_STAGING_DIR}/*.csv > ${OUTPUT1_STAGING_DIR}/sqltable.csv

The script above works perfectly on my test linux instance but nothing happens when it's executed on the temporary EC2 resource. I get no errors whatsoever just the same useless csv with empty spaces on the output s3 datanode.

I'm clueless what i'm doing wrong and why the script doesn't work the same way as on my test linux instance.

Pipeline log:

18 Jul 2016 10:23:06,470 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.taskrunner.TaskPoller: Executing: amazonaws.datapipeline.activity.ShellCommandActivity@515aa023
18 Jul 2016 10:23:06,648 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Begin Downloading files from S3 Path:s3://s3-bucket/mysqlexport/sqltable.csv to output/staging/df-09799242T7UHHPMT072T_input1_7c583c0755eb46f5b518feffa314fccd
18 Jul 2016 10:23:06,648 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Local File Relative compared to Input Root Path:s3://s3-bucket/mysqlexport/sqltable.csv is 
18 Jul 2016 10:23:06,648 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Download just the root file to the local dir. Updated File Relative compared to Input Root Path:s3://s3-bucket/mysqlexport/sqltable.csv is sqltable.csv
18 Jul 2016 10:23:06,649 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Begin Downloading S3 file s3://s3-bucket/mysqlexport/sqltable.csv to /media/ephemeral0/mnt/taskRunner/output/staging/df-09799242T7UHHPMT072T_input1_7c583c0755eb46f5b518feffa314fccd/sqltable.csv
18 Jul 2016 10:23:06,824 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Completed Downloading files from S3 Path:s3://s3-bucket/mysqlexport/sqltable.csv to output/staging/df-09799242T7UHHPMT072T_input1_7c583c0755eb46f5b518feffa314fccd
18 Jul 2016 10:23:06,862 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.objects.CommandRunner: Executing command: #!/bin/bash
sed -i -e 's/^,/\\N,/' -e 's/,$/,\\N/' -e 's/,,/,\\N,/g' -e 's/,,/,\\N,/g' ${INPUT1_STAGING_DIR}/sqltable.csv |cat ${INPUT1_STAGING_DIR}/sqltable.csv > ${OUTPUT1_STAGING_DIR}/sqltable.csv
18 Jul 2016 10:23:06,865 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.objects.CommandRunner: configure ApplicationRunner with stdErr file: output/logs/df-09799242T7UHHPMT072T/ShellCommandActivityId_18OqM/@ShellCommandActivityId_18OqM_2016-07-18T10:18:38/@ShellCommandActivityId_18OqM_2016-07-18T10:18:38_Attempt=1/StdError  and stdout file :output/logs/df-09799242T7UHHPMT072T/ShellCommandActivityId_18OqM/@ShellCommandActivityId_18OqM_2016-07-18T10:18:38/@ShellCommandActivityId_18OqM_2016-07-18T10:18:38_Attempt=1/StdOutput
18 Jul 2016 10:23:06,866 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.objects.CommandRunner: Executing command: output/tmp/df-09799242T7UHHPMT072T-de05e7a112c440b4a42df69d554d8a9a/ShellCommandActivityId18OqM20160718T101838Attempt1_command.sh with env variables :{INPUT1_STAGING_DIR=/media/ephemeral0/mnt/taskRunner/output/staging/df-09799242T7UHHPMT072T_input1_7c583c0755eb46f5b518feffa314fccd, OUTPUT1_STAGING_DIR=/media/ephemeral0/mnt/taskRunner/output/staging/df-09799242T7UHHPMT072T_output1_7c8b2db30c16473f844db5eb21cb000e} with argument : null
18 Jul 2016 10:23:06,952 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Begin Uploading local directory:output/staging/df-09799242T7UHHPMT072T_output1_7c8b2db30c16473f844db5eb21cb000e to S3 s3://s3-bucket/mysqlexport/
18 Jul 2016 10:23:06,977 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Begin Upload single file to S3:s3://s3-bucket/mysqlexport/sqltable.csv
18 Jul 2016 10:23:06,978 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Begin upload of file /media/ephemeral0/mnt/taskRunner/output/staging/df-09799242T7UHHPMT072T_output1_7c8b2db30c16473f844db5eb21cb000e/sqltable.csv to  S3 paths3://s3-bucket/mysqlexport/sqltable.csv
18 Jul 2016 10:23:07,040 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Completed upload of file /media/ephemeral0/mnt/taskRunner/output/staging/df-09799242T7UHHPMT072T_output1_7c8b2db30c16473f844db5eb21cb000e/sqltable.csv to  S3 paths3://s3-bucket/mysqlexport/sqltable.csv
18 Jul 2016 10:23:07,040 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Completed uploading of all files
18 Jul 2016 10:23:07,040 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.S3Helper: Completed upload of local dir output/staging/df-09799242T7UHHPMT072T_output1_7c8b2db30c16473f844db5eb21cb000e to s3://s3-bucket/mysqlexport/
18 Jul 2016 10:23:07,040 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.StageFromS3Connector: cleaning up directory /media/ephemeral0/mnt/taskRunner/output/staging/df-09799242T7UHHPMT072T_input1_7c583c0755eb46f5b518feffa314fccd
18 Jul 2016 10:23:07,050 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.connector.staging.StageInS3Connector: cleaning up directory /media/ephemeral0/mnt/taskRunner/output/staging/df-09799242T7UHHPMT072T_output1_7c8b2db30c16473f844db5eb21cb000e
18 Jul 2016 10:23:07,051 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.taskrunner.HeartBeatService: Finished waiting for heartbeat thread @DefaultShellCommandActivity1_2016-07-18T10:18:38_Attempt=1
18 Jul 2016 10:23:07,052 [INFO] (TaskRunnerService-resource:df-09799242T7UHHPMT072T_@ResourceId_x5OCd_2016-07-18T10:18:38-1) df-09799242T7UHHPMT072T amazonaws.datapipeline.taskrunner.TaskPoller: Work ShellCommandActivity took 0:0 to complete
cirk0
  • 1
  • Try doing it in simple steps are start simple and build up to see where the problem is. – TenG Jul 18 '16 at 11:11
  • It seems sed doesn't see the downloaded csv file in staging input folder. I've created a simple shell command activity pipeline to download the exported csv and execute the script against it. It exited with the error sed: can't read /media/ephemeral0/mnt/taskRunner/output/staging/df-04031231KGO5SUFX4TZL_input1_8bd192f49d6d472f8cb47291eda9c970/sqltable.csv : No such file or directory . I then wrote a script to outpout ls -l ${INPUT1_STAGING_DIR} > ${OUTPUT1_STAGING_DIR}/test.txt and it shows total 168 -rw-rw-r-- 1 ec2-user ec2-user 168448 Jul 18 13:42 sqltable.csv I'm clueless.. – cirk0 Jul 18 '16 at 14:13

1 Answers1

0

Not sure what was the problem exactly, either the bash script or shell command doesn't like piping. Anyway first of all thanks to TenG reminding me to analyze each step in the process separately instead of looking at it as a whole.

Basically instead of downloading a script, I entered the sed and copy commands in the Command field.

sed -i -e 's/^,/\\\\N,/' -e 's/,,/,\\\\N,/g' -e 's/,$/,\\\\N/' ${INPUT1_STAGING_DIR}/sqltable.csv  
cp ${INPUT1_STAGING_DIR}/sqltable.csv ${OUTPUT1_STAGING_DIR}/sqltable.csv

Anyway, I'm still disappointed in AWS Pipeline as why they claim MySQL support when it doesn't even recognise standard MySQL null values like \N and NULL in a csv file. It would be so easy and efficient if I could use the AWS Pipeline for both export and import but unfortunately it's not suitable for the latter if your application makes difference between empty fields and NULL values.

cirk0
  • 1