0

We are using data pipeline to push the data from s3 to the amazon redshift table. We are pushing the data using JSON

{
"jsonpaths": [
    "$['data1']",
    "$['data1']",
    "$['data1']['data1']",
    "$['data1']['data1']",
    "$['data1']['data1']",
    "$['data1']['data1']",
    "$['data1']",
    "$['data1']",
    "$['data1']",
    "$['data1']['data1']",
    "$['data1']['data1']"
 ]}

I recently want to add created_at in my redshift table which should take the current timestamp. I have added the column in my redshift table with default value to current timestamp. I have not defined in the json mapping file for this column (for obvious reasons). But the pipeline is failing stating that the no of columns and json mapping is not matching. Have tried reading multiple documents but have not got the proper solution for this. Can anyone help me on how can we insert default current timestamp to redshift table from s3 using aws data pipeline?

user3681970
  • 1,201
  • 4
  • 19
  • 37

1 Answers1

0

There is option in copy command to specify columns you wish to populate data, that we use in our ETL, that populates any default value for skipped columns.

COPY my_tablename
    (C1  
    ,C2   
    ,C3,...    
  )
    FROM 's3://<<YOUR-BUCKET>>/<<YOUR-FILE>>'
    credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret- access-key>';

So I could apply same in case of pipeline like below, though I have not tested this solutions,

copy My_table(C1,C2,C3,...)
from 's3://mybucket/category_object_paths.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' 
json 's3://mybucket/category_jsonpath.json';

EDIT After Comment

There is transformSql option that you can use RedshiftCopyActivity to copy selective columns.

Here is reference, http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html

Red Boy
  • 5,429
  • 3
  • 28
  • 41
  • Thanks @Red Boy. I have seen this solution in AWS docs as well. However, i am not sure how & where should we add this data pipeline configuration? Do you have any idea on this? – user3681970 Jun 21 '18 at 07:26
  • Great. Trasform SQL shows the way forward. I am stuck on this from couple of hours. Now am not sure how to select the data from staging and append datetime to it using transform sql. Any idea will help – user3681970 Jun 21 '18 at 09:35
  • selecting default should be easy, not sure if you have some complex case though, it could be done something like `select c1,c2, sysdate as current_date**here is default** from stg_test `. – Red Boy Jun 21 '18 at 09:50
  • Its not working. Transform SQL is a shit. I have one redshift-copy-activity which has destination redshift table & input as s3 with transform sql. Why on earth pipeline puts the number of columns check before applying transform sql on staging table? I dont know whether I am missing something. But of course, this simple thing does not warrant this much time.' – user3681970 Jun 21 '18 at 14:02