DataPipeline: Use only first 4 values from CSV in pipeline

Question

I have a CSV, which has a variable structure, which I only want to take the first 4 values from. The CSV stored in S3 has between 7 and 8 fields in it, and I would like to take just the first 4. I have attempted to use the following prepared statement:

INSERT INTO locations (timestamp, item_id, latitude, longitude) VALUES (?, ?, ?, ?);

However I am getting:

Parameter index out of range (5 > number of parameters, which is 4).

Which I believe means that it is attempting to load in the other variables in the CSV. Is it possible to just take the first 4 values? Or otherwise deal with a variable length CSV?

score 2 · Answer 1 · answered May 19 '16 at 13:17

Use transformSql option. You didn't mention what you are loading into, from redshift docs :

The SQL SELECT expression used to transform the input data. When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called staging and initially loads it in there. Data from this table is used to update the target table. If the transformSql option is specified, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table. transformSql must be run on the table named staging and the output schema of transformSql must match the final target table's schema.

DataPipeline: Use only first 4 values from CSV in pipeline

1 Answers1