0

For the sake of exemplifying, let's say I have a parquet file in s3 partitioned by column date with the following format:

s3://my_bucket/path/my_table/date=*

So when I load the table using spark, for example, it shows the following:

+---+-------+--------+
| id|country|    date|
+---+-------+--------+
|  1|  Italy|20200101|
|  2| France|20200101|
|  3|  Japan|20200102|
|  4|    USA|20200103|
+---+-------+--------+

I am trying to load this table into Redshift doing something like this:

create table my_table
(
 id BIGINT ENCODE az64,
 country VARCHAR(256) ENCODE lzo,
 date VARCHAR(256) ENCODE lzo
);

copy my_table
from 's3://my_bucket/path/my_table/'
iam_role 'my-role'
format as parquet
partition by 'date';

Is there a way to insert this data into RS using the command COPY or any other method?

*I already tried using manifest file, but could not to a find a way to load the partition column as well

Henrique Florencio
  • 3,440
  • 1
  • 18
  • 19

1 Answers1

0

I see 2 ways of doing this:

  1. Perform N COPYs (one per currency) and manually set the currency column to the correct value with each COPY. Since the S3 key contains the currency name it would be fairly easy to script this up. COPY all the data for a currency into the table leaving the "currency" column NULL. Then set the currency column to the correct value for this COPY - WHERE currency is NULL. Repeat. You'll need to vacuum the table afterwards so no ideal.

  2. Set up an external table (Spectrum) pointing to these S3 objects that is also partitioned. Since the partition key is not part of this external table you just need INSERT into the in-disk table including "currency". This is cleaner but you need to go through the effort of setting up the external schema and table w/ partitions. If this is a one-time thing then the down-and-dirty script might be the way to go.

Both should work just with different pros and cons.

Bill Weiner
  • 8,835
  • 2
  • 7
  • 18
  • II think I was not clear enough, so I have change the column currency to date. In the real problem I have a lot more of partitions and the data changes everyday, so the first way would not be good because of performance issues and for the second way, Spectrum is not available because of cost issues. – Henrique Florencio May 11 '22 at 13:29
  • Understood and the partition value in S3 is not "date". If you have a lot of partitions and partition "columns" #1 will need some programming to automate the whole thing. Not too hard but would take time to write and verify. For #2 if this is just a replacement for a once-daily COPY I don't think the costs would be high. Or is this a policy against Spectrum in general? I'll keep thinking about ways to do this that aren't too slow. – Bill Weiner May 11 '22 at 16:13