0

I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.

Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.

More Details: Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space. S3 files are not the same size. One of them is 250GB. Some of them in MB.

I am using the command

create table table_name as select * from athena_schema.table_name

it stops exactly after 1hr.

Note: I have set the current query timeout in Aginity to 90000 sec.

RohanB
  • 294
  • 2
  • 5
  • 19
  • why do it via Athena and not just run a copy command? – AlexYes Apr 11 '18 at 06:26
  • 1
    I agree - a copy command is likely faster and - cheaper!! anyway - is it EXACTLY 1 hour? What size is your redshift cluster? are you sure you have enough space? is your s3 data split into small files? what size? are they all about the same size? what command are you running exactly? if you are using spectrum - what serde are you using (perhaps show your table definition) Please answer as many of the above as possible and add this detail to your question – Jon Scott Apr 11 '18 at 08:32
  • @JonScott I have added more details to this question. Thanks. – RohanB Apr 13 '18 at 16:51
  • Are the s3 files gzipped? does that query include the 250GB file? – Jon Scott Apr 13 '18 at 17:06
  • @JonScott yes.. S3 files are gzipped and gzipped size is 250GB. – RohanB Apr 13 '18 at 18:14
  • 1
    you need to split out your files more , make that 250GB file into 1000 smaller files – Jon Scott Apr 13 '18 at 18:15

2 Answers2

1

I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.

Hope this helps! Carlos C.

More information, about my environment:

Redshift:

Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852 

Client Environment:

Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)

Network:

Connected to OpenVPN, via SSH Port tunneling. 
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active. 

Command:

copy tbl_XXXXXXX 
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';

S3 Structure:

120 files of 6.2 GB each. 20 files of 874MB.

Output:

ERROR: 57014: Query (22381) cancelled on user's request

Statistics:

Start:    ***************
End:      ***************
Duration: 3,600.2420863
CContreras
  • 81
  • 4
  • By the way, maybe the Aginity client has the option to set this to a higher value (>3600 secs) or to disable it. I didn't check, since I had my SQL Workbench open already. – CContreras Nov 30 '18 at 16:07
0

I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr. But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.

The copy command in RedShift could load data from S3 or via SSH.

e.g. Simple copy

copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role 
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;

e.g. Using Menifest

copy customer
from 's3://mybucket/cust.manifest' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;

PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.

Red Boy
  • 5,429
  • 3
  • 28
  • 41
  • I tried using copy command but now it is aborting after 7hr 20min. I have a single file to load in Redshift which is 250 GB in gzipped format. – RohanB Apr 16 '18 at 21:23
  • @RohanB In your case, I think you need to unzip the 250GB file, and then split it into 3-4 files of say 70GB of each, then zip them individually, and upload into S3, create the Menifest file, upload the Menifest to S3 and then fire the Copy command with **Menifest** command similar to what I have suggested. Also, you could use ignore option to have few records failed those may gets corrupted because of your split command. Then import those later. – Red Boy Apr 17 '18 at 06:45