5

When I unload a table from amazon redshift to S3, it always splits the table into two parts no matter how small the table. I have read the redshift documentation regarding unloading, but no answers other than it says sometimes it splits the table (I've never seen it not do that). I have two questions:

  • Has anybody every seen a case where only one file is created?

  • Is there a way to force redshift to unload into a single file?

tshepang
  • 12,111
  • 21
  • 91
  • 136
Elm
  • 1,355
  • 6
  • 22
  • 33

3 Answers3

4

Amazon recently added support for unloading to a single file by using PARALLEL OFF in the UNLOAD statement. Note that you still can end up with more than one file if it is bigger than 6.2GB.

1

By default, each slice creates one file (explanation below). There is a known workaround - adding a LIMIT to the outermost query will force the leader node to process whole response - thus it will create only one file.

SELECT * FROM (YOUR_QUERY) LIMIT 2147483647;

This only works as long as your inner query returns fewer than 2^31 - 1 records, as a LIMIT clause takes an unsigned integer argument.

How files are created? http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

Amazon Redshift splits the results of a select statement across a set of files, one or more files per node slice, to simplify parallel reloading of the data.

So now we know that at least one file per slice is created. But what is a slice? http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

The number of slices is equal to the number of processor cores on the node. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices.

It seems that the minimal number of slices is 2, and it will grow larger when more nodes or more powerful nodes is added.

Tomasz Tybulewicz
  • 8,487
  • 3
  • 42
  • 44
  • I tried to use "select ... limit" in unload clause, it says "ERROR: Limit clause is not supported" – ciphor Dec 02 '13 at 08:14
  • @ciphor, that is something new, maybe some change at Dec 1? I've tested that wrapping all the query inside another SELECT * FROM () is working, but I wasn't spending a lot of time on that - I know that I got single file, but for how long will it work? I have no idea... – Tomasz Tybulewicz Dec 02 '13 at 09:12
  • @TomaszTybulewicz Do you have a documentation link for your comment about using a LIMIT to force only the leader node to do the processing and the 2^31-1 records? I can't seem to find it anywhere in the docs. – Brent Writes Code Jun 24 '15 at 17:10
1

As of May 6, 2014 UNLOAD queries support a new PARALLEL options. Passing PARALLEL OFF will output a single file if your data is less than 6.2 gigs (data is split into 6.2 GB chunks).

Evan
  • 2,983
  • 8
  • 31
  • 35