0

what happen if create table X (...) clustered by(date) sorted by (time)

but inserted without sort insert into x select * from raw

Will data be sorted after fetched from raw before inserted?

If unsorted data inserted What does "sorted by" do in create table statement. It works just hint for later select queries?

jbapple
  • 3,297
  • 1
  • 24
  • 38
Jihun No
  • 1,201
  • 1
  • 14
  • 29

2 Answers2

3

The documentation explains:

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.

I think it is clear that you want to insert the data sorted if you are using that option.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

No, the data will not be sorted.

As another answer explains, the SORTED BY and CLUSTERED BY options do not change how data will be returned from queries. While the documentation is technically accurate, the purpose of CLUSTER BY is to write underlying data to HDFS in a way that will make subsequent queries faster in some cases. Clustering (bucketing) is similar to partitioning as it allows the query processor to skip reading rows ... If the cluster is chosen wisely. A common use of buckets is sampling data, where you explicitly include only certain buckets, thereby avoiding reads against those excluded.

Tom Harrison
  • 13,533
  • 3
  • 49
  • 77
  • thanks @Tom Harrison Jr. then what will be the best insert query for the fast select time if table created by "create table aaa ... clustered by (x,y ) into 256 bucket". is "insert overwrite aaa select * from bbb distribute by (x,y)" ok? – Jihun No Apr 28 '16 at 09:33
  • It entirely depends on your data and on your needs. It's unclear from your example, but the act of creating a table schema is trivial -- the work depends largely on how much data you write and what queries you'll run. You'll need to try various options and compare write performance, the read/query performance. There are literally hundreds of factors that can affect the outcome. – Tom Harrison Apr 30 '16 at 02:23