1

say i have a order table, which contains multi time column(spend_time,expire_time,withdraw_time), usually,i will query the table with the above column independently,so how do i create the partitions?

order_no | spend_time   | expire_time   | withdraw_time | spend_amount  
A001     | 2017/5/1     | 2017/6/1      | 2017/6/2      | 100   
A002     | 2017/4/1     | 2017/4/19     | 2017/4/25     | 500   
A003     | 2017/3/1     | 2017/3/19     | 2017/3/25     | 1000  

Usually the business situation is to calculate total spend_amount between certain spend_time or expire_time or withdraw_time, or the combination of the 3.

But with 3 time dimensions cross combination(each has about 1000 partitions) can be a lot of partitions(1000*1000*1000),is that ok and efficient?

my solution is that i create 3 tables with 3 different columns.Is this a efficient way to solve this problem?

lei yu
  • 58
  • 6
  • Do you have the basic knowledge regarding partitions in Hive? – David דודו Markovitz Jun 02 '17 at 10:02
  • yes,but not much,the partitions are stored as file folders,if i create 3 partitions(A,B,C), then the table is stored like tablename/A-partition/B-partition/C-partition/file-block. My questions is that, if i only query where filter C or even deeper inside partitions ,will the query be efficient?(of course it's not as efficient as A due to the partition order) Is there a balance between partition numbers and performance? – lei yu Jun 05 '17 at 07:21
  • O.K. Please show a data sample and example of the partitions values you want to use – David דודו Markovitz Jun 05 '17 at 07:24
  • (1) Please delete the comment and add the additional information to your post (2) Give a data sample with multiple rows. (3) Is that your actual date format?(!) – David דודו Markovitz Jun 05 '17 at 07:32
  • Sorry about the format,i have add a picture about the data. It's a table with 3 time dimensions: spend_time,expire_time,withdraw_time. Sometimes i need to query with only one dimension,sometimes might be two, how do i arrange the partitions? Just by the frequency of the dimensions i used? But with 3 time dimensions cross combination(each has about 1000 partitions) can be a lot of partitions(1000*1000*1000),is that ok and efficient? – lei yu Jun 05 '17 at 07:42
  • i have re-format the question,please check it again,thanks! – lei yu Jun 05 '17 at 08:11
  • (1) How many records do you have in your table? (2) Is that the real date format you are using? (3) You can't partition an existing table. You'll have to create a new table. (4) One of the options is to partition by months (YYYY-MM). – David דודו Markovitz Jun 05 '17 at 12:14
  • About 30million, not very big,but used very frequent, partition by months might be a good choice. – lei yu Jun 06 '17 at 13:40

0 Answers0