0

I have big data events (TBs) I need to query and I am trying to partition it correctly.

I have client and each client has many games. The problem is there are fields we query for, that might be null in some events, therefore they cannot be used as partitions (for example: segment).

I thought about 2 strategies:

  1. partitions by: client/game/date (S3)
  2. different table per client or game, and partition only by date. different buckets.

option 1, is simple - and I filter in where clause. option 2, will require unions.

What is the correct way to partition such data? And by correct I mean most efficient and most cost effective?

Reagards, Ido

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
Ido Barash
  • 4,856
  • 11
  • 41
  • 78
  • I think a `segment` can be used as a partition key even if it's nullable. In the partition name a null value should be replaced with `__HIVE_DEFAULT_PARTITION__`. – Piotr Findeisen Jun 19 '19 at 07:31

1 Answers1

0

As far as the big data event is described, the events are as per following behavior: Multiple Clients, each clients with multiple games and each games with multiple events which can be partitioned on Date.

Now, for different games, event schema may be different and hence, while querying may return in null values. There is no dependency on client. So, with different clients and same game, event schema should be same.

So, among client/games/date and games/client/date, better is to make partition with games/client/date because the above partition would be more helpful as after first level of partition, the events schema would be same. From query perspective for query without game field partition, it would not make any difference but if games partition field is used in query, then it would result in higher efficiency.