0

How I should implement indexing strategy this fact table ? Contain about 5 mlns of rows

Marked here ZOrder by chassis where we have 2 files after optimize

Here is ZOrder by day_id. Separated it to 3 files for better presentation

Is worth to add bloom filter index here also ? If so, in which way ?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Bartosz
  • 35
  • 4

1 Answers1

0

I personally would just start only with the ZOrder only on machine_id and day_id and test the performance. Given that you have only 5 million rows, then most probably will create a single file for it, so bloom filter won't make any sense.

P.S. Bloom filters are good when you have a lot of distinct values where you're searching by the equality.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • I tested alittle, how do You think, which scenario will be better ?. I can see ordering only on machine_id creates more sensible ranges when ordering by machine_id and day_id makes range looks weird. And then we do not have chance to hit fewer number of files. Look : – Bartosz Feb 11 '23 at 11:13
  • how many files have you got after you run the `optimize zorder by` ? – Alex Ott Feb 11 '23 at 11:15
  • before optimize 43 files and after optimize 7 files – Bartosz Feb 11 '23 at 13:00
  • You can add bloom filter on `machine_id` as it's a string value and you're search by exact value. But you need to test if it will improve performance - I'm really not sure that it will add significant performance improvement. – Alex Ott Feb 11 '23 at 13:58
  • So what You will choose from provided 3 configuration examples ? – Bartosz Feb 11 '23 at 18:18
  • as I said, I would start without bloom filter and test performance. if it's not enough, then I would test with bloom filter, but I doubt that it will add significant performance improvement – Alex Ott Feb 11 '23 at 18:57
  • I mean, which option will You choose for zordering here and why with day_id and not same machine_i. I can see ordering by machine only produce looking more clean ranges – Bartosz Feb 12 '23 at 14:45
  • When you get 2 variables then data will be better collocated. If you do only by machine, then day will be spreaded not do good – Alex Ott Feb 12 '23 at 17:01