1

I'm starting to work with Netezza on Aginity to handle millions of records per table. I've been reading up online on how to optimize tables so that loading the data and running queries on it will be relatively quick and I have a few questions.

1) One thing I read was that the distribution you pick when creating a table will impact how fast you can query the table. For the most part, is it best to distribute on the primary key? And let's say there are no primary keys but instead there are foreign keys in a table, is it good to distribute on both foreign keys or just one?

2) I read about the way in which you can see the skew of your table (which i guess is based alot on your distribution) by running this query:

SELECT count(*), datasliceid from SCHEMA.TableName group by datascliceid ordery by datasliceid

So in doing this, I noticed that the minimum number of data per datasliceid was 21,530 and the max was 22,456. There are 240 datasliceid's that was returned. Is this good or should the data be distributed much more evenly?

3) Last but not least, I'm trying to run:

generate statistics on SCHEMA.TABLENAME

inside of Aginity but when I do, I just get the message The command completed successfully but nothing shows up. How do I view this information?

tee
  • 1,286
  • 2
  • 19
  • 35

1 Answers1

2

I’ll give it a try: 1) only worry about distribution for REALLY large tables (go RANDOM on all the small/medium ones) Apart from that: only distribute on one column, and only if you join on it a LOT 2) that distribution in not skewed in my opinion. Don’t worry. 3) when you update statistics in a database it’s primarily for the benefit of the database optimizer to make better choices when you run (complex) sql against it. You can however read that info back from the catalog tables afterwards. Queries for that can be found online :)

Lars G Olsen
  • 1,093
  • 8
  • 11
  • What is considered a REALLY large table? For example I have here a table with 57 columns and over 34 million rows. – tee Oct 22 '18 at 11:24
  • Since you mention 240 data slices, it’s a Mako/Striper -010 and on those 34 mill rows is only ‘medium’. Approximately 10 mill evenly distributed rows will result in only one I/O operation per disk. – Lars G Olsen Oct 23 '18 at 12:19
  • so is this the best I can do? or is what you're saying bad – tee Oct 23 '18 at 12:24
  • 1
    It’s probably not significant, but the only way to determine it is to create a few sql statements for testing, then make multiple copies on the the involved tables and measure the difference. But only try to distribute on one of the columns you join on and compare that to ‘random’ – Lars G Olsen Oct 24 '18 at 05:32
  • 1
    More important: have you worked with ‘organize on +groom records all’? This is designed to reduce the number of I/Os and in my experience that’s more likely to be a bottleneck. Choose a column you put simple whereclauses on (date/time is a good guess) – Lars G Olsen Oct 24 '18 at 05:35
  • Sorry I'm confused. Would we want to reduce the number of I/Os or increase the number of I/Os? You are recommending that I use 'organize on' and 'groom all' to reduce the number of I/Os? – tee Oct 26 '18 at 17:19
  • Always decrease disk I/O. Read up on ZoneMaps in Netezza. Its a Really powerful property, especially if you try to optimize the use of them. But only do that when you have a problem. Most queries run nicely on netezza anyway. – Lars G Olsen Oct 27 '18 at 11:07