2

I am currently working on a project that collects a customers demographics weekly and stores the delta (from previous weeks) as a new record. This process will encompass 160 variables and a couple hundred million people (my management and a consulting firm requires this, although ~100 of the variables are seemingly useless). These variables will be collected from 9 different tables in our Teradata warehouse.

I am planning to split this into 2 tables.

  1. Table with often used demographics (~60 variables sourced from 3 tables)
    • Normalized (1 customer id and add date for each demographic variable)
  2. Table with rarely or unused demographics (~100 variables sourced from 6 tables)
    • Normalized (1 customer id and add date for each demographic variable)

MVC is utilized to save as much space as possible as the database it will live on is limited in size due to backup limitations. (to note the customer id currently consumes 30% (3.5gb) of the table 1's size, so additional tables would add that storage cost)

The table(s) will be accessed by finding the most recent record in relation to the date the Analyst has selected:

SELECT cus_id,demo
    FROM db1.demo_test 
    WHERE (cus_id,add_dt) IN (
        SELECT cus_id, MAX(add_dt) 
            FROM db1.dt_test 
            WHERE add_dt <= '2013-03-01'  -- Analyst selected Point-in-Time Date
         GROUP BY 1)
GROUP BY 1,2

This data will be used for modeling purposes, so a reasonable SELECT speed is acceptable.

  1. Does this approach seem sound for storage and querying?
    • Is any individual table too large?
  2. Is there a better suggested approach?
    • My concern with splitting further is
      • Space due to uncompressible fields such as dates and customer ids
      • Speed with joining 2-3 tables (I suspect an inner join may use very little resources.)

Please excuse my ignorance in this matter. I usually work with large tables that do not persist for long (I am a Data Analyst by profession) or the tables I build for long term data collection only contain a handful of columns.

Dustin
  • 23
  • 3

2 Answers2

2

The width of the table at 160 columns, sparsely populated is not necessarily an incorrect physical implementation (normalized in 3NF or slightly de-normalized). I have also seen situations where attributes not regularly accessed are moved to a documentation table. If you elect to implement the latter in your physical implementation it would be in your best interest that each table share the same primary index. This allows the joining of these to tables (60 attributes and 100 attributes) to be AMP-local on Teradata.

If the access of the table(s) will also include the add_dt column you may wish create a partitioned primary index on this column. This will allow the optimizer to eliminate the other partitions from being scanned when the add_dt column is included in the WHERE clause of a query. Another option would be to test the behavior of a value ordered secondary index on the add_dt column.

Rob Paller
  • 7,736
  • 29
  • 26
  • Thanks for the followup! Yes, I have a PPI using the add date at the moment. My main concern with the collection of data and PPI is that the changes will be slow (due to the contract with our demographic provider and people changing relatively), so the Secondary index may provide better performance until the table matures over several years. I will research the value ordered secondary index suggestion to see what benefit it may add over the PPI. – Dustin Sep 17 '13 at 19:23
2

Additional to Rob's remarks:

What is your current PI/partitioning?

Is the current performance unsatisfactory?

How do the analysts access beside the point-in-time, any other common conditions?

Depending on your needs a (prev_dt, add_dt) might be better than a single add_dt. More overhead to load, but querying might be as simple as date ... between prev_dt and end_dt.

A Join Index on (cus_id), (add_dt) might be helpful, too.

You might replace the MAX(subquery) with a RANK (MAX is usually slower, only when cus_id is the PI RANK might be worse):

SELECT *
FROM db1.demo_test 
QUALIFY 
  RANK() OVER (PARTITION BY cus_id ORDER BY add_dt DESC) = 1

In TD14 you might split your single table in two row-containers of a column-partitioned table.

...

dnoeth
  • 59,503
  • 4
  • 39
  • 56
  • We are currently on TD 13.10 and 14 will not be online until late next year (best case.) Current PI is cus_id and add_dt. I am building the table at the moment, so trying to make sure it is done "correctly" the first time. All that is on the table is cus_id, add_dt and demographic variables, so those 2 are the most common conditions. I will explore prev_dt, join index, and RANK (although cus_id is part of the PI) – Dustin Sep 17 '13 at 19:37
  • Is the PI both columns: (cus_id, add_dt) partitioned by add_dt? Only cus_id as NUPI might be better, distribution should not change, but table must be defined as MultiSet. – dnoeth Sep 17 '13 at 19:46
  • Thanks for supplemental information @dnoeth. Insightful as always. :) – Rob Paller Sep 17 '13 at 20:00
  • Yes it is cus_id,add_dt and partitioned by add_dt. I could leverage upd_dt from the base demographic tables (all 3 tables dates match) for the initial population to help the distribution of add_dt. What is the expected real world gain? – Dustin Sep 17 '13 at 20:03
  • PI(cus_id) will facilitate your point-in-time query because all rows of a customer are stored on a single AMP. – dnoeth Sep 17 '13 at 20:30
  • dnoeth, Is your suggestion for MultiSet due to the lack of overhead from the duplicate row check, since I wouldn't be using a UPI and it wouldn't be necessary? (I should just manage the potential (unlikely) duplicate records on insert?) Thank you for your help with my issue. – Dustin Sep 18 '13 at 12:30
  • Yes, i don't know how manx rows per customer exist, if it's a 3 or more digit numbe ryou simply need MultiSet to avoid the duplicate row checks consuming lots of CPU during Inserts/Updates – dnoeth Sep 18 '13 at 13:17
  • Thanks again for your and rob's help. I will need to do more testing as the table matures, but based on my initial build the NUPI as cus_id with a Partition on add_dt is the fastest based on the query in my original post (with the suggested modifications.) The difference is 10 seconds vs 20 seconds, so inconsequential at the moment. – Dustin Sep 18 '13 at 15:16
  • Wall Clock time is only part of the equation and will vary based on the system's load at any point in time. Look at the differences in AMP CPU Time and TotalIOCount in the Database Query Logs. – Rob Paller Sep 19 '13 at 00:23
  • Rob, I understand that the DBS Time (what i quoted) is affected by load, so I did testing further testing after hours (only one on the server) to try to get a better read on performance as I don't have access to the query logs at the moment. I'm working with my DBAs to get access to the QryLog table, but may take some time. – Dustin Sep 19 '13 at 13:07
  • Just following up with Query Log information: A NUPI of cus_id and a Value Ordered NUSI add_dt uses substantially more IO, a little more spool and less CPU time than a NUPI of cus_id and a Partition of add_dt. It seems like each is a trade off of CPU Time vs IO/Spool. – Dustin Sep 19 '13 at 19:49