2

Found a similar question but without an answer that worked succuessfully.

I need to select a sample of 50 of each status type within a single table.

TABLE1

MEMBER  STATUS
1234       A
1324       A
3424       R
3432       S
3232       R
2783       A
2413       S
4144       R
2387       S

I tried:

SEL Member, status FROM TABLE1 Qualify Row_Number ( ) OVER (PARTITION BY status ORDER BY random (1,10000)) <=50

As suggested in the previous question/answer but Teradata does not like RANDOM in an Aggregate or Ordered Analytical Function.

Community
  • 1
  • 1
Donna Panzarotti
  • 63
  • 2
  • 2
  • 8

2 Answers2

4

Since, as you say in the comments, you do not need to use random you can do this:

SEL     Member, status 
FROM    TABLE1 
QUALIFY ROW_NUMBER() OVER (
            PARTITION BY status 
            ORDER BY NULL) <= 50;

Edit: based on a question in the comments related to a spool-space error, an approach we can try when the table is large is to create a temporary table with a portion of the records from the original table.

CREATE MULTISET VOLATILE TABLE tmp_tbl AS (
    SEL     Member, status
    FROM    TABLE1
    WHERE   somefield = 'somecriterion'
) WITH DATA 
ON COMMIT PRESERVE ROWS;

And then try again:

SEL     Member, status 
FROM    tmp_tbl /* now from temporary table */
QUALIFY ROW_NUMBER() OVER (
            PARTITION BY status 
            ORDER BY NULL) <= 50;
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • Thank you, bernie. I believe your query will work but my table has over 100M records and I ran out of spool space before it completed. I did try to limit the query using a WHERE clause before the QUALIFY clause but still ran out of space. Is there any way to make this more efficient that you know of? Or do you think the WHERE clause would have made it less efficient? – Donna Panzarotti Jun 08 '12 at 16:44
  • You are welcome. If your `WHERE` clause removes a significant portion of the 100M records, you might try creating a temporary table and running the above query on the temp table. Let me know if you need an example of that. – mechanical_meat Jun 08 '12 at 16:52
  • I would love an example of that! – Donna Panzarotti Jun 08 '12 at 16:54
  • Thank you again, I really appreciate your time!! I used the create table but it came up empty. Also, the where clause that I used only eliminated about 3M rows, not enough to make a difference anyway. I will look to see how else I can limit the population but the create table didn't work. – Donna Panzarotti Jun 08 '12 at 17:27
  • You are right, 3M is not enough. The approach is sound if you can eliminate more records up front. If you would like to describe a little more about how the temporary table " came up empty" I might be able to help. After you run that statement you should see a message to this effect: "0 rows affected". Despite that misleading message, the data should be there. – mechanical_meat Jun 08 '12 at 17:40
  • Yes, the response was "Completed. 0 rows processed." – Donna Panzarotti Jun 08 '12 at 17:59
3

If you have a discrete number of Status values the following approach may work given that the TOP operator is processed after the WHERE clause has been evaluated by Teradata. TOP n is actually the preferred method over the Window Aggregate approach of QUALIFY ROW_NUMBER() OVER() or QUALIFY RANK () OVER() providing better performance at best or comparable performance at worst:

SELECT TOP 50 
       Member
     , Status
FROM Table1
WHERE Status = 'A'
UNION ALL
SELECT TOP 50 
       Member
     , Status
From Table1
WHERE Status = 'R'
UNION ALL
SELECT TOP 50 
       Member
     , Status
FROM Table1
WHERE Status = 'S';

You may be also have success using multiple queries with the SAMPLE clause filtering each query by the status code you wish to sample.

Rob Paller
  • 7,736
  • 29
  • 26
  • Thank you for your reply. I did end up running multiple queries with the Sample clause - that was easy enough with the handful of one of the statuses I had but I have another table with about 100 different possible values that I need to process. I'll give your code a shot and in the meantime, I have someone working on providing me more bandwidth to run these queries. It's not part of my regular day job so that level of access isn't already set up for me. Thanks again for the reply! – Donna Panzarotti Jun 12 '12 at 01:19
  • 2
    If I understand the question correctly, you want a "stratified sample". I'd suggest using SAMPLE rather than TOP; it may take a little longer to run but you can "defend" your sample methodology if challenged. And don't worry about UNIONing a query for each status value; it's the correct thing to do. – BellevueBob Jun 15 '12 at 22:07