How to generate test data for a "group by data from other rows" algorithm

Question

UPDATE: I'm looking for a technique to compute data for all edge cases of my algorithm (or arbitrary algorithm for that matter).
What i tried so far is just thinking about what might be edge cases + producing some "random" data, but i don't know how can i be more sure i didn't miss something real users will be capable of messing up..

I want to check i didn't miss something important in my algorithm and i don't know how to generate test data to cover all possible situations:

The task is to report snapshots of data for every Event_Date, but make a separate row for edits that may belong to the next Event_Date - see Group 2) on input and output data illustration:

input and output data illustration

My algorithm:

make a list of event_dates and compute next_event_dates for them
join the results to main_audit_table and compute the biggest transaction_id for each snapshot (Groups 1-4 in my illustration) - groupped by id, event_date and by 2 options based on whether transaction_date < next_event_date is true or not
join main_audit_table to the results to get the other data from the same transaction_id
join costs_audit_table to the results - use the biggest transaction_id that is smaller than transaction_id from the result

My question(s):

How can i generate test data that would cover all possible scenarios, so i know i got the algorithm right?
Can you see any mistakes in my algorithm logic?
Is there a better forum for this kind of questions?

My code (that needs to be tested):

select
    snapshots.id,
    snapshots.event_date,
    main.event,
    main.transaction_date as last_change,
    costs.costs as costs_2012
  from (
    --snapshots that return correct transaction ids grouped by event_date
    select
      main_grp.id,
      main_grp.event_date,
      max(main_grp.transaction_id) main_transaction_id,
      max(costs_grp.transaction_id) costs_transaction_id
    from main_audit_table main_grp
    join (
      --list of all event_dates and their next_event_dates
      select
        id,
        event_date,
        coalesce(lead(event_date) over (partition by id order by event_date),
                 '1.1.2099') next_event_date
      from main_audit_table
      group by main_grp.id, main_grp.event_date
    ) list on list.id = main_grp.id and list.event_date = main_grp.event_date
    left join costs_audit_table costs_grp
      on costs_grp.id = main_grp.id and
         costs_grp.year = 2012 and
         costs_grp.transaction_id <= main_grp.transaction_id
    group by
      main_grp.id,
      main_grp.event_date,
      case when main_grp.transaction_date < list.next_event_date
           then 1
           else 0 end
  ) snapshots
  join main_audit_table main
    on main.id = snapshots.id and
       main.transaction_id = snapshots.main_transaction_id
  left join costs_audit_table costs
    on costs.id = snapshots.id and
       costs.transaction_id = snapshots.costs_transaction_id

Can you clarify how this data is modeled, and how you managed to assign those groups? — Kodra, Apr 24 '12 at 20:55
@Kodra as for a model - they are *IBM Tivoli Service Request Manager* audit tables (a_workorder with dozens of custom fields) + custom audit tables - without up-to-date documentation and my reverse engineering skills are as good as yours.. — Aprillion, Apr 24 '12 at 22:20
@Kodra the group assignment should be clear from point 2. of my algorithm - if not, please tell me what exactly is not clear so i can rephrase it, thanks — Aprillion, Apr 24 '12 at 22:21
Are you really storing dates as strings, or is that the 'default' representation of a date on your system (`1.1.2099`)? — Clockwork-Muse, Apr 25 '12 at 00:00
@X-Zero it is, but i guess i will use `DATE(TIMESTAMP_FORMAT())` in the production code, thanks — Aprillion, Apr 25 '12 at 05:49

score 3 · Answer 1 · answered Apr 25 '12 at 20:51

Common table expressions (CTE) are not only a good way to bury complexity and reduce the duplication of longer SQL snippets, but also an easy way to represent test data as if it came from a permanent table. At the very least, CTEs will focus the major components of your query at the top, allowing you to refer to them by their label throughout the rest of the statement. Graeme Birchall's DB2 SQL Cookbook (a well-maintained, free e-book) has some good examples of this and other advanced SQL patterns. Joe Celko is another good source for ideas on how to make SQL do more of the heavy lifting for you.

+1 for a great tip,, but actually i don't have a problem with inserting data into real tables - how can i compute the edge cases that need to be tested? (i can generate a lot of random data and miss a case real users will produce) do you know any technique for this? — Aprillion, Apr 25 '12 at 21:10

How to generate test data for a "group by data from other rows" algorithm

1 Answers1