4

I have some fairly simple SQL that should provide 1 row per quarter per asset1. Instead, I get multiple rows per group by.

Below is the SQL, a SAS data step, and some of the output data. The number of duplicate rows (in the below data, 227708) is equal to Num_borrowers, which is the asset1 number of rows.

proc sql outobs=max;

create table table1 as
select 
    case 
        when period_dt ='01DEC2003'd then '2003Q4'
        when period_dt ='01DEC2004'd then '2004Q4'
        when period_dt ='01DEC2005'd then '2005Q4'
        when period_dt ='01DEC2006'd then '2006Q4'
        when period_dt ='01DEC2007'd then '2007Q4' 
        when period_dt ='01DEC2008'd then '2008Q4'
        when period_dt ='01DEC2009'd then '2009Q4'
        when period_dt ='01DEC2010'd then '2010Q4'
        when period_dt ='01DEC2011'd then '2011Q4'
        when period_dt ='01DEC2012'd then '2012Q4'
        when period_dt ='01DEC2013'd then '2013Q4'
        when period_dt ='01JUN2014'd then '2014Q2'   
    end as QTR,
    case 
        when MM_ASSET in ('C&I', 'Foreign', 'Leasing','Scored-WF','Scored-WB')  THEN 'C&I'
        when MM_ASSET='Construction' THEN 'Construction RE'
        when MM_ASSET='Mortgage-IP' THEN 'Income Producing RE'
        when MM_ASSET='Mortgage-OO' THEN 'Owner Occupied RE'
        when MM_ASSET='Mortgage-SF' THEN 'Mortgage-SF'
        when MM_ASSET='Unknown' THEN 'Other'
    end as asset1,
    count (period_dt) as Num_Borrowers, 
    exposure,
    co_itd,
    MM_NINEQTR_LOSS,
    MM_LIFE_LOSS
  from td_prod.OBLIGOR_COMBINED
  where period_dt in ('01DEC2003'd,'01DEC2004'd,'01DEC2005'd,'01DEC2006'd,'01DEC2007'd,'01DEC2008'd, '01DEC2009'd,'01DEC2010'd,'01DEC2011'd,'01DEC2012'd,'01DEC2013'd,'01JUN2014'd)
  and mm_asset in ('C&I','Foreign','Leasing','Construction','Mortgage-IP','Scored-WF','Scored-WB'
               'Mortgage-OO','Mortgage-SF','Unknown')
  group by 1,2
  order by 1,2;

quit;



data table2; set table1;

  Total_Exposure = exposure/1000000;
  if total_exposure = 0 then total_exposure=.;
  Total_Charge_Offs =co_itd/1000000;
  Total_9Q_Losses = MM_NINEQTR_LOSS/1000000;
  Total_Life_Losses = MM_LIFE_LOSS/1000000;
  avg_borrower_exp = total_exposure/num_borrowers;
  co_rate = total_charge_offs/total_exposure;
  life_lossR = Total_life_losses/total_exposure;
  nineQtr_lossR = total_9q_losses/total_exposure;

run;



*** sample of output data set ***;
qtr             asset1      num_borrowers
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
2003Q4          C&I             227708
kristof
  • 52,923
  • 24
  • 87
  • 110
pgrebus
  • 41
  • 1
  • 1
  • 2
  • what database are you using here? Seems MySQL, you are selecting more fields in non-aggregate form than you are grouping by. How does your database know how to handle multiple combinations of exposure, co_itd, MM_NINEQTR_LOSS, MM_LIFE_LOSS ? – Twelfth Aug 27 '14 at 23:38
  • 4
    Assume using SAS, per the tag and PROC SQL. SAS SQL does an odd non-ANSI standard thing when you put a column on the select statement which is not one of the group by columns and is not computed from an aggregate function. It returns every record from the source table. And gives you a note in the log, "this step required re-merging" or something to that effect. Sometimes this remerging is helpful. But I have known true SQL experts who did not like this feature, and avoided it like the plague. Many databases would throw an error from such a SELECT statement. – Quentin Aug 28 '14 at 00:06
  • Ugh, you mean MySQL isn't the only thing that does that, SAS SQl does as well? Argh!!!!!! What is with these languages doing the wrong thing instead of returning an error? In any case, solution = figure out what to do with those 4 columns you exclude from the sample set of results you provided – Twelfth Aug 28 '14 at 00:12

1 Answers1

7

Realized my comment above is more like an answer.

In SAS SQL, in a query with a group by clause that includes extraneous columns on the select statement (i.e. columns not part of the group by and not derived from an aggregating function), SAS "remerges" the summary statistics back to the original data (with a note to that effect). Most SQLs would just throw an error. Below is an example:

data have;
  input gender $ age score;
  cards;
M 10 100
M 20 200
F 30 300
F 40 400
;
run;

proc sql;
  select gender, mean(age) as AvgAge, SCore
    from have
    group by gender
  ;
quit;

returns:

 gender      AvgAge     score
 F               35       300
 F               35       400
 M               15       100
 M               15       200

In your code, exposure, co_itd, MM_NINEQTR_LOSS, and MM_LIFE_LOSS are all extraneous columns, causing SAS to remerge.

Whenever remerging occures you will see the following message in SAS log:

NOTE: The query requires remerging summary statistics back with the original data.

See remerging data section in SAS documentation on summary-function for more details

kristof
  • 52,923
  • 24
  • 87
  • 110
Quentin
  • 5,960
  • 1
  • 13
  • 21
  • 5
    To disallow remerge-ing behaviour use NOREMERGE proc sql option or NOSQLREMERGE system option. – vasja Jul 24 '15 at 11:24