Sum distinct, duplicate rows

Question

The tl/dr summary: 3 tables with hierarchical relationship, a number field in the middle level, need a sum of that number without duplicating because of the lower level, looking for an alternative using OLAP functions in DB2.

This somewhat revisits these two topics (SUM(DISTINCT) Based on Other Columns and Sum Values based on Distinct Guids) - but I'm bumping as a separate topic because I'm wondering if there's a way to accomplish this with OLAP functions.

I'm working in DB2. The scenario (not the actual tables, due to client confidentiality) is:

   Table: NEIGHBORHOOD, field NEIGHBORHOOD_NAME
   Table: HOUSEHOLD, fields NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, and HOUSEHOLD_INCOME
   Table: HOUSEHOLD_MEMBER, fields HOUSEHOLD_NAME, PERSON_NAME

Right now we've got the data pulled by a single flatten-it-all view. So we would get something like

 Shady Acres, 123 Shady Lane, 25000, Jane
 Shady Acres, 123 Shady Lane, 25000, Mary
 Shady Acres, 123 Shady Lane, 25000, Robert
 Shady Acres, 126 Shady Lane, 15000, George
 Shady Acres, 126 Shady Lane, 15000, Tom
 Shady Acres, 126 Shady Lane, 15000, Betsy
 Shady Acres, 126 Shady Lane, 15000, Timmy

If I want

    Shady Acres, 123 Shady Lane, 25000, 3  (household income, count of members)
    Shady Acres, 125 Shady Lane, 15000, 4

it's no problem:

SELECT N.NEIGHBORHOOD_NAME, H.HOUSEHOLD_NAME, H.HOUSEHOLD_INCOME, count(1)
from NEIGHBORHOOD N join HOUSEHOLD H on N.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
join HOUSEHOLD_MEMBER M on H.HOUSEHOLD_NAME = M.HOUSEHOLD_NAME
group by N.NEIGHBORHOOD_NAME, H.HOUSEHOLD_NAME, H.HOUSEHOLD_INCOME

However, if I want

   Shady Acres, 2, 40000, 7 (i.e. neighborhood, number of households, sum of income, count of members)

I can't accomplish it without a subquery, as seen in the related links.

The best I've gotten so far is

select NEIGHBORHOOD.NEIGHBORHOOD_NAME,
count(distinct HOUSEHOLD.HOUSEHOLD_NAME) household_Count,
sum(distinct HOUSEHOLD.HOUSEHOLD_INCOME) total_income,
count(1) household_members group by N.NEIGHBORHOOD_NAME

This won't work if you have two households with the same income, of course. I was frankly surprised that "sum(distinct)" even worked, since it just doesn't make sense to me.

I tried

sum(household_income) over (partition by household.household_name)

and it threw an error:

An‬‎ ‪expression‬‎ ‪starting‬‎ ‪with‬‎ ‪‬‎"HOUSEHOLD_INCOME"‬‎ ‪specified‬‎ ‪in‬‎ ‪a‬‎ ‪SELECT‬‎ ‪clause‬‎,‪‬‎ ‪HAVING‬‎ ‪clause‬‎,‪‬‎ ‪or‬‎ ‪ORDER‬‎ ‪BY‬‎ ‪clause‬‎ ‪is‬‎ ‪not‬‎ ‪specified‬‎ ‪in‬‎ ‪the‬‎ ‪GROUP‬‎ ‪BY‬‎ ‪clause‬‎ ‪or‬‎ ‪it‬‎ ‪is‬‎ ‪in‬‎ ‪a‬‎ ‪SELECT‬‎ ‪clause‬‎,‪‬‎ ‪HAVING‬‎ ‪clause‬‎,‪‬‎ ‪or‬‎ ‪ORDER‬‎ ‪BY‬‎ ‪clause‬‎ ‪with‬‎ ‪a‬‎ ‪column‬‎ ‪function‬‎ ‪and‬‎ ‪no‬‎ ‪GROUP‬‎ ‪BY‬‎ ‪clause‬‎ ‪is‬‎ ‪specified‬‎.‪‬‎.‪‬‎ ‪SQLCODE‬‎=‪‬‎-‪119‬‎,‪‬‎ ‪SQLSTATE‬‎=‪42803‬‎,‪‬‎ ‪DRIVER‬‎=‪4‬‎.‪19‬‎.‪56

Attempting to add either HOUSEHOLD_INCOME or HOUSEHOLD_NAME to the grouping causes the wrong results since we don't want to break it out by those fields.

It's entirely possible that there's no solution to this aside from using a subquery, but we'd have to do some significant redesign of the underlying view (including adding additional views), so I figured it couldn't hurt to ask.

It would be helpful if you posted the full query where you got the `-119` error. As it currently written, you have no `FROM` clause in your query. It is difficult to debug a partial query — Paul Vernon, Oct 08 '19 at 17:21

score 0 · Answer 1 · answered Oct 08 '19 at 17:50

I agree that this is not possible without a sub-query if you want to use OLAP fucntions

A co-related sub-select would work, but is inelegant, inefficient and likely not what you want

WITH NEIGHBORHOOD(NEIGHBORHOOD_NAME) AS (VALUES ('Shady Acres'))
,   HOUSEHOLD (NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, HOUSEHOLD_INCOME)
AS (VALUES 
   ('Shady Acres', '123 Shady Lane', 25000)
  ,('Shady Acres', '126 Shady Lane', 15000) 
  )
, HOUSEHOLD_MEMBER ( HOUSEHOLD_NAME, PERSON_NAME )
AS(VALUES
      ('123 Shady Lane', 'Jane'  )
     ,('123 Shady Lane', 'Mary'  )
     ,('123 Shady Lane', 'Robert')
     ,('126 Shady Lane', 'George')
     ,('126 Shady Lane', 'Tom'   )
     ,('126 Shady Lane', 'Betsy' )
     ,('126 Shady Lane', 'Timmy' )    
)
SELECT
    NEIGHBORHOOD_NAME
,   COUNT(DISTINCT HOUSEHOLD_NAME  )  AS HOUSEHOLD_COUNT
--,   SUM(DISTINCT   HOUSEHOLD_INCOME)  AS TOTAL_INCOME       -- not valid if two househols have the same income
--,   SUM(HOUSEHOLD_INCOME) OVER (PARTITION BY HOUSEHOLD_NAME)  -- not valid unless we GROUP BY HOUSEHOLD_NAME in the main body
,   SUM( (SELECT SUM(S.HOUSEHOLD_INCOME) FROM HOUSEHOLD S 
            WHERE S.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
            AND   M.PERSON_NAME = (SELECT MAX(SS.PERSON_NAME) 
                                   FROM HOUSEHOLD_MEMBER SS
                                   WHERE SS.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME))
           )            AS TOTAL_INCOME
,   COUNT(1)                          AS HOUSEHOLD_MEMBERS
FROM NEIGHBORHOOD N
JOIN HOUSEHOLD    H     USING ( NEIGHBORHOOD_NAME )
JOIN HOUSEHOLD_MEMBER M USING ( HOUSEHOLD_NAME )
GROUP BY N.NEIGHBORHOOD_NAME

returns

 NEIGHBORHOOD_NAME  HOUSEHOLD_COUNT     TOTAL_INCOME    HOUSEHOLD_MEMBERS
 -----------------  ---------------     ------------    -----------------
 Shady Acres                      2     40000           7

score 0 · Answer 2 · edited Oct 09 '19 at 20:36

Another option could be

with base (NEIGHBORHOOD_NAME,HOUSEHOLD_NAME, HOUSEHOLD_INCOME, HOUSEHOLD_MEMBER) as (
 values ('Shady Acres', '123 Shady Lane', 25000, 'Jane')
,( 'Shady Acres', '123 Shady Lane', 25000, 'Mary')
,( 'Shady Acres', '123 Shady Lane', 25000, 'Robert')
,( 'Shady Acres', '126 Shady Lane', 15000, 'George')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Tom')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Betsy')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Timmy')
)
, temp as (
select NEIGHBORHOOD_NAME,HOUSEHOLD_NAME, HOUSEHOLD_INCOME, HOUSEHOLD_MEMBER
     , row_number() over (partition by NEIGHBORHOOD_NAME,HOUSEHOLD_NAME order by HOUSEHOLD_MEMBER) as rownum_asc
     , row_number() over (partition by NEIGHBORHOOD_NAME,HOUSEHOLD_NAME order by HOUSEHOLD_MEMBER desc) as rownum_desc
FROM base
)
SELECT NEIGHBORHOOD_NAME, sum(HOUSEHOLD_INCOME) as TOTAL_INCOME, sum(rownum_desc) as member_count

  FROM temp
 WHERE rownum_asc = 1
 GROUP BY NEIGHBORHOOD_NAME

There is a little trick with the rownumber in both directions - one to select only one of the rows per Household and the other to count the members so a sum would do the job in the end.

OLAP functions do not reduce the number of rows - this is a mayor difference and often advantage of OLAP functions compared to GROUP BY and column functions. But with you flattened base table a GROUP BY is needed.

@PaulVernon: Please use the editing functionality only for corrections and do not add stuff that express your opinion in my answer. If you think it is important you are free to comment on my answer. I do not necessarily disagree in this case but for the future I would be glad if you could respect this - thanks — MichaelTiefenbacher, Oct 09 '19 at 17:28
Apologies Micheal. I intended that comment to be on my answer not yours (likely I got confused using my phone trying to edit my answer). I've rolled it back — Paul Vernon, Oct 09 '19 at 20:37

score 0 · Answer 3 · answered Oct 09 '19 at 08:18

So this would be a very hacky solution, that will work as long as you get no hash collisions on the Key of the denormalised / duplicated data and you don't have more duplicates than than the number of decimal places I pushed the HASH value out of the way for the SUM()

WITH NEIGHBORHOOD(NEIGHBORHOOD_NAME) AS (VALUES ('Shady Acres'))
,   HOUSEHOLD (NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, HOUSEHOLD_INCOME)
AS (VALUES 
   ('Shady Acres', '123 Shady Lane', 25000)
  ,('Shady Acres', '126 Shady Lane', 25000) 
  )
, HOUSEHOLD_MEMBER ( HOUSEHOLD_NAME, PERSON_NAME )
AS(VALUES
      ('123 Shady Lane', 'Jane'  )
     ,('123 Shady Lane', 'Mary'  )
     ,('123 Shady Lane', 'Robert')
     ,('126 Shady Lane', 'George')
     ,('126 Shady Lane', 'Tom'   )
     ,('126 Shady Lane', 'Betsy' )
     ,('126 Shady Lane', 'Timmy' )    
)
SELECT
    NEIGHBORHOOD_NAME
,   COUNT(DISTINCT HOUSEHOLD_NAME  )  AS HOUSEHOLD_COUNT
,   BIGINT(SUM(DISTINCT DECFLOAT(HOUSEHOLD_INCOME || '.000000' || ABS(HASH4(HOUSEHOLD_NAME))))) AS TOTAL_INCOME
,   COUNT(1)                          AS HOUSEHOLD_MEMBERS
FROM NEIGHBORHOOD N
JOIN HOUSEHOLD    H     USING ( NEIGHBORHOOD_NAME )
JOIN HOUSEHOLD_MEMBER M USING ( HOUSEHOLD_NAME )
GROUP BY N.NEIGHBORHOOD_NAME

which returns

 NEIGHBORHOOD_NAME  HOUSEHOLD_COUNT     TOTAL_INCOME    HOUSEHOLD_MEMBERS
 -----------------  ---------------     ------------    -----------------
 Shady Acres                      2        50000                    7

Note I made the income of the two households the same to prove that this solution works in that scenario

I guess you could argue that SQL is missing some syntactic feature where OLAP functions using the DISTINCT keyword should be able to define what is being made DISTINCT separately from what is being aggregated.

score 0 · Answer 4 · answered Oct 09 '19 at 20:03

The following query returns the result you need:

WITH 
  NEIGHBORHOOD(NEIGHBORHOOD_NAME) AS 
(
VALUES ('Shady Acres')
)
, HOUSEHOLD (NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, HOUSEHOLD_INCOME) AS 
(
VALUES 
  ('Shady Acres', '123 Shady Lane', 25000)
, ('Shady Acres', '126 Shady Lane', 15000) 
)
, HOUSEHOLD_MEMBER (HOUSEHOLD_NAME, PERSON_NAME) AS
(
VALUES
      ('123 Shady Lane', 'Jane'  )
     ,('123 Shady Lane', 'Mary'  )
     ,('123 Shady Lane', 'Robert')
     ,('126 Shady Lane', 'George')
     ,('126 Shady Lane', 'Tom'   )
     ,('126 Shady Lane', 'Betsy' )
     ,('126 Shady Lane', 'Timmy' )    
)
, TMP AS 
(
SELECT 
  N.NEIGHBORHOOD_NAME
, CASE WHEN H.HOUSEHOLD_NAME = LAG(H.HOUSEHOLD_NAME) OVER (PARTITION BY N.NEIGHBORHOOD_NAME ORDER BY H.HOUSEHOLD_NAME) THEN 0 ELSE 1 END AS HOUSEHOLD_NAME_CHANGED
, H.HOUSEHOLD_INCOME
FROM NEIGHBORHOOD N
JOIN HOUSEHOLD H ON H.NEIGHBORHOOD_NAME = N.NEIGHBORHOOD_NAME
JOIN HOUSEHOLD_MEMBER M ON M.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
)
SELECT 
  NEIGHBORHOOD_NAME
, SUM(HOUSEHOLD_NAME_CHANGED) AS HOUSEHOLD_NAMES
, SUM(HOUSEHOLD_NAME_CHANGED * HOUSEHOLD_INCOME) AS HOUSEHOLD_INCOME
, COUNT(1) AS MEMBERS
FROM TMP
GROUP BY NEIGHBORHOOD_NAME;

Sum distinct, duplicate rows

4 Answers4