The tl/dr summary: 3 tables with hierarchical relationship, a number field in the middle level, need a sum of that number without duplicating because of the lower level, looking for an alternative using OLAP functions in DB2.
This somewhat revisits these two topics (SUM(DISTINCT) Based on Other Columns and Sum Values based on Distinct Guids) - but I'm bumping as a separate topic because I'm wondering if there's a way to accomplish this with OLAP functions.
I'm working in DB2. The scenario (not the actual tables, due to client confidentiality) is:
Table: NEIGHBORHOOD, field NEIGHBORHOOD_NAME Table: HOUSEHOLD, fields NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, and HOUSEHOLD_INCOME Table: HOUSEHOLD_MEMBER, fields HOUSEHOLD_NAME, PERSON_NAME
Right now we've got the data pulled by a single flatten-it-all view. So we would get something like
Shady Acres, 123 Shady Lane, 25000, Jane Shady Acres, 123 Shady Lane, 25000, Mary Shady Acres, 123 Shady Lane, 25000, Robert Shady Acres, 126 Shady Lane, 15000, George Shady Acres, 126 Shady Lane, 15000, Tom Shady Acres, 126 Shady Lane, 15000, Betsy Shady Acres, 126 Shady Lane, 15000, Timmy
If I want
Shady Acres, 123 Shady Lane, 25000, 3 (household income, count of members) Shady Acres, 125 Shady Lane, 15000, 4
it's no problem:
SELECT N.NEIGHBORHOOD_NAME, H.HOUSEHOLD_NAME, H.HOUSEHOLD_INCOME, count(1)
from NEIGHBORHOOD N join HOUSEHOLD H on N.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
join HOUSEHOLD_MEMBER M on H.HOUSEHOLD_NAME = M.HOUSEHOLD_NAME
group by N.NEIGHBORHOOD_NAME, H.HOUSEHOLD_NAME, H.HOUSEHOLD_INCOME
However, if I want
Shady Acres, 2, 40000, 7 (i.e. neighborhood, number of households, sum of income, count of members)
I can't accomplish it without a subquery, as seen in the related links.
The best I've gotten so far is
select NEIGHBORHOOD.NEIGHBORHOOD_NAME,
count(distinct HOUSEHOLD.HOUSEHOLD_NAME) household_Count,
sum(distinct HOUSEHOLD.HOUSEHOLD_INCOME) total_income,
count(1) household_members group by N.NEIGHBORHOOD_NAME
This won't work if you have two households with the same income, of course. I was frankly surprised that "sum(distinct)" even worked, since it just doesn't make sense to me.
I tried
sum(household_income) over (partition by household.household_name)
and it threw an error:
An expression starting with "HOUSEHOLD_INCOME" specified in a SELECT clause, HAVING clause, or ORDER BY clause is not specified in the GROUP BY clause or it is in a SELECT clause, HAVING clause, or ORDER BY clause with a column function and no GROUP BY clause is specified.. SQLCODE=-119, SQLSTATE=42803, DRIVER=4.19.56
Attempting to add either HOUSEHOLD_INCOME or HOUSEHOLD_NAME to the grouping causes the wrong results since we don't want to break it out by those fields.
It's entirely possible that there's no solution to this aside from using a subquery, but we'd have to do some significant redesign of the underlying view (including adding additional views), so I figured it couldn't hurt to ask.