1

I have a table of data. Column 1 is a list of categories, and column 2 is a boolean. I have N number of categories, with N number of rows per category.

I would like to return a table with the data grouped by category, and summary of the number of rows for each category, and the sum of the boolean column (number of rows with value = 1).

I would also like to return a summary of: (Sum(BooleanField)/Count(BooleanField))/(Sum(BooleanField)/Count(BooleanField)), where the numerator does not include rows with the category (Category_name) that my Group By function returns for, and the denominator is all-inclusive (all categories).

So far, I have the code

SELECT(Category_name),
COUNT(BooleanField),
SUM(BooleanField),
SUM(BooleanField)/COUNT(BooleanField) -- this is % True for each category

-- some logic that takes the % true for all categories except the category 
-- that we are grouping by later / by the % true overall (all observations) 

FROM Data.Source
GROUP BY Category_Name

This code so far is just exploratory.

The "magic number" column explains what I am looking for next, with the other columns representing what is being returned by my code so far: https://docs.google.com/spreadsheets/d/17oienILCeATmH-kNzBZqz0s0Bj9ptjKZ9HfcQJCvAdA/edit#gid=0

Thanks for any help.

Sample Data:

Category BooleanField
Cat1    0
cat1    1
cat2    1
cat2    1
cat2    1
Cat2    0
Cat2    0
Cat2    1
Cat2    1
Cat2    1
Cat3    0
Cat3    0
Cat3    0
Cat3    1
Cat4    1
Cat4    0
Cat4    0
Cat4    0
Cat4    0
Cat4    1

Desired Result

Category    Percent True    Sum Count   Magic Number
 Cat1       50.00%            1   2      1.0000
 Cat2       60.00%            6   8      0.6667
 Cat3       25.00%            1   4      1.1250
 Cat4       33.33%            2   6      1.1429

The magic number column is the trouble I'm finding. I need to find this magic number column so that I can determine which categories are driving down the overall % true the most. Such that removing the most influentially negative category would increase the overall %T the most.

GMB
  • 216,147
  • 25
  • 84
  • 135
Emery
  • 31
  • 1
  • 8
  • 2
    Please provide sample data and desired results. It is not really clear which logic you want for the summary column. – GMB Oct 23 '20 at 20:50
  • 1
    Sorry @GMB, just added some, hopefully its helpful. – Emery Oct 23 '20 at 21:06

2 Answers2

0

You can use an OVER() window function to sum all values, and then substract the current row.

with data as (
    select $1 num, $2 str
    from (values (1, 'one'), (2, 'two'), (3, 'three'))
)

select num
  , sum(num) over() everything_added
  , sum(num) over() - num everything_but_this_row
from data;

enter image description here

With these basic components, you can now build any desired formula, like in "where the numerator does not include rows with the category".

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
0

You can use window functions. I think the logic you want is:

select category, 
    avg(cast(booleanField as int)) as percent_true,
    sum(cast(booleanField as int)) as total,
    count(*) cnt,
    (sum(sum(cast(booleanField as int))) over() - sum(cast(booleanField as int))) 
        / (sum(count(*)) over() - count(*))
        / avg(cast(booleanField as int))
        as magic_number
from mytable
group by category
order by category
GMB
  • 216,147
  • 25
  • 84
  • 135
  • Unfortunately, this returns a division by 0 error. I'm unfamiliar with the cast function so I'm having a bit of difficulty figuring out why. Also, should the Percent_true be as an integer? by nature of being a percent, it should be less than 1. – Emery Oct 26 '20 at 13:56
  • The 5th - 8th lines are throwing the divide by 0 error. trying to decipher the issue myself, but can't seem to crack it. – Emery Oct 26 '20 at 14:44
  • @Emery: presumably, you have categories whose boolean field is always `0`. You can use `nullif()` to avoid the division by `0`, like: `... / nullif((sum(count(*)) over() - count(*)), 0) / nullif(avg(cast(booleanField as int)), 0)` – GMB Oct 26 '20 at 18:14
  • `percent_true` is the *average* of an integer, so that's indeed a value between `0` and `1`. – GMB Oct 26 '20 at 18:14
  • I think that just about does it. Thank you I really appreciate the help. – Emery Oct 26 '20 at 20:39
  • Actually, it didn't work. I am getting some numbers that are quite a big larger than expected for the Magic_number column. This is close though. I'll keep working on it. – Emery Oct 26 '20 at 21:48