2

I'm doing my first foray into something proper in SQL, except having trouble replicating something which I've been easily able to do via Alteryx.

Essentially, some basic data wrangling needed to create a summary of a Table which I've already created in Impala/Hive. The basic table needs to be broken into smaller tables (unpivoted and pivoted), which are then unioned together to create an aggregated table.

Table is as following:

Run_Code | ID | ColB | ColC | ColD | ColE | ColF | ColG | TaxExpense | RetainedExpense | IncomeExpense | Year
-----------------------------------------------------------------------------------------------------------------
run1     | 21 | 1234 | 1234 | 1234 | 1234 | 1234 | 1234 | 1234.56789 | 1234.56789      |  1234.56789   | Year1 
run1     | 22 | 1234 | 1234 | 1234 | 1234 | 1234 | 1234 | 1234.56789 | 1234.56789      |  1234.56789   | Year2
run1     | 23 | 1234 | 1234 | 1234 | 1234 | 1234 | 1234 | 1234.56789 | 1234.56789      |  1234.56789   | Year3
run1     | 24 | 1234 | 1234 | 1234 | 1234 | 1234 | 1234 | 1234.56789 | 1234.56789      |  1234.56789   | Year4

Currently the following is done in Alteryx; whereby only TaxExpense is selected and then un-pivoted for Year, and then Pivoted back to as a column.

Run_Code | ID | ColB | ColC | ColD |     Name    | Year1 | Year2 | Year3 | Year4
-----------------------------------------------------------------------------------------------------------------
run1     | 21 | 1234 | 1234 | 1234 | Tax Expense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | Tax Expense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | Tax Expense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | Tax Expense | (sum) | (sum) | (sum) | (sum)

The same is done for Retained Expense and then to IncomeExpense as well.

Run_Code | ID | ColB | ColC | ColD |      Name      | Year1 | Year2 | Year3 | Year4
-----------------------------------------------------------------------------------------------------------------
run1     | 21 | 1234 | 1234 | 1234 | RetainedExpense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | RetainedExpense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | RetainedExpense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | RetainedExpense | (sum) | (sum) | (sum) | (sum)

End Result is the below and the desired is the below:

Run_Code | ID | ColB | ColC | ColD |      Name       | Year1 | Year2 | Year3 | Year4
-----------------------------------------------------------------------------------------------------------------
run1     | 21 | 1234 | 1234 | 1234 | TaxExpense      | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | TaxExpense      | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | TaxExpense      | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | TaxExpense      | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | RetainedExpense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | RetainedExpense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | RetainedExpense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | RetainedExpense | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | IncomeExpense   | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | IncomeExpense   | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | IncomeExpense   | (sum) | (sum) | (sum) | (sum)
run1     | 21 | 1234 | 1234 | 1234 | IncomeExpense   | (sum) | (sum) | (sum) | (sum)

Appreciate any help in creating an SQL which solves the above.

ShushKebab
  • 35
  • 4

1 Answers1

1

Hmmm . . . If I understand correctly, ou can unpivot and reaggregate:

select Run_Code, ID, ColB, ColC, ColD,
       sum(case when year = 'year1' then expense end) as year_1,
       sum(case when year = 'year2' then expense end) as year_2,
       sum(case when year = 'year3' then expense end) as year_3,
       sum(case when year = 'year4' then expense end) as year_4
from ((select Run_Code, ID, ColB, ColC, ColD, 'TaxExpense' as name, TaxExpense as expense, year
       from t
      ) union all
      (select Run_Code, ID, ColB, ColC, ColD, 'RetainedExpense' as name, RetainedExpense, year
       from t
      ) union all
      (select Run_Code, ID, ColB, ColC, ColD, 'IncomeExpense' as name, IncomeExpense, year
       from t
      )
     ) t
group by Run_Code, ID, ColB, ColC, ColD, name
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I'm getting the following error: AnalysisException: SUM requires a numeric parameter: sum(CASE WHEN year = 'YEAR1' THEN Name END). When I try MAX instead of Sum, the Name field doesn't appear and they instead they populate under the years of Year_1, Year_2 etc. – ShushKebab Nov 25 '20 at 01:10