2

Lets say I have flight data (from Foundry Academy).

Starting dataset:

Date flight_id origin_state carrier_name
jan 000000001 California delta air
jan 000000002 Alabama delta air
jan 000000003 California southwest
feb 000000004 California southwest
... ... ... ...

I'm doing monthly data aggregation by state and by carrier. Header of my aggregated data looks like this:

origin state carrier name jan feb ...
Alabama delta air 1 0 ...
California delta air 1 0 ...
California southwest 1 1 ...

I need to get subtotals for each state; I need to sort by most flights; and I want it to be sorted by states, then by carrier.

desired output

origin state carrier name jan feb ...
California null 2 1 ...
California delta air 1 0 ...
California southwest 1 1 ...
Alabama null 1 0 ...
Alabama delta air 1 0 ...

PIVOT - doesn't provide subtotals for categories;

EXPRESSION - doesn't offer possibility to split date column into columns.

  • Hi @Jonas any possibility you can provide sample/similar data and your desire output? – R. Baraiya May 31 '22 at 09:14
  • 1
    How are you doing the aggregation? In Contour? – Andrew Andrade May 31 '22 at 19:08
  • @Jonas-Reklaitis there are N different dev tools in foundry that you can do what you are describing. Could you please better describe what you are trying to do, what you have tried and where please? We would love to help you, but your question is very vague as it stands. – fmsf May 31 '22 at 21:34

1 Answers1

2

I solved it on Contour. not the prettiest solution, but it works.

I've created two paths to the same dataset:

| Date | flight_id | origin_state | carrier_name |
| ---- | --------- | ------------ | ------------ |
| ...  | ...       | ...          | ...          |

1st path was used to calculate full aggregation. pivot table and switch to pivoted data:

Switch to pivoted data: using column "date", 
grouped by "origin_state" and "carrier_name",
aggregated by Count

2nd path was used to get subtotals:

Switch to pivoted data: using column "date", 
grouped by "origin_state",
aggregated by Count

Afterwards I've added empty column "carrier_name" to second dataset. And made union of both datasets

Add rows that appear in "second_path" by column name

After that I've added additional column with expression

Add new column "order" from max("Jan") OVER (
PARTITION BY "origin_state" )

After that I sorted resulting dataset.

Sort dataset by "order" descending, then by "Jan" descending

I received result. but it has additional column, and now I wish to change row formatting of subtotals.

Other approaches are welcome. as my real data has more hierarchical levels.

  • Thank you for answering your own question :) – fmsf Jun 01 '22 at 15:52
  • I just wish for more streamlined solution. it gets very cumbersome if there are more hierarchical levels (e.g. ISIC codes). also adding parameters (especially for "Date" column), makes resulting dataset unreadable for casual user. – Jonas Reklaitis Jun 02 '22 at 06:45
  • Is your intent for the dataset, within the general dataset view, to be readable for the user in a particular order? I would be careful with that since the dataset view doesn't necessarily respect the ordering of the data and it only displays a random subset of data. You may be able to get it if you repartition(1) in pyspark after sorting, but you should keep your partitions ≈ 128 MB each. – fmsf Jun 02 '22 at 13:12