How do I sort hierarchical data in Palantir?

Question

Lets say I have flight data (from Foundry Academy).

Starting dataset:

Date	flight_id	origin_state	carrier_name
jan	000000001	California	delta air
jan	000000002	Alabama	delta air
jan	000000003	California	southwest
feb	000000004	California	southwest
...	...	...	...

I'm doing monthly data aggregation by state and by carrier. Header of my aggregated data looks like this:

origin state	carrier name	jan	feb	...
Alabama	delta air	1	0	...
California	delta air	1	0	...
California	southwest	1	1	...

I need to get subtotals for each state; I need to sort by most flights; and I want it to be sorted by states, then by carrier.

desired output

origin state	carrier name	jan	feb	...
California	null	2	1	...
California	delta air	1	0	...
California	southwest	1	1	...
Alabama	null	1	0	...
Alabama	delta air	1	0	...

PIVOT - doesn't provide subtotals for categories;

EXPRESSION - doesn't offer possibility to split date column into columns.

Hi @Jonas any possibility you can provide sample/similar data and your desire output? — R. Baraiya, May 31 '22 at 09:14
@Jonas-Reklaitis there are N different dev tools in foundry that you can do what you are describing. Could you please better describe what you are trying to do, what you have tried and where please? We would love to help you, but your question is very vague as it stands. — fmsf, May 31 '22 at 21:34

score 2 · Answer 1 · answered Jun 01 '22 at 08:38

I solved it on Contour. not the prettiest solution, but it works.

I've created two paths to the same dataset:

| Date | flight_id | origin_state | carrier_name |
| ---- | --------- | ------------ | ------------ |
| ...  | ...       | ...          | ...          |

1st path was used to calculate full aggregation. pivot table and switch to pivoted data:

Switch to pivoted data: using column "date", 
grouped by "origin_state" and "carrier_name",
aggregated by Count

2nd path was used to get subtotals:

Switch to pivoted data: using column "date", 
grouped by "origin_state",
aggregated by Count

Afterwards I've added empty column "carrier_name" to second dataset. And made union of both datasets

Add rows that appear in "second_path" by column name

After that I've added additional column with expression

Add new column "order" from max("Jan") OVER (
PARTITION BY "origin_state" )

After that I sorted resulting dataset.

Sort dataset by "order" descending, then by "Jan" descending

I received result. but it has additional column, and now I wish to change row formatting of subtotals.

Other approaches are welcome. as my real data has more hierarchical levels.

I just wish for more streamlined solution. it gets very cumbersome if there are more hierarchical levels (e.g. ISIC codes). also adding parameters (especially for "Date" column), makes resulting dataset unreadable for casual user. — Jonas Reklaitis, Jun 02 '22 at 06:45
Is your intent for the dataset, within the general dataset view, to be readable for the user in a particular order? I would be careful with that since the dataset view doesn't necessarily respect the ordering of the data and it only displays a random subset of data. You may be able to get it if you repartition(1) in pyspark after sorting, but you should keep your partitions ≈ 128 MB each. — fmsf, Jun 02 '22 at 13:12

How do I sort hierarchical data in Palantir?

1 Answers1