I determine the number of employees and enterprises of activities sectors, city by city:
|codeCommune|nomCommune |regroupement|section|libelleAPE |nombreEntreprises|nombreSalaries|nombreActifs|
+-----------+--------------------+------------+-------+---------------------------------------------------------------------------------------------------------------------------------+-----------------+--------------+------------+
|14654 |Saint-Pierre-en-Auge|84.11Z |O |Administration publique générale |3 |153.5 |169.5 |
|14654 |Saint-Pierre-en-Auge|16.24Z |C |Fabrication d'emballages en bois |1 |149.5 |150.5 |
|14654 |Saint-Pierre-en-Auge|10.11Z |C |Transformation et conservation de la viande de boucherie |1 |149.5 |150.5 |
with a grouping level (regroupement
below) that is set by the user:
+-----------+--------------------+------------+-------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------+--------------+------------+
|codeCommune|nomCommune |regroupement|section|libelleAPE |nombreEntreprises|nombreSalaries|nombreActifs|
+-----------+--------------------+------------+-------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------+--------------+------------+
|14654 |Saint-Pierre-en-Auge|10 |C |Industries alimentaires |16 |208.0 |225.0 |
|14654 |Saint-Pierre-en-Auge|86 |Q |Activités pour la santé humaine |46 |169.5 |218.5 |
|14654 |Saint-Pierre-en-Auge|84 |O |Administration publique et défense ; sécurité sociale obligatoire |5 |153.5 |171.5 |
The job is done that way:
From a
Dataset
of enterprises and establishments, partitioned by a department code (roughly the two first characters of a city code), these columns are selected:city_code
,city_name
,grouping
(the part of the activity code we are retaining:84.11Z
or84
),section
(a code summarizing the sector of an activity: industrial, commercial, etc.),activity_description
,siren
(the enterprise identifier: an enterprise might have many establishments),number_of_workers
,number_of_actives_people
a
groupBy
is done :
RelationalGroupedDataset group = enterprisesAndEstablishments
.groupBy("city_code", "city_name", "grouping", "section", "activity_description");
- I perform my calculations by an aggregation then:
group.agg(countDistinct("siren").as("nombreEntreprises"),
sum("number_of_workers").as("nombreSalaries"),
sum("number_of_actives_people").as("nombreActifs"));
My problem is that the groupBy
method doesn't care about the dataset partitions and is gathering its data from any partition of the dataset enterprisesAndEstablishments
and sorts globally a massive amount of data,
when targeting only a part would be more efficient: all activities in this sample are in the partition [codeDepartement=14]
.
I would like it to respect these partitions and do this groupBy
at their levels, to avoid shuffle.
I'm looking for the sortWithPartitions
companion of groupBy
. Something that would be called groupWithinPartitions
but I don't find it.
What is the best way to achieve what I'm looking for,
or if there's no tool, what alternative should I choose ?