For example Im executing some queries on spark, and in the spark UI I can see that some queries have more shuffle , and this shuffle seems that is the amount of data read locally and read between executors.
But so Im not understanding one thing, for example this query below loaded 7GB from HDFS but the suffle read + shuffled write is more than 10GB. But I saw other queries that load also 7GB from HDFS and the shuffle is like 500kb. So Im not understanding this, can you please help? The amount of data shuffled is not related in the data read from hdfs?
select
nation, o_year, sum(amount) as sum_profit
from
(
select
n_name as nation, year(o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
from
orders o join
(select l_extendedprice, l_discount, l_quantity, l_orderkey, n_name, ps_supplycost
from part p join
(select l_extendedprice, l_discount, l_quantity, l_partkey, l_orderkey,
n_name, ps_supplycost
from partsupp ps join
(select l_suppkey, l_extendedprice, l_discount, l_quantity, l_partkey,
l_orderkey, n_name
from
(select s_suppkey, n_name
from nation n join supplier s on n.n_nationkey = s.s_nationkey
) s1 join lineitem l on s1.s_suppkey = l.l_suppkey
) l1 on ps.ps_suppkey = l1.l_suppkey and ps.ps_partkey = l1.l_partkey
) l2 on p.p_name like '%green%' and p.p_partkey = l2.l_partkey
) l3 on o.o_orderkey = l3.l_orderkey
)profit
group by nation, o_year
order by nation, o_year desc;