I have a Hive SQL query (Ran on AWS Athena) that I'm trying to convert to Pyspark code. Both the SQL query and Pyspark code are running on the same data. However, when I compare the output row count of both queries, they are significantly different.
The SQL query returns a count of 250,000 rows, while the Pyspark code returns a count of 307 million rows. I've carefully checked the Pyspark code and it seems to be equivalent to the SQL query. I'm not sure why the Pyspark code is returning so many more rows.
Here is the Hive SQL query that I'm trying to convert:
self_joined_tbl AS (
SELECT
tbl1.item_id,
tbl1.p_date,
tbl1.uid,
tbl1.direct_uid,
tbl1.related_uid,
tbl1.uid_type,
COALESCE(tbl2.related_uid, tbl3.related_uid) AS new_related_uid
FROM input_table AS tbl1
LEFT JOIN input_table AS tbl2
ON tbl1.direct_uid = tbl2.direct_uid
AND tbl1.item_id = tbl2.item_id
AND tbl2.uid_type IN ('new')
LEFT JOIN input_table AS tbl3
ON tbl1.direct_uid = tbl3.direct_uid
AND tbl1.item_id = tbl3.item_id
AND tbl3.uid_type IN ('old')
)
And here is the Pyspark code that I've converted from the SQL query:
self_joined_tbl_df = (
input_table_df.alias("tbl1")
.join(
input_table_df.alias("tbl2"),
on=(
(F.col("tbl1.direct_uid") == F.col("tbl2.direct_uid"))
& (F.col("tbl1.item_id") == F.col("tbl2.item_id"))
& (F.col("tbl2.uid_type").isin("new"))
),
how="left"
)
.join(
input_table_df.alias("tbl3"),
on=(
(F.col("tbl1.direct_uid") == F.col("tbl3.direct_uid"))
& (F.col("tbl1.item_id") == F.col("tbl3.item_id"))
& (F.col("tbl3.uid_type").isin("old"))
),
how="left"
)
.select(
F.col("tbl1.item_id"),
F.col("tbl1.item_type"),
F.col("tbl1.item_name"),
F.col("tbl1.p_date"),
F.col("tbl1.uid"),
F.col("tbl1.direct_uid"),
F.col("tbl1.related_uid"),
F.col("tbl1.uid_type"),
F.coalesce(F.col("tbl2.related_uid"),
F.col("tbl3.related_uid")).alias("new_related_uid")
)
)
I'm hoping someone can help me identify why the Pyspark code is returning so many more rows than the SQL query, and suggest any changes to the Pyspark code to fix the issue. Thank you!