1

I have a Hive SQL query (Ran on AWS Athena) that I'm trying to convert to Pyspark code. Both the SQL query and Pyspark code are running on the same data. However, when I compare the output row count of both queries, they are significantly different.

The SQL query returns a count of 250,000 rows, while the Pyspark code returns a count of 307 million rows. I've carefully checked the Pyspark code and it seems to be equivalent to the SQL query. I'm not sure why the Pyspark code is returning so many more rows.

Here is the Hive SQL query that I'm trying to convert:

self_joined_tbl AS (
    SELECT
        tbl1.item_id,
        tbl1.p_date,
        tbl1.uid,
        tbl1.direct_uid,
        tbl1.related_uid,
        tbl1.uid_type,
        COALESCE(tbl2.related_uid, tbl3.related_uid) AS new_related_uid
    FROM input_table AS tbl1
    LEFT JOIN input_table AS tbl2
        ON tbl1.direct_uid = tbl2.direct_uid
        AND tbl1.item_id = tbl2.item_id
        AND tbl2.uid_type IN ('new')
    LEFT JOIN input_table AS tbl3
        ON tbl1.direct_uid = tbl3.direct_uid
        AND tbl1.item_id = tbl3.item_id
        AND tbl3.uid_type IN ('old')
)

And here is the Pyspark code that I've converted from the SQL query:

self_joined_tbl_df = (
        input_table_df.alias("tbl1")
        .join(
            input_table_df.alias("tbl2"),
            on=(
                (F.col("tbl1.direct_uid") == F.col("tbl2.direct_uid"))
                & (F.col("tbl1.item_id") == F.col("tbl2.item_id"))
                & (F.col("tbl2.uid_type").isin("new"))
            ),
            how="left"
        )
        .join(
            input_table_df.alias("tbl3"),
            on=(
                (F.col("tbl1.direct_uid") == F.col("tbl3.direct_uid"))
                & (F.col("tbl1.item_id") == F.col("tbl3.item_id"))
                & (F.col("tbl3.uid_type").isin("old"))
            ),
            how="left"
        )
        .select(
            F.col("tbl1.item_id"),
            F.col("tbl1.item_type"),
            F.col("tbl1.item_name"),
            F.col("tbl1.p_date"),
            F.col("tbl1.uid"),
            F.col("tbl1.direct_uid"),
            F.col("tbl1.related_uid"),
            F.col("tbl1.uid_type"),
            F.coalesce(F.col("tbl2.related_uid"),
                       F.col("tbl3.related_uid")).alias("new_related_uid")
        )
)

I'm hoping someone can help me identify why the Pyspark code is returning so many more rows than the SQL query, and suggest any changes to the Pyspark code to fix the issue. Thank you!

WarBoy
  • 146
  • 1
  • 11
  • it's due to the use of alias `tbl1` in the second join. pyspark isn't executed like sql. in your case, you are trying to join the third table to the result of `tbl1 join tbl2`. so, you'll need to give an alias to the result of `tbl1 join tbl2` and then move forward. – samkart Apr 27 '23 at 09:03

1 Answers1

1

I would suggest looking into non-deterministic behavior of pyspark. I just recently had similar problem but in my case it was dropDuplicates that introduced the variable results. See if this is somewhat your problem by executing your code several time and previewing count on some condition (it might warry after 3-5 executions).

euh
  • 319
  • 2
  • 11
  • I’ve executed it multiple times but, unfortunately running into the same issue over and over again. – WarBoy Apr 27 '23 at 14:09
  • Does your output varies though if you preview it with count or for `new_related_uid` ? – euh Apr 27 '23 at 14:53
  • No actually, every time it’s the same. Is there any chance that the Pyspark code might generate Cartesian product (Cross join). – WarBoy Apr 27 '23 at 16:05