Different dataframe Count results with different pool sizes while using PySpark in Synapse Analytics

Question

To build up our historical layer of our lakehouse architecture we're reading csv files and saving them as delta tables. Part of this logic is to do a simple count() operation to check if the amount of new records complies with what we're expecting.

This count() operations behaves different and we cannot fix this problem. Furthermore, the problem seems to be related to the cluster size.

datDf = spark.read.option("quote", "").load(f"abfss://dls@xxxx.dfs.core.windows.net/xx/xx/xx/2022/10/20/xx.dat", format='csv', sep =';', header = False)

print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())

Largte (16 vCores / 128 GB) - 3 to 15 nodes results in:

Small (4 vCores / 32 GB) - 3 to 15 nodes results in:

After some testing it looks like the option intelligent cache % has something to do with this behavior. After setting it to 0% instead of the default 50% on a large pool the counts are stable. — justsander, Nov 10 '22 at 10:35
Consider posting it as an answer so it might help others facing the same issue. — Saideep Arikontham, Nov 11 '22 at 04:06
unfortunately, it did not solve our problem. Microsoft advised to create a ticket, and I will update this post once I have an answer. — justsander, Nov 14 '22 at 08:40

Different dataframe Count results with different pool sizes while using PySpark in Synapse Analytics

0 Answers0