To build up our historical layer of our lakehouse architecture we're reading csv files and saving them as delta tables. Part of this logic is to do a simple count() operation to check if the amount of new records complies with what we're expecting.
This count() operations behaves different and we cannot fix this problem. Furthermore, the problem seems to be related to the cluster size.
datDf = spark.read.option("quote", "").load(f"abfss://dls@xxxx.dfs.core.windows.net/xx/xx/xx/2022/10/20/xx.dat", format='csv', sep =';', header = False)
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
Largte (16 vCores / 128 GB) - 3 to 15 nodes results in:
50270
50270
50270
45935
50270
50270
41640
50270
50270
50270
45974
Small (4 vCores / 32 GB) - 3 to 15 nodes results in:
50270
50270
50270
50270
50270
50270
50270
50270
50270
50270
50270