0

To build up our historical layer of our lakehouse architecture we're reading csv files and saving them as delta tables. Part of this logic is to do a simple count() operation to check if the amount of new records complies with what we're expecting.

This count() operations behaves different and we cannot fix this problem. Furthermore, the problem seems to be related to the cluster size.

datDf = spark.read.option("quote", "").load(f"abfss://dls@xxxx.dfs.core.windows.net/xx/xx/xx/2022/10/20/xx.dat", format='csv', sep =';', header = False)

print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())
print(datDf.count())

Largte (16 vCores / 128 GB) - 3 to 15 nodes results in:

50270
50270
50270
45935
50270
50270
41640
50270
50270
50270
45974

Small (4 vCores / 32 GB) - 3 to 15 nodes results in:

50270
50270
50270
50270
50270
50270
50270
50270
50270
50270
50270
justsander
  • 91
  • 5
  • After some testing it looks like the option intelligent cache % has something to do with this behavior. After setting it to 0% instead of the default 50% on a large pool the counts are stable. – justsander Nov 10 '22 at 10:35
  • Consider posting it as an answer so it might help others facing the same issue. – Saideep Arikontham Nov 11 '22 at 04:06
  • unfortunately, it did not solve our problem. Microsoft advised to create a ticket, and I will update this post once I have an answer. – justsander Nov 14 '22 at 08:40

0 Answers0