What is the reason for inconsistent counts in Pyspark, Spark SQL and toPandas().shape?

Question

I am working on databricks cloud 5.4 ML and I created a training dataset for my classification problem. When counting the records I get inconsistencies in the counts I cannot explain.

Furthermore I have checked that my Spark DataFrame does not contain null or nan values. I also did the same check in Pandas.

#train is spark dataframe
train.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(),c)).alias(c) 
   for c in train.columns]).show() #all counts are 0
train.filter((F.col('colname') == "") | F.col('colname').isNull() 
   | F.isnan(F.col('colname'))).count() #count is 0
train.toPandas().isnull().sum().sum() #count is 0

Here under are the different count values:

train.count() #count is 6011
train.toPandas().shape[0] #count is 6022

-- Spark SQL
-- count is 6012
SELECT COUNT(*)
FROM train

-- Spark SQL
-- count is 5985
SELECT COUNT(colname)
FROM train

I would expect that the record count generated by Pyspark, Spark SQL and toPandas().shape to be the same. What are your thoughts? Solutions?

Thanks a lot.

score 3 · Answer 1 · answered Jul 25 '19 at 13:04

I found the answer to my problem. I created my dataset using subsampling. it turned out that I needed to cache the train dataframe. Otherwise the lazy evaluation recreates the train spark dataframe every time.

ids = dataset.select('id').distinct()
train_ids = ids.sample(fraction=0.8, seed=3 ,withReplacement=False)
train_ids.cache()
train = dataset.join(F.broadcast(train_ids), ['id'], 'inner')

Thanks and take care.

What is the reason for inconsistent counts in Pyspark, Spark SQL and toPandas().shape?

1 Answers1