0

I am new to data bricks and working on pyspark dataframe. In my code, I have join the two dataframe by using join function and then I use the count function to get the count of new dataframe. Then I sort the dataframe by using orderby function and again use count function to get the count but this time count is diffent. Also, every time I run the code both count is never the same and return a different value in every run. Code is something like this

newDf=df1.join(df2, df1.col1=df2.col2, 'inner')
newDF.count()
newDF=newDF.orderBy('col1')
newDF.count()
ASD
  • 25
  • 6
  • Is the underlying data source changing? – Robert Kossendey Sep 28 '22 at 12:25
  • I tried to reproduce your scenario I am getting the same count for both. If your data source is changing you can get this mismatch See here:-https://i.imgur.com/PDEHtFH.png – Pratik Lad Sep 28 '22 at 12:42
  • No underlining data is the same. I have checked the count of two dataframe which is used in join to create new data frame. Count of these dataframes remains the same in every run. Also, at least count before & after sorting should remain same. but that is also different. My dataframe has 10 million records, Does that create any problem? – ASD Sep 28 '22 at 12:56
  • see this:: https://i.stack.imgur.com/e9TPm.png – ASD Sep 28 '22 at 13:05

2 Answers2

0

This is due to the "lazy" nature of Spark. In your pictures it just seems like you are querying the same data, but in the background for both .count() queries you actually retrieve the data from the underlying data source. I assume that the underlying data source, whatever it might be, changes due to inserts / updates.

You could call .cache() on your Mapped_Data data frame.

Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
  • like I mentioned , Count for both Mapped_data & Data_sector_curve remain same just before the join in every run. but Count before & after sorting is changing in every run. So there are 2 issue. 1) count is different after & before sort 2) count is changing in every run – ASD Sep 28 '22 at 14:21
  • Again, this is probably due to the nature of Spark. As you can see your queries run for > 10 minutes, in the mean time there must have been new data arriving in the source tables. – Robert Kossendey Sep 28 '22 at 14:52
0

If your df1 and df2 hold the same data in different runs, then the count will be the same. Is the underlying data that df1 and df2 reading changing in different runs? Are you doing any sampling or limiting the data being read by your dataframes df1 and df2?

inder
  • 61
  • 2
  • No, I am only ordering the data as you can see in the code and pics attached in the commnet – ASD Oct 04 '22 at 18:47