Can anybody explain the following behavior?
import pyspark.pandas as ps
loan_information = ps.read_sql_query([blah])
loan_information.shape
#748834, 84
loan_information.apply(lambda col: col.shape)
#Each column has 75 dimensions. The first 74 are size 10000, the last is size 8843
#This still sums to 748834, but hardly seems like desirable behavior
My guess is that batches of size 10000 are being fed to the executors but, again, this seems like pretty undesirable behavior.