My question is about the execution time of pyspark codes in zeppelin.
I have some notes and I work with some SQL's in it. in one of my notes, I convert my dataframe to panda with .topandas() function. size of my data is about 600 megabyte.
my problem is that it takes a long time.
if I use sampling for example like this:
df.sample(False, 0.7).toPandas()
it works correctly and in an acceptable time.
the other strange point is when I run this note several times, sometimes it works fast and sometimes slow. for example for the first run after restart pyspark interpreter, it works faster.
how I can work with zeppelin in a stable state? and which parameters are effective to run a spark code in an acceptable time?