0

My question is about the execution time of pyspark codes in zeppelin.

I have some notes and I work with some SQL's in it. in one of my notes, I convert my dataframe to panda with .topandas() function. size of my data is about 600 megabyte.

my problem is that it takes a long time.

if I use sampling for example like this:

df.sample(False, 0.7).toPandas()

it works correctly and in an acceptable time.

the other strange point is when I run this note several times, sometimes it works fast and sometimes slow. for example for the first run after restart pyspark interpreter, it works faster.

how I can work with zeppelin in a stable state? and which parameters are effective to run a spark code in an acceptable time?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Saeed
  • 159
  • 3
  • 13

1 Answers1

0

The problem here is not zeppelin, but you as a programmer. Spark is a distributed (cluster computing) data analysis engine written in Scala which therefore runs in a JVM. Pyspark is the Python API for Spark which utilises the Py4j library to provide an interface for JVM objects.

Methods like .toPandas() or .collect() return a python object which is not just an interface for JVM objects (i.e. it actually contains your data). They are costly because they require to transfer your (distributed) data from the JVM to the python interpreter inside the spark driver. Therefore you should only use it when the resulting data is small and work as long as possible with pyspark dataframes.

Your other issue regarding different execution times needs to be discussed with your cluster admin. Network spikes and jobs submitted by other users can influence your execution time heavily. I am also surprised that your first run after a restart of the spark interpreter is faster, because during the first run the sparkcontext is created and cluster ressources are allocated which adds some overhead.

cronoik
  • 15,434
  • 3
  • 40
  • 78