0

I’m trying to use Spark in my desktop computer, which runs windows 7 (locally, not from a cluster or anything, in order to get some practice with it), through pySpark in an iPython notebook and found out a package called ‘findspark’ (which is available on pip) which can be used to avoid having to go through the setup of Spark.

Basically, I just download a spark version pre-built for hadoop from the official, decompress the file and then run something like this in python:

import findspark
findspark.init(‘spark_directory’)
import pyspark
sc = pyspark.SparkContext()

and I get a fully working spark context that works correctly, without setting up anything. However, it runs quite slowly, to the point that if I run something like:

print(sc.parallelize([1]).collect())

it takes over a second to produce the results, and if I try more expensive computations, it’s also quite slow and the RAM memory usage is limited (i.e. it doesn’t exceed a certain point even if the computation requires it) – for comparison purposes, I also ran it from an already-setup linux virtual machine that I downloaded in a MOOC and all operations run a lot faster.

I was wondering what can I do or what can I configure to speed it up. My aim is to have a functional instance of spark in my local machine to practice with pyspark, in an ipython notebook.

anymous_asker
  • 113
  • 1
  • 11
  • Did you take a look [here](http://stackoverflow.com/q/33326749/3415409)? – eliasah Nov 09 '15 at 07:44
  • Thanks for the link, but it deals with how to get pyspark running in Eclipse (from what I understood, it had to do with setting up an environmental variable pointing to the spark directory). What I meant to ask was more in the lines of what could I tweak/configure/setup in order for it not to run slowly and not to have a memory usage limit. – anymous_asker Nov 09 '15 at 07:54

0 Answers0