I’m trying to use Spark in my desktop computer, which runs windows 7 (locally, not from a cluster or anything, in order to get some practice with it), through pySpark in an iPython notebook and found out a package called ‘findspark’ (which is available on pip) which can be used to avoid having to go through the setup of Spark.
Basically, I just download a spark version pre-built for hadoop from the official, decompress the file and then run something like this in python:
import findspark
findspark.init(‘spark_directory’)
import pyspark
sc = pyspark.SparkContext()
and I get a fully working spark context that works correctly, without setting up anything. However, it runs quite slowly, to the point that if I run something like:
print(sc.parallelize([1]).collect())
it takes over a second to produce the results, and if I try more expensive computations, it’s also quite slow and the RAM memory usage is limited (i.e. it doesn’t exceed a certain point even if the computation requires it) – for comparison purposes, I also ran it from an already-setup linux virtual machine that I downloaded in a MOOC and all operations run a lot faster.
I was wondering what can I do or what can I configure to speed it up. My aim is to have a functional instance of spark in my local machine to practice with pyspark, in an ipython notebook.