7

I just start learning spark , I'm a bit confused by the this concept, so from the spark installation , we get the pyspark under the spark installation sub-folders , I understand it's a shell, and from the python package we also can also install the python package thru pip install pyspark, so we can run the python code instead of submitting it to the cluster , so what's the difference between these two ? also in anaconda we can use findspark and use the pyspark from there , so does that mean it's not using the pyspark from the python package ?

Plus in the real world spark application development , what is being used in which scenario ? thanks in advance .

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
JYBLTN
  • 71
  • 5

2 Answers2

3

If you pip install, that's only going to install the necessary Python libraries locally, and will not include spark-submit script or other Spark configuration files that you'd get otherwise by downloading all of Spark.

Therefore, in the "real world" of Spark outside of notebooks, you'd package the Python code as a Zip, then submit it to a cluster using that submit script or otherwise set up the master and all Spark options within the code itself, which is not as flexible

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
0

At lower versions of spark before version 2.2 you need install spark after that some steps need to do. But at higher versions pip install pysparkis enough.