I understand that pyspark shell uses Python interpreter. How is it possible to import a jar to it? What happens in the backstage that makes it possible?
Asked
Active
Viewed 44 times
0
-
1What do you mean by "import a jar"? – Alper t. Turker Jul 20 '18 at 13:04
-
dupe @user8371915 ? https://stackoverflow.com/questions/31684842/calling-java-scala-function-from-a-task/34412182#34412182 – eliasah Jul 20 '18 at 13:39
-
@eliasah Maybe, but I wouldn't cast a vote without clarification. [This comment](https://stackoverflow.com/questions/51443114/what-happens-in-the-backstage-when-we-import-a-jar-to-pyspark-shell?noredirect=1#comment89858719_51443712) suggests that OP is not really interested in PySpark. – Alper t. Turker Jul 20 '18 at 14:25
-
But the OP accepted the answer... – eliasah Jul 20 '18 at 14:27
-
@eliasah Maybe dupe of [Calling Java from Python](https://stackoverflow.com/q/3652554/8371915)? – Alper t. Turker Jul 20 '18 at 14:27
-
Definitely but I’ve already cast a vote as not clear... – eliasah Jul 20 '18 at 14:28
-
@user8371915 Am interested in understanding how things work and was confused about having jars imported to Python. I'm currently building an application that should read data from Kudu and load it to Hive using PySpark. Currently the only way I can use the kudu-spark lib is through PySpark/Spark-Submit – Guigs Jul 20 '18 at 14:36
-
In that case please follow the link shared by eliasah. It explains how to interface Python and Java in Spark and links to more detailed explanations – Alper t. Turker Jul 20 '18 at 15:15
-
@user8371915 yep, am doing that. Should I delete this question then? – Guigs Jul 20 '18 at 16:43
1 Answers
2
In short nothing, because you simply import a jar to Python interpreter (well, unless you use Jython, but that's a different story).
In PySpark Python interpreter communicates with JVM using sockets.
- Python serializes data (some form of it) or command and sends it over socket to JVM process.
- JVM process deserializes the thing, decides what to do with it, computes the result and sends it over socket to Python interpreter.
If any import from jar happens, it happens on JVM in it's "natural" environment.
Specific tool that it is used is Py4j, so you can check it with you're interested in implementation details, but other similar tools exist out there.

user10111189
- 36
- 1
-
Thank you! So if I want to use the original Python interpreter (not PySpark) and import a jar that contain classes and methods to it, I'll need to implement this myself using Py4J? – Guigs Jul 20 '18 at 14:07