I referenced the article shown below which instruct how pyflink works on python interpreter and jvm. https://www.alibabacloud.com/blog/the-flink-ecosystem-a-quick-start-to-pyflink_596150
And I couldn't figure out that whether they execute a job across the several processes or not,etc...
So I left my interpretation of their internal system here and want to ask you to correct it.
My Interpretation:
- User writes program in python.
- Python interpreter make an process(Process 1) which accounts for controlling connection to JVM gatewayserver(Process 2).
- Process 1 will make job graph of streaming and submit it to JVM gateway server through socket. Which is also used for passing input data and receive output data.
- Next step, a jobmanager(Process 3) plan an actual execution and allocate sufficient task slots.(Processes 4) Then here is an execution graph which is parallerized and ready to execute.
- There are several needs to call python because some operator use UserDefinedFunction or LambdaFunc etc.... which are not supported in java. In such situation, the operator(e.g. map, flatmap)(Processes 4) submit state and some related information to python process(Process 5) which is made for executing UDF or Lambda etc... and the result of it will be returned to the operator in JVM.