Let me add I bit of thoughts:
occam-pi
is true-[PARALLEL]
language well fit onto parallel InMos T-414 Transputer hardware ( actually a hardware network of Transputers ). Process-flow was based on the theory of lambda-calculus guaranteed scheduling strategy and coordination was thanks to the seminal work of Hoarre's CSP not a constraint to achieving a true-[PARALLEL]
execution, pure-[SERIAL]
execution ( where feasible ) and opportunistic "just"-[CONCURRENT]
, where required.
So the language ( paradigm ) does not uniquely map onto some above Wikipedia listed archetype form of parallelism. Also the external code-execution eco-system properties matter.
Python, on the other hand, is since the Guido Rossum's design decision a pure sequential interpreter process ( whatever amount of threads one might have instantiated, as the central Global Interpreter Lock, the GIL-lock, knowingly chops the flow of time and permits one and only one thread to execute, all others waiting for GIL, thus the code principally avoids any form of a just-[CONCURRENT]
related collisions ( race condition to acquire a resource, read(s) colliding with write(s) et al ).
Python can use message passing using MPI
or ZeroMQ, can use a CSP paradigm module, has modules that enjoy the actor-model behaviour ( as an example in, mimicking the XEROX PARC-place invention of a Model-Visual-Controller coordination ) so the language typically does not constrain a paradigm, being used on a higher layer of abstraction ( while the lower level constraints do limit how hard-real-time any such abstracted form of execution may get harnessed, as any low-level limitations extend all upper abstraction-layers latency and may introduce fine-grain blocking state(s), that are outside of a domain of control of the upper-layer abstracted code-execution behaviour(s) )
Python can use multiprocessing
( joblib
decorated or not ) - it helped to partially escape from the principal, and as Guido Rossum has expressed on his own that GIL-lock will remain a natural part of the python interpreter, unless an immense scale of total re-design of the whole concept of the interpreter is undertaken (which is not, in his view, a probable direction of further efforts spent in this domain). Attempts to escape from the otherwise, known and forever present principal GIL-lock orchestrated re-[SERIAL]
-isation of any number of threads execution were developed, yet each one comes at a cost - human-related: refactoring the code, system-related: re-spawning the full-identical copies of the original python interpreter state ( the only chance under Windows-class O/S-es, partial or ad-hoc copies in linux fork
or forkserver
), making troubles for both newbies and practitioners by ignoring or wrong guesses of the Amdahl Law add-on costs added right due to process-instantiation costs ( TIME + RAM-allocations TIME + RAM-to-RAM copy TIME + parameters / interprocess SER/DES-add-on TIME ), sum of which may easily wipe up any promises or wished-to-have speedups from going into a "just"-[CONCURRENT]
or a true-[PARALLEL]
code-execution domain.
Python can, as most of the other mentioned examples, participate in a distributed-computing-infrastructure, where higher-layer paradigms control the mode-of-cooperative execution, so the macro-system may have higher-levels of concurrency, not visible from "inside" a python-node.
The above-listed "forms"-of are sort of academic ( missing a hardware-based ILP-parallelism, AND-based and OR-based forms of fine-grain forms of parallelisms ), PRAM-s being the subject of C/S research as deep as in late 60-ies, early 70-ies, when it was concluded that even PRAM-based architectures cannot escape from Class-2 computing taxonomy.
"Section 4.3 ( IS THERE ANY CHANCE FOR A GIANT LEAP ) BEYOND THE CLASS-2 COMPUTER BOUNDARIES 2
The main practical - though negative - implication of the previous thoughts is a fact
that within the Class-2 computing, there is not to be expected any efficient solution
for sequentially intractable problems.
Nevertheless, a question raises here, whether some other sort of parallel computers could be imaginable,
that would be computationally more efficient than Class-2 computers.
Indications, coming from many known, conceptually different C2 class computer models,
suggest that without adding some other, fundamental computing capability, the parallelism per se
does not suffice to overcome C2 class boundaries,
irrespective how we try to modify, within all thinkable possibilites,
the architectures of such computers.
As a matter of fact, it turns out, that C2 class boundaries will be crossed,
if there would be a non-determinism added to an MIMD-type parallelism ( Ref. Section 3.5 ).
Non-deterministic PRAM (+)
can, as an example, solve ( intractable ) problems from NPTIME class
in polylogarithmic time and problems of a demonstrably exponential sequential complexity in polynomial time.
Because, in the context of computers, where the non-determinism is equally well technically feasible to be implemented
as a clairvoyance, the C2 computer class seems to represent, from the efficiency point of view,
the ultimate class for the parallel computers, the borders of which will never be crossed.
+) PRAM: a Parallel-RAM, not a SIMD-only processor, demonstrated by Savitch, Stimson in 1979 (1)
(1) SAVITCH, W. J. - STIMSON, M. J.: Time bounded random access machines with parallel processing. J. ACM 26, 1979, Pg. 103-118.
(2) WIEDERMANN, J.: Efficiency boundaries of parallel computing systems. ( Medze efektivnosti paralelných výpočtových systémov ).
Advances in Mathematics, Physics and Astronomy ( Pokroky matematiky, fyziky a astronomie ), Vol. 33 (1988), No. 2, Pg. 81--94"
Both a process-based and a thread-based code may per-se use, or participate in a gang-of-coordinated actors in almost any of the above enlisted forms-of-concurrency.
The code-implementation plus all the underlying resources' management constraints ( hardware + O/S + resource-management policy in respective context of use ) actually decide about what forms remain achievable in fields, when and how any piece of code gets executed - i.e., your code design may be of any level of geniality architecture-wise, if O/S policy resorts your code to get executed on a one only and the only one CPU-core ( due to user-process effective rights enforced affinity mapping constraints ), again any such smart-code will result in a re-[SERIAL]
-ised code-execution ( paying all the add-on overhead costs of wished-to-have [CONCURRENT]
-execution, but getting nothing in return of having spent and continuing to spend such add-on costs ) the very like the straightforward, pure-[SERIAL]
code does [ which one also remains free from any wasted add-on costs, so results in a faster result generation, often with also enjoying a benefit of non-depleted CPU-core local L1/L2 cache hierarchies, if HPC-grade computing was carefully designed-in :o) ]