The FFP Machine is a very fine-grained parallel computer architecture:
each processor holds a single symbol / atom / value.
It uses a string reduction model of computation in which
innermost function applications are found and replaced by their
equivalent result (eager evaluation).
Where a result is used in several places, it tends to be re-evaluated
instead of incurring the costs of accessing some global store
(but see Mago's paper on "Copying Operands vs Copying Results", or better yet Mago's "Data Sharing in an FFP Machine" in the 1982 Functional Programming Languages and Computer Architecture conference).
The L cells holding the FFP expression being reduced
communicate through a tree structured arrangement of T cells.
Note that IC's are basically two dimensional and with wiring,
circuits can move towards being three dimensional in physical space.
Interconnection networks that occupy higher dimensions
(such as the Hypercube, Omega, Banyan, Star, etc. networks)
will eventually be unable to perform near their theoretical limit.
This communication network is circuit-switched rather than being packet-switched.
Data packets contain no addresses and do not need routing.
Packets from distinct reductions cannot meet, cannot conflict
and cannot experience congestion with each other.
The configuring activity (called "Partitioning") is performed
in a single sweep upwards in the tree, using a handful
of logic operations on 3-bit messages, leaving "area machines" in its wake,
each created to advance at most a single reducible application.
While it is technically logarithmic in time,
the resulting area machines can begin communicating
in a pipelined fashion behind the partitioning wave,
practically costing a constant time penalty.
(The dismantling of area machines remains a logarithmic cost in time).
Packets within a single reduction should, and must, meet
and thus provide a often-useful synchronization.
Sequences of packets are sorted and combined as they rise
within an area, to be broadcast from the root of the area machine.
Parallel Prefix and Parallel Suffix operations are provided
to reduce area traffic, since there remains a potential bottleneck
within an individual reducible application.
This is accomplished without the need exhibited in
the Ultracomputer (Jack (Jacob?) Schwartz at NYU)
for a separate logarithmic-sized cache memory in each
communication node.
Each T cell (internal tree node) only needs a FIFO buffer
(for efficiency) of size greater than the pipeline path to
the top of the tree and back down.
(This latter is a conjecture of mine, but it seems reasonable).
Since the tree maintains the left-to-right order of data
(unlike some other combining networks), the system enables cells
to rotate their data in logarithmic rather than linear time,
avoiding the plausible congestion at the root of the area machine.
It's worth noting again, that the parallelism within an area
machine is independent of the simultaneous parallelism in other
area machines, and has available to it a number of processors
proportional to the quantity of data in the operand.