Space efficient data bus implementations

Question

I'm writing a microcontroller in VHDL and have essentially got a core for my actual microcontroller section down. I'm now getting to the point however of starting to include memory mapped peripherals. I'm using a very simple bus consisting of a single master (the CPU) and multiple slaves (the peripherals/RAM). My bus works through an acknowledge CPU->perip and acknowledge perip->CPU. The CPU also has separate input and output data buses to avoid tristates.

I've chosen this method as I wish to have the ability for peripherals to stall the CPU. A bus transaction is achieved by: The master places the data, address and read/write bit on the bus, bringing the ack(c->p) high. Once the slave has successfully received the information and has placed the response back on the data (p->c) bus, the slave sets its ack(p->c) high. The master notes the slave has successfully placed the data, takes the data for processing and releases the ack(c->p). The bus is now in idle state again, ready for further transactions.

Obviously this is a very simple bus protocol and doesn't include burst features, variable word sizes or other more complex features. My question however is what space efficient methods can be used to connect peripherals to a master CPU?

I've looked into 3 different methods as of yet. I'm currently using a single output data bus from the master to all of the peripherals, with the data outputs from all the peripherals being or'd, along with their ack(p->c) outputs. Each peripheral contains a small address mux which only allows a slave to respond if the address is within a predefined range. This reduces the logic for switching between peripherals but obviously will infer lots of logic/peripheral for the address muxes which leads me to believe that future scalability will be impacted.

Another method I though of was having a single large address mux connected from the master which decodes the address and sends it, along with the data and ack signals to each slave. The output data is then muxed back into the master. This seems a slightly more efficient method though I always seem to end up with ridiculously long data vectors and its a bit of a chore to keep track of.

A third method I thought of was to have it arranged in a ring like fashion. The master address goes to all of the slaves, with a smaller mux which merely chooses which ack signals to send out. The data output from the master then travels serially through each slave. Each slave contains a mux which can allow it to either let the data coming into it pass through unaffected OR to allow the slave to place its own data on the bus. I feel this will work best for slow systems as there is only one small mux/slave required to mux between the incoming data and that slave's data, along with a small mux that decodes the address and sends out the ack signals. The issue here I believe however is that with lots of peripherals, the propagation delay from the output of the master to the input of the master would be pretty large as it has to travel through each slave!

Could anybody give me suitable reasoning for the different methods? I'm using Quartus to synthesize and route for an Altera EP4CE10E22C8 FPGA and I'm looking for the smallest implementation with regards to FPGA LUTs. My system uses a 16bit address and data bus. I'm looking to achieve at minimum ~50MHz under ideal memory conditions (i.e no wait states) and would be looking to have around 12 slaves, each with between 8 and 16bits of addressable space.

Thanks!

Your problem is not sufficiently well described: you asked for *space efficient methods* but you forgot to explain what is your hardware target, what tools you are using and what metric you consider as a *space* metric (transistors, gates, silicon area, FPGA LUTs...) You forgot also to tell us how many slaves you consider and what your target clock frequency is. Finally, if you could give us an idea of the complexity of your address space (how many bits of the address bus are needed to uniquely identify the target slave), it could also help. — Renaud Pacalet, Dec 04 '15 at 07:25
@RenaudPacalet I've now hopefully improved my question, thank you for your help. — Pyrohaz, Dec 04 '15 at 14:19
OK, I updated my answer to take these extra info into account. — Renaud Pacalet, Dec 04 '15 at 15:16

Renaud Pacalet · Accepted Answer · 2015-12-04T15:15:51.150

I suggest that you download the AMBA specification from the ARM web site (http://www.arm.com/) and look at the AXI4-lite bus or the much older APB bus. In most bus standards with a single master there is no multiplexer on the addresses, only an address decoder that drives the peripheral selection signals. It is only the response data from the slaves that are multiplexed to the master, thanks to the "response valid" signals from the slaves. It is scalable if you pipeline it when the number of slaves increases and you cannot reach your target clock frequency any more. The hardware cost is mainly due to the read data multiplexing, that is, a N-bits P-to-one multiplexer.

This is almost your second option.

The first option is a variant of the second where read data multiplexers are replaced by or gates. I do not think it will change much the hardware cost: or gates are less complex than multiplexers but each slave will now have to zero its read data bus, which adds as many and gates. A good point is, maybe, a reduced activity and thus a lower power consumption: slaves that are not accessed by the master will keep their read data bus low. But as you synthesize all this with a logic synthesizer and place and route it with a CAD tool, I am almost sure that you will end up with the same results (area, power, frequency) as for the more classical second option.

Your third option reminds me the principles of the daisy chain or the token ring. But as you want to avoid 3-states I doubt that it will bring any benefit in terms of hardware cost. If you pipeline it correctly (each slave samples the incoming master requests and processes them or passes them to the next) you will probably reach higher clock frequencies than with the classical bus, especially with a large number of slaves, but as, in average, a complete transaction will take more clock cycles, you will not improve the performance neither.

For really small (but slow) interconnection networks you could also have a look at the Serial Peripheral Interface (SPI) protocols. This is what they are made for: drive several slaves from a single master with few wires.

Considering your target hardware (Altera Cyclone IV), your target clock frequency (50MHz) and your other specifications I would first try the classical bus. The address decoder will produce one select signal for each of your 12 slaves, based on the 8 most significant bits of your 16-bits address bus. The cost will be negligible. Apart these individual select signals, all slaves will receive all other signals (address bus, write data bus, read enable, write enable(s)). The 16-bits read data bus of your master will be the output of a 16-bits 12-to-1 multiplexer that selects one slave response among 12. This will be the part that consumes most of the resources of your interconnect. But it should be OK and run at 50 MHz without problem... if you avoid combinatorial paths between master requests and salve responses.

This is a really good answer. I'm going to look to implementing an address decoder to select the slaves along side a 12-to-1 mux. Thank you for your time! — Pyrohaz, Dec 04 '15 at 15:59
AMBA is a good suggestion, but if you are aiming for space efficiency then you should not use AXI. AXI also tends to have higher latency than traditional busses. APB works, but has its limitations. Inbetween APB and AXI you will find AHB and AHB-Lite. My suggestion would be AHB-Lite. It has low latency, decent bandwidth, good space efficiency, easy to implement, and allows simple yet efficient multi-master systems. — Timmy Brolin, Dec 06 '15 at 19:00

Martin Zabel · Answer 2 · 2015-12-04T09:36:05.927

2

A good starter is the WISHBONE SoC Interconnect from OpenCores.org. The classic read and write cycles are easy to implement. Beyond that, also burst transfers are specified for high throughput and much more. The website also hosts a lot of WISHBONE compatible projects providing a wide range of I/O devices.

And last but not least, the WISHBONE standard is in the public domain.

edited Dec 04 '15 at 09:36

answered Dec 04 '15 at 08:42

Martin Zabel

3,589
3
19
34

Space efficient data bus implementations

2 Answers2