I'm writing a microcontroller in VHDL and have essentially got a core for my actual microcontroller section down. I'm now getting to the point however of starting to include memory mapped peripherals. I'm using a very simple bus consisting of a single master (the CPU) and multiple slaves (the peripherals/RAM). My bus works through an acknowledge CPU->perip and acknowledge perip->CPU. The CPU also has separate input and output data buses to avoid tristates.
I've chosen this method as I wish to have the ability for peripherals to stall the CPU. A bus transaction is achieved by: The master places the data, address and read/write bit on the bus, bringing the ack(c->p) high. Once the slave has successfully received the information and has placed the response back on the data (p->c) bus, the slave sets its ack(p->c) high. The master notes the slave has successfully placed the data, takes the data for processing and releases the ack(c->p). The bus is now in idle state again, ready for further transactions.
Obviously this is a very simple bus protocol and doesn't include burst features, variable word sizes or other more complex features. My question however is what space efficient methods can be used to connect peripherals to a master CPU?
I've looked into 3 different methods as of yet. I'm currently using a single output data bus from the master to all of the peripherals, with the data outputs from all the peripherals being or'd, along with their ack(p->c) outputs. Each peripheral contains a small address mux which only allows a slave to respond if the address is within a predefined range. This reduces the logic for switching between peripherals but obviously will infer lots of logic/peripheral for the address muxes which leads me to believe that future scalability will be impacted.
Another method I though of was having a single large address mux connected from the master which decodes the address and sends it, along with the data and ack signals to each slave. The output data is then muxed back into the master. This seems a slightly more efficient method though I always seem to end up with ridiculously long data vectors and its a bit of a chore to keep track of.
A third method I thought of was to have it arranged in a ring like fashion. The master address goes to all of the slaves, with a smaller mux which merely chooses which ack signals to send out. The data output from the master then travels serially through each slave. Each slave contains a mux which can allow it to either let the data coming into it pass through unaffected OR to allow the slave to place its own data on the bus. I feel this will work best for slow systems as there is only one small mux/slave required to mux between the incoming data and that slave's data, along with a small mux that decodes the address and sends out the ack signals. The issue here I believe however is that with lots of peripherals, the propagation delay from the output of the master to the input of the master would be pretty large as it has to travel through each slave!
Could anybody give me suitable reasoning for the different methods? I'm using Quartus to synthesize and route for an Altera EP4CE10E22C8 FPGA and I'm looking for the smallest implementation with regards to FPGA LUTs. My system uses a 16bit address and data bus. I'm looking to achieve at minimum ~50MHz under ideal memory conditions (i.e no wait states) and would be looking to have around 12 slaves, each with between 8 and 16bits of addressable space.
Thanks!