1

I have reached a position in my design in which we need to massively increase parallelisation, but we have many resources to spare in the FPGA.

To that end, I have the type defined as

type LargeByteArray is array(0 to 10000) of std_logic_vector(7 downto 0);

I have two of these that I want to "byte-wise" average in as few operations as possible, as well as shift right to divide by two. So for example, avg(0) should be an 8bit standard logic vector which is a_in(0) + b_in(0) / 2. avg(1) should be a_in(1) + b_in(1) / 2 and so on. Assume for the moment we don't care that two 8 bit numbers add to a 9 bit. And I want to be able to do the entire 10000 operations in parallel.

I think I need to use an intermediate step to be able to bitshift like this, using the Signal "inter".

entity Large_adder is
Port ( a_in : LargeByteArray;
       b_in : LargeByteArray;
       avg_out : LargeByteArray);

architecture arch of Large_adder is
    SIGNAL inter : LargeByteArray;
begin

My Current code looks a bit like this;

inter(0) <= std_logic_vector((unsigned(a_in(0)) + unsigned(b_in(0))));
inter(1) <= std_logic_vector((unsigned(a_in(1)) + unsigned(b_in(1))));

10000 lines later...

inter(10000) <= std_logic_vector((unsigned(a_in(10000)) + unsigned(b(10000))));

And a similar story for finally assigning the output with the bit shift

avg_out(0) <= '0' & inter(0)(7 downto 1);
avg_out(1) <= '0' & inter(1)(7 downto 1);

All the way down to 10000.

Surely there is a more space efficient way to specify this.

I have tried

inter <= std_logic_vector((unsigned(a_in) + unsigned(b)));

but I get an error about found '0' matching definitions for <= operator.

Now obviously the number could be decreased from 10000 in case this question looks stupid in what I'm trying to achieve, but in general, how do you write these sort of operations elegantly without a line for every element of my Type?

If I had to guess I would say we can describe to the "<=" operator what to do when met with LargeByteArray types. But I do not know how to do so or where to define this behaviour.

Thanks

2 Answers2

1

You have two choices. Either a for loop inside a process:

  process (a_in, b_in)
  begin
    for I in 0 to 10000 loop
      inter(I) <= std_logic_vector((unsigned(a_in(I)) + unsigned(b_in(I))));
    end loop;
  end process;

  process (inter)
  begin
    for I in 0 to 10000 loop
      c_out(I) <= '0' & inter(I)(7 downto 1);
    end loop;
  end process;

or a generate loop outside a process:

G1: for I in 0 to 10000 generate
  inter(I) <= std_logic_vector((unsigned(a_in(I)) + unsigned(b_in(I))));
end generate;

G2: for I in 0 to 10000 generate
  c_out(I) <= '0' & inter(I)(7 downto 1);
end generate;

https://www.edaplayground.com/x/3hJV

The simulator executes the lines inside the for loop (inside the process) sequentially because simulators always execute lines inside a process sequentially (but concurrently will other processes and concurrent assignments). The simulator executes the lines inside the generate loop concurrently, because a generate loop is a language construct that is used to generate multiple concurrent things. Because of the topology of your circuit (everything is parallel), both methods will behave the same in simulation and in synthesis.

Matthew Taylor
  • 13,365
  • 3
  • 17
  • 44
  • Thank you - For some reason I had it in my head that the for loop would be sequential, and I didn't realise generate loops could be used like this. Cheers! – Will Haward Jul 05 '18 at 14:01
1

Use a regular process:

process(a_in, b_in)
  variable tmp: unsigned(8 downto 0);
begin
  for i in a_in'range loop
    tmp := unsigned('0' & a_in(i)) +  unsigned('0' & b_in(i));
    avg_out(i) <= std_logic_vector(tmp(8 downto 1));
  end loop;
end process;

It looks sequential but it is not, for reasons about the VHDL semantics that would be too long to explain here. Your synthesizer will do want you want.

And, by the way, the sum of two 8-bits unsigned numbers is a 9-bits unsigned number (reason why variable tmp is declared as unsigned(8 downto 0)). And dividing by two simply consists in shifting to the right (if the Least Significant Bit is the rightmost, which is usually the case) by one position. So, if you want an 8-bits result, just left-extend your operands by one bit, add them and drop the LSB of the result, as proposed in the process above. If, instead, you add them without extension you will encounter overflow problems and severe inaccuracies.

Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51