More code, as well as specific information about the architecture you are synthesizing to is needed for a more accurate answer.
As for your last question, the standard strategy is to minimize the amount of combinational logic between registers. That is, to apply a pipeline. For your code, this would be used as this:
t0 <= h0 + A;
t1 <= h1 + B;
t2 <= h2 + C;
t3 <= h3 + D;
t4 <= h4 + E;
h0 <= t0;
h1 <= t1;
h2 <= t2;
h3 <= t3;
h4 <= t4;
But for just an addition, I doubt there is an improvement here. Recall that the combinational block (that is, the adder) is still there.
Let's assume that your target architecture cannot implement fast large adders, just fast small adders, and your registers are very wide. Then you could split each large addition into small parallel additions that could be performed using available resources, but I doubt that the compiler doesn't do that by itself.
reg [127:0] a,b,c;
always @(posedge clk)
a <= b + c
Becomes:
reg [63:0] ah,al;
reg cy;
reg [127:0] a,b,c;
always @(posedge clk) begin
{cy,al} <= b[63:0] + c[63:0];
ah <= b[127:64] + c[127:64];
a <= {ah+cy, al};
end