5

I've started to implement a 8086/8088 with the goal of being cycle-exact. I can understand the reasoning behind the number of clock cycles for most instructions, however I must say I'm quite puzzled by the Effective Address (EA) calculation time.

More specifically, why does computing BP + DI or BX + SI take 7 cycles, but computing BP + SI or BX + DI take 8 cycles?

I could just wait for a given number of cycles, but I'm really interested in knowing why there's this 1-cycle difference (and overall why it takes so many cycles to do any EA calculation, since EA uses the ALU for computing addresses, and an ADD between registers is just 3 cycles).

Matthieu Wipliez
  • 1,901
  • 1
  • 16
  • 15
  • Where do you buy your 8086 these days? :) – BitTickler Apr 24 '15 at 08:46
  • lol. I heard that NASA bought some on eBay: http://www.omgfacts.com/lists/15256/NASA-bought-parts-of-the-Space-Shuttle-on-eBay-because-Intel-wasn-t-making-them-anymore This is among other things the reason why I'm doing this: I intend to design a 8086 and use it on an FPGA :-) – Matthieu Wipliez Apr 24 '15 at 09:03
  • 8086 heavily used micro-code, it had only 20,000 active transistors. You'll have to break in to a vault somewhere near Santa Clara to know what it looked like. – Hans Passant Apr 24 '15 at 09:46
  • I'm pretty sure they've advanced to a state-of-the-art 386 by now. http://www.cpushack.com/space-craft-cpu.html – Leeor Apr 28 '15 at 23:28

1 Answers1

4

Without reverse engineering the chip I don't think it's possible to explain the difference in cycles between [BP + SI] and [BP + DI]. (Note that it's not entirely out of the question that someone has done or will do the reverse engineering necessary, it's been done for the some of the chips in the Commodore 64 in order to create more exact emulators.) It however fairly easy to explain why effective address calculations in general take so long. The reason is the calculation for [BX + SI] is actually DS * 16 + BX + SI, so it's two adds, not just one. It's also a 20-bit calculation and the ALU is only 16 bits wide, so it takes one more add to calculate the upper 20-bits of the physical address. That's the equivalent three register to register adds that cost a total of 9 cycles, and assumes the 4-bit shift is free, so the EA calculation is actually faster than the equivalent instructions.

Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
  • I see, that explains it, I thought that the 16 to 20-bit calculation was done with a specific adder. I'm curious though: what makes you think that this isn't the case? Would a dedicated 20-bit adder have been too expensive (in terms of silicon) at the time, i.e. the ALU-based solution was sufficient given their 29K transistors budget? – Matthieu Wipliez Apr 24 '15 at 17:37
  • 3
    That's what I remembered being told back in the day (mid to late 80's). Intel processors didn't get dedicated hardware for this until the 80286 which could do a 24-bit address calculation in one or two cycles. – Ross Ridge Apr 24 '15 at 18:53
  • Actually the 80186 also had dedicated address calculation hardware, though I'm not sure if it actually was the first. Both came out in 1982. – Ross Ridge Apr 24 '15 at 19:58
  • Thank you for the information. This is why experience is invaluable: "in 1982", I wasn't even born! :-) – Matthieu Wipliez Apr 24 '15 at 20:07
  • 2
    Also a possible explanation for the difference in the number of cycles was suggested on reddit: this would happen if BP and SI were in a (single-ported) register file, and BX and DI in another one. Link: https://www.reddit.com/r/programming/comments/33q4ff/effective_address_calculation_time_on_80868088/ – Matthieu Wipliez Apr 24 '15 at 20:12
  • @user3144770 Good suggestion, though I went with replacing BP with BX instead in order to fix the problem with my example using the wrong segment register. Thanks. – Ross Ridge Apr 26 '15 at 15:43
  • @RossRidge maybe that's what you meant, but just to clarify - the displacement calculation is actually a 16-bit calculation. "DS * 16 + BX + SI" is correct, but if the sum of BX and SI is greater than 0xFFFF then it wraps around, and only then added to the "DS * 16" in a 20-bit calculation... – obe Oct 12 '19 at 21:56