177

I've seen programmers use the formula

mid = start + (end - start) / 2

instead of using the simpler formula

mid = (start + end) / 2

for finding the middle element in the array or list.

Why do they use the former one?

cadaniluk
  • 15,027
  • 2
  • 39
  • 67
Pallavi Chauhan
  • 1,131
  • 2
  • 8
  • 4
  • 54
    Wild guess: `(start + end)` might overflow, while `(end - start)` cannot. – cadaniluk Jul 31 '16 at 20:15
  • 32
    because latter does not work when `start` and `end` are pointer. – ensc Jul 31 '16 at 20:20
  • 25
    [Extra, Extra — Read All About It: Almost Every Binary Search and Mergesort is Broken](https://research.googleblog.com/2006/06/extra-extra-read-all-about-it-nearly.html) – Jonathan Leffler Jul 31 '16 at 21:08
  • 21
    `start + (end - start) / 2` also carries semantic meaning: `(end - start)` is the length, so this says: `start + half the length`. – njzk2 Aug 01 '16 at 02:16
  • @JonathanLeffler why anyone would use a *signed* integer as array index is beyond me. I normally use `size_t` for them these days… if you find a case where that’s incorrect, do tell, but I think it works for both char arrays, pointer arrays, and arrays of structs. – mirabilos Aug 01 '16 at 12:14
  • @mirabilos unfortunately, that's just not the type of an array index. native c array indices are signed and of type ptrdiff_t – Steve Cox Aug 01 '16 at 21:13
  • @SteveCox hrm, ok. But `size_t` is usually the same as `ptrdiff_t`, just unsigned (making it easier to handle) and more portable (available on many more platforms, especially older ones). – mirabilos Aug 01 '16 at 21:52
  • @mirabilos you can like, size_t better, but `a[-2]` is a valid C expression indexing into an array. this isn't really a question of preference. C array indexing is signed, and an unsigned type is not sufficient for the task. – Steve Cox Aug 01 '16 at 22:08
  • @SteveCox `a[-2]` is not a valid index into an array defined as `a[]`, only if `char b[]; char *a = b + 2;` or somesuch, yes. But wrapping around *is* defined for unsigned types, so it would work at least for char types, but probably for others too… – mirabilos Aug 01 '16 at 23:58
  • some other duplicates [Binary search using iterators, why do we use “(end - begin)/2”?](http://stackoverflow.com/q/38560566/995714), [Overflow issues when implementing math formulas](http://stackoverflow.com/q/10882368/995714). Btw this should be explained in every questions about binary search – phuclv Aug 02 '16 at 07:11
  • 2
    @LưuVĩnhPhúc: Doesn't this question have the best answers and the most votes? If so, the other questions should probably be closed as a dup of this one. The age of the posts are irrelevant. – Nisse Engström Aug 02 '16 at 12:08
  • Can we also do mid= (start)/2+(end)/2 ? – Nikhil Goyal Mar 13 '21 at 21:55

4 Answers4

228

There are three reasons.

First of all, start + (end - start) / 2 works even if you are using pointers, as long as end - start doesn't overflow1.

int *start = ..., *end = ...;
int *mid = start + (end - start) / 2; // works as expected
int *mid = (start + end) / 2;         // type error, won't compile

Second of all, start + (end - start) / 2 won't overflow if start and end are large positive numbers. With signed operands, overflow is undefined:

int start = 0x7ffffffe, end = 0x7fffffff;
int mid = start + (end - start) / 2; // works as expected
int mid = (start + end) / 2;         // overflow... undefined

(Note that end - start may overflow, but only if start < 0 or end < 0.)

Or with unsigned arithmetic, overflow is defined but gives you the wrong answer. However, for unsigned operands, start + (end - start) / 2 will never overflow as long as end >= start.

unsigned start = 0xfffffffeu, end = 0xffffffffu;
unsigned mid = start + (end - start) / 2; // works as expected
unsigned mid = (start + end) / 2;         // mid = 0x7ffffffe

Finally, you often want to round towards the start element.

int start = -3, end = 0;
int mid = start + (end - start) / 2; // -2, closer to start
int mid = (start + end) / 2;         // -1, surprise!

Footnotes

1 According to the C standard, if the result of pointer subtraction is not representable as a ptrdiff_t, then the behavior is undefined. However, in practice, this requires allocating a char array using at least half the entire address space.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • result of `(end - start)` in the `signed int` case is undefined when it overflows. – ensc Jul 31 '16 at 20:52
  • Can you prove that `end-start` wont overflow? AFAIK if you take a negative `start` it should be possible to make it overflow. Sure, most of the times when you compute the average you know that values are `>= 0` ... – Bakuriu Jul 31 '16 at 21:28
  • 12
    @Bakuriu: It's impossible to prove something which is not true. – Dietrich Epp Jul 31 '16 at 23:07
  • @DietrichEpp Then why do you state so in your answer? You should clarify that you are excluding negative `start` values from your observation, even though in some circumstances they are perfectly valid. – Bakuriu Aug 01 '16 at 08:28
  • 4
    It's of particular interest in C, since pointer subtraction (per the standard) is broken by design. Implementations are permitted to create arrays so big that `end - start` is undefined, because object sizes are unsigned whereas pointer differences are signed. So `end - start` "works even using pointers", provided you also somehow keep the size of the array below `PTRDIFF_MAX`. To be fair to the standard, that's not much of an obstruction on most architectures since that's half the size of the memory map. – Steve Jessop Aug 01 '16 at 11:03
  • 3
    @Bakuriu: By the way, there is an "edit" button on the post you can use to suggest changes (or make them yourself) if you think I've missed something, or something is unclear. I'm only human, and this post has been seen by over two thousand pairs of eyeballs. The kind of comment, "You should clarify..." really rubs me the wrong way. – Dietrich Epp Aug 01 '16 at 14:02
  • @SteveJessop: Two's-complement implementations often define pointer subtraction in such a way that if `p2>=p1`, then `(size_t)(p2-p1)` will always yield the correct difference, whether or not the result could be represented in a `ptrdiff_t`. – supercat Aug 01 '16 at 19:24
  • @SteveJessop Completely true, but you might want to also mention that you can only run into this issue with char arrays. arrays of larger elements (on implementations with larger elements) never have this issue whatsoever. – Steve Cox Aug 01 '16 at 21:22
  • @LittleAlien where did you get 2 from? The midpoint of [0,-3) is -1.5. The `(start + end) / 2` method rounds zero, which means that it rounds up or down depending on sign, giving -1. If you test with positive numbers this may surprise you. It doesn't have translation invariance, for one thing. – Dietrich Epp Aug 02 '16 at 13:16
  • @Steve Cox The issue is not limited to arrays. Consider pointer arithmetic and `double *x = calloc(SIZE_MAX, sizeof *x);`. Some platforms support allocations exceeding `SIZE_MAX`. `SIZE_MAX` is the limit of the index, not the allocation size limit. – chux - Reinstate Monica Aug 09 '16 at 02:33
  • @chux: You say that some platforms support this. Can you name a such platform? I am skeptical. – Dietrich Epp Aug 09 '16 at 02:42
  • Example: Allocations greater than `SIZE_MAX` were accomplished with `calloc()` with DOS [huge memroy model](http://users.pja.edu.pl/~jms/qnx/help/watcom/compiler16/wmodels.html#16BitDataModels). This is compliant C behavior. Certainly not as common a flat memory models today. Not so difficult to write code that handles both. – chux - Reinstate Monica Aug 09 '16 at 12:52
22

We can take a simple example to demonstrate this fact. Suppose in a certain large array, we are trying to find the midpoint of the range [1000, INT_MAX]. Now, INT_MAX is the largest value the int data type can store. Even if 1 is added to this, the final value will become negative.

Also, start = 1000 and end = INT_MAX.

Using the formula: (start + end)/2,

the mid-point will be

(1000 + INT_MAX)/2 = -(INT_MAX+999)/2, which is negative and may give segmentation fault if we try to index using this value.

But, using the formula, (start + (end-start)/2), we get:

(1000 + (INT_MAX-1000)/2) = (1000 + INT_MAX/2 - 500) = (INT_MAX/2 + 500) which will not overflow.

Shubham
  • 2,847
  • 4
  • 24
  • 37
  • 1
    If you add 1 to `INT_MAX`, the result will not be negative, but undefined. – celtschk Aug 01 '16 at 21:46
  • @celtschk Theoretically, yes. Practically it will wrap-around a lot of the times going from `INT_MAX` to `-INT_MAX`. It's a bad habit to rely upon that though. – Mast Aug 02 '16 at 09:46
19

To add to what others have already said, the first one explains its meaning clearer to those less mathematically minded:

mid = start + (end - start) / 2

reads as:

mid equals start plus half of the length.

whereas:

mid = (start + end) / 2

reads as:

mid equals half of start plus end

Which does not seem as clear as the first, at least when expressed like that.

as Kos pointed out it can also read:

mid equals the average of start and end

Which is clearer but still not, at least in my opinion, as clear as the first.

TheLethalCoder
  • 6,668
  • 6
  • 34
  • 69
  • 3
    I see your point, but this really is a stretch. If you see "e - s" and think "length" then you almost surely see "(s+e)/2" and think "average" or "mid." – djechlin Aug 01 '16 at 22:57
  • 2
    @djechlin Programmers are poor at math. They are busy doing their work. They have no time to attend the math classes. – Little Alien Aug 02 '16 at 05:01
1

start + (end-start) / 2 can avoid possible overflow, for example start = 2^20 and end = 2^30

fight_club
  • 327
  • 4
  • 13