Explaining this passage in "About size_t and ptrdiff_t"

Question

In this blog entry by Andrey Karpov entitled, "About size_t and ptrdiff_t" he concludes with

As the reader can see, using ptrdiff_t and size_t types gives some advantages for 64-bit programs. However, it is not an all-out solution for replacement of all unsigned types with size_t ones. Firstly, it does not guarantee correct operation of a program on a 64-bit system. Secondly, it is most likely that due to this replacement, new errors will appear, data format compatibility will be violated, and so on. You should not forget that after this replacement, the memory size needed for the program will greatly increase as well. Increase of the necessary memory size will slow down the application's work, for the cache will store fewer objects being dealt with.

I don't understand these claims, and I don't see them addressed in the article,

"it is most likely that due to this replacement, new errors will appear, data format compatibility will be violated, and so on."

How is that likely, how can there be no error before the migration and the type-migration result in an error? It's not clear when the types (size_t and ptrdiff_t) seem to be more more restrictive than what they're replacing.

You should not forget that after this replacement, the memory size needed for the program will greatly increase as well.

I'm unclear of how or why the memory size needed would "greatly" increase, or increase at all? I understand though that if it did Andrey's conclusions follow.

The article contains a number of rather questionable claims. "The size of size_t and ptrdiff_t always coincide with the pointer's size." - this is typically true on "flat memory" platforms, but it is completely incorrect from the language point of view. Basically, the whole article is based on platform-specific observations, which are not guaranteed by the language. — AnT stands with Russia, Jul 07 '18 at 03:45
@AnT I will accept any answer you give if you want to buzz in. Thus far, I'm convinced you're one of like two people that actually get this. There are so many trash blogs about these things, it wouldn't surprise me if the guy seeking *to clear it up* also got it wrong. Also, I have one other question about this blog. But it's on a different part entirely. Just gave you a tip of the hat too https://chat.stackexchange.com/transcript/message/45523594#45523594 — Evan Carroll, Jul 07 '18 at 03:47
The errors are talked about with a broad brush, "*As a result, there appears floating errors in the program, occurring and vanishing with the subtlest change of the code.*", where loose examples of changing a loop counter to `size_t` potentially writes beyond the bounds of `int` array, and a pointer arithmetic issue that is questionable. The performance increase is equally loosely supported with the example of an addition instruction being moved into an address calculation (doubtful it is repeatable). The size increase is a bit dubious as well. Adding 4-bytes to a loop counter rarely hurts. — David C. Rankin, Jul 07 '18 at 05:13
Not that what the article says is wrong, or bad. It is more an instance where you know what he is thinking and attempting to say, but where what is actually put down in print falls a bit short of a cogent explanation. — David C. Rankin, Jul 07 '18 at 05:15

Antti Haapala -- Слава Україні · Answer 1 · 2018-07-07T14:19:18.197

The article contains very dubious claims.

First of all, size_t is the type returned by sizeof. uintptr_t is an integer type that can store any pointer to void.

The article claims that size_t and uintptr_t are synonymous. They're not. On for example segmented MSDOS with large memory models the maximum number of elements in an array would have fit in a size_t of 16 bits, but a pointer requires 32 bits. They're synonymous on our common Windows, Linux flat memory models now.

Even worse is the claim that you can store a pointer in ptrdiff_t, or that it would be synonymous with intptr_t:

The size of size_t and ptrdiff_t always coincide with the pointer's size. Because of this, it is these types which should be used as indexes for large arrays, for storage of pointers and, pointer arithmetic.

That's not true at all. ptrdiff_t is the type of the value of pointer subtraction, but pointer subtraction is defined only when both pointers point to the same object or just after it, not just anywhere in the memory.

On the other hand ptrdiff_t could be chosen to be larger than size_t - this is because if you have an array of size larger than MAX_SIZE / 2 elements, subtracting a pointer to the first element from the pointer to the last element or just beyond will have undefined behaviour if ptrdiff_t is of the same width as size_t. Inded, the standard does say that size_t can be only 16 bits wide, but ptrdiff_t must be at least 17](http://port70.net/~nsz/c/c11/n1570.html#7.20.3).

On Linux ptrdiff_t and size_t are of same size - and it is possible to allocate an object on 32-bit Linux that is larger than PTRDIFF_MAX elements. And as it was pointed out in the comments that standard doesn't require ptrdiff_t to be even of the same rank as size_t, though such an implementation would be pure evil.

If one is to follow the advice and use size_t and ptrdiff_t to store pointers, one certainly cannot go right.

As for the claim that

You should not forget that after this replacement, the memory size needed for the program will greatly increase as well.

I'd contest that claim - the memory requirement increase would be rather modest compared to the already-present increased consumption from general 64-bit alignment, alignment of the stack and the 64-bit pointers that are inherent in moving to 64-bit environment.

As for the claim that

"it is most likely that due to this replacement, new errors will appear, data format compatibility will be violated, and so on."

That certainly is true, but most probably if you're coding such buggy code, you'd accidentally "fix" old errors in the process, like the signed/unsigned int example:

int A = -2;
unsigned B = 1;
int array[5] = { 1, 2, 3, 4, 5 };
int *ptr = array + 3;
ptr = ptr + (A + B); //Error
printf("%i\n", *ptr);

where the both original and the new code will have undefined behaviour (accessing array elements out of bounds), but the new code would appear to be "correct" on 64-bit platforms too.

FYI, although `ptrdiff_t` **could be** wider than `size_t`, as you say, an implementation **may** make it narrower, even if it supports arrays large enough that the results of subtracting pointers will not fit in its `ptrdiff_t`. The standard does not require `ptrdiff_t` be large enough to store the result of validly subtracting two pointers, because it does not require `ptrdiff_t` to work! Per C 2011 (draft N1570) 6.5.6 9: “When two pointers are subtracted… result… type is `ptrdiff_t`… If the result is not representable in an object of that type, the behavior is undefined.” — Eric Postpischil, Jul 07 '18 at 12:06
Antti (love the Antti Patterns pun by the way), I'd consider elaborating on the second paragraph. While it is true that the standard states `uintptr_t` can convert freely to and from `void*`, your text makes it sound like it can't work for any other pointer type. Since those other pointers can freely go to `void*` and back, `uintptr_t` will work with any of them. — paxdiablo, Jul 10 '18 at 05:17

score 1 · Answer 2 · answered Jul 07 '18 at 03:47

1

Well any change will potentially introduce errors. Specifically, I can imagine changing sizes could break where less rigour with regard to types have been applied (e.g. assuming ints or longs being the same as pointers where they are not). Any binary structure written to a file would not be readable directly, and any RPC may well fail, depending on protocols.

Memory requirements will obviously increase as the size of most in-memory objects will increase. Most data will be aligned on 64 bit boundaries, meaning more "holes". Stack usage will increase, potentially resulting in more frequent cache misses.

While all generalisations may be true or false, the only way to find out is to do some proper analysis on the system at hand.

answered Jul 07 '18 at 03:47

LoztInSpace

5,584
1
15
27

I'm not even sure of what he's talking about to do proper analysis or test the claims. To be honest you brought something to the table with the mention of alignment -- if in fact `uintptr_t` and `size_t` are aligned on 64bit boundaries and the 64bit ints are not. I can also see the mention of serialization to files being an issue, if the files were written with older software. That seems pretty obvious though if that's what he means by *"data format compatibility"* then I honestly just missed it entirely (though it makes sense). – Evan Carroll Jul 07 '18 at 03:55
If it were me doing an upgrade, I'd just be documenting and analysing 3rd party libraries, files, system boundaries and such. Once you've got a gut feel for the amount of work in those areas, just fire it up and see what breaks. To be honest, I'd imagine other than those I listed, only small pockets of "clever" code would be affected. Design & execute acceptance tests until you're happy. – LoztInSpace Jul 07 '18 at 04:20
Yea, it seems like we're talking about a *very* safe upgrade in the grand scheme of things. – Evan Carroll Jul 07 '18 at 04:32

Steve Summit · Answer 3 · 2018-07-07T12:20:50.627

As a general proposition, using size_t and ptrdiff_t is vastly preferred over using, say, plain unsigned int and int. size_t and ptrdiff_t are pretty much the only way of writing a robust and widely portable program.

However: there is no such thing as a free lunch. Properly using size_t takes some work, too -- it's just that, if you know what you're doing, it takes less work than trying to achieve the same result without using size_t.

Also, size_t has the problem that you can't print it using %d or %u. Ideally you want to use %zu, but, tragically, not all implementations have supported it.

If you have a large and badly written program that doesn't use size_t, it's probably full of bugs. Some of those bugs will have been masked or worked around. If you try to change it to use size_t, a certain number of the program's workarounds will fail, perhaps uncovering once-hidden bugs. Eventually you'll work those out and achieve the more-robust and more-reliable and more-portable program you desire, but the process will be a rocky one. I suspect that's what the author means by "it is most likely that due to this replacement, new errors will appear".

Changing a program over to use size_t is sort of like trying to add const in all the right places. You make the changes you think you need to make, and recompile, and you get a bunch of errors and warnings, and you fix those and recompile, and you get a bunch more errors and warnings, etc. It's at least a nuisance, and sometimes a ton of work. But it's generally the only way to go if you want to make the code more robust and portable.

A big part of the problem is keeping the compiler happy. It's going to warn about a bunch of stuff, and you'll generally want to fix everything it complains about, even though some of what it complains about is ticky-tack and unlikely to cause a problem. But it's perilous to say, "Yeah, I can ignore this particular warning", so in the end, as I said, you'll generally want to fix everything.

The author's most eye-catching claim is

memory size needed for the program will greatly increase as well.

I suspect this is an exaggeration -- in most cases I doubt that memory will "greatly" increase -- but it's likely to increase at least a little bit. The issue is that on a 64-bit system, size_t and ptrdiff_t are likely to be 64-bit types. If for whatever reason you have large arrays of these, or large arrays of structures containing these, and if you had been using some 32-bit type (perhaps plain int or unsigned int) before, yes, you're going to see a memory increase.

And then you're going to want to ask, Do I really need to be able to describe 64-bit sizes? 64-bit programming gives you two things: (a) the ability to address more than 4Gb of memory, and (b) the ability to have a single object greater than 4Gb. If you want to have a total data usage greater than 4Gb, but you don't ever need to have a single object bigger than 4Gb, and if you never want to read more than 4Gb of data at a time from a file (using a single read or fread call, that is), you don't really need 64-bit size variables everywhere.

So to avoid bloat, you might make an informed choice to use, say, unsigned int (or even unsigned short) instead of size_t in some places. As a trivial example, if you had

size_t x = sizeof(int);
printf("%zu\n", x);

you could change this to

unsigned int x = sizeof(int);
printf("%u\n", x);

without any loss in portability, because I can quite confidently guarantee your code is never going to find itself running on a machine with 34359738368-bit ints (or at least, not in our lifetimes :-) ).

But this last example, trivial as it is, also illustrates the other issues that tend to intrude. The similar code

unsigned int x = sizeof(y);
printf("%u\n", x);

is not so obviously safe, because whatever y is, there's a chance it could be so big that its size doesn't fit in an unsigned int. So if you or your compiler really care about type correctness, there may be warnings about possible data loss when assigning size_t to unsigned int. And to shut off those warnings, you may need explicit casts, as in

unsigned int x = (unsigned int)sizeof(int);

And this cast is, arguably, perfectly appropriate. The compiler is operating under the assumption that any object might be really big, that any attempt to jam a size_t into an unsigned int might lose data. The cast says you've thought about this case: you're saying, "Yes, I know that, but in this case, I know it won't overflow, so please don't warn me about this one any more, but please do warn me about any others, that might not be so safe."

P.S. I'm being downvoted, so in case I've given the wrong impression, let me make clear that (as I said in my opening paragraph) size_t and ptrdiff_t are vastly preferred. In general there's every reason to use them, no good reason not to use them. (Come to that, Karpov wasn't saying not to use them, either -- merely highlighting some of the issues that might come up along the way.)

Any well configured compiler or static syntax checker should complain about the last code snippet. While I agree there is hardly a need to use `size_t` in this case and `unsigned` is fine, you need to cast to be really portable, **expecially** on the mentioned >32 bit systems. Or you use the 100% correect types and let the compiler do the optimisations. Whether there will be any overhead depends on way too many factors, it could e.g. replace the `printf` to a `puts` with constant string. That's wht the "as-if" rule of the abssstract machine is for. — too honest for this site, Jul 07 '18 at 11:02
@Olaf "Should complain", maybe, but do they? gcc -Wall doesn't, and clang -Wall doesn't, at least not the copies I have. But the issue is worth mentioning, and I have now done so. — Steve Summit, Jul 07 '18 at 11:28
By "should" I mean "with recommended warnings enabled". gcc definitively provides such a warning and I'd be surprised if clang doesn't. lint, etc. ddefinitively do, too. Expecially for beginners it should be the first advise to enable them. And pros should know themselves. — too honest for this site, Jul 07 '18 at 11:52
The actual querstionm is: what is the value used for? If you process it with an `size_t` later anyway, it's not worth the potential problem. I mean, how often do you just print the size of an object? — too honest for this site, Jul 07 '18 at 11:57
@Olaf As I said, the example I showed was trivial; in practice you'd only care if you had thousands of `size_t` variables in some large data structure. (Also, do tell which gcc option warns about 8-to-4-byte unsigned assignment.) — Steve Summit, Jul 07 '18 at 12:07
I'm not up to searching, but IIRC it was something around `-Wconversion` or so should give a truncation warning. And it's not just about 8 to 4 bytes (octets?!). It's also one of the reasonable-to-enforce MISRA rules. — too honest for this site, Jul 07 '18 at 12:14

Explaining this passage in "About size_t and ptrdiff_t"

3 Answers3