0

I am recreating the entire standard C library and I'm working on an implementation for strlen that I would like to be the basis of all my other str functions.

My current implementation is as follows:

int     ft_strlen(char const *str)
{
int length;

length = 0;
while(str[length] != '\0' || str[length + 1] == '\0')
    length++;

return length;
}

My question is that when I pass a str like:

char str[6] = "hi!";

As expected, the memory reads:

['h']['i']['!']['\0']['\0']['\0']['\0']

If you look at my implementation, you can expect that I would get a return of 6 - as opposed to 3 (my previous approach) so that I can check strlen potentially including extra allocated memory.

The catch here is that I will have to read outside of initialized memory by 1 byte to fail my last loop condition at final null terminator - which is the behavior I WANT. However this is generally considered bad practice and by some an automatic error.

Is reading outside of your initialized value a bad idea even when you are very specifically intending to read into a junk value (to ensure it DOES NOT contain '\0')?

If so, why?

I understand that:

"buffer overruns are a favorite avenue for attacking secure programs"

Still, I can't see the problem if I'm simply trying to ensure I've hit the end of initialized values...

Also, I realize this problem can be avoided - I have already sidestepped with a value set to 1 and then only reading initialized values - that's not the point, this is more of a fundamental question about C, runtime behavior and best practices ;)

[EDITS:]

Comment to previous post:

OK. Fair enough - but as to the question "Is it always a bad idea (danger from intentional manipulation or runtime stability) to read after initialized values" - do you have an answer? Please read the accepted answer for an example of the nature of the question. I really don't need this code fixed, nor do I need a better understanding of data types, POSIX specs or common standards. My question is related to WHY such standards may exist - why it may be important to never read past initialized memory (if such reasons exist)? What is the potential fallout of reading past initialized values IN GENERAL?

Please all - I'm trying to better understand aspects of how systems operate and I have a VERY SPECIFIC question.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
MJHd
  • 166
  • 1
  • 10
  • 1
    There is no guarantee that the byte(s) following the array are not zero, so your function could exceed the buffer by an arbitrary amount. Or it could hit an addressed not mapped to physical storage and blow up with a segfault. Or... (see UB). – rici Jul 18 '17 at 06:39
  • Using uninitialized variables is enumerated as UB(undefined behaviour). – BLUEPIXY Jul 18 '17 at 07:43
  • J.2 Undefined behavior _The value of an object with automatic storage duration is used while it is indeterminate_ – BLUEPIXY Jul 18 '17 at 07:49
  • Is the approach to managing the undefined behavior I have built into my example acceptable and/or advisable -> or is it unstable and dangerous? – MJHd Jul 18 '17 at 07:50
  • That is my ONLY question -> stability and safety – MJHd Jul 18 '17 at 07:50
  • 5
    There's no such thing as "*managing undefined behavior*". Undefined is undefined, period. –  Jul 18 '17 at 07:51
  • 1
    Also 6.5.6 Additive operators p8 _If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated._ So undefined is undefind. Every action is not stipulated. – BLUEPIXY Jul 18 '17 at 07:57
  • 3.4.3 _1 undefined behavior behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements 2 NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic_ – BLUEPIXY Jul 18 '17 at 08:02
  • So can my program be attacked or could this approach cause crash - system error - exposure of sensitive information? It's not important to me if this behavior is defined or undefined, stipulated or unstipulated - will it cause crashes/system errors/seg fault(or other runtime) be prone to attack or reveal sensitive/secure data etc - I'm looking for computer behavior - not categorical fiat. What could the computer possibly do as a result of reading past initialized memory that will make me frown? – MJHd Jul 18 '17 at 08:19
  • You already got all the answers possible here. "*It's not important to me if this behavior is defined or undefined*" <- again wrong. UB *will* lead to crashes and exploitable holes. You already got examples of *how* this could happen with your broken code here. –  Jul 18 '17 at 08:27
  • @MJHd To direct a comment, use @+Username. Without that, as [here](https://stackoverflow.com/users/3809007/mjhd), the comment is directed to everyone. – chux - Reinstate Monica Jul 18 '17 at 15:36
  • Note: "working on an implementation for strlen" and `int ft_strlen(char const *str)` does not return the same type as `size_t strlen()`. – chux - Reinstate Monica Jul 18 '17 at 15:37
  • @chux, No, it won't stop at `hi!\0\0`, because at that point `str[length + 1] == '\0'` evaluates true and therefore it runs another iteration. In fact `str[length + 1] == '\0'` will not evaluate false before `str[length + 1]` is the element after the end of the array (though it might not even then). – 8bittree Jul 18 '17 at 16:00
  • @8bittree True, I read the code incorrectly, instead code attempts to read outside `str[]` --> UB. Comment removed. – chux - Reinstate Monica Jul 18 '17 at 16:08

6 Answers6

2

Instead of the reading uninitialized memory that's IMHO just a symptom here, let's focus on your idea and the explanation why it is wrong:

char str[6] = "hi!";
strlen(str); // evaluates to 3

This is what the C standard mandates and it's what everyone would expect. An implementation returning 6 here is just wrong. This has its reason in the way C handles arrays and strings:

Letting VLAs (variable length arrays) aside here because they're just a special case with somewhat similar rules. Then, the size of an array is fixed, in your above code, sizeof(str) is 6 and this is a compile-time constant. This size is only known where the array is in scope.

According to the specification of C, the identifier of an array evaluates to a pointer to its first element, except when used with sizeof, _Alignof or &. As one consequence, it's impossible to pass an array to a function, what you actually pass is the pointer. If you write a function to accept an array type, this type is adjusted to be a pointer type instead. ("adjusted" is the wording of the C standard, it's commonly said that the array decays as a pointer)

This specification allows C to treat an array as nothing more than a contiguous sequence of objects of the same type -- there is no metadata (like e.g. the length) stored with it.

So, if you're passing "arrays" around, therefore just having pointers to their first elements, how do you know the size of the array? There are two possibilities:

  1. pass the size in a separate parameter of type size_t.
  2. have a sentinel value at the end of your array.

Now, talking about strings in C: A string isn't a first-class citizen in C, it doesn't have its own type. It's defined as a sequence of char, ending with '\0'. Therefore you can store a string in a char[] and when you're working with strings, you don't need to pass lengths, because the sentinel value is already defined: every string ends with '\0'. But this also means whatever might come after a first '\0' is not part of the string.

So, with your idea, you mix up two things. You somehow want to have a function that returns the size of your array, something that isn't possible in general. You're using your array to store a string that's smaller than the array. Still, a function called strlen() is supposed to return the length of the string, which is an entirely different thing than the size of the array you use to hold your string.

You could even write something like this:

char foo[3] = "hi!";

This would initialize foo from the string constant "hi!", but foo would not contain a string, because it doesn't have the '\0' terminator. It would still be a valid char[]. But of course, you can't write a function finding out its size.


Summary: The size of an array is something completely different from the length of a string. You're mixing up both; the ill assumption that the size of an array could be determined in a function leads to code with UB, and of course, this is potentially dangerous code that could crash or worse (be exploited).

  • This has nothing to do with my question - I was very explicit in my question for this EXACT reason... As I said in the post I have already refactored the function as well as it's helpers - I am not interested in how you would optimize my code or how the Clibs work. – MJHd Jul 18 '17 at 07:23
  • As the question states: this is a general purpose question about whether or not it is generally bad practice to read outside of initialized memory. The code provided is to illustrate the point. – MJHd Jul 18 '17 at 07:25
  • 1
    Then your question is entirely pointless. You don't just read uninitialized memory but even dereference a pointer that isn't necessarily valid in your program. –  Jul 18 '17 at 07:49
  • 1
    Still, this answer is **very** important, because your "example" shows a complete lack of understanding of basic concepts in C. –  Jul 18 '17 at 07:50
  • If you don't want to be corrected, better don't ask here. It's still very important for the purpose of this side to point out thinking errors and explain them. It's not *just for you*. –  Jul 18 '17 at 08:04
  • @chux not sure you're citing the correct passage for what you want to say, but I still get it. I'll have a look at the standard to find a better wording. –  Jul 19 '17 at 07:03
  • 1
    @chux it's §6.3.2.1 section **3**: "*Except when it is the operand of the `sizeof` operator, the `_Alignof` operator, or the unary `&` operator, or is a string literal used to initialize an array, an expression that has type ‘‘array of type’’ is converted to an expression with type ‘‘pointer to type’’ that points to the initial element of the array object and is not an lvalue.*" -- I'll fix it :) –  Jul 19 '17 at 07:09
2

ft_strlen() can read beyond the array the string resides in. This is often undefined behavior (UB).

Even with conditions that do not read into "un-owned" memory, the result is not 6 or a value that depends on array length.

int main(void) {

  struct xx {
    char str_pre[6];
    char str[6];
    char str_post[6];
    char str_postpost[6];
  } x = { "", "Hi!", "", "x" };
  printf("%d\n", ft_strlen(x.str));  --> 11 loop was stopped by "x"

  char str[6] = "1234y";
  strcpy(str, "Hi!");
  printf("%d\n", ft_strlen(str));  --> 3  loop was stopped by "y"

  return 0;
}

ft_strlen() is not reliable code to determine array size nor string length.


Is it always a bad idea to read after initialized values?

Clarity:

char str[6] = "hi!"; initializes all 6 of str[6]. In C, there is no partial initialization - it is all or nothing.

Assignment can be partial.

char str[6];        // str uninitialized
strcpy(str, "Hi!"); // Only first 4 `char` assigned.

Reading after some initialized values implies reading into a another object or worse, outside code's accessible memory. Attempting to access is undefined behavior UB and is bad.

My question is related to WHY such standards may exist - why it may be important to never read past initialized memory.

This is really a core question about the design of C. C is a compromise. It is a language designed to work on many different platforms. To achieve that, it must be adaptable for all sorts of memory architectures. If C was to specify the result of "read after initialized values", then C would 1) seg-faulting, 2) bounds checking 3) or some other software/hardware to implement that detection. This may make C more robust at error detection, but then increase/slow emitted code. IOWs, C trusts the programmer is doing the right thing and does not try to catch such errors. An implementation might detect the issue, it might not. It is UB. C is coding on a tight-rope without a net.

What is the potential fallout of reading past initialized values IN GENERAL (?)

C does not specify the result of attempting to do such a read so there is no general result of this UB. Common results, which may vary each time the code is run, include:

  1. A zero is read.
  2. A consistent garbage value is read.
  3. An inconsistent garbage value is read.
  4. A trap value is read. (Never applies to unsigned char though.)
  5. Seg-fault or other stoppage of code.
  6. Code invoke a executive handler (one step in a typical hacker exploit)
  7. Code ventures off and does something else.
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • Ok. Fair enough - but as to the question "Is it always a bad idea (danger from intentional manipulation or runtime stability) to read after initialized values" - do you have an answer? Please read the accepted answer for an example of the nature of the question. I really don't need this code fixed, nor do I need a better understanding data types, POSIX specs or common standards. My question is related to WHY such standards may exist - why it may be important to never read past initialized memory (if such reasons exist) What is the potential fallout of reading past initialized values IN GENERAL – MJHd Jul 18 '17 at 17:41
  • 1
    @MJHd Answers/comments are not only posted for OP but for all viewers. So from time-to-time, additional info is relevant. – chux - Reinstate Monica Jul 18 '17 at 18:11
  • True - I've actually already gotton a rundown on this issue from my professor (who has also expressed his frustration with the responses I've received I might add) - at this point I'm only responding for the sake of those who may share my original question and become confused with answers to completely different questions - if someone benefits indirectly from the other answers that's great! Just not my concern at this point - thank you though :) – MJHd Jul 18 '17 at 21:05
  • I might add - that while I greatly appreciate that members want to leave useful information for others who find this post - it is pretty common courtesy (and general logic IMHO) to answer the question asked first. I am CERTAIN no one would talk to me like this in person - so I have a hard time believing it is a misunderstanding of some sort... – MJHd Jul 18 '17 at 21:13
  • @MJHd "*I am CERTAIN no one would talk to me like this in person*" -- I'm certain if you go to a colleague (or your instructor, if you're an apprentice) showing **this** code and asking **this** question, you would **of course** get some explanation about why your idea behind this code is ill-adviced. Feeling insulted for receiving some explanations won't get you far anywhere. Stripping down your question to just "what could be possible consequences of this UB" would make it a duplicate of countless other questions. –  Jul 19 '17 at 07:17
  • Fair enough. You've been patient enough. I really don't want to waste any more time with this. Thanks to everyone for the advice. – MJHd Jul 19 '17 at 07:46
1

Reading uninitialized memory can return data previously stored there. If your program processes sensitive data (such as passwords or cryptographic keys) and you disclose the uninitialized data to some party (expecting that it is valid), you might reveal confidential information.

Furthermore, if you read beyond the end of an array, the memory might not be mapped, and you will get a segmentation fault and a crash.

The compiler can also assume that your code is correct and will not read uninitialized memory, and make optimization decisions based on that, so even reading uninitialized memory can have arbitrary side effects.

Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
  • Ok - these are really good points. Some questions though: 1) If compiler ignores attempts to read uninitialized values - wouldn't I get the correct return value without runtime error or potential for malicious code injection? 2) Similarly, if the register is read but not used or returned - could the potentially sensitive value inside ever be recovered? Once the function returns shouldn't all local data in it's stack be gone? – MJHd Jul 18 '17 at 06:54
0

Did you heard about the "buffer overflow problem" when you read outside the "buffer" aka the uninitialized memory a malicious code be hidden in the stack (when you read it the malicious code could be executed) more info here https://en.wikipedia.org/wiki/Buffer_overflow

therefore it is very very bad to read outside the uninitialized memory but most compiler protect that by not allowing you to do that or give you a warning to protect the stack.

Hamuel
  • 633
  • 5
  • 16
  • True - but the value is never returned or used - so could it actually affect the program execution to read malicious code but do nothing with it? – MJHd Jul 18 '17 at 06:47
0

It appears you want to keep track of allocated and used string memory. There is nothing wrong with that (although its contrary to C's standard library approach). What is wrong, however, is trying to build this on a foundation that relies on UB. There are easier ways to shoot yourself in the foot.

Done right, you should rather follow a path that relies on clean code. One possible approach could be:

struct string_t
{
    int length;
    char strdata[length];
};

Then you would have to provide a suitable set of functions to deal with your own string type like

struct string_t *str_alloc(int length)
{
    struct string_t *s;

    s = malloc(sizeof(struct string_t) + length + 1);

    if (s)
        s->length = length;

    return s;
}

void str_free(struct string_t *s)
{
    free(s);
}

Might be a good exercise to go through the implementation of this with more functions like str_cat(), str_cpy() and more. This will probably also show you why the standard library does things just the way it does.

mfro
  • 3,286
  • 1
  • 19
  • 28
0

-- Big final last edit --

So the correct "not an answer to my question" answer to my question fell into my lap today...

It turns out I am not the first person who ever thought it would be useful to be able to count available, allocated, and initialized (zero/null term/other) memory values.

The correct way to handle this situation is to bookend memory allocations for specific uses with the ASCII char 'us' (decimal: 31).

'us' is unit separator - it's purpose is to define a use-specific unit. The original IBM manual states: "its specific meaning has to be specified for each application". In our case, to signal the end of available safe write space in an array.

So my mem block should have read:

['h']['i']['!']['\0']['\0']['\0']['\0']['us']

Thus eliminating the need to EVER read outside of memory.

You're welcome person this answer is for C:

MJHd
  • 166
  • 1
  • 10