Will a string initializer somewhat waste memory?

Question

To initialize a char array, usually I write:

char string[] = "some text";

But today, one of my classmates said that one should use:

char string[] = {'s', 'o', 'm', 'e', ' ', 't', 'e', 'x', 't', '\0'};

I told him it is crazy to abandon readability and brevity, but he argued that initializing a char array with a string will actually create two strings, one asides in the stack and another in the read-only memory. When working with embedded devices, this can result in unacceptable waste in memory.

Of course, string initializers seems clearer, so I'll use them in my programs. But the question is, will a string initializer create two same string? Or string initializers are just syntax sugars?

They will take the exact same amount of memory. There is no need for the compiler to create an anonymous string constant for this case. You friend seems to be under the misconception that some kind of dynamic copying takes place. It does not. — Tom Karzes, Feb 17 '16 at 06:59
You should ask your classmate where he thinks the data of the second initializer comes from. — user694733, Feb 17 '16 at 07:03
@TomKarzes Your comment is somewhat misleading. There is a hidden `memcpy` when array is initalized and initialization data is copied from ROM to RAM. — user694733, Feb 17 '16 at 07:12
@user694733 In both cases, `string` should be allocated in the data section, initialized with the desired string value. There should be no difference at all. Any load-time issues, whether from disk or from ROM or whatever, should be identical in the two cases. That was the point of the question, and that is as direct an answer as can be given. — Tom Karzes, Feb 17 '16 at 07:37
@TomKarzes I know, both are equal. I was just saying that on typical embedded system there *is* copying that takes place at initalization state. — user694733, Feb 17 '16 at 07:42
@TomKarzes It isn't obvious what scope `string` has, you can't assume that it will sit at file scope. And your assumptions only apply on a RAM-based, PC-like system. On embedded systems with true ROM and no harddrive, even static storage duration variables are initialized in "runtime" with a copy-down from ROM. — Lundin, Feb 17 '16 at 07:43
At any rate, the string literal will be allocated in the executable. There is no way around that. — Lundin, Feb 17 '16 at 07:46
@Lundin None of that matters. I made no assumptions. The two cases should be identical. Why do you think they aren't? They are two different syntaxes for the same array initializer. No more, no less. — Tom Karzes, Feb 17 '16 at 07:47
@TomKarzes If the variable is allocated on the stack, it will obviously have to be initialized in run-time, each time the program counter enters the scope where the variable resides in. Which means that the actual literal must be allocated in read-only memory. --> — Lundin, Feb 17 '16 at 07:53
@TomKarzes If the variable has static storage duration, then it will end up in the `.data` segment, which on a RAM-based PC system is pre-loaded when the program is loaded into RAM at startup. Then there is no need to have the actual literal accessible in run-time, so it doesn't have to be allocated in addressable memory. On a ROM-based system (flash memory etc) however, the literal sits in ROM and at program start-up it gets copied into `.data` by your own program. So you always get a copy of it in addressable ROM no matter what you do. — Lundin, Feb 17 '16 at 07:54
@Lundin Yes, of course. Why are you harping on that? That isn't the issue. The issue is whether there is some kind of benefit from writing the inializer as a string literal vs. using array initializer syntax. And the answer is that *there is no difference*. — Tom Karzes, Feb 17 '16 at 07:56
@TomKarzes Because I was referring to your comment "There is no need for the compiler to create an anonymous string constant for this case. You friend seems to be under the misconception that some kind of dynamic copying takes place. It does not. " which is not true for all systems. Dynamic copying is exactly what takes place at least on every ROM-based system. — Lundin, Feb 17 '16 at 07:57
@Lundin ok, regarding that comment, yes, I was assuming static storage. If it's automatic, then a copy of the initializer needs to be placed in static storage. But the two cases will still be handled the same, that's the real point. OP's friend seemed to think they were different. — Tom Karzes, Feb 17 '16 at 07:59

Lundin · Accepted Answer · 2016-02-17T07:37:57.720

7

char string[] = "some text";

is 100% equivalent to

char string[] = {'s', 'o', 'm', 'e', ' ', 't', 'e', 'x', 't', '\0'};

Your friend is confused: if string is a local variable, then in both cases you create two strings. Your variable string which resides on the stack and a read-only string literal which resides in read-only memory (.rodata).

There is no way to avoid the read-only storage, since all data must be allocated somewhere. You cannot allocate string literals in thin air. All you can do is to move it from one read-only memory segment to another, which will give you the very same program size in the end anyway.

The former style is preferred in general, as it is more readable. It is indeed a form of syntactic sugar.

But it is also preferred because it might ease some compiler optimization known as "string pooling", which allows the compiler to store the string literal "some text" in more memory-effective ways. If you initialize the string character-by-character, the compile may or may not realize that it is a read-only string constant.

edited Feb 17 '16 at 07:37

answered Feb 17 '16 at 07:31

Lundin

195,001
40
254
396

I feel that this is the best answer as it answers from the practical point of view and covers all questions. – user694733 Feb 17 '16 at 07:44
Maybe you could provide evidence for your statements quoting the standard - I personally am not 100% sure what you are saying is true. Also providing real-life examples will be good too - though not so important as the first one. – AnArrayOfFunctions Feb 24 '16 at 13:06
@FISOCPP The C standard does not specify where data is stored, because that's outside the scope of the language standard. To find something to quote, you'd have to dig into compiler docs and elf format specifications etc. But you could simply look at the memory map of any program and see for yourself. Now if you are going to down vote my answer because _you_ don't know the scope of the C standard, or because _you_ have never seen the .rodata section in a map file (which is everyday trivial stuff for anyone who has ever done hardware-related programming), then I am not going to teach it to you. – Lundin Feb 24 '16 at 13:53
Calm down. All am I saying is that from your post I can't understand why the 2 forms are equivalent. You just state it without any proof. For me this is not evident from first sight. – AnArrayOfFunctions Feb 24 '16 at 16:30
I can even tell you why - I remember that sometime in the past I've been testing those 2 forms and they were producing different code. In the first case there was one globally stored string literal which was copied to locally allocated variable and in the second one the string itself was encoded in the function code as move instructions. Maybe that is not standard but from your post I can't tell for sure unless I completely trust your unproven opinion. – AnArrayOfFunctions Feb 24 '16 at 16:40
@FISOCPP "All you can do is to move it from one read-only memory segment to another, which will give you the very same program size in the end anyway." Also see the last paragraph of my answer: if you don't use the string format, some compilers may not realize that it is a string and store it in some other part of the read-only memory than the segment used for strings. You still allocate it, it has to be allocated somewhere. I still don't get it why some people think a program can magically know about other things besides those stored in its own memory. – Lundin Feb 25 '16 at 07:27

Some programmer dude · Answer 2 · 2016-02-17T06:55:48.590

After your edit, there is no difference between the two definitions. Both will produce an array of ten characters and initialized to the same contents.

This is actually easy to verify: First check what sizeof gives you for the two arrays, then you can use e.g. memcmp to compare both the arrays.

~~The second initialization is almost equal to the first, with once crucial difference: The second array is not terminated as a string.~~

The first creates an array of ten characters (including the terminator) and the second creates an array of nine characters. If you don't use the array as a string, then yes you will save once element with the second initialization.

Sorry, I didn't means this. It's a mistake, I'm going to fix that. — nalzok, Feb 17 '16 at 06:55

score 3 · Answer 3 · answered Feb 17 '16 at 07:26

3

The C standard has a "special case" that allows you to initialize an array with a string literal:

§6.7.9/14 An array of character type may be initialized by a character string literal or UTF−8 string literal, optionally enclosed in braces. Successive bytes of the string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.

That's it. It doesn't say anything else, which would be an implementation detail of the platform and compiler. Unlike C++, which explicitly gives string literal static storage duration, the C standard doesn't. It's implied. There are common extensions that allow you to modify string literals, meaning that it's not guaranteed that it will be placed in read-only memory.

answered Feb 17 '16 at 07:26

user5939003

59
3

Is the term "string literal" implies it's static? As I know, string literals are guaranteed to be static in C. – nalzok Feb 17 '16 at 07:31
1

@sunqingyao The string literal is preserved throughout the execution of the program, so in that way it is "static". But the C term and keyword `static`, which affects storage duration and scope, doesn't really apply to string literals. – Lundin Feb 17 '16 at 07:37
1

@sunqingyao No, it's implied because it's undefined behavior to modify a string literal (which means that it's likely to be placed in ROM) and because of 6.7.9/4 "All the expressions in an initializer for an object that has static or thread storage duration shall be constant expressions or string literals." Whereas C++ outright says "string literals have static storage duration". – user5939003 Feb 17 '16 at 07:37

score 3 · Answer 4 · answered Feb 17 '16 at 16:47

Semantically, the two lines are identical. But the practical consequences will depend on the compiler.

Experimenting with http://gcc.godbolt.org/ shows a variety of strategies:

Fill in the arrays one character at a time using a series of movb instructions (or equivalent) with immediate operands.
Fill in the arrays one doubleword at a time using movabsq / movq pairs, where the first has an immediate doubleword operand.
Copy the data into the arrays from a string constant stored in the .rodata section.

Different compilers used different strategies for the two cases. In particular, gcc found the movabsq optimization only for the case of char string[] = "string literal";, which makes your friend's strategy somewhat bulkier (because the generated code has more bytes).

Trying different optimization settings would probably have produced even more variations.

It's clear that the base data has to be stored somewhere in the program, whether it is in the data section or as a series of immediate operands in executable code. Since it is not practical to figure out or guess how a particular style might affect a given compiler's ability to optimize, the only rational approach is to use the style which is easiest to read and maintain. (The useful corollary is that the compiler will probably also have the easiest time with the most common style.)

In the unlikely event that this is actually performance-critical, you would have to examine the code produced by the actual compiler being used. But you should first ask yourself whether an initialized mutable buffer is really necessary.

score 1 · Answer 5 · answered Feb 18 '16 at 23:51

In the first case, string[] is initialised using a literal string constant of length 10 bytes, which will be instantiated in a read-only segment.
In the second case string[] is initialised using a constant array of literal character constants of length 10 bytes, which will be instantiated in a read-only segment.

Both cases are identical both semantically and in memory requirement. The first is merely syntactic sugar for the second (and much more convenient, and less error prone).

If you need to initialise a non-read-only data with compile-time constant data, that the constant initialiser will necessarily be compiled-in regardless of the syntax used. You cannot get something for nothing. If however the data is constant, you can use a single read-only copy by declaring:

const char* string = "some text" ;

This will create a pointer string to a constant string, and can save memory when compared with a say:

#define string "some text"

which may generate multiple copies of "some text" everywhere the macro string is used. (Although most modern compiler/linker toolchains are able remove duplicate strings in any case). In the first instance you can take the address of string and be sure that the value will be identical for all references, while the macro will be different for each reference is not optimised. Another semantic difference is that for const char* string, sizeof(string) is the size of a pointer, wheras for string[] it is the length of the initialiser (including the nul terminator)

score 0 · Answer 6 · answered Feb 17 '16 at 07:02

will a string initializer create two same string? Or string initializers are just syntax sugars?

The two cases are totally different:

1st case:

char string[] = "some text";  // <-- string initialization

This syntax is string-specific and cannot apply to any other data type. It automatically add a \0 character at the end so it's guaranteed that library function such as printf knows where to end the output (with %s control string).

2nd case:

char string[] = {'s', 'o', 'm', 'e', ' ', 't', 'e', 'x', 't'};  // <--  array initialization

This syntax is to initialize array BUT not ~~string~~. The syntax can be used to initialize other types of array (such as int, long, etc.). It NEVER automatically add a \0 at the end of the array. So it'd be WRONG to printf this char array using the %s control.

In short, these are two different initialization syntaxes used for different purposes. If you need string, then use the first syntax, if you use character array - then use the second.

Ok I saw that - as @Joachim said they are simply the same then. — artm, Feb 17 '16 at 07:05
You should either delete the answer or edit it reflect the updated question. — user694733, Feb 17 '16 at 07:17

Will a string initializer somewhat waste memory?

6 Answers6