26

I went through this article. I understand the rules explained but I am wondering what exactly blocks the compiler from accepting the following syntax when defining a constant multi-dimensional array and directly initializing it with known values of given type:

const int multi_arr1[][] = {{1,2,3}, {1,2,3}}; // why not?
const int multi_arr2[][3] = {{1,2,3}, {1,2,3}}; // OK

error: declaration of 'multi_arr1' as multidimensional array must have bounds
       for all dimensions except the first

What prevents the compiler from looking to the right and realizing that we are dealing with 3 elements for each "subarray" or possibly returning an error only for cases when the programmer passes e.g. a different number of elements for each subarray like {1,2,3}, {1,2,3,4}?

For example when dealing with a 1D char array the compiler can look at the string on the right hand side of = and this is valid:

const char str[] = "Str";

I would like to understand what's happening so that the compiler is not able to deduce the array dimensions and calculate the size for allocation since now it seems to me like the compiler has all the information needed to do so. What am I missing here?

Boann
  • 48,794
  • 16
  • 117
  • 146
esgaldir
  • 823
  • 9
  • 11
  • 17
    What "blocks" the compiler is adherence to the standard (for C _or_ C++, they're different standards, pick one). What blocks the standard from allowing this is _no-one wrote a standards proposal for implementing it which was subsequently accepted_. – Useless Feb 19 '18 at 11:15
  • ^ - That. Which tells you a lot about how much a true need for this feature may come up in practice – StoryTeller - Unslander Monica Feb 19 '18 at 11:20
  • 18
    The fight over whether different-sized initialisers should be an error or the dimension should be that of the largest one would last for decades. – molbdnilo Feb 19 '18 at 11:33
  • "What prevents compiler from looking ..." --> Little prevents it. "Why ... not possible" --> C lacks features: binary constants, function overloading. Needs work on nascent Unicode support, _Generic. `[][] = {{…}, {…}}` is not a priority to change the Spec - even though it is interesting. – chux - Reinstate Monica Feb 19 '18 at 12:14

5 Answers5

31

Requiring the compiler to infer inner dimensions from the initializers would require the compiler to work retroactively in a way the standard avoids.

The standard allows objects being initialized to refer to themselves. For example:

struct foo { struct foo *next; int value; } head = { &head, 0 };

This defines a node of a linked list that points to itself initially. (Presumably, more nodes would be inserted later.) This is valid because C 2011 [N1570] 6.2.1 7 says the identifier head “has scope that begins just after the completion of its declarator.” A declarator is the part of the grammar of a declaration that includes the identifier name along with the array, function, and/or pointer parts of the declaration (for example, f(int, float) and *a[3] are declarators, in a declarations such as float f(int, float) or int *a[3]).

Because of 6.2.1 7, a programmer could write this definition:

void *p[][1] = { { p[1] }, { p[0] } };

Consider the initializer p[1]. This is an array, so it is automatically converted to a pointer to its first element, p[1][0]. The compiler knows that address because it knows p[i] is an array of 1 void * (for any value of i). If the compiler did not know how big p[i] was, it could not calculate this address. So, if the C standard allowed us to write:

void *p[][] = { { p[1] }, { p[0] } };

then the compiler would have to continue scanning past p[1] so it can count the number of initializers given for the second dimension (just one in this case, but we have to scan at least to the } to see that, and it could be many more), then go back and calculate the value of p[1].

The standard avoids forcing compilers to do this sort of multiple-pass work. Requiring compilers to infer the inner dimensions would violate this goal, so the standard does not do it.

(In fact, I think the standard might not require the compiler to do any more than a finite amount of look-ahead, possibly just a few characters during tokenization and a single token while parsing the grammar, but I am not sure. Some things have values not known until link time, such as void (*p)(void) = &SomeFunction;, but those are filled in by the linker.)

Additionally, consider a definition such as:

char x[][] =
    {
        {  0,  1 },
        { 10, 11 },
        { 20, 21, 22 }
    };

As the compiler reads the first two lines of initial values, it may want to prepare a copy of the array in memory. So, when it reads the first line, it will store two values. Then it sees the line end, so it can assume for the moment the inner dimension is 2, forming char x[][2]. When it sees the second line, it allocates more memory (as with realloc) and continues, storing the next two values, 10 and 11, in their appropriate places.

When it reads the third line and sees 22, it realizes the inner dimension is at least three. Now the compiler cannot simply allocate more memory. It has to rearrange where 10 and 11 are in memory relative to 0 and 1, because there is a new element between them; x[0][2] now exists and has a value of 0 (so far). So requiring the compile to infer the inner dimensions while also allowing different numbers of initializers in each subarray (and inferring the inner dimension based on the maximum number of initializers seen throughout the entire list) can burden the compiler with a lot of memory motion.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • 3
    Incidentally, C99 does allow for something like: `int *q[5] = {(int[]){1,2,3,-1}, (int[]){1,2,-1}, (int[]){1,2,3,4,5,6,7,-1}};`. The syntax is a bit awkward, code would need to expect to use an `(int*)[]` rather than a two-dimensional array, and there would be no nice way of finding out the inner dimensions unless it were implied by the data [e.g. by including sentinels at the end of each row] but the approach could be more efficient than trying to use a two-dimensional array if the rows would have different numbers of initializers. – supercat Feb 19 '18 at 18:18
8

There is nothing impossible in implementing compilers that would deduce the innermost dimensions of multidimensional arrays in presence of an initializer, however, it is a feature that is NOT supported by C or C++ standards, and, evidently, there has been no great demand for that feature to bother.

In other words, what you're after is not supported by the standard language. It could be supported if enough people needed it. They don't.

Salem
  • 13,516
  • 4
  • 51
  • 70
Armen Tsirunyan
  • 130,161
  • 59
  • 324
  • 434
4

To briefly expand on the comment:

What "blocks" the compiler is adherence to the standard (for C or C++, they're different standards, pick one).

What "blocks" the standard from allowing this is no-one wrote a standards proposal for implementing it which was subsequently accepted.

So, all you're asking is why no-one was motivated to do something you feel would be useful, and I can only see that as opinion-based.

There may also be practical difficulties implementing this, or keeping consistent semantics; that's not precisely the question you asked, but it might at least be objectively answerable. I suspect someone could work through those difficulties if sufficiently motivated. Presumably no-one was.

For example, (reference), the syntax a[] really means array of unknown bound. Because the bound can be inferred in the special case when it's declared using aggregate initialization, you're treating it as something like a[auto]. Maybe that would be a better proposal, since it doesn't have the historical baggage. Feel free to write it up yourself if you think the benefits justify the effort.

Community
  • 1
  • 1
Useless
  • 64,155
  • 6
  • 88
  • 132
3

The rule is that compiler determines only first dimension of the array by the given initializer list. It expects second dimension to be specified explicitly. Period.

haccks
  • 104,019
  • 25
  • 176
  • 264
1

With an array, the compiler has to know how big each element is so that it can do index calculation. For example

int a[3];

is an integer array. The compiler knows how big an int is (usually 4 bytes) so it can calculate the address of a[x] where x is an index between 0 and 2.

A two dimensional array can be thought of as a one dimensional array of arrays. e.g.

int b[2][3];

is a two dimensional array of int but it is also a one dimensional array of arrays of int. i.e. b[x] refers to an array of three ints.

Even with arrays of arrays, the rule that the compiler must know the size of each element still applies which means that in an array of arrays, the second array must be of fixed size. If it were not, the compiler couldn't calculate the address when indexing i.e. b[x] would be impossible to calculate. Hence the reason why multi_arr2 in you're example is OK, but multi_arr1 is not.

What prevents compiler from looking to the right and claim that we are dealing 3 elements for each "subarray" or possibly return error only for cases when programmer passes e.g. different number of elements for each subarray like {1,2,3}, {1,2,3,4}

Probably a limitation of the parser. By the time it gets to the initialiser, the parser has already gone past the declaration. The earliest C compilers were pretty limited and the behaviour above was set as expected long before modern compilers arrived.

JeremyP
  • 84,577
  • 15
  • 123
  • 161
  • There's also the problem of passing an array to a function. Unless the compiler passes additional (hidden) information, the function has no way of knowing the dimensions. And passing that information would break the "array is just a pointer" paradigm. – jamesqf Feb 19 '18 at 17:53
  • @jamesqf I was going to write something about passing parameters but I forgot. In a function declaration, you need to specify the size of the last index in a 2D array. – JeremyP Feb 20 '18 at 10:09