Implement the internal processing for C #include directives

Question

I need to think in the cleanest possible way to implement the functionality of the #include directive for a C compiler.

I only know how implement the external part of the processing: Get the '#' char at the beginning of the line to run a preprocessor-only loop, and I also know how to gather the "include" string and the string between <> or "".

What I don't know is the best way to implement the internal processing to run the actual effect of the #include directive: Expand the full path for library header files (using <>) but not for the ones using "" (it's probably cleaner and more flexible to assume that they are in the current directory as that would also allow for including source files with the full path correctly).

The tasks I think I would need to implement would be:

The main C file passed as a command-line parameter to the compiler should be processed just like a #include "mainfile.c" directive to start the compilation in an uniform way.
Expand the path for files included with quotes ("", are single quotes '' valid at least for some compilers?)
Put the file in a list of files, also indicating in which line and in which file we found the #include directive
In the preprocessor stage, see if it's an #include directive and try to open the specified file unconditionally to try to properly get all files from the start. If a file doesn't exist, don't signal an error at the preprocessor stage, only when we have marked them as usable, when we determine whether we should include them or not due to #ifdef or #elif, while trying to translate the actual C code.
After finishing to process all #includes in the code, process the rest of the preprocessor code now with the full set of potential files to include.

I think that using a stack of files would be useful but only after completing the preprocessor stage, and when we are already translating the whole code and adding files (pushing file indexes on the source file stack at #include and popping file indexes at the end of a source file.)

I think that the easiest way to handle the code would be to inspect all of the files pointed by #include, make a list of them, and then later only mark as usable, the ones that I will actually include and process fully, those that meet #ifdef or #elif conditions, but for that I need to see which included files there are in the whole set of source files.

There is an interesting irony you hit on in your first and last statements. Includes are unclean by definition :) — Aluan Haddad, Feb 12 '18 at 18:25
Note that different compilers treat path names in "" and <> differently, though the prevalence of the GNU Compiler Collection has forced some reconciliation. In general, the name of the file included by `"q-char-sequence"` or `` is located by searching through a series of alternate paths, usually including the place the file doing the `#include` came from at the front in case there are multiple qualifying paths (e.g., /usr/include/foo.h and /usr/include/sys/foo.h plus -Ipath which found path/bar.h when there is a path/foo.h). But... — torek, Feb 12 '18 at 21:48
... you can also use techniques like `#define HEADER "foo.h"` followed by `#include HEADER`, so you can use pp-token-construction technques to build path names somewhat dynamically. This means that in general you have to interpret the path names as you come across them and expand pp-tokens. — torek, Feb 12 '18 at 21:51

score 2 · Accepted Answer · edited Feb 12 '18 at 21:39

2

You usually process all preprocessor directives as you read them. So when you see an #include, you get the file name, search through the include path, open the file and start processing it -- no need to defer things. Once you get to the end of the included file, you continue processing the original file.

Similarly with an #if, you read the condition and decide if it is true or false. If false, you then start skipping over the input, ignoring it until you find the matching #else or #endif. So if there's an #include in there, you just skip it.

edited Feb 12 '18 at 21:39

torek

448,244
59
642
775

answered Feb 12 '18 at 19:08

Chris Dodd

119,907
13
134
226

Would it make the preprocessor to have to run for each source file separately instead of a single pass for the complete code? At most, it would force the compiler to recognize only things that have been previously declared. If not, they would be considered not declared if they are used before their declaration in the sequence of the code. – alt.126 Feb 12 '18 at 19:17
Parsing preprocessor directives per individual source file instead of for the whole source code in a single preprocessor run would add complexity to the compiler code because the parsing loop would need to call the preprocessor and arrange what it finds instead of having it fixed from the start. Running the preprocessor only once would intrinsically miss if something was declared before or after being used unless we use an offset for the whole code to know where it was first declared. – alt.126 Feb 12 '18 at 19:28

score 0 · Answer 2 · answered Feb 13 '18 at 14:28

It seems that the preprocessor code needs to be parsed in order to properly know whether we have already performed tasks like defining stuff or including files to avoid doing so again, so it really needs to be parsed as we find the preprocessor directives in the order in which we find it.

The actual C code can probably be analyzed in any order by doing several passes, mostly to the globally declared stuff, for being able to use things before declaring them, but the preprocessor needs to be processed in order for being able to selectively define and include things.

Implement the internal processing for C #include directives

2 Answers2