Notes on the C Preprocessor: Token Pasting

A feature of the preprocessor little known outside of hardcore C programmers is token pasting (also called concatenation).  As the name suggests, token pasting lets the programmer take any two tokens in the source code and turn them into a brand new token.  Developers can and do use token pasting in all sorts of clever ways.  The manual for the preprocessor provides an example of using token pasting to generate code from boilerplate:

1: #define COMMAND(NAME) { #NAME, NAME ## _command }
2: struct command commands[] =
3: {
4:  COMMAND (quit),  // expands to { "quit", quit_command }
5:  COMMAND (help),  // expands to { "help", help_command }
6:  …
7: };

The COMMAND macro uses the token paste operator ## to construct identifiers with the _command suffix, helping to eliminate repetitive code.  This pattern is used to generate all sorts of things, even entire function definitions.

Token pasting can be a serious pain for program analyses, code browsers, and really any tool that depends on the syntax tree of a program, because it alters the stream of tokens that the parser sees.  A tool can opt to resolve token pasting first, but this requires evaluating at least some macros, resulting in code that might be dramatically different from the source code the developer is writing.  Even worse, the resulting tokens can depend on configuration options when used with #ifdefs.  Let’s look at a few examples, and see why this is so hard.

Continue reading “Notes on the C Preprocessor: Token Pasting”

Notes on the C Preprocessor: Problems with Parentheses

The use of preprocessor macros is full of pitfalls. In fact, its user manual includes a section called “Macro Pitfalls”. One of these pitfalls revolves around operator precedence. Operator precedence problems happen when the expectations about pass-by-value, the default for C functions, are applied to function-like macros. Function-like macros are pass-by-name; the arguments are textually substituted in the macro body. This is especially confusing because both C and preprocessor functions have the same invocation syntax.

Here is an example of what can go wrong:

 #define square(a) a * a
 int z = square(x + y) * 2

This square root method looks perfectly sensible and would be if square were a typical C function. Let’s see what happens when we preprocess this example. We get the following:

int z = x + y * x + y * 2 

Continue reading “Notes on the C Preprocessor: Problems with Parentheses”

Notes on the C Preprocessor: Introduction

My graduate work on SuperC made made me way too familiar with the C preprocessor’s ins and outs, more than I ever could have imagined (or wanted). SuperC’s novel preprocessing and parsing algorithms let you parse a program without having to run the preprocessor first. Solving this challenge exposed me to interesting quirks of the preprocessor and strange usage patterns that appear in the wild. I’d like to share these and bring attention to this underrated aspect of compilers, hopefully providing insight for future language development and software tools.

Lurking between the lexer and parser, it can be hard to distinguish the preprocessor from the C language itself. For instance #include is not part of the C language, but a preprocessing feature that basically just copies in a given file before compilation. This and the rest of the preprocessor constructs, macros (#define) and conditional compilation (#ifdef), are completely distinct from the C language, sharing only its lexical specification. This makes for a powerful tool that is used to augment the diminutive C language, even enabling what resemble generics, iterators, modules, and more. Continue reading “Notes on the C Preprocessor: Introduction”