Notes on the C Preprocessor: Token Pasting

2 minute read

A feature of the preprocessor little known outside of hardcore C programmers is token pasting (also called concatenation). As the name suggests, token pasting lets the programmer take any two tokens in the source code and turn them into a brand new token. Developers can and do use token pasting in all sorts of clever ways.  The manual for the preprocessor provides an example of using token pasting to generate code from boilerplate:

1: #define COMMAND(NAME) { #NAME, NAME ## _command }
2: struct command commands[] =
3: {
4:  COMMAND (quit),  // expands to { "quit", quit_command }
5:  COMMAND (help),  // expands to { "help", help_command }
6:  …
7: };

The COMMAND macro uses the token paste operator ## to construct identifiers with the _command suffix, helping to eliminate repetitive code. This pattern is used to generate all sorts of things, even entire function definitions.

Token pasting can be a serious pain for program analyses, code browsers, and really any tool that depends on the syntax tree of a program, because it alters the stream of tokens that the parser sees. A tool can opt to resolve token pasting first, but this requires evaluating at least some macros, resulting in code that might be dramatically different from the source code the developer is writing. Even worse, the resulting tokens can depend on configuration options when used with #ifdefs. Let’s look at a few examples, and see why this is so hard.

Developers have found very creative uses for token pasting. This [example]((https://elixir.bootlin.com/linux/v2.6.33/source/include/linux/compiler-gcc.h), from a past version of the Linux kernel source, includes a header file whose name depends on the version of GCC being used to compile the source code:

1: #define __gcc_header(x) #x
2: #define _gcc_header(x) __gcc_header(linux/compiler-gcc##x.h)
3: #define gcc_header(x) _gcc_header(x)
4: #include gcc_header(__GNUC__)

Line 4 is the actual include, which calls a macro to generate a file name from __GNUC__, a built-in macro that returns the hard-coded version number of the compiler. Line 3 defines the header file generation macros, using an extra macro definition with a single preceding underscore (line 2). The extra macro definition is necessary to invoke token pasting on the value of the parameter x (rather than the symbol x itself). The macro on line 1 then stringifies (with operator #) the resulting header file name, turning C tokens into a C string, yet another feature of the preprocessor.

If this example wasn’t confusing enough, token pasting can even muck with types, since any C token can be constructed by pasting, including type names. The following [example]((https://elixir.bootlin.com/linux/v2.6.33/source/fs/udf/balloc.c#L41), also from the Linux kernel, constructs different type names depending on whether we want to build for a 32-bit or 64-bit architecture.

1: #define uintBPL_t uint(BITS_PER_LONG)
2: #define uint(x) xuint(x)
3: #define xuint(x) __le ## x
4: uintBPL_t *p = ... ;

BITS_PER_LONG is either 64 or 32 depending on another configuration option. Trying to reason about this piece of code is tricky, particularly for an automated tool that, for instance, looks for bugs or does refactoring.