• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..23-Nov-2020-

MakefileH A D22-Jul-20153.2 KiB10018

READMEH A D22-Jul-201539.3 KiB949782

arith.cH A D22-Jul-201538.9 KiB1,5061,064

arith.hH A D22-Jul-201510.5 KiB274149

assert.cH A D22-Jul-201511.4 KiB479360

atest.cH A D22-Jul-20155.8 KiB237197

config.hH A D19-May-20203 KiB12950

cpp.cH A D14-Feb-201668 KiB2,7762,199

cpp.hH A D22-Jul-201511.5 KiB409294

eval.cH A D22-Jul-201518 KiB718592

hash.cH A D22-Jul-20157.7 KiB330192

hash.hH A D22-Jul-20152.1 KiB5925

lexer.cH A D22-Jul-201525.8 KiB1,078814

macro.cH A D22-Jul-201551.1 KiB2,1111,670

mem.cH A D22-Jul-20157.1 KiB331223

mem.hH A D22-Jul-20155.3 KiB176109

nhash.cH A D22-Jul-201515.3 KiB585392

nhash.hH A D22-Jul-20156.1 KiB18269

ppdir.plH A D19-May-20201.2 KiB6239

reent.hH A D22-Jul-20153.2 KiB13876

sample.cH A D22-Jul-20153.1 KiB11560

sample_r.cH A D22-Jul-20153.2 KiB11963

scanppd.cH A D14-Feb-20166.7 KiB302254

tune.hH A D22-Jul-201518.5 KiB476105

ucpp.1H A D22-Jul-20154.2 KiB213212

ucppi.hH A D22-Jul-20157.3 KiB242149

README

1ucpp-1.3 is a C preprocessor compliant to ISO-C99.
2
3Author: Thomas Pornin <pornin@bolet.org>
4Main site: http://pornin.nerim.net/ucpp/
5
6
7
8INTRODUCTION
9------------
10
11A C preprocessor is a part of a C compiler responsible for macro
12replacement, conditional compilation and inclusion of header files.
13It is often found as a stand-alone program on Unix systems.
14
15ucpp is such a preprocessor; it is designed to be quick and light,
16but anyway fully compliant to the ISO standard 9899:1999, also known
17as C99. ucpp can be compiled as a stand-alone program, or linked to
18some other code; in the latter case, ucpp will output tokens, one
19at a time, on demand, as an integrated lexer.
20
21ucpp operates in two modes:
22-- lexer mode: ucpp is linked to some other code and outputs a stream of
23tokens (each call to the lex() function will yield one token)
24-- non-lexer mode: ucpp preprocesses text and outputs the resulting text
25to a file descriptor; if linked to some other code, the cpp() function
26must be called repeatedly, otherwise ucpp is a stand-alone binary.
27
28
29
30INSTALLATION
31------------
32
331. Uncompress the archive file and extract the source files.
34
352. Edit tune.h. Here is a short explanation of compile-time options:
36
37  LOW_MEM
38     Enable memory-saving functions; this is for low-end and old systems,
39     but seems to be good for larger systems too. Keep it.
40  NO_LIBC_BUF
41  NO_UCPP_BUF
42     Two options used to disable the two bufferings inside ucpp. Define
43     both options for maximum memory savings but you will probably want
44     to keep libc buffering for decent performance. Define none on large
45     systems (modern 32 or 64-bit systems).
46  UCPP_MMAP
47     With this option, if ucpp internal buffering is active, ucpp will
48     try to mmap() the input files. This might yield a slight performance
49     improvement, but will work only on a limited set of architectures.
50  PRAGMA_TOKENIZE
51     Make ucpp generate tokenized PRAGMA tokens on #pragma and _Pragma();
52     tokenization is made this way: tokens are assembled as a null
53     terminated array of unsigned chars; if a token has a string value
54     (as defined by the STRING_TOKEN macro), the value follows the token,
55     terminated by PRAGMA_TOKEN_END (by default, a newline character cast
56     to unsigned char). Whitespace tokens are skipped. The "name" value
57     of the PRAGMA token is a pointer to that array. This setting is
58     irrelevant in non-lexer mode.
59  PRAGMA_DUMP
60     In non-lexer mode, keep #pragma in output; non-void _Pragma() are
61     translated to the equivalent #pragma. Irrelevant in lexer mode.
62  NO_PRAGMA_IN_DIRECTIVE
63     Do not evaluate _Pragma() inside #if, #include, #include_next and #line
64     directives; instead, emit an error (since the remaining _Pragma will
65     surely imply a syntax error).
66  DSHARP_TOKEN_MERGE
67     When two tokens are to be merged with the `##' operator, but fail
68     because they do not merge into a single valid token, ucpp keeps those
69     two tokens separate by adding an extra space between them in text
70     output. With this option on, that extra space is not added, which means
71     that some tokens may merge partially if the text output is preprocessed
72     again. See tune.h for details.
73  INMACRO_FLAG
74     In lexer mode, set the inmacro flag to 1 if the current token comes
75     from a macro replacement, 0 otherwise. macro_count maintains an
76     increasing counter of such replacements. CONTEXT tokens count as
77     one macro replacement each. #pragma, and _Pragma() that do not come
78     from a macro replacement, also count as one macro replacement each.
79     This setting is irrelevant in non-lexer mode.
80  STD_INCLUDE_PATH
81     Default include path in stand-alone ucpp.
82  STD_MACROS
83     Default predefined macros in stand-alone ucpp.
84  STD_ASSERT
85     Default assertions in stand-alone ucpp.
86  NATIVE_SIGNED
87  NATIVE_UNSIGNED
88  NATIVE_UNSIGNED_BITS
89  NATIVE_SIGNED_MIN
90  NATIVE_SIGNED_MAX
91  SIMUL_ARITH_SUBTYPE
92  SIMUL_SUBTYPE_BITS
93  SIMUL_NUMBITS
94  WCHAR_SIGNEDNESS
95     Those options define how #if expressions are evaluated; see the
96     cross-compilation section of this file for more info, and the
97     comments in tune.h. Extra info is found in arith.h and arith.c,
98     at the possible expense of your mental health.
99  DEFAULT_LEXER_FLAGS
100  DEFAULT_CPP_FLAGS
101     Default flags in respectively lexer and non-lexer modes.
102  POSIX_JMP
103     Define this if your architecture defines sigsetjmp() and
104     siglongjmp(); it is known to (very slightly) improve performance
105     on AIX systems.
106  MAX_CHAR_VAL
107     ucpp will consider characters whose value is equal or above
108     MAX_CHAR_VAL as outside the C source charset (so they will be
109     treated just like '@', for instance). For ASCII systems, 128
110     is fine. 256 is a safer value, but uses more (static) memory.
111     For performance reasons, use a power of two. If MAX_CHAR_VAL is
112     correctly adjusted, ucpp should be compatible with any character
113     set.
114  UNBREAKABLE_SPACE
115     If you want an extra-whitespace character, define this macro to that
116     character. For instance, define this to 160 on an ISO-8859-1 system
117     if you want the 'unbreakable space' to be considered as whitespace.
118  SEMPER_FIDELIS
119     With this option set, ucpp, when used as a lexer, will pass
120     whitespace tokens to its caller, and those tokens will have their
121     true content; this is intended for reconstruction of the source
122     line. Beware that some comments may have embedded newlines.
123  COPY_LINE_LENGTH
124     ucpp can maintain a copy of the current source line, up to that
125     length. Irrelevant to stand-alone version.
126  *_MEMG
127     Those settings modify ucpp behaviour, wrt memory allocations. With
128     higher values, ucpp will perform less malloc() calls and will run
129     faster, but it will use more memory. Reduce INPUT_BUF_MEMG and
130     OUTPUT_BUF_MEMG on low-memory systems, if you kept ucpp buffering
131     (see NO_UCPP_BUF option).
132
1333. Edit the Makefile. You should define the variables CC and FLAGS;
134   there are the following options:
135
136  -DAUDIT
137     Enable internal sanity checks; this slows down a bit ucpp. Do not
138     define unless you plan to debug ucpp.
139  -DMEM_CHECK
140     With this setting, ucpp will check for the return value of malloc()
141     and exit with a diagnostic when out of memory. MEM_CHECK is implied
142     by AUDIT.
143  -DMEM_DEBUG
144     Enable memory debug code. This will track memory leaks and several
145     occurrences of memory management errors; it will also slow down
146     things and increase memory consumption, so you probably do not
147     want to use this option.
148  -DINLINE=foobar
149     The ucpp code uses "inline" qualifier for some functions; by
150     default, that qualifier is macro-replaced with nothing. Define
151     INLINE to the correct replacement for your compiler, if supported.
152     Note that all "inline" functions in ucpp are also "static". For any
153     C99-compliant compiler, the GNU compiler (gcc), and the Compaq C
154     compiler under Linux/Alpha, no -DINLINE is needed (see tune.h for
155     details).
156
1574. Compile by typing "make". This should produce the ucpp executable
158   file. You might see some warning messages, especially with gcc:
159   gcc believes some variables might be used prior to their
160   initialization; ignore those messages.
161
1625. Install wherever you want the binary and the man page ucpp.1. I
163   have not provided an install sequence because I didn't bother.
164
1656. If you do not have the make utility, compile each file separately
166   and link them together. The exact details depend on your compiler.
167   You must define the macro STAND_ALONE when compiling cpp.c (there
168   is such a definition, commented out, in cpp.c, line 34).
169
170There is no "configure" script because:
171-- I do not like the very idea of a "configure" script.
172-- ucpp is written in ANSI-C and should be fairly portable.
173-- There is no such thing as "standard" settings for a C preprocessor.
174   The predefined system macros, standard assertions,... must be tuned
175   by the sysadmin.
176-- The primary goal of ucpp is to be included in compilers. The
177   stand-alone version is mainly a debugging tool.
178
179Please note that you need an ISO-C90 (formerly ANSI) C compiler suite
180(including the standard library) to compile ucpp. If your compiler is
181not C99 (or later), read the cross-compilation section in this README
182file.
183
184The C90 and C99 standards state that external linkage names might be
185considered equal or different based upon only their first 6 characters;
186this rule might make ucpp not compile on a conformant C implementation.
187I have yet to see such an implementation, however.
188
189If you want to use ucpp as an integrated preprocessor and lexer, see the
190section REUSE. Compiling ucpp as a library is an exercise left to the
191reader.
192
193With the LOW_MEM code enabled, ucpp can run on a Minix-i86 or Msdos
19416-bit small-memory-model machine. It will not be fully compliant
195on such an architecture to C99, since C99 states that at least one
196source code with 4095 simultaneously defined macros must be processed;
197ucpp will be limited to about 1500 macros (at most) due to memory
198restrictions. At least ucpp can preprocess its own code in these
199conditions. LOW_MEM is on by default because it seems to improve
200performance on large systems.
201
202
203
204LICENSE
205-------
206
207The copyright notice and license is at the beginning of the Makefile and
208each source file. It is basically a BSD license, without the advertising
209subclause (which BSD dropped recently anyway) and with no reference to
210Berkeley (since the code is all mine, written from scratch). Informally,
211this means that you can reuse and redistribute the code as you want,
212provided that you state in the documentation (or any substantial part of
213the software) of redistributed code that I am the original author. (If
214you press a cdrom with 200 software packages, I do not insist on having
215my name on the cover of the cdrom -- just keep a Readme file somewhere
216on the cdrom, with the copyright notice included.)
217
218As a courteous gesture, if you reuse my code, please drop me a mail.
219It raises my self-esteem.
220
221
222
223REUSE
224-----
225
226The code has been thought as part of a bigger project; it might be
227used as an integrated lexer, that will read files, process them as a
228C preprocessor, and output a stream of C tokens. To include this code
229into a project, compile with STAND_ALONE undefined.
230
231To use the preprocessor and lexer, several steps should be performed.
232See the file 'sample.c' for an example.
233
2341. call init_cpp(). This function initializes the lexer automaton.
235
2362. set the following global variables:
237	no_special_macros
238		non-zero if the special macros (__FILE__ and others)
239		should not be defined. This is a global flag since
240		it affects the redefinition of such macros (which are
241		allowed if the special macros are not defined)
242	c99_compliant
243		if non-zero, define __STDC_VERSION__ to 199901L; this
244		is the default; otherwise, do not define __STDC_VERSION__.
245		Note that ucpp will accept to undefine __STDC_VERSION__
246		with a #undef directive.
247	c99_hosted
248		if strictly positive, define __STDC_HOSTED__ to 1.
249		If zero, define __STDC_HOSTED__ to 0. If negative,
250		do not define __STDC_HOSTED__. The default is 1.
251	emit_defines and emit_assertions should be set to 0 for
252	the step 3.
253
2543. call init_tables(). This function initializes the macro table
255   and other things; it will intialize assertions if it has a non-zero
256   argument.
257
2584. call init_include_path(). This function will reset the include
259   path to the list of paths given as argument.
260
2615. set the following global variables
262	emit_dependencies
263		set to 1 if dependencies should be emitted during
264		preprocessing
265		set to 2 if dependencies should also be emitted for
266		system include files
267	emit_defines
268		set to non-zero if #define macro definitions should be
269		emitted when macros are defined
270	emit_assertions
271		set to non-zero if #define macro definitions should be
272		emitted when macros are defined
273	emit_output
274		the FILE * where the above items are sent if one of the
275		three emit_ variables is set to non zero
276	transient_characters
277		this is for some cross-compilation; see the relevant
278		part in this README file for details
279
2806. call set_init_filename() with the initial filename as argument;
281   the second argument indicates whether the filename is real or
282   conventional ("real" means "an fopen() on it will work").
283
2847. initialize your struct lexer_state:
285	call init_lexer_state()
286	call init_lexer_mode() if the preprocessor is supposed to
287	   output a list of tokens, otherwise set the flags field
288	   to DEFAULT_CPP_FLAGS and set the output field to the
289	   FILE * where output should be sent
290	(init_lexer_mode(), if called at all, must be called after
291	 init_lexer_state())
292	adjust the flags field; here is the meaning of flags:
293
294WARN_STANDARD
295	emit the standard warnings
296WARN_ANNOYING
297	emit the useless and annoying warnings
298WARN_TRIGRAPHS
299	count trigraphs encountered; it is up to the caller to emit
300	a warning if some trigraphs were indeed encountered; the count
301	is stored in the count_trigraphs field of the struct lexer_state
302WARN_TRIGRAPHS_MORE
303	emit a warning for each trigraph encountered
304WARN_PRAGMA
305	emit a warning for each non-void _Pragma encountered in non-lexer
306	mode (because these are dumped as #pragma in the output) and for each
307	#pragma too, if ucpp was compiled without PRAGMA_DUMP
308FAIL_SHARP
309	emit errors on '#' tokens beginning a line and not followed
310	by a valid cpp directive
311CCHARSET
312	emit errors when non-C characters are encountered; if this flag
313	is not set, each non-C character will be considered as a BUNCH
314	token (since C99 states that non-C characters are allowed as
315	long as they "disappear" during preprocessing [through macro
316	replacement and stringification for instance], this flag must
317	not be set, for maximum C99 compliance)
318DISCARD_COMMENTS
319	do not keep comments in output (irrelevant in lexer mode)
320CPLUSPLUS_COMMENTS
321	understand new style comments (//) (mandatory for C99)
322LINE_NUM
323	emit #line directives when entering a file, if not in lexer mode;
324	emit CONTEXT token in lexer mode for #line and new files
325GCC_LINE_NUM
326	if LINE_NUM is set, emit gcc-like directives instead of #line
327HANDLE_ASSERTIONS
328	understand assertions in #if expressions (and #assert, #unassert)
329HANDLE_PRAGMA
330	make PRAGMA tokens for #pragma; irrelevant in non-lexer mode
331	(handling of some pragmas is required in C99 but is not of
332	the competence of the preprocessor; without this flag, ucpp will
333	ignore the contents of #pragma and _Pragma directives)
334MACRO_VAARG
335	understand macros with a variable number of arguments (mandatory
336	for C99)
337UTF8_SOURCE
338	understand UTF-8 encoding: multibyte characters are considered
339	equivalent to letters as far as syntax is concerned (they can
340	be used in identifiers)
341LEXER
342	act as a lexer, outputting tokens
343TEXT_OUTPUT
344	this flag should be set to 0 if ucpp works as a lexer, 1 otherwise.
345	It is somehow redundant with the LEXER flag, but the presence of
346	those two different flags is needed in ucpp.
347KEEP_OUTPUT
348	in non-lexer mode, emit the result of preprocessing
349COPY_LINE
350	maintain a copy of the last read line in the copy_line field of
351	the struct lexer_state ; see below for how to use this buffer
352HANDLE_TRIGRAPHS
353	understand trigraphs, such as ??/ for \. This option should be
354	set by default, except for some legacy code.
355
356	There are other flags, but they are for private usage of ucpp.
357
3588. adjust the input field in the lexer_state to the FILE * from where
359   source file is read. If you use the UCPP_MMAP compile-time option,
360   and your input file is eligible to mmap(), then you can call
361   fopen_mmap_file() to open it, then set_input_file() to set ls->input
362   and some other internal options. Do not call set_input_file() unless
363   you just called fopen_mmap_file() just before on the same file.
364
3659. call add_incpath() to add an include path, define_macro() and
366   undef_macro() to add or remove macros, make_assertion() and
367   destroy_assertion() to add or remove assertions.
368
36910. call enter_file() (this is needed only in non-lexer mode, or if
370    LINE_NUM is set).
371
372
373Afterwards:
374
375-- if you are in lexer mode, call lex(); each call will make the ctok
376   field point to the next token. A non-zero return value is an error.
377   lex() skips whitespace tokens. The memory used by the string value
378   of some tokens (identifiers, numbers...) is automatically freed,
379   so copy the contents of each such token if you want to keep it
380   (tokens with a string content are identified by the STRING_TOKEN
381   macro applied to their type).
382   When lex() returned a non-zero value: if it is CPPERR_EOF, then
383   end-of-input was reached. Otherwise, it is a genuine error and
384   ls->ctok is an undefined token; skip it and call lex() again to
385   ignore the error.
386
387-- otherwise, call cpp(); each call will analyze one or more tokens
388   (one token if it did find neither a cpp directive nor a macro name).
389   A positive return value is an error.
390
391For both functions, if the return value is CPPERR_EOF (which is a
392strictly positive value), then it means that the end of file was
393reached. Call check_cpp_errors() after end of file for pending errors
394(unfinished #if constructions for instance). In non-lexer mode,
395call flush_output().
396
397In the struct lexer_state, the following fields might be read:
398	line		   the current input line number
399	oline		   the current output line number (in non-lexer mode)
400	flags		   the flags described above
401	count_trigraphs	   the number of trigraphs encountered
402	inmacro		   the current token comes from a macro
403	macro_count	   the current macro counter
404"flags" is an unsigned long and might be modified; the three others
405are of long type.
406
407
408To perform another preprocessing: use free_lexer_state() to release
409memory used by the buffers referenced in lexer_state, and go back to
410step 2. The different tables (macros, assertions...) should be reset to
411their respective initial contents.
412
413There is also the wipeout() function: when called, it should release
414(almost) all memory blocks allocated dynamically. After a wipeout(),
415ucpp should be back to its state at step 2 (init_cpp() initializes only
416static tables, that are never freed nor modified afterwards).
417
418
419The COPY_LINE buffer: the struct lexer_state contains two interesting
420fields, copy_line[] and cli. If the COPY_LINE flag is on, each read
421line is stored in this buffer, up to (at most) COPY_LINE_LENGTH - 1
422characters (COPY_LINE_LENGTH is defined in tune.h). The last character
423of the buffer is always a zero, and if the line was read entirely, it is
424zero terminated; the trailing newline is not included.
425
426The purpose of this buffer is error-reporting. When an error occurs
427(cpp() returns a strictly positive value, or lex() returns a non-zero
428value), if your struct lexer_state is called ls, use this code:
429
430	if (ls.cli != 0) ls.copy_line[ls.cli] = 0;
431
432This will add a trailing 0 if the line was not read entirely.
433
434You can disable the COPY_LINE buffer by defining NO_UCPP_COPY_LINE
435(in tune.h, for example). This will make the code slightly faster.
436
437
438ucpp may be configured at runtime to accept alternate characters as
439possible parts of identifiers. Typical intended usage is for the '$'
440and '@' characters. The two relevant functions are set_identifier_char()
441and unset_identifier_char(). When this call is issued:
442	set_identifier_char('$');
443then for all the remaining input, the '$' character will be considered
444as just another letter, as far as identifier tokenizing is concerned. This
445is for identifiers only; numeric constants are not modified by that setting.
446This call resets things back:
447	unset_identifier_char('$');
448Those two functions modify the static table which is initialized by
449init_cpp(). You may call init_cpp() at any time to restore the table
450to its standard state.
451
452When using this feature, take care of the following points:
453
454-- Do NOT use a character whose numeric value (as an `unsigned char'
455cast into an `int') is greater than or equal to MAX_CHAR_VAL (in tune.h).
456This would lead to unpredictable results, including an abrupt crash of
457ucpp. ucpp makes absolutely no check whatsoever on that matter: this is
458the programmer's responsibility.
459
460-- If you use a standard character such as '+' or '{', tokens which
461begin with those characters cease to exist. This can be troublesome.
462If you use set_identifier_char() on the '<' character, the handling of
463#include directives will be greatly disturbed. Therefore the use of any
464standard C character in set_identifier_char() of unset_identifier_char()
465is declared unsupported, forbidden and altogether unwise.
466
467-- Stricto sensu, when an extra character is declared as part of an
468identifier, ucpp behaviour cease to conform to C99, which mandates that
469characters such as '$' or '@' must be treated as independant tokens of
470their own. Therefore, if your purpose is to use ucpp in a conformant
471C implementation, the use of set_identifier_char() should be made at
472least a runtime option.
473
474-- When enabling a new character in the middle of a macro replacement,
475the effect of that replacement may be delayed up to the end of that
476macro (but this is a "may" !). If you wish to trigger this feature with
477a custom #pragma or _Pragma(), you should remember it (for instance,
478usine _Pragma() in a macro replacement, and then the extra character
479in the same macro replacement, is not reliable).
480
481
482
483REENTRANT API
484-------------
485
486You can build ucpp with UCPP_REENTRANT defined if you plan to use
487the ucpp lexer in a multithreaded application.
488
489See the file 'sample_r.c' for an example.
490
491When using the reentrant API, you can create multiple preprocessor
492objects. A new object is created using new_cpp(). It should finally
493be destroyed using del_cpp(). Furthermore, most API functions expect
494a pointer to valid preprocessor object as their first argument.
495These functions are:
496
497	add_incpath()
498	check_cpp_errors()
499	cpp()
500	destroy_assertion()
501	enter_file()
502	flush_output()
503	fopen_mmap_file()
504	init_assertions()
505	init_cpp()
506	init_include_path()
507	init_tables()
508	define_macro()
509	undef_macro()
510	lex()
511	make_assertion()
512	print_assertions()
513	report_context()
514	set_identifier_char()
515	set_init_filename()
516	set_input_file()
517	unset_identifier_char()
518	init_macros()
519	print_defines()
520	is_macro_defined()
521	get_macro_definition()
522	iterate_macros()
523	wipeout()
524
525Additionally, ucpp_ouch(), ucpp_error() and ucpp_warning() are no
526longer global functions. They can also be defined separately for
527each preprocessor object:
528
529	struct CPP cpp;
530	cpp = new_cpp();
531	cpp->ucpp_ouch = my_ouch_func;
532	/* ... */
533	del_cpp(cpp);
534
535Each of these functions receives a pointer to the corresponding
536preprocessor object as its first parameter. There is also a
537callback_arg member defined in the preprocessor object structure
538which can be used to pass an additional pointer to the callback
539function.
540
541If you additionally define UCPP_CLONE, you can also clone an
542existing preprocessor object:
543
544	clone = clone_cpp(original);
545
546The cloned object will be identical to the original object, except
547for its internal lexer states, which means you cannot clone a
548preprocessor object while it is preprocessing source code.
549
550
551
552COMPATIBILITY NOTES
553-------------------
554
555The C language has a lengthening history. Nowadays, C comes in three
556flavours:
557
558-- Traditional C, aka "K&R". This is the language first described by
559Brian Kernighan and Dennis Ritchie, and implemented in the first C
560compiler that was ever coded. There are actually several dialects of
561K&R, and all of them are considered deprecated.
562
563-- ISO 9899:1990, aka C90, aka C89, aka ANSI-C. Formalized by ANSI
564in 1989 and adopted by ISO the next year, it is the C flavour many C
565compilers understand. It is mostly backward compatible with K&R C, but
566with enhancements, clarifications and several new features.
567
568-- ISO 9899:1999, aka C99. This is an evolution on C90, almost fully
569backward compatible with C90. C99 introduces many new and useful
570features, however, including in the preprocessor.
571
572There was also a normative addendum in 1995, that added a few features
573to C90 (for instance, digraphs) that are also present in C99. It is
574sometimes refered to as "C95" or "AMD 1".
575
576
577ucpp implements the C99 standard, but can be used in a stricter mode,
578to enforce C90 compatibility (it will, however, still recognize some
579constructions that are not in plain C90).
580
581ucpp also knows about several extensions to C99:
582
583-- Assertions: this is an extension to the defined() operator, with
584   its own namespace. Assertions seem to be used in several places,
585   therefore ucpp knows about them. It is recommended to enable
586   assertions by default on Solaris systems.
587-- Unicode: the C99 norm specifies that extended characters, from
588   the ISO-10646 charset (aka "unicode") can be used in identifiers
589   with the notations \u and \U. ucpp also accepts (with the proper
590   flag) the UTF-8 encoding in the source file for such characters.
591-- #include_next directive: it works as a #include, but will look
592   for files only in the directories specified in the include path
593   after the one the current file was found. This is a GNU-ism that
594   is useful for writing transparent wrappers around header files.
595
596Assertions and unicode are activated by specific flags; the #include_next
597support is always active.
598
599The ucpp code itself should be compatible with any ISO-C90 compiler.
600The cpp.c file is rather big (~ 64kB), it might confuse old 16-bit C
601compilers; the macro.c file is somewhat large also (~ 47kB).
602
603The evaluation of #if expressions is subject to some subtleties, see the
604section "cross-compilation".
605
606The lexer code makes no assumption about the source character set, but
607the following: source characters (those which have a syntactic value in
608C; comment and string literal contents are not concerned) must have a
609strictly positive value that is strictly lower than MAX_CHAR_VAL. The
610strict positivity is already assured by the C standard, so you just need
611to adjust MAX_CHAR_VAL.
612
613ucpp has been tested succesfully on ASCII/ISO-8859-1 and EBCDIC systems.
614Beware that UTF-8 is NOT compatible with EBCDIC.
615
616Pragma handling: when used in non-lexer mode, ucpp tries to output a
617source text that, when read again, will yield the exact same stream of
618tokens. This is not completely true with regards to line numbering in
619some tricky macro replacements, but it should work correctly otherwise,
620especially with pragma directives if the compile-time option PRAGMA_DUMP
621was set: #pragma are dumped, non-void _Pragma() are converted to the
622corresponding #pragma and dumped also.
623
624ucpp does not macro-replace the contents of #pragma and _Pragma();
625If you want a macro-replaced pragma, use this:
626
627#define pragma_(x)	_Pragma(#x)
628#define pragma(x)	pragma_(x)
629
630Anyway, pragmas do not nest (an _Pragma() cannot be evaluated if it is
631inside a #pragma or another _Pragma).
632
633
634I wrote ucpp according to what is found in "The C Programming Language"
635from Brian Kernighan and Dennis Ritchie (2nd edition) and the C99
636standard; but I could have misinterpreted some points. On some tricky
637points I got help from the helpful people from the comp.std.c newsgroup.
638For assertions and #include_next, I mimicked the behaviour of GNU cpp,
639as is stated in the GNU cpp info documentation. An open question is
640related to the following code:
641
642#define undefined	!
643#define makeun(x)	un ## x
644#if makeun(defined foo)
645qux
646#else
647bar
648#endif
649
650ucpp will replace 'defined foo' with 0 first (since foo is not defined),
651then it will replace the macro makeun, and the expression will become
652'un0', which is replaced by 0 since this is a remaining identifier. The
653expression evaluates to false, and 'bar' is emitted.
654However, some other preprocessors will replace makeun first, considering
655that it is not part of a 'defined' operator application; this will
656produce the macro 'undefined', which is replaced, and the expression
657becomes '!foo'. 'foo' is replaced by 0, the expression evaluates to
658true, and 'qux' is emitted.
659
660My opinion is that the behaviour is undefined, because use of the
661'defined' operator does not match an allowed form prior to macro
662replacement (I mean, its syntax matches, but its use is reconverted
663to inexistant and therefore is not anymore matching). Other people
664think that the behaviour is well-specified, and contrary to what ucpp
665does. The only thing clear to me is that the wording of the standard
666(paragraph 6.10.1.3) is unclear.
667
668Since the ucpp behaviour makes ucpp code simpler and cleaner, and
669that it is unlikely that any real-life code would ever be disturbed
670by that interpretation of the standard, ucpp will keep its current
671behaviour until convincing evidence of my misinterpretation of the
672standard is given to me. The problem can only occur if one uses ## to
673make a 'defined' operator disappear from a #if expression (everybody
674agrees that the generation of a 'defined' operator triggers undefined
675behaviour).
676
677
678Another point about macro replacement has been discussed at length in
679several occasions. It is about the following code:
680
681#define CAT(a, b)    CAT_(a, b)
682#define CAT_(a, b)   a ## b
683#define AB(x, y)     CAT(x, y)
684CAT(A, B)(X, Y)
685
686ucpp will produce `CAT(X,Y)' as replacement for the last line, whereas
687some other preprocessors output `XY'. The answer to the question
688"which behaviour is correct" seems to be "this is not defined by the
689C standard". It is the answer that has been actually given by the C
690standardization committee in 1992, to the defect report #017, question
69123, which asked that very same question. Since the wording of the
692standard has not changed in these parts from the 1990 to the 1999
693version, the preprocessor behaviour on the above-stated code should
694still be considered as undefined.
695
696It seems, however, that there used to be a time (around 1988) when the
697committee members agreed upon a precise macro-replacement algorithm,
698which specified quite clearly the preprocessor behaviour in such
699situation. ucpp behaviour is occasionnaly claimed as "incorrect" with
700regards to that algorithm. Since that macro replacement algorithm has
701never been published, and the committee itself backed out from it in
7021992, I decided to disregard those feeble claims.
703
704It is possible, however, that at some point in the future I rewrite the
705ucpp macro replacement code, since that code is a bit messy and might be
706made to use less memory in some occasions. It is then possible that, in
707the aftermath of such a rewrite, the ucpp behaviour for the above stated
708code become tunable. Don't hold your breath, though.
709
710
711About _Pragma: the standard is not clear about when this operator is
712evaluated, and if it is allowed inside #if directives and such. For
713ucpp, I coded _Pragma as a special macro with lazy replacement: it will
714be evaluated wherever a macro could be replaced, and only at the end of
715the macro replacement (for practical purposes, _Pragma can be considered
716as a macro taking one argument, and being replaced by nothing, except
717for some tricky uses of the # and ## operators). This means that, by
718default, ucpp will evaluate _Pragma inside some directives (mainly, #if,
719#include, #include_next and #line), but it can be taught not to do so by
720defining NO_PRAGMA_IN_DIRECTIVE in tune.h.
721
722
723
724CROSS-COMPILATION
725-----------------
726
727If compiled with a C99 development suite, ucpp should be fully
728C99-compliant on the host platform (up to my own understanding of the
729standard -- remember that this software is distributed as-is, without
730any guarantee). However, if a pre-C99 compiler is used, or if the
731target machine is not the host machine (for instance when you build a
732cross-compiler), the evaluation of #if expressions is subject to some
733cross-compiling issues:
734
735
736-- character constants: when evaluating expressions, character constants
737are interpreted in the source character set context; this is allowed
738by the standard but this can lead to problems with code that expects
739this interpretation to match the one made in the C code. To ease
740cross-compilation, you can define a conversion array, and make the
741global variable transient_characters point to it. The array should
742contain 256 int; transient_characters[x] is the value of the character
743whose value is x in the source character set.
744
745This facility is provided for inclusion of ucpp inside another code;
746if you want a stand-alone ucpp with that conversion, hard-code the
747conversion table into eval.c and make transient_characters[] statically
748point to it. Alternatively, you could provide an option syntax to
749provide such a table on command-line, if you feel like it.
750
751
752-- wide character constants signedness: by default, ucpp makes wide
753characters as signed as what plain chars are on the build host. To
754force wide character constant signedness, define WCHAR_SIGNEDNESS to 0
755(for unsigned) or 1 (for signed). Beware, however, that "native" wide
756character constants, even signed, are considered positive. Non-wide
757character constants are, according to the C99 standard, of type int, and
758therefore always signed.
759
760
761-- evaluation type: C90 states that all constants in #if expressions
762are considered as either long or unsigned long, and that the evaluation
763is performed with operands of that size. In C99, the situation is
764equivalent, except that the types used are intmax_t and uintmax_t, as
765defined in <stdint.h>.
766
767ucpp can use two expression evaluators: one uses native integer types
768(one signed and one unsigned), the other evaluator emulates big integer
769numbers by representing them with two values of some unsigned type. The
770emulated type handles signed values in two's complement representation,
771and can be any width ranging from 2 bits to twice the size of the
772underlying native unsigned type used. An odd width is allowed. When
773right shifting an emulated signed negative value, it is left-padded with
774bits set to 1 (this is sign extension).
775
776When the ARITHMETIC_CHECKS macro is defined in tune.h, all occurrences
777of implementation-defined or undefined behaviour during arithmetic
778evaluation are reported as errors or warned upon. This includes all
779overflows and underflows on signed quantities, constants too large,
780and so on. Errors (which terminate immediately evaluation) are emitted
781for division by 0 (on / and % operators) and overflow (on / operator);
782otherwise, warnings are emitted and the faulty evaluation takes place.
783This prevents ucpp from crashing on typical x86 machines, while still
784allowing to use some extensions.
785
786
787
788FUTURE EVOLUTIONS
789-----------------
790
791ucpp is quite complete now. There was a longstanding project of
792"traditional" preprocessing, but I dropped it because it would not
793map cleanly on the token-based ucpp structure. Maybe I will code a
794string-based preprocessor one day; it would certainly use some of the
795code from lexer.c, eval.c, mem.c and nhash.c. However, making such a
796tool is almost irrelevant nowadays. If one wants to handle such project,
797using ucpp as code base, I would happily provide some help, if needed.
798
799
800
801CHANGES
802-------
803
804From 1.2 to 1.3:
805
806* brand new integer evaluation code, with precise evaluation and checks
807* new hash table implementation, with binary trees
808* relaxed attitude on failed `##' operators
809* bugfix on macro definition on command-line wrt nesting macros
810* support for up to 32766 macro arguments in LOW_MEM code
811* support for optional additional "identifier" characters such as '$' or '@'
812
813From 1.1 to 1.2:
814
815* bugfix: numerous memory leaks
816* new function: wipeout(); this should release all malloc() blocks
817* bugfix: missing "newline" and trailing "context" tokens
818* improved included files name caching
819* included memory leak detection code
820
821From 1.0 to 1.1:
822
823* bugfix: missing newline when exiting from a non-newline-terminated file
824* bugfix: crash when resetting due to definition of the _Pragma pseudo-macro
825* bugfix: handling of additional "optional" whitespace with SEMPER_FIDELIS
826* improved handling of unreplaced arg macros wrt output line
827* tricky handling of utterly tricky #include
828* bugfix: spurious token `~=' eliminated
829
830From 0.9 to 1.0:
831
832* bugfix: crash after erroneous #assert
833* changed ERR_SHARP to FAIL_SHARP, EMUL_UINTMAX to SIMUL_UINTMAX
834* made "inline" default on gcc and DEC ccc (Linux/Alpha)
835* semantic of -I is now Unix-like (added directories are looked first)
836* added -J flag (to add include directories after the system ones)
837* cleaned up non-ascii issues
838* bugfix: missing brace in no-LOW_MEM code
839* bugfix: argument number check in variadic macros
840* bugfix: crash in non-lexer mode after some cases of unreplaced macro
841* bugfix: _Pragma() handling wrt # and ##
842* made evaluation of _Pragma() optional in #if, #include and #line
843* bugfix: re-dump of multiline #pragma
844* added the inmacro and macro_count flags
845* added mmap() support
846* added option to retain whitespace content in lexer mode
847
848From 0.8 to 0.9:
849
850* added check for division by 0 in #if evaluation
851* added check for non-standard line numbers
852* added check for trailing garbage in most directives
853* corrected signedness of char constants (always int, therefore always signed)
854* made LOW_MEM code, so that ucpp runs smoothly on low memory architectures
855* multiple bugfixes (using the GNU cpp testsuite)
856* added handling of _Pragma (as a macro)
857* added tokenization of pragma directives
858* added conservation of pragma directives in text output
859* produced Msdos 16-bit small memory model executable
860* produced Minix-86 executable
861
862From 0.7 to 0.8:
863
864* added some support for Amiga systems
865* fixed extra spacing in stringified tokens
866* fixed bug related to %:% and tolerated rogue sharps
867* namespace cleanup
868* bugfix for macro redefinition
869* added warning for evaluated comma operators in #if (ISO requirement)
870* -Dfoo now defines foo with content 1 (and not void content)
871* trigraphs can be disabled (for incorrect but legacy code)
872* fixed semantics for #include "file" (local directory)
873* fixed detection of protected files
874* produced a Msdos 16-bit executable
875
876From 0.6 to 0.7:
877
878* officially changed the goal to full C99 compliance
879* added the CONTEXT token and let NEWLINE tokens go
880* added report_context() for error reporting
881* enforced matching of #if/#endif (file-global nesting level = 0)
882* added support of C99 digraphs
883* added UTF-8 encoding support
884* added universal character names
885* rewrote #if expressions (sizes fixed, bignum, signed/unsigned fixed)
886* fixed incomplete evaluation of #if expressions
887* added transient_characters[]
888
889From 0.5 to 0.6:
890
891* disappearance of error_nonl()
892* added extra optional warnings for trigraphs
893* some bugfixes, especially in lexer mode
894* handled MacIntosh files correctly
895
896From 0.4 to 0.5:
897
898* nicer #pragma handling (a token can be emitted)
899* bugfix in lexer mode after #line and #error
900* sample.c   an example of code linked with ucpp
901* made #if expressions conforming to standard signed/unsigned handling
902* added the copy_line[] buffer feature
903
904From 0.3 to 0.4:
905
906* relaxed interpretation of '#include foo' when foo ends up, after macro
907  substitution, with a '<bar>' content
908* corrected the 'double-dot' bug
909* corrected two bugs related to the treatment of macro aborted calls (due
910  to lack of arguments)
911* some namespaces cleanup, to ease integration into other code
912* documented the way to include ucpp into another program
913* made newlines embedded into strings illegal (and reported as such)
914
915From 0.2 to 0.3:
916
917* added support for system predefined macros
918* made several bugfixes
919* checked C99 compliance for most of the features
920* ucpp now accepts non-C characters on standard when used stand-alone
921* removed many useless spaces in the output
922
923From 0.1 to 0.2:
924
925* added support for assertions
926* added support for macros with variable arguments
927* split the pharaonic cpp.c file into many
928* made several bugfixes
929* relaxed the behaviour with regards to the void arguments
930* made C++-like comments an option
931
932
933
934THANKS TO
935---------
936
937Volker Barthelmann, Neil Booth, Stephen Davies, St�phane Ecolivet,
938Marc Espie, Marcus Holland-Moritz, Antoine Leca, Cyrille Lefevre,
939Dave Rivers, Loic Tortay and Laurent Wacrenier, for suggestions and
940beta-testing.
941
942Paul Eggert, Douglas A. Gwyn, Clive D.W. Feather, and the other guys from
943comp.std.c, for explanations about the standard.
944
945Dave Brolley, Jamie Lokier and Neil Booth, for discussion about tricky
946points on nesting macros.
947
948Brian Kernighan and Dennis Ritchie, for bringing C to mortal Men.
949