xref: /freebsd/contrib/mandoc/mandoc_escape.3 (revision 10ff414c)
1.\"	$Id: mandoc_escape.3,v 1.4 2017/07/04 23:40:01 schwarze Exp $
2.\"
3.\" Copyright (c) 2014 Ingo Schwarze <schwarze@openbsd.org>
4.\"
5.\" Permission to use, copy, modify, and distribute this software for any
6.\" purpose with or without fee is hereby granted, provided that the above
7.\" copyright notice and this permission notice appear in all copies.
8.\"
9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16.\"
17.Dd $Mdocdate: July 4 2017 $
18.Dt MANDOC_ESCAPE 3
19.Os
20.Sh NAME
21.Nm mandoc_escape
22.Nd parse roff escape sequences
23.Sh SYNOPSIS
24.In sys/types.h
25.In mandoc.h
26.Ft "enum mandoc_esc"
27.Fo mandoc_escape
28.Fa "const char **end"
29.Fa "const char **start"
30.Fa "int *sz"
31.Fc
32.Sh DESCRIPTION
33This function scans a
34.Xr roff 7
35escape sequence.
36.Pp
37An escape sequence consists of
38.Bl -dash -compact -width 2n
39.It
40an initial backslash character
41.Pq Sq \e ,
42.It
43a single ASCII character called the escape sequence identifier,
44.It
45and, with only a few exceptions, an argument.
46.El
47.Pp
48Arguments can be given in the following forms; some escape sequence
49identifiers only accept some of these forms as specified below.
50The first three forms are called the standard forms.
51.Bl -tag -width 2n
52.It \&In brackets: Ic \&[ Ns Ar argument Ns Ic \&]
53The argument starts after the initial
54.Sq \&[ ,
55ends before the final
56.Sq \&] ,
57and the escape sequence ends with the final
58.Sq \&] .
59.It Two-character argument short form: Ic \&( Ns Ar ar
60This form can only be used for arguments
61consisting of exactly two characters.
62It has the same effect as
63.Ic \&[ Ns Ar ar Ns Ic \&] .
64.It One-character argument short form: Ar a
65This form can only be used for arguments
66consisting of exactly one character.
67It has the same effect as
68.Ic \&[ Ns Ar a Ns Ic \&] .
69.It Delimited form: Ar C Ns Ar argument Ns Ar C
70The argument starts after the initial delimiter character
71.Ar C ,
72ends before the next occurrence of the delimiter character
73.Ar C ,
74and the escape sequence ends with that second
75.Ar C .
76Some escape sequences allow arbitrary characters
77.Ar C
78as quoting characters, some restrict the range of characters
79that can be used as quoting characters.
80.El
81.Pp
82Upon function entry,
83.Fa end
84is expected to point to the escape sequence identifier.
85The values passed in as
86.Fa start
87and
88.Fa sz
89are ignored and overwritten.
90.Pp
91By design, this function cannot handle those
92.Xr roff 7
93escape sequences that require in-place expansion, in particular
94user-defined strings
95.Ic \e* ,
96number registers
97.Ic \en ,
98width measurements
99.Ic \ew ,
100and numerical expression control
101.Ic \eB .
102These are handled by
103.Fn roff_res ,
104a private preprocessor function called from
105.Fn roff_parseln ,
106see the file
107.Pa roff.c .
108.Pp
109The function
110.Fn mandoc_escape
111is used
112.Bl -dash -compact -width 2n
113.It
114recursively by itself, because some escape sequence arguments can
115in turn contain other escape sequences,
116.It
117for error detection internally by the
118.Xr roff 7
119parser part of the
120.Xr mandoc 3
121library, see the file
122.Pa roff.c ,
123.It
124above all externally by the
125.Xr mandoc 1
126formatting modules, in particular
127.Fl Tascii
128and
129.Fl Thtml ,
130for formatting purposes, see the files
131.Pa term.c
132and
133.Pa html.c ,
134.It
135and rarely externally by high-level utilities using the mandoc library,
136for example
137.Xr makewhatis 8 ,
138to purge escape sequences from text.
139.El
140.Sh RETURN VALUES
141Upon function return, the pointer
142.Fa end
143is set to the character after the end of the escape sequence,
144such that the calling higher-level parser can easily continue.
145.Pp
146For escape sequences taking an argument, the pointer
147.Fa start
148is set to the beginning of the argument and
149.Fa sz
150is set to the length of the argument.
151For escape sequences not taking an argument,
152.Fa start
153is set to the character after the end of the sequence and
154.Fa sz
155is set to 0.
156Both
157.Fa start
158and
159.Fa sz
160may be
161.Dv NULL ;
162in that case, the argument and the length are not returned.
163.Pp
164For sequences taking an argument, the function
165.Fn mandoc_escape
166returns one of the following values:
167.Bl -tag -width 2n
168.It Dv ESCAPE_FONT
169The escape sequence
170.Ic \ef
171taking an argument in standard form:
172.Ic \ef[ , \ef( , \ef Ns Ar a .
173Two-character arguments starting with the character
174.Sq C
175are reduced to one-character arguments by skipping the
176.Sq C .
177More specific values are returned for the most commonly used arguments:
178.Bl -column "argument" "ESCAPE_FONTITALIC"
179.It argument Ta return value
180.It Cm R No or Cm 1 Ta Dv ESCAPE_FONTROMAN
181.It Cm I No or Cm 2 Ta Dv ESCAPE_FONTITALIC
182.It Cm B No or Cm 3 Ta Dv ESCAPE_FONTBOLD
183.It Cm P Ta Dv ESCAPE_FONTPREV
184.It Cm BI Ta Dv ESCAPE_FONTBI
185.El
186.It Dv ESCAPE_SPECIAL
187The escape sequence
188.Ic \eC
189taking an argument delimited with the single quote character
190and, as a special exception, the escape sequences
191.Em not
192having an identifier, that is, those where the argument, in standard
193form, directly follows the initial backslash:
194.Ic \eC' , \e[ , \e( , \e Ns Ar a .
195Note that the one-character argument short form can only be used for
196argument characters that do not clash with escape sequence identifiers.
197.Pp
198If the argument matches one of the forms described below under
199.Dv ESCAPE_UNICODE ,
200that value is returned instead.
201.Pp
202The
203.Dv ESCAPE_SPECIAL
204special character escape sequences can be rendered using the functions
205.Fn mchars_spec2cp
206and
207.Fn mchars_spec2str
208described in the
209.Xr mchars_alloc 3
210manual.
211.It Dv ESCAPE_UNICODE
212Escape sequences of the same format as described above under
213.Dv ESCAPE_SPECIAL ,
214but with an argument of the forms
215.Ic u Ns Ar XXXX ,
216.Ic u Ns Ar YXXXX ,
217or
218.Ic u10 Ns Ar XXXX
219where
220.Ar X
221and
222.Ar Y
223are hexadecimal digits and
224.Ar Y
225is not zero:
226.Ic \eC'u , \e[u .
227As a special exception,
228.Fa start
229is set to the character after the
230.Ic u ,
231and the
232.Fa sz
233return value does not include the
234.Ic u
235either.
236.Pp
237Such Unicode character escape sequences can be rendered using the function
238.Fn mchars_num2uc
239described in the
240.Xr mchars_alloc 3
241manual.
242.It Dv ESCAPE_NUMBERED
243The escape sequence
244.Ic \eN
245followed by a delimited argument.
246The delimiter character is arbitrary except that digits cannot be used.
247If a digit is encountered instead of the opening delimiter, that
248digit is considered to be the argument and the end of the sequence, and
249.Dv ESCAPE_IGNORE
250is returned.
251.Pp
252Such ASCII character escape sequences can be rendered using the function
253.Fn mchars_num2char
254described in the
255.Xr mchars_alloc 3
256manual.
257.It Dv ESCAPE_OVERSTRIKE
258The escape sequence
259.Ic \eo
260followed by an argument delimited by an arbitrary character.
261.It Dv ESCAPE_IGNORE
262.Bl -bullet -width 2n
263.It
264The escape sequence
265.Ic \es
266followed by an argument in standard form or by an argument delimited
267by the single quote character:
268.Ic \es' , \es[ , \es( , \es Ns Ar a .
269As a special exception, an optional
270.Sq +
271or
272.Sq \-
273character is allowed after the
274.Sq s
275for all forms.
276.It
277The escape sequences
278.Ic \eF ,
279.Ic \eg ,
280.Ic \ek ,
281.Ic \eM ,
282.Ic \em ,
283.Ic \en ,
284.Ic \eV ,
285and
286.Ic \eY
287followed by an argument in standard form.
288.It
289The escape sequences
290.Ic \eA ,
291.Ic \eb ,
292.Ic \eD ,
293.Ic \eR ,
294.Ic \eX ,
295and
296.Ic \eZ
297followed by an argument delimited by an arbitrary character.
298.It
299The escape sequences
300.Ic \eH ,
301.Ic \eh ,
302.Ic \eL ,
303.Ic \el ,
304.Ic \eS ,
305.Ic \ev ,
306and
307.Ic \ex
308followed by an argument delimited by a character that cannot occur
309in numerical expressions.
310However, if any character that can occur in numerical expressions
311is found instead of a delimiter, the sequence is considered to end
312with that character, and
313.Dv ESCAPE_ERROR
314is returned.
315.El
316.It Dv ESCAPE_ERROR
317Escape sequences taking an argument but not matching any of the above patterns.
318In particular, that happens if the end of the logical input line
319is reached before the end of the argument.
320.El
321.Pp
322For sequences that do not take an argument, the function
323.Fn mandoc_escape
324returns one of the following values:
325.Bl -tag -width 2n
326.It Dv ESCAPE_SKIPCHAR
327The escape sequence
328.Qq \ez .
329.It Dv ESCAPE_NOSPACE
330The escape sequence
331.Qq \ec .
332.It Dv ESCAPE_IGNORE
333The escape sequences
334.Qq \ed
335and
336.Qq \eu .
337.El
338.Sh FILES
339This function is implemented in
340.Pa mandoc.c .
341.Sh SEE ALSO
342.Xr mchars_alloc 3 ,
343.Xr mandoc_char 7 ,
344.Xr roff 7
345.Sh HISTORY
346This function has been available since mandoc 1.11.2.
347.Sh AUTHORS
348.An Kristaps Dzonsons Aq Mt kristaps@bsd.lv
349.An Ingo Schwarze Aq Mt schwarze@openbsd.org
350.Sh BUGS
351The function doesn't cleanly distinguish between sequences that are
352valid and supported, valid and ignored, valid and unsupported,
353syntactically invalid, or undefined.
354For sequences that are ignored or unsupported, it doesn't tell
355whether that deficiency is likely to cause major formatting problems
356and/or loss of document content.
357The function is already rather complicated and still parses some
358sequences incorrectly.
359.
360.ig
361For these sequences, the list given below specifies a starting string
362and either the length of the argument or an ending character.
363The argument starts after the starting string.
364In the former case, the sequence ends with the end of the argument.
365In the latter case, the argument ends before the ending character,
366and the sequence ends with the ending character.
367..
368