1-- Copyright 2006-2016 Mitchell mitchell.att.foicica.com. See LICENSE.
2
3local M = {}
4
5--[=[ This comment is for LuaDoc.
6---
7-- Lexes Scintilla documents with Lua and LPeg.
8--
9-- ## Overview
10--
11-- Lexers highlight the syntax of source code. Scintilla (the editing component
12-- behind [Textadept][] and [SciTE][]) traditionally uses static, compiled C++
13-- lexers which are notoriously difficult to create and/or extend. On the other
14-- hand, Lua makes it easy to to rapidly create new lexers, extend existing
15-- ones, and embed lexers within one another. Lua lexers tend to be more
16-- readable than C++ lexers too.
17--
18-- Lexers are Parsing Expression Grammars, or PEGs, composed with the Lua
19-- [LPeg library][]. The following table comes from the LPeg documentation and
20-- summarizes all you need to know about constructing basic LPeg patterns. This
21-- module provides convenience functions for creating and working with other
22-- more advanced patterns and concepts.
23--
24-- Operator             | Description
25-- ---------------------|------------
26-- `lpeg.P(string)`     | Matches `string` literally.
27-- `lpeg.P(`_`n`_`)`    | Matches exactly _`n`_ characters.
28-- `lpeg.S(string)`     | Matches any character in set `string`.
29-- `lpeg.R("`_`xy`_`")` | Matches any character between range `x` and `y`.
30-- `patt^`_`n`_         | Matches at least _`n`_ repetitions of `patt`.
31-- `patt^-`_`n`_        | Matches at most _`n`_ repetitions of `patt`.
32-- `patt1 * patt2`      | Matches `patt1` followed by `patt2`.
33-- `patt1 + patt2`      | Matches `patt1` or `patt2` (ordered choice).
34-- `patt1 - patt2`      | Matches `patt1` if `patt2` does not match.
35-- `-patt`              | Equivalent to `("" - patt)`.
36-- `#patt`              | Matches `patt` but consumes no input.
37--
38-- The first part of this document deals with rapidly constructing a simple
39-- lexer. The next part deals with more advanced techniques, such as custom
40-- coloring and embedding lexers within one another. Following that is a
41-- discussion about code folding, or being able to tell Scintilla which code
42-- blocks are "foldable" (temporarily hideable from view). After that are
43-- instructions on how to use LPeg lexers with the aforementioned Textadept and
44-- SciTE editors. Finally there are comments on lexer performance and
45-- limitations.
46--
47-- [LPeg library]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
48-- [Textadept]: http://foicica.com/textadept
49-- [SciTE]: http://scintilla.org/SciTE.html
50--
51-- ## Lexer Basics
52--
53-- The *lexers/* directory contains all lexers, including your new one. Before
54-- attempting to write one from scratch though, first determine if your
55-- programming language is similar to any of the 80+ languages supported. If so,
56-- you may be able to copy and modify that lexer, saving some time and effort.
57-- The filename of your lexer should be the name of your programming language in
58-- lower case followed by a *.lua* extension. For example, a new Lua lexer has
59-- the name *lua.lua*.
60--
61-- Note: Try to refrain from using one-character language names like "c", "d",
62-- or "r". For example, Scintillua uses "ansi_c", "dmd", and "rstats",
63-- respectively.
64--
65-- ### New Lexer Template
66--
67-- There is a *lexers/template.txt* file that contains a simple template for a
68-- new lexer. Feel free to use it, replacing the '?'s with the name of your
69-- lexer:
70--
71--     -- ? LPeg lexer.
72--
73--     local l = require('lexer')
74--     local token, word_match = l.token, l.word_match
75--     local P, R, S = lpeg.P, lpeg.R, lpeg.S
76--
77--     local M = {_NAME = '?'}
78--
79--     -- Whitespace.
80--     local ws = token(l.WHITESPACE, l.space^1)
81--
82--     M._rules = {
83--       {'whitespace', ws},
84--     }
85--
86--     M._tokenstyles = {
87--
88--     }
89--
90--     return M
91--
92-- The first 3 lines of code simply define often used convenience variables. The
93-- 5th and last lines define and return the lexer object Scintilla uses; they
94-- are very important and must be part of every lexer. The sixth line defines
95-- something called a "token", an essential building block of lexers. You will
96-- learn about tokens shortly. The rest of the code defines a set of grammar
97-- rules and token styles. You will learn about those later. Note, however, the
98-- `M.` prefix in front of `_rules` and `_tokenstyles`: not only do these tables
99-- belong to their respective lexers, but any non-local variables need the `M.`
100-- prefix too so-as not to affect Lua's global environment. All in all, this is
101-- a minimal, working lexer that you can build on.
102--
103-- ### Tokens
104--
105-- Take a moment to think about your programming language's structure. What kind
106-- of key elements does it have? In the template shown earlier, one predefined
107-- element all languages have is whitespace. Your language probably also has
108-- elements like comments, strings, and keywords. Lexers refer to these elements
109-- as "tokens". Tokens are the fundamental "building blocks" of lexers. Lexers
110-- break down source code into tokens for coloring, which results in the syntax
111-- highlighting familiar to you. It is up to you how specific your lexer is when
112-- it comes to tokens. Perhaps only distinguishing between keywords and
113-- identifiers is necessary, or maybe recognizing constants and built-in
114-- functions, methods, or libraries is desirable. The Lua lexer, for example,
115-- defines 11 tokens: whitespace, comments, strings, numbers, keywords, built-in
116-- functions, constants, built-in libraries, identifiers, labels, and operators.
117-- Even though constants, built-in functions, and built-in libraries are subsets
118-- of identifiers, Lua programmers find it helpful for the lexer to distinguish
119-- between them all. It is perfectly acceptable to just recognize keywords and
120-- identifiers.
121--
122-- In a lexer, tokens consist of a token name and an LPeg pattern that matches a
123-- sequence of characters recognized as an instance of that token. Create tokens
124-- using the [`lexer.token()`]() function. Let us examine the "whitespace" token
125-- defined in the template shown earlier:
126--
127--     local ws = token(l.WHITESPACE, l.space^1)
128--
129-- At first glance, the first argument does not appear to be a string name and
130-- the second argument does not appear to be an LPeg pattern. Perhaps you
131-- expected something like:
132--
133--     local ws = token('whitespace', S('\t\v\f\n\r ')^1)
134--
135-- The `lexer` (`l`) module actually provides a convenient list of common token
136-- names and common LPeg patterns for you to use. Token names include
137-- [`lexer.DEFAULT`](), [`lexer.WHITESPACE`](), [`lexer.COMMENT`](),
138-- [`lexer.STRING`](), [`lexer.NUMBER`](), [`lexer.KEYWORD`](),
139-- [`lexer.IDENTIFIER`](), [`lexer.OPERATOR`](), [`lexer.ERROR`](),
140-- [`lexer.PREPROCESSOR`](), [`lexer.CONSTANT`](), [`lexer.VARIABLE`](),
141-- [`lexer.FUNCTION`](), [`lexer.CLASS`](), [`lexer.TYPE`](), [`lexer.LABEL`](),
142-- [`lexer.REGEX`](), and [`lexer.EMBEDDED`](). Patterns include
143-- [`lexer.any`](), [`lexer.ascii`](), [`lexer.extend`](), [`lexer.alpha`](),
144-- [`lexer.digit`](), [`lexer.alnum`](), [`lexer.lower`](), [`lexer.upper`](),
145-- [`lexer.xdigit`](), [`lexer.cntrl`](), [`lexer.graph`](), [`lexer.print`](),
146-- [`lexer.punct`](), [`lexer.space`](), [`lexer.newline`](),
147-- [`lexer.nonnewline`](), [`lexer.nonnewline_esc`](), [`lexer.dec_num`](),
148-- [`lexer.hex_num`](), [`lexer.oct_num`](), [`lexer.integer`](),
149-- [`lexer.float`](), and [`lexer.word`](). You may use your own token names if
150-- none of the above fit your language, but an advantage to using predefined
151-- token names is that your lexer's tokens will inherit the universal syntax
152-- highlighting color theme used by your text editor.
153--
154-- #### Example Tokens
155--
156-- So, how might you define other tokens like comments, strings, and keywords?
157-- Here are some examples.
158--
159-- **Comments**
160--
161-- Line-style comments with a prefix character(s) are easy to express with LPeg:
162--
163--     local shell_comment = token(l.COMMENT, '#' * l.nonnewline^0)
164--     local c_line_comment = token(l.COMMENT, '//' * l.nonnewline_esc^0)
165--
166-- The comments above start with a '#' or "//" and go to the end of the line.
167-- The second comment recognizes the next line also as a comment if the current
168-- line ends with a '\' escape character.
169--
170-- C-style "block" comments with a start and end delimiter are also easy to
171-- express:
172--
173--     local c_comment = token(l.COMMENT, '/*' * (l.any - '*/')^0 * P('*/')^-1)
174--
175-- This comment starts with a "/\*" sequence and contains anything up to and
176-- including an ending "\*/" sequence. The ending "\*/" is optional so the lexer
177-- can recognize unfinished comments as comments and highlight them properly.
178--
179-- **Strings**
180--
181-- It is tempting to think that a string is not much different from the block
182-- comment shown above in that both have start and end delimiters:
183--
184--     local dq_str = '"' * (l.any - '"')^0 * P('"')^-1
185--     local sq_str = "'" * (l.any - "'")^0 * P("'")^-1
186--     local simple_string = token(l.STRING, dq_str + sq_str)
187--
188-- However, most programming languages allow escape sequences in strings such
189-- that a sequence like "\\"" in a double-quoted string indicates that the
190-- '"' is not the end of the string. The above token incorrectly matches
191-- such a string. Instead, use the [`lexer.delimited_range()`]() convenience
192-- function.
193--
194--     local dq_str = l.delimited_range('"')
195--     local sq_str = l.delimited_range("'")
196--     local string = token(l.STRING, dq_str + sq_str)
197--
198-- In this case, the lexer treats '\' as an escape character in a string
199-- sequence.
200--
201-- **Keywords**
202--
203-- Instead of matching _n_ keywords with _n_ `P('keyword_`_`n`_`')` ordered
204-- choices, use another convenience function: [`lexer.word_match()`](). It is
205-- much easier and more efficient to write word matches like:
206--
207--     local keyword = token(l.KEYWORD, l.word_match{
208--       'keyword_1', 'keyword_2', ..., 'keyword_n'
209--     })
210--
211--     local case_insensitive_keyword = token(l.KEYWORD, l.word_match({
212--       'KEYWORD_1', 'keyword_2', ..., 'KEYword_n'
213--     }, nil, true))
214--
215--     local hyphened_keyword = token(l.KEYWORD, l.word_match({
216--       'keyword-1', 'keyword-2', ..., 'keyword-n'
217--     }, '-'))
218--
219-- By default, characters considered to be in keywords are in the set of
220-- alphanumeric characters and underscores. The last token demonstrates how to
221-- allow '-' (hyphen) characters to be in keywords as well.
222--
223-- **Numbers**
224--
225-- Most programming languages have the same format for integer and float tokens,
226-- so it might be as simple as using a couple of predefined LPeg patterns:
227--
228--     local number = token(l.NUMBER, l.float + l.integer)
229--
230-- However, some languages allow postfix characters on integers.
231--
232--     local integer = P('-')^-1 * (l.dec_num * S('lL')^-1)
233--     local number = token(l.NUMBER, l.float + l.hex_num + integer)
234--
235-- Your language may need other tweaks, but it is up to you how fine-grained you
236-- want your highlighting to be. After all, you are not writing a compiler or
237-- interpreter!
238--
239-- ### Rules
240--
241-- Programming languages have grammars, which specify valid token structure. For
242-- example, comments usually cannot appear within a string. Grammars consist of
243-- rules, which are simply combinations of tokens. Recall from the lexer
244-- template the `_rules` table, which defines all the rules used by the lexer
245-- grammar:
246--
247--     M._rules = {
248--       {'whitespace', ws},
249--     }
250--
251-- Each entry in a lexer's `_rules` table consists of a rule name and its
252-- associated pattern. Rule names are completely arbitrary and serve only to
253-- identify and distinguish between different rules. Rule order is important: if
254-- text does not match the first rule, the lexer tries the second rule, and so
255-- on. This simple grammar says to match whitespace tokens under a rule named
256-- "whitespace".
257--
258-- To illustrate the importance of rule order, here is an example of a
259-- simplified Lua grammar:
260--
261--     M._rules = {
262--       {'whitespace', ws},
263--       {'keyword', keyword},
264--       {'identifier', identifier},
265--       {'string', string},
266--       {'comment', comment},
267--       {'number', number},
268--       {'label', label},
269--       {'operator', operator},
270--     }
271--
272-- Note how identifiers come after keywords. In Lua, as with most programming
273-- languages, the characters allowed in keywords and identifiers are in the same
274-- set (alphanumerics plus underscores). If the lexer specified the "identifier"
275-- rule before the "keyword" rule, all keywords would match identifiers and thus
276-- incorrectly highlight as identifiers instead of keywords. The same idea
277-- applies to function, constant, etc. tokens that you may want to distinguish
278-- between: their rules should come before identifiers.
279--
280-- So what about text that does not match any rules? For example in Lua, the '!'
281-- character is meaningless outside a string or comment. Normally the lexer
282-- skips over such text. If instead you want to highlight these "syntax errors",
283-- add an additional end rule:
284--
285--     M._rules = {
286--       {'whitespace', ws},
287--       {'error', token(l.ERROR, l.any)},
288--     }
289--
290-- This identifies and highlights any character not matched by an existing
291-- rule as an `lexer.ERROR` token.
292--
293-- Even though the rules defined in the examples above contain a single token,
294-- rules may consist of multiple tokens. For example, a rule for an HTML tag
295-- could consist of a tag token followed by an arbitrary number of attribute
296-- tokens, allowing the lexer to highlight all tokens separately. The rule might
297-- look something like this:
298--
299--     {'tag', tag_start * (ws * attributes)^0 * tag_end^-1}
300--
301-- Note however that lexers with complex rules like these are more prone to lose
302-- track of their state.
303--
304-- ### Summary
305--
306-- Lexers primarily consist of tokens and grammar rules. At your disposal are a
307-- number of convenience patterns and functions for rapidly creating a lexer. If
308-- you choose to use predefined token names for your tokens, you do not have to
309-- define how the lexer highlights them. The tokens will inherit the default
310-- syntax highlighting color theme your editor uses.
311--
312-- ## Advanced Techniques
313--
314-- ### Styles and Styling
315--
316-- The most basic form of syntax highlighting is assigning different colors to
317-- different tokens. Instead of highlighting with just colors, Scintilla allows
318-- for more rich highlighting, or "styling", with different fonts, font sizes,
319-- font attributes, and foreground and background colors, just to name a few.
320-- The unit of this rich highlighting is called a "style". Styles are simply
321-- strings of comma-separated property settings. By default, lexers associate
322-- predefined token names like `lexer.WHITESPACE`, `lexer.COMMENT`,
323-- `lexer.STRING`, etc. with particular styles as part of a universal color
324-- theme. These predefined styles include [`lexer.STYLE_CLASS`](),
325-- [`lexer.STYLE_COMMENT`](), [`lexer.STYLE_CONSTANT`](),
326-- [`lexer.STYLE_ERROR`](), [`lexer.STYLE_EMBEDDED`](),
327-- [`lexer.STYLE_FUNCTION`](), [`lexer.STYLE_IDENTIFIER`](),
328-- [`lexer.STYLE_KEYWORD`](), [`lexer.STYLE_LABEL`](), [`lexer.STYLE_NUMBER`](),
329-- [`lexer.STYLE_OPERATOR`](), [`lexer.STYLE_PREPROCESSOR`](),
330-- [`lexer.STYLE_REGEX`](), [`lexer.STYLE_STRING`](), [`lexer.STYLE_TYPE`](),
331-- [`lexer.STYLE_VARIABLE`](), and [`lexer.STYLE_WHITESPACE`](). Like with
332-- predefined token names and LPeg patterns, you may define your own styles. At
333-- their core, styles are just strings, so you may create new ones and/or modify
334-- existing ones. Each style consists of the following comma-separated settings:
335--
336-- Setting        | Description
337-- ---------------|------------
338-- font:_name_    | The name of the font the style uses.
339-- size:_int_     | The size of the font the style uses.
340-- [not]bold      | Whether or not the font face is bold.
341-- weight:_int_   | The weight or boldness of a font, between 1 and 999.
342-- [not]italics   | Whether or not the font face is italic.
343-- [not]underlined| Whether or not the font face is underlined.
344-- fore:_color_   | The foreground color of the font face.
345-- back:_color_   | The background color of the font face.
346-- [not]eolfilled | Does the background color extend to the end of the line?
347-- case:_char_    | The case of the font ('u': upper, 'l': lower, 'm': normal).
348-- [not]visible   | Whether or not the text is visible.
349-- [not]changeable| Whether the text is changeable or read-only.
350--
351-- Specify font colors in either "#RRGGBB" format, "0xBBGGRR" format, or the
352-- decimal equivalent of the latter. As with token names, LPeg patterns, and
353-- styles, there is a set of predefined color names, but they vary depending on
354-- the current color theme in use. Therefore, it is generally not a good idea to
355-- manually define colors within styles in your lexer since they might not fit
356-- into a user's chosen color theme. Try to refrain from even using predefined
357-- colors in a style because that color may be theme-specific. Instead, the best
358-- practice is to either use predefined styles or derive new color-agnostic
359-- styles from predefined ones. For example, Lua "longstring" tokens use the
360-- existing `lexer.STYLE_STRING` style instead of defining a new one.
361--
362-- #### Example Styles
363--
364-- Defining styles is pretty straightforward. An empty style that inherits the
365-- default theme settings is simply an empty string:
366--
367--     local style_nothing = ''
368--
369-- A similar style but with a bold font face looks like this:
370--
371--     local style_bold = 'bold'
372--
373-- If you want the same style, but also with an italic font face, define the new
374-- style in terms of the old one:
375--
376--     local style_bold_italic = style_bold..',italics'
377--
378-- This allows you to derive new styles from predefined ones without having to
379-- rewrite them. This operation leaves the old style unchanged. Thus if you
380-- had a "static variable" token whose style you wanted to base off of
381-- `lexer.STYLE_VARIABLE`, it would probably look like:
382--
383--     local style_static_var = l.STYLE_VARIABLE..',italics'
384--
385-- The color theme files in the *lexers/themes/* folder give more examples of
386-- style definitions.
387--
388-- ### Token Styles
389--
390-- Lexers use the `_tokenstyles` table to assign tokens to particular styles.
391-- Recall the token definition and `_tokenstyles` table from the lexer template:
392--
393--     local ws = token(l.WHITESPACE, l.space^1)
394--
395--     ...
396--
397--     M._tokenstyles = {
398--
399--     }
400--
401-- Why is a style not assigned to the `lexer.WHITESPACE` token? As mentioned
402-- earlier, lexers automatically associate tokens that use predefined token
403-- names with a particular style. Only tokens with custom token names need
404-- manual style associations. As an example, consider a custom whitespace token:
405--
406--     local ws = token('custom_whitespace', l.space^1)
407--
408-- Assigning a style to this token looks like:
409--
410--     M._tokenstyles = {
411--       custom_whitespace = l.STYLE_WHITESPACE
412--     }
413--
414-- Do not confuse token names with rule names. They are completely different
415-- entities. In the example above, the lexer assigns the "custom_whitespace"
416-- token the existing style for `WHITESPACE` tokens. If instead you want to
417-- color the background of whitespace a shade of grey, it might look like:
418--
419--     local custom_style = l.STYLE_WHITESPACE..',back:$(color.grey)'
420--     M._tokenstyles = {
421--       custom_whitespace = custom_style
422--     }
423--
424-- Notice that the lexer peforms Scintilla/SciTE-style "$()" property expansion.
425-- You may also use "%()". Remember to refrain from assigning specific colors in
426-- styles, but in this case, all user color themes probably define the
427-- "color.grey" property.
428--
429-- ### Line Lexers
430--
431-- By default, lexers match the arbitrary chunks of text passed to them by
432-- Scintilla. These chunks may be a full document, only the visible part of a
433-- document, or even just portions of lines. Some lexers need to match whole
434-- lines. For example, a lexer for the output of a file "diff" needs to know if
435-- the line started with a '+' or '-' and then style the entire line
436-- accordingly. To indicate that your lexer matches by line, use the
437-- `_LEXBYLINE` field:
438--
439--     M._LEXBYLINE = true
440--
441-- Now the input text for the lexer is a single line at a time. Keep in mind
442-- that line lexers do not have the ability to look ahead at subsequent lines.
443--
444-- ### Embedded Lexers
445--
446-- Lexers embed within one another very easily, requiring minimal effort. In the
447-- following sections, the lexer being embedded is called the "child" lexer and
448-- the lexer a child is being embedded in is called the "parent". For example,
449-- consider an HTML lexer and a CSS lexer. Either lexer stands alone for styling
450-- their respective HTML and CSS files. However, CSS can be embedded inside
451-- HTML. In this specific case, the CSS lexer is the "child" lexer with the HTML
452-- lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This
453-- sounds a lot like the case with CSS, but there is a subtle difference: PHP
454-- _embeds itself_ into HTML while CSS is _embedded in_ HTML. This fundamental
455-- difference results in two types of embedded lexers: a parent lexer that
456-- embeds other child lexers in it (like HTML embedding CSS), and a child lexer
457-- that embeds itself within a parent lexer (like PHP embedding itself in HTML).
458--
459-- #### Parent Lexer
460--
461-- Before embedding a child lexer into a parent lexer, the parent lexer needs to
462-- load the child lexer. This is done with the [`lexer.load()`]() function. For
463-- example, loading the CSS lexer within the HTML lexer looks like:
464--
465--     local css = l.load('css')
466--
467-- The next part of the embedding process is telling the parent lexer when to
468-- switch over to the child lexer and when to switch back. The lexer refers to
469-- these indications as the "start rule" and "end rule", respectively, and are
470-- just LPeg patterns. Continuing with the HTML/CSS example, the transition from
471-- HTML to CSS is when the lexer encounters a "style" tag with a "type"
472-- attribute whose value is "text/css":
473--
474--     local css_tag = P('<style') * P(function(input, index)
475--       if input:find('^[^>]+type="text/css"', index) then
476--         return index
477--       end
478--     end)
479--
480-- This pattern looks for the beginning of a "style" tag and searches its
481-- attribute list for the text "`type="text/css"`". (In this simplified example,
482-- the Lua pattern does not consider whitespace between the '=' nor does it
483-- consider that using single quotes is valid.) If there is a match, the
484-- functional pattern returns a value instead of `nil`. In this case, the value
485-- returned does not matter because we ultimately want to style the "style" tag
486-- as an HTML tag, so the actual start rule looks like this:
487--
488--     local css_start_rule = #css_tag * tag
489--
490-- Now that the parent knows when to switch to the child, it needs to know when
491-- to switch back. In the case of HTML/CSS, the switch back occurs when the
492-- lexer encounters an ending "style" tag, though the lexer should still style
493-- the tag as an HTML tag:
494--
495--     local css_end_rule = #P('</style>') * tag
496--
497-- Once the parent loads the child lexer and defines the child's start and end
498-- rules, it embeds the child with the [`lexer.embed_lexer()`]() function:
499--
500--     l.embed_lexer(M, css, css_start_rule, css_end_rule)
501--
502-- The first parameter is the parent lexer object to embed the child in, which
503-- in this case is `M`. The other three parameters are the child lexer object
504-- loaded earlier followed by its start and end rules.
505--
506-- #### Child Lexer
507--
508-- The process for instructing a child lexer to embed itself into a parent is
509-- very similar to embedding a child into a parent: first, load the parent lexer
510-- into the child lexer with the [`lexer.load()`]() function and then create
511-- start and end rules for the child lexer. However, in this case, swap the
512-- lexer object arguments to [`lexer.embed_lexer()`](). For example, in the PHP
513-- lexer:
514--
515--     local html = l.load('html')
516--     local php_start_rule = token('php_tag', '<?php ')
517--     local php_end_rule = token('php_tag', '?>')
518--     l.embed_lexer(html, M, php_start_rule, php_end_rule)
519--
520-- ### Lexers with Complex State
521--
522-- A vast majority of lexers are not stateful and can operate on any chunk of
523-- text in a document. However, there may be rare cases where a lexer does need
524-- to keep track of some sort of persistent state. Rather than using `lpeg.P`
525-- function patterns that set state variables, it is recommended to make use of
526-- Scintilla's built-in, per-line state integers via [`lexer.line_state`](). It
527-- was designed to accommodate up to 32 bit flags for tracking state.
528-- [`lexer.line_from_position()`]() will return the line for any position given
529-- to an `lpeg.P` function pattern. (Any positions derived from that position
530-- argument will also work.)
531--
532-- Writing stateful lexers is beyond the scope of this document.
533--
534-- ## Code Folding
535--
536-- When reading source code, it is occasionally helpful to temporarily hide
537-- blocks of code like functions, classes, comments, etc. This is the concept of
538-- "folding". In the Textadept and SciTE editors for example, little indicators
539-- in the editor margins appear next to code that can be folded at places called
540-- "fold points". When the user clicks an indicator, the editor hides the code
541-- associated with the indicator until the user clicks the indicator again. The
542-- lexer specifies these fold points and what code exactly to fold.
543--
544-- The fold points for most languages occur on keywords or character sequences.
545-- Examples of fold keywords are "if" and "end" in Lua and examples of fold
546-- character sequences are '{', '}', "/\*", and "\*/" in C for code block and
547-- comment delimiters, respectively. However, these fold points cannot occur
548-- just anywhere. For example, lexers should not recognize fold keywords that
549-- appear within strings or comments. The lexer's `_foldsymbols` table allows
550-- you to conveniently define fold points with such granularity. For example,
551-- consider C:
552--
553--     M._foldsymbols = {
554--       [l.OPERATOR] = {['{'] = 1, ['}'] = -1},
555--       [l.COMMENT] = {['/*'] = 1, ['*/'] = -1},
556--       _patterns = {'[{}]', '/%*', '%*/'}
557--     }
558--
559-- The first assignment states that any '{' or '}' that the lexer recognized as
560-- an `lexer.OPERATOR` token is a fold point. The integer `1` indicates the
561-- match is a beginning fold point and `-1` indicates the match is an ending
562-- fold point. Likewise, the second assignment states that any "/\*" or "\*/"
563-- that the lexer recognizes as part of a `lexer.COMMENT` token is a fold point.
564-- The lexer does not consider any occurences of these characters outside their
565-- defined tokens (such as in a string) as fold points. Finally, every
566-- `_foldsymbols` table must have a `_patterns` field that contains a list of
567-- [Lua patterns][] that match fold points. If the lexer encounters text that
568-- matches one of those patterns, the lexer looks up the matched text in its
569-- token's table in order to determine whether or not the text is a fold point.
570-- In the example above, the first Lua pattern matches any '{' or '}'
571-- characters. When the lexer comes across one of those characters, it checks if
572-- the match is an `lexer.OPERATOR` token. If so, the lexer identifies the match
573-- as a fold point. The same idea applies for the other patterns. (The '%' is in
574-- the other patterns because '\*' is a special character in Lua patterns that
575-- needs escaping.) How do you specify fold keywords? Here is an example for
576-- Lua:
577--
578--     M._foldsymbols = {
579--       [l.KEYWORD] = {
580--         ['if'] = 1, ['do'] = 1, ['function'] = 1,
581--         ['end'] = -1, ['repeat'] = 1, ['until'] = -1
582--       },
583--       _patterns = {'%l+'}
584--     }
585--
586-- Any time the lexer encounters a lower case word, if that word is a
587-- `lexer.KEYWORD` token and in the associated list of fold points, the lexer
588-- identifies the word as a fold point.
589--
590-- If your lexer has case-insensitive keywords as fold points, simply add a
591-- `_case_insensitive = true` option to the `_foldsymbols` table and specify
592-- keywords in lower case.
593--
594-- If your lexer needs to do some additional processing to determine if a match
595-- is a fold point, assign a function that returns an integer. Returning `1` or
596-- `-1` indicates the match is a fold point. Returning `0` indicates it is not.
597-- For example:
598--
599--     local function fold_strange_token(text, pos, line, s, match)
600--       if ... then
601--         return 1 -- beginning fold point
602--       elseif ... then
603--         return -1 -- ending fold point
604--       end
605--       return 0
606--     end
607--
608--     M._foldsymbols = {
609--       ['strange_token'] = {['|'] = fold_strange_token},
610--       _patterns = {'|'}
611--     }
612--
613-- Any time the lexer encounters a '|' that is a "strange_token", it calls the
614-- `fold_strange_token` function to determine if '|' is a fold point. The lexer
615-- calls these functions with the following arguments: the text to identify fold
616-- points in, the beginning position of the current line in the text to fold,
617-- the current line's text, the position in the current line the matched text
618-- starts at, and the matched text itself.
619--
620-- [Lua patterns]: http://www.lua.org/manual/5.2/manual.html#6.4.1
621--
622-- ### Fold by Indentation
623--
624-- Some languages have significant whitespace and/or no delimiters that indicate
625-- fold points. If your lexer falls into this category and you would like to
626-- mark fold points based on changes in indentation, use the
627-- `_FOLDBYINDENTATION` field:
628--
629--     M._FOLDBYINDENTATION = true
630--
631-- ## Using Lexers
632--
633-- ### Textadept
634--
635-- Put your lexer in your *~/.textadept/lexers/* directory so you do not
636-- overwrite it when upgrading Textadept. Also, lexers in this directory
637-- override default lexers. Thus, Textadept loads a user *lua* lexer instead of
638-- the default *lua* lexer. This is convenient for tweaking a default lexer to
639-- your liking. Then add a [file type][] for your lexer if necessary.
640--
641-- [file type]: _M.textadept.file_types.html
642--
643-- ### SciTE
644--
645-- Create a *.properties* file for your lexer and `import` it in either your
646-- *SciTEUser.properties* or *SciTEGlobal.properties*. The contents of the
647-- *.properties* file should contain:
648--
649--     file.patterns.[lexer_name]=[file_patterns]
650--     lexer.$(file.patterns.[lexer_name])=[lexer_name]
651--
652-- where `[lexer_name]` is the name of your lexer (minus the *.lua* extension)
653-- and `[file_patterns]` is a set of file extensions to use your lexer for.
654--
655-- Please note that Lua lexers ignore any styling information in *.properties*
656-- files. Your theme file in the *lexers/themes/* directory contains styling
657-- information.
658--
659-- ## Considerations
660--
661-- ### Performance
662--
663-- There might be some slight overhead when initializing a lexer, but loading a
664-- file from disk into Scintilla is usually more expensive. On modern computer
665-- systems, I see no difference in speed between LPeg lexers and Scintilla's C++
666-- ones. Optimize lexers for speed by re-arranging rules in the `_rules` table
667-- so that the most common rules match first. Do keep in mind that order matters
668-- for similar rules.
669--
670-- ### Limitations
671--
672-- Embedded preprocessor languages like PHP cannot completely embed in their
673-- parent languages in that the parent's tokens do not support start and end
674-- rules. This mostly goes unnoticed, but code like
675--
676--     <div id="<?php echo $id; ?>">
677--
678-- or
679--
680--     <div <?php if ($odd) { echo 'class="odd"'; } ?>>
681--
682-- will not style correctly.
683--
684-- ### Troubleshooting
685--
686-- Errors in lexers can be tricky to debug. Lexers print Lua errors to
687-- `io.stderr` and `_G.print()` statements to `io.stdout`. Running your editor
688-- from a terminal is the easiest way to see errors as they occur.
689--
690-- ### Risks
691--
692-- Poorly written lexers have the ability to crash Scintilla (and thus its
693-- containing application), so unsaved data might be lost. However, I have only
694-- observed these crashes in early lexer development, when syntax errors or
695-- pattern errors are present. Once the lexer actually starts styling text
696-- (either correctly or incorrectly, it does not matter), I have not observed
697-- any crashes.
698--
699-- ### Acknowledgements
700--
701-- Thanks to Peter Odding for his [lexer post][] on the Lua mailing list
702-- that inspired me, and thanks to Roberto Ierusalimschy for LPeg.
703--
704-- [lexer post]: http://lua-users.org/lists/lua-l/2007-04/msg00116.html
705-- @field LEXERPATH (string)
706--   The path used to search for a lexer to load.
707--   Identical in format to Lua's `package.path` string.
708--   The default value is `package.path`.
709-- @field DEFAULT (string)
710--   The token name for default tokens.
711-- @field WHITESPACE (string)
712--   The token name for whitespace tokens.
713-- @field COMMENT (string)
714--   The token name for comment tokens.
715-- @field STRING (string)
716--   The token name for string tokens.
717-- @field NUMBER (string)
718--   The token name for number tokens.
719-- @field KEYWORD (string)
720--   The token name for keyword tokens.
721-- @field IDENTIFIER (string)
722--   The token name for identifier tokens.
723-- @field OPERATOR (string)
724--   The token name for operator tokens.
725-- @field ERROR (string)
726--   The token name for error tokens.
727-- @field PREPROCESSOR (string)
728--   The token name for preprocessor tokens.
729-- @field CONSTANT (string)
730--   The token name for constant tokens.
731-- @field VARIABLE (string)
732--   The token name for variable tokens.
733-- @field FUNCTION (string)
734--   The token name for function tokens.
735-- @field CLASS (string)
736--   The token name for class tokens.
737-- @field TYPE (string)
738--   The token name for type tokens.
739-- @field LABEL (string)
740--   The token name for label tokens.
741-- @field REGEX (string)
742--   The token name for regex tokens.
743-- @field STYLE_CLASS (string)
744--   The style typically used for class definitions.
745-- @field STYLE_COMMENT (string)
746--   The style typically used for code comments.
747-- @field STYLE_CONSTANT (string)
748--   The style typically used for constants.
749-- @field STYLE_ERROR (string)
750--   The style typically used for erroneous syntax.
751-- @field STYLE_FUNCTION (string)
752--   The style typically used for function definitions.
753-- @field STYLE_KEYWORD (string)
754--   The style typically used for language keywords.
755-- @field STYLE_LABEL (string)
756--   The style typically used for labels.
757-- @field STYLE_NUMBER (string)
758--   The style typically used for numbers.
759-- @field STYLE_OPERATOR (string)
760--   The style typically used for operators.
761-- @field STYLE_REGEX (string)
762--   The style typically used for regular expression strings.
763-- @field STYLE_STRING (string)
764--   The style typically used for strings.
765-- @field STYLE_PREPROCESSOR (string)
766--   The style typically used for preprocessor statements.
767-- @field STYLE_TYPE (string)
768--   The style typically used for static types.
769-- @field STYLE_VARIABLE (string)
770--   The style typically used for variables.
771-- @field STYLE_WHITESPACE (string)
772--   The style typically used for whitespace.
773-- @field STYLE_EMBEDDED (string)
774--   The style typically used for embedded code.
775-- @field STYLE_IDENTIFIER (string)
776--   The style typically used for identifier words.
777-- @field STYLE_DEFAULT (string)
778--   The style all styles are based off of.
779-- @field STYLE_LINENUMBER (string)
780--   The style used for all margins except fold margins.
781-- @field STYLE_BRACELIGHT (string)
782--   The style used for highlighted brace characters.
783-- @field STYLE_BRACEBAD (string)
784--   The style used for unmatched brace characters.
785-- @field STYLE_CONTROLCHAR (string)
786--   The style used for control characters.
787--   Color attributes are ignored.
788-- @field STYLE_INDENTGUIDE (string)
789--   The style used for indentation guides.
790-- @field STYLE_CALLTIP (string)
791--   The style used by call tips if [`buffer.call_tip_use_style`]() is set.
792--   Only the font name, size, and color attributes are used.
793-- @field any (pattern)
794--   A pattern that matches any single character.
795-- @field ascii (pattern)
796--   A pattern that matches any ASCII character (codes 0 to 127).
797-- @field extend (pattern)
798--   A pattern that matches any ASCII extended character (codes 0 to 255).
799-- @field alpha (pattern)
800--   A pattern that matches any alphabetic character ('A'-'Z', 'a'-'z').
801-- @field digit (pattern)
802--   A pattern that matches any digit ('0'-'9').
803-- @field alnum (pattern)
804--   A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z',
805--     '0'-'9').
806-- @field lower (pattern)
807--   A pattern that matches any lower case character ('a'-'z').
808-- @field upper (pattern)
809--   A pattern that matches any upper case character ('A'-'Z').
810-- @field xdigit (pattern)
811--   A pattern that matches any hexadecimal digit ('0'-'9', 'A'-'F', 'a'-'f').
812-- @field cntrl (pattern)
813--   A pattern that matches any control character (ASCII codes 0 to 31).
814-- @field graph (pattern)
815--   A pattern that matches any graphical character ('!' to '~').
816-- @field print (pattern)
817--   A pattern that matches any printable character (' ' to '~').
818-- @field punct (pattern)
819--   A pattern that matches any punctuation character ('!' to '/', ':' to '@',
820--   '[' to ''', '{' to '~').
821-- @field space (pattern)
822--   A pattern that matches any whitespace character ('\t', '\v', '\f', '\n',
823--   '\r', space).
824-- @field newline (pattern)
825--   A pattern that matches any set of end of line characters.
826-- @field nonnewline (pattern)
827--   A pattern that matches any single, non-newline character.
828-- @field nonnewline_esc (pattern)
829--   A pattern that matches any single, non-newline character or any set of end
830--   of line characters escaped with '\'.
831-- @field dec_num (pattern)
832--   A pattern that matches a decimal number.
833-- @field hex_num (pattern)
834--   A pattern that matches a hexadecimal number.
835-- @field oct_num (pattern)
836--   A pattern that matches an octal number.
837-- @field integer (pattern)
838--   A pattern that matches either a decimal, hexadecimal, or octal number.
839-- @field float (pattern)
840--   A pattern that matches a floating point number.
841-- @field word (pattern)
842--   A pattern that matches a typical word. Words begin with a letter or
843--   underscore and consist of alphanumeric and underscore characters.
844-- @field FOLD_BASE (number)
845--   The initial (root) fold level.
846-- @field FOLD_BLANK (number)
847--   Flag indicating that the line is blank.
848-- @field FOLD_HEADER (number)
849--   Flag indicating the line is fold point.
850-- @field fold_level (table, Read-only)
851--   Table of fold level bit-masks for line numbers starting from zero.
852--   Fold level masks are composed of an integer level combined with any of the
853--   following bits:
854--
855--   * `lexer.FOLD_BASE`
856--     The initial fold level.
857--   * `lexer.FOLD_BLANK`
858--     The line is blank.
859--   * `lexer.FOLD_HEADER`
860--     The line is a header, or fold point.
861-- @field indent_amount (table, Read-only)
862--   Table of indentation amounts in character columns, for line numbers
863--   starting from zero.
864-- @field line_state (table)
865--   Table of integer line states for line numbers starting from zero.
866--   Line states can be used by lexers for keeping track of persistent states.
867-- @field property (table)
868--   Map of key-value string pairs.
869-- @field property_expanded (table, Read-only)
870--   Map of key-value string pairs with `$()` and `%()` variable replacement
871--   performed in values.
872-- @field property_int (table, Read-only)
873--   Map of key-value pairs with values interpreted as numbers, or `0` if not
874--   found.
875-- @field style_at (table, Read-only)
876--   Table of style names at positions in the buffer starting from 1.
877module('lexer')]=]
878
879--local lpeg = require('lpeg')
880local lpeg_P, lpeg_R, lpeg_S, lpeg_V = lpeg.P, lpeg.R, lpeg.S, lpeg.V
881local lpeg_Ct, lpeg_Cc, lpeg_Cp = lpeg.Ct, lpeg.Cc, lpeg.Cp
882local lpeg_Cmt, lpeg_C = lpeg.Cmt, lpeg.C
883local lpeg_match = lpeg.match
884
885M.LEXERPATH = package.path
886
887-- Table of loaded lexers.
888local lexers = {}
889
890-- Keep track of the last parent lexer loaded. This lexer's rules are used for
891-- proxy lexers (those that load parent and child lexers to embed) that do not
892-- declare a parent lexer.
893local parent_lexer
894
895if not package.searchpath then
896  -- Searches for the given *name* in the given *path*.
897  -- This is an implementation of Lua 5.2's `package.searchpath()` function for
898  -- Lua 5.1.
899  function package.searchpath(name, path)
900    local tried = {}
901    for part in path:gmatch('[^;]+') do
902      local filename = part:gsub('%?', name)
903      local f = io.open(filename, 'r')
904      if f then f:close() return filename end
905      tried[#tried + 1] = ("no file '%s'"):format(filename)
906    end
907    return nil, table.concat(tried, '\n')
908  end
909end
910
911-- Adds a rule to a lexer's current ordered list of rules.
912-- @param lexer The lexer to add the given rule to.
913-- @param name The name associated with this rule. It is used for other lexers
914--   to access this particular rule from the lexer's `_RULES` table. It does not
915--   have to be the same as the name passed to `token`.
916-- @param rule The LPeg pattern of the rule.
917local function add_rule(lexer, id, rule)
918  if not lexer._RULES then
919    lexer._RULES = {}
920    -- Contains an ordered list (by numerical index) of rule names. This is used
921    -- in conjunction with lexer._RULES for building _TOKENRULE.
922    lexer._RULEORDER = {}
923  end
924  lexer._RULES[id] = rule
925  lexer._RULEORDER[#lexer._RULEORDER + 1] = id
926end
927
928-- Adds a new Scintilla style to Scintilla.
929-- @param lexer The lexer to add the given style to.
930-- @param token_name The name of the token associated with this style.
931-- @param style A Scintilla style created from `style()`.
932-- @see style
933local function add_style(lexer, token_name, style)
934  local num_styles = lexer._numstyles
935  if num_styles == 32 then num_styles = num_styles + 8 end -- skip predefined
936  if num_styles >= 255 then print('Too many styles defined (255 MAX)') end
937  lexer._TOKENSTYLES[token_name], lexer._numstyles = num_styles, num_styles + 1
938  lexer._EXTRASTYLES[token_name] = style
939end
940
941-- (Re)constructs `lexer._TOKENRULE`.
942-- @param parent The parent lexer.
943local function join_tokens(lexer)
944  local patterns, order = lexer._RULES, lexer._RULEORDER
945  local token_rule = patterns[order[1]]
946  for i = 2, #order do token_rule = token_rule + patterns[order[i]] end
947  lexer._TOKENRULE = token_rule + M.token(M.DEFAULT, M.any)
948  return lexer._TOKENRULE
949end
950
951-- Adds a given lexer and any of its embedded lexers to a given grammar.
952-- @param grammar The grammar to add the lexer to.
953-- @param lexer The lexer to add.
954local function add_lexer(grammar, lexer, token_rule)
955  local token_rule = join_tokens(lexer)
956  local lexer_name = lexer._NAME
957  for i = 1, #lexer._CHILDREN do
958    local child = lexer._CHILDREN[i]
959    if child._CHILDREN then add_lexer(grammar, child) end
960    local child_name = child._NAME
961    local rules = child._EMBEDDEDRULES[lexer_name]
962    local rules_token_rule = grammar['__'..child_name] or rules.token_rule
963    grammar[child_name] = (-rules.end_rule * rules_token_rule)^0 *
964                          rules.end_rule^-1 * lpeg_V(lexer_name)
965    local embedded_child = '_'..child_name
966    grammar[embedded_child] = rules.start_rule * (-rules.end_rule *
967                              rules_token_rule)^0 * rules.end_rule^-1
968    token_rule = lpeg_V(embedded_child) + token_rule
969  end
970  grammar['__'..lexer_name] = token_rule -- can contain embedded lexer rules
971  grammar[lexer_name] = token_rule^0
972end
973
974-- (Re)constructs `lexer._GRAMMAR`.
975-- @param lexer The parent lexer.
976-- @param initial_rule The name of the rule to start lexing with. The default
977--   value is `lexer._NAME`. Multilang lexers use this to start with a child
978--   rule if necessary.
979local function build_grammar(lexer, initial_rule)
980  local children = lexer._CHILDREN
981  if children then
982    local lexer_name = lexer._NAME
983    if not initial_rule then initial_rule = lexer_name end
984    local grammar = {initial_rule}
985    add_lexer(grammar, lexer)
986    lexer._INITIALRULE = initial_rule
987    lexer._GRAMMAR = lpeg_Ct(lpeg_P(grammar))
988  else
989    lexer._GRAMMAR = lpeg_Ct(join_tokens(lexer)^0)
990  end
991end
992
993local string_upper = string.upper
994-- Default styles.
995local default = {
996  'nothing', 'whitespace', 'comment', 'string', 'number', 'keyword',
997  'identifier', 'operator', 'error', 'preprocessor', 'constant', 'variable',
998  'function', 'class', 'type', 'label', 'regex', 'embedded'
999}
1000for i = 1, #default do
1001  local name, upper_name = default[i], string_upper(default[i])
1002  M[upper_name], M['STYLE_'..upper_name] = name, '$(style.'..name..')'
1003end
1004-- Predefined styles.
1005local predefined = {
1006  'default', 'linenumber', 'bracelight', 'bracebad', 'controlchar',
1007  'indentguide', 'calltip'
1008}
1009for i = 1, #predefined do
1010  local name, upper_name = predefined[i], string_upper(predefined[i])
1011  M[upper_name], M['STYLE_'..upper_name] = name, '$(style.'..name..')'
1012end
1013
1014---
1015-- Initializes or loads and returns the lexer of string name *name*.
1016-- Scintilla calls this function in order to load a lexer. Parent lexers also
1017-- call this function in order to load child lexers and vice-versa. The user
1018-- calls this function in order to load a lexer when using Scintillua as a Lua
1019-- library.
1020-- @param name The name of the lexing language.
1021-- @param alt_name The alternate name of the lexing language. This is useful for
1022--   embedding the same child lexer with multiple sets of start and end tokens.
1023-- @return lexer object
1024-- @name load
1025function M.load(name, alt_name)
1026  if lexers[alt_name or name] then return lexers[alt_name or name] end
1027  parent_lexer = nil -- reset
1028
1029  -- When using Scintillua as a stand-alone module, the `property` and
1030  -- `property_int` tables do not exist (they are not useful). Create them to
1031  -- prevent errors from occurring.
1032  if not M.property then
1033    M.property, M.property_int = {}, setmetatable({}, {
1034      __index = function(t, k) return tonumber(M.property[k]) or 0 end,
1035      __newindex = function() error('read-only property') end
1036    })
1037  end
1038
1039  -- Load the language lexer with its rules, styles, etc.
1040  M.WHITESPACE = (alt_name or name)..'_whitespace'
1041  local lexer = dofile(assert(package.searchpath(name, M.LEXERPATH)))
1042  if alt_name then lexer._NAME = alt_name end
1043
1044  -- Create the initial maps for token names to style numbers and styles.
1045  local token_styles = {}
1046  for i = 1, #default do token_styles[default[i]] = i - 1 end
1047  for i = 1, #predefined do token_styles[predefined[i]] = i + 31 end
1048  lexer._TOKENSTYLES, lexer._numstyles = token_styles, #default
1049  lexer._EXTRASTYLES = {}
1050
1051  -- If the lexer is a proxy (loads parent and child lexers to embed) and does
1052  -- not declare a parent, try and find one and use its rules.
1053  if not lexer._rules and not lexer._lexer then lexer._lexer = parent_lexer end
1054
1055  -- If the lexer is a proxy or a child that embedded itself, add its rules and
1056  -- styles to the parent lexer. Then set the parent to be the main lexer.
1057  if lexer._lexer then
1058    local l, _r, _s = lexer._lexer, lexer._rules, lexer._tokenstyles
1059    if not l._tokenstyles then l._tokenstyles = {} end
1060    if _r then
1061      for i = 1, #_r do
1062        -- Prevent rule id clashes.
1063        l._rules[#l._rules + 1] = {lexer._NAME..'_'.._r[i][1], _r[i][2]}
1064      end
1065    end
1066    if _s then
1067      for token, style in pairs(_s) do l._tokenstyles[token] = style end
1068    end
1069    lexer = l
1070  end
1071
1072  -- Add the lexer's styles and build its grammar.
1073  if lexer._rules then
1074    if lexer._tokenstyles then
1075      for token, style in pairs(lexer._tokenstyles) do
1076        add_style(lexer, token, style)
1077      end
1078    end
1079    for i = 1, #lexer._rules do
1080      add_rule(lexer, lexer._rules[i][1], lexer._rules[i][2])
1081    end
1082    build_grammar(lexer)
1083  end
1084  -- Add the lexer's unique whitespace style.
1085  add_style(lexer, lexer._NAME..'_whitespace', M.STYLE_WHITESPACE)
1086
1087  -- Process the lexer's fold symbols.
1088  if lexer._foldsymbols and lexer._foldsymbols._patterns then
1089    local patterns = lexer._foldsymbols._patterns
1090    for i = 1, #patterns do patterns[i] = '()('..patterns[i]..')' end
1091  end
1092
1093  lexer.lex, lexer.fold = M.lex, M.fold
1094  lexers[alt_name or name] = lexer
1095  return lexer
1096end
1097
1098---
1099-- Lexes a chunk of text *text* (that has an initial style number of
1100-- *init_style*) with lexer *lexer*.
1101-- If *lexer* has a `_LEXBYLINE` flag set, the text is lexed one line at a time.
1102-- Otherwise the text is lexed as a whole.
1103-- @param lexer The lexer object to lex with.
1104-- @param text The text in the buffer to lex.
1105-- @param init_style The current style. Multiple-language lexers use this to
1106--   determine which language to start lexing in.
1107-- @return table of token names and positions.
1108-- @name lex
1109function M.lex(lexer, text, init_style)
1110  if not lexer._GRAMMAR then return {M.DEFAULT, #text + 1} end
1111  if not lexer._LEXBYLINE then
1112    -- For multilang lexers, build a new grammar whose initial_rule is the
1113    -- current language.
1114    if lexer._CHILDREN then
1115      for style, style_num in pairs(lexer._TOKENSTYLES) do
1116        if style_num == init_style then
1117          local lexer_name = style:match('^(.+)_whitespace') or lexer._NAME
1118          if lexer._INITIALRULE ~= lexer_name then
1119            build_grammar(lexer, lexer_name)
1120          end
1121          break
1122        end
1123      end
1124    end
1125    return lpeg_match(lexer._GRAMMAR, text)
1126  else
1127    local tokens = {}
1128    local function append(tokens, line_tokens, offset)
1129      for i = 1, #line_tokens, 2 do
1130        tokens[#tokens + 1] = line_tokens[i]
1131        tokens[#tokens + 1] = line_tokens[i + 1] + offset
1132      end
1133    end
1134    local offset = 0
1135    local grammar = lexer._GRAMMAR
1136    for line in text:gmatch('[^\r\n]*\r?\n?') do
1137      local line_tokens = lpeg_match(grammar, line)
1138      if line_tokens then append(tokens, line_tokens, offset) end
1139      offset = offset + #line
1140      -- Use the default style to the end of the line if none was specified.
1141      if tokens[#tokens] ~= offset then
1142        tokens[#tokens + 1], tokens[#tokens + 2] = 'default', offset + 1
1143      end
1144    end
1145    return tokens
1146  end
1147end
1148
1149---
1150-- Determines fold points in a chunk of text *text* with lexer *lexer*.
1151-- *text* starts at position *start_pos* on line number *start_line* with a
1152-- beginning fold level of *start_level* in the buffer. If *lexer* has a `_fold`
1153-- function or a `_foldsymbols` table, that field is used to perform folding.
1154-- Otherwise, if *lexer* has a `_FOLDBYINDENTATION` field set, or if a
1155-- `fold.by.indentation` property is set, folding by indentation is done.
1156-- @param lexer The lexer object to fold with.
1157-- @param text The text in the buffer to fold.
1158-- @param start_pos The position in the buffer *text* starts at, starting at
1159--   zero.
1160-- @param start_line The line number *text* starts on.
1161-- @param start_level The fold level *text* starts on.
1162-- @return table of fold levels.
1163-- @name fold
1164function M.fold(lexer, text, start_pos, start_line, start_level)
1165  local folds = {}
1166  if text == '' then return folds end
1167  local fold = M.property_int['fold'] > 0
1168  local FOLD_BASE = M.FOLD_BASE
1169  local FOLD_HEADER, FOLD_BLANK  = M.FOLD_HEADER, M.FOLD_BLANK
1170  if fold and lexer._fold then
1171    return lexer._fold(text, start_pos, start_line, start_level)
1172  elseif fold and lexer._foldsymbols then
1173    local lines = {}
1174    for p, l in (text..'\n'):gmatch('()(.-)\r?\n') do
1175      lines[#lines + 1] = {p, l}
1176    end
1177    local fold_zero_sum_lines = M.property_int['fold.on.zero.sum.lines'] > 0
1178    local fold_symbols = lexer._foldsymbols
1179    local fold_symbols_patterns = fold_symbols._patterns
1180    local fold_symbols_case_insensitive = fold_symbols._case_insensitive
1181    local style_at, fold_level = M.style_at, M.fold_level
1182    local line_num, prev_level = start_line, start_level
1183    local current_level = prev_level
1184    for i = 1, #lines do
1185      local pos, line = lines[i][1], lines[i][2]
1186      if line ~= '' then
1187        if fold_symbols_case_insensitive then line = line:lower() end
1188        local level_decreased = false
1189        for j = 1, #fold_symbols_patterns do
1190          for s, match in line:gmatch(fold_symbols_patterns[j]) do
1191            local symbols = fold_symbols[style_at[start_pos + pos + s - 1]]
1192            local l = symbols and symbols[match]
1193            if type(l) == 'function' then l = l(text, pos, line, s, match) end
1194            if type(l) == 'number' then
1195              current_level = current_level + l
1196              if l < 0 and current_level < prev_level then
1197                -- Potential zero-sum line. If the level were to go back up on
1198                -- the same line, the line may be marked as a fold header.
1199                level_decreased = true
1200              end
1201            end
1202          end
1203        end
1204        folds[line_num] = prev_level
1205        if current_level > prev_level then
1206          folds[line_num] = prev_level + FOLD_HEADER
1207        elseif level_decreased and current_level == prev_level and
1208               fold_zero_sum_lines then
1209          if line_num > start_line then
1210            folds[line_num] = prev_level - 1 + FOLD_HEADER
1211          else
1212            -- Typing within a zero-sum line.
1213            local level = fold_level[line_num - 1] - 1
1214            if level > FOLD_HEADER then level = level - FOLD_HEADER end
1215            if level > FOLD_BLANK then level = level - FOLD_BLANK end
1216            folds[line_num] = level + FOLD_HEADER
1217            current_level = current_level + 1
1218          end
1219        end
1220        if current_level < FOLD_BASE then current_level = FOLD_BASE end
1221        prev_level = current_level
1222      else
1223        folds[line_num] = prev_level + FOLD_BLANK
1224      end
1225      line_num = line_num + 1
1226    end
1227  elseif fold and (lexer._FOLDBYINDENTATION or
1228                   M.property_int['fold.by.indentation'] > 0) then
1229    -- Indentation based folding.
1230    -- Calculate indentation per line.
1231    local indentation = {}
1232    for indent, line in (text..'\n'):gmatch('([\t ]*)([^\r\n]*)\r?\n') do
1233      indentation[#indentation + 1] = line ~= '' and #indent
1234    end
1235    -- Find the first non-blank line before start_line. If the current line is
1236    -- indented, make that previous line a header and update the levels of any
1237    -- blank lines inbetween. If the current line is blank, match the level of
1238    -- the previous non-blank line.
1239    local current_level = start_level
1240    for i = start_line - 1, 0, -1 do
1241      local level = M.fold_level[i]
1242      if level >= FOLD_HEADER then level = level - FOLD_HEADER end
1243      if level < FOLD_BLANK then
1244        local indent = M.indent_amount[i]
1245        if indentation[1] and indentation[1] > indent then
1246          folds[i] = FOLD_BASE + indent + FOLD_HEADER
1247          for j = i + 1, start_line - 1 do
1248            folds[j] = start_level + FOLD_BLANK
1249          end
1250        elseif not indentation[1] then
1251          current_level = FOLD_BASE + indent
1252        end
1253        break
1254      end
1255    end
1256    -- Iterate over lines, setting fold numbers and fold flags.
1257    for i = 1, #indentation do
1258      if indentation[i] then
1259        current_level = FOLD_BASE + indentation[i]
1260        folds[start_line + i - 1] = current_level
1261        for j = i + 1, #indentation do
1262          if indentation[j] then
1263            if FOLD_BASE + indentation[j] > current_level then
1264              folds[start_line + i - 1] = current_level + FOLD_HEADER
1265              current_level = FOLD_BASE + indentation[j] -- for any blanks below
1266            end
1267            break
1268          end
1269        end
1270      else
1271        folds[start_line + i - 1] = current_level + FOLD_BLANK
1272      end
1273    end
1274  else
1275    -- No folding, reset fold levels if necessary.
1276    local current_line = start_line
1277    for _ in text:gmatch('\r?\n') do
1278      folds[current_line] = start_level
1279      current_line = current_line + 1
1280    end
1281  end
1282  return folds
1283end
1284
1285-- The following are utility functions lexers will have access to.
1286
1287-- Common patterns.
1288M.any = lpeg_P(1)
1289M.ascii = lpeg_R('\000\127')
1290M.extend = lpeg_R('\000\255')
1291M.alpha = lpeg_R('AZ', 'az')
1292M.digit = lpeg_R('09')
1293M.alnum = lpeg_R('AZ', 'az', '09')
1294M.lower = lpeg_R('az')
1295M.upper = lpeg_R('AZ')
1296M.xdigit = lpeg_R('09', 'AF', 'af')
1297M.cntrl = lpeg_R('\000\031')
1298M.graph = lpeg_R('!~')
1299M.print = lpeg_R(' ~')
1300M.punct = lpeg_R('!/', ':@', '[\'', '{~')
1301M.space = lpeg_S('\t\v\f\n\r ')
1302
1303M.newline = lpeg_S('\r\n\f')^1
1304M.nonnewline = 1 - M.newline
1305M.nonnewline_esc = 1 - (M.newline + '\\') + '\\' * M.any
1306
1307M.dec_num = M.digit^1
1308M.hex_num = '0' * lpeg_S('xX') * M.xdigit^1
1309M.oct_num = '0' * lpeg_R('07')^1
1310M.integer = lpeg_S('+-')^-1 * (M.hex_num + M.oct_num + M.dec_num)
1311M.float = lpeg_S('+-')^-1 *
1312          ((M.digit^0 * '.' * M.digit^1 + M.digit^1 * '.' * M.digit^0) *
1313           (lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)^-1 +
1314           (M.digit^1 * lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1))
1315
1316M.word = (M.alpha + '_') * (M.alnum + '_')^0
1317
1318---
1319-- Creates and returns a token pattern with token name *name* and pattern
1320-- *patt*.
1321-- If *name* is not a predefined token name, its style must be defined in the
1322-- lexer's `_tokenstyles` table.
1323-- @param name The name of token. If this name is not a predefined token name,
1324--   then a style needs to be assiciated with it in the lexer's `_tokenstyles`
1325--   table.
1326-- @param patt The LPeg pattern associated with the token.
1327-- @return pattern
1328-- @usage local ws = token(l.WHITESPACE, l.space^1)
1329-- @usage local annotation = token('annotation', '@' * l.word)
1330-- @name token
1331function M.token(name, patt)
1332  return lpeg_Cc(name) * patt * lpeg_Cp()
1333end
1334
1335---
1336-- Creates and returns a pattern that matches a range of text bounded by
1337-- *chars* characters.
1338-- This is a convenience function for matching more complicated delimited ranges
1339-- like strings with escape characters and balanced parentheses. *single_line*
1340-- indicates whether or not the range must be on a single line, *no_escape*
1341-- indicates whether or not to ignore '\' as an escape character, and *balanced*
1342-- indicates whether or not to handle balanced ranges like parentheses and
1343-- requires *chars* to be composed of two characters.
1344-- @param chars The character(s) that bound the matched range.
1345-- @param single_line Optional flag indicating whether or not the range must be
1346--   on a single line.
1347-- @param no_escape Optional flag indicating whether or not the range end
1348--   character may be escaped by a '\\' character.
1349-- @param balanced Optional flag indicating whether or not to match a balanced
1350--   range, like the "%b" Lua pattern. This flag only applies if *chars*
1351--   consists of two different characters (e.g. "()").
1352-- @return pattern
1353-- @usage local dq_str_escapes = l.delimited_range('"')
1354-- @usage local dq_str_noescapes = l.delimited_range('"', false, true)
1355-- @usage local unbalanced_parens = l.delimited_range('()')
1356-- @usage local balanced_parens = l.delimited_range('()', false, false, true)
1357-- @see nested_pair
1358-- @name delimited_range
1359function M.delimited_range(chars, single_line, no_escape, balanced)
1360  local s = chars:sub(1, 1)
1361  local e = #chars == 2 and chars:sub(2, 2) or s
1362  local range
1363  local b = balanced and s or ''
1364  local n = single_line and '\n' or ''
1365  if no_escape then
1366    local invalid = lpeg_S(e..n..b)
1367    range = M.any - invalid
1368  else
1369    local invalid = lpeg_S(e..n..b) + '\\'
1370    range = M.any - invalid + '\\' * M.any
1371  end
1372  if balanced and s ~= e then
1373    return lpeg_P{s * (range + lpeg_V(1))^0 * e}
1374  else
1375    return s * range^0 * lpeg_P(e)^-1
1376  end
1377end
1378
1379---
1380-- Creates and returns a pattern that matches pattern *patt* only at the
1381-- beginning of a line.
1382-- @param patt The LPeg pattern to match on the beginning of a line.
1383-- @return pattern
1384-- @usage local preproc = token(l.PREPROCESSOR, l.starts_line('#') *
1385--   l.nonnewline^0)
1386-- @name starts_line
1387function M.starts_line(patt)
1388  return lpeg_Cmt(lpeg_C(patt), function(input, index, match, ...)
1389    local pos = index - #match
1390    if pos == 1 then return index, ... end
1391    local char = input:sub(pos - 1, pos - 1)
1392    if char == '\n' or char == '\r' or char == '\f' then return index, ... end
1393  end)
1394end
1395
1396---
1397-- Creates and returns a pattern that verifies that string set *s* contains the
1398-- first non-whitespace character behind the current match position.
1399-- @param s String character set like one passed to `lpeg.S()`.
1400-- @return pattern
1401-- @usage local regex = l.last_char_includes('+-*!%^&|=,([{') *
1402--   l.delimited_range('/')
1403-- @name last_char_includes
1404function M.last_char_includes(s)
1405  s = '['..s:gsub('[-%%%[]', '%%%1')..']'
1406  return lpeg_P(function(input, index)
1407    if index == 1 then return index end
1408    local i = index
1409    while input:sub(i - 1, i - 1):match('[ \t\r\n\f]') do i = i - 1 end
1410    if input:sub(i - 1, i - 1):match(s) then return index end
1411  end)
1412end
1413
1414---
1415-- Returns a pattern that matches a balanced range of text that starts with
1416-- string *start_chars* and ends with string *end_chars*.
1417-- With single-character delimiters, this function is identical to
1418-- `delimited_range(start_chars..end_chars, false, true, true)`.
1419-- @param start_chars The string starting a nested sequence.
1420-- @param end_chars The string ending a nested sequence.
1421-- @return pattern
1422-- @usage local nested_comment = l.nested_pair('/*', '*/')
1423-- @see delimited_range
1424-- @name nested_pair
1425function M.nested_pair(start_chars, end_chars)
1426  local s, e = start_chars, lpeg_P(end_chars)^-1
1427  return lpeg_P{s * (M.any - s - end_chars + lpeg_V(1))^0 * e}
1428end
1429
1430---
1431-- Creates and returns a pattern that matches any single word in list *words*.
1432-- Words consist of alphanumeric and underscore characters, as well as the
1433-- characters in string set *word_chars*. *case_insensitive* indicates whether
1434-- or not to ignore case when matching words.
1435-- This is a convenience function for simplifying a set of ordered choice word
1436-- patterns.
1437-- @param words A table of words.
1438-- @param word_chars Optional string of additional characters considered to be
1439--   part of a word. By default, word characters are alphanumerics and
1440--   underscores ("%w_" in Lua). This parameter may be `nil` or the empty string
1441--   in order to indicate no additional word characters.
1442-- @param case_insensitive Optional boolean flag indicating whether or not the
1443--   word match is case-insensitive. The default is `false`.
1444-- @return pattern
1445-- @usage local keyword = token(l.KEYWORD, word_match{'foo', 'bar', 'baz'})
1446-- @usage local keyword = token(l.KEYWORD, word_match({'foo-bar', 'foo-baz',
1447--   'bar-foo', 'bar-baz', 'baz-foo', 'baz-bar'}, '-', true))
1448-- @name word_match
1449function M.word_match(words, word_chars, case_insensitive)
1450  local word_list = {}
1451  for i = 1, #words do
1452    word_list[case_insensitive and words[i]:lower() or words[i]] = true
1453  end
1454  local chars = M.alnum + '_'
1455  if word_chars then chars = chars + lpeg_S(word_chars) end
1456  return lpeg_Cmt(chars^1, function(input, index, word)
1457    if case_insensitive then word = word:lower() end
1458    return word_list[word] and index or nil
1459  end)
1460end
1461
1462---
1463-- Embeds child lexer *child* in parent lexer *parent* using patterns
1464-- *start_rule* and *end_rule*, which signal the beginning and end of the
1465-- embedded lexer, respectively.
1466-- @param parent The parent lexer.
1467-- @param child The child lexer.
1468-- @param start_rule The pattern that signals the beginning of the embedded
1469--   lexer.
1470-- @param end_rule The pattern that signals the end of the embedded lexer.
1471-- @usage l.embed_lexer(M, css, css_start_rule, css_end_rule)
1472-- @usage l.embed_lexer(html, M, php_start_rule, php_end_rule)
1473-- @usage l.embed_lexer(html, ruby, ruby_start_rule, ruby_end_rule)
1474-- @name embed_lexer
1475function M.embed_lexer(parent, child, start_rule, end_rule)
1476  -- Add child rules.
1477  if not child._EMBEDDEDRULES then child._EMBEDDEDRULES = {} end
1478  if not child._RULES then -- creating a child lexer to be embedded
1479    if not child._rules then error('Cannot embed language with no rules') end
1480    for i = 1, #child._rules do
1481      add_rule(child, child._rules[i][1], child._rules[i][2])
1482    end
1483  end
1484  child._EMBEDDEDRULES[parent._NAME] = {
1485    ['start_rule'] = start_rule,
1486    token_rule = join_tokens(child),
1487    ['end_rule'] = end_rule
1488  }
1489  if not parent._CHILDREN then parent._CHILDREN = {} end
1490  local children = parent._CHILDREN
1491  children[#children + 1] = child
1492  -- Add child styles.
1493  if not parent._tokenstyles then parent._tokenstyles = {} end
1494  local tokenstyles = parent._tokenstyles
1495  tokenstyles[child._NAME..'_whitespace'] = M.STYLE_WHITESPACE
1496  if child._tokenstyles then
1497    for token, style in pairs(child._tokenstyles) do
1498      tokenstyles[token] = style
1499    end
1500  end
1501  child._lexer = parent -- use parent's tokens if child is embedding itself
1502  parent_lexer = parent -- use parent's tokens if the calling lexer is a proxy
1503end
1504
1505-- Determines if the previous line is a comment.
1506-- This is used for determining if the current comment line is a fold point.
1507-- @param prefix The prefix string defining a comment.
1508-- @param text The text passed to a fold function.
1509-- @param pos The pos passed to a fold function.
1510-- @param line The line passed to a fold function.
1511-- @param s The s passed to a fold function.
1512local function prev_line_is_comment(prefix, text, pos, line, s)
1513  local start = line:find('%S')
1514  if start < s and not line:find(prefix, start, true) then return false end
1515  local p = pos - 1
1516  if text:sub(p, p) == '\n' then
1517    p = p - 1
1518    if text:sub(p, p) == '\r' then p = p - 1 end
1519    if text:sub(p, p) ~= '\n' then
1520      while p > 1 and text:sub(p - 1, p - 1) ~= '\n' do p = p - 1 end
1521      while text:sub(p, p):find('^[\t ]$') do p = p + 1 end
1522      return text:sub(p, p + #prefix - 1) == prefix
1523    end
1524  end
1525  return false
1526end
1527
1528-- Determines if the next line is a comment.
1529-- This is used for determining if the current comment line is a fold point.
1530-- @param prefix The prefix string defining a comment.
1531-- @param text The text passed to a fold function.
1532-- @param pos The pos passed to a fold function.
1533-- @param line The line passed to a fold function.
1534-- @param s The s passed to a fold function.
1535local function next_line_is_comment(prefix, text, pos, line, s)
1536  local p = text:find('\n', pos + s)
1537  if p then
1538    p = p + 1
1539    while text:sub(p, p):find('^[\t ]$') do p = p + 1 end
1540    return text:sub(p, p + #prefix - 1) == prefix
1541  end
1542  return false
1543end
1544
1545---
1546-- Returns a fold function (to be used within the lexer's `_foldsymbols` table)
1547-- that folds consecutive line comments that start with string *prefix*.
1548-- @param prefix The prefix string defining a line comment.
1549-- @usage [l.COMMENT] = {['--'] = l.fold_line_comments('--')}
1550-- @usage [l.COMMENT] = {['//'] = l.fold_line_comments('//')}
1551-- @name fold_line_comments
1552function M.fold_line_comments(prefix)
1553  local property_int = M.property_int
1554  return function(text, pos, line, s)
1555    if property_int['fold.line.comments'] == 0 then return 0 end
1556    if s > 1 and line:match('^%s*()') < s then return 0 end
1557    local prev_line_comment = prev_line_is_comment(prefix, text, pos, line, s)
1558    local next_line_comment = next_line_is_comment(prefix, text, pos, line, s)
1559    if not prev_line_comment and next_line_comment then return 1 end
1560    if prev_line_comment and not next_line_comment then return -1 end
1561    return 0
1562  end
1563end
1564
1565M.property_expanded = setmetatable({}, {
1566  -- Returns the string property value associated with string property *key*,
1567  -- replacing any "$()" and "%()" expressions with the values of their keys.
1568  __index = function(t, key)
1569    return M.property[key]:gsub('[$%%]%b()', function(key)
1570      return t[key:sub(3, -2)]
1571    end)
1572  end,
1573  __newindex = function() error('read-only property') end
1574})
1575
1576--[[ The functions and fields below were defined in C.
1577
1578---
1579-- Returns the line number of the line that contains position *pos*, which
1580-- starts from 1.
1581-- @param pos The position to get the line number of.
1582-- @return number
1583local function line_from_position(pos) end
1584
1585---
1586-- Individual fields for a lexer instance.
1587-- @field _NAME The string name of the lexer.
1588-- @field _rules An ordered list of rules for a lexer grammar.
1589--   Each rule is a table containing an arbitrary rule name and the LPeg pattern
1590--   associated with the rule. The order of rules is important, as rules are
1591--   matched sequentially.
1592--   Child lexers should not use this table to access and/or modify their
1593--   parent's rules and vice-versa. Use the `_RULES` table instead.
1594-- @field _tokenstyles A map of non-predefined token names to styles.
1595--   Remember to use token names, not rule names. It is recommended to use
1596--   predefined styles or color-agnostic styles derived from predefined styles
1597--   to ensure compatibility with user color themes.
1598-- @field _foldsymbols A table of recognized fold points for the lexer.
1599--   Keys are token names with table values defining fold points. Those table
1600--   values have string keys of keywords or characters that indicate a fold
1601--   point whose values are integers. A value of `1` indicates a beginning fold
1602--   point and a value of `-1` indicates an ending fold point. Values can also
1603--   be functions that return `1`, `-1`, or `0` (indicating no fold point) for
1604--   keys which need additional processing.
1605--   There is also a required `_patterns` key whose value is a table containing
1606--   Lua pattern strings that match all fold points (the string keys contained
1607--   in token name table values). When the lexer encounters text that matches
1608--   one of those patterns, the matched text is looked up in its token's table
1609--   to determine whether or not it is a fold point.
1610--   There is also an optional `_case_insensitive` option that indicates whether
1611--   or not fold point keys are case-insensitive. If `true`, fold point keys
1612--   should be in lower case.
1613-- @field _fold If this function exists in the lexer, it is called for folding
1614--   the document instead of using `_foldsymbols` or indentation.
1615-- @field _lexer The parent lexer object whose rules should be used. This field
1616--   is only necessary to disambiguate a proxy lexer that loaded parent and
1617--   child lexers for embedding and ended up having multiple parents loaded.
1618-- @field _RULES A map of rule name keys with their associated LPeg pattern
1619--   values for the lexer.
1620--   This is constructed from the lexer's `_rules` table and accessible to other
1621--   lexers for embedded lexer applications like modifying parent or child
1622--   rules.
1623-- @field _LEXBYLINE Indicates the lexer can only process one whole line of text
1624--    (instead of an arbitrary chunk of text) at a time.
1625--    The default value is `false`. Line lexers cannot look ahead to subsequent
1626--    lines.
1627-- @field _FOLDBYINDENTATION Declares the lexer does not define fold points and
1628--    that fold points should be calculated based on changes in indentation.
1629-- @class table
1630-- @name lexer
1631local lexer
1632]]
1633
1634return M
1635