1## Scintillua API Documentation 2 3### Overview 4 5The Scintillua Scintilla lexer has its own API in order to avoid any 6modifications to Scintilla itself. It is invoked using 7[`SCI_PRIVATELEXERCALL`][]. Please note that some of the names of the API 8calls do not make perfect sense. This is a tradeoff in order to keep 9Scintilla unmodified. 10 11[`SCI_PRIVATELEXERCALL`]: https://scintilla.org/ScintillaDoc.html#LexerObjects 12 13The following notation is used: 14 15 SCI_PRIVATELEXERCALL (int operation, void *pointer) 16 17This means you would call Scintilla like this: 18 19 SendScintilla(sci, SCI_PRIVATELEXERCALL, operation, pointer); 20 21### Scintillua Usage Example 22 23Here is a pseudo-code example: 24 25 init_app() { 26 SetLibraryProperty("lpeg.home", "/home/mitchell/app/lexers") 27 SetLibraryProperty("lpeg.color.theme", "light") 28 sci = scintilla_new() 29 } 30 31 create_doc() { 32 doc = SendScintilla(sci, SCI_CREATEDOCUMENT) 33 SendScintilla(sci, SCI_SETDOCPOINTER, 0, doc) 34 SendScintilla(sci, SCI_SETILEXER, 0, CreateLexer(NULL)) 35 fn = SendScintilla(sci, SCI_GETDIRECTFUNCTION) 36 SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_GETDIRECTFUNCTION, fn) 37 psci = SendScintilla(sci, SCI_GETDIRECTPOINTER) 38 SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETDOCPOINTER, psci) 39 SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETLEXERLANGUAGE, "lua") 40 } 41 42 set_lexer(lang) { 43 psci = SendScintilla(sci, SCI_GETDIRECTPOINTER) 44 SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETDOCPOINTER, psci) 45 SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETLEXERLANGUAGE, lang) 46 } 47 48### Functions defined by `Scintillua` 49 50<a id="SCI_CHANGELEXERSTATE"></a> 51#### `SCI_PRIVATELEXERCALL`(SCI\_CHANGELEXERSTATE, lua) 52 53Tells Scintillua to use `lua` as its Lua state instead of creating a separate 54state. 55 56`lua` must have already opened the "base", "string", "table", and "lpeg" 57libraries. 58 59Scintillua will create a single `lexer` package (that can be used with Lua's 60`require()`), as well as a number of other variables in the 61`LUA_REGISTRYINDEX` table with the "sci_" prefix. 62 63Instead of including the path to Scintillua's lexers in the `package.path` of 64the given Lua state, set the "lexer.lpeg.home" property appropriately 65instead. Scintillua uses that property to find and load lexers. 66 67Fields: 68 69* `SCI_CHANGELEXERSTATE`: 70* `lua`: (`lua_State *`) The Lua state to use. 71 72Usage: 73 74* `lua = luaL_newstate()` 75* `SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_CHANGELEXERSTATE, lua)` 76 77<a id="SCI_GETDIRECTFUNCTION"></a> 78#### `SCI_PRIVATELEXERCALL`(SCI\_GETDIRECTFUNCTION, SciFnDirect) 79 80Tells Scintillua the address of the function that handles Scintilla messages. 81 82Despite the name `SCI_GETDIRECTFUNCTION`, it only notifies Scintillua what 83the value of `SciFnDirect` obtained from [`SCI_GETDIRECTFUNCTION`][] is. It 84does not return anything. 85Use this if you would like to have the Scintillua lexer set all Lua LPeg 86lexer styles automatically. This is useful for maintaining a consistent color 87theme. Do not use this if your application maintains its own color theme. 88 89If you use this call, it *must* be made *once* for each Scintilla buffer that 90was created using [`SCI_CREATEDOCUMENT`][]. You must also use the 91[`SCI_SETDOCPOINTER()`](#SCI_SETDOCPOINTER) Scintillua API call. 92 93[`SCI_GETDIRECTFUNCTION`]: https://scintilla.org/ScintillaDoc.html#SCI_GETDIRECTFUNCTION 94[`SCI_CREATEDOCUMENT`]: https://scintilla.org/ScintillaDoc.html#SCI_CREATEDOCUMENT 95 96Fields: 97 98* `SCI_GETDIRECTFUNCTION`: 99* `SciFnDirect`: The pointer returned by [`SCI_GETDIRECTFUNCTION`][]. 100 101Usage: 102 103* `fn = SendScintilla(sci, SCI_GETDIRECTFUNCTION)` 104* `SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_GETDIRECTFUNCTION, fn)` 105 106See also: 107 108* [`SCI_SETDOCPOINTER`](#SCI_SETDOCPOINTER) 109 110<a id="SCI_GETLEXERLANGUAGE"></a> 111#### `SCI_PRIVATELEXERCALL`(SCI\_GETLEXERLANGUAGE, languageName) 112 113Returns the length of the string name of the current Lua LPeg lexer or stores 114the name into the given buffer. If the buffer is long enough, the name is 115terminated by a `0` character. 116 117For parent lexers with embedded children or child lexers embedded into 118parents, the name is in "lexer/current" format, where "lexer" is the actual 119lexer's name and "current" is the parent or child lexer at the current caret 120position. In order for this to work, you must have called 121[`SCI_GETDIRECTFUNCTION`](#SCI_GETDIRECTFUNCTION) and 122[`SCI_SETDOCPOINTER`](#SCI_SETDOCPOINTER). 123 124Fields: 125 126* `SCI_GETLEXERLANGUAGE`: 127* `languageName`: (`char *`) If `NULL`, returns the length that should be 128 allocated to store the string Lua LPeg lexer name. Otherwise fills the 129 buffer with the name. 130 131<a id="SCI_GETNAMEDSTYLES"></a> 132#### `SCI_PRIVATELEXERCALL`(SCI\_GETNAMEDSTYLES, styleName) 133 134Returns the style number associated with *styleName*, or `STYLE_DEFAULT` 135if *styleName* is not known. 136 137Fields: 138 139* `SCI_GETNAMEDSTYLES`: 140* `styleName`: (`const char *`) Style name to get the style number of. 141 142Usage: 143 144* `SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_GETNAMEDSTYLES, "error")` 145* `SendScintilla(sci, SCI_ANNOTATIONSETSTYLE, line, style) // match error 146 style` 147 148<a id="SCI_GETSTATUS"></a> 149#### `SCI_PRIVATELEXERCALL`(SCI\_GETSTATUS) 150 151Returns the error message of the Scintillua or Lua LPeg lexer error that 152occurred (if any). 153 154If no error occurred, the returned message will be empty. 155 156Since Scintillua does not throw errors as they occur, errors can only be 157handled passively. Note that Scintillua does print all errors to stderr. 158 159Fields: 160 161* `SCI_GETSTATUS`: 162 163Usage: 164 165* `SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_GETSTATUS, errmsg)` 166* `if (strlen(errmsg) > 0) { /* handle error */ }` 167 168<a id="SCI_LOADLEXERLIBRARY"></a> 169#### `SCI_PRIVATELEXERCALL`(SCI\_LOADLEXERLIBRARY, path) 170 171Tells Scintillua that the given path is where Scintillua's lexers are 172located, or is a path that contains additional lexers and/or themes to load 173(e.g. user-defined lexers/themes). 174 175This call may be made multiple times in order to support lexers and themes 176across multiple directories. 177 178Fields: 179 180* `SCI_LOADLEXERLIBRARY`: 181* `path`: (`const char *`) A path containing Scintillua lexers and/or 182 themes. 183 184Usage: 185 186* `SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_LOADLEXERLIBRARY, 187 "path/to/lexers")` 188 189<a id="SCI_PROPERTYNAMES"></a> 190#### `SCI_PRIVATELEXERCALL`(SCI\_PROPERTYNAMES, names) 191 192Returns the length of a '\n'-separated list of known lexer names, or stores 193the lexer list into the given buffer. If the buffer is long enough, the 194string is terminated by a `0` character. 195 196The lexers in this list can be passed to the 197[`SCI_SETLEXERLANGUAGE`](#SCI_SETLEXERLANGUAGE) Scintillua API call. 198 199Fields: 200 201* `SCI_PROPERTYNAMES`: 202* `names`: (`char *`) If `NULL`, returns the length that should be 203 allocated to store the list of lexer names. Otherwise fills the buffer with 204 the names. 205 206Usage: 207 208* `SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_PROPERTYNAMES, lexers)` 209* `// lexers now contains a '\n'-separated list of known lexer names` 210 211See also: 212 213* [`SCI_SETLEXERLANGUAGE`](#SCI_SETLEXERLANGUAGE) 214 215<a id="SCI_SETDOCPOINTER"></a> 216#### `SCI_PRIVATELEXERCALL`(SCI\_SETDOCPOINTER, sci) 217 218Tells Scintillua the address of the Scintilla window currently in use. 219 220Despite the name `SCI_SETDOCPOINTER`, it has no relationship to Scintilla 221documents. 222 223Use this call only if you are using the 224[`SCI_GETDIRECTFUNCTION()`](#SCI_GETDIRECTFUNCTION) Scintillua API call. It 225*must* be made *before* each call to the 226[`SCI_SETLEXERLANGUAGE()`](#SCI_SETLEXERLANGUAGE) Scintillua API call. 227 228Fields: 229 230* `SCI_SETDOCPOINTER`: 231* `sci`: The pointer returned by [`SCI_GETDIRECTPOINTER`][]. 232 233[`SCI_GETDIRECTPOINTER`]: https://scintilla.org/ScintillaDoc.html#SCI_GETDIRECTPOINTER 234 235Usage: 236 237* `SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETDOCPOINTER, sci)` 238 239See also: 240 241* [`SCI_GETDIRECTFUNCTION`](#SCI_GETDIRECTFUNCTION) 242* [`SCI_SETLEXERLANGUAGE`](#SCI_SETLEXERLANGUAGE) 243 244<a id="SCI_SETLEXERLANGUAGE"></a> 245#### `SCI_PRIVATELEXERCALL`(SCI\_SETLEXERLANGUAGE, languageName) 246 247Sets the current Lua LPeg lexer to `languageName`. 248 249If you are having the Scintillua lexer set the Lua LPeg lexer styles 250automatically, make sure you call the 251[`SCI_SETDOCPOINTER()`](#SCI_SETDOCPOINTER) Scintillua API *first*. 252 253Fields: 254 255* `SCI_SETLEXERLANGUAGE`: 256* `languageName`: (`const char*`) The name of the Lua LPeg lexer to use. 257 258Usage: 259 260* `SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETLEXERLANGUAGE, "lua")` 261 262See also: 263 264* [`SCI_SETDOCPOINTER`](#SCI_SETDOCPOINTER) 265* [`SCI_PROPERTYNAMES`](#SCI_PROPERTYNAMES) 266 267<a id="styleNum"></a> 268#### `SCI_PRIVATELEXERCALL`(styleNum, style) 269 270Returns the length of the associated SciTE-formatted style definition for the 271given style number or stores that string into the given buffer. If the buffer 272is long enough, the string is terminated by a `0` character. 273 274Please see the [SciTE documentation][] for the style definition format 275specified by `style.*.stylenumber`. You can parse these definitions to set 276Lua LPeg lexer styles manually if you chose not to have them set 277automatically using the [`SCI_GETDIRECTFUNCTION()`](#SCI_GETDIRECTFUNCTION) 278and [`SCI_SETDOCPOINTER()`](#SCI_SETDOCPOINTER) Scintillua API calls. 279 280[SciTE documentation]: https://scintilla.org/SciTEDoc.html 281 282Fields: 283 284* `styleNum`: (`int`) For the range `-STYLE_MAX <= styleNum < 0`, uses the 285 Scintilla style number `-styleNum - 1` for returning SciTE-formatted style 286 definitions. (Style `0` would be `-1`, style `1` would be `-2`, and so on.) 287* `style`: (`char *`) If `NULL`, returns the length that should be 288 allocated to store the associated string. Otherwise fills the buffer with 289 the string. 290 291 292--- 293<a id="lexer"></a> 294## The `lexer` Lua Module 295--- 296 297Lexes Scintilla documents and source code with Lua and LPeg. 298 299### Writing Lua Lexers 300 301Lexers highlight the syntax of source code. Scintilla (the editing component 302behind [Textadept][] and [SciTE][]) traditionally uses static, compiled C++ 303lexers which are notoriously difficult to create and/or extend. On the other 304hand, Lua makes it easy to to rapidly create new lexers, extend existing 305ones, and embed lexers within one another. Lua lexers tend to be more 306readable than C++ lexers too. 307 308Lexers are Parsing Expression Grammars, or PEGs, composed with the Lua 309[LPeg library][]. The following table comes from the LPeg documentation and 310summarizes all you need to know about constructing basic LPeg patterns. This 311module provides convenience functions for creating and working with other 312more advanced patterns and concepts. 313 314Operator | Description 315---------------------|------------ 316`lpeg.P(string)` | Matches `string` literally. 317`lpeg.P(`_`n`_`)` | Matches exactly _`n`_ number of characters. 318`lpeg.S(string)` | Matches any character in set `string`. 319`lpeg.R("`_`xy`_`")` | Matches any character between range `x` and `y`. 320`patt^`_`n`_ | Matches at least _`n`_ repetitions of `patt`. 321`patt^-`_`n`_ | Matches at most _`n`_ repetitions of `patt`. 322`patt1 * patt2` | Matches `patt1` followed by `patt2`. 323`patt1 + patt2` | Matches `patt1` or `patt2` (ordered choice). 324`patt1 - patt2` | Matches `patt1` if `patt2` does not also match. 325`-patt` | Equivalent to `("" - patt)`. 326`#patt` | Matches `patt` but consumes no input. 327 328The first part of this document deals with rapidly constructing a simple 329lexer. The next part deals with more advanced techniques, such as custom 330coloring and embedding lexers within one another. Following that is a 331discussion about code folding, or being able to tell Scintilla which code 332blocks are "foldable" (temporarily hideable from view). After that are 333instructions on how to use Lua lexers with the aforementioned Textadept and 334SciTE editors. Finally there are comments on lexer performance and 335limitations. 336 337[LPeg library]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html 338[Textadept]: https://orbitalquark.github.io/textadept 339[SciTE]: https://scintilla.org/SciTE.html 340 341### Lexer Basics 342 343The *lexers/* directory contains all lexers, including your new one. Before 344attempting to write one from scratch though, first determine if your 345programming language is similar to any of the 100+ languages supported. If 346so, you may be able to copy and modify that lexer, saving some time and 347effort. The filename of your lexer should be the name of your programming 348language in lower case followed by a *.lua* extension. For example, a new Lua 349lexer has the name *lua.lua*. 350 351Note: Try to refrain from using one-character language names like "c", "d", 352or "r". For example, Scintillua uses "ansi_c", "dmd", and "rstats", 353respectively. 354 355#### New Lexer Template 356 357There is a *lexers/template.txt* file that contains a simple template for a 358new lexer. Feel free to use it, replacing the '?'s with the name of your 359lexer. Consider this snippet from the template: 360 361 -- ? LPeg lexer. 362 363 local lexer = require('lexer') 364 local token, word_match = lexer.token, lexer.word_match 365 local P, S = lpeg.P, lpeg.S 366 367 local lex = lexer.new('?') 368 369 -- Whitespace. 370 local ws = token(lexer.WHITESPACE, lexer.space^1) 371 lex:add_rule('whitespace', ws) 372 373 [...] 374 375 return lex 376 377The first 3 lines of code simply define often used convenience variables. The 378fourth and last lines [define](#lexer.new) and return the lexer object 379Scintilla uses; they are very important and must be part of every lexer. The 380fifth line defines something called a "token", an essential building block of 381lexers. You will learn about tokens shortly. The sixth line defines a lexer 382grammar rule, which you will learn about later, as well as token styles. (Be 383aware that it is common practice to combine these two lines for short rules.) 384Note, however, the `local` prefix in front of variables, which is needed 385so-as not to affect Lua's global environment. All in all, this is a minimal, 386working lexer that you can build on. 387 388#### Tokens 389 390Take a moment to think about your programming language's structure. What kind 391of key elements does it have? In the template shown earlier, one predefined 392element all languages have is whitespace. Your language probably also has 393elements like comments, strings, and keywords. Lexers refer to these elements 394as "tokens". Tokens are the fundamental "building blocks" of lexers. Lexers 395break down source code into tokens for coloring, which results in the syntax 396highlighting familiar to you. It is up to you how specific your lexer is when 397it comes to tokens. Perhaps only distinguishing between keywords and 398identifiers is necessary, or maybe recognizing constants and built-in 399functions, methods, or libraries is desirable. The Lua lexer, for example, 400defines 11 tokens: whitespace, keywords, built-in functions, constants, 401built-in libraries, identifiers, strings, comments, numbers, labels, and 402operators. Even though constants, built-in functions, and built-in libraries 403are subsets of identifiers, Lua programmers find it helpful for the lexer to 404distinguish between them all. It is perfectly acceptable to just recognize 405keywords and identifiers. 406 407In a lexer, tokens consist of a token name and an LPeg pattern that matches a 408sequence of characters recognized as an instance of that token. Create tokens 409using the [`lexer.token()`](#lexer.token) function. Let us examine the "whitespace" token 410defined in the template shown earlier: 411 412 local ws = token(lexer.WHITESPACE, lexer.space^1) 413 414At first glance, the first argument does not appear to be a string name and 415the second argument does not appear to be an LPeg pattern. Perhaps you 416expected something like: 417 418 local ws = token('whitespace', S('\t\v\f\n\r ')^1) 419 420The `lexer` module actually provides a convenient list of common token names 421and common LPeg patterns for you to use. Token names include 422[`lexer.DEFAULT`](#lexer.DEFAULT), [`lexer.WHITESPACE`](#lexer.WHITESPACE), [`lexer.COMMENT`](#lexer.COMMENT), 423[`lexer.STRING`](#lexer.STRING), [`lexer.NUMBER`](#lexer.NUMBER), [`lexer.KEYWORD`](#lexer.KEYWORD), 424[`lexer.IDENTIFIER`](#lexer.IDENTIFIER), [`lexer.OPERATOR`](#lexer.OPERATOR), [`lexer.ERROR`](#lexer.ERROR), 425[`lexer.PREPROCESSOR`](#lexer.PREPROCESSOR), [`lexer.CONSTANT`](#lexer.CONSTANT), [`lexer.VARIABLE`](#lexer.VARIABLE), 426[`lexer.FUNCTION`](#lexer.FUNCTION), [`lexer.CLASS`](#lexer.CLASS), [`lexer.TYPE`](#lexer.TYPE), [`lexer.LABEL`](#lexer.LABEL), 427[`lexer.REGEX`](#lexer.REGEX), and [`lexer.EMBEDDED`](#lexer.EMBEDDED). Patterns include 428[`lexer.any`](#lexer.any), [`lexer.alpha`](#lexer.alpha), [`lexer.digit`](#lexer.digit), [`lexer.alnum`](#lexer.alnum), 429[`lexer.lower`](#lexer.lower), [`lexer.upper`](#lexer.upper), [`lexer.xdigit`](#lexer.xdigit), [`lexer.graph`](#lexer.graph), 430[`lexer.print`](#lexer.print), [`lexer.punct`](#lexer.punct), [`lexer.space`](#lexer.space), [`lexer.newline`](#lexer.newline), 431[`lexer.nonnewline`](#lexer.nonnewline), [`lexer.dec_num`](#lexer.dec_num), [`lexer.hex_num`](#lexer.hex_num), 432[`lexer.oct_num`](#lexer.oct_num), [`lexer.integer`](#lexer.integer), [`lexer.float`](#lexer.float), 433[`lexer.number`](#lexer.number), and [`lexer.word`](#lexer.word). You may use your own token names if 434none of the above fit your language, but an advantage to using predefined 435token names is that your lexer's tokens will inherit the universal syntax 436highlighting color theme used by your text editor. 437 438##### Example Tokens 439 440So, how might you define other tokens like keywords, comments, and strings? 441Here are some examples. 442 443**Keywords** 444 445Instead of matching _n_ keywords with _n_ `P('keyword_`_`n`_`')` ordered 446choices, use another convenience function: [`lexer.word_match()`](#lexer.word_match). It is 447much easier and more efficient to write word matches like: 448 449 local keyword = token(lexer.KEYWORD, lexer.word_match[[ 450 keyword_1 keyword_2 ... keyword_n 451 ]]) 452 453 local case_insensitive_keyword = token(lexer.KEYWORD, lexer.word_match([[ 454 KEYWORD_1 keyword_2 ... KEYword_n 455 ]], true)) 456 457 local hyphened_keyword = token(lexer.KEYWORD, lexer.word_match[[ 458 keyword-1 keyword-2 ... keyword-n 459 ]]) 460 461In order to more easily separate or categorize keyword sets, you can use Lua 462line comments within keyword strings. Such comments will be ignored. For 463example: 464 465 local keyword = token(lexer.KEYWORD, lexer.word_match[[ 466 -- Version 1 keywords. 467 keyword_11, keyword_12 ... keyword_1n 468 -- Version 2 keywords. 469 keyword_21, keyword_22 ... keyword_2n 470 ... 471 -- Version N keywords. 472 keyword_m1, keyword_m2 ... keyword_mn 473 ]]) 474 475**Comments** 476 477Line-style comments with a prefix character(s) are easy to express with LPeg: 478 479 local shell_comment = token(lexer.COMMENT, lexer.to_eol('#')) 480 local c_line_comment = token(lexer.COMMENT, lexer.to_eol('//', true)) 481 482The comments above start with a '#' or "//" and go to the end of the line. 483The second comment recognizes the next line also as a comment if the current 484line ends with a '\' escape character. 485 486C-style "block" comments with a start and end delimiter are also easy to 487express: 488 489 local c_comment = token(lexer.COMMENT, lexer.range('/*', '*/')) 490 491This comment starts with a "/\*" sequence and contains anything up to and 492including an ending "\*/" sequence. The ending "\*/" is optional so the lexer 493can recognize unfinished comments as comments and highlight them properly. 494 495**Strings** 496 497Most programming languages allow escape sequences in strings such that a 498sequence like "\\"" in a double-quoted string indicates that the 499'"' is not the end of the string. [`lexer.range()`](#lexer.range) handles escapes 500inherently. 501 502 local dq_str = lexer.range('"') 503 local sq_str = lexer.range("'") 504 local string = token(lexer.STRING, dq_str + sq_str) 505 506In this case, the lexer treats '\' as an escape character in a string 507sequence. 508 509**Numbers** 510 511Most programming languages have the same format for integer and float tokens, 512so it might be as simple as using a predefined LPeg pattern: 513 514 local number = token(lexer.NUMBER, lexer.number) 515 516However, some languages allow postfix characters on integers. 517 518 local integer = P('-')^-1 * (lexer.dec_num * S('lL')^-1) 519 local number = token(lexer.NUMBER, lexer.float + lexer.hex_num + integer) 520 521Your language may need other tweaks, but it is up to you how fine-grained you 522want your highlighting to be. After all, you are not writing a compiler or 523interpreter! 524 525#### Rules 526 527Programming languages have grammars, which specify valid token structure. For 528example, comments usually cannot appear within a string. Grammars consist of 529rules, which are simply combinations of tokens. Recall from the lexer 530template the [`lexer.add_rule()`](#lexer.add_rule) call, which adds a rule to the lexer's 531grammar: 532 533 lex:add_rule('whitespace', ws) 534 535Each rule has an associated name, but rule names are completely arbitrary and 536serve only to identify and distinguish between different rules. Rule order is 537important: if text does not match the first rule added to the grammar, the 538lexer tries to match the second rule added, and so on. Right now this lexer 539simply matches whitespace tokens under a rule named "whitespace". 540 541To illustrate the importance of rule order, here is an example of a 542simplified Lua lexer: 543 544 lex:add_rule('whitespace', token(lexer.WHITESPACE, ...)) 545 lex:add_rule('keyword', token(lexer.KEYWORD, ...)) 546 lex:add_rule('identifier', token(lexer.IDENTIFIER, ...)) 547 lex:add_rule('string', token(lexer.STRING, ...)) 548 lex:add_rule('comment', token(lexer.COMMENT, ...)) 549 lex:add_rule('number', token(lexer.NUMBER, ...)) 550 lex:add_rule('label', token(lexer.LABEL, ...)) 551 lex:add_rule('operator', token(lexer.OPERATOR, ...)) 552 553Note how identifiers come after keywords. In Lua, as with most programming 554languages, the characters allowed in keywords and identifiers are in the same 555set (alphanumerics plus underscores). If the lexer added the "identifier" 556rule before the "keyword" rule, all keywords would match identifiers and thus 557incorrectly highlight as identifiers instead of keywords. The same idea 558applies to function, constant, etc. tokens that you may want to distinguish 559between: their rules should come before identifiers. 560 561So what about text that does not match any rules? For example in Lua, the '!' 562character is meaningless outside a string or comment. Normally the lexer 563skips over such text. If instead you want to highlight these "syntax errors", 564add an additional end rule: 565 566 lex:add_rule('whitespace', ws) 567 ... 568 lex:add_rule('error', token(lexer.ERROR, lexer.any)) 569 570This identifies and highlights any character not matched by an existing 571rule as a `lexer.ERROR` token. 572 573Even though the rules defined in the examples above contain a single token, 574rules may consist of multiple tokens. For example, a rule for an HTML tag 575could consist of a tag token followed by an arbitrary number of attribute 576tokens, allowing the lexer to highlight all tokens separately. That rule 577might look something like this: 578 579 lex:add_rule('tag', tag_start * (ws * attributes)^0 * tag_end^-1) 580 581Note however that lexers with complex rules like these are more prone to lose 582track of their state, especially if they span multiple lines. 583 584#### Summary 585 586Lexers primarily consist of tokens and grammar rules. At your disposal are a 587number of convenience patterns and functions for rapidly creating a lexer. If 588you choose to use predefined token names for your tokens, you do not have to 589define how the lexer highlights them. The tokens will inherit the default 590syntax highlighting color theme your editor uses. 591 592### Advanced Techniques 593 594#### Styles and Styling 595 596The most basic form of syntax highlighting is assigning different colors to 597different tokens. Instead of highlighting with just colors, Scintilla allows 598for more rich highlighting, or "styling", with different fonts, font sizes, 599font attributes, and foreground and background colors, just to name a few. 600The unit of this rich highlighting is called a "style". Styles are simply Lua 601tables of properties. By default, lexers associate predefined token names 602like `lexer.WHITESPACE`, `lexer.COMMENT`, `lexer.STRING`, etc. with 603particular styles as part of a universal color theme. These predefined styles 604are contained in [`lexer.styles`](#lexer.styles), and you may define your own styles. See 605that table's documentation for more information. As with token names, 606LPeg patterns, and styles, there is a set of predefined color names, but they 607vary depending on the current color theme in use. Therefore, it is generally 608not a good idea to manually define colors within styles in your lexer since 609they might not fit into a user's chosen color theme. Try to refrain from even 610using predefined colors in a style because that color may be theme-specific. 611Instead, the best practice is to either use predefined styles or derive new 612color-agnostic styles from predefined ones. For example, Lua "longstring" 613tokens use the existing `lexer.styles.string` style instead of defining a new 614one. 615 616##### Example Styles 617 618Defining styles is pretty straightforward. An empty style that inherits the 619default theme settings is simply an empty table: 620 621 local style_nothing = {} 622 623A similar style but with a bold font face looks like this: 624 625 local style_bold = {bold = true} 626 627You can derive new styles from predefined ones without having to rewrite 628them. This operation leaves the old style unchanged. For example, if you had 629a "static variable" token whose style you wanted to base off of 630`lexer.styles.variable`, it would probably look like: 631 632 local style_static_var = lexer.styles.variable .. {italics = true} 633 634The color theme files in the *lexers/themes/* folder give more examples of 635style definitions. 636 637#### Token Styles 638 639Lexers use the [`lexer.add_style()`](#lexer.add_style) function to assign styles to 640particular tokens. Recall the token definition and from the lexer template: 641 642 local ws = token(lexer.WHITESPACE, lexer.space^1) 643 lex:add_rule('whitespace', ws) 644 645Why is a style not assigned to the `lexer.WHITESPACE` token? As mentioned 646earlier, lexers automatically associate tokens that use predefined token 647names with a particular style. Only tokens with custom token names need 648manual style associations. As an example, consider a custom whitespace token: 649 650 local ws = token('custom_whitespace', lexer.space^1) 651 652Assigning a style to this token looks like: 653 654 lex:add_style('custom_whitespace', lexer.styles.whitespace) 655 656Do not confuse token names with rule names. They are completely different 657entities. In the example above, the lexer associates the "custom_whitespace" 658token with the existing style for `lexer.WHITESPACE` tokens. If instead you 659prefer to color the background of whitespace a shade of grey, it might look 660like: 661 662 lex:add_style('custom_whitespace', 663 lexer.styles.whitespace .. {back = lexer.colors.grey}) 664 665Remember to refrain from assigning specific colors in styles, but in this 666case, all user color themes probably define `colors.grey`. 667 668#### Line Lexers 669 670By default, lexers match the arbitrary chunks of text passed to them by 671Scintilla. These chunks may be a full document, only the visible part of a 672document, or even just portions of lines. Some lexers need to match whole 673lines. For example, a lexer for the output of a file "diff" needs to know if 674the line started with a '+' or '-' and then style the entire line 675accordingly. To indicate that your lexer matches by line, create the lexer 676with an extra parameter: 677 678 local lex = lexer.new('?', {lex_by_line = true}) 679 680Now the input text for the lexer is a single line at a time. Keep in mind 681that line lexers do not have the ability to look ahead at subsequent lines. 682 683#### Embedded Lexers 684 685Lexers embed within one another very easily, requiring minimal effort. In the 686following sections, the lexer being embedded is called the "child" lexer and 687the lexer a child is being embedded in is called the "parent". For example, 688consider an HTML lexer and a CSS lexer. Either lexer stands alone for styling 689their respective HTML and CSS files. However, CSS can be embedded inside 690HTML. In this specific case, the CSS lexer is the "child" lexer with the HTML 691lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This 692sounds a lot like the case with CSS, but there is a subtle difference: PHP 693_embeds itself into_ HTML while CSS is _embedded in_ HTML. This fundamental 694difference results in two types of embedded lexers: a parent lexer that 695embeds other child lexers in it (like HTML embedding CSS), and a child lexer 696that embeds itself into a parent lexer (like PHP embedding itself in HTML). 697 698##### Parent Lexer 699 700Before embedding a child lexer into a parent lexer, the parent lexer needs to 701load the child lexer. This is done with the [`lexer.load()`](#lexer.load) function. For 702example, loading the CSS lexer within the HTML lexer looks like: 703 704 local css = lexer.load('css') 705 706The next part of the embedding process is telling the parent lexer when to 707switch over to the child lexer and when to switch back. The lexer refers to 708these indications as the "start rule" and "end rule", respectively, and are 709just LPeg patterns. Continuing with the HTML/CSS example, the transition from 710HTML to CSS is when the lexer encounters a "style" tag with a "type" 711attribute whose value is "text/css": 712 713 local css_tag = P('<style') * P(function(input, index) 714 if input:find('^[^>]+type="text/css"', index) then 715 return index 716 end 717 end) 718 719This pattern looks for the beginning of a "style" tag and searches its 720attribute list for the text "`type="text/css"`". (In this simplified example, 721the Lua pattern does not consider whitespace between the '=' nor does it 722consider that using single quotes is valid.) If there is a match, the 723functional pattern returns a value instead of `nil`. In this case, the value 724returned does not matter because we ultimately want to style the "style" tag 725as an HTML tag, so the actual start rule looks like this: 726 727 local css_start_rule = #css_tag * tag 728 729Now that the parent knows when to switch to the child, it needs to know when 730to switch back. In the case of HTML/CSS, the switch back occurs when the 731lexer encounters an ending "style" tag, though the lexer should still style 732the tag as an HTML tag: 733 734 local css_end_rule = #P('</style>') * tag 735 736Once the parent loads the child lexer and defines the child's start and end 737rules, it embeds the child with the [`lexer.embed()`](#lexer.embed) function: 738 739 lex:embed(css, css_start_rule, css_end_rule) 740 741##### Child Lexer 742 743The process for instructing a child lexer to embed itself into a parent is 744very similar to embedding a child into a parent: first, load the parent lexer 745into the child lexer with the [`lexer.load()`](#lexer.load) function and then create 746start and end rules for the child lexer. However, in this case, call 747[`lexer.embed()`](#lexer.embed) with switched arguments. For example, in the PHP lexer: 748 749 local html = lexer.load('html') 750 local php_start_rule = token('php_tag', '<?php ') 751 local php_end_rule = token('php_tag', '?>') 752 lex:add_style('php_tag', lexer.styles.embedded) 753 html:embed(lex, php_start_rule, php_end_rule) 754 755#### Lexers with Complex State 756 757A vast majority of lexers are not stateful and can operate on any chunk of 758text in a document. However, there may be rare cases where a lexer does need 759to keep track of some sort of persistent state. Rather than using `lpeg.P` 760function patterns that set state variables, it is recommended to make use of 761Scintilla's built-in, per-line state integers via [`lexer.line_state`](#lexer.line_state). It 762was designed to accommodate up to 32 bit flags for tracking state. 763[`lexer.line_from_position()`](#lexer.line_from_position) will return the line for any position given 764to an `lpeg.P` function pattern. (Any positions derived from that position 765argument will also work.) 766 767Writing stateful lexers is beyond the scope of this document. 768 769### Code Folding 770 771When reading source code, it is occasionally helpful to temporarily hide 772blocks of code like functions, classes, comments, etc. This is the concept of 773"folding". In the Textadept and SciTE editors for example, little indicators 774in the editor margins appear next to code that can be folded at places called 775"fold points". When the user clicks an indicator, the editor hides the code 776associated with the indicator until the user clicks the indicator again. The 777lexer specifies these fold points and what code exactly to fold. 778 779The fold points for most languages occur on keywords or character sequences. 780Examples of fold keywords are "if" and "end" in Lua and examples of fold 781character sequences are '{', '}', "/\*", and "\*/" in C for code block and 782comment delimiters, respectively. However, these fold points cannot occur 783just anywhere. For example, lexers should not recognize fold keywords that 784appear within strings or comments. The [`lexer.add_fold_point()`](#lexer.add_fold_point) function 785allows you to conveniently define fold points with such granularity. For 786example, consider C: 787 788 lex:add_fold_point(lexer.OPERATOR, '{', '}') 789 lex:add_fold_point(lexer.COMMENT, '/*', '*/') 790 791The first assignment states that any '{' or '}' that the lexer recognized as 792an `lexer.OPERATOR` token is a fold point. Likewise, the second assignment 793states that any "/\*" or "\*/" that the lexer recognizes as part of a 794`lexer.COMMENT` token is a fold point. The lexer does not consider any 795occurrences of these characters outside their defined tokens (such as in a 796string) as fold points. How do you specify fold keywords? Here is an example 797for Lua: 798 799 lex:add_fold_point(lexer.KEYWORD, 'if', 'end') 800 lex:add_fold_point(lexer.KEYWORD, 'do', 'end') 801 lex:add_fold_point(lexer.KEYWORD, 'function', 'end') 802 lex:add_fold_point(lexer.KEYWORD, 'repeat', 'until') 803 804If your lexer has case-insensitive keywords as fold points, simply add a 805`case_insensitive_fold_points = true` option to [`lexer.new()`](#lexer.new), and 806specify keywords in lower case. 807 808If your lexer needs to do some additional processing in order to determine if 809a token is a fold point, pass a function that returns an integer to 810`lex:add_fold_point()`. Returning `1` indicates the token is a beginning fold 811point and returning `-1` indicates the token is an ending fold point. 812Returning `0` indicates the token is not a fold point. For example: 813 814 local function fold_strange_token(text, pos, line, s, symbol) 815 if ... then 816 return 1 -- beginning fold point 817 elseif ... then 818 return -1 -- ending fold point 819 end 820 return 0 821 end 822 823 lex:add_fold_point('strange_token', '|', fold_strange_token) 824 825Any time the lexer encounters a '|' that is a "strange_token", it calls the 826`fold_strange_token` function to determine if '|' is a fold point. The lexer 827calls these functions with the following arguments: the text to identify fold 828points in, the beginning position of the current line in the text to fold, 829the current line's text, the position in the current line the fold point text 830starts at, and the fold point text itself. 831 832#### Fold by Indentation 833 834Some languages have significant whitespace and/or no delimiters that indicate 835fold points. If your lexer falls into this category and you would like to 836mark fold points based on changes in indentation, create the lexer with a 837`fold_by_indentation = true` option: 838 839 local lex = lexer.new('?', {fold_by_indentation = true}) 840 841### Using Lexers 842 843**Textadept** 844 845Put your lexer in your *~/.textadept/lexers/* directory so you do not 846overwrite it when upgrading Textadept. Also, lexers in this directory 847override default lexers. Thus, Textadept loads a user *lua* lexer instead of 848the default *lua* lexer. This is convenient for tweaking a default lexer to 849your liking. Then add a [file type](#textadept.file_types) for your lexer if 850necessary. 851 852**SciTE** 853 854Create a *.properties* file for your lexer and `import` it in either your 855*SciTEUser.properties* or *SciTEGlobal.properties*. The contents of the 856*.properties* file should contain: 857 858 file.patterns.[lexer_name]=[file_patterns] 859 lexer.$(file.patterns.[lexer_name])=[lexer_name] 860 861where `[lexer_name]` is the name of your lexer (minus the *.lua* extension) 862and `[file_patterns]` is a set of file extensions to use your lexer for. 863 864Please note that Lua lexers ignore any styling information in *.properties* 865files. Your theme file in the *lexers/themes/* directory contains styling 866information. 867 868### Migrating Legacy Lexers 869 870Legacy lexers are of the form: 871 872 local l = require('lexer') 873 local token, word_match = l.token, l.word_match 874 local P, R, S = lpeg.P, lpeg.R, lpeg.S 875 876 local M = {_NAME = '?'} 877 878 [... token and pattern definitions ...] 879 880 M._rules = { 881 {'rule', pattern}, 882 [...] 883 } 884 885 M._tokenstyles = { 886 'token' = 'style', 887 [...] 888 } 889 890 M._foldsymbols = { 891 _patterns = {...}, 892 ['token'] = {['start'] = 1, ['end'] = -1}, 893 [...] 894 } 895 896 return M 897 898While Scintillua will handle such legacy lexers just fine without any 899changes, it is recommended that you migrate yours. The migration process is 900fairly straightforward: 901 9021. Replace all instances of `l` with `lexer`, as it's better practice and 903 results in less confusion. 9042. Replace `local M = {_NAME = '?'}` with `local lex = lexer.new('?')`, where 905 `?` is the name of your legacy lexer. At the end of the lexer, change 906 `return M` to `return lex`. 9073. Instead of defining rules towards the end of your lexer, define your rules 908 as you define your tokens and patterns using 909 [`lex:add_rule()`](#lexer.add_rule). 9104. Similarly, any custom token names should have their styles immediately 911 defined using [`lex:add_style()`](#lexer.add_style). 9125. Convert any table arguments passed to [`lexer.word_match()`](#lexer.word_match) to a 913 space-separated string of words. 9146. Replace any calls to `lexer.embed(M, child, ...)` and 915 `lexer.embed(parent, M, ...)` with 916 [`lex:embed`](#lexer.embed)`(child, ...)` and `parent:embed(lex, ...)`, 917 respectively. 9187. Define fold points with simple calls to 919 [`lex:add_fold_point()`](#lexer.add_fold_point). No need to mess with Lua 920 patterns anymore. 9218. Any legacy lexer options such as `M._FOLDBYINDENTATION`, `M._LEXBYLINE`, 922 `M._lexer`, etc. should be added as table options to [`lexer.new()`](#lexer.new). 9239. Any external lexer rule fetching and/or modifications via `lexer._RULES` 924 should be changed to use [`lexer.get_rule()`](#lexer.get_rule) and 925 [`lexer.modify_rule()`](#lexer.modify_rule). 926 927As an example, consider the following sample legacy lexer: 928 929 local l = require('lexer') 930 local token, word_match = l.token, l.word_match 931 local P, R, S = lpeg.P, lpeg.R, lpeg.S 932 933 local M = {_NAME = 'legacy'} 934 935 local ws = token(l.WHITESPACE, l.space^1) 936 local comment = token(l.COMMENT, '#' * l.nonnewline^0) 937 local string = token(l.STRING, l.delimited_range('"')) 938 local number = token(l.NUMBER, l.float + l.integer) 939 local keyword = token(l.KEYWORD, word_match{'foo', 'bar', 'baz'}) 940 local custom = token('custom', P('quux')) 941 local identifier = token(l.IDENTIFIER, l.word) 942 local operator = token(l.OPERATOR, S('+-*/%^=<>,.()[]{}')) 943 944 M._rules = { 945 {'whitespace', ws}, 946 {'keyword', keyword}, 947 {'custom', custom}, 948 {'identifier', identifier}, 949 {'string', string}, 950 {'comment', comment}, 951 {'number', number}, 952 {'operator', operator} 953 } 954 955 M._tokenstyles = { 956 'custom' = l.STYLE_KEYWORD .. ',bold' 957 } 958 959 M._foldsymbols = { 960 _patterns = {'[{}]'}, 961 [l.OPERATOR] = {['{'] = 1, ['}'] = -1} 962 } 963 964 return M 965 966Following the migration steps would yield: 967 968 local lexer = require('lexer') 969 local token, word_match = lexer.token, lexer.word_match 970 local P, S = lpeg.P, lpeg.S 971 972 local lex = lexer.new('legacy') 973 974 lex:add_rule('whitespace', token(lexer.WHITESPACE, lexer.space^1)) 975 lex:add_rule('keyword', token(lexer.KEYWORD, word_match[[foo bar baz]])) 976 lex:add_rule('custom', token('custom', P('quux'))) 977 lex:add_style('custom', lexer.styles.keyword .. {bold = true}) 978 lex:add_rule('identifier', token(lexer.IDENTIFIER, lexer.word)) 979 lex:add_rule('string', token(lexer.STRING, lexer.range('"'))) 980 lex:add_rule('comment', token(lexer.COMMENT, lexer.to_eol('#'))) 981 lex:add_rule('number', token(lexer.NUMBER, lexer.number)) 982 lex:add_rule('operator', token(lexer.OPERATOR, S('+-*/%^=<>,.()[]{}'))) 983 984 lex:add_fold_point(lexer.OPERATOR, '{', '}') 985 986 return lex 987 988### Considerations 989 990#### Performance 991 992There might be some slight overhead when initializing a lexer, but loading a 993file from disk into Scintilla is usually more expensive. On modern computer 994systems, I see no difference in speed between Lua lexers and Scintilla's C++ 995ones. Optimize lexers for speed by re-arranging `lexer.add_rule()` calls so 996that the most common rules match first. Do keep in mind that order matters 997for similar rules. 998 999In some cases, folding may be far more expensive than lexing, particularly 1000in lexers with a lot of potential fold points. If your lexer is exhibiting 1001signs of slowness, try disabling folding in your text editor first. If that 1002speeds things up, you can try reducing the number of fold points you added, 1003overriding `lexer.fold()` with your own implementation, or simply eliminating 1004folding support from your lexer. 1005 1006#### Limitations 1007 1008Embedded preprocessor languages like PHP cannot completely embed in their 1009parent languages in that the parent's tokens do not support start and end 1010rules. This mostly goes unnoticed, but code like 1011 1012 <div id="<?php echo $id; ?>"> 1013 1014will not style correctly. 1015 1016#### Troubleshooting 1017 1018Errors in lexers can be tricky to debug. Lexers print Lua errors to 1019`io.stderr` and `_G.print()` statements to `io.stdout`. Running your editor 1020from a terminal is the easiest way to see errors as they occur. 1021 1022#### Risks 1023 1024Poorly written lexers have the ability to crash Scintilla (and thus its 1025containing application), so unsaved data might be lost. However, I have only 1026observed these crashes in early lexer development, when syntax errors or 1027pattern errors are present. Once the lexer actually starts styling text 1028(either correctly or incorrectly, it does not matter), I have not observed 1029any crashes. 1030 1031#### Acknowledgements 1032 1033Thanks to Peter Odding for his [lexer post][] on the Lua mailing list 1034that provided inspiration, and thanks to Roberto Ierusalimschy for LPeg. 1035 1036[lexer post]: http://lua-users.org/lists/lua-l/2007-04/msg00116.html 1037 1038### Fields defined by `lexer` 1039 1040<a id="lexer.CLASS"></a> 1041#### `lexer.CLASS` (string) 1042 1043The token name for class tokens. 1044 1045<a id="lexer.COMMENT"></a> 1046#### `lexer.COMMENT` (string) 1047 1048The token name for comment tokens. 1049 1050<a id="lexer.CONSTANT"></a> 1051#### `lexer.CONSTANT` (string) 1052 1053The token name for constant tokens. 1054 1055<a id="lexer.DEFAULT"></a> 1056#### `lexer.DEFAULT` (string) 1057 1058The token name for default tokens. 1059 1060<a id="lexer.ERROR"></a> 1061#### `lexer.ERROR` (string) 1062 1063The token name for error tokens. 1064 1065<a id="lexer.FOLD_BASE"></a> 1066#### `lexer.FOLD_BASE` (number) 1067 1068The initial (root) fold level. 1069 1070<a id="lexer.FOLD_BLANK"></a> 1071#### `lexer.FOLD_BLANK` (number) 1072 1073Flag indicating that the line is blank. 1074 1075<a id="lexer.FOLD_HEADER"></a> 1076#### `lexer.FOLD_HEADER` (number) 1077 1078Flag indicating the line is fold point. 1079 1080<a id="lexer.FUNCTION"></a> 1081#### `lexer.FUNCTION` (string) 1082 1083The token name for function tokens. 1084 1085<a id="lexer.IDENTIFIER"></a> 1086#### `lexer.IDENTIFIER` (string) 1087 1088The token name for identifier tokens. 1089 1090<a id="lexer.KEYWORD"></a> 1091#### `lexer.KEYWORD` (string) 1092 1093The token name for keyword tokens. 1094 1095<a id="lexer.LABEL"></a> 1096#### `lexer.LABEL` (string) 1097 1098The token name for label tokens. 1099 1100<a id="lexer.NUMBER"></a> 1101#### `lexer.NUMBER` (string) 1102 1103The token name for number tokens. 1104 1105<a id="lexer.OPERATOR"></a> 1106#### `lexer.OPERATOR` (string) 1107 1108The token name for operator tokens. 1109 1110<a id="lexer.PREPROCESSOR"></a> 1111#### `lexer.PREPROCESSOR` (string) 1112 1113The token name for preprocessor tokens. 1114 1115<a id="lexer.REGEX"></a> 1116#### `lexer.REGEX` (string) 1117 1118The token name for regex tokens. 1119 1120<a id="lexer.STRING"></a> 1121#### `lexer.STRING` (string) 1122 1123The token name for string tokens. 1124 1125<a id="lexer.TYPE"></a> 1126#### `lexer.TYPE` (string) 1127 1128The token name for type tokens. 1129 1130<a id="lexer.VARIABLE"></a> 1131#### `lexer.VARIABLE` (string) 1132 1133The token name for variable tokens. 1134 1135<a id="lexer.WHITESPACE"></a> 1136#### `lexer.WHITESPACE` (string) 1137 1138The token name for whitespace tokens. 1139 1140<a id="lexer.alnum"></a> 1141#### `lexer.alnum` (pattern) 1142 1143A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z', 1144 '0'-'9'). 1145 1146<a id="lexer.alpha"></a> 1147#### `lexer.alpha` (pattern) 1148 1149A pattern that matches any alphabetic character ('A'-'Z', 'a'-'z'). 1150 1151<a id="lexer.any"></a> 1152#### `lexer.any` (pattern) 1153 1154A pattern that matches any single character. 1155 1156<a id="lexer.ascii"></a> 1157#### `lexer.ascii` (pattern) 1158 1159A pattern that matches any ASCII character (codes 0 to 127). 1160 1161<a id="lexer.cntrl"></a> 1162#### `lexer.cntrl` (pattern) 1163 1164A pattern that matches any control character (ASCII codes 0 to 31). 1165 1166<a id="lexer.dec_num"></a> 1167#### `lexer.dec_num` (pattern) 1168 1169A pattern that matches a decimal number. 1170 1171<a id="lexer.digit"></a> 1172#### `lexer.digit` (pattern) 1173 1174A pattern that matches any digit ('0'-'9'). 1175 1176<a id="lexer.extend"></a> 1177#### `lexer.extend` (pattern) 1178 1179A pattern that matches any ASCII extended character (codes 0 to 255). 1180 1181<a id="lexer.float"></a> 1182#### `lexer.float` (pattern) 1183 1184A pattern that matches a floating point number. 1185 1186<a id="lexer.fold_by_indentation"></a> 1187#### `lexer.fold_by_indentation` (boolean) 1188 1189Whether or not to fold based on indentation level if a lexer does not have 1190 a folder. 1191 Some lexers automatically enable this option. It is disabled by default. 1192 This is an alias for `lexer.property['fold.by.indentation'] = '1|0'`. 1193 1194<a id="lexer.fold_compact"></a> 1195#### `lexer.fold_compact` (boolean) 1196 1197Whether or not blank lines after an ending fold point are included in that 1198 fold. 1199 This option is disabled by default. 1200 This is an alias for `lexer.property['fold.compact'] = '1|0'`. 1201 1202<a id="lexer.fold_level"></a> 1203#### `lexer.fold_level` (table, Read-only) 1204 1205Table of fold level bit-masks for line numbers starting from 1. 1206 Fold level masks are composed of an integer level combined with any of the 1207 following bits: 1208 1209 * `lexer.FOLD_BASE` 1210 The initial fold level. 1211 * `lexer.FOLD_BLANK` 1212 The line is blank. 1213 * `lexer.FOLD_HEADER` 1214 The line is a header, or fold point. 1215 1216<a id="lexer.fold_line_groups"></a> 1217#### `lexer.fold_line_groups` (boolean) 1218 1219Whether or not to fold multiple, consecutive line groups (such as line 1220 comments and import statements) and only show the top line. 1221 This option is disabled by default. 1222 This is an alias for `lexer.property['fold.line.groups'] = '1|0'`. 1223 1224<a id="lexer.fold_on_zero_sum_lines"></a> 1225#### `lexer.fold_on_zero_sum_lines` (boolean) 1226 1227Whether or not to mark as a fold point lines that contain both an ending 1228 and starting fold point. For example, `} else {` would be marked as a fold 1229 point. 1230 This option is disabled by default. 1231 This is an alias for `lexer.property['fold.on.zero.sum.lines'] = '1|0'`. 1232 1233<a id="lexer.folding"></a> 1234#### `lexer.folding` (boolean) 1235 1236Whether or not folding is enabled for the lexers that support it. 1237 This option is disabled by default. 1238 This is an alias for `lexer.property['fold'] = '1|0'`. 1239 1240<a id="lexer.graph"></a> 1241#### `lexer.graph` (pattern) 1242 1243A pattern that matches any graphical character ('!' to '~'). 1244 1245<a id="lexer.hex_num"></a> 1246#### `lexer.hex_num` (pattern) 1247 1248A pattern that matches a hexadecimal number. 1249 1250<a id="lexer.indent_amount"></a> 1251#### `lexer.indent_amount` (table, Read-only) 1252 1253Table of indentation amounts in character columns, for line numbers 1254 starting from 1. 1255 1256<a id="lexer.integer"></a> 1257#### `lexer.integer` (pattern) 1258 1259A pattern that matches either a decimal, hexadecimal, or octal number. 1260 1261<a id="lexer.line_state"></a> 1262#### `lexer.line_state` (table) 1263 1264Table of integer line states for line numbers starting from 1. 1265 Line states can be used by lexers for keeping track of persistent states. 1266 1267<a id="lexer.lower"></a> 1268#### `lexer.lower` (pattern) 1269 1270A pattern that matches any lower case character ('a'-'z'). 1271 1272<a id="lexer.newline"></a> 1273#### `lexer.newline` (pattern) 1274 1275A pattern that matches a sequence of end of line characters. 1276 1277<a id="lexer.nonnewline"></a> 1278#### `lexer.nonnewline` (pattern) 1279 1280A pattern that matches any single, non-newline character. 1281 1282<a id="lexer.number"></a> 1283#### `lexer.number` (pattern) 1284 1285A pattern that matches a typical number, either a floating point, decimal, 1286 hexadecimal, or octal number. 1287 1288<a id="lexer.oct_num"></a> 1289#### `lexer.oct_num` (pattern) 1290 1291A pattern that matches an octal number. 1292 1293<a id="lexer.print"></a> 1294#### `lexer.print` (pattern) 1295 1296A pattern that matches any printable character (' ' to '~'). 1297 1298<a id="lexer.property"></a> 1299#### `lexer.property` (table) 1300 1301Map of key-value string pairs. 1302 1303<a id="lexer.property_expanded"></a> 1304#### `lexer.property_expanded` (table, Read-only) 1305 1306Map of key-value string pairs with `$()` and `%()` variable replacement 1307 performed in values. 1308 1309<a id="lexer.property_int"></a> 1310#### `lexer.property_int` (table, Read-only) 1311 1312Map of key-value pairs with values interpreted as numbers, or `0` if not 1313 found. 1314 1315<a id="lexer.punct"></a> 1316#### `lexer.punct` (pattern) 1317 1318A pattern that matches any punctuation character ('!' to '/', ':' to '@', 1319 '[' to ''', '{' to '~'). 1320 1321<a id="lexer.space"></a> 1322#### `lexer.space` (pattern) 1323 1324A pattern that matches any whitespace character ('\t', '\v', '\f', '\n', 1325 '\r', space). 1326 1327<a id="lexer.style_at"></a> 1328#### `lexer.style_at` (table, Read-only) 1329 1330Table of style names at positions in the buffer starting from 1. 1331 1332<a id="lexer.upper"></a> 1333#### `lexer.upper` (pattern) 1334 1335A pattern that matches any upper case character ('A'-'Z'). 1336 1337<a id="lexer.word"></a> 1338#### `lexer.word` (pattern) 1339 1340A pattern that matches a typical word. Words begin with a letter or 1341 underscore and consist of alphanumeric and underscore characters. 1342 1343<a id="lexer.xdigit"></a> 1344#### `lexer.xdigit` (pattern) 1345 1346A pattern that matches any hexadecimal digit ('0'-'9', 'A'-'F', 'a'-'f'). 1347 1348 1349### Functions defined by `lexer` 1350 1351<a id="lexer.add_fold_point"></a> 1352#### `lexer.add_fold_point`(lexer, token\_name, start\_symbol, end\_symbol) 1353 1354Adds to lexer *lexer* a fold point whose beginning and end tokens are string 1355*token_name* tokens with string content *start_symbol* and *end_symbol*, 1356respectively. 1357In the event that *start_symbol* may or may not be a fold point depending on 1358context, and that additional processing is required, *end_symbol* may be a 1359function that ultimately returns `1` (indicating a beginning fold point), 1360`-1` (indicating an ending fold point), or `0` (indicating no fold point). 1361That function is passed the following arguments: 1362 1363 * `text`: The text being processed for fold points. 1364 * `pos`: The position in *text* of the beginning of the line currently 1365 being processed. 1366 * `line`: The text of the line currently being processed. 1367 * `s`: The position of *start_symbol* in *line*. 1368 * `symbol`: *start_symbol* itself. 1369 1370Fields: 1371 1372* `lexer`: The lexer to add a fold point to. 1373* `token_name`: The token name of text that indicates a fold point. 1374* `start_symbol`: The text that indicates the beginning of a fold point. 1375* `end_symbol`: Either the text that indicates the end of a fold point, or 1376 a function that returns whether or not *start_symbol* is a beginning fold 1377 point (1), an ending fold point (-1), or not a fold point at all (0). 1378 1379Usage: 1380 1381* `lex:add_fold_point(lexer.OPERATOR, '{', '}')` 1382* `lex:add_fold_point(lexer.KEYWORD, 'if', 'end')` 1383* `lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('#'))` 1384* `lex:add_fold_point('custom', function(text, pos, line, s, symbol) 1385 ... end)` 1386 1387<a id="lexer.add_rule"></a> 1388#### `lexer.add_rule`(lexer, id, rule) 1389 1390Adds pattern *rule* identified by string *id* to the ordered list of rules 1391for lexer *lexer*. 1392 1393Fields: 1394 1395* `lexer`: The lexer to add the given rule to. 1396* `id`: The id associated with this rule. It does not have to be the same 1397 as the name passed to `token()`. 1398* `rule`: The LPeg pattern of the rule. 1399 1400See also: 1401 1402* [`lexer.modify_rule`](#lexer.modify_rule) 1403 1404<a id="lexer.add_style"></a> 1405#### `lexer.add_style`(lexer, token\_name, style) 1406 1407Associates string *token_name* in lexer *lexer* with style table *style*. 1408*style* may have the following fields: 1409 1410* `font`: String font name. 1411* `size`: Integer font size. 1412* `bold`: Whether or not the font face is bold. The default value is `false`. 1413* `weight`: Integer weight or boldness of a font, between 1 and 999. 1414* `italics`: Whether or not the font face is italic. The default value is 1415 `false`. 1416* `underlined`: Whether or not the font face is underlined. The default value 1417 is `false`. 1418* `fore`: Font face foreground color in `0xBBGGRR` or `"#RRGGBB"` format. 1419* `back`: Font face background color in `0xBBGGRR` or `"#RRGGBB"` format. 1420* `eolfilled`: Whether or not the background color extends to the end of the 1421 line. The default value is `false`. 1422* `case`: Font case, `'u'` for upper, `'l'` for lower, and `'m'` for normal, 1423 mixed case. The default value is `'m'`. 1424* `visible`: Whether or not the text is visible. The default value is `true`. 1425* `changeable`: Whether the text is changeable instead of read-only. The 1426 default value is `true`. 1427 1428Field values may also contain "$(property.name)" expansions for properties 1429defined in Scintilla, theme files, etc. 1430 1431Fields: 1432 1433* `lexer`: The lexer to add a style to. 1434* `token_name`: The name of the token to associated with the style. 1435* `style`: A style string for Scintilla. 1436 1437Usage: 1438 1439* `lex:add_style('longstring', lexer.styles.string)` 1440* `lex:add_style('deprecated_func', lexer.styles['function'] .. 1441 {italics = true}` 1442* `lex:add_style('visible_ws', lexer.styles.whitespace .. 1443 {back = lexer.colors.grey}` 1444 1445<a id="lexer.embed"></a> 1446#### `lexer.embed`(lexer, child, start\_rule, end\_rule) 1447 1448Embeds child lexer *child* in parent lexer *lexer* using patterns 1449*start_rule* and *end_rule*, which signal the beginning and end of the 1450embedded lexer, respectively. 1451 1452Fields: 1453 1454* `lexer`: The parent lexer. 1455* `child`: The child lexer. 1456* `start_rule`: The pattern that signals the beginning of the embedded 1457 lexer. 1458* `end_rule`: The pattern that signals the end of the embedded lexer. 1459 1460Usage: 1461 1462* `html:embed(css, css_start_rule, css_end_rule)` 1463* `html:embed(lex, php_start_rule, php_end_rule) -- from php lexer` 1464 1465<a id="lexer.fold"></a> 1466#### `lexer.fold`(lexer, text, start\_pos, start\_line, start\_level) 1467 1468Determines fold points in a chunk of text *text* using lexer *lexer*, 1469returning a table of fold levels associated with line numbers. 1470*text* starts at position *start_pos* on line number *start_line* with a 1471beginning fold level of *start_level* in the buffer. 1472 1473Fields: 1474 1475* `lexer`: The lexer to fold text with. 1476* `text`: The text in the buffer to fold. 1477* `start_pos`: The position in the buffer *text* starts at, counting from 1478 1. 1479* `start_line`: The line number *text* starts on, counting from 1. 1480* `start_level`: The fold level *text* starts on. 1481 1482Return: 1483 1484* table of fold levels associated with line numbers. 1485 1486<a id="lexer.fold_consecutive_lines"></a> 1487#### `lexer.fold_consecutive_lines`(prefix) 1488 1489Returns for `lexer.add_fold_point()` the parameters needed to fold 1490consecutive lines that start with string *prefix*. 1491 1492Fields: 1493 1494* `prefix`: The prefix string (e.g. a line comment). 1495 1496Usage: 1497 1498* `lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('--'))` 1499* `lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('//'))` 1500* `lex:add_fold_point( 1501 lexer.KEYWORD, lexer.fold_consecutive_lines('import'))` 1502 1503<a id="lexer.get_rule"></a> 1504#### `lexer.get_rule`(lexer, id) 1505 1506Returns the rule identified by string *id*. 1507 1508Fields: 1509 1510* `lexer`: The lexer to fetch a rule from. 1511* `id`: The id of the rule to fetch. 1512 1513Return: 1514 1515* pattern 1516 1517<a id="lexer.last_char_includes"></a> 1518#### `lexer.last_char_includes`(s) 1519 1520Creates and returns a pattern that verifies the first non-whitespace 1521character behind the current match position is in string set *s*. 1522 1523Fields: 1524 1525* `s`: String character set like one passed to `lpeg.S()`. 1526 1527Usage: 1528 1529* `local regex = lexer.last_char_includes('+-*!%^&|=,([{') * 1530 lexer.range('/')` 1531 1532Return: 1533 1534* pattern 1535 1536<a id="lexer.lex"></a> 1537#### `lexer.lex`(lexer, text, init\_style) 1538 1539Lexes a chunk of text *text* (that has an initial style number of 1540*init_style*) using lexer *lexer*, returning a table of token names and 1541positions. 1542 1543Fields: 1544 1545* `lexer`: The lexer to lex text with. 1546* `text`: The text in the buffer to lex. 1547* `init_style`: The current style. Multiple-language lexers use this to 1548 determine which language to start lexing in. 1549 1550Return: 1551 1552* table of token names and positions. 1553 1554<a id="lexer.line_from_position"></a> 1555#### `lexer.line_from_position`(pos) 1556 1557Returns the line number (starting from 1) of the line that contains position 1558*pos*, which starts from 1. 1559 1560Fields: 1561 1562* `pos`: The position to get the line number of. 1563 1564Return: 1565 1566* number 1567 1568<a id="lexer.load"></a> 1569#### `lexer.load`(name, alt\_name, cache) 1570 1571Initializes or loads and returns the lexer of string name *name*. 1572Scintilla calls this function in order to load a lexer. Parent lexers also 1573call this function in order to load child lexers and vice-versa. The user 1574calls this function in order to load a lexer when using Scintillua as a Lua 1575library. 1576 1577Fields: 1578 1579* `name`: The name of the lexing language. 1580* `alt_name`: The alternate name of the lexing language. This is useful for 1581 embedding the same child lexer with multiple sets of start and end tokens. 1582* `cache`: Flag indicating whether or not to load lexers from the cache. 1583 This should only be `true` when initially loading a lexer (e.g. not from 1584 within another lexer for embedding purposes). 1585 The default value is `false`. 1586 1587Return: 1588 1589* lexer object 1590 1591<a id="lexer.modify_rule"></a> 1592#### `lexer.modify_rule`(lexer, id, rule) 1593 1594Replaces in lexer *lexer* the existing rule identified by string *id* with 1595pattern *rule*. 1596 1597Fields: 1598 1599* `lexer`: The lexer to modify. 1600* `id`: The id associated with this rule. 1601* `rule`: The LPeg pattern of the rule. 1602 1603<a id="lexer.new"></a> 1604#### `lexer.new`(name, opts) 1605 1606Creates a returns a new lexer with the given name. 1607 1608Fields: 1609 1610* `name`: The lexer's name. 1611* `opts`: Table of lexer options. Options currently supported: 1612 * `lex_by_line`: Whether or not the lexer only processes whole lines of 1613 text (instead of arbitrary chunks of text) at a time. 1614 Line lexers cannot look ahead to subsequent lines. 1615 The default value is `false`. 1616 * `fold_by_indentation`: Whether or not the lexer does not define any fold 1617 points and that fold points should be calculated based on changes in line 1618 indentation. 1619 The default value is `false`. 1620 * `case_insensitive_fold_points`: Whether or not fold points added via 1621 `lexer.add_fold_point()` ignore case. 1622 The default value is `false`. 1623 * `inherit`: Lexer to inherit from. 1624 The default value is `nil`. 1625 1626Usage: 1627 1628* `lexer.new('rhtml', {inherit = lexer.load('html')})` 1629 1630<a id="lexer.range"></a> 1631#### `lexer.range`(s, e, single\_line, escapes, balanced) 1632 1633Creates and returns a pattern that matches a range of text bounded by strings 1634or patterns *s* and *e*. 1635This is a convenience function for matching more complicated ranges like 1636strings with escape characters, balanced parentheses, and block comments 1637(nested or not). *e* is optional and defaults to *s*. *single_line* indicates 1638whether or not the range must be on a single line; *escapes* indicates 1639whether or not to allow '\' as an escape character; and *balanced* indicates 1640whether or not to handle balanced ranges like parentheses, and requires *s* 1641and *e* to be different. 1642 1643Fields: 1644 1645* `s`: String or pattern start of a range. 1646* `e`: Optional string or pattern end of a range. The default value is *s*. 1647* `single_line`: Optional flag indicating whether or not the range must be 1648 on a single line. The default value is `false`. 1649* `escapes`: Optional flag indicating whether or not the range end may 1650 be escaped by a '\' character. 1651 The default value is `false` unless *s* and *e* are identical, 1652 single-character strings. In that case, the default value is `true`. 1653* `balanced`: Optional flag indicating whether or not to match a balanced 1654 range, like the "%b" Lua pattern. This flag only applies if *s* and *e* are 1655 different. 1656 1657Usage: 1658 1659* `local dq_str_escapes = lexer.range('"')` 1660* `local dq_str_noescapes = lexer.range('"', false, false)` 1661* `local unbalanced_parens = lexer.range('(', ')')` 1662* `local balanced_parens = lexer.range('(', ')', false, false, true)` 1663 1664Return: 1665 1666* pattern 1667 1668<a id="lexer.starts_line"></a> 1669#### `lexer.starts_line`(patt) 1670 1671Creates and returns a pattern that matches pattern *patt* only at the 1672beginning of a line. 1673 1674Fields: 1675 1676* `patt`: The LPeg pattern to match on the beginning of a line. 1677 1678Usage: 1679 1680* `local preproc = token(lexer.PREPROCESSOR, 1681 lexer.starts_line(lexer.to_eol('#')))` 1682 1683Return: 1684 1685* pattern 1686 1687<a id="lexer.to_eol"></a> 1688#### `lexer.to_eol`(prefix, escape) 1689 1690Creates and returns a pattern that matches from string or pattern *prefix* 1691until the end of the line. 1692*escape* indicates whether the end of the line can be escaped with a '\' 1693character. 1694 1695Fields: 1696 1697* `prefix`: String or pattern prefix to start matching at. 1698* `escape`: Optional flag indicating whether or not newlines can be escaped 1699 by a '\' character. The default value is `false`. 1700 1701Usage: 1702 1703* `local line_comment = lexer.to_eol('//')` 1704* `local line_comment = lexer.to_eol(S('#;'))` 1705 1706Return: 1707 1708* pattern 1709 1710<a id="lexer.token"></a> 1711#### `lexer.token`(name, patt) 1712 1713Creates and returns a token pattern with token name *name* and pattern 1714*patt*. 1715If *name* is not a predefined token name, its style must be defined via 1716`lexer.add_style()`. 1717 1718Fields: 1719 1720* `name`: The name of token. If this name is not a predefined token name, 1721 then a style needs to be assiciated with it via `lexer.add_style()`. 1722* `patt`: The LPeg pattern associated with the token. 1723 1724Usage: 1725 1726* `local ws = token(lexer.WHITESPACE, lexer.space^1)` 1727* `local annotation = token('annotation', '@' * lexer.word)` 1728 1729Return: 1730 1731* pattern 1732 1733<a id="lexer.word_match"></a> 1734#### `lexer.word_match`(words, case\_insensitive, word\_chars) 1735 1736Creates and returns a pattern that matches any single word in string *words*. 1737*case_insensitive* indicates whether or not to ignore case when matching 1738words. 1739This is a convenience function for simplifying a set of ordered choice word 1740patterns. 1741If *words* is a multi-line string, it may contain Lua line comments (`--`) 1742that will ultimately be ignored. 1743 1744Fields: 1745 1746* `words`: A string list of words separated by spaces. 1747* `case_insensitive`: Optional boolean flag indicating whether or not the 1748 word match is case-insensitive. The default value is `false`. 1749* `word_chars`: Unused legacy parameter. 1750 1751Usage: 1752 1753* `local keyword = token(lexer.KEYWORD, word_match[[foo bar baz]])` 1754* `local keyword = token(lexer.KEYWORD, word_match([[foo-bar foo-baz 1755 bar-foo bar-baz baz-foo baz-bar]], true))` 1756 1757Return: 1758 1759* pattern 1760 1761 1762### Tables defined by `lexer` 1763 1764<a id="lexer.colors"></a> 1765#### `lexer.colors` 1766 1767Map of color name strings to color values in `0xBBGGRR` or `"#RRGGBB"` 1768format. 1769Note: for applications running within a terminal emulator, only 16 color 1770values are recognized, regardless of how many colors a user's terminal 1771actually supports. (A terminal emulator's settings determines how to actually 1772display these recognized color values, which may end up being mapped to a 1773completely different color set.) In order to use the light variant of a 1774color, some terminals require a style's `bold` attribute must be set along 1775with that normal color. Recognized color values are black (0x000000), red 1776(0x000080), green (0x008000), yellow (0x008080), blue (0x800000), magenta 1777(0x800080), cyan (0x808000), white (0xC0C0C0), light black (0x404040), light 1778red (0x0000FF), light green (0x00FF00), light yellow (0x00FFFF), light blue 1779(0xFF0000), light magenta (0xFF00FF), light cyan (0xFFFF00), and light white 1780(0xFFFFFF). 1781 1782<a id="lexer.styles"></a> 1783#### `lexer.styles` 1784 1785Map of style names to style definition tables. 1786 1787Style names consist of the following default names as well as the token names 1788defined by lexers. 1789 1790* `default`: The default style all others are based on. 1791* `line_number`: The line number margin style. 1792* `control_char`: The style of control character blocks. 1793* `indent_guide`: The style of indentation guides. 1794* `call_tip`: The style of call tip text. Only the `font`, `size`, `fore`, 1795 and `back` style definition fields are supported. 1796* `fold_display_text`: The style of text displayed next to folded lines. 1797* `class`, `comment`, `constant`, `embedded`, `error`, `function`, 1798 `identifier`, `keyword`, `label`, `number`, `operator`, `preprocessor`, 1799 `regex`, `string`, `type`, `variable`, `whitespace`: Some token names used 1800 by lexers. Some lexers may define more token names, so this list is not 1801 exhaustive. 1802* *`lang`*`_whitespace`: A special style for whitespace tokens in lexer name 1803 *lang*. It inherits from `whitespace`, and is used in place of it for all 1804 lexers. 1805 1806Style definition tables may contain the following fields: 1807 1808* `font`: String font name. 1809* `size`: Integer font size. 1810* `bold`: Whether or not the font face is bold. The default value is `false`. 1811* `weight`: Integer weight or boldness of a font, between 1 and 999. 1812* `italics`: Whether or not the font face is italic. The default value is 1813 `false`. 1814* `underlined`: Whether or not the font face is underlined. The default value 1815 is `false`. 1816* `fore`: Font face foreground color in `0xBBGGRR` or `"#RRGGBB"` format. 1817* `back`: Font face background color in `0xBBGGRR` or `"#RRGGBB"` format. 1818* `eolfilled`: Whether or not the background color extends to the end of the 1819 line. The default value is `false`. 1820* `case`: Font case, `'u'` for upper, `'l'` for lower, and `'m'` for normal, 1821 mixed case. The default value is `'m'`. 1822* `visible`: Whether or not the text is visible. The default value is `true`. 1823* `changeable`: Whether the text is changeable instead of read-only. The 1824 default value is `true`. 1825 1826--- 1827