1-- Copyright 2006-2016 Mitchell mitchell.att.foicica.com. See LICENSE. 2 3local M = {} 4 5--[=[ This comment is for LuaDoc. 6--- 7-- Lexes Scintilla documents with Lua and LPeg. 8-- 9-- ## Overview 10-- 11-- Lexers highlight the syntax of source code. Scintilla (the editing component 12-- behind [Textadept][] and [SciTE][]) traditionally uses static, compiled C++ 13-- lexers which are notoriously difficult to create and/or extend. On the other 14-- hand, Lua makes it easy to to rapidly create new lexers, extend existing 15-- ones, and embed lexers within one another. Lua lexers tend to be more 16-- readable than C++ lexers too. 17-- 18-- Lexers are Parsing Expression Grammars, or PEGs, composed with the Lua 19-- [LPeg library][]. The following table comes from the LPeg documentation and 20-- summarizes all you need to know about constructing basic LPeg patterns. This 21-- module provides convenience functions for creating and working with other 22-- more advanced patterns and concepts. 23-- 24-- Operator | Description 25-- ---------------------|------------ 26-- `lpeg.P(string)` | Matches `string` literally. 27-- `lpeg.P(`_`n`_`)` | Matches exactly _`n`_ characters. 28-- `lpeg.S(string)` | Matches any character in set `string`. 29-- `lpeg.R("`_`xy`_`")` | Matches any character between range `x` and `y`. 30-- `patt^`_`n`_ | Matches at least _`n`_ repetitions of `patt`. 31-- `patt^-`_`n`_ | Matches at most _`n`_ repetitions of `patt`. 32-- `patt1 * patt2` | Matches `patt1` followed by `patt2`. 33-- `patt1 + patt2` | Matches `patt1` or `patt2` (ordered choice). 34-- `patt1 - patt2` | Matches `patt1` if `patt2` does not match. 35-- `-patt` | Equivalent to `("" - patt)`. 36-- `#patt` | Matches `patt` but consumes no input. 37-- 38-- The first part of this document deals with rapidly constructing a simple 39-- lexer. The next part deals with more advanced techniques, such as custom 40-- coloring and embedding lexers within one another. Following that is a 41-- discussion about code folding, or being able to tell Scintilla which code 42-- blocks are "foldable" (temporarily hideable from view). After that are 43-- instructions on how to use LPeg lexers with the aforementioned Textadept and 44-- SciTE editors. Finally there are comments on lexer performance and 45-- limitations. 46-- 47-- [LPeg library]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html 48-- [Textadept]: http://foicica.com/textadept 49-- [SciTE]: http://scintilla.org/SciTE.html 50-- 51-- ## Lexer Basics 52-- 53-- The *lexers/* directory contains all lexers, including your new one. Before 54-- attempting to write one from scratch though, first determine if your 55-- programming language is similar to any of the 80+ languages supported. If so, 56-- you may be able to copy and modify that lexer, saving some time and effort. 57-- The filename of your lexer should be the name of your programming language in 58-- lower case followed by a *.lua* extension. For example, a new Lua lexer has 59-- the name *lua.lua*. 60-- 61-- Note: Try to refrain from using one-character language names like "c", "d", 62-- or "r". For example, Scintillua uses "ansi_c", "dmd", and "rstats", 63-- respectively. 64-- 65-- ### New Lexer Template 66-- 67-- There is a *lexers/template.txt* file that contains a simple template for a 68-- new lexer. Feel free to use it, replacing the '?'s with the name of your 69-- lexer: 70-- 71-- -- ? LPeg lexer. 72-- 73-- local l = require('lexer') 74-- local token, word_match = l.token, l.word_match 75-- local P, R, S = lpeg.P, lpeg.R, lpeg.S 76-- 77-- local M = {_NAME = '?'} 78-- 79-- -- Whitespace. 80-- local ws = token(l.WHITESPACE, l.space^1) 81-- 82-- M._rules = { 83-- {'whitespace', ws}, 84-- } 85-- 86-- M._tokenstyles = { 87-- 88-- } 89-- 90-- return M 91-- 92-- The first 3 lines of code simply define often used convenience variables. The 93-- 5th and last lines define and return the lexer object Scintilla uses; they 94-- are very important and must be part of every lexer. The sixth line defines 95-- something called a "token", an essential building block of lexers. You will 96-- learn about tokens shortly. The rest of the code defines a set of grammar 97-- rules and token styles. You will learn about those later. Note, however, the 98-- `M.` prefix in front of `_rules` and `_tokenstyles`: not only do these tables 99-- belong to their respective lexers, but any non-local variables need the `M.` 100-- prefix too so-as not to affect Lua's global environment. All in all, this is 101-- a minimal, working lexer that you can build on. 102-- 103-- ### Tokens 104-- 105-- Take a moment to think about your programming language's structure. What kind 106-- of key elements does it have? In the template shown earlier, one predefined 107-- element all languages have is whitespace. Your language probably also has 108-- elements like comments, strings, and keywords. Lexers refer to these elements 109-- as "tokens". Tokens are the fundamental "building blocks" of lexers. Lexers 110-- break down source code into tokens for coloring, which results in the syntax 111-- highlighting familiar to you. It is up to you how specific your lexer is when 112-- it comes to tokens. Perhaps only distinguishing between keywords and 113-- identifiers is necessary, or maybe recognizing constants and built-in 114-- functions, methods, or libraries is desirable. The Lua lexer, for example, 115-- defines 11 tokens: whitespace, comments, strings, numbers, keywords, built-in 116-- functions, constants, built-in libraries, identifiers, labels, and operators. 117-- Even though constants, built-in functions, and built-in libraries are subsets 118-- of identifiers, Lua programmers find it helpful for the lexer to distinguish 119-- between them all. It is perfectly acceptable to just recognize keywords and 120-- identifiers. 121-- 122-- In a lexer, tokens consist of a token name and an LPeg pattern that matches a 123-- sequence of characters recognized as an instance of that token. Create tokens 124-- using the [`lexer.token()`]() function. Let us examine the "whitespace" token 125-- defined in the template shown earlier: 126-- 127-- local ws = token(l.WHITESPACE, l.space^1) 128-- 129-- At first glance, the first argument does not appear to be a string name and 130-- the second argument does not appear to be an LPeg pattern. Perhaps you 131-- expected something like: 132-- 133-- local ws = token('whitespace', S('\t\v\f\n\r ')^1) 134-- 135-- The `lexer` (`l`) module actually provides a convenient list of common token 136-- names and common LPeg patterns for you to use. Token names include 137-- [`lexer.DEFAULT`](), [`lexer.WHITESPACE`](), [`lexer.COMMENT`](), 138-- [`lexer.STRING`](), [`lexer.NUMBER`](), [`lexer.KEYWORD`](), 139-- [`lexer.IDENTIFIER`](), [`lexer.OPERATOR`](), [`lexer.ERROR`](), 140-- [`lexer.PREPROCESSOR`](), [`lexer.CONSTANT`](), [`lexer.VARIABLE`](), 141-- [`lexer.FUNCTION`](), [`lexer.CLASS`](), [`lexer.TYPE`](), [`lexer.LABEL`](), 142-- [`lexer.REGEX`](), and [`lexer.EMBEDDED`](). Patterns include 143-- [`lexer.any`](), [`lexer.ascii`](), [`lexer.extend`](), [`lexer.alpha`](), 144-- [`lexer.digit`](), [`lexer.alnum`](), [`lexer.lower`](), [`lexer.upper`](), 145-- [`lexer.xdigit`](), [`lexer.cntrl`](), [`lexer.graph`](), [`lexer.print`](), 146-- [`lexer.punct`](), [`lexer.space`](), [`lexer.newline`](), 147-- [`lexer.nonnewline`](), [`lexer.nonnewline_esc`](), [`lexer.dec_num`](), 148-- [`lexer.hex_num`](), [`lexer.oct_num`](), [`lexer.integer`](), 149-- [`lexer.float`](), and [`lexer.word`](). You may use your own token names if 150-- none of the above fit your language, but an advantage to using predefined 151-- token names is that your lexer's tokens will inherit the universal syntax 152-- highlighting color theme used by your text editor. 153-- 154-- #### Example Tokens 155-- 156-- So, how might you define other tokens like comments, strings, and keywords? 157-- Here are some examples. 158-- 159-- **Comments** 160-- 161-- Line-style comments with a prefix character(s) are easy to express with LPeg: 162-- 163-- local shell_comment = token(l.COMMENT, '#' * l.nonnewline^0) 164-- local c_line_comment = token(l.COMMENT, '//' * l.nonnewline_esc^0) 165-- 166-- The comments above start with a '#' or "//" and go to the end of the line. 167-- The second comment recognizes the next line also as a comment if the current 168-- line ends with a '\' escape character. 169-- 170-- C-style "block" comments with a start and end delimiter are also easy to 171-- express: 172-- 173-- local c_comment = token(l.COMMENT, '/*' * (l.any - '*/')^0 * P('*/')^-1) 174-- 175-- This comment starts with a "/\*" sequence and contains anything up to and 176-- including an ending "\*/" sequence. The ending "\*/" is optional so the lexer 177-- can recognize unfinished comments as comments and highlight them properly. 178-- 179-- **Strings** 180-- 181-- It is tempting to think that a string is not much different from the block 182-- comment shown above in that both have start and end delimiters: 183-- 184-- local dq_str = '"' * (l.any - '"')^0 * P('"')^-1 185-- local sq_str = "'" * (l.any - "'")^0 * P("'")^-1 186-- local simple_string = token(l.STRING, dq_str + sq_str) 187-- 188-- However, most programming languages allow escape sequences in strings such 189-- that a sequence like "\\"" in a double-quoted string indicates that the 190-- '"' is not the end of the string. The above token incorrectly matches 191-- such a string. Instead, use the [`lexer.delimited_range()`]() convenience 192-- function. 193-- 194-- local dq_str = l.delimited_range('"') 195-- local sq_str = l.delimited_range("'") 196-- local string = token(l.STRING, dq_str + sq_str) 197-- 198-- In this case, the lexer treats '\' as an escape character in a string 199-- sequence. 200-- 201-- **Keywords** 202-- 203-- Instead of matching _n_ keywords with _n_ `P('keyword_`_`n`_`')` ordered 204-- choices, use another convenience function: [`lexer.word_match()`](). It is 205-- much easier and more efficient to write word matches like: 206-- 207-- local keyword = token(l.KEYWORD, l.word_match{ 208-- 'keyword_1', 'keyword_2', ..., 'keyword_n' 209-- }) 210-- 211-- local case_insensitive_keyword = token(l.KEYWORD, l.word_match({ 212-- 'KEYWORD_1', 'keyword_2', ..., 'KEYword_n' 213-- }, nil, true)) 214-- 215-- local hyphened_keyword = token(l.KEYWORD, l.word_match({ 216-- 'keyword-1', 'keyword-2', ..., 'keyword-n' 217-- }, '-')) 218-- 219-- By default, characters considered to be in keywords are in the set of 220-- alphanumeric characters and underscores. The last token demonstrates how to 221-- allow '-' (hyphen) characters to be in keywords as well. 222-- 223-- **Numbers** 224-- 225-- Most programming languages have the same format for integer and float tokens, 226-- so it might be as simple as using a couple of predefined LPeg patterns: 227-- 228-- local number = token(l.NUMBER, l.float + l.integer) 229-- 230-- However, some languages allow postfix characters on integers. 231-- 232-- local integer = P('-')^-1 * (l.dec_num * S('lL')^-1) 233-- local number = token(l.NUMBER, l.float + l.hex_num + integer) 234-- 235-- Your language may need other tweaks, but it is up to you how fine-grained you 236-- want your highlighting to be. After all, you are not writing a compiler or 237-- interpreter! 238-- 239-- ### Rules 240-- 241-- Programming languages have grammars, which specify valid token structure. For 242-- example, comments usually cannot appear within a string. Grammars consist of 243-- rules, which are simply combinations of tokens. Recall from the lexer 244-- template the `_rules` table, which defines all the rules used by the lexer 245-- grammar: 246-- 247-- M._rules = { 248-- {'whitespace', ws}, 249-- } 250-- 251-- Each entry in a lexer's `_rules` table consists of a rule name and its 252-- associated pattern. Rule names are completely arbitrary and serve only to 253-- identify and distinguish between different rules. Rule order is important: if 254-- text does not match the first rule, the lexer tries the second rule, and so 255-- on. This simple grammar says to match whitespace tokens under a rule named 256-- "whitespace". 257-- 258-- To illustrate the importance of rule order, here is an example of a 259-- simplified Lua grammar: 260-- 261-- M._rules = { 262-- {'whitespace', ws}, 263-- {'keyword', keyword}, 264-- {'identifier', identifier}, 265-- {'string', string}, 266-- {'comment', comment}, 267-- {'number', number}, 268-- {'label', label}, 269-- {'operator', operator}, 270-- } 271-- 272-- Note how identifiers come after keywords. In Lua, as with most programming 273-- languages, the characters allowed in keywords and identifiers are in the same 274-- set (alphanumerics plus underscores). If the lexer specified the "identifier" 275-- rule before the "keyword" rule, all keywords would match identifiers and thus 276-- incorrectly highlight as identifiers instead of keywords. The same idea 277-- applies to function, constant, etc. tokens that you may want to distinguish 278-- between: their rules should come before identifiers. 279-- 280-- So what about text that does not match any rules? For example in Lua, the '!' 281-- character is meaningless outside a string or comment. Normally the lexer 282-- skips over such text. If instead you want to highlight these "syntax errors", 283-- add an additional end rule: 284-- 285-- M._rules = { 286-- {'whitespace', ws}, 287-- {'error', token(l.ERROR, l.any)}, 288-- } 289-- 290-- This identifies and highlights any character not matched by an existing 291-- rule as an `lexer.ERROR` token. 292-- 293-- Even though the rules defined in the examples above contain a single token, 294-- rules may consist of multiple tokens. For example, a rule for an HTML tag 295-- could consist of a tag token followed by an arbitrary number of attribute 296-- tokens, allowing the lexer to highlight all tokens separately. The rule might 297-- look something like this: 298-- 299-- {'tag', tag_start * (ws * attributes)^0 * tag_end^-1} 300-- 301-- Note however that lexers with complex rules like these are more prone to lose 302-- track of their state. 303-- 304-- ### Summary 305-- 306-- Lexers primarily consist of tokens and grammar rules. At your disposal are a 307-- number of convenience patterns and functions for rapidly creating a lexer. If 308-- you choose to use predefined token names for your tokens, you do not have to 309-- define how the lexer highlights them. The tokens will inherit the default 310-- syntax highlighting color theme your editor uses. 311-- 312-- ## Advanced Techniques 313-- 314-- ### Styles and Styling 315-- 316-- The most basic form of syntax highlighting is assigning different colors to 317-- different tokens. Instead of highlighting with just colors, Scintilla allows 318-- for more rich highlighting, or "styling", with different fonts, font sizes, 319-- font attributes, and foreground and background colors, just to name a few. 320-- The unit of this rich highlighting is called a "style". Styles are simply 321-- strings of comma-separated property settings. By default, lexers associate 322-- predefined token names like `lexer.WHITESPACE`, `lexer.COMMENT`, 323-- `lexer.STRING`, etc. with particular styles as part of a universal color 324-- theme. These predefined styles include [`lexer.STYLE_CLASS`](), 325-- [`lexer.STYLE_COMMENT`](), [`lexer.STYLE_CONSTANT`](), 326-- [`lexer.STYLE_ERROR`](), [`lexer.STYLE_EMBEDDED`](), 327-- [`lexer.STYLE_FUNCTION`](), [`lexer.STYLE_IDENTIFIER`](), 328-- [`lexer.STYLE_KEYWORD`](), [`lexer.STYLE_LABEL`](), [`lexer.STYLE_NUMBER`](), 329-- [`lexer.STYLE_OPERATOR`](), [`lexer.STYLE_PREPROCESSOR`](), 330-- [`lexer.STYLE_REGEX`](), [`lexer.STYLE_STRING`](), [`lexer.STYLE_TYPE`](), 331-- [`lexer.STYLE_VARIABLE`](), and [`lexer.STYLE_WHITESPACE`](). Like with 332-- predefined token names and LPeg patterns, you may define your own styles. At 333-- their core, styles are just strings, so you may create new ones and/or modify 334-- existing ones. Each style consists of the following comma-separated settings: 335-- 336-- Setting | Description 337-- ---------------|------------ 338-- font:_name_ | The name of the font the style uses. 339-- size:_int_ | The size of the font the style uses. 340-- [not]bold | Whether or not the font face is bold. 341-- weight:_int_ | The weight or boldness of a font, between 1 and 999. 342-- [not]italics | Whether or not the font face is italic. 343-- [not]underlined| Whether or not the font face is underlined. 344-- fore:_color_ | The foreground color of the font face. 345-- back:_color_ | The background color of the font face. 346-- [not]eolfilled | Does the background color extend to the end of the line? 347-- case:_char_ | The case of the font ('u': upper, 'l': lower, 'm': normal). 348-- [not]visible | Whether or not the text is visible. 349-- [not]changeable| Whether the text is changeable or read-only. 350-- 351-- Specify font colors in either "#RRGGBB" format, "0xBBGGRR" format, or the 352-- decimal equivalent of the latter. As with token names, LPeg patterns, and 353-- styles, there is a set of predefined color names, but they vary depending on 354-- the current color theme in use. Therefore, it is generally not a good idea to 355-- manually define colors within styles in your lexer since they might not fit 356-- into a user's chosen color theme. Try to refrain from even using predefined 357-- colors in a style because that color may be theme-specific. Instead, the best 358-- practice is to either use predefined styles or derive new color-agnostic 359-- styles from predefined ones. For example, Lua "longstring" tokens use the 360-- existing `lexer.STYLE_STRING` style instead of defining a new one. 361-- 362-- #### Example Styles 363-- 364-- Defining styles is pretty straightforward. An empty style that inherits the 365-- default theme settings is simply an empty string: 366-- 367-- local style_nothing = '' 368-- 369-- A similar style but with a bold font face looks like this: 370-- 371-- local style_bold = 'bold' 372-- 373-- If you want the same style, but also with an italic font face, define the new 374-- style in terms of the old one: 375-- 376-- local style_bold_italic = style_bold..',italics' 377-- 378-- This allows you to derive new styles from predefined ones without having to 379-- rewrite them. This operation leaves the old style unchanged. Thus if you 380-- had a "static variable" token whose style you wanted to base off of 381-- `lexer.STYLE_VARIABLE`, it would probably look like: 382-- 383-- local style_static_var = l.STYLE_VARIABLE..',italics' 384-- 385-- The color theme files in the *lexers/themes/* folder give more examples of 386-- style definitions. 387-- 388-- ### Token Styles 389-- 390-- Lexers use the `_tokenstyles` table to assign tokens to particular styles. 391-- Recall the token definition and `_tokenstyles` table from the lexer template: 392-- 393-- local ws = token(l.WHITESPACE, l.space^1) 394-- 395-- ... 396-- 397-- M._tokenstyles = { 398-- 399-- } 400-- 401-- Why is a style not assigned to the `lexer.WHITESPACE` token? As mentioned 402-- earlier, lexers automatically associate tokens that use predefined token 403-- names with a particular style. Only tokens with custom token names need 404-- manual style associations. As an example, consider a custom whitespace token: 405-- 406-- local ws = token('custom_whitespace', l.space^1) 407-- 408-- Assigning a style to this token looks like: 409-- 410-- M._tokenstyles = { 411-- custom_whitespace = l.STYLE_WHITESPACE 412-- } 413-- 414-- Do not confuse token names with rule names. They are completely different 415-- entities. In the example above, the lexer assigns the "custom_whitespace" 416-- token the existing style for `WHITESPACE` tokens. If instead you want to 417-- color the background of whitespace a shade of grey, it might look like: 418-- 419-- local custom_style = l.STYLE_WHITESPACE..',back:$(color.grey)' 420-- M._tokenstyles = { 421-- custom_whitespace = custom_style 422-- } 423-- 424-- Notice that the lexer peforms Scintilla/SciTE-style "$()" property expansion. 425-- You may also use "%()". Remember to refrain from assigning specific colors in 426-- styles, but in this case, all user color themes probably define the 427-- "color.grey" property. 428-- 429-- ### Line Lexers 430-- 431-- By default, lexers match the arbitrary chunks of text passed to them by 432-- Scintilla. These chunks may be a full document, only the visible part of a 433-- document, or even just portions of lines. Some lexers need to match whole 434-- lines. For example, a lexer for the output of a file "diff" needs to know if 435-- the line started with a '+' or '-' and then style the entire line 436-- accordingly. To indicate that your lexer matches by line, use the 437-- `_LEXBYLINE` field: 438-- 439-- M._LEXBYLINE = true 440-- 441-- Now the input text for the lexer is a single line at a time. Keep in mind 442-- that line lexers do not have the ability to look ahead at subsequent lines. 443-- 444-- ### Embedded Lexers 445-- 446-- Lexers embed within one another very easily, requiring minimal effort. In the 447-- following sections, the lexer being embedded is called the "child" lexer and 448-- the lexer a child is being embedded in is called the "parent". For example, 449-- consider an HTML lexer and a CSS lexer. Either lexer stands alone for styling 450-- their respective HTML and CSS files. However, CSS can be embedded inside 451-- HTML. In this specific case, the CSS lexer is the "child" lexer with the HTML 452-- lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This 453-- sounds a lot like the case with CSS, but there is a subtle difference: PHP 454-- _embeds itself_ into HTML while CSS is _embedded in_ HTML. This fundamental 455-- difference results in two types of embedded lexers: a parent lexer that 456-- embeds other child lexers in it (like HTML embedding CSS), and a child lexer 457-- that embeds itself within a parent lexer (like PHP embedding itself in HTML). 458-- 459-- #### Parent Lexer 460-- 461-- Before embedding a child lexer into a parent lexer, the parent lexer needs to 462-- load the child lexer. This is done with the [`lexer.load()`]() function. For 463-- example, loading the CSS lexer within the HTML lexer looks like: 464-- 465-- local css = l.load('css') 466-- 467-- The next part of the embedding process is telling the parent lexer when to 468-- switch over to the child lexer and when to switch back. The lexer refers to 469-- these indications as the "start rule" and "end rule", respectively, and are 470-- just LPeg patterns. Continuing with the HTML/CSS example, the transition from 471-- HTML to CSS is when the lexer encounters a "style" tag with a "type" 472-- attribute whose value is "text/css": 473-- 474-- local css_tag = P('<style') * P(function(input, index) 475-- if input:find('^[^>]+type="text/css"', index) then 476-- return index 477-- end 478-- end) 479-- 480-- This pattern looks for the beginning of a "style" tag and searches its 481-- attribute list for the text "`type="text/css"`". (In this simplified example, 482-- the Lua pattern does not consider whitespace between the '=' nor does it 483-- consider that using single quotes is valid.) If there is a match, the 484-- functional pattern returns a value instead of `nil`. In this case, the value 485-- returned does not matter because we ultimately want to style the "style" tag 486-- as an HTML tag, so the actual start rule looks like this: 487-- 488-- local css_start_rule = #css_tag * tag 489-- 490-- Now that the parent knows when to switch to the child, it needs to know when 491-- to switch back. In the case of HTML/CSS, the switch back occurs when the 492-- lexer encounters an ending "style" tag, though the lexer should still style 493-- the tag as an HTML tag: 494-- 495-- local css_end_rule = #P('</style>') * tag 496-- 497-- Once the parent loads the child lexer and defines the child's start and end 498-- rules, it embeds the child with the [`lexer.embed_lexer()`]() function: 499-- 500-- l.embed_lexer(M, css, css_start_rule, css_end_rule) 501-- 502-- The first parameter is the parent lexer object to embed the child in, which 503-- in this case is `M`. The other three parameters are the child lexer object 504-- loaded earlier followed by its start and end rules. 505-- 506-- #### Child Lexer 507-- 508-- The process for instructing a child lexer to embed itself into a parent is 509-- very similar to embedding a child into a parent: first, load the parent lexer 510-- into the child lexer with the [`lexer.load()`]() function and then create 511-- start and end rules for the child lexer. However, in this case, swap the 512-- lexer object arguments to [`lexer.embed_lexer()`](). For example, in the PHP 513-- lexer: 514-- 515-- local html = l.load('html') 516-- local php_start_rule = token('php_tag', '<?php ') 517-- local php_end_rule = token('php_tag', '?>') 518-- l.embed_lexer(html, M, php_start_rule, php_end_rule) 519-- 520-- ### Lexers with Complex State 521-- 522-- A vast majority of lexers are not stateful and can operate on any chunk of 523-- text in a document. However, there may be rare cases where a lexer does need 524-- to keep track of some sort of persistent state. Rather than using `lpeg.P` 525-- function patterns that set state variables, it is recommended to make use of 526-- Scintilla's built-in, per-line state integers via [`lexer.line_state`](). It 527-- was designed to accommodate up to 32 bit flags for tracking state. 528-- [`lexer.line_from_position()`]() will return the line for any position given 529-- to an `lpeg.P` function pattern. (Any positions derived from that position 530-- argument will also work.) 531-- 532-- Writing stateful lexers is beyond the scope of this document. 533-- 534-- ## Code Folding 535-- 536-- When reading source code, it is occasionally helpful to temporarily hide 537-- blocks of code like functions, classes, comments, etc. This is the concept of 538-- "folding". In the Textadept and SciTE editors for example, little indicators 539-- in the editor margins appear next to code that can be folded at places called 540-- "fold points". When the user clicks an indicator, the editor hides the code 541-- associated with the indicator until the user clicks the indicator again. The 542-- lexer specifies these fold points and what code exactly to fold. 543-- 544-- The fold points for most languages occur on keywords or character sequences. 545-- Examples of fold keywords are "if" and "end" in Lua and examples of fold 546-- character sequences are '{', '}', "/\*", and "\*/" in C for code block and 547-- comment delimiters, respectively. However, these fold points cannot occur 548-- just anywhere. For example, lexers should not recognize fold keywords that 549-- appear within strings or comments. The lexer's `_foldsymbols` table allows 550-- you to conveniently define fold points with such granularity. For example, 551-- consider C: 552-- 553-- M._foldsymbols = { 554-- [l.OPERATOR] = {['{'] = 1, ['}'] = -1}, 555-- [l.COMMENT] = {['/*'] = 1, ['*/'] = -1}, 556-- _patterns = {'[{}]', '/%*', '%*/'} 557-- } 558-- 559-- The first assignment states that any '{' or '}' that the lexer recognized as 560-- an `lexer.OPERATOR` token is a fold point. The integer `1` indicates the 561-- match is a beginning fold point and `-1` indicates the match is an ending 562-- fold point. Likewise, the second assignment states that any "/\*" or "\*/" 563-- that the lexer recognizes as part of a `lexer.COMMENT` token is a fold point. 564-- The lexer does not consider any occurences of these characters outside their 565-- defined tokens (such as in a string) as fold points. Finally, every 566-- `_foldsymbols` table must have a `_patterns` field that contains a list of 567-- [Lua patterns][] that match fold points. If the lexer encounters text that 568-- matches one of those patterns, the lexer looks up the matched text in its 569-- token's table in order to determine whether or not the text is a fold point. 570-- In the example above, the first Lua pattern matches any '{' or '}' 571-- characters. When the lexer comes across one of those characters, it checks if 572-- the match is an `lexer.OPERATOR` token. If so, the lexer identifies the match 573-- as a fold point. The same idea applies for the other patterns. (The '%' is in 574-- the other patterns because '\*' is a special character in Lua patterns that 575-- needs escaping.) How do you specify fold keywords? Here is an example for 576-- Lua: 577-- 578-- M._foldsymbols = { 579-- [l.KEYWORD] = { 580-- ['if'] = 1, ['do'] = 1, ['function'] = 1, 581-- ['end'] = -1, ['repeat'] = 1, ['until'] = -1 582-- }, 583-- _patterns = {'%l+'} 584-- } 585-- 586-- Any time the lexer encounters a lower case word, if that word is a 587-- `lexer.KEYWORD` token and in the associated list of fold points, the lexer 588-- identifies the word as a fold point. 589-- 590-- If your lexer has case-insensitive keywords as fold points, simply add a 591-- `_case_insensitive = true` option to the `_foldsymbols` table and specify 592-- keywords in lower case. 593-- 594-- If your lexer needs to do some additional processing to determine if a match 595-- is a fold point, assign a function that returns an integer. Returning `1` or 596-- `-1` indicates the match is a fold point. Returning `0` indicates it is not. 597-- For example: 598-- 599-- local function fold_strange_token(text, pos, line, s, match) 600-- if ... then 601-- return 1 -- beginning fold point 602-- elseif ... then 603-- return -1 -- ending fold point 604-- end 605-- return 0 606-- end 607-- 608-- M._foldsymbols = { 609-- ['strange_token'] = {['|'] = fold_strange_token}, 610-- _patterns = {'|'} 611-- } 612-- 613-- Any time the lexer encounters a '|' that is a "strange_token", it calls the 614-- `fold_strange_token` function to determine if '|' is a fold point. The lexer 615-- calls these functions with the following arguments: the text to identify fold 616-- points in, the beginning position of the current line in the text to fold, 617-- the current line's text, the position in the current line the matched text 618-- starts at, and the matched text itself. 619-- 620-- [Lua patterns]: http://www.lua.org/manual/5.2/manual.html#6.4.1 621-- 622-- ### Fold by Indentation 623-- 624-- Some languages have significant whitespace and/or no delimiters that indicate 625-- fold points. If your lexer falls into this category and you would like to 626-- mark fold points based on changes in indentation, use the 627-- `_FOLDBYINDENTATION` field: 628-- 629-- M._FOLDBYINDENTATION = true 630-- 631-- ## Using Lexers 632-- 633-- ### Textadept 634-- 635-- Put your lexer in your *~/.textadept/lexers/* directory so you do not 636-- overwrite it when upgrading Textadept. Also, lexers in this directory 637-- override default lexers. Thus, Textadept loads a user *lua* lexer instead of 638-- the default *lua* lexer. This is convenient for tweaking a default lexer to 639-- your liking. Then add a [file type][] for your lexer if necessary. 640-- 641-- [file type]: _M.textadept.file_types.html 642-- 643-- ### SciTE 644-- 645-- Create a *.properties* file for your lexer and `import` it in either your 646-- *SciTEUser.properties* or *SciTEGlobal.properties*. The contents of the 647-- *.properties* file should contain: 648-- 649-- file.patterns.[lexer_name]=[file_patterns] 650-- lexer.$(file.patterns.[lexer_name])=[lexer_name] 651-- 652-- where `[lexer_name]` is the name of your lexer (minus the *.lua* extension) 653-- and `[file_patterns]` is a set of file extensions to use your lexer for. 654-- 655-- Please note that Lua lexers ignore any styling information in *.properties* 656-- files. Your theme file in the *lexers/themes/* directory contains styling 657-- information. 658-- 659-- ## Considerations 660-- 661-- ### Performance 662-- 663-- There might be some slight overhead when initializing a lexer, but loading a 664-- file from disk into Scintilla is usually more expensive. On modern computer 665-- systems, I see no difference in speed between LPeg lexers and Scintilla's C++ 666-- ones. Optimize lexers for speed by re-arranging rules in the `_rules` table 667-- so that the most common rules match first. Do keep in mind that order matters 668-- for similar rules. 669-- 670-- ### Limitations 671-- 672-- Embedded preprocessor languages like PHP cannot completely embed in their 673-- parent languages in that the parent's tokens do not support start and end 674-- rules. This mostly goes unnoticed, but code like 675-- 676-- <div id="<?php echo $id; ?>"> 677-- 678-- or 679-- 680-- <div <?php if ($odd) { echo 'class="odd"'; } ?>> 681-- 682-- will not style correctly. 683-- 684-- ### Troubleshooting 685-- 686-- Errors in lexers can be tricky to debug. Lexers print Lua errors to 687-- `io.stderr` and `_G.print()` statements to `io.stdout`. Running your editor 688-- from a terminal is the easiest way to see errors as they occur. 689-- 690-- ### Risks 691-- 692-- Poorly written lexers have the ability to crash Scintilla (and thus its 693-- containing application), so unsaved data might be lost. However, I have only 694-- observed these crashes in early lexer development, when syntax errors or 695-- pattern errors are present. Once the lexer actually starts styling text 696-- (either correctly or incorrectly, it does not matter), I have not observed 697-- any crashes. 698-- 699-- ### Acknowledgements 700-- 701-- Thanks to Peter Odding for his [lexer post][] on the Lua mailing list 702-- that inspired me, and thanks to Roberto Ierusalimschy for LPeg. 703-- 704-- [lexer post]: http://lua-users.org/lists/lua-l/2007-04/msg00116.html 705-- @field LEXERPATH (string) 706-- The path used to search for a lexer to load. 707-- Identical in format to Lua's `package.path` string. 708-- The default value is `package.path`. 709-- @field DEFAULT (string) 710-- The token name for default tokens. 711-- @field WHITESPACE (string) 712-- The token name for whitespace tokens. 713-- @field COMMENT (string) 714-- The token name for comment tokens. 715-- @field STRING (string) 716-- The token name for string tokens. 717-- @field NUMBER (string) 718-- The token name for number tokens. 719-- @field KEYWORD (string) 720-- The token name for keyword tokens. 721-- @field IDENTIFIER (string) 722-- The token name for identifier tokens. 723-- @field OPERATOR (string) 724-- The token name for operator tokens. 725-- @field ERROR (string) 726-- The token name for error tokens. 727-- @field PREPROCESSOR (string) 728-- The token name for preprocessor tokens. 729-- @field CONSTANT (string) 730-- The token name for constant tokens. 731-- @field VARIABLE (string) 732-- The token name for variable tokens. 733-- @field FUNCTION (string) 734-- The token name for function tokens. 735-- @field CLASS (string) 736-- The token name for class tokens. 737-- @field TYPE (string) 738-- The token name for type tokens. 739-- @field LABEL (string) 740-- The token name for label tokens. 741-- @field REGEX (string) 742-- The token name for regex tokens. 743-- @field STYLE_CLASS (string) 744-- The style typically used for class definitions. 745-- @field STYLE_COMMENT (string) 746-- The style typically used for code comments. 747-- @field STYLE_CONSTANT (string) 748-- The style typically used for constants. 749-- @field STYLE_ERROR (string) 750-- The style typically used for erroneous syntax. 751-- @field STYLE_FUNCTION (string) 752-- The style typically used for function definitions. 753-- @field STYLE_KEYWORD (string) 754-- The style typically used for language keywords. 755-- @field STYLE_LABEL (string) 756-- The style typically used for labels. 757-- @field STYLE_NUMBER (string) 758-- The style typically used for numbers. 759-- @field STYLE_OPERATOR (string) 760-- The style typically used for operators. 761-- @field STYLE_REGEX (string) 762-- The style typically used for regular expression strings. 763-- @field STYLE_STRING (string) 764-- The style typically used for strings. 765-- @field STYLE_PREPROCESSOR (string) 766-- The style typically used for preprocessor statements. 767-- @field STYLE_TYPE (string) 768-- The style typically used for static types. 769-- @field STYLE_VARIABLE (string) 770-- The style typically used for variables. 771-- @field STYLE_WHITESPACE (string) 772-- The style typically used for whitespace. 773-- @field STYLE_EMBEDDED (string) 774-- The style typically used for embedded code. 775-- @field STYLE_IDENTIFIER (string) 776-- The style typically used for identifier words. 777-- @field STYLE_DEFAULT (string) 778-- The style all styles are based off of. 779-- @field STYLE_LINENUMBER (string) 780-- The style used for all margins except fold margins. 781-- @field STYLE_BRACELIGHT (string) 782-- The style used for highlighted brace characters. 783-- @field STYLE_BRACEBAD (string) 784-- The style used for unmatched brace characters. 785-- @field STYLE_CONTROLCHAR (string) 786-- The style used for control characters. 787-- Color attributes are ignored. 788-- @field STYLE_INDENTGUIDE (string) 789-- The style used for indentation guides. 790-- @field STYLE_CALLTIP (string) 791-- The style used by call tips if [`buffer.call_tip_use_style`]() is set. 792-- Only the font name, size, and color attributes are used. 793-- @field any (pattern) 794-- A pattern that matches any single character. 795-- @field ascii (pattern) 796-- A pattern that matches any ASCII character (codes 0 to 127). 797-- @field extend (pattern) 798-- A pattern that matches any ASCII extended character (codes 0 to 255). 799-- @field alpha (pattern) 800-- A pattern that matches any alphabetic character ('A'-'Z', 'a'-'z'). 801-- @field digit (pattern) 802-- A pattern that matches any digit ('0'-'9'). 803-- @field alnum (pattern) 804-- A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z', 805-- '0'-'9'). 806-- @field lower (pattern) 807-- A pattern that matches any lower case character ('a'-'z'). 808-- @field upper (pattern) 809-- A pattern that matches any upper case character ('A'-'Z'). 810-- @field xdigit (pattern) 811-- A pattern that matches any hexadecimal digit ('0'-'9', 'A'-'F', 'a'-'f'). 812-- @field cntrl (pattern) 813-- A pattern that matches any control character (ASCII codes 0 to 31). 814-- @field graph (pattern) 815-- A pattern that matches any graphical character ('!' to '~'). 816-- @field print (pattern) 817-- A pattern that matches any printable character (' ' to '~'). 818-- @field punct (pattern) 819-- A pattern that matches any punctuation character ('!' to '/', ':' to '@', 820-- '[' to ''', '{' to '~'). 821-- @field space (pattern) 822-- A pattern that matches any whitespace character ('\t', '\v', '\f', '\n', 823-- '\r', space). 824-- @field newline (pattern) 825-- A pattern that matches any set of end of line characters. 826-- @field nonnewline (pattern) 827-- A pattern that matches any single, non-newline character. 828-- @field nonnewline_esc (pattern) 829-- A pattern that matches any single, non-newline character or any set of end 830-- of line characters escaped with '\'. 831-- @field dec_num (pattern) 832-- A pattern that matches a decimal number. 833-- @field hex_num (pattern) 834-- A pattern that matches a hexadecimal number. 835-- @field oct_num (pattern) 836-- A pattern that matches an octal number. 837-- @field integer (pattern) 838-- A pattern that matches either a decimal, hexadecimal, or octal number. 839-- @field float (pattern) 840-- A pattern that matches a floating point number. 841-- @field word (pattern) 842-- A pattern that matches a typical word. Words begin with a letter or 843-- underscore and consist of alphanumeric and underscore characters. 844-- @field FOLD_BASE (number) 845-- The initial (root) fold level. 846-- @field FOLD_BLANK (number) 847-- Flag indicating that the line is blank. 848-- @field FOLD_HEADER (number) 849-- Flag indicating the line is fold point. 850-- @field fold_level (table, Read-only) 851-- Table of fold level bit-masks for line numbers starting from zero. 852-- Fold level masks are composed of an integer level combined with any of the 853-- following bits: 854-- 855-- * `lexer.FOLD_BASE` 856-- The initial fold level. 857-- * `lexer.FOLD_BLANK` 858-- The line is blank. 859-- * `lexer.FOLD_HEADER` 860-- The line is a header, or fold point. 861-- @field indent_amount (table, Read-only) 862-- Table of indentation amounts in character columns, for line numbers 863-- starting from zero. 864-- @field line_state (table) 865-- Table of integer line states for line numbers starting from zero. 866-- Line states can be used by lexers for keeping track of persistent states. 867-- @field property (table) 868-- Map of key-value string pairs. 869-- @field property_expanded (table, Read-only) 870-- Map of key-value string pairs with `$()` and `%()` variable replacement 871-- performed in values. 872-- @field property_int (table, Read-only) 873-- Map of key-value pairs with values interpreted as numbers, or `0` if not 874-- found. 875-- @field style_at (table, Read-only) 876-- Table of style names at positions in the buffer starting from 1. 877module('lexer')]=] 878 879--local lpeg = require('lpeg') 880local lpeg_P, lpeg_R, lpeg_S, lpeg_V = lpeg.P, lpeg.R, lpeg.S, lpeg.V 881local lpeg_Ct, lpeg_Cc, lpeg_Cp = lpeg.Ct, lpeg.Cc, lpeg.Cp 882local lpeg_Cmt, lpeg_C = lpeg.Cmt, lpeg.C 883local lpeg_match = lpeg.match 884 885M.LEXERPATH = package.path 886 887-- Table of loaded lexers. 888local lexers = {} 889 890-- Keep track of the last parent lexer loaded. This lexer's rules are used for 891-- proxy lexers (those that load parent and child lexers to embed) that do not 892-- declare a parent lexer. 893local parent_lexer 894 895if not package.searchpath then 896 -- Searches for the given *name* in the given *path*. 897 -- This is an implementation of Lua 5.2's `package.searchpath()` function for 898 -- Lua 5.1. 899 function package.searchpath(name, path) 900 local tried = {} 901 for part in path:gmatch('[^;]+') do 902 local filename = part:gsub('%?', name) 903 local f = io.open(filename, 'r') 904 if f then f:close() return filename end 905 tried[#tried + 1] = ("no file '%s'"):format(filename) 906 end 907 return nil, table.concat(tried, '\n') 908 end 909end 910 911-- Adds a rule to a lexer's current ordered list of rules. 912-- @param lexer The lexer to add the given rule to. 913-- @param name The name associated with this rule. It is used for other lexers 914-- to access this particular rule from the lexer's `_RULES` table. It does not 915-- have to be the same as the name passed to `token`. 916-- @param rule The LPeg pattern of the rule. 917local function add_rule(lexer, id, rule) 918 if not lexer._RULES then 919 lexer._RULES = {} 920 -- Contains an ordered list (by numerical index) of rule names. This is used 921 -- in conjunction with lexer._RULES for building _TOKENRULE. 922 lexer._RULEORDER = {} 923 end 924 lexer._RULES[id] = rule 925 lexer._RULEORDER[#lexer._RULEORDER + 1] = id 926end 927 928-- Adds a new Scintilla style to Scintilla. 929-- @param lexer The lexer to add the given style to. 930-- @param token_name The name of the token associated with this style. 931-- @param style A Scintilla style created from `style()`. 932-- @see style 933local function add_style(lexer, token_name, style) 934 local num_styles = lexer._numstyles 935 if num_styles == 32 then num_styles = num_styles + 8 end -- skip predefined 936 if num_styles >= 255 then print('Too many styles defined (255 MAX)') end 937 lexer._TOKENSTYLES[token_name], lexer._numstyles = num_styles, num_styles + 1 938 lexer._EXTRASTYLES[token_name] = style 939end 940 941-- (Re)constructs `lexer._TOKENRULE`. 942-- @param parent The parent lexer. 943local function join_tokens(lexer) 944 local patterns, order = lexer._RULES, lexer._RULEORDER 945 local token_rule = patterns[order[1]] 946 for i = 2, #order do token_rule = token_rule + patterns[order[i]] end 947 lexer._TOKENRULE = token_rule + M.token(M.DEFAULT, M.any) 948 return lexer._TOKENRULE 949end 950 951-- Adds a given lexer and any of its embedded lexers to a given grammar. 952-- @param grammar The grammar to add the lexer to. 953-- @param lexer The lexer to add. 954local function add_lexer(grammar, lexer, token_rule) 955 local token_rule = join_tokens(lexer) 956 local lexer_name = lexer._NAME 957 for i = 1, #lexer._CHILDREN do 958 local child = lexer._CHILDREN[i] 959 if child._CHILDREN then add_lexer(grammar, child) end 960 local child_name = child._NAME 961 local rules = child._EMBEDDEDRULES[lexer_name] 962 local rules_token_rule = grammar['__'..child_name] or rules.token_rule 963 grammar[child_name] = (-rules.end_rule * rules_token_rule)^0 * 964 rules.end_rule^-1 * lpeg_V(lexer_name) 965 local embedded_child = '_'..child_name 966 grammar[embedded_child] = rules.start_rule * (-rules.end_rule * 967 rules_token_rule)^0 * rules.end_rule^-1 968 token_rule = lpeg_V(embedded_child) + token_rule 969 end 970 grammar['__'..lexer_name] = token_rule -- can contain embedded lexer rules 971 grammar[lexer_name] = token_rule^0 972end 973 974-- (Re)constructs `lexer._GRAMMAR`. 975-- @param lexer The parent lexer. 976-- @param initial_rule The name of the rule to start lexing with. The default 977-- value is `lexer._NAME`. Multilang lexers use this to start with a child 978-- rule if necessary. 979local function build_grammar(lexer, initial_rule) 980 local children = lexer._CHILDREN 981 if children then 982 local lexer_name = lexer._NAME 983 if not initial_rule then initial_rule = lexer_name end 984 local grammar = {initial_rule} 985 add_lexer(grammar, lexer) 986 lexer._INITIALRULE = initial_rule 987 lexer._GRAMMAR = lpeg_Ct(lpeg_P(grammar)) 988 else 989 lexer._GRAMMAR = lpeg_Ct(join_tokens(lexer)^0) 990 end 991end 992 993local string_upper = string.upper 994-- Default styles. 995local default = { 996 'nothing', 'whitespace', 'comment', 'string', 'number', 'keyword', 997 'identifier', 'operator', 'error', 'preprocessor', 'constant', 'variable', 998 'function', 'class', 'type', 'label', 'regex', 'embedded' 999} 1000for i = 1, #default do 1001 local name, upper_name = default[i], string_upper(default[i]) 1002 M[upper_name], M['STYLE_'..upper_name] = name, '$(style.'..name..')' 1003end 1004-- Predefined styles. 1005local predefined = { 1006 'default', 'linenumber', 'bracelight', 'bracebad', 'controlchar', 1007 'indentguide', 'calltip' 1008} 1009for i = 1, #predefined do 1010 local name, upper_name = predefined[i], string_upper(predefined[i]) 1011 M[upper_name], M['STYLE_'..upper_name] = name, '$(style.'..name..')' 1012end 1013 1014--- 1015-- Initializes or loads and returns the lexer of string name *name*. 1016-- Scintilla calls this function in order to load a lexer. Parent lexers also 1017-- call this function in order to load child lexers and vice-versa. The user 1018-- calls this function in order to load a lexer when using Scintillua as a Lua 1019-- library. 1020-- @param name The name of the lexing language. 1021-- @param alt_name The alternate name of the lexing language. This is useful for 1022-- embedding the same child lexer with multiple sets of start and end tokens. 1023-- @return lexer object 1024-- @name load 1025function M.load(name, alt_name) 1026 if lexers[alt_name or name] then return lexers[alt_name or name] end 1027 parent_lexer = nil -- reset 1028 1029 -- When using Scintillua as a stand-alone module, the `property` and 1030 -- `property_int` tables do not exist (they are not useful). Create them to 1031 -- prevent errors from occurring. 1032 if not M.property then 1033 M.property, M.property_int = {}, setmetatable({}, { 1034 __index = function(t, k) return tonumber(M.property[k]) or 0 end, 1035 __newindex = function() error('read-only property') end 1036 }) 1037 end 1038 1039 -- Load the language lexer with its rules, styles, etc. 1040 M.WHITESPACE = (alt_name or name)..'_whitespace' 1041 local lexer = dofile(assert(package.searchpath(name, M.LEXERPATH))) 1042 if alt_name then lexer._NAME = alt_name end 1043 1044 -- Create the initial maps for token names to style numbers and styles. 1045 local token_styles = {} 1046 for i = 1, #default do token_styles[default[i]] = i - 1 end 1047 for i = 1, #predefined do token_styles[predefined[i]] = i + 31 end 1048 lexer._TOKENSTYLES, lexer._numstyles = token_styles, #default 1049 lexer._EXTRASTYLES = {} 1050 1051 -- If the lexer is a proxy (loads parent and child lexers to embed) and does 1052 -- not declare a parent, try and find one and use its rules. 1053 if not lexer._rules and not lexer._lexer then lexer._lexer = parent_lexer end 1054 1055 -- If the lexer is a proxy or a child that embedded itself, add its rules and 1056 -- styles to the parent lexer. Then set the parent to be the main lexer. 1057 if lexer._lexer then 1058 local l, _r, _s = lexer._lexer, lexer._rules, lexer._tokenstyles 1059 if not l._tokenstyles then l._tokenstyles = {} end 1060 if _r then 1061 for i = 1, #_r do 1062 -- Prevent rule id clashes. 1063 l._rules[#l._rules + 1] = {lexer._NAME..'_'.._r[i][1], _r[i][2]} 1064 end 1065 end 1066 if _s then 1067 for token, style in pairs(_s) do l._tokenstyles[token] = style end 1068 end 1069 lexer = l 1070 end 1071 1072 -- Add the lexer's styles and build its grammar. 1073 if lexer._rules then 1074 if lexer._tokenstyles then 1075 for token, style in pairs(lexer._tokenstyles) do 1076 add_style(lexer, token, style) 1077 end 1078 end 1079 for i = 1, #lexer._rules do 1080 add_rule(lexer, lexer._rules[i][1], lexer._rules[i][2]) 1081 end 1082 build_grammar(lexer) 1083 end 1084 -- Add the lexer's unique whitespace style. 1085 add_style(lexer, lexer._NAME..'_whitespace', M.STYLE_WHITESPACE) 1086 1087 -- Process the lexer's fold symbols. 1088 if lexer._foldsymbols and lexer._foldsymbols._patterns then 1089 local patterns = lexer._foldsymbols._patterns 1090 for i = 1, #patterns do patterns[i] = '()('..patterns[i]..')' end 1091 end 1092 1093 lexer.lex, lexer.fold = M.lex, M.fold 1094 lexers[alt_name or name] = lexer 1095 return lexer 1096end 1097 1098--- 1099-- Lexes a chunk of text *text* (that has an initial style number of 1100-- *init_style*) with lexer *lexer*. 1101-- If *lexer* has a `_LEXBYLINE` flag set, the text is lexed one line at a time. 1102-- Otherwise the text is lexed as a whole. 1103-- @param lexer The lexer object to lex with. 1104-- @param text The text in the buffer to lex. 1105-- @param init_style The current style. Multiple-language lexers use this to 1106-- determine which language to start lexing in. 1107-- @return table of token names and positions. 1108-- @name lex 1109function M.lex(lexer, text, init_style) 1110 if not lexer._GRAMMAR then return {M.DEFAULT, #text + 1} end 1111 if not lexer._LEXBYLINE then 1112 -- For multilang lexers, build a new grammar whose initial_rule is the 1113 -- current language. 1114 if lexer._CHILDREN then 1115 for style, style_num in pairs(lexer._TOKENSTYLES) do 1116 if style_num == init_style then 1117 local lexer_name = style:match('^(.+)_whitespace') or lexer._NAME 1118 if lexer._INITIALRULE ~= lexer_name then 1119 build_grammar(lexer, lexer_name) 1120 end 1121 break 1122 end 1123 end 1124 end 1125 return lpeg_match(lexer._GRAMMAR, text) 1126 else 1127 local tokens = {} 1128 local function append(tokens, line_tokens, offset) 1129 for i = 1, #line_tokens, 2 do 1130 tokens[#tokens + 1] = line_tokens[i] 1131 tokens[#tokens + 1] = line_tokens[i + 1] + offset 1132 end 1133 end 1134 local offset = 0 1135 local grammar = lexer._GRAMMAR 1136 for line in text:gmatch('[^\r\n]*\r?\n?') do 1137 local line_tokens = lpeg_match(grammar, line) 1138 if line_tokens then append(tokens, line_tokens, offset) end 1139 offset = offset + #line 1140 -- Use the default style to the end of the line if none was specified. 1141 if tokens[#tokens] ~= offset then 1142 tokens[#tokens + 1], tokens[#tokens + 2] = 'default', offset + 1 1143 end 1144 end 1145 return tokens 1146 end 1147end 1148 1149--- 1150-- Determines fold points in a chunk of text *text* with lexer *lexer*. 1151-- *text* starts at position *start_pos* on line number *start_line* with a 1152-- beginning fold level of *start_level* in the buffer. If *lexer* has a `_fold` 1153-- function or a `_foldsymbols` table, that field is used to perform folding. 1154-- Otherwise, if *lexer* has a `_FOLDBYINDENTATION` field set, or if a 1155-- `fold.by.indentation` property is set, folding by indentation is done. 1156-- @param lexer The lexer object to fold with. 1157-- @param text The text in the buffer to fold. 1158-- @param start_pos The position in the buffer *text* starts at, starting at 1159-- zero. 1160-- @param start_line The line number *text* starts on. 1161-- @param start_level The fold level *text* starts on. 1162-- @return table of fold levels. 1163-- @name fold 1164function M.fold(lexer, text, start_pos, start_line, start_level) 1165 local folds = {} 1166 if text == '' then return folds end 1167 local fold = M.property_int['fold'] > 0 1168 local FOLD_BASE = M.FOLD_BASE 1169 local FOLD_HEADER, FOLD_BLANK = M.FOLD_HEADER, M.FOLD_BLANK 1170 if fold and lexer._fold then 1171 return lexer._fold(text, start_pos, start_line, start_level) 1172 elseif fold and lexer._foldsymbols then 1173 local lines = {} 1174 for p, l in (text..'\n'):gmatch('()(.-)\r?\n') do 1175 lines[#lines + 1] = {p, l} 1176 end 1177 local fold_zero_sum_lines = M.property_int['fold.on.zero.sum.lines'] > 0 1178 local fold_symbols = lexer._foldsymbols 1179 local fold_symbols_patterns = fold_symbols._patterns 1180 local fold_symbols_case_insensitive = fold_symbols._case_insensitive 1181 local style_at, fold_level = M.style_at, M.fold_level 1182 local line_num, prev_level = start_line, start_level 1183 local current_level = prev_level 1184 for i = 1, #lines do 1185 local pos, line = lines[i][1], lines[i][2] 1186 if line ~= '' then 1187 if fold_symbols_case_insensitive then line = line:lower() end 1188 local level_decreased = false 1189 for j = 1, #fold_symbols_patterns do 1190 for s, match in line:gmatch(fold_symbols_patterns[j]) do 1191 local symbols = fold_symbols[style_at[start_pos + pos + s - 1]] 1192 local l = symbols and symbols[match] 1193 if type(l) == 'function' then l = l(text, pos, line, s, match) end 1194 if type(l) == 'number' then 1195 current_level = current_level + l 1196 if l < 0 and current_level < prev_level then 1197 -- Potential zero-sum line. If the level were to go back up on 1198 -- the same line, the line may be marked as a fold header. 1199 level_decreased = true 1200 end 1201 end 1202 end 1203 end 1204 folds[line_num] = prev_level 1205 if current_level > prev_level then 1206 folds[line_num] = prev_level + FOLD_HEADER 1207 elseif level_decreased and current_level == prev_level and 1208 fold_zero_sum_lines then 1209 if line_num > start_line then 1210 folds[line_num] = prev_level - 1 + FOLD_HEADER 1211 else 1212 -- Typing within a zero-sum line. 1213 local level = fold_level[line_num - 1] - 1 1214 if level > FOLD_HEADER then level = level - FOLD_HEADER end 1215 if level > FOLD_BLANK then level = level - FOLD_BLANK end 1216 folds[line_num] = level + FOLD_HEADER 1217 current_level = current_level + 1 1218 end 1219 end 1220 if current_level < FOLD_BASE then current_level = FOLD_BASE end 1221 prev_level = current_level 1222 else 1223 folds[line_num] = prev_level + FOLD_BLANK 1224 end 1225 line_num = line_num + 1 1226 end 1227 elseif fold and (lexer._FOLDBYINDENTATION or 1228 M.property_int['fold.by.indentation'] > 0) then 1229 -- Indentation based folding. 1230 -- Calculate indentation per line. 1231 local indentation = {} 1232 for indent, line in (text..'\n'):gmatch('([\t ]*)([^\r\n]*)\r?\n') do 1233 indentation[#indentation + 1] = line ~= '' and #indent 1234 end 1235 -- Find the first non-blank line before start_line. If the current line is 1236 -- indented, make that previous line a header and update the levels of any 1237 -- blank lines inbetween. If the current line is blank, match the level of 1238 -- the previous non-blank line. 1239 local current_level = start_level 1240 for i = start_line - 1, 0, -1 do 1241 local level = M.fold_level[i] 1242 if level >= FOLD_HEADER then level = level - FOLD_HEADER end 1243 if level < FOLD_BLANK then 1244 local indent = M.indent_amount[i] 1245 if indentation[1] and indentation[1] > indent then 1246 folds[i] = FOLD_BASE + indent + FOLD_HEADER 1247 for j = i + 1, start_line - 1 do 1248 folds[j] = start_level + FOLD_BLANK 1249 end 1250 elseif not indentation[1] then 1251 current_level = FOLD_BASE + indent 1252 end 1253 break 1254 end 1255 end 1256 -- Iterate over lines, setting fold numbers and fold flags. 1257 for i = 1, #indentation do 1258 if indentation[i] then 1259 current_level = FOLD_BASE + indentation[i] 1260 folds[start_line + i - 1] = current_level 1261 for j = i + 1, #indentation do 1262 if indentation[j] then 1263 if FOLD_BASE + indentation[j] > current_level then 1264 folds[start_line + i - 1] = current_level + FOLD_HEADER 1265 current_level = FOLD_BASE + indentation[j] -- for any blanks below 1266 end 1267 break 1268 end 1269 end 1270 else 1271 folds[start_line + i - 1] = current_level + FOLD_BLANK 1272 end 1273 end 1274 else 1275 -- No folding, reset fold levels if necessary. 1276 local current_line = start_line 1277 for _ in text:gmatch('\r?\n') do 1278 folds[current_line] = start_level 1279 current_line = current_line + 1 1280 end 1281 end 1282 return folds 1283end 1284 1285-- The following are utility functions lexers will have access to. 1286 1287-- Common patterns. 1288M.any = lpeg_P(1) 1289M.ascii = lpeg_R('\000\127') 1290M.extend = lpeg_R('\000\255') 1291M.alpha = lpeg_R('AZ', 'az') 1292M.digit = lpeg_R('09') 1293M.alnum = lpeg_R('AZ', 'az', '09') 1294M.lower = lpeg_R('az') 1295M.upper = lpeg_R('AZ') 1296M.xdigit = lpeg_R('09', 'AF', 'af') 1297M.cntrl = lpeg_R('\000\031') 1298M.graph = lpeg_R('!~') 1299M.print = lpeg_R(' ~') 1300M.punct = lpeg_R('!/', ':@', '[\'', '{~') 1301M.space = lpeg_S('\t\v\f\n\r ') 1302 1303M.newline = lpeg_S('\r\n\f')^1 1304M.nonnewline = 1 - M.newline 1305M.nonnewline_esc = 1 - (M.newline + '\\') + '\\' * M.any 1306 1307M.dec_num = M.digit^1 1308M.hex_num = '0' * lpeg_S('xX') * M.xdigit^1 1309M.oct_num = '0' * lpeg_R('07')^1 1310M.integer = lpeg_S('+-')^-1 * (M.hex_num + M.oct_num + M.dec_num) 1311M.float = lpeg_S('+-')^-1 * 1312 ((M.digit^0 * '.' * M.digit^1 + M.digit^1 * '.' * M.digit^0) * 1313 (lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)^-1 + 1314 (M.digit^1 * lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)) 1315 1316M.word = (M.alpha + '_') * (M.alnum + '_')^0 1317 1318--- 1319-- Creates and returns a token pattern with token name *name* and pattern 1320-- *patt*. 1321-- If *name* is not a predefined token name, its style must be defined in the 1322-- lexer's `_tokenstyles` table. 1323-- @param name The name of token. If this name is not a predefined token name, 1324-- then a style needs to be assiciated with it in the lexer's `_tokenstyles` 1325-- table. 1326-- @param patt The LPeg pattern associated with the token. 1327-- @return pattern 1328-- @usage local ws = token(l.WHITESPACE, l.space^1) 1329-- @usage local annotation = token('annotation', '@' * l.word) 1330-- @name token 1331function M.token(name, patt) 1332 return lpeg_Cc(name) * patt * lpeg_Cp() 1333end 1334 1335--- 1336-- Creates and returns a pattern that matches a range of text bounded by 1337-- *chars* characters. 1338-- This is a convenience function for matching more complicated delimited ranges 1339-- like strings with escape characters and balanced parentheses. *single_line* 1340-- indicates whether or not the range must be on a single line, *no_escape* 1341-- indicates whether or not to ignore '\' as an escape character, and *balanced* 1342-- indicates whether or not to handle balanced ranges like parentheses and 1343-- requires *chars* to be composed of two characters. 1344-- @param chars The character(s) that bound the matched range. 1345-- @param single_line Optional flag indicating whether or not the range must be 1346-- on a single line. 1347-- @param no_escape Optional flag indicating whether or not the range end 1348-- character may be escaped by a '\\' character. 1349-- @param balanced Optional flag indicating whether or not to match a balanced 1350-- range, like the "%b" Lua pattern. This flag only applies if *chars* 1351-- consists of two different characters (e.g. "()"). 1352-- @return pattern 1353-- @usage local dq_str_escapes = l.delimited_range('"') 1354-- @usage local dq_str_noescapes = l.delimited_range('"', false, true) 1355-- @usage local unbalanced_parens = l.delimited_range('()') 1356-- @usage local balanced_parens = l.delimited_range('()', false, false, true) 1357-- @see nested_pair 1358-- @name delimited_range 1359function M.delimited_range(chars, single_line, no_escape, balanced) 1360 local s = chars:sub(1, 1) 1361 local e = #chars == 2 and chars:sub(2, 2) or s 1362 local range 1363 local b = balanced and s or '' 1364 local n = single_line and '\n' or '' 1365 if no_escape then 1366 local invalid = lpeg_S(e..n..b) 1367 range = M.any - invalid 1368 else 1369 local invalid = lpeg_S(e..n..b) + '\\' 1370 range = M.any - invalid + '\\' * M.any 1371 end 1372 if balanced and s ~= e then 1373 return lpeg_P{s * (range + lpeg_V(1))^0 * e} 1374 else 1375 return s * range^0 * lpeg_P(e)^-1 1376 end 1377end 1378 1379--- 1380-- Creates and returns a pattern that matches pattern *patt* only at the 1381-- beginning of a line. 1382-- @param patt The LPeg pattern to match on the beginning of a line. 1383-- @return pattern 1384-- @usage local preproc = token(l.PREPROCESSOR, l.starts_line('#') * 1385-- l.nonnewline^0) 1386-- @name starts_line 1387function M.starts_line(patt) 1388 return lpeg_Cmt(lpeg_C(patt), function(input, index, match, ...) 1389 local pos = index - #match 1390 if pos == 1 then return index, ... end 1391 local char = input:sub(pos - 1, pos - 1) 1392 if char == '\n' or char == '\r' or char == '\f' then return index, ... end 1393 end) 1394end 1395 1396--- 1397-- Creates and returns a pattern that verifies that string set *s* contains the 1398-- first non-whitespace character behind the current match position. 1399-- @param s String character set like one passed to `lpeg.S()`. 1400-- @return pattern 1401-- @usage local regex = l.last_char_includes('+-*!%^&|=,([{') * 1402-- l.delimited_range('/') 1403-- @name last_char_includes 1404function M.last_char_includes(s) 1405 s = '['..s:gsub('[-%%%[]', '%%%1')..']' 1406 return lpeg_P(function(input, index) 1407 if index == 1 then return index end 1408 local i = index 1409 while input:sub(i - 1, i - 1):match('[ \t\r\n\f]') do i = i - 1 end 1410 if input:sub(i - 1, i - 1):match(s) then return index end 1411 end) 1412end 1413 1414--- 1415-- Returns a pattern that matches a balanced range of text that starts with 1416-- string *start_chars* and ends with string *end_chars*. 1417-- With single-character delimiters, this function is identical to 1418-- `delimited_range(start_chars..end_chars, false, true, true)`. 1419-- @param start_chars The string starting a nested sequence. 1420-- @param end_chars The string ending a nested sequence. 1421-- @return pattern 1422-- @usage local nested_comment = l.nested_pair('/*', '*/') 1423-- @see delimited_range 1424-- @name nested_pair 1425function M.nested_pair(start_chars, end_chars) 1426 local s, e = start_chars, lpeg_P(end_chars)^-1 1427 return lpeg_P{s * (M.any - s - end_chars + lpeg_V(1))^0 * e} 1428end 1429 1430--- 1431-- Creates and returns a pattern that matches any single word in list *words*. 1432-- Words consist of alphanumeric and underscore characters, as well as the 1433-- characters in string set *word_chars*. *case_insensitive* indicates whether 1434-- or not to ignore case when matching words. 1435-- This is a convenience function for simplifying a set of ordered choice word 1436-- patterns. 1437-- @param words A table of words. 1438-- @param word_chars Optional string of additional characters considered to be 1439-- part of a word. By default, word characters are alphanumerics and 1440-- underscores ("%w_" in Lua). This parameter may be `nil` or the empty string 1441-- in order to indicate no additional word characters. 1442-- @param case_insensitive Optional boolean flag indicating whether or not the 1443-- word match is case-insensitive. The default is `false`. 1444-- @return pattern 1445-- @usage local keyword = token(l.KEYWORD, word_match{'foo', 'bar', 'baz'}) 1446-- @usage local keyword = token(l.KEYWORD, word_match({'foo-bar', 'foo-baz', 1447-- 'bar-foo', 'bar-baz', 'baz-foo', 'baz-bar'}, '-', true)) 1448-- @name word_match 1449function M.word_match(words, word_chars, case_insensitive) 1450 local word_list = {} 1451 for i = 1, #words do 1452 word_list[case_insensitive and words[i]:lower() or words[i]] = true 1453 end 1454 local chars = M.alnum + '_' 1455 if word_chars then chars = chars + lpeg_S(word_chars) end 1456 return lpeg_Cmt(chars^1, function(input, index, word) 1457 if case_insensitive then word = word:lower() end 1458 return word_list[word] and index or nil 1459 end) 1460end 1461 1462--- 1463-- Embeds child lexer *child* in parent lexer *parent* using patterns 1464-- *start_rule* and *end_rule*, which signal the beginning and end of the 1465-- embedded lexer, respectively. 1466-- @param parent The parent lexer. 1467-- @param child The child lexer. 1468-- @param start_rule The pattern that signals the beginning of the embedded 1469-- lexer. 1470-- @param end_rule The pattern that signals the end of the embedded lexer. 1471-- @usage l.embed_lexer(M, css, css_start_rule, css_end_rule) 1472-- @usage l.embed_lexer(html, M, php_start_rule, php_end_rule) 1473-- @usage l.embed_lexer(html, ruby, ruby_start_rule, ruby_end_rule) 1474-- @name embed_lexer 1475function M.embed_lexer(parent, child, start_rule, end_rule) 1476 -- Add child rules. 1477 if not child._EMBEDDEDRULES then child._EMBEDDEDRULES = {} end 1478 if not child._RULES then -- creating a child lexer to be embedded 1479 if not child._rules then error('Cannot embed language with no rules') end 1480 for i = 1, #child._rules do 1481 add_rule(child, child._rules[i][1], child._rules[i][2]) 1482 end 1483 end 1484 child._EMBEDDEDRULES[parent._NAME] = { 1485 ['start_rule'] = start_rule, 1486 token_rule = join_tokens(child), 1487 ['end_rule'] = end_rule 1488 } 1489 if not parent._CHILDREN then parent._CHILDREN = {} end 1490 local children = parent._CHILDREN 1491 children[#children + 1] = child 1492 -- Add child styles. 1493 if not parent._tokenstyles then parent._tokenstyles = {} end 1494 local tokenstyles = parent._tokenstyles 1495 tokenstyles[child._NAME..'_whitespace'] = M.STYLE_WHITESPACE 1496 if child._tokenstyles then 1497 for token, style in pairs(child._tokenstyles) do 1498 tokenstyles[token] = style 1499 end 1500 end 1501 child._lexer = parent -- use parent's tokens if child is embedding itself 1502 parent_lexer = parent -- use parent's tokens if the calling lexer is a proxy 1503end 1504 1505-- Determines if the previous line is a comment. 1506-- This is used for determining if the current comment line is a fold point. 1507-- @param prefix The prefix string defining a comment. 1508-- @param text The text passed to a fold function. 1509-- @param pos The pos passed to a fold function. 1510-- @param line The line passed to a fold function. 1511-- @param s The s passed to a fold function. 1512local function prev_line_is_comment(prefix, text, pos, line, s) 1513 local start = line:find('%S') 1514 if start < s and not line:find(prefix, start, true) then return false end 1515 local p = pos - 1 1516 if text:sub(p, p) == '\n' then 1517 p = p - 1 1518 if text:sub(p, p) == '\r' then p = p - 1 end 1519 if text:sub(p, p) ~= '\n' then 1520 while p > 1 and text:sub(p - 1, p - 1) ~= '\n' do p = p - 1 end 1521 while text:sub(p, p):find('^[\t ]$') do p = p + 1 end 1522 return text:sub(p, p + #prefix - 1) == prefix 1523 end 1524 end 1525 return false 1526end 1527 1528-- Determines if the next line is a comment. 1529-- This is used for determining if the current comment line is a fold point. 1530-- @param prefix The prefix string defining a comment. 1531-- @param text The text passed to a fold function. 1532-- @param pos The pos passed to a fold function. 1533-- @param line The line passed to a fold function. 1534-- @param s The s passed to a fold function. 1535local function next_line_is_comment(prefix, text, pos, line, s) 1536 local p = text:find('\n', pos + s) 1537 if p then 1538 p = p + 1 1539 while text:sub(p, p):find('^[\t ]$') do p = p + 1 end 1540 return text:sub(p, p + #prefix - 1) == prefix 1541 end 1542 return false 1543end 1544 1545--- 1546-- Returns a fold function (to be used within the lexer's `_foldsymbols` table) 1547-- that folds consecutive line comments that start with string *prefix*. 1548-- @param prefix The prefix string defining a line comment. 1549-- @usage [l.COMMENT] = {['--'] = l.fold_line_comments('--')} 1550-- @usage [l.COMMENT] = {['//'] = l.fold_line_comments('//')} 1551-- @name fold_line_comments 1552function M.fold_line_comments(prefix) 1553 local property_int = M.property_int 1554 return function(text, pos, line, s) 1555 if property_int['fold.line.comments'] == 0 then return 0 end 1556 if s > 1 and line:match('^%s*()') < s then return 0 end 1557 local prev_line_comment = prev_line_is_comment(prefix, text, pos, line, s) 1558 local next_line_comment = next_line_is_comment(prefix, text, pos, line, s) 1559 if not prev_line_comment and next_line_comment then return 1 end 1560 if prev_line_comment and not next_line_comment then return -1 end 1561 return 0 1562 end 1563end 1564 1565M.property_expanded = setmetatable({}, { 1566 -- Returns the string property value associated with string property *key*, 1567 -- replacing any "$()" and "%()" expressions with the values of their keys. 1568 __index = function(t, key) 1569 return M.property[key]:gsub('[$%%]%b()', function(key) 1570 return t[key:sub(3, -2)] 1571 end) 1572 end, 1573 __newindex = function() error('read-only property') end 1574}) 1575 1576--[[ The functions and fields below were defined in C. 1577 1578--- 1579-- Returns the line number of the line that contains position *pos*, which 1580-- starts from 1. 1581-- @param pos The position to get the line number of. 1582-- @return number 1583local function line_from_position(pos) end 1584 1585--- 1586-- Individual fields for a lexer instance. 1587-- @field _NAME The string name of the lexer. 1588-- @field _rules An ordered list of rules for a lexer grammar. 1589-- Each rule is a table containing an arbitrary rule name and the LPeg pattern 1590-- associated with the rule. The order of rules is important, as rules are 1591-- matched sequentially. 1592-- Child lexers should not use this table to access and/or modify their 1593-- parent's rules and vice-versa. Use the `_RULES` table instead. 1594-- @field _tokenstyles A map of non-predefined token names to styles. 1595-- Remember to use token names, not rule names. It is recommended to use 1596-- predefined styles or color-agnostic styles derived from predefined styles 1597-- to ensure compatibility with user color themes. 1598-- @field _foldsymbols A table of recognized fold points for the lexer. 1599-- Keys are token names with table values defining fold points. Those table 1600-- values have string keys of keywords or characters that indicate a fold 1601-- point whose values are integers. A value of `1` indicates a beginning fold 1602-- point and a value of `-1` indicates an ending fold point. Values can also 1603-- be functions that return `1`, `-1`, or `0` (indicating no fold point) for 1604-- keys which need additional processing. 1605-- There is also a required `_patterns` key whose value is a table containing 1606-- Lua pattern strings that match all fold points (the string keys contained 1607-- in token name table values). When the lexer encounters text that matches 1608-- one of those patterns, the matched text is looked up in its token's table 1609-- to determine whether or not it is a fold point. 1610-- There is also an optional `_case_insensitive` option that indicates whether 1611-- or not fold point keys are case-insensitive. If `true`, fold point keys 1612-- should be in lower case. 1613-- @field _fold If this function exists in the lexer, it is called for folding 1614-- the document instead of using `_foldsymbols` or indentation. 1615-- @field _lexer The parent lexer object whose rules should be used. This field 1616-- is only necessary to disambiguate a proxy lexer that loaded parent and 1617-- child lexers for embedding and ended up having multiple parents loaded. 1618-- @field _RULES A map of rule name keys with their associated LPeg pattern 1619-- values for the lexer. 1620-- This is constructed from the lexer's `_rules` table and accessible to other 1621-- lexers for embedded lexer applications like modifying parent or child 1622-- rules. 1623-- @field _LEXBYLINE Indicates the lexer can only process one whole line of text 1624-- (instead of an arbitrary chunk of text) at a time. 1625-- The default value is `false`. Line lexers cannot look ahead to subsequent 1626-- lines. 1627-- @field _FOLDBYINDENTATION Declares the lexer does not define fold points and 1628-- that fold points should be calculated based on changes in indentation. 1629-- @class table 1630-- @name lexer 1631local lexer 1632]] 1633 1634return M 1635