Module lexer
Performs lexing of Scintilla documents.
Overview
Dynamic lexers are more flexible than Scintilla's static ones. They are often more readable as well. This document provides all the information necessary in order to write a new lexer. For illustrative purposes, a Lua lexer will be created. Lexers are written using Parsing Expression Grammars or PEGs with the Lua LPeg library. Please familiarize yourself with LPeg's documentation before proceeding.
Writing a Dynamic Lexer
Rather than writing a lexer from scratch, first see if your language is similar to any of the 70+ languages supported. If so, you can copy and modify that lexer, saving some time and effort.
Introduction
All lexers are contained in the lexers/ directory. To begin, create a Lua
script with the name of your lexer and open it for editing.
$> cd lexers
$> textadept lua.lua
Inside the lexer, the heading should look like the following:
-- Lua LPeg lexer
local l = lexer
local token, word_match = l.token, l.word_match
local P, R, S, V = l.lpeg.P, l.lpeg.R, l.lpeg.S, l.lpeg.V
module(...)
Each lexer is a module so the global namespace is not cluttered with lexer
patterns and variables. The ... is there for a reason! Do not replace it
with the name of your lexer. This is done by Lua automatically.
The local variables above the module give easy access to the many useful functions available for creating lexers.
Lexer Language Structure
It is important to spend some time considering the structure of the language you are creating the lexer for. What kinds of tokens does it have? Comments, strings, keywords, etc.? Lua has 9 tokens: whitespace, comments, strings, numbers, keywords, functions, constants, identifiers, and operators.
Tokens
In a lexer, tokens are comprised of a token type followed by an LPeg pattern.
They are created using the token() function. A whitespace token
typically looks like:
local ws = token('whitespace', S('\t\v\f\n\r ')^1)
It is difficult to remember that a space character is either a \t, \v,
\f, \n, \r, or . The lexer (l) module provides you with a
shortcut for this and many other character sequences. They are:
any: Matches any single character.ascii: Matches any ASCII character (0..127).extend: Matches any ASCII extended character (0..255).alpha: Matches any alphabetic character (A-Z,a-z).digit: Matches any digit (0-9).alnum: Matches any alphanumeric character (A-Z,a-z,0-9).lower: Matches any lowercase character (a-z).upper: Matches any uppercase character (A-Z).xdigit: Matches any hexadecimal digit (0-9,A-F,a-f).cntrl: Matches any control character (0..31).graph: Matches any graphical character (!to~).print: Matches any printable character (space to~).punct: Matches any punctuation character not alphanumeric (!to/,:to@,[to',{to~).space: Matches any whitespace character (\t,\v,\f,\n,\r, space).newline: Matches any newline characters.nonnewline: Matches any non-newline character.nonnewline_esc: Matches any non-newline character excluding newlines escaped with\\.dec_num: Matches a decimal number.hex_num: Matches a hexadecimal number.oct_num: Matches an octal number.integer: Matches a decimal, hexadecimal, or octal number.float: Matches a floating point number.word: Matches a typical word starting with a letter or underscore and then any alphanumeric or underscore characters.
The above whitespace token can be rewritten more simply as:
local ws = token('whitespace', l.space^1)
The next Lua token is a comment. Short comments beginning with -- are easy
to express with LPeg:
local line_comment = '--' * l.nonnewline^0
On the other hand, long comments are more difficult to express because they have levels. See the Lua Reference Manual for more information. As a result, a functional pattern is necessary:
local longstring = #('[[' + ('[' * P('=')^0 * '[')) *
P(function(input, index)
local level = input:match('^%[(=*)%[', index)
if level then
local _, stop = input:find(']'..level..']', index, true)
return stop and stop + 1 or #input + 1
end
end)
local block_comment = '--' * longstring
The token for a comment is then:
local comment = token('comment', line_comment + block_comment)
It is worth noting that while token names are arbitrary, you are encouraged
to use the ones listed in the tokens table because a standard
color theme is applied to them. If you wish to create a unique token, no
problem. You can specify how it will be colored later on.
Lua strings should be easy to express because they are just characters
surrounded by ' or " characters, right? Not quite. Lua strings contain
escape sequences (\char) so a \' sequence in a single-quoted string
does not indicate the end of a string and must be handled appropriately.
Fortunately, this is a common occurance in many programming languages, so a
convenient function is provided: delimited_range().
local sq_str = l.delimited_range("'", '\\', true)
local dq_str = l.delimited_range('"', '\\', true)
Lua also has multi-line strings, but they have the same format as block comments. All strings can all be combined into a token:
local string = token('string', sq_str + dq_str + longstring)
Numbers are easy in Lua using lexer's predefined patterns.
local lua_integer = P('-')^-1 * (l.hex_num + l.dec_num)
local number = token('number', l.float + lua_integer)
Keep in mind that the predefined patterns may not be completely accurate for
your language, so you may have to create your own variants. In the above
case, Lua integers do not have octal sequences, so the l.integer pattern is
not used.
Depending on the number of keywords for a particular language, a simple
P(keyword1) + P(keyword2) + ... + P(keywordN) pattern can get quite large.
In fact, LPeg has a limit on pattern size. Also, if the keywords are not case
sensitive, additional complexity arises, so a better approach is necessary.
Once again, lexer has a shortcut function: word_match().
local keyword = token('keyword', word_match {
'and', 'break', 'do', 'else', 'elseif', 'end', 'false', 'for',
'function', 'if', 'in', 'local', 'nil', 'not', 'or', 'repeat',
'return', 'then', 'true', 'until', 'while'
})
If keywords were case-insensitive, an additional parameter would be specified
in the call to word_match(); no other action is needed.
Lua functions and constants are specified like keywords:
local func = token('function', word_match {
'assert', 'collectgarbage', 'dofile', 'error', 'getfenv',
'getmetatable', 'gcinfo', 'ipairs', 'loadfile', 'loadlib',
'loadstring', 'next', 'pairs', 'pcall', 'print', 'rawequal',
'rawget', 'rawset', 'require', 'setfenv', 'setmetatable',
'tonumber', 'tostring', 'type', 'unpack', 'xpcall'
})
local constant = token('constant', word_match {
'_G', '_VERSION', 'LUA_PATH', '_LOADED', '_REQUIREDNAME', '_ALERT',
'_ERRORMESSAGE', '_PROMPT'
})
Unlike most programming languages, Lua allows an additional range of
characters in its identifier names (variables, functions, modules, etc.) so
the usual l.word cannot be used. Instead, identifiers are represented by:
local word = (R('AZ', 'az', '\127\255') + '_') * (l.alnum + '_')^0
local identifier = token('identifier', word)
Finally, an operator character is one of the following:
local operator = token('operator', '~=' + S('+-*/%^#=<>;:,.{}[]()'))
Rules
Rules are just a combination of tokens. In Lua, all rules consist of a single token, but other languages may have two or more tokens in a rule. For example, an HTML tag consists of an element token followed by an optional set of attribute tokens. This allows each part of the tag to be colored distinctly.
The set of rules that comprises Lua is specified in a _rules table for the
lexer.
_rules = {
{ 'whitespace', ws },
{ 'keyword', keyword },
{ 'function', func },
{ 'constant', constant },
{ 'identifier', identifier },
{ 'string', string },
{ 'comment', comment },
{ 'number', number },
{ 'operator', operator },
{ 'any_char', l.any_char },
}
Each entry is a rule name and its associated pattern. Please note that the names of the rules can be completely different than the names of the tokens contained within them.
The order of the rules is important because of the nature of LPeg. LPeg tries
to apply the first rule to the current position in the text it is matching.
If there is a match, it colors that section appropriately and moves on. If
there is not a match, it tries the next rule, and so on. Suppose instead that
the identifier rule was before the keyword rule. It can be seen that all
keywords satisfy the requirements for being an identifier, so any keywords
would be incorrectly colored as identifiers. This is why identifier is
where it is in the _rules table.
You might be wondering what that any_char is doing at the bottom of
_rules. Its purpose is to match anything not accounted for in the above
rules. For example, suppose the ! character is in the input text. It will
not be matched by any of the first 9 rules, so without any_char, the text
would not match at all, and no coloring would occur. any_char matches one
single character and moves on. It may be colored red (indicating a syntax
error) if desired because it is a token, not just a pattern.
Summary
The above method of defining tokens and rules is sufficient for a majority of
lexers. The lexer module provides many useful patterns and functions for
constructing a working lexer quickly and efficiently. In most cases, the
amount of knowledge of LPeg required to write a lexer is minimal.
As long as you used token names listed in tokens, you do not
have to specify any coloring (or styling) information in the lexer; it is
taken care of by the user's color theme.
The rest of this document is devoted to more complex lexer techniques.
Styling Tokens
The term for coloring text is styling. Just like with predefined LPeg
patterns in lexer, predefined styles are available.
style_nothing: Typically used for whitespace.style_char: Typically used for character literals.style_class: Typically used for class definitions.style_comment: Typically used for code comments.style_constant: Typically used for constants.style_definition: Typically used for definitions.style_error: Typically used for erroneous syntax.style_function: Typically used for function definitions.style_keyword: Typically used for language keywords.style_number: Typically used for numbers.style_operator: Typically used for operators.style_string: Typically used for strings.style_preproc: Typically used for preprocessor statements.style_tag: Typically used for markup tags.style_type: Typically used for static types.style_variable: Typically used for variables.style_embedded: Typically used for embedded code.style_identifier: Typically used for identifier words.
Each style consists of a set of attributes:
font: The style's font name.size: The style's font size.bold: Flag indicating whether or not the font is boldface.italic: Flag indicating whether or not the font is italic.underline: Flag indicating whether or not the font is underlined.fore: The color of the font face.back: The color of the font background.eolfilled: Flag indicating whether or not to color the end of the line.characterset: The character set of the font.case: The case of the font. 1 for upper case, 2 for lower case, 0 for normal case.visible: Flag indicating whether or not the text is visible.changable: Flag indicating whether or not the text is read-only.hotspot: Flag indicating whether or not the style is clickable.
Styles are created with style(). For example:
-- style with default theme settings
local style_nothing = l.style { }
-- style with bold text with default theme font
local style_bold = l.style { bold = true }
-- style with bold italic text with default theme font
local style_bold_italic = l.style { bold = true, italic = true }
The style_bold_italic style can be rewritten in terms of style_bold:
local style_bold_italic = style_bold..{ italic = true }
In this way you can build on previously defined styles without having to rewrite them. Note the previous style is left unchanged.
Style colors are different than the #rrggbb RGB notation you may be familiar
with. Instead, create a color using color().
local red = l.color('FF', '00', '00')
local green = l.color('00', 'FF', '00')
local blue = l.color('00', '00', 'FF')
As you might have guessed, lexer has a set of default colors.
greenblueredyellowtealwhiteblackgreypurpleorange
It is recommended to use them to stay consistant with a user's color theme.
Finally, styles are assigned to tokens via a _tokenstyles table in the
lexer. Styles do not have to be assigned to standard tokens; it is done
automatically. You only have to assign styles for tokens you create. For
example:
local lua = token('lua', P('lua'))
-- ... other patterns and tokens ...
_tokenstyles = {
{ 'lua', l.style_keyword },
}
Each entry is the token name the style is for and the style itself. The order
of styles in _tokenstyles does not matter.
For examples of how styles are created, please see the theme files in the
lexers/themes/ folder.
Line Lexer
Sometimes it is advantageous to lex input text line by line rather than a chunk at a time. This occurs particularly in diff, patch, or make files. Put
_LEXBYLINE = true
somewhere in your lexer in order to do this.
Embedded Lexers
A particular advantage that dynamic lexers have over static ones is that lexers can be embedded within one another very easily, requiring minimal effort. There are two kinds of embedded lexers: a parent lexer that embeds other child lexers in it, and a child lexer that embeds itself within a parent lexer.
Parent Lexer with Children
An example of this kind of lexer is HTML with embedded CSS and Javascript.
After creating the parent lexer, load the children lexers in it using
lexer.load(). For example:
local css = l.load('css')
There needs to be a transition from the parent HTML lexer to the child CSS
lexer. This is something of the form <style type="text/css">. Similarly,
the transition from child to parent is </style>.
local css_start_rule = #(P('<') * P('style') *
P(function(input, index)
if input:find('[^>]+type%s*=%s*(["\'])text/css%1') then
return index
end
end)) * tag
local css_end_rule = #(P('</') * P('style') * ws^0 * P('>')) * tag
where tag and ws have been previously defined in the HTML lexer. Recall
that an any_char rule matches anything not matched previously in a lexer.
This rule exists in the CSS lexer, but we want it to stop matching when it
encounters </style> (otherwise the rest of the input would be counted as
CSS) without modifying the lexer file itself. The solution is to edit the
any_char rule from within the css._RULES table:
css._RULES['any_char'] = token('css_default', l.any - css_end_rule)
Now the CSS lexer can be embedded using embed_lexer():
l.embed_lexer(_M, css, css_start_rule, css_end_rule)
What is _M? It is the parent HTML lexer object, not the string ... or
'html'. The lexer object is needed by embed_lexer().
The same procedure can be done for Javascript, but with there is a wrinkle:
the child to parent transition (</script>) starts with a <, which is an
operator in Javascript. Therefore the operator rule must be edited in
addition to any_char.
local js = l.load('javascript')
local js_start_rule = #(P('<') * P('script') *
P(function(input, index)
if input:find('[^>]+type%s*=%s*(["\'])text/javascript%1') then
return index
end
end)) * tag
local js_end_rule = #('</' * P('script') * ws^0 * '>') * tag
js._RULES['operator'] = token('operator', S('+-/*%^!=&|?:;.()[]{}>') +
'<' * -('/' * P('script')))
js._RULES['any_char'] = token('js_default', l.any - js_end_rule)
l.embed_lexer(_M, js, js_start_rule, js_end_rule)
Note the tokens css_default and js_default that were added. Since they
are not standard tokens, styles must be added for them. If _tokenstyles has
already been defined in the parent lexer, styles are added this way:
_tokenstyles[#_tokenstyles + 1] = { 'css_default', l.style_nothing }
_tokenstyles[#_tokenstyles + 1] = { 'js_default', l.style_nothing }
Child Lexer Within Parent
An example of this kind of lexer is PHP embedded in HTML. After creating the child lexer, load the parent lexer. As an example:
local html = l.load('hypertext')
Since HTML should be the main lexer, (PHP is just a preprocessing language), the following statement changes the main lexer from PHP to HTML:
_lexer = html
Like in the previous section, transitions from HTML to PHP and back are specified:
local php_start_rule = token('php_tag', '<?' * ('php' * l.space)^-1)
local php_end_rule = token('php_tag', '?>')
And PHP is embedded as a preprocessing language:
l.embed_lexer(html, _M, php_start_rule, php_end_rule, true)
If PHP were not a preprocessing language, the lexer would be finished.
However, PHP can appear anywhere within an HTML document, so the HTML lexer
needs to have this indicated in its rules -- for example within strings.
First, it is necessary to obtain the PHP rule (<?php ... ?> sequence).
local php_rules = _M._EMBEDDEDRULES[html._NAME]
local php_rule = php_rules.start_rule * php_rules.token_rule^0 *
php_rules.end_rule^-1
Now, string patterns with embedded PHP need to be created. The explanation on
how to do so is beyond the scope of this tutorial. Sufficed to say a shortcut
function delimited_range_with_embedded()
is available:
local embedded_sq_str =
l.delimited_range_with_embedded("'", '\\', 'string', php_rule)
local embedded_dq_str =
l.delimited_range_with_embedded('"', '\\', 'string', php_rule)
The HTML string rule can now be modified:
html._RULES['string'] = embedded_sq_str + embedded_dq_str
This procedure should be repeated for other rules, but is not shown here. You
can look at lexers/php.lua for more information.
Code Folding (Optional)
It is sometimes convenient to "fold", or not show blocks of text. These blocks can be functions, classes, comments, etc. A folder iterates over each line of input text and assigns a fold level to it. Certain lines can be specified as fold points that fold subsequent lines with a higher fold level.
In order to implement a folder, define the following function in your lexer:
function _fold(input, start_pos, start_line, start_level)
end
input: The text to fold.start_pos: Current position in the buffer of the text (used for obtaining style information from the document).start_line: The line number the text starts at.start_level: The fold level of the text atstart_line.
The function must return a table whose indices are line numbers and whose values are tables containing the fold level and optionally a fold flag.
The following Scintilla fold flags are available:
SC_FOLDLEVELBASE: The initial (root) fold level.SC_FOLDLEVELWHITEFLAG: Flag indicating that the line is blank.SC_FOLDLEVELHEADERFLAG: Flag indicating the line is fold point.SC_FOLDLEVELNUMBERMASK: Flag used withSCI_GETFOLDLEVEL(line)to get the fold level of a line.
Have your fold function interate over each line, setting fold levels. You can
use the get_style_at(), get_property(),
get_fold_level(), and
get_indent_amount() functions as necessary to determine
the fold level for each line. The following example sets fold points by
changes in indentation.
function _fold(input, start_pos, start_line, start_level)
local folds = {}
local current_line = start_line
local prev_level = start_level
for indent, line in text:gmatch('([\t ]*)(.-)\r?\n') do
if #line > 0 then
local current_level = l.get_indent_amount(current_line)
if current_level > prev_level then -- next level
local i = current_line - 1
while folds[i] and folds[i][2] == l.SC_FOLDLEVELWHITEFLAG do
i = i - 1
end
if folds[i] then
folds[i][2] = l.SC_FOLDLEVELHEADERFLAG -- low indent
end
folds[current_line] = { current_level } -- high indent
elseif current_level < prev_level then -- prev level
if folds[current_line - 1] then
folds[current_line - 1][1] = prev_level -- high indent
end
folds[current_line] = { current_level } -- low indent
else -- same level
folds[current_line] = { prev_level }
end
prev_level = current_level
else
folds[current_line] = { prev_level, l.SC_FOLDLEVELWHITEFLAG }
end
current_line = current_line + 1
end
return folds
end
SciTE users note: do not use get_property for getting fold options from a
.properties file because SciTE is not set up to forward them to your lexer.
Instead, you can provide options that can be set at the top of the lexer.
Using the Lexer with SciTE
Create a .properties file for your lexer and import it in either your
SciTEUser.properties or SciTEGlobal.properties. The contents of the
.properties file should contain:
file.patterns.[lexer_name]=[file_patterns]
lexer.$(file.patterns.[lexer_name])=[lexer_name]
where [lexer_name] is the name of your lexer (minus the .lua extension)
and [file_patterns] is a set of file extensions matched to your lexer.
Please note any styling information in .properties files is ignored.
Using the Lexer with Textadept
Put your lexer in your ~/.textadept/lexers/ directory. That way
your lexer will not be overwritten when upgrading. Also, lexers in this
directory override default lexers. (A user lua lexer would be loaded
instead of the default lua lexer. This is convenient if you wish to tweak
a default lexer to your liking.) Do not forget to add a
mime-type for your lexer.
Optimization
Lexers can usually be optimized for speed by re-arranging tokens so that the
most common ones are recognized first. Keep in mind the issue that was raised
earlier: if you put similar tokens like identifiers before keywords, the
latter will not be styled correctly.
Troubleshooting
Errors in lexers can be tricky to debug. Lua errors and _G.print()
statements in lexers are printed to STDOUT.
Limitations
Lexers can have up to 32757 elements in them. So unless the lexer is written very poorly, or has a dozen embedded languages, this limitation is not a problem.
Performance
There might be some slight overhead when initializing a lexer, but loading a file from disk into Scintilla is usually more expensive.
On modern computer systems, I see no difference in speed between LPeg lexers and Scintilla's C++ ones for single language lexers. There may be differences for multiple language lexers though, depending on the size of the file since the entire document must be lexed to ensure accuracy.
Risks
Poorly written lexers have the ability to crash Scintilla, so unsaved data might be lost. However, these crashes have only been observed in early lexer development, when syntax errors or pattern errors are present. Once the lexer actually starts styling text (either correctly or incorrectly; it does not matter), no crashes have occurred.
Acknowledgements
Thanks to Peter Odding for his lexer post on the Lua mailing list that inspired me, and of course thanks to Roberto Ierusalimschy for LPeg.
Functions
| add_rule (lexer, id, rule, name) | [Local function] Adds a rule to a lexer's current ordered list of rules. |
| add_style (lexer, token_name, style) | [Local function] Adds a new Scintilla style to Scintilla. |
| build_grammar (lexer) | [Local function] (Re)constructs lexer._GRAMMAR. |
| color (r, g, b) | Creates a Scintilla color. |
| delimited_range (chars, escape, end_optional, balanced, forbidden) | Creates an LPeg pattern that matches a range of characters delimitted by a specific character(s). |
| delimited_range_with_embedded (chars, escape, token_name, patt, forbidden) | Similar to `delimited_range()`, but includes embedded patterns. |
| embed_lexer (parent, child, start_rule, end_rule, preproc) | Embeds a child lexer language in a parent one. |
| fold (text, start_pos, start_line, start_level) | Folds the given text. |
| get_fold_level (line, line_number) | Returns the fold level for a given line. |
| get_indent_amount (line) | Returns the indent amount of text for a given line. |
| get_property (key, default) | Returns an integer property value for a given key. |
| get_style_at (pos) | Returns the integer style number at a given position. |
| join_tokens (lexer, parent) | [Local function] (Re)constructs lexer._TOKENRULE. |
| lex (text) | Lexes the given text. |
| load (lexer_name) | Initializes the specified lexer. |
| nested_pair (start_chars, end_chars, end_optional) | Similar to `delimited_range()`, but allows for multi-character delimitters. |
| starts_line (patt) | Creates an LPeg pattern from a given pattern that matches the beginning of a line and returns it. |
| style (style_table) | Creates a Scintilla style from a table of style properties. |
| token (name, patt) | Creates an LPeg capture table index with the name and position of the token. |
| word_match (words, word_chars, case_insensitive) | Creates an LPeg pattern that matches a set of words. |
Tables
| _EMBEDDEDRULES | Set of rules for an embedded lexer. |
| _RULES | List of rule names with associated LPeg patterns for a specific lexer. |
| tokens | [Local table] Default tokens. |
Functions
- add_rule (lexer, id, rule, name)
-
[Local function] Adds a rule to a lexer's current ordered list of rules.
Parameters
- lexer: The lexer to add the given rule to.
- id:
- rule: The LPeg pattern of the rule.
- name: The name associated with this rule. It is used for other lexers to access this particular rule from the lexer's `_RULES` table. It does not have to be the same as the name passed to `token`.
- add_style (lexer, token_name, style)
-
[Local function] Adds a new Scintilla style to Scintilla.
Parameters
- lexer: The lexer to add the given style to.
- token_name: The name of the token associated with this style.
- style: A Scintilla style created from style().
See also:
- build_grammar (lexer)
-
[Local function] (Re)constructs lexer._GRAMMAR.
Parameters
- lexer: The parent lexer.
- color (r, g, b)
-
Creates a Scintilla color.
Parameters
- r: The string red component of the hexadecimal color.
- g: The string green component of the color.
- b: The string blue component of the color.
Usage:
local red = color('FF', '00', '00') - delimited_range (chars, escape, end_optional, balanced, forbidden)
-
Creates an LPeg pattern that matches a range of characters delimitted by a specific character(s). This can be used to match a string, parenthesis, etc.
Parameters
- chars: The character(s) that bound the matched range.
- escape: Optional escape character. This parameter may be omitted, nil, or the empty string.
- end_optional: Optional flag indicating whether or not an ending delimiter is optional or not. If true, the range begun by the start delimiter matches until an end delimiter or the end of the input is reached.
- balanced: Optional flag indicating whether or not a balanced range is matched, like `%b` in Lua's `string.find`. This flag only applies if `chars` consists of two different characters (e.g. '()').
- forbidden: Optional string of characters forbidden in a delimited range. Each character is part of the set.
Usage
- local sq_str_noescapes = delimited_range("'")
- local sq_str_escapes = delimited_range("'", '\\', true)
- local unbalanced_parens = delimited_range('()', '\\', true)
- local balanced_parens = delimited_range('()', '\\', true, true)
- delimited_range_with_embedded (chars, escape, token_name, patt, forbidden)
-
Similar to `delimited_range()`, but includes embedded patterns. This is useful for embedding additional lexers inside strings. Do not enclose this range with `token()`. Instead specify the token name in the `token_name` parameter.
Parameters
- chars: The character(s) that bound the matched range.
- escape: Escape character or nil.
- token_name: Token name for the characters in the range excluding the embedded pattern. Use this instead of `token()`.
- patt: Pattern embedded in the range.
- forbidden: Optional string of characters forbidden in a delimited range. Each character is part of the set.
Usage:
local embedded_sq_str = l.delimited_range_with_embedded("'", '\\', 'string', php_rule) - embed_lexer (parent, child, start_rule, end_rule, preproc)
-
Embeds a child lexer language in a parent one.
Parameters
- parent: The parent lexer.
- child: The child lexer.
- start_rule: The token that signals the beginning of the embedded lexer.
- end_rule: The token that signals the end of the embedded lexer.
- preproc: Boolean flag specifying if the child lexer is a preprocessor language.
Usage
- embed_lexer(_M, css, css_start_rule, css_end_rule)
- embed_lexer(html, _M, php_start_rule, php_end_rule, true)
- embed_lexer(html, ruby, ruby_start_rule, rule_end_rule, true)
- fold (text, start_pos, start_line, start_level)
-
Folds the given text. Called by LexLPeg.cxx; do not call from Lua. If the current lexer has no _fold function, folding by indentation is performed if the 'fold.by.indentation' property is set.
Parameters
- text: The document text to fold.
- start_pos: The position in the document text starts at.
- start_line: The line number text starts on.
- start_level: The fold level text starts on.
Return value:
Table of fold levels. - get_fold_level (line, line_number)
-
Returns the fold level for a given line. This level already has `SC_FOLDLEVELBASE` added to it, so you do not need to add it yourself.
Parameters
- line:
- line_number: The line number to get the fold level of.
- get_indent_amount (line)
-
Returns the indent amount of text for a given line.
Parameters
- line: The line number to get the indent amount of.
- get_property (key, default)
-
Returns an integer property value for a given key.
Parameters
- key: The property key.
- default: Optional integer value to return if key is not set.
- get_style_at (pos)
-
Returns the integer style number at a given position.
Parameters
- pos: The position to get the style for.
- join_tokens (lexer, parent)
-
[Local function] (Re)constructs lexer._TOKENRULE.
Parameters
- lexer:
- parent: The parent lexer.
- lex (text)
-
Lexes the given text. Called by LexLPeg.cxx; do not call from Lua. If the lexer has a _LEXBYLINE flag set, the text is lexed one line at a time. Otherwise the text is lexed as a whole.
Parameters
- text: The text to lex.
- load (lexer_name)
-
Initializes the specified lexer.
Parameters
- lexer_name: The name of the lexing language.
- nested_pair (start_chars, end_chars, end_optional)
-
Similar to `delimited_range()`, but allows for multi-character delimitters. This is useful for lexers with tokens such as nested block comments. With single-character delimiters, this function is identical to `delimited_range(start_chars..end_chars, nil, end_optional, true)`.
Parameters
- start_chars: The string starting a nested sequence.
- end_chars: The string ending a nested sequence.
- end_optional: Optional flag indicating whether or not an ending delimiter is optional or not. If true, the range begun by the start delimiter matches until an end delimiter or the end of the input is reached.
Usage:
local nested_comment = l.nested_pair('/*', '*/', true) - starts_line (patt)
-
Creates an LPeg pattern from a given pattern that matches the beginning of a line and returns it.
Parameters
- patt: The LPeg pattern to match at the beginning of a line.
Usage:
local preproc = token('preprocessor', #P('#') * l.starts_line('#' * l.nonnewline^0)) - style (style_table)
-
Creates a Scintilla style from a table of style properties.
Parameters
- style_table: A table of style properties. Style properties available: font = [string] size = [integer] bold = [boolean] italic = [boolean] underline = [boolean] fore = [integer]* back = [integer]* eolfilled = [boolean] characterset = ? case = [integer] visible = [boolean] changeable = [boolean] hotspot = [boolean] * Use the value returned by `color()`.
Usage:
local bold_italic = style { bold = true, italic = true }See also:
- token (name, patt)
-
Creates an LPeg capture table index with the name and position of the token.
Parameters
- name: The name of token. If this name is not in `l.tokens` then you will have to specify a style for it in `lexer._tokenstyles`.
- patt: The LPeg pattern associated with the token.
Usage
- local ws = token('whitespace', l.space^1)
- php_start_rule = token('php_tag', '' * ('php' * l.space)^-1)
- word_match (words, word_chars, case_insensitive)
-
Creates an LPeg pattern that matches a set of words.
Parameters
- words: A table of words.
- word_chars: Optional string of additional characters considered to be part of a word (default is `%w_`).
- case_insensitive: Optional boolean flag indicating whether the word match is case-insensitive.
Usage
- local keyword = token('keyword', word_match { 'foo', 'bar', 'baz' })
- local keyword = token('keyword', word_match({ 'foo-bar', 'foo-baz', 'bar-foo', 'bar-baz', 'baz-foo', 'baz-bar' }, '-', true))
Tables
- _EMBEDDEDRULES
- Set of rules for an embedded lexer. For a parent lexer name, contains child's `start_rule`, `token_rule`, and `end_rule` patterns.
- _RULES
- List of rule names with associated LPeg patterns for a specific lexer. It is accessible to other lexers for embedded lexer applications.
- tokens
- [Local table] Default tokens. Contains token identifiers and associated style numbers.
Fields
- default: The default type (0).
- whitespace: The whitespace type (1).
- comment: The comment type (2).
- string: The string type (3).
- number: The number type (4).
- keyword: The keyword type (5).
- identifier: The identifier type (6).
- operator: The operator type (7).
- error: The error type (8).
- preprocessor: The preprocessor type (9).
- constant: The constant type (10).
- function: The function type (11).
- class: The class type (12).
- type: The type type (13).