Module lexer

Performs lexing of Scintilla documents.

Overview

Dynamic lexers are more flexible than Scintilla's static ones. They are often more readable as well. This document provides all the information necessary in order to write a new lexer. For illustrative purposes, a Lua lexer will be created. Lexers are written using Parsing Expression Grammars or PEGs with the Lua LPeg library. Please familiarize yourself with LPeg's documentation before proceeding.

Writing a Dynamic Lexer

Rather than writing a lexer from scratch, first see if your language is similar to any of the 70+ languages supported. If so, you can copy and modify that lexer, saving some time and effort.

Introduction

All lexers are contained in the lexers/ directory. To begin, create a Lua script with the name of your lexer and open it for editing.

$> cd lexers
$> textadept lua.lua

Inside the lexer, the heading should look like the following:

-- Lua LPeg lexer

local l = lexer
local token, word_match = l.token, l.word_match
local P, R, S, V = l.lpeg.P, l.lpeg.R, l.lpeg.S, l.lpeg.V

module(...)

Each lexer is a module so the global namespace is not cluttered with lexer patterns and variables. The ... is there for a reason! Do not replace it with the name of your lexer. This is done by Lua automatically.

The local variables above the module give easy access to the many useful functions available for creating lexers.

Lexer Language Structure

It is important to spend some time considering the structure of the language you are creating the lexer for. What kinds of tokens does it have? Comments, strings, keywords, etc.? Lua has 9 tokens: whitespace, comments, strings, numbers, keywords, functions, constants, identifiers, and operators.

Tokens

In a lexer, tokens are comprised of a token type followed by an LPeg pattern. They are created using the token() function. A whitespace token typically looks like:

local ws = token('whitespace', S('\t\v\f\n\r ')^1)

It is difficult to remember that a space character is either a \t, \v, \f, \n, \r, or . The lexer (l) module provides you with a shortcut for this and many other character sequences. They are:

  • any: Matches any single character.
  • ascii: Matches any ASCII character (0..127).
  • extend: Matches any ASCII extended character (0..255).
  • alpha: Matches any alphabetic character (A-Z, a-z).
  • digit: Matches any digit (0-9).
  • alnum: Matches any alphanumeric character (A-Z, a-z, 0-9).
  • lower: Matches any lowercase character (a-z).
  • upper: Matches any uppercase character (A-Z).
  • xdigit: Matches any hexadecimal digit (0-9, A-F, a-f).
  • cntrl: Matches any control character (0..31).
  • graph: Matches any graphical character (! to ~).
  • print: Matches any printable character (space to ~).
  • punct: Matches any punctuation character not alphanumeric (! to /, : to @, [ to ', { to ~).
  • space: Matches any whitespace character (\t, \v, \f, \n, \r, space).
  • newline: Matches any newline characters.
  • nonnewline: Matches any non-newline character.
  • nonnewline_esc: Matches any non-newline character excluding newlines escaped with \\.
  • dec_num: Matches a decimal number.
  • hex_num: Matches a hexadecimal number.
  • oct_num: Matches an octal number.
  • integer: Matches a decimal, hexadecimal, or octal number.
  • float: Matches a floating point number.
  • word: Matches a typical word starting with a letter or underscore and then any alphanumeric or underscore characters.

The above whitespace token can be rewritten more simply as:

local ws = token('whitespace', l.space^1)

The next Lua token is a comment. Short comments beginning with -- are easy to express with LPeg:

local line_comment = '--' * l.nonnewline^0

On the other hand, long comments are more difficult to express because they have levels. See the Lua Reference Manual for more information. As a result, a functional pattern is necessary:

local longstring = #('[[' + ('[' * P('=')^0 * '[')) *
  P(function(input, index)
  local level = input:match('^%[(=*)%[', index)
   if level then
     local _, stop = input:find(']'..level..']', index, true)
     return stop and stop + 1 or #input + 1
    end
  end)
local block_comment = '--' * longstring

The token for a comment is then:

local comment = token('comment', line_comment + block_comment)

It is worth noting that while token names are arbitrary, you are encouraged to use the ones listed in the tokens table because a standard color theme is applied to them. If you wish to create a unique token, no problem. You can specify how it will be colored later on.

Lua strings should be easy to express because they are just characters surrounded by ' or " characters, right? Not quite. Lua strings contain escape sequences (\char) so a \' sequence in a single-quoted string does not indicate the end of a string and must be handled appropriately. Fortunately, this is a common occurance in many programming languages, so a convenient function is provided: delimited_range().

local sq_str = l.delimited_range("'", '\\', true)
local dq_str = l.delimited_range('"', '\\', true)

Lua also has multi-line strings, but they have the same format as block comments. All strings can all be combined into a token:

local string = token('string', sq_str + dq_str + longstring)

Numbers are easy in Lua using lexer's predefined patterns.

local lua_integer = P('-')^-1 * (l.hex_num + l.dec_num)
local number = token('number', l.float + lua_integer)

Keep in mind that the predefined patterns may not be completely accurate for your language, so you may have to create your own variants. In the above case, Lua integers do not have octal sequences, so the l.integer pattern is not used.

Depending on the number of keywords for a particular language, a simple P(keyword1) + P(keyword2) + ... + P(keywordN) pattern can get quite large. In fact, LPeg has a limit on pattern size. Also, if the keywords are not case sensitive, additional complexity arises, so a better approach is necessary. Once again, lexer has a shortcut function: word_match().

local keyword = token('keyword', word_match {
  'and', 'break', 'do', 'else', 'elseif', 'end', 'false', 'for',
  'function', 'if', 'in', 'local', 'nil', 'not', 'or', 'repeat',
  'return', 'then', 'true', 'until', 'while'
})

If keywords were case-insensitive, an additional parameter would be specified in the call to word_match(); no other action is needed.

Lua functions and constants are specified like keywords:

local func = token('function', word_match {
  'assert', 'collectgarbage', 'dofile', 'error', 'getfenv',
  'getmetatable', 'gcinfo', 'ipairs', 'loadfile', 'loadlib',
  'loadstring', 'next', 'pairs', 'pcall', 'print', 'rawequal',
  'rawget', 'rawset', 'require', 'setfenv', 'setmetatable',
  'tonumber', 'tostring', 'type', 'unpack', 'xpcall'
})

local constant = token('constant', word_match {
  '_G', '_VERSION', 'LUA_PATH', '_LOADED', '_REQUIREDNAME', '_ALERT',
  '_ERRORMESSAGE', '_PROMPT'
})

Unlike most programming languages, Lua allows an additional range of characters in its identifier names (variables, functions, modules, etc.) so the usual l.word cannot be used. Instead, identifiers are represented by:

local word = (R('AZ', 'az', '\127\255') + '_') * (l.alnum + '_')^0
local identifier = token('identifier', word)

Finally, an operator character is one of the following:

local operator = token('operator', '~=' + S('+-*/%^#=<>;:,.{}[]()'))

Rules

Rules are just a combination of tokens. In Lua, all rules consist of a single token, but other languages may have two or more tokens in a rule. For example, an HTML tag consists of an element token followed by an optional set of attribute tokens. This allows each part of the tag to be colored distinctly.

The set of rules that comprises Lua is specified in a _rules table for the lexer.

_rules = {
  { 'whitespace', ws },
  { 'keyword', keyword },
  { 'function', func },
  { 'constant', constant },
  { 'identifier', identifier },
  { 'string', string },
  { 'comment', comment },
  { 'number', number },
  { 'operator', operator },
  { 'any_char', l.any_char },
}

Each entry is a rule name and its associated pattern. Please note that the names of the rules can be completely different than the names of the tokens contained within them.

The order of the rules is important because of the nature of LPeg. LPeg tries to apply the first rule to the current position in the text it is matching. If there is a match, it colors that section appropriately and moves on. If there is not a match, it tries the next rule, and so on. Suppose instead that the identifier rule was before the keyword rule. It can be seen that all keywords satisfy the requirements for being an identifier, so any keywords would be incorrectly colored as identifiers. This is why identifier is where it is in the _rules table.

You might be wondering what that any_char is doing at the bottom of _rules. Its purpose is to match anything not accounted for in the above rules. For example, suppose the ! character is in the input text. It will not be matched by any of the first 9 rules, so without any_char, the text would not match at all, and no coloring would occur. any_char matches one single character and moves on. It may be colored red (indicating a syntax error) if desired because it is a token, not just a pattern.

Summary

The above method of defining tokens and rules is sufficient for a majority of lexers. The lexer module provides many useful patterns and functions for constructing a working lexer quickly and efficiently. In most cases, the amount of knowledge of LPeg required to write a lexer is minimal.

As long as you used token names listed in tokens, you do not have to specify any coloring (or styling) information in the lexer; it is taken care of by the user's color theme.

The rest of this document is devoted to more complex lexer techniques.

Styling Tokens

The term for coloring text is styling. Just like with predefined LPeg patterns in lexer, predefined styles are available.

  • style_nothing: Typically used for whitespace.
  • style_char: Typically used for character literals.
  • style_class: Typically used for class definitions.
  • style_comment: Typically used for code comments.
  • style_constant: Typically used for constants.
  • style_definition: Typically used for definitions.
  • style_error: Typically used for erroneous syntax.
  • style_function: Typically used for function definitions.
  • style_keyword: Typically used for language keywords.
  • style_number: Typically used for numbers.
  • style_operator: Typically used for operators.
  • style_string: Typically used for strings.
  • style_preproc: Typically used for preprocessor statements.
  • style_tag: Typically used for markup tags.
  • style_type: Typically used for static types.
  • style_variable: Typically used for variables.
  • style_embedded: Typically used for embedded code.
  • style_identifier: Typically used for identifier words.

Each style consists of a set of attributes:

  • font: The style's font name.
  • size: The style's font size.
  • bold: Flag indicating whether or not the font is boldface.
  • italic: Flag indicating whether or not the font is italic.
  • underline: Flag indicating whether or not the font is underlined.
  • fore: The color of the font face.
  • back: The color of the font background.
  • eolfilled: Flag indicating whether or not to color the end of the line.
  • characterset: The character set of the font.
  • case: The case of the font. 1 for upper case, 2 for lower case, 0 for normal case.
  • visible: Flag indicating whether or not the text is visible.
  • changable: Flag indicating whether or not the text is read-only.
  • hotspot: Flag indicating whether or not the style is clickable.

Styles are created with style(). For example:

-- style with default theme settings
local style_nothing = l.style { }

-- style with bold text with default theme font
local style_bold = l.style { bold = true }

-- style with bold italic text with default theme font
local style_bold_italic = l.style { bold = true, italic = true }

The style_bold_italic style can be rewritten in terms of style_bold:

local style_bold_italic = style_bold..{ italic = true }

In this way you can build on previously defined styles without having to rewrite them. Note the previous style is left unchanged.

Style colors are different than the #rrggbb RGB notation you may be familiar with. Instead, create a color using color().

local red = l.color('FF', '00', '00')
local green = l.color('00', 'FF', '00')
local blue = l.color('00', '00', 'FF')

As you might have guessed, lexer has a set of default colors.

  • green
  • blue
  • red
  • yellow
  • teal
  • white
  • black
  • grey
  • purple
  • orange

It is recommended to use them to stay consistant with a user's color theme.

Finally, styles are assigned to tokens via a _tokenstyles table in the lexer. Styles do not have to be assigned to standard tokens; it is done automatically. You only have to assign styles for tokens you create. For example:

local lua = token('lua', P('lua'))

-- ... other patterns and tokens ...

_tokenstyles = {
  { 'lua', l.style_keyword },
}

Each entry is the token name the style is for and the style itself. The order of styles in _tokenstyles does not matter.

For examples of how styles are created, please see the theme files in the lexers/themes/ folder.

Line Lexer

Sometimes it is advantageous to lex input text line by line rather than a chunk at a time. This occurs particularly in diff, patch, or make files. Put

_LEXBYLINE = true

somewhere in your lexer in order to do this.

Embedded Lexers

A particular advantage that dynamic lexers have over static ones is that lexers can be embedded within one another very easily, requiring minimal effort. There are two kinds of embedded lexers: a parent lexer that embeds other child lexers in it, and a child lexer that embeds itself within a parent lexer.

Parent Lexer with Children

An example of this kind of lexer is HTML with embedded CSS and Javascript. After creating the parent lexer, load the children lexers in it using lexer.load(). For example:

local css = l.load('css')

There needs to be a transition from the parent HTML lexer to the child CSS lexer. This is something of the form <style type="text/css">. Similarly, the transition from child to parent is </style>.

local css_start_rule = #(P('<') * P('style') *
  P(function(input, index)
    if input:find('[^>]+type%s*=%s*(["\'])text/css%1') then
      return index
    end
  end)) * tag
local css_end_rule = #(P('</') * P('style') * ws^0 * P('>')) * tag

where tag and ws have been previously defined in the HTML lexer. Recall that an any_char rule matches anything not matched previously in a lexer. This rule exists in the CSS lexer, but we want it to stop matching when it encounters </style> (otherwise the rest of the input would be counted as CSS) without modifying the lexer file itself. The solution is to edit the any_char rule from within the css._RULES table:

css._RULES['any_char'] = token('css_default', l.any - css_end_rule)

Now the CSS lexer can be embedded using embed_lexer():

l.embed_lexer(_M, css, css_start_rule, css_end_rule)

What is _M? It is the parent HTML lexer object, not the string ... or 'html'. The lexer object is needed by embed_lexer().

The same procedure can be done for Javascript, but with there is a wrinkle: the child to parent transition (</script>) starts with a <, which is an operator in Javascript. Therefore the operator rule must be edited in addition to any_char.

local js = l.load('javascript')

local js_start_rule = #(P('<') * P('script') *
  P(function(input, index)
    if input:find('[^>]+type%s*=%s*(["\'])text/javascript%1') then
      return index
    end
  end)) * tag
local js_end_rule = #('</' * P('script') * ws^0 * '>') * tag
js._RULES['operator'] = token('operator', S('+-/*%^!=&|?:;.()[]{}>') +
                                          '<' * -('/' * P('script')))
js._RULES['any_char'] = token('js_default', l.any - js_end_rule)
l.embed_lexer(_M, js, js_start_rule, js_end_rule)

Note the tokens css_default and js_default that were added. Since they are not standard tokens, styles must be added for them. If _tokenstyles has already been defined in the parent lexer, styles are added this way:

_tokenstyles[#_tokenstyles + 1] = { 'css_default', l.style_nothing }
_tokenstyles[#_tokenstyles + 1] = { 'js_default', l.style_nothing }

Child Lexer Within Parent

An example of this kind of lexer is PHP embedded in HTML. After creating the child lexer, load the parent lexer. As an example:

local html = l.load('hypertext')

Since HTML should be the main lexer, (PHP is just a preprocessing language), the following statement changes the main lexer from PHP to HTML:

_lexer = html

Like in the previous section, transitions from HTML to PHP and back are specified:

local php_start_rule = token('php_tag', '<?' * ('php' * l.space)^-1)
local php_end_rule = token('php_tag', '?>')

And PHP is embedded as a preprocessing language:

l.embed_lexer(html, _M, php_start_rule, php_end_rule, true)

If PHP were not a preprocessing language, the lexer would be finished. However, PHP can appear anywhere within an HTML document, so the HTML lexer needs to have this indicated in its rules -- for example within strings. First, it is necessary to obtain the PHP rule (<?php ... ?> sequence).

local php_rules = _M._EMBEDDEDRULES[html._NAME]
local php_rule = php_rules.start_rule * php_rules.token_rule^0 *
                 php_rules.end_rule^-1

Now, string patterns with embedded PHP need to be created. The explanation on how to do so is beyond the scope of this tutorial. Sufficed to say a shortcut function delimited_range_with_embedded() is available:

local embedded_sq_str =
  l.delimited_range_with_embedded("'", '\\', 'string', php_rule)
local embedded_dq_str =
  l.delimited_range_with_embedded('"', '\\', 'string', php_rule)

The HTML string rule can now be modified:

html._RULES['string'] = embedded_sq_str + embedded_dq_str

This procedure should be repeated for other rules, but is not shown here. You can look at lexers/php.lua for more information.

Code Folding (Optional)

It is sometimes convenient to "fold", or not show blocks of text. These blocks can be functions, classes, comments, etc. A folder iterates over each line of input text and assigns a fold level to it. Certain lines can be specified as fold points that fold subsequent lines with a higher fold level.

In order to implement a folder, define the following function in your lexer:

function _fold(input, start_pos, start_line, start_level)

end
  • input: The text to fold.
  • start_pos: Current position in the buffer of the text (used for obtaining style information from the document).
  • start_line: The line number the text starts at.
  • start_level: The fold level of the text at start_line.

The function must return a table whose indices are line numbers and whose values are tables containing the fold level and optionally a fold flag.

The following Scintilla fold flags are available:

  • SC_FOLDLEVELBASE: The initial (root) fold level.
  • SC_FOLDLEVELWHITEFLAG: Flag indicating that the line is blank.
  • SC_FOLDLEVELHEADERFLAG: Flag indicating the line is fold point.
  • SC_FOLDLEVELNUMBERMASK: Flag used with SCI_GETFOLDLEVEL(line) to get the fold level of a line.

Have your fold function interate over each line, setting fold levels. You can use the get_style_at(), get_property(), get_fold_level(), and get_indent_amount() functions as necessary to determine the fold level for each line. The following example sets fold points by changes in indentation.

function _fold(input, start_pos, start_line, start_level)
  local folds = {}
  local current_line = start_line
  local prev_level = start_level
  for indent, line in text:gmatch('([\t ]*)(.-)\r?\n') do
    if #line > 0 then
      local current_level = l.get_indent_amount(current_line)
      if current_level > prev_level then -- next level
        local i = current_line - 1
        while folds[i] and folds[i][2] == l.SC_FOLDLEVELWHITEFLAG do
          i = i - 1
        end
        if folds[i] then
          folds[i][2] = l.SC_FOLDLEVELHEADERFLAG -- low indent
        end
        folds[current_line] = { current_level } -- high indent
      elseif current_level < prev_level then -- prev level
        if folds[current_line - 1] then
          folds[current_line - 1][1] = prev_level -- high indent
        end
        folds[current_line] = { current_level } -- low indent
      else -- same level
        folds[current_line] = { prev_level }
      end
      prev_level = current_level
    else
      folds[current_line] = { prev_level, l.SC_FOLDLEVELWHITEFLAG }
    end
    current_line = current_line + 1
  end
  return folds
end

SciTE users note: do not use get_property for getting fold options from a .properties file because SciTE is not set up to forward them to your lexer. Instead, you can provide options that can be set at the top of the lexer.

Using the Lexer with SciTE

Create a .properties file for your lexer and import it in either your SciTEUser.properties or SciTEGlobal.properties. The contents of the .properties file should contain:

file.patterns.[lexer_name]=[file_patterns]
lexer.$(file.patterns.[lexer_name])=[lexer_name]

where [lexer_name] is the name of your lexer (minus the .lua extension) and [file_patterns] is a set of file extensions matched to your lexer.

Please note any styling information in .properties files is ignored.

Using the Lexer with Textadept

Put your lexer in your ~/.textadept/lexers/ directory. That way your lexer will not be overwritten when upgrading. Also, lexers in this directory override default lexers. (A user lua lexer would be loaded instead of the default lua lexer. This is convenient if you wish to tweak a default lexer to your liking.) Do not forget to add a mime-type for your lexer.

Optimization

Lexers can usually be optimized for speed by re-arranging tokens so that the most common ones are recognized first. Keep in mind the issue that was raised earlier: if you put similar tokens like identifiers before keywords, the latter will not be styled correctly.

Troubleshooting

Errors in lexers can be tricky to debug. Lua errors and _G.print() statements in lexers are printed to STDOUT.

Limitations

Lexers can have up to 32757 elements in them. So unless the lexer is written very poorly, or has a dozen embedded languages, this limitation is not a problem.

Performance

There might be some slight overhead when initializing a lexer, but loading a file from disk into Scintilla is usually more expensive.

On modern computer systems, I see no difference in speed between LPeg lexers and Scintilla's C++ ones for single language lexers. There may be differences for multiple language lexers though, depending on the size of the file since the entire document must be lexed to ensure accuracy.

Risks

Poorly written lexers have the ability to crash Scintilla, so unsaved data might be lost. However, these crashes have only been observed in early lexer development, when syntax errors or pattern errors are present. Once the lexer actually starts styling text (either correctly or incorrectly; it does not matter), no crashes have occurred.

Acknowledgements

Thanks to Peter Odding for his lexer post on the Lua mailing list that inspired me, and of course thanks to Roberto Ierusalimschy for LPeg.

Functions

add_rule (lexer, id, rule, name) [Local function] Adds a rule to a lexer's current ordered list of rules.
add_style (lexer, token_name, style) [Local function] Adds a new Scintilla style to Scintilla.
build_grammar (lexer) [Local function] (Re)constructs lexer._GRAMMAR.
color (r, g, b) Creates a Scintilla color.
delimited_range (chars, escape, end_optional, balanced, forbidden) Creates an LPeg pattern that matches a range of characters delimitted by a specific character(s).
delimited_range_with_embedded (chars, escape, token_name, patt, forbidden) Similar to `delimited_range()`, but includes embedded patterns.
embed_lexer (parent, child, start_rule, end_rule, preproc) Embeds a child lexer language in a parent one.
fold (text, start_pos, start_line, start_level) Folds the given text.
get_fold_level (line, line_number) Returns the fold level for a given line.
get_indent_amount (line) Returns the indent amount of text for a given line.
get_property (key, default) Returns an integer property value for a given key.
get_style_at (pos) Returns the integer style number at a given position.
join_tokens (lexer, parent) [Local function] (Re)constructs lexer._TOKENRULE.
lex (text) Lexes the given text.
load (lexer_name) Initializes the specified lexer.
nested_pair (start_chars, end_chars, end_optional) Similar to `delimited_range()`, but allows for multi-character delimitters.
starts_line (patt) Creates an LPeg pattern from a given pattern that matches the beginning of a line and returns it.
style (style_table) Creates a Scintilla style from a table of style properties.
token (name, patt) Creates an LPeg capture table index with the name and position of the token.
word_match (words, word_chars, case_insensitive) Creates an LPeg pattern that matches a set of words.

Tables

_EMBEDDEDRULES Set of rules for an embedded lexer.
_RULES List of rule names with associated LPeg patterns for a specific lexer.
tokens [Local table] Default tokens.


Functions

add_rule (lexer, id, rule, name)
[Local function] Adds a rule to a lexer's current ordered list of rules.

Parameters

  • lexer: The lexer to add the given rule to.
  • id:
  • rule: The LPeg pattern of the rule.
  • name: The name associated with this rule. It is used for other lexers to access this particular rule from the lexer's `_RULES` table. It does not have to be the same as the name passed to `token`.
add_style (lexer, token_name, style)
[Local function] Adds a new Scintilla style to Scintilla.

Parameters

  • lexer: The lexer to add the given style to.
  • token_name: The name of the token associated with this style.
  • style: A Scintilla style created from style().

See also:

build_grammar (lexer)
[Local function] (Re)constructs lexer._GRAMMAR.

Parameters

  • lexer: The parent lexer.
color (r, g, b)
Creates a Scintilla color.

Parameters

  • r: The string red component of the hexadecimal color.
  • g: The string green component of the color.
  • b: The string blue component of the color.

Usage:

local red = color('FF', '00', '00')
delimited_range (chars, escape, end_optional, balanced, forbidden)
Creates an LPeg pattern that matches a range of characters delimitted by a specific character(s). This can be used to match a string, parenthesis, etc.

Parameters

  • chars: The character(s) that bound the matched range.
  • escape: Optional escape character. This parameter may be omitted, nil, or the empty string.
  • end_optional: Optional flag indicating whether or not an ending delimiter is optional or not. If true, the range begun by the start delimiter matches until an end delimiter or the end of the input is reached.
  • balanced: Optional flag indicating whether or not a balanced range is matched, like `%b` in Lua's `string.find`. This flag only applies if `chars` consists of two different characters (e.g. '()').
  • forbidden: Optional string of characters forbidden in a delimited range. Each character is part of the set.

Usage

  • local sq_str_noescapes = delimited_range("'")
  • local sq_str_escapes = delimited_range("'", '\\', true)
  • local unbalanced_parens = delimited_range('()', '\\', true)
  • local balanced_parens = delimited_range('()', '\\', true, true)
delimited_range_with_embedded (chars, escape, token_name, patt, forbidden)
Similar to `delimited_range()`, but includes embedded patterns. This is useful for embedding additional lexers inside strings. Do not enclose this range with `token()`. Instead specify the token name in the `token_name` parameter.

Parameters

  • chars: The character(s) that bound the matched range.
  • escape: Escape character or nil.
  • token_name: Token name for the characters in the range excluding the embedded pattern. Use this instead of `token()`.
  • patt: Pattern embedded in the range.
  • forbidden: Optional string of characters forbidden in a delimited range. Each character is part of the set.

Usage:

local embedded_sq_str = l.delimited_range_with_embedded("'", '\\', 'string', php_rule)
embed_lexer (parent, child, start_rule, end_rule, preproc)
Embeds a child lexer language in a parent one.

Parameters

  • parent: The parent lexer.
  • child: The child lexer.
  • start_rule: The token that signals the beginning of the embedded lexer.
  • end_rule: The token that signals the end of the embedded lexer.
  • preproc: Boolean flag specifying if the child lexer is a preprocessor language.

Usage

  • embed_lexer(_M, css, css_start_rule, css_end_rule)
  • embed_lexer(html, _M, php_start_rule, php_end_rule, true)
  • embed_lexer(html, ruby, ruby_start_rule, rule_end_rule, true)
fold (text, start_pos, start_line, start_level)
Folds the given text. Called by LexLPeg.cxx; do not call from Lua. If the current lexer has no _fold function, folding by indentation is performed if the 'fold.by.indentation' property is set.

Parameters

  • text: The document text to fold.
  • start_pos: The position in the document text starts at.
  • start_line: The line number text starts on.
  • start_level: The fold level text starts on.

Return value:

Table of fold levels.
get_fold_level (line, line_number)
Returns the fold level for a given line. This level already has `SC_FOLDLEVELBASE` added to it, so you do not need to add it yourself.

Parameters

  • line:
  • line_number: The line number to get the fold level of.
get_indent_amount (line)
Returns the indent amount of text for a given line.

Parameters

  • line: The line number to get the indent amount of.
get_property (key, default)
Returns an integer property value for a given key.

Parameters

  • key: The property key.
  • default: Optional integer value to return if key is not set.
get_style_at (pos)
Returns the integer style number at a given position.

Parameters

  • pos: The position to get the style for.
join_tokens (lexer, parent)
[Local function] (Re)constructs lexer._TOKENRULE.

Parameters

  • lexer:
  • parent: The parent lexer.
lex (text)
Lexes the given text. Called by LexLPeg.cxx; do not call from Lua. If the lexer has a _LEXBYLINE flag set, the text is lexed one line at a time. Otherwise the text is lexed as a whole.

Parameters

  • text: The text to lex.
load (lexer_name)
Initializes the specified lexer.

Parameters

  • lexer_name: The name of the lexing language.
nested_pair (start_chars, end_chars, end_optional)
Similar to `delimited_range()`, but allows for multi-character delimitters. This is useful for lexers with tokens such as nested block comments. With single-character delimiters, this function is identical to `delimited_range(start_chars..end_chars, nil, end_optional, true)`.

Parameters

  • start_chars: The string starting a nested sequence.
  • end_chars: The string ending a nested sequence.
  • end_optional: Optional flag indicating whether or not an ending delimiter is optional or not. If true, the range begun by the start delimiter matches until an end delimiter or the end of the input is reached.

Usage:

local nested_comment = l.nested_pair('/*', '*/', true)
starts_line (patt)
Creates an LPeg pattern from a given pattern that matches the beginning of a line and returns it.

Parameters

  • patt: The LPeg pattern to match at the beginning of a line.

Usage:

local preproc = token('preprocessor', #P('#') * l.starts_line('#' * l.nonnewline^0))
style (style_table)
Creates a Scintilla style from a table of style properties.

Parameters

  • style_table: A table of style properties. Style properties available: font = [string] size = [integer] bold = [boolean] italic = [boolean] underline = [boolean] fore = [integer]* back = [integer]* eolfilled = [boolean] characterset = ? case = [integer] visible = [boolean] changeable = [boolean] hotspot = [boolean] * Use the value returned by `color()`.

Usage:

local bold_italic = style { bold = true, italic = true }

See also:

token (name, patt)
Creates an LPeg capture table index with the name and position of the token.

Parameters

  • name: The name of token. If this name is not in `l.tokens` then you will have to specify a style for it in `lexer._tokenstyles`.
  • patt: The LPeg pattern associated with the token.

Usage

  • local ws = token('whitespace', l.space^1)
  • php_start_rule = token('php_tag', '
word_match (words, word_chars, case_insensitive)
Creates an LPeg pattern that matches a set of words.

Parameters

  • words: A table of words.
  • word_chars: Optional string of additional characters considered to be part of a word (default is `%w_`).
  • case_insensitive: Optional boolean flag indicating whether the word match is case-insensitive.

Usage

  • local keyword = token('keyword', word_match { 'foo', 'bar', 'baz' })
  • local keyword = token('keyword', word_match({ 'foo-bar', 'foo-baz', 'bar-foo', 'bar-baz', 'baz-foo', 'baz-bar' }, '-', true))

Tables

_EMBEDDEDRULES
Set of rules for an embedded lexer. For a parent lexer name, contains child's `start_rule`, `token_rule`, and `end_rule` patterns.
_RULES
List of rule names with associated LPeg patterns for a specific lexer. It is accessible to other lexers for embedded lexer applications.
tokens
[Local table] Default tokens. Contains token identifiers and associated style numbers. Fields
  • default: The default type (0).
  • whitespace: The whitespace type (1).
  • comment: The comment type (2).
  • string: The string type (3).
  • number: The number type (4).
  • keyword: The keyword type (5).
  • identifier: The identifier type (6).
  • operator: The operator type (7).
  • error: The error type (8).
  • preprocessor: The preprocessor type (9).
  • constant: The constant type (10).
  • function: The function type (11).
  • class: The class type (12).
  • type: The type type (13).

Valid XHTML 1.0!