Next: Using Tree-sitter Parser, Up: Parsing Program Source [Contents][Index]
Tree-sitter relies on language grammar to parse text in that
language. In Emacs, a language grammar is represented by a symbol.
For example, the C language grammar is represented as the symbol
c
, and c
can be passed to tree-sitter functions as the
language argument.
Tree-sitter language grammar are distributed as dynamic libraries. In order to use a language grammar in Emacs, you need to make sure that the dynamic library is installed on the system. Emacs looks for language grammar in several places, in the following order:
treesit-extra-load-path
;
user-emacs-directory
(see The Init File);
In each of these directories, Emacs looks for a file with file-name
extensions specified by the variable dynamic-library-suffixes
.
If Emacs cannot find the library or has problems loading it, Emacs
signals the treesit-load-language-error
error. The data of
that signal could be one of the following:
(not-found error-msg …)
This means that Emacs could not find the language grammar library.
(symbol-error error-msg)
This means that Emacs could not find in the library the expected function that every language grammar library should export.
(version-mismatch error-msg)
This means that the version of language grammar library is incompatible with that of the tree-sitter library.
In all of these cases, error-msg might provide additional details about the failure.
This function returns non-nil
if the language grammar for
language exist and can be loaded.
If detail is non-nil
, return (t . nil)
when
language is available, and (nil . data)
when it’s
unavailable. data is the signal data of
treesit-load-language-error
.
By convention, the file name of the dynamic library for language is
libtree-sitter-language.ext, where ext is the
system-specific extension for dynamic libraries. Also by convention,
the function provided by that library is named
tree_sitter_language
. If a language grammar library
doesn’t follow this convention, you should add an entry
(language library-base-name function-name)
to the list in the variable treesit-load-name-override-list
, where
library-base-name is the basename of the dynamic library’s file name,
(usually, libtree-sitter-language), and
function-name is the function provided by the library
(usually, tree_sitter_language
). For example,
(cool-lang "libtree-sitter-coool" "tree_sitter_cooool")
for a language that considers itself too “cool” to abide by conventions.
This function returns the version of the language grammar
Application Binary Interface (ABI) supported by the
tree-sitter library. By default, it returns the latest ABI version
supported by the library, but if min-compatible is
non-nil
, it returns the oldest ABI version which the library
still can support. language grammar libraries must be built for
ABI versions between the oldest and the latest versions supported by
the tree-sitter library, otherwise the library will be unable to load
them.
This function returns the ABI version of the language
grammar library loaded by Emacs for language. If language
is unavailable, this function returns nil
.
A syntax tree is what a parser generates. In a syntax tree, each node represents a piece of text, and is connected to each other by a parent-child relationship. For example, if the source text is
1 + 2
its syntax tree could be
+--------------+ | root "1 + 2" | +--------------+ | +--------------------------------+ | expression "1 + 2" | +--------------------------------+ | | | +------------+ +--------------+ +------------+ | number "1" | | operator "+" | | number "2" | +------------+ +--------------+ +------------+
We can also represent it as an s-expression:
(root (expression (number) (operator) (number)))
Names like root
, expression
, number
, and
operator
specify the type of the nodes. However, not all
nodes in a syntax tree have a type. Nodes that don’t have a type are
known as anonymous nodes, and nodes with a type are named
nodes. Anonymous nodes are tokens with fixed spellings, including
punctuation characters like bracket ‘]’, and keywords like
return
.
To make the syntax tree easier to analyze, many language grammar
assign field names to child nodes. For example, a
function_definition
node could have a declarator
and a
body
:
(function_definition declarator: (declaration) body: (compound_statement))
To aid in understanding the syntax of a language and in debugging of Lisp program that use the syntax tree, Emacs provides an “explore” mode, which displays the syntax tree of the source in the current buffer in real time. Emacs also comes with an “inspect mode”, which displays information of the nodes at point in the mode-line.
This mode pops up a window displaying the syntax tree of the source in the current buffer. Selecting text in the source buffer highlights the corresponding nodes in the syntax tree display. Clicking on nodes in the syntax tree highlights the corresponding text in the source buffer.
This minor mode displays on the mode-line the node that starts at point. For example, the mode-line can display
parent field: (node (child (…)))
where node, child, etc., are nodes which begin at point. parent is the parent of node. node is displayed in a bold typeface. field-names are field names of node and of child, etc.
If no node starts at point, i.e., point is in the middle of a node, then the mode line displays the earliest node that spans point, and its immediate parent.
This minor mode doesn’t create parsers on its own. It uses the first
parser in (treesit-parser-list)
(see Using Tree-sitter Parser).
Authors of language grammar define the grammar of a programming language, which determines how a parser constructs a concrete syntax tree out of the program text. In order to use the syntax tree effectively, you need to consult the grammar file.
The grammar file is usually grammar.js in a language grammar’s project repository. The link to a language grammar’s home page can be found on tree-sitter’s homepage.
The grammar definition is written in JavaScript. For example, the
rule matching a function_definition
node looks like
function_definition: $ => seq( $.declaration_specifiers, field('declarator', $.declaration), field('body', $.compound_statement) )
The rules are represented by functions that take a single argument
$, representing the whole grammar. The function itself is
constructed by other functions: the seq
function puts together
a sequence of children; the field
function annotates a child
with a field name. If we write the above definition in the so-called
Backus-Naur Form (BNF) syntax, it would look like
function_definition := <declaration_specifiers> <declaration> <compound_statement>
and the node returned by the parser would look like
(function_definition (declaration_specifier) declarator: (declaration) body: (compound_statement))
Below is a list of functions that one can see in a grammar definition. Each function takes other rules as arguments and returns a new rule.
seq(rule1, rule2, …)
matches each rule one after another.
choice(rule1, rule2, …)
matches one of the rules in its arguments.
repeat(rule)
matches rule for zero or more times. This is like the ‘*’ operator in regular expressions.
repeat1(rule)
matches rule for one or more times. This is like the ‘+’ operator in regular expressions.
optional(rule)
matches rule for zero or one time. This is like the ‘?’ operator in regular expressions.
field(name, rule)
assigns field name name to the child node matched by rule.
alias(rule, alias)
makes nodes matched by rule appear as alias in the syntax tree generated by the parser. For example,
alias(preprocessor_call_exp, call_expression)
makes any node matched by preprocessor_call_exp
appear as
call_expression
.
Below are grammar functions of lesser importance for reading a language grammar.
token(rule)
marks rule to produce a single leaf node. That is, instead of generating a parent node with individual child nodes under it, everything is combined into a single leaf node. See Retrieving Nodes.
token.immediate(rule)
Normally, grammar rules ignore preceding whitespace; this changes rule to match only when there is no preceding whitespaces.
prec(n, rule)
gives rule the level-n precedence.
prec.left([n,] rule)
marks rule as left-associative, optionally with level n.
prec.right([n,] rule)
marks rule as right-associative, optionally with level n.
prec.dynamic(n, rule)
this is like prec
, but the precedence is applied at runtime
instead.
The documentation of the tree-sitter project has more about writing a grammar. Read especially “The Grammar DSL” section.
Next: Using Tree-sitter Parser, Up: Parsing Program Source [Contents][Index]