It is possible to override the tokenizer in an EBNF defined parser. Usually the input sequence to be parsed is an array of characters or a string. Terminals in a rule match successive characters in the array or string.
ExamplesUSING: multiline ;
EBNF: foo [=[
rule = "++" "--"
]=]
This parser when run with the string "++--" or the array { CHAR: + CHAR: + CHAR: - CHAR: - } will succeed with an AST of { "++" "--" }. If you want to add whitespace handling to the grammar you need to put it between the terminals:
ExamplesUSING: multiline ;
EBNF: foo [=[
space = (" " | "\r" | "\t" | "\n")
spaces = space* => [[ drop ignore ]]
rule = spaces "++" spaces "--" spaces
]=]
In a large grammar this gets tedious and makes the grammar hard to read. Instead you can write a rule to split the input sequence into tokens, and have the grammar operate on these tokens. This is how the previous example might look:
ExamplesUSING: multiline ;
EBNF: foo [=[
space = (" " | "\r" | "\t" | "\n")
spaces = space* => [[ drop ignore ]]
tokenizer = spaces ( "++" | "--" )
rule = "++" "--"
]=]
'tokenizer' is the name of a built in rule. Once defined it is called to retrieve the next complete token from the input sequence. So the first part of 'rule' is to try and match "++". It calls the tokenizer to get the next complete token. This ignores spaces until it finds a "++" or "--". It is as if the input sequence for the parser was actually { "++" "--" } instead of the string "++--". With the new tokenizer "...." sequences in the grammar are matched for equality against the token, rather than a string comparison against successive items in the sequence. This can be used to match an AST from a tokenizer.
In this example I split the tokenizer into a separate parser and use 'foreign' to call it from the main one. This allows testing of the tokenizer separately:
ExamplesUSING: prettyprint peg peg.ebnf kernel math.parser strings
accessors math arrays multiline ;
IN: scratchpad
TUPLE: ast-number value ;
TUPLE: ast-string value ;
EBNF: foo-tokenizer [=[
space = (" " | "\r" | "\t" | "\n")
spaces = space* => [[ drop ignore ]]
number = [0-9]+ => [[ >string string>number ast-number boa ]]
operator = ("+" | "-")
token = spaces ( number | operator )
tokens = token*
]=]
EBNF: foo [=[
tokenizer = <foreign foo-tokenizer token>
number = . ?[ ast-number? ]? => [[ value>> ]]
string = . ?[ ast-string? ]? => [[ value>> ]]
rule = string:a number:b "+" number:c => [[ a b c + 2array ]]
]=]
"123 456 +" foo-tokenizer .
V{
T{ ast-number { value 123 } }
T{ ast-number { value 456 } }
"+"
}
The '.' EBNF production means match a single object in the source sequence. Usually this is a character. With the replacement tokenizer it is either a number object, a string object or a string containing the operator. Using a tokenizer in language grammars makes it easier to deal with whitespace. Defining tokenizers in this way has the advantage of the tokenizer and parser working in one pass. There is no tokenization occurring over the whole string followed by the parse of that result. It tokenizes as it needs to. You can even switch tokenizers multiple times during a grammar. Rules use the tokenizer that was defined lexically before the rule. This is useful in the JavaScript grammar:
ExamplesUSING: multiline ;
EBNF: javascript [=[
tokenizer = default
nl = "\r" "\n" | "\n"
tokenizer = <foreign tokenize-javascript Tok>
...
End = !(.)
Name = . ?[ ast-name? ]? => [[ value>> ]]
Number = . ?[ ast-number? ]? => [[ value>> ]]
String = . ?[ ast-string? ]? => [[ value>> ]]
RegExp = . ?[ ast-regexp? ]? => [[ value>> ]]
SpacesNoNl = (!(nl) Space)* => [[ ignore ]]
Sc = SpacesNoNl (nl | &("}") | End)| ";"
]=]
Here the rule 'nl' is defined using the default tokenizer of sequential characters ('default' has the special meaning of the built in tokenizer). This is followed by using the JavaScript tokenizer for the remaining rules. This tokenizer strips out whitespace and newlines. Some rules in the grammar require checking for a newline. In particular the automatic semicolon insertion rule (managed by the 'Sc' rule here). If there is a newline, the semicolon can be optional in places.
Examples"do" Stmt:s "while" "(" Expr:c ")" Sc => [[ s c ast-do-while boa ]]
Even though the JavaScript tokenizer has removed the newlines, the 'nl' rule can be used to detect them since it is using the default tokenizer. This allows grammars to mix and match the tokenizer as required to make them more readable.