Please note that this article describes an initial implementation of the parser. If you want to browse the code while reading the article, make sure that you have switched to branch dsl_v1.
After a short break I'm returning to my "toy compiler" series. This time I'll describe the working of the parser - the part of the code which reads the input (usually in a tokenized form) and generates internal representation of the program (in my case an abstract syntax tree - AST).
The goal of my project was to study the compilation step and parser was just a necessary evil that I had to deal with. That's why it is written in a pretty primitive manner, without using any improvements like a Pratt parser.
My parser is exposed as a very simple interface. It will receive the code to be parsed (as a string), a reference to the tokenizer that should be used for reading the input, and a reference to the root node of the resulting AST. The function will return False if parsing fails, in which case the caller can cast the parser interface to ISimpleDSLErrorInfo to get more information about the error.
type
ISimpleDSLParser
=
interface
['{73F3CBB3-3DEF-4573-B079-7EFB00631560}']
function
Parse(const
code:
string;
const
tokenizer:
ISimpleDSLTokenizer;
const
ast:
ISimpleDSLAST):
boolean;
end;
function
TSimpleDSLParser.Parse(const
code:
string; const
tokenizer:
ISimpleDSLTokenizer;
const
ast:
ISimpleDSLAST):
boolean;
begin
Result
:=
false;
FTokenizer
:=
tokenizer;
FAST
:=
ast;
FLookaheadIdent
:=
#0;
tokenizer.Initialize(code);
while
not
tokenizer.IsAtEnd
do
if
not
ParseFunction
then
Exit;
Result
:=
true;
end;
ParseFunction first grabs an identifier (function name). Then it creates an entry in the global function table because we may need it in some other part of the parser if the function recursively calls itself. For the same reason it will also store away an interface of the function AST node into FContext.CurrentFunc.
Next it will check for the tkLeftParen token. If it is present, it will enter a loop which looks for either an identifier (parameter name) or right parenthesis (end of parameter list) and store all parameter names into the AST (func.ParamNames.Add).
If everything is fine, it will call ParseBlock to parse the function body and store its output into the AST (func.Body := block).
function
TSimpleDSLParser.ParseFunction:
boolean;
var
block
:
IASTBlock;
expected:
TTokenKinds;
func
:
IASTFunction;
funcName:
string;
ident
:
string;
token
:
TTokenKind;
begin
Result
:=
false;
/// function = identifier "(" [ identifier { "," identifier } ] ")" block
// function name
if
not
FetchToken([tkIdent],
funcName,
token)
then
Exit(token
=
tkEOF);
func
:=
AST.CreateFunction;
func.Name
:=
funcName;
// we might need this function in the global table for recursive calls
AST.Functions.Add(func);
FContext.CurrentFunc
:=
func;
try
// "("
if
not
FetchToken([tkLeftParen])
then
Exit;
// parameter list, including ")"
expected
:=
[tkIdent,
tkRightParen];
repeat
if
not
FetchToken(expected,
ident,
token)
then
Exit;
if
token
=
tkRightParen
then
break
//repeat
else
if
token
=
tkIdent
then
begin
func.ParamNames.Add(ident);
expected
:=
expected
-
[tkIdent]
+
[tkComma,
tkRightParen];
end
else
if
token
=
tkComma
then
expected
:=
expected
+
[tkIdent]
-
[tkComma,
tkRightParen]
else
begin
LastError
:=
'Internal error in ParseFunction';
Exit;
end;
until
false;
// function body
if
not
ParseBlock(block)
then
Exit;
func.Body
:=
block;
Result
:=
true;
finally
FContext.CurrentFunc
:=
nil;
end;
end;
Tokens are always fetched with the FetchToken function which accepts a list of valid tokens, skips all white space (as the language is designed, all white space can be ignored) and returns found token/identifier pair (or sets error information with code location if wrong token is encountered).
function TSimpleDSLParser.FetchToken(allowed: TTokenKinds; var ident: string;
var token: TTokenKind): boolean;
var
loc: TPoint;
begin
Result := false;
while GetToken(token, ident) do
if token in allowed then
Exit(true)
else if token = tkWhitespace then
// do nothing
else begin
loc := FTokenizer.CurrentLocation;
LastError := Format('Invalid syntax in line %d, character %d', [loc.X, loc.Y]);
Exit;
end;
end;
Similarly to the tokenizer, parser uses a 'push back' approach. When if finds out that it had read a token too many, it can push it back by calling PushBack. The GetToken function first looks into this buffer and calls tokenizer.GetToken only if the buffer is empty.
procedure
TSimpleDSLParser.PushBack(token:
TTokenKind;
const
ident:
string);
begin
Assert(FLookaheadIdent
=
#0,
'TSimpleDSLParser:
Lookahead buffer is not empty');
FLookaheadToken
:=
token;
FLookaheadIdent
:=
ident;
end;
function
TSimpleDSLParser.GetToken(var
token:
TTokenKind;
var
ident:
string):
boolean;
begin
if
FLookaheadIdent
<>
#0
then
begin
token
:=
FLookaheadToken;
ident
:=
FLookaheadIdent;
FLookaheadIdent
:=
#0;
Result
:=
true;
end
else
Result
:=
FTokenizer.GetToken(token,
ident);
end;
No comments:
Post a Comment