Please note that this article describes an initial implementation of the tokenizer. If you want to browse the code while reading the article, make sure that you have switched to branch dsl_v1.
With this article I'm moving into the gritty part of the project - the code which reads the source code and turns it into a beautiful abstract syntax tree. In other words, I'll be talking about the parser.
I must admit that I spent as little time on the parser as I could. After all, my main goal was to convert AST into an executable code, not to parse input. Still, one cannot write a compiler without writing a parser, so ... here I am.
If you want to do anything with any data, you have to a) know in what format it is written and b) write some code that reads the input, decodes the format and creates internal memory structures filled with data. As the first task of any good programmer is to create a library that will do the hard work instead of her/him ;) it was not long before such tools appeared also in the area of compiler writing.
Programmers had also understood very soon that parsing is actually a two-part process. In first step you want to split data into lexical tokens. Each token represents a smallest part of the input that has a specific meaning. For example, if we split a pascal statement a := func(i+42); into separate tokens, we would get: identifier:a whitespace becomes whitespace identifier:func left-parent identifier:i plus number:42 right-parent semicolon. Some characters are mapped to their own tokens (for example, ";" becomes a semicolon) and some are grouped into appropriate units ("42" becomes a number with value 42).
The second part - a parser - uses the sequence of tokens produced by the tokenizer and tries to understand its semantics - that is, how the tokens fit together into a formal language specification. That will be the topic of the next article.
Because of this separation, the supporting tools also split into two areas - one to support creation of tokenizers and another of parsers. In Unix world, for example, we have Lex for lexicographic analysis, a.k.a. tokenization, and Yacc (Yet Another Compiler Compiler) for doing semantic analysis. In pascal world we also have our share of compiler-generating tools. There's a pretty decent port to Delphi, Plex+Pyacc for FreePascal, quite old an unsupported (but free) Delphi Compiler Generator and probably some other tools which I failed to find with my pretty superficial search.
Let's get back to the topic ...
In my case, the language is extremely simple and so is the tokenizer. Instead of using some specialized tool I just went ahead and coded the whole thing. As you'll see, it is pretty simple.
Tokens
Let's take a look at a demo program from part 1:fib(i) {
if i < 3 {
return 1
} else {
return fib(i-2) + fib(i-1)
}
}
From this example we can guess almost all tokens used in The Language: identifier ("fib", "i", "if", "return" ...), number ("1", "2", "3"), whitespace, left parenthesis ("("), right parenthesis (")"), left curly bracket ("{"), right curly bracket ("}"), less than ("<"), plus ("+") and minus ("-").
There are only two tokens this example fails to cover - a comma (",") and a semicolon (";"). The former is used to separate parameters in function definitions and in function calls and the latter is used to separate statements but doesn't appear in the sample code as it is optional immediately before the right curly bracket.
The following type from unit SimpleDSL.Compiler.Tokenizer enumerates all possibilitis. In addition to already mentioned token types, tkUnknown is used to represent unexpected input and tkEOF is used to signal that end of input stream was reached.
type
TTokenKind
=
(tkUnknown,
tkWhitespace,
tkIdent,
tkNumber,
tkLeftParen,
tkRightParen,
tkLeftCurly,
tkRightCurly,
tkLessThan,
tkPlus,
tkMinus,
tkComma,
tkSemicolon,
tkEOF);
Interface
The tokenizer is accessed through a very simple interface
ISimpleDSLTokenizer
=
interface
function
CurrentLocation:
TPoint;
function
GetToken(var
kind:
TTokenKind;
var
identifier:
string):
boolean;
procedure
Initialize(const
code:
string);
function
IsAtEnd:
boolean;
end;
The most important function here is GetToken. It returns next token from the source stream - kind contains the token kind and identifier contains the sequence of characters representing the token. The function returns False when end of stream is reached.
Other functions are all just simple helpers. IsAtEnd returns True when end of stream is reached, CurrentLocation returns current line and column, which is useful for reporting errors, and Initialize initializes the tokenizer.
procedure
TSimpleDSLTokenizer.Initialize(const
code:
string);
begin
FProgram.Text
:=
code;
FNextLine
:=
0;
FNextChar
:=
1;
FLookahead
:=
#0;
FLastLine
:=
FProgram.Count
-
1;
if
FLastLine
>=
0
then
begin
FLastLineLen
:=
Length(FProgram[FLastLine]);
FCurrentLine
:=
FProgram[FNextLine];
end;
end;
A program is internally stored in a string list FProgram. Other variables track the current location and are mostly used in the method GetChar.
Reading input
Let's take a look at the GetToken method.
function
TSimpleDSLTokenizer.GetToken(var
kind:
TTokenKind;
var
identifier:
string):
boolean;
var
ch:
char;
begin
identifier
:=
'';
Result
:=
GetChar(ch);
if
not
Result
then
begin
kind
:=
tkEOF;
Exit;
end;
case
ch
of
'(':
kind
:=
tkLeftParen;
')':
kind
:=
tkRightParen;
'{':
kind
:=
tkLeftCurly;
'}':
kind
:=
tkRightCurly;
'+':
kind
:=
tkPlus;
'-':
kind
:=
tkMinus;
'<':
kind
:=
tkLessThan;
',':
kind
:=
tkComma;
';':
kind
:=
tkSemicolon;
else
if
ch.IsLetter
then
begin
kind
:=
tkIdent;
identifier
:=
ch
+
GetIdent;
end
else
if
CharInSet(ch,
['0'..'9'])
then
begin
kind
:=
tkNumber;
identifier
:=
ch
+
GetNumber;
end
else
if
ch.IsWhiteSpace
then
begin
kind
:=
tkWhitespace;
SkipWhitespace;
end
else
kind
:=
tkUnknown;
end;
end;
Firstly it reads next character from the stream (via GetChar) and exits if there is no next character. Then it handles all one-character tokens internally. At the end it handles the specific job of reading identifiers, numbers, and whitespace to GetIdent, GetNumber, and SkipWhitespace, respectively.
GetIdent and GetNumber are very similar, so I'll only focus on one of them.
function
TSimpleDSLTokenizer.GetIdent:
string;
var
ch:
char;
begin
Result
:=
'';
while
GetChar(ch)
do
begin
if
ch.IsLetter
or
ch.IsNumber
or
(ch
=
'_')
then
Result
:=
Result
+
ch
else
begin
PushBack(ch);
Exit;
end;
end;
end;
As the GetToken has already read the first character of the identifier, GetIdent collects together all following characters that are either a letter, a number, or an underline. (And as you can see, the code is using character class helpers - IsLetter and IsNumber - so identifiers are actually Unicode-aware.)
When a non-identifier character appears, the simplest solution is just to say "Oooops, I've read too far, please push this last read character back to the processing."
One-character buffer
As we only ever read one character too far, this "please undo the last GetChar" operation is handled with a simple one character buffer used in PushBack and GetChar.
procedure
TSimpleDSLTokenizer.PushBack(ch:
char);
begin
Assert(FLookahead
=
#0,
'TSimpleDSLTokenizer:
Lookahead buffer is not empty');
FLookahead
:=
ch;
end;
function
TSimpleDSLTokenizer.GetChar(var
ch:
char):
boolean;
begin
if
FLookahead
<>
#0
then
begin
ch
:=
FLookahead;
FLookahead
:=
#0;
Result
:=
true;
end
else
begin
Result
:=
not
IsAtEnd;
if
Result
then
begin
ch
:=
FCurrentLine[FNextChar];
Inc(FNextChar);
if
FNextChar
>
Length(FCurrentLine)
then
begin
Inc(FNextLine);
if
FNextLine
<
FProgram.Count
then
FCurrentLine
:=
FProgram[FNextLine];
FNextChar
:=
1;
end;
end;
end;
end;
PushBack just stores the current buffer in the FLookahead buffer (unless the buffer is not empty which can only happen if there's a bug in the tokenizer). GetChar has a bit more work as in addition to handling the FLookahead buffer it must also handle 'end of line of text' condition.
This 'push back' approach is also used in GetNumber and SkipWhitespace (for details see the code) and in the parser, as we'll see very soon.
Nice Article. Just some notes from my end:
ReplyDeleteYour Identifiers can not start with an underscore. Is this intended?
Your Tokenizer actually treats whitespace as a token. I'd suggest, unless your language is whitespace sensitive(sorry if i missed that somewhere) to simply skip whitespace until the next printable character comes. Otherwhise whitspace handling is transfered to the parser.
You fetch characters one by one. I'd suggest to first determine the range of a token and then copy the range from the stream. Otherwhise when an identifier has a length of 12, you (re)allocate 12 strings. Or if delphi is smart it just reallocates the same over and over again.
Underscores: intentional.
DeleteWhitespace: I agree. Just didn't cross my mand.
Fecth one by one: I was just lazy.
PRs are welcome :)
Funny thing is, by using isLetter, your Parser might be more "accurate" then the Delphi one. The DelphiCompiler accepts anything which is > 255 and fits into UCS2 as valid Letter(which is something i preferred to do on my parsers :P)
DeleteLeftRight-Override, Hearts and other funny stuff in Identifiers :D