Dodajem knjige
This commit is contained in:
@@ -0,0 +1,54 @@
|
||||
TUTOR.ZIP
|
||||
|
||||
This file contains all of the installments of Jack Crenshaw's
|
||||
tutorial on compiler construction, including the new Installment 15.
|
||||
The intended audience is those folks who are not computer scientists,
|
||||
but who enjoy computing and have always wanted to know how compilers
|
||||
work. A lot of compiler theory has been left out, but the practical
|
||||
issues are covered. By the time you have completed the series, you
|
||||
should be able to design and build your own working compiler. It will
|
||||
not be the world's best, nor will it put out incredibly tight code.
|
||||
Your product will probably never put Borland or MicroSoft out of
|
||||
business. But it will work, and it will be yours.
|
||||
|
||||
A word about the file format: The files were originally created using
|
||||
Borland's DOS editor, Sprint. Sprint could write to a text file only
|
||||
if you formatted the file to go to the selected printer. I used the
|
||||
most common printer I could think of, the Epson MX-80, but even then
|
||||
the files ended up with printer control sequences at the beginning
|
||||
and end of each page.
|
||||
|
||||
To bring the files up to date and get myself positioned to continue
|
||||
the series, I recently (1994) converted all the files to work with
|
||||
Microsoft Word for Windows. Unlike Sprint, Word allows you to write
|
||||
the file as a DOS text file. Unfortunately, this gave me a new
|
||||
problem, because when Word is writing to a text file, it doesn't
|
||||
write hard page breaks or page numbers. In other words, in six years
|
||||
we've gone from a file with page breaks and page numbers, but
|
||||
embedded escape sequences, to files with no embedded escape sequences
|
||||
but no page breaks or page numbers. Isn't progress wonderful?
|
||||
|
||||
Of course, it's possible for me to insert the page numbers as
|
||||
straight text, rather than asking the editor to do it for me. But
|
||||
since Word won't allow me to write page breaks to the file, we would
|
||||
end up with files with page numbers that may or may not fall at the
|
||||
ends of the pages, depending on your editor and your printer. It
|
||||
seems to me that almost every file I've ever downloaded from
|
||||
CompuServe or BBS's that had such page numbering was incompatible
|
||||
with my printer, and gave me pages that were one line short or one
|
||||
line long, with the page numbers consequently walking up the page.
|
||||
|
||||
So perhaps this new format is, after all, the safest one for general
|
||||
distribution. The files as they exist will look just fine if read
|
||||
into any text editor capable of reading DOS text files. Since most
|
||||
editors these days include rather sophisticated word processing
|
||||
capabilities, you should be able to get your editor to paginate for
|
||||
you, prior to printing.
|
||||
|
||||
I hope you like the tutorials. Much thought went into them.
|
||||
|
||||
|
||||
Jack W. Crenshaw
|
||||
|
||||
CompuServe 72325,1327
|
||||
|
||||
@@ -0,0 +1,398 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LET'S BUILD A COMPILER!
|
||||
|
||||
By
|
||||
|
||||
Jack W. Crenshaw, Ph.D.
|
||||
|
||||
24 July 1988
|
||||
|
||||
|
||||
Part I: INTRODUCTION
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
|
||||
This series of articles is a tutorial on the theory and practice
|
||||
of developing language parsers and compilers. Before we are
|
||||
finished, we will have covered every aspect of compiler
|
||||
construction, designed a new programming language, and built a
|
||||
working compiler.
|
||||
|
||||
Though I am not a computer scientist by education (my Ph.D. is in
|
||||
a different field, Physics), I have been interested in compilers
|
||||
for many years. I have bought and tried to digest the contents
|
||||
of virtually every book on the subject ever written. I don't
|
||||
mind telling you that it was slow going. Compiler texts are
|
||||
written for Computer Science majors, and are tough sledding for
|
||||
the rest of us. But over the years a bit of it began to seep in.
|
||||
What really caused it to jell was when I began to branch off on
|
||||
my own and begin to try things on my own computer. Now I plan to
|
||||
share with you what I have learned. At the end of this series
|
||||
you will by no means be a computer scientist, nor will you know
|
||||
all the esoterics of compiler theory. I intend to completely
|
||||
ignore the more theoretical aspects of the subject. What you
|
||||
_WILL_ know is all the practical aspects that one needs to know
|
||||
to build a working system.
|
||||
|
||||
This is a "learn-by-doing" series. In the course of the series I
|
||||
will be performing experiments on a computer. You will be
|
||||
expected to follow along, repeating the experiments that I do,
|
||||
and performing some on your own. I will be using Turbo Pascal
|
||||
4.0 on a PC clone. I will periodically insert examples written
|
||||
in TP. These will be executable code, which you will be expected
|
||||
to copy into your own computer and run. If you don't have a copy
|
||||
of Turbo, you will be severely limited in how well you will be
|
||||
able to follow what's going on. If you don't have a copy, I urge
|
||||
you to get one. After all, it's an excellent product, good for
|
||||
many other uses!
|
||||
|
||||
Some articles on compilers show you examples, or show you (as in
|
||||
the case of Small-C) a finished product, which you can then copy
|
||||
and use without a whole lot of understanding of how it works. I
|
||||
hope to do much more than that. I hope to teach you HOW the
|
||||
things get done, so that you can go off on your own and not only
|
||||
reproduce what I have done, but improve on it.
|
||||
|
||||
This is admittedly an ambitious undertaking, and it won't be done
|
||||
in one page. I expect to do it in the course of a number of
|
||||
articles. Each article will cover a single aspect of compiler
|
||||
theory, and will pretty much stand alone. If all you're
|
||||
interested in at a given time is one aspect, then you need to
|
||||
look only at that one article. Each article will be uploaded as
|
||||
it is complete, so you will have to wait for the last one before
|
||||
you can consider yourself finished. Please be patient.
|
||||
|
||||
|
||||
|
||||
The average text on compiler theory covers a lot of ground that
|
||||
we won't be covering here. The typical sequence is:
|
||||
|
||||
o An introductory chapter describing what a compiler is.
|
||||
|
||||
o A chapter or two on syntax equations, using Backus-Naur Form
|
||||
(BNF).
|
||||
|
||||
o A chapter or two on lexical scanning, with emphasis on
|
||||
deterministic and non-deterministic finite automata.
|
||||
|
||||
o Several chapters on parsing theory, beginning with top-down
|
||||
recursive descent, and ending with LALR parsers.
|
||||
|
||||
o A chapter on intermediate languages, with emphasis on P-code
|
||||
and similar reverse polish representations.
|
||||
|
||||
o Many chapters on alternative ways to handle subroutines and
|
||||
parameter passing, type declarations, and such.
|
||||
|
||||
o A chapter toward the end on code generation, usually for some
|
||||
imaginary CPU with a simple instruction set. Most readers
|
||||
(and in fact, most college classes) never make it this far.
|
||||
|
||||
o A final chapter or two on optimization. This chapter often
|
||||
goes unread, too.
|
||||
|
||||
|
||||
I'll be taking a much different approach in this series. To
|
||||
begin with, I won't dwell long on options. I'll be giving you
|
||||
_A_ way that works. If you want to explore options, well and
|
||||
good ... I encourage you to do so ... but I'll be sticking to
|
||||
what I know. I also will skip over most of the theory that puts
|
||||
people to sleep. Don't get me wrong: I don't belittle the
|
||||
theory, and it's vitally important when it comes to dealing with
|
||||
the more tricky parts of a given language. But I believe in
|
||||
putting first things first. Here we'll be dealing with the 95%
|
||||
of compiler techniques that don't need a lot of theory to handle.
|
||||
|
||||
I also will discuss only one approach to parsing: top-down,
|
||||
recursive descent parsing, which is the _ONLY_ technique that's
|
||||
at all amenable to hand-crafting a compiler. The other
|
||||
approaches are only useful if you have a tool like YACC, and also
|
||||
don't care how much memory space the final product uses.
|
||||
|
||||
I also take a page from the work of Ron Cain, the author of the
|
||||
original Small C. Whereas almost all other compiler authors have
|
||||
historically used an intermediate language like P-code and
|
||||
divided the compiler into two parts (a front end that produces
|
||||
P-code, and a back end that processes P-code to produce
|
||||
executable object code), Ron showed us that it is a
|
||||
straightforward matter to make a compiler directly produce
|
||||
executable object code, in the form of assembler language
|
||||
statements. The code will _NOT_ be the world's tightest code ...
|
||||
producing optimized code is a much more difficult job. But it
|
||||
will work, and work reasonably well. Just so that I don't leave
|
||||
you with the impression that our end product will be worthless, I
|
||||
_DO_ intend to show you how to "soup up" the compiler with some
|
||||
optimization.
|
||||
|
||||
|
||||
|
||||
Finally, I'll be using some tricks that I've found to be most
|
||||
helpful in letting me understand what's going on without wading
|
||||
through a lot of boiler plate. Chief among these is the use of
|
||||
single-character tokens, with no embedded spaces, for the early
|
||||
design work. I figure that if I can get a parser to recognize
|
||||
and deal with I-T-L, I can get it to do the same with IF-THEN-
|
||||
ELSE. And I can. In the second "lesson," I'll show you just
|
||||
how easy it is to extend a simple parser to handle tokens of
|
||||
arbitrary length. As another trick, I completely ignore file
|
||||
I/O, figuring that if I can read source from the keyboard and
|
||||
output object to the screen, I can also do it from/to disk files.
|
||||
Experience has proven that once a translator is working
|
||||
correctly, it's a straightforward matter to redirect the I/O to
|
||||
files. The last trick is that I make no attempt to do error
|
||||
correction/recovery. The programs we'll be building will
|
||||
RECOGNIZE errors, and will not CRASH, but they will simply stop
|
||||
on the first error ... just like good ol' Turbo does. There will
|
||||
be other tricks that you'll see as you go. Most of them can't be
|
||||
found in any compiler textbook, but they work.
|
||||
|
||||
A word about style and efficiency. As you will see, I tend to
|
||||
write programs in _VERY_ small, easily understood pieces. None
|
||||
of the procedures we'll be working with will be more than about
|
||||
15-20 lines long. I'm a fervent devotee of the KISS (Keep It
|
||||
Simple, Sidney) school of software development. I try to never
|
||||
do something tricky or complex, when something simple will do.
|
||||
Inefficient? Perhaps, but you'll like the results. As Brian
|
||||
Kernighan has said, FIRST make it run, THEN make it run fast.
|
||||
If, later on, you want to go back and tighten up the code in one
|
||||
of our products, you'll be able to do so, since the code will be
|
||||
quite understandable. If you do so, however, I urge you to wait
|
||||
until the program is doing everything you want it to.
|
||||
|
||||
I also have a tendency to delay building a module until I
|
||||
discover that I need it. Trying to anticipate every possible
|
||||
future contingency can drive you crazy, and you'll generally
|
||||
guess wrong anyway. In this modern day of screen editors and
|
||||
fast compilers, I don't hesitate to change a module when I feel I
|
||||
need a more powerful one. Until then, I'll write only what I
|
||||
need.
|
||||
|
||||
One final caveat: One of the principles we'll be sticking to here
|
||||
is that we don't fool around with P-code or imaginary CPUs, but
|
||||
that we will start out on day one producing working, executable
|
||||
object code, at least in the form of assembler language source.
|
||||
However, you may not like my choice of assembler language ...
|
||||
it's 68000 code, which is what works on my system (under SK*DOS).
|
||||
I think you'll find, though, that the translation to any other
|
||||
CPU such as the 80x86 will be quite obvious, though, so I don't
|
||||
see a problem here. In fact, I hope someone out there who knows
|
||||
the '86 language better than I do will offer us the equivalent
|
||||
object code fragments as we need them.
|
||||
|
||||
|
||||
THE CRADLE
|
||||
|
||||
Every program needs some boiler plate ... I/O routines, error
|
||||
message routines, etc. The programs we develop here will be no
|
||||
exceptions. I've tried to hold this stuff to an absolute
|
||||
minimum, however, so that we can concentrate on the important
|
||||
stuff without losing it among the trees. The code given below
|
||||
represents about the minimum that we need to get anything done.
|
||||
It consists of some I/O routines, an error-handling routine and a
|
||||
skeleton, null main program. I call it our cradle. As we
|
||||
develop other routines, we'll add them to the cradle, and add the
|
||||
calls to them as we need to. Make a copy of the cradle and save
|
||||
it, because we'll be using it more than once.
|
||||
|
||||
There are many different ways to organize the scanning activities
|
||||
of a parser. In Unix systems, authors tend to use getc and
|
||||
ungetc. I've had very good luck with the approach shown here,
|
||||
which is to use a single, global, lookahead character. Part of
|
||||
the initialization procedure (the only part, so far!) serves to
|
||||
"prime the pump" by reading the first character from the input
|
||||
stream. No other special techniques are required with Turbo 4.0
|
||||
... each successive call to GetChar will read the next character
|
||||
in the stream.
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
program Cradle;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Constant Declarations }
|
||||
|
||||
const TAB = ^I;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Variable Declarations }
|
||||
|
||||
var Look: char; { Lookahead Character }
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Read New Character From Input Stream }
|
||||
|
||||
procedure GetChar;
|
||||
begin
|
||||
Read(Look);
|
||||
end;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Report an Error }
|
||||
|
||||
procedure Error(s: string);
|
||||
begin
|
||||
WriteLn;
|
||||
WriteLn(^G, 'Error: ', s, '.');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Report Error and Halt }
|
||||
|
||||
procedure Abort(s: string);
|
||||
begin
|
||||
Error(s);
|
||||
Halt;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Report What Was Expected }
|
||||
|
||||
procedure Expected(s: string);
|
||||
begin
|
||||
Abort(s + ' Expected');
|
||||
end;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Match a Specific Input Character }
|
||||
|
||||
procedure Match(x: char);
|
||||
begin
|
||||
if Look = x then GetChar
|
||||
else Expected('''' + x + '''');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize an Alpha Character }
|
||||
|
||||
function IsAlpha(c: char): boolean;
|
||||
begin
|
||||
IsAlpha := upcase(c) in ['A'..'Z'];
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
{ Recognize a Decimal Digit }
|
||||
|
||||
function IsDigit(c: char): boolean;
|
||||
begin
|
||||
IsDigit := c in ['0'..'9'];
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get an Identifier }
|
||||
|
||||
function GetName: char;
|
||||
begin
|
||||
if not IsAlpha(Look) then Expected('Name');
|
||||
GetName := UpCase(Look);
|
||||
GetChar;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get a Number }
|
||||
|
||||
function GetNum: char;
|
||||
begin
|
||||
if not IsDigit(Look) then Expected('Integer');
|
||||
GetNum := Look;
|
||||
GetChar;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Output a String with Tab }
|
||||
|
||||
procedure Emit(s: string);
|
||||
begin
|
||||
Write(TAB, s);
|
||||
end;
|
||||
|
||||
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Output a String with Tab and CRLF }
|
||||
|
||||
procedure EmitLn(s: string);
|
||||
begin
|
||||
Emit(s);
|
||||
WriteLn;
|
||||
end;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Initialize }
|
||||
|
||||
procedure Init;
|
||||
begin
|
||||
GetChar;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Main Program }
|
||||
|
||||
begin
|
||||
Init;
|
||||
end.
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
That's it for this introduction. Copy the code above into TP and
|
||||
compile it. Make sure that it compiles and runs correctly. Then
|
||||
proceed to the first lesson, which is on expression parsing.
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,801 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LET'S BUILD A COMPILER!
|
||||
|
||||
By
|
||||
|
||||
Jack W. Crenshaw, Ph.D.
|
||||
|
||||
5 June 1989
|
||||
|
||||
|
||||
Part XII: MISCELLANY
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
This installment is another one of those excursions into side
|
||||
alleys that don't seem to fit into the mainstream of this
|
||||
tutorial series. As I mentioned last time, it was while I was
|
||||
writing this installment that I realized some changes had to be
|
||||
made to the compiler structure. So I had to digress from this
|
||||
digression long enough to develop the new structure and show it
|
||||
to you.
|
||||
|
||||
Now that that's behind us, I can tell you what I set out to in
|
||||
the first place. This shouldn't take long, and then we can get
|
||||
back into the mainstream.
|
||||
|
||||
Several people have asked me about things that other languages
|
||||
provide, but so far I haven't addressed in this series. The two
|
||||
biggies are semicolons and comments. Perhaps you've wondered
|
||||
about them, too, and wondered how things would change if we had
|
||||
to deal with them. Just so you can proceed with what's to come,
|
||||
without being bothered by that nagging feeling that something is
|
||||
missing, we'll address such issues here.
|
||||
|
||||
|
||||
SEMICOLONS
|
||||
|
||||
Ever since the introduction of Algol, semicolons have been a part
|
||||
of almost every modern language. We've all used them to the
|
||||
point that they are taken for granted. Yet I suspect that more
|
||||
compilation errors have occurred due to misplaced or missing
|
||||
semicolons than any other single cause. And if we had a penny
|
||||
for every extra keystroke programmers have used to type the
|
||||
little rascals, we could pay off the national debt.
|
||||
|
||||
Having been brought up with FORTRAN, it took me a long time to
|
||||
get used to using semicolons, and to tell the truth I've never
|
||||
quite understood why they were necessary. Since I program in
|
||||
Pascal, and since the use of semicolons in Pascal is particularly
|
||||
tricky, that one little character is still by far my biggest
|
||||
source of errors.
|
||||
|
||||
When I began developing KISS, I resolved to question EVERY
|
||||
construct in other languages, and to try to avoid the most common
|
||||
problems that occur with them. That puts the semicolon very high
|
||||
on my hit list.
|
||||
|
||||
To understand the role of the semicolon, you have to look at a
|
||||
little history.
|
||||
|
||||
Early programming languages were line-oriented. In FORTRAN, for
|
||||
example, various parts of the statement had specific columns or
|
||||
fields that they had to appear in. Since some statements were
|
||||
too long for one line, the "continuation card" mechanism was
|
||||
provided to let the compiler know that a given card was still
|
||||
part of the previous line. The mechanism survives to this day,
|
||||
even though punched cards are now things of the distant past.
|
||||
|
||||
When other languages came along, they also adopted various
|
||||
mechanisms for dealing with multiple-line statements. BASIC is a
|
||||
good example. It's important to recognize, though, that the
|
||||
FORTRAN mechanism was not so much required by the line
|
||||
orientation of that language, as by the column-orientation. In
|
||||
those versions of FORTRAN where free-form input is permitted,
|
||||
it's no longer needed.
|
||||
|
||||
When the fathers of Algol introduced that language, they wanted
|
||||
to get away from line-oriented programs like FORTRAN and BASIC,
|
||||
and allow for free-form input. This included the possibility of
|
||||
stringing multiple statements on a single line, as in
|
||||
|
||||
|
||||
a=b; c=d; e=e+1;
|
||||
|
||||
|
||||
In cases like this, the semicolon is almost REQUIRED. The same
|
||||
line, without the semicolons, just looks "funny":
|
||||
|
||||
|
||||
a=b c= d e=e+1
|
||||
|
||||
I suspect that this is the major ... perhaps ONLY ... reason for
|
||||
semicolons: to keep programs from looking funny.
|
||||
|
||||
But the idea of stringing multiple statements together on a
|
||||
single line is a dubious one at best. It's not very good
|
||||
programming style, and harks back to the days when it was
|
||||
considered improtant to conserve cards. In these days of CRT's
|
||||
and indented code, the clarity of programs is far better served
|
||||
by keeping statements separate. It's still nice to have the
|
||||
OPTION of multiple statements, but it seems a shame to keep
|
||||
programmers in slavery to the semicolon, just to keep that one
|
||||
rare case from "looking funny."
|
||||
|
||||
When I started in with KISS, I tried to keep an open mind. I
|
||||
decided that I would use semicolons when it became necessary for
|
||||
the parser, but not until then. I figured this would happen just
|
||||
about the time I added the ability to spread statements over
|
||||
multiple lines. But, as you can see, that never happened. The
|
||||
TINY compiler is perfectly happy to parse the most complicated
|
||||
statement, spread over any number of lines, without semicolons.
|
||||
|
||||
Still, there are people who have used semicolons for so long,
|
||||
they feel naked without them. I'm one of them. Once I had KISS
|
||||
defined sufficiently well, I began to write a few sample programs
|
||||
in the language. I discovered, somewhat to my horror, that I
|
||||
kept putting semicolons in anyway. So now I'm facing the
|
||||
prospect of a NEW rash of compiler errors, caused by UNWANTED
|
||||
semicolons. Phooey!
|
||||
|
||||
Perhaps more to the point, there are readers out there who are
|
||||
designing their own languages, which may include semicolons, or
|
||||
who want to use the techniques of these tutorials to compile
|
||||
conventional languages like C. In either case, we need to be
|
||||
able to deal with semicolons.
|
||||
|
||||
|
||||
SYNTACTIC SUGAR
|
||||
|
||||
This whole discussion brings up the issue of "syntactic sugar"
|
||||
... constructs that are added to a language, not because they are
|
||||
needed, but because they help make the programs look right to the
|
||||
programmer. After all, it's nice to have a small, simple
|
||||
compiler, but it would be of little use if the resulting
|
||||
language were cryptic and hard to program. The language FORTH
|
||||
comes to mind (a premature OUCH! for the barrage I know that
|
||||
one's going to fetch me). If we can add features to the language
|
||||
that make the programs easier to read and understand, and if
|
||||
those features help keep the programmer from making errors, then
|
||||
we should do so. Particularly if the constructs don't add much
|
||||
to the complexity of the language or its compiler.
|
||||
|
||||
The semicolon could be considered an example, but there are
|
||||
plenty of others, such as the 'THEN' in a IF-statement, the 'DO'
|
||||
in a WHILE-statement, and even the 'PROGRAM' statement, which I
|
||||
came within a gnat's eyelash of leaving out of TINY. None of
|
||||
these tokens add much to the syntax of the language ... the
|
||||
compiler can figure out what's going on without them. But some
|
||||
folks feel that they DO add to the readability of programs, and
|
||||
that can be very important.
|
||||
|
||||
There are two schools of thought on this subject, which are well
|
||||
represented by two of our most popular languages, C and Pascal.
|
||||
|
||||
To the minimalists, all such sugar should be left out. They
|
||||
argue that it clutters up the language and adds to the number of
|
||||
keystrokes programmers must type. Perhaps more importantly,
|
||||
every extra token or keyword represents a trap laying in wait for
|
||||
the inattentive programmer. If you leave out a token, misplace
|
||||
it, or misspell it, the compiler will get you. So these people
|
||||
argue that the best approach is to get rid of such things. These
|
||||
folks tend to like C, which has a minimum of unnecessary keywords
|
||||
and punctuation.
|
||||
|
||||
Those from the other school tend to like Pascal. They argue that
|
||||
having to type a few extra characters is a small price to pay for
|
||||
legibility. After all, humans have to read the programs, too.
|
||||
Their best argument is that each such construct is an opportunity
|
||||
to tell the compiler that you really mean for it to do what you
|
||||
said to. The sugary tokens serve as useful landmarks to help you
|
||||
find your way.
|
||||
|
||||
The differences are well represented by the two languages. The
|
||||
most oft-heard complaint about C is that it is too forgiving.
|
||||
When you make a mistake in C, the erroneous code is too often
|
||||
another legal C construct. So the compiler just happily
|
||||
continues to compile, and leaves you to find the error during
|
||||
debug. I guess that's why debuggers are so popular with C
|
||||
programmers.
|
||||
|
||||
On the other hand, if a Pascal program compiles, you can be
|
||||
pretty sure that the program will do what you told it. If there
|
||||
is an error at run time, it's probably a design error.
|
||||
|
||||
The best example of useful sugar is the semicolon itself.
|
||||
Consider the code fragment:
|
||||
|
||||
|
||||
a=1+(2*b+c) b...
|
||||
|
||||
|
||||
Since there is no operator connecting the token 'b' with the rest
|
||||
of the statement, the compiler will conclude that the expression
|
||||
ends with the ')', and the 'b' is the beginning of a new
|
||||
statement. But suppose I have simply left out the intended
|
||||
operator, and I really want to say:
|
||||
|
||||
|
||||
a=1+(2*b+c)*b...
|
||||
|
||||
|
||||
In this case the compiler will get an error, all right, but it
|
||||
won't be very meaningful since it will be expecting an '=' sign
|
||||
after the 'b' that really shouldn't be there.
|
||||
|
||||
If, on the other hand, I include a semicolon after the 'b', THEN
|
||||
there can be no doubt where I intend the statement to end.
|
||||
Syntactic sugar, then, can serve a very useful purpose by
|
||||
providing some additional insurance that we remain on track.
|
||||
|
||||
I find myself somewhere in the middle of all this. I tend to
|
||||
favor the Pascal-ers' view ... I'd much rather find my bugs at
|
||||
compile time rather than run time. But I also hate to just throw
|
||||
verbosity in for no apparent reason, as in COBOL. So far I've
|
||||
consistently left most of the Pascal sugar out of KISS/TINY. But
|
||||
I certainly have no strong feelings either way, and I also can
|
||||
see the value of sprinkling a little sugar around just for the
|
||||
extra insurance that it brings. If you like this latter
|
||||
approach, things like that are easy to add. Just remember that,
|
||||
like the semicolon, each item of sugar is something that can
|
||||
potentially cause a compile error by its omission.
|
||||
|
||||
|
||||
DEALING WITH SEMICOLONS
|
||||
|
||||
There are two distinct ways in which semicolons are used in
|
||||
popular languages. In Pascal, the semicolon is regarded as an
|
||||
statement SEPARATOR. No semicolon is required after the last
|
||||
statement in a block. The syntax is:
|
||||
|
||||
|
||||
<block> ::= <statement> ( ';' <statement>)*
|
||||
|
||||
<statement> ::= <assignment> | <if> | <while> ... | null
|
||||
|
||||
|
||||
(The null statement is IMPORTANT!)
|
||||
|
||||
Pascal also defines some semicolons in other places, such as
|
||||
after the PROGRAM statement.
|
||||
|
||||
In C and Ada, on the other hand, the semicolon is considered a
|
||||
statement TERMINATOR, and follows all statements (with some
|
||||
embarrassing and confusing exceptions). The syntax for this is
|
||||
simply:
|
||||
|
||||
|
||||
<block> ::= ( <statement> ';')*
|
||||
|
||||
|
||||
Of the two syntaxes, the Pascal one seems on the face of it more
|
||||
rational, but experience has shown that it leads to some strange
|
||||
difficulties. People get so used to typing a semicolon after
|
||||
every statement that they tend to type one after the last
|
||||
statement in a block, also. That usually doesn't cause any harm
|
||||
... it just gets treated as a null statement. Many Pascal
|
||||
programmers, including yours truly, do just that. But there is
|
||||
one place you absolutely CANNOT type a semicolon, and that's
|
||||
right before an ELSE. This little gotcha has cost me many an
|
||||
extra compilation, particularly when the ELSE is added to
|
||||
existing code. So the C/Ada choice turns out to be better.
|
||||
Apparently Nicklaus Wirth thinks so, too: In his Modula 2, he
|
||||
abandoned the Pascal approach.
|
||||
|
||||
Given either of these two syntaxes, it's an easy matter (now that
|
||||
we've reorganized the parser!) to add these features to our
|
||||
parser. Let's take the last case first, since it's simpler.
|
||||
|
||||
To begin, I've made things easy by introducing a new recognizer:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Match a Semicolon }
|
||||
|
||||
procedure Semi;
|
||||
begin
|
||||
MatchString(';');
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
This procedure works very much like our old Match. It insists on
|
||||
finding a semicolon as the next token. Having found it, it skips
|
||||
to the next one.
|
||||
|
||||
Since a semicolon follows a statement, procedure Block is almost
|
||||
the only one we need to change:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate a Block of Statements }
|
||||
|
||||
procedure Block;
|
||||
begin
|
||||
Scan;
|
||||
while not(Token in ['e', 'l']) do begin
|
||||
case Token of
|
||||
'i': DoIf;
|
||||
'w': DoWhile;
|
||||
'R': DoRead;
|
||||
'W': DoWrite;
|
||||
'x': Assignment;
|
||||
end;
|
||||
Semi;
|
||||
Scan;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Note carefully the subtle change in the case statement. The call
|
||||
to Assignment is now guarded by a test on Token. This is to
|
||||
avoid calling Assignment when the token is a semicolon (which
|
||||
could happen if the statement is null).
|
||||
|
||||
Since declarations are also statements, we also need to add a
|
||||
call to Semi within procedure TopDecls:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate Global Declarations }
|
||||
|
||||
procedure TopDecls;
|
||||
begin
|
||||
Scan;
|
||||
while Token = 'v' do begin
|
||||
Alloc;
|
||||
while Token = ',' do
|
||||
Alloc;
|
||||
Semi;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Finally, we need one for the PROGRAM statement:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Main Program }
|
||||
|
||||
begin
|
||||
Init;
|
||||
MatchString('PROGRAM');
|
||||
Semi;
|
||||
Header;
|
||||
TopDecls;
|
||||
MatchString('BEGIN');
|
||||
Prolog;
|
||||
Block;
|
||||
MatchString('END');
|
||||
Epilog;
|
||||
end.
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
It's as easy as that. Try it with a copy of TINY and see how you
|
||||
like it.
|
||||
|
||||
The Pascal version is a little trickier, but it still only
|
||||
requires minor changes, and those only to procedure Block. To
|
||||
keep things as simple as possible, let's split the procedure into
|
||||
two parts. The following procedure handles just one statement:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate a Single Statement }
|
||||
|
||||
procedure Statement;
|
||||
begin
|
||||
Scan;
|
||||
case Token of
|
||||
'i': DoIf;
|
||||
'w': DoWhile;
|
||||
'R': DoRead;
|
||||
'W': DoWrite;
|
||||
'x': Assignment;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Using this procedure, we can now rewrite Block like this:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate a Block of Statements }
|
||||
|
||||
procedure Block;
|
||||
begin
|
||||
Statement;
|
||||
while Token = ';' do begin
|
||||
Next;
|
||||
Statement;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
That sure didn't hurt, did it? We can now parse semicolons in
|
||||
Pascal-like fashion.
|
||||
|
||||
|
||||
A COMPROMISE
|
||||
|
||||
Now that we know how to deal with semicolons, does that mean that
|
||||
I'm going to put them in KISS/TINY? Well, yes and no. I like
|
||||
the extra sugar and the security that comes with knowing for sure
|
||||
where the ends of statements are. But I haven't changed my
|
||||
dislike for the compilation errors associated with semicolons.
|
||||
|
||||
So I have what I think is a nice compromise: Make them OPTIONAL!
|
||||
|
||||
Consider the following version of Semi:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Match a Semicolon }
|
||||
|
||||
procedure Semi;
|
||||
begin
|
||||
if Token = ';' then Next;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
This procedure will ACCEPT a semicolon whenever it is called, but
|
||||
it won't INSIST on one. That means that when you choose to use
|
||||
semicolons, the compiler will use the extra information to help
|
||||
keep itself on track. But if you omit one (or omit them all) the
|
||||
compiler won't complain. The best of both worlds.
|
||||
|
||||
Put this procedure in place in the first version of your program
|
||||
(the one for C/Ada syntax), and you have the makings of TINY
|
||||
Version 1.2.
|
||||
|
||||
|
||||
COMMENTS
|
||||
|
||||
Up until now I have carefully avoided the subject of comments.
|
||||
You would think that this would be an easy subject ... after all,
|
||||
the compiler doesn't have to deal with comments at all; it should
|
||||
just ignore them. Well, sometimes that's true.
|
||||
|
||||
Comments can be just about as easy or as difficult as you choose
|
||||
to make them. At one extreme, we can arrange things so that
|
||||
comments are intercepted almost the instant they enter the
|
||||
compiler. At the other, we can treat them as lexical elements.
|
||||
Things tend to get interesting when you consider things like
|
||||
comment delimiters contained in quoted strings.
|
||||
|
||||
|
||||
SINGLE-CHARACTER DELIMITERS
|
||||
|
||||
Here's an example. Suppose we assume the Turbo Pascal standard
|
||||
and use curly braces for comments. In this case we have single-
|
||||
character delimiters, so our parsing is a little easier.
|
||||
|
||||
One approach is to strip the comments out the instant we
|
||||
encounter them in the input stream; that is, right in procedure
|
||||
GetChar. To do this, first change the name of GetChar to
|
||||
something else, say GetCharX. (For the record, this is going to
|
||||
be a TEMPORARY change, so best not do this with your only copy of
|
||||
TINY. I assume you understand that you should always do these
|
||||
experiments with a working copy.)
|
||||
|
||||
Now, we're going to need a procedure to skip over comments. So
|
||||
key in the following one:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Skip A Comment Field }
|
||||
|
||||
procedure SkipComment;
|
||||
begin
|
||||
while Look <> '}' do
|
||||
GetCharX;
|
||||
GetCharX;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Clearly, what this procedure is going to do is to simply read and
|
||||
discard characters from the input stream, until it finds a right
|
||||
curly brace. Then it reads one more character and returns it in
|
||||
Look.
|
||||
|
||||
Now we can write a new version of GetChar that SkipComment to
|
||||
strip out comments:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get Character from Input Stream }
|
||||
{ Skip Any Comments }
|
||||
|
||||
procedure GetChar;
|
||||
begin
|
||||
GetCharX;
|
||||
if Look = '{' then SkipComment;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Code this up and give it a try. You'll find that you can,
|
||||
indeed, bury comments anywhere you like. The comments never even
|
||||
get into the parser proper ... every call to GetChar just returns
|
||||
any character that's NOT part of a comment.
|
||||
|
||||
As a matter of fact, while this approach gets the job done, and
|
||||
may even be perfectly satisfactory for you, it does its job a
|
||||
little TOO well. First of all, most programming languages
|
||||
specify that a comment should be treated like a space, so that
|
||||
comments aren't allowed to be embedded in, say, variable names.
|
||||
This current version doesn't care WHERE you put comments.
|
||||
|
||||
Second, since the rest of the parser can't even receive a '{'
|
||||
character, you will not be allowed to put one in a quoted string.
|
||||
|
||||
Before you turn up your nose at this simplistic solution, though,
|
||||
I should point out that as respected a compiler as Turbo Pascal
|
||||
also won't allow a '{' in a quoted string. Try it. And as for
|
||||
embedding a comment in an identifier, I can't imagine why anyone
|
||||
would want to do such a thing, anyway, so the question is moot.
|
||||
For 99% of all applications, what I've just shown you will work
|
||||
just fine.
|
||||
|
||||
But, if you want to be picky about it and stick to the
|
||||
conventional treatment, then we need to move the interception
|
||||
point downstream a little further.
|
||||
|
||||
To do this, first change GetChar back to the way it was and
|
||||
change the name called in SkipComment. Then, let's add the left
|
||||
brace as a possible whitespace character:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize White Space }
|
||||
|
||||
function IsWhite(c: char): boolean;
|
||||
begin
|
||||
IsWhite := c in [' ', TAB, CR, LF, '{'];
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Now, we can deal with comments in procedure SkipWhite:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Skip Over Leading White Space }
|
||||
|
||||
procedure SkipWhite;
|
||||
begin
|
||||
while IsWhite(Look) do begin
|
||||
if Look = '{' then
|
||||
SkipComment
|
||||
else
|
||||
GetChar;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Note that SkipWhite is written so that we will skip over any
|
||||
combination of whitespace characters and comments, in one call.
|
||||
|
||||
OK, give this one a try, too. You'll find that it will let a
|
||||
comment serve to delimit tokens. It's worth mentioning that this
|
||||
approach also gives us the ability to handle curly braces within
|
||||
quoted strings, since within such strings we will not be testing
|
||||
for or skipping over whitespace.
|
||||
|
||||
There's one last item to deal with: Nested comments. Some
|
||||
programmers like the idea of nesting comments, since it allows
|
||||
you to comment out code during debugging. The code I've given
|
||||
here won't allow that and, again, neither will Turbo Pascal.
|
||||
|
||||
But the fix is incredibly easy. All we need to do is to make
|
||||
SkipComment recursive:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Skip A Comment Field }
|
||||
|
||||
procedure SkipComment;
|
||||
begin
|
||||
while Look <> '}' do begin
|
||||
GetChar;
|
||||
if Look = '{' then SkipComment;
|
||||
end;
|
||||
GetChar;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
That does it. As sophisticated a comment-handler as you'll ever
|
||||
need.
|
||||
|
||||
|
||||
MULTI-CHARACTER DELIMITERS
|
||||
|
||||
That's all well and good for cases where a comment is delimited
|
||||
by single characters, but what about the cases such as C or
|
||||
standard Pascal, where two characters are required? Well, the
|
||||
principles are still the same, but we have to change our approach
|
||||
quite a bit. I'm sure it won't surprise you to learn that things
|
||||
get harder in this case.
|
||||
|
||||
For the multi-character situation, the easiest thing to do is to
|
||||
intercept the left delimiter back at the GetChar stage. We can
|
||||
"tokenize" it right there, replacing it by a single character.
|
||||
|
||||
Let's assume we're using the C delimiters '/*' and '*/'. First,
|
||||
we need to go back to the "GetCharX' approach. In yet another
|
||||
copy of your compiler, rename GetChar to GetCharX and then enter
|
||||
the following new procedure GetChar:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Read New Character. Intercept '/*' }
|
||||
|
||||
procedure GetChar;
|
||||
begin
|
||||
if TempChar <> ' ' then begin
|
||||
Look := TempChar;
|
||||
TempChar := ' ';
|
||||
end
|
||||
else begin
|
||||
GetCharX;
|
||||
if Look = '/' then begin
|
||||
Read(TempChar);
|
||||
if TempChar = '*' then begin
|
||||
Look := '{';
|
||||
TempChar := ' ';
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
As you can see, what this procedure does is to intercept every
|
||||
occurrence of '/'. It then examines the NEXT character in the
|
||||
stream. If the character is a '*', then we have found the
|
||||
beginning of a comment, and GetChar will return a single
|
||||
character replacement for it. (For simplicity, I'm using the
|
||||
same '{' character as I did for Pascal. If you were writing a C
|
||||
compiler, you'd no doubt want to pick some other character that's
|
||||
not used elsewhere in C. Pick anything you like ... even $FF,
|
||||
anything that's unique.)
|
||||
|
||||
If the character following the '/' is NOT a '*', then GetChar
|
||||
tucks it away in the new global TempChar, and returns the '/'.
|
||||
|
||||
Note that you need to declare this new variable and initialize it
|
||||
to ' '. I like to do things like that using the Turbo "typed
|
||||
constant" construct:
|
||||
|
||||
|
||||
const TempChar: char = ' ';
|
||||
|
||||
|
||||
Now we need a new version of SkipComment:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Skip A Comment Field }
|
||||
|
||||
procedure SkipComment;
|
||||
begin
|
||||
repeat
|
||||
repeat
|
||||
GetCharX;
|
||||
until Look = '*';
|
||||
GetCharX;
|
||||
until Look = '/';
|
||||
GetChar;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
A few things to note: first of all, function IsWhite and
|
||||
procedure SkipWhite don't need to be changed, since GetChar
|
||||
returns the '{' token. If you change that token character, then
|
||||
of course you also need to change the character in those two
|
||||
routines.
|
||||
|
||||
Second, note that SkipComment doesn't call GetChar in its loop,
|
||||
but GetCharX. That means that the trailing '/' is not
|
||||
intercepted and is seen by SkipComment. Third, although GetChar
|
||||
is the procedure doing the work, we can still deal with the
|
||||
comment characters embedded in a quoted string, by calling
|
||||
GetCharX instead of GetChar while we're within the string.
|
||||
Finally, note that we can again provide for nested comments by
|
||||
adding a single statement to SkipComment, just as we did before.
|
||||
|
||||
|
||||
ONE-SIDED COMMENTS
|
||||
|
||||
So far I've shown you how to deal with any kind of comment
|
||||
delimited on the left and the right. That only leaves the one-
|
||||
sided comments like those in assembler language or in Ada, that
|
||||
are terminated by the end of the line. In a way, that case is
|
||||
easier. The only procedure that would need to be changed is
|
||||
SkipComment, which must now terminate at the newline characters:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Skip A Comment Field }
|
||||
|
||||
procedure SkipComment;
|
||||
begin
|
||||
repeat
|
||||
GetCharX;
|
||||
until Look = CR;
|
||||
GetChar;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
If the leading character is a single one, as in the ';' of
|
||||
assembly language, then we're essentially done. If it's a two-
|
||||
character token, as in the '--' of Ada, we need only modify the
|
||||
tests within GetChar. Either way, it's an easier problem than
|
||||
the balanced case.
|
||||
|
||||
|
||||
CONCLUSION
|
||||
|
||||
At this point we now have the ability to deal with both comments
|
||||
and semicolons, as well as other kinds of syntactic sugar. I've
|
||||
shown you several ways to deal with each, depending upon the
|
||||
convention desired. The only issue left is: which of these
|
||||
conventions should we use in KISS/TINY?
|
||||
|
||||
For the reasons that I've given as we went along, I'm choosing
|
||||
the following:
|
||||
|
||||
|
||||
(1) Semicolons are TERMINATORS, not separators
|
||||
|
||||
(2) Semicolons are OPTIONAL
|
||||
|
||||
(3) Comments are delimited by curly braces
|
||||
|
||||
(4) Comments MAY be nested
|
||||
|
||||
|
||||
Put the code corresponding to these cases into your copy of TINY.
|
||||
You now have TINY Version 1.2.
|
||||
|
||||
Now that we have disposed of these sideline issues, we can
|
||||
finally get back into the mainstream. In the next installment,
|
||||
we'll talk about procedures and parameter passing, and we'll add
|
||||
these important features to TINY. See you then.
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,792 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LET'S BUILD A COMPILER!
|
||||
|
||||
By
|
||||
|
||||
Jack W. Crenshaw, Ph.D.
|
||||
|
||||
24 July 1988
|
||||
|
||||
|
||||
Part II: EXPRESSION PARSING
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
GETTING STARTED
|
||||
|
||||
If you've read the introduction document to this series, you will
|
||||
already know what we're about. You will also have copied the
|
||||
cradle software into your Turbo Pascal system, and have compiled
|
||||
it. So you should be ready to go.
|
||||
|
||||
|
||||
The purpose of this article is for us to learn how to parse and
|
||||
translate mathematical expressions. What we would like to see as
|
||||
output is a series of assembler-language statements that perform
|
||||
the desired actions. For purposes of definition, an expression
|
||||
is the right-hand side of an equation, as in
|
||||
|
||||
x = 2*y + 3/(4*z)
|
||||
|
||||
In the early going, I'll be taking things in _VERY_ small steps.
|
||||
That's so that the beginners among you won't get totally lost.
|
||||
There are also some very good lessons to be learned early on,
|
||||
that will serve us well later. For the more experienced readers:
|
||||
bear with me. We'll get rolling soon enough.
|
||||
|
||||
SINGLE DIGITS
|
||||
|
||||
In keeping with the whole theme of this series (KISS, remember?),
|
||||
let's start with the absolutely most simple case we can think of.
|
||||
That, to me, is an expression consisting of a single digit.
|
||||
|
||||
Before starting to code, make sure you have a baseline copy of
|
||||
the "cradle" that I gave last time. We'll be using it again for
|
||||
other experiments. Then add this code:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Expression }
|
||||
|
||||
procedure Expression;
|
||||
begin
|
||||
EmitLn('MOVE #' + GetNum + ',D0')
|
||||
end;
|
||||
{---------------------------------------------------------------}
|
||||
|
||||
|
||||
And add the line "Expression;" to the main program so that it
|
||||
reads:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
begin
|
||||
Init;
|
||||
Expression;
|
||||
end.
|
||||
{---------------------------------------------------------------}
|
||||
|
||||
|
||||
Now run the program. Try any single-digit number as input. You
|
||||
should get a single line of assembler-language output. Now try
|
||||
any other character as input, and you'll see that the parser
|
||||
properly reports an error.
|
||||
|
||||
|
||||
CONGRATULATIONS! You have just written a working translator!
|
||||
|
||||
OK, I grant you that it's pretty limited. But don't brush it off
|
||||
too lightly. This little "compiler" does, on a very limited
|
||||
scale, exactly what any larger compiler does: it correctly
|
||||
recognizes legal statements in the input "language" that we have
|
||||
defined for it, and it produces correct, executable assembler
|
||||
code, suitable for assembling into object format. Just as
|
||||
importantly, it correctly recognizes statements that are NOT
|
||||
legal, and gives a meaningful error message. Who could ask for
|
||||
more? As we expand our parser, we'd better make sure those two
|
||||
characteristics always hold true.
|
||||
|
||||
There are some other features of this tiny program worth
|
||||
mentioning. First, you can see that we don't separate code
|
||||
generation from parsing ... as soon as the parser knows what we
|
||||
want done, it generates the object code directly. In a real
|
||||
compiler, of course, the reads in GetChar would be from a disk
|
||||
file, and the writes to another disk file, but this way is much
|
||||
easier to deal with while we're experimenting.
|
||||
|
||||
Also note that an expression must leave a result somewhere. I've
|
||||
chosen the 68000 register DO. I could have made some other
|
||||
choices, but this one makes sense.
|
||||
|
||||
|
||||
BINARY EXPRESSIONS
|
||||
|
||||
Now that we have that under our belt, let's branch out a bit.
|
||||
Admittedly, an "expression" consisting of only one character is
|
||||
not going to meet our needs for long, so let's see what we can do
|
||||
to extend it. Suppose we want to handle expressions of the form:
|
||||
|
||||
1+2
|
||||
or 4-3
|
||||
or, in general, <term> +/- <term>
|
||||
|
||||
(That's a bit of Backus-Naur Form, or BNF.)
|
||||
|
||||
To do this we need a procedure that recognizes a term and leaves
|
||||
its result somewhere, and another that recognizes and
|
||||
distinguishes between a '+' and a '-' and generates the
|
||||
appropriate code. But if Expression is going to leave its result
|
||||
in DO, where should Term leave its result? Answer: the same
|
||||
place. We're going to have to save the first result of Term
|
||||
somewhere before we get the next one.
|
||||
|
||||
OK, basically what we want to do is have procedure Term do what
|
||||
Expression was doing before. So just RENAME procedure Expression
|
||||
as Term, and enter the following new version of Expression:
|
||||
|
||||
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate an Expression }
|
||||
|
||||
procedure Expression;
|
||||
begin
|
||||
Term;
|
||||
EmitLn('MOVE D0,D1');
|
||||
case Look of
|
||||
'+': Add;
|
||||
'-': Subtract;
|
||||
else Expected('Addop');
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Next, just above Expression enter these two procedures:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize and Translate an Add }
|
||||
|
||||
procedure Add;
|
||||
begin
|
||||
Match('+');
|
||||
Term;
|
||||
EmitLn('ADD D1,D0');
|
||||
end;
|
||||
|
||||
|
||||
{-------------------------------------------------------------}
|
||||
{ Recognize and Translate a Subtract }
|
||||
|
||||
procedure Subtract;
|
||||
begin
|
||||
Match('-');
|
||||
Term;
|
||||
EmitLn('SUB D1,D0');
|
||||
end;
|
||||
{-------------------------------------------------------------}
|
||||
|
||||
|
||||
When you're finished with that, the order of the routines should
|
||||
be:
|
||||
|
||||
o Term (The OLD Expression)
|
||||
o Add
|
||||
o Subtract
|
||||
o Expression
|
||||
|
||||
Now run the program. Try any combination you can think of of two
|
||||
single digits, separated by a '+' or a '-'. You should get a
|
||||
series of four assembler-language instructions out of each run.
|
||||
Now try some expressions with deliberate errors in them. Does
|
||||
the parser catch the errors?
|
||||
|
||||
Take a look at the object code generated. There are two
|
||||
observations we can make. First, the code generated is NOT what
|
||||
we would write ourselves. The sequence
|
||||
|
||||
MOVE #n,D0
|
||||
MOVE D0,D1
|
||||
|
||||
is inefficient. If we were writing this code by hand, we would
|
||||
probably just load the data directly to D1.
|
||||
|
||||
There is a message here: code generated by our parser is less
|
||||
efficient than the code we would write by hand. Get used to it.
|
||||
That's going to be true throughout this series. It's true of all
|
||||
compilers to some extent. Computer scientists have devoted whole
|
||||
lifetimes to the issue of code optimization, and there are indeed
|
||||
things that can be done to improve the quality of code output.
|
||||
Some compilers do quite well, but there is a heavy price to pay
|
||||
in complexity, and it's a losing battle anyway ... there will
|
||||
probably never come a time when a good assembler-language pro-
|
||||
grammer can't out-program a compiler. Before this session is
|
||||
over, I'll briefly mention some ways that we can do a little op-
|
||||
timization, just to show you that we can indeed improve things
|
||||
without too much trouble. But remember, we're here to learn, not
|
||||
to see how tight we can make the object code. For now, and
|
||||
really throughout this series of articles, we'll studiously
|
||||
ignore optimization and concentrate on getting out code that
|
||||
works.
|
||||
|
||||
Speaking of which: ours DOESN'T! The code is _WRONG_! As things
|
||||
are working now, the subtraction process subtracts D1 (which has
|
||||
the FIRST argument in it) from D0 (which has the second). That's
|
||||
the wrong way, so we end up with the wrong sign for the result.
|
||||
So let's fix up procedure Subtract with a sign-changer, so that
|
||||
it reads
|
||||
|
||||
|
||||
{-------------------------------------------------------------}
|
||||
{ Recognize and Translate a Subtract }
|
||||
|
||||
procedure Subtract;
|
||||
begin
|
||||
Match('-');
|
||||
Term;
|
||||
EmitLn('SUB D1,D0');
|
||||
EmitLn('NEG D0');
|
||||
end;
|
||||
{-------------------------------------------------------------}
|
||||
|
||||
|
||||
Now our code is even less efficient, but at least it gives the
|
||||
right answer! Unfortunately, the rules that give the meaning of
|
||||
math expressions require that the terms in an expression come out
|
||||
in an inconvenient order for us. Again, this is just one of
|
||||
those facts of life you learn to live with. This one will come
|
||||
back to haunt us when we get to division.
|
||||
|
||||
OK, at this point we have a parser that can recognize the sum or
|
||||
difference of two digits. Earlier, we could only recognize a
|
||||
single digit. But real expressions can have either form (or an
|
||||
infinity of others). For kicks, go back and run the program with
|
||||
the single input line '1'.
|
||||
|
||||
Didn't work, did it? And why should it? We just finished
|
||||
telling our parser that the only kinds of expressions that are
|
||||
legal are those with two terms. We must rewrite procedure
|
||||
Expression to be a lot more broadminded, and this is where things
|
||||
start to take the shape of a real parser.
|
||||
|
||||
|
||||
|
||||
|
||||
GENERAL EXPRESSIONS
|
||||
|
||||
In the REAL world, an expression can consist of one or more
|
||||
terms, separated by "addops" ('+' or '-'). In BNF, this is
|
||||
written
|
||||
|
||||
<expression> ::= <term> [<addop> <term>]*
|
||||
|
||||
|
||||
We can accomodate this definition of an expression with the
|
||||
addition of a simple loop to procedure Expression:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate an Expression }
|
||||
|
||||
procedure Expression;
|
||||
begin
|
||||
Term;
|
||||
while Look in ['+', '-'] do begin
|
||||
EmitLn('MOVE D0,D1');
|
||||
case Look of
|
||||
'+': Add;
|
||||
'-': Subtract;
|
||||
else Expected('Addop');
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
NOW we're getting somewhere! This version handles any number of
|
||||
terms, and it only cost us two extra lines of code. As we go on,
|
||||
you'll discover that this is characteristic of top-down parsers
|
||||
... it only takes a few lines of code to accomodate extensions to
|
||||
the language. That's what makes our incremental approach
|
||||
possible. Notice, too, how well the code of procedure Expression
|
||||
matches the BNF definition. That, too, is characteristic of the
|
||||
method. As you get proficient in the approach, you'll find that
|
||||
you can turn BNF into parser code just about as fast as you can
|
||||
type!
|
||||
|
||||
OK, compile the new version of our parser, and give it a try. As
|
||||
usual, verify that the "compiler" can handle any legal
|
||||
expression, and will give a meaningful error message for an
|
||||
illegal one. Neat, eh? You might note that in our test version,
|
||||
any error message comes out sort of buried in whatever code had
|
||||
already been generated. But remember, that's just because we are
|
||||
using the CRT as our "output file" for this series of
|
||||
experiments. In a production version, the two outputs would be
|
||||
separated ... one to the output file, and one to the screen.
|
||||
|
||||
|
||||
USING THE STACK
|
||||
|
||||
At this point I'm going to violate my rule that we don't
|
||||
introduce any complexity until it's absolutely necessary, long
|
||||
enough to point out a problem with the code we're generating. As
|
||||
things stand now, the parser uses D0 for the "primary" register,
|
||||
and D1 as a place to store the partial sum. That works fine for
|
||||
now, because as long as we deal with only the "addops" '+' and
|
||||
'-', any new term can be added in as soon as it is found. But in
|
||||
general that isn't true. Consider, for example, the expression
|
||||
|
||||
1+(2-(3+(4-5)))
|
||||
|
||||
If we put the '1' in D1, where do we put the '2'? Since a
|
||||
general expression can have any degree of complexity, we're going
|
||||
to run out of registers fast!
|
||||
|
||||
Fortunately, there's a simple solution. Like every modern
|
||||
microprocessor, the 68000 has a stack, which is the perfect place
|
||||
to save a variable number of items. So instead of moving the term
|
||||
in D0 to D1, let's just push it onto the stack. For the benefit
|
||||
of those unfamiliar with 68000 assembler language, a push is
|
||||
written
|
||||
|
||||
-(SP)
|
||||
|
||||
and a pop, (SP)+ .
|
||||
|
||||
|
||||
So let's change the EmitLn in Expression to read:
|
||||
|
||||
EmitLn('MOVE D0,-(SP)');
|
||||
|
||||
and the two lines in Add and Subtract to
|
||||
|
||||
EmitLn('ADD (SP)+,D0')
|
||||
|
||||
and EmitLn('SUB (SP)+,D0'),
|
||||
|
||||
respectively. Now try the parser again and make sure we haven't
|
||||
broken it.
|
||||
|
||||
Once again, the generated code is less efficient than before, but
|
||||
it's a necessary step, as you'll see.
|
||||
|
||||
|
||||
MULTIPLICATION AND DIVISION
|
||||
|
||||
Now let's get down to some REALLY serious business. As you all
|
||||
know, there are other math operators than "addops" ...
|
||||
expressions can also have multiply and divide operations. You
|
||||
also know that there is an implied operator PRECEDENCE, or
|
||||
hierarchy, associated with expressions, so that in an expression
|
||||
like
|
||||
|
||||
2 + 3 * 4,
|
||||
|
||||
we know that we're supposed to multiply FIRST, then add. (See
|
||||
why we needed the stack?)
|
||||
|
||||
In the early days of compiler technology, people used some rather
|
||||
complex techniques to insure that the operator precedence rules
|
||||
were obeyed. It turns out, though, that none of this is
|
||||
necessary ... the rules can be accommodated quite nicely by our
|
||||
top-down parsing technique. Up till now, the only form that
|
||||
we've considered for a term is that of a single decimal digit.
|
||||
|
||||
More generally, we can define a term as a PRODUCT of FACTORS;
|
||||
i.e.,
|
||||
|
||||
<term> ::= <factor> [ <mulop> <factor ]*
|
||||
|
||||
What is a factor? For now, it's what a term used to be ... a
|
||||
single digit.
|
||||
|
||||
Notice the symmetry: a term has the same form as an expression.
|
||||
As a matter of fact, we can add to our parser with a little
|
||||
judicious copying and renaming. But to avoid confusion, the
|
||||
listing below is the complete set of parsing routines. (Note the
|
||||
way we handle the reversal of operands in Divide.)
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Factor }
|
||||
|
||||
procedure Factor;
|
||||
begin
|
||||
EmitLn('MOVE #' + GetNum + ',D0')
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize and Translate a Multiply }
|
||||
|
||||
procedure Multiply;
|
||||
begin
|
||||
Match('*');
|
||||
Factor;
|
||||
EmitLn('MULS (SP)+,D0');
|
||||
end;
|
||||
|
||||
|
||||
{-------------------------------------------------------------}
|
||||
{ Recognize and Translate a Divide }
|
||||
|
||||
procedure Divide;
|
||||
begin
|
||||
Match('/');
|
||||
Factor;
|
||||
EmitLn('MOVE (SP)+,D1');
|
||||
EmitLn('DIVS D1,D0');
|
||||
end;
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Term }
|
||||
|
||||
procedure Term;
|
||||
begin
|
||||
Factor;
|
||||
while Look in ['*', '/'] do begin
|
||||
EmitLn('MOVE D0,-(SP)');
|
||||
case Look of
|
||||
'*': Multiply;
|
||||
'/': Divide;
|
||||
else Expected('Mulop');
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
|
||||
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize and Translate an Add }
|
||||
|
||||
procedure Add;
|
||||
begin
|
||||
Match('+');
|
||||
Term;
|
||||
EmitLn('ADD (SP)+,D0');
|
||||
end;
|
||||
|
||||
|
||||
{-------------------------------------------------------------}
|
||||
{ Recognize and Translate a Subtract }
|
||||
|
||||
procedure Subtract;
|
||||
begin
|
||||
Match('-');
|
||||
Term;
|
||||
EmitLn('SUB (SP)+,D0');
|
||||
EmitLn('NEG D0');
|
||||
end;
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate an Expression }
|
||||
|
||||
procedure Expression;
|
||||
begin
|
||||
Term;
|
||||
while Look in ['+', '-'] do begin
|
||||
EmitLn('MOVE D0,-(SP)');
|
||||
case Look of
|
||||
'+': Add;
|
||||
'-': Subtract;
|
||||
else Expected('Addop');
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Hot dog! A NEARLY functional parser/translator, in only 55 lines
|
||||
of Pascal! The output is starting to look really useful, if you
|
||||
continue to overlook the inefficiency, which I hope you will.
|
||||
Remember, we're not trying to produce tight code here.
|
||||
|
||||
|
||||
PARENTHESES
|
||||
|
||||
We can wrap up this part of the parser with the addition of
|
||||
parentheses with math expressions. As you know, parentheses are
|
||||
a mechanism to force a desired operator precedence. So, for
|
||||
example, in the expression
|
||||
|
||||
2*(3+4) ,
|
||||
|
||||
the parentheses force the addition before the multiply. Much
|
||||
more importantly, though, parentheses give us a mechanism for
|
||||
defining expressions of any degree of complexity, as in
|
||||
|
||||
(1+2)/((3+4)+(5-6))
|
||||
|
||||
The key to incorporating parentheses into our parser is to
|
||||
realize that no matter how complicated an expression enclosed by
|
||||
parentheses may be, to the rest of the world it looks like a
|
||||
simple factor. That is, one of the forms for a factor is:
|
||||
|
||||
<factor> ::= (<expression>)
|
||||
|
||||
This is where the recursion comes in. An expression can contain a
|
||||
factor which contains another expression which contains a factor,
|
||||
etc., ad infinitum.
|
||||
|
||||
Complicated or not, we can take care of this by adding just a few
|
||||
lines of Pascal to procedure Factor:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Factor }
|
||||
|
||||
procedure Expression; Forward;
|
||||
|
||||
procedure Factor;
|
||||
begin
|
||||
if Look = '(' then begin
|
||||
Match('(');
|
||||
Expression;
|
||||
Match(')');
|
||||
end
|
||||
else
|
||||
EmitLn('MOVE #' + GetNum + ',D0');
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Note again how easily we can extend the parser, and how well the
|
||||
Pascal code matches the BNF syntax.
|
||||
|
||||
As usual, compile the new version and make sure that it correctly
|
||||
parses legal sentences, and flags illegal ones with an error
|
||||
message.
|
||||
|
||||
|
||||
UNARY MINUS
|
||||
|
||||
At this point, we have a parser that can handle just about any
|
||||
expression, right? OK, try this input sentence:
|
||||
|
||||
-1
|
||||
|
||||
WOOPS! It doesn't work, does it? Procedure Expression expects
|
||||
everything to start with an integer, so it coughs up the leading
|
||||
minus sign. You'll find that +3 won't work either, nor will
|
||||
something like
|
||||
|
||||
-(3-2) .
|
||||
|
||||
There are a couple of ways to fix the problem. The easiest
|
||||
(although not necessarily the best) way is to stick an imaginary
|
||||
leading zero in front of expressions of this type, so that -3
|
||||
becomes 0-3. We can easily patch this into our existing version
|
||||
of Expression:
|
||||
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate an Expression }
|
||||
|
||||
procedure Expression;
|
||||
begin
|
||||
if IsAddop(Look) then
|
||||
EmitLn('CLR D0')
|
||||
else
|
||||
Term;
|
||||
while IsAddop(Look) do begin
|
||||
EmitLn('MOVE D0,-(SP)');
|
||||
case Look of
|
||||
'+': Add;
|
||||
'-': Subtract;
|
||||
else Expected('Addop');
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
I TOLD you that making changes was easy! This time it cost us
|
||||
only three new lines of Pascal. Note the new reference to
|
||||
function IsAddop. Since the test for an addop appeared twice, I
|
||||
chose to embed it in the new function. The form of IsAddop
|
||||
should be apparent from that for IsAlpha. Here it is:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize an Addop }
|
||||
|
||||
function IsAddop(c: char): boolean;
|
||||
begin
|
||||
IsAddop := c in ['+', '-'];
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
OK, make these changes to the program and recompile. You should
|
||||
also include IsAddop in your baseline copy of the cradle. We'll
|
||||
be needing it again later. Now try the input -1 again. Wow!
|
||||
The efficiency of the code is pretty poor ... six lines of code
|
||||
just for loading a simple constant ... but at least it's correct.
|
||||
Remember, we're not trying to replace Turbo Pascal here.
|
||||
|
||||
At this point we're just about finished with the structure of our
|
||||
expression parser. This version of the program should correctly
|
||||
parse and compile just about any expression you care to throw at
|
||||
it. It's still limited in that we can only handle factors
|
||||
involving single decimal digits. But I hope that by now you're
|
||||
starting to get the message that we can accomodate further
|
||||
extensions with just some minor changes to the parser. You
|
||||
probably won't be surprised to hear that a variable or even a
|
||||
function call is just another kind of a factor.
|
||||
|
||||
In the next session, I'll show you just how easy it is to extend
|
||||
our parser to take care of these things too, and I'll also show
|
||||
you just how easily we can accomodate multicharacter numbers and
|
||||
variable names. So you see, we're not far at all from a truly
|
||||
useful parser.
|
||||
|
||||
|
||||
|
||||
|
||||
A WORD ABOUT OPTIMIZATION
|
||||
|
||||
Earlier in this session, I promised to give you some hints as to
|
||||
how we can improve the quality of the generated code. As I said,
|
||||
the production of tight code is not the main purpose of this
|
||||
series of articles. But you need to at least know that we aren't
|
||||
just wasting our time here ... that we can indeed modify the
|
||||
parser further to make it produce better code, without throwing
|
||||
away everything we've done to date. As usual, it turns out that
|
||||
SOME optimization is not that difficult to do ... it simply takes
|
||||
some extra code in the parser.
|
||||
|
||||
There are two basic approaches we can take:
|
||||
|
||||
o Try to fix up the code after it's generated
|
||||
|
||||
This is the concept of "peephole" optimization. The general
|
||||
idea it that we know what combinations of instructions the
|
||||
compiler is going to generate, and we also know which ones
|
||||
are pretty bad (such as the code for -1, above). So all we
|
||||
do is to scan the produced code, looking for those
|
||||
combinations, and replacing them by better ones. It's sort
|
||||
of a macro expansion, in reverse, and a fairly
|
||||
straightforward exercise in pattern-matching. The only
|
||||
complication, really, is that there may be a LOT of such
|
||||
combinations to look for. It's called peephole optimization
|
||||
simply because it only looks at a small group of instructions
|
||||
at a time. Peephole optimization can have a dramatic effect
|
||||
on the quality of the code, with little change to the
|
||||
structure of the compiler itself. There is a price to pay,
|
||||
though, in both the speed, size, and complexity of the
|
||||
compiler. Looking for all those combinations calls for a lot
|
||||
of IF tests, each one of which is a source of error. And, of
|
||||
course, it takes time.
|
||||
|
||||
In the classical implementation of a peephole optimizer,
|
||||
it's done as a second pass to the compiler. The output code
|
||||
is written to disk, and then the optimizer reads and
|
||||
processes the disk file again. As a matter of fact, you can
|
||||
see that the optimizer could even be a separate PROGRAM from
|
||||
the compiler proper. Since the optimizer only looks at the
|
||||
code through a small "window" of instructions (hence the
|
||||
name), a better implementation would be to simply buffer up a
|
||||
few lines of output, and scan the buffer after each EmitLn.
|
||||
|
||||
o Try to generate better code in the first place
|
||||
|
||||
This approach calls for us to look for special cases BEFORE
|
||||
we Emit them. As a trivial example, we should be able to
|
||||
identify a constant zero, and Emit a CLR instead of a load,
|
||||
or even do nothing at all, as in an add of zero, for example.
|
||||
Closer to home, if we had chosen to recognize the unary minus
|
||||
in Factor instead of in Expression, we could treat constants
|
||||
like -1 as ordinary constants, rather then generating them
|
||||
from positive ones. None of these things are difficult to
|
||||
deal with ... they only add extra tests in the code, which is
|
||||
why I haven't included them in our program. The way I see
|
||||
it, once we get to the point that we have a working compiler,
|
||||
generating useful code that executes, we can always go back
|
||||
and tweak the thing to tighten up the code produced. That's
|
||||
why there are Release 2.0's in the world.
|
||||
|
||||
There IS one more type of optimization worth mentioning, that
|
||||
seems to promise pretty tight code without too much hassle. It's
|
||||
my "invention" in the sense that I haven't seen it suggested in
|
||||
print anywhere, though I have no illusions that it's original
|
||||
with me.
|
||||
|
||||
This is to avoid such a heavy use of the stack, by making better
|
||||
use of the CPU registers. Remember back when we were doing only
|
||||
addition and subtraction, that we used registers D0 and D1,
|
||||
rather than the stack? It worked, because with only those two
|
||||
operations, the "stack" never needs more than two entries.
|
||||
|
||||
Well, the 68000 has eight data registers. Why not use them as a
|
||||
privately managed stack? The key is to recognize that, at any
|
||||
point in its processing, the parser KNOWS how many items are on
|
||||
the stack, so it can indeed manage it properly. We can define a
|
||||
private "stack pointer" that keeps track of which stack level
|
||||
we're at, and addresses the corresponding register. Procedure
|
||||
Factor, for example, would not cause data to be loaded into
|
||||
register D0, but into whatever the current "top-of-stack"
|
||||
register happened to be.
|
||||
|
||||
What we're doing in effect is to replace the CPU's RAM stack with
|
||||
a locally managed stack made up of registers. For most
|
||||
expressions, the stack level will never exceed eight, so we'll
|
||||
get pretty good code out. Of course, we also have to deal with
|
||||
those odd cases where the stack level DOES exceed eight, but
|
||||
that's no problem either. We simply let the stack spill over
|
||||
into the CPU stack. For levels beyond eight, the code is no
|
||||
worse than what we're generating now, and for levels less than
|
||||
eight, it's considerably better.
|
||||
|
||||
For the record, I have implemented this concept, just to make
|
||||
sure it works before I mentioned it to you. It does. In
|
||||
practice, it turns out that you can't really use all eight levels
|
||||
... you need at least one register free to reverse the operand
|
||||
order for division (sure wish the 68000 had an XTHL, like the
|
||||
8080!). For expressions that include function calls, we would
|
||||
also need a register reserved for them. Still, there is a nice
|
||||
improvement in code size for most expressions.
|
||||
|
||||
So, you see, getting better code isn't that difficult, but it
|
||||
does add complexity to the our translator ... complexity we can
|
||||
do without at this point. For that reason, I STRONGLY suggest
|
||||
that we continue to ignore efficiency issues for the rest of this
|
||||
series, secure in the knowledge that we can indeed improve the
|
||||
code quality without throwing away what we've done.
|
||||
|
||||
Next lesson, I'll show you how to deal with variables factors and
|
||||
function calls. I'll also show you just how easy it is to handle
|
||||
multicharacter tokens and embedded white space.
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -0,0 +1,946 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LET'S BUILD A COMPILER!
|
||||
|
||||
By
|
||||
|
||||
Jack W. Crenshaw, Ph.D.
|
||||
|
||||
4 Aug 1988
|
||||
|
||||
|
||||
Part III: MORE EXPRESSIONS
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
In the last installment, we examined the techniques used to parse
|
||||
and translate a general math expression. We ended up with a
|
||||
simple parser that could handle arbitrarily complex expressions,
|
||||
with two restrictions:
|
||||
|
||||
o No variables were allowed, only numeric factors
|
||||
|
||||
o The numeric factors were limited to single digits
|
||||
|
||||
In this installment, we'll get rid of those restrictions. We'll
|
||||
also extend what we've done to include assignment statements
|
||||
function calls and. Remember, though, that the second
|
||||
restriction was mainly self-imposed ... a choice of convenience
|
||||
on our part, to make life easier and to let us concentrate on the
|
||||
fundamental concepts. As you'll see in a bit, it's an easy
|
||||
restriction to get rid of, so don't get too hung up about it.
|
||||
We'll use the trick when it serves us to do so, confident that we
|
||||
can discard it when we're ready to.
|
||||
|
||||
|
||||
VARIABLES
|
||||
|
||||
Most expressions that we see in practice involve variables, such
|
||||
as
|
||||
|
||||
b * b + 4 * a * c
|
||||
|
||||
No parser is much good without being able to deal with them.
|
||||
Fortunately, it's also quite easy to do.
|
||||
|
||||
Remember that in our parser as it currently stands, there are two
|
||||
kinds of factors allowed: integer constants and expressions
|
||||
within parentheses. In BNF notation,
|
||||
|
||||
<factor> ::= <number> | (<expression>)
|
||||
|
||||
The '|' stands for "or", meaning of course that either form is a
|
||||
legal form for a factor. Remember, too, that we had no trouble
|
||||
knowing which was which ... the lookahead character is a left
|
||||
paren '(' in one case, and a digit in the other.
|
||||
|
||||
It probably won't come as too much of a surprise that a variable
|
||||
is just another kind of factor. So we extend the BNF above to
|
||||
read:
|
||||
|
||||
|
||||
<factor> ::= <number> | (<expression>) | <variable>
|
||||
|
||||
|
||||
Again, there is no ambiguity: if the lookahead character is a
|
||||
letter, we have a variable; if a digit, we have a number. Back
|
||||
when we translated the number, we just issued code to load the
|
||||
number, as immediate data, into D0. Now we do the same, only we
|
||||
load a variable.
|
||||
|
||||
A minor complication in the code generation arises from the fact
|
||||
that most 68000 operating systems, including the SK*DOS that I'm
|
||||
using, require the code to be written in "position-independent"
|
||||
form, which basically means that everything is PC-relative. The
|
||||
format for a load in this language is
|
||||
|
||||
MOVE X(PC),D0
|
||||
|
||||
where X is, of course, the variable name. Armed with that, let's
|
||||
modify the current version of Factor to read:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Factor }
|
||||
|
||||
procedure Expression; Forward;
|
||||
|
||||
procedure Factor;
|
||||
begin
|
||||
if Look = '(' then begin
|
||||
Match('(');
|
||||
Expression;
|
||||
Match(')');
|
||||
end
|
||||
else if IsAlpha(Look) then
|
||||
EmitLn('MOVE ' + GetName + '(PC),D0')
|
||||
else
|
||||
EmitLn('MOVE #' + GetNum + ',D0');
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
I've remarked before how easy it is to add extensions to the
|
||||
parser, because of the way it's structured. You can see that
|
||||
this still holds true here. This time it cost us all of two
|
||||
extra lines of code. Notice, too, how the if-else-else structure
|
||||
exactly parallels the BNF syntax equation.
|
||||
|
||||
OK, compile and test this new version of the parser. That didn't
|
||||
hurt too badly, did it?
|
||||
|
||||
|
||||
FUNCTIONS
|
||||
|
||||
There is only one other common kind of factor supported by most
|
||||
languages: the function call. It's really too early for us to
|
||||
deal with functions well, because we haven't yet addressed the
|
||||
issue of parameter passing. What's more, a "real" language would
|
||||
include a mechanism to support more than one type, one of which
|
||||
should be a function type. We haven't gotten there yet, either.
|
||||
But I'd still like to deal with functions now for a couple of
|
||||
reasons. First, it lets us finally wrap up the parser in
|
||||
something very close to its final form, and second, it brings up
|
||||
a new issue which is very much worth talking about.
|
||||
|
||||
Up till now, we've been able to write what is called a
|
||||
"predictive parser." That means that at any point, we can know
|
||||
by looking at the current lookahead character exactly what to do
|
||||
next. That isn't the case when we add functions. Every language
|
||||
has some naming rules for what constitutes a legal identifier.
|
||||
For the present, ours is simply that it is one of the letters
|
||||
'a'..'z'. The problem is that a variable name and a function
|
||||
name obey the same rules. So how can we tell which is which?
|
||||
One way is to require that they each be declared before they are
|
||||
used. Pascal takes that approach. The other is that we might
|
||||
require a function to be followed by a (possibly empty) parameter
|
||||
list. That's the rule used in C.
|
||||
|
||||
Since we don't yet have a mechanism for declaring types, let's
|
||||
use the C rule for now. Since we also don't have a mechanism to
|
||||
deal with parameters, we can only handle empty lists, so our
|
||||
function calls will have the form
|
||||
|
||||
x() .
|
||||
|
||||
Since we're not dealing with parameter lists yet, there is
|
||||
nothing to do but to call the function, so we need only to issue
|
||||
a BSR (call) instead of a MOVE.
|
||||
|
||||
Now that there are two possibilities for the "If IsAlpha" branch
|
||||
of the test in Factor, let's treat them in a separate procedure.
|
||||
Modify Factor to read:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Factor }
|
||||
|
||||
procedure Expression; Forward;
|
||||
|
||||
procedure Factor;
|
||||
begin
|
||||
if Look = '(' then begin
|
||||
Match('(');
|
||||
Expression;
|
||||
Match(')');
|
||||
end
|
||||
else if IsAlpha(Look) then
|
||||
Ident
|
||||
else
|
||||
EmitLn('MOVE #' + GetNum + ',D0');
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
and insert before it the new procedure
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate an Identifier }
|
||||
|
||||
procedure Ident;
|
||||
var Name: char;
|
||||
begin
|
||||
Name := GetName;
|
||||
if Look = '(' then begin
|
||||
Match('(');
|
||||
Match(')');
|
||||
EmitLn('BSR ' + Name);
|
||||
end
|
||||
else
|
||||
EmitLn('MOVE ' + Name + '(PC),D0')
|
||||
end;
|
||||
{---------------------------------------------------------------}
|
||||
|
||||
|
||||
OK, compile and test this version. Does it parse all legal
|
||||
expressions? Does it correctly flag badly formed ones?
|
||||
|
||||
The important thing to notice is that even though we no longer
|
||||
have a predictive parser, there is little or no complication
|
||||
added with the recursive descent approach that we're using. At
|
||||
the point where Factor finds an identifier (letter), it doesn't
|
||||
know whether it's a variable name or a function name, nor does it
|
||||
really care. It simply passes it on to Ident and leaves it up to
|
||||
that procedure to figure it out. Ident, in turn, simply tucks
|
||||
away the identifier and then reads one more character to decide
|
||||
which kind of identifier it's dealing with.
|
||||
|
||||
Keep this approach in mind. It's a very powerful concept, and it
|
||||
should be used whenever you encounter an ambiguous situation
|
||||
requiring further lookahead. Even if you had to look several
|
||||
tokens ahead, the principle would still work.
|
||||
|
||||
|
||||
MORE ON ERROR HANDLING
|
||||
|
||||
As long as we're talking philosophy, there's another important
|
||||
issue to point out: error handling. Notice that although the
|
||||
parser correctly rejects (almost) every malformed expression we
|
||||
can throw at it, with a meaningful error message, we haven't
|
||||
really had to do much work to make that happen. In fact, in the
|
||||
whole parser per se (from Ident through Expression) there are
|
||||
only two calls to the error routine, Expected. Even those aren't
|
||||
necessary ... if you'll look again in Term and Expression, you'll
|
||||
see that those statements can't be reached. I put them in early
|
||||
on as a bit of insurance, but they're no longer needed. Why
|
||||
don't you delete them now?
|
||||
|
||||
So how did we get this nice error handling virtually for free?
|
||||
It's simply that I've carefully avoided reading a character
|
||||
directly using GetChar. Instead, I've relied on the error
|
||||
handling in GetName, GetNum, and Match to do all the error
|
||||
checking for me. Astute readers will notice that some of the
|
||||
calls to Match (for example, the ones in Add and Subtract) are
|
||||
also unnecessary ... we already know what the character is by the
|
||||
time we get there ... but it maintains a certain symmetry to
|
||||
leave them in, and the general rule to always use Match instead
|
||||
of GetChar is a good one.
|
||||
|
||||
I mentioned an "almost" above. There is a case where our error
|
||||
handling leaves a bit to be desired. So far we haven't told our
|
||||
parser what and end-of-line looks like, or what to do with
|
||||
embedded white space. So a space character (or any other
|
||||
character not part of the recognized character set) simply causes
|
||||
the parser to terminate, ignoring the unrecognized characters.
|
||||
|
||||
It could be argued that this is reasonable behavior at this
|
||||
point. In a "real" compiler, there is usually another statement
|
||||
following the one we're working on, so any characters not treated
|
||||
as part of our expression will either be used for or rejected as
|
||||
part of the next one.
|
||||
|
||||
But it's also a very easy thing to fix up, even if it's only
|
||||
temporary. All we have to do is assert that the expression
|
||||
should end with an end-of-line , i.e., a carriage return.
|
||||
|
||||
To see what I'm talking about, try the input line
|
||||
|
||||
1+2 <space> 3+4
|
||||
|
||||
See how the space was treated as a terminator? Now, to make the
|
||||
compiler properly flag this, add the line
|
||||
|
||||
if Look <> CR then Expected('Newline');
|
||||
|
||||
in the main program, just after the call to Expression. That
|
||||
catches anything left over in the input stream. Don't forget to
|
||||
define CR in the const statement:
|
||||
|
||||
CR = ^M;
|
||||
|
||||
As usual, recompile the program and verify that it does what it's
|
||||
supposed to.
|
||||
|
||||
|
||||
ASSIGNMENT STATEMENTS
|
||||
|
||||
OK, at this point we have a parser that works very nicely. I'd
|
||||
like to point out that we got it using only 88 lines of
|
||||
executable code, not counting what was in the cradle. The
|
||||
compiled object file is a whopping 4752 bytes. Not bad,
|
||||
considering we weren't trying very hard to save either source
|
||||
code or object size. We just stuck to the KISS principle.
|
||||
|
||||
Of course, parsing an expression is not much good without having
|
||||
something to do with it afterwards. Expressions USUALLY (but not
|
||||
always) appear in assignment statements, in the form
|
||||
|
||||
<Ident> = <Expression>
|
||||
|
||||
We're only a breath away from being able to parse an assignment
|
||||
statement, so let's take that last step. Just after procedure
|
||||
Expression, add the following new procedure:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate an Assignment Statement }
|
||||
|
||||
procedure Assignment;
|
||||
var Name: char;
|
||||
begin
|
||||
Name := GetName;
|
||||
Match('=');
|
||||
Expression;
|
||||
EmitLn('LEA ' + Name + '(PC),A0');
|
||||
EmitLn('MOVE D0,(A0)')
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Note again that the code exactly parallels the BNF. And notice
|
||||
further that the error checking was painless, handled by GetName
|
||||
and Match.
|
||||
|
||||
The reason for the two lines of assembler has to do with a
|
||||
peculiarity in the 68000, which requires this kind of construct
|
||||
for PC-relative code.
|
||||
|
||||
Now change the call to Expression, in the main program, to one to
|
||||
Assignment. That's all there is to it.
|
||||
|
||||
Son of a gun! We are actually compiling assignment statements.
|
||||
If those were the only kind of statements in a language, all we'd
|
||||
have to do is put this in a loop and we'd have a full-fledged
|
||||
compiler!
|
||||
|
||||
Well, of course they're not the only kind. There are also little
|
||||
items like control statements (IFs and loops), procedures,
|
||||
declarations, etc. But cheer up. The arithmetic expressions
|
||||
that we've been dealing with are among the most challenging in a
|
||||
language. Compared to what we've already done, control
|
||||
statements will be easy. I'll be covering them in the fifth
|
||||
installment. And the other statements will all fall in line, as
|
||||
long as we remember to KISS.
|
||||
|
||||
|
||||
MULTI-CHARACTER TOKENS
|
||||
|
||||
Throughout this series, I've been carefully restricting
|
||||
everything we do to single-character tokens, all the while
|
||||
assuring you that it wouldn't be difficult to extend to multi-
|
||||
character ones. I don't know if you believed me or not ... I
|
||||
wouldn't really blame you if you were a bit skeptical. I'll
|
||||
continue to use that approach in the sessions which follow,
|
||||
because it helps keep complexity away. But I'd like to back up
|
||||
those assurances, and wrap up this portion of the parser, by
|
||||
showing you just how easy that extension really is. In the
|
||||
process, we'll also provide for embedded white space. Before you
|
||||
make the next few changes, though, save the current version of
|
||||
the parser away under another name. I have some more uses for it
|
||||
in the next installment, and we'll be working with the single-
|
||||
character version.
|
||||
|
||||
Most compilers separate out the handling of the input stream into
|
||||
a separate module called the lexical scanner. The idea is that
|
||||
the scanner deals with all the character-by-character input, and
|
||||
returns the separate units (tokens) of the stream. There may
|
||||
come a time when we'll want to do something like that, too, but
|
||||
for now there is no need. We can handle the multi-character
|
||||
tokens that we need by very slight and very local modifications
|
||||
to GetName and GetNum.
|
||||
|
||||
The usual definition of an identifier is that the first character
|
||||
must be a letter, but the rest can be alphanumeric (letters or
|
||||
numbers). To deal with this, we need one other recognizer
|
||||
function
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize an Alphanumeric }
|
||||
|
||||
function IsAlNum(c: char): boolean;
|
||||
begin
|
||||
IsAlNum := IsAlpha(c) or IsDigit(c);
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Add this function to your parser. I put mine just after IsDigit.
|
||||
While you're at it, might as well include it as a permanent
|
||||
member of Cradle, too.
|
||||
|
||||
Now, we need to modify function GetName to return a string
|
||||
instead of a character:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get an Identifier }
|
||||
|
||||
function GetName: string;
|
||||
var Token: string;
|
||||
begin
|
||||
Token := '';
|
||||
if not IsAlpha(Look) then Expected('Name');
|
||||
while IsAlNum(Look) do begin
|
||||
Token := Token + UpCase(Look);
|
||||
GetChar;
|
||||
end;
|
||||
GetName := Token;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Similarly, modify GetNum to read:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get a Number }
|
||||
|
||||
function GetNum: string;
|
||||
var Value: string;
|
||||
begin
|
||||
Value := '';
|
||||
if not IsDigit(Look) then Expected('Integer');
|
||||
while IsDigit(Look) do begin
|
||||
Value := Value + Look;
|
||||
GetChar;
|
||||
end;
|
||||
GetNum := Value;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Amazingly enough, that is virtually all the changes required to
|
||||
the parser! The local variable Name in procedures Ident and
|
||||
Assignment was originally declared as "char", and must now be
|
||||
declared string[8]. (Clearly, we could make the string length
|
||||
longer if we chose, but most assemblers limit the length anyhow.)
|
||||
Make this change, and then recompile and test. _NOW_ do you
|
||||
believe that it's a simple change?
|
||||
|
||||
|
||||
WHITE SPACE
|
||||
|
||||
Before we leave this parser for awhile, let's address the issue
|
||||
of white space. As it stands now, the parser will barf (or
|
||||
simply terminate) on a single space character embedded anywhere
|
||||
in the input stream. That's pretty unfriendly behavior. So
|
||||
let's "productionize" the thing a bit by eliminating this last
|
||||
restriction.
|
||||
|
||||
The key to easy handling of white space is to come up with a
|
||||
simple rule for how the parser should treat the input stream, and
|
||||
to enforce that rule everywhere. Up till now, because white
|
||||
space wasn't permitted, we've been able to assume that after each
|
||||
parsing action, the lookahead character Look contains the next
|
||||
meaningful character, so we could test it immediately. Our
|
||||
design was based upon this principle.
|
||||
|
||||
It still sounds like a good rule to me, so that's the one we'll
|
||||
use. This means that every routine that advances the input
|
||||
stream must skip over white space, and leave the next non-white
|
||||
character in Look. Fortunately, because we've been careful to
|
||||
use GetName, GetNum, and Match for most of our input processing,
|
||||
it is only those three routines (plus Init) that we need to
|
||||
modify.
|
||||
|
||||
Not surprisingly, we start with yet another new recognizer
|
||||
routine:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize White Space }
|
||||
|
||||
function IsWhite(c: char): boolean;
|
||||
begin
|
||||
IsWhite := c in [' ', TAB];
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
We also need a routine that will eat white-space characters,
|
||||
until it finds a non-white one:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Skip Over Leading White Space }
|
||||
|
||||
procedure SkipWhite;
|
||||
begin
|
||||
while IsWhite(Look) do
|
||||
GetChar;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Now, add calls to SkipWhite to Match, GetName, and GetNum as
|
||||
shown below:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Match a Specific Input Character }
|
||||
|
||||
procedure Match(x: char);
|
||||
begin
|
||||
if Look <> x then Expected('''' + x + '''')
|
||||
else begin
|
||||
GetChar;
|
||||
SkipWhite;
|
||||
end;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get an Identifier }
|
||||
|
||||
function GetName: string;
|
||||
var Token: string;
|
||||
begin
|
||||
Token := '';
|
||||
if not IsAlpha(Look) then Expected('Name');
|
||||
while IsAlNum(Look) do begin
|
||||
Token := Token + UpCase(Look);
|
||||
GetChar;
|
||||
end;
|
||||
GetName := Token;
|
||||
SkipWhite;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get a Number }
|
||||
|
||||
function GetNum: string;
|
||||
var Value: string;
|
||||
begin
|
||||
Value := '';
|
||||
if not IsDigit(Look) then Expected('Integer');
|
||||
while IsDigit(Look) do begin
|
||||
Value := Value + Look;
|
||||
GetChar;
|
||||
end;
|
||||
GetNum := Value;
|
||||
SkipWhite;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
(Note that I rearranged Match a bit, without changing the
|
||||
functionality.)
|
||||
|
||||
Finally, we need to skip over leading blanks where we "prime the
|
||||
pump" in Init:
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Initialize }
|
||||
|
||||
procedure Init;
|
||||
begin
|
||||
GetChar;
|
||||
SkipWhite;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Make these changes and recompile the program. You will find that
|
||||
you will have to move Match below SkipWhite, to avoid an error
|
||||
message from the Pascal compiler. Test the program as always to
|
||||
make sure it works properly.
|
||||
|
||||
Since we've made quite a few changes during this session, I'm
|
||||
reproducing the entire parser below:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
program parse;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Constant Declarations }
|
||||
|
||||
const TAB = ^I;
|
||||
CR = ^M;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Variable Declarations }
|
||||
|
||||
var Look: char; { Lookahead Character }
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Read New Character From Input Stream }
|
||||
|
||||
procedure GetChar;
|
||||
begin
|
||||
Read(Look);
|
||||
end;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Report an Error }
|
||||
|
||||
procedure Error(s: string);
|
||||
begin
|
||||
WriteLn;
|
||||
WriteLn(^G, 'Error: ', s, '.');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Report Error and Halt }
|
||||
|
||||
procedure Abort(s: string);
|
||||
begin
|
||||
Error(s);
|
||||
Halt;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Report What Was Expected }
|
||||
|
||||
procedure Expected(s: string);
|
||||
begin
|
||||
Abort(s + ' Expected');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize an Alpha Character }
|
||||
|
||||
function IsAlpha(c: char): boolean;
|
||||
begin
|
||||
IsAlpha := UpCase(c) in ['A'..'Z'];
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize a Decimal Digit }
|
||||
|
||||
function IsDigit(c: char): boolean;
|
||||
begin
|
||||
IsDigit := c in ['0'..'9'];
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize an Alphanumeric }
|
||||
|
||||
function IsAlNum(c: char): boolean;
|
||||
begin
|
||||
IsAlNum := IsAlpha(c) or IsDigit(c);
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize an Addop }
|
||||
|
||||
function IsAddop(c: char): boolean;
|
||||
begin
|
||||
IsAddop := c in ['+', '-'];
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize White Space }
|
||||
|
||||
function IsWhite(c: char): boolean;
|
||||
begin
|
||||
IsWhite := c in [' ', TAB];
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Skip Over Leading White Space }
|
||||
|
||||
procedure SkipWhite;
|
||||
begin
|
||||
while IsWhite(Look) do
|
||||
GetChar;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Match a Specific Input Character }
|
||||
|
||||
procedure Match(x: char);
|
||||
begin
|
||||
if Look <> x then Expected('''' + x + '''')
|
||||
else begin
|
||||
GetChar;
|
||||
SkipWhite;
|
||||
end;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get an Identifier }
|
||||
|
||||
function GetName: string;
|
||||
var Token: string;
|
||||
begin
|
||||
Token := '';
|
||||
if not IsAlpha(Look) then Expected('Name');
|
||||
while IsAlNum(Look) do begin
|
||||
Token := Token + UpCase(Look);
|
||||
GetChar;
|
||||
end;
|
||||
GetName := Token;
|
||||
SkipWhite;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get a Number }
|
||||
|
||||
function GetNum: string;
|
||||
var Value: string;
|
||||
begin
|
||||
Value := '';
|
||||
if not IsDigit(Look) then Expected('Integer');
|
||||
while IsDigit(Look) do begin
|
||||
Value := Value + Look;
|
||||
GetChar;
|
||||
end;
|
||||
GetNum := Value;
|
||||
SkipWhite;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Output a String with Tab }
|
||||
|
||||
procedure Emit(s: string);
|
||||
begin
|
||||
Write(TAB, s);
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Output a String with Tab and CRLF }
|
||||
|
||||
procedure EmitLn(s: string);
|
||||
begin
|
||||
Emit(s);
|
||||
WriteLn;
|
||||
end;
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Identifier }
|
||||
|
||||
procedure Ident;
|
||||
var Name: string[8];
|
||||
begin
|
||||
Name:= GetName;
|
||||
if Look = '(' then begin
|
||||
Match('(');
|
||||
Match(')');
|
||||
EmitLn('BSR ' + Name);
|
||||
end
|
||||
else
|
||||
EmitLn('MOVE ' + Name + '(PC),D0');
|
||||
end;
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Factor }
|
||||
|
||||
procedure Expression; Forward;
|
||||
|
||||
procedure Factor;
|
||||
begin
|
||||
if Look = '(' then begin
|
||||
Match('(');
|
||||
Expression;
|
||||
Match(')');
|
||||
end
|
||||
else if IsAlpha(Look) then
|
||||
Ident
|
||||
else
|
||||
EmitLn('MOVE #' + GetNum + ',D0');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize and Translate a Multiply }
|
||||
|
||||
procedure Multiply;
|
||||
begin
|
||||
Match('*');
|
||||
Factor;
|
||||
EmitLn('MULS (SP)+,D0');
|
||||
end;
|
||||
|
||||
|
||||
{-------------------------------------------------------------}
|
||||
{ Recognize and Translate a Divide }
|
||||
|
||||
procedure Divide;
|
||||
begin
|
||||
Match('/');
|
||||
Factor;
|
||||
EmitLn('MOVE (SP)+,D1');
|
||||
EmitLn('EXS.L D0');
|
||||
EmitLn('DIVS D1,D0');
|
||||
end;
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Term }
|
||||
|
||||
procedure Term;
|
||||
begin
|
||||
Factor;
|
||||
while Look in ['*', '/'] do begin
|
||||
EmitLn('MOVE D0,-(SP)');
|
||||
case Look of
|
||||
'*': Multiply;
|
||||
'/': Divide;
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize and Translate an Add }
|
||||
|
||||
procedure Add;
|
||||
begin
|
||||
Match('+');
|
||||
Term;
|
||||
EmitLn('ADD (SP)+,D0');
|
||||
end;
|
||||
|
||||
|
||||
{-------------------------------------------------------------}
|
||||
{ Recognize and Translate a Subtract }
|
||||
|
||||
procedure Subtract;
|
||||
begin
|
||||
Match('-');
|
||||
Term;
|
||||
EmitLn('SUB (SP)+,D0');
|
||||
EmitLn('NEG D0');
|
||||
end;
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate an Expression }
|
||||
|
||||
procedure Expression;
|
||||
begin
|
||||
if IsAddop(Look) then
|
||||
EmitLn('CLR D0')
|
||||
else
|
||||
Term;
|
||||
while IsAddop(Look) do begin
|
||||
EmitLn('MOVE D0,-(SP)');
|
||||
case Look of
|
||||
'+': Add;
|
||||
'-': Subtract;
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate an Assignment Statement }
|
||||
|
||||
procedure Assignment;
|
||||
var Name: string[8];
|
||||
begin
|
||||
Name := GetName;
|
||||
Match('=');
|
||||
Expression;
|
||||
EmitLn('LEA ' + Name + '(PC),A0');
|
||||
EmitLn('MOVE D0,(A0)')
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Initialize }
|
||||
|
||||
procedure Init;
|
||||
begin
|
||||
GetChar;
|
||||
SkipWhite;
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Main Program }
|
||||
|
||||
begin
|
||||
Init;
|
||||
Assignment;
|
||||
If Look <> CR then Expected('NewLine');
|
||||
end.
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Now the parser is complete. It's got every feature we can put in
|
||||
a one-line "compiler." Tuck it away in a safe place. Next time
|
||||
we'll move on to a new subject, but we'll still be talking about
|
||||
expressions for quite awhile. Next installment, I plan to talk a
|
||||
bit about interpreters as opposed to compilers, and show you how
|
||||
the structure of the parser changes a bit as we change what sort
|
||||
of action has to be taken. The information we pick up there will
|
||||
serve us in good stead later on, even if you have no interest in
|
||||
interpreters. See you next time.
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -0,0 +1,701 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LET'S BUILD A COMPILER!
|
||||
|
||||
By
|
||||
|
||||
Jack W. Crenshaw, Ph.D.
|
||||
|
||||
24 July 1988
|
||||
|
||||
|
||||
Part IV: INTERPRETERS
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
In the first three installments of this series, we've looked at
|
||||
parsing and compiling math expressions, and worked our way grad-
|
||||
ually and methodically from dealing with very simple one-term,
|
||||
one-character "expressions" up through more general ones, finally
|
||||
arriving at a very complete parser that could parse and translate
|
||||
complete assignment statements, with multi-character tokens,
|
||||
embedded white space, and function calls. This time, I'm going
|
||||
to walk you through the process one more time, only with the goal
|
||||
of interpreting rather than compiling object code.
|
||||
|
||||
Since this is a series on compilers, why should we bother with
|
||||
interpreters? Simply because I want you to see how the nature of
|
||||
the parser changes as we change the goals. I also want to unify
|
||||
the concepts of the two types of translators, so that you can see
|
||||
not only the differences, but also the similarities.
|
||||
|
||||
Consider the assignment statement
|
||||
|
||||
x = 2 * y + 3
|
||||
|
||||
In a compiler, we want the target CPU to execute this assignment
|
||||
at EXECUTION time. The translator itself doesn't do any arith-
|
||||
metic ... it only issues the object code that will cause the CPU
|
||||
to do it when the code is executed. For the example above, the
|
||||
compiler would issue code to compute the expression and store the
|
||||
results in variable x.
|
||||
|
||||
For an interpreter, on the other hand, no object code is gen-
|
||||
erated. Instead, the arithmetic is computed immediately, as the
|
||||
parsing is going on. For the example, by the time parsing of the
|
||||
statement is complete, x will have a new value.
|
||||
|
||||
The approach we've been taking in this whole series is called
|
||||
"syntax-driven translation." As you are aware by now, the struc-
|
||||
ture of the parser is very closely tied to the syntax of the
|
||||
productions we parse. We have built Pascal procedures that rec-
|
||||
ognize every language construct. Associated with each of these
|
||||
constructs (and procedures) is a corresponding "action," which
|
||||
does whatever makes sense to do once a construct has been
|
||||
recognized. In our compiler so far, every action involves
|
||||
emitting object code, to be executed later at execution time. In
|
||||
an interpreter, every action involves something to be done im-
|
||||
mediately.
|
||||
|
||||
What I'd like you to see here is that the layout ... the struc-
|
||||
ture ... of the parser doesn't change. It's only the actions
|
||||
that change. So if you can write an interpreter for a given
|
||||
language, you can also write a compiler, and vice versa. Yet, as
|
||||
you will see, there ARE differences, and significant ones.
|
||||
Because the actions are different, the procedures that do the
|
||||
recognizing end up being written differently. Specifically, in
|
||||
the interpreter the recognizing procedures end up being coded as
|
||||
FUNCTIONS that return numeric values to their callers. None of
|
||||
the parsing routines for our compiler did that.
|
||||
|
||||
Our compiler, in fact, is what we might call a "pure" compiler.
|
||||
Each time a construct is recognized, the object code is emitted
|
||||
IMMEDIATELY. (That's one reason the code is not very efficient.)
|
||||
The interpreter we'll be building here is a pure interpreter, in
|
||||
the sense that there is no translation, such as "tokenizing,"
|
||||
performed on the source code. These represent the two extremes
|
||||
of translation. In the real world, translators are rarely so
|
||||
pure, but tend to have bits of each technique.
|
||||
|
||||
I can think of several examples. I've already mentioned one:
|
||||
most interpreters, such as Microsoft BASIC, for example, trans-
|
||||
late the source code (tokenize it) into an intermediate form so
|
||||
that it'll be easier to parse real time.
|
||||
|
||||
Another example is an assembler. The purpose of an assembler, of
|
||||
course, is to produce object code, and it normally does that on a
|
||||
one-to-one basis: one object instruction per line of source code.
|
||||
But almost every assembler also permits expressions as arguments.
|
||||
In this case, the expressions are always constant expressions,
|
||||
and so the assembler isn't supposed to issue object code for
|
||||
them. Rather, it "interprets" the expressions and computes the
|
||||
corresponding constant result, which is what it actually emits as
|
||||
object code.
|
||||
|
||||
As a matter of fact, we could use a bit of that ourselves. The
|
||||
translator we built in the previous installment will dutifully
|
||||
spit out object code for complicated expressions, even though
|
||||
every term in the expression is a constant. In that case it
|
||||
would be far better if the translator behaved a bit more like an
|
||||
interpreter, and just computed the equivalent constant result.
|
||||
|
||||
There is a concept in compiler theory called "lazy" translation.
|
||||
The idea is that you typically don't just emit code at every
|
||||
action. In fact, at the extreme you don't emit anything at all,
|
||||
until you absolutely have to. To accomplish this, the actions
|
||||
associated with the parsing routines typically don't just emit
|
||||
code. Sometimes they do, but often they simply return in-
|
||||
formation back to the caller. Armed with such information, the
|
||||
caller can then make a better choice of what to do.
|
||||
|
||||
For example, given the statement
|
||||
|
||||
x = x + 3 - 2 - (5 - 4) ,
|
||||
|
||||
our compiler will dutifully spit out a stream of 18 instructions
|
||||
to load each parameter into registers, perform the arithmetic,
|
||||
and store the result. A lazier evaluation would recognize that
|
||||
the arithmetic involving constants can be evaluated at compile
|
||||
time, and would reduce the expression to
|
||||
|
||||
x = x + 0 .
|
||||
|
||||
An even lazier evaluation would then be smart enough to figure
|
||||
out that this is equivalent to
|
||||
|
||||
x = x ,
|
||||
|
||||
which calls for no action at all. We could reduce 18 in-
|
||||
structions to zero!
|
||||
|
||||
Note that there is no chance of optimizing this way in our trans-
|
||||
lator as it stands, because every action takes place immediately.
|
||||
|
||||
Lazy expression evaluation can produce significantly better
|
||||
object code than we have been able to so far. I warn you,
|
||||
though: it complicates the parser code considerably, because each
|
||||
routine now has to make decisions as to whether to emit object
|
||||
code or not. Lazy evaluation is certainly not named that because
|
||||
it's easier on the compiler writer!
|
||||
|
||||
Since we're operating mainly on the KISS principle here, I won't
|
||||
go into much more depth on this subject. I just want you to be
|
||||
aware that you can get some code optimization by combining the
|
||||
techniques of compiling and interpreting. In particular, you
|
||||
should know that the parsing routines in a smarter translator
|
||||
will generally return things to their caller, and sometimes
|
||||
expect things as well. That's the main reason for going over
|
||||
interpretation in this installment.
|
||||
|
||||
|
||||
THE INTERPRETER
|
||||
|
||||
OK, now that you know WHY we're going into all this, let's do it.
|
||||
Just to give you practice, we're going to start over with a bare
|
||||
cradle and build up the translator all over again. This time, of
|
||||
course, we can go a bit faster.
|
||||
|
||||
Since we're now going to do arithmetic, the first thing we need
|
||||
to do is to change function GetNum, which up till now has always
|
||||
returned a character (or string). Now, it's better for it to
|
||||
return an integer. MAKE A COPY of the cradle (for goodness's
|
||||
sake, don't change the version in Cradle itself!!) and modify
|
||||
GetNum as follows:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get a Number }
|
||||
|
||||
function GetNum: integer;
|
||||
begin
|
||||
if not IsDigit(Look) then Expected('Integer');
|
||||
GetNum := Ord(Look) - Ord('0');
|
||||
GetChar;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Now, write the following version of Expression:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate an Expression }
|
||||
|
||||
function Expression: integer;
|
||||
begin
|
||||
Expression := GetNum;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Finally, insert the statement
|
||||
|
||||
|
||||
Writeln(Expression);
|
||||
|
||||
|
||||
at the end of the main program. Now compile and test.
|
||||
|
||||
All this program does is to "parse" and translate a single
|
||||
integer "expression." As always, you should make sure that it
|
||||
does that with the digits 0..9, and gives an error message for
|
||||
anything else. Shouldn't take you very long!
|
||||
|
||||
OK, now let's extend this to include addops. Change Expression
|
||||
to read:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate an Expression }
|
||||
|
||||
function Expression: integer;
|
||||
var Value: integer;
|
||||
begin
|
||||
if IsAddop(Look) then
|
||||
Value := 0
|
||||
else
|
||||
Value := GetNum;
|
||||
while IsAddop(Look) do begin
|
||||
case Look of
|
||||
'+': begin
|
||||
Match('+');
|
||||
Value := Value + GetNum;
|
||||
end;
|
||||
'-': begin
|
||||
Match('-');
|
||||
Value := Value - GetNum;
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
Expression := Value;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
The structure of Expression, of course, parallels what we did
|
||||
before, so we shouldn't have too much trouble debugging it.
|
||||
There's been a SIGNIFICANT development, though, hasn't there?
|
||||
Procedures Add and Subtract went away! The reason is that the
|
||||
action to be taken requires BOTH arguments of the operation. I
|
||||
could have chosen to retain the procedures and pass into them the
|
||||
value of the expression to date, which is Value. But it seemed
|
||||
cleaner to me to keep Value as strictly a local variable, which
|
||||
meant that the code for Add and Subtract had to be moved in line.
|
||||
This result suggests that, while the structure we had developed
|
||||
was nice and clean for our simple-minded translation scheme, it
|
||||
probably wouldn't do for use with lazy evaluation. That's a
|
||||
little tidbit we'll probably want to keep in mind for later.
|
||||
|
||||
OK, did the translator work? Then let's take the next step.
|
||||
It's not hard to figure out what procedure Term should now look
|
||||
like. Change every call to GetNum in function Expression to a
|
||||
call to Term, and then enter the following form for Term:
|
||||
|
||||
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Term }
|
||||
|
||||
function Term: integer;
|
||||
var Value: integer;
|
||||
begin
|
||||
Value := GetNum;
|
||||
while Look in ['*', '/'] do begin
|
||||
case Look of
|
||||
'*': begin
|
||||
Match('*');
|
||||
Value := Value * GetNum;
|
||||
end;
|
||||
'/': begin
|
||||
Match('/');
|
||||
Value := Value div GetNum;
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
Term := Value;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
Now, try it out. Don't forget two things: first, we're dealing
|
||||
with integer division, so, for example, 1/3 should come out zero.
|
||||
Second, even though we can output multi-digit results, our input
|
||||
is still restricted to single digits.
|
||||
|
||||
That seems like a silly restriction at this point, since we have
|
||||
already seen how easily function GetNum can be extended. So
|
||||
let's go ahead and fix it right now. The new version is
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get a Number }
|
||||
|
||||
function GetNum: integer;
|
||||
var Value: integer;
|
||||
begin
|
||||
Value := 0;
|
||||
if not IsDigit(Look) then Expected('Integer');
|
||||
while IsDigit(Look) do begin
|
||||
Value := 10 * Value + Ord(Look) - Ord('0');
|
||||
GetChar;
|
||||
end;
|
||||
GetNum := Value;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
If you've compiled and tested this version of the interpreter,
|
||||
the next step is to install function Factor, complete with pa-
|
||||
renthesized expressions. We'll hold off a bit longer on the
|
||||
variable names. First, change the references to GetNum, in
|
||||
function Term, so that they call Factor instead. Now code the
|
||||
following version of Factor:
|
||||
|
||||
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Factor }
|
||||
|
||||
function Expression: integer; Forward;
|
||||
|
||||
function Factor: integer;
|
||||
begin
|
||||
if Look = '(' then begin
|
||||
Match('(');
|
||||
Factor := Expression;
|
||||
Match(')');
|
||||
end
|
||||
else
|
||||
Factor := GetNum;
|
||||
end;
|
||||
{---------------------------------------------------------------}
|
||||
|
||||
That was pretty easy, huh? We're rapidly closing in on a useful
|
||||
interpreter.
|
||||
|
||||
|
||||
A LITTLE PHILOSOPHY
|
||||
|
||||
Before going any further, there's something I'd like to call to
|
||||
your attention. It's a concept that we've been making use of in
|
||||
all these sessions, but I haven't explicitly mentioned it up till
|
||||
now. I think it's time, because it's a concept so useful, and so
|
||||
powerful, that it makes all the difference between a parser
|
||||
that's trivially easy, and one that's too complex to deal with.
|
||||
|
||||
In the early days of compiler technology, people had a terrible
|
||||
time figuring out how to deal with things like operator prece-
|
||||
dence ... the way that multiply and divide operators take
|
||||
precedence over add and subtract, etc. I remember a colleague of
|
||||
some thirty years ago, and how excited he was to find out how to
|
||||
do it. The technique used involved building two stacks, upon
|
||||
which you pushed each operator or operand. Associated with each
|
||||
operator was a precedence level, and the rules required that you
|
||||
only actually performed an operation ("reducing" the stack) if
|
||||
the precedence level showing on top of the stack was correct. To
|
||||
make life more interesting, an operator like ')' had different
|
||||
precedence levels, depending upon whether or not it was already
|
||||
on the stack. You had to give it one value before you put it on
|
||||
the stack, and another to decide when to take it off. Just for
|
||||
the experience, I worked all of this out for myself a few years
|
||||
ago, and I can tell you that it's very tricky.
|
||||
|
||||
We haven't had to do anything like that. In fact, by now the
|
||||
parsing of an arithmetic statement should seem like child's play.
|
||||
How did we get so lucky? And where did the precedence stacks go?
|
||||
|
||||
A similar thing is going on in our interpreter above. You just
|
||||
KNOW that in order for it to do the computation of arithmetic
|
||||
statements (as opposed to the parsing of them), there have to be
|
||||
numbers pushed onto a stack somewhere. But where is the stack?
|
||||
|
||||
Finally, in compiler textbooks, there are a number of places
|
||||
where stacks and other structures are discussed. In the other
|
||||
leading parsing method (LR), an explicit stack is used. In fact,
|
||||
the technique is very much like the old way of doing arithmetic
|
||||
expressions. Another concept is that of a parse tree. Authors
|
||||
like to draw diagrams of the tokens in a statement, connected
|
||||
into a tree with operators at the internal nodes. Again, where
|
||||
are the trees and stacks in our technique? We haven't seen any.
|
||||
The answer in all cases is that the structures are implicit, not
|
||||
explicit. In any computer language, there is a stack involved
|
||||
every time you call a subroutine. Whenever a subroutine is
|
||||
called, the return address is pushed onto the CPU stack. At the
|
||||
end of the subroutine, the address is popped back off and control
|
||||
is transferred there. In a recursive language such as Pascal,
|
||||
there can also be local data pushed onto the stack, and it, too,
|
||||
returns when it's needed.
|
||||
|
||||
For example, function Expression contains a local parameter
|
||||
called Value, which it fills by a call to Term. Suppose, in its
|
||||
next call to Term for the second argument, that Term calls
|
||||
Factor, which recursively calls Expression again. That "in-
|
||||
stance" of Expression gets another value for its copy of Value.
|
||||
What happens to the first Value? Answer: it's still on the
|
||||
stack, and will be there again when we return from our call
|
||||
sequence.
|
||||
|
||||
In other words, the reason things look so simple is that we've
|
||||
been making maximum use of the resources of the language. The
|
||||
hierarchy levels and the parse trees are there, all right, but
|
||||
they're hidden within the structure of the parser, and they're
|
||||
taken care of by the order with which the various procedures are
|
||||
called. Now that you've seen how we do it, it's probably hard to
|
||||
imagine doing it any other way. But I can tell you that it took
|
||||
a lot of years for compiler writers to get that smart. The early
|
||||
compilers were too complex too imagine. Funny how things get
|
||||
easier with a little practice.
|
||||
|
||||
The reason I've brought all this up is as both a lesson and a
|
||||
warning. The lesson: things can be easy when you do them right.
|
||||
The warning: take a look at what you're doing. If, as you branch
|
||||
out on your own, you begin to find a real need for a separate
|
||||
stack or tree structure, it may be time to ask yourself if you're
|
||||
looking at things the right way. Maybe you just aren't using the
|
||||
facilities of the language as well as you could be.
|
||||
|
||||
|
||||
The next step is to add variable names. Now, though, we have a
|
||||
slight problem. For the compiler, we had no problem in dealing
|
||||
with variable names ... we just issued the names to the assembler
|
||||
and let the rest of the program take care of allocating storage
|
||||
for them. Here, on the other hand, we need to be able to fetch
|
||||
the values of the variables and return them as the return values
|
||||
of Factor. We need a storage mechanism for these variables.
|
||||
|
||||
Back in the early days of personal computing, Tiny BASIC lived.
|
||||
It had a grand total of 26 possible variables: one for each
|
||||
letter of the alphabet. This fits nicely with our concept of
|
||||
single-character tokens, so we'll try the same trick. In the
|
||||
beginning of your interpreter, just after the declaration of
|
||||
variable Look, insert the line:
|
||||
|
||||
Table: Array['A'..'Z'] of integer;
|
||||
|
||||
We also need to initialize the array, so add this procedure:
|
||||
|
||||
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Initialize the Variable Area }
|
||||
|
||||
procedure InitTable;
|
||||
var i: char;
|
||||
begin
|
||||
for i := 'A' to 'Z' do
|
||||
Table[i] := 0;
|
||||
end;
|
||||
{---------------------------------------------------------------}
|
||||
|
||||
|
||||
You must also insert a call to InitTable, in procedure Init.
|
||||
DON'T FORGET to do that, or the results may surprise you!
|
||||
|
||||
Now that we have an array of variables, we can modify Factor to
|
||||
use it. Since we don't have a way (so far) to set the variables,
|
||||
Factor will always return zero values for them, but let's go
|
||||
ahead and extend it anyway. Here's the new version:
|
||||
|
||||
|
||||
{---------------------------------------------------------------}
|
||||
{ Parse and Translate a Math Factor }
|
||||
|
||||
function Expression: integer; Forward;
|
||||
|
||||
function Factor: integer;
|
||||
begin
|
||||
if Look = '(' then begin
|
||||
Match('(');
|
||||
Factor := Expression;
|
||||
Match(')');
|
||||
end
|
||||
else if IsAlpha(Look) then
|
||||
Factor := Table[GetName]
|
||||
else
|
||||
Factor := GetNum;
|
||||
end;
|
||||
{---------------------------------------------------------------}
|
||||
|
||||
|
||||
As always, compile and test this version of the program. Even
|
||||
though all the variables are now zeros, at least we can correctly
|
||||
parse the complete expressions, as well as catch any badly formed
|
||||
expressions.
|
||||
|
||||
I suppose you realize the next step: we need to do an assignment
|
||||
statement so we can put something INTO the variables. For now,
|
||||
let's stick to one-liners, though we will soon be handling
|
||||
multiple statements.
|
||||
|
||||
The assignment statement parallels what we did before:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate an Assignment Statement }
|
||||
|
||||
|
||||
|
||||
procedure Assignment;
|
||||
var Name: char;
|
||||
begin
|
||||
Name := GetName;
|
||||
Match('=');
|
||||
Table[Name] := Expression;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
To test this, I added a temporary write statement in the main
|
||||
program, to print out the value of A. Then I tested it with
|
||||
various assignments to it.
|
||||
|
||||
Of course, an interpretive language that can only accept a single
|
||||
line of program is not of much value. So we're going to want to
|
||||
handle multiple statements. This merely means putting a loop
|
||||
around the call to Assignment. So let's do that now. But what
|
||||
should be the loop exit criterion? Glad you asked, because it
|
||||
brings up a point we've been able to ignore up till now.
|
||||
|
||||
One of the most tricky things to handle in any translator is to
|
||||
determine when to bail out of a given construct and go look for
|
||||
something else. This hasn't been a problem for us so far because
|
||||
we've only allowed for a single kind of construct ... either an
|
||||
expression or an assignment statement. When we start adding
|
||||
loops and different kinds of statements, you'll find that we have
|
||||
to be very careful that things terminate properly. If we put our
|
||||
interpreter in a loop, we need a way to quit. Terminating on a
|
||||
newline is no good, because that's what sends us back for another
|
||||
line. We could always let an unrecognized character take us out,
|
||||
but that would cause every run to end in an error message, which
|
||||
certainly seems uncool.
|
||||
|
||||
What we need is a termination character. I vote for Pascal's
|
||||
ending period ('.'). A minor complication is that Turbo ends
|
||||
every normal line with TWO characters, the carriage return (CR)
|
||||
and line feed (LF). At the end of each line, we need to eat
|
||||
these characters before processing the next one. A natural way
|
||||
to do this would be with procedure Match, except that Match's
|
||||
error message prints the character, which of course for the CR
|
||||
and/or LF won't look so great. What we need is a special proce-
|
||||
dure for this, which we'll no doubt be using over and over. Here
|
||||
it is:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Recognize and Skip Over a Newline }
|
||||
|
||||
procedure NewLine;
|
||||
begin
|
||||
if Look = CR then begin
|
||||
GetChar;
|
||||
if Look = LF then
|
||||
GetChar;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Insert this procedure at any convenient spot ... I put mine just
|
||||
after Match. Now, rewrite the main program to look like this:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Main Program }
|
||||
|
||||
begin
|
||||
Init;
|
||||
repeat
|
||||
Assignment;
|
||||
NewLine;
|
||||
until Look = '.';
|
||||
end.
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Note that the test for a CR is now gone, and that there are also
|
||||
no error tests within NewLine itself. That's OK, though ...
|
||||
whatever is left over in terms of bogus characters will be caught
|
||||
at the beginning of the next assignment statement.
|
||||
|
||||
Well, we now have a functioning interpreter. It doesn't do us a
|
||||
lot of good, however, since we have no way to read data in or
|
||||
write it out. Sure would help to have some I/O!
|
||||
|
||||
Let's wrap this session up, then, by adding the I/O routines.
|
||||
Since we're sticking to single-character tokens, I'll use '?' to
|
||||
stand for a read statement, and '!' for a write, with the char-
|
||||
acter immediately following them to be used as a one-token
|
||||
"parameter list." Here are the routines:
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Input Routine }
|
||||
|
||||
procedure Input;
|
||||
begin
|
||||
Match('?');
|
||||
Read(Table[GetName]);
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Output Routine }
|
||||
|
||||
procedure Output;
|
||||
begin
|
||||
Match('!');
|
||||
WriteLn(Table[GetName]);
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
They aren't very fancy, I admit ... no prompt character on input,
|
||||
for example ... but they get the job done.
|
||||
|
||||
The corresponding changes in the main program are shown below.
|
||||
Note that we use the usual trick of a case statement based upon
|
||||
the current lookahead character, to decide what to do.
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Main Program }
|
||||
|
||||
begin
|
||||
Init;
|
||||
repeat
|
||||
case Look of
|
||||
'?': Input;
|
||||
'!': Output;
|
||||
else Assignment;
|
||||
end;
|
||||
NewLine;
|
||||
until Look = '.';
|
||||
end.
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
You have now completed a real, working interpreter. It's pretty
|
||||
sparse, but it works just like the "big boys." It includes three
|
||||
kinds of program statements (and can tell the difference!), 26
|
||||
variables, and I/O statements. The only things that it lacks,
|
||||
really, are control statements, subroutines, and some kind of
|
||||
program editing function. The program editing part, I'm going to
|
||||
pass on. After all, we're not here to build a product, but to
|
||||
learn things. The control statements, we'll cover in the next
|
||||
installment, and the subroutines soon after. I'm anxious to get
|
||||
on with that, so we'll leave the interpreter as it stands.
|
||||
|
||||
I hope that by now you're convinced that the limitation of sin-
|
||||
gle-character names and the processing of white space are easily
|
||||
taken care of, as we did in the last session. This time, if
|
||||
you'd like to play around with these extensions, be my guest ...
|
||||
they're "left as an exercise for the student." See you next
|
||||
time.
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
1 --
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,525 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LET'S BUILD A COMPILER!
|
||||
|
||||
By
|
||||
|
||||
Jack W. Crenshaw, Ph.D.
|
||||
|
||||
2 April 1989
|
||||
|
||||
|
||||
Part VIII: A LITTLE PHILOSOPHY
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
This is going to be a different kind of session than the others
|
||||
in our series on parsing and compiler construction. For this
|
||||
session, there won't be any experiments to do or code to write.
|
||||
This once, I'd like to just talk with you for a while.
|
||||
Mercifully, it will be a short session, and then we can take up
|
||||
where we left off, hopefully with renewed vigor.
|
||||
|
||||
When I was in college, I found that I could always follow a
|
||||
prof's lecture a lot better if I knew where he was going with it.
|
||||
I'll bet you were the same.
|
||||
|
||||
So I thought maybe it's about time I told you where we're going
|
||||
with this series: what's coming up in future installments, and in
|
||||
general what all this is about. I'll also share some general
|
||||
thoughts concerning the usefulness of what we've been doing.
|
||||
|
||||
|
||||
THE ROAD HOME
|
||||
|
||||
So far, we've covered the parsing and translation of arithmetic
|
||||
expressions, Boolean expressions, and combinations connected by
|
||||
relational operators. We've also done the same for control
|
||||
constructs. In all of this we've leaned heavily on the use of
|
||||
top-down, recursive descent parsing, BNF definitions of the
|
||||
syntax, and direct generation of assembly-language code. We also
|
||||
learned the value of such tricks as single-character tokens to
|
||||
help us see the forest through the trees. In the last
|
||||
installment we dealt with lexical scanning, and I showed you
|
||||
simple but powerful ways to remove the single-character barriers.
|
||||
|
||||
Throughout the whole study, I've emphasized the KISS philosophy
|
||||
... Keep It Simple, Sidney ... and I hope by now you've realized
|
||||
just how simple this stuff can really be. While there are for
|
||||
sure areas of compiler theory that are truly intimidating, the
|
||||
ultimate message of this series is that in practice you can just
|
||||
politely sidestep many of these areas. If the language
|
||||
definition cooperates or, as in this series, if you can define
|
||||
the language as you go, it's possible to write down the language
|
||||
definition in BNF with reasonable ease. And, as we've seen, you
|
||||
can crank out parse procedures from the BNF just about as fast as
|
||||
you can type.
|
||||
|
||||
As our compiler has taken form, it's gotten more parts, but each
|
||||
part is quite small and simple, and very much like all the
|
||||
others.
|
||||
|
||||
At this point, we have many of the makings of a real, practical
|
||||
compiler. As a matter of fact, we already have all we need to
|
||||
build a toy compiler for a language as powerful as, say, Tiny
|
||||
BASIC. In the next couple of installments, we'll go ahead and
|
||||
define that language.
|
||||
|
||||
To round out the series, we still have a few items to cover.
|
||||
These include:
|
||||
|
||||
o Procedure calls, with and without parameters
|
||||
|
||||
o Local and global variables
|
||||
|
||||
o Basic types, such as character and integer types
|
||||
|
||||
o Arrays
|
||||
|
||||
o Strings
|
||||
|
||||
o User-defined types and structures
|
||||
|
||||
o Tree-structured parsers and intermediate languages
|
||||
|
||||
o Optimization
|
||||
|
||||
These will all be covered in future installments. When we're
|
||||
finished, you'll have all the tools you need to design and build
|
||||
your own languages, and the compilers to translate them.
|
||||
|
||||
I can't design those languages for you, but I can make some
|
||||
comments and recommendations. I've already sprinkled some
|
||||
throughout past installments. You've seen, for example, the
|
||||
control constructs I prefer.
|
||||
|
||||
These constructs are going to be part of the languages I build.
|
||||
I have three languages in mind at this point, two of which you
|
||||
will see in installments to come:
|
||||
|
||||
TINY - A minimal, but usable language on the order of Tiny
|
||||
BASIC or Tiny C. It won't be very practical, but it will
|
||||
have enough power to let you write and run real programs
|
||||
that do something worthwhile.
|
||||
|
||||
KISS - The language I'm building for my own use. KISS is
|
||||
intended to be a systems programming language. It won't
|
||||
have strong typing or fancy data structures, but it will
|
||||
support most of the things I want to do with a higher-
|
||||
order language (HOL), except perhaps writing compilers.
|
||||
|
||||
I've also been toying for years with the idea of a HOL-like
|
||||
assembler, with structured control constructs and HOL-like
|
||||
assignment statements. That, in fact, was the impetus behind my
|
||||
original foray into the jungles of compiler theory. This one may
|
||||
never be built, simply because I've learned that it's actually
|
||||
easier to implement a language like KISS, that only uses a subset
|
||||
of the CPU instructions. As you know, assembly language can be
|
||||
bizarre and irregular in the extreme, and a language that maps
|
||||
one-for-one onto it can be a real challenge. Still, I've always
|
||||
felt that the syntax used in conventional assemblers is dumb ...
|
||||
why is
|
||||
|
||||
MOVE.L A,B
|
||||
|
||||
better, or easier to translate, than
|
||||
|
||||
B=A ?
|
||||
|
||||
I think it would be an interesting exercise to develop a
|
||||
"compiler" that would give the programmer complete access to and
|
||||
control over the full complement of the CPU instruction set, and
|
||||
would allow you to generate programs as efficient as assembly
|
||||
language, without the pain of learning a set of mnemonics. Can
|
||||
it be done? I don't know. The real question may be, "Will the
|
||||
resulting language be any easier to write than assembly"? If
|
||||
not, there's no point in it. I think that it can be done, but
|
||||
I'm not completely sure yet how the syntax should look.
|
||||
|
||||
Perhaps you have some comments or suggestions on this one. I'd
|
||||
love to hear them.
|
||||
|
||||
You probably won't be surprised to learn that I've already worked
|
||||
ahead in most of the areas that we will cover. I have some good
|
||||
news: Things never get much harder than they've been so far.
|
||||
It's possible to build a complete, working compiler for a real
|
||||
language, using nothing but the same kinds of techniques you've
|
||||
learned so far. And THAT brings up some interesting questions.
|
||||
|
||||
|
||||
WHY IS IT SO SIMPLE?
|
||||
|
||||
Before embarking on this series, I always thought that compilers
|
||||
were just naturally complex computer programs ... the ultimate
|
||||
challenge. Yet the things we have done here have usually turned
|
||||
out to be quite simple, sometimes even trivial.
|
||||
|
||||
For awhile, I thought is was simply because I hadn't yet gotten
|
||||
into the meat of the subject. I had only covered the simple
|
||||
parts. I will freely admit to you that, even when I began the
|
||||
series, I wasn't sure how far we would be able to go before
|
||||
things got too complex to deal with in the ways we have so far.
|
||||
But at this point I've already been down the road far enough to
|
||||
see the end of it. Guess what?
|
||||
|
||||
|
||||
THERE ARE NO HARD PARTS!
|
||||
|
||||
|
||||
Then, I thought maybe it was because we were not generating very
|
||||
good object code. Those of you who have been following the
|
||||
series and trying sample compiles know that, while the code works
|
||||
and is rather foolproof, its efficiency is pretty awful. I
|
||||
figured that if we were concentrating on turning out tight code,
|
||||
we would soon find all that missing complexity.
|
||||
|
||||
To some extent, that one is true. In particular, my first few
|
||||
efforts at trying to improve efficiency introduced complexity at
|
||||
an alarming rate. But since then I've been tinkering around with
|
||||
some simple optimizations and I've found some that result in very
|
||||
respectable code quality, WITHOUT adding a lot of complexity.
|
||||
|
||||
Finally, I thought that perhaps the saving grace was the "toy
|
||||
compiler" nature of the study. I have made no pretense that we
|
||||
were ever going to be able to build a compiler to compete with
|
||||
Borland and Microsoft. And yet, again, as I get deeper into this
|
||||
thing the differences are starting to fade away.
|
||||
|
||||
Just to make sure you get the message here, let me state it flat
|
||||
out:
|
||||
|
||||
USING THE TECHNIQUES WE'VE USED HERE, IT IS POSSIBLE TO
|
||||
BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING
|
||||
A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE.
|
||||
|
||||
|
||||
Since the series began I've received some comments from you.
|
||||
Most of them echo my own thoughts: "This is easy! Why do the
|
||||
textbooks make it seem so hard?" Good question.
|
||||
|
||||
Recently, I've gone back and looked at some of those texts again,
|
||||
and even bought and read some new ones. Each time, I come away
|
||||
with the same feeling: These guys have made it seem too hard.
|
||||
|
||||
What's going on here? Why does the whole thing seem difficult in
|
||||
the texts, but easy to us? Are we that much smarter than Aho,
|
||||
Ullman, Brinch Hansen, and all the rest?
|
||||
|
||||
Hardly. But we are doing some things differently, and more and
|
||||
more I'm starting to appreciate the value of our approach, and
|
||||
the way that it simplifies things. Aside from the obvious
|
||||
shortcuts that I outlined in Part I, like single-character tokens
|
||||
and console I/O, we have made some implicit assumptions and done
|
||||
some things differently from those who have designed compilers in
|
||||
the past. As it turns out, our approach makes life a lot easier.
|
||||
|
||||
So why didn't all those other guys use it?
|
||||
|
||||
You have to remember the context of some of the earlier compiler
|
||||
development. These people were working with very small computers
|
||||
of limited capacity. Memory was very limited, the CPU
|
||||
instruction set was minimal, and programs ran in batch mode
|
||||
rather than interactively. As it turns out, these caused some
|
||||
key design decisions that have really complicated the designs.
|
||||
Until recently, I hadn't realized how much of classical compiler
|
||||
design was driven by the available hardware.
|
||||
|
||||
Even in cases where these limitations no longer apply, people
|
||||
have tended to structure their programs in the same way, since
|
||||
that is the way they were taught to do it.
|
||||
|
||||
In our case, we have started with a blank sheet of paper. There
|
||||
is a danger there, of course, that you will end up falling into
|
||||
traps that other people have long since learned to avoid. But it
|
||||
also has allowed us to take different approaches that, partly by
|
||||
design and partly by pure dumb luck, have allowed us to gain
|
||||
simplicity.
|
||||
|
||||
Here are the areas that I think have led to complexity in the
|
||||
past:
|
||||
|
||||
o Limited RAM Forcing Multiple Passes
|
||||
|
||||
I just read "Brinch Hansen on Pascal Compilers" (an
|
||||
excellent book, BTW). He developed a Pascal compiler for a
|
||||
PC, but he started the effort in 1981 with a 64K system, and
|
||||
so almost every design decision he made was aimed at making
|
||||
the compiler fit into RAM. To do this, his compiler has
|
||||
three passes, one of which is the lexical scanner. There is
|
||||
no way he could, for example, use the distributed scanner I
|
||||
introduced in the last installment, because the program
|
||||
structure wouldn't allow it. He also required not one but
|
||||
two intermediate languages, to provide the communication
|
||||
between phases.
|
||||
|
||||
All the early compiler writers had to deal with this issue:
|
||||
Break the compiler up into enough parts so that it will fit
|
||||
in memory. When you have multiple passes, you need to add
|
||||
data structures to support the information that each pass
|
||||
leaves behind for the next. That adds complexity, and ends
|
||||
up driving the design. Lee's book, "The Anatomy of a
|
||||
Compiler," mentions a FORTRAN compiler developed for an IBM
|
||||
1401. It had no fewer than 63 separate passes! Needless to
|
||||
say, in a compiler like this the separation into phases
|
||||
would dominate the design.
|
||||
|
||||
Even in situations where RAM is plentiful, people have
|
||||
tended to use the same techniques because that is what
|
||||
they're familiar with. It wasn't until Turbo Pascal came
|
||||
along that we found how simple a compiler could be if you
|
||||
started with different assumptions.
|
||||
|
||||
|
||||
o Batch Processing
|
||||
|
||||
In the early days, batch processing was the only choice ...
|
||||
there was no interactive computing. Even today, compilers
|
||||
run in essentially batch mode.
|
||||
|
||||
In a mainframe compiler as well as many micro compilers,
|
||||
considerable effort is expended on error recovery ... it can
|
||||
consume as much as 30-40% of the compiler and completely
|
||||
drive the design. The idea is to avoid halting on the first
|
||||
error, but rather to keep going at all costs, so that you
|
||||
can tell the programmer about as many errors in the whole
|
||||
program as possible.
|
||||
|
||||
All of that harks back to the days of the early mainframes,
|
||||
where turnaround time was measured in hours or days, and it
|
||||
was important to squeeze every last ounce of information out
|
||||
of each run.
|
||||
|
||||
In this series, I've been very careful to avoid the issue of
|
||||
error recovery, and instead our compiler simply halts with
|
||||
an error message on the first error. I will frankly admit
|
||||
that it was mostly because I wanted to take the easy way out
|
||||
and keep things simple. But this approach, pioneered by
|
||||
Borland in Turbo Pascal, also has a lot going for it anyway.
|
||||
Aside from keeping the compiler simple, it also fits very
|
||||
well with the idea of an interactive system. When
|
||||
compilation is fast, and especially when you have an editor
|
||||
such as Borland's that will take you right to the point of
|
||||
the error, then it makes a lot of sense to stop there, and
|
||||
just restart the compilation after the error is fixed.
|
||||
|
||||
|
||||
o Large Programs
|
||||
|
||||
Early compilers were designed to handle large programs ...
|
||||
essentially infinite ones. In those days there was little
|
||||
choice; the idea of subroutine libraries and separate
|
||||
compilation were still in the future. Again, this
|
||||
assumption led to multi-pass designs and intermediate files
|
||||
to hold the results of partial processing.
|
||||
|
||||
Brinch Hansen's stated goal was that the compiler should be
|
||||
able to compile itself. Again, because of his limited RAM,
|
||||
this drove him to a multi-pass design. He needed as little
|
||||
resident compiler code as possible, so that the necessary
|
||||
tables and other data structures would fit into RAM.
|
||||
|
||||
I haven't stated this one yet, because there hasn't been a
|
||||
need ... we've always just read and written the data as
|
||||
streams, anyway. But for the record, my plan has always
|
||||
been that, in a production compiler, the source and object
|
||||
data should all coexist in RAM with the compiler, a la the
|
||||
early Turbo Pascals. That's why I've been careful to keep
|
||||
routines like GetChar and Emit as separate routines, in
|
||||
spite of their small size. It will be easy to change them
|
||||
to read to and write from memory.
|
||||
|
||||
|
||||
o Emphasis on Efficiency
|
||||
|
||||
John Backus has stated that, when he and his colleagues
|
||||
developed the original FORTRAN compiler, they KNEW that they
|
||||
had to make it produce tight code. In those days, there was
|
||||
a strong sentiment against HOLs and in favor of assembly
|
||||
language, and efficiency was the reason. If FORTRAN didn't
|
||||
produce very good code by assembly standards, the users
|
||||
would simply refuse to use it. For the record, that FORTRAN
|
||||
compiler turned out to be one of the most efficient ever
|
||||
built, in terms of code quality. But it WAS complex!
|
||||
|
||||
Today, we have CPU power and RAM size to spare, so code
|
||||
efficiency is not so much of an issue. By studiously
|
||||
ignoring this issue, we have indeed been able to Keep It
|
||||
Simple. Ironically, though, as I have said, I have found
|
||||
some optimizations that we can add to the basic compiler
|
||||
structure, without having to add a lot of complexity. So in
|
||||
this case we get to have our cake and eat it too: we will
|
||||
end up with reasonable code quality, anyway.
|
||||
|
||||
|
||||
o Limited Instruction Sets
|
||||
|
||||
The early computers had primitive instruction sets. Things
|
||||
that we take for granted, such as stack operations and
|
||||
indirect addressing, came only with great difficulty.
|
||||
|
||||
Example: In most compiler designs, there is a data structure
|
||||
called the literal pool. The compiler typically identifies
|
||||
all literals used in the program, and collects them into a
|
||||
single data structure. All references to the literals are
|
||||
done indirectly to this pool. At the end of the
|
||||
compilation, the compiler issues commands to set aside
|
||||
storage and initialize the literal pool.
|
||||
|
||||
We haven't had to address that issue at all. When we want
|
||||
to load a literal, we just do it, in line, as in
|
||||
|
||||
MOVE #3,D0
|
||||
|
||||
There is something to be said for the use of a literal pool,
|
||||
particularly on a machine like the 8086 where data and code
|
||||
can be separated. Still, the whole thing adds a fairly
|
||||
large amount of complexity with little in return.
|
||||
|
||||
Of course, without the stack we would be lost. In a micro,
|
||||
both subroutine calls and temporary storage depend heavily
|
||||
on the stack, and we have used it even more than necessary
|
||||
to ease expression parsing.
|
||||
|
||||
|
||||
o Desire for Generality
|
||||
|
||||
Much of the content of the typical compiler text is taken up
|
||||
with issues we haven't addressed here at all ... things like
|
||||
automated translation of grammars, or generation of LALR
|
||||
parse tables. This is not simply because the authors want
|
||||
to impress you. There are good, practical reasons why the
|
||||
subjects are there.
|
||||
|
||||
We have been concentrating on the use of a recursive-descent
|
||||
parser to parse a deterministic grammar, i.e., a grammar
|
||||
that is not ambiguous and, therefore, can be parsed with one
|
||||
level of lookahead. I haven't made much of this limitation,
|
||||
but the fact is that this represents a small subset of
|
||||
possible grammars. In fact, there is an infinite number of
|
||||
grammars that we can't parse using our techniques. The LR
|
||||
technique is a more powerful one, and can deal with grammars
|
||||
that we can't.
|
||||
|
||||
In compiler theory, it's important to know how to deal with
|
||||
these other grammars, and how to transform them into
|
||||
grammars that are easier to deal with. For example, many
|
||||
(but not all) ambiguous grammars can be transformed into
|
||||
unambiguous ones. The way to do this is not always obvious,
|
||||
though, and so many people have devoted years to develop
|
||||
ways to transform them automatically.
|
||||
|
||||
In practice, these issues turn out to be considerably less
|
||||
important. Modern languages tend to be designed to be easy
|
||||
to parse, anyway. That was a key motivation in the design
|
||||
of Pascal. Sure, there are pathological grammars that you
|
||||
would be hard pressed to write unambiguous BNF for, but in
|
||||
the real world the best answer is probably to avoid those
|
||||
grammars!
|
||||
|
||||
In our case, of course, we have sneakily let the language
|
||||
evolve as we go, so we haven't painted ourselves into any
|
||||
corners here. You may not always have that luxury. Still,
|
||||
with a little care you should be able to keep the parser
|
||||
simple without having to resort to automatic translation of
|
||||
the grammar.
|
||||
|
||||
|
||||
We have taken a vastly different approach in this series. We
|
||||
started with a clean sheet of paper, and developed techniques
|
||||
that work in the context that we are in; that is, a single-user
|
||||
PC with rather ample CPU power and RAM space. We have limited
|
||||
ourselves to reasonable grammars that are easy to parse, we have
|
||||
used the instruction set of the CPU to advantage, and we have not
|
||||
concerned ourselves with efficiency. THAT's why it's been easy.
|
||||
|
||||
Does this mean that we are forever doomed to be able to build
|
||||
only toy compilers? No, I don't think so. As I've said, we can
|
||||
add certain optimizations without changing the compiler
|
||||
structure. If we want to process large files, we can always add
|
||||
file buffering to do that. These things do not affect the
|
||||
overall program design.
|
||||
|
||||
And I think that's a key factor. By starting with small and
|
||||
limited cases, we have been able to concentrate on a structure
|
||||
for the compiler that is natural for the job. Since the
|
||||
structure naturally fits the job, it is almost bound to be simple
|
||||
and transparent. Adding capability doesn't have to change that
|
||||
basic structure. We can simply expand things like the file
|
||||
structure or add an optimization layer. I guess my feeling is
|
||||
that, back when resources were tight, the structures people ended
|
||||
up with were artificially warped to make them work under those
|
||||
conditions, and weren't optimum structures for the problem at
|
||||
hand.
|
||||
|
||||
|
||||
CONCLUSION
|
||||
|
||||
Anyway, that's my arm-waving guess as to how we've been able to
|
||||
keep things simple. We started with something simple and let it
|
||||
evolve naturally, without trying to force it into some
|
||||
traditional mold.
|
||||
|
||||
We're going to press on with this. I've given you a list of the
|
||||
areas we'll be covering in future installments. With those
|
||||
installments, you should be able to build complete, working
|
||||
compilers for just about any occasion, and build them simply. If
|
||||
you REALLY want to build production-quality compilers, you'll be
|
||||
able to do that, too.
|
||||
|
||||
For those of you who are chafing at the bit for more parser code,
|
||||
I apologize for this digression. I just thought you'd like to
|
||||
have things put into perspective a bit. Next time, we'll get
|
||||
back to the mainstream of the tutorial.
|
||||
|
||||
So far, we've only looked at pieces of compilers, and while we
|
||||
have many of the makings of a complete language, we haven't
|
||||
talked about how to put it all together. That will be the
|
||||
subject of our next two installments. Then we'll press on into
|
||||
the new subjects I listed at the beginning of this installment.
|
||||
|
||||
See you then.
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
@@ -0,0 +1,821 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LET'S BUILD A COMPILER!
|
||||
|
||||
By
|
||||
|
||||
Jack W. Crenshaw, Ph.D.
|
||||
|
||||
16 April 1989
|
||||
|
||||
|
||||
Part IX: A TOP VIEW
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
In the previous installments, we have learned many of the
|
||||
techniques required to build a full-blown compiler. We've done
|
||||
both assignment statements (with Boolean and arithmetic
|
||||
expressions), relational operators, and control constructs. We
|
||||
still haven't addressed procedure or function calls, but even so
|
||||
we could conceivably construct a mini-language without them.
|
||||
I've always thought it would be fun to see just how small a
|
||||
language one could build that would still be useful. We're
|
||||
ALMOST in a position to do that now. The problem is: though we
|
||||
know how to parse and translate the constructs, we still don't
|
||||
know quite how to put them all together into a language.
|
||||
|
||||
In those earlier installments, the development of our programs
|
||||
had a decidedly bottom-up flavor. In the case of expression
|
||||
parsing, for example, we began with the very lowest level
|
||||
constructs, the individual constants and variables, and worked
|
||||
our way up to more complex expressions.
|
||||
|
||||
Most people regard the top-down design approach as being better
|
||||
than the bottom-up one. I do too, but the way we did it
|
||||
certainly seemed natural enough for the kinds of things we were
|
||||
parsing.
|
||||
|
||||
You mustn't get the idea, though, that the incremental approach
|
||||
that we've been using in all these tutorials is inherently
|
||||
bottom-up. In this installment I'd like to show you that the
|
||||
approach can work just as well when applied from the top down ...
|
||||
maybe better. We'll consider languages such as C and Pascal, and
|
||||
see how complete compilers can be built starting from the top.
|
||||
|
||||
In the next installment, we'll apply the same technique to build
|
||||
a complete translator for a subset of the KISS language, which
|
||||
I'll be calling TINY. But one of my goals for this series is
|
||||
that you will not only be able to see how a compiler for TINY or
|
||||
KISS works, but that you will also be able to design and build
|
||||
compilers for your own languages. The C and Pascal examples will
|
||||
help. One thing I'd like you to see is that the natural
|
||||
structure of the compiler depends very much on the language being
|
||||
translated, so the simplicity and ease of construction of the
|
||||
compiler depends very much on letting the language set the
|
||||
program structure.
|
||||
|
||||
It's a bit much to produce a full C or Pascal compiler here, and
|
||||
we won't try. But we can flesh out the top levels far enough so
|
||||
that you can see how it goes.
|
||||
|
||||
Let's get started.
|
||||
|
||||
|
||||
THE TOP LEVEL
|
||||
|
||||
One of the biggest mistakes people make in a top-down design is
|
||||
failing to start at the true top. They think they know what the
|
||||
overall structure of the design should be, so they go ahead and
|
||||
write it down.
|
||||
|
||||
Whenever I start a new design, I always like to do it at the
|
||||
absolute beginning. In program design language (PDL), this top
|
||||
level looks something like:
|
||||
|
||||
|
||||
begin
|
||||
solve the problem
|
||||
end
|
||||
|
||||
|
||||
OK, I grant you that this doesn't give much of a hint as to what
|
||||
the next level is, but I like to write it down anyway, just to
|
||||
give me that warm feeling that I am indeed starting at the top.
|
||||
|
||||
For our problem, the overall function of a compiler is to compile
|
||||
a complete program. Any definition of the language, written in
|
||||
BNF, begins here. What does the top level BNF look like? Well,
|
||||
that depends quite a bit on the language to be translated. Let's
|
||||
take a look at Pascal.
|
||||
|
||||
|
||||
THE STRUCTURE OF PASCAL
|
||||
|
||||
Most texts for Pascal include a BNF or "railroad-track"
|
||||
definition of the language. Here are the first few lines of one:
|
||||
|
||||
|
||||
<program> ::= <program-header> <block> '.'
|
||||
|
||||
<program-header> ::= PROGRAM <ident>
|
||||
|
||||
<block> ::= <declarations> <statements>
|
||||
|
||||
|
||||
We can write recognizers to deal with each of these elements,
|
||||
just as we've done before. For each one, we'll use our familiar
|
||||
single-character tokens to represent the input, then flesh things
|
||||
out a little at a time. Let's begin with the first recognizer:
|
||||
the program itself.
|
||||
|
||||
To translate this, we'll start with a fresh copy of the Cradle.
|
||||
Since we're back to single-character names, we'll just use a 'p'
|
||||
to stand for 'PROGRAM.'
|
||||
|
||||
To a fresh copy of the cradle, add the following code, and insert
|
||||
a call to it from the main program:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate A Program }
|
||||
|
||||
procedure Prog;
|
||||
var Name: char;
|
||||
begin
|
||||
Match('p'); { Handles program header part }
|
||||
Name := GetName;
|
||||
Prolog(Name);
|
||||
Match('.');
|
||||
Epilog(Name);
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
The procedures Prolog and Epilog perform whatever is required to
|
||||
let the program interface with the operating system, so that it
|
||||
can execute as a program. Needless to say, this part will be
|
||||
VERY OS-dependent. Remember, I've been emitting code for a 68000
|
||||
running under the OS I use, which is SK*DOS. I realize most of
|
||||
you are using PC's and would rather see something else, but I'm
|
||||
in this thing too deep to change now!
|
||||
|
||||
Anyhow, SK*DOS is a particularly easy OS to interface to. Here
|
||||
is the code for Prolog and Epilog:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Write the Prolog }
|
||||
|
||||
procedure Prolog;
|
||||
begin
|
||||
EmitLn('WARMST EQU $A01E');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Write the Epilog }
|
||||
|
||||
procedure Epilog(Name: char);
|
||||
begin
|
||||
EmitLn('DC WARMST');
|
||||
EmitLn('END ' + Name);
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
As usual, add this code and try out the "compiler." At this
|
||||
point, there is only one legal input:
|
||||
|
||||
|
||||
px. (where x is any single letter, the program name)
|
||||
|
||||
|
||||
Well, as usual our first effort is rather unimpressive, but by
|
||||
now I'm sure you know that things will get more interesting.
|
||||
There is one important thing to note: THE OUTPUT IS A WORKING,
|
||||
COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled).
|
||||
|
||||
This is very important. The nice feature of the top-down
|
||||
approach is that at any stage you can compile a subset of the
|
||||
complete language and get a program that will run on the target
|
||||
machine. From here on, then, we need only add features by
|
||||
fleshing out the language constructs. It's all very similar to
|
||||
what we've been doing all along, except that we're approaching it
|
||||
from the other end.
|
||||
|
||||
|
||||
FLESHING IT OUT
|
||||
|
||||
To flesh out the compiler, we only have to deal with language
|
||||
features one by one. I like to start with a stub procedure that
|
||||
does nothing, then add detail in incremental fashion. Let's
|
||||
begin by processing a block, in accordance with its PDL above.
|
||||
We can do this in two stages. First, add the null procedure:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate a Pascal Block }
|
||||
|
||||
procedure DoBlock(Name: char);
|
||||
begin
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
and modify Prog to read:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate A Program }
|
||||
|
||||
procedure Prog;
|
||||
var Name: char;
|
||||
begin
|
||||
Match('p');
|
||||
Name := GetName;
|
||||
Prolog;
|
||||
DoBlock(Name);
|
||||
Match('.');
|
||||
Epilog(Name);
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
That certainly shouldn't change the behavior of the program, and
|
||||
it doesn't. But now the definition of Prog is complete, and we
|
||||
can proceed to flesh out DoBlock. That's done right from its BNF
|
||||
definition:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate a Pascal Block }
|
||||
|
||||
procedure DoBlock(Name: char);
|
||||
begin
|
||||
Declarations;
|
||||
PostLabel(Name);
|
||||
Statements;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
The procedure PostLabel was defined in the installment on
|
||||
branches. Copy it into your cradle.
|
||||
|
||||
I probably need to explain the reason for inserting the label
|
||||
where I have. It has to do with the operation of SK*DOS. Unlike
|
||||
some OS's, SK*DOS allows the entry point to the main program to
|
||||
be anywhere in the program. All you have to do is to give that
|
||||
point a name. The call to PostLabel puts that name just before
|
||||
the first executable statement in the main program. How does
|
||||
SK*DOS know which of the many labels is the entry point, you ask?
|
||||
It's the one that matches the END statement at the end of the
|
||||
program.
|
||||
|
||||
OK, now we need stubs for the procedures Declarations and
|
||||
Statements. Make them null procedures as we did before.
|
||||
|
||||
Does the program still run the same? Then we can move on to the
|
||||
next stage.
|
||||
|
||||
|
||||
DECLARATIONS
|
||||
|
||||
The BNF for Pascal declarations is:
|
||||
|
||||
|
||||
<declarations> ::= ( <label list> |
|
||||
<constant list> |
|
||||
<type list> |
|
||||
<variable list> |
|
||||
<procedure> |
|
||||
<function> )*
|
||||
|
||||
|
||||
(Note that I'm using the more liberal definition used by Turbo
|
||||
Pascal. In the standard Pascal definition, each of these parts
|
||||
must be in a specific order relative to the rest.)
|
||||
|
||||
As usual, let's let a single character represent each of these
|
||||
declaration types. The new form of Declarations is:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate the Declaration Part }
|
||||
|
||||
procedure Declarations;
|
||||
begin
|
||||
while Look in ['l', 'c', 't', 'v', 'p', 'f'] do
|
||||
case Look of
|
||||
'l': Labels;
|
||||
'c': Constants;
|
||||
't': Types;
|
||||
'v': Variables;
|
||||
'p': DoProcedure;
|
||||
'f': DoFunction;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Of course, we need stub procedures for each of these declaration
|
||||
types. This time, they can't quite be null procedures, since
|
||||
otherwise we'll end up with an infinite While loop. At the very
|
||||
least, each recognizer must eat the character that invokes it.
|
||||
Insert the following procedures:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process Label Statement }
|
||||
|
||||
procedure Labels;
|
||||
begin
|
||||
Match('l');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process Const Statement }
|
||||
|
||||
procedure Constants;
|
||||
begin
|
||||
Match('c');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process Type Statement }
|
||||
procedure Types;
|
||||
begin
|
||||
Match('t');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process Var Statement }
|
||||
|
||||
procedure Variables;
|
||||
begin
|
||||
Match('v');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process Procedure Definition }
|
||||
|
||||
procedure DoProcedure;
|
||||
begin
|
||||
Match('p');
|
||||
end;
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process Function Definition }
|
||||
|
||||
procedure DoFunction;
|
||||
begin
|
||||
Match('f');
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Now try out the compiler with a few representative inputs. You
|
||||
can mix the declarations any way you like, as long as the last
|
||||
character in the program is'.' to indicate the end of the
|
||||
program. Of course, none of the declarations actually declare
|
||||
anything, so you don't need (and can't use) any characters other
|
||||
than those standing for the keywords.
|
||||
|
||||
We can flesh out the statement part in a similar way. The BNF
|
||||
for it is:
|
||||
|
||||
|
||||
<statements> ::= <compound statement>
|
||||
|
||||
<compound statement> ::= BEGIN <statement>
|
||||
(';' <statement>) END
|
||||
|
||||
|
||||
Note that statements can begin with any identifier except END.
|
||||
So the first stub form of procedure Statements is:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate the Statement Part }
|
||||
|
||||
procedure Statements;
|
||||
begin
|
||||
Match('b');
|
||||
while Look <> 'e' do
|
||||
GetChar;
|
||||
Match('e');
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
At this point the compiler will accept any number of
|
||||
declarations, followed by the BEGIN block of the main program.
|
||||
This block itself can contain any characters at all (except an
|
||||
END), but it must be present.
|
||||
|
||||
The simplest form of input is now
|
||||
|
||||
'pxbe.'
|
||||
|
||||
Try it. Also try some combinations of this. Make some
|
||||
deliberate errors and see what happens.
|
||||
|
||||
At this point you should be beginning to see the drill. We begin
|
||||
with a stub translator to process a program, then we flesh out
|
||||
each procedure in turn, based upon its BNF definition. Just as
|
||||
the lower-level BNF definitions add detail and elaborate upon the
|
||||
higher-level ones, the lower-level recognizers will parse more
|
||||
detail of the input program. When the last stub has been
|
||||
expanded, the compiler will be complete. That's top-down
|
||||
design/implementation in its purest form.
|
||||
|
||||
You might note that even though we've been adding procedures, the
|
||||
output of the program hasn't changed. That's as it should be.
|
||||
At these top levels there is no emitted code required. The
|
||||
recognizers are functioning as just that: recognizers. They are
|
||||
accepting input sentences, catching bad ones, and channeling good
|
||||
input to the right places, so they are doing their job. If we
|
||||
were to pursue this a bit longer, code would start to appear.
|
||||
|
||||
The next step in our expansion should probably be procedure
|
||||
Statements. The Pascal definition is:
|
||||
|
||||
|
||||
<statement> ::= <simple statement> | <structured statement>
|
||||
|
||||
<simple statement> ::= <assignment> | <procedure call> | null
|
||||
|
||||
<structured statement> ::= <compound statement> |
|
||||
<if statement> |
|
||||
<case statement> |
|
||||
<while statement> |
|
||||
<repeat statement> |
|
||||
<for statement> |
|
||||
<with statement>
|
||||
|
||||
|
||||
These are starting to look familiar. As a matter of fact, you
|
||||
have already gone through the process of parsing and generating
|
||||
code for both assignment statements and control structures. This
|
||||
is where the top level meets our bottom-up approach of previous
|
||||
sessions. The constructs will be a little different from those
|
||||
we've been using for KISS, but the differences are nothing you
|
||||
can't handle.
|
||||
|
||||
I think you can get the picture now as to the procedure. We
|
||||
begin with a complete BNF description of the language. Starting
|
||||
at the top level, we code up the recognizer for that BNF
|
||||
statement, using stubs for the next-level recognizers. Then we
|
||||
flesh those lower-level statements out one by one.
|
||||
|
||||
As it happens, the definition of Pascal is very compatible with
|
||||
the use of BNF, and BNF descriptions of the language abound.
|
||||
Armed with such a description, you will find it fairly
|
||||
straightforward to continue the process we've begun.
|
||||
|
||||
You might have a go at fleshing a few of these constructs out,
|
||||
just to get a feel for it. I don't expect you to be able to
|
||||
complete a Pascal compiler here ... there are too many things
|
||||
such as procedures and types that we haven't addressed yet ...
|
||||
but it might be helpful to try some of the more familiar ones.
|
||||
It will do you good to see executable programs coming out the
|
||||
other end.
|
||||
|
||||
If I'm going to address those issues that we haven't covered yet,
|
||||
I'd rather do it in the context of KISS. We're not trying to
|
||||
build a complete Pascal compiler just yet, so I'm going to stop
|
||||
the expansion of Pascal here. Let's take a look at a very
|
||||
different language.
|
||||
|
||||
|
||||
THE STRUCTURE OF C
|
||||
|
||||
The C language is quite another matter, as you'll see. Texts on
|
||||
C rarely include a BNF definition of the language. Probably
|
||||
that's because the language is quite hard to write BNF for.
|
||||
|
||||
One reason I'm showing you these structures now is so that I can
|
||||
impress upon you these two facts:
|
||||
|
||||
(1) The definition of the language drives the structure of the
|
||||
compiler. What works for one language may be a disaster for
|
||||
another. It's a very bad idea to try to force a given
|
||||
structure upon the compiler. Rather, you should let the BNF
|
||||
drive the structure, as we have done here.
|
||||
|
||||
(2) A language that is hard to write BNF for will probably be
|
||||
hard to write a compiler for, as well. C is a popular
|
||||
language, and it has a reputation for letting you do
|
||||
virtually anything that is possible to do. Despite the
|
||||
success of Small C, C is _NOT_ an easy language to parse.
|
||||
|
||||
|
||||
A C program has less structure than its Pascal counterpart. At
|
||||
the top level, everything in C is a static declaration, either of
|
||||
data or of a function. We can capture this thought like this:
|
||||
|
||||
|
||||
<program> ::= ( <global declaration> )*
|
||||
|
||||
<global declaration> ::= <data declaration> |
|
||||
<function>
|
||||
|
||||
In Small C, functions can only have the default type int, which
|
||||
is not declared. This makes the input easy to parse: the first
|
||||
token is either "int," "char," or the name of a function. In
|
||||
Small C, the preprocessor commands are also processed by the
|
||||
compiler proper, so the syntax becomes:
|
||||
|
||||
|
||||
<global declaration> ::= '#' <preprocessor command> |
|
||||
'int' <data list> |
|
||||
'char' <data list> |
|
||||
<ident> <function body> |
|
||||
|
||||
|
||||
Although we're really more interested in full C here, I'll show
|
||||
you the code corresponding to this top-level structure for Small
|
||||
C.
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Parse and Translate A Program }
|
||||
|
||||
procedure Prog;
|
||||
begin
|
||||
while Look <> ^Z do begin
|
||||
case Look of
|
||||
'#': PreProc;
|
||||
'i': IntDecl;
|
||||
'c': CharDecl;
|
||||
else DoFunction(Int);
|
||||
end;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
Note that I've had to use a ^Z to indicate the end of the source.
|
||||
C has no keyword such as END or the '.' to otherwise indicate the
|
||||
end.
|
||||
|
||||
With full C, things aren't even this easy. The problem comes
|
||||
about because in full C, functions can also have types. So when
|
||||
the compiler sees a keyword like "int," it still doesn't know
|
||||
whether to expect a data declaration or a function definition.
|
||||
Things get more complicated since the next token may not be a
|
||||
name ... it may start with an '*' or '(', or combinations of the
|
||||
two.
|
||||
|
||||
More specifically, the BNF for full C begins with:
|
||||
|
||||
|
||||
<program> ::= ( <top-level decl> )*
|
||||
|
||||
<top-level decl> ::= <function def> | <data decl>
|
||||
|
||||
<data decl> ::= [<class>] <type> <decl-list>
|
||||
|
||||
<function def> ::= [<class>] [<type>] <function decl>
|
||||
|
||||
|
||||
You can now see the problem: The first two parts of the
|
||||
declarations for data and functions can be the same. Because of
|
||||
the ambiguity in the grammar as written above, it's not a
|
||||
suitable grammar for a recursive-descent parser. Can we
|
||||
transform it into one that is suitable? Yes, with a little work.
|
||||
Suppose we write it this way:
|
||||
|
||||
|
||||
<top-level decl> ::= [<class>] <decl>
|
||||
|
||||
<decl> ::= <type> <typed decl> | <function decl>
|
||||
|
||||
<typed decl> ::= <data list> | <function decl>
|
||||
|
||||
|
||||
We can build a parsing routine for the class and type
|
||||
definitions, and have them store away their findings and go on,
|
||||
without their ever having to "know" whether a function or a data
|
||||
declaration is being processed.
|
||||
|
||||
To begin, key in the following version of the main program:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Main Program }
|
||||
|
||||
begin
|
||||
Init;
|
||||
while Look <> ^Z do begin
|
||||
GetClass;
|
||||
GetType;
|
||||
TopDecl;
|
||||
end;
|
||||
end.
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
For the first round, just make the three procedures stubs that do
|
||||
nothing _BUT_ call GetChar.
|
||||
|
||||
Does this program work? Well, it would be hard put NOT to, since
|
||||
we're not really asking it to do anything. It's been said that a
|
||||
C compiler will accept virtually any input without choking. It's
|
||||
certainly true of THIS compiler, since in effect all it does is
|
||||
to eat input characters until it finds a ^Z.
|
||||
|
||||
Next, let's make GetClass do something worthwhile. Declare the
|
||||
global variable
|
||||
|
||||
|
||||
var Class: char;
|
||||
|
||||
|
||||
and change GetClass to do the following:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get a Storage Class Specifier }
|
||||
|
||||
Procedure GetClass;
|
||||
begin
|
||||
if Look in ['a', 'x', 's'] then begin
|
||||
Class := Look;
|
||||
GetChar;
|
||||
end
|
||||
else Class := 'a';
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Here, I've used three single characters to represent the three
|
||||
storage classes "auto," "extern," and "static." These are not
|
||||
the only three possible classes ... there are also "register" and
|
||||
"typedef," but this should give you the picture. Note that the
|
||||
default class is "auto."
|
||||
|
||||
We can do a similar thing for types. Enter the following
|
||||
procedure next:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Get a Type Specifier }
|
||||
|
||||
procedure GetType;
|
||||
begin
|
||||
Typ := ' ';
|
||||
if Look = 'u' then begin
|
||||
Sign := 'u';
|
||||
Typ := 'i';
|
||||
GetChar;
|
||||
end
|
||||
else Sign := 's';
|
||||
if Look in ['i', 'l', 'c'] then begin
|
||||
Typ := Look;
|
||||
GetChar;
|
||||
end;
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
Note that you must add two more global variables, Sign and Typ.
|
||||
|
||||
With these two procedures in place, the compiler will process the
|
||||
class and type definitions and store away their findings. We can
|
||||
now process the rest of the declaration.
|
||||
|
||||
We are by no means out of the woods yet, because there are still
|
||||
many complexities just in the definition of the type, before we
|
||||
even get to the actual data or function names. Let's pretend for
|
||||
the moment that we have passed all those gates, and that the next
|
||||
thing in the input stream is a name. If the name is followed by
|
||||
a left paren, we have a function declaration. If not, we have at
|
||||
least one data item, and possibly a list, each element of which
|
||||
can have an initializer.
|
||||
|
||||
Insert the following version of TopDecl:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process a Top-Level Declaration }
|
||||
|
||||
procedure TopDecl;
|
||||
var Name: char;
|
||||
begin
|
||||
Name := Getname;
|
||||
if Look = '(' then
|
||||
DoFunc(Name)
|
||||
else
|
||||
DoData(Name);
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
(Note that, since we have already read the name, we must pass it
|
||||
along to the appropriate routine.)
|
||||
|
||||
Finally, add the two procedures DoFunc and DoData:
|
||||
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process a Function Definition }
|
||||
|
||||
procedure DoFunc(n: char);
|
||||
begin
|
||||
Match('(');
|
||||
Match(')');
|
||||
Match('{');
|
||||
Match('}');
|
||||
if Typ = ' ' then Typ := 'i';
|
||||
Writeln(Class, Sign, Typ, ' function ', n);
|
||||
end;
|
||||
|
||||
{--------------------------------------------------------------}
|
||||
{ Process a Data Declaration }
|
||||
|
||||
procedure DoData(n: char);
|
||||
begin
|
||||
if Typ = ' ' then Expected('Type declaration');
|
||||
Writeln(Class, Sign, Typ, ' data ', n);
|
||||
while Look = ',' do begin
|
||||
Match(',');
|
||||
n := GetName;
|
||||
WriteLn(Class, Sign, Typ, ' data ', n);
|
||||
end;
|
||||
Match(';');
|
||||
end;
|
||||
{--------------------------------------------------------------}
|
||||
|
||||
|
||||
Since we're still a long way from producing executable code, I
|
||||
decided to just have these two routines tell us what they found.
|
||||
|
||||
OK, give this program a try. For data declarations, it's OK to
|
||||
give a list separated by commas. We can't process initializers
|
||||
as yet. We also can't process argument lists for the functions,
|
||||
but the "(){}" characters should be there.
|
||||
|
||||
We're still a _VERY_ long way from having a C compiler, but what
|
||||
we have is starting to process the right kinds of inputs, and is
|
||||
recognizing both good and bad inputs. In the process, the
|
||||
natural structure of the compiler is starting to take form.
|
||||
|
||||
Can we continue this until we have something that acts more like
|
||||
a compiler. Of course we can. Should we? That's another matter.
|
||||
I don't know about you, but I'm beginning to get dizzy, and we've
|
||||
still got a long way to go to even get past the data
|
||||
declarations.
|
||||
|
||||
At this point, I think you can see how the structure of the
|
||||
compiler evolves from the language definition. The structures
|
||||
we've seen for our two examples, Pascal and C, are as different
|
||||
as night and day. Pascal was designed at least partly to be easy
|
||||
to parse, and that's reflected in the compiler. In general, in
|
||||
Pascal there is more structure and we have a better idea of what
|
||||
kinds of constructs to expect at any point. In C, on the other
|
||||
hand, the program is essentially a list of declarations,
|
||||
terminated only by the end of file.
|
||||
|
||||
We could pursue both of these structures much farther, but
|
||||
remember that our purpose here is not to build a Pascal or a C
|
||||
compiler, but rather to study compilers in general. For those of
|
||||
you who DO want to deal with Pascal or C, I hope I've given you
|
||||
enough of a start so that you can take it from here (although
|
||||
you'll soon need some of the stuff we still haven't covered yet,
|
||||
such as typing and procedure calls). For the rest of you, stay
|
||||
with me through the next installment. There, I'll be leading you
|
||||
through the development of a complete compiler for TINY, a subset
|
||||
of KISS.
|
||||
|
||||
See you then.
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
Reference in New Issue
Block a user