399 lines
14 KiB
Plaintext
399 lines
14 KiB
Plaintext
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LET'S BUILD A COMPILER!
|
|
|
|
By
|
|
|
|
Jack W. Crenshaw, Ph.D.
|
|
|
|
24 July 1988
|
|
|
|
|
|
Part I: INTRODUCTION
|
|
|
|
|
|
*****************************************************************
|
|
* *
|
|
* COPYRIGHT NOTICE *
|
|
* *
|
|
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
|
* *
|
|
*****************************************************************
|
|
|
|
|
|
INTRODUCTION
|
|
|
|
|
|
This series of articles is a tutorial on the theory and practice
|
|
of developing language parsers and compilers. Before we are
|
|
finished, we will have covered every aspect of compiler
|
|
construction, designed a new programming language, and built a
|
|
working compiler.
|
|
|
|
Though I am not a computer scientist by education (my Ph.D. is in
|
|
a different field, Physics), I have been interested in compilers
|
|
for many years. I have bought and tried to digest the contents
|
|
of virtually every book on the subject ever written. I don't
|
|
mind telling you that it was slow going. Compiler texts are
|
|
written for Computer Science majors, and are tough sledding for
|
|
the rest of us. But over the years a bit of it began to seep in.
|
|
What really caused it to jell was when I began to branch off on
|
|
my own and begin to try things on my own computer. Now I plan to
|
|
share with you what I have learned. At the end of this series
|
|
you will by no means be a computer scientist, nor will you know
|
|
all the esoterics of compiler theory. I intend to completely
|
|
ignore the more theoretical aspects of the subject. What you
|
|
_WILL_ know is all the practical aspects that one needs to know
|
|
to build a working system.
|
|
|
|
This is a "learn-by-doing" series. In the course of the series I
|
|
will be performing experiments on a computer. You will be
|
|
expected to follow along, repeating the experiments that I do,
|
|
and performing some on your own. I will be using Turbo Pascal
|
|
4.0 on a PC clone. I will periodically insert examples written
|
|
in TP. These will be executable code, which you will be expected
|
|
to copy into your own computer and run. If you don't have a copy
|
|
of Turbo, you will be severely limited in how well you will be
|
|
able to follow what's going on. If you don't have a copy, I urge
|
|
you to get one. After all, it's an excellent product, good for
|
|
many other uses!
|
|
|
|
Some articles on compilers show you examples, or show you (as in
|
|
the case of Small-C) a finished product, which you can then copy
|
|
and use without a whole lot of understanding of how it works. I
|
|
hope to do much more than that. I hope to teach you HOW the
|
|
things get done, so that you can go off on your own and not only
|
|
reproduce what I have done, but improve on it.
|
|
|
|
This is admittedly an ambitious undertaking, and it won't be done
|
|
in one page. I expect to do it in the course of a number of
|
|
articles. Each article will cover a single aspect of compiler
|
|
theory, and will pretty much stand alone. If all you're
|
|
interested in at a given time is one aspect, then you need to
|
|
look only at that one article. Each article will be uploaded as
|
|
it is complete, so you will have to wait for the last one before
|
|
you can consider yourself finished. Please be patient.
|
|
|
|
|
|
|
|
The average text on compiler theory covers a lot of ground that
|
|
we won't be covering here. The typical sequence is:
|
|
|
|
o An introductory chapter describing what a compiler is.
|
|
|
|
o A chapter or two on syntax equations, using Backus-Naur Form
|
|
(BNF).
|
|
|
|
o A chapter or two on lexical scanning, with emphasis on
|
|
deterministic and non-deterministic finite automata.
|
|
|
|
o Several chapters on parsing theory, beginning with top-down
|
|
recursive descent, and ending with LALR parsers.
|
|
|
|
o A chapter on intermediate languages, with emphasis on P-code
|
|
and similar reverse polish representations.
|
|
|
|
o Many chapters on alternative ways to handle subroutines and
|
|
parameter passing, type declarations, and such.
|
|
|
|
o A chapter toward the end on code generation, usually for some
|
|
imaginary CPU with a simple instruction set. Most readers
|
|
(and in fact, most college classes) never make it this far.
|
|
|
|
o A final chapter or two on optimization. This chapter often
|
|
goes unread, too.
|
|
|
|
|
|
I'll be taking a much different approach in this series. To
|
|
begin with, I won't dwell long on options. I'll be giving you
|
|
_A_ way that works. If you want to explore options, well and
|
|
good ... I encourage you to do so ... but I'll be sticking to
|
|
what I know. I also will skip over most of the theory that puts
|
|
people to sleep. Don't get me wrong: I don't belittle the
|
|
theory, and it's vitally important when it comes to dealing with
|
|
the more tricky parts of a given language. But I believe in
|
|
putting first things first. Here we'll be dealing with the 95%
|
|
of compiler techniques that don't need a lot of theory to handle.
|
|
|
|
I also will discuss only one approach to parsing: top-down,
|
|
recursive descent parsing, which is the _ONLY_ technique that's
|
|
at all amenable to hand-crafting a compiler. The other
|
|
approaches are only useful if you have a tool like YACC, and also
|
|
don't care how much memory space the final product uses.
|
|
|
|
I also take a page from the work of Ron Cain, the author of the
|
|
original Small C. Whereas almost all other compiler authors have
|
|
historically used an intermediate language like P-code and
|
|
divided the compiler into two parts (a front end that produces
|
|
P-code, and a back end that processes P-code to produce
|
|
executable object code), Ron showed us that it is a
|
|
straightforward matter to make a compiler directly produce
|
|
executable object code, in the form of assembler language
|
|
statements. The code will _NOT_ be the world's tightest code ...
|
|
producing optimized code is a much more difficult job. But it
|
|
will work, and work reasonably well. Just so that I don't leave
|
|
you with the impression that our end product will be worthless, I
|
|
_DO_ intend to show you how to "soup up" the compiler with some
|
|
optimization.
|
|
|
|
|
|
|
|
Finally, I'll be using some tricks that I've found to be most
|
|
helpful in letting me understand what's going on without wading
|
|
through a lot of boiler plate. Chief among these is the use of
|
|
single-character tokens, with no embedded spaces, for the early
|
|
design work. I figure that if I can get a parser to recognize
|
|
and deal with I-T-L, I can get it to do the same with IF-THEN-
|
|
ELSE. And I can. In the second "lesson," I'll show you just
|
|
how easy it is to extend a simple parser to handle tokens of
|
|
arbitrary length. As another trick, I completely ignore file
|
|
I/O, figuring that if I can read source from the keyboard and
|
|
output object to the screen, I can also do it from/to disk files.
|
|
Experience has proven that once a translator is working
|
|
correctly, it's a straightforward matter to redirect the I/O to
|
|
files. The last trick is that I make no attempt to do error
|
|
correction/recovery. The programs we'll be building will
|
|
RECOGNIZE errors, and will not CRASH, but they will simply stop
|
|
on the first error ... just like good ol' Turbo does. There will
|
|
be other tricks that you'll see as you go. Most of them can't be
|
|
found in any compiler textbook, but they work.
|
|
|
|
A word about style and efficiency. As you will see, I tend to
|
|
write programs in _VERY_ small, easily understood pieces. None
|
|
of the procedures we'll be working with will be more than about
|
|
15-20 lines long. I'm a fervent devotee of the KISS (Keep It
|
|
Simple, Sidney) school of software development. I try to never
|
|
do something tricky or complex, when something simple will do.
|
|
Inefficient? Perhaps, but you'll like the results. As Brian
|
|
Kernighan has said, FIRST make it run, THEN make it run fast.
|
|
If, later on, you want to go back and tighten up the code in one
|
|
of our products, you'll be able to do so, since the code will be
|
|
quite understandable. If you do so, however, I urge you to wait
|
|
until the program is doing everything you want it to.
|
|
|
|
I also have a tendency to delay building a module until I
|
|
discover that I need it. Trying to anticipate every possible
|
|
future contingency can drive you crazy, and you'll generally
|
|
guess wrong anyway. In this modern day of screen editors and
|
|
fast compilers, I don't hesitate to change a module when I feel I
|
|
need a more powerful one. Until then, I'll write only what I
|
|
need.
|
|
|
|
One final caveat: One of the principles we'll be sticking to here
|
|
is that we don't fool around with P-code or imaginary CPUs, but
|
|
that we will start out on day one producing working, executable
|
|
object code, at least in the form of assembler language source.
|
|
However, you may not like my choice of assembler language ...
|
|
it's 68000 code, which is what works on my system (under SK*DOS).
|
|
I think you'll find, though, that the translation to any other
|
|
CPU such as the 80x86 will be quite obvious, though, so I don't
|
|
see a problem here. In fact, I hope someone out there who knows
|
|
the '86 language better than I do will offer us the equivalent
|
|
object code fragments as we need them.
|
|
|
|
|
|
THE CRADLE
|
|
|
|
Every program needs some boiler plate ... I/O routines, error
|
|
message routines, etc. The programs we develop here will be no
|
|
exceptions. I've tried to hold this stuff to an absolute
|
|
minimum, however, so that we can concentrate on the important
|
|
stuff without losing it among the trees. The code given below
|
|
represents about the minimum that we need to get anything done.
|
|
It consists of some I/O routines, an error-handling routine and a
|
|
skeleton, null main program. I call it our cradle. As we
|
|
develop other routines, we'll add them to the cradle, and add the
|
|
calls to them as we need to. Make a copy of the cradle and save
|
|
it, because we'll be using it more than once.
|
|
|
|
There are many different ways to organize the scanning activities
|
|
of a parser. In Unix systems, authors tend to use getc and
|
|
ungetc. I've had very good luck with the approach shown here,
|
|
which is to use a single, global, lookahead character. Part of
|
|
the initialization procedure (the only part, so far!) serves to
|
|
"prime the pump" by reading the first character from the input
|
|
stream. No other special techniques are required with Turbo 4.0
|
|
... each successive call to GetChar will read the next character
|
|
in the stream.
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
program Cradle;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Constant Declarations }
|
|
|
|
const TAB = ^I;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Variable Declarations }
|
|
|
|
var Look: char; { Lookahead Character }
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Read New Character From Input Stream }
|
|
|
|
procedure GetChar;
|
|
begin
|
|
Read(Look);
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Report an Error }
|
|
|
|
procedure Error(s: string);
|
|
begin
|
|
WriteLn;
|
|
WriteLn(^G, 'Error: ', s, '.');
|
|
end;
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Report Error and Halt }
|
|
|
|
procedure Abort(s: string);
|
|
begin
|
|
Error(s);
|
|
Halt;
|
|
end;
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Report What Was Expected }
|
|
|
|
procedure Expected(s: string);
|
|
begin
|
|
Abort(s + ' Expected');
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Match a Specific Input Character }
|
|
|
|
procedure Match(x: char);
|
|
begin
|
|
if Look = x then GetChar
|
|
else Expected('''' + x + '''');
|
|
end;
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize an Alpha Character }
|
|
|
|
function IsAlpha(c: char): boolean;
|
|
begin
|
|
IsAlpha := upcase(c) in ['A'..'Z'];
|
|
end;
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
|
|
{ Recognize a Decimal Digit }
|
|
|
|
function IsDigit(c: char): boolean;
|
|
begin
|
|
IsDigit := c in ['0'..'9'];
|
|
end;
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Get an Identifier }
|
|
|
|
function GetName: char;
|
|
begin
|
|
if not IsAlpha(Look) then Expected('Name');
|
|
GetName := UpCase(Look);
|
|
GetChar;
|
|
end;
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Get a Number }
|
|
|
|
function GetNum: char;
|
|
begin
|
|
if not IsDigit(Look) then Expected('Integer');
|
|
GetNum := Look;
|
|
GetChar;
|
|
end;
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Output a String with Tab }
|
|
|
|
procedure Emit(s: string);
|
|
begin
|
|
Write(TAB, s);
|
|
end;
|
|
|
|
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Output a String with Tab and CRLF }
|
|
|
|
procedure EmitLn(s: string);
|
|
begin
|
|
Emit(s);
|
|
WriteLn;
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Initialize }
|
|
|
|
procedure Init;
|
|
begin
|
|
GetChar;
|
|
end;
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Main Program }
|
|
|
|
begin
|
|
Init;
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
That's it for this introduction. Copy the code above into TP and
|
|
compile it. Make sure that it compiles and runs correctly. Then
|
|
proceed to the first lesson, which is on expression parsing.
|
|
|
|
|
|
*****************************************************************
|
|
* *
|
|
* COPYRIGHT NOTICE *
|
|
* *
|
|
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
|
|
* *
|
|
*****************************************************************
|
|
|
|
|
|
|
|
|