1151 lines
43 KiB
Plaintext
1151 lines
43 KiB
Plaintext
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LET'S BUILD A COMPILER!
|
|
|
|
By
|
|
|
|
Jack W. Crenshaw, Ph.D.
|
|
|
|
5 March 1994
|
|
|
|
|
|
Part 15: BACK TO THE FUTURE
|
|
|
|
|
|
|
|
*****************************************************************
|
|
* *
|
|
* COPYRIGHT NOTICE *
|
|
* *
|
|
* Copyright (C) 1994 Jack W. Crenshaw. All rights reserved. *
|
|
* *
|
|
*****************************************************************
|
|
|
|
|
|
INTRODUCTION
|
|
|
|
Can it really have been four years since I wrote installment
|
|
fourteen of this series? Is it really possible that six long
|
|
years have passed since I began it? Funny how time flies when
|
|
you're having fun, isn't it?
|
|
|
|
I won't spend a lot of time making excuses; only point out that
|
|
things happen, and priorities change. In the four years since
|
|
installment fourteen, I've managed to get laid off, get divorced,
|
|
have a nervous breakdown, begin a new career as a writer, begin
|
|
another one as a consultant, move, work on two real-time systems,
|
|
and raise fourteen baby birds, three pigeons, six possums, and a
|
|
duck. For awhile there, the parsing of source code was not high
|
|
on my list of priorities. Neither was writing stuff for free,
|
|
instead of writing stuff for pay. But I do try to be faithful,
|
|
and I do recognize and feel my responsibility to you, the reader,
|
|
to finish what I've started. As the tortoise said in one of my
|
|
son's old stories, I may be slow, but I'm sure. I'm sure that
|
|
there are people out there anxious to see the last reel of this
|
|
film, and I intend to give it to them. So, if you're one of those
|
|
who's been waiting, more or less patiently, to see how this thing
|
|
comes out, thanks for your patience. I apologize for the delay.
|
|
Let's move on.
|
|
|
|
|
|
NEW STARTS, OLD DIRECTIONS
|
|
|
|
Like many other things, programming languages and programming
|
|
styles change with time. In 1994, it seems a little anachronistic
|
|
to be programming in Turbo Pascal, when the rest of the world
|
|
seems to have gone bananas over C++. It also seems a little
|
|
strange to be programming in a classical style when the rest of
|
|
the world has switched to object-oriented methods. Still, in
|
|
spite of the four-year hiatus, it would be entirely too wrenching
|
|
a change, at this point, to switch to, say, C++ with object-
|
|
orientation . Anyway, Pascal is still not only a powerful
|
|
programming language (more than ever, in fact), but it's a
|
|
wonderful medium for teaching. C is a notoriously difficult
|
|
language to read ... it's often been accused, along with Forth, of
|
|
being a "write-only language." When I program in C++, I find
|
|
myself spending at least 50% of my time struggling with language
|
|
syntax rather than with concepts. A stray "&" or "*" can not only
|
|
change the functioning of the program, but its correctness as
|
|
well. By contrast, Pascal code is usually quite transparent and
|
|
easy to read, even if you don't know the language. What you see is
|
|
almost always what you get, and we can concentrate on concepts
|
|
rather than implementation details. I've said from the beginning
|
|
that the purpose of this tutorial series was not to generate the
|
|
world's fastest compiler, but to teach the fundamentals of
|
|
compiler technology, while spending the least amount of time
|
|
wrestling with language syntax or other aspects of software
|
|
implementation. Finally, since a lot of what we do in this course
|
|
amounts to software experimentation, it's important to have a
|
|
compiler and associated environment that compiles quickly and with
|
|
no fuss. In my opinion, by far the most significant time measure
|
|
in software development is the speed of the edit/compile/test
|
|
cycle. In this department, Turbo Pascal is king. The compilation
|
|
speed is blazing fast, and continues to get faster in every
|
|
release (how do they keep doing that?). Despite vast improvements
|
|
in C compilation speed over the years, even Borland's fastest
|
|
C/C++ compiler is still no match for Turbo Pascal. Further, the
|
|
editor built into their IDE, the make facility, and even their
|
|
superb smart linker, all complement each other to produce a
|
|
wonderful environment for quick turnaround. For all of these
|
|
reasons, I intend to stick with Pascal for the duration of this
|
|
series. We'll be using Turbo Pascal for Windows, one of the
|
|
compilers provided Borland Pascal with Objects, version 7.0. If
|
|
you don't have this compiler, don't worry ... nothing we do here
|
|
is going to count on your having the latest version. Using the
|
|
Windows version helps me a lot, by allowing me to use the
|
|
Clipboard to copy code from the compiler's editor into these
|
|
documents. It should also help you at least as much, copying the
|
|
code in the other direction.
|
|
|
|
I've thought long and hard about whether or not to introduce
|
|
objects to our discussion. I'm a big advocate of object-oriented
|
|
methods for all uses, and such methods definitely have their place
|
|
in compiler technology. In fact, I've written papers on just this
|
|
subject (Refs. 1-3). But the architecture of a compiler which is
|
|
based on object-oriented approaches is vastly different than that
|
|
of the more classical compiler we've been building. Again, it
|
|
would seem to be entirely too much to change these horses in mid-
|
|
stream. As I said, programming styles change. Who knows, it may
|
|
be another six years before we finish this thing, and if we keep
|
|
changing the code every time programming style changes, we may
|
|
NEVER finish.
|
|
|
|
So for now, at least, I've determined to continue the classical
|
|
style in Pascal, though we might indeed discuss objects and object
|
|
orientation as we go. Likewise, the target machine will remain
|
|
the Motorola 68000 family. Of all the decisions to be made here,
|
|
this one has been the easiest. Though I know that many of you
|
|
would like to see code for the 80x86, the 68000 has become, if
|
|
anything, even more popular as a platform for embedded systems,
|
|
and it's to that application that this whole effort began in the
|
|
first place. Compiling for the PC, MSDOS platform, we'd have to
|
|
deal with all the issues of DOS system calls, DOS linker formats,
|
|
the PC file system and hardware, and all those other complications
|
|
of a DOS environment. An embedded system, on the other hand, must
|
|
run standalone, and it's for this kind of application, as an
|
|
alternative to assembly language, that I've always imagined that a
|
|
language like KISS would thrive. Anyway, who wants to deal with
|
|
the 80x86 architecture if they don't have to?
|
|
|
|
The one feature of Turbo Pascal that I'm going to be making heavy
|
|
use of is units. In the past, we've had to make compromises
|
|
between code size and complexity, and program functionality. A
|
|
lot of our work has been in the nature of computer
|
|
experimentation, looking at only one aspect of compiler technology
|
|
at a time. We did this to avoid to avoid having to carry around
|
|
large programs, just to investigate simple concepts. In the
|
|
process, we've re-invented the wheel and re-programmed the same
|
|
functions more times than I'd like to count. Turbo units provide
|
|
a wonderful way to get functionality and simplicity at the same
|
|
time: You write reusable code, and invoke it with a single line.
|
|
Your test program stays small, but it can do powerful things.
|
|
|
|
One feature of Turbo Pascal units is their initialization block.
|
|
As with an Ada package, any code in the main begin-end block of a
|
|
unit gets executed as the program is initialized. As you'll see
|
|
later, this sometimes gives us neat simplifications in the code.
|
|
Our procedure Init, which has been with us since Installment 1,
|
|
goes away entirely when we use units. The various routines in the
|
|
Cradle, another key features of our approach, will get distributed
|
|
among the units.
|
|
|
|
The concept of units, of course, is no different than that of C
|
|
modules. However, in C (and C++), the interface between modules
|
|
comes via preprocessor include statements and header files. As
|
|
someone who's had to read a lot of other people's C programs, I've
|
|
always found this rather bewildering. It always seems that
|
|
whatever data structure you'd like to know about is in some other
|
|
file. Turbo units are simpler for the very reason that they're
|
|
criticized by some: The function interfaces and their
|
|
implementation are included in the same file. While this
|
|
organization may create problems with code security, it also
|
|
reduces the number of files by half, which isn't half bad.
|
|
Linking of the object files is also easy, because the Turbo
|
|
compiler takes care of it without the need for make files or other
|
|
mechanisms.
|
|
|
|
|
|
STARTING OVER?
|
|
|
|
Four years ago, in Installment 14, I promised you that our days of
|
|
re-inventing the wheel, and recoding the same software over and
|
|
over for each lesson, were over, and that from now on we'd stick
|
|
to more complete programs that we would simply add new features
|
|
to. I still intend to keep that promise; that's one of the main
|
|
purposes for using units. However, because of the long time since
|
|
Installment 14, it's natural to want to at least do some review,
|
|
and anyhow, we're going to have to make rather sweeping changes in
|
|
the code to make the transition to units. Besides, frankly, after
|
|
all this time I can't remember all the neat ideas I had in my head
|
|
four years ago. The best way for me to recall them is to retrace
|
|
some of the steps we took to arrive at Installment 14. So I hope
|
|
you'll be understanding and bear with me as we go back to our
|
|
roots, in a sense, and rebuild the core of the software,
|
|
distributing the routines among the various units, and
|
|
bootstrapping ourselves back up to the point we were at lo, those
|
|
many moons ago. As has always been the case, you're going to get
|
|
to see me make all the mistakes and execute changes of direction,
|
|
in real time. Please bear with me ... we'll start getting to the
|
|
new stuff before you know it.
|
|
|
|
Since we're going to be using multiple modules in our new
|
|
approach, we have to address the issue of file management. If
|
|
you've followed all the other sections of this tutorial, you know
|
|
that, as our programs evolve, we're going to be replacing older,
|
|
more simple-minded units with more capable ones. This brings us to
|
|
an issue of version control. There will almost certainly be times
|
|
when we will overlay a simple file (unit), but later wish we had
|
|
the simple one again. A case in point is embodied in our
|
|
predilection for using single-character variable names, keywords,
|
|
etc., to test concepts without getting bogged down in the details
|
|
of a lexical scanner. Thanks to the use of units, we will be
|
|
doing much less of this in the future. Still, I not only suspect,
|
|
but am certain that we will need to save some older versions of
|
|
files, for special purposes, even though they've been replaced by
|
|
newer, more capable ones.
|
|
|
|
To deal with this problem, I suggest that you create different
|
|
directories, with different versions of the units as needed. If
|
|
we do this properly, the code in each directory will remain self-
|
|
consistent. I've tentatively created four directories: SINGLE
|
|
(for single-character experimentation), MULTI (for, of course,
|
|
multi-character versions), TINY, and KISS.
|
|
|
|
Enough said about philosophy and details. Let's get on with the
|
|
resurrection of the software.
|
|
|
|
|
|
THE INPUT UNIT
|
|
|
|
A key concept that we've used since Day 1 has been the idea of an
|
|
input stream with one lookahead character. All the parsing
|
|
routines examine this character, without changing it, to decide
|
|
what they should do next. (Compare this approach with the C/Unix
|
|
approach using getchar and unget, and I think you'll agree that
|
|
our approach is simpler). We'll begin our hike into the future by
|
|
translating this concept into our new, unit-based organization.
|
|
The first unit, appropriately called Input, is shown below:
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
unit Input;
|
|
{--------------------------------------------------------------}
|
|
interface
|
|
var Look: char; { Lookahead character }
|
|
procedure GetChar; { Read new character }
|
|
|
|
{--------------------------------------------------------------}
|
|
implementation
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Read New Character From Input Stream }
|
|
|
|
procedure GetChar;
|
|
begin
|
|
Read(Look);
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Unit Initialization }
|
|
begin
|
|
GetChar;
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
As you can see, there's nothing very profound, and certainly
|
|
nothing complicated, about this unit, since it consists of only a
|
|
single procedure. But already, we can see how the use of units
|
|
gives us advantages. Note the executable code in the
|
|
initialization block. This code "primes the pump" of the input
|
|
stream for us, something we've always had to do before, by
|
|
inserting the call to GetChar in line, or in procedure Init. This
|
|
time, the call happens without any special reference to it on our
|
|
part, except within the unit itself. As I predicted earlier, this
|
|
mechanism is going to make our lives much simpler as we proceed.
|
|
I consider it to be one of the most useful features of Turbo
|
|
Pascal, and I lean on it heavily.
|
|
|
|
Copy this unit into your compiler's IDE, and compile it. To test
|
|
the software, of course, we always need a main program. I used
|
|
the following, really complex test program, which we'll later
|
|
evolve into the Main for our compiler:
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
program Main;
|
|
uses WinCRT, Input;
|
|
begin
|
|
WriteLn(Look);
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
Note the use of the Borland-supplied unit, WinCRT. This unit is
|
|
necessary if you intend to use the standard Pascal I/O routines,
|
|
Read, ReadLn, Write, and WriteLn, which of course we intend to do.
|
|
If you forget to include this unit in the "uses" clause, you will
|
|
get a really bizarre and indecipherable error message at run time.
|
|
|
|
Note also that we can access the lookahead character, even though
|
|
it's not declared in the main program. All variables declared
|
|
within the interface section of a unit are global, but they're
|
|
hidden from prying eyes; to that extent, we get a modicum of
|
|
information hiding. Of course, if we were writing in an object-
|
|
oriented fashion, we should not allow outside modules to access
|
|
the units internal variables. But, although Turbo units have a
|
|
lot in common with objects, we're not doing object-oriented design
|
|
or code here, so our use of Look is appropriate.
|
|
|
|
Go ahead and save the test program as Main.pas. To make life
|
|
easier as we get more and more files, you might want to take this
|
|
opportunity to declare this file as the compiler's Primary file.
|
|
That way, you can execute the program from any file. Otherwise,
|
|
if you press Cntl-F9 to compile and run from one of the units,
|
|
you'll get an error message. You set the primary file using the
|
|
main submenu, "Compile," in the Turbo IDE.
|
|
|
|
I hasten to point out, as I've done before, that the function of
|
|
unit Input is, and always has been, considered to be a dummy
|
|
version of the real thing. In a production version of a compiler,
|
|
the input stream will, of course, come from a file rather than
|
|
from the keyboard. And it will almost certainly include line
|
|
buffering, at the very least, and more likely, a rather large text
|
|
buffer to support efficient disk I/O. The nice part about the
|
|
unit approach is that, as with objects, we can modify the code in
|
|
the unit to be as simple or as sophisticated as we like. As long
|
|
as the interface, as embodied in the public procedures and the
|
|
lookahead character, don't change, the rest of the program is
|
|
totally unaffected. And since units are compiled, rather than
|
|
merely included, the time required to link with them is virtually
|
|
nil. Again, the result is that we can get all the benefits of
|
|
sophisticated implementations, without having to carry the code
|
|
around as so much baggage.
|
|
|
|
In later installments, I intend to provide a full-blown IDE for
|
|
the KISS compiler, using a true Windows application generated by
|
|
Borland's OWL applications framework. For now, though, we'll obey
|
|
my #1 rule to live by: Keep It Simple.
|
|
|
|
|
|
|
|
THE OUTPUT UNIT
|
|
|
|
Of course, every decent program should have output, and ours is no
|
|
exception. Our output routines included the Emit functions. The
|
|
code for the corresponding output unit is shown next:
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
unit Output;
|
|
{--------------------------------------------------------------}
|
|
interface
|
|
procedure Emit(s: string); { Emit an instruction }
|
|
procedure EmitLn(s: string); { Emit an instruction line }
|
|
|
|
{--------------------------------------------------------------}
|
|
implementation
|
|
const TAB = ^I;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Emit an Instruction }
|
|
|
|
procedure Emit(s: string);
|
|
begin
|
|
Write(TAB, s);
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Emit an Instruction, Followed By a Newline }
|
|
|
|
procedure EmitLn(s: string);
|
|
begin
|
|
Emit(s);
|
|
WriteLn;
|
|
end;
|
|
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
(Notice that this unit has no initialization clause, so it needs
|
|
no begin-block.)
|
|
|
|
Test this unit with the following main program:
|
|
|
|
{--------------------------------------------------------------}
|
|
program Test;
|
|
uses WinCRT, Input, Output, Scanner, Parser;
|
|
begin
|
|
WriteLn('MAIN:");
|
|
EmitLn('Hello, world!');
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
Did you see anything that surprised you? You may have been
|
|
surprised to see that you needed to type something, even though
|
|
the main program requires no input. That's because of the
|
|
initialization in unit Input, which still requires something to
|
|
put into the lookahead character. Sorry, there's no way out of
|
|
that box, or rather, we don't _WANT_ to get out. Except for simple
|
|
test cases such as this, we will always want a valid lookahead
|
|
character, so the right thing to do about this "problem" is ...
|
|
nothing.
|
|
|
|
Perhaps more surprisingly, notice that the TAB character had no
|
|
effect; our line of "instructions" begins at column 1, same as the
|
|
fake label. That's right: WinCRT doesn't support tabs. We have a
|
|
problem.
|
|
|
|
There are a few ways we can deal with this problem. The one thing
|
|
we can't do is to simply ignore it. Every assembler I've ever
|
|
used reserves column 1 for labels, and will rebel to see
|
|
instructions starting there. So, at the very least, we must space
|
|
the instructions over one column to keep the assembler happy. .
|
|
That's easy enough to do: Simply change, in procedure Emit, the
|
|
line:
|
|
|
|
Write(TAB, s);
|
|
|
|
by:
|
|
|
|
Write(' ', s);
|
|
|
|
I must admit that I've wrestled with this problem before, and find
|
|
myself changing my mind as often as a chameleon changes color.
|
|
For the purposes we're going to be using, 99% of which will be
|
|
examining the output code as it's displayed on a CRT, it would be
|
|
nice to see neatly blocked out "object" code. The line:
|
|
|
|
SUB1: MOVE #4,D0
|
|
|
|
just plain looks neater than the different, but functionally
|
|
identical code,
|
|
|
|
SUB1:
|
|
MOVE #4,D0
|
|
|
|
In test versions of my code, I included a more sophisticated
|
|
version of the procedure PostLabel, that avoids having labels on
|
|
separate lines, but rather defers the printing of a label so it
|
|
can end up on the same line as the associated instruction. As
|
|
recently as an hour ago, my version of unit Output provided full
|
|
support for tabs, using an internal column count variable and
|
|
software to manage it. I had, if I do say so myself, some rather
|
|
elegant code to support the tab mechanism, with a minimum of code
|
|
bloat. It was awfully tempting to show you the "prettyprint"
|
|
version, if for no other reason than to show off the elegance.
|
|
|
|
Nevertheless, the code of the "elegant" version was considerably
|
|
more complex and larger. Since then, I've had second thoughts. In
|
|
spite of our desire to see pretty output, the inescapable fact is
|
|
that the two versions of the MAIN: code fragment shown above are
|
|
functionally identical; the assembler, which is the ultimate
|
|
destination of the code, couldn't care less which version it gets,
|
|
except that the prettier version will contain more characters,
|
|
therefore will use more disk space and take longer to assemble.
|
|
but the prettier one not only takes more code to generate, but
|
|
will create a larger output file, with many more space characters
|
|
than the minimum needed. When you look at it that way, it's not
|
|
very hard to decide which approach to use, is it?
|
|
|
|
What finally clinched the issue for me was a reminder to consider
|
|
my own first commandment: KISS. Although I was pretty proud of
|
|
all my elegant little tricks to implement tabbing, I had to remind
|
|
myself that, to paraphrase Senator Barry Goldwater, elegance in
|
|
the pursuit of complexity is no virtue. Another wise man once
|
|
wrote, "Any idiot can design a Rolls-Royce. It takes a genius to
|
|
design a VW." So the elegant, tab-friendly version of Output is
|
|
history, and what you see is the simple, compact, VW version.
|
|
|
|
|
|
THE ERROR UNIT
|
|
|
|
Our next set of routines are those that handle errors. To refresh
|
|
your memory, we take the approach, pioneered by Borland in Turbo
|
|
Pascal, of halting on the first error. Not only does this greatly
|
|
simplify our code, by completely avoiding the sticky issue of
|
|
error recovery, but it also makes much more sense, in my opinion,
|
|
in an interactive environment. I know this may be an extreme
|
|
position, but I consider the practice of reporting all errors in a
|
|
program to be an anachronism, a holdover from the days of batch
|
|
processing. It's time to scuttle the practice. So there.
|
|
|
|
In our original Cradle, we had two error-handling procedures:
|
|
Error, which didn't halt, and Abort, which did. But I don't think
|
|
we ever found a use for the procedure that didn't halt, so in the
|
|
new, lean and mean unit Errors, shown next, procedure Error takes
|
|
the place of Abort.
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
unit Errors;
|
|
{--------------------------------------------------------------}
|
|
interface
|
|
procedure Error(s: string);
|
|
procedure Expected(s: string);
|
|
|
|
{--------------------------------------------------------------}
|
|
implementation
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Write error Message and Halt }
|
|
|
|
procedure Error(s: string);
|
|
begin
|
|
WriteLn;
|
|
WriteLn(^G, 'Error: ', s, '.');
|
|
Halt;
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Write "<something> Expected" }
|
|
|
|
procedure Expected(s: string);
|
|
begin
|
|
Error(s + ' Expected');
|
|
end;
|
|
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
As usual, here's a test program:
|
|
|
|
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
program Test;
|
|
uses WinCRT, Input, Output, Errors;
|
|
|
|
begin
|
|
Expected('Integer');
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
Have you noticed that the "uses" line in our main program keeps
|
|
getting longer? That's OK. In the final version, the main program
|
|
will only call procedures in our parser, so its use clause will
|
|
only have a couple of entries. But for now, it's probably best to
|
|
include all the units so we can test procedures in them.
|
|
|
|
|
|
SCANNING AND PARSING
|
|
|
|
The classical compiler architecture consists of separate modules
|
|
for the lexical scanner, which supplies tokens in the language,
|
|
and the parser, which tries to make sense of the tokens as syntax
|
|
elements. If you can still remember what we did in earlier
|
|
installments, you'll recall that we didn't do things that way.
|
|
Because we're using a predictive parser, we can almost always tell
|
|
what language element is coming next, just by examining the
|
|
lookahead character. Therefore, we found no need to prefetch
|
|
tokens, as a scanner would do.
|
|
|
|
But, even though there is no functional procedure called
|
|
"Scanner," it still makes sense to separate the scanning functions
|
|
from the parsing functions. So I've created two more units
|
|
called, amazingly enough, Scanner and Parser. The Scanner unit
|
|
contains all of the routines known as recognizers. Some of these,
|
|
such as IsAlpha, are pure boolean routines which operate on the
|
|
lookahead character only. The other routines are those which
|
|
collect tokens, such as identifiers and numeric constants. The
|
|
Parser unit will contain all of the routines making up the
|
|
recursive-descent parser. The general rule should be that unit
|
|
Parser contains all of the information that is language-specific;
|
|
in other words, the syntax of the language should be wholly
|
|
contained in Parser. In an ideal world, this rule should be true
|
|
to the extent that we can change the compiler to compile a
|
|
different language, merely by replacing the single unit, Parser.
|
|
|
|
In practice, things are almost never this pure. There's always a
|
|
small amount of "leakage" of syntax rules into the scanner as
|
|
well. For example, the rules concerning what makes up a legal
|
|
identifier or constant may vary from language to language. In
|
|
some languages, the rules concerning comments permit them to be
|
|
filtered by the scanner, while in others they do not. So in
|
|
practice, both units are likely to end up having language-
|
|
dependent components, but the changes required to the scanner
|
|
should be relatively trivial.
|
|
|
|
Now, recall that we've used two versions of the scanner routines:
|
|
One that handled only single-character tokens, which we used for a
|
|
number of our tests, and another that provided full support for
|
|
multi-character tokens. Now that we have our software separated
|
|
into units, I don't anticipate getting much use out of the single-
|
|
character version, but it doesn't cost us much to provide for
|
|
both. I've created two versions of the Scanner unit. The first
|
|
one, called Scanner1, contains the single-digit version of the
|
|
recognizers:
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
unit Scanner1;
|
|
{--------------------------------------------------------------}
|
|
interface
|
|
uses Input, Errors;
|
|
|
|
function IsAlpha(c: char): boolean;
|
|
function IsDigit(c: char): boolean;
|
|
function IsAlNum(c: char): boolean;
|
|
function IsAddop(c: char): boolean;
|
|
function IsMulop(c: char): boolean;
|
|
|
|
procedure Match(x: char);
|
|
function GetName: char;
|
|
function GetNumber: char;
|
|
|
|
{--------------------------------------------------------------}
|
|
implementation
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize an Alpha Character }
|
|
|
|
function IsAlpha(c: char): boolean;
|
|
begin
|
|
IsAlpha := UpCase(c) in ['A'..'Z'];
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize a Numeric Character }
|
|
|
|
function IsDigit(c: char): boolean;
|
|
begin
|
|
IsDigit := c in ['0'..'9'];
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize an Alphanumeric Character }
|
|
|
|
function IsAlnum(c: char): boolean;
|
|
begin
|
|
IsAlnum := IsAlpha(c) or IsDigit(c);
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize an Addition Operator }
|
|
|
|
function IsAddop(c: char): boolean;
|
|
begin
|
|
IsAddop := c in ['+','-'];
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize a Multiplication Operator }
|
|
|
|
function IsMulop(c: char): boolean;
|
|
begin
|
|
IsMulop := c in ['*','/'];
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Match One Character }
|
|
|
|
procedure Match(x: char);
|
|
begin
|
|
if Look = x then GetChar
|
|
else Expected('''' + x + '''');
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Get an Identifier }
|
|
|
|
function GetName: char;
|
|
begin
|
|
if not IsAlpha(Look) then Expected('Name');
|
|
GetName := UpCase(Look);
|
|
GetChar;
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Get a Number }
|
|
|
|
function GetNumber: char;
|
|
begin
|
|
if not IsDigit(Look) then Expected('Integer');
|
|
GetNumber := Look;
|
|
GetChar;
|
|
end;
|
|
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
The following code fragment of the main program provides a good
|
|
test of the scanner. For brevity, I'll only include the
|
|
executable code here; the rest remains the same. Don't forget,
|
|
though, to add the name Scanner1 to the "uses" clause.
|
|
|
|
Write(GetName);
|
|
Match('=');
|
|
Write(GetNumber);
|
|
Match('+');
|
|
WriteLn(GetName);
|
|
|
|
This code will recognize all sentences of the form:
|
|
|
|
x=0+y
|
|
|
|
where x and y can be any single-character variable names, and 0
|
|
any digit. The code should reject all other sentences, and give a
|
|
meaningful error message. If it did, you're in good shape and we
|
|
can proceed.
|
|
|
|
|
|
THE SCANNER UNIT
|
|
|
|
The next, and by far the most important, version of the scanner is
|
|
the one that handles the multi-character tokens that all real
|
|
languages must have. Only the two functions, GetName and
|
|
GetNumber, change between the two units, but just to be sure there
|
|
are no mistakes, I've reproduced the entire unit here. This is
|
|
unit Scanner:
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
unit Scanner;
|
|
{--------------------------------------------------------------}
|
|
interface
|
|
uses Input, Errors;
|
|
|
|
function IsAlpha(c: char): boolean;
|
|
function IsDigit(c: char): boolean;
|
|
function IsAlNum(c: char): boolean;
|
|
function IsAddop(c: char): boolean;
|
|
function IsMulop(c: char): boolean;
|
|
|
|
procedure Match(x: char);
|
|
function GetName: string;
|
|
function GetNumber: longint;
|
|
|
|
{--------------------------------------------------------------}
|
|
implementation
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize an Alpha Character }
|
|
|
|
function IsAlpha(c: char): boolean;
|
|
begin
|
|
IsAlpha := UpCase(c) in ['A'..'Z'];
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize a Numeric Character }
|
|
|
|
function IsDigit(c: char): boolean;
|
|
begin
|
|
IsDigit := c in ['0'..'9'];
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize an Alphanumeric Character }
|
|
|
|
function IsAlnum(c: char): boolean;
|
|
begin
|
|
IsAlnum := IsAlpha(c) or IsDigit(c);
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize an Addition Operator }
|
|
|
|
function IsAddop(c: char): boolean;
|
|
begin
|
|
IsAddop := c in ['+','-'];
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Recognize a Multiplication Operator }
|
|
|
|
function IsMulop(c: char): boolean;
|
|
begin
|
|
IsMulop := c in ['*','/'];
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Match One Character }
|
|
|
|
procedure Match(x: char);
|
|
begin
|
|
if Look = x then GetChar
|
|
else Expected('''' + x + '''');
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Get an Identifier }
|
|
|
|
function GetName: string;
|
|
var n: string;
|
|
begin
|
|
n := '';
|
|
if not IsAlpha(Look) then Expected('Name');
|
|
while IsAlnum(Look) do begin
|
|
n := n + Look;
|
|
GetChar;
|
|
end;
|
|
GetName := n;
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Get a Number }
|
|
|
|
function GetNumber: string;
|
|
var n: string;
|
|
begin
|
|
n := '';
|
|
if not IsDigit(Look) then Expected('Integer');
|
|
while IsDigit(Look) do begin
|
|
n := n + Look;
|
|
GetChar;
|
|
end;
|
|
GetNumber := n;
|
|
end;
|
|
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
The same test program will test this scanner, also. Simply change
|
|
the "uses" clause to use Scanner instead of Scanner1. Now you
|
|
should be able to type multi-character names and numbers.
|
|
|
|
|
|
DECISIONS, DECISIONS
|
|
|
|
In spite of the relative simplicity of both scanners, a lot of
|
|
thought has gone into them, and a lot of decisions had to be made.
|
|
I'd like to share those thoughts with you now so you can make your
|
|
own educated decision, appropriate for your application. First,
|
|
note that both versions of GetName translate the input characters
|
|
to upper case. Obviously, there was a design decision made here,
|
|
and this is one of those cases where the language syntax splatters
|
|
over into the scanner. In the C language, the case of characters
|
|
in identifiers is significant. For such a language, we obviously
|
|
can't map the characters to upper case. The design I'm using
|
|
assumes a language like Pascal, where the case of characters
|
|
doesn't matter. For such languages, it's easier to go ahead and
|
|
map all identifiers to upper case in the scanner, so we don't have
|
|
to worry later on when we're comparing strings for equality.
|
|
|
|
We could have even gone a step further, and map the characters to
|
|
upper case right as they come in, in GetChar. This approach works
|
|
too, and I've used it in the past, but it's too confining.
|
|
Specifically, it will also map characters that may be part of
|
|
quoted strings, which is not a good idea. So if you're going to
|
|
map to upper case at all, GetName is the proper place to do it.
|
|
|
|
Note that the function GetNumber in this scanner returns a string,
|
|
just as GetName does. This is another one of those things I've
|
|
oscillated about almost daily, and the last swing was all of ten
|
|
minutes ago. The alternative approach, and one I've used many
|
|
times in past installments, returns an integer result.
|
|
|
|
Both approaches have their good points. Since we're fetching a
|
|
number, the approach that immediately comes to mind is to return
|
|
it as an integer. But bear in mind that the eventual use of the
|
|
number will be in a write statement that goes back to the outside
|
|
world. Someone -- either us or the code hidden inside the write
|
|
statement -- is going to have to convert the number back to a
|
|
string again. Turbo Pascal includes such string conversion
|
|
routines, but why use them if we don't have to? Why convert a
|
|
number from string to integer form, only to convert it right back
|
|
again in the code generator, only a few statements later?
|
|
|
|
Furthermore, as you'll soon see, we're going to need a temporary
|
|
storage spot for the value of the token we've fetched. If we treat
|
|
the number in its string form, we can store the value of either a
|
|
variable or a number in the same string. Otherwise, we'll have to
|
|
create a second, integer variable.
|
|
|
|
On the other hand, we'll find that carrying the number as a string
|
|
virtually eliminates any chance of optimization later on. As we
|
|
get to the point where we are beginning to concern ourselves with
|
|
code generation, we'll encounter cases in which we're doing
|
|
arithmetic on constants. For such cases, it's really foolish to
|
|
generate code that performs the constant arithmetic at run time.
|
|
Far better to let the parser do the arithmetic at compile time,
|
|
and merely code the result. To do that, we'll wish we had the
|
|
constants stored as integers rather than strings.
|
|
|
|
What finally swung me back over to the string approach was an
|
|
aggressive application of the KISS test, plus reminding myself
|
|
that we've studiously avoided issues of code efficiency. One of
|
|
the things that makes our simple-minded parsing work, without the
|
|
complexities of a "real" compiler, is that we've said up front
|
|
that we aren't concerned about code efficiency. That gives us a
|
|
lot of freedom to do things the easy way rather than the efficient
|
|
one, and it's a freedom we must be careful not to abandon
|
|
voluntarily, in spite of the urges for efficiency shouting in our
|
|
ear. In addition to being a big believer in the KISS philosophy,
|
|
I'm also an advocate of "lazy programming," which in this context
|
|
means, don't program anything until you need it. As P.J. Plauger
|
|
says, "Never put off until tomorrow what you can put off
|
|
indefinitely." Over the years, much code has been written to
|
|
provide for eventualities that never happened. I've learned that
|
|
lesson myself, from bitter experience. So the bottom line is: We
|
|
won't convert to an integer here because we don't need to. It's
|
|
as simple as that.
|
|
|
|
For those of you who still think we may need the integer version
|
|
(and indeed we may), here it is:
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Get a Number (integer version) }
|
|
|
|
function GetNumber: longint;
|
|
var n: longint;
|
|
begin
|
|
n := 0;
|
|
if not IsDigit(Look) then Expected('Integer');
|
|
while IsDigit(Look) do begin
|
|
n := 10 * n + (Ord(Look) - Ord('0'));
|
|
GetChar;
|
|
end;
|
|
GetNumber := n;
|
|
end;
|
|
{--------------------------------------------------------------}
|
|
|
|
You might file this one away, as I intend to, for a rainy day.
|
|
|
|
|
|
PARSING
|
|
|
|
At this point, we have distributed all the routines that made up
|
|
our Cradle into units that we can draw upon as we need them.
|
|
Obviously, they will evolve further as we continue the process of
|
|
bootstrapping ourselves up again, but for the most part their
|
|
content, and certainly the architecture that they imply, is
|
|
defined. What remains is to embody the language syntax into the
|
|
parser unit. We won't do much of that in this installment, but I
|
|
do want to do a little, just to leave us with the good feeling
|
|
that we still know what we're doing. So before we go, let's
|
|
generate just enough of a parser to process single factors in an
|
|
expression. In the process, we'll also, by necessity, find we
|
|
have created a code generator unit, as well.
|
|
|
|
Remember the very first installment of this series? We read an
|
|
integer value, say n, and generated the code to load it into the
|
|
D0 register via an immediate move:
|
|
|
|
MOVE #n,D0
|
|
|
|
Shortly afterwards, we repeated the process for a variable,
|
|
|
|
MOVE X(PC),D0
|
|
|
|
and then for a factor that could be either constant or variable.
|
|
For old times sake, let's revisit that process. Define the
|
|
following new unit:
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
unit Parser;
|
|
{--------------------------------------------------------------}
|
|
interface
|
|
uses Input, Scanner, Errors, CodeGen;
|
|
procedure Factor;
|
|
|
|
{--------------------------------------------------------------}
|
|
implementation
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Parse and Translate a Factor }
|
|
|
|
procedure Factor;
|
|
begin
|
|
LoadConstant(GetNumber);
|
|
end;
|
|
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
As you can see, this unit calls a procedure, LoadConstant, which
|
|
actually effects the output of the assembly-language code. The
|
|
unit also uses a new unit, CodeGen. This step represents the last
|
|
major change in our architecture, from earlier installments: The
|
|
removal of the machine-dependent code to a separate unit. If I
|
|
have my way, there will not be a single line of code, outside of
|
|
CodeGen, that betrays the fact that we're targeting the 68000 CPU.
|
|
And this is one place I think that having my way is quite
|
|
feasible.
|
|
|
|
For those of you who wish I were using the 80x86 architecture (or
|
|
any other one) instead of the 68000, here's your answer: Merely
|
|
replace CodeGen with one suitable for your CPU of choice.
|
|
|
|
So far, our code generator has only one procedure in it. Here's
|
|
the unit:
|
|
|
|
|
|
{--------------------------------------------------------------}
|
|
unit CodeGen;
|
|
|
|
{--------------------------------------------------------------}
|
|
interface
|
|
uses Output;
|
|
procedure LoadConstant(n: string);
|
|
|
|
{--------------------------------------------------------------}
|
|
implementation
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Load the Primary Register with a Constant }
|
|
|
|
procedure LoadConstant(n: string);
|
|
begin
|
|
EmitLn('MOVE #' + n + ',D0' );
|
|
end;
|
|
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
Copy and compile this unit, and execute the following main
|
|
program:
|
|
|
|
{--------------------------------------------------------------}
|
|
program Main;
|
|
uses WinCRT, Input, Output, Errors, Scanner, Parser;
|
|
begin
|
|
Factor;
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
There it is, the generated code, just as we hoped it would be.
|
|
|
|
Now, I hope you can begin to see the advantage of the unit-based
|
|
architecture of our new design. Here we have a main program
|
|
that's all of five lines long. That's all of the program we need
|
|
to see, unless we choose to see more. And yet, all those units
|
|
are sitting there, patiently waiting to serve us. We can have our
|
|
cake and eat it too, in that we have simple and short code, but
|
|
powerful allies. What remains to be done is to flesh out the
|
|
units to match the capabilities of earlier installments. We'll do
|
|
that in the next installment, but before I close, let's finish out
|
|
the parsing of a factor, just to satisfy ourselves that we still
|
|
know how. The final version of CodeGen includes the new
|
|
procedure, LoadVariable:
|
|
|
|
{--------------------------------------------------------------}
|
|
unit CodeGen;
|
|
|
|
{--------------------------------------------------------------}
|
|
interface
|
|
uses Output;
|
|
procedure LoadConstant(n: string);
|
|
procedure LoadVariable(Name: string);
|
|
|
|
{--------------------------------------------------------------}
|
|
implementation
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Load the Primary Register with a Constant }
|
|
|
|
procedure LoadConstant(n: string);
|
|
begin
|
|
EmitLn('MOVE #' + n + ',D0' );
|
|
end;
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Load a Variable to the Primary Register }
|
|
|
|
procedure LoadVariable(Name: string);
|
|
begin
|
|
EmitLn('MOVE ' + Name + '(PC),D0');
|
|
end;
|
|
|
|
|
|
end.
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
The parser unit itself doesn't change, but we have a more complex
|
|
version of procedure Factor:
|
|
|
|
{--------------------------------------------------------------}
|
|
{ Parse and Translate a Factor }
|
|
|
|
procedure Factor;
|
|
begin
|
|
if IsDigit(Look) then
|
|
LoadConstant(GetNumber)
|
|
else if IsAlpha(Look)then
|
|
LoadVariable(GetName)
|
|
else
|
|
Error('Unrecognized character ' + Look);
|
|
end;
|
|
{--------------------------------------------------------------}
|
|
|
|
|
|
Now, without altering the main program, you should find that our
|
|
program will process either a variable or a constant factor. At
|
|
this point, our architecture is almost complete; we have units to
|
|
do all the dirty work, and enough code in the parser and code
|
|
generator to demonstrate that everything works. What remains is
|
|
to flesh out the units we've defined, particularly the parser and
|
|
code generator, to support the more complex syntax elements that
|
|
make up a real language. Since we've done this many times before
|
|
in earlier installments, it shouldn't take long to get us back to
|
|
where we were before the long hiatus. We'll continue this process
|
|
in Installment 16, coming soon. See you then.
|
|
|
|
|
|
|
|
REFERENCES
|
|
|
|
1. Crenshaw, J.W., "Object-Oriented Design of Assemblers and
|
|
Compilers," Proc. Software Development '91 Conference, Miller
|
|
Freeman, San Francisco, CA, February 1991, pp. 143-155.
|
|
|
|
2. Crenshaw, J.W., "A Perfect Marriage," Computer Language, Volume
|
|
8, #6, June 1991, pp. 44-55.
|
|
|
|
3. Crenshaw, J.W., "Syntax-Driven Object-Oriented Design," Proc.
|
|
1991 Embedded Systems Conference, Miller Freeman, San
|
|
Francisco, CA, September 1991, pp. 45-60.
|
|
|
|
|
|
*****************************************************************
|
|
* *
|
|
* COPYRIGHT NOTICE *
|
|
* *
|
|
* Copyright (C) 1994 Jack W. Crenshaw. All rights reserved. *
|
|
* *
|
|
* *
|
|
*****************************************************************
|
|
|
|
|