Dodajem knjige
This commit is contained in:
@@ -0,0 +1,525 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LET'S BUILD A COMPILER!
|
||||
|
||||
By
|
||||
|
||||
Jack W. Crenshaw, Ph.D.
|
||||
|
||||
2 April 1989
|
||||
|
||||
|
||||
Part VIII: A LITTLE PHILOSOPHY
|
||||
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
This is going to be a different kind of session than the others
|
||||
in our series on parsing and compiler construction. For this
|
||||
session, there won't be any experiments to do or code to write.
|
||||
This once, I'd like to just talk with you for a while.
|
||||
Mercifully, it will be a short session, and then we can take up
|
||||
where we left off, hopefully with renewed vigor.
|
||||
|
||||
When I was in college, I found that I could always follow a
|
||||
prof's lecture a lot better if I knew where he was going with it.
|
||||
I'll bet you were the same.
|
||||
|
||||
So I thought maybe it's about time I told you where we're going
|
||||
with this series: what's coming up in future installments, and in
|
||||
general what all this is about. I'll also share some general
|
||||
thoughts concerning the usefulness of what we've been doing.
|
||||
|
||||
|
||||
THE ROAD HOME
|
||||
|
||||
So far, we've covered the parsing and translation of arithmetic
|
||||
expressions, Boolean expressions, and combinations connected by
|
||||
relational operators. We've also done the same for control
|
||||
constructs. In all of this we've leaned heavily on the use of
|
||||
top-down, recursive descent parsing, BNF definitions of the
|
||||
syntax, and direct generation of assembly-language code. We also
|
||||
learned the value of such tricks as single-character tokens to
|
||||
help us see the forest through the trees. In the last
|
||||
installment we dealt with lexical scanning, and I showed you
|
||||
simple but powerful ways to remove the single-character barriers.
|
||||
|
||||
Throughout the whole study, I've emphasized the KISS philosophy
|
||||
... Keep It Simple, Sidney ... and I hope by now you've realized
|
||||
just how simple this stuff can really be. While there are for
|
||||
sure areas of compiler theory that are truly intimidating, the
|
||||
ultimate message of this series is that in practice you can just
|
||||
politely sidestep many of these areas. If the language
|
||||
definition cooperates or, as in this series, if you can define
|
||||
the language as you go, it's possible to write down the language
|
||||
definition in BNF with reasonable ease. And, as we've seen, you
|
||||
can crank out parse procedures from the BNF just about as fast as
|
||||
you can type.
|
||||
|
||||
As our compiler has taken form, it's gotten more parts, but each
|
||||
part is quite small and simple, and very much like all the
|
||||
others.
|
||||
|
||||
At this point, we have many of the makings of a real, practical
|
||||
compiler. As a matter of fact, we already have all we need to
|
||||
build a toy compiler for a language as powerful as, say, Tiny
|
||||
BASIC. In the next couple of installments, we'll go ahead and
|
||||
define that language.
|
||||
|
||||
To round out the series, we still have a few items to cover.
|
||||
These include:
|
||||
|
||||
o Procedure calls, with and without parameters
|
||||
|
||||
o Local and global variables
|
||||
|
||||
o Basic types, such as character and integer types
|
||||
|
||||
o Arrays
|
||||
|
||||
o Strings
|
||||
|
||||
o User-defined types and structures
|
||||
|
||||
o Tree-structured parsers and intermediate languages
|
||||
|
||||
o Optimization
|
||||
|
||||
These will all be covered in future installments. When we're
|
||||
finished, you'll have all the tools you need to design and build
|
||||
your own languages, and the compilers to translate them.
|
||||
|
||||
I can't design those languages for you, but I can make some
|
||||
comments and recommendations. I've already sprinkled some
|
||||
throughout past installments. You've seen, for example, the
|
||||
control constructs I prefer.
|
||||
|
||||
These constructs are going to be part of the languages I build.
|
||||
I have three languages in mind at this point, two of which you
|
||||
will see in installments to come:
|
||||
|
||||
TINY - A minimal, but usable language on the order of Tiny
|
||||
BASIC or Tiny C. It won't be very practical, but it will
|
||||
have enough power to let you write and run real programs
|
||||
that do something worthwhile.
|
||||
|
||||
KISS - The language I'm building for my own use. KISS is
|
||||
intended to be a systems programming language. It won't
|
||||
have strong typing or fancy data structures, but it will
|
||||
support most of the things I want to do with a higher-
|
||||
order language (HOL), except perhaps writing compilers.
|
||||
|
||||
I've also been toying for years with the idea of a HOL-like
|
||||
assembler, with structured control constructs and HOL-like
|
||||
assignment statements. That, in fact, was the impetus behind my
|
||||
original foray into the jungles of compiler theory. This one may
|
||||
never be built, simply because I've learned that it's actually
|
||||
easier to implement a language like KISS, that only uses a subset
|
||||
of the CPU instructions. As you know, assembly language can be
|
||||
bizarre and irregular in the extreme, and a language that maps
|
||||
one-for-one onto it can be a real challenge. Still, I've always
|
||||
felt that the syntax used in conventional assemblers is dumb ...
|
||||
why is
|
||||
|
||||
MOVE.L A,B
|
||||
|
||||
better, or easier to translate, than
|
||||
|
||||
B=A ?
|
||||
|
||||
I think it would be an interesting exercise to develop a
|
||||
"compiler" that would give the programmer complete access to and
|
||||
control over the full complement of the CPU instruction set, and
|
||||
would allow you to generate programs as efficient as assembly
|
||||
language, without the pain of learning a set of mnemonics. Can
|
||||
it be done? I don't know. The real question may be, "Will the
|
||||
resulting language be any easier to write than assembly"? If
|
||||
not, there's no point in it. I think that it can be done, but
|
||||
I'm not completely sure yet how the syntax should look.
|
||||
|
||||
Perhaps you have some comments or suggestions on this one. I'd
|
||||
love to hear them.
|
||||
|
||||
You probably won't be surprised to learn that I've already worked
|
||||
ahead in most of the areas that we will cover. I have some good
|
||||
news: Things never get much harder than they've been so far.
|
||||
It's possible to build a complete, working compiler for a real
|
||||
language, using nothing but the same kinds of techniques you've
|
||||
learned so far. And THAT brings up some interesting questions.
|
||||
|
||||
|
||||
WHY IS IT SO SIMPLE?
|
||||
|
||||
Before embarking on this series, I always thought that compilers
|
||||
were just naturally complex computer programs ... the ultimate
|
||||
challenge. Yet the things we have done here have usually turned
|
||||
out to be quite simple, sometimes even trivial.
|
||||
|
||||
For awhile, I thought is was simply because I hadn't yet gotten
|
||||
into the meat of the subject. I had only covered the simple
|
||||
parts. I will freely admit to you that, even when I began the
|
||||
series, I wasn't sure how far we would be able to go before
|
||||
things got too complex to deal with in the ways we have so far.
|
||||
But at this point I've already been down the road far enough to
|
||||
see the end of it. Guess what?
|
||||
|
||||
|
||||
THERE ARE NO HARD PARTS!
|
||||
|
||||
|
||||
Then, I thought maybe it was because we were not generating very
|
||||
good object code. Those of you who have been following the
|
||||
series and trying sample compiles know that, while the code works
|
||||
and is rather foolproof, its efficiency is pretty awful. I
|
||||
figured that if we were concentrating on turning out tight code,
|
||||
we would soon find all that missing complexity.
|
||||
|
||||
To some extent, that one is true. In particular, my first few
|
||||
efforts at trying to improve efficiency introduced complexity at
|
||||
an alarming rate. But since then I've been tinkering around with
|
||||
some simple optimizations and I've found some that result in very
|
||||
respectable code quality, WITHOUT adding a lot of complexity.
|
||||
|
||||
Finally, I thought that perhaps the saving grace was the "toy
|
||||
compiler" nature of the study. I have made no pretense that we
|
||||
were ever going to be able to build a compiler to compete with
|
||||
Borland and Microsoft. And yet, again, as I get deeper into this
|
||||
thing the differences are starting to fade away.
|
||||
|
||||
Just to make sure you get the message here, let me state it flat
|
||||
out:
|
||||
|
||||
USING THE TECHNIQUES WE'VE USED HERE, IT IS POSSIBLE TO
|
||||
BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING
|
||||
A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE.
|
||||
|
||||
|
||||
Since the series began I've received some comments from you.
|
||||
Most of them echo my own thoughts: "This is easy! Why do the
|
||||
textbooks make it seem so hard?" Good question.
|
||||
|
||||
Recently, I've gone back and looked at some of those texts again,
|
||||
and even bought and read some new ones. Each time, I come away
|
||||
with the same feeling: These guys have made it seem too hard.
|
||||
|
||||
What's going on here? Why does the whole thing seem difficult in
|
||||
the texts, but easy to us? Are we that much smarter than Aho,
|
||||
Ullman, Brinch Hansen, and all the rest?
|
||||
|
||||
Hardly. But we are doing some things differently, and more and
|
||||
more I'm starting to appreciate the value of our approach, and
|
||||
the way that it simplifies things. Aside from the obvious
|
||||
shortcuts that I outlined in Part I, like single-character tokens
|
||||
and console I/O, we have made some implicit assumptions and done
|
||||
some things differently from those who have designed compilers in
|
||||
the past. As it turns out, our approach makes life a lot easier.
|
||||
|
||||
So why didn't all those other guys use it?
|
||||
|
||||
You have to remember the context of some of the earlier compiler
|
||||
development. These people were working with very small computers
|
||||
of limited capacity. Memory was very limited, the CPU
|
||||
instruction set was minimal, and programs ran in batch mode
|
||||
rather than interactively. As it turns out, these caused some
|
||||
key design decisions that have really complicated the designs.
|
||||
Until recently, I hadn't realized how much of classical compiler
|
||||
design was driven by the available hardware.
|
||||
|
||||
Even in cases where these limitations no longer apply, people
|
||||
have tended to structure their programs in the same way, since
|
||||
that is the way they were taught to do it.
|
||||
|
||||
In our case, we have started with a blank sheet of paper. There
|
||||
is a danger there, of course, that you will end up falling into
|
||||
traps that other people have long since learned to avoid. But it
|
||||
also has allowed us to take different approaches that, partly by
|
||||
design and partly by pure dumb luck, have allowed us to gain
|
||||
simplicity.
|
||||
|
||||
Here are the areas that I think have led to complexity in the
|
||||
past:
|
||||
|
||||
o Limited RAM Forcing Multiple Passes
|
||||
|
||||
I just read "Brinch Hansen on Pascal Compilers" (an
|
||||
excellent book, BTW). He developed a Pascal compiler for a
|
||||
PC, but he started the effort in 1981 with a 64K system, and
|
||||
so almost every design decision he made was aimed at making
|
||||
the compiler fit into RAM. To do this, his compiler has
|
||||
three passes, one of which is the lexical scanner. There is
|
||||
no way he could, for example, use the distributed scanner I
|
||||
introduced in the last installment, because the program
|
||||
structure wouldn't allow it. He also required not one but
|
||||
two intermediate languages, to provide the communication
|
||||
between phases.
|
||||
|
||||
All the early compiler writers had to deal with this issue:
|
||||
Break the compiler up into enough parts so that it will fit
|
||||
in memory. When you have multiple passes, you need to add
|
||||
data structures to support the information that each pass
|
||||
leaves behind for the next. That adds complexity, and ends
|
||||
up driving the design. Lee's book, "The Anatomy of a
|
||||
Compiler," mentions a FORTRAN compiler developed for an IBM
|
||||
1401. It had no fewer than 63 separate passes! Needless to
|
||||
say, in a compiler like this the separation into phases
|
||||
would dominate the design.
|
||||
|
||||
Even in situations where RAM is plentiful, people have
|
||||
tended to use the same techniques because that is what
|
||||
they're familiar with. It wasn't until Turbo Pascal came
|
||||
along that we found how simple a compiler could be if you
|
||||
started with different assumptions.
|
||||
|
||||
|
||||
o Batch Processing
|
||||
|
||||
In the early days, batch processing was the only choice ...
|
||||
there was no interactive computing. Even today, compilers
|
||||
run in essentially batch mode.
|
||||
|
||||
In a mainframe compiler as well as many micro compilers,
|
||||
considerable effort is expended on error recovery ... it can
|
||||
consume as much as 30-40% of the compiler and completely
|
||||
drive the design. The idea is to avoid halting on the first
|
||||
error, but rather to keep going at all costs, so that you
|
||||
can tell the programmer about as many errors in the whole
|
||||
program as possible.
|
||||
|
||||
All of that harks back to the days of the early mainframes,
|
||||
where turnaround time was measured in hours or days, and it
|
||||
was important to squeeze every last ounce of information out
|
||||
of each run.
|
||||
|
||||
In this series, I've been very careful to avoid the issue of
|
||||
error recovery, and instead our compiler simply halts with
|
||||
an error message on the first error. I will frankly admit
|
||||
that it was mostly because I wanted to take the easy way out
|
||||
and keep things simple. But this approach, pioneered by
|
||||
Borland in Turbo Pascal, also has a lot going for it anyway.
|
||||
Aside from keeping the compiler simple, it also fits very
|
||||
well with the idea of an interactive system. When
|
||||
compilation is fast, and especially when you have an editor
|
||||
such as Borland's that will take you right to the point of
|
||||
the error, then it makes a lot of sense to stop there, and
|
||||
just restart the compilation after the error is fixed.
|
||||
|
||||
|
||||
o Large Programs
|
||||
|
||||
Early compilers were designed to handle large programs ...
|
||||
essentially infinite ones. In those days there was little
|
||||
choice; the idea of subroutine libraries and separate
|
||||
compilation were still in the future. Again, this
|
||||
assumption led to multi-pass designs and intermediate files
|
||||
to hold the results of partial processing.
|
||||
|
||||
Brinch Hansen's stated goal was that the compiler should be
|
||||
able to compile itself. Again, because of his limited RAM,
|
||||
this drove him to a multi-pass design. He needed as little
|
||||
resident compiler code as possible, so that the necessary
|
||||
tables and other data structures would fit into RAM.
|
||||
|
||||
I haven't stated this one yet, because there hasn't been a
|
||||
need ... we've always just read and written the data as
|
||||
streams, anyway. But for the record, my plan has always
|
||||
been that, in a production compiler, the source and object
|
||||
data should all coexist in RAM with the compiler, a la the
|
||||
early Turbo Pascals. That's why I've been careful to keep
|
||||
routines like GetChar and Emit as separate routines, in
|
||||
spite of their small size. It will be easy to change them
|
||||
to read to and write from memory.
|
||||
|
||||
|
||||
o Emphasis on Efficiency
|
||||
|
||||
John Backus has stated that, when he and his colleagues
|
||||
developed the original FORTRAN compiler, they KNEW that they
|
||||
had to make it produce tight code. In those days, there was
|
||||
a strong sentiment against HOLs and in favor of assembly
|
||||
language, and efficiency was the reason. If FORTRAN didn't
|
||||
produce very good code by assembly standards, the users
|
||||
would simply refuse to use it. For the record, that FORTRAN
|
||||
compiler turned out to be one of the most efficient ever
|
||||
built, in terms of code quality. But it WAS complex!
|
||||
|
||||
Today, we have CPU power and RAM size to spare, so code
|
||||
efficiency is not so much of an issue. By studiously
|
||||
ignoring this issue, we have indeed been able to Keep It
|
||||
Simple. Ironically, though, as I have said, I have found
|
||||
some optimizations that we can add to the basic compiler
|
||||
structure, without having to add a lot of complexity. So in
|
||||
this case we get to have our cake and eat it too: we will
|
||||
end up with reasonable code quality, anyway.
|
||||
|
||||
|
||||
o Limited Instruction Sets
|
||||
|
||||
The early computers had primitive instruction sets. Things
|
||||
that we take for granted, such as stack operations and
|
||||
indirect addressing, came only with great difficulty.
|
||||
|
||||
Example: In most compiler designs, there is a data structure
|
||||
called the literal pool. The compiler typically identifies
|
||||
all literals used in the program, and collects them into a
|
||||
single data structure. All references to the literals are
|
||||
done indirectly to this pool. At the end of the
|
||||
compilation, the compiler issues commands to set aside
|
||||
storage and initialize the literal pool.
|
||||
|
||||
We haven't had to address that issue at all. When we want
|
||||
to load a literal, we just do it, in line, as in
|
||||
|
||||
MOVE #3,D0
|
||||
|
||||
There is something to be said for the use of a literal pool,
|
||||
particularly on a machine like the 8086 where data and code
|
||||
can be separated. Still, the whole thing adds a fairly
|
||||
large amount of complexity with little in return.
|
||||
|
||||
Of course, without the stack we would be lost. In a micro,
|
||||
both subroutine calls and temporary storage depend heavily
|
||||
on the stack, and we have used it even more than necessary
|
||||
to ease expression parsing.
|
||||
|
||||
|
||||
o Desire for Generality
|
||||
|
||||
Much of the content of the typical compiler text is taken up
|
||||
with issues we haven't addressed here at all ... things like
|
||||
automated translation of grammars, or generation of LALR
|
||||
parse tables. This is not simply because the authors want
|
||||
to impress you. There are good, practical reasons why the
|
||||
subjects are there.
|
||||
|
||||
We have been concentrating on the use of a recursive-descent
|
||||
parser to parse a deterministic grammar, i.e., a grammar
|
||||
that is not ambiguous and, therefore, can be parsed with one
|
||||
level of lookahead. I haven't made much of this limitation,
|
||||
but the fact is that this represents a small subset of
|
||||
possible grammars. In fact, there is an infinite number of
|
||||
grammars that we can't parse using our techniques. The LR
|
||||
technique is a more powerful one, and can deal with grammars
|
||||
that we can't.
|
||||
|
||||
In compiler theory, it's important to know how to deal with
|
||||
these other grammars, and how to transform them into
|
||||
grammars that are easier to deal with. For example, many
|
||||
(but not all) ambiguous grammars can be transformed into
|
||||
unambiguous ones. The way to do this is not always obvious,
|
||||
though, and so many people have devoted years to develop
|
||||
ways to transform them automatically.
|
||||
|
||||
In practice, these issues turn out to be considerably less
|
||||
important. Modern languages tend to be designed to be easy
|
||||
to parse, anyway. That was a key motivation in the design
|
||||
of Pascal. Sure, there are pathological grammars that you
|
||||
would be hard pressed to write unambiguous BNF for, but in
|
||||
the real world the best answer is probably to avoid those
|
||||
grammars!
|
||||
|
||||
In our case, of course, we have sneakily let the language
|
||||
evolve as we go, so we haven't painted ourselves into any
|
||||
corners here. You may not always have that luxury. Still,
|
||||
with a little care you should be able to keep the parser
|
||||
simple without having to resort to automatic translation of
|
||||
the grammar.
|
||||
|
||||
|
||||
We have taken a vastly different approach in this series. We
|
||||
started with a clean sheet of paper, and developed techniques
|
||||
that work in the context that we are in; that is, a single-user
|
||||
PC with rather ample CPU power and RAM space. We have limited
|
||||
ourselves to reasonable grammars that are easy to parse, we have
|
||||
used the instruction set of the CPU to advantage, and we have not
|
||||
concerned ourselves with efficiency. THAT's why it's been easy.
|
||||
|
||||
Does this mean that we are forever doomed to be able to build
|
||||
only toy compilers? No, I don't think so. As I've said, we can
|
||||
add certain optimizations without changing the compiler
|
||||
structure. If we want to process large files, we can always add
|
||||
file buffering to do that. These things do not affect the
|
||||
overall program design.
|
||||
|
||||
And I think that's a key factor. By starting with small and
|
||||
limited cases, we have been able to concentrate on a structure
|
||||
for the compiler that is natural for the job. Since the
|
||||
structure naturally fits the job, it is almost bound to be simple
|
||||
and transparent. Adding capability doesn't have to change that
|
||||
basic structure. We can simply expand things like the file
|
||||
structure or add an optimization layer. I guess my feeling is
|
||||
that, back when resources were tight, the structures people ended
|
||||
up with were artificially warped to make them work under those
|
||||
conditions, and weren't optimum structures for the problem at
|
||||
hand.
|
||||
|
||||
|
||||
CONCLUSION
|
||||
|
||||
Anyway, that's my arm-waving guess as to how we've been able to
|
||||
keep things simple. We started with something simple and let it
|
||||
evolve naturally, without trying to force it into some
|
||||
traditional mold.
|
||||
|
||||
We're going to press on with this. I've given you a list of the
|
||||
areas we'll be covering in future installments. With those
|
||||
installments, you should be able to build complete, working
|
||||
compilers for just about any occasion, and build them simply. If
|
||||
you REALLY want to build production-quality compilers, you'll be
|
||||
able to do that, too.
|
||||
|
||||
For those of you who are chafing at the bit for more parser code,
|
||||
I apologize for this digression. I just thought you'd like to
|
||||
have things put into perspective a bit. Next time, we'll get
|
||||
back to the mainstream of the tutorial.
|
||||
|
||||
So far, we've only looked at pieces of compilers, and while we
|
||||
have many of the makings of a complete language, we haven't
|
||||
talked about how to put it all together. That will be the
|
||||
subject of our next two installments. Then we'll press on into
|
||||
the new subjects I listed at the beginning of this installment.
|
||||
|
||||
See you then.
|
||||
|
||||
*****************************************************************
|
||||
* *
|
||||
* COPYRIGHT NOTICE *
|
||||
* *
|
||||
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
|
||||
* *
|
||||
*****************************************************************
|
||||
|
||||
Reference in New Issue
Block a user