Dodajem knjige

2026-05-29 00:39:46 +02:00
parent 34af1ebdc7
commit 4ce48dfb1a
309 changed files with 92526 additions and 0 deletions
@@ -0,0 +1,54 @@
+TUTOR.ZIP
+
+This file contains all of the installments of Jack Crenshaw's
+tutorial on compiler construction, including the new Installment 15. 
+The intended audience is those folks who are not computer scientists,
+but who enjoy computing and have always wanted to know how compilers
+work. A lot of compiler theory has been left out, but the practical
+issues are covered. By the time you have completed the series, you
+should be able to design and build your own working compiler. It will
+not be the world's best, nor will it put out incredibly tight code.
+Your product will probably never put Borland or MicroSoft out of
+business.  But it will work, and it will be yours.
+
+A word about the file format: The files were originally created using
+Borland's DOS editor, Sprint.  Sprint could write to a text file only
+if you formatted the file to go to the selected printer.  I used the
+most common printer I could think of, the Epson MX-80, but even then
+the files ended up with printer control sequences at the beginning
+and end of each page.
+
+To bring the files up to date and get myself positioned to continue
+the series, I recently (1994) converted all the files to work with
+Microsoft Word for Windows.  Unlike Sprint, Word allows you to write
+the file as a DOS text file.  Unfortunately, this gave me a new
+problem, because when Word is writing to a text file, it doesn't
+write hard page breaks or page numbers.  In other words, in six years
+we've gone from a file with page breaks and page numbers, but
+embedded escape sequences, to files with no embedded escape sequences
+but no page breaks or page numbers.  Isn't progress wonderful?
+
+Of course, it's possible for me to insert the page numbers as
+straight text, rather than asking the editor to do it for me.  But
+since Word won't allow me to write page breaks to the file, we would
+end up with files with page numbers that may or may not fall at the
+ends of the pages, depending on your editor and your printer.  It
+seems to me that almost every file I've ever downloaded from
+CompuServe or BBS's that had such page numbering was incompatible
+with my printer, and gave me pages that were one line short or one
+line long, with the page numbers consequently walking up the page.  
+
+So perhaps this new format is, after all, the safest one for general
+distribution.  The files as they exist will look just fine if read
+into any text editor capable of reading DOS text files.  Since most
+editors these days include rather sophisticated word processing
+capabilities, you should be able to get your editor to paginate for
+you, prior to printing.
+
+I hope you like the tutorials.  Much thought went into them.
+
+
+									Jack W. Crenshaw
+
+								CompuServe 72325,1327
+
@@ -0,0 +1,398 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                            LET'S BUILD A COMPILER!
+
+                                       By
+
+                            Jack W. Crenshaw, Ph.D.
+
+                                  24 July 1988
+
+
+                              Part I: INTRODUCTION
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+INTRODUCTION
+
+
+This series of articles is a tutorial on the theory  and practice
+of  developing language parsers and compilers.    Before  we  are
+finished,  we  will  have  covered  every   aspect   of  compiler
+construction, designed a new programming  language,  and  built a
+working compiler.
+
+Though I am not a computer scientist by education (my Ph.D. is in
+a different  field, Physics), I have been interested in compilers
+for many years.  I have  bought  and tried to digest the contents
+of virtually every  book  on  the  subject ever written.  I don't
+mind  telling you that it was slow going.    Compiler  texts  are
+written for Computer  Science  majors, and are tough sledding for
+the rest of us.  But over the years a bit of it began to seep in.
+What really caused it to jell was when I began  to  branch off on
+my own and begin to try things on my own computer.  Now I plan to
+share with you what I have  learned.    At the end of this series
+you will by no means be  a  computer scientist, nor will you know
+all the esoterics of  compiler  theory.    I intend to completely
+ignore the more theoretical  aspects  of  the  subject.  What you
+_WILL_ know is all  the  practical aspects that one needs to know
+to build a working system.
+
+This is a "learn-by-doing" series.  In the course of the series I
+will be performing  experiments  on  a  computer.    You  will be
+expected to follow along,  repeating  the  experiments that I do,
+and  performing  some  on your own.  I will be using Turbo Pascal
+4.0 on a PC  clone.   I will periodically insert examples written
+in TP.  These will be executable code, which you will be expected
+to copy into your own computer and run.  If you don't have a copy
+of  Turbo,  you  will be severely limited in how well you will be
+able to follow what's going on.  If you don't have a copy, I urge
+you to get one.  After  all,  it's an excellent product, good for
+many other uses!
+
+Some articles on compilers show you examples, or show you  (as in
+the case of Small-C) a finished product, which you can  then copy
+and  use without a whole lot of understanding of how it works.  I
+hope to do much more  than  that.    I  hope to teach you HOW the
+things get done,  so that you can go off on your own and not only
+reproduce what I have done, but improve on it.
+                              
+This is admittedly an ambitious undertaking, and it won't be done
+in  one page.  I expect to do it in the course  of  a  number  of
+articles.    Each  article will cover a single aspect of compiler
+theory,  and  will  pretty  much  stand  alone.   If  all  you're
+interested in at a given time is one  aspect,  then  you  need to
+look only at that one article.  Each article will be  uploaded as
+it  is complete, so you will have to wait for the last one before
+you can consider yourself finished.  Please be patient.
+
+
+
+The average text on  compiler  theory covers a lot of ground that
+we won't be covering here.  The typical sequence is:
+
+ o An introductory chapter describing what a compiler is.
+
+ o A chapter or two on syntax equations, using Backus-Naur Form
+   (BNF).
+
+ o A chapter or two on lexical scanning, with emphasis on
+   deterministic and non-deterministic finite automata.
+
+ o Several chapters on parsing theory, beginning with top-down
+   recursive descent, and ending with LALR parsers.
+
+ o A chapter on intermediate languages, with emphasis on P-code
+   and similar reverse polish representations.
+
+ o Many chapters on alternative ways to handle subroutines and
+   parameter passing, type declarations, and such.
+
+ o A chapter toward the end on code generation, usually for some
+   imaginary CPU with a simple instruction set.  Most readers
+   (and in fact, most college classes) never make it this far.
+
+ o A final chapter or two on optimization. This chapter often
+   goes unread, too.
+
+
+I'll  be taking a much different approach in  this  series.    To
+begin  with,  I  won't dwell long on options.  I'll be giving you
+_A_ way that works.  If you want  to  explore  options,  well and
+good ...  I  encourage  you  to do so ... but I'll be sticking to
+what I know.   I also will skip over most of the theory that puts
+people  to  sleep.  Don't get me  wrong:  I  don't  belittle  the
+theory, and it's vitally important  when it comes to dealing with
+the more tricky  parts  of  a  given  language.  But I believe in
+putting first things first.    Here we'll be dealing with the 95%
+of compiler techniques that don't need a lot of theory to handle.
+
+I  also  will  discuss only one approach  to  parsing:  top-down,
+recursive descent parsing, which is the  _ONLY_  technique that's
+at  all   amenable  to  hand-crafting  a  compiler.    The  other
+approaches are only useful if you have a tool like YACC, and also
+don't care how much memory space the final product uses.
+                              
+I  also take a page from the work of Ron Cain, the author of  the
+original Small C.  Whereas almost all other compiler authors have
+historically  used  an  intermediate  language  like  P-code  and
+divided  the  compiler  into two parts (a front end that produces
+P-code,  and   a  back  end  that  processes  P-code  to  produce
+executable   object  code),  Ron  showed  us   that   it   is   a
+straightforward  matter  to  make  a  compiler  directly  produce
+executable  object  code,  in  the  form  of  assembler  language
+statements.  The code will _NOT_ be the world's tightest code ...
+producing optimized code is  a  much  more  difficult job. But it
+will work, and work reasonably well.  Just so that I  don't leave
+you with the impression that our end product will be worthless, I
+_DO_ intend to show you how  to  "soup up" the compiler with some
+optimization.
+
+
+
+Finally, I'll be  using  some  tricks  that I've found to be most
+helpful in letting  me  understand what's going on without wading
+through a lot of boiler plate.  Chief among these  is  the use of
+single-character tokens, with no embedded spaces,  for  the early
+design work.  I figure that  if  I  can get a parser to recognize
+and deal with I-T-L, I can  get  it  to do the same with IF-THEN-
+ELSE.  And I can.  In the second "lesson,"   I'll  show  you just
+how easy it  is  to  extend  a  simple parser to handle tokens of
+arbitrary length.  As another  trick,  I  completely  ignore file
+I/O, figuring that  if  I  can  read source from the keyboard and
+output object to the screen, I can also do it from/to disk files.
+Experience  has  proven  that  once  a   translator   is  working
+correctly, it's a  straightforward  matter to redirect the I/O to
+files.    The last trick is that I make no attempt  to  do  error
+correction/recovery.   The   programs   we'll  be  building  will
+RECOGNIZE errors, and will not CRASH, but they  will  simply stop
+on the first error ... just like good ol' Turbo does.  There will
+be  other tricks that you'll see as you go. Most of them can't be
+found in any compiler textbook, but they work.
+
+A word about style and efficiency.    As  you will see, I tend to
+write programs in  _VERY_  small, easily understood pieces.  None
+of the procedures we'll  be  working with will be more than about
+15-20 lines long.  I'm a fervent devotee  of  the  KISS  (Keep It
+Simple, Sidney) school of software development.  I  try  to never
+do something tricky or  complex,  when  something simple will do.
+Inefficient?  Perhaps, but you'll like the  results.    As  Brian
+Kernighan has said,  FIRST  make  it  run, THEN make it run fast.
+If, later on,  you want to go back and tighten up the code in one
+of  our products, you'll be able to do so, since the code will be
+quite understandable. If you  do  so, however, I urge you to wait
+until the program is doing everything you want it to.
+
+I  also  have  a  tendency  to  delay  building  a module until I
+discover that I need  it.    Trying  to anticipate every possible
+future contingency can  drive  you  crazy,  and  you'll generally
+guess wrong anyway.    In  this  modern day of screen editors and
+fast compilers, I don't hesitate to change a module when I feel I
+need a more powerful one.  Until then,  I'll  write  only  what I
+need.
+
+One final caveat: One of the principles we'll be sticking to here
+is that we don't  fool  around with P-code or imaginary CPUs, but
+that we will start out on day one  producing  working, executable
+object code, at least in the form of  assembler  language source.
+However, you may not  like  my  choice  of assembler language ...
+it's 68000 code, which is what works on my system (under SK*DOS).
+I  think  you'll  find, though, that the translation to any other
+CPU such as the 80x86 will  be  quite obvious, though, so I don't
+see  a problem here.  In fact, I hope someone out there who knows
+the '86 language better than I do will offer  us  the  equivalent
+object code fragments as we need them.
+
+
+THE CRADLE
+
+Every program needs some boiler  plate  ...  I/O  routines, error
+message routines, etc.   The  programs we develop here will be no
+exceptions.    I've  tried to hold  this  stuff  to  an  absolute
+minimum, however, so that we  can  concentrate  on  the important
+stuff without losing it  among  the  trees.  The code given below
+represents about the minimum that we need to  get  anything done.
+It consists of some I/O routines, an error-handling routine and a
+skeleton, null main program.   I  call  it  our  cradle.    As we
+develop other routines, we'll add them to the cradle, and add the
+calls to them as we  need to.  Make a copy of the cradle and save
+it, because we'll be using it more than once.
+
+There are many different ways to organize the scanning activities
+of  a  parser.   In Unix systems, authors tend to  use  getc  and
+ungetc.  I've had very good luck with the  approach  shown  here,
+which is to use  a  single, global, lookahead character.  Part of
+the initialization procedure  (the  only part, so far!) serves to
+"prime  the  pump"  by reading the first character from the input
+stream.  No other special  techniques are required with Turbo 4.0
+... each successive call to  GetChar will read the next character
+in the stream.
+
+
+{--------------------------------------------------------------}
+program Cradle;
+
+{--------------------------------------------------------------}
+{ Constant Declarations }
+
+const TAB = ^I;
+
+{--------------------------------------------------------------}
+{ Variable Declarations }
+
+var Look: char;              { Lookahead Character }
+                              
+{--------------------------------------------------------------}
+{ Read New Character From Input Stream }
+
+procedure GetChar;
+begin
+   Read(Look);
+end;
+
+{--------------------------------------------------------------}
+{ Report an Error }
+
+procedure Error(s: string);
+begin
+   WriteLn;
+   WriteLn(^G, 'Error: ', s, '.');
+end;
+
+
+{--------------------------------------------------------------}
+{ Report Error and Halt }
+
+procedure Abort(s: string);
+begin
+   Error(s);
+   Halt;
+end;
+
+
+{--------------------------------------------------------------}
+{ Report What Was Expected }
+
+procedure Expected(s: string);
+begin
+   Abort(s + ' Expected');
+end;
+
+{--------------------------------------------------------------}
+{ Match a Specific Input Character }
+
+procedure Match(x: char);
+begin
+   if Look = x then GetChar
+   else Expected('''' + x + '''');
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize an Alpha Character }
+
+function IsAlpha(c: char): boolean;
+begin
+   IsAlpha := upcase(c) in ['A'..'Z'];
+end;
+                              
+
+{--------------------------------------------------------------}
+
+{ Recognize a Decimal Digit }
+
+function IsDigit(c: char): boolean;
+begin
+   IsDigit := c in ['0'..'9'];
+end;
+
+
+{--------------------------------------------------------------}
+{ Get an Identifier }
+
+function GetName: char;
+begin
+   if not IsAlpha(Look) then Expected('Name');
+   GetName := UpCase(Look);
+   GetChar;
+end;
+
+
+{--------------------------------------------------------------}
+{ Get a Number }
+
+function GetNum: char;
+begin
+   if not IsDigit(Look) then Expected('Integer');
+   GetNum := Look;
+   GetChar;
+end;
+
+
+{--------------------------------------------------------------}
+{ Output a String with Tab }
+
+procedure Emit(s: string);
+begin
+   Write(TAB, s);
+end;
+
+
+
+
+{--------------------------------------------------------------}
+{ Output a String with Tab and CRLF }
+
+procedure EmitLn(s: string);
+begin
+   Emit(s);
+   WriteLn;
+end;
+
+{--------------------------------------------------------------}
+{ Initialize }
+
+procedure Init;
+begin
+   GetChar;
+end;
+
+
+{--------------------------------------------------------------}
+{ Main Program }
+
+begin
+   Init;
+end.
+{--------------------------------------------------------------}
+
+
+That's it for this introduction.  Copy the code above into TP and
+compile it.  Make sure that it compiles and runs  correctly. Then
+proceed to the first lesson, which is on expression parsing.
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+
+
@@ -0,0 +1,801 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                     LET'S BUILD A COMPILER!
+
+                                By
+
+                     Jack W. Crenshaw, Ph.D.
+
+                           5 June 1989
+
+
+                       Part XII: MISCELLANY
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+INTRODUCTION
+
+This installment is another one  of  those  excursions  into side
+alleys  that  don't  seem to fit  into  the  mainstream  of  this
+tutorial  series.    As I mentioned last time, it was while I was
+writing this installment that I realized some changes  had  to be
+made  to  the  compiler structure.  So I had to digress from this
+digression long enough to develop the new structure  and  show it
+to you.
+
+Now that that's behind us, I can tell you what I  set  out  to in
+the first place.  This shouldn't  take  long, and then we can get
+back into the mainstream.
+
+Several people have asked  me  about  things that other languages
+provide, but so far I haven't addressed in this series.   The two
+biggies are semicolons and  comments.    Perhaps  you've wondered
+about them, too, and  wondered  how things would change if we had
+to  deal with them.  Just so you can proceed with what's to come,
+without being  bothered by that nagging feeling that something is
+missing, we'll address such issues here.
+
+
+SEMICOLONS
+
+Ever since the introduction of Algol, semicolons have been a part
+of  almost every modern language.  We've all  used  them  to  the
+point that they are taken for  granted.   Yet I suspect that more
+compilation errors have  occurred  due  to  misplaced  or missing
+semicolons  than  any  other single cause.  And if we had a penny
+for  every  extra  keystroke programmers have used  to  type  the
+little rascals, we could pay off the national debt.
+
+Having  been  brought  up with FORTRAN, it took me a long time to
+get used to using semicolons, and to tell the  truth  I've  never
+quite understood why they  were  necessary.    Since I program in
+Pascal, and since the use of semicolons in Pascal is particularly
+tricky,  that one little character is still  by  far  my  biggest
+source of errors.
+
+When  I  began  developing  KISS,  I resolved to  question  EVERY
+construct in other languages, and to try to avoid the most common
+problems that occur with them.  That puts the semicolon very high
+on my hit list.
+
+To  understand  the  role of the semicolon, you have to look at a
+little history.
+
+Early programming languages were line-oriented.  In  FORTRAN, for
+example, various parts  of  the statement had specific columns or
+fields that they had to appear in.  Since  some  statements  were
+too  long for one line, the  "continuation  card"  mechanism  was
+provided to let  the  compiler  know  that a given card was still
+part of the previous  line.   The mechanism survives to this day,
+even though punched cards are now things of the distant past.
+
+When  other  languages  came  along,  they  also  adopted various
+mechanisms for dealing with multiple-line statements.  BASIC is a
+good  example.  It's important to  recognize,  though,  that  the
+FORTRAN  mechanism  was   not   so  much  required  by  the  line
+orientation of that  language,  as by the column-orientation.  In
+those versions of FORTRAN  where  free-form  input  is permitted,
+it's no longer needed.
+
+When the fathers  of  Algol introduced that language, they wanted
+to get away  from  line-oriented programs like FORTRAN and BASIC,
+and allow for free-form input.   This included the possibility of
+stringing multiple statements on a single line, as in
+
+
+     a=b; c=d; e=e+1;
+
+
+In cases like this,  the  semicolon is almost REQUIRED.  The same
+line, without the semicolons, just looks "funny":
+
+
+     a=b c= d e=e+1
+
+I suspect that this is the major ... perhaps ONLY ...  reason for
+semicolons: to keep programs from looking funny.
+
+But  the  idea  of stringing multiple statements  together  on  a
+single  line  is  a  dubious  one  at  best.  It's not very  good
+programming  style,  and  harks back to  the  days  when  it  was
+considered improtant to conserve cards.  In these  days  of CRT's
+and indented code, the clarity of programs is  far  better served
+by  keeping statements separate.  It's still  nice  to  have  the
+OPTION  of  multiple  statements,  but  it seems a shame to  keep
+programmers  in  slavery  to the semicolon, just to keep that one
+rare case from "looking funny."
+
+When I started in with KISS, I tried  to  keep  an  open mind.  I
+decided that I would use  semicolons when it became necessary for
+the parser, but not until then.  I figured this would happen just
+about  the time I added the ability  to  spread  statements  over
+multiple lines.  But, as you  can  see, that never happened.  The
+TINY compiler is perfectly  happy  to  parse the most complicated
+statement, spread over any number of lines, without semicolons.
+
+Still, there are people  who  have  used  semicolons for so long,
+they feel naked  without them.  I'm one of them.  Once I had KISS
+defined sufficiently well, I began to write a few sample programs
+in the language.    I  discovered,  somewhat to my horror, that I
+kept  putting  semicolons  in anyway.   So  now  I'm  facing  the
+prospect of a NEW  rash  of  compiler  errors, caused by UNWANTED
+semicolons.  Phooey!
+
+Perhaps more to the point, there are readers out  there  who  are
+designing their own languages, which may  include  semicolons, or
+who  want to use the techniques of  these  tutorials  to  compile
+conventional languages like  C.    In  either case, we need to be
+able to deal with semicolons.
+
+
+SYNTACTIC SUGAR
+
+This whole discussion brings  up  the  issue of "syntactic sugar"
+... constructs that are added to a language, not because they are
+needed, but because they help make the programs look right to the
+programmer.    After  all, it's nice  to  have  a  small,  simple
+compiler,    but  it  would  be  of  little  use if the resulting
+language  were  cryptic  and hard to program.  The language FORTH
+comes  to mind (a premature OUCH! for the  barrage  I  know  that
+one's going to fetch me).  If we can add features to the language
+that  make the programs easier to read  and  understand,  and  if
+those features  help keep the programmer from making errors, then
+we should do so.    Particularly if the constructs don't add much
+to the complexity of the language or its compiler.
+
+The  semicolon  could  be considered an example,  but  there  are
+plenty of others, such as the 'THEN' in a IF-statement,  the 'DO'
+in a WHILE-statement,  and  even the 'PROGRAM' statement, which I
+came within a gnat's eyelash of leaving out  of  TINY.    None of
+these tokens  add  much  to  the  syntax  of the language ... the
+compiler can figure out  what's  going on without them.  But some
+folks feel that they  DO  add to the readability of programs, and
+that can be very important.
+
+There are two schools of thought on this subject, which  are well
+represented by two of our most popular languages, C and Pascal.
+
+To  the minimalists, all such sugar should be  left  out.    They
+argue that it clutters up the language and adds to the  number of
+keystrokes  programmers  must type.   Perhaps  more  importantly,
+every extra token or keyword represents a trap laying in wait for
+the inattentive programmer.  If you leave out  a  token, misplace
+it, or misspell it, the compiler  will  get you.  So these people
+argue that the best approach is to get rid of such things.  These
+folks tend to like C, which has a minimum of unnecessary keywords
+and punctuation.
+
+Those from the other school tend to like Pascal.  They argue that
+having to type a few extra characters is a small price to pay for
+legibility.    After  all, humans have to read the programs, too.
+Their best argument is that each such construct is an opportunity
+to tell the compiler that you really mean for it  to  do what you
+said to.  The sugary tokens serve as useful landmarks to help you
+find your way.
+
+The differences are well represented by the two  languages.   The
+most oft-heard complaint about  C  is  that  it is too forgiving.
+When you make a mistake in C, the  erroneous  code  is  too often
+another  legal  C  construct.    So  the  compiler  just  happily
+continues to compile, and  leaves  you  to  find the error during
+debug.    I guess that's why debuggers  are  so  popular  with  C
+programmers.
+
+On the  other  hand,  if  a  Pascal  program compiles, you can be
+pretty  sure that the program will do what you told it.  If there
+is an error at run time, it's probably a design error.
+
+The  best  example  of  useful  sugar  is  the semicolon  itself.
+Consider the code fragment:
+
+
+     a=1+(2*b+c)   b...
+
+
+Since there is no operator connecting the token 'b' with the rest
+of the  statement, the compiler will conclude that the expression
+ends  with  the  ')', and the 'b'  is  the  beginning  of  a  new
+statement.    But  suppose  I  have simply left out the  intended
+operator, and I really want to say:
+
+
+     a=1+(2*b+c)*b...
+
+
+In  this  case  the compiler will get an error, all right, but it
+won't be very meaningful  since  it will be expecting an '=' sign
+after the 'b' that really shouldn't be there.
+
+If, on the other hand, I include a semicolon after the  'b', THEN
+there  can  be no doubt where I  intend  the  statement  to  end.
+Syntactic  sugar,  then,  can  serve  a  very  useful purpose  by
+providing some additional insurance that we remain on track.
+
+I find  myself  somewhere  in  the middle of all this.  I tend to
+favor the Pascal-ers' view ... I'd much rather find  my  bugs  at
+compile time rather than run time.  But I also hate to just throw
+verbosity  in  for  no apparent reason, as in COBOL.  So far I've
+consistently left most of the Pascal sugar out of KISS/TINY.  But
+I certainly have no strong feelings either way, and  I  also  can
+see the value of sprinkling a little sugar around  just  for  the
+extra  insurance  that  it  brings.    If  you like  this  latter
+approach, things like that are easy to add.  Just  remember that,
+like  the semicolon, each item of sugar  is  something  that  can
+potentially cause a compile error by its omission.
+
+
+DEALING WITH SEMICOLONS
+
+There  are  two  distinct  ways  in which semicolons are used  in
+popular  languages.    In Pascal, the semicolon is regarded as an
+statement SEPARATOR.  No semicolon  is  required  after  the last
+statement in a block.  The syntax is:
+
+
+     <block> ::= <statement> ( ';' <statement>)*
+
+     <statement> ::= <assignment> | <if> | <while> ... | null
+
+
+(The null statement is IMPORTANT!)
+
+Pascal  also defines some semicolons in  other  places,  such  as
+after the PROGRAM statement.
+
+In  C  and  Ada, on the other hand, the semicolon is considered a
+statement TERMINATOR,  and  follows  all  statements  (with  some
+embarrassing and confusing  exceptions).   The syntax for this is
+simply:
+
+
+     <block> ::= ( <statement> ';')*
+
+
+Of  the two syntaxes, the Pascal one seems on the face of it more
+rational, but experience has shown  that it leads to some strange
+difficulties.  People get  so  used  to  typing a semicolon after
+every  statement  that  they tend to  type  one  after  the  last
+statement in a block, also.  That usually doesn't cause  any harm
+...  it  just gets treated as a  null  statement.    Many  Pascal
+programmers, including yours truly,  do  just  that. But there is
+one  place you absolutely CANNOT type  a  semicolon,  and  that's
+right before an ELSE.  This little gotcha  has  cost  me  many an
+extra  compilation,  particularly  when  the  ELSE  is  added  to
+existing code.    So  the  C/Ada  choice  turns out to be better.
+Apparently Nicklaus Wirth thinks so, too:  In his  Modula  2,  he
+abandoned the Pascal approach.
+
+Given either of these two syntaxes, it's an easy matter (now that
+we've  reorganized  the  parser!) to add these  features  to  our
+parser.  Let's take the last case first, since it's simpler.
+
+To begin, I've made things easy by introducing a new recognizer:
+
+
+{--------------------------------------------------------------}
+{ Match a Semicolon }
+
+procedure Semi;
+begin
+   MatchString(';');
+end;
+{--------------------------------------------------------------}
+
+
+This procedure works very much like our old Match.  It insists on
+finding a semicolon as the next token.  Having found it, it skips
+to the next one.
+
+Since a  semicolon follows a statement, procedure Block is almost
+the only one we need to change:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate a Block of Statements }
+
+procedure Block;
+begin
+   Scan;
+   while not(Token in ['e', 'l']) do begin
+      case Token of
+       'i': DoIf;
+       'w': DoWhile;
+       'R': DoRead;
+       'W': DoWrite;
+       'x': Assignment;
+      end;
+      Semi;
+      Scan;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+Note carefully the subtle change in the case statement.  The call
+to  Assignment  is now guarded by a test on Token.   This  is  to
+avoid calling Assignment when the  token  is  a  semicolon (which
+could happen if the statement is null).
+
+Since declarations are also  statements,  we  also  need to add a
+call to Semi within procedure TopDecls:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate Global Declarations }
+
+procedure TopDecls;
+begin
+   Scan;
+   while Token = 'v' do begin
+      Alloc;
+      while Token = ',' do
+         Alloc;
+      Semi;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+Finally, we need one for the PROGRAM statement:
+
+
+{--------------------------------------------------------------}
+{ Main Program }
+
+begin
+   Init;
+   MatchString('PROGRAM');
+   Semi;
+   Header;
+   TopDecls;
+   MatchString('BEGIN');
+   Prolog;
+   Block;
+   MatchString('END');
+   Epilog;
+end.
+{--------------------------------------------------------------}
+
+
+It's as easy as that.  Try it with a copy of TINY and see how you
+like it.
+
+The Pascal version  is  a  little  trickier,  but  it  still only
+requires  minor  changes,  and those only to procedure Block.  To
+keep things as simple as possible, let's split the procedure into
+two parts.  The following procedure handles just one statement:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate a Single Statement }
+
+procedure Statement;
+begin
+   Scan;
+   case Token of
+    'i': DoIf;
+    'w': DoWhile;
+    'R': DoRead;
+    'W': DoWrite;
+    'x': Assignment;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+Using this procedure, we can now rewrite Block like this:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate a Block of Statements }
+
+procedure Block;
+begin
+   Statement;
+   while Token = ';' do begin
+      Next;
+      Statement;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+That  sure  didn't  hurt, did it?  We can now parse semicolons in
+Pascal-like fashion.
+
+
+A COMPROMISE
+
+Now that we know how to deal with semicolons, does that mean that
+I'm going to put them in KISS/TINY?  Well, yes and  no.    I like
+the extra sugar and the security that comes with knowing for sure
+where the  ends  of  statements  are.    But I haven't changed my
+dislike for the compilation errors associated with semicolons.
+
+So I have what I think is a nice compromise: Make them OPTIONAL!
+
+Consider the following version of Semi:
+
+
+{--------------------------------------------------------------}
+{ Match a Semicolon }
+
+procedure Semi;
+begin
+   if Token = ';' then Next;
+end;
+{--------------------------------------------------------------}
+
+
+This procedure will ACCEPT a semicolon whenever it is called, but
+it won't INSIST on one.  That means that when  you  choose to use
+semicolons, the compiler  will  use the extra information to help
+keep itself on track.  But if you omit one (or omit them all) the
+compiler won't complain.  The best of both worlds.
+
+Put this procedure in place in the first version of  your program
+(the  one for C/Ada syntax), and you have  the  makings  of  TINY
+Version 1.2.
+
+
+COMMENTS
+
+Up  until  now  I have carefully avoided the subject of comments.
+You would think that this would be an easy subject ... after all,
+the compiler doesn't have to deal with comments at all; it should
+just ignore them.  Well, sometimes that's true.
+
+Comments can be just about as easy or as difficult as  you choose
+to make them.    At  one  extreme,  we can arrange things so that
+comments  are  intercepted  almost  the  instant  they  enter the
+compiler.  At the  other,  we can treat them as lexical elements.
+Things  tend to get interesting when  you  consider  things  like
+comment delimiters contained in quoted strings.
+
+
+SINGLE-CHARACTER DELIMITERS
+
+Here's an example.  Suppose we assume the  Turbo  Pascal standard
+and use curly braces for comments.  In this case we  have single-
+character delimiters, so our parsing is a little easier.
+
+One  approach  is  to  strip  the  comments  out the  instant  we
+encounter them in the input stream; that is,  right  in procedure
+GetChar.    To  do  this,  first  change  the  name of GetChar to
+something else, say GetCharX.  (For the record, this is  going to
+be a TEMPORARY change, so best not do this with your only copy of
+TINY.  I assume you understand that you should  always  do  these
+experiments with a working copy.)
+
+Now, we're going to need a  procedure  to skip over comments.  So
+key in the following one:
+
+
+{--------------------------------------------------------------}
+{ Skip A Comment Field }
+
+procedure SkipComment;
+begin
+   while Look <> '}' do
+      GetCharX;
+   GetCharX;
+end;
+{--------------------------------------------------------------}
+
+
+Clearly, what this procedure is going to do is to simply read and
+discard characters from the input  stream, until it finds a right
+curly brace.  Then it reads one more character and returns  it in
+Look.
+
+Now we can  write  a  new  version of GetChar that SkipComment to
+strip out comments:
+
+
+{--------------------------------------------------------------}
+{ Get Character from Input Stream }
+{ Skip Any Comments }
+
+procedure GetChar;
+begin
+   GetCharX;
+   if Look = '{' then SkipComment;
+end;
+{--------------------------------------------------------------}
+
+
+Code this up  and  give  it  a  try.    You'll find that you can,
+indeed, bury comments anywhere you like.  The comments never even
+get into the parser proper ... every call to GetChar just returns
+any character that's NOT part of a comment.
+
+As a matter of fact, while  this  approach gets the job done, and
+may even be  perfectly  satisfactory  for  you, it does its job a
+little  TOO  well.    First  of all, most  programming  languages
+specify that a comment should be treated like a  space,  so  that
+comments aren't allowed  to  be embedded in, say, variable names.
+This current version doesn't care WHERE you put comments.
+
+Second, since the  rest  of  the  parser can't even receive a '{'
+character, you will not be allowed to put one in a quoted string.
+
+Before you turn up your nose at this simplistic solution, though,
+I should point out  that  as respected a compiler as Turbo Pascal
+also won't allow  a  '{' in a quoted string.  Try it.  And as for
+embedding a comment in an  identifier, I can't imagine why anyone
+would want to do such a  thing,  anyway, so the question is moot.
+For 99% of all  applications,  what I've just shown you will work
+just fine.
+
+But,  if  you  want  to  be  picky  about it  and  stick  to  the
+conventional treatment, then we  need  to  move  the interception
+point downstream a little further.
+
+To  do  this,  first change GetChar back to the way  it  was  and
+change the name called in SkipComment.  Then, let's add  the left
+brace as a possible whitespace character:
+
+
+{--------------------------------------------------------------}
+{ Recognize White Space }
+
+function IsWhite(c: char): boolean;
+begin
+   IsWhite := c in [' ', TAB, CR, LF, '{'];
+end;
+{--------------------------------------------------------------}
+
+
+Now, we can deal with comments in procedure SkipWhite:
+
+
+{--------------------------------------------------------------}
+{ Skip Over Leading White Space }
+
+procedure SkipWhite;
+begin
+   while IsWhite(Look) do begin
+      if Look = '{' then
+         SkipComment
+      else
+         GetChar;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+Note  that SkipWhite is written so that we  will  skip  over  any
+combination of whitespace characters and comments, in one call.
+
+OK, give this one a try, too.   You'll  find  that  it will let a
+comment serve to delimit tokens.  It's worth mentioning that this
+approach also gives us the  ability to handle curly braces within
+quoted strings, since within such  strings we will not be testing
+for or skipping over whitespace.
+
+There's one last  item  to  deal  with:  Nested  comments.   Some
+programmers like the idea  of  nesting  comments, since it allows
+you to comment out code during debugging.  The  code  I've  given
+here won't allow that and, again, neither will Turbo Pascal.
+
+But the fix is incredibly easy.  All  we  need  to  do is to make
+SkipComment recursive:
+
+
+{--------------------------------------------------------------}
+{ Skip A Comment Field }
+
+procedure SkipComment;
+begin
+   while Look <> '}' do begin
+      GetChar;
+      if Look = '{' then SkipComment;
+   end;
+   GetChar;
+end;
+{--------------------------------------------------------------}
+
+
+That does it.  As  sophisticated a comment-handler as you'll ever
+need.
+
+
+MULTI-CHARACTER DELIMITERS
+
+That's all well and  good  for cases where a comment is delimited
+by single  characters,  but  what  about  the  cases such as C or
+standard Pascal, where two  characters  are  required?  Well, the
+principles are still the same, but we have to change our approach
+quite a bit.  I'm sure it won't surprise you to learn that things
+get harder in this case.
+
+For the multi-character situation, the  easiest thing to do is to
+intercept the left delimiter  back  at the GetChar stage.  We can
+"tokenize" it right there, replacing it by a single character.
+
+Let's assume we're using the C delimiters '/*' and '*/'.   First,
+we  need  to  go back to the "GetCharX' approach.  In yet another
+copy of your compiler, rename  GetChar to GetCharX and then enter
+the following new procedure GetChar:
+
+
+{--------------------------------------------------------------}
+{ Read New Character.  Intercept '/*' }
+
+procedure GetChar;
+begin
+   if TempChar <> ' ' then begin
+      Look := TempChar;
+      TempChar := ' ';
+      end
+   else begin
+      GetCharX;
+      if Look = '/' then begin
+         Read(TempChar);
+         if TempChar = '*' then begin
+            Look := '{';
+            TempChar := ' ';
+         end;
+      end;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+As you can see, what this procedure does is  to  intercept  every
+occurrence of '/'.  It then examines the NEXT  character  in  the
+stream.  If the character  is  a  '*',  then  we  have  found the
+beginning  of  a  comment,  and  GetChar  will  return  a  single
+character replacement for it.   (For  simplicity,  I'm  using the
+same '{' character  as I did for Pascal.  If you were writing a C
+compiler, you'd no doubt want to pick some other character that's
+not  used  elsewhere  in C.  Pick anything you like ... even $FF,
+anything that's unique.)
+
+If the character  following  the  '/'  is NOT a '*', then GetChar
+tucks it away in the new global TempChar, and  returns  the  '/'.
+
+Note that you need to declare this new variable and initialize it
+to ' '.  I like to do  things  like  that  using the Turbo "typed
+constant" construct:
+
+
+     const TempChar: char = ' ';
+
+
+Now we need a new version of SkipComment:
+
+
+{--------------------------------------------------------------}
+{ Skip A Comment Field }
+
+procedure SkipComment;
+begin
+   repeat
+      repeat
+         GetCharX;
+      until Look = '*';
+      GetCharX;
+   until Look = '/';
+   GetChar;
+end;
+{--------------------------------------------------------------}
+
+
+A  few  things  to  note:  first  of  all, function  IsWhite  and
+procedure SkipWhite  don't  need  to  be  changed,  since GetChar
+returns the '{' token.  If you change that token  character, then
+of  course you also need to change the  character  in  those  two
+routines.
+
+Second, note that  SkipComment  doesn't call GetChar in its loop,
+but  GetCharX.    That  means   that  the  trailing  '/'  is  not
+intercepted and  is seen by SkipComment.  Third, although GetChar
+is the  procedure  doing  the  work,  we  can still deal with the
+comment  characters  embedded  in  a  quoted  string,  by calling
+GetCharX  instead  of  GetChar  while  we're  within  the string.
+Finally,  note  that  we can again provide for nested comments by
+adding a single statement to SkipComment, just as we did before.
+
+
+ONE-SIDED COMMENTS
+
+So far I've shown you  how  to  deal  with  any  kind  of comment
+delimited on the left and the  right.   That only leaves the one-
+sided comments like those in assembler language or  in  Ada, that
+are terminated by the end of the line.  In a  way,  that  case is
+easier.   The only procedure that would need  to  be  changed  is
+SkipComment, which must now terminate at the newline characters:
+
+
+{--------------------------------------------------------------}
+{ Skip A Comment Field }
+
+procedure SkipComment;
+begin
+   repeat
+      GetCharX;
+   until Look = CR;
+   GetChar;
+end;
+{--------------------------------------------------------------}
+
+
+If the leading character is  a  single  one,  as  in  the  ';' of
+assembly language, then we're essentially done.  If  it's  a two-
+character token, as in the '--'  of  Ada, we need only modify the
+tests  within  GetChar.   Either way, it's an easier problem than
+the balanced case.
+
+
+CONCLUSION
+
+At this point we now have the ability to deal with  both comments
+and semicolons, as well as other kinds of syntactic sugar.   I've
+shown  you several ways to deal with  each,  depending  upon  the
+convention  desired.    The  only  issue left is: which of  these
+conventions should we use in KISS/TINY?
+
+For the reasons that I've given as we went  along,  I'm  choosing
+the following:
+
+
+ (1) Semicolons are TERMINATORS, not separators
+
+ (2) Semicolons are OPTIONAL
+
+ (3) Comments are delimited by curly braces
+
+ (4) Comments MAY be nested
+
+
+Put the code corresponding to these cases into your copy of TINY.
+You now have TINY Version 1.2.
+
+Now that we  have  disposed  of  these  sideline  issues,  we can
+finally get back into the mainstream.  In  the  next installment,
+we'll talk  about procedures and parameter passing, and we'll add
+these important features to TINY.  See you then.
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
@@ -0,0 +1,792 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                     LET'S BUILD A COMPILER!
+
+                                By
+
+                     Jack W. Crenshaw, Ph.D.
+
+                           24 July 1988
+
+
+                   Part II: EXPRESSION PARSING
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+GETTING STARTED
+
+If you've read the introduction document to this series, you will
+already know what  we're  about.    You will also have copied the
+cradle software  into your Turbo Pascal system, and have compiled
+it.  So you should be ready to go.
+
+
+The purpose of this article is for us to learn  how  to parse and
+translate mathematical expressions.  What we would like to see as
+output is a series of assembler-language statements  that perform
+the desired actions.    For purposes of definition, an expression
+is the right-hand side of an equation, as in
+
+               x = 2*y + 3/(4*z)
+
+In the early going, I'll be taking things in _VERY_  small steps.
+That's  so  that  the beginners among you won't get totally lost.
+There are also  some  very  good  lessons to be learned early on,
+that will serve us well later.  For the more experienced readers:
+bear with me.  We'll get rolling soon enough.
+
+SINGLE DIGITS
+
+In keeping with the whole theme of this series (KISS, remember?),
+let's start with the absolutely most simple case we can think of.
+That, to me, is an expression consisting of a single digit.
+
+Before starting to code, make sure you have a  baseline  copy  of
+the  "cradle" that I gave last time.  We'll be using it again for
+other experiments.  Then add this code:
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Expression }
+
+procedure Expression;
+begin
+   EmitLn('MOVE #' + GetNum + ',D0')
+end;
+{---------------------------------------------------------------}
+
+
+And add the  line  "Expression;"  to  the main program so that it
+reads:
+                              
+
+{---------------------------------------------------------------}
+begin
+   Init;
+   Expression;
+end.
+{---------------------------------------------------------------}
+
+
+Now  run  the  program. Try any single-digit number as input. You
+should get a single line of assembler-language output.    Now try
+any  other character as input, and you'll  see  that  the  parser
+properly reports an error.
+
+
+CONGRATULATIONS! You have just written a working translator!
+
+OK, I grant you that it's pretty limited. But don't brush  it off
+too  lightly.  This little "compiler" does,  on  a  very  limited
+scale,  exactly  what  any larger compiler does:    it  correctly
+recognizes legal  statements in the input "language" that we have
+defined for it, and  it  produces  correct,  executable assembler
+code,  suitable  for  assembling  into  object  format.  Just  as
+importantly,  it correctly  recognizes  statements  that  are NOT
+legal, and gives a  meaningful  error message.  Who could ask for
+more?  As we expand our  parser,  we'd better make sure those two
+characteristics always hold true.
+
+There  are  some  other  features  of  this  tiny  program  worth
+mentioning.    First,  you  can  see that we don't separate  code
+generation from parsing ...  as  soon as the parser knows what we
+want  done, it generates the object code directly.    In  a  real
+compiler, of course, the reads in GetChar would be  from  a  disk
+file, and the writes to another  disk  file, but this way is much
+easier to deal with while we're experimenting.
+
+Also note that an expression must leave a result somewhere.  I've
+chosen the  68000  register  DO.    I  could have made some other
+choices, but this one makes sense.
+
+
+BINARY EXPRESSIONS
+
+Now that we have that under our belt,  let's  branch  out  a bit.
+Admittedly, an "expression" consisting of only  one  character is
+not going to meet our needs for long, so let's see what we can do
+to extend it. Suppose we want to handle expressions of the form:
+
+                         1+2
+     or                  4-3
+     or, in general, <term> +/- <term>
+
+(That's a bit of Backus-Naur Form, or BNF.)
+                              
+To do this we need a procedure that recognizes a term  and leaves
+its   result   somewhere,  and  another   that   recognizes   and
+distinguishes  between   a  '+'  and  a  '-'  and  generates  the
+appropriate code.  But if Expression is going to leave its result
+in DO, where should Term leave its result?    Answer:    the same
+place.  We're  going  to  have  to  save the first result of Term
+somewhere before we get the next one.
+
+OK, basically what we want to  do  is have procedure Term do what
+Expression was doing before.  So just RENAME procedure Expression
+as Term, and enter the following new version of Expression:
+
+
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate an Expression }
+
+procedure Expression;
+begin
+   Term;
+   EmitLn('MOVE D0,D1');
+   case Look of
+    '+': Add;
+    '-': Subtract;
+   else Expected('Addop');
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+Next, just above Expression enter these two procedures:
+
+
+{--------------------------------------------------------------}
+{ Recognize and Translate an Add }
+
+procedure Add;
+begin
+   Match('+');
+   Term;
+   EmitLn('ADD D1,D0');
+end;
+
+
+{-------------------------------------------------------------}
+{ Recognize and Translate a Subtract }
+
+procedure Subtract;
+begin
+   Match('-');
+   Term;
+   EmitLn('SUB D1,D0');
+end;
+{-------------------------------------------------------------}
+                              
+
+When you're finished with that,  the order of the routines should
+be:
+
+ o Term (The OLD Expression)
+ o Add
+ o Subtract
+ o Expression
+
+Now run the program.  Try any combination you can think of of two
+single digits,  separated  by  a  '+' or a '-'.  You should get a
+series of four assembler-language instructions out  of  each run.
+Now  try  some  expressions with deliberate errors in them.  Does
+the parser catch the errors?
+
+Take  a  look  at the object  code  generated.    There  are  two
+observations we can make.  First, the code generated is  NOT what
+we would write ourselves.  The sequence
+
+        MOVE #n,D0
+        MOVE D0,D1
+
+is inefficient.  If we were  writing  this code by hand, we would
+probably just load the data directly to D1.
+
+There is a  message  here:  code  generated by our parser is less
+efficient  than the code we would write by hand.  Get used to it.
+That's going to be true throughout this series.  It's true of all
+compilers to some extent.  Computer scientists have devoted whole
+lifetimes to the issue of code optimization, and there are indeed
+things that can be done to improve the quality  of  code  output.
+Some compilers do quite well, but  there  is a heavy price to pay
+in complexity, and it's  a  losing  battle  anyway ... there will
+probably never come a time when  a  good  assembler-language pro-
+grammer can't out-program a compiler.    Before  this  session is
+over, I'll briefly mention some ways that we can do a  little op-
+timization,  just  to  show you that we can indeed improve things
+without too much trouble.  But remember, we're here to learn, not
+to see how tight we can make  the  object  code.    For  now, and
+really throughout  this  series  of  articles,  we'll  studiously
+ignore optimization and  concentrate  on  getting  out  code that
+works.
+
+Speaking of which: ours DOESN'T!  The code is _WRONG_!  As things
+are working  now, the subtraction process subtracts D1 (which has
+the FIRST argument in it) from D0 (which has the second).  That's
+the wrong way, so we end up with the wrong  sign  for the result.
+So let's fix up procedure Subtract with a  sign-changer,  so that
+it reads
+
+
+{-------------------------------------------------------------}
+{ Recognize and Translate a Subtract }
+
+procedure Subtract;
+begin
+   Match('-');
+   Term;
+   EmitLn('SUB D1,D0');
+   EmitLn('NEG D0');
+end;
+{-------------------------------------------------------------}
+
+
+Now  our  code  is even less efficient, but at least it gives the
+right answer!  Unfortunately, the  rules that give the meaning of
+math expressions require that the terms in an expression come out
+in an inconvenient  order  for  us.    Again, this is just one of
+those facts of life you learn to live with.   This  one will come
+back to haunt us when we get to division.
+
+OK,  at this point we have a parser that can recognize the sum or
+difference of two digits.    Earlier,  we  could only recognize a
+single digit.  But  real  expressions can have either form (or an
+infinity of others).  For kicks, go back and run the program with
+the single input line '1'.
+
+Didn't work, did it?   And  why  should  it?    We  just finished
+telling  our  parser  that the only kinds of expressions that are
+legal are those  with  two  terms.    We  must  rewrite procedure
+Expression to be a lot more broadminded, and this is where things
+start to take the shape of a real parser.
+
+
+
+
+GENERAL EXPRESSIONS
+
+In the  REAL  world,  an  expression  can  consist of one or more
+terms, separated  by  "addops"  ('+'  or  '-').   In BNF, this is
+written
+
+          <expression> ::= <term> [<addop> <term>]*
+
+
+We  can  accomodate  this definition of an  expression  with  the
+addition of a simple loop to procedure Expression:
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate an Expression }
+
+procedure Expression;
+begin
+   Term;
+   while Look in ['+', '-'] do begin
+      EmitLn('MOVE D0,D1');
+      case Look of
+       '+': Add;
+       '-': Subtract;
+      else Expected('Addop');
+      end;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+NOW we're getting somewhere!   This version handles any number of
+terms, and it only cost us two extra lines of code.  As we go on,
+you'll discover that this is characteristic  of  top-down parsers
+... it only takes a few lines of code to accomodate extensions to
+the  language.    That's  what  makes  our  incremental  approach
+possible.  Notice, too, how well the code of procedure Expression
+matches the BNF definition.   That, too, is characteristic of the
+method.  As you get proficient in the approach, you'll  find that
+you can turn BNF into parser code just about as  fast  as you can
+type!
+
+OK, compile the new version of our parser, and give it a try.  As
+usual,  verify  that  the  "compiler"   can   handle   any  legal
+expression,  and  will  give a meaningful error  message  for  an
+illegal one.  Neat, eh?  You might note that in our test version,
+any error message comes  out  sort of buried in whatever code had
+already been  generated. But remember, that's just because we are
+using  the  CRT  as  our  "output  file"  for   this   series  of
+experiments.  In a production version, the two  outputs  would be
+separated ... one to the output file, and one to the screen.
+
+
+USING THE STACK
+
+At  this  point  I'm going to  violate  my  rule  that  we  don't
+introduce any complexity until  it's  absolutely  necessary, long
+enough to point out a problem with the code we're generating.  As
+things stand now, the parser  uses D0 for the "primary" register,
+and D1 as  a place to store the partial sum.  That works fine for
+now,  because  as  long as we deal with only the "addops" '+' and
+'-', any new term can be added in as soon as it is found.  But in
+general that isn't true.  Consider, for example, the expression
+
+               1+(2-(3+(4-5)))
+                              
+If we put the '1' in D1, where  do  we  put  the  '2'?    Since a
+general expression can have any degree of complexity, we're going
+to run out of registers fast!
+
+Fortunately,  there's  a  simple  solution.    Like  every modern
+microprocessor, the 68000 has a stack, which is the perfect place
+to save a variable number of items. So instead of moving the term
+in D0 to  D1, let's just push it onto the stack.  For the benefit
+of  those unfamiliar with 68000 assembler  language,  a  push  is
+written
+
+               -(SP)
+
+and a pop,     (SP)+ .
+
+
+So let's change the EmitLn in Expression to read:
+
+               EmitLn('MOVE D0,-(SP)');
+
+and the two lines in Add and Subtract to
+
+               EmitLn('ADD (SP)+,D0')
+
+and            EmitLn('SUB (SP)+,D0'),
+
+respectively.  Now try the parser again and make sure  we haven't
+broken it.
+
+Once again, the generated code is less efficient than before, but
+it's a necessary step, as you'll see.
+
+
+MULTIPLICATION AND DIVISION
+
+Now let's get down to some REALLY serious business.  As  you  all
+know,  there  are  other  math   operators   than   "addops"  ...
+expressions can also have  multiply  and  divide operations.  You
+also  know  that  there  is  an implied operator  PRECEDENCE,  or
+hierarchy, associated with expressions, so that in  an expression
+like
+
+                    2 + 3 * 4,
+
+we know that we're supposed to multiply FIRST, then  add.    (See
+why we needed the stack?)
+
+In the early days of compiler technology, people used some rather
+complex techniques to insure that the  operator  precedence rules
+were  obeyed.    It turns out,  though,  that  none  of  this  is
+necessary ... the rules can be accommodated quite  nicely  by our
+top-down  parsing technique.  Up till now,  the  only  form  that
+we've considered for a term is that of a  single  decimal  digit.
+
+More generally, we  can  define  a  term as a PRODUCT of FACTORS;
+i.e.,
+
+          <term> ::= <factor>  [ <mulop> <factor ]*
+
+What  is  a factor?  For now, it's what a term used to be  ...  a
+single digit.
+
+Notice the symmetry: a  term  has the same form as an expression.
+As a matter of fact, we can  add  to  our  parser  with  a little
+judicious  copying and renaming.  But  to  avoid  confusion,  the
+listing below is the complete set of parsing routines.  (Note the
+way we handle the reversal of operands in Divide.)
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Factor }
+
+procedure Factor;
+begin
+   EmitLn('MOVE #' + GetNum + ',D0')
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize and Translate a Multiply }
+
+procedure Multiply;
+begin
+   Match('*');
+   Factor;
+   EmitLn('MULS (SP)+,D0');
+end;
+
+
+{-------------------------------------------------------------}
+{ Recognize and Translate a Divide }
+
+procedure Divide;
+begin
+   Match('/');
+   Factor;
+   EmitLn('MOVE (SP)+,D1');
+   EmitLn('DIVS D1,D0');
+end;
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Term }
+
+procedure Term;
+begin
+   Factor;
+   while Look in ['*', '/'] do begin
+      EmitLn('MOVE D0,-(SP)');
+      case Look of
+       '*': Multiply;
+       '/': Divide;
+      else Expected('Mulop');
+      end;
+   end;
+end;
+
+
+
+
+{--------------------------------------------------------------}
+{ Recognize and Translate an Add }
+
+procedure Add;
+begin
+   Match('+');
+   Term;
+   EmitLn('ADD (SP)+,D0');
+end;
+
+
+{-------------------------------------------------------------}
+{ Recognize and Translate a Subtract }
+
+procedure Subtract;
+begin
+   Match('-');
+   Term;
+   EmitLn('SUB (SP)+,D0');
+   EmitLn('NEG D0');
+end;
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate an Expression }
+
+procedure Expression;
+begin
+   Term;
+   while Look in ['+', '-'] do begin
+      EmitLn('MOVE D0,-(SP)');
+      case Look of
+       '+': Add;
+       '-': Subtract;
+      else Expected('Addop');
+      end;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+Hot dog!  A NEARLY functional parser/translator, in only 55 lines
+of Pascal!  The output is starting to look really useful,  if you
+continue to overlook the inefficiency,  which  I  hope  you will.
+Remember, we're not trying to produce tight code here.
+
+
+PARENTHESES
+
+We  can  wrap  up this part of the parser with  the  addition  of
+parentheses with  math expressions.  As you know, parentheses are
+a  mechanism to force a desired operator  precedence.    So,  for
+example, in the expression
+
+               2*(3+4) ,
+
+the parentheses force the addition  before  the  multiply.   Much
+more importantly, though, parentheses  give  us  a  mechanism for
+defining expressions of any degree of complexity, as in
+
+               (1+2)/((3+4)+(5-6))
+
+The  key  to  incorporating  parentheses  into our parser  is  to
+realize that  no matter how complicated an expression enclosed by
+parentheses may be,  to  the  rest  of  the world it looks like a
+simple factor.  That is, one of the forms for a factor is:
+
+          <factor> ::= (<expression>)
+
+This is where the recursion comes in. An expression can contain a
+factor which contains another expression which contains a factor,
+etc., ad infinitum.
+
+Complicated or not, we can take care of this by adding just a few
+lines of Pascal to procedure Factor:
+                             
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Factor }
+
+procedure Expression; Forward;
+
+procedure Factor;
+begin
+   if Look = '(' then begin
+      Match('(');
+      Expression;
+      Match(')');
+      end
+   else
+      EmitLn('MOVE #' + GetNum + ',D0');
+end;
+{--------------------------------------------------------------}
+
+
+Note again how easily we can extend the parser, and how  well the
+Pascal code matches the BNF syntax.
+
+As usual, compile the new version and make sure that it correctly
+parses  legal sentences, and flags illegal  ones  with  an  error
+message.
+
+
+UNARY MINUS
+
+At  this  point,  we have a parser that can handle just about any
+expression, right?  OK, try this input sentence:
+
+                         -1
+
+WOOPS!  It doesn't work, does it?   Procedure  Expression expects
+everything to start with an integer, so it coughs up  the leading
+minus  sign.  You'll find that +3 won't  work  either,  nor  will
+something like
+
+                    -(3-2) .
+
+There  are  a  couple of ways to fix the problem.    The  easiest
+(although not necessarily the best)  way is to stick an imaginary
+leading zero in  front  of  expressions  of this type, so that -3
+becomes 0-3.  We can easily patch this into our  existing version
+of Expression:
+
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate an Expression }
+
+procedure Expression;
+begin
+   if IsAddop(Look) then
+      EmitLn('CLR D0')
+   else
+      Term;
+   while IsAddop(Look) do begin
+      EmitLn('MOVE D0,-(SP)');
+      case Look of
+       '+': Add;
+       '-': Subtract;
+      else Expected('Addop');
+      end;
+   end;
+end;
+{--------------------------------------------------------------}
+ 
+
+I TOLD you that making changes  was  easy!   This time it cost us
+only  three  new lines of Pascal.   Note  the  new  reference  to
+function IsAddop.  Since the test for an addop appeared  twice, I
+chose  to  embed  it in the new function.  The  form  of  IsAddop
+should be apparent from that for IsAlpha.  Here it is:
+
+
+{--------------------------------------------------------------}
+{ Recognize an Addop }
+
+function IsAddop(c: char): boolean;
+begin
+   IsAddop := c in ['+', '-'];
+end;
+{--------------------------------------------------------------}
+
+
+OK, make these changes to the program and recompile.   You should
+also include IsAddop in your baseline copy of the cradle.   We'll
+be needing  it  again  later.   Now try the input -1 again.  Wow!
+The efficiency of the code is  pretty  poor ... six lines of code
+just for loading a simple constant ... but at least it's correct.
+Remember, we're not trying to replace Turbo Pascal here.
+
+At this point we're just about finished with the structure of our
+expression parser.   This version of the program should correctly
+parse and compile just about any expression you care to  throw at
+it.    It's still limited in that  we  can  only  handle  factors
+involving single decimal digits.    But I hope that by now you're
+starting  to  get  the  message  that we can  accomodate  further
+extensions  with  just  some  minor  changes to the parser.   You
+probably won't be  surprised  to  hear  that a variable or even a
+function call is just another kind of a factor.
+                             
+In  the next session, I'll show you just how easy it is to extend
+our parser to take care of  these  things too, and I'll also show
+you just  how easily we can accomodate multicharacter numbers and
+variable names.  So you see,  we're  not  far at all from a truly
+useful parser.
+
+
+
+
+A WORD ABOUT OPTIMIZATION
+
+Earlier in this session, I promised to give you some hints  as to
+how we can improve the quality of the generated code.  As I said,
+the  production of tight code is not the  main  purpose  of  this
+series of articles.  But you need to at least know that we aren't
+just  wasting our time here ... that we  can  indeed  modify  the
+parser further to  make  it produce better code, without throwing
+away everything we've done to date.  As usual, it turns  out that
+SOME optimization is not that difficult to do ... it simply takes
+some extra code in the parser.
+
+There are two basic approaches we can take:
+
+  o Try to fix up the code after it's generated
+
+    This is  the concept of "peephole" optimization.  The general
+    idea it that we  know  what  combinations of instructions the
+    compiler  is  going  to generate, and we also know which ones
+    are pretty bad (such as the code for -1, above).    So all we
+    do  is  to   scan   the  produced  code,  looking  for  those
+    combinations, and replacing  them  by better ones.  It's sort
+    of   a   macro   expansion,   in   reverse,   and   a  fairly
+    straightforward  exercise  in   pattern-matching.   The  only
+    complication,  really, is that there may be  a  LOT  of  such
+    combinations to look for.  It's called  peephole optimization
+    simply because it only looks at a small group of instructions
+    at a time.  Peephole  optimization can have a dramatic effect
+    on  the  quality  of the code,  with  little  change  to  the
+    structure of the compiler  itself.   There is a price to pay,
+    though,  in  both  the  speed,   size, and complexity of  the
+    compiler.  Looking for all those combinations calls for a lot
+    of IF tests, each one of which is a source of error.  And, of
+    course, it takes time.
+
+     In  the  classical  implementation  of a peephole optimizer,
+    it's done as a second pass to the compiler.  The  output code
+    is  written  to  disk,  and  then  the  optimizer  reads  and
+    processes the disk file again.  As a matter of fact,  you can
+    see that the optimizer could  even be a separate PROGRAM from
+    the compiler proper.  Since the optimizer only  looks  at the
+    code through a  small  "window"  of  instructions  (hence the
+    name), a better implementation would be to simply buffer up a
+    few lines of output, and scan the buffer after each EmitLn.
+
+  o Try to generate better code in the first place
+                             
+    This approach calls for us to look for  special  cases BEFORE
+    we Emit them.  As a trivial example,  we  should  be  able to
+    identify a constant zero,  and  Emit a CLR instead of a load,
+    or even do nothing at all, as in an add of zero, for example.
+    Closer to home, if we had chosen to recognize the unary minus
+    in Factor  instead of in Expression, we could treat constants
+    like -1 as ordinary constants,  rather  then  generating them
+    from  positive  ones.   None of these things are difficult to
+    deal with ... they only add extra tests in the code, which is
+    why  I  haven't  included them in our program.  The way I see
+    it, once we get to the point that we have a working compiler,
+    generating useful code  that  executes, we can always go back
+    and tweak the thing to tighten up the code produced.   That's
+    why there are Release 2.0's in the world.
+
+There IS one more type  of  optimization  worth  mentioning, that
+seems to promise pretty tight code without too much hassle.  It's
+my "invention" in the  sense  that I haven't seen it suggested in
+print anywhere, though I have  no  illusions  that  it's original
+with me.
+
+This  is to avoid such a heavy use of the stack, by making better
+use of the CPU registers.  Remember back when we were  doing only
+addition  and  subtraction,  that we used registers  D0  and  D1,
+rather than the stack?  It worked, because with  only  those  two
+operations, the "stack" never needs more than two entries.
+
+Well,  the 68000 has eight data registers.  Why not use them as a
+privately managed stack?  The key is to recognize  that,  at  any
+point in its processing,  the  parser KNOWS how many items are on
+the  stack, so it can indeed manage it properly.  We can define a
+private "stack pointer" that keeps  track  of  which  stack level
+we're at, and addresses the  corresponding  register.   Procedure
+Factor,  for  example,  would  not  cause data to be loaded  into
+register  D0,  but   into  whatever  the  current  "top-of-stack"
+register happened to be.
+
+What we're doing in effect is to replace the CPU's RAM stack with
+a  locally  managed  stack  made  up  of  registers.    For  most
+expressions, the stack level  will  never  exceed eight, so we'll
+get pretty good code out.  Of course, we also  have  to deal with
+those  odd cases where the stack level  DOES  exceed  eight,  but
+that's no problem  either.    We  simply let the stack spill over
+into the CPU  stack.    For  levels  beyond eight, the code is no
+worse  than  what  we're generating now, and for levels less than
+eight, it's considerably better.
+
+For the record, I  have  implemented  this  concept, just to make
+sure  it  works  before  I  mentioned  it to you.  It does.    In
+practice, it turns out that you can't really use all eight levels
+... you need at least one register free to  reverse  the  operand
+order for division  (sure  wish  the  68000 had an XTHL, like the
+8080!).  For expressions  that  include  function calls, we would
+also need a register reserved for them. Still, there  is  a  nice
+improvement in code size for most expressions.
+
+So, you see, getting  better  code  isn't  that difficult, but it
+does add complexity to the our translator ...  complexity  we can
+do without at this point.  For that reason,  I  STRONGLY  suggest
+that we continue to ignore efficiency issues for the rest of this
+series,  secure  in  the knowledge that we can indeed improve the
+code quality without throwing away what we've done.
+
+Next lesson, I'll show you how to deal with variables factors and
+function calls.  I'll also show you just how easy it is to handle
+multicharacter tokens and embedded white space.
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+ 
+
+
+
@@ -0,0 +1,946 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                     LET'S BUILD A COMPILER!
+
+                                By
+
+                     Jack W. Crenshaw, Ph.D.
+
+                            4 Aug 1988
+
+
+                    Part III: MORE EXPRESSIONS
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+INTRODUCTION
+
+In the last installment, we examined the techniques used to parse
+and  translate a general math expression.  We  ended  up  with  a
+simple parser that  could handle arbitrarily complex expressions,
+with two restrictions:
+
+  o No variables were allowed, only numeric factors
+
+  o The numeric factors were limited to single digits
+
+In this installment, we'll get  rid of those restrictions.  We'll
+also extend what  we've  done  to  include  assignment statements
+function  calls  and.    Remember,   though,   that   the  second
+restriction was  mainly self-imposed  ... a choice of convenience
+on our part, to make life easier and to let us concentrate on the
+fundamental concepts.    As  you'll  see  in  a bit, it's an easy
+restriction to get rid of, so don't get  too  hung  up  about it.
+We'll use the trick when it serves us to do so, confident that we
+can discard it when we're ready to.
+
+
+VARIABLES
+
+Most expressions  that we see in practice involve variables, such
+as
+
+               b * b + 4 * a * c
+
+No  parser is much good without being able  to  deal  with  them.
+Fortunately, it's also quite easy to do.
+
+Remember that in our parser as it currently stands, there are two
+kinds of  factors  allowed:  integer  constants  and  expressions
+within parentheses.  In BNF notation,
+
+     <factor> ::= <number> | (<expression>)
+
+The '|' stands for "or", meaning of course that either form  is a
+legal form for a factor.   Remember,  too, that we had no trouble
+knowing which was which  ...  the  lookahead  character is a left
+paren '(' in one case, and a digit in the other.
+                              
+It probably won't come as too much of a surprise that  a variable
+is just another kind of factor.    So  we extend the BNF above to
+read:
+
+
+     <factor> ::= <number> | (<expression>) | <variable>
+
+
+Again, there is no  ambiguity:  if  the  lookahead character is a
+letter,  we  have  a variable; if a digit, we have a number. Back
+when we translated the number, we just issued code  to  load  the
+number,  as immediate data, into D0.  Now we do the same, only we
+load a variable.
+
+A minor complication in the  code generation arises from the fact
+that most  68000 operating systems, including the SK*DOS that I'm
+using, require the code to be  written  in "position-independent"
+form, which  basically means that everything is PC-relative.  The
+format for a load in this language is
+
+               MOVE X(PC),D0
+
+where X is, of course, the variable name.  Armed with that, let's
+modify the current version of Factor to read:
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Factor }
+
+procedure Expression; Forward;
+
+procedure Factor;
+begin
+   if Look = '(' then begin
+      Match('(');
+      Expression;
+      Match(')');
+      end
+   else if IsAlpha(Look) then
+      EmitLn('MOVE ' + GetName + '(PC),D0')
+   else
+      EmitLn('MOVE #' + GetNum + ',D0');
+end;
+{--------------------------------------------------------------}
+
+
+I've  remarked before how easy it is to  add  extensions  to  the
+parser, because of  the  way  it's  structured.  You can see that
+this  still  holds true here.  This time it cost us  all  of  two
+extra lines of code.  Notice, too, how the if-else-else structure
+exactly parallels the BNF syntax equation.
+
+OK, compile and test this new version of the parser.  That didn't
+hurt too badly, did it?
+                              
+
+FUNCTIONS
+
+There is only one  other  common kind of factor supported by most
+languages: the function call.  It's really too early  for  us  to
+deal with functions well,  because  we  haven't yet addressed the
+issue of parameter passing.  What's more, a "real" language would
+include a mechanism to  support  more than one type, one of which
+should be a function type.  We haven't gotten there  yet, either.
+But I'd still like to deal with functions  now  for  a  couple of
+reasons.    First,  it  lets  us  finally  wrap  up the parser in
+something very close to its final form, and second, it  brings up
+a new issue which is very much worth talking about.
+
+Up  till  now,  we've  been  able  to  write  what  is  called  a
+"predictive parser."  That  means  that at any point, we can know
+by looking at the current  lookahead character exactly what to do
+next.  That isn't the case when we add functions.  Every language
+has some naming rules  for  what  constitutes a legal identifier.
+For the present, ours is simply that it  is  one  of  the letters
+'a'..'z'.  The  problem  is  that  a variable name and a function
+name obey  the  same  rules.   So how can we tell which is which?
+One way is to require that they each be declared before  they are
+used.    Pascal  takes that approach.  The other is that we might
+require a function to be followed by a (possibly empty) parameter
+list.  That's the rule used in C.
+
+Since  we  don't  yet have a mechanism for declaring types, let's
+use the C  rule for now.  Since we also don't have a mechanism to
+deal  with parameters, we can only handle  empty  lists,  so  our
+function calls will have the form
+
+                    x()  .
+
+Since  we're  not  dealing  with  parameter lists yet,  there  is
+nothing  to do but to call the function, so we need only to issue
+a BSR (call) instead of a MOVE.
+
+Now that there are two  possibilities for the "If IsAlpha" branch
+of the test in Factor, let's treat them in a  separate procedure.
+Modify Factor to read:
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Factor }
+
+procedure Expression; Forward;
+
+procedure Factor;
+begin
+   if Look = '(' then begin
+      Match('(');
+      Expression;
+      Match(')');
+      end
+   else if IsAlpha(Look) then
+      Ident
+   else
+      EmitLn('MOVE #' + GetNum + ',D0');
+end;
+{--------------------------------------------------------------}
+
+
+and insert before it the new procedure
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate an Identifier }
+
+procedure Ident;
+var Name: char;
+begin
+   Name := GetName;
+   if Look = '(' then begin
+      Match('(');
+      Match(')');
+      EmitLn('BSR ' + Name);
+      end
+   else
+      EmitLn('MOVE ' + Name + '(PC),D0')
+end;
+{---------------------------------------------------------------}
+
+
+OK, compile and  test  this  version.  Does  it  parse  all legal
+expressions?  Does it correctly flag badly formed ones?
+
+The important thing to notice is that even though  we  no  longer
+have  a predictive parser, there is  little  or  no  complication
+added with the recursive descent approach that we're  using.   At
+the point where  Factor  finds an identifier (letter), it doesn't
+know whether it's a variable name or a function name, nor does it
+really care.  It simply passes it on to Ident and leaves it up to
+that procedure to figure it out.  Ident, in  turn,  simply  tucks
+away the identifier and then reads one more  character  to decide
+which kind of identifier it's dealing with.
+
+Keep this approach in mind.  It's a very powerful concept, and it
+should be used  whenever  you  encounter  an  ambiguous situation
+requiring further lookahead.   Even  if  you  had to look several
+tokens ahead, the principle would still work.
+
+
+MORE ON ERROR HANDLING
+
+As long as we're talking  philosophy,  there's  another important
+issue to point out:  error  handling.    Notice that although the
+parser correctly rejects (almost)  every malformed  expression we
+can  throw at it, with a meaningful  error  message,  we  haven't
+really had to  do much work to make that happen.  In fact, in the
+whole parser per se (from  Ident  through  Expression)  there are
+only two calls to the error routine, Expected.  Even those aren't
+necessary ... if you'll look again in Term and Expression, you'll
+see that those statements can't be reached.  I put them  in early
+on as a  bit  of  insurance,  but  they're no longer needed.  Why
+don't you delete them now?
+
+So how did we get this nice error handling  virtually  for  free?
+It's simply  that  I've  carefully  avoided  reading  a character
+directly  using  GetChar.  Instead,  I've  relied  on  the  error
+handling in GetName,  GetNum,  and  Match  to  do  all  the error
+checking for me.    Astute  readers  will notice that some of the
+calls to Match (for example, the ones in Add  and  Subtract)  are
+also unnecessary ... we already know what the character is by the
+time  we get there ... but it maintains  a  certain  symmetry  to
+leave them in, and  the  general rule to always use Match instead
+of GetChar is a good one.
+
+I mentioned an "almost" above.   There  is a case where our error
+handling  leaves a bit to be desired.  So far we haven't told our
+parser what and  end-of-line  looks  like,  or  what  to  do with
+embedded  white  space.  So  a  space  character  (or  any  other
+character not part of the recognized character set) simply causes
+the parser to terminate, ignoring the unrecognized characters.
+
+It  could  be  argued  that  this is reasonable behavior at  this
+point.  In a "real"  compiler, there is usually another statement
+following the one we're working on, so any characters not treated
+as part of our expression will either be used for or  rejected as
+part of the next one.
+
+But  it's  also a very easy thing to fix up, even  if  it's  only
+temporary.   All  we  have  to  do  is assert that the expression
+should end with an end-of-line , i.e., a carriage return.
+
+To see what I'm talking about, try the input line
+
+               1+2 <space> 3+4
+
+See  how the space was treated as a terminator?  Now, to make the
+compiler properly flag this, add the line
+
+               if Look <> CR then Expected('Newline');
+
+in the main  program,  just  after  the call to Expression.  That
+catches anything left over in the input stream.  Don't  forget to
+define CR in the const statement:
+
+               CR = ^M;
+
+As usual, recompile the program and verify that it does what it's
+supposed to.
+
+
+ASSIGNMENT STATEMENTS
+
+OK,  at  this  point we have a parser that works very nicely. I'd
+like to  point  out  that  we  got  it  using  only  88  lines of
+executable code, not  counting  what  was  in  the  cradle.   The
+compiled  object  file  is  a  whopping  4752  bytes.   Not  bad,
+considering we weren't trying very  hard  to  save  either source
+code or object size.  We just stuck to the KISS principle.
+
+Of course, parsing an expression  is not much good without having
+something to do with it afterwards.  Expressions USUALLY (but not
+always) appear in assignment statements, in the form
+
+          <Ident> = <Expression>
+
+We're only a breath  away  from being able to parse an assignment
+statement, so let's take that  last  step.  Just  after procedure
+Expression, add the following new procedure:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate an Assignment Statement }
+
+procedure Assignment;
+var Name: char;
+begin
+   Name := GetName;
+   Match('=');
+   Expression;
+   EmitLn('LEA ' + Name + '(PC),A0');
+   EmitLn('MOVE D0,(A0)')
+end;
+{--------------------------------------------------------------}
+
+
+Note again that the  code  exactly parallels the BNF.  And notice
+further that  the error checking was painless, handled by GetName
+and Match.
+
+The reason for the two  lines  of  assembler  has  to  do  with a
+peculiarity in the  68000,  which requires this kind of construct
+for PC-relative code.
+
+Now change the call to Expression, in the main program, to one to
+Assignment.  That's all there is to it.
+
+Son of a gun!  We are actually  compiling  assignment statements.
+If those were the only kind of statements in a language, all we'd
+have to  do  is  put  this in a loop and we'd have a full-fledged
+compiler!
+
+Well, of course they're not the only kind.  There are also little
+items  like  control  statements  (IFs  and  loops),  procedures,
+declarations, etc.  But cheer  up.    The  arithmetic expressions
+that we've been dealing with are among the most challenging  in a
+language.      Compared  to  what  we've  already  done,  control
+statements  will be easy.  I'll be covering  them  in  the  fifth
+installment.  And the other statements will all fall in  line, as
+long as we remember to KISS.
+
+
+MULTI-CHARACTER TOKENS
+
+Throughout  this   series,   I've   been   carefully  restricting
+everything  we  do  to  single-character  tokens,  all  the while
+assuring  you  that  it wouldn't be difficult to extend to multi-
+character ones.    I  don't  know if you believed me or not ... I
+wouldn't  really blame you if you were a  bit  skeptical.    I'll
+continue  to use  that approach in  the  sessions  which  follow,
+because it helps keep complexity away.    But I'd like to back up
+those  assurances, and wrap up this portion  of  the  parser,  by
+showing you  just  how  easy  that  extension  really is.  In the
+process, we'll also provide for embedded white space.  Before you
+make  the  next  few changes, though, save the current version of
+the parser away under another name.  I have some more uses for it
+in  the  next  installment, and we'll be working with the single-
+character version.
+
+Most compilers separate out the handling of the input stream into
+a separate module called  the  lexical scanner.  The idea is that
+the scanner deals with all the character-by-character  input, and
+returns the separate units  (tokens)  of  the  stream.  There may
+come a time when we'll want  to  do something like that, too, but
+for  now  there  is  no  need. We can handle the  multi-character
+tokens that we need by very slight and  very  local modifications
+to GetName and GetNum.
+
+The usual definition of an identifier is that the first character
+must be a letter, but the rest can be  alphanumeric  (letters  or
+numbers).  To  deal  with  this,  we  need  one  other recognizer
+function
+
+
+{--------------------------------------------------------------}
+{ Recognize an Alphanumeric }
+
+function IsAlNum(c: char): boolean;
+begin
+   IsAlNum := IsAlpha(c) or IsDigit(c);
+end;
+{--------------------------------------------------------------}
+
+
+Add this function to your parser.  I put mine just after IsDigit.
+While you're  at  it,  might  as  well  include it as a permanent
+member of Cradle, too.
+                              
+Now, we need  to  modify  function  GetName  to  return  a string
+instead of a character:
+
+
+{--------------------------------------------------------------}
+{ Get an Identifier }
+
+function GetName: string;
+var Token: string;
+begin
+   Token := '';
+   if not IsAlpha(Look) then Expected('Name');
+   while IsAlNum(Look) do begin
+      Token := Token + UpCase(Look);
+      GetChar;
+   end;
+   GetName := Token;
+end;
+{--------------------------------------------------------------}
+
+
+Similarly, modify GetNum to read:
+
+
+{--------------------------------------------------------------}
+{ Get a Number }
+
+function GetNum: string;
+var Value: string;
+begin
+   Value := '';
+   if not IsDigit(Look) then Expected('Integer');
+   while IsDigit(Look) do begin
+      Value := Value + Look;
+      GetChar;
+   end;
+   GetNum := Value;
+end;
+{--------------------------------------------------------------}
+
+
+Amazingly enough, that  is  virtually all the changes required to
+the  parser!  The local variable Name  in  procedures  Ident  and
+Assignment was originally declared as  "char",  and  must  now be
+declared string[8].  (Clearly,  we  could  make the string length
+longer if we chose, but most assemblers limit the length anyhow.)
+Make  this  change,  and  then  recompile and test. _NOW_ do  you
+believe that it's a simple change?
+
+
+WHITE SPACE
+
+Before we leave this parser for awhile, let's  address  the issue
+of  white  space.   As it stands now, the parser  will  barf  (or
+simply terminate) on a single space  character  embedded anywhere
+in  the input stream.  That's pretty  unfriendly  behavior.    So
+let's "productionize" the thing  a  bit  by eliminating this last
+restriction.
+
+The  key  to easy handling of white space is to come  up  with  a
+simple rule for how the parser should treat the input stream, and
+to  enforce that rule everywhere.  Up  till  now,  because  white
+space wasn't permitted, we've been able to assume that after each
+parsing action, the lookahead character  Look  contains  the next
+meaningful  character,  so  we could test it  immediately.    Our
+design was based upon this principle.
+
+It still sounds like a good rule to me, so  that's  the one we'll
+use.    This  means  that  every routine that advances the  input
+stream must skip over white space, and leave  the  next non-white
+character in Look.   Fortunately,  because  we've been careful to
+use GetName, GetNum, and Match  for most of our input processing,
+it is  only  those  three  routines  (plus  Init) that we need to
+modify.
+
+Not  surprisingly,  we  start  with  yet  another  new recognizer
+routine:
+
+
+{--------------------------------------------------------------}
+{ Recognize White Space }
+
+function IsWhite(c: char): boolean;
+begin
+   IsWhite := c in [' ', TAB];
+end;
+{--------------------------------------------------------------}
+
+
+We  also need a routine that  will  eat  white-space  characters,
+until it finds a non-white one:
+
+
+{--------------------------------------------------------------}
+{ Skip Over Leading White Space }
+
+procedure SkipWhite;
+begin
+   while IsWhite(Look) do
+      GetChar;
+end;
+{--------------------------------------------------------------}
+
+
+Now,  add calls to SkipWhite to Match,  GetName,  and  GetNum  as
+shown below:
+
+
+{--------------------------------------------------------------}
+{ Match a Specific Input Character }
+
+procedure Match(x: char);
+begin
+   if Look <> x then Expected('''' + x + '''')
+   else begin
+      GetChar;
+      SkipWhite;
+   end;
+end;
+
+
+{--------------------------------------------------------------}
+{ Get an Identifier }
+
+function GetName: string;
+var Token: string;
+begin
+   Token := '';
+   if not IsAlpha(Look) then Expected('Name');
+   while IsAlNum(Look) do begin
+      Token := Token + UpCase(Look);
+      GetChar;
+   end;
+   GetName := Token;
+   SkipWhite;
+end;
+
+
+{--------------------------------------------------------------}
+{ Get a Number }
+
+function GetNum: string;
+var Value: string;
+begin
+   Value := '';
+   if not IsDigit(Look) then Expected('Integer');
+   while IsDigit(Look) do begin
+      Value := Value + Look;
+      GetChar;
+   end;
+   GetNum := Value;
+   SkipWhite;
+end;
+{--------------------------------------------------------------}
+
+(Note  that  I  rearranged  Match  a  bit,  without changing  the
+functionality.)
+
+Finally, we need to skip over leading blanks where we  "prime the
+pump" in Init:
+                             
+{--------------------------------------------------------------}
+{ Initialize }
+
+procedure Init;
+begin
+   GetChar;
+   SkipWhite;
+end;
+{--------------------------------------------------------------}
+
+
+Make these changes and recompile the program.  You will find that
+you will have to move Match below SkipWhite, to  avoid  an  error
+message from the Pascal compiler.  Test the program as  always to
+make sure it works properly.
+
+Since we've made quite  a  few  changes  during this session, I'm
+reproducing the entire parser below:
+
+
+{--------------------------------------------------------------}
+program parse;
+
+{--------------------------------------------------------------}
+{ Constant Declarations }
+
+const TAB = ^I;
+       CR = ^M;
+
+{--------------------------------------------------------------}
+{ Variable Declarations }
+
+var Look: char;              { Lookahead Character }
+
+{--------------------------------------------------------------}
+{ Read New Character From Input Stream }
+
+procedure GetChar;
+begin
+   Read(Look);
+end;
+
+{--------------------------------------------------------------}
+{ Report an Error }
+
+procedure Error(s: string);
+begin
+   WriteLn;
+   WriteLn(^G, 'Error: ', s, '.');
+end;
+
+
+{--------------------------------------------------------------}
+{ Report Error and Halt }
+                             
+procedure Abort(s: string);
+begin
+   Error(s);
+   Halt;
+end;
+
+
+{--------------------------------------------------------------}
+{ Report What Was Expected }
+
+procedure Expected(s: string);
+begin
+   Abort(s + ' Expected');
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize an Alpha Character }
+
+function IsAlpha(c: char): boolean;
+begin
+   IsAlpha := UpCase(c) in ['A'..'Z'];
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize a Decimal Digit }
+
+function IsDigit(c: char): boolean;
+begin
+   IsDigit := c in ['0'..'9'];
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize an Alphanumeric }
+
+function IsAlNum(c: char): boolean;
+begin
+   IsAlNum := IsAlpha(c) or IsDigit(c);
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize an Addop }
+
+function IsAddop(c: char): boolean;
+begin
+   IsAddop := c in ['+', '-'];
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize White Space }
+                             
+function IsWhite(c: char): boolean;
+begin
+   IsWhite := c in [' ', TAB];
+end;
+
+
+{--------------------------------------------------------------}
+{ Skip Over Leading White Space }
+
+procedure SkipWhite;
+begin
+   while IsWhite(Look) do
+      GetChar;
+end;
+
+
+{--------------------------------------------------------------}
+{ Match a Specific Input Character }
+
+procedure Match(x: char);
+begin
+   if Look <> x then Expected('''' + x + '''')
+   else begin
+      GetChar;
+      SkipWhite;
+   end;
+end;
+
+
+{--------------------------------------------------------------}
+{ Get an Identifier }
+
+function GetName: string;
+var Token: string;
+begin
+   Token := '';
+   if not IsAlpha(Look) then Expected('Name');
+   while IsAlNum(Look) do begin
+      Token := Token + UpCase(Look);
+      GetChar;
+   end;
+   GetName := Token;
+   SkipWhite;
+end;
+
+
+{--------------------------------------------------------------}
+{ Get a Number }
+
+function GetNum: string;
+var Value: string;
+begin
+   Value := '';
+   if not IsDigit(Look) then Expected('Integer');
+   while IsDigit(Look) do begin
+      Value := Value + Look;
+      GetChar;
+   end;
+   GetNum := Value;
+   SkipWhite;
+end;
+
+
+{--------------------------------------------------------------}
+{ Output a String with Tab }
+
+procedure Emit(s: string);
+begin
+   Write(TAB, s);
+end;
+
+
+{--------------------------------------------------------------}
+{ Output a String with Tab and CRLF }
+
+procedure EmitLn(s: string);
+begin
+   Emit(s);
+   WriteLn;
+end;
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Identifier }
+
+procedure Ident;
+var Name: string[8];
+begin
+   Name:= GetName;
+   if Look = '(' then begin
+      Match('(');
+      Match(')');
+      EmitLn('BSR ' + Name);
+      end
+   else
+      EmitLn('MOVE ' + Name + '(PC),D0');
+end;
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Factor }
+
+procedure Expression; Forward;
+
+procedure Factor;
+begin
+   if Look = '(' then begin
+      Match('(');
+      Expression;
+      Match(')');
+      end
+   else if IsAlpha(Look) then
+      Ident
+   else
+      EmitLn('MOVE #' + GetNum + ',D0');
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize and Translate a Multiply }
+
+procedure Multiply;
+begin
+   Match('*');
+   Factor;
+   EmitLn('MULS (SP)+,D0');
+end;
+
+
+{-------------------------------------------------------------}
+{ Recognize and Translate a Divide }
+
+procedure Divide;
+begin
+   Match('/');
+   Factor;
+   EmitLn('MOVE (SP)+,D1');
+   EmitLn('EXS.L D0');
+   EmitLn('DIVS D1,D0');
+end;
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Term }
+
+procedure Term;
+begin
+   Factor;
+   while Look in ['*', '/'] do begin
+      EmitLn('MOVE D0,-(SP)');
+      case Look of
+       '*': Multiply;
+       '/': Divide;
+      end;
+   end;
+end;
+
+
+{--------------------------------------------------------------}
+{ Recognize and Translate an Add }
+
+procedure Add;
+begin
+   Match('+');
+   Term;
+   EmitLn('ADD (SP)+,D0');
+end;
+
+
+{-------------------------------------------------------------}
+{ Recognize and Translate a Subtract }
+
+procedure Subtract;
+begin
+   Match('-');
+   Term;
+   EmitLn('SUB (SP)+,D0');
+   EmitLn('NEG D0');
+end;
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate an Expression }
+
+procedure Expression;
+begin
+   if IsAddop(Look) then
+      EmitLn('CLR D0')
+   else
+      Term;
+   while IsAddop(Look) do begin
+      EmitLn('MOVE D0,-(SP)');
+      case Look of
+       '+': Add;
+       '-': Subtract;
+      end;
+   end;
+end;
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate an Assignment Statement }
+
+procedure Assignment;
+var Name: string[8];
+begin
+   Name := GetName;
+   Match('=');
+   Expression;
+   EmitLn('LEA ' + Name + '(PC),A0');
+   EmitLn('MOVE D0,(A0)')
+end;
+
+
+{--------------------------------------------------------------}
+{ Initialize }
+                             
+procedure Init;
+begin
+   GetChar;
+   SkipWhite;
+end;
+
+
+{--------------------------------------------------------------}
+{ Main Program }
+
+begin
+   Init;
+   Assignment;
+   If Look <> CR then Expected('NewLine');
+end.
+{--------------------------------------------------------------}
+
+
+Now the parser is complete.  It's got every feature we can put in
+a  one-line "compiler."  Tuck it away in a safe place.  Next time
+we'll move on to a new subject, but we'll still be  talking about
+expressions for quite awhile.  Next installment, I plan to talk a
+bit about interpreters as opposed  to compilers, and show you how
+the structure of the parser changes a bit as we change  what sort
+of action has to be taken.  The information we pick up there will
+serve  us in good stead later on, even if you have no interest in
+interpreters.  See you next time.
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+
+
@@ -0,0 +1,701 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                     LET'S BUILD A COMPILER!
+
+                                By
+
+                     Jack W. Crenshaw, Ph.D.
+
+                           24 July 1988
+
+
+                      Part IV: INTERPRETERS
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+INTRODUCTION
+
+In the first three installments of this series,  we've  looked at
+parsing and  compiling math expressions, and worked our way grad-
+ually and methodically from dealing  with  very  simple one-term,
+one-character "expressions" up through more general ones, finally
+arriving at a very complete parser that could parse and translate
+complete  assignment  statements,  with  multi-character  tokens,
+embedded white space, and function calls.  This  time,  I'm going
+to walk you through the process one more time, only with the goal
+of interpreting rather than compiling object code.
+
+Since this is a series on compilers, why should  we  bother  with
+interpreters?  Simply because I want you to see how the nature of
+the  parser changes as we change the goals.  I also want to unify
+the concepts of the two types of translators, so that you can see
+not only the differences, but also the similarities.
+
+Consider the assignment statement
+
+               x = 2 * y + 3
+
+In a compiler, we want the target CPU to execute  this assignment
+at EXECUTION time.  The translator itself doesn't  do  any arith-
+metic ... it only issues the object code that will cause  the CPU
+to do it when the code is executed.  For  the  example above, the
+compiler would issue code to compute the expression and store the
+results in variable x.
+
+For an interpreter,  on  the  other  hand, no object code is gen-
+erated.   Instead, the arithmetic is computed immediately, as the
+parsing is going on.  For the example, by the time parsing of the
+statement is complete, x will have a new value.
+
+The approach we've been  taking  in  this  whole series is called
+"syntax-driven translation."  As you are aware by now, the struc-
+ture of the  parser  is  very  closely  tied to the syntax of the
+productions we parse.  We  have built Pascal procedures that rec-
+ognize every language  construct.   Associated with each of these
+constructs (and procedures) is  a  corresponding  "action," which
+does  whatever  makes  sense to do  once  a  construct  has  been
+recognized.    In  our  compiler  so far, every  action  involves
+emitting object code, to be executed later at execution time.  In
+an interpreter, every action  involves  something  to be done im-
+mediately.
+
+What I'd like you to see here is that the  layout  ... the struc-
+ture ... of  the  parser  doesn't  change.  It's only the actions
+that change.   So  if  you  can  write an interpreter for a given
+language, you can also write a compiler, and vice versa.  Yet, as
+you  will  see,  there  ARE  differences,  and  significant ones.
+Because the actions are different,  the  procedures  that  do the
+recognizing end up being written differently.    Specifically, in
+the interpreter  the recognizing procedures end up being coded as
+FUNCTIONS that return numeric values to their callers.    None of
+the parsing routines for our compiler did that.
+
+Our compiler, in fact,  is  what we might call a "pure" compiler.
+Each time a construct is recognized, the object  code  is emitted
+IMMEDIATELY.  (That's one reason the code is not very efficient.)
+The interpreter we'll be building  here is a pure interpreter, in
+the sense that there is  no  translation,  such  as "tokenizing,"
+performed on the source code.  These represent  the  two extremes
+of translation.  In  the  real  world,  translators are rarely so
+pure, but tend to have bits of each technique.
+
+I can think of  several  examples.    I've already mentioned one:
+most interpreters, such as Microsoft BASIC,  for  example, trans-
+late the source code (tokenize it) into an  intermediate  form so
+that it'll be easier to parse real time.
+
+Another example is an assembler.  The purpose of an assembler, of
+course, is to produce object code, and it normally does that on a
+one-to-one basis: one object instruction per line of source code.
+But almost every assembler also permits expressions as arguments.
+In this case, the expressions  are  always  constant expressions,
+and  so the assembler isn't supposed to  issue  object  code  for
+them.  Rather,  it  "interprets" the expressions and computes the
+corresponding constant result, which is what it actually emits as
+object code.
+
+As a matter of fact, we  could  use  a bit of that ourselves. The
+translator we built in the  previous  installment  will dutifully
+spit out object code  for  complicated  expressions,  even though
+every term in  the  expression  is  a  constant.  In that case it
+would be far better if the translator behaved a bit more  like an
+interpreter, and just computed the equivalent constant result.
+
+There is  a concept in compiler theory called "lazy" translation.
+The  idea is that you typically don't just  emit  code  at  every
+action.  In fact, at the extreme you don't emit anything  at all,
+until  you  absolutely  have to.  To accomplish this, the actions
+associated with the parsing routines  typically  don't  just emit
+code.  Sometimes  they  do,  but  often  they  simply  return in-
+formation back to the caller.  Armed with  such  information, the
+caller can then make a better choice of what to do.
+
+For example, given the statement
+
+               x = x + 3 - 2 - (5 - 4)  ,
+
+our compiler will dutifully spit  out a stream of 18 instructions
+to load each parameter into  registers,  perform  the arithmetic,
+and store the result.  A lazier evaluation  would  recognize that
+the arithmetic involving constants can  be  evaluated  at compile
+time, and would reduce the expression to
+
+               x = x + 0  .
+
+An  even  lazier  evaluation would then be smart enough to figure
+out that this is equivalent to
+
+               x = x  ,
+
+which  calls  for  no  action  at  all.   We could reduce 18  in-
+structions to zero!
+
+Note that there is no chance of optimizing this way in our trans-
+lator as it stands, because every action takes place immediately.
+
+Lazy  expression  evaluation  can  produce  significantly  better
+object code than  we  have  been  able  to  so  far.  I warn you,
+though: it complicates the parser code considerably, because each
+routine now has to make decisions as to whether  to  emit  object
+code or not.  Lazy evaluation is certainly not named that because
+it's easier on the compiler writer!
+
+Since we're operating mainly on  the KISS principle here, I won't
+go  into much more depth on this subject.  I just want you to  be
+aware  that  you  can get some code optimization by combining the
+techniques of compiling and  interpreting.    In  particular, you
+should know that the parsing  routines  in  a  smarter translator
+will generally  return  things  to  their  caller,  and sometimes
+expect things as  well.    That's  the main reason for going over
+interpretation in this installment.
+
+
+THE INTERPRETER
+
+OK, now that you know WHY we're going into all this, let's do it.
+Just to give you practice, we're going to start over with  a bare
+cradle and build up the translator all over again.  This time, of
+course, we can go a bit faster.
+
+Since we're now going  to  do arithmetic, the first thing we need
+to do is to change function GetNum, which up till now  has always
+returned a character  (or  string).    Now, it's better for it to
+return an integer.    MAKE  A  COPY of the cradle (for goodness's
+sake, don't change the version  in  Cradle  itself!!)  and modify
+GetNum as follows:
+
+
+{--------------------------------------------------------------}
+{ Get a Number }
+
+function GetNum: integer;
+begin
+   if not IsDigit(Look) then Expected('Integer');
+   GetNum := Ord(Look) - Ord('0');
+   GetChar;
+end;
+{--------------------------------------------------------------}
+
+
+Now, write the following version of Expression:
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate an Expression }
+
+function Expression: integer;
+begin
+   Expression := GetNum;
+end;
+{--------------------------------------------------------------}
+
+
+Finally, insert the statement
+
+
+   Writeln(Expression);
+
+
+at the end of the main program.  Now compile and test.
+
+All this program  does  is  to  "parse"  and  translate  a single
+integer  "expression."    As always, you should make sure that it
+does that with the digits 0..9, and gives an  error  message  for
+anything else.  Shouldn't take you very long!
+
+OK, now let's extend this to include addops.    Change Expression
+to read:
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate an Expression }
+
+function Expression: integer;
+var Value: integer;
+begin
+   if IsAddop(Look) then
+      Value := 0
+   else
+      Value := GetNum;
+   while IsAddop(Look) do begin
+      case Look of
+       '+': begin
+               Match('+');
+               Value := Value + GetNum;
+            end;
+       '-': begin
+               Match('-');
+               Value := Value - GetNum;
+            end;
+      end;
+   end;
+   Expression := Value;
+end;
+{--------------------------------------------------------------}
+
+
+The structure of Expression, of  course,  parallels  what  we did
+before,  so  we  shouldn't have too much  trouble  debugging  it.
+There's  been  a  SIGNIFICANT  development, though, hasn't there?
+Procedures Add and Subtract went away!  The reason  is  that  the
+action to be taken  requires  BOTH arguments of the operation.  I
+could have chosen to retain the procedures and pass into them the
+value of the expression to date,  which  is Value.  But it seemed
+cleaner to me to  keep  Value as strictly a local variable, which
+meant that the code for Add and Subtract had to be moved in line.
+This result suggests  that,  while the structure we had developed
+was nice and  clean  for our simple-minded translation scheme, it
+probably  wouldn't do for use with lazy  evaluation.    That's  a
+little tidbit we'll probably want to keep in mind for later.
+
+OK,  did the translator work?  Then let's  take  the  next  step.
+It's not hard to  figure  out what procedure Term should now look
+like.  Change every call to GetNum in function  Expression  to  a
+call to Term, and then enter the following form for Term:
+
+
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Term }
+
+function Term: integer;
+var Value: integer;
+begin
+   Value := GetNum;
+   while Look in ['*', '/'] do begin
+      case Look of
+       '*': begin
+               Match('*');
+               Value := Value * GetNum;
+            end;
+       '/': begin
+               Match('/');
+               Value := Value div GetNum;
+            end;
+      end;
+   end;
+   Term := Value;
+end;
+{--------------------------------------------------------------}
+
+Now, try it out.    Don't forget two things: first, we're dealing
+with integer division, so, for example, 1/3 should come out zero.
+Second, even  though we can output multi-digit results, our input
+is still restricted to single digits.
+
+That seems like a silly restriction at this point, since  we have
+already  seen how easily function GetNum can  be  extended.    So
+let's go ahead and fix it right now.  The new version is
+
+
+{--------------------------------------------------------------}
+{ Get a Number }
+
+function GetNum: integer;
+var Value: integer;
+begin
+   Value := 0;
+   if not IsDigit(Look) then Expected('Integer');
+   while IsDigit(Look) do begin
+      Value := 10 * Value + Ord(Look) - Ord('0');
+      GetChar;
+   end;
+   GetNum := Value;
+end;
+{--------------------------------------------------------------}
+
+
+If you've compiled and  tested  this  version of the interpreter,
+the  next  step  is to install function Factor, complete with pa-
+renthesized  expressions.  We'll hold off a  bit  longer  on  the
+variable  names.    First, change the references  to  GetNum,  in
+function Term, so that they call Factor instead.   Now  code  the
+following version of Factor:
+
+
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Factor }
+
+function Expression: integer; Forward;
+
+function Factor: integer;
+begin
+   if Look = '(' then begin
+      Match('(');
+      Factor := Expression;
+      Match(')');
+      end
+   else
+       Factor := GetNum;
+end;
+{---------------------------------------------------------------}
+
+That was pretty easy, huh?  We're rapidly closing in on  a useful
+interpreter.
+
+
+A LITTLE PHILOSOPHY
+
+Before going any further, there's something I'd like  to  call to
+your attention.  It's a concept that we've been making use  of in
+all these sessions, but I haven't explicitly mentioned it up till
+now.  I think it's time, because it's a concept so useful, and so
+powerful,  that  it  makes all the difference  between  a  parser
+that's trivially easy, and one that's too complex to deal with.
+
+In the early days of compiler technology, people  had  a terrible
+time  figuring  out  how to deal with things like operator prece-
+dence  ...  the  way  that  multiply  and  divide operators  take
+precedence over add and subtract, etc.  I remember a colleague of
+some  thirty years ago, and how excited he was to find out how to
+do it.  The technique used involved building two  stacks,    upon
+which you pushed each operator  or operand.  Associated with each
+operator was a precedence level,  and the rules required that you
+only actually performed an operation  ("reducing"  the  stack) if
+the precedence level showing on top of the stack was correct.  To
+make life more interesting,  an  operator  like ')' had different
+precedence levels, depending  upon  whether or not it was already
+on the stack.  You  had to give it one value before you put it on
+the stack, and another to decide when to take it  off.   Just for
+the experience, I worked all of  this  out for myself a few years
+ago, and I can tell you that it's very tricky.
+
+We haven't  had  to  do  anything like that.  In fact, by now the
+parsing of an arithmetic statement should seem like child's play.
+How did we get so lucky?  And where did the precedence stacks go?
+
+A similar thing is going on  in  our interpreter above.  You just
+KNOW that in  order  for  it  to do the computation of arithmetic
+statements (as opposed to the parsing of them), there have  to be
+numbers pushed onto a stack somewhere.  But where is the stack?
+
+Finally,  in compiler textbooks, there are  a  number  of  places
+where  stacks  and  other structures are discussed.  In the other
+leading parsing method (LR), an explicit stack is used.  In fact,
+the technique is very  much  like the old way of doing arithmetic
+expressions.  Another concept  is  that of a parse tree.  Authors
+like to draw diagrams  of  the  tokens  in a statement, connected
+into a tree with  operators  at the internal nodes.  Again, where
+are the trees and stacks in our technique?  We haven't seen any.
+The answer in all cases is that the structures are  implicit, not
+explicit.    In  any computer language, there is a stack involved
+every  time  you  call  a  subroutine.  Whenever a subroutine  is
+called, the return address is pushed onto the CPU stack.   At the
+end of the subroutine, the address is popped back off and control
+is  transferred  there.   In a recursive language such as Pascal,
+there can also be local data pushed onto the stack, and  it, too,
+returns when it's needed.
+
+For example,  function  Expression  contains  a  local  parameter
+called  Value, which it fills by a call to Term.  Suppose, in its
+next call to  Term  for  the  second  argument,  that  Term calls
+Factor, which recursively  calls  Expression  again.    That "in-
+stance" of Expression gets another value for its  copy  of Value.
+What happens  to  the  first  Value?    Answer: it's still on the
+stack, and  will  be  there  again  when  we return from our call
+sequence.
+
+In other words, the reason things look so simple  is  that  we've
+been making maximum use of the resources of the  language.    The
+hierarchy levels  and  the  parse trees are there, all right, but
+they're hidden within the  structure  of  the parser, and they're
+taken care of by the order with which the various  procedures are
+called.  Now that you've seen how we do it, it's probably hard to
+imagine doing it  any other way.  But I can tell you that it took
+a lot of years for compiler writers to get that smart.  The early
+compilers were too complex  too  imagine.    Funny how things get
+easier with a little practice.
+
+The reason  I've  brought  all  this up is as both a lesson and a
+warning.  The lesson: things can be easy when you do  them right.
+The warning: take a look at what you're doing.  If, as you branch
+out on  your  own,  you  begin to find a real need for a separate
+stack or tree structure, it may be time to ask yourself if you're
+looking at things the right way.  Maybe you just aren't using the
+facilities of the language as well as you could be.
+
+
+The next step is to add variable names.  Now,  though,  we have a
+slight problem.  For  the  compiler, we had no problem in dealing
+with variable names ... we just issued the names to the assembler
+and let the rest  of  the program take care of allocating storage
+for  them.  Here, on the other hand, we need to be able to  fetch
+the values of the variables and return them as the  return values
+of Factor.  We need a storage mechanism for these variables.
+
+Back in the early days of personal computing,  Tiny  BASIC lived.
+It had  a  grand  total  of  26  possible variables: one for each
+letter of the  alphabet.    This  fits nicely with our concept of
+single-character tokens, so we'll  try  the  same  trick.  In the
+beginning of your  interpreter,  just  after  the  declaration of
+variable Look, insert the line:
+
+               Table: Array['A'..'Z'] of integer;
+
+We also need to initialize the array, so add this procedure:
+
+
+
+
+{---------------------------------------------------------------}
+{ Initialize the Variable Area }
+
+procedure InitTable;
+var i: char;
+begin
+   for i := 'A' to 'Z' do
+      Table[i] := 0;
+end;
+{---------------------------------------------------------------}
+
+
+You must also insert a call to InitTable, in procedure Init.
+DON'T FORGET to do that, or the results may surprise you!
+
+Now that we have an array  of  variables, we can modify Factor to
+use it.  Since we don't have a way (so far) to set the variables,
+Factor  will always return zero values for  them,  but  let's  go
+ahead and extend it anyway.  Here's the new version:
+
+
+{---------------------------------------------------------------}
+{ Parse and Translate a Math Factor }
+
+function Expression: integer; Forward;
+
+function Factor: integer;
+begin
+   if Look = '(' then begin
+      Match('(');
+      Factor := Expression;
+      Match(')');
+      end
+   else if IsAlpha(Look) then
+      Factor := Table[GetName]
+   else
+       Factor := GetNum;
+end;
+{---------------------------------------------------------------}
+
+
+As always, compile and test this version of the  program.    Even
+though all the variables are now zeros, at least we can correctly
+parse the complete expressions, as well as catch any badly formed
+expressions.
+
+I suppose you realize the next step: we need to do  an assignment
+statement so we can  put  something INTO the variables.  For now,
+let's  stick  to  one-liners,  though  we will soon  be  handling
+multiple statements.
+
+The assignment statement parallels what we did before:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate an Assignment Statement }
+                             
+
+
+procedure Assignment;
+var Name: char;
+begin
+   Name := GetName;
+   Match('=');
+   Table[Name] := Expression;
+end;
+{--------------------------------------------------------------}
+
+
+To test this,  I  added  a  temporary write statement in the main
+program,  to  print out the value of A.  Then I  tested  it  with
+various assignments to it.
+
+Of course, an interpretive language that can only accept a single
+line of program  is not of much value.  So we're going to want to
+handle multiple statements.  This  merely  means  putting  a loop
+around  the  call  to Assignment.  So let's do that now. But what
+should be the loop exit criterion?  Glad you  asked,  because  it
+brings up a point we've been able to ignore up till now.
+
+One of the most tricky things  to  handle in any translator is to
+determine when to bail out of  a  given construct and go look for
+something else.  This hasn't been a problem for us so far because
+we've only allowed for  a  single kind of construct ... either an
+expression  or an assignment statement.   When  we  start  adding
+loops and different kinds of statements, you'll find that we have
+to be very careful that things terminate properly.  If we put our
+interpreter in a loop, we need a way to quit.    Terminating on a
+newline is no good, because that's what sends us back for another
+line.  We could always let an unrecognized character take us out,
+but that would cause every run to end in an error  message, which
+certainly seems uncool.
+
+What we need  is  a  termination  character.  I vote for Pascal's
+ending period ('.').   A  minor  complication  is that Turbo ends
+every normal line  with  TWO characters, the carriage return (CR)
+and line feed (LF).   At  the  end  of  each line, we need to eat
+these characters before processing the next one.   A  natural way
+to do this would  be  with  procedure  Match, except that Match's
+error  message  prints  the character, which of course for the CR
+and/or  LF won't look so great.  What we need is a special proce-
+dure for this, which we'll no doubt be using over and over.  Here
+it is:
+
+
+{--------------------------------------------------------------}
+{ Recognize and Skip Over a Newline }
+
+procedure NewLine;
+begin
+   if Look = CR then begin
+      GetChar;
+      if Look = LF then
+         GetChar;
+   end;
+end;
+{--------------------------------------------------------------}
+
+
+Insert this procedure at any convenient spot ... I put  mine just
+after Match.  Now, rewrite the main program to look like this:
+
+
+{--------------------------------------------------------------}
+{ Main Program }
+
+begin
+   Init;
+   repeat
+      Assignment;
+      NewLine;
+   until Look = '.';
+end.
+{--------------------------------------------------------------}
+
+
+Note that the  test for a CR is now gone, and that there are also
+no  error tests within NewLine itself.   That's  OK,  though  ...
+whatever is left over in terms of bogus characters will be caught
+at the beginning of the next assignment statement.
+
+Well, we now have a functioning interpreter.  It doesn't do  us a
+lot of  good,  however,  since  we have no way to read data in or
+write it out.  Sure would help to have some I/O!
+
+Let's wrap this session  up,  then,  by  adding the I/O routines.
+Since we're  sticking to single-character tokens, I'll use '?' to
+stand for a read statement, and  '!'  for a write, with the char-
+acter  immediately  following  them  to  be used as  a  one-token
+"parameter list."  Here are the routines:
+
+{--------------------------------------------------------------}
+{ Input Routine }
+
+procedure Input;
+begin
+   Match('?');
+   Read(Table[GetName]);
+end;
+
+
+{--------------------------------------------------------------}
+{ Output Routine }
+
+procedure Output;
+begin
+   Match('!');
+   WriteLn(Table[GetName]);
+end;
+{--------------------------------------------------------------}
+
+They aren't very fancy, I admit ... no prompt character on input,
+for example ... but they get the job done.
+
+The corresponding changes in  the  main  program are shown below.
+Note that we use the usual  trick  of a case statement based upon
+the current lookahead character, to decide what to do.
+
+
+{--------------------------------------------------------------}
+{ Main Program }
+
+begin
+   Init;
+   repeat
+      case Look of
+       '?': Input;
+       '!': Output;
+       else Assignment;
+      end;
+      NewLine;
+   until Look = '.';
+end.
+{--------------------------------------------------------------}
+
+
+You have now completed a  real, working interpreter.  It's pretty
+sparse, but it works just like the "big boys."  It includes three
+kinds of program statements  (and  can  tell the difference!), 26
+variables,  and  I/O  statements.  The only things that it lacks,
+really, are control statements,  subroutines,    and some kind of
+program editing function.  The program editing part, I'm going to
+pass on.  After all, we're  not  here  to build a product, but to
+learn  things.    The control statements, we'll cover in the next
+installment, and the subroutines soon  after.  I'm anxious to get
+on with that, so we'll leave the interpreter as it stands.
+
+I hope that by  now  you're convinced that the limitation of sin-
+gle-character names  and the processing of white space are easily
+taken  care  of, as we did in the last session.   This  time,  if
+you'd like to play around with these extensions, be my  guest ...
+they're  "left as an exercise for the student."    See  you  next
+time.
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+ 1 --
+
+
@@ -0,0 +1,525 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                     LET'S BUILD A COMPILER!
+
+                                By
+
+                     Jack W. Crenshaw, Ph.D.
+
+                           2 April 1989
+
+
+                  Part VIII: A LITTLE PHILOSOPHY
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+INTRODUCTION
+
+This is going to be a  different  kind of session than the others
+in our series on  parsing  and  compiler  construction.  For this
+session, there won't be  any  experiments to do or code to write.
+This  once,  I'd  like  to  just  talk  with  you  for  a  while.
+Mercifully, it will be a short  session,  and then we can take up
+where we left off, hopefully with renewed vigor.
+
+When  I  was  in college, I found that I could  always  follow  a
+prof's lecture a lot better if I knew where he was going with it.
+I'll bet you were the same.
+
+So I thought maybe it's about  time  I told you where we're going
+with this series: what's coming up in future installments, and in
+general what all  this  is  about.   I'll also share some general
+thoughts concerning the usefulness of what we've been doing.
+
+
+THE ROAD HOME
+
+So far, we've  covered  the parsing and translation of arithmetic
+expressions,  Boolean expressions, and combinations connected  by
+relational  operators.    We've also done the  same  for  control
+constructs.    In  all of this we've leaned heavily on the use of
+top-down, recursive  descent  parsing,  BNF  definitions  of  the
+syntax, and direct generation of assembly-language code.  We also
+learned the value of  such  tricks  as single-character tokens to
+help  us  see  the  forest  through  the  trees.    In  the  last
+installment  we dealt with lexical scanning,  and  I  showed  you
+simple but powerful ways to remove the single-character barriers.
+
+Throughout the whole study, I've emphasized  the  KISS philosophy
+... Keep It Simple, Sidney ... and I hope by now  you've realized
+just  how  simple  this stuff can really be.  While there are for
+sure areas of compiler  theory  that  are truly intimidating, the
+ultimate message of this series is that in practice you  can just
+politely  sidestep   many  of  these  areas.    If  the  language
+definition  cooperates  or,  as in this series, if you can define
+the language as you go, it's possible to write down  the language
+definition in BNF with reasonable ease.  And, as we've  seen, you
+can crank out parse procedures from the BNF just about as fast as
+you can type.
+
+As our compiler has taken form, it's gotten more parts,  but each
+part  is  quite small and simple, and  very  much  like  all  the
+others.
+
+At this point, we have many  of  the makings of a real, practical
+compiler.  As a matter of  fact,  we  already have all we need to
+build a toy  compiler  for  a  language as powerful as, say, Tiny
+BASIC.  In the next couple of installments, we'll  go  ahead  and
+define that language.
+
+To round out  the  series,  we  still  have a few items to cover.
+These include:
+
+   o Procedure calls, with and without parameters
+
+   o Local and global variables
+
+   o Basic types, such as character and integer types
+
+   o Arrays
+
+   o Strings
+
+   o User-defined types and structures
+
+   o Tree-structured parsers and intermediate languages
+
+   o Optimization
+
+These will all be  covered  in  future  installments.  When we're
+finished, you'll have all the tools you need to design  and build
+your own languages, and the compilers to translate them.
+
+I can't  design  those  languages  for  you,  but I can make some
+comments  and  recommendations.    I've  already  sprinkled  some
+throughout past installments.    You've  seen,  for  example, the
+control constructs I prefer.
+
+These constructs are going  to  be part of the languages I build.
+I  have  three  languages in mind at this point, two of which you
+will see in installments to come:
+
+TINY - A  minimal,  but  usable  language  on the order  of  Tiny
+       BASIC or Tiny C.  It won't be very practical, but  it will
+       have enough power to let you write and  run  real programs
+       that do something worthwhile.
+
+KISS - The  language  I'm  building for my  own  use.    KISS  is
+       intended to be  a  systems programming language.  It won't
+       have strong typing  or  fancy data structures, but it will
+       support most of  the  things  I  want to do with a higher-
+       order language (HOL), except perhaps writing compilers.
+                              
+I've also  been  toying  for  years  with  the idea of a HOL-like
+assembler,  with  structured  control  constructs   and  HOL-like
+assignment statements.  That, in  fact, was the impetus behind my
+original foray into the jungles of compiler theory.  This one may
+never be built, simply  because  I've  learned that it's actually
+easier to implement a language like KISS, that only uses a subset
+of the CPU instructions.    As you know, assembly language can be
+bizarre  and  irregular  in the extreme, and a language that maps
+one-for-one onto it can be a real challenge.  Still,  I've always
+felt that the syntax used  in conventional assemblers is dumb ...
+why is
+
+     MOVE.L A,B
+
+better, or easier to translate, than
+
+     B=A ?
+
+I  think  it  would  be  an  interesting  exercise to  develop  a
+"compiler" that  would give the programmer complete access to and
+control over the full complement  of the CPU instruction set, and
+would allow you to generate  programs  as  efficient  as assembly
+language, without the pain  of  learning a set of mnemonics.  Can
+it be done?  I don't  know.  The  real question may be, "Will the
+resulting language be any  easier  to  write  than assembly"?  If
+not, there's no point in it.  I think that it  can  be  done, but
+I'm not completely sure yet how the syntax should look.
+
+Perhaps you have some  comments  or suggestions on this one.  I'd
+love to hear them.
+
+You probably won't be surprised to learn that I've already worked
+ahead in most  of the areas that we will cover.  I have some good
+news:  Things  never  get  much  harder than they've been so far.
+It's  possible  to  build a complete, working compiler for a real
+language, using nothing  but  the same kinds of techniques you've
+learned so far.  And THAT brings up some interesting questions.
+
+
+WHY IS IT SO SIMPLE?
+
+Before embarking  on this series, I always thought that compilers
+were just naturally complex computer  programs  ...  the ultimate
+challenge.  Yet the things we have done here have  usually turned
+out to be quite simple, sometimes even trivial.
+
+For awhile, I thought  is  was simply because I hadn't yet gotten
+into the meat  of  the  subject.    I had only covered the simple
+parts.  I will freely admit  to  you  that, even when I began the
+series,  I  wasn't  sure how far we would be able  to  go  before
+things got too complex to deal with in the ways  we  have so far.
+But at this point I've already  been  down the road far enough to
+see the end of it.  Guess what?
+                              
+
+                     THERE ARE NO HARD PARTS!
+
+
+Then, I thought maybe it was because we were not  generating very
+good object  code.    Those  of  you  who have been following the
+series and trying sample compiles know that, while the code works
+and  is  rather  foolproof,  its  efficiency is pretty awful.   I
+figured that if we were  concentrating on turning out tight code,
+we would soon find all that missing complexity.
+
+To  some  extent,  that one is true.  In particular, my first few
+efforts at trying to improve efficiency introduced  complexity at
+an alarming rate.  But since then I've been tinkering around with
+some simple optimizations and I've found some that result in very
+respectable code quality, WITHOUT adding a lot of complexity.
+
+Finally, I thought that  perhaps  the  saving  grace was the "toy
+compiler" nature of the study.   I  have made no pretense that we
+were  ever  going  to be able to build a compiler to compete with
+Borland and Microsoft.  And yet, again, as I get deeper into this
+thing the differences are starting to fade away.
+
+Just  to make sure you get the message here, let me state it flat
+out:
+
+   USING THE TECHNIQUES WE'VE USED  HERE,  IT  IS  POSSIBLE TO
+   BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING
+   A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE.
+
+
+Since  the series began I've received  some  comments  from  you.
+Most of them echo my own thoughts:  "This is easy!    Why  do the
+textbooks make it seem so hard?"  Good question.
+
+Recently, I've gone back and looked at some of those texts again,
+and even bought and read some new ones.  Each  time,  I come away
+with the same feeling: These guys have made it seem too hard.
+
+What's going on here?  Why does the whole thing seem difficult in
+the texts, but easy to us?    Are  we that much smarter than Aho,
+Ullman, Brinch Hansen, and all the rest?
+
+Hardly.  But we  are  doing some things differently, and more and
+more  I'm  starting  to appreciate the value of our approach, and
+the way that  it  simplifies  things.    Aside  from  the obvious
+shortcuts that I outlined in Part I, like single-character tokens
+and console I/O, we have  made some implicit assumptions and done
+some things differently from those who have designed compilers in
+the past. As it turns out, our approach makes life a lot easier.
+
+So why didn't all those other guys use it?
+
+You have to remember the context of some of the  earlier compiler
+development.  These people were working with very small computers
+of  limited  capacity.      Memory  was  very  limited,  the  CPU
+instruction  set  was  minimal, and programs ran  in  batch  mode
+rather  than  interactively.   As it turns out, these caused some
+key design decisions that have  really  complicated  the designs.
+Until recently,  I hadn't realized how much of classical compiler
+design was driven by the available hardware.
+
+Even in cases where these  limitations  no  longer  apply, people
+have  tended  to  structure their programs in the same way, since
+that is the way they were taught to do it.
+
+In  our case, we have started with a blank sheet of paper.  There
+is a danger there, of course,  that  you will end up falling into
+traps that other people have long since learned to avoid.  But it
+also has allowed us to  take different approaches that, partly by
+design  and partly by pure dumb luck, have  allowed  us  to  gain
+simplicity.
+
+Here are the areas that I think have  led  to  complexity  in the
+past:
+
+  o  Limited RAM Forcing Multiple Passes
+
+     I  just  read  "Brinch  Hansen  on  Pascal   Compilers"  (an
+     excellent book, BTW).  He  developed a Pascal compiler for a
+     PC, but he started the effort in 1981 with a 64K system, and
+     so almost every design decision  he made was aimed at making
+     the compiler fit  into  RAM.    To do this, his compiler has
+     three passes, one of which is the lexical scanner.  There is
+     no way he could, for  example, use the distributed scanner I
+     introduced  in  the last installment,  because  the  program
+     structure wouldn't allow it.  He also required  not  one but
+     two intermediate  languages,  to  provide  the communication
+     between phases.
+
+     All the early compiler writers  had to deal with this issue:
+     Break the compiler up into enough parts so that it  will fit
+     in memory.  When  you  have multiple passes, you need to add
+     data structures to support the  information  that  each pass
+     leaves behind for the next.   That adds complexity, and ends
+     up driving the  design.    Lee's  book,  "The  Anatomy  of a
+     Compiler,"  mentions a FORTRAN compiler developed for an IBM
+     1401.  It had no fewer than 63 separate passes!  Needless to
+     say,  in a compiler like this  the  separation  into  phases
+     would dominate the design.
+
+     Even in  situations  where  RAM  is  plentiful,  people have
+     tended  to  use  the same techniques because  that  is  what
+     they're familiar with.   It  wasn't  until Turbo Pascal came
+     along that we found how simple a compiler could  be  if  you
+     started with different assumptions.
+
+
+  o  Batch Processing
+                              
+     In the early days, batch  processing was the only choice ...
+     there was no interactive computing.   Even  today, compilers
+     run in essentially batch mode.
+
+     In a mainframe compiler as  well  as  many  micro compilers,
+     considerable effort is expended on error recovery ... it can
+     consume as much as 30-40%  of  the  compiler  and completely
+     drive the design.  The idea is to avoid halting on the first
+     error, but rather to keep going at all costs,  so  that  you
+     can  tell  the  programmer about as many errors in the whole
+     program as possible.
+
+     All of that harks back to the days of the  early mainframes,
+     where turnaround time was measured  in hours or days, and it
+     was important to squeeze every last ounce of information out
+     of each run.
+
+     In this series, I've been very careful to avoid the issue of
+     error recovery, and instead our compiler  simply  halts with
+     an error message on  the  first error.  I will frankly admit
+     that it was mostly because I wanted to take the easy way out
+     and keep things simple.   But  this  approach,  pioneered by
+     Borland in Turbo Pascal, also has a lot going for it anyway.
+     Aside from keeping the  compiler  simple,  it also fits very
+     well  with   the  idea  of  an  interactive  system.    When
+     compilation is  fast, and especially when you have an editor
+     such as Borland's that  will  take you right to the point of
+     the error, then it makes a  lot  of sense to stop there, and
+     just restart the compilation after the error is fixed.
+
+
+  o  Large Programs
+
+     Early compilers were designed to handle  large  programs ...
+     essentially infinite ones.    In those days there was little
+     choice;  the  idea  of  subroutine  libraries  and  separate
+     compilation  were  still  in  the  future.      Again,  this
+     assumption led to  multi-pass designs and intermediate files
+     to hold the results of partial processing.
+
+     Brinch Hansen's  stated goal was that the compiler should be
+     able to compile itself.   Again, because of his limited RAM,
+     this drove him to a multi-pass design.  He needed  as little
+     resident compiler code as possible,  so  that  the necessary
+     tables and other data structures would fit into RAM.
+
+     I haven't stated this one yet, because there  hasn't  been a
+     need  ... we've always just read and  written  the  data  as
+     streams, anyway.  But  for  the  record,  my plan has always
+     been that, in  a  production compiler, the source and object
+     data should all coexist  in  RAM with the compiler, a la the
+     early Turbo Pascals.  That's why I've been  careful  to keep
+     routines like GetChar  and  Emit  as  separate  routines, in
+     spite of their small size.   It  will be easy to change them
+     to read to and write from memory.
+
+
+  o  Emphasis on Efficiency
+
+     John  Backus has stated that, when  he  and  his  colleagues
+     developed the original FORTRAN compiler, they KNEW that they
+     had to make it produce tight code.  In those days, there was
+     a strong sentiment against HOLs  and  in  favor  of assembly
+     language, and  efficiency was the reason.  If FORTRAN didn't
+     produce very good  code  by  assembly  standards,  the users
+     would simply refuse to use it.  For the record, that FORTRAN
+     compiler turned out to  be  one  of  the most efficient ever
+     built, in terms of code quality.  But it WAS complex!
+
+     Today,  we have CPU power and RAM size  to  spare,  so  code
+     efficiency is not  so  much  of  an  issue.    By studiously
+     ignoring this issue, we  have  indeed  been  able to Keep It
+     Simple.    Ironically,  though, as I have said, I have found
+     some optimizations that we can  add  to  the  basic compiler
+     structure, without having to add a lot of complexity.  So in
+     this  case we get to have our cake and eat it too:  we  will
+     end up with reasonable code quality, anyway.
+
+
+  o  Limited Instruction Sets
+
+     The early computers had primitive instruction sets.   Things
+     that  we  take  for granted, such as  stack  operations  and
+     indirect addressing, came only with great difficulty.
+
+     Example: In most compiler designs, there is a data structure
+     called the literal pool.  The compiler  typically identifies
+     all literals used in the program, and collects  them  into a
+     single data structure.    All references to the literals are
+     done  indirectly  to  this  pool.    At  the   end   of  the
+     compilation, the  compiler  issues  commands  to  set  aside
+     storage and initialize the literal pool.
+
+     We haven't had to address that  issue  at all.  When we want
+     to load a literal, we just do it, in line, as in
+
+          MOVE #3,D0
+
+     There is something to be said for the use of a literal pool,
+     particularly on a machine like  the 8086 where data and code
+     can  be separated.  Still, the whole  thing  adds  a  fairly
+     large amount of complexity with little in return.
+
+     Of course, without the stack we would be lost.  In  a micro,
+     both  subroutine calls and temporary storage depend  heavily
+     on the stack, and  we  have used it even more than necessary
+     to ease expression parsing.
+
+
+  o  Desire for Generality
+
+     Much of the content of the typical compiler text is taken up
+     with issues we haven't addressed here at all ... things like
+     automated  translation  of  grammars,  or generation of LALR
+     parse tables.  This is not simply because  the  authors want
+     to impress you.  There are good, practical  reasons  why the
+     subjects are there.
+
+     We have been concentrating on the use of a recursive-descent
+     parser to parse a  deterministic  grammar,  i.e.,  a grammar
+     that is not ambiguous and, therefore, can be parsed with one
+     level of lookahead.  I haven't made much of this limitation,
+     but  the  fact  is  that  this represents a small subset  of
+     possible grammars.  In fact,  there is an infinite number of
+     grammars that we can't parse using our techniques.    The LR
+     technique is a more powerful one, and can deal with grammars
+     that we can't.
+
+     In compiler theory, it's important  to know how to deal with
+     these  other  grammars,  and  how  to  transform  them  into
+     grammars  that  are  easier to deal with.  For example, many
+     (but not all) ambiguous  grammars  can  be  transformed into
+     unambiguous ones.  The way to do this is not always obvious,
+     though, and so many people  have  devoted  years  to develop
+     ways to transform them automatically.
+
+     In practice, these  issues  turn out to be considerably less
+     important.  Modern languages tend  to be designed to be easy
+     to parse, anyway.   That  was a key motivation in the design
+     of Pascal.   Sure,  there are pathological grammars that you
+     would be hard pressed to write unambiguous BNF  for,  but in
+     the  real  world  the best answer is probably to avoid those
+     grammars!
+
+     In  our  case,  of course, we have sneakily let the language
+     evolve  as  we  go, so we haven't painted ourselves into any
+     corners here.  You may not always have that luxury.   Still,
+     with a little  care  you  should  be able to keep the parser
+     simple without having to resort to automatic  translation of
+     the grammar.
+
+
+We have taken  a  vastly  different  approach in this series.  We
+started with a clean sheet  of  paper,  and  developed techniques
+that work in the context that  we  are in; that is, a single-user
+PC  with  rather  ample CPU power and RAM space.  We have limited
+ourselves to reasonable grammars that  are easy to parse, we have
+used the instruction set of the CPU to advantage, and we have not
+concerned ourselves with efficiency.  THAT's why it's been easy.
+
+Does this mean that we are forever doomed  to  be  able  to build
+only toy compilers?   No, I don't think so.  As I've said, we can
+add  certain   optimizations   without   changing   the  compiler
+structure.  If we want to process large files, we can  always add
+file  buffering  to do that.  These  things  do  not  affect  the
+overall program design.
+
+And I think  that's  a  key  factor.   By starting with small and
+limited  cases,  we  have been able to concentrate on a structure
+for  the  compiler  that is natural  for  the  job.    Since  the
+structure naturally fits the job, it is almost bound to be simple
+and transparent.   Adding  capability doesn't have to change that
+basic  structure.    We  can  simply expand things like the  file
+structure or add an optimization layer.  I guess  my  feeling  is
+that, back when resources were tight, the structures people ended
+up  with  were  artificially warped to make them work under those
+conditions, and weren't optimum  structures  for  the  problem at
+hand.
+
+
+CONCLUSION
+
+Anyway, that's my arm-waving  guess  as to how we've been able to
+keep things simple.  We started with something simple and  let it
+evolve  naturally,  without  trying  to   force   it   into  some
+traditional mold.
+
+We're going to  press on with this.  I've given you a list of the
+areas  we'll  be  covering in future installments.    With  those
+installments, you  should  be  able  to  build  complete, working
+compilers for just about any occasion, and build them simply.  If
+you REALLY want to build production-quality compilers,  you'll be
+able to do that, too.
+
+For those of you who are chafing at the bit for more parser code,
+I apologize for this digression.  I just thought  you'd  like  to
+have things put  into  perspective  a  bit.  Next time, we'll get
+back to the mainstream of the tutorial.
+
+So far, we've only looked at pieces of compilers,  and  while  we
+have  many  of  the  makings  of a complete language, we  haven't
+talked about how to put  it  all  together.    That  will  be the
+subject of our next  two  installments.  Then we'll press on into
+the new subjects I listed at the beginning of this installment.
+
+See you then.
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
@@ -0,0 +1,821 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                     LET'S BUILD A COMPILER!
+
+                                By
+
+                     Jack W. Crenshaw, Ph.D.
+
+                          16 April 1989
+
+
+                       Part IX: A TOP VIEW
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+
+
+INTRODUCTION
+
+In  the  previous  installments,  we  have  learned  many of  the
+techniques required to  build  a full-blown compiler.  We've done
+both  assignment   statements   (with   Boolean   and  arithmetic
+expressions),  relational operators, and control constructs.   We
+still haven't  addressed procedure or function calls, but even so
+we  could  conceivably construct a  mini-language  without  them.
+I've  always  thought  it would be fun to see just  how  small  a
+language  one  could  build  that  would still be useful.   We're
+ALMOST in a position to do that now.  The  problem  is: though we
+know  how  to  parse and translate the constructs, we still don't
+know quite how to put them all together into a language.
+
+In those earlier installments, the  development  of  our programs
+had  a decidedly bottom-up flavor.  In  the  case  of  expression
+parsing,  for  example,  we  began  with  the  very lowest  level
+constructs, the individual constants  and  variables,  and worked
+our way up to more complex expressions.
+
+Most people regard  the  top-down design approach as being better
+than  the  bottom-up  one.  I do too,  but  the  way  we  did  it
+certainly seemed natural enough for the kinds of  things  we were
+parsing.
+
+You mustn't get  the  idea, though, that the incremental approach
+that  we've  been  using  in  all these tutorials  is  inherently
+bottom-up.  In  this  installment  I'd  like to show you that the
+approach can work just as well when applied from the top down ...
+maybe better.  We'll consider languages such as C and Pascal, and
+see how complete compilers can be built starting from the top.
+
+In the next installment, we'll  apply the same technique to build
+a  complete  translator  for a subset of the KISS language, which
+I'll be  calling  TINY.    But one of my goals for this series is
+that you will  not only be able to see how a compiler for TINY or
+KISS  works,  but  that you will also be able to design and build
+compilers for your own languages.  The C and Pascal examples will
+help.    One  thing I'd like you  to  see  is  that  the  natural
+structure of the compiler depends very much on the language being
+translated, so the simplicity and  ease  of  construction  of the
+compiler  depends  very  much  on  letting the language  set  the
+program structure.
+                              
+It's  a bit much to produce a full C or Pascal compiler here, and
+we won't try.   But we can flesh out the top levels far enough so
+that you can see how it goes.
+
+Let's get started.
+
+
+THE TOP LEVEL
+
+One of the biggest  mistakes  people make in a top-down design is
+failing  to start at the true top.  They think they know what the
+overall structure of the  design  should be, so they go ahead and
+write it down.
+
+Whenever  I  start a new design, I always like to do  it  at  the
+absolute beginning.   In  program design language (PDL), this top
+level looks something like:
+
+
+     begin
+        solve the problem
+     end
+
+
+OK, I grant  you that this doesn't give much of a hint as to what
+the next level is, but I  like  to  write it down anyway, just to
+give me that warm feeling that I am indeed starting at the top.
+
+For our problem, the overall function of a compiler is to compile
+a complete program.  Any definition of the  language,  written in
+BNF,  begins here.  What does the top level BNF look like?  Well,
+that depends quite a bit on the language to be translated.  Let's
+take a look at Pascal.
+
+
+THE STRUCTURE OF PASCAL
+
+Most  texts  for  Pascal  include  a   BNF   or  "railroad-track"
+definition of the language.  Here are the first few lines of one:
+
+
+     <program> ::= <program-header> <block> '.'
+
+     <program-header> ::= PROGRAM <ident>
+
+     <block> ::= <declarations> <statements>
+
+
+We can write recognizers  to  deal  with  each of these elements,
+just as we've done before.  For each one, we'll use  our familiar
+single-character tokens to represent the input, then flesh things
+out a little at a time.    Let's begin with the first recognizer:
+the program itself.
+                              
+To translate this, we'll  start  with a fresh copy of the Cradle.
+Since we're back to single-character  names, we'll just use a 'p'
+to stand for 'PROGRAM.'
+
+To a fresh copy of the cradle, add the following code, and insert
+a call to it from the main program:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate A Program }
+
+procedure Prog;
+var  Name: char;
+begin
+   Match('p');            { Handles program header part }
+   Name := GetName;
+   Prolog(Name);
+   Match('.');
+   Epilog(Name);
+end;
+{--------------------------------------------------------------}
+
+
+The procedures  Prolog and Epilog perform whatever is required to
+let the program interface with the operating system,  so  that it
+can execute as a program.  Needless to  say,  this  part  will be
+VERY OS-dependent.  Remember, I've been emitting code for a 68000
+running under the OS I use, which is SK*DOS.   I  realize most of
+you are using PC's  and  would rather see something else, but I'm
+in this thing too deep to change now!
+
+Anyhow, SK*DOS is a  particularly  easy OS to interface to.  Here
+is the code for Prolog and Epilog:
+
+
+{--------------------------------------------------------------}
+{ Write the Prolog }
+
+procedure Prolog;
+begin
+   EmitLn('WARMST EQU $A01E');
+end;
+
+
+{--------------------------------------------------------------}
+{ Write the Epilog }
+
+procedure Epilog(Name: char);
+begin
+   EmitLn('DC WARMST');
+   EmitLn('END ' + Name);
+end;
+{--------------------------------------------------------------}
+                              
+As usual, add  this  code  and  try  out the "compiler."  At this
+point, there is only one legal input:
+
+
+     px.   (where x is any single letter, the program name)
+
+
+Well,  as  usual  our first effort is rather unimpressive, but by
+now  I'm sure you know that things  will  get  more  interesting.
+There is one important thing to  note:   THE OUTPUT IS A WORKING,
+COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled).
+
+This  is  very  important.  The  nice  feature  of  the  top-down
+approach is that at any stage you can  compile  a  subset  of the
+complete language and get  a  program that will run on the target
+machine.    From here on, then, we  need  only  add  features  by
+fleshing out the language constructs.  It's all  very  similar to
+what we've been doing all along, except that we're approaching it
+from the other end.
+
+
+FLESHING IT OUT
+
+To flesh out  the  compiler,  we  only have to deal with language
+features  one by one.  I like to start with a stub procedure that
+does  nothing, then add detail in  incremental  fashion.    Let's
+begin  by  processing  a block, in accordance with its PDL above.
+We can do this in two stages.  First, add the null procedure:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate a Pascal Block }
+
+procedure DoBlock(Name: char);
+begin
+end;
+{--------------------------------------------------------------}
+
+
+and modify Prog to read:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate A Program }
+
+procedure Prog;
+var  Name: char;
+begin
+   Match('p');
+   Name := GetName;
+   Prolog;
+   DoBlock(Name);
+   Match('.');
+   Epilog(Name);
+end;
+{--------------------------------------------------------------}
+
+
+That certainly  shouldn't change the behavior of the program, and
+it doesn't.  But now the  definition  of Prog is complete, and we
+can proceed to flesh out DoBlock.  That's done right from its BNF
+definition:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate a Pascal Block }
+
+procedure DoBlock(Name: char);
+begin
+   Declarations;
+   PostLabel(Name);
+   Statements;
+end;
+{--------------------------------------------------------------}
+
+
+The  procedure  PostLabel  was  defined  in  the  installment  on
+branches.  Copy it into your cradle.
+
+I probably need to  explain  the  reason  for inserting the label
+where I have.  It has to do with the operation of SK*DOS.  Unlike
+some OS's,  SK*DOS allows the entry point to the main  program to
+be  anywhere in the program.  All you have to do is to give  that
+point a name.  The call  to  PostLabel puts that name just before
+the first executable statement  in  the  main  program.  How does
+SK*DOS know which of the many labels is the entry point, you ask?
+It's the one that matches the END statement  at  the  end  of the
+program.
+
+OK,  now  we  need  stubs  for  the  procedures Declarations  and
+Statements.  Make them null procedures as we did before.
+
+Does the program  still run the same?  Then we can move on to the
+next stage.
+
+
+DECLARATIONS
+
+The BNF for Pascal declarations is:
+
+
+     <declarations> ::= ( <label list>    |
+                          <constant list> |
+                          <type list>     |
+                          <variable list> |
+                          <procedure>     |
+                          <function>         )*
+                              
+
+(Note  that  I'm  using the more liberal definition used by Turbo
+Pascal.  In the standard Pascal definition, each  of  these parts
+must be in a specific order relative to the rest.)
+
+As  usual,  let's  let a single character represent each of these
+declaration types.  The new form of Declarations is:
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate the Declaration Part }
+
+procedure Declarations;
+begin
+   while Look in ['l', 'c', 't', 'v', 'p', 'f'] do
+      case Look of
+       'l': Labels;
+       'c': Constants;
+       't': Types;
+       'v': Variables;
+       'p': DoProcedure;
+       'f': DoFunction;
+      end;
+end;
+{--------------------------------------------------------------}
+
+
+Of course, we need stub  procedures for each of these declaration
+types.  This time,  they  can't  quite  be null procedures, since
+otherwise we'll end up with an infinite While loop.  At  the very
+least, each recognizer must  eat  the  character that invokes it.
+Insert the following procedures:
+
+
+{--------------------------------------------------------------}
+{ Process Label Statement }
+
+procedure Labels;
+begin
+   Match('l');
+end;
+
+
+{--------------------------------------------------------------}
+{ Process Const Statement }
+
+procedure Constants;
+begin
+   Match('c');
+end;
+
+
+{--------------------------------------------------------------}
+{ Process Type Statement }
+procedure Types;
+begin
+   Match('t');
+end;
+
+
+{--------------------------------------------------------------}
+{ Process Var Statement }
+
+procedure Variables;
+begin
+   Match('v');
+end;
+
+
+{--------------------------------------------------------------}
+{ Process Procedure Definition }
+
+procedure DoProcedure;
+begin
+   Match('p');
+end;
+
+
+{--------------------------------------------------------------}
+{ Process Function Definition }
+
+procedure DoFunction;
+begin
+   Match('f');
+end;
+{--------------------------------------------------------------}
+
+
+Now try out the  compiler  with a few representative inputs.  You
+can  mix  the  declarations any way you like, as long as the last
+character  in  the  program is'.' to  indicate  the  end  of  the
+program.  Of course,  none  of  the declarations actually declare
+anything, so you don't need  (and can't use) any characters other
+than those standing for the keywords.
+
+We can flesh out the statement  part  in  a similar way.  The BNF
+for it is:
+
+
+     <statements> ::= <compound statement>
+
+     <compound statement> ::= BEGIN <statement>
+                                   (';' <statement>) END
+
+
+Note that statements can  begin  with  any identifier except END.
+So the first stub form of procedure Statements is:
+                              
+
+{--------------------------------------------------------------}
+{ Parse and Translate the Statement Part }
+
+procedure Statements;
+begin
+   Match('b');
+   while Look <> 'e' do
+      GetChar;
+   Match('e');
+end;
+{--------------------------------------------------------------}
+
+
+At  this  point  the  compiler   will   accept   any   number  of
+declarations, followed by the  BEGIN  block  of the main program.
+This  block  itself  can contain any characters at all (except an
+END), but it must be present.
+
+The simplest form of input is now
+
+     'pxbe.'
+
+Try  it.    Also  try  some  combinations  of  this.   Make  some
+deliberate errors and see what happens.
+
+At this point you should be beginning to see the drill.  We begin
+with a stub translator to process a program, then  we  flesh  out
+each procedure in turn,  based  upon its BNF definition.  Just as
+the lower-level BNF definitions add detail and elaborate upon the
+higher-level ones, the lower-level  recognizers  will  parse more
+detail  of  the  input  program.    When  the  last stub has been
+expanded,  the  compiler  will  be  complete.    That's  top-down
+design/implementation in its purest form.
+
+You might note that even though we've been adding procedures, the
+output of the program hasn't changed.  That's as  it  should  be.
+At these  top  levels  there  is  no  emitted code required.  The
+recognizers are  functioning as just that: recognizers.  They are
+accepting input sentences, catching bad ones, and channeling good
+input to the right places, so  they  are  doing their job.  If we
+were to pursue this a bit longer, code would start to appear.
+
+The  next  step  in our expansion should  probably  be  procedure
+Statements.  The Pascal definition is:
+
+
+    <statement> ::= <simple statement> | <structured statement>
+
+    <simple statement> ::= <assignment> | <procedure call> | null
+
+    <structured statement> ::= <compound statement> |
+                               <if statement>       |
+                               <case statement>     |
+                               <while statement>    |
+                               <repeat statement>   |
+                               <for statement>      |
+                               <with statement>
+
+
+These  are  starting  to look familiar.  As a matter of fact, you
+have already gone  through  the process of parsing and generating
+code for both assignment statements and control structures.  This
+is where the top level meets our bottom-up  approach  of previous
+sessions.  The constructs will be a little  different  from those
+we've  been  using  for KISS, but the differences are nothing you
+can't handle.
+
+I  think  you can get the picture now as to the  procedure.    We
+begin with a complete BNF  description of the language.  Starting
+at  the  top  level, we code  up  the  recognizer  for  that  BNF
+statement, using stubs  for  the next-level recognizers.  Then we
+flesh those lower-level statements out one by one.
+
+As it happens, the definition of Pascal is  very  compatible with
+the  use of BNF, and BNF descriptions  of  the  language  abound.
+Armed  with  such   a   description,  you  will  find  it  fairly
+straightforward to continue the process we've begun.
+
+You  might  have  a go at fleshing a few of these constructs out,
+just  to get a feel for it.  I don't expect you  to  be  able  to
+complete a Pascal compiler  here  ...  there  are too many things
+such  as  procedures  and types that we haven't addressed yet ...
+but  it  might  be helpful to try some of the more familiar ones.
+It will do  you  good  to  see executable programs coming out the
+other end.
+
+If I'm going to address those issues that we haven't covered yet,
+I'd rather  do  it  in  the context of KISS.  We're not trying to
+build a complete Pascal  compiler  just yet, so I'm going to stop
+the expansion of Pascal here.    Let's  take  a  look  at  a very
+different language.
+
+
+THE STRUCTURE OF C
+
+The C language is quite another matter, as you'll see.   Texts on
+C  rarely  include  a BNF definition of  the  language.  Probably
+that's because the language is quite hard to write BNF for.
+
+One reason I'm showing you these structures now is so that  I can
+impress upon you these two facts:
+
+ (1) The definition of  the  language drives the structure of the
+     compiler.  What works for one language may be a disaster for
+     another.    It's  a very bad idea to try to  force  a  given
+     structure upon the compiler.  Rather, you should let the BNF
+     drive the structure, as we have done here.
+                             
+ (2) A language that is hard to write BNF for  will  probably  be
+     hard  to  write  a compiler for, as well.  C  is  a  popular
+     language,  and  it  has  a  reputation  for  letting you  do
+     virtually  anything that is possible to  do.    Despite  the
+     success of Small C, C is _NOT_ an easy language to parse.
+
+
+A C program has  less  structure than its Pascal counterpart.  At
+the top level, everything in C is a static declaration, either of
+data or of a function.  We can capture this thought like this:
+
+
+     <program> ::= ( <global declaration> )*
+
+     <global declaration> ::= <data declaration>  |
+                              <function>
+
+In Small C, functions  can  only have the default type int, which
+is not declared.  This makes  the  input easy to parse: the first
+token is either "int," "char," or the name  of  a  function.   In
+Small  C, the preprocessor commands are  also  processed  by  the
+compiler proper, so the syntax becomes:
+
+
+     <global declaration> ::= '#' <preprocessor command>  |
+                              'int' <data list>           |
+                              'char' <data list>          |
+                              <ident> <function body>     |
+
+
+Although we're really more interested in full C  here,  I'll show
+you the  code corresponding to this top-level structure for Small
+C.
+
+
+{--------------------------------------------------------------}
+{ Parse and Translate A Program }
+
+procedure Prog;
+begin
+   while Look <> ^Z do begin
+      case Look of
+       '#': PreProc;
+       'i': IntDecl;
+       'c': CharDecl;
+      else DoFunction(Int);
+      end;
+   end;
+end;
+{--------------------------------------------------------------}
+
+Note that I've had to use a ^Z to indicate the end of the source.
+C has no keyword such as END or the '.' to otherwise indicate the
+end.
+                             
+With full C,  things  aren't  even  this easy.  The problem comes
+about because in full C, functions can also have types.   So when
+the compiler sees a  keyword  like  "int,"  it still doesn't know
+whether to expect a  data  declaration  or a function definition.
+Things get more  complicated  since  the  next token may not be a
+name  ... it may start with an '*' or '(', or combinations of the
+two.
+
+More specifically, the BNF for full C begins with:
+
+
+     <program> ::= ( <top-level decl> )*
+
+     <top-level decl> ::= <function def> | <data decl>
+
+     <data decl> ::= [<class>] <type> <decl-list>
+
+     <function def> ::= [<class>] [<type>] <function decl>
+
+
+You  can  now  see the problem:   The  first  two  parts  of  the
+declarations for data and functions can be the same.   Because of
+the  ambiguity  in  the grammar as  written  above,  it's  not  a
+suitable  grammar  for  a  recursive-descent  parser.     Can  we
+transform it into one that is suitable?  Yes, with a little work.
+Suppose we write it this way:
+
+
+     <top-level decl> ::= [<class>] <decl>
+
+     <decl> ::= <type> <typed decl> | <function decl>
+
+     <typed decl> ::= <data list> | <function decl>
+
+
+We  can  build  a  parsing  routine  for  the   class   and  type
+definitions, and have them store away their findings  and  go on,
+without their ever having to "know" whether a function or  a data
+declaration is being processed.
+
+To begin, key in the following version of the main program:
+
+
+{--------------------------------------------------------------}
+{ Main Program }
+
+begin
+   Init;
+   while Look <> ^Z do begin
+      GetClass;
+      GetType;
+      TopDecl;
+   end;
+end.
+
+{--------------------------------------------------------------}
+
+
+For the first round, just make the three procedures stubs that do
+nothing _BUT_ call GetChar.
+
+Does this program work?  Well, it would be hard put NOT to, since
+we're not really asking it to do anything.  It's been said that a
+C compiler will accept virtually any input without choking.  It's
+certainly true of THIS  compiler,  since in effect all it does is
+to eat input characters until it finds a ^Z.
+
+Next, let's make  GetClass  do something worthwhile.  Declare the
+global variable
+
+
+     var Class: char;
+
+
+and change GetClass to do the following:
+
+
+{--------------------------------------------------------------}
+{  Get a Storage Class Specifier }
+
+Procedure GetClass;
+begin
+   if Look in ['a', 'x', 's'] then begin
+      Class := Look;
+      GetChar;
+      end
+   else Class := 'a';
+end;
+{--------------------------------------------------------------}
+
+
+Here, I've used three  single  characters  to represent the three
+storage classes "auto," "extern,"  and  "static."   These are not
+the only three possible classes ... there are also "register" and
+"typedef," but this should  give  you the picture.  Note that the
+default class is "auto."
+
+We  can  do  a  similar  thing  for  types.   Enter the following
+procedure next:
+
+
+{--------------------------------------------------------------}
+{  Get a Type Specifier }
+
+procedure GetType;
+begin
+   Typ := ' ';
+   if Look = 'u' then begin
+      Sign := 'u';
+      Typ := 'i';
+      GetChar;
+      end
+   else Sign := 's';
+   if Look in ['i', 'l', 'c'] then begin
+      Typ := Look;
+      GetChar;
+   end;
+end;
+{--------------------------------------------------------------}
+
+Note that you must add two more global variables, Sign and Typ.
+
+With these two procedures in place, the compiler will process the
+class and type definitions and store away their findings.  We can
+now process the rest of the declaration.
+
+We  are by no means out of the woods yet, because there are still
+many complexities just in the definition of the  type,  before we
+even get to the actual data or function names.  Let's pretend for
+the moment that we have passed all those gates, and that the next
+thing in the  input stream is a name.  If the name is followed by
+a left paren, we have a function declaration.  If not, we have at
+least one data item,  and  possibly a list, each element of which
+can have an initializer.
+
+Insert the following version of TopDecl:
+
+
+{--------------------------------------------------------------}
+{ Process a Top-Level Declaration }
+
+procedure TopDecl;
+var Name: char;
+begin
+   Name := Getname;
+   if Look = '(' then
+      DoFunc(Name)
+   else
+      DoData(Name);
+end;
+{--------------------------------------------------------------}
+
+
+(Note that, since we have already read the name, we must  pass it
+along to the appropriate routine.)
+
+Finally, add the two procedures DoFunc and DoData:
+
+
+{--------------------------------------------------------------}
+{ Process a Function Definition }
+
+procedure DoFunc(n: char);
+begin
+   Match('(');
+   Match(')');
+   Match('{');
+   Match('}');
+   if Typ = ' ' then Typ := 'i';
+   Writeln(Class, Sign, Typ, ' function ', n);
+end;
+
+{--------------------------------------------------------------}
+{ Process a Data Declaration }
+
+procedure DoData(n: char);
+begin
+   if Typ = ' ' then Expected('Type declaration');
+   Writeln(Class, Sign, Typ, ' data ', n);
+   while Look = ',' do begin
+      Match(',');
+      n := GetName;
+      WriteLn(Class, Sign, Typ, ' data ', n);
+   end;
+   Match(';');
+end;
+{--------------------------------------------------------------}
+
+
+Since  we're  still  a long way from producing executable code, I
+decided to just have these two routines tell us what they found.
+
+OK, give this program a try.    For data declarations, it's OK to
+give a list separated by commas.  We  can't  process initializers
+as yet.  We also can't process argument lists for  the functions,
+but the "(){}" characters should be there.
+
+We're still a _VERY_ long way from having a C compiler,  but what
+we have is starting to process the right kinds of inputs,  and is
+recognizing both good  and  bad  inputs.    In  the  process, the
+natural structure of the compiler is starting to take form.
+
+Can we continue this until we have something that acts  more like
+a compiler. Of course we can.  Should we?  That's another matter.
+I don't know about you, but I'm beginning to get dizzy, and we've
+still  got  a  long  way  to  go  to  even  get  past   the  data
+declarations.
+
+At  this  point,  I think you can see how the  structure  of  the
+compiler evolves from the language  definition.    The structures
+we've seen for our  two  examples, Pascal and C, are as different
+as night and day.  Pascal was designed at least partly to be easy
+to parse, and that's  reflected  in the compiler.  In general, in
+Pascal there is more structure and we have a better idea  of what
+kinds of constructs to expect at any point.  In  C,  on the other
+hand,  the  program  is  essentially  a  list   of  declarations,
+terminated only by the end of file.
+
+We  could  pursue  both  of  these structures much  farther,  but
+remember that our purpose here is  not  to  build a Pascal or a C
+compiler, but rather to study compilers in general.  For those of
+you  who DO want to deal with Pascal or C, I hope I've given  you
+enough of a start so that you can  take  it  from  here (although
+you'll soon need some of the stuff we still haven't  covered yet,
+such as typing and procedure calls).    For the rest of you, stay
+with me through the next installment.  There, I'll be leading you
+through the development of a complete compiler for TINY, a subset
+of KISS.
+
+See you then.
+
+
+*****************************************************************
+*                                                               *
+*                        COPYRIGHT NOTICE                       *
+*                                                               *
+*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
+*                                                               *
+*****************************************************************
+