diff --git a/Assembler/Assembly_Language_Primer_for_the_IBM_PC_&_XT.pdf b/Assembler/Assembly_Language_Primer_for_the_IBM_PC_&_XT.pdf new file mode 100644 index 0000000..7bf8e2e --- /dev/null +++ b/Assembler/Assembly_Language_Primer_for_the_IBM_PC_&_XT.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d650455cdbfa05ee641432d6ca84ac0bc300e6da2277f9619a4ff1ee9c2ef391 +size 11873892 diff --git a/Assembler/Assembly_Language_Succinctly.pdf b/Assembler/Assembly_Language_Succinctly.pdf new file mode 100644 index 0000000..11e197b --- /dev/null +++ b/Assembler/Assembly_Language_Succinctly.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:050015a23f70c763a8dc4a0b9effe1adf563f865e271c13be7ac0225604e576e +size 2291462 diff --git a/Assembler/Compute's_Beginners_Guide_to_Machine_Language_on_the_IBM_PC_and_PCjr.pdf b/Assembler/Compute's_Beginners_Guide_to_Machine_Language_on_the_IBM_PC_and_PCjr.pdf new file mode 100644 index 0000000..e6b77c9 --- /dev/null +++ b/Assembler/Compute's_Beginners_Guide_to_Machine_Language_on_the_IBM_PC_and_PCjr.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8a74ff8b60dfbd60515a7c321c7c087f6948e85cd8d96b35d643850116af7fd7 +size 7489302 diff --git a/Assembler/Intro_to_x86_Assembly_with_FASM_Part_1.pdf b/Assembler/Intro_to_x86_Assembly_with_FASM_Part_1.pdf new file mode 100644 index 0000000..bf52b10 --- /dev/null +++ b/Assembler/Intro_to_x86_Assembly_with_FASM_Part_1.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b763b62046ab62fb14a2151d1e49cef36e425cf96ec04c5640b62be11016474e +size 445468 diff --git a/Assembler/Intro_to_x86_Assembly_with_FASM_Part_2.pdf b/Assembler/Intro_to_x86_Assembly_with_FASM_Part_2.pdf new file mode 100644 index 0000000..96b8b0d --- /dev/null +++ b/Assembler/Intro_to_x86_Assembly_with_FASM_Part_2.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ec2753b3be6fe601f7fe946ed255397999178bd3e3a9696a8aaf207dd133b540 +size 825226 diff --git a/Assembler/Introduction_to_x64_Assembly.pdf b/Assembler/Introduction_to_x64_Assembly.pdf new file mode 100644 index 0000000..29ee9bd --- /dev/null +++ b/Assembler/Introduction_to_x64_Assembly.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c90e4188abbec2ce7f5a12f1032120ff39cb85436b7a7df0752237791d605770 +size 310818 diff --git a/Assembler/ProgrammingGroundUp-1-0-booksize.pdf b/Assembler/ProgrammingGroundUp-1-0-booksize.pdf new file mode 100644 index 0000000..1164b02 --- /dev/null +++ b/Assembler/ProgrammingGroundUp-1-0-booksize.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:663bd554622af154a94e0363fbd8b5b3e93137247f6eeada77005c911ec74513 +size 1383853 diff --git a/Assembler/The_Art_of_Computer_Programming - Vol 1.pdf b/Assembler/The_Art_of_Computer_Programming - Vol 1.pdf new file mode 100644 index 0000000..32a6e06 --- /dev/null +++ b/Assembler/The_Art_of_Computer_Programming - Vol 1.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b8f1af511932ca09797319ce48e630cb4a9b03b32d6c77ccdba1ef9bae124be1 +size 36255819 diff --git a/Assembler/aoa.pdf b/Assembler/aoa.pdf new file mode 100644 index 0000000..3110355 --- /dev/null +++ b/Assembler/aoa.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:eedc201299c8f3e295efa92a0bcd88173f794e3cf2412a9fb27931123f0e8cdb +size 4503986 diff --git a/Assembler/assembly_tutorial.pdf b/Assembler/assembly_tutorial.pdf new file mode 100644 index 0000000..84a27ba --- /dev/null +++ b/Assembler/assembly_tutorial.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a11626978eb766ee6775af2546bcbeb68f0867c019f257a9f283cf75e9f2efb4 +size 2113635 diff --git a/Assembler/flazx_32-64-bit-80x86-assembly-language-architecture.chm b/Assembler/flazx_32-64-bit-80x86-assembly-language-architecture.chm new file mode 100644 index 0000000..6ca9f7e Binary files /dev/null and b/Assembler/flazx_32-64-bit-80x86-assembly-language-architecture.chm differ diff --git a/Assembler/flazx_6502-assembly-language-programming-self-teaching-guides-.pdf b/Assembler/flazx_6502-assembly-language-programming-self-teaching-guides-.pdf new file mode 100644 index 0000000..de7e57a --- /dev/null +++ b/Assembler/flazx_6502-assembly-language-programming-self-teaching-guides-.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:830c1af3a4b69c0c1c89966a3bae846b52c76cf709e017d59b37422fd880eed2 +size 7687692 diff --git a/Assembler/flazx_assembly-language-step-by-step.pdf b/Assembler/flazx_assembly-language-step-by-step.pdf new file mode 100644 index 0000000..e51ba43 --- /dev/null +++ b/Assembler/flazx_assembly-language-step-by-step.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9445138c8b276fd90e63a761a930087c893077c14d0de08a85d3d8d740f3ca51 +size 7981280 diff --git a/Assembler/flazx_introduction-to-80x86-assembly-language-and-computer-architecture.chm b/Assembler/flazx_introduction-to-80x86-assembly-language-and-computer-architecture.chm new file mode 100644 index 0000000..b5ffa81 Binary files /dev/null and b/Assembler/flazx_introduction-to-80x86-assembly-language-and-computer-architecture.chm differ diff --git a/Assembler/flazx_professional-assembly-language.pdf b/Assembler/flazx_professional-assembly-language.pdf new file mode 100644 index 0000000..694fbc2 --- /dev/null +++ b/Assembler/flazx_professional-assembly-language.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:54b1e2d61d915ed99084d9af68a70faa930f5287694be8c7ad4da2d4708bbb84 +size 6933547 diff --git a/Assembler/flazx_the-art-of-assembly-language.chm b/Assembler/flazx_the-art-of-assembly-language.chm new file mode 100644 index 0000000..7be2f03 Binary files /dev/null and b/Assembler/flazx_the-art-of-assembly-language.chm differ diff --git a/Assembler/optimizing_assembly.pdf b/Assembler/optimizing_assembly.pdf new file mode 100644 index 0000000..cc4bd05 --- /dev/null +++ b/Assembler/optimizing_assembly.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ecab9b3baa077c4975cc8a906128d727aa532ea12f558d53bfad777c512a9511 +size 1507033 diff --git a/Assembler/pc_underground.pdf b/Assembler/pc_underground.pdf new file mode 100644 index 0000000..679f377 --- /dev/null +++ b/Assembler/pc_underground.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f3c335386e301ad7347c59fdbbd3678c4f0f63f6b0895789ac0f093328cbf8e3 +size 5205631 diff --git a/Assembler/pcasm-book.pdf b/Assembler/pcasm-book.pdf new file mode 100644 index 0000000..37a0607 --- /dev/null +++ b/Assembler/pcasm-book.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:74833408c26ea3071cdb0cf312d8d7362b3e2d0228feb9a91c4c27da63503e64 +size 1071411 diff --git a/Assembler/pdfaoa.zip b/Assembler/pdfaoa.zip new file mode 100644 index 0000000..83d1389 --- /dev/null +++ b/Assembler/pdfaoa.zip @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ab30904849a90e5250a139f51e82c9492180d3b905a14c2ba9a4e7d8f8a0d695 +size 4594814 diff --git a/C++/ADDISON-WESLEY--Design-Patterns--Elements-of-Reusable.pdf b/C++/ADDISON-WESLEY--Design-Patterns--Elements-of-Reusable.pdf new file mode 100644 index 0000000..743da70 --- /dev/null +++ b/C++/ADDISON-WESLEY--Design-Patterns--Elements-of-Reusable.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7f7174e1024bd9df0dd65a99146ee09d303e6270eab8c05c8f9693bffd44b601 +size 3608198 diff --git a/C++/Bjarne_Stroustrup_-_The_C++_Programming_Language_3rd_Ed.pdf b/C++/Bjarne_Stroustrup_-_The_C++_Programming_Language_3rd_Ed.pdf new file mode 100644 index 0000000..69bf2b6 --- /dev/null +++ b/C++/Bjarne_Stroustrup_-_The_C++_Programming_Language_3rd_Ed.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9d8972eea0dfcd849665daf8d586f3979538624c9859dbf2d1ac718826b0b57d +size 4628794 diff --git a/C++/Borland_C___Power_Programming_Book_and_Disk__Programming_.pdf b/C++/Borland_C___Power_Programming_Book_and_Disk__Programming_.pdf new file mode 100644 index 0000000..e509403 --- /dev/null +++ b/C++/Borland_C___Power_Programming_Book_and_Disk__Programming_.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:58bc24acc9655b0dcbb7e46f612cf3e4475dfa328cfb0c69ed9e32f7c4a25f38 +size 3576258 diff --git a/C++/C++ A Beginner's Guide 2nd Edition (2003).pdf b/C++/C++ A Beginner's Guide 2nd Edition (2003).pdf new file mode 100644 index 0000000..a2fc237 --- /dev/null +++ b/C++/C++ A Beginner's Guide 2nd Edition (2003).pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a0f4543375be402cb2ed3c08af2c1c592d31acc0d136728df58992a1383434a8 +size 11880323 diff --git a/C++/C++ Templates The Complete Guide.pdf b/C++/C++ Templates The Complete Guide.pdf new file mode 100644 index 0000000..55782fd --- /dev/null +++ b/C++/C++ Templates The Complete Guide.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d586454e0f5738cc053b63a53e17f5e9d4b792212e917bc6ab8e2acf41163b53 +size 2850310 diff --git a/C++/C++.Standard.Library.The.A.Tutorial.And.Reference.chm b/C++/C++.Standard.Library.The.A.Tutorial.And.Reference.chm new file mode 100644 index 0000000..06058f5 Binary files /dev/null and b/C++/C++.Standard.Library.The.A.Tutorial.And.Reference.chm differ diff --git a/C++/C++.pdf b/C++/C++.pdf new file mode 100644 index 0000000..8cf97ec --- /dev/null +++ b/C++/C++.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:15ea92e29b55fe5c72fc8bd5ff9b05519160bad011fc52d0975dd5b43f5327a3 +size 10829969 diff --git a/C++/C++_tutorial.pdf b/C++/C++_tutorial.pdf new file mode 100644 index 0000000..680564c --- /dev/null +++ b/C++/C++_tutorial.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:aafd64b759b896ceed90c88625c08f215f2a3b0a01ccf47e64239875c5710aa6 +size 1272254 diff --git a/C++/C++_za_apsolutne_pocetnike.pdf b/C++/C++_za_apsolutne_pocetnike.pdf new file mode 100644 index 0000000..64a435f --- /dev/null +++ b/C++/C++_za_apsolutne_pocetnike.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8433f9849797d92ba278174f59f2935f40e40d2ffb6b2923db9878fbc9a3e482 +size 6379147 diff --git a/C++/Cpp.FAQs.2nd.chm b/C++/Cpp.FAQs.2nd.chm new file mode 100644 index 0000000..a161370 Binary files /dev/null and b/C++/Cpp.FAQs.2nd.chm differ diff --git a/C++/CppEssentials.pdf b/C++/CppEssentials.pdf new file mode 100644 index 0000000..3d5ddba --- /dev/null +++ b/C++/CppEssentials.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:12b21152fc6b9e80b63a51e13d86a2a60446a454f2e7f682d90be749a9de60aa +size 553591 diff --git a/C++/Data Structures and Algorithm in C++.pdf b/C++/Data Structures and Algorithm in C++.pdf new file mode 100644 index 0000000..17f0ec3 --- /dev/null +++ b/C++/Data Structures and Algorithm in C++.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:86b603fe68d50bd86ce757a9070de0a04e75d517aaf0804976e2a5a12fbb5cb5 +size 2681639 diff --git a/C++/Getting Started with C++ Audio Programming for Game Development.pdf b/C++/Getting Started with C++ Audio Programming for Game Development.pdf new file mode 100644 index 0000000..f4bb273 --- /dev/null +++ b/C++/Getting Started with C++ Audio Programming for Game Development.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:966c2c5dfefbd4c1407b32ee3409ef433a6d9f5de3db53edf93b1d1967cf68ed +size 1787277 diff --git a/C++/How to think like a computer scientist(C++ Ver).pdf b/C++/How to think like a computer scientist(C++ Ver).pdf new file mode 100644 index 0000000..dc9fa9d --- /dev/null +++ b/C++/How to think like a computer scientist(C++ Ver).pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3e0206dd41edc5741d6695a805b073c48aab46b9966ebbf33857c17c9494a885 +size 845116 diff --git a/C++/Introduction to C++ Programming and Graphics.pdf b/C++/Introduction to C++ Programming and Graphics.pdf new file mode 100644 index 0000000..3eceb07 --- /dev/null +++ b/C++/Introduction to C++ Programming and Graphics.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3bbd2a931dae3c48f3e77da465dec5e3fd64db28cd245e3a78f75a1076be36d0 +size 3794721 diff --git a/C++/Object-Oriented.pdf b/C++/Object-Oriented.pdf new file mode 100644 index 0000000..3f62ebb --- /dev/null +++ b/C++/Object-Oriented.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5960149f8261e67eb31290b70337cfed90c0cfc7713a748e482987117073ff1e +size 11253116 diff --git a/C++/Objektno_orijentisano_programiranje_C++.pdf b/C++/Objektno_orijentisano_programiranje_C++.pdf new file mode 100644 index 0000000..ca1b1d4 --- /dev/null +++ b/C++/Objektno_orijentisano_programiranje_C++.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b915480b132b0246c5ad6b6f4c5b9b1f2852cc57075655c10ee21e29ff5b4004 +size 706887 diff --git a/C++/Optimizing Softwares in C++.pdf b/C++/Optimizing Softwares in C++.pdf new file mode 100644 index 0000000..c848054 --- /dev/null +++ b/C++/Optimizing Softwares in C++.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9e41283f414ea8e14d453652a30e74c4853d06f15170d50b4c2e965ff3a4e267 +size 1008259 diff --git a/C++/Real Time C++(OOP).pdf b/C++/Real Time C++(OOP).pdf new file mode 100644 index 0000000..64617e2 --- /dev/null +++ b/C++/Real Time C++(OOP).pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:94c6caa7bd49df93662de2adb43ceb855adf17730f7b51093ee922a34c5ae069 +size 2884247 diff --git a/C++/Teach Yourself C++ in 21 Days 5th Edition.pdf b/C++/Teach Yourself C++ in 21 Days 5th Edition.pdf new file mode 100644 index 0000000..847ad85 --- /dev/null +++ b/C++/Teach Yourself C++ in 21 Days 5th Edition.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:786c6d4b3a763c347cf88629bc35a8ef4d0f1d4c972b1d57d4843d6e9f882107 +size 6989997 diff --git a/C++/The-c++-programming-language-3rdEdition-Bjarne.pdf b/C++/The-c++-programming-language-3rdEdition-Bjarne.pdf new file mode 100644 index 0000000..d195c2d --- /dev/null +++ b/C++/The-c++-programming-language-3rdEdition-Bjarne.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cdadcedbeb2229a768ba8efc18c7c588e720aa70326176a6fe577efe00aa2a8b +size 3548194 diff --git a/C++/cpp_tutorial.pdf b/C++/cpp_tutorial.pdf new file mode 100644 index 0000000..eaff289 --- /dev/null +++ b/C++/cpp_tutorial.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4b4f5b68093949fed55cfa5b7e339d5ef3fb433ec2778d65600c677195b5f24a +size 2641088 diff --git a/C++/flazx_c-primer-third-edition.pdf b/C++/flazx_c-primer-third-edition.pdf new file mode 100644 index 0000000..faf8dda --- /dev/null +++ b/C++/flazx_c-primer-third-edition.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b1c0bf5628434b937c418b8e59666fa1b1453ff579a934bb86fb8dbdb8c65e4b +size 2491783 diff --git a/C++/lcpp.pdf b/C++/lcpp.pdf new file mode 100644 index 0000000..4ac7da6 --- /dev/null +++ b/C++/lcpp.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c2693ef58ab80608bf43fc7c1f95fe758faad6752356772de9a416ece8cffaba +size 281668 diff --git a/C++/new_learning.pdf b/C++/new_learning.pdf new file mode 100644 index 0000000..6974837 --- /dev/null +++ b/C++/new_learning.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:27613ab2890a94defad6bf5945b15a5b84eed95ee265ec3de1d7dc44e6a21db5 +size 40923 diff --git a/C++/optimizing_cpp.pdf b/C++/optimizing_cpp.pdf new file mode 100644 index 0000000..a2700db --- /dev/null +++ b/C++/optimizing_cpp.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e8543a97526d0660840657c340878def792ef740c26bccf1a4d4ae7a770cc7c2 +size 1833301 diff --git a/C++/teachyourselfcplusplusin21days.pdf b/C++/teachyourselfcplusplusin21days.pdf new file mode 100644 index 0000000..b852f3c --- /dev/null +++ b/C++/teachyourselfcplusplusin21days.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b0b063a60755dc1d0b0680356a253f992d9cdcd21f242e4d95341ad2ec08fb04 +size 1747809 diff --git a/C++/vc++6kurs.pdf b/C++/vc++6kurs.pdf new file mode 100644 index 0000000..51085a5 --- /dev/null +++ b/C++/vc++6kurs.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5e9a21f592a363d02bde06461b80b2f663edd4ce617c1364df889ba5ba468751 +size 757343 diff --git a/C/Apress.Beginning.C.From.Novice.to.Professional.4th.Edition.Oct.2006.pdf b/C/Apress.Beginning.C.From.Novice.to.Professional.4th.Edition.Oct.2006.pdf new file mode 100644 index 0000000..a2fdc2d --- /dev/null +++ b/C/Apress.Beginning.C.From.Novice.to.Professional.4th.Edition.Oct.2006.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5c23185b3e081ed460b0208e24895b94f900b7529c447fe34aab96d2b9624e40 +size 13925078 diff --git a/C/C_Language_Users_Handbook.pdf b/C/C_Language_Users_Handbook.pdf new file mode 100644 index 0000000..58e5b71 --- /dev/null +++ b/C/C_Language_Users_Handbook.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:25571ed0dbbf496548744714c0e8d1b9b499a8f921aaeea592e1a3e76e28980c +size 33129520 diff --git a/C/C_programski_jezik.pdf b/C/C_programski_jezik.pdf new file mode 100644 index 0000000..3999215 --- /dev/null +++ b/C/C_programski_jezik.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1e308d896e90f07b21fdeb72a68b600cad5a1db2f4ff28bb4f1adc5c2187d8f8 +size 834313 diff --git a/C/Learn C programming language in 24 hours.pdf b/C/Learn C programming language in 24 hours.pdf new file mode 100644 index 0000000..aa4e2cb --- /dev/null +++ b/C/Learn C programming language in 24 hours.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:59436df62804750e9e82b965426d4f3f793a85e317a20a5b6b472d80dc16f369 +size 830317 diff --git a/C/Objective-C Succinctly.pdf b/C/Objective-C Succinctly.pdf new file mode 100644 index 0000000..5863726 --- /dev/null +++ b/C/Objective-C Succinctly.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:608ebd051a2968277d899801553c241301f683a6db82bb67e06bc65678355932 +size 3649710 diff --git a/C/Practical C Programming, 3rd Edition (1997).pdf b/C/Practical C Programming, 3rd Edition (1997).pdf new file mode 100644 index 0000000..45d9be4 --- /dev/null +++ b/C/Practical C Programming, 3rd Edition (1997).pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2bd5eacd609c704a76a052372d112c7e975ade7f80a310b0395269037ec0bfa3 +size 6152829 diff --git a/C/The C Programming Language ANSI C Version_2.pdf b/C/The C Programming Language ANSI C Version_2.pdf new file mode 100644 index 0000000..670cf7c --- /dev/null +++ b/C/The C Programming Language ANSI C Version_2.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9ceb7afe41bafd627bdac60121e1072f713ac8f40b0e2a5d041b2e877980cdf2 +size 1346027 diff --git a/C/The_C_programming_language.pdf b/C/The_C_programming_language.pdf new file mode 100644 index 0000000..57a4eca --- /dev/null +++ b/C/The_C_programming_language.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:13f07729c0cf1dd7057734959a11c0f0a1571ab9dbe2a5b756dd940d782f6908 +size 2900111 diff --git a/C/Turbo_C_Users_Guide_1987.pdf b/C/Turbo_C_Users_Guide_1987.pdf new file mode 100644 index 0000000..d3fd182 --- /dev/null +++ b/C/Turbo_C_Users_Guide_1987.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:46ca5be305f919573a405d8b2666df1fa04402dbe938496ac15bbfc9c06f7b64 +size 10489658 diff --git a/Compiler construction/11-ghuloum.pdf b/Compiler construction/11-ghuloum.pdf new file mode 100644 index 0000000..7e9b5e6 --- /dev/null +++ b/Compiler construction/11-ghuloum.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:643a428b34248da898c82ea7bde4a9126a6f975165ba6d29c2b700bdd798cd83 +size 595682 diff --git a/Compiler construction/BookBody.pdf b/Compiler construction/BookBody.pdf new file mode 100644 index 0000000..4ebf884 --- /dev/null +++ b/Compiler construction/BookBody.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d637366eacdc009656b9f80d48725d517d75d4cc2cea5a0caf76083b71b434d7 +size 1381114 diff --git a/Compiler construction/Compiler Design - Formal Syntax And Semanti.pdf b/Compiler construction/Compiler Design - Formal Syntax And Semanti.pdf new file mode 100644 index 0000000..099560d --- /dev/null +++ b/Compiler construction/Compiler Design - Formal Syntax And Semanti.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:01ccc994de164f2196ba950c31fb6d4f9a039384f363adfbaf28d40eb9aebb8c +size 1787529 diff --git a/Compiler construction/Compiler_construction.pdf b/Compiler construction/Compiler_construction.pdf new file mode 100644 index 0000000..1074ce9 --- /dev/null +++ b/Compiler construction/Compiler_construction.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:bb63b8da4582b32d82d29a423ec5835652d69fad05c14be76df3bab22ba61157 +size 6565683 diff --git a/Compiler construction/Engineering A Compiler.pdf b/Compiler construction/Engineering A Compiler.pdf new file mode 100644 index 0000000..f7057e1 --- /dev/null +++ b/Compiler construction/Engineering A Compiler.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7dffbf0843128fa40373b45d3099d2f42b9d4aa3f7fb7650b9664319fad6346d +size 2309778 diff --git a/Compiler construction/Lets_build_compiler/readme.txt b/Compiler construction/Lets_build_compiler/readme.txt new file mode 100644 index 0000000..c637eab --- /dev/null +++ b/Compiler construction/Lets_build_compiler/readme.txt @@ -0,0 +1,54 @@ +TUTOR.ZIP + +This file contains all of the installments of Jack Crenshaw's +tutorial on compiler construction, including the new Installment 15. +The intended audience is those folks who are not computer scientists, +but who enjoy computing and have always wanted to know how compilers +work. A lot of compiler theory has been left out, but the practical +issues are covered. By the time you have completed the series, you +should be able to design and build your own working compiler. It will +not be the world's best, nor will it put out incredibly tight code. +Your product will probably never put Borland or MicroSoft out of +business. But it will work, and it will be yours. + +A word about the file format: The files were originally created using +Borland's DOS editor, Sprint. Sprint could write to a text file only +if you formatted the file to go to the selected printer. I used the +most common printer I could think of, the Epson MX-80, but even then +the files ended up with printer control sequences at the beginning +and end of each page. + +To bring the files up to date and get myself positioned to continue +the series, I recently (1994) converted all the files to work with +Microsoft Word for Windows. Unlike Sprint, Word allows you to write +the file as a DOS text file. Unfortunately, this gave me a new +problem, because when Word is writing to a text file, it doesn't +write hard page breaks or page numbers. In other words, in six years +we've gone from a file with page breaks and page numbers, but +embedded escape sequences, to files with no embedded escape sequences +but no page breaks or page numbers. Isn't progress wonderful? + +Of course, it's possible for me to insert the page numbers as +straight text, rather than asking the editor to do it for me. But +since Word won't allow me to write page breaks to the file, we would +end up with files with page numbers that may or may not fall at the +ends of the pages, depending on your editor and your printer. It +seems to me that almost every file I've ever downloaded from +CompuServe or BBS's that had such page numbering was incompatible +with my printer, and gave me pages that were one line short or one +line long, with the page numbers consequently walking up the page. + +So perhaps this new format is, after all, the safest one for general +distribution. The files as they exist will look just fine if read +into any text editor capable of reading DOS text files. Since most +editors these days include rather sophisticated word processing +capabilities, you should be able to get your editor to paginate for +you, prior to printing. + +I hope you like the tutorials. Much thought went into them. + + + Jack W. Crenshaw + + CompuServe 72325,1327 + \ No newline at end of file diff --git a/Compiler construction/Lets_build_compiler/tutor1.txt b/Compiler construction/Lets_build_compiler/tutor1.txt new file mode 100644 index 0000000..ab3bd9e --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor1.txt @@ -0,0 +1,398 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 24 July 1988 + + + Part I: INTRODUCTION + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + + +This series of articles is a tutorial on the theory and practice +of developing language parsers and compilers. Before we are +finished, we will have covered every aspect of compiler +construction, designed a new programming language, and built a +working compiler. + +Though I am not a computer scientist by education (my Ph.D. is in +a different field, Physics), I have been interested in compilers +for many years. I have bought and tried to digest the contents +of virtually every book on the subject ever written. I don't +mind telling you that it was slow going. Compiler texts are +written for Computer Science majors, and are tough sledding for +the rest of us. But over the years a bit of it began to seep in. +What really caused it to jell was when I began to branch off on +my own and begin to try things on my own computer. Now I plan to +share with you what I have learned. At the end of this series +you will by no means be a computer scientist, nor will you know +all the esoterics of compiler theory. I intend to completely +ignore the more theoretical aspects of the subject. What you +_WILL_ know is all the practical aspects that one needs to know +to build a working system. + +This is a "learn-by-doing" series. In the course of the series I +will be performing experiments on a computer. You will be +expected to follow along, repeating the experiments that I do, +and performing some on your own. I will be using Turbo Pascal +4.0 on a PC clone. I will periodically insert examples written +in TP. These will be executable code, which you will be expected +to copy into your own computer and run. If you don't have a copy +of Turbo, you will be severely limited in how well you will be +able to follow what's going on. If you don't have a copy, I urge +you to get one. After all, it's an excellent product, good for +many other uses! + +Some articles on compilers show you examples, or show you (as in +the case of Small-C) a finished product, which you can then copy +and use without a whole lot of understanding of how it works. I +hope to do much more than that. I hope to teach you HOW the +things get done, so that you can go off on your own and not only +reproduce what I have done, but improve on it. + +This is admittedly an ambitious undertaking, and it won't be done +in one page. I expect to do it in the course of a number of +articles. Each article will cover a single aspect of compiler +theory, and will pretty much stand alone. If all you're +interested in at a given time is one aspect, then you need to +look only at that one article. Each article will be uploaded as +it is complete, so you will have to wait for the last one before +you can consider yourself finished. Please be patient. + + + +The average text on compiler theory covers a lot of ground that +we won't be covering here. The typical sequence is: + + o An introductory chapter describing what a compiler is. + + o A chapter or two on syntax equations, using Backus-Naur Form + (BNF). + + o A chapter or two on lexical scanning, with emphasis on + deterministic and non-deterministic finite automata. + + o Several chapters on parsing theory, beginning with top-down + recursive descent, and ending with LALR parsers. + + o A chapter on intermediate languages, with emphasis on P-code + and similar reverse polish representations. + + o Many chapters on alternative ways to handle subroutines and + parameter passing, type declarations, and such. + + o A chapter toward the end on code generation, usually for some + imaginary CPU with a simple instruction set. Most readers + (and in fact, most college classes) never make it this far. + + o A final chapter or two on optimization. This chapter often + goes unread, too. + + +I'll be taking a much different approach in this series. To +begin with, I won't dwell long on options. I'll be giving you +_A_ way that works. If you want to explore options, well and +good ... I encourage you to do so ... but I'll be sticking to +what I know. I also will skip over most of the theory that puts +people to sleep. Don't get me wrong: I don't belittle the +theory, and it's vitally important when it comes to dealing with +the more tricky parts of a given language. But I believe in +putting first things first. Here we'll be dealing with the 95% +of compiler techniques that don't need a lot of theory to handle. + +I also will discuss only one approach to parsing: top-down, +recursive descent parsing, which is the _ONLY_ technique that's +at all amenable to hand-crafting a compiler. The other +approaches are only useful if you have a tool like YACC, and also +don't care how much memory space the final product uses. + +I also take a page from the work of Ron Cain, the author of the +original Small C. Whereas almost all other compiler authors have +historically used an intermediate language like P-code and +divided the compiler into two parts (a front end that produces +P-code, and a back end that processes P-code to produce +executable object code), Ron showed us that it is a +straightforward matter to make a compiler directly produce +executable object code, in the form of assembler language +statements. The code will _NOT_ be the world's tightest code ... +producing optimized code is a much more difficult job. But it +will work, and work reasonably well. Just so that I don't leave +you with the impression that our end product will be worthless, I +_DO_ intend to show you how to "soup up" the compiler with some +optimization. + + + +Finally, I'll be using some tricks that I've found to be most +helpful in letting me understand what's going on without wading +through a lot of boiler plate. Chief among these is the use of +single-character tokens, with no embedded spaces, for the early +design work. I figure that if I can get a parser to recognize +and deal with I-T-L, I can get it to do the same with IF-THEN- +ELSE. And I can. In the second "lesson," I'll show you just +how easy it is to extend a simple parser to handle tokens of +arbitrary length. As another trick, I completely ignore file +I/O, figuring that if I can read source from the keyboard and +output object to the screen, I can also do it from/to disk files. +Experience has proven that once a translator is working +correctly, it's a straightforward matter to redirect the I/O to +files. The last trick is that I make no attempt to do error +correction/recovery. The programs we'll be building will +RECOGNIZE errors, and will not CRASH, but they will simply stop +on the first error ... just like good ol' Turbo does. There will +be other tricks that you'll see as you go. Most of them can't be +found in any compiler textbook, but they work. + +A word about style and efficiency. As you will see, I tend to +write programs in _VERY_ small, easily understood pieces. None +of the procedures we'll be working with will be more than about +15-20 lines long. I'm a fervent devotee of the KISS (Keep It +Simple, Sidney) school of software development. I try to never +do something tricky or complex, when something simple will do. +Inefficient? Perhaps, but you'll like the results. As Brian +Kernighan has said, FIRST make it run, THEN make it run fast. +If, later on, you want to go back and tighten up the code in one +of our products, you'll be able to do so, since the code will be +quite understandable. If you do so, however, I urge you to wait +until the program is doing everything you want it to. + +I also have a tendency to delay building a module until I +discover that I need it. Trying to anticipate every possible +future contingency can drive you crazy, and you'll generally +guess wrong anyway. In this modern day of screen editors and +fast compilers, I don't hesitate to change a module when I feel I +need a more powerful one. Until then, I'll write only what I +need. + +One final caveat: One of the principles we'll be sticking to here +is that we don't fool around with P-code or imaginary CPUs, but +that we will start out on day one producing working, executable +object code, at least in the form of assembler language source. +However, you may not like my choice of assembler language ... +it's 68000 code, which is what works on my system (under SK*DOS). +I think you'll find, though, that the translation to any other +CPU such as the 80x86 will be quite obvious, though, so I don't +see a problem here. In fact, I hope someone out there who knows +the '86 language better than I do will offer us the equivalent +object code fragments as we need them. + + +THE CRADLE + +Every program needs some boiler plate ... I/O routines, error +message routines, etc. The programs we develop here will be no +exceptions. I've tried to hold this stuff to an absolute +minimum, however, so that we can concentrate on the important +stuff without losing it among the trees. The code given below +represents about the minimum that we need to get anything done. +It consists of some I/O routines, an error-handling routine and a +skeleton, null main program. I call it our cradle. As we +develop other routines, we'll add them to the cradle, and add the +calls to them as we need to. Make a copy of the cradle and save +it, because we'll be using it more than once. + +There are many different ways to organize the scanning activities +of a parser. In Unix systems, authors tend to use getc and +ungetc. I've had very good luck with the approach shown here, +which is to use a single, global, lookahead character. Part of +the initialization procedure (the only part, so far!) serves to +"prime the pump" by reading the first character from the input +stream. No other special techniques are required with Turbo 4.0 +... each successive call to GetChar will read the next character +in the stream. + + +{--------------------------------------------------------------} +program Cradle; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look: char; { Lookahead Character } + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + if Look = x then GetChar + else Expected('''' + x + ''''); +end; + + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := upcase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} + +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: char; +begin + if not IsAlpha(Look) then Expected('Name'); + GetName := UpCase(Look); + GetChar; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: char; +begin + if not IsDigit(Look) then Expected('Integer'); + GetNum := Look; + GetChar; +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + + + +{--------------------------------------------------------------} +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +begin + GetChar; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; +end. +{--------------------------------------------------------------} + + +That's it for this introduction. Copy the code above into TP and +compile it. Make sure that it compiles and runs correctly. Then +proceed to the first lesson, which is on expression parsing. + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + + + diff --git a/Compiler construction/Lets_build_compiler/tutor10.txt b/Compiler construction/Lets_build_compiler/tutor10.txt new file mode 100644 index 0000000..f611a50 --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor10.txt @@ -0,0 +1,3538 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 21 May 1989 + + + Part X: INTRODUCING "TINY" + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +In the last installment, I showed you the general idea for the +top-down development of a compiler. I gave you the first few +steps of the process for compilers for Pascal and C, but I +stopped far short of pushing it through to completion. The +reason was simple: if we're going to produce a real, functional +compiler for any language, I'd rather do it for KISS, the +language that I've been defining in this tutorial series. + +In this installment, we're going to do just that, for a subset of +KISS which I've chosen to call TINY. + +The process will be essentially that outlined in Installment IX, +except for one notable difference. In that installment, I +suggested that you begin with a full BNF description of the +language. That's fine for something like Pascal or C, for which +the language definition is firm. In the case of TINY, however, +we don't yet have a full description ... we seem to be defining +the language as we go. That's OK. In fact, it's preferable, +since we can tailor the language slightly as we go, to keep the +parsing easy. + +So in the development that follows, we'll actually be doing a +top-down development of BOTH the language and its compiler. The +BNF description will grow along with the compiler. + +In this process, there will be a number of decisions to be made, +each of which will influence the BNF and therefore the nature of +the language. At each decision point I'll try to remember to +explain the decision and the rationale behind my choice. That +way, if you happen to hold a different opinion and would prefer a +different option, you can choose it instead. You now have the +background to do that. I guess the important thing to note is +that nothing we do here is cast in concrete. When YOU'RE +designing YOUR language, you should feel free to do it YOUR way. + +Many of you may be asking at this point: Why bother starting over +from scratch? We had a working subset of KISS as the outcome of +Installment VII (lexical scanning). Why not just extend it as +needed? The answer is threefold. First of all, I have been +making a number of changes to further simplify the program ... +changes like encapsulating the code generation procedures, so +that we can convert to a different target machine more easily. +Second, I want you to see how the development can indeed be done +from the top down as outlined in the last installment. Finally, +we both need the practice. Each time I go through this exercise, +I get a little better at it, and you will, also. + + +GETTING STARTED + +Many years ago there were languages called Tiny BASIC, Tiny +Pascal, and Tiny C, each of which was a subset of its parent full +language. Tiny BASIC, for example, had only single-character +variable names and global variables. It supported only a single +data type. Sound familiar? At this point we have almost all the +tools we need to build a compiler like that. + +Yet a language called Tiny-anything still carries some baggage +inherited from its parent language. I've often wondered if this +is a good idea. Granted, a language based upon some parent +language will have the advantage of familiarity, but there may +also be some peculiar syntax carried over from the parent that +may tend to add unnecessary complexity to the compiler. (Nowhere +is this more true than in Small C.) + +I've wondered just how small and simple a compiler could be made +and still be useful, if it were designed from the outset to be +both easy to use and to parse. Let's find out. This language +will just be called "TINY," period. It's a subset of KISS, which +I also haven't fully defined, so that at least makes us +consistent (!). I suppose you could call it TINY KISS. But that +opens up a whole can of worms involving cuter and cuter (and +perhaps more risque) names, so let's just stick with TINY. + +The main limitations of TINY will be because of the things we +haven't yet covered, such as data types. Like its cousins Tiny C +and Tiny BASIC, TINY will have only one data type, the 16-bit +integer. The first version we develop will also have no +procedure calls and will use single-character variable names, +although as you will see we can remove these restrictions without +much effort. + +The language I have in mind will share some of the good features +of Pascal, C, and Ada. Taking a lesson from the comparison of +the Pascal and C compilers in the previous installment, though, +TINY will have a decided Pascal flavor. Wherever feasible, a +language structure will be bracketed by keywords or symbols, so +that the parser will know where it's going without having to +guess. + +One other ground rule: As we go, I'd like to keep the compiler +producing real, executable code. Even though it may not DO much +at the beginning, it will at least do it correctly. + +Finally, I'll use a couple of Pascal restrictions that make +sense: All data and procedures must be declared before they are +used. That makes good sense, even though for now the only data +type we'll use is a word. This rule in turn means that the only +reasonable place to put the executable code for the main program +is at the end of the listing. + +The top-level definition will be similar to Pascal: + + + ::= PROGRAM
'.' + + +Already, we've reached a decision point. My first thought was to +make the main block optional. It doesn't seem to make sense to +write a "program" with no main program, but it does make sense if +we're allowing for multiple modules, linked together. As a +matter of fact, I intend to allow for this in KISS. But then we +begin to open up a can of worms that I'd rather leave closed for +now. For example, the term "PROGRAM" really becomes a misnomer. +The MODULE of Modula-2 or the Unit of Turbo Pascal would be more +appropriate. Second, what about scope rules? We'd need a +convention for dealing with name visibility across modules. +Better for now to just keep it simple and ignore the idea +altogether. + +There's also a decision in choosing to require the main program +to be last. I toyed with the idea of making its position +optional, as in C. The nature of SK*DOS, the OS I'm compiling +for, make this very easy to do. But this doesn't really make +much sense in view of the Pascal-like requirement that all data +and procedures be declared before they're referenced. Since the +main program can only call procedures that have already been +declared, the only position that makes sense is at the end, a la +Pascal. + +Given the BNF above, let's write a parser that just recognizes +the brackets: + + +{--------------------------------------------------------------} +{ Parse and Translate a Program } + +procedure Prog; +begin + Match('p'); + Header; + Prolog; + Match('.'); + Epilog; +end; +{--------------------------------------------------------------} + + +The procedure Header just emits the startup code required by the +assembler: + + +{--------------------------------------------------------------} +{ Write Header Info } + +procedure Header; +begin + WriteLn('WARMST', TAB, 'EQU $A01E'); +end; +{--------------------------------------------------------------} + + +The procedures Prolog and Epilog emit the code for identifying +the main program, and for returning to the OS: + + +{--------------------------------------------------------------} +{ Write the Prolog } + +procedure Prolog; +begin + PostLabel('MAIN'); +end; + + +{--------------------------------------------------------------} +{ Write the Epilog } + +procedure Epilog; +begin + EmitLn('DC WARMST'); + EmitLn('END MAIN'); +end; +{--------------------------------------------------------------} + + +The main program just calls Prog, and then looks for a clean +ending: + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + Prog; + if Look <> CR then Abort('Unexpected data after ''.'''); +end. +{--------------------------------------------------------------} + + +At this point, TINY will accept only one input "program," the +null program: + + + PROGRAM . (or 'p.' in our shorthand.) + +Note, though, that the compiler DOES generate correct code for +this program. It will run, and do what you'd expect the null +program to do, that is, nothing but return gracefully to the OS. + +As a matter of interest, one of my favorite compiler benchmarks +is to compile, link, and execute the null program in whatever +language is involved. You can learn a lot about the +implementation by measuring the overhead in time required to +compile what should be a trivial case. It's also interesting to +measure the amount of code produced. In many compilers, the code +can be fairly large, because they always include the whole run- +time library whether they need it or not. Early versions of +Turbo Pascal produced a 12K object file for this case. VAX C +generates 50K! + +The smallest null programs I've seen are those produced by +Modula-2 compilers, and they run about 200-800 bytes. + +In the case of TINY, we HAVE no run-time library as yet, so the +object code is indeed tiny: two bytes. That's got to be a +record, and it's likely to remain one since it is the minimum +size required by the OS. + +The next step is to process the code for the main program. I'll +use the Pascal BEGIN-block: + + +
::= BEGIN END + + +Here, again, we have made a decision. We could have chosen to +require a "PROCEDURE MAIN" sort of declaration, similar to C. I +must admit that this is not a bad idea at all ... I don't +particularly like the Pascal approach since I tend to have +trouble locating the main program in a Pascal listing. But the +alternative is a little awkward, too, since you have to deal with +the error condition where the user omits the main program or +misspells its name. Here I'm taking the easy way out. + +Another solution to the "where is the main program" problem might +be to require a name for the program, and then bracket the main +by + + + BEGIN + END + + +similar to the convention of Modula 2. This adds a bit of +"syntactic sugar" to the language. Things like this are easy to +add or change to your liking, if the language is your own design. + +To parse this definition of a main block, change procedure Prog +to read: + +{--------------------------------------------------------------} +{ Parse and Translate a Program } + +procedure Prog; +begin + Match('p'); + Header; + Main; + Match('.'); +end; +{--------------------------------------------------------------} + + +and add the new procedure: + + +{--------------------------------------------------------------} +{ Parse and Translate a Main Program } + +procedure Main; +begin + Match('b'); + Prolog; + Match('e'); + Epilog; +end; +{--------------------------------------------------------------} + + +Now, the only legal program is: + + + PROGRAM BEGIN END . (or 'pbe.') + + +Aren't we making progress??? Well, as usual it gets better. You +might try some deliberate errors here, like omitting the 'b' or +the 'e', and see what happens. As always, the compiler should +flag all illegal inputs. + + +DECLARATIONS + +The obvious next step is to decide what we mean by a declaration. +My intent here is to have two kinds of declarations: variables +and procedures/functions. At the top level, only global +declarations are allowed, just as in C. + +For now, there can only be variable declarations, identified by +the keyword VAR (abbreviated 'v'): + + + ::= ( )* + + ::= VAR + + +Note that since there is only one variable type, there is no need +to declare the type. Later on, for full KISS, we can easily add +a type description. + +The procedure Prog becomes: + + +{--------------------------------------------------------------} +{ Parse and Translate a Program } + +procedure Prog; +begin + Match('p'); + Header; + TopDecls; + Main; + Match('.'); +end; +{--------------------------------------------------------------} + + +Now, add the two new procedures: + + +{--------------------------------------------------------------} +{ Process a Data Declaration } + +procedure Decl; +begin + Match('v'); + GetChar; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + while Look <> 'b' do + case Look of + 'v': Decl; + else Abort('Unrecognized Keyword ''' + Look + ''''); + end; +end; +{--------------------------------------------------------------} + + +Note that at this point, Decl is just a stub. It generates no +code, and it doesn't process a list ... every variable must occur +in a separate VAR statement. + +OK, now we can have any number of data declarations, each +starting with a 'v' for VAR, before the BEGIN-block. Try a few +cases and see what happens. + + +DECLARATIONS AND SYMBOLS + +That looks pretty good, but we're still only generating the null +program for output. A real compiler would issue assembler +directives to allocate storage for the variables. It's about +time we actually produced some code. + +With a little extra code, that's an easy thing to do from +procedure Decl. Modify it as follows: + + +{--------------------------------------------------------------} +{ Parse and Translate a Data Declaration } + +procedure Decl; +var Name: char; +begin + Match('v'); + Alloc(GetName); +end; +{--------------------------------------------------------------} + + +The procedure Alloc just issues a command to the assembler to +allocate storage: + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc(N: char); +begin + WriteLn(N, ':', TAB, 'DC 0'); +end; +{--------------------------------------------------------------} + + +Give this one a whirl. Try an input that declares some +variables, such as: + + pvxvyvzbe. + +See how the storage is allocated? Simple, huh? Note also that +the entry point, "MAIN," comes out in the right place. + +For the record, a "real" compiler would also have a symbol table +to record the variables being used. Normally, the symbol table +is necessary to record the type of each variable. But since in +this case all variables have the same type, we don't need a +symbol table for that reason. As it turns out, we're going to +find a symbol necessary even without different types, but let's +postpone that need until it arises. + +Of course, we haven't really parsed the correct syntax for a data +declaration, since it involves a variable list. Our version only +permits a single variable. That's easy to fix, too. + +The BNF for is + + + ::= (, )* + + +Adding this syntax to Decl gives this new version: + + +{--------------------------------------------------------------} +{ Parse and Translate a Data Declaration } + +procedure Decl; +var Name: char; +begin + Match('v'); + Alloc(GetName); + while Look = ',' do begin + GetChar; + Alloc(GetName); + end; +end; +{--------------------------------------------------------------} + + +OK, now compile this code and give it a try. Try a number of +lines of VAR declarations, try a list of several variables on one +line, and try combinations of the two. Does it work? + + +INITIALIZERS + +As long as we're dealing with data declarations, one thing that's +always bothered me about Pascal is that it doesn't allow +initializing data items in the declaration. That feature is +admittedly sort of a frill, and it may be out of place in a +language that purports to be a minimal language. But it's also +SO easy to add that it seems a shame not to do so. The BNF +becomes: + + + ::= ( )* + + ::= [ = ] + +Change Alloc as follows: + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc(N: char); +begin + Write(N, ':', TAB, 'DC '); + if Look = '=' then begin + Match('='); + WriteLn(GetNum); + end + else + WriteLn('0'); +end; +{--------------------------------------------------------------} + + +There you are: an initializer with six added lines of Pascal. + +OK, try this version of TINY and verify that you can, indeed, +give the variables initial values. + +By golly, this thing is starting to look real! Of course, it +still doesn't DO anything, but it looks good, doesn't it? + +Before leaving this section, I should point out that we've used +two versions of function GetNum. One, the earlier one, returns a +character value, a single digit. The other accepts a multi-digit +integer and returns an integer value. Either one will work here, +since WriteLn will handle either type. But there's no reason to +limit ourselves to single-digit values here, so the correct +version to use is the one that returns an integer. Here it is: + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: integer; +var Val: integer; +begin + Val := 0; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + Val := 10 * Val + Ord(Look) - Ord('0'); + GetChar; + end; + GetNum := Val; +end; +{--------------------------------------------------------------} + +As a matter of fact, strictly speaking we should allow for +expressions in the data field of the initializer, or at the very +least for negative values. For now, let's just allow for +negative values by changing the code for Alloc as follows: + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc(N: char); +begin + if InTable(N) then Abort('Duplicate Variable Name ' + N); + ST[N] := 'v'; + Write(N, ':', TAB, 'DC '); + if Look = '=' then begin + Match('='); + If Look = '-' then begin + Write(Look); + Match('-'); + end; + WriteLn(GetNum); + end + else + WriteLn('0'); +end; +{--------------------------------------------------------------} + + +Now you should be able to initialize variables with negative +and/or multi-digit values. + + +THE SYMBOL TABLE + +There's one problem with the compiler as it stands so far: it +doesn't do anything to record a variable when we declare it. So +the compiler is perfectly content to allocate storage for several +variables with the same name. You can easily verify this with an +input like + + + pvavavabe. + + +Here we've declared the variable A three times. As you can see, +the compiler will cheerfully accept that, and generate three +identical labels. Not good. + +Later on, when we start referencing variables, the compiler will +also let us reference variables that don't exist. The assembler +will catch both of these error conditions, but it doesn't seem +friendly at all to pass such errors along to the assembler. The +compiler should catch such things at the source language level. + +So even though we don't need a symbol table to record data types, +we ought to install one just to check for these two conditions. +Since at this point we are still restricted to single-character +variable names, the symbol table can be trivial. To provide for +it, first add the following declaration at the beginning of your +program: + + + var ST: array['A'..'Z'] of char; + + +and insert the following function: + + +{--------------------------------------------------------------} +{ Look for Symbol in Table } + +function InTable(n: char): Boolean; +begin + InTable := ST[n] <> ' '; +end; +{--------------------------------------------------------------} + + +We also need to initialize the table to all blanks. The +following lines in Init will do the job: + + +var i: char; +begin + for i := 'A' to 'Z' do + ST[i] := ' '; + ... + + +Finally, insert the following two lines at the beginning of +Alloc: + + + if InTable(N) then Abort('Duplicate Variable Name ' + N); + ST[N] := 'v'; + + +That should do it. The compiler will now catch duplicate +declarations. Later, we can also use InTable when generating +references to the variables. + + +EXECUTABLE STATEMENTS + +At this point, we can generate a null program that has some data +variables declared and possibly initialized. But so far we +haven't arranged to generate the first line of executable code. + +Believe it or not, though, we almost have a usable language! +What's missing is the executable code that must go into the main +program. But that code is just assignment statements and control +statements ... all stuff we have done before. So it shouldn't +take us long to provide for them, as well. + +The BNF definition given earlier for the main program included a +statement block, which we have so far ignored: + + +
::= BEGIN END + + +For now, we can just consider a block to be a series of +assignment statements: + + + ::= (Assignment)* + + +Let's start things off by adding a parser for the block. We'll +begin with a stub for the assignment statement: + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +begin + GetChar; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + while Look <> 'e' do + Assignment; +end; +{--------------------------------------------------------------} + + +Modify procedure Main to call Block as shown below: + + +{--------------------------------------------------------------} +{ Parse and Translate a Main Program } + +procedure Main; +begin + Match('b'); + Prolog; + Block; + Match('e'); + Epilog; +end; +{--------------------------------------------------------------} + + +This version still won't generate any code for the "assignment +statements" ... all it does is to eat characters until it sees +the 'e' for 'END.' But it sets the stage for what is to follow. + +The next step, of course, is to flesh out the code for an +assignment statement. This is something we've done many times +before, so I won't belabor it. This time, though, I'd like to +deal with the code generation a little differently. Up till now, +we've always just inserted the Emits that generate output code in +line with the parsing routines. A little unstructured, perhaps, +but it seemed the most straightforward approach, and made it easy +to see what kind of code would be emitted for each construct. + +However, I realize that most of you are using an 80x86 computer, +so the 68000 code generated is of little use to you. Several of +you have asked me if the CPU-dependent code couldn't be collected +into one spot where it would be easier to retarget to another +CPU. The answer, of course, is yes. + +To accomplish this, insert the following "code generation" +routines: + + +{---------------------------------------------------------------} +{ Clear the Primary Register } + +procedure Clear; +begin + EmitLn('CLR D0'); +end; + + +{---------------------------------------------------------------} +{ Negate the Primary Register } + +procedure Negate; +begin + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Load a Constant Value to Primary Register } + +procedure LoadConst(n: integer); +begin + Emit('MOVE #'); + WriteLn(n, ',D0'); +end; + + +{---------------------------------------------------------------} +{ Load a Variable to Primary Register } + +procedure LoadVar(Name: char); +begin + if not InTable(Name) then Undefined(Name); + EmitLn('MOVE ' + Name + '(PC),D0'); +end; + + +{---------------------------------------------------------------} +{ Push Primary onto Stack } + +procedure Push; +begin + EmitLn('MOVE D0,-(SP)'); +end; + + +{---------------------------------------------------------------} +{ Add Top of Stack to Primary } + +procedure PopAdd; +begin + EmitLn('ADD (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Subtract Primary from Top of Stack } + +procedure PopSub; +begin + EmitLn('SUB (SP)+,D0'); + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Multiply Top of Stack by Primary } + +procedure PopMul; +begin + EmitLn('MULS (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Divide Top of Stack by Primary } + +procedure PopDiv; +begin + EmitLn('MOVE (SP)+,D7'); + EmitLn('EXT.L D7'); + EmitLn('DIVS D0,D7'); + EmitLn('MOVE D7,D0'); +end; + + +{---------------------------------------------------------------} +{ Store Primary to Variable } + +procedure Store(Name: char); +begin + if not InTable(Name) then Undefined(Name); + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)') +end; +{---------------------------------------------------------------} + + +The nice part of this approach, of course, is that we can +retarget the compiler to a new CPU simply by rewriting these +"code generator" procedures. In addition, we will find later +that we can improve the code quality by tweaking these routines a +bit, without having to modify the compiler proper. + +Note that both LoadVar and Store check the symbol table to make +sure that the variable is defined. The error handler Undefined +simply calls Abort: + + +{--------------------------------------------------------------} +{ Report an Undefined Identifier } + +procedure Undefined(n: string); +begin + Abort('Undefined Identifier ' + n); +end; +{--------------------------------------------------------------} + + +OK, we are now finally ready to begin processing executable code. +We'll do that by replacing the stub version of procedure +Assignment. + +We've been down this road many times before, so this should all +be familiar to you. In fact, except for the changes associated +with the code generation, we could just copy the procedures from +Part VII. Since we are making some changes, I won't just copy +them, but we will go a little faster than usual. + +The BNF for the assignment statement is: + + ::= = + + ::= ( )* + + ::= + + ::= + + ::= ( )* + + ::= [ ] + + ::= | | ( ) + + +This version of the BNF is also a bit different than we've used +before ... yet another "variation on the theme of an expression." +This particular version has what I consider to be the best +treatment of the unary minus. As you'll see later, it lets us +handle negative constant values efficiently. It's worth +mentioning here that we have often seen the advantages of +"tweaking" the BNF as we go, to help make the language easy to +parse. What you're looking at here is a bit different: we've +tweaked the BNF to make the CODE GENERATION more efficient! +That's a first for this series. + +Anyhow, the following code implements the BNF: + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure Expression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + Expression; + Match(')'); + end + else if IsAlpha(Look) then + LoadVar(GetName) + else + LoadConst(GetNum); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Negative Factor } + +procedure NegFactor; +begin + Match('-'); + if IsDigit(Look) then + LoadConst(-GetNum) + else begin + Factor; + Negate; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Leading Factor } + +procedure FirstFactor; +begin + case Look of + '+': begin + Match('+'); + Factor; + end; + '-': NegFactor; + else Factor; + end; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Multiply } + +procedure Multiply; +begin + Match('*'); + Factor; + PopMul; +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Divide } + +procedure Divide; +begin + Match('/'); + Factor; + PopDiv; +end; + + +{---------------------------------------------------------------} +{ Common Code Used by Term and FirstTerm } + +procedure Term1; +begin + while IsMulop(Look) do begin + Push; + case Look of + '*': Multiply; + '/': Divide; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +procedure Term; +begin + Factor; + Term1; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Leading Term } + +procedure FirstTerm; +begin + FirstFactor; + Term1; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Match('+'); + Term; + PopAdd; +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Match('-'); + Term; + PopSub; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + FirstTerm; + while IsAddop(Look) do begin + Push; + case Look of + '+': Add; + '-': Subtract; + end; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: char; +begin + Name := GetName; + Match('='); + Expression; + Store(Name); +end; +{--------------------------------------------------------------} + + +OK, if you've got all this code inserted, then compile it and +check it out. You should be seeing reasonable-looking code, +representing a complete program that will assemble and execute. +We have a compiler! + + +BOOLEANS + +The next step should also be familiar to you. We must add +Boolean expressions and relational operations. Again, since +we've already dealt with them more than once, I won't elaborate +much on them, except where they are different from what we've +done before. Again, we won't just copy from other files because +I've changed a few things just a bit. Most of the changes just +involve encapsulating the machine-dependent parts as we did for +the arithmetic operations. I've also modified procedure +NotFactor somewhat, to parallel the structure of FirstFactor. +Finally, I corrected an error in the object code for the +relational operators: The Scc instruction I used only sets the +low 8 bits of D0. We want all 16 bits set for a logical true, so +I've added an instruction to sign-extend the low byte. + +To begin, we're going to need some more recognizers: + + +{--------------------------------------------------------------} +{ Recognize a Boolean Orop } + +function IsOrop(c: char): boolean; +begin + IsOrop := c in ['|', '~']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Relop } + +function IsRelop(c: char): boolean; +begin + IsRelop := c in ['=', '#', '<', '>']; +end; +{--------------------------------------------------------------} + + +Also, we're going to need some more code generation routines: + + +{---------------------------------------------------------------} +{ Complement the Primary Register } + +procedure NotIt; +begin + EmitLn('NOT D0'); +end; +{---------------------------------------------------------------} +. +. +. +{---------------------------------------------------------------} +{ AND Top of Stack with Primary } + +procedure PopAnd; +begin + EmitLn('AND (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ OR Top of Stack with Primary } + +procedure PopOr; +begin + EmitLn('OR (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ XOR Top of Stack with Primary } + +procedure PopXor; +begin + EmitLn('EOR (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Compare Top of Stack with Primary } + +procedure PopCompare; +begin + EmitLn('CMP (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was = } + +procedure SetEqual; +begin + EmitLn('SEQ D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was != } + +procedure SetNEqual; +begin + EmitLn('SNE D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was > } + +procedure SetGreater; +begin + EmitLn('SLT D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was < } + +procedure SetLess; +begin + EmitLn('SGT D0'); + EmitLn('EXT D0'); +end; +{---------------------------------------------------------------} + +All of this gives us the tools we need. The BNF for the Boolean +expressions is: + + + ::= ( )* + + ::= ( )* + + ::= [ '!' ] + + ::= [ ] + + +Sharp-eyed readers might note that this syntax does not include +the non-terminal "bool-factor" used in earlier versions. It was +needed then because I also allowed for the Boolean constants TRUE +and FALSE. But remember that in TINY there is no distinction +made between Boolean and arithmetic types ... they can be freely +intermixed. So there is really no need for these predefined +values ... we can just use -1 and 0, respectively. + +In C terminology, we could always use the defines: + + + #define TRUE -1 + #define FALSE 0 + + +(That is, if TINY had a preprocessor.) Later on, when we allow +for declarations of constants, these two values will be +predefined by the language. + +The reason that I'm harping on this is that I've already tried +the alternative, which is to include TRUE and FALSE as keywords. +The problem with that approach is that it then requires lexical +scanning for EVERY variable name in every expression. If you'll +recall, I pointed out in Installment VII that this slows the +compiler down considerably. As long as keywords can't be in +expressions, we need to do the scanning only at the beginning of +every new statement ... quite an improvement. So using the +syntax above not only simplifies the parsing, but speeds up the +scanning as well. + +OK, given that we're all satisfied with the syntax above, the +corresponding code is shown below: + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Equals" } + +procedure Equals; +begin + Match('='); + Expression; + PopCompare; + SetEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Not Equals" } + +procedure NotEquals; +begin + Match('#'); + Expression; + PopCompare; + SetNEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Less Than" } + +procedure Less; +begin + Match('<'); + Expression; + PopCompare; + SetLess; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Greater Than" } + +procedure Greater; +begin + Match('>'); + Expression; + PopCompare; + SetGreater; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Relation } + + +procedure Relation; +begin + Expression; + if IsRelop(Look) then begin + Push; + case Look of + '=': Equals; + '#': NotEquals; + '<': Less; + '>': Greater; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Factor with Leading NOT } + +procedure NotFactor; +begin + if Look = '!' then begin + Match('!'); + Relation; + NotIt; + end + else + Relation; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Term } + +procedure BoolTerm; +begin + NotFactor; + while Look = '&' do begin + Push; + Match('&'); + NotFactor; + PopAnd; + end; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Boolean OR } + +procedure BoolOr; +begin + Match('|'); + BoolTerm; + PopOr; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an Exclusive Or } + +procedure BoolXor; +begin + Match('~'); + BoolTerm; + PopXor; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Expression } + +procedure BoolExpression; +begin + BoolTerm; + while IsOrOp(Look) do begin + Push; + case Look of + '|': BoolOr; + '~': BoolXor; + end; + end; +end; +{--------------------------------------------------------------} + + +To tie it all together, don't forget to change the references to +Expression in procedures Factor and Assignment so that they call +BoolExpression instead. + +OK, if you've got all that typed in, compile it and give it a +whirl. First, make sure you can still parse an ordinary +arithmetic expression. Then, try a Boolean one. Finally, make +sure that you can assign the results of relations. Try, for +example: + + pvx,y,zbx=z>ye. + +which stands for: + + PROGRAM + VAR X,Y,Z + BEGIN + X = Z > Y + END. + + +See how this assigns a Boolean value to X? + +CONTROL STRUCTURES + +We're almost home. With Boolean expressions in place, it's a +simple matter to add control structures. For TINY, we'll only +allow two kinds of them, the IF and the WHILE: + + + ::= IF [ ELSE ] ENDIF + + ::= WHILE ENDWHILE + +Once again, let me spell out the decisions implicit in this +syntax, which departs strongly from that of C or Pascal. In both +of those languages, the "body" of an IF or WHILE is regarded as a +single statement. If you intend to use a block of more than one +statement, you have to build a compound statement using BEGIN-END +(in Pascal) or '{}' (in C). In TINY (and KISS) there is no such +thing as a compound statement ... single or multiple they're all +just blocks to these languages. + +In KISS, all the control structures will have explicit and unique +keywords bracketing the statement block, so there can be no +confusion as to where things begin and end. This is the modern +approach, used in such respected languages as Ada and Modula 2, +and it completely eliminates the problem of the "dangling else." + +Note that I could have chosen to use the same keyword END to end +all the constructs, as is done in Pascal. (The closing '}' in C +serves the same purpose.) But this has always led to confusion, +which is why Pascal programmers tend to write things like + + + end { loop } + +or end { if } + + +As I explained in Part V, using unique terminal keywords does +increase the size of the keyword list and therefore slows down +the scanning, but in this case it seems a small price to pay for +the added insurance. Better to find the errors at compile time +rather than run time. + +One last thought: The two constructs above each have the non- +terminals + + + and + + +juxtaposed with no separating keyword. In Pascal we would expect +the keywords THEN and DO in these locations. + +I have no problem with leaving out these keywords, and the parser +has no trouble either, ON CONDITION that we make no errors in the +bool-expression part. On the other hand, if we were to include +these extra keywords we would get yet one more level of insurance +at very little cost, and I have no problem with that, either. +Use your best judgment as to which way to go. + +OK, with that bit of explanation let's proceed. As usual, we're +going to need some new code generation routines. These generate +the code for conditional and unconditional branches: + +{---------------------------------------------------------------} +{ Branch Unconditional } + +procedure Branch(L: string); +begin + EmitLn('BRA ' + L); +end; + + +{---------------------------------------------------------------} +{ Branch False } + +procedure BranchFalse(L: string); +begin + EmitLn('TST D0'); + EmitLn('BEQ ' + L); +end; +{--------------------------------------------------------------} + + +Except for the encapsulation of the code generation, the code to +parse the control constructs is the same as you've seen before: + + +{---------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block; Forward; + + +procedure DoIf; +var L1, L2: string; +begin + Match('i'); + BoolExpression; + L1 := NewLabel; + L2 := L1; + BranchFalse(L1); + Block; + if Look = 'l' then begin + Match('l'); + L2 := NewLabel; + Branch(L2); + PostLabel(L1); + Block; + end; + PostLabel(L2); + Match('e'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a WHILE Statement } + +procedure DoWhile; +var L1, L2: string; +begin + Match('w'); + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + BoolExpression; + BranchFalse(L2); + Block; + Match('e'); + Branch(L1); + PostLabel(L2); +end; +{--------------------------------------------------------------} + + +To tie everything together, we need only modify procedure Block +to recognize the "keywords" for the IF and WHILE. As usual, we +expand the definition of a block like so: + + + ::= ( )* + + +where + + + ::= | | + + +The corresponding code is: + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + while not(Look in ['e', 'l']) do begin + case Look of + 'i': DoIf; + 'w': DoWhile; + else Assignment; + end; + end; +end; +{--------------------------------------------------------------} + + +OK, add the routines I've given, compile and test them. You +should be able to parse the single-character versions of any of +the control constructs. It's looking pretty good! + +As a matter of fact, except for the single-character limitation +we've got a virtually complete version of TINY. I call it, with +tongue planted firmly in cheek, TINY Version 0.1. + + +LEXICAL SCANNING + +Of course, you know what's next: We have to convert the program +so that it can deal with multi-character keywords, newlines, and +whitespace. We have just gone through all that in Part VII. +We'll use the distributed scanner technique that I showed you in +that installment. The actual implementation is a little +different because the way I'm handling newlines is different. + +To begin with, let's simply allow for whitespace. This involves +only adding calls to SkipWhite at the end of the three routines, +GetName, GetNum, and Match. A call to SkipWhite in Init primes +the pump in case there are leading spaces. + +Next, we need to deal with newlines. This is really a two-step +process, since the treatment of the newlines with single- +character tokens is different from that for multi-character ones. +We can eliminate some work by doing both steps at once, but I +feel safer taking things one step at a time. + +Insert the new procedure: + + +{--------------------------------------------------------------} +{ Skip Over an End-of-Line } + +procedure NewLine; +begin + while Look = CR do begin + GetChar; + if Look = LF then GetChar; + SkipWhite; + end; +end; +{--------------------------------------------------------------} + + +Note that we have seen this procedure before in the form of +Procedure Fin. I've changed the name since this new one seems +more descriptive of the actual function. I've also changed the +code to allow for multiple newlines and lines with nothing but +white space. + +The next step is to insert calls to NewLine wherever we decide a +newline is permissible. As I've pointed out before, this can be +very different in different languages. In TINY, I've decided to +allow them virtually anywhere. This means that we need calls to +NewLine at the BEGINNING (not the end, as with SkipWhite) of the +procedures GetName, GetNum, and Match. + +For procedures that have while loops, such as TopDecl, we need a +call to NewLine at the beginning of the procedure AND at the +bottom of each loop. That way, we can be assured that NewLine +has just been called at the beginning of each pass through the +loop. + +If you've got all this done, try the program out and verify that +it will indeed handle white space and newlines. + +If it does, then we're ready to deal with multi-character tokens +and keywords. To begin, add the additional declarations (copied +almost verbatim from Part VII): + + +{--------------------------------------------------------------} +{ Type Declarations } + +type Symbol = string[8]; + + SymTab = array[1..1000] of Symbol; + + TabPtr = ^SymTab; + + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look : char; { Lookahead Character } + Token: char; { Encoded Token } + Value: string[16]; { Unencoded Token } + + ST: Array['A'..'Z'] of char; + +{--------------------------------------------------------------} +{ Definition of Keywords and Token Types } + +const NKW = 9; + NKW1 = 10; + +const KWlist: array[1..NKW] of Symbol = + ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE', + 'VAR', 'BEGIN', 'END', 'PROGRAM'); + +const KWcode: string[NKW1] = 'xilewevbep'; +{--------------------------------------------------------------} + + +Next, add the three procedures, also from Part VII: + + +{--------------------------------------------------------------} +{ Table Lookup } + +function Lookup(T: TabPtr; s: string; n: integer): integer; +var i: integer; + found: Boolean; +begin + found := false; + i := n; + while (i > 0) and not found do + if s = T^[i] then + found := true + else + dec(i); + Lookup := i; +end; +{--------------------------------------------------------------} +. +. +{--------------------------------------------------------------} +{ Get an Identifier and Scan it for Keywords } + +procedure Scan; +begin + GetName; + Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1]; +end; +{--------------------------------------------------------------} +. +. +{--------------------------------------------------------------} +{ Match a Specific Input String } + +procedure MatchString(x: string); +begin + if Value <> x then Expected('''' + x + ''''); +end; +{--------------------------------------------------------------} + + +Now, we have to make a fairly large number of subtle changes to +the remaining procedures. First, we must change the function +GetName to a procedure, again as we did in Part VII: + + +{--------------------------------------------------------------} +{ Get an Identifier } + +procedure GetName; +begin + NewLine; + if not IsAlpha(Look) then Expected('Name'); + Value := ''; + while IsAlNum(Look) do begin + Value := Value + UpCase(Look); + GetChar; + end; + SkipWhite; +end; +{--------------------------------------------------------------} + + +Note that this procedure leaves its result in the global string +Value. + +Next, we have to change every reference to GetName to reflect its +new form. These occur in Factor, Assignment, and Decl: + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure BoolExpression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + BoolExpression; + Match(')'); + end + else if IsAlpha(Look) then begin + GetName; + LoadVar(Value[1]); + end + else + LoadConst(GetNum); +end; +{--------------------------------------------------------------} +. +. +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: char; +begin + Name := Value[1]; + Match('='); + BoolExpression; + Store(Name); +end; +{---------------------------------------------------------------} +. +. +{--------------------------------------------------------------} +{ Parse and Translate a Data Declaration } + +procedure Decl; +begin + GetName; + Alloc(Value[1]); + while Look = ',' do begin + Match(','); + GetName; + Alloc(Value[1]); + end; +end; +{--------------------------------------------------------------} + + +(Note that we're still only allowing single-character variable +names, so we take the easy way out here and simply use the first +character of the string.) + +Finally, we must make the changes to use Token instead of Look as +the test character and to call Scan at the appropriate places. +Mostly, this involves deleting calls to Match, occasionally +replacing calls to Match by calls to MatchString, and Replacing +calls to NewLine by calls to Scan. Here are the affected +routines: + +{---------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block; Forward; + + +procedure DoIf; +var L1, L2: string; +begin + BoolExpression; + L1 := NewLabel; + L2 := L1; + BranchFalse(L1); + Block; + if Token = 'l' then begin + L2 := NewLabel; + Branch(L2); + PostLabel(L1); + Block; + end; + PostLabel(L2); + MatchString('ENDIF'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a WHILE Statement } + +procedure DoWhile; +var L1, L2: string; +begin + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + BoolExpression; + BranchFalse(L2); + Block; + MatchString('ENDWHILE'); + Branch(L1); + PostLabel(L2); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + Scan; + while not(Token in ['e', 'l']) do begin + case Token of + 'i': DoIf; + 'w': DoWhile; + else Assignment; + end; + Scan; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + Scan; + while Token <> 'b' do begin + case Token of + 'v': Decl; + else Abort('Unrecognized Keyword ' + Value); + end; + Scan; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Main Program } + +procedure Main; +begin + MatchString('BEGIN'); + Prolog; + Block; + MatchString('END'); + Epilog; +end; + +{--------------------------------------------------------------} +{ Parse and Translate a Program } + +procedure Prog; +begin + MatchString('PROGRAM'); + Header; + TopDecls; + Main; + Match('.'); +end; + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +var i: char; +begin + for i := 'A' to 'Z' do + ST[i] := ' '; + GetChar; + Scan; +end; +{--------------------------------------------------------------} + + +That should do it. If all the changes got in correctly, you +should now be parsing programs that look like programs. (If you +didn't make it through all the changes, don't despair. A +complete listing of the final form is given later.) + +Did it work? If so, then we're just about home. In fact, with a +few minor exceptions we've already got a compiler that's usable. +There are still a few areas that need improvement. + + +MULTI-CHARACTER VARIABLE NAMES + +One of those is the restriction that we still have, requiring +single-character variable names. Now that we can handle multi- +character keywords, this one begins to look very much like an +arbitrary and unnecessary limitation. And indeed it is. +Basically, its only virtue is that it permits a trivially simple +implementation of the symbol table. But that's just a +convenience to the compiler writers, and needs to be eliminated. + +We've done this step before. This time, as usual, I'm doing it a +little differently. I think the approach used here keeps things +just about as simple as possible. + +The natural way to implement a symbol table in Pascal is by +declaring a record type, and making the symbol table an array of +such records. Here, though, we don't really need a type field +yet (there is only one kind of entry allowed so far), so we only +need an array of symbols. This has the advantage that we can use +the existing procedure Lookup to search the symbol table as well +as the keyword list. As it turns out, even when we need more +fields we can still use the same approach, simply by storing the +other fields in separate arrays. + +OK, here are the changes that need to be made. First, add the +new typed constant: + + + NEntry: integer = 0; + + +Then change the definition of the symbol table as follows: + + +const MaxEntry = 100; + +var ST : array[1..MaxEntry] of Symbol; + + +(Note that ST is _NOT_ declared as a SymTab. That declaration is +a phony one to get Lookup to work. A SymTab would take up too +much RAM space, and so one is never actually allocated.) + +Next, we need to replace InTable: + + +{--------------------------------------------------------------} +{ Look for Symbol in Table } + +function InTable(n: Symbol): Boolean; +begin + InTable := Lookup(@ST, n, MaxEntry) <> 0; +end; +{--------------------------------------------------------------} + + +We also need a new procedure, AddEntry, that adds a new entry to +the table: + + +{--------------------------------------------------------------} +{ Add a New Entry to Symbol Table } + +procedure AddEntry(N: Symbol; T: char); +begin + if InTable(N) then Abort('Duplicate Identifier ' + N); + if NEntry = MaxEntry then Abort('Symbol Table Full'); + Inc(NEntry); + ST[NEntry] := N; + SType[NEntry] := T; +end; +{--------------------------------------------------------------} + + +This procedure is called by Alloc: + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc(N: Symbol); +begin + if InTable(N) then Abort('Duplicate Variable Name ' + N); + AddEntry(N, 'v'); +. +. +. +{--------------------------------------------------------------} + + +Finally, we must change all the routines that currently treat the +variable name as a single character. These include LoadVar and +Store (just change the type from char to string), and Factor, +Assignment, and Decl (just change Value[1] to Value). + +One last thing: change procedure Init to clear the array as +shown: + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +var i: integer; +begin + for i := 1 to MaxEntry do begin + ST[i] := ''; + SType[i] := ' '; + end; + GetChar; + Scan; +end; +{--------------------------------------------------------------} + + +That should do it. Try it out and verify that you can, indeed, +use multi-character variable names. + + +MORE RELOPS + +We still have one remaining single-character restriction: the one +on relops. Some of the relops are indeed single characters, but +others require two. These are '<=' and '>='. I also prefer the +Pascal '<>' for "not equals," instead of '#'. + +If you'll recall, in Part VII I pointed out that the conventional +way to deal with relops is to include them in the list of +keywords, and let the lexical scanner find them. But, again, +this requires scanning throughout the expression parsing process, +whereas so far we've been able to limit the use of the scanner to +the beginning of a statement. + +I mentioned then that we can still get away with this, since the +multi-character relops are so few and so limited in their usage. +It's easy to just treat them as special cases and handle them in +an ad hoc manner. + +The changes required affect only the code generation routines and +procedures Relation and friends. First, we're going to need two +more code generation routines: + + +{---------------------------------------------------------------} +{ Set D0 If Compare was <= } + +procedure SetLessOrEqual; +begin + EmitLn('SGE D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was >= } + +procedure SetGreaterOrEqual; +begin + EmitLn('SLE D0'); + EmitLn('EXT D0'); +end; +{---------------------------------------------------------------} + + +Then, modify the relation parsing routines as shown below: + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Less Than or Equal" } + +procedure LessOrEqual; +begin + Match('='); + Expression; + PopCompare; + SetLessOrEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Not Equals" } + +procedure NotEqual; +begin + Match('>'); + Expression; + PopCompare; + SetNEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Less Than" } + +procedure Less; +begin + Match('<'); + case Look of + '=': LessOrEqual; + '>': NotEqual; + else begin + Expression; + PopCompare; + SetLess; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Greater Than" } + +procedure Greater; +begin + Match('>'); + if Look = '=' then begin + Match('='); + Expression; + PopCompare; + SetGreaterOrEqual; + end + else begin + Expression; + PopCompare; + SetGreater; + end; +end; +{---------------------------------------------------------------} + + +That's all it takes. Now you can process all the relops. Try +it. + + +INPUT/OUTPUT + +We now have a complete, working language, except for one minor +embarassment: we have no way to get data in or out. We need some +I/O. + +Now, the convention these days, established in C and continued in +Ada and Modula 2, is to leave I/O statements out of the language +itself, and just include them in the subroutine library. That +would be fine, except that so far we have no provision for +subroutines. Anyhow, with this approach you run into the problem +of variable-length argument lists. In Pascal, the I/O statements +are built into the language because they are the only ones for +which the argument list can have a variable number of entries. +In C, we settle for kludges like scanf and printf, and must pass +the argument count to the called procedure. In Ada and Modula 2 +we must use the awkward (and SLOW!) approach of a separate call +for each argument. + +So I think I prefer the Pascal approach of building the I/O in, +even though we don't need to. + +As usual, for this we need some more code generation routines. +These turn out to be the easiest of all, because all we do is to +call library procedures to do the work: + + +{---------------------------------------------------------------} +{ Read Variable to Primary Register } + +procedure ReadVar; +begin + EmitLn('BSR READ'); + Store(Value); +end; + + +{---------------------------------------------------------------} +{ Write Variable from Primary Register } + +procedure WriteVar; +begin + EmitLn('BSR WRITE'); +end; +{--------------------------------------------------------------} + + +The idea is that READ loads the value from input to the D0, and +WRITE outputs it from there. + +These two procedures represent our first encounter with a need +for library procedures ... the components of a Run Time Library +(RTL). Of course, someone (namely us) has to write these +routines, but they're not part of the compiler itself. I won't +even bother showing the routines here, since these are obviously +very much OS-dependent. I _WILL_ simply say that for SK*DOS, +they are particularly simple ... almost trivial. One reason I +won't show them here is that you can add all kinds of fanciness +to the things, for example by prompting in READ for the inputs, +and by giving the user a chance to reenter a bad input. + +But that is really separate from compiler design, so for now I'll +just assume that a library call TINYLIB.LIB exists. Since we now +need it loaded, we need to add a statement to include it in +procedure Header: + + +{--------------------------------------------------------------} +{ Write Header Info } + +procedure Header; +begin + + WriteLn('WARMST', TAB, 'EQU $A01E'); + EmitLn('LIB TINYLIB'); +end; +{--------------------------------------------------------------} + +That takes care of that part. Now, we also need to recognize the +read and write commands. We can do this by adding two more +keywords to our list: + + +{--------------------------------------------------------------} +{ Definition of Keywords and Token Types } + +const NKW = 11; + NKW1 = 12; + +const KWlist: array[1..NKW] of Symbol = + ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE', + 'READ', 'WRITE', 'VAR', 'BEGIN', 'END', +'PROGRAM'); + +const KWcode: string[NKW1] = 'xileweRWvbep'; +{--------------------------------------------------------------} + + +(Note how I'm using upper case codes here to avoid conflict with +the 'w' of WHILE.) + +Next, we need procedures for processing the read/write statement +and its argument list: + + +{--------------------------------------------------------------} +{ Process a Read Statement } +procedure DoRead; +begin + Match('('); + GetName; + ReadVar; + while Look = ',' do begin + Match(','); + GetName; + ReadVar; + end; + Match(')'); +end; + + +{--------------------------------------------------------------} +{ Process a Write Statement } + +procedure DoWrite; +begin + Match('('); + Expression; + WriteVar; + while Look = ',' do begin + Match(','); + Expression; + WriteVar; + end; + Match(')'); +end; +{--------------------------------------------------------------} + + +Finally, we must expand procedure Block to handle the new +statement types: + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + Scan; + while not(Token in ['e', 'l']) do begin + case Token of + 'i': DoIf; + 'w': DoWhile; + 'R': DoRead; + 'W': DoWrite; + else Assignment; + end; + Scan; + end; +end; +{--------------------------------------------------------------} + +That's all there is to it. _NOW_ we have a language! + + +CONCLUSION + +At this point we have TINY completely defined. It's not much ... +actually a toy compiler. TINY has only one data type and no +subroutines ... but it's a complete, usable language. While +you're not likely to be able to write another compiler in it, or +do anything else very seriously, you could write programs to read +some input, perform calculations, and output the results. Not +too bad for a toy. + +Most importantly, we have a firm base upon which to build further +extensions. I know you'll be glad to hear this: this is the last +time I'll start over in building a parser ... from now on I +intend to just add features to TINY until it becomes KISS. Oh, +there'll be other times we will need to try things out with new +copies of the Cradle, but once we've found out how to do those +things they'll be incorporated into TINY. + +What will those features be? Well, for starters we need +subroutines and functions. Then we need to be able to handle +different types, including arrays, strings, and other structures. +Then we need to deal with the idea of pointers. All this will be +upcoming in future installments. + +See you then. + +For references purposes, the complete listing of TINY Version 1.0 +is shown below: + + +{--------------------------------------------------------------} +program Tiny10; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + CR = ^M; + LF = ^J; + + LCount: integer = 0; + NEntry: integer = 0; + + +{--------------------------------------------------------------} +{ Type Declarations } + +type Symbol = string[8]; + + SymTab = array[1..1000] of Symbol; + TabPtr = ^SymTab; + + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look : char; { Lookahead Character } + Token: char; { Encoded Token } + Value: string[16]; { Unencoded Token } + + +const MaxEntry = 100; + +var ST : array[1..MaxEntry] of Symbol; + SType: array[1..MaxEntry] of char; + + +{--------------------------------------------------------------} +{ Definition of Keywords and Token Types } + +const NKW = 11; + NKW1 = 12; + +const KWlist: array[1..NKW] of Symbol = + ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE', + 'READ', 'WRITE', 'VAR', 'BEGIN', 'END', +'PROGRAM'); + +const KWcode: string[NKW1] = 'xileweRWvbep'; + + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + +{--------------------------------------------------------------} +{ Report an Undefined Identifier } + +procedure Undefined(n: string); +begin + Abort('Undefined Identifier ' + n); +end; + + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Recognize an AlphaNumeric Character } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; + + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Mulop } + +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*', '/']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Boolean Orop } + +function IsOrop(c: char): boolean; +begin + IsOrop := c in ['|', '~']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Relop } + +function IsRelop(c: char): boolean; +begin + IsRelop := c in ['=', '#', '<', '>']; +end; + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB]; +end; + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; + + +{--------------------------------------------------------------} +{ Skip Over an End-of-Line } + +procedure NewLine; +begin + while Look = CR do begin + GetChar; + if Look = LF then GetChar; + SkipWhite; + end; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + NewLine; + if Look = x then GetChar + else Expected('''' + x + ''''); + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Table Lookup } + +function Lookup(T: TabPtr; s: string; n: integer): integer; +var i: integer; + found: Boolean; +begin + found := false; + i := n; + while (i > 0) and not found do + if s = T^[i] then + found := true + else + dec(i); + Lookup := i; +end; + + +{--------------------------------------------------------------} +{ Locate a Symbol in Table } +{ Returns the index of the entry. Zero if not present. } + +function Locate(N: Symbol): integer; +begin + Locate := Lookup(@ST, n, MaxEntry); +end; + + +{--------------------------------------------------------------} +{ Look for Symbol in Table } + +function InTable(n: Symbol): Boolean; +begin + InTable := Lookup(@ST, n, MaxEntry) <> 0; +end; + + +{--------------------------------------------------------------} +{ Add a New Entry to Symbol Table } + +procedure AddEntry(N: Symbol; T: char); +begin + if InTable(N) then Abort('Duplicate Identifier ' + N); + if NEntry = MaxEntry then Abort('Symbol Table Full'); + Inc(NEntry); + ST[NEntry] := N; + SType[NEntry] := T; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +procedure GetName; +begin + NewLine; + if not IsAlpha(Look) then Expected('Name'); + Value := ''; + while IsAlNum(Look) do begin + Value := Value + UpCase(Look); + GetChar; + end; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: integer; +var Val: integer; +begin + NewLine; + if not IsDigit(Look) then Expected('Integer'); + Val := 0; + while IsDigit(Look) do begin + Val := 10 * Val + Ord(Look) - Ord('0'); + GetChar; + end; + GetNum := Val; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier and Scan it for Keywords } + +procedure Scan; +begin + GetName; + Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1]; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input String } + +procedure MatchString(x: string); +begin + if Value <> x then Expected('''' + x + ''''); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + + +{--------------------------------------------------------------} +{ Generate a Unique Label } + +function NewLabel: string; +var S: string; +begin + Str(LCount, S); + NewLabel := 'L' + S; + Inc(LCount); +end; + + +{--------------------------------------------------------------} +{ Post a Label To Output } + +procedure PostLabel(L: string); +begin + WriteLn(L, ':'); +end; + + +{---------------------------------------------------------------} +{ Clear the Primary Register } + +procedure Clear; +begin + EmitLn('CLR D0'); +end; + + +{---------------------------------------------------------------} +{ Negate the Primary Register } + +procedure Negate; +begin + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Complement the Primary Register } + +procedure NotIt; +begin + EmitLn('NOT D0'); +end; + + +{---------------------------------------------------------------} +{ Load a Constant Value to Primary Register } + +procedure LoadConst(n: integer); +begin + Emit('MOVE #'); + WriteLn(n, ',D0'); +end; + + +{---------------------------------------------------------------} +{ Load a Variable to Primary Register } + +procedure LoadVar(Name: string); +begin + if not InTable(Name) then Undefined(Name); + EmitLn('MOVE ' + Name + '(PC),D0'); +end; + + +{---------------------------------------------------------------} +{ Push Primary onto Stack } + +procedure Push; +begin + EmitLn('MOVE D0,-(SP)'); +end; + + +{---------------------------------------------------------------} +{ Add Top of Stack to Primary } + +procedure PopAdd; +begin + EmitLn('ADD (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Subtract Primary from Top of Stack } + +procedure PopSub; +begin + EmitLn('SUB (SP)+,D0'); + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Multiply Top of Stack by Primary } + +procedure PopMul; +begin + EmitLn('MULS (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Divide Top of Stack by Primary } + +procedure PopDiv; +begin + EmitLn('MOVE (SP)+,D7'); + EmitLn('EXT.L D7'); + EmitLn('DIVS D0,D7'); + EmitLn('MOVE D7,D0'); +end; + + +{---------------------------------------------------------------} +{ AND Top of Stack with Primary } + +procedure PopAnd; +begin + EmitLn('AND (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ OR Top of Stack with Primary } + +procedure PopOr; +begin + EmitLn('OR (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ XOR Top of Stack with Primary } + +procedure PopXor; +begin + EmitLn('EOR (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Compare Top of Stack with Primary } + +procedure PopCompare; +begin + EmitLn('CMP (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was = } + +procedure SetEqual; +begin + EmitLn('SEQ D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was != } + +procedure SetNEqual; +begin + EmitLn('SNE D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was > } + +procedure SetGreater; +begin + EmitLn('SLT D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was < } + +procedure SetLess; +begin + EmitLn('SGT D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was <= } + +procedure SetLessOrEqual; +begin + EmitLn('SGE D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was >= } + +procedure SetGreaterOrEqual; +begin + EmitLn('SLE D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Store Primary to Variable } + +procedure Store(Name: string); +begin + if not InTable(Name) then Undefined(Name); + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)') +end; + + +{---------------------------------------------------------------} +{ Branch Unconditional } + +procedure Branch(L: string); +begin + EmitLn('BRA ' + L); +end; + + +{---------------------------------------------------------------} +{ Branch False } + +procedure BranchFalse(L: string); +begin + EmitLn('TST D0'); + EmitLn('BEQ ' + L); +end; + + +{---------------------------------------------------------------} +{ Read Variable to Primary Register } + +procedure ReadVar; +begin + EmitLn('BSR READ'); + Store(Value[1]); +end; + + +{ Write Variable from Primary Register } + +procedure WriteVar; +begin + EmitLn('BSR WRITE'); +end; + + +{--------------------------------------------------------------} +{ Write Header Info } + +procedure Header; +begin + WriteLn('WARMST', TAB, 'EQU $A01E'); +end; + + +{--------------------------------------------------------------} +{ Write the Prolog } + +procedure Prolog; +begin + PostLabel('MAIN'); +end; + + +{--------------------------------------------------------------} +{ Write the Epilog } + +procedure Epilog; +begin + EmitLn('DC WARMST'); + EmitLn('END MAIN'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure BoolExpression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + BoolExpression; + Match(')'); + end + else if IsAlpha(Look) then begin + GetName; + LoadVar(Value); + end + else + LoadConst(GetNum); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Negative Factor } + +procedure NegFactor; +begin + Match('-'); + if IsDigit(Look) then + LoadConst(-GetNum) + else begin + Factor; + Negate; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Leading Factor } + +procedure FirstFactor; +begin + case Look of + '+': begin + Match('+'); + Factor; + end; + '-': NegFactor; + else Factor; + end; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Multiply } + +procedure Multiply; +begin + Match('*'); + Factor; + PopMul; +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Divide } + +procedure Divide; +begin + Match('/'); + Factor; + PopDiv; +end; + + +{---------------------------------------------------------------} +{ Common Code Used by Term and FirstTerm } + +procedure Term1; +begin + while IsMulop(Look) do begin + Push; + case Look of + '*': Multiply; + '/': Divide; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +procedure Term; +begin + Factor; + Term1; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Leading Term } + +procedure FirstTerm; +begin + FirstFactor; + Term1; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Match('+'); + Term; + PopAdd; +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Match('-'); + Term; + PopSub; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + FirstTerm; + while IsAddop(Look) do begin + Push; + case Look of + '+': Add; + '-': Subtract; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Equals" } + +procedure Equal; +begin + Match('='); + Expression; + PopCompare; + SetEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Less Than or Equal" } + +procedure LessOrEqual; +begin + Match('='); + Expression; + PopCompare; + SetLessOrEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Not Equals" } + +procedure NotEqual; +begin + Match('>'); + Expression; + PopCompare; + SetNEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Less Than" } + +procedure Less; +begin + Match('<'); + case Look of + '=': LessOrEqual; + '>': NotEqual; + else begin + Expression; + PopCompare; + SetLess; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Greater Than" } + +procedure Greater; +begin + Match('>'); + if Look = '=' then begin + Match('='); + Expression; + PopCompare; + SetGreaterOrEqual; + end + else begin + Expression; + PopCompare; + SetGreater; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Relation } + + +procedure Relation; +begin + Expression; + if IsRelop(Look) then begin + Push; + case Look of + '=': Equal; + '<': Less; + '>': Greater; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Factor with Leading NOT } + +procedure NotFactor; +begin + if Look = '!' then begin + Match('!'); + Relation; + NotIt; + end + else + Relation; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Term } + +procedure BoolTerm; +begin + NotFactor; + while Look = '&' do begin + Push; + Match('&'); + NotFactor; + PopAnd; + end; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Boolean OR } + +procedure BoolOr; +begin + Match('|'); + BoolTerm; + PopOr; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an Exclusive Or } + +procedure BoolXor; +begin + Match('~'); + BoolTerm; + PopXor; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Expression } + +procedure BoolExpression; +begin + BoolTerm; + while IsOrOp(Look) do begin + Push; + case Look of + '|': BoolOr; + '~': BoolXor; + end; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: string; +begin + Name := Value; + Match('='); + BoolExpression; + Store(Name); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block; Forward; + + +procedure DoIf; +var L1, L2: string; +begin + BoolExpression; + L1 := NewLabel; + L2 := L1; + BranchFalse(L1); + Block; + if Token = 'l' then begin + L2 := NewLabel; + Branch(L2); + PostLabel(L1); + Block; + end; + PostLabel(L2); + MatchString('ENDIF'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a WHILE Statement } + +procedure DoWhile; +var L1, L2: string; +begin + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + BoolExpression; + BranchFalse(L2); + Block; + MatchString('ENDWHILE'); + Branch(L1); + PostLabel(L2); +end; + + +{--------------------------------------------------------------} +{ Process a Read Statement } + +procedure DoRead; +begin + Match('('); + GetName; + ReadVar; + while Look = ',' do begin + Match(','); + GetName; + ReadVar; + end; + Match(')'); +end; + + +{--------------------------------------------------------------} +{ Process a Write Statement } + +procedure DoWrite; +begin + Match('('); + Expression; + WriteVar; + while Look = ',' do begin + Match(','); + Expression; + WriteVar; + end; + Match(')'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + Scan; + while not(Token in ['e', 'l']) do begin + case Token of + 'i': DoIf; + 'w': DoWhile; + 'R': DoRead; + 'W': DoWrite; + else Assignment; + end; + Scan; + end; +end; + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc(N: Symbol); +begin + if InTable(N) then Abort('Duplicate Variable Name ' + N); + AddEntry(N, 'v'); + Write(N, ':', TAB, 'DC '); + if Look = '=' then begin + Match('='); + If Look = '-' then begin + Write(Look); + Match('-'); + end; + WriteLn(GetNum); + end + else + WriteLn('0'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Data Declaration } + +procedure Decl; +begin + GetName; + Alloc(Value); + while Look = ',' do begin + Match(','); + GetName; + Alloc(Value); + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + Scan; + while Token <> 'b' do begin + case Token of + 'v': Decl; + else Abort('Unrecognized Keyword ' + Value); + end; + Scan; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Main Program } + +procedure Main; +begin + MatchString('BEGIN'); + Prolog; + Block; + MatchString('END'); + Epilog; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Program } + +procedure Prog; +begin + MatchString('PROGRAM'); + Header; + TopDecls; + Main; + Match('.'); +end; + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +var i: integer; +begin + for i := 1 to MaxEntry do begin + ST[i] := ''; + SType[i] := ' '; + end; + GetChar; + Scan; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + Prog; + if Look <> CR then Abort('Unexpected data after ''.'''); +end. +{--------------------------------------------------------------} + + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + diff --git a/Compiler construction/Lets_build_compiler/tutor11.txt b/Compiler construction/Lets_build_compiler/tutor11.txt new file mode 100644 index 0000000..860d361 --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor11.txt @@ -0,0 +1,1741 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 3 June 1989 + + + Part XI: LEXICAL SCAN REVISITED + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +I've got some good news and some bad news. The bad news is that +this installment is not the one I promised last time. What's +more, the one after this one won't be, either. + +The good news is the reason for this installment: I've found a +way to simplify and improve the lexical scanning part of the +compiler. Let me explain. + + +BACKGROUND + +If you'll remember, we talked at length about the subject of +lexical scanners in Part VII, and I left you with a design for a +distributed scanner that I felt was about as simple as I could +make it ... more than most that I've seen elsewhere. We used +that idea in Part X. The compiler structure that resulted was +simple, and it got the job done. + +Recently, though, I've begun to have problems, and they're the +kind that send a message that you might be doing something wrong. + +The whole thing came to a head when I tried to address the issue +of semicolons. Several people have asked me about them, and +whether or not KISS will have them separating the statements. My +intention has been NOT to use semicolons, simply because I don't +like them and, as you can see, they have not proved necessary. + +But I know that many of you, like me, have gotten used to them, +and so I set out to write a short installment to show you how +they could easily be added, if you were so inclined. + +Well, it turned out that they weren't easy to add at all. In +fact it was darned difficult. + +I guess I should have realized that something was wrong, because +of the issue of newlines. In the last couple of installments +we've addressed that issue, and I've shown you how to deal with +newlines with a procedure called, appropriately enough, NewLine. +In TINY Version 1.0, I sprinkled calls to this procedure in +strategic spots in the code. + +It seems that every time I've addressed the issue of newlines, +though, I've found it to be tricky, and the resulting parser +turned out to be quite fragile ... one addition or deletion here +or there and things tended to go to pot. Looking back on it, I +realize that there was a message in this that I just wasn't +paying attention to. + +When I tried to add semicolons on top of the newlines, that was +the last straw. I ended up with much too complex a solution. I +began to realize that something fundamental had to change. + +So, in a way this installment will cause us to backtrack a bit +and revisit the issue of scanning all over again. Sorry about +that. That's the price you pay for watching me do this in real +time. But the new version is definitely an improvement, and will +serve us well for what is to come. + +As I said, the scanner we used in Part X was about as simple as +one can get. But anything can be improved. The new scanner is +more like the classical scanner, and not as simple as before. +But the overall compiler structure is even simpler than before. +It's also more robust, and easier to add to and/or modify. I +think that's worth the time spent in this digression. So in this +installment, I'll be showing you the new structure. No doubt +you'll be happy to know that, while the changes affect many +procedures, they aren't very profound and so we lose very little +of what's been done so far. + +Ironically, the new scanner is much more conventional than the +old one, and is very much like the more generic scanner I showed +you earlier in Part VII. Then I started trying to get clever, +and I almost clevered myself clean out of business. You'd think +one day I'd learn: K-I-S-S! + + +THE PROBLEM + +The problem begins to show itself in procedure Block, which I've +reproduced below: + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + Scan; + while not(Token in ['e', 'l']) do begin + case Token of + 'i': DoIf; + 'w': DoWhile; + 'R': DoRead; + 'W': DoWrite; + else Assignment; + end; + Scan; + end; +end; +{--------------------------------------------------------------} + + +As you can see, Block is oriented to individual program +statements. At each pass through the loop, we know that we are +at the beginning of a statement. We exit the block when we have +scanned an END or an ELSE. + +But suppose that we see a semicolon instead. The procedure as +it's shown above can't handle that, because procedure Scan only +expects and can only accept tokens that begin with a letter. + +I tinkered around for quite awhile to come up with a fix. I +found many possible approaches, but none were very satisfying. I +finally figured out the reason. + +Recall that when we started with our single-character parsers, we +adopted a convention that the lookahead character would always be +prefetched. That is, we would have the character that +corresponds to our current position in the input stream fetched +into the global character Look, so that we could examine it as +many times as needed. The rule we adopted was that EVERY +recognizer, if it found its target token, would advance Look to +the next character in the input stream. + +That simple and fixed convention served us very well when we had +single-character tokens, and it still does. It would make a lot +of sense to apply the same rule to multi-character tokens. + +But when we got into lexical scanning, I began to violate that +simple rule. The scanner of Part X did indeed advance to the +next token if it found an identifier or keyword, but it DIDN'T do +that if it found a carriage return, a whitespace character, or an +operator. + +Now, that sort of mixed-mode operation gets us into deep trouble +in procedure Block, because whether or not the input stream has +been advanced depends upon the kind of token we encounter. If +it's a keyword or the target of an assignment statement, the +"cursor," as defined by the contents of Look, has been advanced +to the next token OR to the beginning of whitespace. If, on the +other hand, the token is a semicolon, or if we have hit a +carriage return, the cursor has NOT advanced. + +Needless to say, we can add enough logic to keep us on track. +But it's tricky, and makes the whole parser very fragile. + +There's a much better way, and that's just to adopt that same +rule that's worked so well before, to apply to TOKENS as well as +single characters. In other words, we'll prefetch tokens just as +we've always done for characters. It seems so obvious once you +think about it that way. + +Interestingly enough, if we do things this way the problem that +we've had with newline characters goes away. We can just lump +them in as whitespace characters, which means that the handling +of newlines becomes very trivial, and MUCH less prone to error +than we've had to deal with in the past. + + +THE SOLUTION + +Let's begin to fix the problem by re-introducing the two +procedures: + +{--------------------------------------------------------------} +{ Get an Identifier } + +procedure GetName; +begin + SkipWhite; + if Not IsAlpha(Look) then Expected('Identifier'); + Token := 'x'; + Value := ''; + repeat + Value := Value + UpCase(Look); + GetChar; + until not IsAlNum(Look); +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +procedure GetNum; +begin + SkipWhite; + if not IsDigit(Look) then Expected('Number'); + Token := '#'; + Value := ''; + repeat + Value := Value + Look; + GetChar; + until not IsDigit(Look); +end; +{--------------------------------------------------------------} + + +These two procedures are functionally almost identical to the +ones I showed you in Part VII. They each fetch the current +token, either an identifier or a number, into the global string +Value. They also set the encoded version, Token, to the +appropriate code. The input stream is left with Look containing +the first character NOT part of the token. + +We can do the same thing for operators, even multi-character +operators, with a procedure such as: + + +{--------------------------------------------------------------} +{ Get an Operator } + +procedure GetOp; +begin + Token := Look; + Value := ''; + repeat + Value := Value + Look; + GetChar; + until IsAlpha(Look) or IsDigit(Look) or IsWhite(Look); +end; +{--------------------------------------------------------------} + +Note that GetOp returns, as its encoded token, the FIRST +character of the operator. This is important, because it means +that we can now use that single character to drive the parser, +instead of the lookahead character. + +We need to tie these procedures together into a single procedure +that can handle all three cases. The following procedure will +read any one of the token types and always leave the input stream +advanced beyond it: + + +{--------------------------------------------------------------} +{ Get the Next Input Token } + +procedure Next; +begin + SkipWhite; + if IsAlpha(Look) then GetName + else if IsDigit(Look) then GetNum + else GetOp; +end; +{--------------------------------------------------------------} + + +***NOTE that here I have put SkipWhite BEFORE the calls rather +than after. This means that, in general, the variable Look will +NOT have a meaningful value in it, and therefore we should NOT +use it as a test value for parsing, as we have been doing so far. +That's the big departure from our normal approach. + +Now, remember that before I was careful not to treat the carriage +return (CR) and line feed (LF) characters as white space. This +was because, with SkipWhite called as the last thing in the +scanner, the encounter with LF would trigger a read statement. +If we were on the last line of the program, we couldn't get out +until we input another line with a non-white character. That's +why I needed the second procedure, NewLine, to handle the CRLF's. + +But now, with the call to SkipWhite coming first, that's exactly +the behavior we want. The compiler must know there's another +token coming or it wouldn't be calling Next. In other words, it +hasn't found the terminating END yet. So we're going to insist +on more data until we find something. + +All this means that we can greatly simplify both the program and +the concepts, by treating CR and LF as whitespace characters, and +eliminating NewLine. You can do that simply by modifying the +function IsWhite: + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB, CR, LF]; +end; +{--------------------------------------------------------------} + + +We've already tried similar routines in Part VII, but you might +as well try these new ones out. Add them to a copy of the Cradle +and call Next with the following main program: + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + repeat + Next; + WriteLn(Token, ' ', Value); + until Token = '.'; +end. +{--------------------------------------------------------------} + + +Compile it and verify that you can separate a program into a +series of tokens, and that you get the right encoding for each +token. + +This ALMOST works, but not quite. There are two potential +problems: First, in KISS/TINY almost all of our operators are +single-character operators. The only exceptions are the relops +>=, <=, and <>. It seems a shame to treat all operators as +strings and do a string compare, when only a single character +compare will almost always suffice. Second, and much more +important, the thing doesn't WORK when two operators appear +together, as in (a+b)*(c+d). Here the string following 'b' would +be interpreted as a single operator ")*(." + +It's possible to fix that problem. For example, we could just +give GetOp a list of legal characters, and we could treat the +parentheses as different operator types than the others. But +this begins to get messy. + +Fortunately, there's a better way that solves all the problems. +Since almost all the operators are single characters, let's just +treat them that way, and let GetOp get only one character at a +time. This not only simplifies GetOp, but also speeds things up +quite a bit. We still have the problem of the relops, but we +were treating them as special cases anyway. + +So here's the final version of GetOp: + + +{--------------------------------------------------------------} +{ Get an Operator } + +procedure GetOp; +begin + SkipWhite; + Token := Look; + Value := Look; + GetChar; +end; +{--------------------------------------------------------------} + + +Note that I still give the string Value a value. If you're truly +concerned about efficiency, you could leave this out. When we're +expecting an operator, we will only be testing Token anyhow, so +the value of the string won't matter. But to me it seems to be +good practice to give the thing a value just in case. + +Try this new version with some realistic-looking code. You +should be able to separate any program into its individual +tokens, with the caveat that the two-character relops will scan +into two separate tokens. That's OK ... we'll parse them that +way. + +Now, in Part VII the function of Next was combined with procedure +Scan, which also checked every identifier against a list of +keywords and encoded each one that was found. As I mentioned at +the time, the last thing we would want to do is to use such a +procedure in places where keywords should not appear, such as in +expressions. If we did that, the keyword list would be scanned +for every identifier appearing in the code. Not good. + +The right way to deal with that is to simply separate the +functions of fetching tokens and looking for keywords. The +version of Scan shown below does NOTHING but check for keywords. +Notice that it operates on the current token and does NOT advance +the input stream. + + +{--------------------------------------------------------------} +{ Scan the Current Identifier for Keywords } + +procedure Scan; +begin + if Token = 'x' then + Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1]; +end; +{--------------------------------------------------------------} + + +There is one last detail. In the compiler there are a few places +that we must actually check the string value of the token. +Mainly, this is done to distinguish between the different END's, +but there are a couple of other places. (I should note in +passing that we could always eliminate the need for matching END +characters by encoding each one to a different character. Right +now we are definitely taking the lazy man's route.) + +The following version of MatchString takes the place of the +character-oriented Match. Note that, like Match, it DOES advance +the input stream. + + +{--------------------------------------------------------------} +{ Match a Specific Input String } + +procedure MatchString(x: string); +begin + if Value <> x then Expected('''' + x + ''''); + Next; +end; +{--------------------------------------------------------------} + + +FIXING UP THE COMPILER + +Armed with these new scanner procedures, we can now begin to fix +the compiler to use them properly. The changes are all quite +minor, but there are quite a few places where changes are +necessary. Rather than showing you each place, I will give you +the general idea and then just give the finished product. + + +First of all, the code for procedure Block doesn't change, though +its function does: + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + Scan; + while not(Token in ['e', 'l']) do begin + case Token of + 'i': DoIf; + 'w': DoWhile; + 'R': DoRead; + 'W': DoWrite; + else Assignment; + end; + Scan; + end; +end; +{--------------------------------------------------------------} + + +Remember that the new version of Scan doesn't advance the input +stream, it only scans for keywords. The input stream must be +advanced by each procedure that Block calls. + +In general, we have to replace every test on Look with a similar +test on Token. For example: + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Expression } + +procedure BoolExpression; +begin + BoolTerm; + while IsOrOp(Token) do begin + Push; + case Token of + '|': BoolOr; + '~': BoolXor; + end; + end; +end; +{--------------------------------------------------------------} + + +In procedures like Add, we don't have to use Match anymore. We +need only call Next to advance the input stream: + + +{--------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Next; + Term; + PopAdd; +end; +{-------------------------------------------------------------} + + +Control structures are actually simpler. We just call Next to +advance over the control keywords: + + +{---------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block; Forward; + +procedure DoIf; +var L1, L2: string; +begin + Next; + BoolExpression; + L1 := NewLabel; + L2 := L1; + BranchFalse(L1); + Block; + if Token = 'l' then begin + Next; + L2 := NewLabel; + Branch(L2); + PostLabel(L1); + Block; + end; + PostLabel(L2); + MatchString('ENDIF'); +end; +{--------------------------------------------------------------} + + +That's about the extent of the REQUIRED changes. In the listing +of TINY Version 1.1 below, I've also made a number of other +"improvements" that aren't really required. Let me explain them +briefly: + + (1) I've deleted the two procedures Prog and Main, and combined + their functions into the main program. They didn't seem to + add to program clarity ... in fact they seemed to just + muddy things up a little. + + (2) I've deleted the keywords PROGRAM and BEGIN from the + keyword list. Each one only occurs in one place, so it's + not necessary to search for it. + + (3) Having been bitten by an overdose of cleverness, I've + reminded myself that TINY is supposed to be a minimalist + program. Therefore I've replaced the fancy handling of + unary minus with the dumbest one I could think of. A giant + step backwards in code quality, but a great simplification + of the compiler. KISS is the right place to use the other + version. + + (4) I've added some error-checking routines such as CheckTable + and CheckDup, and replaced in-line code by calls to them. + This cleans up a number of routines. + + (5) I've taken the error checking out of code generation + routines like Store, and put it in the parser where it + belongs. See Assignment, for example. + + (6) There was an error in InTable and Locate that caused them + to search all locations instead of only those with valid + data in them. They now search only valid cells. This + allows us to eliminate the initialization of the symbol + table, which was done in Init. + + (7) Procedure AddEntry now has two arguments, which helps to + make things a bit more modular. + + (8) I've cleaned up the code for the relational operators by + the addition of the new procedures CompareExpression and + NextExpression. + + (9) I fixed an error in the Read routine ... the earlier value + did not check for a valid variable name. + + + CONCLUSION + +The resulting compiler for TINY is given below. Other than the +removal of the keyword PROGRAM, it parses the same language as +before. It's just a bit cleaner, and more importantly it's +considerably more robust. I feel good about it. + +The next installment will be another digression: the discussion +of semicolons and such that got me into this mess in the first +place. THEN we'll press on into procedures and types. Hang in +there with me. The addition of those features will go a long way +towards removing KISS from the "toy language" category. We're +getting very close to being able to write a serious compiler. + + +TINY VERSION 1.1 + + +{--------------------------------------------------------------} +program Tiny11; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + CR = ^M; + LF = ^J; + + LCount: integer = 0; + NEntry: integer = 0; + + +{--------------------------------------------------------------} +{ Type Declarations } + +type Symbol = string[8]; + + SymTab = array[1..1000] of Symbol; + + TabPtr = ^SymTab; + + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look : char; { Lookahead Character } + Token: char; { Encoded Token } + Value: string[16]; { Unencoded Token } + + +const MaxEntry = 100; + +var ST : array[1..MaxEntry] of Symbol; + SType: array[1..MaxEntry] of char; + + +{--------------------------------------------------------------} +{ Definition of Keywords and Token Types } + +const NKW = 9; + NKW1 = 10; + +const KWlist: array[1..NKW] of Symbol = + ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE', + 'READ', 'WRITE', 'VAR', 'END'); + +const KWcode: string[NKW1] = 'xileweRWve'; + + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + +{--------------------------------------------------------------} +{ Report an Undefined Identifier } + +procedure Undefined(n: string); +begin + Abort('Undefined Identifier ' + n); +end; + + +{--------------------------------------------------------------} +{ Report a Duplicate Identifier } + +procedure Duplicate(n: string); +begin + Abort('Duplicate Identifier ' + n); +end; + + +{--------------------------------------------------------------} +{ Check to Make Sure the Current Token is an Identifier } + +procedure CheckIdent; +begin + if Token <> 'x' then Expected('Identifier'); +end; + + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Recognize an AlphaNumeric Character } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; + + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Mulop } + +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*', '/']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Boolean Orop } + +function IsOrop(c: char): boolean; +begin + IsOrop := c in ['|', '~']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Relop } + +function IsRelop(c: char): boolean; +begin + IsRelop := c in ['=', '#', '<', '>']; +end; + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB, CR, LF]; +end; + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; + + +{--------------------------------------------------------------} +{ Table Lookup } + +function Lookup(T: TabPtr; s: string; n: integer): integer; +var i: integer; + found: Boolean; +begin + found := false; + i := n; + while (i > 0) and not found do + if s = T^[i] then + found := true + else + dec(i); + Lookup := i; +end; + + +{--------------------------------------------------------------} +{ Locate a Symbol in Table } +{ Returns the index of the entry. Zero if not present. } + +function Locate(N: Symbol): integer; +begin + Locate := Lookup(@ST, n, NEntry); +end; + + +{--------------------------------------------------------------} +{ Look for Symbol in Table } + +function InTable(n: Symbol): Boolean; +begin + InTable := Lookup(@ST, n, NEntry) <> 0; +end; + + +{--------------------------------------------------------------} +{ Check to See if an Identifier is in the Symbol Table } +{ Report an error if it's not. } + + +procedure CheckTable(N: Symbol); +begin + if not InTable(N) then Undefined(N); +end; + + +{--------------------------------------------------------------} +{ Check the Symbol Table for a Duplicate Identifier } +{ Report an error if identifier is already in table. } + + +procedure CheckDup(N: Symbol); +begin + if InTable(N) then Duplicate(N); +end; + + +{--------------------------------------------------------------} +{ Add a New Entry to Symbol Table } + +procedure AddEntry(N: Symbol; T: char); +begin + CheckDup(N); + if NEntry = MaxEntry then Abort('Symbol Table Full'); + Inc(NEntry); + ST[NEntry] := N; + SType[NEntry] := T; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +procedure GetName; +begin + SkipWhite; + if Not IsAlpha(Look) then Expected('Identifier'); + Token := 'x'; + Value := ''; + repeat + Value := Value + UpCase(Look); + GetChar; + until not IsAlNum(Look); +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +procedure GetNum; +begin + SkipWhite; + if not IsDigit(Look) then Expected('Number'); + Token := '#'; + Value := ''; + repeat + Value := Value + Look; + GetChar; + until not IsDigit(Look); +end; + + +{--------------------------------------------------------------} +{ Get an Operator } + +procedure GetOp; +begin + SkipWhite; + Token := Look; + Value := Look; + GetChar; +end; + + +{--------------------------------------------------------------} +{ Get the Next Input Token } + +procedure Next; +begin + SkipWhite; + if IsAlpha(Look) then GetName + else if IsDigit(Look) then GetNum + else GetOp; +end; + + +{--------------------------------------------------------------} +{ Scan the Current Identifier for Keywords } + +procedure Scan; +begin + if Token = 'x' then + Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1]; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input String } + +procedure MatchString(x: string); +begin + if Value <> x then Expected('''' + x + ''''); + Next; +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + + +{--------------------------------------------------------------} +{ Generate a Unique Label } + +function NewLabel: string; +var S: string; +begin + Str(LCount, S); + NewLabel := 'L' + S; + Inc(LCount); +end; + + +{--------------------------------------------------------------} +{ Post a Label To Output } + +procedure PostLabel(L: string); +begin + WriteLn(L, ':'); +end; + + +{---------------------------------------------------------------} +{ Clear the Primary Register } + +procedure Clear; +begin + EmitLn('CLR D0'); +end; + + +{---------------------------------------------------------------} +{ Negate the Primary Register } + +procedure Negate; +begin + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Complement the Primary Register } + +procedure NotIt; +begin + EmitLn('NOT D0'); +end; + + +{---------------------------------------------------------------} +{ Load a Constant Value to Primary Register } + +procedure LoadConst(n: string); +begin + Emit('MOVE #'); + WriteLn(n, ',D0'); +end; + + +{---------------------------------------------------------------} +{ Load a Variable to Primary Register } + +procedure LoadVar(Name: string); +begin + if not InTable(Name) then Undefined(Name); + EmitLn('MOVE ' + Name + '(PC),D0'); +end; + + +{---------------------------------------------------------------} +{ Push Primary onto Stack } + +procedure Push; +begin + EmitLn('MOVE D0,-(SP)'); +end; + + +{---------------------------------------------------------------} +{ Add Top of Stack to Primary } + +procedure PopAdd; +begin + EmitLn('ADD (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Subtract Primary from Top of Stack } + +procedure PopSub; +begin + EmitLn('SUB (SP)+,D0'); + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Multiply Top of Stack by Primary } + +procedure PopMul; +begin + EmitLn('MULS (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Divide Top of Stack by Primary } + +procedure PopDiv; +begin + EmitLn('MOVE (SP)+,D7'); + EmitLn('EXT.L D7'); + EmitLn('DIVS D0,D7'); + EmitLn('MOVE D7,D0'); +end; + + +{---------------------------------------------------------------} +{ AND Top of Stack with Primary } + +procedure PopAnd; +begin + EmitLn('AND (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ OR Top of Stack with Primary } + +procedure PopOr; +begin + EmitLn('OR (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ XOR Top of Stack with Primary } + +procedure PopXor; +begin + EmitLn('EOR (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Compare Top of Stack with Primary } + +procedure PopCompare; +begin + EmitLn('CMP (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was = } + +procedure SetEqual; +begin + EmitLn('SEQ D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was != } + +procedure SetNEqual; +begin + EmitLn('SNE D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was > } + +procedure SetGreater; +begin + EmitLn('SLT D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was < } + +procedure SetLess; +begin + EmitLn('SGT D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was <= } + +procedure SetLessOrEqual; +begin + EmitLn('SGE D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Set D0 If Compare was >= } + +procedure SetGreaterOrEqual; +begin + EmitLn('SLE D0'); + EmitLn('EXT D0'); +end; + + +{---------------------------------------------------------------} +{ Store Primary to Variable } + +procedure Store(Name: string); +begin + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)') +end; + + +{---------------------------------------------------------------} +{ Branch Unconditional } + +procedure Branch(L: string); +begin + EmitLn('BRA ' + L); +end; + + +{---------------------------------------------------------------} +{ Branch False } + +procedure BranchFalse(L: string); +begin + EmitLn('TST D0'); + EmitLn('BEQ ' + L); +end; + + +{---------------------------------------------------------------} +{ Read Variable to Primary Register } + +procedure ReadIt(Name: string); +begin + EmitLn('BSR READ'); + Store(Name); +end; + + +{ Write from Primary Register } + +procedure WriteIt; +begin + EmitLn('BSR WRITE'); +end; + + +{--------------------------------------------------------------} +{ Write Header Info } + +procedure Header; +begin + WriteLn('WARMST', TAB, 'EQU $A01E'); +end; + + +{--------------------------------------------------------------} +{ Write the Prolog } + +procedure Prolog; +begin + PostLabel('MAIN'); +end; + + +{--------------------------------------------------------------} +{ Write the Epilog } + +procedure Epilog; +begin + EmitLn('DC WARMST'); + EmitLn('END MAIN'); +end; + + +{---------------------------------------------------------------} +{ Allocate Storage for a Static Variable } + +procedure Allocate(Name, Val: string); +begin + WriteLn(Name, ':', TAB, 'DC ', Val); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure BoolExpression; Forward; + +procedure Factor; +begin + if Token = '(' then begin + Next; + BoolExpression; + MatchString(')'); + end + else begin + if Token = 'x' then + LoadVar(Value) + else if Token = '#' then + LoadConst(Value) + else Expected('Math Factor'); + Next; + end; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Multiply } + +procedure Multiply; +begin + Next; + Factor; + PopMul; +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Divide } + +procedure Divide; +begin + Next; + Factor; + PopDiv; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +procedure Term; +begin + Factor; + while IsMulop(Token) do begin + Push; + case Token of + '*': Multiply; + '/': Divide; + end; + end; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Next; + Term; + PopAdd; +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Next; + Term; + PopSub; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + if IsAddop(Token) then + Clear + else + Term; + while IsAddop(Token) do begin + Push; + case Token of + '+': Add; + '-': Subtract; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Get Another Expression and Compare } + +procedure CompareExpression; +begin + Expression; + PopCompare; +end; + + +{---------------------------------------------------------------} +{ Get The Next Expression and Compare } + +procedure NextExpression; +begin + Next; + CompareExpression; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Equals" } + +procedure Equal; +begin + NextExpression; + SetEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Less Than or Equal" } + +procedure LessOrEqual; +begin + NextExpression; + SetLessOrEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Not Equals" } + +procedure NotEqual; +begin + NextExpression; + SetNEqual; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Less Than" } + +procedure Less; +begin + Next; + case Token of + '=': LessOrEqual; + '>': NotEqual; + else begin + CompareExpression; + SetLess; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Greater Than" } + +procedure Greater; +begin + Next; + if Token = '=' then begin + NextExpression; + SetGreaterOrEqual; + end + else begin + CompareExpression; + SetGreater; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Relation } + + +procedure Relation; +begin + Expression; + if IsRelop(Token) then begin + Push; + case Token of + '=': Equal; + '<': Less; + '>': Greater; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Factor with Leading NOT } + +procedure NotFactor; +begin + if Token = '!' then begin + Next; + Relation; + NotIt; + end + else + Relation; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Term } + +procedure BoolTerm; +begin + NotFactor; + while Token = '&' do begin + Push; + Next; + NotFactor; + PopAnd; + end; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Boolean OR } + +procedure BoolOr; +begin + Next; + BoolTerm; + PopOr; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an Exclusive Or } + +procedure BoolXor; +begin + Next; + BoolTerm; + PopXor; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Expression } + +procedure BoolExpression; +begin + BoolTerm; + while IsOrOp(Token) do begin + Push; + case Token of + '|': BoolOr; + '~': BoolXor; + end; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: string; +begin + CheckTable(Value); + Name := Value; + Next; + MatchString('='); + BoolExpression; + Store(Name); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block; Forward; + +procedure DoIf; +var L1, L2: string; +begin + Next; + BoolExpression; + L1 := NewLabel; + L2 := L1; + BranchFalse(L1); + Block; + if Token = 'l' then begin + Next; + L2 := NewLabel; + Branch(L2); + PostLabel(L1); + Block; + end; + PostLabel(L2); + MatchString('ENDIF'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a WHILE Statement } + +procedure DoWhile; +var L1, L2: string; +begin + Next; + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + BoolExpression; + BranchFalse(L2); + Block; + MatchString('ENDWHILE'); + Branch(L1); + PostLabel(L2); +end; + + +{--------------------------------------------------------------} +{ Read a Single Variable } + +procedure ReadVar; +begin + CheckIdent; + CheckTable(Value); + ReadIt(Value); + Next; +end; + + +{--------------------------------------------------------------} +{ Process a Read Statement } + +procedure DoRead; +begin + Next; + MatchString('('); + ReadVar; + while Token = ',' do begin + Next; + ReadVar; + end; + MatchString(')'); +end; + + +{--------------------------------------------------------------} +{ Process a Write Statement } + +procedure DoWrite; +begin + Next; + MatchString('('); + Expression; + WriteIt; + while Token = ',' do begin + Next; + Expression; + WriteIt; + end; + MatchString(')'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + Scan; + while not(Token in ['e', 'l']) do begin + case Token of + 'i': DoIf; + 'w': DoWhile; + 'R': DoRead; + 'W': DoWrite; + else Assignment; + end; + Scan; + end; +end; + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc; +begin + Next; + if Token <> 'x' then Expected('Variable Name'); + CheckDup(Value); + AddEntry(Value, 'v'); + Allocate(Value, '0'); + Next; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + Scan; + while Token = 'v' do + Alloc; + while Token = ',' do + Alloc; +end; + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +begin + GetChar; + Next; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + MatchString('PROGRAM'); + Header; + TopDecls; + MatchString('BEGIN'); + Prolog; + Block; + MatchString('END'); + Epilog; +end. +{--------------------------------------------------------------} +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + diff --git a/Compiler construction/Lets_build_compiler/tutor12.txt b/Compiler construction/Lets_build_compiler/tutor12.txt new file mode 100644 index 0000000..7f3f6ce --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor12.txt @@ -0,0 +1,801 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 5 June 1989 + + + Part XII: MISCELLANY + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +This installment is another one of those excursions into side +alleys that don't seem to fit into the mainstream of this +tutorial series. As I mentioned last time, it was while I was +writing this installment that I realized some changes had to be +made to the compiler structure. So I had to digress from this +digression long enough to develop the new structure and show it +to you. + +Now that that's behind us, I can tell you what I set out to in +the first place. This shouldn't take long, and then we can get +back into the mainstream. + +Several people have asked me about things that other languages +provide, but so far I haven't addressed in this series. The two +biggies are semicolons and comments. Perhaps you've wondered +about them, too, and wondered how things would change if we had +to deal with them. Just so you can proceed with what's to come, +without being bothered by that nagging feeling that something is +missing, we'll address such issues here. + + +SEMICOLONS + +Ever since the introduction of Algol, semicolons have been a part +of almost every modern language. We've all used them to the +point that they are taken for granted. Yet I suspect that more +compilation errors have occurred due to misplaced or missing +semicolons than any other single cause. And if we had a penny +for every extra keystroke programmers have used to type the +little rascals, we could pay off the national debt. + +Having been brought up with FORTRAN, it took me a long time to +get used to using semicolons, and to tell the truth I've never +quite understood why they were necessary. Since I program in +Pascal, and since the use of semicolons in Pascal is particularly +tricky, that one little character is still by far my biggest +source of errors. + +When I began developing KISS, I resolved to question EVERY +construct in other languages, and to try to avoid the most common +problems that occur with them. That puts the semicolon very high +on my hit list. + +To understand the role of the semicolon, you have to look at a +little history. + +Early programming languages were line-oriented. In FORTRAN, for +example, various parts of the statement had specific columns or +fields that they had to appear in. Since some statements were +too long for one line, the "continuation card" mechanism was +provided to let the compiler know that a given card was still +part of the previous line. The mechanism survives to this day, +even though punched cards are now things of the distant past. + +When other languages came along, they also adopted various +mechanisms for dealing with multiple-line statements. BASIC is a +good example. It's important to recognize, though, that the +FORTRAN mechanism was not so much required by the line +orientation of that language, as by the column-orientation. In +those versions of FORTRAN where free-form input is permitted, +it's no longer needed. + +When the fathers of Algol introduced that language, they wanted +to get away from line-oriented programs like FORTRAN and BASIC, +and allow for free-form input. This included the possibility of +stringing multiple statements on a single line, as in + + + a=b; c=d; e=e+1; + + +In cases like this, the semicolon is almost REQUIRED. The same +line, without the semicolons, just looks "funny": + + + a=b c= d e=e+1 + +I suspect that this is the major ... perhaps ONLY ... reason for +semicolons: to keep programs from looking funny. + +But the idea of stringing multiple statements together on a +single line is a dubious one at best. It's not very good +programming style, and harks back to the days when it was +considered improtant to conserve cards. In these days of CRT's +and indented code, the clarity of programs is far better served +by keeping statements separate. It's still nice to have the +OPTION of multiple statements, but it seems a shame to keep +programmers in slavery to the semicolon, just to keep that one +rare case from "looking funny." + +When I started in with KISS, I tried to keep an open mind. I +decided that I would use semicolons when it became necessary for +the parser, but not until then. I figured this would happen just +about the time I added the ability to spread statements over +multiple lines. But, as you can see, that never happened. The +TINY compiler is perfectly happy to parse the most complicated +statement, spread over any number of lines, without semicolons. + +Still, there are people who have used semicolons for so long, +they feel naked without them. I'm one of them. Once I had KISS +defined sufficiently well, I began to write a few sample programs +in the language. I discovered, somewhat to my horror, that I +kept putting semicolons in anyway. So now I'm facing the +prospect of a NEW rash of compiler errors, caused by UNWANTED +semicolons. Phooey! + +Perhaps more to the point, there are readers out there who are +designing their own languages, which may include semicolons, or +who want to use the techniques of these tutorials to compile +conventional languages like C. In either case, we need to be +able to deal with semicolons. + + +SYNTACTIC SUGAR + +This whole discussion brings up the issue of "syntactic sugar" +... constructs that are added to a language, not because they are +needed, but because they help make the programs look right to the +programmer. After all, it's nice to have a small, simple +compiler, but it would be of little use if the resulting +language were cryptic and hard to program. The language FORTH +comes to mind (a premature OUCH! for the barrage I know that +one's going to fetch me). If we can add features to the language +that make the programs easier to read and understand, and if +those features help keep the programmer from making errors, then +we should do so. Particularly if the constructs don't add much +to the complexity of the language or its compiler. + +The semicolon could be considered an example, but there are +plenty of others, such as the 'THEN' in a IF-statement, the 'DO' +in a WHILE-statement, and even the 'PROGRAM' statement, which I +came within a gnat's eyelash of leaving out of TINY. None of +these tokens add much to the syntax of the language ... the +compiler can figure out what's going on without them. But some +folks feel that they DO add to the readability of programs, and +that can be very important. + +There are two schools of thought on this subject, which are well +represented by two of our most popular languages, C and Pascal. + +To the minimalists, all such sugar should be left out. They +argue that it clutters up the language and adds to the number of +keystrokes programmers must type. Perhaps more importantly, +every extra token or keyword represents a trap laying in wait for +the inattentive programmer. If you leave out a token, misplace +it, or misspell it, the compiler will get you. So these people +argue that the best approach is to get rid of such things. These +folks tend to like C, which has a minimum of unnecessary keywords +and punctuation. + +Those from the other school tend to like Pascal. They argue that +having to type a few extra characters is a small price to pay for +legibility. After all, humans have to read the programs, too. +Their best argument is that each such construct is an opportunity +to tell the compiler that you really mean for it to do what you +said to. The sugary tokens serve as useful landmarks to help you +find your way. + +The differences are well represented by the two languages. The +most oft-heard complaint about C is that it is too forgiving. +When you make a mistake in C, the erroneous code is too often +another legal C construct. So the compiler just happily +continues to compile, and leaves you to find the error during +debug. I guess that's why debuggers are so popular with C +programmers. + +On the other hand, if a Pascal program compiles, you can be +pretty sure that the program will do what you told it. If there +is an error at run time, it's probably a design error. + +The best example of useful sugar is the semicolon itself. +Consider the code fragment: + + + a=1+(2*b+c) b... + + +Since there is no operator connecting the token 'b' with the rest +of the statement, the compiler will conclude that the expression +ends with the ')', and the 'b' is the beginning of a new +statement. But suppose I have simply left out the intended +operator, and I really want to say: + + + a=1+(2*b+c)*b... + + +In this case the compiler will get an error, all right, but it +won't be very meaningful since it will be expecting an '=' sign +after the 'b' that really shouldn't be there. + +If, on the other hand, I include a semicolon after the 'b', THEN +there can be no doubt where I intend the statement to end. +Syntactic sugar, then, can serve a very useful purpose by +providing some additional insurance that we remain on track. + +I find myself somewhere in the middle of all this. I tend to +favor the Pascal-ers' view ... I'd much rather find my bugs at +compile time rather than run time. But I also hate to just throw +verbosity in for no apparent reason, as in COBOL. So far I've +consistently left most of the Pascal sugar out of KISS/TINY. But +I certainly have no strong feelings either way, and I also can +see the value of sprinkling a little sugar around just for the +extra insurance that it brings. If you like this latter +approach, things like that are easy to add. Just remember that, +like the semicolon, each item of sugar is something that can +potentially cause a compile error by its omission. + + +DEALING WITH SEMICOLONS + +There are two distinct ways in which semicolons are used in +popular languages. In Pascal, the semicolon is regarded as an +statement SEPARATOR. No semicolon is required after the last +statement in a block. The syntax is: + + + ::= ( ';' )* + + ::= | | ... | null + + +(The null statement is IMPORTANT!) + +Pascal also defines some semicolons in other places, such as +after the PROGRAM statement. + +In C and Ada, on the other hand, the semicolon is considered a +statement TERMINATOR, and follows all statements (with some +embarrassing and confusing exceptions). The syntax for this is +simply: + + + ::= ( ';')* + + +Of the two syntaxes, the Pascal one seems on the face of it more +rational, but experience has shown that it leads to some strange +difficulties. People get so used to typing a semicolon after +every statement that they tend to type one after the last +statement in a block, also. That usually doesn't cause any harm +... it just gets treated as a null statement. Many Pascal +programmers, including yours truly, do just that. But there is +one place you absolutely CANNOT type a semicolon, and that's +right before an ELSE. This little gotcha has cost me many an +extra compilation, particularly when the ELSE is added to +existing code. So the C/Ada choice turns out to be better. +Apparently Nicklaus Wirth thinks so, too: In his Modula 2, he +abandoned the Pascal approach. + +Given either of these two syntaxes, it's an easy matter (now that +we've reorganized the parser!) to add these features to our +parser. Let's take the last case first, since it's simpler. + +To begin, I've made things easy by introducing a new recognizer: + + +{--------------------------------------------------------------} +{ Match a Semicolon } + +procedure Semi; +begin + MatchString(';'); +end; +{--------------------------------------------------------------} + + +This procedure works very much like our old Match. It insists on +finding a semicolon as the next token. Having found it, it skips +to the next one. + +Since a semicolon follows a statement, procedure Block is almost +the only one we need to change: + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + Scan; + while not(Token in ['e', 'l']) do begin + case Token of + 'i': DoIf; + 'w': DoWhile; + 'R': DoRead; + 'W': DoWrite; + 'x': Assignment; + end; + Semi; + Scan; + end; +end; +{--------------------------------------------------------------} + + +Note carefully the subtle change in the case statement. The call +to Assignment is now guarded by a test on Token. This is to +avoid calling Assignment when the token is a semicolon (which +could happen if the statement is null). + +Since declarations are also statements, we also need to add a +call to Semi within procedure TopDecls: + + +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + Scan; + while Token = 'v' do begin + Alloc; + while Token = ',' do + Alloc; + Semi; + end; +end; +{--------------------------------------------------------------} + + +Finally, we need one for the PROGRAM statement: + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + MatchString('PROGRAM'); + Semi; + Header; + TopDecls; + MatchString('BEGIN'); + Prolog; + Block; + MatchString('END'); + Epilog; +end. +{--------------------------------------------------------------} + + +It's as easy as that. Try it with a copy of TINY and see how you +like it. + +The Pascal version is a little trickier, but it still only +requires minor changes, and those only to procedure Block. To +keep things as simple as possible, let's split the procedure into +two parts. The following procedure handles just one statement: + + +{--------------------------------------------------------------} +{ Parse and Translate a Single Statement } + +procedure Statement; +begin + Scan; + case Token of + 'i': DoIf; + 'w': DoWhile; + 'R': DoRead; + 'W': DoWrite; + 'x': Assignment; + end; +end; +{--------------------------------------------------------------} + + +Using this procedure, we can now rewrite Block like this: + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + Statement; + while Token = ';' do begin + Next; + Statement; + end; +end; +{--------------------------------------------------------------} + + +That sure didn't hurt, did it? We can now parse semicolons in +Pascal-like fashion. + + +A COMPROMISE + +Now that we know how to deal with semicolons, does that mean that +I'm going to put them in KISS/TINY? Well, yes and no. I like +the extra sugar and the security that comes with knowing for sure +where the ends of statements are. But I haven't changed my +dislike for the compilation errors associated with semicolons. + +So I have what I think is a nice compromise: Make them OPTIONAL! + +Consider the following version of Semi: + + +{--------------------------------------------------------------} +{ Match a Semicolon } + +procedure Semi; +begin + if Token = ';' then Next; +end; +{--------------------------------------------------------------} + + +This procedure will ACCEPT a semicolon whenever it is called, but +it won't INSIST on one. That means that when you choose to use +semicolons, the compiler will use the extra information to help +keep itself on track. But if you omit one (or omit them all) the +compiler won't complain. The best of both worlds. + +Put this procedure in place in the first version of your program +(the one for C/Ada syntax), and you have the makings of TINY +Version 1.2. + + +COMMENTS + +Up until now I have carefully avoided the subject of comments. +You would think that this would be an easy subject ... after all, +the compiler doesn't have to deal with comments at all; it should +just ignore them. Well, sometimes that's true. + +Comments can be just about as easy or as difficult as you choose +to make them. At one extreme, we can arrange things so that +comments are intercepted almost the instant they enter the +compiler. At the other, we can treat them as lexical elements. +Things tend to get interesting when you consider things like +comment delimiters contained in quoted strings. + + +SINGLE-CHARACTER DELIMITERS + +Here's an example. Suppose we assume the Turbo Pascal standard +and use curly braces for comments. In this case we have single- +character delimiters, so our parsing is a little easier. + +One approach is to strip the comments out the instant we +encounter them in the input stream; that is, right in procedure +GetChar. To do this, first change the name of GetChar to +something else, say GetCharX. (For the record, this is going to +be a TEMPORARY change, so best not do this with your only copy of +TINY. I assume you understand that you should always do these +experiments with a working copy.) + +Now, we're going to need a procedure to skip over comments. So +key in the following one: + + +{--------------------------------------------------------------} +{ Skip A Comment Field } + +procedure SkipComment; +begin + while Look <> '}' do + GetCharX; + GetCharX; +end; +{--------------------------------------------------------------} + + +Clearly, what this procedure is going to do is to simply read and +discard characters from the input stream, until it finds a right +curly brace. Then it reads one more character and returns it in +Look. + +Now we can write a new version of GetChar that SkipComment to +strip out comments: + + +{--------------------------------------------------------------} +{ Get Character from Input Stream } +{ Skip Any Comments } + +procedure GetChar; +begin + GetCharX; + if Look = '{' then SkipComment; +end; +{--------------------------------------------------------------} + + +Code this up and give it a try. You'll find that you can, +indeed, bury comments anywhere you like. The comments never even +get into the parser proper ... every call to GetChar just returns +any character that's NOT part of a comment. + +As a matter of fact, while this approach gets the job done, and +may even be perfectly satisfactory for you, it does its job a +little TOO well. First of all, most programming languages +specify that a comment should be treated like a space, so that +comments aren't allowed to be embedded in, say, variable names. +This current version doesn't care WHERE you put comments. + +Second, since the rest of the parser can't even receive a '{' +character, you will not be allowed to put one in a quoted string. + +Before you turn up your nose at this simplistic solution, though, +I should point out that as respected a compiler as Turbo Pascal +also won't allow a '{' in a quoted string. Try it. And as for +embedding a comment in an identifier, I can't imagine why anyone +would want to do such a thing, anyway, so the question is moot. +For 99% of all applications, what I've just shown you will work +just fine. + +But, if you want to be picky about it and stick to the +conventional treatment, then we need to move the interception +point downstream a little further. + +To do this, first change GetChar back to the way it was and +change the name called in SkipComment. Then, let's add the left +brace as a possible whitespace character: + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB, CR, LF, '{']; +end; +{--------------------------------------------------------------} + + +Now, we can deal with comments in procedure SkipWhite: + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do begin + if Look = '{' then + SkipComment + else + GetChar; + end; +end; +{--------------------------------------------------------------} + + +Note that SkipWhite is written so that we will skip over any +combination of whitespace characters and comments, in one call. + +OK, give this one a try, too. You'll find that it will let a +comment serve to delimit tokens. It's worth mentioning that this +approach also gives us the ability to handle curly braces within +quoted strings, since within such strings we will not be testing +for or skipping over whitespace. + +There's one last item to deal with: Nested comments. Some +programmers like the idea of nesting comments, since it allows +you to comment out code during debugging. The code I've given +here won't allow that and, again, neither will Turbo Pascal. + +But the fix is incredibly easy. All we need to do is to make +SkipComment recursive: + + +{--------------------------------------------------------------} +{ Skip A Comment Field } + +procedure SkipComment; +begin + while Look <> '}' do begin + GetChar; + if Look = '{' then SkipComment; + end; + GetChar; +end; +{--------------------------------------------------------------} + + +That does it. As sophisticated a comment-handler as you'll ever +need. + + +MULTI-CHARACTER DELIMITERS + +That's all well and good for cases where a comment is delimited +by single characters, but what about the cases such as C or +standard Pascal, where two characters are required? Well, the +principles are still the same, but we have to change our approach +quite a bit. I'm sure it won't surprise you to learn that things +get harder in this case. + +For the multi-character situation, the easiest thing to do is to +intercept the left delimiter back at the GetChar stage. We can +"tokenize" it right there, replacing it by a single character. + +Let's assume we're using the C delimiters '/*' and '*/'. First, +we need to go back to the "GetCharX' approach. In yet another +copy of your compiler, rename GetChar to GetCharX and then enter +the following new procedure GetChar: + + +{--------------------------------------------------------------} +{ Read New Character. Intercept '/*' } + +procedure GetChar; +begin + if TempChar <> ' ' then begin + Look := TempChar; + TempChar := ' '; + end + else begin + GetCharX; + if Look = '/' then begin + Read(TempChar); + if TempChar = '*' then begin + Look := '{'; + TempChar := ' '; + end; + end; + end; +end; +{--------------------------------------------------------------} + + +As you can see, what this procedure does is to intercept every +occurrence of '/'. It then examines the NEXT character in the +stream. If the character is a '*', then we have found the +beginning of a comment, and GetChar will return a single +character replacement for it. (For simplicity, I'm using the +same '{' character as I did for Pascal. If you were writing a C +compiler, you'd no doubt want to pick some other character that's +not used elsewhere in C. Pick anything you like ... even $FF, +anything that's unique.) + +If the character following the '/' is NOT a '*', then GetChar +tucks it away in the new global TempChar, and returns the '/'. + +Note that you need to declare this new variable and initialize it +to ' '. I like to do things like that using the Turbo "typed +constant" construct: + + + const TempChar: char = ' '; + + +Now we need a new version of SkipComment: + + +{--------------------------------------------------------------} +{ Skip A Comment Field } + +procedure SkipComment; +begin + repeat + repeat + GetCharX; + until Look = '*'; + GetCharX; + until Look = '/'; + GetChar; +end; +{--------------------------------------------------------------} + + +A few things to note: first of all, function IsWhite and +procedure SkipWhite don't need to be changed, since GetChar +returns the '{' token. If you change that token character, then +of course you also need to change the character in those two +routines. + +Second, note that SkipComment doesn't call GetChar in its loop, +but GetCharX. That means that the trailing '/' is not +intercepted and is seen by SkipComment. Third, although GetChar +is the procedure doing the work, we can still deal with the +comment characters embedded in a quoted string, by calling +GetCharX instead of GetChar while we're within the string. +Finally, note that we can again provide for nested comments by +adding a single statement to SkipComment, just as we did before. + + +ONE-SIDED COMMENTS + +So far I've shown you how to deal with any kind of comment +delimited on the left and the right. That only leaves the one- +sided comments like those in assembler language or in Ada, that +are terminated by the end of the line. In a way, that case is +easier. The only procedure that would need to be changed is +SkipComment, which must now terminate at the newline characters: + + +{--------------------------------------------------------------} +{ Skip A Comment Field } + +procedure SkipComment; +begin + repeat + GetCharX; + until Look = CR; + GetChar; +end; +{--------------------------------------------------------------} + + +If the leading character is a single one, as in the ';' of +assembly language, then we're essentially done. If it's a two- +character token, as in the '--' of Ada, we need only modify the +tests within GetChar. Either way, it's an easier problem than +the balanced case. + + +CONCLUSION + +At this point we now have the ability to deal with both comments +and semicolons, as well as other kinds of syntactic sugar. I've +shown you several ways to deal with each, depending upon the +convention desired. The only issue left is: which of these +conventions should we use in KISS/TINY? + +For the reasons that I've given as we went along, I'm choosing +the following: + + + (1) Semicolons are TERMINATORS, not separators + + (2) Semicolons are OPTIONAL + + (3) Comments are delimited by curly braces + + (4) Comments MAY be nested + + +Put the code corresponding to these cases into your copy of TINY. +You now have TINY Version 1.2. + +Now that we have disposed of these sideline issues, we can +finally get back into the mainstream. In the next installment, +we'll talk about procedures and parameter passing, and we'll add +these important features to TINY. See you then. + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + diff --git a/Compiler construction/Lets_build_compiler/tutor13.txt b/Compiler construction/Lets_build_compiler/tutor13.txt new file mode 100644 index 0000000..45c7ba9 --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor13.txt @@ -0,0 +1,2416 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 27 August 1989 + + + Part XIII: PROCEDURES + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +At last we get to the good part! + +At this point we've studied almost all the basic features of +compilers and parsing. We have learned how to translate +arithmetic expressions, Boolean expressions, control constructs, +data declarations, and I/O statements. We have defined a +language, TINY 1.3, that embodies all these features, and we have +written a rudimentary compiler that can translate them. By +adding some file I/O we could indeed have a working compiler that +could produce executable object files from programs written in +TINY. With such a compiler, we could write simple programs that +could read integer data, perform calculations with it, and output +the results. + +That's nice, but what we have is still only a toy language. We +can't read or write even a single character of text, and we still +don't have procedures. + +It's the features to be discussed in the next couple of +installments that separate the men from the toys, so to speak. +"Real" languages have more than one data type, and they support +procedure calls. More than any others, it's these two features +that give a language much of its character and personality. Once +we have provided for them, our languages, TINY and its +successors, will cease to become toys and will take on the +character of real languages, suitable for serious programming +jobs. + +For several installments now, I've been promising you sessions on +these two important subjects. Each time, other issues came up +that required me to digress and deal with them. Finally, we've +been able to put all those issues to rest and can get on with the +mainstream of things. In this installment, I'll cover +procedures. Next time, we'll talk about the basic data types. + + +ONE LAST DIGRESSION + +This has been an extraordinarily difficult installment for me to +write. The reason has nothing to do with the subject itself ... +I've known what I wanted to say for some time, and in fact I +presented most of this at Software Development '89, back in +February. It has more to do with the approach. Let me explain. + +When I first began this series, I told you that we would use +several "tricks" to make things easy, and to let us learn the +concepts without getting too bogged down in the details. Among +these tricks was the idea of looking at individual pieces of a +compiler at a time, i.e. performing experiments using the Cradle +as a base. When we studied expressions, for example, we dealt +with only that part of compiler theory. When we studied control +structures, we wrote a different program, still based on the +Cradle, to do that part. We only incorporated these concepts into +a complete language fairly recently. These techniques have served +us very well indeed, and led us to the development of a compiler +for TINY version 1.3. + +When I first began this session, I tried to build upon what we +had already done, and just add the new features to the existing +compiler. That turned out to be a little awkward and tricky ... +much too much to suit me. + +I finally figured out why. In this series of experiments, I had +abandoned the very useful techniques that had allowed us to get +here, and without meaning to I had switched over into a new +method of working, that involved incremental changes to the full +TINY compiler. + +You need to understand that what we are doing here is a little +unique. There have been a number of articles, such as the Small +C articles by Cain and Hendrix, that presented finished compilers +for one language or another. This is different. In this series +of tutorials, you are watching me design and implement both a +language and a compiler, in real time. + +In the experiments that I've been doing in preparation for this +article, I was trying to inject the changes into the TINY +compiler in such a way that, at every step, we still had a real, +working compiler. In other words, I was attempting an +incremental enhancement of the language and its compiler, while +at the same time explaining to you what I was doing. + +That's a tough act to pull off! I finally realized that it was +dumb to try. Having gotten this far using the idea of small +experiments based on single-character tokens and simple, +special-purpose programs, I had abandoned them in favor of +working with the full compiler. It wasn't working. + +So we're going to go back to our roots, so to speak. In this +installment and the next, I'll be using single-character tokens +again as we study the concepts of procedures, unfettered by the +other baggage that we have accumulated in the previous sessions. +As a matter of fact, I won't even attempt, at the end of this +session, to merge the constructs into the TINY compiler. We'll +save that for later. + +After all this time, you don't need more buildup than that, so +let's waste no more time and dive right in. + + +THE BASICS + +All modern CPU's provide direct support for procedure calls, and +the 68000 is no exception. For the 68000, the call is a BSR +(PC-relative version) or JSR, and the return is RTS. All we have +to do is to arrange for the compiler to issue these commands at +the proper place. + +Actually, there are really THREE things we have to address. One +of them is the call/return mechanism. The second is the +mechanism for DEFINING the procedure in the first place. And, +finally, there is the issue of passing parameters to the called +procedure. None of these things are really very difficult, and +we can of course borrow heavily on what people have done in other +languages ... there's no need to reinvent the wheel here. Of the +three issues, that of parameter passing will occupy most of our +attention, simply because there are so many options available. + + +A BASIS FOR EXPERIMENTS + +As always, we will need some software to serve as a basis for +what we are doing. We don't need the full TINY compiler, but we +do need enough of a program so that some of the other constructs +are present. Specifically, we need at least to be able to handle +statements of some sort, and data declarations. + +The program shown below is that basis. It's a vestigial form of +TINY, with single-character tokens. It has data declarations, +but only in their simplest form ... no lists or initializers. It +has assignment statements, but only of the kind + + = + +In other words, the only legal expression is a single variable +name. There are no control constructs ... the only legal +statement is the assignment. + +Most of the program is just the standard Cradle routines. I've +shown the whole thing here, just to make sure we're all starting +from the same point: + + +{--------------------------------------------------------------} +program Calls; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + CR = ^M; + LF = ^J; + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look: char; { Lookahead Character } + +var ST: Array['A'..'Z'] of char; + + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + + +{--------------------------------------------------------------} +{ Report an Undefined Identifier } + +procedure Undefined(n: string); +begin + Abort('Undefined Identifier ' + n); +end; + + +{--------------------------------------------------------------} +{ Report an Duplicate Identifier } + +procedure Duplicate(n: string); +begin + Abort('Duplicate Identifier ' + n); +end; + + +{--------------------------------------------------------------} +{ Get Type of Symbol } + +function TypeOf(n: char): char; +begin + TypeOf := ST[n]; +end; + + +{--------------------------------------------------------------} +{ Look for Symbol in Table } + +function InTable(n: char): Boolean; +begin + InTable := ST[n] <> ' '; +end; + + +{--------------------------------------------------------------} +{ Add a New Symbol to Table } + +procedure AddEntry(Name, T: char); +begin + if Intable(Name) then Duplicate(Name); + ST[Name] := T; +end; + + +{--------------------------------------------------------------} +{ Check an Entry to Make Sure It's a Variable } + +procedure CheckVar(Name: char); +begin + if not InTable(Name) then Undefined(Name); + if TypeOf(Name) <> 'v' then Abort(Name + ' is not a +variable'); +end; + + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := upcase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Recognize an AlphaNumeric Character } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; + + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Mulop } + +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*', '/']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Boolean Orop } + +function IsOrop(c: char): boolean; +begin + IsOrop := c in ['|', '~']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Relop } + +function IsRelop(c: char): boolean; +begin + IsRelop := c in ['=', '#', '<', '>']; +end; + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB]; +end; + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; + + +{--------------------------------------------------------------} +{ Skip Over an End-of-Line } + +procedure Fin; +begin + if Look = CR then begin + GetChar; + if Look = LF then + GetChar; + end; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + if Look = x then GetChar + else Expected('''' + x + ''''); + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: char; +begin + if not IsAlpha(Look) then Expected('Name'); + GetName := UpCase(Look); + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: char; +begin + if not IsDigit(Look) then Expected('Integer'); + GetNum := Look; + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + + +{--------------------------------------------------------------} +{ Post a Label To Output } + +procedure PostLabel(L: string); +begin + WriteLn(L, ':'); +end; + + +{--------------------------------------------------------------} +{ Load a Variable to the Primary Register } + +procedure LoadVar(Name: char); +begin + CheckVar(Name); + EmitLn('MOVE ' + Name + '(PC),D0'); +end; + + +{--------------------------------------------------------------} +{ Store the Primary Register } + +procedure StoreVar(Name: char); +begin + CheckVar(Name); + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)') +end; + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +var i: char; +begin + GetChar; + SkipWhite; + for i := 'A' to 'Z' do + ST[i] := ' '; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Expression } +{ Vestigial Version } + +procedure Expression; +begin + LoadVar(GetName); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: char; +begin + Name := GetName; + Match('='); + Expression; + StoreVar(Name); +end; + + +{--------------------------------------------------------------} + + + + + + + + +{ Parse and Translate a Block of Statements } + +procedure DoBlock; +begin + while not(Look in ['e']) do begin + Assignment; + Fin; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Begin-Block } + +procedure BeginBlock; +begin + Match('b'); + Fin; + DoBlock; + Match('e'); + Fin; +end; + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc(N: char); +begin + if InTable(N) then Duplicate(N); + ST[N] := 'v'; + WriteLn(N, ':', TAB, 'DC 0'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Data Declaration } + +procedure Decl; +var Name: char; +begin + Match('v'); + Alloc(GetName); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + while Look <> 'b' do begin + case Look of + 'v': Decl; + else Abort('Unrecognized Keyword ' + Look); + end; + Fin; + end; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + TopDecls; + BeginBlock; +end. +{--------------------------------------------------------------} + + +Note that we DO have a symbol table, and there is logic to check +a variable name to make sure it's a legal one. It's also worth +noting that I have included the code you've seen before to +provide for white space and newlines. Finally, note that the +main program is delimited, as usual, by BEGIN-END brackets. + +Once you've copied the program to Turbo, the first step is to +compile it and make sure it works. Give it a few declarations, +and then a begin-block. Try something like: + + + va (for VAR A) + vb (for VAR B) + vc (for VAR C) + b (for BEGIN) + a=b + b=c + e. (for END.) + + +As usual, you should also make some deliberate errors, and verify +that the program catches them correctly. + + +DECLARING A PROCEDURE + +If you're satisfied that our little program works, then it's time +to deal with the procedures. Since we haven't talked about + + + + + + + +parameters yet, we'll begin by considering only procedures that +have no parameter lists. + +As a start, let's consider a simple program with a procedure, and +think about the code we'd like to see generated for it: + + + PROGRAM FOO; + . + . + PROCEDURE BAR; BAR: + BEGIN . + . . + . . + END; RTS + + BEGIN { MAIN PROGRAM } MAIN: + . . + . . + FOO; BSR BAR + . . + . . + END. END MAIN + + +Here I've shown the high-order language constructs on the left, +and the desired assembler code on the right. The first thing to +notice is that we certainly don't have much code to generate +here! For the great bulk of both the procedure and the main +program, our existing constructs take care of the code to be +generated. + +The key to dealing with the body of the procedure is to recognize +that although a procedure may be quite long, declaring it is +really no different than declaring a variable. It's just one +more kind of declaration. We can write the BNF: + + + ::= | + + +This means that it should be easy to modify TopDecl to deal with +procedures. What about the syntax of a procedure? Well, here's +a suggested syntax, which is essentially that of Pascal: + + + ::= PROCEDURE + + +There is practically no code generation required, other than that +generated within the begin-block. We need only emit a label at +the beginning of the procedure, and an RTS at the end. + +Here's the required code: + +{--------------------------------------------------------------} +{ Parse and Translate a Procedure Declaration } + +procedure DoProc; +var N: char; +begin + Match('p'); + N := GetName; + Fin; + if InTable(N) then Duplicate(N); + ST[N] := 'p'; + PostLabel(N); + BeginBlock; + Return; +end; +{--------------------------------------------------------------} + + +Note that I've added a new code generation routine, Return, which +merely emits an RTS instruction. The creation of that routine is +"left as an exercise for the student." + +To finish this version, add the following line within the Case +statement in DoBlock: + + + 'p': DoProc; + + +I should mention that this structure for declarations, and the +BNF that drives it, differs from standard Pascal. In the Jensen +& Wirth definition of Pascal, variable declarations, in fact ALL +kinds of declarations, must appear in a specific sequence, i.e. +labels, constants, types, variables, procedures, and main +program. To follow such a scheme, we should separate the two +declarations, and have code in the main program something like + + + DoVars; + DoProcs; + DoMain; + + +However, most implementations of Pascal, including Turbo, don't +require that order and let you freely mix up the various +declarations, as long as you still don't try to refer to +something before it's declared. Although it may be more +aesthetically pleasing to declare all the global variables at the +top of the program, it certainly doesn't do any HARM to allow +them to be sprinkled around. In fact, it may do some GOOD, in +the sense that it gives you the opportunity to do a little +rudimentary information hiding. Variables that should be +accessed only by the main program, for example, can be declared +just before it and will thus be inaccessible by the procedures. + +OK, try this new version out. Note that we can declare as many +procedures as we choose (as long as we don't run out of single- +character names!), and the labels and RTS's all come out in the +right places. + +It's worth noting here that I do _NOT_ allow for nested +procedures. In TINY, all procedures must be declared at the +global level, the same as in C. There has been quite a +discussion about this point in the Computer Language Forum of +CompuServe. It turns out that there is a significant penalty in +complexity that must be paid for the luxury of nested procedures. +What's more, this penalty gets paid at RUN TIME, because extra +code must be added and executed every time a procedure is called. +I also happen to believe that nesting is not a good idea, simply +on the grounds that I have seen too many abuses of the feature. +Before going on to the next step, it's also worth noting that the +"main program" as it stands is incomplete, since it doesn't have +the label and END statement. Let's fix that little oversight: + + +{--------------------------------------------------------------} +{ Parse and Translate a Main Program } + +procedure DoMain; +begin + Match('b'); + Fin; + Prolog; + DoBlock; + Epilog; +end; +{--------------------------------------------------------------} +. +. +. +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + TopDecls; + DoMain; +end. +{--------------------------------------------------------------} + + +Note that DoProc and DoMain are not quite symmetrical. DoProc +uses a call to BeginBlock, whereas DoMain cannot. That's because +a procedure is signaled by the keyword PROCEDURE (abbreviated by +a 'p' here), while the main program gets no keyword other than +the BEGIN itself. + +And _THAT_ brings up an interesting question: WHY? + +If we look at the structure of C programs, we find that all +functions are treated just alike, except that the main program +happens to be identified by its name, "main." Since C functions +can appear in any order, the main program can also be anywhere in +the compilation unit. + +In Pascal, on the other hand, all variables and procedures must +be declared before they're used, which means that there is no +point putting anything after the main program ... it could never +be accessed. The "main program" is not identified at all, other +than being that part of the code that comes after the global +BEGIN. In other words, if it ain't anything else, it must be the +main program. + +This causes no small amount of confusion for beginning +programmers, and for big Pascal programs sometimes it's difficult +to find the beginning of the main program at all. This leads to +conventions such as identifying it in comments: + + + BEGIN { of MAIN } + + +This has always seemed to me to be a bit of a kludge. The +question comes up: Why should the main program be treated so +much differently than a procedure? In fact, now that we've +recognized that procedure declarations are just that ... part of +the global declarations ... isn't the main program just one more +declaration, also? + +The answer is yes, and by treating it that way, we can simplify +the code and make it considerably more orthogonal. I propose +that we use an explicit keyword, PROGRAM, to identify the main +program (Note that this means that we can't start the file with +it, as in Pascal). In this case, our BNF becomes: + + + ::= | |
+ + + ::= PROCEDURE + + +
::= PROGRAM + + +The code also looks much better, at least in the sense that +DoMain and DoProc look more alike: + + +{--------------------------------------------------------------} +{ Parse and Translate a Main Program } + +procedure DoMain; +var N: char; +begin + Match('P'); + N := GetName; + Fin; + if InTable(N) then Duplicate(N); + Prolog; + BeginBlock; +end; +{--------------------------------------------------------------} +. +. +. +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + while Look <> '.' do begin + case Look of + 'v': Decl; + 'p': DoProc; + 'P': DoMain; + else Abort('Unrecognized Keyword ' + Look); + end; + Fin; + end; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + TopDecls; + Epilog; +end. +{--------------------------------------------------------------} + + +Since the declaration of the main program is now within the loop +of TopDecl, that does present some difficulties. How do we +ensure that it's the last thing in the file? And how do we ever +exit from the loop? My answer for the second question, as you +can see, was to bring back our old friend the period. Once the +parser sees that, we're done. + +To answer the first question: it depends on how far we're +willing to go to protect the programmer from dumb mistakes. In +the code that I've shown, there's nothing to keep the programmer +from adding code after the main program ... even another main +program. The code will just not be accessible. However, we +COULD access it via a FORWARD statement, which we'll be providing +later. As a matter of fact, many assembler language programmers +like to use the area just after the program to declare large, +uninitialized data blocks, so there may indeed be some value in +not requiring the main program to be last. We'll leave it as it +is. + +If we decide that we should give the programmer a little more +help than that, it's pretty easy to add some logic to kick us out +of the loop once the main program has been processed. Or we +could at least flag an error if someone tries to include two +mains. + + +CALLING THE PROCEDURE + +If you're satisfied that things are working, let's address the +second half of the equation ... the call. + +Consider the BNF for a procedure call: + + + ::= + + +for an assignment statement, on the other hand, the BNF is: + + + ::= '=' + + +At this point we seem to have a problem. The two BNF statements +both begin on the right-hand side with the token . +How are we supposed to know, when we see the identifier, whether +we have a procedure call or an assignment statement? This looks +like a case where our parser ceases being predictive, and indeed +that's exactly the case. However, it turns out to be an easy +problem to fix, since all we have to do is to look at the type of +the identifier, as recorded in the symbol table. As we've +discovered before, a minor local violation of the predictive +parsing rule can be easily handled as a special case. + +Here's how to do it: + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment(Name: char); +begin + Match('='); + Expression; + StoreVar(Name); +end; + + +{--------------------------------------------------------------} +{ Decide if a Statement is an Assignment or Procedure Call } + +procedure AssignOrProc; +var Name: char; +begin + Name := GetName; + case TypeOf(Name) of + ' ': Undefined(Name); + 'v': Assignment(Name); + 'p': CallProc(Name); + else Abort('Identifier ' + Name + + ' Cannot Be Used Here'); + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure DoBlock; +begin + while not(Look in ['e']) do begin + AssignOrProc; + Fin; + end; +end; +{--------------------------------------------------------------} + + +As you can see, procedure Block now calls AssignOrProc instead of +Assignment. The function of this new procedure is to simply read +the identifier, determine its type, and then call whichever +procedure is appropriate for that type. Since the name has +already been read, we must pass it to the two procedures, and +modify Assignment to match. Procedure CallProc is a simple code +generation routine: + + +{--------------------------------------------------------------} +{ Call a Procedure } + +procedure CallProc(N: char); +begin + EmitLn('BSR ' + N); +end; +{--------------------------------------------------------------} + + +Well, at this point we have a compiler that can deal with +procedures. It's worth noting that procedures can call +procedures to any depth. So even though we don't allow nested +DECLARATIONS, there is certainly nothing to keep us from nesting +CALLS, just as we would expect to do in any language. We're +getting there, and it wasn't too hard, was it? + +Of course, so far we can only deal with procedures that have no +parameters. The procedures can only operate on the global +variables by their global names. So at this point we have the +equivalent of BASIC's GOSUB construct. Not too bad ... after all +lots of serious programs were written using GOSUBs, but we can do +better, and we will. That's the next step. + + +PASSING PARAMETERS + +Again, we all know the basic idea of passed parameters, but let's +review them just to be safe. + +In general the procedure is given a parameter list, for example + + PROCEDURE FOO(X, Y, Z) + +In the declaration of a procedure, the parameters are called +formal parameters, and may be referred to in the body of the +procedure by those names. The names used for the formal +parameters are really arbitrary. Only the position really +counts. In the example above, the name 'X' simply means "the +first parameter" wherever it is used. + +When a procedure is called, the "actual parameters" passed to it +are associated with the formal parameters, on a one-for-one +basis. + +The BNF for the syntax looks something like this: + + + ::= PROCEDURE + '(' ')' + + + ::= ( ',' )* | null + +Similarly, the procedure call looks like: + + + ::= '(' ')' + + +Note that there is already an implicit decision built into this +syntax. Some languages, such as Pascal and Ada, permit parameter +lists to be optional. If there are no parameters, you simply +leave off the parens completely. Other languages, like C and +Modula 2, require the parens even if the list is empty. Clearly, +the example we just finished corresponds to the former point of +view. But to tell the truth I prefer the latter. For procedures +alone, the decision would seem to favor the "listless" approach. +The statement + + + Initialize; , + + +standing alone, can only mean a procedure call. In the parsers +we've been writing, we've made heavy use of parameterless +procedures, and it would seem a shame to have to write an empty +pair of parens for each case. + +But later on we're going to be using functions, too. And since +functions can appear in the same places as simple scalar +identifiers, you can't tell the difference between the two. You +have to go back to the declarations to find out. Some folks +consider this to be an advantage. Their argument is that an +identifier gets replaced by a value, and what do you care whether +it's done by substitution or by a function? But we sometimes +_DO_ care, because the function may be quite time-consuming. If, +by writing a simple identifier into a given expression, we can +incur a heavy run-time penalty, it seems to me we ought to be +made aware of it. + +Anyway, Niklaus Wirth designed both Pascal and Modula 2. I'll +give him the benefit of the doubt and assume that he had a good +reason for changing the rules the second time around! + +Needless to say, it's an easy thing to accomodate either point of +view as we design a language, so this one is strictly a matter of +personal preference. Do it whichever way you like best. + +Before we go any further, let's alter the translator to handle a +(possibly empty) parameter list. For now we won't generate any +extra code ... just parse the syntax. The code for processing +the declaration has very much the same form we've seen before +when dealing with VAR-lists: + + +{--------------------------------------------------------------} +{ Process the Formal Parameter List of a Procedure } + +procedure FormalList; +begin + Match('('); + if Look <> ')' then begin + FormalParam; + while Look = ',' do begin + Match(','); + FormalParam; + end; + end; + Match(')'); +end; +{--------------------------------------------------------------} + + +Procedure DoProc needs to have a line added to call FormalList: + + +{--------------------------------------------------------------} +{ Parse and Translate a Procedure Declaration } + +procedure DoProc; +var N: char; +begin + Match('p'); + N := GetName; + FormalList; + Fin; + if InTable(N) then Duplicate(N); + ST[N] := 'p'; + PostLabel(N); + BeginBlock; + Return; +end; +{--------------------------------------------------------------} + + +For now, the code for FormalParam is just a dummy one that simply +skips the parameter name: + + +{--------------------------------------------------------------} +{ Process a Formal Parameter } + +procedure FormalParam; +var Name: char; +begin + Name := GetName; +end; +{--------------------------------------------------------------} + + +For the actual procedure call, there must be similar code to +process the actual parameter list: + + +{--------------------------------------------------------------} +{ Process an Actual Parameter } + +procedure Param; +var Name: char; +begin + Name := GetName; +end; + + +{--------------------------------------------------------------} +{ Process the Parameter List for a Procedure Call } + +procedure ParamList; +begin + Match('('); + if Look <> ')' then begin + Param; + while Look = ',' do begin + Match(','); + Param; + end; + end; + Match(')'); +end; + + +{--------------------------------------------------------------} +{ Process a Procedure Call } + +procedure CallProc(Name: char); +begin + ParamList; + Call(Name); +end; +{--------------------------------------------------------------} + + +Note here that CallProc is no longer just a simple code +generation routine. It has some structure to it. To handle +this, I've renamed the code generation routine to just Call, and +called it from within CallProc. + +OK, if you'll add all this code to your translator and try it +out, you'll find that you can indeed parse the syntax properly. +I'll note in passing that there is _NO_ checking to make sure +that the number (and, later, types) of formal and actual +parameters match up. In a production compiler, we must of course +do this. We'll ignore the issue now if for no other reason than +that the structure of our symbol table doesn't currently give us +a place to store the necessary information. Later on, we'll have +a place for that data and we can deal with the issue then. + + +THE SEMANTICS OF PARAMETERS + +So far we've dealt with the SYNTAX of parameter passing, and +we've got the parsing mechanisms in place to handle it. Next, we +have to look at the SEMANTICS, i.e., the actions to be taken when +we encounter parameters. This brings us square up against the +issue of the different ways parameters can be passed. + +There is more than one way to pass a parameter, and the way we do +it can have a profound effect on the character of the language. +So this is another of those areas where I can't just give you my +solution. Rather, it's important that we spend some time looking +at the alternatives so that you can go another route if you +choose to. + +There are two main ways parameters are passed: + + o By value + o By reference (address) + +The differences are best seen in the light of a little history. + +The old FORTRAN compilers passed all parameters by reference. In +other words, what was actually passed was the address of the +parameter. This meant that the called subroutine was free to +either read or write that parameter, as often as it chose to, +just as though it were a global variable. This was actually +quite an efficient way to do things, and it was pretty simple +since the same mechanism was used in all cases, with one +exception that I'll get to shortly. + +There were problems, though. Many people felt that this method +created entirely too much coupling between the called subroutine +and its caller. In effect, it gave the subroutine complete +access to all variables that appeared in the parameter list. + +Many times, we didn't want to actually change a parameter, but +only use it as an input. For example, we might pass an element +count to a subroutine, and wish we could then use that count +within a DO-loop. To avoid changing the value in the calling +program, we had to make a local copy of the input parameter, and +operate only on the copy. Some FORTRAN programmers, in fact, +made it a practice to copy ALL parameters except those that were +to be used as return values. Needless to say, all this copying +defeated a good bit of the efficiency associated with the +approach. + +There was, however, an even more insidious problem, which was not +really just the fault of the "pass by reference" convention, but +a bad convergence of several implementation decisions. + +Suppose we have a subroutine: + + + SUBROUTINE FOO(X, Y, N) + + +where N is some kind of input count or flag. Many times, we'd +like to be able to pass a literal or even an expression in place +of a variable, such as: + + + CALL FOO(A, B, J + 1) + + +Here the third parameter is not a variable, and so it has no +address. The earliest FORTRAN compilers did not allow such +things, so we had to resort to subterfuges like: + + + K = J + 1 + CALL FOO(A, B, K) + + +Here again, there was copying required, and the burden was on the +programmer to do it. Not good. + +Later FORTRAN implementations got rid of this by allowing +expressions as parameters. What they did was to assign a +compiler-generated variable, store the value of the expression in +the variable, and then pass the address of the expression. + +So far, so good. Even if the subroutine mistakenly altered the +anonymous variable, who was to know or care? On the next call, +it would be recalculated anyway. + +The problem arose when someone decided to make things more +efficient. They reasoned, rightly enough, that the most common +kind of "expression" was a single integer value, as in: + + + CALL FOO(A, B, 4) + + +It seemed inefficient to go to the trouble of "computing" such an +integer and storing it in a temporary variable, just to pass it +through the calling list. Since we had to pass the address of +the thing anyway, it seemed to make lots of sense to just pass +the address of the literal integer, 4 in the example above. + +To make matters more interesting, most compilers, then and now, +identify all literals and store them separately in a "literal +pool," so that we only have to store one value for each unique +literal. That combination of design decisions: passing +expressions, optimization for literals as a special case, and use +of a literal pool, is what led to disaster. + +To see how it works, imagine that we call subroutine FOO as in +the example above, passing it a literal 4. Actually, what gets +passed is the address of the literal 4, which is stored in the +literal pool. This address corresponds to the formal parameter, +K, in the subroutine itself. + +Now suppose that, unbeknownst to the programmer, subroutine FOO +actually modifies K to be, say, -7. Suddenly, that literal 4 in +the literal pool gets CHANGED, to a -7. From then on, every +expression that uses a 4 and every subroutine that passes a 4 +will be using the value of -7 instead! Needless to say, this can +lead to some bizarre and difficult-to-find behavior. The whole +thing gave the concept of pass-by-reference a bad name, although +as we have seen, it was really a combination of design decisions +that led to the problem. + +In spite of the problem, the FORTRAN approach had its good +points. Chief among them is the fact that we don't have to +support multiple mechanisms. The same scheme, passing the +address of the argument, works for EVERY case, including arrays. +So the size of the compiler can be reduced. + +Partly because of the FORTRAN gotcha, and partly just because of +the reduced coupling involved, modern languages like C, Pascal, +Ada, and Modula 2 generally pass scalars by value. + +This means that the value of the scalar is COPIED into a separate +value used only for the call. Since the value passed is a copy, +the called procedure can use it as a local variable and modify it +any way it likes. The value in the caller will not be changed. + +It may seem at first that this is a bit inefficient, because of +the need to copy the parameter. But remember that we're going to +have to fetch SOME value to pass anyway, whether it be the +parameter itself or an address for it. Inside the subroutine, +using pass-by-value is definitely more efficient, since we +eliminate one level of indirection. Finally, we saw earlier that +with FORTRAN, it was often necessary to make copies within the +subroutine anyway, so pass-by-value reduces the number of local +variables. All in all, pass-by-value is better. + +Except for one small little detail: if all parameters are passed +by value, there is no way for a called to procedure to return a +result to its caller! The parameter passed is NOT altered in the +caller, only in the called procedure. Clearly, that won't get +the job done. + +There have been two answers to this problem, which are +equivalent. In Pascal, Wirth provides for VAR parameters, which +are passed-by-reference. What a VAR parameter is, in fact, is +none other than our old friend the FORTRAN parameter, with a new +name and paint job for disguise. Wirth neatly gets around the +"changing a literal" problem as well as the "address of an +expression" problem, by the simple expedient of allowing only a +variable to be the actual parameter. In other words, it's the +same restriction that the earliest FORTRANs imposed. + +C does the same thing, but explicitly. In C, _ALL_ parameters +are passed by value. One kind of variable that C supports, +however, is the pointer. So by passing a pointer by value, you +in effect pass what it points to by reference. In some ways this +works even better yet, because even though you can change the +variable pointed to all you like, you still CAN'T change the +pointer itself. In a function such as strcpy, for example, where +the pointers are incremented as the string is copied, we are +really only incrementing copies of the pointers, so the values of +those pointers in the calling procedure still remain as they +were. To modify a pointer, you must pass a pointer to the +pointer. + +Since we are simply performing experiments here, we'll look at +BOTH pass-by-value and pass-by-reference. That way, we'll be +able to use either one as we need to. It's worth mentioning that +it's going to be tough to use the C approach to pointers here, +since a pointer is a different type and we haven't studied types +yet! + + +PASS-BY-VALUE + +Let's just try some simple-minded things and see where they lead +us. Let's begin with the pass-by-value case. Consider the +procedure call: + + + FOO(X, Y) + + +Almost the only reasonable way to pass the data is through the +CPU stack. So the code we'd like to see generated might look +something like this: + + + MOVE X(PC),-(SP) ; Push X + MOVE Y(PC),-(SP) ; Push Y + BSR FOO ; Call FOO + + +That certainly doesn't seem too complex! + +When the BSR is executed, the CPU pushes the return address onto +the stack and jumps to FOO. At this point the stack will look +like this: + + . + . + Value of X (2 bytes) + Value of Y (2 bytes) + SP --> Return Address (4 bytes) + + +So the values of the parameters have addresses that are fixed +offsets from the stack pointer. In this example, the addresses +are: + + + X: 6(SP) + Y: 4(SP) + + +Now consider what the called procedure might look like: + + + PROCEDURE FOO(A, B) + BEGIN + A = B + END + +(Remember, the names of the formal parameters are arbitrary ... +only the positions count.) + +The desired output code might look like: + + + FOO: MOVE 4(SP),D0 + MOVE D0,6(SP) + RTS + + +Note that, in order to address the formal parameters, we're going +to have to know which position they have in the parameter list. +This means some changes to the symbol table stuff. In fact, for +our single-character case it's best to just create a new symbol +table for the formal parameters. + +Let's begin by declaring a new table: + + + var Params: Array['A'..'Z'] of integer; + + +We also will need to keep track of how many parameters a given +procedure has: + + + var NumParams: integer; + + +And we need to initialize the new table. Now, remember that the +formal parameter list will be different for each procedure that +we process, so we'll need to initialize that table anew for each +procedure. Here's the initializer: + + +{--------------------------------------------------------------} +{ Initialize Parameter Table to Null } + +procedure ClearParams; +var i: char; +begin + for i := 'A' to 'Z' do + Params[i] := 0; + NumParams := 0; +end; +{--------------------------------------------------------------} + + +We'll put a call to this procedure in Init, and also at the end +of DoProc: + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +var i: char; +begin + GetChar; + SkipWhite; + for i := 'A' to 'Z' do + ST[i] := ' '; + ClearParams; +end; +{--------------------------------------------------------------} +. +. +. +{--------------------------------------------------------------} +{ Parse and Translate a Procedure Declaration } + +procedure DoProc; +var N: char; +begin + Match('p'); + N := GetName; + FormalList; + Fin; + if InTable(N) then Duplicate(N); + ST[N] := 'p'; + PostLabel(N); + BeginBlock; + Return; + ClearParams; +end; +{--------------------------------------------------------------} + + +Note that the call within DoProc ensures that the table will be +clear when we're in the main program. + + +OK, now we need a few procedures to work with the table. The +next few functions are essentially copies of InTable, TypeOf, +etc.: + + +{--------------------------------------------------------------} +{ Find the Parameter Number } + +function ParamNumber(N: char): integer; +begin + ParamNumber := Params[N]; +end; + + +{--------------------------------------------------------------} +{ See if an Identifier is a Parameter } + +function IsParam(N: char): boolean; +begin + IsParam := Params[N] <> 0; +end; + + +{--------------------------------------------------------------} +{ Add a New Parameter to Table } + +procedure AddParam(Name: char); +begin + if IsParam(Name) then Duplicate(Name); + Inc(NumParams); + Params[Name] := NumParams; +end; +{--------------------------------------------------------------} + + +Finally, we need some code generation routines: + + +{--------------------------------------------------------------} +{ Load a Parameter to the Primary Register } + +procedure LoadParam(N: integer); +var Offset: integer; +begin + Offset := 4 + 2 * (NumParams - N); + Emit('MOVE '); + WriteLn(Offset, '(SP),D0'); +end; + + +{--------------------------------------------------------------} +{ Store a Parameter from the Primary Register } + +procedure StoreParam(N: integer); +var Offset: integer; +begin + Offset := 4 + 2 * (NumParams - N); + Emit('MOVE D0,'); + WriteLn(Offset, '(SP)'); +end; + + +{--------------------------------------------------------------} +{ Push The Primary Register to the Stack } + +procedure Push; +begin + EmitLn('MOVE D0,-(SP)'); +end; +{--------------------------------------------------------------} + + +( The last routine is one we've seen before, but it wasn't in +this vestigial version of the program.) + +With those preliminaries in place, we're ready to deal with the +semantics of procedures with calling lists (remember, the code to +deal with the syntax is already in place). + +Let's begin by processing a formal parameter. All we have to do +is to add each parameter to the parameter symbol table: + + +{--------------------------------------------------------------} +{ Process a Formal Parameter } + +procedure FormalParam; +begin + AddParam(GetName); +end; +{--------------------------------------------------------------} + + +Now, what about dealing with a formal parameter when it appears +in the body of the procedure? That takes a little more work. We +must first determine that it IS a formal parameter. To do this, +I've written a modified version of TypeOf: + + +{--------------------------------------------------------------} +{ Get Type of Symbol } + +function TypeOf(n: char): char; +begin + if IsParam(n) then + TypeOf := 'f' + else + TypeOf := ST[n]; +end; +{--------------------------------------------------------------} + + +(Note that, since TypeOf now calls IsParam, it may need to be +relocated in your source.) + +We also must modify AssignOrProc to deal with this new type: + + +{--------------------------------------------------------------} +{ Decide if a Statement is an Assignment or Procedure Call } + +procedure AssignOrProc; +var Name: char; +begin + Name := GetName; + case TypeOf(Name) of + ' ': Undefined(Name); + 'v', 'f': Assignment(Name); + 'p': CallProc(Name); + else Abort('Identifier ' + Name + ' Cannot Be Used +Here'); + end; +end; +{--------------------------------------------------------------} + + +Finally, the code to process an assignment statement and an +expression must be extended: + + +{--------------------------------------------------------------} +{ Parse and Translate an Expression } +{ Vestigial Version } + +procedure Expression; +var Name: char; +begin + Name := GetName; + if IsParam(Name) then + LoadParam(ParamNumber(Name)) + else + LoadVar(Name); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment(Name: char); +begin + Match('='); + Expression; + if IsParam(Name) then + StoreParam(ParamNumber(Name)) + else + StoreVar(Name); +end; +{--------------------------------------------------------------} + + +As you can see, these procedures will treat every variable name +encountered as either a formal parameter or a global variable, +depending on whether or not it appears in the parameter symbol +table. Remember that we are using only a vestigial form of +Expression. In the final program, the change shown here will +have to be added to Factor, not Expression. + +The rest is easy. We need only add the semantics to the actual +procedure call, which we can do with one new line of code: + + +{--------------------------------------------------------------} +{ Process an Actual Parameter } + +procedure Param; +begin + Expression; + Push; +end; +{--------------------------------------------------------------} + + +That's it. Add these changes to your program and give it a try. +Try declaring one or two procedures, each with a formal parameter +list. Then do some assignments, using combinations of global and +formal parameters. You can call one procedure from within +another, but you cannot DECLARE a nested procedure. You can even +pass formal parameters from one procedure to another. If we had +the full syntax of the language here, you'd also be able to do +things like read or write formal parameters or use them in +complicated expressions. + + +WHAT'S WRONG? + +At this point, you might be thinking: Surely there's more to this +than a few pushes and pops. There must be more to passing +parameters than this. + +You'd be right. As a matter of fact, the code that we're +generating here leaves a lot to be desired in several respects. + +The most glaring oversight is that it's wrong! If you'll look +back at the code for a procedure call, you'll see that the caller +pushes each actual parameter onto the stack before it calls the +procedure. The procedure USES that information, but it doesn't +change the stack pointer. That means that the stuff is still +there when we return. SOMEBODY needs to clean up the stack, or +we'll soon be in very hot water! + +Fortunately, that's easily fixed. All we have to do is to +increment the stack pointer when we're finished. + +Should we do that in the calling program, or the called +procedure? Some folks let the called procedure clean up the +stack, since that requires less code to be generated per call, +and since the procedure, after all, knows how many parameters +it's got. But that means that it must do something with the +return address so as not to lose it. + +I prefer letting the caller clean up, so that the callee need +only execute a return. Also, it seems a bit more balanced, since +the caller is the one who "messed up" the stack in the first +place. But THAT means that the caller must remember how many +items it pushed. To make things easy, I've modified the +procedure ParamList to be a function instead of a procedure, +returning the number of bytes pushed: + + +{--------------------------------------------------------------} +{ Process the Parameter List for a Procedure Call } + +function ParamList: integer; +var N: integer; +begin + N := 0; + Match('('); + if Look <> ')' then begin + Param; + inc(N); + while Look = ',' do begin + Match(','); + Param; + inc(N); + end; + end; + Match(')'); + ParamList := 2 * N; +end; +{--------------------------------------------------------------} + + +Procedure CallProc then uses this to clean up the stack: + + +{--------------------------------------------------------------} +{ Process a Procedure Call } + +procedure CallProc(Name: char); +var N: integer; +begin + N := ParamList; + Call(Name); + CleanStack(N); +end; +{--------------------------------------------------------------} + + +Here I've created yet another code generation procedure: + + +{--------------------------------------------------------------} +{ Adjust the Stack Pointer Upwards by N Bytes } + +procedure CleanStack(N: integer); +begin + if N > 0 then begin + Emit('ADD #'); + WriteLn(N, ',SP'); + end; +end; +{--------------------------------------------------------------} + + +OK, if you'll add this code to your compiler, I think you'll find +that the stack is now under control. + +The next problem has to do with our way of addressing relative to +the stack pointer. That works fine in our simple examples, since +with our rudimentary form of expressions nobody else is messing +with the stack. But consider a different example as simple as: + + + PROCEDURE FOO(A, B) + BEGIN + A = A + B + END + + +The code generated by a simple-minded parser might be: + + + FOO: MOVE 6(SP),D0 ; Fetch A + MOVE D0,-(SP) ; Push it + MOVE 4(SP),D0 ; Fetch B + ADD (SP)+,D0 ; Add A + MOVE D0,6(SP) : Store A + RTS + + +This would be wrong. When we push the first argument onto the +stack, the offsets for the two formal parameters are no longer 4 +and 6, but are 6 and 8. So the second fetch would fetch A again, +not B. + +This is not the end of the world. I think you can see that all +we really have to do is to alter the offset every time we do a +push, and that in fact is what's done if the CPU has no support +for other methods. + +Fortunately, though, the 68000 does have such support. +Recognizing that this CPU would be used a lot with high-order +language compilers, Motorola decided to add direct support for +this kind of thing. + +The problem, as you can see, is that as the procedure executes, +the stack pointer bounces up and down, and so it becomes an +awkward thing to use as a reference to access the formal +parameters. The solution is to define some _OTHER_ register, and +use it instead. This register is typically set equal to the +original stack pointer, and is called the frame pointer. + +The 68000 instruction set LINK lets you declare such a frame +pointer, and sets it equal to the stack pointer, all in one +instruction. As a matter of fact, it does even more than that. +Since this register may have been in use for something else in +the calling procedure, LINK also pushes the current value of that +register onto the stack. It can also add a value to the stack +pointer, to make room for local variables. + +The complement of LINK is UNLK, which simply restores the stack +pointer and pops the old value back into the register. + +Using these two instructions, the code for the previous example +becomes: + + + FOO: LINK A6,#0 + MOVE 10(A6),D0 ; Fetch A + MOVE D0,-(SP) ; Push it + MOVE 8(A6),D0 ; Fetch B + ADD (SP)+,D0 ; Add A + MOVE D0,10(A6) : Store A + UNLK A6 + RTS + + +Fixing the compiler to generate this code is a lot easier than it +is to explain it. All we need to do is to modify the code +generation created by DoProc. Since that makes the code a little +more than one line, I've created new procedures to deal with it, +paralleling the Prolog and Epilog procedures called by DoMain: + + +{--------------------------------------------------------------} +{ Write the Prolog for a Procedure } + +procedure ProcProlog(N: char); +begin + PostLabel(N); + EmitLn('LINK A6,#0'); +end; + + +{--------------------------------------------------------------} +{ Write the Epilog for a Procedure } + +procedure ProcEpilog; +begin + EmitLn('UNLK A6'); + EmitLn('RTS'); +end; +{--------------------------------------------------------------} + + +Procedure DoProc now just calls these: + + +{--------------------------------------------------------------} +{ Parse and Translate a Procedure Declaration } + +procedure DoProc; +var N: char; +begin + Match('p'); + N := GetName; + FormalList; + Fin; + if InTable(N) then Duplicate(N); + ST[N] := 'p'; + ProcProlog(N); + BeginBlock; + ProcEpilog; + ClearParams; +end; +{--------------------------------------------------------------} + + +Finally, we need to change the references to SP in procedures +LoadParam and StoreParam: + + +{--------------------------------------------------------------} +{ Load a Parameter to the Primary Register } + +procedure LoadParam(N: integer); +var Offset: integer; +begin + Offset := 8 + 2 * (NumParams - N); + Emit('MOVE '); + WriteLn(Offset, '(A6),D0'); +end; + + +{--------------------------------------------------------------} +{ Store a Parameter from the Primary Register } + +procedure StoreParam(N: integer); +var Offset: integer; +begin + Offset := 8 + 2 * (NumParams - N); + Emit('MOVE D0,'); + WriteLn(Offset, '(A6)'); +end; +{--------------------------------------------------------------} + + +(Note that the Offset computation changes to allow for the extra +push of A6.) + +That's all it takes. Try this out and see how you like it. + +At this point we are generating some relatively nice code for +procedures and procedure calls. Within the limitation that there +are no local variables (yet) and that no procedure nesting is +allowed, this code is just what we need. + +There is still just one little small problem remaining: + + + WE HAVE NO WAY TO RETURN RESULTS TO THE CALLER! + + +But that, of course, is not a limitation of the code we're +generating, but one inherent in the call-by-value protocol. +Notice that we CAN use formal parameters in any way inside the +procedure. We can calculate new values for them, use them as +loop counters (if we had loops, that is!), etc. So the code is +doing what it's supposed to. To get over this last problem, we +need to look at the alternative protocol. + + +CALL-BY-REFERENCE + +This one is easy, now that we have the mechanisms already in +place. We only have to make a few changes to the code +generation. Instead of pushing a value onto the stack, we must +push an address. As it turns out, the 68000 has an instruction, +PEA, that does just that. + +We'll be making a new version of the test program for this. +Before we do anything else, + +>>>> MAKE A COPY <<<< + +of the program as it now stands, because we'll be needing it +again later. + +Let's begin by looking at the code we'd like to see generated for +the new case. Using the same example as before, we need the call + + + FOO(X, Y) + + +to be translated to: + + + PEA X(PC) ; Push the address of X + PEA Y(PC) ; Push Y the address of Y + BSR FOO ; Call FOO + + +That's a simple matter of a slight change to Param: + + +{--------------------------------------------------------------} +{ Process an Actual Parameter } + +procedure Param; +begin + EmitLn('PEA ' + GetName + '(PC)'); +end; +{--------------------------------------------------------------} + + +(Note that with pass-by-reference, we can't have expressions in +the calling list, so Param can just read the name directly.) + +At the other end, the references to the formal parameters must be +given one level of indirection: + + + FOO: LINK A6,#0 + MOVE.L 12(A6),A0 ; Fetch the address of A + MOVE (A0),D0 ; Fetch A + MOVE D0,-(SP) ; Push it + MOVE.L 8(A6),A0 ; Fetch the address of B + MOVE (A0),D0 ; Fetch B + ADD (SP)+,D0 ; Add A + MOVE.L 12(A6),A0 ; Fetch the address of A + MOVE D0,(A0) : Store A + UNLK A6 + RTS + + +All of this can be handled by changes to LoadParam and +StoreParam: + + +{--------------------------------------------------------------} +{ Load a Parameter to the Primary Register } + +procedure LoadParam(N: integer); +var Offset: integer; +begin + Offset := 8 + 4 * (NumParams - N); + Emit('MOVE.L '); + WriteLn(Offset, '(A6),A0'); + EmitLn('MOVE (A0),D0'); +end; + + +{--------------------------------------------------------------} +{ Store a Parameter from the Primary Register } + +procedure StoreParam(N: integer); +var Offset: integer; +begin + Offset := 8 + 4 * (NumParams - N); + Emit('MOVE.L '); + WriteLn(Offset, '(A6),A0'); + EmitLn('MOVE D0,(A0)'); +end; +{--------------------------------------------------------------} + +To get the count right, we must also change one line in +ParamList: + + + ParamList := 4 * N; + + +That should do it. Give it a try and see if it's generating +reasonable-looking code. As you will see, the code is hardly +optimal, since we reload the address register every time a +parameter is needed. But that's consistent with our KISS +approach here, of just being sure to generate code that works. +We'll just make a little note here, that here's yet another +candidate for optimization, and press on. + +Now we've learned to process parameters using pass-by-value and +pass-by-reference. In the real world, of course, we'd like to be +able to deal with BOTH methods. We can't do that yet, though, +because we have not yet had a session on types, and that has to +come first. + +If we can only have ONE method, then of course it has to be the +good ol' FORTRAN method of pass-by-reference, since that's the +only way procedures can ever return values to their caller. + +This, in fact, will be one of the differences between TINY and +KISS. In the next version of TINY, we'll use pass-by-reference +for all parameters. KISS will support both methods. + + +LOCAL VARIABLES + +So far, we've said nothing about local variables, and our +definition of procedures doesn't allow for them. Needless to +say, that's a big gap in our language, and one that needs to be +corrected. + +Here again we are faced with a choice: Static or dynamic storage? + +In those old FORTRAN programs, local variables were given static +storage just like global ones. That is, each local variable got +a name and allocated address, like any other variable, and was +referenced by that name. + +That's easy for us to do, using the allocation mechanisms already +in place. Remember, though, that local variables can have the +same names as global ones. We need to somehow deal with that by +assigning unique names for these variables. + +The characteristic of static storage, of course, is that the data +survives a procedure call and return. When the procedure is +called again, the data will still be there. That can be an +advantage in some applications. In the FORTRAN days we used to +do tricks like initialize a flag, so that you could tell when you +were entering a procedure for the first time and could do any +one-time initialization that needed to be done. + +Of course, the same "feature" is also what makes recursion +impossible with static storage. Any new call to a procedure will +overwrite the data already in the local variables. + +The alternative is dynamic storage, in which storage is allocated +on the stack just as for passed parameters. We also have the +mechanisms already for doing this. In fact, the same routines +that deal with passed (by value) parameters on the stack can +easily deal with local variables as well ... the code to be +generated is the same. The purpose of the offset in the 68000 +LINK instruction is there just for that reason: we can use it to +adjust the stack pointer to make room for locals. Dynamic +storage, of course, inherently supports recursion. + +When I first began planning TINY, I must admit to being +prejudiced in favor of static storage. That's simply because +those old FORTRAN programs were pretty darned efficient ... the +early FORTRAN compilers produced a quality of code that's still +rarely matched by modern compilers. Even today, a given program +written in FORTRAN is likely to outperform the same program +written in C or Pascal, sometimes by wide margins. (Whew! Am I +going to hear about THAT statement!) + +I've always supposed that the reason had to do with the two main +differences between FORTRAN implementations and the others: +static storage and pass-by-reference. I know that dynamic +storage supports recursion, but it's always seemed to me a bit +peculiar to be willing to accept slower code in the 95% of cases +that don't need recursion, just to get that feature when you need +it. The idea is that, with static storage, you can use absolute +addressing rather than indirect addressing, which should result +in faster code. + +More recently, though, several folks have pointed out to me that +there really is no performance penalty associated with dynamic +storage. With the 68000, for example, you shouldn't use absolute +addressing anyway ... most operating systems require position +independent code. And the 68000 instruction + + MOVE 8(A6),D0 + +has exactly the same timing as + + MOVE X(PC),D0. + +So I'm convinced, now, that there is no good reason NOT to use +dynamic storage. + +Since this use of local variables fits so well into the scheme of +pass-by-value parameters, we'll use that version of the +translator to illustrate it. (I _SURE_ hope you kept a copy!) + +The general idea is to keep track of how many local parameters +there are. Then we use the integer in the LINK instruction to +adjust the stack pointer downward to make room for them. Formal +parameters are addressed as positive offsets from the frame +pointer, and locals as negative offsets. With a little bit of +work, the same procedures we've already created can take care of +the whole thing. + +Let's start by creating a new variable, Base: + + + var Base: integer; + +We'll use this variable, instead of NumParams, to compute stack +offsets. That means changing the two references to NumParams in +LoadParam and StoreParam: + + +{--------------------------------------------------------------} +{ Load a Parameter to the Primary Register } + +procedure LoadParam(N: integer); +var Offset: integer; +begin + Offset := 8 + 2 * (Base - N); + Emit('MOVE '); + WriteLn(Offset, '(A6),D0'); +end; + + +{--------------------------------------------------------------} +{ Store a Parameter from the Primary Register } + +procedure StoreParam(N: integer); +var Offset: integer; +begin + Offset := 8 + 2 * (Base - N); + Emit('MOVE D0,'); + WriteLn(Offset, '(A6)'); +end; +{--------------------------------------------------------------} + + +The idea is that the value of Base will be frozen after we have +processed the formal parameters, and won't increase further as +the new, local variables, are inserted in the symbol table. This +is taken care of at the end of FormalList: + + +{--------------------------------------------------------------} +{ Process the Formal Parameter List of a Procedure } + +procedure FormalList; +begin + Match('('); + if Look <> ')' then begin + FormalParam; + while Look = ',' do begin + Match(','); + FormalParam; + end; + end; + Match(')'); + Fin; + Base := NumParams; + NumParams := NumParams + 4; +end; +{--------------------------------------------------------------} + + +(We add four words to make allowances for the return address and +old frame pointer, which end up between the formal parameters and +the locals.) + +About all we need to do next is to install the semantics for +declaring local variables into the parser. The routines are very +similar to Decl and TopDecls: + + +{--------------------------------------------------------------} +{ Parse and Translate a Local Data Declaration } + +procedure LocDecl; +var Name: char; +begin + Match('v'); + AddParam(GetName); + Fin; +end; + + +{--------------------------------------------------------------} + + +{ Parse and Translate Local Declarations } + +function LocDecls: integer; +var n: integer; +begin + n := 0; + while Look = 'v' do begin + LocDecl; + inc(n); + end; + LocDecls := n; +end; +{--------------------------------------------------------------} + + +Note that LocDecls is a FUNCTION, returning the number of locals +to DoProc. + +Next, we modify DoProc to use this information: + + +{--------------------------------------------------------------} +{ Parse and Translate a Procedure Declaration } + +procedure DoProc; +var N: char; + k: integer; +begin + Match('p'); + N := GetName; + if InTable(N) then Duplicate(N); + ST[N] := 'p'; + FormalList; + k := LocDecls; + ProcProlog(N, k); + BeginBlock; + ProcEpilog; + ClearParams; +end; +{--------------------------------------------------------------} + + +(I've made a couple of changes here that weren't really +necessary. Aside from rearranging things a bit, I moved the call +to Fin to within FormalList, and placed one inside LocDecls as +well. Don't forget to put one at the end of FormalList, so that +we're together here.) + +Note the change in the call to ProcProlog. The new argument is +the number of WORDS (not bytes) to allocate space for. Here's +the new version of ProcProlog: + + +{--------------------------------------------------------------} +{ Write the Prolog for a Procedure } + +procedure ProcProlog(N: char; k: integer); +begin + PostLabel(N); + Emit('LINK A6,#'); + WriteLn(-2 * k) +end; +{--------------------------------------------------------------} + + +That should do it. Add these changes and see how they work. + + +CONCLUSION + +At this point you know how to compile procedure declarations and +procedure calls, with parameters passed by reference and by +value. You can also handle local variables. As you can see, the +hard part is not in providing the mechanisms, but in deciding +just which mechanisms to use. Once we make these decisions, the +code to translate the constructs is really not that difficult. +I didn't show you how to deal with the combination of local +parameters and pass-by-reference parameters, but that's a +straightforward extension to what you've already seen. It just +gets a little more messy, that's all, since we need to support +both mechanisms instead of just one at a time. I'd prefer to +save that one until after we've dealt with ways to handle +different variable types. + +That will be the next installment, which will be coming soon to a +Forum near you. See you then. + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + diff --git a/Compiler construction/Lets_build_compiler/tutor14.txt b/Compiler construction/Lets_build_compiler/tutor14.txt new file mode 100644 index 0000000..3a58098 --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor14.txt @@ -0,0 +1,2265 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 26 May 1990 + + + Part XIV: TYPES + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +In the last installment (Part XIII: PROCEDURES) I mentioned that +in that part and this one, we would cover the two features that +tend to separate the toy language from a real, usable one. We +covered procedure calls in that installment. Many of you have +been waiting patiently, since August '89, for me to drop the +other shoe. Well, here it is. + +In this installment, we'll talk about how to deal with different +data types. As I did in the last segment, I will NOT incorporate +these features directly into the TINY compiler at this time. +Instead, I'll be using the same approach that has worked so well +for us in the past: using only fragments of the parser and +single-character tokens. As usual, this allows us to get +directly to the heart of the matter without having to wade +through a lot of unnecessary code. Since the major problems in +dealing with multiple types occur in the arithmetic operations, +that's where we'll concentrate our focus. + +A few words of warning: First, there are some types that I will +NOT be covering in this installment. Here we will ONLY be +talking about the simple, predefined types. We won't even deal +with arrays, pointers or strings in this installment; I'll be +covering them in the next few. + +Second, we also will not discuss user-defined types. That will +not come until much later, for the simple reason that I still +haven't convinced myself that user-defined types belong in a +language named KISS. In later installments, I do intend to cover +at least the general concepts of user-defined types, records, +etc., just so that the series will be complete. But whether or +not they will be included as part of KISS is still an open issue. +I am open to comments or suggestions on this question. + +Finally, I should warn you: what we are about to do CAN add +considerable extra complication to both the parser and the +generated code. Handling variables of different types is +straightforward enough. The complexity comes in when you add +rules about conversion between types. In general, you can make +the compiler as simple or as complex as you choose to make it, +depending upon the way you define the type-conversion rules. +Even if you decide not to allow ANY type conversions (as in Ada, +for example) the problem is still there, and is built into the +mathematics. When you multiply two short numbers, for example, +you can get a long result. + +I've approached this problem very carefully, in an attempt to +Keep It Simple. But we can't avoid the complexity entirely. As +has so often has happened, we end up having to trade code quality +against complexity, and as usual I will tend to opt for the +simplest approach. + + +WHAT'S COMING NEXT? + +Before diving into the tutorial, I think you'd like to know where +we are going from here ... especially since it's been so long +since the last installment. + +I have not been idle in the meantime. What I've been doing is +reorganizing the compiler itself into Turbo Units. One of the +problems I've encountered is that as we've covered new areas and +thereby added features to the TINY compiler, it's been getting +longer and longer. I realized a couple of installments back that +this was causing trouble, and that's why I've gone back to using +only compiler fragments for the last installment and this one. +The problem is that it just seems dumb to have to reproduce the +code for, say, processing boolean exclusive OR's, when the +subject of the discussion is parameter passing. + +The obvious way to have our cake and eat it, too, is to break up +the compiler into separately compilable modules, and of course +the Turbo Unit is an ideal vehicle for doing this. This allows +us to hide some fairly complex code (such as the full arithmetic +and boolean expression parsing) into a single unit, and just pull +it in whenever it's needed. In that way, the only code I'll have +to reproduce in these installments will be the code that actually +relates to the issue under discussion. + +I've also been toying with Turbo 5.5, which of course includes +the Borland object-oriented extensions to Pascal. I haven't +decided whether to make use of these features, for two reasons. +First of all, many of you who have been following this series may +still not have 5.5, and I certainly don't want to force anyone to +have to go out and buy a new compiler just to complete the +series. Secondly, I'm not convinced that the O-O extensions have +all that much value for this application. We've been having some +discussions about that in CompuServe's CLM forum, and so far +we've not found any compelling reason to use O-O constructs. +This is another of those areas where I could use some feedback +from you readers. Anyone want to vote for Turbo 5.5 and O-O? + +In any case, after the next few installments in the series, the +plan is to upload to you a complete set of Units, and complete +functioning compilers as well. The plan, in fact, is to have +THREE compilers: One for a single-character version of TINY (to +use for our experiments), one for TINY and one for KISS. I've +pretty much isolated the differences between TINY and KISS, which +are these: + + o TINY will support only two data types: The character and the + 16-bit integer. I may also try to do something with + strings, since without them a compiler would be pretty + useless. KISS will support all the usual simple types, + including arrays and even floating point. + + o TINY will only have two control constructs, the IF and the + WHILE. KISS will support a very rich set of constructs, + including one we haven't discussed here before ... the CASE. + + o KISS will support separately compilable modules. + +One caveat: Since I still don't know much about 80x86 assembler +language, all these compiler modules will still be written to +support 68000 code. However, for the programs I plan to upload, +all the code generation has been carefully encapsulated into a +single unit, so that any enterprising student should be able to +easily retarget to any other processor. This task is "left as an +exercise for the student." I'll make an offer right here and +now: For the person who provides us the first robust retarget to +80x86, I will be happy to discuss shared copyrights and royalties +from the book that's upcoming. + +But enough talk. Let's get on with the study of types. As I +said earlier, we'll do this one as we did in the last +installment: by performing experiments using single-character +tokens. + + +THE SYMBOL TABLE + +It should be apparent that, if we're going to deal with variables +of different types, we're going to need someplace to record what +those types are. The obvious vehicle for that is the symbol +table, and we've already used it that way to distinguish, for +example, between local and global variables, and between +variables and procedures. + +The symbol table structure for single-character tokens is +particularly simple, and we've used it several times before. To +deal with it, we'll steal some procedures that we've used before. + +First, we need to declare the symbol table itself: + + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look: char; { Lookahead Character } + + ST: Array['A'..'Z'] of char; { *** ADD THIS LINE ***} +{--------------------------------------------------------------} + + +Next, we need to make sure it's initialized as part of procedure +Init: + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +var i: char; +begin + for i := 'A' to 'Z' do + ST[i] := '?'; + GetChar; +end; +{--------------------------------------------------------------} + + +We don't really need the next procedure, but it will be helpful +for debugging. All it does is to dump the contents of the symbol +table: + + +{--------------------------------------------------------------} +{ Dump the Symbol Table } + +procedure DumpTable; +var i: char; +begin + for i := 'A' to 'Z' do + WriteLn(i, ' ', ST[i]); +end; +{--------------------------------------------------------------} + + +It really doesn't matter much where you put this procedure ... I +plan to cluster all the symbol table routines together, so I put +mine just after the error reporting procedures. + +If you're the cautious type (as I am), you might want to begin +with a test program that does nothing but initializes, then dumps +the table. Just to be sure that we're all on the same wavelength +here, I'm reproducing the entire program below, complete with the +new procedures. Note that this version includes support for +white space: + + +{--------------------------------------------------------------} +program Types; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + CR = ^M; + LF = ^J; + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look: char; { Lookahead Character } + + ST: Array['A'..'Z'] of char; + + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + + +{--------------------------------------------------------------} +{ Dump the Symbol Table } + +procedure DumpTable; +var i: char; +begin + for i := 'A' to 'Z' do + WriteLn(i, ' ', ST[i]); +end; + + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Recognize an AlphaNumeric Character } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; + + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Mulop } + +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*', '/']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Boolean Orop } + +function IsOrop(c: char): boolean; +begin + IsOrop := c in ['|', '~']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Relop } + +function IsRelop(c: char): boolean; +begin + IsRelop := c in ['=', '#', '<', '>']; +end; + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB]; +end; + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; + + +{--------------------------------------------------------------} +{ Skip Over an End-of-Line } + +procedure Fin; +begin + if Look = CR then begin + GetChar; + if Look = LF then + GetChar; + end; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + if Look = x then GetChar + else Expected('''' + x + ''''); + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: char; +begin + if not IsAlpha(Look) then Expected('Name'); + GetName := UpCase(Look); + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: char; +begin + if not IsDigit(Look) then Expected('Integer'); + GetNum := Look; + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +var i: char; +begin + for i := 'A' to 'Z' do + ST[i] := '?'; + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + DumpTable; +end. +{--------------------------------------------------------------} + + +OK, run this program. You should get a (very fast) printout of +all the letters of the alphabet (potential identifiers), each +followed by a question mark. Not very exciting, but it's a +start. + +Of course, in general we only want to see the types of the +variables that have been defined. We can eliminate the others by +modifying DumpTable with an IF test. Change the loop to read: + + + for i := 'A' to 'Z' do + if ST[i] <> '?' then + WriteLn(i, ' ', ST[i]); + + +Now, run the program again. What did you get? + +Well, that's even more boring than before! There was no output +at all, since at this point NONE of the names have been declared. +We can spice things up a bit by inserting some statements +declaring some entries in the main program. Try these: + + + ST['A'] := 'a'; + ST['P'] := 'b'; + ST['X'] := 'c'; + + +This time, when you run the program, you should get an output +showing that the symbol table is working right. + + +ADDING ENTRIES + +Of course, writing to the table directly is pretty poor practice, +and not one that will help us much later. What we need is a +procedure to add entries to the table. At the same time, we know +that we're going to need to test the table, to make sure that we +aren't redeclaring a variable that's already in use (easy to do +with only 26 choices!). To handle all this, enter the following +new procedures: + + +{--------------------------------------------------------------} +{ Report Type of a Variable } + + +function TypeOf(N: char): char; +begin + TypeOf := ST[N]; +end; + + +{--------------------------------------------------------------} +{ Report if a Variable is in the Table } + + +function InTable(N: char): boolean; +begin + InTable := TypeOf(N) <> '?'; +end; + + +{--------------------------------------------------------------} +{ Check for a Duplicate Variable Name } + +procedure CheckDup(N: char); +begin + if InTable(N) then Abort('Duplicate Name ' + N); +end; + + +{--------------------------------------------------------------} +{ Add Entry to Table } + +procedure AddEntry(N, T: char); +begin + CheckDup(N); + ST[N] := T; +end; +{--------------------------------------------------------------} + + +Now change the three lines in the main program to read: + + + AddEntry('A', 'a'); + AddEntry('P', 'b'); + AddEntry('X', 'c'); + + +and run the program again. Did it work? Then we have the symbol +table routines needed to support our work on types. In the next +section, we'll actually begin to use them. + + +ALLOCATING STORAGE + +In other programs like this one, including the TINY compiler +itself, we have already addressed the issue of declaring global +variables, and the code generated for them. Let's build a +vestigial version of a "compiler" here, whose only function is to +allow us declare variables. Remember, the syntax for a +declaration is: + + + ::= VAR + + +Again, we can lift a lot of the code from previous programs. The +following are stripped-down versions of those procedures. They +are greatly simplified since I have eliminated niceties like +variable lists and initializers. In procedure Alloc, note that +the new call to AddEntry will also take care of checking for +duplicate declarations: + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc(N: char); +begin + AddEntry(N, 'v'); + WriteLn(N, ':', TAB, 'DC 0'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Data Declaration } + +procedure Decl; +var Name: char; +begin + Match('v'); + Alloc(GetName); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + while Look <> '.' do begin + case Look of + 'v': Decl; + else Abort('Unrecognized Keyword ' + Look); + end; + Fin; + end; +end; +{--------------------------------------------------------------} + + +Now, in the main program, add a call to TopDecls and run the +program. Try allocating a few variables, and note the resulting +code generated. This is old stuff for you, so the results should +look familiar. Note from the code for TopDecls that the program +is ended by a terminating period. + +While you're at it, try declaring two variables with the same +name, and verify that the parser catches the error. + + +DECLARING TYPES + + +Allocating storage of different sizes is as easy as modifying +procedure TopDecls to recognize more than one keyword. There are +a number of decisions to be made here, in terms of what the +syntax should be, etc., but for now I'm going to duck all the +issues and simply declare by executive fiat that our syntax will +be: + + + ::= + +where: + + + ::= BYTE | WORD | LONG + + +(By an amazing coincidence, the first letters of these names +happen to be the same as the 68000 assembly code length +specifications, so this choice saves us a little work.) + +We can create the code to take care of these declarations with +only slight modifications. In the routines below, note that I've +separated the code generation parts of Alloc from the logic +parts. This is in keeping with our desire to encapsulate the +machine-dependent part of the compiler. + + +{--------------------------------------------------------------} +{ Generate Code for Allocation of a Variable } + +procedure AllocVar(N, T: char); +begin + WriteLn(N, ':', TAB, 'DC.', T, ' 0'); +end; + + +{--------------------------------------------------------------} +{ Allocate Storage for a Variable } + +procedure Alloc(N, T: char); +begin + AddEntry(N, T); + AllocVar(N, T); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Data Declaration } + +procedure Decl; +var Typ: char; +begin + Typ := GetName; + Alloc(GetName, Typ); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate Global Declarations } + +procedure TopDecls; +begin + while Look <> '.' do begin + case Look of + 'b', 'w', 'l': Decl; + else Abort('Unrecognized Keyword ' + Look); + end; + Fin; + end; +end; +{--------------------------------------------------------------} + + +Make the changes shown to these procedures, and give the thing a +try. Use the single characters 'b', 'w', and 'l' for the +keywords (they must be lower case, for now). You will see that +in each case, we are allocating the proper storage size. Note +from the dumped symbol table that the sizes are also recorded for +later use. What later use? Well, that's the subject of the rest +of this installment. + + +ASSIGNMENTS + +Now that we can declare variables of different sizes, it stands +to reason that we ought to be able to do something with them. +For our first trick, let's just try loading them into our working +register, D0. It makes sense to use the same idea we used for +Alloc; that is, make a load procedure that can load more than one +size. We also want to continue to encapsulate the machine- +dependent stuff. The load procedure looks like this: + + +{---------------------------------------------------------------} +{ Load a Variable to Primary Register } + +procedure LoadVar(Name, Typ: char); +begin + Move(Typ, Name + '(PC)', 'D0'); +end; +{---------------------------------------------------------------} + + +On the 68000, at least, it happens that many instructions turn +out to be MOVE's. It turns out to be useful to create a separate +code generator just for these instructions, and then call it as +needed: + + +{---------------------------------------------------------------} +{ Generate a Move Instruction } + +procedure Move(Size: char; Source, Dest: String); +begin + EmitLn('MOVE.' + Size + ' ' + Source + ',' + Dest); +end; +{---------------------------------------------------------------} + + +Note that these two routines are strictly code generators; they +have no error-checking or other logic. To complete the picture, +we need one more layer of software that provides these functions. + +First of all, we need to make sure that the type we are dealing +with is a loadable type. This sounds like a job for another +recognizer: + + +{--------------------------------------------------------------} +{ Recognize a Legal Variable Type } + +function IsVarType(c: char): boolean; +begin + IsVarType := c in ['B', 'W', 'L']; +end; +{--------------------------------------------------------------} + + +Next, it would be nice to have a routine that will fetch the type +of a variable from the symbol table, while checking it to make +sure it's valid: + + +{--------------------------------------------------------------} +{ Get a Variable Type from the Symbol Table } + +function VarType(Name: char): char; +var Typ: char; +begin + Typ := TypeOf(Name); + if not IsVarType(Typ) then Abort('Identifier ' + Name + + ' is not a variable'); + VarType := Typ; +end; +{--------------------------------------------------------------} + + +Armed with these tools, a procedure to cause a variable to be +loaded becomes trivial: + + +{--------------------------------------------------------------} +{ Load a Variable to the Primary Register } + +procedure Load(Name: char); +begin + LoadVar(Name, VarType(Name)); +end; +{--------------------------------------------------------------} + + +(NOTE to the concerned: I know, I know, all this is all very +inefficient. In a production program, we probably would take +steps to avoid such deep nesting of procedure calls. Don't worry +about it. This is an EXERCISE, remember? It's more important to +get it right and understand it, than it is to make it get the +wrong answer, quickly. If you get your compiler completed and +find that you're unhappy with the speed, feel free to come back +and hack the code to speed it up!) + +It would be a good idea to test the program at this point. Since +we don't have a procedure for dealing with assignments yet, I +just added the lines: + + + Load('A'); + Load('B'); + Load('C'); + Load('X'); + + +to the main program. Thus, after the declaration section is +complete, they will be executed to generate code for the loads. +You can play around with this, and try different combinations of +declarations to see how the errors are handled. + +I'm sure you won't be surprised to learn that storing variables +is a lot like loading them. The necessary procedures are shown +next: + + +{---------------------------------------------------------------} +{ Store Primary to Variable } + +procedure StoreVar(Name, Typ: char); +begin + EmitLn('LEA ' + Name + '(PC),A0'); + Move(Typ, 'D0', '(A0)'); +end; + + +{--------------------------------------------------------------} +{ Store a Variable from the Primary Register } + +procedure Store(Name: char); +begin + StoreVar(Name, VarType(Name)); +end; +{--------------------------------------------------------------} + + +You can test this one the same way as the loads. + +Now, of course, it's a RATHER small step to use these to handle +assignment statements. What we'll do is to create a special +version of procedure Block that supports only assignment +statements, and also a special version of Expression that only +supports single variables as legal expressions. Here they are: + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +var Name: char; +begin + Load(GetName); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: char; +begin + Name := GetName; + Match('='); + Expression; + Store(Name); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Block of Statements } + +procedure Block; +begin + while Look <> '.' do begin + Assignment; + Fin; + end; +end; +{--------------------------------------------------------------} + + +(It's worth noting that, if anything, the new procedures that +permit us to manipulate types are, if anything, even simpler and +cleaner than what we've seen before. This is mostly thanks to +our efforts to encapsulate the code generator procedures.) + +There is one small, nagging problem. Before, we used the Pascal +terminating period to get us out of procedure TopDecls. This is +now the wrong character ... it's used to terminate Block. In +previous programs, we've used the BEGIN symbol (abbreviated 'b') +to get us out. But that is now used as a type symbol. + +The solution, while somewhat of a kludge, is easy enough. We'll +use an UPPER CASE 'B' to stand for the BEGIN. So change the +character in the WHILE loop within TopDecls, from '.' to 'B', and +everything will be fine. + +Now, we can complete the task by changing the main program to +read: + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + TopDecls; + Match('B'); + Fin; + Block; + DumpTable; +end. +{--------------------------------------------------------------} + + +(Note that I've had to sprinkle a few calls to Fin around to get +us out of Newline troubles.) + +OK, run this program. Try the input: + + + ba { byte a } *** DON'T TYPE THE COMMENTS!!! *** + wb { word b } + lc { long c } + B { begin } + a=a + a=b + a=c + b=a + b=b + b=c + c=a + c=b + c=c + . + + +For each declaration, you should get code generated that +allocates storage. For each assignment, you should get code that +loads a variable of the correct size, and stores one, also of the +correct size. + +There's only one small little problem: The generated code is +WRONG! + +Look at the code for a=c above. The code is: + + + MOVE.L C(PC),D0 + LEA A(PC),A0 + MOVE.B D0,(A0) + + +This code is correct. It will cause the lower eight bits of C to +be stored into A, which is a reasonable behavior. It's about all +we can expect to happen. + +But now, look at the opposite case. For c=a, the code generated +is: + + + MOVE.B A(PC),D0 + LEA C(PC),A0 + MOVE.L D0,(A0) + + +This is NOT correct. It will cause the byte variable A to be +stored into the lower eight bits of D0. According to the rules +for the 68000 processor, the upper 24 bits are unchanged. This +means that when we store the entire 32 bits into C, whatever +garbage that was in those high bits will also get stored. Not +good. + +So what we have run into here, early on, is the issue of TYPE +CONVERSION, or COERCION. + +Before we do anything with variables of different types, even if +it's just to copy them, we have to face up to the issue. It is +not the most easy part of a compiler. Most of the bugs I have +seen in production compilers have had to do with errors in type +conversion for some obscure combination of arguments. As usual, +there is a tradeoff between compiler complexity and the potential +quality of the generated code, and as usual, we will take the +path that keeps the compiler simple. I think you'll find that, +with this approach, we can keep the potential complexity in check +rather nicely. + + +THE COWARD'S WAY OUT + +Before we get into the details (and potential complexity) of type +conversion, I'd like you to see that there is one super-simple +way to solve the problem: simply promote every variable to a long +integer when we load it! + +This takes the addition of only one line to LoadVar, although if +we are not going to COMPLETELY ignore efficiency, it should be +guarded by an IF test. Here is the modified version: + + +{---------------------------------------------------------------} +{ Load a Variable to Primary Register } + +procedure LoadVar(Name, Typ: char); +begin + if Typ <> 'L' then + EmitLn('CLR.L D0'); + Move(Typ, Name + '(PC)', 'D0'); +end; +{---------------------------------------------------------------} + + +(Note that StoreVar needs no similar change.) + +If you run some tests with this new version, you will find that +everything works correctly now, albeit sometimes inefficiently. +For example, consider the case a=b (for the same declarations +shown above). Now the generated code turns out to be: + + + CLR.L D0 + MOVE.W B(PC),D0 + LEA A(PC),A0 + MOVE.B D0,(A0) + + +In this case, the CLR turns out not to be necessary, since the +result is going into a byte-sized variable. With a little bit of +work, we can do better. Still, this is not bad, and it typical +of the kinds of inefficiencies that we've seen before in simple- +minded compilers. + +I should point out that, by setting the high bits to zero, we are +in effect treating the numbers as UNSIGNED integers. If we want +to treat them as signed ones instead (the more likely case) we +should do a sign extension after the load, instead of a clear +before it. Just to tie this part of the discussion up with a +nice, red ribbon, let's change LoadVar as shown below: + + +{---------------------------------------------------------------} +{ Load a Variable to Primary Register } + +procedure LoadVar(Name, Typ: char); +begin + if Typ = 'B' then + EmitLn('CLR.L D0'); + Move(Typ, Name + '(PC)', 'D0'); + if Typ = 'W' then + EmitLn('EXT.L D0'); +end; +{---------------------------------------------------------------} + + +With this version, a byte is treated as unsigned (as in Pascal +and C), while a word is treated as signed. + + +A MORE REASONABLE SOLUTION + +As we've seen, promoting every variable to long while it's in +memory solves the problem, but it can hardly be called efficient, +and probably wouldn't be acceptable even for those of us who +claim be unconcerned about efficiency. It will mean that all +arithmetic operations will be done to 32-bit accuracy, which will +DOUBLE the run time for most operations, and make it even worse +for multiplication and division. For those operations, we would +need to call subroutines to do them, even if the data were byte +or word types. The whole thing is sort of a cop-out, too, since +it ducks all the real issues. + +OK, so that solution's no good. Is there still a relatively easy +way to get data conversion? Can we still Keep It Simple? + +Yes, indeed. All we have to do is to make the conversion at the +other end ... that is, we convert on the way _OUT_, when the data +is stored, rather than on the way in. + +But, remember, the storage part of the assignment is pretty much +independent of the data load, which is taken care of by procedure +Expression. In general the expression may be arbitrarily +complex, so how can procedure Assignment know what type of data +is left in register D0? + +Again, the answer is simple: We'll just _ASK_ procedure +Expression! The answer can be returned as a function value. + +All of this requires several procedures to be modified, but the +mods, like the method, are quite simple. First of all, since we +aren't requiring LoadVar to do all the work of conversion, let's +go back to the simple version: + + +{---------------------------------------------------------------} +{ Load a Variable to Primary Register } + +procedure LoadVar(Name, Typ: char); +begin + Move(Typ, Name + '(PC)', 'D0'); +end; +{--------------------------------------------------------------} + + +Next, let's add a new procedure that will convert from one type +to another: + + +{---------------------------------------------------------------} +{ Convert a Data Item from One Type to Another } + + +procedure Convert(Source, Dest: char); +begin + if Source <> Dest then begin + if Source = 'B' then + EmitLn('AND.W #$FF,D0'); + if Dest = 'L' then + EmitLn('EXT.L D0'); + end; +end; +{--------------------------------------------------------------} + + +Next, we need to do the logic required to load and store a +variable of any type. Here are the routines for that: + + +{---------------------------------------------------------------} +{ Load a Variable to the Primary Register } + +function Load(Name: char): char; +var Typ : char; +begin + Typ := VarType(Name); + LoadVar(Name, Typ); + Load := Typ; +end; + + +{--------------------------------------------------------------} +{ Store a Variable from the Primary Register } + +procedure Store(Name, T1: char); +var T2: char; +begin + T2 := VarType(Name); + Convert(T1, T2); + StoreVar(Name, T2); +end; +{--------------------------------------------------------------} + + +Note that Load is a function, which not only emits the code for a +load, but also returns the variable type. In this way, we always +know what type of data we are dealing with. When we execute a +Store, we pass it the current type of the variable in D0. Since +Store also knows the type of the destination variable, it can +convert as necessary. + +Armed with all these new routines, the implementation of our +rudimentary assignment statement is essentially trivial. +Procedure Expression now becomes a function, which returns its +type to procedure Assignment: + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +function Expression: char; +begin + Expression := Load(GetName); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: char; +begin + Name := GetName; + Match('='); + Store(Name, Expression); +end; +{--------------------------------------------------------------} + +Again, note how incredibly simple these two routines are. We've +encapsulated all the type logic into Load and Store, and the +trick of passing the type around makes the rest of the work +extremely easy. Of course, all of this is for our special, +trivial case of Expression. Naturally, for the general case it +will have to get more complex. But you're looking now at the +FINAL version of procedure Assignment! + +All this seems like a very simple and clean solution, and it is +indeed. Compile this program and run the same test cases as +before. You will see that all types of data are converted +properly, and there are few if any wasted instructions. Only the +byte-to-long conversion uses two instructions where one would do, +and we could easily modify Convert to handle this case, too. + +Although we haven't considered unsigned variables in this case, I +think you can see that we could easily fix up procedure Convert +to deal with these types as well. This is "left as an exercise +for the student." + + +LITERAL ARGUMENTS + +Sharp-eyed readers might have noticed, though, that we don't even +have a proper form of a simple factor yet, because we don't allow +for loading literal constants, only variables. Let's fix that +now. + +To begin with, we'll need a GetNum function. We've seen several +versions of this, some returning only a single character, some a +string, and some an integer. The one needed here will return a +LongInt, so that it can handle anything we throw at it. Note +that no type information is returned here: GetNum doesn't concern +itself with how the number will be used: + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: LongInt; +var Val: LongInt; +begin + if not IsDigit(Look) then Expected('Integer'); + Val := 0; + while IsDigit(Look) do begin + Val := 10 * Val + Ord(Look) - Ord('0'); + GetChar; + end; + GetNum := Val; + SkipWhite; +end; +{---------------------------------------------------------------} + + +Now, when dealing with literal data, we have one little small +problem. With variables, we know what type things should be +because they've been declared to be that type. We have no such +type information for literals. When the programmer says, "-1," +does that mean a byte, word, or longword version? We have no +clue. The obvious thing to do would be to use the largest type +possible, i.e. a longword. But that's a bad idea, because when +we get to more complex expressions, we'll find that it will cause +every expression involving literals to be promoted to long, as +well. + +A better approach is to select a type based upon the value of the +literal, as shown next: + + +{--------------------------------------------------------------} +{ Load a Constant to the Primary Register } + +function LoadNum(N: LongInt): char; +var Typ : char; +begin + if abs(N) <= 127 then + Typ := 'B' + else if abs(N) <= 32767 then + Typ := 'W' + else Typ := 'L'; + LoadConst(N, Typ); + LoadNum := Typ; +end; +{---------------------------------------------------------------} + + +(I know, I know, the number base isn't really symmetric. You can +store -128 in a single byte, and -32768 in a word. But that's +easily fixed, and not worth the time or the added complexity to +fool with it here. It's the thought that counts.) + +Note that LoadNum calls a new version of the code generator +routine LoadConst, which has an added argument to define the +type: + + +{---------------------------------------------------------------} +{ Load a Constant to the Primary Register } + +procedure LoadConst(N: LongInt; Typ: char); +var temp:string; +begin + Str(N, temp); + Move(Typ, '#' + temp, 'D0'); +end; +{--------------------------------------------------------------} + + +Now we can modify procedure Expression to accomodate the two +possible kinds of factors: + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +function Expression: char; +begin + if IsAlpha(Look) then + Expression := Load(GetName) + else + Expression := LoadNum(GetNum); +end; +{--------------------------------------------------------------} + + +(Wow, that sure didn't hurt too bad! Just a few extra lines do +the job.) + +OK, compile this code into your program and give it a try. +You'll see that it now works for either variables or constants as +valid expressions. + + +ADDITIVE EXPRESSIONS + +If you've been following this series from the beginning, I'm sure +you know what's coming next: We'll expand the form for an +expression to handle first additive expressions, then +multiplicative, then general expressions with parentheses. + +The nice part is that we already have a pattern for dealing with +these more complex expressions. All we have to do is to make +sure that all the procedures called by Expression (Term, Factor, +etc.) always return a type identifier. If we do that, the +program structure gets changed hardly at all. + +The first step is easy: We can rename our existing function +Expression to Term, as we've done so many times before, and +create the new version of Expression: + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +function Expression: char; +var Typ: char; +begin + if IsAddop(Look) then + Typ := Unop + else + Typ := Term; + while IsAddop(Look) do begin + Push(Typ); + case Look of + '+': Typ := Add(Typ); + '-': Typ := Subtract(Typ); + end; + end; + Expression := Typ; +end; +{--------------------------------------------------------------} + + +Note in this routine how each procedure call has become a +function call, and how the local variable Typ gets updated at +each pass. + +Note also the new call to a function Unop, which lets us deal +with a leading unary minus. This change is not necessary ... we +could still use a form more like what we've done before. I've +chosen to introduce UnOp as a separate routine because it will +make it easier, later, to produce somewhat better code than we've +been doing. In other words, I'm looking ahead to optimization +issues. + +For this version, though, we'll retain the same dumb old code, +which makes the new routine trivial: + + +{---------------------------------------------------------------} +{ Process a Term with Leading Unary Operator } + +function Unop: char; +begin + Clear; + Unop := 'W'; +end; +{---------------------------------------------------------------} + + +Procedure Push is a code-generator routine, and now has a type +argument: + + +{---------------------------------------------------------------} +{ Push Primary onto Stack } + +procedure Push(Size: char); +begin + Move(Size, 'D0', '-(SP)'); +end; +{---------------------------------------------------------------} + + +Now, let's take a look at functions Add and Subtract. In the +older versions of these routines, we let them call code generator +routines PopAdd and PopSub. We'll continue to do that, which +makes the functions themselves extremely simple: + + +{---------------------------------------------------------------} +{ Recognize and Translate an Add } + +function Add(T1: char): char; +begin + Match('+'); + Add := PopAdd(T1, Term); +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +function Subtract(T1: char): char; +begin + Match('-'); + Subtract := PopSub(T1, Term); +end; +{---------------------------------------------------------------} + + +The simplicity is deceptive, though, because what we've done is +to defer all the logic to PopAdd and PopSub, which are no longer +just code generation routines. They must also now take care of +the type conversions required. + +And just what conversion is that? Simple: Both arguments must be +of the same size, and the result is also of that size. The +smaller of the two arguments must be "promoted" to the size of +the larger one. + +But this presents a bit of a problem. If the argument to be +promoted is the second argument (i.e. in the primary register +D0), we are in great shape. If it's not, however, we're in a +fix: we can't change the size of the information that's already +been pushed onto the stack. + +The solution is simple but a little painful: We must abandon that +lovely "pop the data and do something with it" instructions +thoughtfully provided by Motorola. + +The alternative is to assign a secondary register, which I've +chosen to be R7. (Why not R1? Because I have later plans for +the other registers.) + +The first step in this new structure is to introduce a Pop +procedure analogous to the Push. This procedure will always Pop +the top element of the stack into D7: + + +{---------------------------------------------------------------} +{ Pop Stack into Secondary Register } + +procedure Pop(Size: char); +begin + Move(Size, '(SP)+', 'D7'); +end; +{---------------------------------------------------------------} + + +The general idea is that all the "Pop-Op" routines can call this +one. When this is done, we will then have both operands in +registers, so we can promote whichever one we need to. To deal +with this, procedure Convert needs another argument, the register +name: + + +{---------------------------------------------------------------} +{ Convert a Data Item from One Type to Another } + +procedure Convert(Source, Dest: char; Reg: String); +begin + if Source <> Dest then begin + if Source = 'B' then + EmitLn('AND.W #$FF,' + Reg); + if Dest = 'L' then + EmitLn('EXT.L ' + Reg); + end; +end; +{---------------------------------------------------------------} + + +The next function does a conversion, but only if the current type +T1 is smaller in size than the desired type T2. It is a +function, returning the final type to let us know what it decided +to do: + + +{---------------------------------------------------------------} +{ Promote the Size of a Register Value } + +function Promote(T1, T2: char; Reg: string): char; +var Typ: char; +begin + Typ := T1; + if T1 <> T2 then + if (T1 = 'B') or ((T1 = 'W') and (T2 = 'L')) then begin + Convert(T1, T2, Reg); + Typ := T2; + end; + Promote := Typ; +end; +{---------------------------------------------------------------} + + +Finally, the following function forces the two registers to be of +the same type: + + +{---------------------------------------------------------------} +{ Force both Arguments to Same Type } + +function SameType(T1, T2: char): char; +begin + T1 := Promote(T1, T2, 'D7'); + SameType := Promote(T2, T1, 'D0'); +end; +{---------------------------------------------------------------} + + +These new routines give us the ammunition we need to flesh out +PopAdd and PopSub: + + +{---------------------------------------------------------------} +{ Generate Code to Add Primary to the Stack } + +function PopAdd(T1, T2: char): char; +begin + Pop(T1); + T2 := SameType(T1, T2); + GenAdd(T2); + PopAdd := T2; +end; + + +{---------------------------------------------------------------} +{ Generate Code to Subtract Primary from the Stack } + +function PopSub(T1, T2: char): char; +begin + Pop(T1); + T2 := SameType(T1, T2); + GenSub(T2); + PopSub := T2; +end; +{---------------------------------------------------------------} + + +After all the buildup, the final results are almost +anticlimactic. Once again, you can see that the logic is quite +simple. All the two routines do is to pop the top-of-stack into +D7, force the two operands to be the same size, and then generate +the code. + +Note the new code generator routines GenAdd and GenSub. These +are vestigial forms of the ORIGINAL PopAdd and PopSub. That is, +they are pure code generators, producing a register-to-register +add or subtract: + + +{---------------------------------------------------------------} +{ Add Top of Stack to Primary } + +procedure GenAdd(Size: char); +begin + EmitLn('ADD.' + Size + ' D7,D0'); +end; + + +{---------------------------------------------------------------} +{ Subtract Primary from Top of Stack } + +procedure GenSub(Size: char); +begin + EmitLn('SUB.' + Size + ' D7,D0'); + EmitLn('NEG.' + Size + ' D0'); +end; +{---------------------------------------------------------------} + + +OK, I grant you: I've thrown a lot of routines at you since we +last tested the code. But you have to admit that each new +routine is pretty simple and transparent. If you (like me) don't +like to test so many new routines at once, that's OK. You can +stub out routines like Convert, Promote, and SameType, since they +don't read any inputs. You won't get the correct code, of +course, but things should work. Then flesh them out one at a +time. + +When testing the program, don't forget that you first have to +declare some variables, and then start the "body" of the program +with an upper-case 'B' (for BEGIN). You should find that the +parser will handle any additive expressions. Once all the +conversion routines are in, you should see that the correct code +is generated, with type conversions inserted where necessary. +Try mixing up variables of different sizes, and also literals. +Make sure that everything's working properly. As usual, it's a +good idea to try some erroneous expressions and see how the +compiler handles them. + + +WHY SO MANY PROCEDURES? + +At this point, you may think I've pretty much gone off the deep +end in terms of deeply nested procedures. There is admittedly a +lot of overhead here. But there's a method in my madness. As in +the case of UnOp, I'm looking ahead to the time when we're going +to want better code generation. The way the code is organized, +we can achieve this without major modifications to the program. +For example, in cases where the value pushed onto the stack does +_NOT_ have to be converted, it's still better to use the "pop and +add" instruction. If we choose to test for such cases, we can +embed the extra tests into PopAdd and PopSub without changing +anything else much. + + +MULTIPLICATIVE EXPRESSIONS + +The procedure for dealing with multiplicative operators is much +the same. In fact, at the first level, they are almost +identical, so I'll just show them here without much fanfare. The +first one is our general form for Factor, which includes +parenthetical subexpressions: + + +{---------------------------------------------------------------} +{ Parse and Translate a Factor } + +function Expression: char; Forward; + +function Factor: char; +begin + if Look = '(' then begin + Match('('); + Factor := Expression; + Match(')'); + end + else if IsAlpha(Look) then + Factor := Load(GetName) + else + Factor := LoadNum(GetNum); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Multiply } + +Function Multiply(T1: char): char; +begin + Match('*'); + Multiply := PopMul(T1, Factor); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Divide } + +function Divide(T1: char): char; +begin + Match('/'); + DIvide := PopDiv(T1, Factor); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +function Term: char; +var Typ: char; +begin + Typ := Factor; + while IsMulop(Look) do begin + Push(Typ); + case Look of + '*': Typ := Multiply(Typ); + '/': Typ := Divide(Typ); + end; + end; + Term := Typ; +end; +{---------------------------------------------------------------} + + +These routines parallel the additive ones almost exactly. As +before, the complexity is encapsulated within PopMul and PopDiv. +If you'd like to test the program before we get into that, you +can build dummy versions of them, similar to PopAdd and PopSub. +Again, the code won't be correct at this point, but the parser +should handle expressions of arbitrary complexity. + + +MULTIPLICATION + +Once you've convinced yourself that the parser itself is working +properly, we need to figure out what it will take to generate the +right code. This is where things begin to get a little sticky, +because the rules are more complex. + +Let's take the case of multiplication first. This operation is +similar to the "addops" in that both operands should be of the +same size. It differs in two important respects: + + + o The type of the product is typically not the same as that of + the two operands. For the product of two words, we get a + longword result. + + o The 68000 does not support a 32 x 32 multiply, so a call to + a software routine is needed. This routine will become part + of the run-time library. + + o It also does not support an 8 x 8 multiply, so all byte + operands must be promoted to words. + + +The actions that we have to take are best shown in the following +table: + + T1 --> | | | | + | | | | + | | B | W | L | + T2 V | | | | +----------------------------------------------------------------- + | | | | + + + + + + + + B | Convert D0 to W | Convert D0 to W | Convert D0 to L | + | Convert D7 to W | | | + | MULS | MULS | JSR MUL32 | + | Result = W | Result = L | Result = L | + | | | | +----------------------------------------------------------------- + | | | | + W | Convert D7 to W | | Convert D0 to L | + | MULS | MULS | JSR MUL32 | + | Result = L | Result = L | Result = L | + | | | | +----------------------------------------------------------------- + | | | | + L | Convert D7 to L | Convert D7 to L | | + | JSR MUL32 | JSR MUL32 | JSR MUL32 | + | Result = L | Result = L | Result = L | + | | | | +----------------------------------------------------------------- + +This table shows the actions to be taken for each combination of +operand types. There are three things to note: First, we assume +a library routine MUL32 which performs a 32 x 32 multiply, +leaving a >> 32-bit << (not 64-bit) product. If there is any +overflow in the process, we choose to ignore it and return only +the lower 32 bits. + +Second, note that the table is symmetric ... the two operands +enter in the same way. Finally, note that the product is ALWAYS +a longword, except when both operands are bytes. (It's worth +noting, in passing, that this means that many expressions will +end up being longwords, whether we like it or not. Perhaps the +idea of just promoting them all up front wasn't all that +outrageous, after all!) + +Now, clearly, we are going to have to generate different code for +the 16-bit and 32-bit multiplies. This is best done by having +separate code generator routines for the two cases: + + +{---------------------------------------------------------------} +{ Multiply Top of Stack by Primary (Word) } + +procedure GenMult; +begin + EmitLn('MULS D7,D0') +end; + + +{---------------------------------------------------------------} +{ Multiply Top of Stack by Primary (Long) } + +procedure GenLongMult; +begin + EmitLn('JSR MUL32'); +end; +{---------------------------------------------------------------} + + +An examination of the code below for PopMul should convince you +that the conditions in the table are met: + + +{---------------------------------------------------------------} +{ Generate Code to Multiply Primary by Stack } + +function PopMul(T1, T2: char): char; +var T: char; +begin + Pop(T1); + T := SameType(T1, T2); + Convert(T, 'W', 'D7'); + Convert(T, 'W', 'D0'); + if T = 'L' then + GenLongMult + else + GenMult; + if T = 'B' then + PopMul := 'W' + else + PopMul:= 'L'; +end; +{---------------------------------------------------------------} + + +As you can see, the routine starts off just like PopAdd. The two +arguments are forced to the same type. The two calls to Convert +take care of the case where both operands are bytes. The data +themselves are promoted to words, but the routine remembers the +type so as to assign the correct type to the result. Finally, we +call one of the two code generator routines, and then assign the +result type. Not too complicated, really. + +At this point, I suggest that you go ahead and test the program. +Try all combinations of operand sizes. + + +DIVISION + +The case of division is not nearly so symmetric. I also have +some bad news for you: + +All modern 16-bit CPU's support integer divide. The +manufacturer's data sheet will describe this operation as a +32 x 16-bit divide, meaning that you can divide a 32-bit dividend +by a 16-bit divisor. Here's the bad news: + + + THEY'RE LYING TO YOU!!! + + +If you don't believe it, try dividing any large 32-bit number +(meaning that it has non-zero bits in the upper 16 bits) by the +integer 1. You are guaranteed to get an overflow exception. + +The problem is that the instruction really requires that the +resulting quotient fit into a 16-bit result. This won't happen +UNLESS the divisor is sufficiently large. When any number is +divided by unity, the quotient will of course be the same as the +dividend, which had better fit into a 16-bit word. + +Since the beginning of time (well, computers, anyway), CPU +architects have provided this little gotcha in the division +circuitry. It provides a certain amount of symmetry in things, +since it is sort of the inverse of the way a multiply works. But +since unity is a perfectly valid (and rather common) number to +use as a divisor, the division as implemented in hardware needs +some help from us programmers. + +The implications are as follows: + + o The type of the quotient must always be the same as that of + the dividend. It is independent of the divisor. + + o In spite of the fact that the CPU supports a longword + dividend, the hardware-provided instruction can only be + trusted for byte and word dividends. For longword + dividends, we need another library routine that can return a + long result. + + + +This looks like a job for another table, to summarize the +required actions: + + T1 --> | | | | + | | | | + | | B | W | L | + T2 V | | | | +----------------------------------------------------------------- + | | | | + B | Convert D0 to W | Convert D0 to W | Convert D0 to L | + | Convert D7 to L | Convert D7 to L | | + | DIVS | DIVS | JSR DIV32 | + | Result = B | Result = W | Result = L | + | | | | +----------------------------------------------------------------- + | | | | + W | Convert D7 to L | Convert D7 to L | Convert D0 to L | + | DIVS | DIVS | JSR DIV32 | + | Result = B | Result = W | Result = L | + | | | | +----------------------------------------------------------------- + | | | | + L | Convert D7 to L | Convert D7 to L | | + | JSR DIV32 | JSR DIV32 | JSR DIV32 | + | Result = B | Result = W | Result = L | + | | | | +----------------------------------------------------------------- + + +(You may wonder why it's necessary to do a 32-bit division, when +the dividend is, say, only a byte in the first place. Since the +number of bits in the result can only be as many as that in the +dividend, why bother? The reason is that, if the divisor is a +longword, and there are any high bits set in it, the result of +the division must be zero. We might not get that if we only use +the lower word of the divisor.) + +The following code provides the correct function for PopDiv: + + +{---------------------------------------------------------------} +{ Generate Code to Divide Stack by the Primary } + +function PopDiv(T1, T2: char): char; +begin + Pop(T1); + Convert(T1, 'L', 'D7'); + if (T1 = 'L') or (T2 = 'L') then begin + Convert(T2, 'L', 'D0'); + GenLongDiv; + PopDiv := 'L'; + end + else begin + Convert(T2, 'W', 'D0'); + GenDiv; + PopDiv := T1; + end; +end; +{---------------------------------------------------------------} + + +The two code generation procedures are: + + +{---------------------------------------------------------------} +{ Divide Top of Stack by Primary (Word) } + +procedure GenDiv; +begin + EmitLn('DIVS D0,D7'); + Move('W', 'D7', 'D0'); +end; + + +{---------------------------------------------------------------} +{ Divide Top of Stack by Primary (Long) } + +procedure GenLongDiv; +begin + EmitLn('JSR DIV32'); +end; +{---------------------------------------------------------------} + + +Note that we assume that DIV32 leaves the (longword) result in +D0. + +OK, install the new procedures for division. At this point you +should be able to generate code for any kind of arithmetic +expression. Give it a whirl! + + +BEGINNING TO WIND DOWN + +At last, in this installment, we've learned how to deal with +variables (and literals) of different types. As you can see, it +hasn't been too tough. In fact, in some ways most of the code +looks even more simple than it does in earlier programs. Only +the multiplication and division operators require a little +thinking and planning. + +The main concept that made things easy was that of converting +procedures such as Expression into functions that return the type +of the result. Once this was done, we were able to retain the +same general structure of the compiler. + +I won't pretend that we've covered every single aspect of the +issue. I conveniently ignored unsigned arithmetic. From what +we've done, I think you can see that to include them adds no new +challenges, just extra possibilities to test for. + +I've also ignored the logical operators And, Or, etc. It turns +out that these are pretty easy to handle. All the logical +operators are bitwise operations, so they are symmetric and +therefore work in the same fashion as PopAdd. There is one +difference, however: if it is necessary to extend the word +length for a logical variable, the extension should be done as an +UNSIGNED number. Floating point numbers, again, are +straightforward to handle ... just a few more procedures to be +added to the run-time library, or perhaps instructions for a math +chip. + +Perhaps more importantly, I have also skirted the issue of type +CHECKING, as opposed to conversion. In other words, we've +allowed for operations between variables of all combinations of +types. In general this will not be true ... certainly you don't +want to add an integer, for example, to a string. Most languages +also don't allow you to mix up character and integer variables. + +Again, there are really no new issues to be addressed in this +case. We are already checking the types of the two operands ... +much of this checking gets done in procedures like SameType. +It's pretty straightforward to include a call to an error +handler, if the types of the two operands are incompatible. + +In the general case, we can think of every single operator as +being handled by a different procedure, depending upon the type +of the two operands. This is straightforward, though tedious, to +implement simply by implementing a jump table with the operand +types as indices. In Pascal, the equivalent operation would +involve nested Case statements. Some of the called procedures +could then be simple error routines, while others could effect +whatever kind of conversion we need. As more types are added, +the number of procedures goes up by a square-law rule, but that's +still not an unreasonably large number of procedures. + +What we've done here is to collapse such a jump table into far +fewer procedures, simply by making use of symmetry and other +simplifying rules. + + +TO COERCE OR NOT TO COERCE + +In case you haven't gotten this message yet, it sure appears that +TINY and KISS will probably _NOT_ be strongly typed languages, +since I've allowed for automatic mixing and conversion of just +about any type. Which brings up the next issue: + + Is this really what we want to do? + +The answer depends on what kind of language you want, and the way +you'd like it to behave. What we have not addressed is the issue +of when to allow and when to deny the use of operations involving +different data types. In other words, what should be the +SEMANTICS of our compiler? Do we want automatic type conversion +for all cases, for some cases, or not at all? + +Let's pause here to think about this a bit more. To do so, it +will help to look at a bit of history. + +FORTRAN II supported only two simple data types: Integer and +Real. It allowed implicit type conversion between real and +integer types during assignment, but not within expressions. All +data items (including literal constants) on the right-hand side +of an assignment statement had to be of the same type. That made +things pretty easy ... much simpler than what we've had to do +here. + +This was changed in FORTRAN IV to support "mixed-mode" +arithmetic. If an expression had any real data items in it, they +were all converted to reals and the expression itself was real. +To round out the picture, functions were provided to explicitly +convert from one type to the other, so that you could force an +expression to end up as either type. + +This led to two things: code that was easier to write, and code +that was less efficient. That's because sloppy programmers would +write expressions with simple constants like 0 and 1 in them, +which the compiler would dutifully compile to convert at +execution time. Still, the system worked pretty well, which +would tend to indicate that implicit type conversion is a Good +Thing. + +C is also a weakly typed language, though it supports a larger +number of types. C won't complain if you try to add a character +to an integer, for example. Partly, this is helped by the C +convention of promoting every char to integer when it is loaded, +or passed through a parameter list. This simplifies the +conversions quite a bit. In fact, in subset C compilers that +don't support long or float types, we end up back where we were +in our earlier, simple-minded first try: every variable has the +same representation, once loaded into a register. Makes life +pretty easy! + +The ultimate language in the direction of automatic type +conversion is PL/I. This language supports a large number of +data types, and you can mix them all freely. If the implicit +conversions of FORTRAN seemed good, then those of PL/I should +have been Heaven, but it turned out to be more like Hell! The +problem was that with so many data types, there had to be a large +number of different conversions, AND a correspondingly large +number of rules about how mixed operands should be converted. +These rules became so complex that no one could remember what +they were! A lot of the errors in PL/I programs had to do with +unexpected and unwanted type conversions. Too much of a Good +Thing can be bad for you! + +Pascal, on the other hand, is a language which is "strongly +typed," which means that in general you can't mix types, even if +they differ only in _NAME_, and yet have the same base type! +Niklaus Wirth made Pascal strongly typed to help keep programmers +out of trouble, and the restrictions have indeed saved many a +programmer from himself, because the compiler kept him from doing +something dumb. Better to find the bug in compilation rather +than the debug phase. The same restrictions can also cause +frustration when you really WANT to mix types, and they tend to +drive an ex-C-programmer up the wall. + +Even so, Pascal does permit some implicit conversions. You can +assign an integer to a real value. You can also mix integer and +real types in expressions of type Real. The integers will be +automatically coerced to real, just as in FORTRAN (and with the +same hidden cost in run-time overhead). + +You can't, however, convert the other way, from real to integer, +without applying an explicit conversion function, Trunc. The +theory here is that, since the numerical value of a real number +is necessarily going to be changed by the conversion (the +fractional part will be lost), you really shouldn't do it in +"secret." + +In the spirit of strong typing, Pascal will not allow you to mix +Char and Integer variables, without applying the explicit +coercion functions Chr and Ord. + +Turbo Pascal also includes the types Byte, Word, and LongInt. +The first two are basically the same as unsigned integers. In +Turbo, these can be freely intermixed with variables of type +Integer, and Turbo will automatically handle the conversion. +There are run-time checks, though, to keep you from overflowing +or otherwise getting the wrong answer. Note that you still can't +mix Byte and Char types, even though they are stored internally +in the same representation. + +The ultimate in a strongly-typed language is Ada, which allows +_NO_ implicit type conversions at all, and also will not allow +mixed-mode arithmetic. Jean Ichbiah's position is that +conversions cost execution time, and you shouldn't be allowed to +build in such cost in a hidden manner. By forcing the programmer +to explicitly request a type conversion, you make it more +apparent that there could be a cost involved. + +I have been using another strongly-typed language, a delightful +little language called Whimsical, by John Spray. Although +Whimsical is intended as a systems programming language, it also +requires explicit conversion EVERY time. There are NEVER any +automatic conversions, even the ones supported by Pascal. + +This approach does have certain advantages: The compiler never +has to guess what to do: the programmer always tells it precisely +what he wants. As a result, there tends to be a more nearly +one-to-one correspondence between source code and compiled code, +and John's compiler produces VERY tight code. + +On the other hand, I sometimes find the explicit conversions to +be a pain. If I want, for example, to add one to a character, or +AND it with a mask, there are a lot of conversions to make. If I +get it wrong, the only error message is "Types are not +compatible." As it happens, John's particular implementation of +the language in his compiler doesn't tell you exactly WHICH types +are not compatible ... it only tells you which LINE the error is +in. + +I must admit that most of my errors with this compiler tend to be +errors of this type, and I've spent a lot of time with the +Whimsical compiler, trying to figure out just WHERE in the line +I've offended it. The only real way to fix the error is to keep +trying things until something works. + +So what should we do in TINY and KISS? For the first one, I have +the answer: TINY will support only the types Char and Integer, +and we'll use the C trick of promoting Chars to Integers +internally. That means that the TINY compiler will be _MUCH_ +simpler than what we've already done. Type conversion in +expressions is sort of moot, since none will be required! Since +longwords will not be supported, we also won't need the MUL32 and +DIV32 run-time routines, nor the logic to figure out when to call +them. I _LIKE_ it! + +KISS, on the other hand, will support the type Long. + +Should it support both signed and unsigned arithmetic? For the +sake of simplicity I'd rather not. It does add quite a bit to +the complexity of type conversions. Even Niklaus Wirth has +eliminated unsigned (Cardinal) numbers from his new language +Oberon, with the argument that 32-bit integers should be long +enough for anybody, in either case. + +But KISS is supposed to be a systems programming language, which +means that we should be able to do whatever operations that can +be done in assembler. Since the 68000 supports both flavors of +integers, I guess KISS should, also. We've seen that logical +operations need to be able to extend integers in an unsigned +fashion, so the unsigned conversion procedures are required in +any case. + + +CONCLUSION + +That wraps up our session on type conversions. Sorry you had to +wait so long for it, but hope you feel that it was worth the +wait. + +In the next few installments, we'll extend the simple types to +include arrays and pointers, and we'll have a look at what to do +about strings. That should pretty well wrap up the mainstream +part of the series. After that, I'll give you the new versions +of the TINY and KISS compilers, and then we'll start to look at +optimization issues. + +See you then. + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + diff --git a/Compiler construction/Lets_build_compiler/tutor15.txt b/Compiler construction/Lets_build_compiler/tutor15.txt new file mode 100644 index 0000000..0384e6d --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor15.txt @@ -0,0 +1,1150 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 5 March 1994 + + +Part 15: BACK TO THE FUTURE + + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1994 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +Can it really have been four years since I wrote installment +fourteen of this series? Is it really possible that six long +years have passed since I began it? Funny how time flies when +you're having fun, isn't it? + +I won't spend a lot of time making excuses; only point out that +things happen, and priorities change. In the four years since +installment fourteen, I've managed to get laid off, get divorced, +have a nervous breakdown, begin a new career as a writer, begin +another one as a consultant, move, work on two real-time systems, +and raise fourteen baby birds, three pigeons, six possums, and a +duck. For awhile there, the parsing of source code was not high +on my list of priorities. Neither was writing stuff for free, +instead of writing stuff for pay. But I do try to be faithful, +and I do recognize and feel my responsibility to you, the reader, +to finish what I've started. As the tortoise said in one of my +son's old stories, I may be slow, but I'm sure. I'm sure that +there are people out there anxious to see the last reel of this +film, and I intend to give it to them. So, if you're one of those +who's been waiting, more or less patiently, to see how this thing +comes out, thanks for your patience. I apologize for the delay. +Let's move on. + + +NEW STARTS, OLD DIRECTIONS + +Like many other things, programming languages and programming +styles change with time. In 1994, it seems a little anachronistic +to be programming in Turbo Pascal, when the rest of the world +seems to have gone bananas over C++. It also seems a little +strange to be programming in a classical style when the rest of +the world has switched to object-oriented methods. Still, in +spite of the four-year hiatus, it would be entirely too wrenching +a change, at this point, to switch to, say, C++ with object- +orientation . Anyway, Pascal is still not only a powerful +programming language (more than ever, in fact), but it's a +wonderful medium for teaching. C is a notoriously difficult +language to read ... it's often been accused, along with Forth, of +being a "write-only language." When I program in C++, I find +myself spending at least 50% of my time struggling with language +syntax rather than with concepts. A stray "&" or "*" can not only +change the functioning of the program, but its correctness as +well. By contrast, Pascal code is usually quite transparent and +easy to read, even if you don't know the language. What you see is +almost always what you get, and we can concentrate on concepts +rather than implementation details. I've said from the beginning +that the purpose of this tutorial series was not to generate the +world's fastest compiler, but to teach the fundamentals of +compiler technology, while spending the least amount of time +wrestling with language syntax or other aspects of software +implementation. Finally, since a lot of what we do in this course +amounts to software experimentation, it's important to have a +compiler and associated environment that compiles quickly and with +no fuss. In my opinion, by far the most significant time measure +in software development is the speed of the edit/compile/test +cycle. In this department, Turbo Pascal is king. The compilation +speed is blazing fast, and continues to get faster in every +release (how do they keep doing that?). Despite vast improvements +in C compilation speed over the years, even Borland's fastest +C/C++ compiler is still no match for Turbo Pascal. Further, the +editor built into their IDE, the make facility, and even their +superb smart linker, all complement each other to produce a +wonderful environment for quick turnaround. For all of these +reasons, I intend to stick with Pascal for the duration of this +series. We'll be using Turbo Pascal for Windows, one of the +compilers provided Borland Pascal with Objects, version 7.0. If +you don't have this compiler, don't worry ... nothing we do here +is going to count on your having the latest version. Using the +Windows version helps me a lot, by allowing me to use the +Clipboard to copy code from the compiler's editor into these +documents. It should also help you at least as much, copying the +code in the other direction. + +I've thought long and hard about whether or not to introduce +objects to our discussion. I'm a big advocate of object-oriented +methods for all uses, and such methods definitely have their place +in compiler technology. In fact, I've written papers on just this +subject (Refs. 1-3). But the architecture of a compiler which is +based on object-oriented approaches is vastly different than that +of the more classical compiler we've been building. Again, it +would seem to be entirely too much to change these horses in mid- +stream. As I said, programming styles change. Who knows, it may +be another six years before we finish this thing, and if we keep +changing the code every time programming style changes, we may +NEVER finish. + +So for now, at least, I've determined to continue the classical +style in Pascal, though we might indeed discuss objects and object +orientation as we go. Likewise, the target machine will remain +the Motorola 68000 family. Of all the decisions to be made here, +this one has been the easiest. Though I know that many of you +would like to see code for the 80x86, the 68000 has become, if +anything, even more popular as a platform for embedded systems, +and it's to that application that this whole effort began in the +first place. Compiling for the PC, MSDOS platform, we'd have to +deal with all the issues of DOS system calls, DOS linker formats, +the PC file system and hardware, and all those other complications +of a DOS environment. An embedded system, on the other hand, must +run standalone, and it's for this kind of application, as an +alternative to assembly language, that I've always imagined that a +language like KISS would thrive. Anyway, who wants to deal with +the 80x86 architecture if they don't have to? + +The one feature of Turbo Pascal that I'm going to be making heavy +use of is units. In the past, we've had to make compromises +between code size and complexity, and program functionality. A +lot of our work has been in the nature of computer +experimentation, looking at only one aspect of compiler technology +at a time. We did this to avoid to avoid having to carry around +large programs, just to investigate simple concepts. In the +process, we've re-invented the wheel and re-programmed the same +functions more times than I'd like to count. Turbo units provide +a wonderful way to get functionality and simplicity at the same +time: You write reusable code, and invoke it with a single line. +Your test program stays small, but it can do powerful things. + +One feature of Turbo Pascal units is their initialization block. +As with an Ada package, any code in the main begin-end block of a +unit gets executed as the program is initialized. As you'll see +later, this sometimes gives us neat simplifications in the code. +Our procedure Init, which has been with us since Installment 1, +goes away entirely when we use units. The various routines in the +Cradle, another key features of our approach, will get distributed +among the units. + +The concept of units, of course, is no different than that of C +modules. However, in C (and C++), the interface between modules +comes via preprocessor include statements and header files. As +someone who's had to read a lot of other people's C programs, I've +always found this rather bewildering. It always seems that +whatever data structure you'd like to know about is in some other +file. Turbo units are simpler for the very reason that they're +criticized by some: The function interfaces and their +implementation are included in the same file. While this +organization may create problems with code security, it also +reduces the number of files by half, which isn't half bad. +Linking of the object files is also easy, because the Turbo +compiler takes care of it without the need for make files or other +mechanisms. + + +STARTING OVER? + +Four years ago, in Installment 14, I promised you that our days of +re-inventing the wheel, and recoding the same software over and +over for each lesson, were over, and that from now on we'd stick +to more complete programs that we would simply add new features +to. I still intend to keep that promise; that's one of the main +purposes for using units. However, because of the long time since +Installment 14, it's natural to want to at least do some review, +and anyhow, we're going to have to make rather sweeping changes in +the code to make the transition to units. Besides, frankly, after +all this time I can't remember all the neat ideas I had in my head +four years ago. The best way for me to recall them is to retrace +some of the steps we took to arrive at Installment 14. So I hope +you'll be understanding and bear with me as we go back to our +roots, in a sense, and rebuild the core of the software, +distributing the routines among the various units, and +bootstrapping ourselves back up to the point we were at lo, those +many moons ago. As has always been the case, you're going to get +to see me make all the mistakes and execute changes of direction, +in real time. Please bear with me ... we'll start getting to the +new stuff before you know it. + +Since we're going to be using multiple modules in our new +approach, we have to address the issue of file management. If +you've followed all the other sections of this tutorial, you know +that, as our programs evolve, we're going to be replacing older, +more simple-minded units with more capable ones. This brings us to +an issue of version control. There will almost certainly be times +when we will overlay a simple file (unit), but later wish we had +the simple one again. A case in point is embodied in our +predilection for using single-character variable names, keywords, +etc., to test concepts without getting bogged down in the details +of a lexical scanner. Thanks to the use of units, we will be +doing much less of this in the future. Still, I not only suspect, +but am certain that we will need to save some older versions of +files, for special purposes, even though they've been replaced by +newer, more capable ones. + +To deal with this problem, I suggest that you create different +directories, with different versions of the units as needed. If +we do this properly, the code in each directory will remain self- +consistent. I've tentatively created four directories: SINGLE +(for single-character experimentation), MULTI (for, of course, +multi-character versions), TINY, and KISS. + +Enough said about philosophy and details. Let's get on with the +resurrection of the software. + + +THE INPUT UNIT + +A key concept that we've used since Day 1 has been the idea of an +input stream with one lookahead character. All the parsing +routines examine this character, without changing it, to decide +what they should do next. (Compare this approach with the C/Unix +approach using getchar and unget, and I think you'll agree that +our approach is simpler). We'll begin our hike into the future by +translating this concept into our new, unit-based organization. +The first unit, appropriately called Input, is shown below: + + +{--------------------------------------------------------------} +unit Input; +{--------------------------------------------------------------} +interface +var Look: char; { Lookahead character } +procedure GetChar; { Read new character } + +{--------------------------------------------------------------} +implementation + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + +{--------------------------------------------------------------} +{ Unit Initialization } +begin + GetChar; +end. +{--------------------------------------------------------------} + + +As you can see, there's nothing very profound, and certainly +nothing complicated, about this unit, since it consists of only a +single procedure. But already, we can see how the use of units +gives us advantages. Note the executable code in the +initialization block. This code "primes the pump" of the input +stream for us, something we've always had to do before, by +inserting the call to GetChar in line, or in procedure Init. This +time, the call happens without any special reference to it on our +part, except within the unit itself. As I predicted earlier, this +mechanism is going to make our lives much simpler as we proceed. +I consider it to be one of the most useful features of Turbo +Pascal, and I lean on it heavily. + +Copy this unit into your compiler's IDE, and compile it. To test +the software, of course, we always need a main program. I used +the following, really complex test program, which we'll later +evolve into the Main for our compiler: + + +{--------------------------------------------------------------} +program Main; +uses WinCRT, Input; +begin + WriteLn(Look); +end. +{--------------------------------------------------------------} + + +Note the use of the Borland-supplied unit, WinCRT. This unit is +necessary if you intend to use the standard Pascal I/O routines, +Read, ReadLn, Write, and WriteLn, which of course we intend to do. +If you forget to include this unit in the "uses" clause, you will +get a really bizarre and indecipherable error message at run time. + +Note also that we can access the lookahead character, even though +it's not declared in the main program. All variables declared +within the interface section of a unit are global, but they're +hidden from prying eyes; to that extent, we get a modicum of +information hiding. Of course, if we were writing in an object- +oriented fashion, we should not allow outside modules to access +the units internal variables. But, although Turbo units have a +lot in common with objects, we're not doing object-oriented design +or code here, so our use of Look is appropriate. + +Go ahead and save the test program as Main.pas. To make life +easier as we get more and more files, you might want to take this +opportunity to declare this file as the compiler's Primary file. +That way, you can execute the program from any file. Otherwise, +if you press Cntl-F9 to compile and run from one of the units, +you'll get an error message. You set the primary file using the +main submenu, "Compile," in the Turbo IDE. + +I hasten to point out, as I've done before, that the function of +unit Input is, and always has been, considered to be a dummy +version of the real thing. In a production version of a compiler, +the input stream will, of course, come from a file rather than +from the keyboard. And it will almost certainly include line +buffering, at the very least, and more likely, a rather large text +buffer to support efficient disk I/O. The nice part about the +unit approach is that, as with objects, we can modify the code in +the unit to be as simple or as sophisticated as we like. As long +as the interface, as embodied in the public procedures and the +lookahead character, don't change, the rest of the program is +totally unaffected. And since units are compiled, rather than +merely included, the time required to link with them is virtually +nil. Again, the result is that we can get all the benefits of +sophisticated implementations, without having to carry the code +around as so much baggage. + +In later installments, I intend to provide a full-blown IDE for +the KISS compiler, using a true Windows application generated by +Borland's OWL applications framework. For now, though, we'll obey +my #1 rule to live by: Keep It Simple. + + + +THE OUTPUT UNIT + +Of course, every decent program should have output, and ours is no +exception. Our output routines included the Emit functions. The +code for the corresponding output unit is shown next: + + +{--------------------------------------------------------------} +unit Output; +{--------------------------------------------------------------} +interface +procedure Emit(s: string); { Emit an instruction } +procedure EmitLn(s: string); { Emit an instruction line } + +{--------------------------------------------------------------} +implementation +const TAB = ^I; + +{--------------------------------------------------------------} +{ Emit an Instruction } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + +{--------------------------------------------------------------} +{ Emit an Instruction, Followed By a Newline } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + +end. +{--------------------------------------------------------------} + + +(Notice that this unit has no initialization clause, so it needs +no begin-block.) + +Test this unit with the following main program: + +{--------------------------------------------------------------} +program Test; +uses WinCRT, Input, Output, Scanner, Parser; +begin + WriteLn('MAIN:"); + EmitLn('Hello, world!'); +end. +{--------------------------------------------------------------} + +Did you see anything that surprised you? You may have been +surprised to see that you needed to type something, even though +the main program requires no input. That's because of the +initialization in unit Input, which still requires something to +put into the lookahead character. Sorry, there's no way out of +that box, or rather, we don't _WANT_ to get out. Except for simple +test cases such as this, we will always want a valid lookahead +character, so the right thing to do about this "problem" is ... +nothing. + +Perhaps more surprisingly, notice that the TAB character had no +effect; our line of "instructions" begins at column 1, same as the +fake label. That's right: WinCRT doesn't support tabs. We have a +problem. + +There are a few ways we can deal with this problem. The one thing +we can't do is to simply ignore it. Every assembler I've ever +used reserves column 1 for labels, and will rebel to see +instructions starting there. So, at the very least, we must space +the instructions over one column to keep the assembler happy. . +That's easy enough to do: Simply change, in procedure Emit, the +line: + + Write(TAB, s); + +by: + + Write(' ', s); + +I must admit that I've wrestled with this problem before, and find +myself changing my mind as often as a chameleon changes color. +For the purposes we're going to be using, 99% of which will be +examining the output code as it's displayed on a CRT, it would be +nice to see neatly blocked out "object" code. The line: + +SUB1: MOVE #4,D0 + +just plain looks neater than the different, but functionally +identical code, + +SUB1: + MOVE #4,D0 + +In test versions of my code, I included a more sophisticated +version of the procedure PostLabel, that avoids having labels on +separate lines, but rather defers the printing of a label so it +can end up on the same line as the associated instruction. As +recently as an hour ago, my version of unit Output provided full +support for tabs, using an internal column count variable and +software to manage it. I had, if I do say so myself, some rather +elegant code to support the tab mechanism, with a minimum of code +bloat. It was awfully tempting to show you the "prettyprint" +version, if for no other reason than to show off the elegance. + +Nevertheless, the code of the "elegant" version was considerably +more complex and larger. Since then, I've had second thoughts. In +spite of our desire to see pretty output, the inescapable fact is +that the two versions of the MAIN: code fragment shown above are +functionally identical; the assembler, which is the ultimate +destination of the code, couldn't care less which version it gets, +except that the prettier version will contain more characters, +therefore will use more disk space and take longer to assemble. +but the prettier one not only takes more code to generate, but +will create a larger output file, with many more space characters +than the minimum needed. When you look at it that way, it's not +very hard to decide which approach to use, is it? + +What finally clinched the issue for me was a reminder to consider +my own first commandment: KISS. Although I was pretty proud of +all my elegant little tricks to implement tabbing, I had to remind +myself that, to paraphrase Senator Barry Goldwater, elegance in +the pursuit of complexity is no virtue. Another wise man once +wrote, "Any idiot can design a Rolls-Royce. It takes a genius to +design a VW." So the elegant, tab-friendly version of Output is +history, and what you see is the simple, compact, VW version. + + +THE ERROR UNIT + +Our next set of routines are those that handle errors. To refresh +your memory, we take the approach, pioneered by Borland in Turbo +Pascal, of halting on the first error. Not only does this greatly +simplify our code, by completely avoiding the sticky issue of +error recovery, but it also makes much more sense, in my opinion, +in an interactive environment. I know this may be an extreme +position, but I consider the practice of reporting all errors in a +program to be an anachronism, a holdover from the days of batch +processing. It's time to scuttle the practice. So there. + +In our original Cradle, we had two error-handling procedures: +Error, which didn't halt, and Abort, which did. But I don't think +we ever found a use for the procedure that didn't halt, so in the +new, lean and mean unit Errors, shown next, procedure Error takes +the place of Abort. + + +{--------------------------------------------------------------} +unit Errors; +{--------------------------------------------------------------} +interface +procedure Error(s: string); +procedure Expected(s: string); + +{--------------------------------------------------------------} +implementation + +{--------------------------------------------------------------} +{ Write error Message and Halt } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); + Halt; +end; + +{--------------------------------------------------------------} +{ Write " Expected" } + +procedure Expected(s: string); +begin + Error(s + ' Expected'); +end; + +end. +{--------------------------------------------------------------} + + +As usual, here's a test program: + + + + +{--------------------------------------------------------------} +program Test; +uses WinCRT, Input, Output, Errors; + +begin + Expected('Integer'); +end. +{--------------------------------------------------------------} + +Have you noticed that the "uses" line in our main program keeps +getting longer? That's OK. In the final version, the main program +will only call procedures in our parser, so its use clause will +only have a couple of entries. But for now, it's probably best to +include all the units so we can test procedures in them. + + +SCANNING AND PARSING + +The classical compiler architecture consists of separate modules +for the lexical scanner, which supplies tokens in the language, +and the parser, which tries to make sense of the tokens as syntax +elements. If you can still remember what we did in earlier +installments, you'll recall that we didn't do things that way. +Because we're using a predictive parser, we can almost always tell +what language element is coming next, just by examining the +lookahead character. Therefore, we found no need to prefetch +tokens, as a scanner would do. + +But, even though there is no functional procedure called +"Scanner," it still makes sense to separate the scanning functions +from the parsing functions. So I've created two more units +called, amazingly enough, Scanner and Parser. The Scanner unit +contains all of the routines known as recognizers. Some of these, +such as IsAlpha, are pure boolean routines which operate on the +lookahead character only. The other routines are those which +collect tokens, such as identifiers and numeric constants. The +Parser unit will contain all of the routines making up the +recursive-descent parser. The general rule should be that unit +Parser contains all of the information that is language-specific; +in other words, the syntax of the language should be wholly +contained in Parser. In an ideal world, this rule should be true +to the extent that we can change the compiler to compile a +different language, merely by replacing the single unit, Parser. + +In practice, things are almost never this pure. There's always a +small amount of "leakage" of syntax rules into the scanner as +well. For example, the rules concerning what makes up a legal +identifier or constant may vary from language to language. In +some languages, the rules concerning comments permit them to be +filtered by the scanner, while in others they do not. So in +practice, both units are likely to end up having language- +dependent components, but the changes required to the scanner +should be relatively trivial. + +Now, recall that we've used two versions of the scanner routines: +One that handled only single-character tokens, which we used for a +number of our tests, and another that provided full support for +multi-character tokens. Now that we have our software separated +into units, I don't anticipate getting much use out of the single- +character version, but it doesn't cost us much to provide for +both. I've created two versions of the Scanner unit. The first +one, called Scanner1, contains the single-digit version of the +recognizers: + + +{--------------------------------------------------------------} +unit Scanner1; +{--------------------------------------------------------------} +interface +uses Input, Errors; + +function IsAlpha(c: char): boolean; +function IsDigit(c: char): boolean; +function IsAlNum(c: char): boolean; +function IsAddop(c: char): boolean; +function IsMulop(c: char): boolean; + +procedure Match(x: char); +function GetName: char; +function GetNumber: char; + +{--------------------------------------------------------------} +implementation + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + +{--------------------------------------------------------------} +{ Recognize a Numeric Character } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + +{--------------------------------------------------------------} +{ Recognize an Alphanumeric Character } + +function IsAlnum(c: char): boolean; +begin + IsAlnum := IsAlpha(c) or IsDigit(c); +end; + +{--------------------------------------------------------------} +{ Recognize an Addition Operator } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+','-']; +end; + +{--------------------------------------------------------------} +{ Recognize a Multiplication Operator } + +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*','/']; +end; + +{--------------------------------------------------------------} +{ Match One Character } + +procedure Match(x: char); +begin + if Look = x then GetChar + else Expected('''' + x + ''''); +end; + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: char; +begin + if not IsAlpha(Look) then Expected('Name'); + GetName := UpCase(Look); + GetChar; +end; + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNumber: char; +begin + if not IsDigit(Look) then Expected('Integer'); + GetNumber := Look; + GetChar; +end; + +end. +{--------------------------------------------------------------} + + +The following code fragment of the main program provides a good +test of the scanner. For brevity, I'll only include the +executable code here; the rest remains the same. Don't forget, +though, to add the name Scanner1 to the "uses" clause. + + Write(GetName); + Match('='); + Write(GetNumber); + Match('+'); + WriteLn(GetName); + +This code will recognize all sentences of the form: + + x=0+y + +where x and y can be any single-character variable names, and 0 +any digit. The code should reject all other sentences, and give a +meaningful error message. If it did, you're in good shape and we +can proceed. + + +THE SCANNER UNIT + +The next, and by far the most important, version of the scanner is +the one that handles the multi-character tokens that all real +languages must have. Only the two functions, GetName and +GetNumber, change between the two units, but just to be sure there +are no mistakes, I've reproduced the entire unit here. This is +unit Scanner: + + +{--------------------------------------------------------------} +unit Scanner; +{--------------------------------------------------------------} +interface +uses Input, Errors; + +function IsAlpha(c: char): boolean; +function IsDigit(c: char): boolean; +function IsAlNum(c: char): boolean; +function IsAddop(c: char): boolean; +function IsMulop(c: char): boolean; + +procedure Match(x: char); +function GetName: string; +function GetNumber: longint; + +{--------------------------------------------------------------} +implementation + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + +{--------------------------------------------------------------} +{ Recognize a Numeric Character } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + +{--------------------------------------------------------------} +{ Recognize an Alphanumeric Character } + +function IsAlnum(c: char): boolean; +begin + IsAlnum := IsAlpha(c) or IsDigit(c); +end; + +{--------------------------------------------------------------} +{ Recognize an Addition Operator } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+','-']; +end; + +{--------------------------------------------------------------} +{ Recognize a Multiplication Operator } + +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*','/']; +end; + +{--------------------------------------------------------------} +{ Match One Character } + +procedure Match(x: char); +begin + if Look = x then GetChar + else Expected('''' + x + ''''); +end; + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: string; +var n: string; +begin + n := ''; + if not IsAlpha(Look) then Expected('Name'); + while IsAlnum(Look) do begin + n := n + Look; + GetChar; + end; + GetName := n; +end; + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNumber: string; +var n: string; +begin + n := ''; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + n := n + Look; + GetChar; + end; + GetNumber := n; +end; + +end. +{--------------------------------------------------------------} + + +The same test program will test this scanner, also. Simply change +the "uses" clause to use Scanner instead of Scanner1. Now you +should be able to type multi-character names and numbers. + + +DECISIONS, DECISIONS + +In spite of the relative simplicity of both scanners, a lot of +thought has gone into them, and a lot of decisions had to be made. +I'd like to share those thoughts with you now so you can make your +own educated decision, appropriate for your application. First, +note that both versions of GetName translate the input characters +to upper case. Obviously, there was a design decision made here, +and this is one of those cases where the language syntax splatters +over into the scanner. In the C language, the case of characters +in identifiers is significant. For such a language, we obviously +can't map the characters to upper case. The design I'm using +assumes a language like Pascal, where the case of characters +doesn't matter. For such languages, it's easier to go ahead and +map all identifiers to upper case in the scanner, so we don't have +to worry later on when we're comparing strings for equality. + +We could have even gone a step further, and map the characters to +upper case right as they come in, in GetChar. This approach works +too, and I've used it in the past, but it's too confining. +Specifically, it will also map characters that may be part of +quoted strings, which is not a good idea. So if you're going to +map to upper case at all, GetName is the proper place to do it. + +Note that the function GetNumber in this scanner returns a string, +just as GetName does. This is another one of those things I've +oscillated about almost daily, and the last swing was all of ten +minutes ago. The alternative approach, and one I've used many +times in past installments, returns an integer result. + +Both approaches have their good points. Since we're fetching a +number, the approach that immediately comes to mind is to return +it as an integer. But bear in mind that the eventual use of the +number will be in a write statement that goes back to the outside +world. Someone -- either us or the code hidden inside the write +statement -- is going to have to convert the number back to a +string again. Turbo Pascal includes such string conversion +routines, but why use them if we don't have to? Why convert a +number from string to integer form, only to convert it right back +again in the code generator, only a few statements later? + +Furthermore, as you'll soon see, we're going to need a temporary +storage spot for the value of the token we've fetched. If we treat +the number in its string form, we can store the value of either a +variable or a number in the same string. Otherwise, we'll have to +create a second, integer variable. + +On the other hand, we'll find that carrying the number as a string +virtually eliminates any chance of optimization later on. As we +get to the point where we are beginning to concern ourselves with +code generation, we'll encounter cases in which we're doing +arithmetic on constants. For such cases, it's really foolish to +generate code that performs the constant arithmetic at run time. +Far better to let the parser do the arithmetic at compile time, +and merely code the result. To do that, we'll wish we had the +constants stored as integers rather than strings. + +What finally swung me back over to the string approach was an +aggressive application of the KISS test, plus reminding myself +that we've studiously avoided issues of code efficiency. One of +the things that makes our simple-minded parsing work, without the +complexities of a "real" compiler, is that we've said up front +that we aren't concerned about code efficiency. That gives us a +lot of freedom to do things the easy way rather than the efficient +one, and it's a freedom we must be careful not to abandon +voluntarily, in spite of the urges for efficiency shouting in our +ear. In addition to being a big believer in the KISS philosophy, +I'm also an advocate of "lazy programming," which in this context +means, don't program anything until you need it. As P.J. Plauger +says, "Never put off until tomorrow what you can put off +indefinitely." Over the years, much code has been written to +provide for eventualities that never happened. I've learned that +lesson myself, from bitter experience. So the bottom line is: We +won't convert to an integer here because we don't need to. It's +as simple as that. + +For those of you who still think we may need the integer version +(and indeed we may), here it is: + + +{--------------------------------------------------------------} +{ Get a Number (integer version) } + +function GetNumber: longint; +var n: longint; +begin + n := 0; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + n := 10 * n + (Ord(Look) - Ord('0')); + GetChar; + end; + GetNumber := n; +end; +{--------------------------------------------------------------} + +You might file this one away, as I intend to, for a rainy day. + + +PARSING + +At this point, we have distributed all the routines that made up +our Cradle into units that we can draw upon as we need them. +Obviously, they will evolve further as we continue the process of +bootstrapping ourselves up again, but for the most part their +content, and certainly the architecture that they imply, is +defined. What remains is to embody the language syntax into the +parser unit. We won't do much of that in this installment, but I +do want to do a little, just to leave us with the good feeling +that we still know what we're doing. So before we go, let's +generate just enough of a parser to process single factors in an +expression. In the process, we'll also, by necessity, find we +have created a code generator unit, as well. + +Remember the very first installment of this series? We read an +integer value, say n, and generated the code to load it into the +D0 register via an immediate move: + + MOVE #n,D0 + +Shortly afterwards, we repeated the process for a variable, + + MOVE X(PC),D0 + +and then for a factor that could be either constant or variable. +For old times sake, let's revisit that process. Define the +following new unit: + + +{--------------------------------------------------------------} +unit Parser; +{--------------------------------------------------------------} +interface +uses Input, Scanner, Errors, CodeGen; +procedure Factor; + +{--------------------------------------------------------------} +implementation + +{--------------------------------------------------------------} +{ Parse and Translate a Factor } + +procedure Factor; +begin + LoadConstant(GetNumber); +end; + +end. +{--------------------------------------------------------------} + + +As you can see, this unit calls a procedure, LoadConstant, which +actually effects the output of the assembly-language code. The +unit also uses a new unit, CodeGen. This step represents the last +major change in our architecture, from earlier installments: The +removal of the machine-dependent code to a separate unit. If I +have my way, there will not be a single line of code, outside of +CodeGen, that betrays the fact that we're targeting the 68000 CPU. +And this is one place I think that having my way is quite +feasible. + +For those of you who wish I were using the 80x86 architecture (or +any other one) instead of the 68000, here's your answer: Merely +replace CodeGen with one suitable for your CPU of choice. + +So far, our code generator has only one procedure in it. Here's +the unit: + + +{--------------------------------------------------------------} +unit CodeGen; + +{--------------------------------------------------------------} +interface +uses Output; +procedure LoadConstant(n: string); + +{--------------------------------------------------------------} +implementation + +{--------------------------------------------------------------} +{ Load the Primary Register with a Constant } + +procedure LoadConstant(n: string); +begin + EmitLn('MOVE #' + n + ',D0' ); +end; + +end. +{--------------------------------------------------------------} + + +Copy and compile this unit, and execute the following main +program: + +{--------------------------------------------------------------} +program Main; +uses WinCRT, Input, Output, Errors, Scanner, Parser; +begin + Factor; +end. +{--------------------------------------------------------------} + + +There it is, the generated code, just as we hoped it would be. + +Now, I hope you can begin to see the advantage of the unit-based +architecture of our new design. Here we have a main program +that's all of five lines long. That's all of the program we need +to see, unless we choose to see more. And yet, all those units +are sitting there, patiently waiting to serve us. We can have our +cake and eat it too, in that we have simple and short code, but +powerful allies. What remains to be done is to flesh out the +units to match the capabilities of earlier installments. We'll do +that in the next installment, but before I close, let's finish out +the parsing of a factor, just to satisfy ourselves that we still +know how. The final version of CodeGen includes the new +procedure, LoadVariable: + +{--------------------------------------------------------------} +unit CodeGen; + +{--------------------------------------------------------------} +interface +uses Output; +procedure LoadConstant(n: string); +procedure LoadVariable(Name: string); + +{--------------------------------------------------------------} +implementation + +{--------------------------------------------------------------} +{ Load the Primary Register with a Constant } + +procedure LoadConstant(n: string); +begin + EmitLn('MOVE #' + n + ',D0' ); +end; + +{--------------------------------------------------------------} +{ Load a Variable to the Primary Register } + +procedure LoadVariable(Name: string); +begin + EmitLn('MOVE ' + Name + '(PC),D0'); +end; + + +end. +{--------------------------------------------------------------} + + +The parser unit itself doesn't change, but we have a more complex +version of procedure Factor: + +{--------------------------------------------------------------} +{ Parse and Translate a Factor } + +procedure Factor; +begin + if IsDigit(Look) then + LoadConstant(GetNumber) + else if IsAlpha(Look)then + LoadVariable(GetName) + else + Error('Unrecognized character ' + Look); +end; +{--------------------------------------------------------------} + + +Now, without altering the main program, you should find that our +program will process either a variable or a constant factor. At +this point, our architecture is almost complete; we have units to +do all the dirty work, and enough code in the parser and code +generator to demonstrate that everything works. What remains is +to flesh out the units we've defined, particularly the parser and +code generator, to support the more complex syntax elements that +make up a real language. Since we've done this many times before +in earlier installments, it shouldn't take long to get us back to +where we were before the long hiatus. We'll continue this process +in Installment 16, coming soon. See you then. + + + +REFERENCES + +1. Crenshaw, J.W., "Object-Oriented Design of Assemblers and +Compilers," Proc. Software Development '91 Conference, Miller +Freeman, San Francisco, CA, February 1991, pp. 143-155. + +2. Crenshaw, J.W., "A Perfect Marriage," Computer Language, Volume +8, #6, June 1991, pp. 44-55. + +3. Crenshaw, J.W., "Syntax-Driven Object-Oriented Design," Proc. +1991 Embedded Systems Conference, Miller Freeman, San +Francisco, CA, September 1991, pp. 45-60. + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1994 Jack W. Crenshaw. All rights reserved. * +* * +* * +***************************************************************** + + diff --git a/Compiler construction/Lets_build_compiler/tutor16.txt b/Compiler construction/Lets_build_compiler/tutor16.txt new file mode 100644 index 0000000..51790ad --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor16.txt @@ -0,0 +1,1172 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 29 May, 1995 + + Part 16: UNIT CONSTRUCTION + + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1995 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + +INTRODUCTION + +This series of tutorials promises to be perhaps one of the longest- +running mini-series in history, rivalled only by the delay in Volume IV +of Knuth. Begun in 1988, the series ran into a four-year hiatus in 1990 +when the "cares of this world," changes in priorities and interests, and +the need to make a living seemed to stall it out after Installment 14. +Those of you with loads of patience were finally rewarded, in the spring +of last year, with the long-awaited Installment 15. In it, I began to +try to steer the series back on track, and in the process, to make it +easier to continue on to the goal, which is to provide you with not only +enough understanding of the difficult subject of compiler theory, but +also enough tools, in the form of canned subroutines and concepts, so +that you would be able to continue on your own and become proficient +enough to build your own parsers and translators. Because of that long +hiatus, I thought it appropriate to go back and review the concepts we +have covered so far, and to redo some of the software, as well. In the +past, we've never concerned ourselves much with the development of +production-quality software tools ... after all, I was trying to teach +(and learn) concepts, not production practice. To do that, I tended to +give you, not complete compilers or parsers, but only those snippets of +code that illustrated the particular point we were considering at the +moment. + +I still believe that's a good way to learn any subject; no one wants to +have to make changes to 100,000 line programs just to try out a new +idea. But the idea of just dealing with code snippets, rather than +complete programs, also has its drawbacks in that we often seemed to be +writing the same code fragments over and over. Although repetition has +been thoroughly proven to be a good way to learn new ideas, it's also +true that one can have too much of a good thing. By the time I had +completed Installment 14 I seemed to have reached the limits of my +abilities to juggle multiple files and multiple versions of the same +software functions. Who knows, perhaps that's one reason I seemed to +have run out of gas at that point. + +Fortunately, the later versions of Borland's Turbo Pascal allow us to +have our cake and eat it too. By using their concept of separately +compilable units, we can still write small subroutines and functions, +and keep our main programs and test programs small and simple. But, +once written, the code in the Pascal units will always be there for us +to use, and linking them in is totally painless and transparent. + +Since, by now, most of you are programming in either C or C++, I know +what you're thinking: Borland, with their Turbo Pascal (TP), certainly +didn't invent the concept of separately compilable modules. And of +course you're right. But if you've not used TP lately, or ever, you may +not realize just how painless the whole process is. Even in C or C++, +you still have to build a make file, either manually or by telling the +compiler how to do so. You must also list, using "extern" statements or +header files, the functions you want to import. In TP, you don't even +have to do that. You need only name the units you wish to use, and all +of their procedures automatically become available. + + +It's not my intention to get into a language-war debate here, so I won't +pursue the subject any further. Even I no longer use Pascal on my job +... I use C at work and C++ for my articles in Embedded Systems +Programming and other magazines. Believe me, when I set out to +resurrect this series, I thought long and hard about switching both +languages and target systems to the ones that we're all using these +days, C/C++ and PC architecture, and possibly object-oriented methods as +well. In the end, I felt it would cause more confusion than the hiatus +itself has. And after all, Pascal still remains one of the best possible +languages for teaching, not to mention production programming. Finally, +TP still compiles at the speed of light, much faster than competing +C/C++ compilers. And Borland's smart linker, used in TP but not in their +C++ products, is second to none. Aside from being much faster than +Microsoft-compatible linkers, the Borland smart linker will cull unused +procedures and data items, even to the extent of trimming them out of +defined objects if they're not needed. For one of the few times in our +lives, we don't have to compromise between completeness and efficiency. +When we're writing a TP unit, we can make it as complete as we like, +including any member functions and data items we may think we will ever +need, confident that doing so will not create unwanted bloat in the +compiled and linked executable. + +The point, really, is simply this: By using TP's unit mechanism, we can +have all the advantages and convenience of writing small, seemingly +stand-alone test programs, without having to constantly rewrite the +support functions that we need. Once written, the TP units sit there, +quietly waiting to do their duty and give us the support we need, when +we need it. + +Using this principle, in Installment 15 I set out to minimize our +tendency to re-invent the wheel by organizing our code into separate +Turbo Pascal units, each containing different parts of the compiler. We +ended up with the following units: + +* Input +* Output +* Errors +* Scanner +* Parser +* CodeGen + +Each of these units serves a different function, and encapsulates +specific areas of functionality. The Input and Output units, as their +name implies, provide character stream I/O and the all-important +lookahead character upon which our predictive parser is based. The +Errors unit, of course, provides standard error handling. The Scanner +unit contains all of our boolean functions such as IsAlpha, and the +routines GetName and GetNumber, which process multi-character tokens. + +The two units we'll be working with the most, and the ones that most +represent the personality of our compiler, are Parser and CodeGen. +Theoretically, the Parser unit should encapsulate all aspects of the +compiler that depend on the syntax of the compiled language (though, as +we saw last time, a small amount of this syntax spills over into +Scanner). Similarly, the code generator unit, CodeGen, contains all of +the code dependent upon the target machine. In this installment, we'll +be continuing with the development of the functions in these two all- +important units. + + + + +JUST LIKE CLASSICAL? + +Before we proceed, however, I think I should clarify the relationship +between, and the functionality of these units. Those of you who are +familiar with compiler theory as taught in universities will, of course, +recognize the names, Scanner, Parser, and CodeGen, all of which are +components of a classical compiler implementation. You may be thinking +that I've abandoned my commitment to the KISS philosophy, and drifted +towards a more conventional architecture than we once had. A closer +look, however, should convince you that, while the names are similar, +the functionalities are quite different. + +Together, the scanner and parser of a classical implementation comprise +the so-called "front end," and the code generator, the back end. The +front end routines process the language-dependent, syntax-related +aspects of the source language, while the code generator, or back end, +deals with the target machine-dependent parts of the problem. In +classical compilers, the two ends communicate via a file of instructions +written in an intermediate language (IL). + +Typically, a classical scanner is a single procedure, operating as a co- +procedure with the parser. It "tokenizes" the source file, reading it +character by character, recognizing language elements, translating them +into tokens, and passing them along to the parser. You can think of the +parser as an abstract machine, executing "op codes," which are the +tokens. Similarly, the parser generates op codes of a second abstract +machine, which mechanizes the IL. Typically, the IL file is written to +disk by the parser, and read back again by the code generator. + +Our organization is quite different. We have no lexical scanner, in the +classical sense; our unit Scanner, though it has a similar name, is not +a single procedure or co-procedure, but merely a set of separate +subroutines which are called by the parser as needed. + +Similarly, the classical code generator, the back end, is a translator +in its own right, reading an IL "source" file, and emitting an object +file. Our code generator doesn't work that way. In our compiler, there +IS no intermediate language; every construct in the source language +syntax is converted into assembly language as it is recognized by the +parser. Like Scanner, the unit CodeGen consists of individual +procedures which are called by the parser as needed. + +This "code 'em as you find 'em" philosophy may not produce the world's +most efficient code -- for example, we haven't provided (yet!) a +convenient place for an optimizer to work its magic -- but it sure does +simplify the compiler, doesn't it? + +And that observation prompts me to reflect, once again, on how we have +managed to reduce a compiler's functions to such comparatively simple +terms. I've waxed eloquent on this subject in past installments, so I +won't belabor the point too much here. However, because of the time +that's elapsed since those last soliloquies, I hope you'll grant me just +a little time to remind myself, as well as you, how we got here. We got +here by applying several principles that writers of commercial compilers +seldom have the luxury of using. These are: + +o The KISS philosophy -- Never do things the hard way without a +reason + +o Lazy coding -- Never put off until tomorrow what you can put +of forever (with credits to P.J. Plauger) + +o Skepticism -- Stubborn refusal to do something just because +that's the way it's always been done. + +o Acceptance of inefficient code + +o Rejection of arbitrary constraints + +As I've reviewed the history of compiler construction, I've learned that +virtually every production compiler in history has suffered from pre- +imposed conditions that strongly influenced its design. The original +FORTRAN compiler of John Backus, et al, had to compete with assembly +language, and therefore was constrained to produce extremely efficient +code. The IBM compilers for the minicomputers of the 70's had to run in +the very small RAM memories then available -- as small as 4k. The early +Ada compiler had to compile itself. Per Brinch Hansen decreed that his +Pascal compiler developed for the IBM PC must execute in a 64k machine. +Compilers developed in Computer Science courses had to compile the +widest variety of languages, and therefore required LALR parsers. + +In each of these cases, these preconceived constraints literally +dominated the design of the compiler. + +A good example is Brinch Hansen's compiler, described in his excellent +book, "Brinch Hansen on Pascal Compilers" (highly recommended). Though +his compiler is one of the most clear and un-obscure compiler +implementations I've seen, that one decision, to compile large files in +a small RAM, totally drives the design, and he ends up with not just +one, but many intermediate files, together with the drivers to write and +read them. + +In time, the architectures resulting from such decisions have found +their way into computer science lore as articles of faith. In this one +man's opinion, it's time that they were re-examined critically. The +conditions, environments, and requirements that led to classical +architectures are not the same as the ones we have today. There's no +reason to believe the solutions should be the same, either. + +In this tutorial, we've followed the leads of such pioneers in the world +of small compilers for Pcs as Leor Zolman, Ron Cain, and James Hendrix, +who didn't know enough compiler theory to know that they "couldn't do it +that way." We have resolutely refused to accept arbitrary constraints, +but rather have done whatever was easy. As a result, we have evolved an +architecture that, while quite different from the classical one, gets +the job done in very simple and straightforward fashion. + +I'll end this philosophizing with an observation re the notion of an +intermediate language. While I've noted before that we don't have one +in our compiler, that's not exactly true; we _DO_ have one, or at least +are evolving one, in the sense that we are defining code generation +functions for the parser to call. In essence, every call to a code +generation procedure can be thought of as an instruction in an +intermediate language. Should we ever find it necessary to formalize an +intermediate language, this is the way we would do it: emit codes from +the parser, each representing a call to one of the code generator +procedures, and then process each code by calling those procedures in a +separate pass, implemented in a back end. Frankly, I don't see that +we'll ever find a need for this approach, but there is the connection, +if you choose to follow it, between the classical and the current +approaches. + + + +FLESHING OUT THE PARSER + +Though I promised you, somewhere along about Installment 14, that we'd +never again write every single function from scratch, I ended up +starting to do just that in Installment 15. One reason: that long +hiatus between the two installments made a review seem eminently +justified ... even imperative, both for you and for me. More +importantly, the decision to collect the procedures into modules +(units), forced us to look at each one yet again, whether we wanted to +or not. And, finally and frankly, I've had some new ideas in the last +four years that warranted a fresh look at some old friends. When I +first began this series, I was frankly amazed, and pleased, to learn +just how simple parsing routines can be made. But this last time +around, I've surprised myself yet again, and been able to make them just +that last little bit simpler, yet. + +Still, because of this total rewrite of the parsing modules, I was only +able to include so much in the last installment. Because of this, our +hero, the parser, when last seen, was a shadow of his former self, +consisting of only enough code to parse and process a factor consisting +of either a variable or a constant. The main effort of this current +installment will be to help flesh out the parser to its former glory. +In the process, I hope you'll bear with me if we sometimes cover ground +we've long since been over and dealt with. + +First, let's take care of a problem that we've addressed before: Our +current version of procedure Factor, as we left it in Installment 15, +can't handle negative arguments. To fix that, we'll introduce the +procedure SignedFactor: + + +{--------------------------------------------------------------} +{ Parse and Translate a Factor with Optional Sign } + +procedure SignedFactor; +var Sign: char; +begin + Sign := Look; + if IsAddop(Look) then + GetChar; + Factor; + if Sign = '-' then Negate; +end; +{--------------------------------------------------------------} + + +Note that this procedure calls a new code generation routine, Negate: + + +{--------------------------------------------------------------} +{ Negate Primary } + +procedure Negate; +begin + EmitLn('NEG D0'); +end; +{--------------------------------------------------------------} + + +(Here, and elsewhere in this series, I'm only going to show you the new +routines. I'm counting on you to put them into the proper unit, which +you should normally have no trouble identifying. Don't forget to add +the procedure's prototype to the interface section of the unit.) + +In the main program, simply change the procedure called from Factor to +SignedFactor, and give the code a test. Isn't it neat how the Turbo +linker and make facility handle all the details? + +Yes, I know, the code isn't very efficient. If we input a number, -3, +the generated code is: + + MOVE #3,D0 + NEG D0 + +which is really, really dumb. We can do better, of course, by simply +pre-appending a minus sign to the string passed to LoadConstant, but it +adds a few lines of code to SignedFactor, and I'm applying the KISS +philosophy very aggressively here. What's more, to tell the truth, I +think I'm subconsciously enjoying generating "really, really dumb" code, +so I can have the pleasure of watching it get dramatically better when +we get into optimization methods. + +Most of you have never heard of John Spray, so allow me to introduce him +to you here. John's from New Zealand, and used to teach computer +science at one of its universities. John wrote a compiler for the +Motorola 6809, based on a delightful, Pascal-like language of his own +design called "Whimsical." He later ported the compiler to the 68000, +and for awhile it was the only compiler I had for my homebrewed 68000 +system. + +For the record, one of my standard tests for any new compiler is to see +how the compiler deals with a null program like: + + program main; + begin + end. + +My test is to measure the time required to compile and link, and the +size of the object file generated. The undisputed _LOSER_ in the test +is the DEC C compiler for the VAX, which took 60 seconds to compile, on +a VAX 11/780, and generated a 50k object file. John's compiler is the +undisputed, once, future, and forever king in the code size department. +Given the null program, Whimsical generates precisely two bytes of code, +implementing the one instruction, + + RET + +By setting a compiler option to generate an include file rather than a +standalone program, John can even cut this size, from two bytes to zero! +Sort of hard to beat a null object file, wouldn't you say? + +Needless to say, I consider John to be something of an expert on code +optimization, and I like what he has to say: "The best way to optimize +is not to have to optimize at all, but to produce good code in the first +place." Words to live by. When we get started on optimization, we'll +follow John's advice, and our first step will not be to add a peephole +optimizer or other after-the-fact device, but to improve the quality of +the code emitted before optimization. So make a note of SignedFactor as +a good first candidate for attention, and for now we'll leave it be. + +TERMS AND EXPRESSIONS + +I'm sure you know what's coming next: We must, yet again, create the +rest of the procedures that implement the recursive-descent parsing of +an expression. We all know that the hierarchy of procedures for +arithmetic expressions is: + +expression + term + factor + +However, for now let's continue to do things one step at a time, +and consider only expressions with additive terms in them. The +code to implement expressions, including a possibly signed first +term, is shown next: + + +{--------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + SignedFactor; + while IsAddop(Look) do + case Look of + '+': Add; + '-': Subtract; + end; +end; +{--------------------------------------------------------------} + + +This procedure calls two other procedures to process the +operations: + + +{--------------------------------------------------------------} +{ Parse and Translate an Addition Operation } + +procedure Add; +begin + Match('+'); + Push; + Factor; + PopAdd; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Subtraction Operation } + +procedure Subtract; +begin + Match('-'); + Push; + Factor; + PopSub; +end; +{--------------------------------------------------------------} + + +The three procedures Push, PopAdd, and PopSub are new code generation +routines. As the name implies, procedure Push generates code to push +the primary register (D0, in our 68000 implementation) to the stack. +PopAdd and PopSub pop the top of the stack again, and add it to, or +subtract it from, the primary register. The code is shown next: + + + + +{--------------------------------------------------------------} +{ Push Primary to Stack } + +procedure Push; +begin + EmitLn('MOVE D0,-(SP)'); +end; + +{--------------------------------------------------------------} +{ Add TOS to Primary } + +procedure PopAdd; +begin + EmitLn('ADD (SP)+,D0'); +end; + +{--------------------------------------------------------------} +{ Subtract TOS from Primary } + +procedure PopSub; +begin + EmitLn('SUB (SP)+,D0'); + Negate; +end; +{--------------------------------------------------------------} + + +Add these routines to Parser and CodeGen, and change the main program to +call Expression. Voila! + +The next step, of course, is to add the capability for dealing with +multiplicative terms. To that end, we'll add a procedure Term, and code +generation procedures PopMul and PopDiv. These code generation +procedures are shown next: + + +{--------------------------------------------------------------} +{ Multiply TOS by Primary } + +procedure PopMul; +begin + EmitLn('MULS (SP)+,D0'); +end; + +{--------------------------------------------------------------} +{ Divide Primary by TOS } + +procedure PopDiv; +begin + EmitLn('MOVE (SP)+,D7'); + EmitLn('EXT.L D7'); + EmitLn('DIVS D0,D7'); + EmitLn('MOVE D7,D0'); +end; +{--------------------------------------------------------------} + + +I admit, the division routine is a little busy, but there's no help for +it. Unfortunately, while the 68000 CPU allows a division using the top +of stack (TOS), it wants the arguments in the wrong order, just as it +does for subtraction. So our only recourse is to pop the stack to a +scratch register (D7), perform the division there, and then move the +result back to our primary register, D0. Note the use of signed multiply +and divide operations. This follows an implied, but unstated, +assumption, that all our variables will be signed 16-bit integers. This +decision will come back to haunt us later, when we start looking at +multiple data types, type conversions, etc. + +Our procedure Term is virtually a clone of Expression, and looks like +this: + + +{--------------------------------------------------------------} +{ Parse and Translate a Term } + +procedure Term; +begin + Factor; + while IsMulop(Look) do + case Look of + '*': Multiply; + '/': Divide; + end; +end; +{--------------------------------------------------------------} + + +Our next step is to change some names. SignedFactor now becomes +SignedTerm, and the calls to Factor in Expression, Add, Subtract and +SignedTerm get changed to call Term: + + +{--------------------------------------------------------------} +{ Parse and Translate a Term with Optional Leading Sign } + +procedure SignedTerm; +var Sign: char; +begin + Sign := Look; + if IsAddop(Look) then + GetChar; + Term; + if Sign = '-' then Negate; +end; +{--------------------------------------------------------------} +... +{--------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + SignedTerm; + while IsAddop(Look) do + case Look of + '+': Add; + '-': Subtract; + end; +end; +{--------------------------------------------------------------} + + +If memory serves me correctly, we once had BOTH a procedure SignedFactor +and a procedure SignedTerm. I had reasons for doing that at the time ... +they had to do with the handling of Boolean algebra and, in particular, +the Boolean "not" function. But certainly, for arithmetic operations, +that duplication isn't necessary. In an expression like: + + -x*y + +it's very apparent that the sign goes with the whole TERM, x*y, and not +just the factor x, and that's the way Expression is coded. + +Test this new code by executing Main. It still calls Expression, so you +should now be able to deal with expressions containing any of the four +arithmetic operators. + +Our last bit of business, as far as expressions goes, is to modify +procedure Factor to allow for parenthetical expressions. By using a +recursive call to Expression, we can reduce the needed code to virtually +nothing. Five lines added to Factor do the job: + + +{--------------------------------------------------------------} +{ Parse and Translate a Factor } + +procedure Factor; +begin + if Look ='(' then begin + Match('('); + Expression; + Match(')'); + end + else if IsDigit(Look) then + LoadConstant(GetNumber) + else if IsAlpha(Look)then + LoadVariable(GetName) + else + Error('Unrecognized character ' + Look); +end; +{--------------------------------------------------------------} + + +At this point, your "compiler" should be able to handle any legal +expression you can throw at it. Better yet, it should reject all +illegal ones! + +ASSIGNMENTS + +As long as we're this close, we might as well create the code to deal +with an assignment statement. This code needs only to remember the name +of the target variable where we are to store the result of an +expression, call Expression, then store the number. The procedure is +shown next: + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: string; +begin + Name := GetName; + Match('='); + Expression; + StoreVariable(Name); +end; +{--------------------------------------------------------------} + +The assignment calls for yet another code generation routine: + + + +{--------------------------------------------------------------} +{ Store the Primary Register to a Variable } + +procedure StoreVariable(Name: string); +begin + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)'); +end; +{--------------------------------------------------------------} + + +Now, change the call in Main to call Assignment, and you should see a +full assignment statement being processed correctly. Pretty neat, eh? +And painless, too. + +In the past, we've always tried to show BNF relations to define the +syntax we're developing. I haven't done that here, and it's high time I +did. Here's the BNF: + + + ::= | | '(' ')' + ::= [] + ::= ( )* + ::= ( )* + ::= '=' + +BOOLEANS + +The next step, as we've learned several times before, is to add Boolean +algebra. In the past, this step has at least doubled the amount of code +we've had to write. As I've gone over this step in my mind, I've found +myself diverging more and more from what we did in previous +installments. To refresh your memory, I noted that Pascal treats the +Boolean operators pretty much identically to the way it treats +arithmetic ones. A Boolean "and" has the same precedence level as +multiplication, and the "or" as addition. C, on the other hand, sets +them at different precedence levels, and all told has a whopping 17 +levels. In our earlier work, I chose something in between, with seven +levels. As a result, we ended up with things called Boolean +expressions, paralleling in most details the arithmetic expressions, but +at a different precedence level. All of this, as it turned out, came +about because I didn't like having to put parentheses around the Boolean +expressions in statements like: + + IF (c >= 'A') and (c <= 'Z') then ... + +In retrospect, that seems a pretty petty reason to add many layers of +complexity to the parser. Perhaps more to the point, I'm not sure I was +even able to avoid the parens. + +For kicks, let's start anew, taking a more Pascal-ish approach, and just +treat the Boolean operators at the same precedence level as the +arithmetic ones. We'll see where it leads us. If it seems to be down +the garden path, we can always backtrack to the earlier approach. + +For starters, we'll add the "addition-level" operators to Expression. +That's easily done; first, modify the function IsAddop in unit Scanner +to include two extra operators: '|' for "or," and '~' for "exclusive +or": + + + + +{--------------------------------------------------------------} +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+','-', '|', '~']; +end; +{--------------------------------------------------------------} + + +Next, we must include the parsing of the operators in procedure +Expression: + + +{--------------------------------------------------------------} +procedure Expression; +begin + SignedTerm; + while IsAddop(Look) do + case Look of + '+': Add; + '-': Subtract; + '|': _Or; + '~': _Xor; + end; +{--------------------------------------------------------------} +end; + + +(The underscores are needed, of course, because "or" and "xor" are +reserved words in Turbo Pascal.) + +Next, the procedures _Or and _Xor: + + +{--------------------------------------------------------------} +{ Parse and Translate a Subtraction Operation } + +procedure _Or; +begin + Match('|'); + Push; + Term; + PopOr; +end; + +{--------------------------------------------------------------} +{ Parse and Translate a Subtraction Operation } + +procedure _Xor; +begin + Match('~'); + Push; + Term; + PopXor; +end; +{--------------------------------------------------------------} + +And, finally, the new code generator procedures: + + + + +{--------------------------------------------------------------} +{ Or TOS with Primary } + +procedure PopOr; +begin + EmitLn('OR (SP)+,D0'); +end; + +{--------------------------------------------------------------} +{ Exclusive-Or TOS with Primary } + +procedure PopXor; +begin + EmitLn('EOR (SP)+,D0'); +end; +{--------------------------------------------------------------} + +Now, let's test the translator (you might want to change the call +in Main back to a call to Expression, just to avoid having to type +"x=" for an assignment every time). + +So far, so good. The parser nicely handles expressions of the +form: + + x|y~z + +Unfortunately, it also does nothing to protect us from mixing +Boolean and arithmetic algebra. It will merrily generate code +for: + + (a+b)*(c~d) + +We've talked about this a bit, in the past. In general the rules +for what operations are legal or not cannot be enforced by the +parser itself, because they are not part of the syntax of the +language, but rather its semantics. A compiler that doesn't allow +mixed-mode expressions of this sort must recognize that c and d +are Boolean variables, rather than numeric ones, and balk at +multiplying them in the next step. But this "policing" can't be +done by the parser; it must be handled somewhere between the +parser and the code generator. We aren't in a position to enforce +such rules yet, because we haven't got either a way of declaring +types, or a symbol table to store the types in. So, for what +we've got to work with at the moment, the parser is doing +precisely what it's supposed to do. + +Anyway, are we sure that we DON'T want to allow mixed-type +operations? We made the decision some time ago (or, at least, I +did) to adopt the value 0000 as a Boolean "false," and -1, or +FFFFh, as a Boolean "true." The nice part about this choice is +that bitwise operations work exactly the same way as logical ones. +In other words, when we do an operation on one bit of a logical +variable, we do it on all of them. This means that we don't need +to distinguish between logical and bitwise operations, as is done +in C with the operators & and &&, and | and ||. Reducing the +number of operators by half certainly doesn't seem all bad. + +From the point of view of the data in storage, of course, the +computer and compiler couldn't care less whether the number FFFFh +represents the logical TRUE, or the numeric -1. Should we? I +sort of think not. I can think of many examples (though they +might be frowned upon as "tricky" code) where the ability to mix +the types might come in handy. Example, the Dirac delta function, +which could be coded in one simple line: + + -(x=0) + +or the absolute value function (DEFINITELY tricky code!): + + x*(1+2*(x<0)) + +Please note, I'm not advocating coding like this as a way of life. +I'd almost certainly write these functions in more readable form, +using IFs, just to keep from confusing later maintainers. Still, +a moral question arises: Do we have the right to ENFORCE our +ideas of good coding practice on the programmer, but writing the +language so he can't do anything else? That's what Nicklaus Wirth +did, in many places in Pascal, and Pascal has been criticized for +it -- for not being as "forgiving" as C. + +An interesting parallel presents itself in the example of the +Motorola 68000 design. Though Motorola brags loudly about the +orthogonality of their instruction set, the fact is that it's far +from orthogonal. For example, you can read a variable from its +address: + + MOVE X,D0 (where X is the name of a variable) + +but you can't write in the same way. To write, you must load an +address register with the address of X. The same is true for PC- +relative addressing: + + MOVE X(PC),DO (legal) + MOVE D0,X(PC) (illegal) + +When you begin asking how such non-orthogonal behavior came about, +you find that someone in Motorola had some theories about how +software should be written. Specifically, in this case, they +decided that self-modifying code, which you can implement using +PC-relative writes, is a Bad Thing. Therefore, they designed the +processor to prohibit it. Unfortunately, in the process they also +prohibited _ALL_ writes of the forms shown above, however benign. +Note that this was not something done by default. Extra design +work had to be done, and extra gates added, to destroy the natural +orthogonality of the instruction set. + +One of the lessons I've learned from life: If you have two +choices, and can't decide which one to take, sometimes the best +thing to do is nothing. Why add extra gates to a processor to +enforce some stranger's idea of good programming practice? Leave +the instructions in, and let the programmers debate what good +programming practice is. Similarly, why should we add extra code +to our parser, to test for and prevent conditions that the user +might prefer to do, anyway? I'd rather leave the compiler simple, +and let the software experts debate whether the practices should +be used or not. + +All of which serves as rationalization for my decision as to how +to prevent mixed-type arithmetic: I won't. For a language +intended for systems programming, the fewer rules, the better. If +you don't agree, and want to test for such conditions, we can do +it once we have a symbol table. + +BOOLEAN "AND" + +With that bit of philosophy out of the way, we can press on to the +"and" operator, which goes into procedure Term. By now, you can +probably do this without me, but here's the code, anyway: + +In Scanner, + +{--------------------------------------------------------------} +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*','/', '&']; +end; +{--------------------------------------------------------------} + +In Parser, + + +{--------------------------------------------------------------} +procedure Term; +begin + Factor; + while IsMulop(Look) do + case Look of + '*': Multiply; + '/': Divide; + '&': _And; + end; +end; + +{--------------------------------------------------------------} +{ Parse and Translate a Boolean And Operation } + +procedure _And; +begin + Match('&'); + Push; + Factor; + PopAnd; +end; +{--------------------------------------------------------------} + +and in CodeGen, + + +{--------------------------------------------------------------} +{ And Primary with TOS } + +procedure PopAnd; +begin + EmitLn('AND (SP)+,D0'); +end; +{--------------------------------------------------------------} + +Your parser should now be able to process almost any sort of logical +expression, and (should you be so inclined), mixed-mode expressions as +well. + +Why not "all sorts of logical expressions"? Because, so far, we haven't +dealt with the logical "not" operator, and this is where it gets tricky. +The logical "not" operator seems, at first glance, to be identical in +its behavior to the unary minus, so my first thought was to let the +exclusive or operator, '~', double as the unary "not." That didn't +work. In my first attempt, procedure SignedTerm simply ate my '~', +because the character passed the test for an addop, but SignedTerm +ignores all addops except '-'. It would have been easy enough to add +another line to SignedTerm, but that would still not solve the problem, +because note that Expression only accepts a signed term for the _FIRST_ +argument. + +Mathematically, an expression like: + + -a * -b + +makes little or no sense, and the parser should flag it as an error. +But the same expression, using a logical "not," makes perfect sense: + + not a and not b + +In the case of these unary operators, choosing to make them act the same +way seems an artificial force fit, sacrificing reasonable behavior on +the altar of implementational ease. While I'm all for keeping the +implementation as simple as possible, I don't think we should do so at +the expense of reasonableness. Patching like this would be missing the +main point, which is that the logical "not" is simply NOT the same kind +of animal as the unary minus. Consider the exclusive or, which is most +naturally written as: + + a~b ::= (a and not b) or (not a and b) + +If we allow the "not" to modify the whole term, the last term in +parentheses would be interpreted as: + + not(a and b) + +which is not the same thing at all. So it's clear that the logical +"not" must be thought of as connected to the FACTOR, not the term. + +The idea of overloading the '~' operator also makes no sense from a +mathematical point of view. The implication of the unary minus is that +it's equivalent to a subtraction from zero: + + -x <=> 0-x + +In fact, in one of my more simple-minded versions of Expression, I +reacted to a leading addop by simply preloading a zero, then processing +the operator as though it were a binary operator. But a "not" is not +equivalent to an exclusive or with zero ... that would just give back +the original number. Instead, it's an exclusive or with FFFFh, or -1. + +In short, the seeming parallel between the unary "not" and the unary +minus falls apart under closer scrutiny. "not" modifies the factor, not +the term, and it is not related to either the unary minus nor the +exclusive or. Therefore, it deserves a symbol to call its own. What +better symbol than the obvious one, also used by C, the '!' character? +Using the rules about the way we think the "not" should behave, we +should be able to code the exclusive or (assuming we'd ever need to), in +the very natural form: + + a & !b | !a & b + +Note that no parentheses are required -- the precedence levels we've +chosen automatically take care of things. + +If you're keeping score on the precedence levels, this definition puts +the '!' at the top of the heap. The levels become: + +1. ! +2. - (unary) +3. *, /, & +4. +, -, |, ~ + +Looking at this list, it's certainly not hard to see why we had trouble +using '~' as the "not" symbol! + +So how do we mechanize the rules? In the same way as we did with +SignedTerm, but at the factor level. We'll define a procedure +NotFactor: + + +{--------------------------------------------------------------} +{ Parse and Translate a Factor with Optional "Not" } + +procedure NotFactor; +begin + if Look ='!' then begin + Match('!'); + Factor; + Notit; + end + else + Factor; +end; +{--------------------------------------------------------------} + + +and call it from all the places where we formerly called Factor, i.e., +from Term, Multiply, Divide, and _And. Note the new code generation +procedure: + + +{--------------------------------------------------------------} +{ Bitwise Not Primary } + +procedure NotIt; +begin + EmitLn('EOR #-1,D0'); +end; + +{--------------------------------------------------------------} + + +Try this now, with a few simple cases. In fact, try that exclusive or +example, + + a&!b|!a&b + + +You should get the code (without the comments, of course): + + MOVE A(PC),DO ; load a + MOVE D0,-(SP) ; push it + MOVE B(PC),DO ; load b + EOR #-1,D0 ; not it + AND (SP)+,D0 ; and with a + MOVE D0,-(SP) ; push result + MOVE A(PC),DO ; load a + EOR #-1,D0 ; not it + MOVE D0,-(SP) ; push it + MOVE B(PC),DO ; load b + AND (SP)+,D0 ; and with !a + OR (SP)+,D0 ; or with first term + +That's precisely what we'd like to get. So, at least for both +arithmetic and logical operators, our new precedence and new, slimmer +syntax hang together. Even the peculiar, but legal, expression with +leading addop: + + ~x + +makes sense. SignedTerm ignores the leading '~', as it should, since +the expression is equivalent to: + + 0~x, + +which is equal to x. + +When we look at the BNF we've created, we find that our boolean algebra +now adds only one extra line: + + + ::= [!] + ::= | | '(' ')' + ::= [] + ::= ( )* + ::= ( )* + ::= '=' + + +That's a big improvement over earlier efforts. Will our luck continue +to hold when we get to relational operators? We'll find out soon, but +it will have to wait for the next installment. We're at a good stopping +place, and I'm anxious to get this installment into your hands. It's +already been a year since the release of Installment 15. I blush to +admit that all of this current installment has been ready for almost as +long, with the exception of relational operators. But the information +does you no good at all, sitting on my hard disk, and by holding it back +until the relational operations were done, I've kept it out of your +hands for that long. It's time for me to let go of it and get it out +where you can get value from it. Besides, there are quite a number of +serious philosophical questions associated with the relational +operators, as well, and I'd rather save them for a separate installment +where I can do them justice. + +Have fun with the new, leaner arithmetic and logical parsing, and I'll +see you soon with relationals. + + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1995 Jack W. Crenshaw. All rights reserved. * +* * +* * +***************************************************************** + + + \ No newline at end of file diff --git a/Compiler construction/Lets_build_compiler/tutor2.txt b/Compiler construction/Lets_build_compiler/tutor2.txt new file mode 100644 index 0000000..c4fe8b9 --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor2.txt @@ -0,0 +1,792 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 24 July 1988 + + + Part II: EXPRESSION PARSING + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +GETTING STARTED + +If you've read the introduction document to this series, you will +already know what we're about. You will also have copied the +cradle software into your Turbo Pascal system, and have compiled +it. So you should be ready to go. + + +The purpose of this article is for us to learn how to parse and +translate mathematical expressions. What we would like to see as +output is a series of assembler-language statements that perform +the desired actions. For purposes of definition, an expression +is the right-hand side of an equation, as in + + x = 2*y + 3/(4*z) + +In the early going, I'll be taking things in _VERY_ small steps. +That's so that the beginners among you won't get totally lost. +There are also some very good lessons to be learned early on, +that will serve us well later. For the more experienced readers: +bear with me. We'll get rolling soon enough. + +SINGLE DIGITS + +In keeping with the whole theme of this series (KISS, remember?), +let's start with the absolutely most simple case we can think of. +That, to me, is an expression consisting of a single digit. + +Before starting to code, make sure you have a baseline copy of +the "cradle" that I gave last time. We'll be using it again for +other experiments. Then add this code: + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Expression } + +procedure Expression; +begin + EmitLn('MOVE #' + GetNum + ',D0') +end; +{---------------------------------------------------------------} + + +And add the line "Expression;" to the main program so that it +reads: + + +{---------------------------------------------------------------} +begin + Init; + Expression; +end. +{---------------------------------------------------------------} + + +Now run the program. Try any single-digit number as input. You +should get a single line of assembler-language output. Now try +any other character as input, and you'll see that the parser +properly reports an error. + + +CONGRATULATIONS! You have just written a working translator! + +OK, I grant you that it's pretty limited. But don't brush it off +too lightly. This little "compiler" does, on a very limited +scale, exactly what any larger compiler does: it correctly +recognizes legal statements in the input "language" that we have +defined for it, and it produces correct, executable assembler +code, suitable for assembling into object format. Just as +importantly, it correctly recognizes statements that are NOT +legal, and gives a meaningful error message. Who could ask for +more? As we expand our parser, we'd better make sure those two +characteristics always hold true. + +There are some other features of this tiny program worth +mentioning. First, you can see that we don't separate code +generation from parsing ... as soon as the parser knows what we +want done, it generates the object code directly. In a real +compiler, of course, the reads in GetChar would be from a disk +file, and the writes to another disk file, but this way is much +easier to deal with while we're experimenting. + +Also note that an expression must leave a result somewhere. I've +chosen the 68000 register DO. I could have made some other +choices, but this one makes sense. + + +BINARY EXPRESSIONS + +Now that we have that under our belt, let's branch out a bit. +Admittedly, an "expression" consisting of only one character is +not going to meet our needs for long, so let's see what we can do +to extend it. Suppose we want to handle expressions of the form: + + 1+2 + or 4-3 + or, in general, +/- + +(That's a bit of Backus-Naur Form, or BNF.) + +To do this we need a procedure that recognizes a term and leaves +its result somewhere, and another that recognizes and +distinguishes between a '+' and a '-' and generates the +appropriate code. But if Expression is going to leave its result +in DO, where should Term leave its result? Answer: the same +place. We're going to have to save the first result of Term +somewhere before we get the next one. + +OK, basically what we want to do is have procedure Term do what +Expression was doing before. So just RENAME procedure Expression +as Term, and enter the following new version of Expression: + + + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + Term; + EmitLn('MOVE D0,D1'); + case Look of + '+': Add; + '-': Subtract; + else Expected('Addop'); + end; +end; +{--------------------------------------------------------------} + + +Next, just above Expression enter these two procedures: + + +{--------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Match('+'); + Term; + EmitLn('ADD D1,D0'); +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Match('-'); + Term; + EmitLn('SUB D1,D0'); +end; +{-------------------------------------------------------------} + + +When you're finished with that, the order of the routines should +be: + + o Term (The OLD Expression) + o Add + o Subtract + o Expression + +Now run the program. Try any combination you can think of of two +single digits, separated by a '+' or a '-'. You should get a +series of four assembler-language instructions out of each run. +Now try some expressions with deliberate errors in them. Does +the parser catch the errors? + +Take a look at the object code generated. There are two +observations we can make. First, the code generated is NOT what +we would write ourselves. The sequence + + MOVE #n,D0 + MOVE D0,D1 + +is inefficient. If we were writing this code by hand, we would +probably just load the data directly to D1. + +There is a message here: code generated by our parser is less +efficient than the code we would write by hand. Get used to it. +That's going to be true throughout this series. It's true of all +compilers to some extent. Computer scientists have devoted whole +lifetimes to the issue of code optimization, and there are indeed +things that can be done to improve the quality of code output. +Some compilers do quite well, but there is a heavy price to pay +in complexity, and it's a losing battle anyway ... there will +probably never come a time when a good assembler-language pro- +grammer can't out-program a compiler. Before this session is +over, I'll briefly mention some ways that we can do a little op- +timization, just to show you that we can indeed improve things +without too much trouble. But remember, we're here to learn, not +to see how tight we can make the object code. For now, and +really throughout this series of articles, we'll studiously +ignore optimization and concentrate on getting out code that +works. + +Speaking of which: ours DOESN'T! The code is _WRONG_! As things +are working now, the subtraction process subtracts D1 (which has +the FIRST argument in it) from D0 (which has the second). That's +the wrong way, so we end up with the wrong sign for the result. +So let's fix up procedure Subtract with a sign-changer, so that +it reads + + +{-------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Match('-'); + Term; + EmitLn('SUB D1,D0'); + EmitLn('NEG D0'); +end; +{-------------------------------------------------------------} + + +Now our code is even less efficient, but at least it gives the +right answer! Unfortunately, the rules that give the meaning of +math expressions require that the terms in an expression come out +in an inconvenient order for us. Again, this is just one of +those facts of life you learn to live with. This one will come +back to haunt us when we get to division. + +OK, at this point we have a parser that can recognize the sum or +difference of two digits. Earlier, we could only recognize a +single digit. But real expressions can have either form (or an +infinity of others). For kicks, go back and run the program with +the single input line '1'. + +Didn't work, did it? And why should it? We just finished +telling our parser that the only kinds of expressions that are +legal are those with two terms. We must rewrite procedure +Expression to be a lot more broadminded, and this is where things +start to take the shape of a real parser. + + + + +GENERAL EXPRESSIONS + +In the REAL world, an expression can consist of one or more +terms, separated by "addops" ('+' or '-'). In BNF, this is +written + + ::= [ ]* + + +We can accomodate this definition of an expression with the +addition of a simple loop to procedure Expression: + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + Term; + while Look in ['+', '-'] do begin + EmitLn('MOVE D0,D1'); + case Look of + '+': Add; + '-': Subtract; + else Expected('Addop'); + end; + end; +end; +{--------------------------------------------------------------} + + +NOW we're getting somewhere! This version handles any number of +terms, and it only cost us two extra lines of code. As we go on, +you'll discover that this is characteristic of top-down parsers +... it only takes a few lines of code to accomodate extensions to +the language. That's what makes our incremental approach +possible. Notice, too, how well the code of procedure Expression +matches the BNF definition. That, too, is characteristic of the +method. As you get proficient in the approach, you'll find that +you can turn BNF into parser code just about as fast as you can +type! + +OK, compile the new version of our parser, and give it a try. As +usual, verify that the "compiler" can handle any legal +expression, and will give a meaningful error message for an +illegal one. Neat, eh? You might note that in our test version, +any error message comes out sort of buried in whatever code had +already been generated. But remember, that's just because we are +using the CRT as our "output file" for this series of +experiments. In a production version, the two outputs would be +separated ... one to the output file, and one to the screen. + + +USING THE STACK + +At this point I'm going to violate my rule that we don't +introduce any complexity until it's absolutely necessary, long +enough to point out a problem with the code we're generating. As +things stand now, the parser uses D0 for the "primary" register, +and D1 as a place to store the partial sum. That works fine for +now, because as long as we deal with only the "addops" '+' and +'-', any new term can be added in as soon as it is found. But in +general that isn't true. Consider, for example, the expression + + 1+(2-(3+(4-5))) + +If we put the '1' in D1, where do we put the '2'? Since a +general expression can have any degree of complexity, we're going +to run out of registers fast! + +Fortunately, there's a simple solution. Like every modern +microprocessor, the 68000 has a stack, which is the perfect place +to save a variable number of items. So instead of moving the term +in D0 to D1, let's just push it onto the stack. For the benefit +of those unfamiliar with 68000 assembler language, a push is +written + + -(SP) + +and a pop, (SP)+ . + + +So let's change the EmitLn in Expression to read: + + EmitLn('MOVE D0,-(SP)'); + +and the two lines in Add and Subtract to + + EmitLn('ADD (SP)+,D0') + +and EmitLn('SUB (SP)+,D0'), + +respectively. Now try the parser again and make sure we haven't +broken it. + +Once again, the generated code is less efficient than before, but +it's a necessary step, as you'll see. + + +MULTIPLICATION AND DIVISION + +Now let's get down to some REALLY serious business. As you all +know, there are other math operators than "addops" ... +expressions can also have multiply and divide operations. You +also know that there is an implied operator PRECEDENCE, or +hierarchy, associated with expressions, so that in an expression +like + + 2 + 3 * 4, + +we know that we're supposed to multiply FIRST, then add. (See +why we needed the stack?) + +In the early days of compiler technology, people used some rather +complex techniques to insure that the operator precedence rules +were obeyed. It turns out, though, that none of this is +necessary ... the rules can be accommodated quite nicely by our +top-down parsing technique. Up till now, the only form that +we've considered for a term is that of a single decimal digit. + +More generally, we can define a term as a PRODUCT of FACTORS; +i.e., + + ::= [ ::= () + +This is where the recursion comes in. An expression can contain a +factor which contains another expression which contains a factor, +etc., ad infinitum. + +Complicated or not, we can take care of this by adding just a few +lines of Pascal to procedure Factor: + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure Expression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + Expression; + Match(')'); + end + else + EmitLn('MOVE #' + GetNum + ',D0'); +end; +{--------------------------------------------------------------} + + +Note again how easily we can extend the parser, and how well the +Pascal code matches the BNF syntax. + +As usual, compile the new version and make sure that it correctly +parses legal sentences, and flags illegal ones with an error +message. + + +UNARY MINUS + +At this point, we have a parser that can handle just about any +expression, right? OK, try this input sentence: + + -1 + +WOOPS! It doesn't work, does it? Procedure Expression expects +everything to start with an integer, so it coughs up the leading +minus sign. You'll find that +3 won't work either, nor will +something like + + -(3-2) . + +There are a couple of ways to fix the problem. The easiest +(although not necessarily the best) way is to stick an imaginary +leading zero in front of expressions of this type, so that -3 +becomes 0-3. We can easily patch this into our existing version +of Expression: + + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + if IsAddop(Look) then + EmitLn('CLR D0') + else + Term; + while IsAddop(Look) do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '+': Add; + '-': Subtract; + else Expected('Addop'); + end; + end; +end; +{--------------------------------------------------------------} + + +I TOLD you that making changes was easy! This time it cost us +only three new lines of Pascal. Note the new reference to +function IsAddop. Since the test for an addop appeared twice, I +chose to embed it in the new function. The form of IsAddop +should be apparent from that for IsAlpha. Here it is: + + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; +{--------------------------------------------------------------} + + +OK, make these changes to the program and recompile. You should +also include IsAddop in your baseline copy of the cradle. We'll +be needing it again later. Now try the input -1 again. Wow! +The efficiency of the code is pretty poor ... six lines of code +just for loading a simple constant ... but at least it's correct. +Remember, we're not trying to replace Turbo Pascal here. + +At this point we're just about finished with the structure of our +expression parser. This version of the program should correctly +parse and compile just about any expression you care to throw at +it. It's still limited in that we can only handle factors +involving single decimal digits. But I hope that by now you're +starting to get the message that we can accomodate further +extensions with just some minor changes to the parser. You +probably won't be surprised to hear that a variable or even a +function call is just another kind of a factor. + +In the next session, I'll show you just how easy it is to extend +our parser to take care of these things too, and I'll also show +you just how easily we can accomodate multicharacter numbers and +variable names. So you see, we're not far at all from a truly +useful parser. + + + + +A WORD ABOUT OPTIMIZATION + +Earlier in this session, I promised to give you some hints as to +how we can improve the quality of the generated code. As I said, +the production of tight code is not the main purpose of this +series of articles. But you need to at least know that we aren't +just wasting our time here ... that we can indeed modify the +parser further to make it produce better code, without throwing +away everything we've done to date. As usual, it turns out that +SOME optimization is not that difficult to do ... it simply takes +some extra code in the parser. + +There are two basic approaches we can take: + + o Try to fix up the code after it's generated + + This is the concept of "peephole" optimization. The general + idea it that we know what combinations of instructions the + compiler is going to generate, and we also know which ones + are pretty bad (such as the code for -1, above). So all we + do is to scan the produced code, looking for those + combinations, and replacing them by better ones. It's sort + of a macro expansion, in reverse, and a fairly + straightforward exercise in pattern-matching. The only + complication, really, is that there may be a LOT of such + combinations to look for. It's called peephole optimization + simply because it only looks at a small group of instructions + at a time. Peephole optimization can have a dramatic effect + on the quality of the code, with little change to the + structure of the compiler itself. There is a price to pay, + though, in both the speed, size, and complexity of the + compiler. Looking for all those combinations calls for a lot + of IF tests, each one of which is a source of error. And, of + course, it takes time. + + In the classical implementation of a peephole optimizer, + it's done as a second pass to the compiler. The output code + is written to disk, and then the optimizer reads and + processes the disk file again. As a matter of fact, you can + see that the optimizer could even be a separate PROGRAM from + the compiler proper. Since the optimizer only looks at the + code through a small "window" of instructions (hence the + name), a better implementation would be to simply buffer up a + few lines of output, and scan the buffer after each EmitLn. + + o Try to generate better code in the first place + + This approach calls for us to look for special cases BEFORE + we Emit them. As a trivial example, we should be able to + identify a constant zero, and Emit a CLR instead of a load, + or even do nothing at all, as in an add of zero, for example. + Closer to home, if we had chosen to recognize the unary minus + in Factor instead of in Expression, we could treat constants + like -1 as ordinary constants, rather then generating them + from positive ones. None of these things are difficult to + deal with ... they only add extra tests in the code, which is + why I haven't included them in our program. The way I see + it, once we get to the point that we have a working compiler, + generating useful code that executes, we can always go back + and tweak the thing to tighten up the code produced. That's + why there are Release 2.0's in the world. + +There IS one more type of optimization worth mentioning, that +seems to promise pretty tight code without too much hassle. It's +my "invention" in the sense that I haven't seen it suggested in +print anywhere, though I have no illusions that it's original +with me. + +This is to avoid such a heavy use of the stack, by making better +use of the CPU registers. Remember back when we were doing only +addition and subtraction, that we used registers D0 and D1, +rather than the stack? It worked, because with only those two +operations, the "stack" never needs more than two entries. + +Well, the 68000 has eight data registers. Why not use them as a +privately managed stack? The key is to recognize that, at any +point in its processing, the parser KNOWS how many items are on +the stack, so it can indeed manage it properly. We can define a +private "stack pointer" that keeps track of which stack level +we're at, and addresses the corresponding register. Procedure +Factor, for example, would not cause data to be loaded into +register D0, but into whatever the current "top-of-stack" +register happened to be. + +What we're doing in effect is to replace the CPU's RAM stack with +a locally managed stack made up of registers. For most +expressions, the stack level will never exceed eight, so we'll +get pretty good code out. Of course, we also have to deal with +those odd cases where the stack level DOES exceed eight, but +that's no problem either. We simply let the stack spill over +into the CPU stack. For levels beyond eight, the code is no +worse than what we're generating now, and for levels less than +eight, it's considerably better. + +For the record, I have implemented this concept, just to make +sure it works before I mentioned it to you. It does. In +practice, it turns out that you can't really use all eight levels +... you need at least one register free to reverse the operand +order for division (sure wish the 68000 had an XTHL, like the +8080!). For expressions that include function calls, we would +also need a register reserved for them. Still, there is a nice +improvement in code size for most expressions. + +So, you see, getting better code isn't that difficult, but it +does add complexity to the our translator ... complexity we can +do without at this point. For that reason, I STRONGLY suggest +that we continue to ignore efficiency issues for the rest of this +series, secure in the knowledge that we can indeed improve the +code quality without throwing away what we've done. + +Next lesson, I'll show you how to deal with variables factors and +function calls. I'll also show you just how easy it is to handle +multicharacter tokens and embedded white space. + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + + + diff --git a/Compiler construction/Lets_build_compiler/tutor3.txt b/Compiler construction/Lets_build_compiler/tutor3.txt new file mode 100644 index 0000000..27f4edb --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor3.txt @@ -0,0 +1,946 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 4 Aug 1988 + + + Part III: MORE EXPRESSIONS + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +In the last installment, we examined the techniques used to parse +and translate a general math expression. We ended up with a +simple parser that could handle arbitrarily complex expressions, +with two restrictions: + + o No variables were allowed, only numeric factors + + o The numeric factors were limited to single digits + +In this installment, we'll get rid of those restrictions. We'll +also extend what we've done to include assignment statements +function calls and. Remember, though, that the second +restriction was mainly self-imposed ... a choice of convenience +on our part, to make life easier and to let us concentrate on the +fundamental concepts. As you'll see in a bit, it's an easy +restriction to get rid of, so don't get too hung up about it. +We'll use the trick when it serves us to do so, confident that we +can discard it when we're ready to. + + +VARIABLES + +Most expressions that we see in practice involve variables, such +as + + b * b + 4 * a * c + +No parser is much good without being able to deal with them. +Fortunately, it's also quite easy to do. + +Remember that in our parser as it currently stands, there are two +kinds of factors allowed: integer constants and expressions +within parentheses. In BNF notation, + + ::= | () + +The '|' stands for "or", meaning of course that either form is a +legal form for a factor. Remember, too, that we had no trouble +knowing which was which ... the lookahead character is a left +paren '(' in one case, and a digit in the other. + +It probably won't come as too much of a surprise that a variable +is just another kind of factor. So we extend the BNF above to +read: + + + ::= | () | + + +Again, there is no ambiguity: if the lookahead character is a +letter, we have a variable; if a digit, we have a number. Back +when we translated the number, we just issued code to load the +number, as immediate data, into D0. Now we do the same, only we +load a variable. + +A minor complication in the code generation arises from the fact +that most 68000 operating systems, including the SK*DOS that I'm +using, require the code to be written in "position-independent" +form, which basically means that everything is PC-relative. The +format for a load in this language is + + MOVE X(PC),D0 + +where X is, of course, the variable name. Armed with that, let's +modify the current version of Factor to read: + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure Expression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + Expression; + Match(')'); + end + else if IsAlpha(Look) then + EmitLn('MOVE ' + GetName + '(PC),D0') + else + EmitLn('MOVE #' + GetNum + ',D0'); +end; +{--------------------------------------------------------------} + + +I've remarked before how easy it is to add extensions to the +parser, because of the way it's structured. You can see that +this still holds true here. This time it cost us all of two +extra lines of code. Notice, too, how the if-else-else structure +exactly parallels the BNF syntax equation. + +OK, compile and test this new version of the parser. That didn't +hurt too badly, did it? + + +FUNCTIONS + +There is only one other common kind of factor supported by most +languages: the function call. It's really too early for us to +deal with functions well, because we haven't yet addressed the +issue of parameter passing. What's more, a "real" language would +include a mechanism to support more than one type, one of which +should be a function type. We haven't gotten there yet, either. +But I'd still like to deal with functions now for a couple of +reasons. First, it lets us finally wrap up the parser in +something very close to its final form, and second, it brings up +a new issue which is very much worth talking about. + +Up till now, we've been able to write what is called a +"predictive parser." That means that at any point, we can know +by looking at the current lookahead character exactly what to do +next. That isn't the case when we add functions. Every language +has some naming rules for what constitutes a legal identifier. +For the present, ours is simply that it is one of the letters +'a'..'z'. The problem is that a variable name and a function +name obey the same rules. So how can we tell which is which? +One way is to require that they each be declared before they are +used. Pascal takes that approach. The other is that we might +require a function to be followed by a (possibly empty) parameter +list. That's the rule used in C. + +Since we don't yet have a mechanism for declaring types, let's +use the C rule for now. Since we also don't have a mechanism to +deal with parameters, we can only handle empty lists, so our +function calls will have the form + + x() . + +Since we're not dealing with parameter lists yet, there is +nothing to do but to call the function, so we need only to issue +a BSR (call) instead of a MOVE. + +Now that there are two possibilities for the "If IsAlpha" branch +of the test in Factor, let's treat them in a separate procedure. +Modify Factor to read: + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure Expression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + Expression; + Match(')'); + end + else if IsAlpha(Look) then + Ident + else + EmitLn('MOVE #' + GetNum + ',D0'); +end; +{--------------------------------------------------------------} + + +and insert before it the new procedure + + +{---------------------------------------------------------------} +{ Parse and Translate an Identifier } + +procedure Ident; +var Name: char; +begin + Name := GetName; + if Look = '(' then begin + Match('('); + Match(')'); + EmitLn('BSR ' + Name); + end + else + EmitLn('MOVE ' + Name + '(PC),D0') +end; +{---------------------------------------------------------------} + + +OK, compile and test this version. Does it parse all legal +expressions? Does it correctly flag badly formed ones? + +The important thing to notice is that even though we no longer +have a predictive parser, there is little or no complication +added with the recursive descent approach that we're using. At +the point where Factor finds an identifier (letter), it doesn't +know whether it's a variable name or a function name, nor does it +really care. It simply passes it on to Ident and leaves it up to +that procedure to figure it out. Ident, in turn, simply tucks +away the identifier and then reads one more character to decide +which kind of identifier it's dealing with. + +Keep this approach in mind. It's a very powerful concept, and it +should be used whenever you encounter an ambiguous situation +requiring further lookahead. Even if you had to look several +tokens ahead, the principle would still work. + + +MORE ON ERROR HANDLING + +As long as we're talking philosophy, there's another important +issue to point out: error handling. Notice that although the +parser correctly rejects (almost) every malformed expression we +can throw at it, with a meaningful error message, we haven't +really had to do much work to make that happen. In fact, in the +whole parser per se (from Ident through Expression) there are +only two calls to the error routine, Expected. Even those aren't +necessary ... if you'll look again in Term and Expression, you'll +see that those statements can't be reached. I put them in early +on as a bit of insurance, but they're no longer needed. Why +don't you delete them now? + +So how did we get this nice error handling virtually for free? +It's simply that I've carefully avoided reading a character +directly using GetChar. Instead, I've relied on the error +handling in GetName, GetNum, and Match to do all the error +checking for me. Astute readers will notice that some of the +calls to Match (for example, the ones in Add and Subtract) are +also unnecessary ... we already know what the character is by the +time we get there ... but it maintains a certain symmetry to +leave them in, and the general rule to always use Match instead +of GetChar is a good one. + +I mentioned an "almost" above. There is a case where our error +handling leaves a bit to be desired. So far we haven't told our +parser what and end-of-line looks like, or what to do with +embedded white space. So a space character (or any other +character not part of the recognized character set) simply causes +the parser to terminate, ignoring the unrecognized characters. + +It could be argued that this is reasonable behavior at this +point. In a "real" compiler, there is usually another statement +following the one we're working on, so any characters not treated +as part of our expression will either be used for or rejected as +part of the next one. + +But it's also a very easy thing to fix up, even if it's only +temporary. All we have to do is assert that the expression +should end with an end-of-line , i.e., a carriage return. + +To see what I'm talking about, try the input line + + 1+2 3+4 + +See how the space was treated as a terminator? Now, to make the +compiler properly flag this, add the line + + if Look <> CR then Expected('Newline'); + +in the main program, just after the call to Expression. That +catches anything left over in the input stream. Don't forget to +define CR in the const statement: + + CR = ^M; + +As usual, recompile the program and verify that it does what it's +supposed to. + + +ASSIGNMENT STATEMENTS + +OK, at this point we have a parser that works very nicely. I'd +like to point out that we got it using only 88 lines of +executable code, not counting what was in the cradle. The +compiled object file is a whopping 4752 bytes. Not bad, +considering we weren't trying very hard to save either source +code or object size. We just stuck to the KISS principle. + +Of course, parsing an expression is not much good without having +something to do with it afterwards. Expressions USUALLY (but not +always) appear in assignment statements, in the form + + = + +We're only a breath away from being able to parse an assignment +statement, so let's take that last step. Just after procedure +Expression, add the following new procedure: + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: char; +begin + Name := GetName; + Match('='); + Expression; + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)') +end; +{--------------------------------------------------------------} + + +Note again that the code exactly parallels the BNF. And notice +further that the error checking was painless, handled by GetName +and Match. + +The reason for the two lines of assembler has to do with a +peculiarity in the 68000, which requires this kind of construct +for PC-relative code. + +Now change the call to Expression, in the main program, to one to +Assignment. That's all there is to it. + +Son of a gun! We are actually compiling assignment statements. +If those were the only kind of statements in a language, all we'd +have to do is put this in a loop and we'd have a full-fledged +compiler! + +Well, of course they're not the only kind. There are also little +items like control statements (IFs and loops), procedures, +declarations, etc. But cheer up. The arithmetic expressions +that we've been dealing with are among the most challenging in a +language. Compared to what we've already done, control +statements will be easy. I'll be covering them in the fifth +installment. And the other statements will all fall in line, as +long as we remember to KISS. + + +MULTI-CHARACTER TOKENS + +Throughout this series, I've been carefully restricting +everything we do to single-character tokens, all the while +assuring you that it wouldn't be difficult to extend to multi- +character ones. I don't know if you believed me or not ... I +wouldn't really blame you if you were a bit skeptical. I'll +continue to use that approach in the sessions which follow, +because it helps keep complexity away. But I'd like to back up +those assurances, and wrap up this portion of the parser, by +showing you just how easy that extension really is. In the +process, we'll also provide for embedded white space. Before you +make the next few changes, though, save the current version of +the parser away under another name. I have some more uses for it +in the next installment, and we'll be working with the single- +character version. + +Most compilers separate out the handling of the input stream into +a separate module called the lexical scanner. The idea is that +the scanner deals with all the character-by-character input, and +returns the separate units (tokens) of the stream. There may +come a time when we'll want to do something like that, too, but +for now there is no need. We can handle the multi-character +tokens that we need by very slight and very local modifications +to GetName and GetNum. + +The usual definition of an identifier is that the first character +must be a letter, but the rest can be alphanumeric (letters or +numbers). To deal with this, we need one other recognizer +function + + +{--------------------------------------------------------------} +{ Recognize an Alphanumeric } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; +{--------------------------------------------------------------} + + +Add this function to your parser. I put mine just after IsDigit. +While you're at it, might as well include it as a permanent +member of Cradle, too. + +Now, we need to modify function GetName to return a string +instead of a character: + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: string; +var Token: string; +begin + Token := ''; + if not IsAlpha(Look) then Expected('Name'); + while IsAlNum(Look) do begin + Token := Token + UpCase(Look); + GetChar; + end; + GetName := Token; +end; +{--------------------------------------------------------------} + + +Similarly, modify GetNum to read: + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: string; +var Value: string; +begin + Value := ''; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + Value := Value + Look; + GetChar; + end; + GetNum := Value; +end; +{--------------------------------------------------------------} + + +Amazingly enough, that is virtually all the changes required to +the parser! The local variable Name in procedures Ident and +Assignment was originally declared as "char", and must now be +declared string[8]. (Clearly, we could make the string length +longer if we chose, but most assemblers limit the length anyhow.) +Make this change, and then recompile and test. _NOW_ do you +believe that it's a simple change? + + +WHITE SPACE + +Before we leave this parser for awhile, let's address the issue +of white space. As it stands now, the parser will barf (or +simply terminate) on a single space character embedded anywhere +in the input stream. That's pretty unfriendly behavior. So +let's "productionize" the thing a bit by eliminating this last +restriction. + +The key to easy handling of white space is to come up with a +simple rule for how the parser should treat the input stream, and +to enforce that rule everywhere. Up till now, because white +space wasn't permitted, we've been able to assume that after each +parsing action, the lookahead character Look contains the next +meaningful character, so we could test it immediately. Our +design was based upon this principle. + +It still sounds like a good rule to me, so that's the one we'll +use. This means that every routine that advances the input +stream must skip over white space, and leave the next non-white +character in Look. Fortunately, because we've been careful to +use GetName, GetNum, and Match for most of our input processing, +it is only those three routines (plus Init) that we need to +modify. + +Not surprisingly, we start with yet another new recognizer +routine: + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB]; +end; +{--------------------------------------------------------------} + + +We also need a routine that will eat white-space characters, +until it finds a non-white one: + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; +{--------------------------------------------------------------} + + +Now, add calls to SkipWhite to Match, GetName, and GetNum as +shown below: + + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + if Look <> x then Expected('''' + x + '''') + else begin + GetChar; + SkipWhite; + end; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: string; +var Token: string; +begin + Token := ''; + if not IsAlpha(Look) then Expected('Name'); + while IsAlNum(Look) do begin + Token := Token + UpCase(Look); + GetChar; + end; + GetName := Token; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: string; +var Value: string; +begin + Value := ''; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + Value := Value + Look; + GetChar; + end; + GetNum := Value; + SkipWhite; +end; +{--------------------------------------------------------------} + +(Note that I rearranged Match a bit, without changing the +functionality.) + +Finally, we need to skip over leading blanks where we "prime the +pump" in Init: + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +begin + GetChar; + SkipWhite; +end; +{--------------------------------------------------------------} + + +Make these changes and recompile the program. You will find that +you will have to move Match below SkipWhite, to avoid an error +message from the Pascal compiler. Test the program as always to +make sure it works properly. + +Since we've made quite a few changes during this session, I'm +reproducing the entire parser below: + + +{--------------------------------------------------------------} +program parse; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + CR = ^M; + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look: char; { Lookahead Character } + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Recognize an Alphanumeric } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; + + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB]; +end; + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + if Look <> x then Expected('''' + x + '''') + else begin + GetChar; + SkipWhite; + end; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: string; +var Token: string; +begin + Token := ''; + if not IsAlpha(Look) then Expected('Name'); + while IsAlNum(Look) do begin + Token := Token + UpCase(Look); + GetChar; + end; + GetName := Token; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: string; +var Value: string; +begin + Value := ''; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + Value := Value + Look; + GetChar; + end; + GetNum := Value; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Identifier } + +procedure Ident; +var Name: string[8]; +begin + Name:= GetName; + if Look = '(' then begin + Match('('); + Match(')'); + EmitLn('BSR ' + Name); + end + else + EmitLn('MOVE ' + Name + '(PC),D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure Expression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + Expression; + Match(')'); + end + else if IsAlpha(Look) then + Ident + else + EmitLn('MOVE #' + GetNum + ',D0'); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Multiply } + +procedure Multiply; +begin + Match('*'); + Factor; + EmitLn('MULS (SP)+,D0'); +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Divide } + +procedure Divide; +begin + Match('/'); + Factor; + EmitLn('MOVE (SP)+,D1'); + EmitLn('EXS.L D0'); + EmitLn('DIVS D1,D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +procedure Term; +begin + Factor; + while Look in ['*', '/'] do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '*': Multiply; + '/': Divide; + end; + end; +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Match('+'); + Term; + EmitLn('ADD (SP)+,D0'); +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Match('-'); + Term; + EmitLn('SUB (SP)+,D0'); + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + if IsAddop(Look) then + EmitLn('CLR D0') + else + Term; + while IsAddop(Look) do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '+': Add; + '-': Subtract; + end; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: string[8]; +begin + Name := GetName; + Match('='); + Expression; + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)') +end; + + +{--------------------------------------------------------------} +{ Initialize } + +procedure Init; +begin + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + Assignment; + If Look <> CR then Expected('NewLine'); +end. +{--------------------------------------------------------------} + + +Now the parser is complete. It's got every feature we can put in +a one-line "compiler." Tuck it away in a safe place. Next time +we'll move on to a new subject, but we'll still be talking about +expressions for quite awhile. Next installment, I plan to talk a +bit about interpreters as opposed to compilers, and show you how +the structure of the parser changes a bit as we change what sort +of action has to be taken. The information we pick up there will +serve us in good stead later on, even if you have no interest in +interpreters. See you next time. + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + + + diff --git a/Compiler construction/Lets_build_compiler/tutor4.txt b/Compiler construction/Lets_build_compiler/tutor4.txt new file mode 100644 index 0000000..bcaa889 --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor4.txt @@ -0,0 +1,701 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 24 July 1988 + + + Part IV: INTERPRETERS + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +In the first three installments of this series, we've looked at +parsing and compiling math expressions, and worked our way grad- +ually and methodically from dealing with very simple one-term, +one-character "expressions" up through more general ones, finally +arriving at a very complete parser that could parse and translate +complete assignment statements, with multi-character tokens, +embedded white space, and function calls. This time, I'm going +to walk you through the process one more time, only with the goal +of interpreting rather than compiling object code. + +Since this is a series on compilers, why should we bother with +interpreters? Simply because I want you to see how the nature of +the parser changes as we change the goals. I also want to unify +the concepts of the two types of translators, so that you can see +not only the differences, but also the similarities. + +Consider the assignment statement + + x = 2 * y + 3 + +In a compiler, we want the target CPU to execute this assignment +at EXECUTION time. The translator itself doesn't do any arith- +metic ... it only issues the object code that will cause the CPU +to do it when the code is executed. For the example above, the +compiler would issue code to compute the expression and store the +results in variable x. + +For an interpreter, on the other hand, no object code is gen- +erated. Instead, the arithmetic is computed immediately, as the +parsing is going on. For the example, by the time parsing of the +statement is complete, x will have a new value. + +The approach we've been taking in this whole series is called +"syntax-driven translation." As you are aware by now, the struc- +ture of the parser is very closely tied to the syntax of the +productions we parse. We have built Pascal procedures that rec- +ognize every language construct. Associated with each of these +constructs (and procedures) is a corresponding "action," which +does whatever makes sense to do once a construct has been +recognized. In our compiler so far, every action involves +emitting object code, to be executed later at execution time. In +an interpreter, every action involves something to be done im- +mediately. + +What I'd like you to see here is that the layout ... the struc- +ture ... of the parser doesn't change. It's only the actions +that change. So if you can write an interpreter for a given +language, you can also write a compiler, and vice versa. Yet, as +you will see, there ARE differences, and significant ones. +Because the actions are different, the procedures that do the +recognizing end up being written differently. Specifically, in +the interpreter the recognizing procedures end up being coded as +FUNCTIONS that return numeric values to their callers. None of +the parsing routines for our compiler did that. + +Our compiler, in fact, is what we might call a "pure" compiler. +Each time a construct is recognized, the object code is emitted +IMMEDIATELY. (That's one reason the code is not very efficient.) +The interpreter we'll be building here is a pure interpreter, in +the sense that there is no translation, such as "tokenizing," +performed on the source code. These represent the two extremes +of translation. In the real world, translators are rarely so +pure, but tend to have bits of each technique. + +I can think of several examples. I've already mentioned one: +most interpreters, such as Microsoft BASIC, for example, trans- +late the source code (tokenize it) into an intermediate form so +that it'll be easier to parse real time. + +Another example is an assembler. The purpose of an assembler, of +course, is to produce object code, and it normally does that on a +one-to-one basis: one object instruction per line of source code. +But almost every assembler also permits expressions as arguments. +In this case, the expressions are always constant expressions, +and so the assembler isn't supposed to issue object code for +them. Rather, it "interprets" the expressions and computes the +corresponding constant result, which is what it actually emits as +object code. + +As a matter of fact, we could use a bit of that ourselves. The +translator we built in the previous installment will dutifully +spit out object code for complicated expressions, even though +every term in the expression is a constant. In that case it +would be far better if the translator behaved a bit more like an +interpreter, and just computed the equivalent constant result. + +There is a concept in compiler theory called "lazy" translation. +The idea is that you typically don't just emit code at every +action. In fact, at the extreme you don't emit anything at all, +until you absolutely have to. To accomplish this, the actions +associated with the parsing routines typically don't just emit +code. Sometimes they do, but often they simply return in- +formation back to the caller. Armed with such information, the +caller can then make a better choice of what to do. + +For example, given the statement + + x = x + 3 - 2 - (5 - 4) , + +our compiler will dutifully spit out a stream of 18 instructions +to load each parameter into registers, perform the arithmetic, +and store the result. A lazier evaluation would recognize that +the arithmetic involving constants can be evaluated at compile +time, and would reduce the expression to + + x = x + 0 . + +An even lazier evaluation would then be smart enough to figure +out that this is equivalent to + + x = x , + +which calls for no action at all. We could reduce 18 in- +structions to zero! + +Note that there is no chance of optimizing this way in our trans- +lator as it stands, because every action takes place immediately. + +Lazy expression evaluation can produce significantly better +object code than we have been able to so far. I warn you, +though: it complicates the parser code considerably, because each +routine now has to make decisions as to whether to emit object +code or not. Lazy evaluation is certainly not named that because +it's easier on the compiler writer! + +Since we're operating mainly on the KISS principle here, I won't +go into much more depth on this subject. I just want you to be +aware that you can get some code optimization by combining the +techniques of compiling and interpreting. In particular, you +should know that the parsing routines in a smarter translator +will generally return things to their caller, and sometimes +expect things as well. That's the main reason for going over +interpretation in this installment. + + +THE INTERPRETER + +OK, now that you know WHY we're going into all this, let's do it. +Just to give you practice, we're going to start over with a bare +cradle and build up the translator all over again. This time, of +course, we can go a bit faster. + +Since we're now going to do arithmetic, the first thing we need +to do is to change function GetNum, which up till now has always +returned a character (or string). Now, it's better for it to +return an integer. MAKE A COPY of the cradle (for goodness's +sake, don't change the version in Cradle itself!!) and modify +GetNum as follows: + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: integer; +begin + if not IsDigit(Look) then Expected('Integer'); + GetNum := Ord(Look) - Ord('0'); + GetChar; +end; +{--------------------------------------------------------------} + + +Now, write the following version of Expression: + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +function Expression: integer; +begin + Expression := GetNum; +end; +{--------------------------------------------------------------} + + +Finally, insert the statement + + + Writeln(Expression); + + +at the end of the main program. Now compile and test. + +All this program does is to "parse" and translate a single +integer "expression." As always, you should make sure that it +does that with the digits 0..9, and gives an error message for +anything else. Shouldn't take you very long! + +OK, now let's extend this to include addops. Change Expression +to read: + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +function Expression: integer; +var Value: integer; +begin + if IsAddop(Look) then + Value := 0 + else + Value := GetNum; + while IsAddop(Look) do begin + case Look of + '+': begin + Match('+'); + Value := Value + GetNum; + end; + '-': begin + Match('-'); + Value := Value - GetNum; + end; + end; + end; + Expression := Value; +end; +{--------------------------------------------------------------} + + +The structure of Expression, of course, parallels what we did +before, so we shouldn't have too much trouble debugging it. +There's been a SIGNIFICANT development, though, hasn't there? +Procedures Add and Subtract went away! The reason is that the +action to be taken requires BOTH arguments of the operation. I +could have chosen to retain the procedures and pass into them the +value of the expression to date, which is Value. But it seemed +cleaner to me to keep Value as strictly a local variable, which +meant that the code for Add and Subtract had to be moved in line. +This result suggests that, while the structure we had developed +was nice and clean for our simple-minded translation scheme, it +probably wouldn't do for use with lazy evaluation. That's a +little tidbit we'll probably want to keep in mind for later. + +OK, did the translator work? Then let's take the next step. +It's not hard to figure out what procedure Term should now look +like. Change every call to GetNum in function Expression to a +call to Term, and then enter the following form for Term: + + + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +function Term: integer; +var Value: integer; +begin + Value := GetNum; + while Look in ['*', '/'] do begin + case Look of + '*': begin + Match('*'); + Value := Value * GetNum; + end; + '/': begin + Match('/'); + Value := Value div GetNum; + end; + end; + end; + Term := Value; +end; +{--------------------------------------------------------------} + +Now, try it out. Don't forget two things: first, we're dealing +with integer division, so, for example, 1/3 should come out zero. +Second, even though we can output multi-digit results, our input +is still restricted to single digits. + +That seems like a silly restriction at this point, since we have +already seen how easily function GetNum can be extended. So +let's go ahead and fix it right now. The new version is + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: integer; +var Value: integer; +begin + Value := 0; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + Value := 10 * Value + Ord(Look) - Ord('0'); + GetChar; + end; + GetNum := Value; +end; +{--------------------------------------------------------------} + + +If you've compiled and tested this version of the interpreter, +the next step is to install function Factor, complete with pa- +renthesized expressions. We'll hold off a bit longer on the +variable names. First, change the references to GetNum, in +function Term, so that they call Factor instead. Now code the +following version of Factor: + + + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +function Expression: integer; Forward; + +function Factor: integer; +begin + if Look = '(' then begin + Match('('); + Factor := Expression; + Match(')'); + end + else + Factor := GetNum; +end; +{---------------------------------------------------------------} + +That was pretty easy, huh? We're rapidly closing in on a useful +interpreter. + + +A LITTLE PHILOSOPHY + +Before going any further, there's something I'd like to call to +your attention. It's a concept that we've been making use of in +all these sessions, but I haven't explicitly mentioned it up till +now. I think it's time, because it's a concept so useful, and so +powerful, that it makes all the difference between a parser +that's trivially easy, and one that's too complex to deal with. + +In the early days of compiler technology, people had a terrible +time figuring out how to deal with things like operator prece- +dence ... the way that multiply and divide operators take +precedence over add and subtract, etc. I remember a colleague of +some thirty years ago, and how excited he was to find out how to +do it. The technique used involved building two stacks, upon +which you pushed each operator or operand. Associated with each +operator was a precedence level, and the rules required that you +only actually performed an operation ("reducing" the stack) if +the precedence level showing on top of the stack was correct. To +make life more interesting, an operator like ')' had different +precedence levels, depending upon whether or not it was already +on the stack. You had to give it one value before you put it on +the stack, and another to decide when to take it off. Just for +the experience, I worked all of this out for myself a few years +ago, and I can tell you that it's very tricky. + +We haven't had to do anything like that. In fact, by now the +parsing of an arithmetic statement should seem like child's play. +How did we get so lucky? And where did the precedence stacks go? + +A similar thing is going on in our interpreter above. You just +KNOW that in order for it to do the computation of arithmetic +statements (as opposed to the parsing of them), there have to be +numbers pushed onto a stack somewhere. But where is the stack? + +Finally, in compiler textbooks, there are a number of places +where stacks and other structures are discussed. In the other +leading parsing method (LR), an explicit stack is used. In fact, +the technique is very much like the old way of doing arithmetic +expressions. Another concept is that of a parse tree. Authors +like to draw diagrams of the tokens in a statement, connected +into a tree with operators at the internal nodes. Again, where +are the trees and stacks in our technique? We haven't seen any. +The answer in all cases is that the structures are implicit, not +explicit. In any computer language, there is a stack involved +every time you call a subroutine. Whenever a subroutine is +called, the return address is pushed onto the CPU stack. At the +end of the subroutine, the address is popped back off and control +is transferred there. In a recursive language such as Pascal, +there can also be local data pushed onto the stack, and it, too, +returns when it's needed. + +For example, function Expression contains a local parameter +called Value, which it fills by a call to Term. Suppose, in its +next call to Term for the second argument, that Term calls +Factor, which recursively calls Expression again. That "in- +stance" of Expression gets another value for its copy of Value. +What happens to the first Value? Answer: it's still on the +stack, and will be there again when we return from our call +sequence. + +In other words, the reason things look so simple is that we've +been making maximum use of the resources of the language. The +hierarchy levels and the parse trees are there, all right, but +they're hidden within the structure of the parser, and they're +taken care of by the order with which the various procedures are +called. Now that you've seen how we do it, it's probably hard to +imagine doing it any other way. But I can tell you that it took +a lot of years for compiler writers to get that smart. The early +compilers were too complex too imagine. Funny how things get +easier with a little practice. + +The reason I've brought all this up is as both a lesson and a +warning. The lesson: things can be easy when you do them right. +The warning: take a look at what you're doing. If, as you branch +out on your own, you begin to find a real need for a separate +stack or tree structure, it may be time to ask yourself if you're +looking at things the right way. Maybe you just aren't using the +facilities of the language as well as you could be. + + +The next step is to add variable names. Now, though, we have a +slight problem. For the compiler, we had no problem in dealing +with variable names ... we just issued the names to the assembler +and let the rest of the program take care of allocating storage +for them. Here, on the other hand, we need to be able to fetch +the values of the variables and return them as the return values +of Factor. We need a storage mechanism for these variables. + +Back in the early days of personal computing, Tiny BASIC lived. +It had a grand total of 26 possible variables: one for each +letter of the alphabet. This fits nicely with our concept of +single-character tokens, so we'll try the same trick. In the +beginning of your interpreter, just after the declaration of +variable Look, insert the line: + + Table: Array['A'..'Z'] of integer; + +We also need to initialize the array, so add this procedure: + + + + +{---------------------------------------------------------------} +{ Initialize the Variable Area } + +procedure InitTable; +var i: char; +begin + for i := 'A' to 'Z' do + Table[i] := 0; +end; +{---------------------------------------------------------------} + + +You must also insert a call to InitTable, in procedure Init. +DON'T FORGET to do that, or the results may surprise you! + +Now that we have an array of variables, we can modify Factor to +use it. Since we don't have a way (so far) to set the variables, +Factor will always return zero values for them, but let's go +ahead and extend it anyway. Here's the new version: + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +function Expression: integer; Forward; + +function Factor: integer; +begin + if Look = '(' then begin + Match('('); + Factor := Expression; + Match(')'); + end + else if IsAlpha(Look) then + Factor := Table[GetName] + else + Factor := GetNum; +end; +{---------------------------------------------------------------} + + +As always, compile and test this version of the program. Even +though all the variables are now zeros, at least we can correctly +parse the complete expressions, as well as catch any badly formed +expressions. + +I suppose you realize the next step: we need to do an assignment +statement so we can put something INTO the variables. For now, +let's stick to one-liners, though we will soon be handling +multiple statements. + +The assignment statement parallels what we did before: + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + + + +procedure Assignment; +var Name: char; +begin + Name := GetName; + Match('='); + Table[Name] := Expression; +end; +{--------------------------------------------------------------} + + +To test this, I added a temporary write statement in the main +program, to print out the value of A. Then I tested it with +various assignments to it. + +Of course, an interpretive language that can only accept a single +line of program is not of much value. So we're going to want to +handle multiple statements. This merely means putting a loop +around the call to Assignment. So let's do that now. But what +should be the loop exit criterion? Glad you asked, because it +brings up a point we've been able to ignore up till now. + +One of the most tricky things to handle in any translator is to +determine when to bail out of a given construct and go look for +something else. This hasn't been a problem for us so far because +we've only allowed for a single kind of construct ... either an +expression or an assignment statement. When we start adding +loops and different kinds of statements, you'll find that we have +to be very careful that things terminate properly. If we put our +interpreter in a loop, we need a way to quit. Terminating on a +newline is no good, because that's what sends us back for another +line. We could always let an unrecognized character take us out, +but that would cause every run to end in an error message, which +certainly seems uncool. + +What we need is a termination character. I vote for Pascal's +ending period ('.'). A minor complication is that Turbo ends +every normal line with TWO characters, the carriage return (CR) +and line feed (LF). At the end of each line, we need to eat +these characters before processing the next one. A natural way +to do this would be with procedure Match, except that Match's +error message prints the character, which of course for the CR +and/or LF won't look so great. What we need is a special proce- +dure for this, which we'll no doubt be using over and over. Here +it is: + + +{--------------------------------------------------------------} +{ Recognize and Skip Over a Newline } + +procedure NewLine; +begin + if Look = CR then begin + GetChar; + if Look = LF then + GetChar; + end; +end; +{--------------------------------------------------------------} + + +Insert this procedure at any convenient spot ... I put mine just +after Match. Now, rewrite the main program to look like this: + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + repeat + Assignment; + NewLine; + until Look = '.'; +end. +{--------------------------------------------------------------} + + +Note that the test for a CR is now gone, and that there are also +no error tests within NewLine itself. That's OK, though ... +whatever is left over in terms of bogus characters will be caught +at the beginning of the next assignment statement. + +Well, we now have a functioning interpreter. It doesn't do us a +lot of good, however, since we have no way to read data in or +write it out. Sure would help to have some I/O! + +Let's wrap this session up, then, by adding the I/O routines. +Since we're sticking to single-character tokens, I'll use '?' to +stand for a read statement, and '!' for a write, with the char- +acter immediately following them to be used as a one-token +"parameter list." Here are the routines: + +{--------------------------------------------------------------} +{ Input Routine } + +procedure Input; +begin + Match('?'); + Read(Table[GetName]); +end; + + +{--------------------------------------------------------------} +{ Output Routine } + +procedure Output; +begin + Match('!'); + WriteLn(Table[GetName]); +end; +{--------------------------------------------------------------} + +They aren't very fancy, I admit ... no prompt character on input, +for example ... but they get the job done. + +The corresponding changes in the main program are shown below. +Note that we use the usual trick of a case statement based upon +the current lookahead character, to decide what to do. + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + repeat + case Look of + '?': Input; + '!': Output; + else Assignment; + end; + NewLine; + until Look = '.'; +end. +{--------------------------------------------------------------} + + +You have now completed a real, working interpreter. It's pretty +sparse, but it works just like the "big boys." It includes three +kinds of program statements (and can tell the difference!), 26 +variables, and I/O statements. The only things that it lacks, +really, are control statements, subroutines, and some kind of +program editing function. The program editing part, I'm going to +pass on. After all, we're not here to build a product, but to +learn things. The control statements, we'll cover in the next +installment, and the subroutines soon after. I'm anxious to get +on with that, so we'll leave the interpreter as it stands. + +I hope that by now you're convinced that the limitation of sin- +gle-character names and the processing of white space are easily +taken care of, as we did in the last session. This time, if +you'd like to play around with these extensions, be my guest ... +they're "left as an exercise for the student." See you next +time. + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + 1 -- + + diff --git a/Compiler construction/Lets_build_compiler/tutor5.txt b/Compiler construction/Lets_build_compiler/tutor5.txt new file mode 100644 index 0000000..448f293 --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor5.txt @@ -0,0 +1,1575 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 19 August 1988 + + + Part V: CONTROL CONSTRUCTS + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +In the first four installments of this series, we've been +concentrating on the parsing of math expressions and assignment +statements. In this installment, we'll take off on a new and +exciting tangent: that of parsing and translating control +constructs such as IF statements. + +This subject is dear to my heart, because it represents a turning +point for me. I had been playing with the parsing of +expressions, just as we have done in this series, but I still +felt that I was a LONG way from being able to handle a complete +language. After all, REAL languages have branches and loops and +subroutines and all that. Perhaps you've shared some of the same +thoughts. Awhile back, though, I had to produce control +constructs for a structured assembler preprocessor I was writing. +Imagine my surprise to discover that it was far easier than the +expression parsing I had already been through. I remember +thinking, "Hey! This is EASY!" After we've finished this session, +I'll bet you'll be thinking so, too. + + +THE PLAN + +In what follows, we'll be starting over again with a bare cradle, +and as we've done twice before now, we'll build things up one at +a time. We'll also be retaining the concept of single-character +tokens that has served us so well to date. This means that the +"code" will look a little funny, with 'i' for IF, 'w' for WHILE, +etc. But it helps us get the concepts down pat without fussing +over lexical scanning. Fear not ... eventually we'll see +something looking like "real" code. + +I also don't want to have us get bogged down in dealing with +statements other than branches, such as the assignment statements +we've been working on. We've already demonstrated that we can +handle them, so there's no point carrying them around as excess +baggage during this exercise. So what I'll do instead is to use +an anonymous statement, "other", to take the place of the non- +control statements and serve as a place-holder for them. We have +to generate some kind of object code for them (we're back into +compiling, not interpretation), so for want of anything else I'll +just echo the character input. + +OK, then, starting with yet another copy of the cradle, let's +define the procedure: + + +{--------------------------------------------------------------} +{ Recognize and Translate an "Other" } + +procedure Other; +begin + EmitLn(GetName); +end; +{--------------------------------------------------------------} + + +Now include a call to it in the main program, thus: + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + Other; +end. +{--------------------------------------------------------------} + + +Run the program and see what you get. Not very exciting, is it? +But hang in there, it's a start, and things will get better. + +The first thing we need is the ability to deal with more than one +statement, since a single-line branch is pretty limited. We did +that in the last session on interpreting, but this time let's get +a little more formal. Consider the following BNF: + + ::= END + + ::= [ ]* + +This says that, for our purposes here, a program is defined as a +block, followed by an END statement. A block, in turn, consists +of zero or more statements. We only have one kind of statement, +so far. + +What signals the end of a block? It's simply any construct that +isn't an "other" statement. For now, that means only the END +statement. + +Armed with these ideas, we can proceed to build up our parser. +The code for a program (we have to call it DoProgram, or Pascal +will complain, is: + +{--------------------------------------------------------------} +{ Parse and Translate a Program } + +procedure DoProgram; +begin + Block; + if Look <> 'e' then Expected('End'); + EmitLn('END') +end; +{--------------------------------------------------------------} + + +Notice that I've arranged to emit an "END" command to the +assembler, which sort of punctuates the output code, and makes +sense considering that we're parsing a complete program here. + +The code for Block is: + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block; +begin + while not(Look in ['e']) do begin + Other; + end; +end; +{--------------------------------------------------------------} + + +(From the form of the procedure, you just KNOW we're going to be +adding to it in a bit!) + +OK, enter these routines into your program. Replace the call to +Block in the main program, by a call to DoProgram. Now try it +and see how it works. Well, it's still not much, but we're +getting closer. + + +SOME GROUNDWORK + +Before we begin to define the various control constructs, we need +to lay a bit more groundwork. First, a word of warning: I won't +be using the same syntax for these constructs as you're familiar +with from Pascal or C. For example, the Pascal syntax for an IF +is: + + + IF THEN + + +(where the statement, of course, may be compound). + +The C version is similar: + + + IF ( ) + + +Instead, I'll be using something that looks more like Ada: + + + IF ENDIF + + +In other words, the IF construct has a specific termination +symbol. This avoids the dangling-else of Pascal and C and also +precludes the need for the brackets {} or begin-end. The syntax +I'm showing you here, in fact, is that of the language KISS that +I'll be detailing in later installments. The other constructs +will also be slightly different. That shouldn't be a real +problem for you. Once you see how it's done, you'll realize that +it really doesn't matter so much which specific syntax is +involved. Once the syntax is defined, turning it into code is +straightforward. + +Now, all of the constructs we'll be dealing with here involve +transfer of control, which at the assembler-language level means +conditional and/or unconditional branches. For example, the +simple IF statement + + + IF A ENDIF B .... + +must get translated into + + Branch if NOT condition to L + A + L: B + ... + + +It's clear, then, that we're going to need some more procedures +to help us deal with these branches. I've defined two of them +below. Procedure NewLabel generates unique labels. This is done +via the simple expedient of calling every label 'Lnn', where nn +is a label number starting from zero. Procedure PostLabel just +outputs the labels at the proper place. + +Here are the two routines: + + +{--------------------------------------------------------------} +{ Generate a Unique Label } + +function NewLabel: string; +var S: string; +begin + Str(LCount, S); + NewLabel := 'L' + S; + Inc(LCount); +end; + + +{--------------------------------------------------------------} +{ Post a Label To Output } + +procedure PostLabel(L: string); +begin + WriteLn(L, ':'); +end; +{--------------------------------------------------------------} + + +Notice that we've added a new global variable, LCount, so you +need to change the VAR declarations at the top of the program to +look like this: + + +var Look : char; { Lookahead Character } + Lcount: integer; { Label Counter } + + +Also, add the following extra initialization to Init: + + + LCount := 0; + +(DON'T forget that, or your labels can look really strange!) + + +At this point I'd also like to show you a new kind of notation. +If you compare the form of the IF statement above with the as- +sembler code that must be produced, you can see that there are +certain actions associated with each of the keywords in the +statement: + + + IF: First, get the condition and issue the code for it. + Then, create a unique label and emit a branch if false. + + ENDIF: Emit the label. + + +These actions can be shown very concisely if we write the syntax +this way: + + + IF + { Condition; + L = NewLabel; + Emit(Branch False to L); } + + ENDIF { PostLabel(L) } + + +This is an example of syntax-directed translation. We've been +doing it all along ... we've just never written it down this way +before. The stuff in curly brackets represents the ACTIONS to be +taken. The nice part about this representation is that it not +only shows what we have to recognize, but also the actions we +have to perform, and in which order. Once we have this syntax, +the code almost writes itself. + +About the only thing left to do is to be a bit more specific +about what we mean by "Branch if false." + +I'm assuming that there will be code executed for +that will perform Boolean algebra and compute some result. It +should also set the condition flags corresponding to that result. +Now, the usual convention for a Boolean variable is to let 0000 +represent "false," and anything else (some use FFFF, some 0001) +represent "true." + +On the 68000 the condition flags are set whenever any data is +moved or calculated. If the data is a 0000 (corresponding to a +false condition, remember), the zero flag will be set. The code +for "Branch on zero" is BEQ. So for our purposes here, + + + BEQ <=> Branch if false + BNE <=> Branch if true + + +It's the nature of the beast that most of the branches we see +will be BEQ's ... we'll be branching AROUND the code that's +supposed to be executed when the condition is true. + + +THE IF STATEMENT + +With that bit of explanation out of the way, we're finally ready +to begin coding the IF-statement parser. In fact, we've almost +already done it! As usual, I'll be using our single-character +approach, with the character 'i' for IF, and 'e' for ENDIF (as +well as END ... that dual nature causes no confusion). I'll +also, for now, skip completely the character for the branch con- +dition, which we still have to define. + +The code for DoIf is: + +{--------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block; Forward; + + +procedure DoIf; +var L: string; +begin + Match('i'); + L := NewLabel; + Condition; + EmitLn('BEQ ' + L); + Block; + Match('e'); + PostLabel(L); +end; +{--------------------------------------------------------------} + + +Add this routine to your program, and change Block to reference +it as follows: + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block; +begin + while not(Look in ['e']) do begin + case Look of + 'i': DoIf; + 'o': Other; + end; + end; +end; +{--------------------------------------------------------------} + + +Notice the reference to procedure Condition. Eventually, we'll +write a routine that can parse and translate any Boolean con- +dition we care to give it. But that's a whole installment by +itself (the next one, in fact). For now, let's just make it a +dummy that emits some text. Write the following routine: + + +{--------------------------------------------------------------} +{ Parse and Translate a Boolean Condition } +{ This version is a dummy } + +Procedure Condition; +begin + EmitLn(''); +end; +{--------------------------------------------------------------} + + +Insert this procedure in your program just before DoIf. Now run +the program. Try a string like + + aibece + +As you can see, the parser seems to recognize the construct and +inserts the object code at the right places. Now try a set of +nested IF's, like + + aibicedefe + +It's starting to look real, eh? + +Now that we have the general idea (and the tools such as the +notation and the procedures NewLabel and PostLabel), it's a piece +of cake to extend the parser to include other constructs. The +first (and also one of the trickiest) is to add the ELSE clause +to IF. The BNF is + + + IF [ ELSE ] ENDIF + + +The tricky part arises simply because there is an optional part, +which doesn't occur in the other constructs. + +The corresponding output code should be + + + + BEQ L1 + + BRA L2 + L1: + L2: ... + + +This leads us to the following syntax-directed translation: + + + IF + { L1 = NewLabel; + L2 = NewLabel; + Emit(BEQ L1) } + + ELSE { Emit(BRA L2); + PostLabel(L1) } + + ENDIF { PostLabel(L2) } + + +Comparing this with the case for an ELSE-less IF gives us a clue +as to how to handle both situations. The code below does it. +(Note that I use an 'l' for the ELSE, since 'e' is otherwise +occupied): + + +{--------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure DoIf; +var L1, L2: string; +begin + Match('i'); + Condition; + L1 := NewLabel; + L2 := L1; + EmitLn('BEQ ' + L1); + Block; + if Look = 'l' then begin + Match('l'); + L2 := NewLabel; + EmitLn('BRA ' + L2); + PostLabel(L1); + Block; + end; + Match('e'); + PostLabel(L2); +end; +{--------------------------------------------------------------} + + +There you have it. A complete IF parser/translator, in 19 lines +of code. + +Give it a try now. Try something like + + aiblcede + +Did it work? Now, just to be sure we haven't broken the ELSE- +less case, try + + aibece + +Now try some nested IF's. Try anything you like, including some +badly formed statements. Just remember that 'e' is not a legal +"other" statement. + + +THE WHILE STATEMENT + +The next type of statement should be easy, since we already have +the process down pat. The syntax I've chosen for the WHILE +statement is + + + WHILE ENDWHILE + + +I know, I know, we don't REALLY need separate kinds of ter- +minators for each construct ... you can see that by the fact that +in our one-character version, 'e' is used for all of them. But I +also remember MANY debugging sessions in Pascal, trying to track +down a wayward END that the compiler obviously thought I meant to +put somewhere else. It's been my experience that specific and +unique keywords, although they add to the vocabulary of the +language, give a bit of error-checking that is worth the extra +work for the compiler writer. + +Now, consider what the WHILE should be translated into. It +should be: + + + L1: + BEQ L2 + + BRA L1 + L2: + + + + +As before, comparing the two representations gives us the actions +needed at each point. + + + WHILE { L1 = NewLabel; + PostLabel(L1) } + { Emit(BEQ L2) } + + ENDWHILE { Emit(BRA L1); + PostLabel(L2) } + + +The code follows immediately from the syntax: + + +{--------------------------------------------------------------} +{ Parse and Translate a WHILE Statement } + +procedure DoWhile; +var L1, L2: string; +begin + Match('w'); + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + Condition; + EmitLn('BEQ ' + L2); + Block; + Match('e'); + EmitLn('BRA ' + L1); + PostLabel(L2); +end; +{--------------------------------------------------------------} + + +Since we've got a new statement, we have to add a call to it +within procedure Block: + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block; +begin + while not(Look in ['e', 'l']) do begin + case Look of + 'i': DoIf; + 'w': DoWhile; + else Other; + end; + end; +end; +{--------------------------------------------------------------} + + +No other changes are necessary. + +OK, try the new program. Note that this time, the +code is INSIDE the upper label, which is just where we wanted it. +Try some nested loops. Try some loops within IF's, and some IF's +within loops. If you get a bit confused as to what you should +type, don't be discouraged: you write bugs in other languages, +too, don't you? It'll look a lot more meaningful when we get +full keywords. + +I hope by now that you're beginning to get the idea that this +really IS easy. All we have to do to accomodate a new construct +is to work out the syntax-directed translation of it. The code +almost falls out from there, and it doesn't affect any of the +other routines. Once you've gotten the feel of the thing, you'll +see that you can add new constructs about as fast as you can +dream them up. + + +THE LOOP STATEMENT + +We could stop right here, and have a language that works. It's +been shown many times that a high-order language with only two +constructs, the IF and the WHILE, is sufficient to write struc- +tured code. But we're on a roll now, so let's richen up the +repertoire a bit. + +This construct is even easier, since it has no condition test at +all ... it's an infinite loop. What's the point of such a loop? +Not much, by itself, but later on we're going to add a BREAK +command, that will give us a way out. This makes the language +considerably richer than Pascal, which has no break, and also +avoids the funny WHILE(1) or WHILE TRUE of C and Pascal. + +The syntax is simply + + LOOP ENDLOOP + +and the syntax-directed translation is: + + + LOOP { L = NewLabel; + PostLabel(L) } + + ENDLOOP { Emit(BRA L } + + +The corresponding code is shown below. Since I've already used +'l' for the ELSE, I've used the last letter, 'p', as the +"keyword" this time. + + +{--------------------------------------------------------------} +{ Parse and Translate a LOOP Statement } + +procedure DoLoop; +var L: string; +begin + Match('p'); + L := NewLabel; + PostLabel(L); + Block; + Match('e'); + EmitLn('BRA ' + L); +end; +{--------------------------------------------------------------} + + +When you insert this routine, don't forget to add a line in Block +to call it. + + + + +REPEAT-UNTIL + +Here's one construct that I lifted right from Pascal. The syntax +is + + + REPEAT UNTIL , + + +and the syntax-directed translation is: + + + REPEAT { L = NewLabel; + PostLabel(L) } + + UNTIL + { Emit(BEQ L) } + + +As usual, the code falls out pretty easily: + + +{--------------------------------------------------------------} +{ Parse and Translate a REPEAT Statement } + +procedure DoRepeat; +var L: string; +begin + Match('r'); + L := NewLabel; + PostLabel(L); + Block; + Match('u'); + Condition; + EmitLn('BEQ ' + L); +end; +{--------------------------------------------------------------} + + +As before, we have to add the call to DoRepeat within Block. +This time, there's a difference, though. I decided to use 'r' +for REPEAT (naturally), but I also decided to use 'u' for UNTIL. +This means that the 'u' must be added to the set of characters in +the while-test. These are the characters that signal an exit +from the current block ... the "follow" characters, in compiler +jargon. + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block; +begin + while not(Look in ['e', 'l', 'u']) do begin + case Look of + 'i': DoIf; + 'w': DoWhile; + 'p': DoLoop; + 'r': DoRepeat; + else Other; + end; + end; +end; +{--------------------------------------------------------------} + + +THE FOR LOOP + +The FOR loop is a very handy one to have around, but it's a bear +to translate. That's not so much because the construct itself is +hard ... it's only a loop after all ... but simply because it's +hard to implement in assembler language. Once the code is +figured out, the translation is straightforward enough. + +C fans love the FOR-loop of that language (and, in fact, it's +easier to code), but I've chosen instead a syntax very much like +the one from good ol' BASIC: + + + FOR = TO ENDFOR + + +The translation of a FOR loop can be just about as difficult as +you choose to make it, depending upon the way you decide to +define the rules as to how to handle the limits. Does expr2 get +evaluated every time through the loop, for example, or is it +treated as a constant limit? Do you always go through the loop +at least once, as in FORTRAN, or not? It gets simpler if you +adopt the point of view that the construct is equivalent to: + + + = + TEMP = + WHILE <= TEMP + + ENDWHILE + + +Notice that with this definition of the loop, will not be +executed at all if is initially larger than . + +The 68000 code needed to do this is trickier than anything we've +done so far. I had a couple of tries at it, putting both the +counter and the upper limit on the stack, both in registers, +etc. I finally arrived at a hybrid arrangement, in which the +loop counter is in memory (so that it can be accessed within the +loop), and the upper limit is on the stack. The translated code +came out like this: + + + get name of loop counter + get initial value + LEA (PC),A0 address the loop counter + SUBQ #1,D0 predecrement it + MOVE D0,(A0) save it + get upper limit + MOVE D0,-(SP) save it on stack + + L1: LEA (PC),A0 address loop counter + MOVE (A0),D0 fetch it to D0 + ADDQ #1,D0 bump the counter + MOVE D0,(A0) save new value + CMP (SP),D0 check for range + BLE L2 skip out if D0 > (SP) + + BRA L1 loop for next pass + L2: ADDQ #2,SP clean up the stack + + +Wow! That seems like a lot of code ... the line containing + seems to almost get lost. But that's the best I could do +with it. I guess it helps to keep in mind that it's really only +sixteen words, after all. If anyone else can optimize this +better, please let me know. + +Still, the parser routine is pretty easy now that we have the +code: + + +{--------------------------------------------------------------} +{ Parse and Translate a FOR Statement } + +procedure DoFor; +var L1, L2: string; + Name: char; +begin + Match('f'); + L1 := NewLabel; + L2 := NewLabel; + Name := GetName; + Match('='); + Expression; + EmitLn('SUBQ #1,D0'); + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)'); + Expression; + EmitLn('MOVE D0,-(SP)'); + PostLabel(L1); + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE (A0),D0'); + EmitLn('ADDQ #1,D0'); + EmitLn('MOVE D0,(A0)'); + EmitLn('CMP (SP),D0'); + EmitLn('BGT ' + L2); + Block; + Match('e'); + EmitLn('BRA ' + L1); + PostLabel(L2); + EmitLn('ADDQ #2,SP'); +end; +{--------------------------------------------------------------} + + +Since we don't have expressions in this parser, I used the same +trick as for Condition, and wrote the routine + + +{--------------------------------------------------------------} +{ Parse and Translate an Expression } +{ This version is a dummy } + +Procedure Expression; +begin + EmitLn(''); +end; +{--------------------------------------------------------------} + + +Give it a try. Once again, don't forget to add the call in +Block. Since we don't have any input for the dummy version of +Expression, a typical input line would look something like + + afi=bece + +Well, it DOES generate a lot of code, doesn't it? But at least +it's the RIGHT code. + + +THE DO STATEMENT + +All this made me wish for a simpler version of the FOR loop. The +reason for all the code above is the need to have the loop +counter accessible as a variable within the loop. If all we need +is a counting loop to make us go through something a specified +number of times, but don't need access to the counter itself, +there is a much easier solution. The 68000 has a "decrement and +branch nonzero" instruction built in which is ideal for counting. +For good measure, let's add this construct, too. This will be +the last of our loop structures. + +The syntax and its translation is: + + + DO + { Emit(SUBQ #1,D0); + L = NewLabel; + PostLabel(L); + Emit(MOVE D0,-(SP) } + + ENDDO { Emit(MOVE (SP)+,D0; + Emit(DBRA D0,L) } + + +That's quite a bit simpler! The loop will execute times. +Here's the code: + + +{--------------------------------------------------------------} +{ Parse and Translate a DO Statement } + +procedure Dodo; +var L: string; +begin + Match('d'); + L := NewLabel; + Expression; + EmitLn('SUBQ #1,D0'); + PostLabel(L); + EmitLn('MOVE D0,-(SP)'); + Block; + EmitLn('MOVE (SP)+,D0'); + EmitLn('DBRA D0,' + L); +end; +{--------------------------------------------------------------} + + +I think you'll have to agree, that's a whole lot simpler than the +classical FOR. Still, each construct has its place. + + +THE BREAK STATEMENT + +Earlier I promised you a BREAK statement to accompany LOOP. This +is one I'm sort of proud of. On the face of it a BREAK seems +really tricky. My first approach was to just use it as an extra +terminator to Block, and split all the loops into two parts, just +as I did with the ELSE half of an IF. That turns out not to +work, though, because the BREAK statement is almost certainly not +going to show up at the same level as the loop itself. The most +likely place for a BREAK is right after an IF, which would cause +it to exit to the IF construct, not the enclosing loop. WRONG. +The BREAK has to exit the inner LOOP, even if it's nested down +into several levels of IFs. + +My next thought was that I would just store away, in some global +variable, the ending label of the innermost loop. That doesn't +work either, because there may be a break from an inner loop +followed by a break from an outer one. Storing the label for the +inner loop would clobber the label for the outer one. So the +global variable turned into a stack. Things were starting to get +messy. + +Then I decided to take my own advice. Remember in the last +session when I pointed out how well the implicit stack of a +recursive descent parser was serving our needs? I said that if +you begin to see the need for an external stack you might be +doing something wrong. Well, I was. It is indeed possible to +let the recursion built into our parser take care of everything, +and the solution is so simple that it's surprising. + +The secret is to note that every BREAK statement has to occur +within a block ... there's no place else for it to be. So all we +have to do is to pass into Block the exit address of the +innermost loop. Then it can pass the address to the routine that +translates the break instruction. Since an IF statement doesn't +change the loop level, procedure DoIf doesn't need to do anything +except pass the label into ITS blocks (both of them). Since +loops DO change the level, each loop construct simply ignores +whatever label is above it and passes its own exit label along. + +All this is easier to show you than it is to describe. I'll +demonstrate with the easiest loop, which is LOOP: + + +{--------------------------------------------------------------} +{ Parse and Translate a LOOP Statement } + +procedure DoLoop; +var L1, L2: string; +begin + Match('p'); + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + Block(L2); + Match('e'); + EmitLn('BRA ' + L1); + PostLabel(L2); +end; +{--------------------------------------------------------------} + + +Notice that DoLoop now has TWO labels, not just one. The second +is to give the BREAK instruction a target to jump to. If there +is no BREAK within the loop, we've wasted a label and cluttered +up things a bit, but there's no harm done. + +Note also that Block now has a parameter, which for loops will +always be the exit address. The new version of Block is: + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block(L: string); +begin + while not(Look in ['e', 'l', 'u']) do begin + case Look of + 'i': DoIf(L); + 'w': DoWhile; + 'p': DoLoop; + 'r': DoRepeat; + 'f': DoFor; + 'd': DoDo; + 'b': DoBreak(L); + else Other; + end; + end; +end; +{--------------------------------------------------------------} + + +Again, notice that all Block does with the label is to pass it +into DoIf and DoBreak. The loop constructs don't need it, +because they are going to pass their own label anyway. + +The new version of DoIf is: + +{--------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block(L: string); Forward; + + +procedure DoIf(L: string); +var L1, L2: string; +begin + Match('i'); + Condition; + L1 := NewLabel; + L2 := L1; + EmitLn('BEQ ' + L1); + Block(L); + if Look = 'l' then begin + Match('l'); + L2 := NewLabel; + EmitLn('BRA ' + L2); + PostLabel(L1); + Block(L); + end; + Match('e'); + PostLabel(L2); +end; +{--------------------------------------------------------------} + + +Here, the only thing that changes is the addition of the +parameter to procedure Block. An IF statement doesn't change the +loop nesting level, so DoIf just passes the label along. No +matter how many levels of IF nesting we have, the same label will +be used. + +Now, remember that DoProgram also calls Block, so it now needs to +pass it a label. An attempt to exit the outermost block is an +error, so DoProgram passes a null label which is caught by +DoBreak: + + +{--------------------------------------------------------------} +{ Recognize and Translate a BREAK } + +procedure DoBreak(L: string); +begin + Match('b'); + if L <> '' then + EmitLn('BRA ' + L) + else Abort('No loop to break from'); +end; + + +{--------------------------------------------------------------} + +{ Parse and Translate a Program } + +procedure DoProgram; +begin + Block(''); + if Look <> 'e' then Expected('End'); + EmitLn('END') +end; +{--------------------------------------------------------------} + + +That ALMOST takes care of everything. Give it a try, see if you +can "break" it . Careful, though. By this time we've used +so many letters, it's hard to think of characters that aren't now +representing reserved words. Remember: before you try the +program, you're going to have to edit every occurence of Block in +the other loop constructs to include the new parameter. Do it +just like I did for LOOP. + +I said ALMOST above. There is one slight problem: if you take a +hard look at the code generated for DO, you'll see that if you +break out of this loop, the value of the loop counter is still +left on the stack. We're going to have to fix that! A shame ... +that was one of our smaller routines, but it can't be helped. +Here's a version that doesn't have the problem: + + + + +{--------------------------------------------------------------} +{ Parse and Translate a DO Statement } + +procedure Dodo; +var L1, L2: string; +begin + Match('d'); + L1 := NewLabel; + L2 := NewLabel; + Expression; + EmitLn('SUBQ #1,D0'); + PostLabel(L1); + EmitLn('MOVE D0,-(SP)'); + Block(L2); + EmitLn('MOVE (SP)+,D0'); + EmitLn('DBRA D0,' + L1); + EmitLn('SUBQ #2,SP'); + PostLabel(L2); + EmitLn('ADDQ #2,SP'); +end; +{--------------------------------------------------------------} + + +The two extra instructions, the SUBQ and ADDQ, take care of +leaving the stack in the right shape. + + +CONCLUSION + +At this point we have created a number of control constructs ... +a richer set, really, than that provided by almost any other pro- +gramming language. And, except for the FOR loop, it was pretty +easy to do. Even that one was tricky only because it's tricky in +assembler language. + +I'll conclude this session here. To wrap the thing up with a red +ribbon, we really should have a go at having real keywords +instead of these mickey-mouse single-character things. You've +already seen that the extension to multi-character words is not +difficult, but in this case it will make a big difference in the +appearance of our input code. I'll save that little bit for the +next installment. In that installment we'll also address Boolean +expressions, so we can get rid of the dummy version of Condition +that we've used here. See you then. + +For reference purposes, here is the completed parser for this +session: + + + + +{--------------------------------------------------------------} +program Branch; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + CR = ^M; + + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look : char; { Lookahead Character } + Lcount: integer; { Label Counter } + + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + if Look = x then GetChar + else Expected('''' + x + ''''); +end; + + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB]; +end; + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: char; +begin + if not IsAlpha(Look) then Expected('Name'); + GetName := UpCase(Look); + GetChar; +end; + + + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: char; +begin + if not IsDigit(Look) then Expected('Integer'); + GetNum := Look; + GetChar; +end; + + +{--------------------------------------------------------------} +{ Generate a Unique Label } + +function NewLabel: string; +var S: string; +begin + Str(LCount, S); + NewLabel := 'L' + S; + Inc(LCount); +end; + + +{--------------------------------------------------------------} +{ Post a Label To Output } + +procedure PostLabel(L: string); +begin + WriteLn(L, ':'); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + +{--------------------------------------------------------------} + +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Boolean Condition } + +procedure Condition; +begin + EmitLn(''); +end; + + + + +{--------------------------------------------------------------} +{ Parse and Translate a Math Expression } + +procedure Expression; +begin + EmitLn(''); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block(L: string); Forward; + + +procedure DoIf(L: string); +var L1, L2: string; +begin + Match('i'); + Condition; + L1 := NewLabel; + L2 := L1; + EmitLn('BEQ ' + L1); + Block(L); + if Look = 'l' then begin + Match('l'); + L2 := NewLabel; + EmitLn('BRA ' + L2); + PostLabel(L1); + Block(L); + end; + Match('e'); + PostLabel(L2); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a WHILE Statement } + +procedure DoWhile; +var L1, L2: string; +begin + Match('w'); + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + Condition; + EmitLn('BEQ ' + L2); + Block(L2); + Match('e'); + EmitLn('BRA ' + L1); + PostLabel(L2); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a LOOP Statement } + +procedure DoLoop; +var L1, L2: string; +begin + Match('p'); + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + Block(L2); + Match('e'); + EmitLn('BRA ' + L1); + PostLabel(L2); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a REPEAT Statement } + +procedure DoRepeat; +var L1, L2: string; +begin + Match('r'); + L1 := NewLabel; + L2 := NewLabel; + PostLabel(L1); + Block(L2); + Match('u'); + Condition; + EmitLn('BEQ ' + L1); + PostLabel(L2); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a FOR Statement } + +procedure DoFor; +var L1, L2: string; + Name: char; +begin + Match('f'); + L1 := NewLabel; + L2 := NewLabel; + Name := GetName; + Match('='); + Expression; + EmitLn('SUBQ #1,D0'); + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)'); + Expression; + EmitLn('MOVE D0,-(SP)'); + PostLabel(L1); + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE (A0),D0'); + EmitLn('ADDQ #1,D0'); + EmitLn('MOVE D0,(A0)'); + EmitLn('CMP (SP),D0'); + EmitLn('BGT ' + L2); + Block(L2); + Match('e'); + EmitLn('BRA ' + L1); + PostLabel(L2); + EmitLn('ADDQ #2,SP'); +end; + + + + +{--------------------------------------------------------------} +{ Parse and Translate a DO Statement } + +procedure Dodo; +var L1, L2: string; +begin + Match('d'); + L1 := NewLabel; + L2 := NewLabel; + Expression; + EmitLn('SUBQ #1,D0'); + PostLabel(L1); + EmitLn('MOVE D0,-(SP)'); + Block(L2); + EmitLn('MOVE (SP)+,D0'); + EmitLn('DBRA D0,' + L1); + EmitLn('SUBQ #2,SP'); + PostLabel(L2); + EmitLn('ADDQ #2,SP'); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a BREAK } + +procedure DoBreak(L: string); +begin + Match('b'); + EmitLn('BRA ' + L); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate an "Other" } + +procedure Other; +begin + EmitLn(GetName); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block(L: string); +begin + while not(Look in ['e', 'l', 'u']) do begin + case Look of + 'i': DoIf(L); + 'w': DoWhile; + 'p': DoLoop; + 'r': DoRepeat; + 'f': DoFor; + 'd': DoDo; + 'b': DoBreak(L); + else Other; + end; + end; +end; + + + + +{--------------------------------------------------------------} + +{ Parse and Translate a Program } + +procedure DoProgram; +begin + Block(''); + if Look <> 'e' then Expected('End'); + EmitLn('END') +end; + + +{--------------------------------------------------------------} + +{ Initialize } + +procedure Init; +begin + LCount := 0; + GetChar; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + DoProgram; +end. +{--------------------------------------------------------------} + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + + + diff --git a/Compiler construction/Lets_build_compiler/tutor6.txt b/Compiler construction/Lets_build_compiler/tutor6.txt new file mode 100644 index 0000000..6bccecf --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor6.txt @@ -0,0 +1,1120 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 31 August 1988 + + + Part VI: BOOLEAN EXPRESSIONS + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +In Part V of this series, we took a look at control constructs, +and developed parsing routines to translate them into object +code. We ended up with a nice, relatively rich set of +constructs. + +As we left the parser, though, there was one big hole in our +capabilities: we did not address the issue of the branch +condition. To fill the void, I introduced to you a dummy parse +routine called Condition, which only served as a place-keeper for +the real thing. + +One of the things we'll do in this session is to plug that hole +by expanding Condition into a true parser/translator. + + +THE PLAN + +We're going to approach this installment a bit differently than +any of the others. In those other installments, we started out +immediately with experiments using the Pascal compiler, building +up the parsers from very rudimentary beginnings to their final +forms, without spending much time in planning beforehand. That's +called coding without specs, and it's usually frowned upon. We +could get away with it before because the rules of arithmetic are +pretty well established ... we know what a '+' sign is supposed +to mean without having to discuss it at length. The same is true +for branches and loops. But the ways in which programming +languages implement logic vary quite a bit from language to +language. So before we begin serious coding, we'd better first +make up our minds what it is we want. And the way to do that is +at the level of the BNF syntax rules (the GRAMMAR). + + +THE GRAMMAR + +For some time now, we've been implementing BNF syntax equations +for arithmetic expressions, without ever actually writing them +down all in one place. It's time that we did so. They are: + + + ::= [ ]* + ::= [ factor]* + ::= | | ( ) + +(Remember, the nice thing about this grammar is that it enforces +the operator precedence hierarchy that we normally expect for +algebra.) + +Actually, while we're on the subject, I'd like to amend this +grammar a bit right now. The way we've handled the unary minus +is a bit awkward. I've found that it's better to write the +grammar this way: + + + ::= [ ]* + ::= [ factor]* + ::= [] + ::= | | () + + +This puts the job of handling the unary minus onto Factor, which +is where it really belongs. + +This doesn't mean that you have to go back and recode the +programs you've already written, although you're free to do so if +you like. But I will be using the new syntax from now on. + +Now, it probably won't come as a shock to you to learn that we +can define an analogous grammar for Boolean algebra. A typical +set or rules is: + + + ::= [ ]* + ::= [AND ]* + ::= [NOT] + ::= | | () + + +Notice that in this grammar, the operator AND is analogous to +'*', and OR (and exclusive OR) to '+'. The NOT operator is +analogous to a unary minus. This hierarchy is not absolutely +standard ... some languages, notably Ada, treat all logical +operators as having the same precedence level ... but it seems +natural. + +Notice also the slight difference between the way the NOT and the +unary minus are handled. In algebra, the unary minus is +considered to go with the whole term, and so never appears but +once in a given term. So an expression like + + a * -b + +or worse yet, + a - -b + +is not allowed. In Boolean algebra, though, the expression + + a AND NOT b + +makes perfect sense, and the syntax shown allows for that. + + +RELOPS + +OK, assuming that you're willing to accept the grammar I've shown +here, we now have syntax rules for both arithmetic and Boolean +algebra. The sticky part comes in when we have to combine the +two. Why do we have to do that? Well, the whole subject came up +because of the need to process the "predicates" (conditions) +associated with control statements such as the IF. The predicate +is required to have a Boolean value; that is, it must evaluate to +either TRUE or FALSE. The branch is then taken or not taken, +depending on that value. What we expect to see going on in +procedure Condition, then, is the evaluation of a Boolean +expression. + +But there's more to it than that. A pure Boolean expression can +indeed be the predicate of a control statement ... things like + + + IF a AND NOT b THEN .... + + +But more often, we see Boolean algebra show up in such things as + + + IF (x >= 0) and (x <= 100) THEN ... + + +Here, the two terms in parens are Boolean expressions, but the +individual terms being compared: x, 0, and 100, are NUMERIC in +nature. The RELATIONAL OPERATORS >= and <= are the catalysts by +which the Boolean and the arithmetic ingredients get merged +together. + +Now, in the example above, the terms being compared are just +that: terms. However, in general each side can be a math +expression. So we can define a RELATION to be: + + + ::= , + + +where the expressions we're talking about here are the old +numeric type, and the relops are any of the usual symbols + + + =, <> (or !=), <, >, <=, and >= + + +If you think about it a bit, you'll agree that, since this kind +of predicate has a single Boolean value, TRUE or FALSE, as its +result, it is really just another kind of factor. So we can +expand the definition of a Boolean factor above to read: + + + ::= + | + | () + | + + +THAT's the connection! The relops and the relation they define +serve to wed the two kinds of algebra. It is worth noting that +this implies a hierarchy where the arithmetic expression has a +HIGHER precedence that a Boolean factor, and therefore than all +the Boolean operators. If you write out the precedence levels +for all the operators, you arrive at the following list: + + + Level Syntax Element Operator + + 0 factor literal, variable + 1 signed factor unary minus + 2 term *, / + 3 expression +, - + 4 b-factor literal, variable, relop + 5 not-factor NOT + 6 b-term AND + 7 b-expression OR, XOR + + +If we're willing to accept that many precedence levels, this + + +grammar seems reasonable. Unfortunately, it won't work! The +grammar may be great in theory, but it's no good at all in the +practice of a top-down parser. To see the problem, consider the +code fragment: + + + IF ((((((A + B + C) < 0 ) AND .... + + +When the parser is parsing this code, it knows after it sees the +IF token that a Boolean expression is supposed to be next. So it +can set up to begin evaluating such an expression. But the first +expression in the example is an ARITHMETIC expression, A + B + C. +What's worse, at the point that the parser has read this much of +the input line: + + + IF ((((((A , + + +it still has no way of knowing which kind of expression it's +dealing with. That won't do, because we must have different +recognizers for the two cases. The situation can be handled +without changing any of our definitions, but only if we're +willing to accept an arbitrary amount of backtracking to work our +way out of bad guesses. No compiler writer in his right mind +would agree to that. + +What's going on here is that the beauty and elegance of BNF +grammar has met face to face with the realities of compiler +technology. + +To deal with this situation, compiler writers have had to make +compromises so that a single parser can handle the grammar +without backtracking. + + +FIXING THE GRAMMAR + +The problem that we've encountered comes up because our +definitions of both arithmetic and Boolean factors permit the use +of parenthesized expressions. Since the definitions are +recursive, we can end up with any number of levels of +parentheses, and the parser can't know which kind of expression +it's dealing with. + +The solution is simple, although it ends up causing profound +changes to our grammar. We can only allow parentheses in one +kind of factor. The way to do that varies considerably from +language to language. This is one place where there is NO +agreement or convention to help us. + +When Niklaus Wirth designed Pascal, the desire was to limit the +number of levels of precedence (fewer parse routines, after all). +So the OR and exclusive OR operators are treated just like an +Addop and processed at the level of a math expression. +Similarly, the AND is treated like a Mulop and processed with +Term. The precedence levels are + + + Level Syntax Element Operator + + 0 factor literal, variable + 1 signed factor unary minus, NOT + 2 term *, /, AND + 3 expression +, -, OR + + +Notice that there is only ONE set of syntax rules, applying to +both kinds of operators. According to this grammar, then, +expressions like + + x + (y AND NOT z) DIV 3 + +are perfectly legal. And, in fact, they ARE ... as far as the +parser is concerned. Pascal doesn't allow the mixing of +arithmetic and Boolean variables, and things like this are caught +at the SEMANTIC level, when it comes time to generate code for +them, rather than at the syntax level. + +The authors of C took a diametrically opposite approach: they +treat the operators as different, and have something much more +akin to our seven levels of precedence. In fact, in C there are +no fewer than 17 levels! That's because C also has the operators +'=', '+=' and its kin, '<<', '>>', '++', '--', etc. Ironically, +although in C the arithmetic and Boolean operators are treated +separately, the variables are NOT ... there are no Boolean or +logical variables in C, so a Boolean test can be made on any +integer value. + +We'll do something that's sort of in-between. I'm tempted to +stick mostly with the Pascal approach, since that seems the +simplest from an implementation point of view, but it results in +some funnies that I never liked very much, such as the fact that, +in the expression + + IF (c >= 'A') and (c <= 'Z') then ... + +the parens above are REQUIRED. I never understood why before, +and neither my compiler nor any human ever explained it very +well, either. But now, we can all see that the 'and' operator, +having the precedence of a multiply, has a higher one than the +relational operators, so without the parens the expression is +equivalent to + + IF c >= ('A' and c) <= 'Z' then + +which doesn't make sense. + +In any case, I've elected to separate the operators into +different levels, although not as many as in C. + + + ::= [ ]* + ::= [AND ]* + ::= [NOT] + ::= | | + ::= | [ ::= [ ]* + ::= [ factor]* + ::= [] + ::= | | () + + +This grammar results in the same set of seven levels that I +showed earlier. Really, it's almost the same grammar ... I just +removed the option of parenthesized b-expressions as a possible +b-factor, and added the relation as a legal form of b-factor. + +There is one subtle but crucial difference, which is what makes +the whole thing work. Notice the square brackets in the +definition of a relation. This means that the relop and the +second expression are OPTIONAL. + +A strange consequence of this grammar (and one shared by C) is +that EVERY expression is potentially a Boolean expression. The +parser will always be looking for a Boolean expression, but will +"settle" for an arithmetic one. To be honest, that's going to +slow down the parser, because it has to wade through more layers +of procedure calls. That's one reason why Pascal compilers tend +to compile faster than C compilers. If it's raw speed you want, +stick with the Pascal syntax. + + +THE PARSER + +Now that we've gotten through the decision-making process, we can +press on with development of a parser. You've done this with me +several times now, so you know the drill: we begin with a fresh +copy of the cradle, and begin adding procedures one by one. So +let's do it. + +We begin, as we did in the arithmetic case, by dealing only with +Boolean literals rather than variables. This gives us a new kind +of input token, so we're also going to need a new recognizer, and +a new procedure to read instances of that token type. Let's +start by defining the two new procedures: + + +{--------------------------------------------------------------} +{ Recognize a Boolean Literal } + +function IsBoolean(c: char): Boolean; +begin + IsBoolean := UpCase(c) in ['T', 'F']; +end; + + +{--------------------------------------------------------------} +{ Get a Boolean Literal } + +function GetBoolean: Boolean; +var c: char; +begin + if not IsBoolean(Look) then Expected('Boolean Literal'); + GetBoolean := UpCase(Look) = 'T'; + GetChar; +end; +{--------------------------------------------------------------} + + +Type these routines into your program. You can test them by +adding into the main program the print statement + + + WriteLn(GetBoolean); + + + + +OK, compile the program and test it. As usual, it's not very +impressive so far, but it soon will be. + +Now, when we were dealing with numeric data we had to arrange to +generate code to load the values into D0. We need to do the same +for Boolean data. The usual way to encode Boolean variables is +to let 0 stand for FALSE, and some other value for TRUE. Many +languages, such as C, use an integer 1 to represent it. But I +prefer FFFF hex (or -1), because a bitwise NOT also becomes a +Boolean NOT. So now we need to emit the right assembler code to +load those values. The first cut at the Boolean expression +parser (BoolExpression, of course) is: + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Expression } + +procedure BoolExpression; +begin + if not IsBoolean(Look) then Expected('Boolean Literal'); + if GetBoolean then + EmitLn('MOVE #-1,D0') + else + EmitLn('CLR D0'); +end; +{---------------------------------------------------------------} + + +Add this procedure to your parser, and call it from the main +program (replacing the print statement you had just put there). +As you can see, we still don't have much of a parser, but the +output code is starting to look more realistic. + +Next, of course, we have to expand the definition of a Boolean +expression. We already have the BNF rule: + + + ::= [ ]* + + +I prefer the Pascal versions of the "orops", OR and XOR. But +since we are keeping to single-character tokens here, I'll encode +those with '|' and '~'. The next version of BoolExpression is +almost a direct copy of the arithmetic procedure Expression: + + +{--------------------------------------------------------------} +{ Recognize and Translate a Boolean OR } + +procedure BoolOr; +begin + Match('|'); + BoolTerm; + EmitLn('OR (SP)+,D0'); +end; + + + + +{--------------------------------------------------------------} +{ Recognize and Translate an Exclusive Or } + +procedure BoolXor; +begin + Match('~'); + BoolTerm; + EmitLn('EOR (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Expression } + +procedure BoolExpression; +begin + BoolTerm; + while IsOrOp(Look) do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '|': BoolOr; + '~': BoolXor; + end; + end; +end; +{---------------------------------------------------------------} + + +Note the new recognizer IsOrOp, which is also a copy, this time +of IsAddOp: + + +{--------------------------------------------------------------} +{ Recognize a Boolean Orop } + +function IsOrop(c: char): Boolean; +begin + IsOrop := c in ['|', '~']; +end; +{--------------------------------------------------------------} + +OK, rename the old version of BoolExpression to BoolTerm, then +enter the code above. Compile and test this version. At this +point, the output code is starting to look pretty good. Of +course, it doesn't make much sense to do a lot of Boolean algebra +on constant values, but we'll soon be expanding the types of +Booleans we deal with. + +You've probably already guessed what the next step is: The +Boolean version of Term. + +Rename the current procedure BoolTerm to NotFactor, and enter the +following new version of BoolTerm. Note that is is much simpler +than the numeric version, since there is no equivalent of +division. + + + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Term } + +procedure BoolTerm; +begin + NotFactor; + while Look = '&' do begin + EmitLn('MOVE D0,-(SP)'); + Match('&'); + NotFactor; + EmitLn('AND (SP)+,D0'); + end; +end; +{--------------------------------------------------------------} + + +Now, we're almost home. We are translating complex Boolean +expressions, although only for constant values. The next step is +to allow for the NOT. Write the following procedure: + + +{--------------------------------------------------------------} +{ Parse and Translate a Boolean Factor with NOT } + +procedure NotFactor; +begin + if Look = '!' then begin + Match('!'); + BoolFactor; + EmitLn('EOR #-1,D0'); + end + else + BoolFactor; +end; +{--------------------------------------------------------------} + +And rename the earlier procedure to BoolFactor. Now try that. +At this point the parser should be able to handle any Boolean +expression you care to throw at it. Does it? Does it trap badly +formed expressions? + +If you've been following what we did in the parser for math +expressions, you know that what we did next was to expand the +definition of a factor to include variables and parens. We don't +have to do that for the Boolean factor, because those little +items get taken care of by the next step. It takes just a one +line addition to BoolFactor to take care of relations: + + +{--------------------------------------------------------------} +{ Parse and Translate a Boolean Factor } + +procedure BoolFactor; +begin + if IsBoolean(Look) then + if GetBoolean then + EmitLn('MOVE #-1,D0') + else + EmitLn('CLR D0') + else Relation; +end; +{--------------------------------------------------------------} + + +You might be wondering when I'm going to provide for Boolean +variables and parenthesized Boolean expressions. The answer is, +I'm NOT! Remember, we took those out of the grammar earlier. +Right now all I'm doing is encoding the grammar we've already +agreed upon. The compiler itself can't tell the difference +between a Boolean variable or expression and an arithmetic one +... all of those will be handled by Relation, either way. + + +Of course, it would help to have some code for Relation. I don't +feel comfortable, though, adding any more code without first +checking out what we already have. So for now let's just write a +dummy version of Relation that does nothing except eat the +current character, and write a little message: + + +{---------------------------------------------------------------} +{ Parse and Translate a Relation } + +procedure Relation; +begin + WriteLn(''); + GetChar; +end; +{--------------------------------------------------------------} + +OK, key in this code and give it a try. All the old things +should still work ... you should be able to generate the code for +ANDs, ORs, and NOTs. In addition, if you type any alphabetic +character you should get a little place-holder, where +a Boolean factor should be. Did you get that? Fine, then let's +move on to the full-blown version of Relation. + +To get that, though, there is a bit of groundwork that we must +lay first. Recall that a relation has the form + + + ::= | [ ']; +end; +{--------------------------------------------------------------} + + +Now, recall that we're using a zero or a -1 in register D0 to +represent a Boolean value, and also that the loop constructs +expect the flags to be set to correspond. In implementing all +this on the 68000, things get a a little bit tricky. + +Since the loop constructs operate only on the flags, it would be +nice (and also quite efficient) just to set up those flags, and + + +not load anything into D0 at all. This would be fine for the +loops and branches, but remember that the relation can be used +ANYWHERE a Boolean factor could be used. We may be storing its +result to a Boolean variable. Since we can't know at this point +how the result is going to be used, we must allow for BOTH cases. + +Comparing numeric data is easy enough ... the 68000 has an +operation for that ... but it sets the flags, not a value. +What's more, the flags will always be set the same (zero if +equal, etc.), while we need the zero flag set differently for the +each of the different relops. + +The solution is found in the 68000 instruction Scc, which sets a +byte value to 0000 or FFFF (funny how that works!) depending upon +the result of the specified condition. If we make the +destination byte to be D0, we get the Boolean value needed. + +Unfortunately, there's one final complication: unlike almost +every other instruction in the 68000 set, Scc does NOT reset the +condition flags to match the data being stored. So we have to do +one last step, which is to test D0 and set the flags to match it. +It must seem to be a trip around the moon to get what we want: we +first perform the test, then test the flags to set data into D0, +then test D0 to set the flags again. It is sort of roundabout, +but it's the most straightforward way to get the flags right, and +after all it's only a couple of instructions. + +I might mention here that this area is, in my opinion, the one +that represents the biggest difference between the efficiency of +hand-coded assembler language and compiler-generated code. We +have seen already that we lose efficiency in arithmetic +operations, although later I plan to show you how to improve that +a bit. We've also seen that the control constructs themselves +can be done quite efficiently ... it's usually very difficult to +improve on the code generated for an IF or a WHILE. But +virtually every compiler I've ever seen generates terrible code, +compared to assembler, for the computation of a Boolean function, +and particularly for relations. The reason is just what I've +hinted at above. When I'm writing code in assembler, I go ahead +and perform the test the most convenient way I can, and then set +up the branch so that it goes the way it should. In effect, I +"tailor" every branch to the situation. The compiler can't do +that (practically), and it also can't know that we don't want to +store the result of the test as a Boolean variable. So it must +generate the code in a very strict order, and it often ends up +loading the result as a Boolean that never gets used for +anything. + +In any case, we're now ready to look at the code for Relation. +It's shown below with its companion procedures: + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Equals" } + +procedure Equals; +begin + Match('='); + Expression; + EmitLn('CMP (SP)+,D0'); + EmitLn('SEQ D0'); +end; + + + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Not Equals" } + +procedure NotEquals; +begin + Match('#'); + Expression; + EmitLn('CMP (SP)+,D0'); + EmitLn('SNE D0'); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Less Than" } + +procedure Less; +begin + Match('<'); + Expression; + EmitLn('CMP (SP)+,D0'); + EmitLn('SGE D0'); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Relational "Greater Than" } + +procedure Greater; +begin + Match('>'); + Expression; + EmitLn('CMP (SP)+,D0'); + EmitLn('SLE D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Relation } + +procedure Relation; +begin + Expression; + if IsRelop(Look) then begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '=': Equals; + '#': NotEquals; + '<': Less; + '>': Greater; + end; + EmitLn('TST D0'); + end; +end; +{---------------------------------------------------------------} + +Now, that call to Expression looks familiar! Here is where the +editor of your system comes in handy. We have already generated +code for Expression and its buddies in previous sessions. You +can copy them into your file now. Remember to use the single- +character versions. Just to be certain, I've duplicated the +arithmetic procedures below. If you're observant, you'll also +see that I've changed them a little to make them correspond to +the latest version of the syntax. This change is NOT necessary, +so you may prefer to hold off on that until you're sure + + +everything is working. + + +{---------------------------------------------------------------} +{ Parse and Translate an Identifier } + +procedure Ident; +var Name: char; +begin + Name:= GetName; + if Look = '(' then begin + Match('('); + Match(')'); + EmitLn('BSR ' + Name); + end + else + EmitLn('MOVE ' + Name + '(PC),D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure Expression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + Expression; + Match(')'); + end + else if IsAlpha(Look) then + Ident + else + EmitLn('MOVE #' + GetNum + ',D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate the First Math Factor } + + +procedure SignedFactor; +begin + if Look = '+' then + GetChar; + if Look = '-' then begin + GetChar; + if IsDigit(Look) then + EmitLn('MOVE #-' + GetNum + ',D0') + else begin + Factor; + EmitLn('NEG D0'); + end; + end + else Factor; +end; + + + + +{--------------------------------------------------------------} +{ Recognize and Translate a Multiply } + +procedure Multiply; +begin + Match('*'); + Factor; + EmitLn('MULS (SP)+,D0'); +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Divide } + +procedure Divide; +begin + Match('/'); + Factor; + EmitLn('MOVE (SP)+,D1'); + EmitLn('EXS.L D0'); + EmitLn('DIVS D1,D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +procedure Term; +begin + SignedFactor; + while Look in ['*', '/'] do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '*': Multiply; + '/': Divide; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Match('+'); + Term; + EmitLn('ADD (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Match('-'); + Term; + EmitLn('SUB (SP)+,D0'); + EmitLn('NEG D0'); +end; + + + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + Term; + while IsAddop(Look) do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '+': Add; + '-': Subtract; + end; + end; +end; +{---------------------------------------------------------------} + + +There you have it ... a parser that can handle both arithmetic +AND Boolean algebra, and things that combine the two through the +use of relops. I suggest you file away a copy of this parser in +a safe place for future reference, because in our next step we're +going to be chopping it up. + + +MERGING WITH CONTROL CONSTRUCTS + +At this point, let's go back to the file we had previously built +that parses control constructs. Remember those little dummy +procedures called Condition and Expression? Now you know what +goes in their places! + +I warn you, you're going to have to do some creative editing +here, so take your time and get it right. What you need to do is +to copy all of the procedures from the logic parser, from Ident +through BoolExpression, into the parser for control constructs. +Insert them at the current location of Condition. Then delete +that procedure, as well as the dummy Expression. Next, change +every call to Condition to refer to BoolExpression instead. +Finally, copy the procedures IsMulop, IsOrOp, IsRelop, IsBoolean, +and GetBoolean into place. That should do it. + +Compile the resulting program and give it a try. Since we +haven't used this program in awhile, don't forget that we used +single-character tokens for IF, WHILE, etc. Also don't forget +that any letter not a keyword just gets echoed as a block. + +Try + + ia=bxlye + +which stands for "IF a=b X ELSE Y ENDIF". + +What do you think? Did it work? Try some others. + + +ADDING ASSIGNMENTS + +As long as we're this far, and we already have the routines for +expressions in place, we might as well replace the "blocks" with +real assignment statements. We've already done that before, so +it won't be too hard. Before taking that step, though, we need +to fix something else. + + + +We're soon going to find that the one-line "programs" that we're +having to write here will really cramp our style. At the moment +we have no cure for that, because our parser doesn't recognize +the end-of-line characters, the carriage return (CR) and the line +feed (LF). So before going any further let's plug that hole. + +There are a couple of ways to deal with the CR/LFs. One (the +C/Unix approach) is just to treat them as additional white space +characters and ignore them. That's actually not such a bad +approach, but it does sort of produce funny results for our +parser as it stands now. If it were reading its input from a +source file as any self-respecting REAL compiler does, there +would be no problem. But we're reading input from the keyboard, +and we're sort of conditioned to expect something to happen when +we hit the return key. It won't, if we just skip over the CR and +LF (try it). So I'm going to use a different method here, which +is NOT necessarily the best approach in the long run. Consider +it a temporary kludge until we're further along. + +Instead of skipping the CR/LF, We'll let the parser go ahead and +catch them, then introduce a special procedure, analogous to +SkipWhite, that skips them only in specified "legal" spots. + +Here's the procedure: + + +{--------------------------------------------------------------} +{ Skip a CRLF } + +procedure Fin; +begin + if Look = CR then GetChar; + if Look = LF then GetChar; +end; + +{--------------------------------------------------------------} + + +Now, add two calls to Fin in procedure Block, like this: + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block(L: string); +begin + while not(Look in ['e', 'l', 'u']) do begin + Fin; + case Look of + 'i': DoIf(L); + 'w': DoWhile; + 'p': DoLoop; + 'r': DoRepeat; + 'f': DoFor; + 'd': DoDo; + 'b': DoBreak(L); + else Other; + end; + Fin; + end; +end; +{--------------------------------------------------------------} + + + +Now, you'll find that you can use multiple-line "programs." The +only restriction is that you can't separate an IF or WHILE token +from its predicate. + +Now we're ready to include the assignment statements. Simply +change that call to Other in procedure Block to a call to +Assignment, and add the following procedure, copied from one of +our earlier programs. Note that Assignment now calls +BoolExpression, so that we can assign Boolean variables. + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: char; +begin + Name := GetName; + Match('='); + BoolExpression; + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)'); +end; +{--------------------------------------------------------------} + + +With that change, you should now be able to write reasonably +realistic-looking programs, subject only to our limitation on +single-character tokens. My original intention was to get rid of +that limitation for you, too. However, that's going to require a +fairly major change to what we've done so far. We need a true +lexical scanner, and that requires some structural changes. They +are not BIG changes that require us to throw away all of what +we've done so far ... with care, it can be done with very minimal +changes, in fact. But it does require that care. + +This installment has already gotten pretty long, and it contains +some pretty heavy stuff, so I've decided to leave that step until +next time, when you've had a little more time to digest what +we've done and are ready to start fresh. + +In the next installment, then, we'll build a lexical scanner and +eliminate the single-character barrier once and for all. We'll +also write our first complete compiler, based on what we've done +in this session. See you then. + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + + \ No newline at end of file diff --git a/Compiler construction/Lets_build_compiler/tutor7.txt b/Compiler construction/Lets_build_compiler/tutor7.txt new file mode 100644 index 0000000..4b6f04a --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor7.txt @@ -0,0 +1,2299 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 7 November 1988 + + + Part VII: LEXICAL SCANNING + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +In the last installment, I left you with a compiler that would +ALMOST work, except that we were still limited to single- +character tokens. The purpose of this session is to get rid of +that restriction, once and for all. This means that we must deal +with the concept of the lexical scanner. + +Maybe I should mention why we need a lexical scanner at all ... +after all, we've been able to manage all right without one, up +till now, even when we provided for multi-character tokens. + +The ONLY reason, really, has to do with keywords. It's a fact of +computer life that the syntax for a keyword has the same form as +that for any other identifier. We can't tell until we get the +complete word whether or not it IS a keyword. For example, the +variable IFILE and the keyword IF look just alike, until you get +to the third character. In the examples to date, we were always +able to make a decision based upon the first character of the +token, but that's no longer possible when keywords are present. +We need to know that a given string is a keyword BEFORE we begin +to process it. And that's why we need a scanner. + +In the last session, I also promised that we would be able to +provide for normal tokens without making wholesale changes to +what we have already done. I didn't lie ... we can, as you will +see later. But every time I set out to install these elements of +the software into the parser we have already built, I had bad +feelings about it. The whole thing felt entirely too much like a +band-aid. I finally figured out what was causing the problem: I +was installing lexical scanning software without first explaining +to you what scanning is all about, and what the alternatives are. +Up till now, I have studiously avoided giving you a lot of +theory, and certainly not alternatives. I generally don't +respond well to the textbooks that give you twenty-five different +ways to do something, but no clue as to which way best fits your +needs. I've tried to avoid that pitfall by just showing you ONE +method, that WORKS. + +But this is an important area. While the lexical scanner is +hardly the most exciting part of a compiler, it often has the +most profound effect on the general "look & feel" of the +language, since after all it's the part closest to the user. I +have a particular structure in mind for the scanner to be used +with KISS. It fits the look & feel that I want for that +language. But it may not work at all for the language YOU'RE +cooking up, so in this one case I feel that it's important for +you to know your options. + +So I'm going to depart, again, from my usual format. In this +session we'll be getting much deeper than usual into the basic +theory of languages and grammars. I'll also be talking about +areas OTHER than compilers in which lexical scanning plays an +important role. Finally, I will show you some alternatives for +the structure of the lexical scanner. Then, and only then, will +we get back to our parser from the last installment. Bear with +me ... I think you'll find it's worth the wait. In fact, since +scanners have many applications outside of compilers, you may +well find this to be the most useful session for you. + + +LEXICAL SCANNING + +Lexical scanning is the process of scanning the stream of input +characters and separating it into strings called tokens. Most +compiler texts start here, and devote several chapters to +discussing various ways to build scanners. This approach has its +place, but as you have already seen, there is a lot you can do +without ever even addressing the issue, and in fact the scanner +we'll end up with here won't look much like what the texts +describe. The reason? Compiler theory and, consequently, the +programs resulting from it, must deal with the most general kind +of parsing rules. We don't. In the real world, it is possible +to specify the language syntax in such a way that a pretty simple +scanner will suffice. And as always, KISS is our motto. + +Typically, lexical scanning is done in a separate part of the +compiler, so that the parser per se sees only a stream of input +tokens. Now, theoretically it is not necessary to separate this +function from the rest of the parser. There is only one set of +syntax equations that define the whole language, so in theory we +could write the whole parser in one module. + +Why the separation? The answer has both practical and +theoretical bases. + +In 1956, Noam Chomsky defined the "Chomsky Hierarchy" of +grammars. They are: + + o Type 0: Unrestricted (e.g., English) + + o Type 1: Context-Sensitive + + o Type 2: Context-Free + + o Type 3: Regular + +A few features of the typical programming language (particularly +the older ones, such as FORTRAN) are Type 1, but for the most +part all modern languages can be described using only the last +two types, and those are all we'll be dealing with here. + +The neat part about these two types is that there are very +specific ways to parse them. It has been shown that any regular +grammar can be parsed using a particular form of abstract machine +called the state machine (finite automaton). We have already +implemented state machines in some of our recognizers. + +Similarly, Type 2 (context-free) grammars can always be parsed +using a push-down automaton (a state machine augmented by a +stack). We have also implemented these machines. Instead of +implementing a literal stack, we have relied on the built-in +stack associated with recursive coding to do the job, and that in +fact is the preferred approach for top-down parsing. + +Now, it happens that in real, practical grammars, the parts that +qualify as regular expressions tend to be the lower-level parts, +such as the definition of an identifier: + + ::= [ | ]* + +Since it takes a different kind of abstract machine to parse the +two types of grammars, it makes sense to separate these lower- +level functions into a separate module, the lexical scanner, +which is built around the idea of a state machine. The idea is to +use the simplest parsing technique needed for the job. + +There is another, more practical reason for separating scanner +from parser. We like to think of the input source file as a +stream of characters, which we process right to left without +backtracking. In practice that isn't possible. Almost every +language has certain keywords such as IF, WHILE, and END. As I +mentioned earlier, we can't really know whether a given +character string is a keyword, until we've reached the end of it, +as defined by a space or other delimiter. So in that sense, we +MUST save the string long enough to find out whether we have a +keyword or not. That's a limited form of backtracking. + +So the structure of a conventional compiler involves splitting up +the functions of the lower-level and higher-level parsing. The +lexical scanner deals with things at the character level, +collecting characters into strings, etc., and passing them along +to the parser proper as indivisible tokens. It's also considered +normal to let the scanner have the job of identifying keywords. + + +STATE MACHINES AND ALTERNATIVES + +I mentioned that the regular expressions can be parsed using a +state machine. In most compiler texts, and indeed in most +compilers as well, you will find this taken literally. There is +typically a real implementation of the state machine, with +integers used to define the current state, and a table of actions +to take for each combination of current state and input +character. If you write a compiler front end using the popular +Unix tools LEX and YACC, that's what you'll get. The output of +LEX is a state machine implemented in C, plus a table of actions +corresponding to the input grammar given to LEX. The YACC output +is similar ... a canned table-driven parser, plus the table +corresponding to the language syntax. + +That is not the only choice, though. In our previous +installments, you have seen over and over that it is possible to +implement parsers without dealing specifically with tables, +stacks, or state variables. In fact, in Installment V I warned +you that if you find yourself needing these things you might be +doing something wrong, and not taking advantage of the power of +Pascal. There are basically two ways to define a state machine's +state: explicitly, with a state number or code, and implicitly, +simply by virtue of the fact that I'm at a certain place in the +code (if it's Tuesday, this must be Belgium). We've relied +heavily on the implicit approaches before, and I think you'll +find that they work well here, too. + +In practice, it may not even be necessary to HAVE a well-defined +lexical scanner. This isn't our first experience at dealing with +multi-character tokens. In Installment III, we extended our +parser to provide for them, and we didn't even NEED a lexical +scanner. That was because in that narrow context, we could +always tell, just by looking at the single lookahead character, +whether we were dealing with a number, a variable, or an +operator. In effect, we built a distributed lexical scanner, +using procedures GetName and GetNum. + +With keywords present, we can't know anymore what we're dealing +with, until the entire token is read. This leads us to a more +localized scanner; although, as you will see, the idea of a +distributed scanner still has its merits. + + +SOME EXPERIMENTS IN SCANNING + +Before getting back to our compiler, it will be useful to +experiment a bit with the general concepts. + +Let's begin with the two definitions most often seen in real +programming languages: + + ::= [ | ]* + ]+ + +(Remember, the '*' indicates zero or more occurences of the terms +in brackets, and the '+', one or more.) + +We have already dealt with similar items in Installment III. +Let's begin (as usual) with a bare cradle. Not surprisingly, we +are going to need a new recognizer: + + +{--------------------------------------------------------------} +{ Recognize an Alphanumeric Character } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; +{--------------------------------------------------------------} + + +Using this let's write the following two routines, which are very +similar to those we've used before: + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: string; +var x: string[8]; +begin + x := ''; + if not IsAlpha(Look) then Expected('Name'); + while IsAlNum(Look) do begin + x := x + UpCase(Look); + GetChar; + end; + GetName := x; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: string; +var x: string[16]; +begin + x := ''; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + x := x + Look; + GetChar; + end; + GetNum := x; +end; +{--------------------------------------------------------------} + + +(Notice that this version of GetNum returns a string, not an +integer as before.) + +You can easily verify that these routines work by calling them +from the main program, as in + + WriteLn(GetName); + +This program will print any legal name typed in (maximum eight +characters, since that's what we told GetName). It will reject +anything else. + +Test the other routine similarly. + + +WHITE SPACE + +We also have dealt with embedded white space before, using the +two routines IsWhite and SkipWhite. Make sure that these +routines are in your current version of the cradle, and add the +the line + + SkipWhite; + +at the end of both GetName and GetNum. + +Now, let's define the new procedure: + + +{--------------------------------------------------------------} +{ Lexical Scanner } + +Function Scan: string; +begin + if IsAlpha(Look) then + Scan := GetName + else if IsDigit(Look) then + Scan := GetNum + else begin + Scan := Look; + GetChar; + end; + SkipWhite; +end; +{--------------------------------------------------------------} + + +We can call this from the new main program: + + +{--------------------------------------------------------------} +{ Main Program } + + +begin + Init; + repeat + Token := Scan; + writeln(Token); + until Token = CR; +end. +{--------------------------------------------------------------} + + +(You will have to add the declaration of the string Token at the +beginning of the program. Make it any convenient length, say 16 +characters.) + +Now, run the program. Note how the input string is, indeed, +separated into distinct tokens. + + +STATE MACHINES + +For the record, a parse routine like GetName does indeed +implement a state machine. The state is implicit in the current +position in the code. A very useful trick for visualizing what's +going on is the syntax diagram, or "railroad-track" diagram. +It's a little difficult to draw one in this medium, so I'll use +them very sparingly, but the figure below should give you the +idea: + + + |-----> Other---------------------------> Error + | + Start -------> Letter ---------------> Other -----> Finish + ^ V + | | + |<----- Letter <---------| + | | + |<----- Digit <---------- + + +As you can see, this diagram shows how the logic flows as +characters are read. Things begin, of course, in the start +state, and end when a character other than an alphanumeric is +found. If the first character is not alpha, an error occurs. +Otherwise the machine will continue looping until the terminating +delimiter is found. + +Note that at any point in the flow, our position is entirely +dependent on the past history of the input characters. At that +point, the action to be taken depends only on the current state, +plus the current input character. That's what make this a state +machine. + +Because of the difficulty of drawing railroad-track diagrams in +this medium, I'll continue to stick to syntax equations from now +on. But I highly recommend the diagrams to you for anything you +do that involves parsing. After a little practice you can begin +to see how to write a parser directly from the diagrams. +Parallel paths get coded into guarded actions (guarded by IF's or +CASE statements), serial paths into sequential calls. It's +almost like working from a schematic. + +We didn't even discuss SkipWhite, which was introduced earlier, +but it also is a simple state machine, as is GetNum. So is their +parent procedure, Scan. Little machines make big machines. + +The neat thing that I'd like you to note is how painlessly this +implicit approach creates these state machines. I personally +prefer it a lot over the table-driven approach. It also results +is a small, tight, and fast scanner. + + +NEWLINES + +Moving right along, let's modify our scanner to handle more than +one line. As I mentioned last time, the most straightforward way +to do this is to simply treat the newline characters, carriage +return and line feed, as white space. This is, in fact, the way +the C standard library routine, iswhite, works. We didn't +actually try this before. I'd like to do it now, so you can get +a feel for the results. + +To do this, simply modify the single executable line of IsWhite +to read: + + + IsWhite := c in [' ', TAB, CR, LF]; + + +We need to give the main program a new stop condition, since it +will never see a CR. Let's just use: + + + until Token = '.'; + + +OK, compile this program and run it. Try a couple of lines, +terminated by the period. I used: + + + now is the time + for all good men. + +Hey, what happened? When I tried it, I didn't get the last +token, the period. The program didn't halt. What's more, when I +pressed the 'enter' key a few times, I still didn't get the +period. + +If you're still stuck in your program, you'll find that typing a +period on a new line will terminate it. + +What's going on here? The answer is that we're hanging up in +SkipWhite. A quick look at that routine will show that as long +as we're typing null lines, we're going to just continue to loop. +After SkipWhite encounters an LF, it tries to execute a GetChar. +But since the input buffer is now empty, GetChar's read statement +insists on having another line. Procedure Scan gets the +terminating period, all right, but it calls SkipWhite to clean +up, and SkipWhite won't return until it gets a non-null line. + +This kind of behavior is not quite as bad as it seems. In a real +compiler, we'd be reading from an input file instead of the +console, and as long as we have some procedure for dealing with +end-of-files, everything will come out OK. But for reading data +from the console, the behavior is just too bizarre. The fact of +the matter is that the C/Unix convention is just not compatible +with the structure of our parser, which calls for a lookahead +character. The code that the Bell wizards have implemented +doesn't use that convention, which is why they need 'ungetc'. + +OK, let's fix the problem. To do that, we need to go back to the +old definition of IsWhite (delete the CR and LF characters) and +make use of the procedure Fin that I introduced last time. If +it's not in your current version of the cradle, put it there now. + +Also, modify the main program to read: + + +{--------------------------------------------------------------} +{ Main Program } + + +begin + Init; + repeat + Token := Scan; + writeln(Token); + if Token = CR then Fin; + until Token = '.'; +end. +{--------------------------------------------------------------} + + +Note the "guard" test preceding the call to Fin. That's what +makes the whole thing work, and ensures that we don't try to read +a line ahead. + +Try the code now. I think you'll like it better. + +If you refer to the code we did in the last installment, you'll +find that I quietly sprinkled calls to Fin throughout the code, +wherever a line break was appropriate. This is one of those +areas that really affects the look & feel that I mentioned. At +this point I would urge you to experiment with different +arrangements and see how you like them. If you want your +language to be truly free-field, then newlines should be +transparent. In this case, the best approach is to put the +following lines at the BEGINNING of Scan: + + + while Look = CR do + Fin; + + +If, on the other hand, you want a line-oriented language like +Assembler, BASIC, or FORTRAN (or even Ada... note that it has +comments terminated by newlines), then you'll need for Scan to +return CR's as tokens. It must also eat the trailing LF. The +best way to do that is to use this line, again at the beginning +of Scan: + + if Look = LF then Fin; + + +For other conventions, you'll have to use other arrangements. +In my example of the last session, I allowed newlines only at +specific places, so I was somewhere in the middle ground. In the +rest of these sessions, I'll be picking ways to handle newlines +that I happen to like, but I want you to know how to choose other +ways for yourselves. + + +OPERATORS + +We could stop now and have a pretty useful scanner for our +purposes. In the fragments of KISS that we've built so far, the +only tokens that have multiple characters are the identifiers and +numbers. All operators were single characters. The only +exception I can think of is the relops <=, >=, and <>, but they +could be dealt with as special cases. + +Still, other languages have multi-character operators, such as +the ':=' of Pascal or the '++' and '>>' of C. So while we may +not need multi-character operators, it's nice to know how to get +them if necessary. + +Needless to say, we can handle operators very much the same way +as the other tokens. Let's start with a recognizer: + + +{--------------------------------------------------------------} +{ Recognize Any Operator } + +function IsOp(c: char): boolean; +begin + IsOp := c in ['+', '-', '*', '/', '<', '>', ':', '=']; +end; +{--------------------------------------------------------------} + + +It's important to note that we DON'T have to include every +possible operator in this list. For example, the paretheses +aren't included, nor is the terminating period. The current +version of Scan handles single-character operators just fine as +it is. The list above includes only those characters that can +appear in multi-character operators. (For specific languages, of +course, the list can always be edited.) + +Now, let's modify Scan to read: + + +{--------------------------------------------------------------} +{ Lexical Scanner } + +Function Scan: string; +begin + while Look = CR do + Fin; + if IsAlpha(Look) then + Scan := GetName + else if IsDigit(Look) then + Scan := GetNum + else if IsOp(Look) then + Scan := GetOp + else begin + Scan := Look; + GetChar; + end; + SkipWhite; +end; +{--------------------------------------------------------------} + + +Try the program now. You will find that any code fragments you +care to throw at it will be neatly broken up into individual +tokens. + + +LISTS, COMMAS AND COMMAND LINES + +Before getting back to the main thrust of our study, I'd like to +get on my soapbox for a moment. + +How many times have you worked with a program or operating system +that had rigid rules about how you must separate items in a list? +(Try, the last time you used MSDOS!) Some programs require +spaces as delimiters, and some require commas. Worst of all, +some require both, in different places. Most are pretty +unforgiving about violations of their rules. + +I think this is inexcusable. It's too easy to write a parser +that will handle both spaces and commas in a flexible way. +Consider the following procedure: + + +{--------------------------------------------------------------} +{ Skip Over a Comma } + +procedure SkipComma; +begin + SkipWhite; + if Look = ',' then begin + GetChar; + SkipWhite; + end; +end; +{--------------------------------------------------------------} + + +This eight-line procedure will skip over a delimiter consisting +of any number (including zero) of spaces, with zero or one comma +embedded in the string. + +TEMPORARILY, change the call to SkipWhite in Scan to a call to +SkipComma, and try inputting some lists. Works nicely, eh? +Don't you wish more software authors knew about SkipComma? + +For the record, I found that adding the equivalent of SkipComma +to my Z80 assembler-language programs took all of 6 (six) extra +bytes of code. Even in a 64K machine, that's not a very high +price to pay for user-friendliness! + +I think you can see where I'm going here. Even if you never +write a line of a compiler code in your life, there are places in +every program where you can use the concepts of parsing. Any +program that processes a command line needs them. In fact, if +you think about it for a bit, you'll have to conclude that any +time you write a program that processes user inputs, you're +defining a language. People communicate with languages, and the +syntax implicit in your program defines that language. The real +question is: are you going to define it deliberately and +explicitly, or just let it turn out to be whatever the program +ends up parsing? + +I claim that you'll have a better, more user-friendly program if +you'll take the time to define the syntax explicitly. Write down +the syntax equations or draw the railroad-track diagrams, and +code the parser using the techniques I've shown you here. You'll +end up with a better program, and it will be easier to write, to +boot. + + +GETTING FANCY + +OK, at this point we have a pretty nice lexical scanner that will +break an input stream up into tokens. We could use it as it +stands and have a servicable compiler. But there are some other +aspects of lexical scanning that we need to cover. + +The main consideration is efficiency. Remember when we +were dealing with single-character tokens, every test was a +comparison of a single character, Look, with a byte constant. We +also used the Case statement heavily. + +With the multi-character tokens being returned by Scan, all those +tests now become string comparisons. Much slower. And not only +slower, but more awkward, since there is no string equivalent of +the Case statement in Pascal. It seems especially wasteful to +test for what used to be single characters ... the '=', '+', and +other operators ... using string comparisons. + +Using string comparison is not impossible ... Ron Cain used just +that approach in writing Small C. Since we're sticking to the +KISS principle here, we would be truly justified in settling for +this approach. But then I would have failed to tell you about +one of the key approaches used in "real" compilers. + +You have to remember: the lexical scanner is going to be called a +_LOT_! Once for every token in the whole source program, in +fact. Experiments have indicated that the average compiler +spends anywhere from 20% to 40% of its time in the scanner +routines. If there were ever a place where efficiency deserves +real consideration, this is it. + +For this reason, most compiler writers ask the lexical scanner to +do a little more work, by "tokenizing" the input stream. The +idea is to match every token against a list of acceptable +keywords and operators, and return unique codes for each one +recognized. In the case of ordinary variable names or numbers, +we just return a code that says what kind of token they are, and +save the actual string somewhere else. + +One of the first things we're going to need is a way to identify +keywords. We can always do it with successive IF tests, but it +surely would be nice if we had a general-purpose routine that +could compare a given string with a table of keywords. (By the +way, we're also going to need such a routine later, for dealing +with symbol tables.) This usually presents a problem in Pascal, +because standard Pascal doesn't allow for arrays of variable +lengths. It's a real bother to have to declare a different +search routine for every table. Standard Pascal also doesn't +allow for initializing arrays, so you tend to see code like + + Table[1] := 'IF'; + Table[2] := 'ELSE'; + . + . + Table[n] := 'END'; + +which can get pretty old if there are many keywords. + +Fortunately, Turbo Pascal 4.0 has extensions that eliminate both +of these problems. Constant arrays can be declared using TP's +"typed constant" facility, and the variable dimensions can be +handled with its C-like extensions for pointers. + +First, modify your declarations like this: + + +{--------------------------------------------------------------} +{ Type Declarations } + +type Symbol = string[8]; + + SymTab = array[1..1000] of Symbol; + + TabPtr = ^SymTab; +{--------------------------------------------------------------} + + +(The dimension used in SymTab is not real ... no storage is +allocated by the declaration itself, and the number need only be +"big enough.") + +Now, just beneath those declarations, add the following: + + +{--------------------------------------------------------------} +{ Definition of Keywords and Token Types } + +const KWlist: array [1..4] of Symbol = + ('IF', 'ELSE', 'ENDIF', 'END'); + +{--------------------------------------------------------------} + + +Next, insert the following new function: + + +{--------------------------------------------------------------} +{ Table Lookup } + +{ If the input string matches a table entry, return the entry + index. If not, return a zero. } + +function Lookup(T: TabPtr; s: string; n: integer): integer; +var i: integer; + found: boolean; +begin + found := false; + i := n; + while (i > 0) and not found do + if s = T^[i] then + found := true + else + dec(i); + Lookup := i; +end; +{--------------------------------------------------------------} + + +To test it, you can temporarily change the main program as +follows: + + +{--------------------------------------------------------------} +{ Main Program } + + +begin + ReadLn(Token); + WriteLn(Lookup(Addr(KWList), Token, 4)); +end. +{--------------------------------------------------------------} + + +Notice how Lookup is called: The Addr function sets up a pointer +to KWList, which gets passed to Lookup. + +OK, give this a try. Since we're bypassing Scan here, you'll +have to type the keywords in upper case to get any matches. + +Now that we can recognize keywords, the next thing is to arrange +to return codes for them. + +So what kind of code should we return? There are really only two +reasonable choices. This seems like an ideal application for the +Pascal enumerated type. For example, you can define something +like + + SymType = (IfSym, ElseSym, EndifSym, EndSym, Ident, Number, + Operator); + +and arrange to return a variable of this type. Let's give it a +try. Insert the line above into your type definitions. + +Now, add the two variable declarations: + + + Token: Symtype; { Current Token } + Value: String[16]; { String Token of Look } + + +Modify the scanner to read: + + +{--------------------------------------------------------------} +{ Lexical Scanner } + +procedure Scan; +var k: integer; +begin + while Look = CR do + Fin; + if IsAlpha(Look) then begin + Value := GetName; + k := Lookup(Addr(KWlist), Value, 4); + if k = 0 then + Token := Ident + else + Token := SymType(k - 1); + end + else if IsDigit(Look) then begin + Value := GetNum; + Token := Number; + end + else if IsOp(Look) then begin + Value := GetOp; + Token := Operator; + end + else begin + Value := Look; + Token := Operator; + GetChar; + end; + SkipWhite; +end; +{--------------------------------------------------------------} + + +(Notice that Scan is now a procedure, not a function.) + + +Finally, modify the main program to read: + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + repeat + Scan; + case Token of + Ident: write('Ident '); + Number: Write('Number '); + Operator: Write('Operator '); + IfSym, ElseSym, EndifSym, EndSym: Write('Keyword '); + end; + Writeln(Value); + until Token = EndSym; +end. +{--------------------------------------------------------------} + + +What we've done here is to replace the string Token used earlier +with an enumerated type. Scan returns the type in variable Token, +and returns the string itself in the new variable Value. + +OK, compile this and give it a whirl. If everything goes right, +you should see that we are now recognizing keywords. + +What we have now is working right, and it was easy to generate +from what we had earlier. However, it still seems a little +"busy" to me. We can simplify things a bit by letting GetName, +GetNum, GetOp, and Scan be procedures working with the global +variables Token and Value, thereby eliminating the local copies. +It also seems a little cleaner to move the table lookup into +GetName. The new form for the four procedures is, then: + + +{--------------------------------------------------------------} +{ Get an Identifier } + +procedure GetName; +var k: integer; +begin + Value := ''; + if not IsAlpha(Look) then Expected('Name'); + while IsAlNum(Look) do begin + Value := Value + UpCase(Look); + GetChar; + end; + k := Lookup(Addr(KWlist), Value, 4); + if k = 0 then + Token := Ident + else + Token := SymType(k-1); +end; + +{--------------------------------------------------------------} +{ Get a Number } + +procedure GetNum; +begin + Value := ''; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + Value := Value + Look; + GetChar; + end; + Token := Number; +end; + + +{--------------------------------------------------------------} +{ Get an Operator } + +procedure GetOp; +begin + Value := ''; + if not IsOp(Look) then Expected('Operator'); + while IsOp(Look) do begin + Value := Value + Look; + GetChar; + end; + Token := Operator; +end; + + +{--------------------------------------------------------------} +{ Lexical Scanner } + +procedure Scan; +var k: integer; +begin + while Look = CR do + Fin; + if IsAlpha(Look) then + GetName + else if IsDigit(Look) then + GetNum + else if IsOp(Look) then + GetOp + else begin + Value := Look; + Token := Operator; + GetChar; + end; + SkipWhite; +end; +{--------------------------------------------------------------} + + +RETURNING A CHARACTER + +Essentially every scanner I've ever seen that was written in +Pascal used the mechanism of an enumerated type that I've just +described. It is certainly a workable mechanism, but it doesn't +seem the simplest approach to me. + +For one thing, the list of possible symbol types can get pretty +long. Here, I've used just one symbol, "Operator," to stand for +all of the operators, but I've seen other designs that actually +return different codes for each one. + +There is, of course, another simple type that can be returned as +a code: the character. Instead of returning the enumeration +value 'Operator' for a '+' sign, what's wrong with just returning +the character itself? A character is just as good a variable for +encoding the different token types, it can be used in case +statements easily, and it's sure a lot easier to type. What +could be simpler? + +Besides, we've already had experience with the idea of encoding +keywords as single characters. Our previous programs are already +written that way, so using this approach will minimize the +changes to what we've already done. + +Some of you may feel that this idea of returning character codes +is too mickey-mouse. I must admit it gets a little awkward for +multi-character operators like '<='. If you choose to stay with +the enumerated type, fine. For the rest, I'd like to show you +how to change what we've done above to support that approach. + +First, you can delete the SymType declaration now ... we won't be +needing that. And you can change the type of Token to char. + +Next, to replace SymType, add the following constant string: + + + const KWcode: string[5] = 'xilee'; + + +(I'll be encoding all idents with the single character 'x'.) + + +Lastly, modify Scan and its relatives as follows: + + +{--------------------------------------------------------------} +{ Get an Identifier } + +procedure GetName; +begin + Value := ''; + if not IsAlpha(Look) then Expected('Name'); + while IsAlNum(Look) do begin + Value := Value + UpCase(Look); + GetChar; + end; + Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1]; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +procedure GetNum; +begin + Value := ''; + if not IsDigit(Look) then Expected('Integer'); + while IsDigit(Look) do begin + Value := Value + Look; + GetChar; + end; + Token := '#'; +end; + + +{--------------------------------------------------------------} +{ Get an Operator } + +procedure GetOp; +begin + Value := ''; + if not IsOp(Look) then Expected('Operator'); + while IsOp(Look) do begin + Value := Value + Look; + GetChar; + end; + if Length(Value) = 1 then + Token := Value[1] + else + Token := '?'; +end; + + +{--------------------------------------------------------------} +{ Lexical Scanner } + +procedure Scan; +var k: integer; +begin + while Look = CR do + Fin; + if IsAlpha(Look) then + GetName + else if IsDigit(Look) then + GetNum + else if IsOp(Look) then begin + GetOp + else begin + Value := Look; + Token := '?'; + GetChar; + end; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Main Program } + + +begin + Init; + repeat + Scan; + case Token of + 'x': write('Ident '); + '#': Write('Number '); + 'i', 'l', 'e': Write('Keyword '); + else Write('Operator '); + end; + Writeln(Value); + until Value = 'END'; +end. +{--------------------------------------------------------------} + + +This program should work the same as the previous version. A +minor difference in structure, maybe, but it seems more +straightforward to me. + + +DISTRIBUTED vs CENTRALIZED SCANNERS + +The structure for the lexical scanner that I've just shown you is +very conventional, and about 99% of all compilers use something +very close to it. This is not, however, the only possible +structure, or even always the best one. + +The problem with the conventional approach is that the scanner +has no knowledge of context. For example, it can't distinguish +between the assignment operator '=' and the relational operator +'=' (perhaps that's why both C and Pascal use different strings +for the two). All the scanner can do is to pass the operator +along to the parser, which can hopefully tell from the context +which operator is meant. Similarly, a keyword like 'IF' has no +place in the middle of a math expression, but if one happens to +appear there, the scanner will see no problem with it, and will +return it to the parser, properly encoded as an 'IF'. + +With this kind of approach, we are not really using all the +information at our disposal. In the middle of an expression, for +example, the parser "knows" that there is no need to look for +keywords, but it has no way of telling the scanner that. So the +scanner continues to do so. This, of course, slows down the +compilation. + +In real-world compilers, the designers often arrange for more +information to be passed between parser and scanner, just to +avoid this kind of problem. But that can get awkward, and +certainly destroys a lot of the modularity of the structure. + +The alternative is to seek some way to use the contextual +information that comes from knowing where we are in the parser. +This leads us back to the notion of a distributed scanner, in +which various portions of the scanner are called depending upon +the context. + +In KISS, as in most languages, keywords ONLY appear at the +beginning of a statement. In places like expressions, they are +not allowed. Also, with one minor exception (the multi-character +relops) that is easily handled, all operators are single +characters, which means that we don't need GetOp at all. + +So it turns out that even with multi-character tokens, we can +still always tell from the current lookahead character exactly +what kind of token is coming, except at the very beginning of a +statement. + +Even at that point, the ONLY kind of token we can accept is an +identifier. We need only to determine if that identifier is a +keyword or the target of an assignment statement. + +We end up, then, still needing only GetName and GetNum, which are +used very much as we've used them in earlier installments. + +It may seem at first to you that this is a step backwards, and a +rather primitive approach. In fact, it is an improvement over +the classical scanner, since we're using the scanning routines +only where they're really needed. In places where keywords are +not allowed, we don't slow things down by looking for them. + + +MERGING SCANNER AND PARSER + +Now that we've covered all of the theory and general aspects of +lexical scanning that we'll be needing, I'm FINALLY ready to back +up my claim that we can accomodate multi-character tokens with +minimal change to our previous work. To keep things short and +simple I will restrict myself here to a subset of what we've done +before; I'm allowing only one control construct (the IF) and no +Boolean expressions. That's enough to demonstrate the parsing of +both keywords and expressions. The extension to the full set of +constructs should be pretty apparent from what we've already +done. + +All the elements of the program to parse this subset, using +single-character tokens, exist already in our previous programs. +I built it by judicious copying of these files, but I wouldn't +dare try to lead you through that process. Instead, to avoid any +confusion, the whole program is shown below: + + +{--------------------------------------------------------------} +program KISS; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + CR = ^M; + LF = ^J; + +{--------------------------------------------------------------} +{ Type Declarations } + +type Symbol = string[8]; + + SymTab = array[1..1000] of Symbol; + + TabPtr = ^SymTab; + + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look : char; { Lookahead Character } + Lcount: integer; { Label Counter } + + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Recognize an AlphaNumeric Character } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Mulop } + +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*', '/']; +end; + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB]; +end; + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + if Look <> x then Expected('''' + x + ''''); + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Skip a CRLF } + +procedure Fin; +begin + if Look = CR then GetChar; + if Look = LF then GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +function GetName: char; +begin + while Look = CR do + Fin; + if not IsAlpha(Look) then Expected('Name'); + Getname := UpCase(Look); + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +function GetNum: char; +begin + if not IsDigit(Look) then Expected('Integer'); + GetNum := Look; + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Generate a Unique Label } + +function NewLabel: string; +var S: string; +begin + Str(LCount, S); + NewLabel := 'L' + S; + Inc(LCount); +end; + + +{--------------------------------------------------------------} +{ Post a Label To Output } + +procedure PostLabel(L: string); +begin + WriteLn(L, ':'); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + +{--------------------------------------------------------------} + +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate an Identifier } + +procedure Ident; +var Name: char; +begin + Name := GetName; + if Look = '(' then begin + Match('('); + Match(')'); + EmitLn('BSR ' + Name); + end + else + EmitLn('MOVE ' + Name + '(PC),D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure Expression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + Expression; + Match(')'); + end + else if IsAlpha(Look) then + Ident + else + EmitLn('MOVE #' + GetNum + ',D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate the First Math Factor } + + +procedure SignedFactor; +var s: boolean; +begin + s := Look = '-'; + if IsAddop(Look) then begin + GetChar; + SkipWhite; + end; + Factor; + if s then + EmitLn('NEG D0'); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Multiply } + +procedure Multiply; +begin + Match('*'); + Factor; + EmitLn('MULS (SP)+,D0'); +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Divide } + +procedure Divide; +begin + Match('/'); + Factor; + EmitLn('MOVE (SP)+,D1'); + EmitLn('EXS.L D0'); + EmitLn('DIVS D1,D0'); +end; + + +{---------------------------------------------------------------} +{ Completion of Term Processing (called by Term and FirstTerm } + +procedure Term1; +begin + while IsMulop(Look) do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '*': Multiply; + '/': Divide; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +procedure Term; +begin + Factor; + Term1; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term with Possible Leading Sign } + +procedure FirstTerm; +begin + SignedFactor; + Term1; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Match('+'); + Term; + EmitLn('ADD (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Match('-'); + Term; + EmitLn('SUB (SP)+,D0'); + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + FirstTerm; + while IsAddop(Look) do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '+': Add; + '-': Subtract; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Condition } +{ This version is a dummy } + +Procedure Condition; +begin + EmitLn('Condition'); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block; + Forward; + +procedure DoIf; +var L1, L2: string; +begin + Match('i'); + Condition; + L1 := NewLabel; + L2 := L1; + EmitLn('BEQ ' + L1); + Block; + if Look = 'l' then begin + Match('l'); + L2 := NewLabel; + EmitLn('BRA ' + L2); + PostLabel(L1); + Block; + end; + PostLabel(L2); + Match('e'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: char; +begin + Name := GetName; + Match('='); + Expression; + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)'); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block; +begin + while not(Look in ['e', 'l']) do begin + case Look of + 'i': DoIf; + CR: while Look = CR do + Fin; + else Assignment; + end; + end; +end; + + +{--------------------------------------------------------------} +{ Parse and Translate a Program } + +procedure DoProgram; +begin + Block; + if Look <> 'e' then Expected('END'); + EmitLn('END') +end; + + +{--------------------------------------------------------------} + +{ Initialize } + +procedure Init; +begin + LCount := 0; + GetChar; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + DoProgram; +end. +{--------------------------------------------------------------} + + +A couple of comments: + + (1) The form for the expression parser, using FirstTerm, etc., + is a little different from what you've seen before. It's + yet another variation on the same theme. Don't let it throw + you ... the change is not required for what follows. + + (2) Note that, as usual, I had to add calls to Fin at strategic + spots to allow for multiple lines. + +Before we proceed to adding the scanner, first copy this file and +verify that it does indeed parse things correctly. Don't forget +the "codes": 'i' for IF, 'l' for ELSE, and 'e' for END or ENDIF. + +If the program works, then let's press on. In adding the scanner +modules to the program, it helps to have a systematic plan. In +all the parsers we've written to date, we've stuck to a +convention that the current lookahead character should always be +a non-blank character. We preload the lookahead character in +Init, and keep the "pump primed" after that. To keep the thing +working right at newlines, we had to modify this a bit and treat +the newline as a legal token. + +In the multi-character version, the rule is similar: The current +lookahead character should always be left at the BEGINNING of the +next token, or at a newline. + +The multi-character version is shown next. To get it, I've made +the following changes: + + + o Added the variables Token and Value, and the type definitions + needed by Lookup. + + o Added the definitions of KWList and KWcode. + + o Added Lookup. + + o Replaced GetName and GetNum by their multi-character versions. + (Note that the call to Lookup has been moved out of GetName, + so that it will not be executed for calls within an + expression.) + + o Created a new, vestigial Scan that calls GetName, then scans + for keywords. + + o Created a new procedure, MatchString, that looks for a + specific keyword. Note that, unlike Match, MatchString does + NOT read the next keyword. + + o Modified Block to call Scan. + + o Changed the calls to Fin a bit. Fin is now called within + GetName. + +Here is the program in its entirety: + + +{--------------------------------------------------------------} +program KISS; + +{--------------------------------------------------------------} +{ Constant Declarations } + +const TAB = ^I; + CR = ^M; + LF = ^J; + +{--------------------------------------------------------------} +{ Type Declarations } + +type Symbol = string[8]; + + SymTab = array[1..1000] of Symbol; + + TabPtr = ^SymTab; + + +{--------------------------------------------------------------} +{ Variable Declarations } + +var Look : char; { Lookahead Character } + Token : char; { Encoded Token } + Value : string[16]; { Unencoded Token } + Lcount: integer; { Label Counter } + + +{--------------------------------------------------------------} +{ Definition of Keywords and Token Types } + +const KWlist: array [1..4] of Symbol = + ('IF', 'ELSE', 'ENDIF', 'END'); + +const KWcode: string[5] = 'xilee'; + + +{--------------------------------------------------------------} +{ Read New Character From Input Stream } + +procedure GetChar; +begin + Read(Look); +end; + +{--------------------------------------------------------------} +{ Report an Error } + +procedure Error(s: string); +begin + WriteLn; + WriteLn(^G, 'Error: ', s, '.'); +end; + + +{--------------------------------------------------------------} +{ Report Error and Halt } + +procedure Abort(s: string); +begin + Error(s); + Halt; +end; + + +{--------------------------------------------------------------} +{ Report What Was Expected } + +procedure Expected(s: string); +begin + Abort(s + ' Expected'); +end; + +{--------------------------------------------------------------} +{ Recognize an Alpha Character } + +function IsAlpha(c: char): boolean; +begin + IsAlpha := UpCase(c) in ['A'..'Z']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Decimal Digit } + +function IsDigit(c: char): boolean; +begin + IsDigit := c in ['0'..'9']; +end; + + +{--------------------------------------------------------------} +{ Recognize an AlphaNumeric Character } + +function IsAlNum(c: char): boolean; +begin + IsAlNum := IsAlpha(c) or IsDigit(c); +end; + + +{--------------------------------------------------------------} +{ Recognize an Addop } + +function IsAddop(c: char): boolean; +begin + IsAddop := c in ['+', '-']; +end; + + +{--------------------------------------------------------------} +{ Recognize a Mulop } + +function IsMulop(c: char): boolean; +begin + IsMulop := c in ['*', '/']; +end; + + +{--------------------------------------------------------------} +{ Recognize White Space } + +function IsWhite(c: char): boolean; +begin + IsWhite := c in [' ', TAB]; +end; + + +{--------------------------------------------------------------} +{ Skip Over Leading White Space } + +procedure SkipWhite; +begin + while IsWhite(Look) do + GetChar; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input Character } + +procedure Match(x: char); +begin + if Look <> x then Expected('''' + x + ''''); + GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Skip a CRLF } + +procedure Fin; +begin + if Look = CR then GetChar; + if Look = LF then GetChar; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Table Lookup } + +function Lookup(T: TabPtr; s: string; n: integer): integer; +var i: integer; + found: boolean; +begin + found := false; + i := n; + while (i > 0) and not found do + if s = T^[i] then + found := true + else + dec(i); + Lookup := i; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier } + +procedure GetName; +begin + while Look = CR do + Fin; + if not IsAlpha(Look) then Expected('Name'); + Value := ''; + while IsAlNum(Look) do begin + Value := Value + UpCase(Look); + GetChar; + end; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get a Number } + +procedure GetNum; +begin + if not IsDigit(Look) then Expected('Integer'); + Value := ''; + while IsDigit(Look) do begin + Value := Value + Look; + GetChar; + end; + Token := '#'; + SkipWhite; +end; + + +{--------------------------------------------------------------} +{ Get an Identifier and Scan it for Keywords } + +procedure Scan; +begin + GetName; + Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1]; +end; + + +{--------------------------------------------------------------} +{ Match a Specific Input String } + +procedure MatchString(x: string); +begin + if Value <> x then Expected('''' + x + ''''); +end; + + +{--------------------------------------------------------------} +{ Generate a Unique Label } + +function NewLabel: string; +var S: string; +begin + Str(LCount, S); + NewLabel := 'L' + S; + Inc(LCount); +end; + + +{--------------------------------------------------------------} +{ Post a Label To Output } + +procedure PostLabel(L: string); +begin + WriteLn(L, ':'); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab } + +procedure Emit(s: string); +begin + Write(TAB, s); +end; + + +{--------------------------------------------------------------} +{ Output a String with Tab and CRLF } + +procedure EmitLn(s: string); +begin + Emit(s); + WriteLn; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate an Identifier } + +procedure Ident; +begin + GetName; + if Look = '(' then begin + Match('('); + Match(')'); + EmitLn('BSR ' + Value); + end + else + EmitLn('MOVE ' + Value + '(PC),D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Factor } + +procedure Expression; Forward; + +procedure Factor; +begin + if Look = '(' then begin + Match('('); + Expression; + Match(')'); + end + else if IsAlpha(Look) then + Ident + else begin + GetNum; + EmitLn('MOVE #' + Value + ',D0'); + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate the First Math Factor } + +procedure SignedFactor; +var s: boolean; +begin + s := Look = '-'; + if IsAddop(Look) then begin + GetChar; + SkipWhite; + end; + Factor; + if s then + EmitLn('NEG D0'); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Multiply } + +procedure Multiply; +begin + Match('*'); + Factor; + EmitLn('MULS (SP)+,D0'); +end; + + +{-------------------------------------------------------------} +{ Recognize and Translate a Divide } + +procedure Divide; +begin + Match('/'); + Factor; + EmitLn('MOVE (SP)+,D1'); + EmitLn('EXS.L D0'); + EmitLn('DIVS D1,D0'); +end; + + +{---------------------------------------------------------------} +{ Completion of Term Processing (called by Term and FirstTerm } + +procedure Term1; +begin + while IsMulop(Look) do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '*': Multiply; + '/': Divide; + end; + end; +end; +{---------------------------------------------------------------} +{ Parse and Translate a Math Term } + +procedure Term; +begin + Factor; + Term1; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Math Term with Possible Leading Sign } + +procedure FirstTerm; +begin + SignedFactor; + Term1; +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate an Add } + +procedure Add; +begin + Match('+'); + Term; + EmitLn('ADD (SP)+,D0'); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate a Subtract } + +procedure Subtract; +begin + Match('-'); + Term; + EmitLn('SUB (SP)+,D0'); + EmitLn('NEG D0'); +end; + + +{---------------------------------------------------------------} +{ Parse and Translate an Expression } + +procedure Expression; +begin + FirstTerm; + while IsAddop(Look) do begin + EmitLn('MOVE D0,-(SP)'); + case Look of + '+': Add; + '-': Subtract; + end; + end; +end; + + +{---------------------------------------------------------------} +{ Parse and Translate a Boolean Condition } +{ This version is a dummy } + +Procedure Condition; +begin + EmitLn('Condition'); +end; + + +{---------------------------------------------------------------} +{ Recognize and Translate an IF Construct } + +procedure Block; Forward; + + +procedure DoIf; +var L1, L2: string; +begin + Condition; + L1 := NewLabel; + L2 := L1; + EmitLn('BEQ ' + L1); + Block; + if Token = 'l' then begin + L2 := NewLabel; + EmitLn('BRA ' + L2); + PostLabel(L1); + Block; + end; + PostLabel(L2); + MatchString('ENDIF'); +end; + + +{--------------------------------------------------------------} +{ Parse and Translate an Assignment Statement } + +procedure Assignment; +var Name: string; +begin + Name := Value; + Match('='); + Expression; + EmitLn('LEA ' + Name + '(PC),A0'); + EmitLn('MOVE D0,(A0)'); +end; + + +{--------------------------------------------------------------} +{ Recognize and Translate a Statement Block } + +procedure Block; +begin + Scan; + while not (Token in ['e', 'l']) do begin + case Token of + 'i': DoIf; + else Assignment; + end; + Scan; + end; +end; + + +{--------------------------------------------------------------} + +{ Parse and Translate a Program } + +procedure DoProgram; +begin + Block; + MatchString('END'); + EmitLn('END') +end; + + +{--------------------------------------------------------------} + +{ Initialize } + +procedure Init; +begin + LCount := 0; + GetChar; +end; + + +{--------------------------------------------------------------} +{ Main Program } + +begin + Init; + DoProgram; +end. +{--------------------------------------------------------------} + + +Compare this program with its single-character counterpart. I +think you will agree that the differences are minor. + + +CONCLUSION + +At this point, you have learned how to parse and generate code +for expressions, Boolean expressions, and control structures. +You have now learned how to develop lexical scanners, and how to +incorporate their elements into a translator. You have still not +seen ALL the elements combined into one program, but on the basis +of what we've done before you should find it a straightforward +matter to extend our earlier programs to include scanners. + +We are very close to having all the elements that we need to +build a real, functional compiler. There are still a few things +missing, notably procedure calls and type definitions. We will +deal with those in the next few sessions. Before doing so, +however, I thought it would be fun to turn the translator above +into a true compiler. That's what we'll be doing in the next +installment. + +Up till now, we've taken a rather bottom-up approach to parsing, +beginning with low-level constructs and working our way up. In +the next installment, I'll also be taking a look from the top +down, and we'll discuss how the structure of the translator is +altered by changes in the language definition. + +See you then. + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + + + diff --git a/Compiler construction/Lets_build_compiler/tutor8.txt b/Compiler construction/Lets_build_compiler/tutor8.txt new file mode 100644 index 0000000..935aa0b --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor8.txt @@ -0,0 +1,525 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 2 April 1989 + + + Part VIII: A LITTLE PHILOSOPHY + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +This is going to be a different kind of session than the others +in our series on parsing and compiler construction. For this +session, there won't be any experiments to do or code to write. +This once, I'd like to just talk with you for a while. +Mercifully, it will be a short session, and then we can take up +where we left off, hopefully with renewed vigor. + +When I was in college, I found that I could always follow a +prof's lecture a lot better if I knew where he was going with it. +I'll bet you were the same. + +So I thought maybe it's about time I told you where we're going +with this series: what's coming up in future installments, and in +general what all this is about. I'll also share some general +thoughts concerning the usefulness of what we've been doing. + + +THE ROAD HOME + +So far, we've covered the parsing and translation of arithmetic +expressions, Boolean expressions, and combinations connected by +relational operators. We've also done the same for control +constructs. In all of this we've leaned heavily on the use of +top-down, recursive descent parsing, BNF definitions of the +syntax, and direct generation of assembly-language code. We also +learned the value of such tricks as single-character tokens to +help us see the forest through the trees. In the last +installment we dealt with lexical scanning, and I showed you +simple but powerful ways to remove the single-character barriers. + +Throughout the whole study, I've emphasized the KISS philosophy +... Keep It Simple, Sidney ... and I hope by now you've realized +just how simple this stuff can really be. While there are for +sure areas of compiler theory that are truly intimidating, the +ultimate message of this series is that in practice you can just +politely sidestep many of these areas. If the language +definition cooperates or, as in this series, if you can define +the language as you go, it's possible to write down the language +definition in BNF with reasonable ease. And, as we've seen, you +can crank out parse procedures from the BNF just about as fast as +you can type. + +As our compiler has taken form, it's gotten more parts, but each +part is quite small and simple, and very much like all the +others. + +At this point, we have many of the makings of a real, practical +compiler. As a matter of fact, we already have all we need to +build a toy compiler for a language as powerful as, say, Tiny +BASIC. In the next couple of installments, we'll go ahead and +define that language. + +To round out the series, we still have a few items to cover. +These include: + + o Procedure calls, with and without parameters + + o Local and global variables + + o Basic types, such as character and integer types + + o Arrays + + o Strings + + o User-defined types and structures + + o Tree-structured parsers and intermediate languages + + o Optimization + +These will all be covered in future installments. When we're +finished, you'll have all the tools you need to design and build +your own languages, and the compilers to translate them. + +I can't design those languages for you, but I can make some +comments and recommendations. I've already sprinkled some +throughout past installments. You've seen, for example, the +control constructs I prefer. + +These constructs are going to be part of the languages I build. +I have three languages in mind at this point, two of which you +will see in installments to come: + +TINY - A minimal, but usable language on the order of Tiny + BASIC or Tiny C. It won't be very practical, but it will + have enough power to let you write and run real programs + that do something worthwhile. + +KISS - The language I'm building for my own use. KISS is + intended to be a systems programming language. It won't + have strong typing or fancy data structures, but it will + support most of the things I want to do with a higher- + order language (HOL), except perhaps writing compilers. + +I've also been toying for years with the idea of a HOL-like +assembler, with structured control constructs and HOL-like +assignment statements. That, in fact, was the impetus behind my +original foray into the jungles of compiler theory. This one may +never be built, simply because I've learned that it's actually +easier to implement a language like KISS, that only uses a subset +of the CPU instructions. As you know, assembly language can be +bizarre and irregular in the extreme, and a language that maps +one-for-one onto it can be a real challenge. Still, I've always +felt that the syntax used in conventional assemblers is dumb ... +why is + + MOVE.L A,B + +better, or easier to translate, than + + B=A ? + +I think it would be an interesting exercise to develop a +"compiler" that would give the programmer complete access to and +control over the full complement of the CPU instruction set, and +would allow you to generate programs as efficient as assembly +language, without the pain of learning a set of mnemonics. Can +it be done? I don't know. The real question may be, "Will the +resulting language be any easier to write than assembly"? If +not, there's no point in it. I think that it can be done, but +I'm not completely sure yet how the syntax should look. + +Perhaps you have some comments or suggestions on this one. I'd +love to hear them. + +You probably won't be surprised to learn that I've already worked +ahead in most of the areas that we will cover. I have some good +news: Things never get much harder than they've been so far. +It's possible to build a complete, working compiler for a real +language, using nothing but the same kinds of techniques you've +learned so far. And THAT brings up some interesting questions. + + +WHY IS IT SO SIMPLE? + +Before embarking on this series, I always thought that compilers +were just naturally complex computer programs ... the ultimate +challenge. Yet the things we have done here have usually turned +out to be quite simple, sometimes even trivial. + +For awhile, I thought is was simply because I hadn't yet gotten +into the meat of the subject. I had only covered the simple +parts. I will freely admit to you that, even when I began the +series, I wasn't sure how far we would be able to go before +things got too complex to deal with in the ways we have so far. +But at this point I've already been down the road far enough to +see the end of it. Guess what? + + + THERE ARE NO HARD PARTS! + + +Then, I thought maybe it was because we were not generating very +good object code. Those of you who have been following the +series and trying sample compiles know that, while the code works +and is rather foolproof, its efficiency is pretty awful. I +figured that if we were concentrating on turning out tight code, +we would soon find all that missing complexity. + +To some extent, that one is true. In particular, my first few +efforts at trying to improve efficiency introduced complexity at +an alarming rate. But since then I've been tinkering around with +some simple optimizations and I've found some that result in very +respectable code quality, WITHOUT adding a lot of complexity. + +Finally, I thought that perhaps the saving grace was the "toy +compiler" nature of the study. I have made no pretense that we +were ever going to be able to build a compiler to compete with +Borland and Microsoft. And yet, again, as I get deeper into this +thing the differences are starting to fade away. + +Just to make sure you get the message here, let me state it flat +out: + + USING THE TECHNIQUES WE'VE USED HERE, IT IS POSSIBLE TO + BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING + A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE. + + +Since the series began I've received some comments from you. +Most of them echo my own thoughts: "This is easy! Why do the +textbooks make it seem so hard?" Good question. + +Recently, I've gone back and looked at some of those texts again, +and even bought and read some new ones. Each time, I come away +with the same feeling: These guys have made it seem too hard. + +What's going on here? Why does the whole thing seem difficult in +the texts, but easy to us? Are we that much smarter than Aho, +Ullman, Brinch Hansen, and all the rest? + +Hardly. But we are doing some things differently, and more and +more I'm starting to appreciate the value of our approach, and +the way that it simplifies things. Aside from the obvious +shortcuts that I outlined in Part I, like single-character tokens +and console I/O, we have made some implicit assumptions and done +some things differently from those who have designed compilers in +the past. As it turns out, our approach makes life a lot easier. + +So why didn't all those other guys use it? + +You have to remember the context of some of the earlier compiler +development. These people were working with very small computers +of limited capacity. Memory was very limited, the CPU +instruction set was minimal, and programs ran in batch mode +rather than interactively. As it turns out, these caused some +key design decisions that have really complicated the designs. +Until recently, I hadn't realized how much of classical compiler +design was driven by the available hardware. + +Even in cases where these limitations no longer apply, people +have tended to structure their programs in the same way, since +that is the way they were taught to do it. + +In our case, we have started with a blank sheet of paper. There +is a danger there, of course, that you will end up falling into +traps that other people have long since learned to avoid. But it +also has allowed us to take different approaches that, partly by +design and partly by pure dumb luck, have allowed us to gain +simplicity. + +Here are the areas that I think have led to complexity in the +past: + + o Limited RAM Forcing Multiple Passes + + I just read "Brinch Hansen on Pascal Compilers" (an + excellent book, BTW). He developed a Pascal compiler for a + PC, but he started the effort in 1981 with a 64K system, and + so almost every design decision he made was aimed at making + the compiler fit into RAM. To do this, his compiler has + three passes, one of which is the lexical scanner. There is + no way he could, for example, use the distributed scanner I + introduced in the last installment, because the program + structure wouldn't allow it. He also required not one but + two intermediate languages, to provide the communication + between phases. + + All the early compiler writers had to deal with this issue: + Break the compiler up into enough parts so that it will fit + in memory. When you have multiple passes, you need to add + data structures to support the information that each pass + leaves behind for the next. That adds complexity, and ends + up driving the design. Lee's book, "The Anatomy of a + Compiler," mentions a FORTRAN compiler developed for an IBM + 1401. It had no fewer than 63 separate passes! Needless to + say, in a compiler like this the separation into phases + would dominate the design. + + Even in situations where RAM is plentiful, people have + tended to use the same techniques because that is what + they're familiar with. It wasn't until Turbo Pascal came + along that we found how simple a compiler could be if you + started with different assumptions. + + + o Batch Processing + + In the early days, batch processing was the only choice ... + there was no interactive computing. Even today, compilers + run in essentially batch mode. + + In a mainframe compiler as well as many micro compilers, + considerable effort is expended on error recovery ... it can + consume as much as 30-40% of the compiler and completely + drive the design. The idea is to avoid halting on the first + error, but rather to keep going at all costs, so that you + can tell the programmer about as many errors in the whole + program as possible. + + All of that harks back to the days of the early mainframes, + where turnaround time was measured in hours or days, and it + was important to squeeze every last ounce of information out + of each run. + + In this series, I've been very careful to avoid the issue of + error recovery, and instead our compiler simply halts with + an error message on the first error. I will frankly admit + that it was mostly because I wanted to take the easy way out + and keep things simple. But this approach, pioneered by + Borland in Turbo Pascal, also has a lot going for it anyway. + Aside from keeping the compiler simple, it also fits very + well with the idea of an interactive system. When + compilation is fast, and especially when you have an editor + such as Borland's that will take you right to the point of + the error, then it makes a lot of sense to stop there, and + just restart the compilation after the error is fixed. + + + o Large Programs + + Early compilers were designed to handle large programs ... + essentially infinite ones. In those days there was little + choice; the idea of subroutine libraries and separate + compilation were still in the future. Again, this + assumption led to multi-pass designs and intermediate files + to hold the results of partial processing. + + Brinch Hansen's stated goal was that the compiler should be + able to compile itself. Again, because of his limited RAM, + this drove him to a multi-pass design. He needed as little + resident compiler code as possible, so that the necessary + tables and other data structures would fit into RAM. + + I haven't stated this one yet, because there hasn't been a + need ... we've always just read and written the data as + streams, anyway. But for the record, my plan has always + been that, in a production compiler, the source and object + data should all coexist in RAM with the compiler, a la the + early Turbo Pascals. That's why I've been careful to keep + routines like GetChar and Emit as separate routines, in + spite of their small size. It will be easy to change them + to read to and write from memory. + + + o Emphasis on Efficiency + + John Backus has stated that, when he and his colleagues + developed the original FORTRAN compiler, they KNEW that they + had to make it produce tight code. In those days, there was + a strong sentiment against HOLs and in favor of assembly + language, and efficiency was the reason. If FORTRAN didn't + produce very good code by assembly standards, the users + would simply refuse to use it. For the record, that FORTRAN + compiler turned out to be one of the most efficient ever + built, in terms of code quality. But it WAS complex! + + Today, we have CPU power and RAM size to spare, so code + efficiency is not so much of an issue. By studiously + ignoring this issue, we have indeed been able to Keep It + Simple. Ironically, though, as I have said, I have found + some optimizations that we can add to the basic compiler + structure, without having to add a lot of complexity. So in + this case we get to have our cake and eat it too: we will + end up with reasonable code quality, anyway. + + + o Limited Instruction Sets + + The early computers had primitive instruction sets. Things + that we take for granted, such as stack operations and + indirect addressing, came only with great difficulty. + + Example: In most compiler designs, there is a data structure + called the literal pool. The compiler typically identifies + all literals used in the program, and collects them into a + single data structure. All references to the literals are + done indirectly to this pool. At the end of the + compilation, the compiler issues commands to set aside + storage and initialize the literal pool. + + We haven't had to address that issue at all. When we want + to load a literal, we just do it, in line, as in + + MOVE #3,D0 + + There is something to be said for the use of a literal pool, + particularly on a machine like the 8086 where data and code + can be separated. Still, the whole thing adds a fairly + large amount of complexity with little in return. + + Of course, without the stack we would be lost. In a micro, + both subroutine calls and temporary storage depend heavily + on the stack, and we have used it even more than necessary + to ease expression parsing. + + + o Desire for Generality + + Much of the content of the typical compiler text is taken up + with issues we haven't addressed here at all ... things like + automated translation of grammars, or generation of LALR + parse tables. This is not simply because the authors want + to impress you. There are good, practical reasons why the + subjects are there. + + We have been concentrating on the use of a recursive-descent + parser to parse a deterministic grammar, i.e., a grammar + that is not ambiguous and, therefore, can be parsed with one + level of lookahead. I haven't made much of this limitation, + but the fact is that this represents a small subset of + possible grammars. In fact, there is an infinite number of + grammars that we can't parse using our techniques. The LR + technique is a more powerful one, and can deal with grammars + that we can't. + + In compiler theory, it's important to know how to deal with + these other grammars, and how to transform them into + grammars that are easier to deal with. For example, many + (but not all) ambiguous grammars can be transformed into + unambiguous ones. The way to do this is not always obvious, + though, and so many people have devoted years to develop + ways to transform them automatically. + + In practice, these issues turn out to be considerably less + important. Modern languages tend to be designed to be easy + to parse, anyway. That was a key motivation in the design + of Pascal. Sure, there are pathological grammars that you + would be hard pressed to write unambiguous BNF for, but in + the real world the best answer is probably to avoid those + grammars! + + In our case, of course, we have sneakily let the language + evolve as we go, so we haven't painted ourselves into any + corners here. You may not always have that luxury. Still, + with a little care you should be able to keep the parser + simple without having to resort to automatic translation of + the grammar. + + +We have taken a vastly different approach in this series. We +started with a clean sheet of paper, and developed techniques +that work in the context that we are in; that is, a single-user +PC with rather ample CPU power and RAM space. We have limited +ourselves to reasonable grammars that are easy to parse, we have +used the instruction set of the CPU to advantage, and we have not +concerned ourselves with efficiency. THAT's why it's been easy. + +Does this mean that we are forever doomed to be able to build +only toy compilers? No, I don't think so. As I've said, we can +add certain optimizations without changing the compiler +structure. If we want to process large files, we can always add +file buffering to do that. These things do not affect the +overall program design. + +And I think that's a key factor. By starting with small and +limited cases, we have been able to concentrate on a structure +for the compiler that is natural for the job. Since the +structure naturally fits the job, it is almost bound to be simple +and transparent. Adding capability doesn't have to change that +basic structure. We can simply expand things like the file +structure or add an optimization layer. I guess my feeling is +that, back when resources were tight, the structures people ended +up with were artificially warped to make them work under those +conditions, and weren't optimum structures for the problem at +hand. + + +CONCLUSION + +Anyway, that's my arm-waving guess as to how we've been able to +keep things simple. We started with something simple and let it +evolve naturally, without trying to force it into some +traditional mold. + +We're going to press on with this. I've given you a list of the +areas we'll be covering in future installments. With those +installments, you should be able to build complete, working +compilers for just about any occasion, and build them simply. If +you REALLY want to build production-quality compilers, you'll be +able to do that, too. + +For those of you who are chafing at the bit for more parser code, +I apologize for this digression. I just thought you'd like to +have things put into perspective a bit. Next time, we'll get +back to the mainstream of the tutorial. + +So far, we've only looked at pieces of compilers, and while we +have many of the makings of a complete language, we haven't +talked about how to put it all together. That will be the +subject of our next two installments. Then we'll press on into +the new subjects I listed at the beginning of this installment. + +See you then. + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + diff --git a/Compiler construction/Lets_build_compiler/tutor9.txt b/Compiler construction/Lets_build_compiler/tutor9.txt new file mode 100644 index 0000000..c23ea02 --- /dev/null +++ b/Compiler construction/Lets_build_compiler/tutor9.txt @@ -0,0 +1,821 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LET'S BUILD A COMPILER! + + By + + Jack W. Crenshaw, Ph.D. + + 16 April 1989 + + + Part IX: A TOP VIEW + + +***************************************************************** +* * +* COPYRIGHT NOTICE * +* * +* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * +* * +***************************************************************** + + +INTRODUCTION + +In the previous installments, we have learned many of the +techniques required to build a full-blown compiler. We've done +both assignment statements (with Boolean and arithmetic +expressions), relational operators, and control constructs. We +still haven't addressed procedure or function calls, but even so +we could conceivably construct a mini-language without them. +I've always thought it would be fun to see just how small a +language one could build that would still be useful. We're +ALMOST in a position to do that now. The problem is: though we +know how to parse and translate the constructs, we still don't +know quite how to put them all together into a language. + +In those earlier installments, the development of our programs +had a decidedly bottom-up flavor. In the case of expression +parsing, for example, we began with the very lowest level +constructs, the individual constants and variables, and worked +our way up to more complex expressions. + +Most people regard the top-down design approach as being better +than the bottom-up one. I do too, but the way we did it +certainly seemed natural enough for the kinds of things we were +parsing. + +You mustn't get the idea, though, that the incremental approach +that we've been using in all these tutorials is inherently +bottom-up. In this installment I'd like to show you that the +approach can work just as well when applied from the top down ... +maybe better. We'll consider languages such as C and Pascal, and +see how complete compilers can be built starting from the top. + +In the next installment, we'll apply the same technique to build +a complete translator for a subset of the KISS language, which +I'll be calling TINY. But one of my goals for this series is +that you will not only be able to see how a compiler for TINY or +KISS works, but that you will also be able to design and build +compilers for your own languages. The C and Pascal examples will +help. One thing I'd like you to see is that the natural +structure of the compiler depends very much on the language being +translated, so the simplicity and ease of construction of the +compiler depends very much on letting the language set the +program structure. + +It's a bit much to produce a full C or Pascal compiler here, and +we won't try. But we can flesh out the top levels far enough so +that you can see how it goes. + +Let's get started. + + +THE TOP LEVEL + +One of the biggest mistakes people make in a top-down design is +failing to start at the true top. They think they know what the +overall structure of the design should be, so they go ahead and +write it down. + +Whenever I start a new design, I always like to do it at the +absolute beginning. In program design language (PDL), this top +level looks something like: + + + begin + solve the problem + end + + +OK, I grant you that this doesn't give much of a hint as to what +the next level is, but I like to write it down anyway, just to +give me that warm feeling that I am indeed starting at the top. + +For our problem, the overall function of a compiler is to compile +a complete program. Any definition of the language, written in +BNF, begins here. What does the top level BNF look like? Well, +that depends quite a bit on the language to be translated. Let's +take a look at Pascal. + + +THE STRUCTURE OF PASCAL + +Most texts for Pascal include a BNF or "railroad-track" +definition of the language. Here are the first few lines of one: + + + ::= '.' + + ::= PROGRAM + + ::= + + +We can write recognizers to deal with each of these elements, +just as we've done before. For each one, we'll use our familiar +single-character tokens to represent the input, then flesh things +out a little at a time. Let's begin with the first recognizer: +the program itself. + +To translate this, we'll start with a fresh copy of the Cradle. +Since we're back to single-character names, we'll just use a 'p' +to stand for 'PROGRAM.' + +To a fresh copy of the cradle, add the following code, and insert +a call to it from the main program: + + +{--------------------------------------------------------------} +{ Parse and Translate A Program } + +procedure Prog; +var Name: char; +begin + Match('p'); { Handles program header part } + Name := GetName; + Prolog(Name); + Match('.'); + Epilog(Name); +end; +{--------------------------------------------------------------} + + +The procedures Prolog and Epilog perform whatever is required to +let the program interface with the operating system, so that it +can execute as a program. Needless to say, this part will be +VERY OS-dependent. Remember, I've been emitting code for a 68000 +running under the OS I use, which is SK*DOS. I realize most of +you are using PC's and would rather see something else, but I'm +in this thing too deep to change now! + +Anyhow, SK*DOS is a particularly easy OS to interface to. Here +is the code for Prolog and Epilog: + + +{--------------------------------------------------------------} +{ Write the Prolog } + +procedure Prolog; +begin + EmitLn('WARMST EQU $A01E'); +end; + + +{--------------------------------------------------------------} +{ Write the Epilog } + +procedure Epilog(Name: char); +begin + EmitLn('DC WARMST'); + EmitLn('END ' + Name); +end; +{--------------------------------------------------------------} + +As usual, add this code and try out the "compiler." At this +point, there is only one legal input: + + + px. (where x is any single letter, the program name) + + +Well, as usual our first effort is rather unimpressive, but by +now I'm sure you know that things will get more interesting. +There is one important thing to note: THE OUTPUT IS A WORKING, +COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled). + +This is very important. The nice feature of the top-down +approach is that at any stage you can compile a subset of the +complete language and get a program that will run on the target +machine. From here on, then, we need only add features by +fleshing out the language constructs. It's all very similar to +what we've been doing all along, except that we're approaching it +from the other end. + + +FLESHING IT OUT + +To flesh out the compiler, we only have to deal with language +features one by one. I like to start with a stub procedure that +does nothing, then add detail in incremental fashion. Let's +begin by processing a block, in accordance with its PDL above. +We can do this in two stages. First, add the null procedure: + + +{--------------------------------------------------------------} +{ Parse and Translate a Pascal Block } + +procedure DoBlock(Name: char); +begin +end; +{--------------------------------------------------------------} + + +and modify Prog to read: + + +{--------------------------------------------------------------} +{ Parse and Translate A Program } + +procedure Prog; +var Name: char; +begin + Match('p'); + Name := GetName; + Prolog; + DoBlock(Name); + Match('.'); + Epilog(Name); +end; +{--------------------------------------------------------------} + + +That certainly shouldn't change the behavior of the program, and +it doesn't. But now the definition of Prog is complete, and we +can proceed to flesh out DoBlock. That's done right from its BNF +definition: + + +{--------------------------------------------------------------} +{ Parse and Translate a Pascal Block } + +procedure DoBlock(Name: char); +begin + Declarations; + PostLabel(Name); + Statements; +end; +{--------------------------------------------------------------} + + +The procedure PostLabel was defined in the installment on +branches. Copy it into your cradle. + +I probably need to explain the reason for inserting the label +where I have. It has to do with the operation of SK*DOS. Unlike +some OS's, SK*DOS allows the entry point to the main program to +be anywhere in the program. All you have to do is to give that +point a name. The call to PostLabel puts that name just before +the first executable statement in the main program. How does +SK*DOS know which of the many labels is the entry point, you ask? +It's the one that matches the END statement at the end of the +program. + +OK, now we need stubs for the procedures Declarations and +Statements. Make them null procedures as we did before. + +Does the program still run the same? Then we can move on to the +next stage. + + +DECLARATIONS + +The BNF for Pascal declarations is: + + + ::= (