Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Compiler Design: Theory, Tools, and Examples, Study notes of Compiler Design

A textbook on Compiler Design written by Seth D. Bergmann from Rowan University in 2010. It covers topics such as the phases of a compiler, lexical analysis, syntax analysis, and implementation techniques. The author emphasizes the importance of compiler design in computer science and its relevance to programming languages, command languages, and theoretic topics. The book is a revised edition that accommodates C++ as the primary language in the undergraduate curriculum.

Typology: Study notes

2021/2022

Uploaded on 05/11/2023

stefan18
stefan18 🇺🇸

4.2

(35)

279 documents

1 / 284

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Compiler Design:
Theory, Tools, and Examples
C/C++ Edition
Seth D. Bergmann
Rowan University
2010
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Compiler Design: Theory, Tools, and Examples and more Study notes Compiler Design in PDF only on Docsity!

Compiler Design:

Theory, Tools, and Examples

C/C++ Edition

Seth D. Bergmann

Rowan University

Contents iii

Table of Contents

Preface ................................................................................ i

  • Chapter 1 Introduction Table of Contents.............................................................. iii
  • 1.1 What is a Compiler?
  • 1.2 The Phases of a Compiler
  • 1.3 Implementation Techniques
  • 1.4 Case Study: MiniC
  • 1.5 Chapter Summary
  • Chapter 2 Lexical Analysis
  • 2.0 Formal Languages
  • 2.1 Lexical Tokens
  • 2.2 Implementation with Finite State Machines
  • 2.3 Lexical Tables
  • 2.4 Lex
  • 2.5 Case Study: Lexical Analysis for MiniC
  • 2.6 Chapter Summary
  • Chapter 3 Syntax Analysis
  • 3.0 Grammars, Languages, and Pushdown Machines
  • 3.1 Ambiguities in Programming Languages
  • 3.2 The Parsing Problem
  • 3.3 Chapter Summary
  • Chapter 4 Top Down Parsing iv Contents
  • 4.0 Relations and Closure
  • 4.1 Simple Grammars
  • 4.2 Quasi-Simple Grammars
  • 4.3 LL(1) Grammars
  • 4.4 Parsing Arithmetic Expressions Top Down
  • 4.5 Syntax-Directed Translation
  • 4.6 Attributed Grammars
  • 4.7 An Attributed Translation Grammar for Expressions
  • 4.8 MiniC Expressions
  • 4.9 Translating Control Structures
  • 4.10 Case Study: A Top Down Parser for MiniC
  • 4.11 Chapter Summary
  • Chapter 5 Bottom Up Parsing
  • 5.1 Shift Reduce Parsing
  • 5.2 LR Parsing With Tables
  • 5.3 Yacc
  • 5.4 Arrays
  • 5.5 Case Study: Syntax Analysis for MiniC
  • 5.6 Chapter Summary
  • Chapter 6 Code Generation
  • 6.1 Introduction to Code Generation
  • 6.2 Converting Atoms to Instructions
  • 6.3 Single Pass vs. Multiple Passes
  • 6.4 Register Allocation
  • 6.5 Case Study: A MiniC Code Generator for the Mini Architecture
  • 6.6 Chapter Summary
  • Chapter 7 Optimization
  • 7.1 Introduction and View of Optimization
  • 7.2 Global Optimization
  • 7.3 Local Optimization
  • 7.4 Chapter Summary Contents v
  • Glossary
  • Appendix A MiniC Grammar
  • Appendix B MiniC Compiler
  • B.1 Software Files
  • B.2 Lexicall Phase
  • B.3 Syntax Analysis
  • B.4 Code Generator
  • Appendix C Mini Simulator........................................
  • Bibliography
  • Index...............................................................................

Preface

Compiler design is a subject which many believe to be fundamental and vital to computer science. It is a subject which has been studied intensively since the early 1950's and continues to be an important research field today. Compiler design is an important part of the undergraduate curriculum for many reasons: (1) It provides students with a better understanding of and appreciation for programming languages. (2) The techniques used in compilers can be used in other applications with command languages. (3) It provides motivation for the study of theoretic topics. (4) It is a good vehicle for an extended programming project. There are several compiler design textbooks available today, but most have been written for graduate students. Here at Rowan University (formerly Glassboro State College), our students have had difficulty reading these books. However, I felt it was not the subject matter that was the problem, but the way it was presented. I was sure that if concepts were presented at a slower pace, with sample problems and diagrams to illustrate the concepts, that our students would be able to master the concepts. This is what I have attempted to do in writing this book. This book is a revision of an earlier edition that was written for a Pascal based curriculum. As many computer science departments have moved to C++ as the primary language in the undergraduate curriculum, I have produced this edition to accommodate those departments. This book is not intended to be strictly an object-oriented approach to compiler design. The most essential prerequisites for this book are courses in C or C++ program- ming, Data Structures, Assembly Language or Computer Architecture, and possibly Programming Languages. If the student has not studied formal languages and automata, this book includes introductory sections on these theoretic topics, but in this case it is not likely that all seven chapters will be covered in a one semester course. Students who have studied the theory will be able to skip the preliminary sections (2.0, 3.0, 4.0) without loss of continuity. The concepts of compiler design are applied to a case study which is an imple- mentation of a subset of C which I call MiniC. Chapters 2, 4, 5, and 6 include a section devoted to explaining how the relevant part of the MiniC compiler is designed. This public domain software is presented in full in the appendices and is available on the Internet. Students can benefit by enhancing or changing the MiniC compiler provided. Chapters 6 and 7 focus on the back end of the compiler (code generation and optimization). Here I rely on a fictitious computer, called Mini, as the target machine. I use a fictitious machine for three reasons: (1) I can design it for simplicity so that the compiler design concepts are not obscured by architectural requirements, (2) It is available to anyone who has a C compiler (the Mini simulator, written in C, is available also), and (3) the teacher or student can modify the Mini machine to suit his/her tastes.

Recently the phrase user interface has received much attention in the computer industry. A user interface is the mechanism through which the user of a device communicates with the device. Since digital computers are programmed using a complex system of binary codes and memory addresses, we have developed sophisticated user interfaces, called programming languages, which enable us to specify computations in ways that seem more natural. This book will describe the implementation of this kind of interface, the rationale being that even if you never need to design or implement a programming language, the lessons learned here will still be valuable to you. You will be a better programmer as a result of understanding how programming languages are implemented, and you will have a greater appreciation for programming languages. In addition, the techniques which are presented here can be used in the construction of other user interfaces, such as the query language for a database management system.

1.1 What is a Compiler?

Recall from your study of assembly language or computer organization the kinds of instructions that the computer’s CPU is capable of executing. In general, they are very simple, primitive operations. For example, there are often instructions which do the following kinds of operations: (1) add two numbers stored in memory, (2) move numbers from one location in memory to another, (3) move information between the CPU and memory. But there is certainly no single instruction capable of computing an arbitrary expression such as ((x-x 0 )^2 + (x-x 1 )^2 )1/2, and there is no way to do the following with a single instruction:

if (array6[loc]<MAX) sum = 0; else array6[loc] = 0;

Chapter 1

Introduction

2 Chapter 1 Introduction

These capabilities are implemented with a software translator, known as a compiler. The function of the compiler is to accept statements such as those above and translate them into sequences of machine language operations which, if loaded into memory and executed, would carry out the intended computation. It is important to bear in mind that when processing a statement such as x = x ∗ 9; the compiler does not perform the multiplication. The compiler generates, as output, a sequence of instructions, includ- ing a "multiply" instruction. Languages which permit complex operations, such as the ones above, are called high-level languages , or programming languages. A compiler accepts as input a program written in a particular high-level language and produces as output an equivalent program in machine language for a particular machine called the target machine. We say that two programs are equivalent if they always produce the same output when given the same input. The input program is known as the source program , and its language is the source language. The output program is known as the object program , and its language is the object language. A compiler translates source language programs into equivalent object language programs. Some examples of compilers are:

A Java compiler for the Apple Macintosh A COBOL compiler for the SUN A C++ compiler for the Apple Macintosh

If a portion of the input to a C++ compiler looked like this:

A = B + C ∗ D;

the output corresponding to this input might look something like this:

LOD R1,C // Load the value of C into reg 1 MUL R1,D // Multiply the value of D by reg 1 STO R1,TEMP1 // Store the result in TEMP LOD R1,B // Load the value of B into reg 1 ADD R1,TEMP1 // Add value of Temp1 to register 1 STO R1,TEMP2 // Store the result in TEMP MOV A,TEMP2 // Move TEMP2 to A, the final result

The compiler must be smart enough to know that the multiplication should be done before the addition even though the addition is read first when scanning the input. The compiler must also be smart enough to know whether the input is a correctly formed program (this is called checking for proper syntax ), and to issue helpful error messages if there are syntax errors. Note the somewhat convoluted logic after the Test instruction in Sample Problem 1.1(a) (see p. 3). Why didn’t it simply branch to L3 if the condition code indicated that the first operand (X) was greater than or equal to the second operand (Temp1), thus eliminating an unnecessary branch instruction and label? Some compilers might actually do this, but the point is that even if the architecture of the target machine

4 Chapter 1 Introduction

Figure 1.1 A Compiler and Interpreter Producing Very Different Output for the Same Input

rather than generating a machine language program, the interpreter actually carries out the computations specified in the source program. In other words, the output of a compiler is a program, whereas the output of an interpreter is the source program’s output. Figure 1. shows that although the input may be identical, compilers and interpreters produce very different output. Nevertheless, many of the techniques used in designing compilers are also applicable to interpreters. Students are often confused about the difference between a compiler and an interpreter. Many commercial compilers come packaged with a built-in edit-compile-run front end. In effect, the student is not aware that after compilation is finished, the object program must be loaded into memory and executed, because this all happens automati- cally. As larger programs are needed to solve more complex problems, programs are divided into manageable source modules, each of which is compiled separately to an object module. The object modules can then be linked to form a single, complete, machine language program. In this mode, it is more clear that there is a distinction between compile time , the time at which a source program is compiled, and run time, the time at which the resulting object program is loaded and executed. Syntax errors are reported by the compiler at compile time and are shown at the left, below, as compile- time errors. Other kinds of errors not generally detected by the compiler are called run- time errors and are shown at the right below:

Compile-Time Errors Run-Time Errors

a = ((b+c)∗d; x = a-a; y = 100/x; // division by 0

if x<b fn1(); ptr = NULL; else fn2(); data = ptr->info; // use of null pointer

Input a = 3; b = 4; cout << a*b;

Co mpiler

Output Mov a,=‘3’ Mov b,=‘4’ Lod 1,a Mul 1,b Sto 1,Tmp Push Tmp Call Write

a = 3; b = 4; cout << a*b;

Interpreter

Mov a,=‘3’ Mov b,=‘4’ Lod 1,a Mul 1,b Sto 1,Tmp Push Tmp Call Write

Section 1.1 What is a Compiler? 5

Sample Problem 1.1 (b)

Show the compiler output and the interpreter output for the following C++ source code:

for (i=1; i<=4; i++) cout << i*3;

Solution: Compiler Interpreter

LOD R1,='4' 3 6 9 12 STO R1,Temp MOV i,='1' L1:CMP i,Temp BH L2 {Branch if i>Temp1} LOD R1,i MUL R1,='3' STO R1,Temp PUSH Temp CALL Write ADD i,='1' {Add 1 to i} B L L2:

It is important to remember that a compiler is a program, and it must be written in some language (machine, assembly, high-level). In describing this program, we are dealing with three languages: (1) the source language, i.e. the input to the compiler, (2) the object language, i.e. the output of the compiler, and (3) the language in which the compiler is written, or the language in which it exists, since it might have been translated into a language foreign to the one in which it was originally written. For example, it is possible to have a compiler that translates Java programs into Macintosh machine language. That compiler could have been written in the C language, and translated into Macintosh (or some other) machine language. Note that if the language in which the compiler is written is a machine language, it need not be the same as the object language. For example, a compiler that produces Macintosh machine language could run on a Sun computer. Also, the object language need not be a machine or assembly language, but could be a high-level language. A concise notation describing compilers is given by Aho[1986] and is shown in Figure 1.2 (see p. 6). In these diagrams, the large C stands for Compiler (not the C programming language), the superscript describes the intended translation of the compiler, and the subscript shows the language in which the compiler exists. Figure 1.2 (a) shows a Java compiler for the Macintosh. Figure 1.2 (b) shows a compiler which translates Java programs into equivalent Macintosh machine language, but it exists in Sun machine language, and consequently it will run only on a Sun. Figure 1.2 (c) shows a compiler which translates PC machine language programs into equivalent Java programs. It is written in Ada and will not run in that form on any machine.

Section 1.1 What is a Compiler? 7

Exercises 1.

1. Show assembly language for a machine of your choice, corresponding to each of

the following C/C++ statements:

(a) A = B + C;

(b) A = (B+C) ∗ (C-D);

(c) for (I=1; I<=10; I++) A = A+I;

2. Show the difference between compiler output and interpreter output for each of

the following source inputs:

(a) A = 12; (b) A = 12;

B = 6; B = 6;

C = A+B; if (A<B) cout << A;

cout <<C<<A<<B; else cout << B;

(c) A = 12;

B = 6;

while (B<A)

{ A = A-1;

cout << A << B << endl;

3. Which of the following C/C++ source errors would be detected at compile time,

and which would be detected at run time?

(a) A = B+C = 3;

(b) if (X<3) then A = 2;

else A = X;

(c) if (A>0) X = 20;

else if (A<0) X = 10;

else X = X/A;

8 Chapter 1 Introduction

(d) while ((p->info>0) && (p!=0))

p = p->next;

/* assume p points to a struct

named node with

these field definitions:

int info;

node * next; */

4. Using the big C notation, show the symbol for each of the following:

(a) A compiler which translates COBOL source programs to PC machine lan-

guage and runs on a PC.

(b) A compiler, written in Java, which translates FORTRAN source programs to Mac

machine language.

(c) A compiler, written in Java, which translates Sun machine language programs to

Java.

10 Chapter 1 Introduction

Sample Problem 1.2 (a)

Show the token classes, or “words”, put out by the lexical analysis phase corresponding to this C++ source input:

sum = sum + unit ∗ /∗ accumulate sum ∗/ 1.2e-12 ;

Solution:

identifier (sum) assignment (=) identifier (sum) operator (+) identifier (unit) operator (∗) numeric constant (1.2e-12)

and moving data in memory. Each operation could have 0 or more operands also listed in the atom: (operation, operand1, operand2, operand3). The meaning of the following atom would be to add A and B, and store the result into C:

(ADD, A, B, C)

In Sample Problem 1.2 (b), below, each atom consists of three or four parts: an operation, one or two operands, and a result. Note that the compiler must put out the MULT atom before the ADD atom, despite the fact that the addition is encountered first in the source statement. To implement transfer of control, we could use label atoms, which serve only to mark a spot in the object program to which we might wish to branch in implementing a control structure such as if or while. A label atom with the name L1 would be (LBL,L1). We could use a jump atom for an unconditional branch, and a test atom for a conditional branch: The atom (JMP, L1) would be an unconditional branch to the

Sample Problem 1.2 (b)

Show atoms corresponding to the following C/C++ statement:

A = B + C ∗ D ;

Solution:

(MULT,C,D,TEMP1) (ADD,B,TEMP1,TEMP2) (MOVE,TEMP2,A)

=

A

B

C D

Sample Problem 1.2 (c)

Show atoms corresponding to the C/C++ statement:

while (A<=B) A = A + 1;

Solution:

(LBL, L1) (TEST, A, <=, B, L2) (JMP, L3) (LBL, L2) (ADD, A, 1, A) (JMP, L1) (LBL, L3)

of children, one for each statement in the compound statement. The other way would be to treat the semicolon like a statement concatenation operator, yielding a binary tree. Once a syntax tree has been created, it is not difficult to generate code from the syntax tree; a postfix traversal of the tree is all that is needed. In a postfix traversal , for each node, N, the algorithm visits all the subtrees of N, and visits the node N last, at which point the instruction(s) corresponding to node N can be generated.

label L1. The atom (TEST, A, <=, B, L2) would be a conditional branch to the label L2, if A<=B is true. Some parsers put out syntax trees as an intermediate data structure, rather than atom strings. A syntax tree indicates the structure of the source statement, and object code can be generated directly from the syntax tree. A syntax tree for the expression A = B + C ∗ D is shown in Figure 1.3, below. In syntax trees, each interior node represents an operation or control structure and each leaf node represents an operand. A statement such as if (Expr) Stmt else Stmt2 could be implemented as a node having three children – one for the conditional expression, one for the true part (Stmt1), and one for the else statement (Stmt2). The while control structure would have two children – one for the loop condition, and one for the statement to be repeated. The compound statement could be treated a few different ways. The compound statement could have an unlimited number

Section 1.2 The Phases of a Compiler

Figure 1.3 A Syntax Tree for A = B + C ∗ D

stmt2 and stmt3 can never be executed. They are unreachable and can be eliminated from the object program. A second example of global optimization is shown below:

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

In this case, the assignment to x need not be inside the loop since y doesn’t change as the loop repeats (it is a loop invariant ). In the global optimization phase, the compiler would move the assignment to x out of the loop in the object program:

x = sqrt (y); // loop invariant for (i=1; i<=100000; i++) cout << x+i << endl;

This would eliminate 99,999 unnecessary calls to the sqrt function at run time. The reader is cautioned that global optimization can have a serious impact on run-time debugging. For example, if the value of y in the above example was negative, causing a run-time error in the sqrt function, the user would be unaware of the actual location of that portion of code which called the sqrt function, because the compiler would have moved the offending statement (usually without informing the programmer). Most compilers that perform global optimization also have a switch with which the user can turn optimization on or off. When debugging the program, the switch would be off. When the program is correct, the switch would be turned on to generate an optimized version for the user. One of the most difficult problems for the compiler writer is making sure that the compiler generates optimized and unoptimized object modules, from the same source module, which are equivalent.

1.2.4 Code Generation

It is assumed that the student has had some experience with assembly language and machine language, and is aware that the computer is capable of executing only a limited number of primitive operations on operands with numeric memory addresses, all encoded as binary values. In the code generation phase, atoms or syntax trees are translated to machine language (binary) instructions, or to assembly language, in which case the assembler is invoked to produce the object program. Symbolic addresses (statement labels) are translated to relocatable memory addresses at this time. For target machines with several CPU registers, the code generator is responsible for register allocation. This means that the compiler must be aware of which registers are being used for particular purposes in the generated program, and which become available as code is generated. For example, an ADD atom might be translated to three machine language instructions: (1) load the first operand into a register, (2) add the second operand to that

Section 1.2 The Phases of a Compiler

14 Chapter 1 Introduction

register, and (3) store the result, as shown for the atom (ADD, A, B, Temp):

LOD R1,A // Load A into reg. 1 ADD R1,B // Add B to reg. 1 STO R1,Temp // Store reg. 1 in Temp

In Sample Problem 1.2 (e), below, the destination for the MOV instruction is the first operand, and the source is the second operand, which is the reverse of the operand positions in the MOVE atom. It is not uncommon for the object language to be another high-level language. This is done in order to improve portablility of the language being implemented.

1.2.5 Local Optimization

The local optimization phase is also optional and is needed only to make the object program more efficient. It involves examining sequences of instructions put out by the code generator to find unnecessary or redundant instructions. For this reason, local optimization is often called machine-dependent optimization. An addition operation in the source program might result in three instructions in the object program: (1) Load one operand into a register, (2) add the other operand to the register, and (3) store the result. Consequently, the expression A + B + C in the source program might result in the following instructions as code generator output:

Sample Problem 1.2 (e)

Show assembly language instructions corresponding to the following atom string:

(ADD, A, B, Temp1) (TEST, A, ==, B, L1) (MOVE, Temp1, A) (LBL, L1) (MOVE, Temp1, B)

Solution:

LOD R1,A ADD R1,B STO R1,Temp1 // ADD, A, B, Temp CMP A,B BE L1 // TEST, A, ==, B, L MOV A,Temp1 // MOVE, Temp1, A L1: MOV B,Temp1 // MOVE, Temp1, B