Download Compiler Design: LL(1) Parsing and Predictive Parsing and more Lecture notes Compiler Design in PDF only on Docsity!
Prepared by MOHIT KUMAR for and on behalf of Meerut Institute of Engineering and Technology, Meerut.
COMPILER DESIGN
(TCS-502)
COURSE FILE
FOR
Bachelor of Technology
IN
Computer Science and Engineering
Session: 2007-
Department of Computer Science and Engineering
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
MEERUT
MIET TCS-502 COMPILER DESIGN Course File II
CONTENTS
PREAMBLE
SYLLABUS
LECTURE PLAN
LECTURE NOTES
01 Introduction to Compilers and its Phases
02 Lexical Analysis
03 Basics of Syntax Analysis
04 Top-Down Parsing
05 Basic Bottom-Up Parsing Techniques
06 LR Parsing
07 Syntax-Directed Translation
08 Symbol Tables
09 Run Time Administration
10 Error Detection and Recovery
11 Code Optimization
EXERCISES
Practice Questions
Examination Question Papers
Laboratory Assignments
MIET TCS-502 COMPILER DESIGN Course File IV
SYLLABUS
(As laid down by Uttar Pradesh Technical University, Lucknow)
UNIT I:
Introduction to Compiler: Phases and passes, Bootstrapping, Finite state machines and regular expressions and their applications to lexical analysis, Implementation of lexical analyzers, lexical- analyzer generator, LEX-compiler, Formal grammars and their application to syntax analysis, BNF notation, ambiguity, YACC. The syntactic specification of programming languages: Context free grammars, derivation and parse trees, capabilities of CFG.
UNIT II:
Basic Parsing Techniques: Parsers, Shift reduce parsing, operator precedence parsing, top down parsing, predictive parsers Automatic Construction of efficient Parsers: LR parsers, the canonical Collection of LR (O) items, constructing SLR parsing tables, constructing Canonical LR parsing tables, Constructing LALR parsing tables, using ambiguous grammars, an automatic parser generator, implementation of LR parsing tables, constructing LALR sets of items.
UNIT III:
Syntax-directed Translation: Syntax-directed Translation schemes, Implementation of Syntax- directed Translators, Intermediate code, postfix notation, Parse trees & syntax trees, three address code, quadruple & triples, translation of assignment statements, Boolean expressions, statements that alter the flow of control, postfix translation, translation with a top down parser. More about translation: Array references in arithmetic expressions, procedures call, declarations, case statements.
UNIT IV:
Symbol Tables: Data structure for symbols tables, representing scope information. Run-Time Administration: Implementation of simple stack allocation scheme, storage allocation in block structured language. Error Detection & Recovery: Lexical Phase errors, syntactic phase errors semantic errors.
UNIT V:
Introduction to code optimization: Loop optimization, the DAG representation of basic blocks, value numbers and algebraic laws, Global Data-Flow analysis.
TEXTBOOK:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, “ Compilers: Principles, Techniques, and Tools ” Addison-Wesley.
MIET TCS-502 COMPILER DESIGN Course File V
LECTURE PLAN
(PERIOD OF STUDY: From August 2007 to November 2007)
UNIT COMPETENCIES TOPICS TEACHING
AIDS
HOURS TIME OF
STUDY
Introduction to the subject, its pre-requisites, objectives, content and plan of study
- Introduction to compiler and its phases
- Lexical Analysis 3
AUGUST
FIRST
Basic Concepts of the compiler, overview of the Passes, Phases, Lexical Analyzers, CFG. 3.^ Basics of Syntax Analysis
- Top Down Parsing 4
- Basic Bottom Up Parsing Techniques
3 SEPTEMBER
SECOND
Basic idea about the different types of Parsers and their working Mechanism.
- LR Parsing 6
THIRD
Internal details about translation, actions to be attached to productions that shall produce the desired code.
- Syntax Directed Translation
OCTOBER
- Symbol Tables 1
- Run Time Storage Organization
FOURTH
Data structures related with the compiler, scope of the information stored and the possible errors that may arise.
- Error Detection and Recovery
FIFTH
(^) Optimization techniques that are related with the compiler process.
- Code Optimization Transparencies on Overhead Projector OR PowerPoint Presentation on LCD Projectors
NOVEMBER
TOTAL NUMBER OF LECTURE HOURS FOR THE COURSE: 35
- Puts information about identifiers into the symbol table.
- Regular expressions are used to describe tokens (lexical constructs).
- A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer.
1.2.2 Syntax Analyzer
- A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given program.
- A syntax analyzer is also called a parser.
- A parse tree describes a syntactic structure.
Example: For the line of code newval := oldval + 12 , parse tree will be:
assignment
identifier := expression
newval expression + expression
identifier number
oldval 12
- The syntax of a language is specified by a context free grammar (CFG).
- The rules in a CFG are mostly recursive.
- A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not. - If it satisfies, the syntax analyzer creates a parse tree for the given program.
Example: CFG used for the above parse tree is: assignment Æ identifier := expression expression Æ identifier expression Æ number expression Æ expression + expression
- Depending on how the parse tree is created, there are different parsing techniques.
- These parsing techniques are categorized into two groups:
- Top-Down Parsing,
- Bottom-Up Parsing
- Top-Down Parsing:
- Construction of the parse tree starts at the root, and proceeds towards the leaves.
- Efficient top-down parsers can be easily constructed by hand.
- Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
- Bottom-Up Parsing:
- Construction of the parse tree starts at the leaves, and proceeds towards the root.
- Normally efficient bottom-up parsers are created with the help of some software tools.
- Bottom-up parsing is also known as shift-reduce parsing.
- Operator-Precedence Parsing – simple, restrictive, easy to implement
- LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
1.2.3 Semantic Analyzer
- A semantic analyzer checks the source program for semantic errors and collects the type information for the code generation.
- Type-checking is an important part of semantic analyzer.
- Normally semantic information cannot be represented by a context-free language used in syntax analyzers.
- Context-free grammars used in the syntax analysis are integrated with attributes (semantic rules). The result is a syntax-directed translation and Attribute grammars
Example: In the line of code newval := oldval + 12 , the type of the identifier newval must match with type of the expression (oldval+12).
1.2.4 Intermediate Code Generation
- A compiler may produce an explicit intermediate codes representing the source program.
- These intermediate codes are generally machine architecture independent. But the level of intermediate codes is close to the level of machine codes.
Example:
newval := oldval * fact + 1
id1 := id2 * id3 + 1
MULT id2, id3, temp ADD temp1, #1, temp MOV temp2, id
The last form is the Intermediates Code (Quadruples)
1.2.5 Code Optimizer
- The code optimizer optimizes the code produced by the intermediate code generator in the terms of time and space.
Example: The above piece of intermediate code can be reduced as follows:
MULT id2, id3, temp ADD temp1, #1, id
1.2.6 Code Generator
- Produces the target language in a specific architecture.
2. LEXICAL ANALYSIS
Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer does not return a list of tokens at one shot; it returns a token when the parser asks a token from it.
2.1 Token
- Token represents a set of strings described by a pattern. For example, an identifier represents a set of strings which start with a letter continues with letters and digits. The actual string is called as lexeme.
- Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This additional information is called as the attribute of the token.
- For simplicity, a token may have a single attribute which holds the required information for that token. For identifiers, this attribute is a pointer to the symbol table, and the symbol table holds the actual attributes for that token.
- Examples:
- <identifier, attribute> where attribute is pointer to the symbol table
- no attribute is needed
- <number, value> where value is the actual value of the number
- Token type and its attribute uniquely identify a lexeme.
- Regular expressions are widely used to specify patterns.
2.2 Languages
2.2.1 Terminology
- Alphabet : a finite set of symbols (ASCII characters)
- String : finite sequence of symbols on an alphabet
- Sentence and word are also used in terms of string
- ε is the empty string
- |s| is the length of string s.
- Language: sets of strings over some fixed alphabet
- ∅ the empty set is a language.
- {ε} the set containing empty string is a language
- The set of all possible identifiers is a language.
- Operators on Strings:
- Concatenation : xy represents the concatenation of strings x and y. s ε = s ε s = s
- s n = s s s .. s ( n times) s (^0) = ε
2.2.2. Operations on Languages
- Concatenation: L 1 L 2 = { s 1 s 2 | s 1 ∈ L 1 and s 2 ∈ L 2 }
- Union: L 1 ∪ L 2 = { s | s ∈ L 1 or s ∈ L 2 }
- Exponentiation: L 0 = {ε} L 1 = L L 2 = LL
- Kleene Closure: L* =
- Positive Closure: L +^ =
Examples:
- L 1 = {a,b,c,d} L 2 = {1,2}
- L 1 L 2 = {a1,a2,b1,b2,c1,c2,d1,d2}
- L 1 ∪ L 2 = {a,b,c,d,1,2}
- L 13 = all strings with length three (using a,b,c,d}
- L 1 *^ = all strings using letters a,b,c,d and empty string
- L 1 +^ = doesn’t include the empty string
2.3 Regular Expressions and Finite Automata
2.3.1 Regular Expressions
- We use regular expressions to describe tokens of a programming language.
- A regular expression is built up of simpler regular expressions (using defining rules)
- Each regular expression denotes a language.
- A language denoted by a regular expression is called as a regular set.
For Regular Expressions over alphabet Σ
Regular Expression Language it denotes ε {ε} a∈ Σ {a} (r 1 ) | (r 2 ) L(r 1 ) ∪ L(r 2 ) (r 1 ) (r 2 ) L(r 1 ) L(r 2 ) (r) *^ (L(r)) * (r) L(r)
- (r) +^ = (r)(r) *
- (r)? = (r) | ε
- We may remove parentheses by using precedence rules.
- concatenation next
- | lowest
- ab *|c means (a(b) *)|(c)
Examples:
- Σ = {0,1}
- 0|1 = {0,1}
- (0|1)(0|1) = {00,01,10,11}
- 0 *^ = {ε ,0,00,000,0000,....}
- (0|1) *^ = All strings with 0 and 1, including the empty string
2.3.2 Finite Automata
- A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise.
- We call the recognizer of the tokens as a finite automaton.
- A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
- This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.
- Both deterministic and non-deterministic finite automaton recognize regular sets.
- Which one?
- deterministic – faster recognizer, but it may take more space
- non-deterministic – slower, but it may take less space
- Deterministic automatons are widely used lexical analyzers.
Example:
The DFA to recognize the language (a|b)* ab is as follows.
Transition Graph
0 is the start state s {2} is the set of final states F Σ = {a,b} S = {0,1,2}
Transition Function:
a b
0 1 0
1 1 2
2 1 0
Note that the entries in this function are single value and not set of values (unlike NFA).
2.3.5 Converting RE to NFA (Thomson Construction)
- This is one way to convert a regular expression into a NFA.
- There can be other ways (much efficient) for the conversion.
- Thomson’s Construction is simple and systematic method.
- It guarantees that the resulting NFA will have exactly one final state, and one start state.
- Construction starts from simplest parts (alphabet symbols).
- To create a NFA for a complex regular expression, NFAs of its sub-expressions are combined to create its NFA.
- To recognize an empty string ε:
- To recognize a symbol a in the alphabet Σ:
0 1 2
a b
a
b
b a
i f
ε
a i f
- For regular expression r1 | r2:
N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
- For regular expression r1 r
Here, final state of N(r1) becomes the final state of N(r1r2).
- For regular expression r*
Example: For a RE (a|b) * a, the NFA construction is shown below.
a
a
b b
(a | b)
a
b
ε
ε
ε
ε
b
ε
ε
ε
ε
a
ε
ε (^) ε (a|b)
ε
ε
b
ε
ε
ε
ε
a ε (^) ε
ε
(a|b) *^ a^ a
i N(r) f
ε ε
ε
ε
N(r (^) 2)
N(r (^) 1)
i f
ε ε
ε
ε
i N(r (^) 1) N(r (^) 2) f
Example:
S0 = ε-closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state
⇓ mark S ε-closure(move(S0,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} = S1 S1 into DS ε-closure(move(S0,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2 S2 into DS transfunc[S0,a] Í S1 transfunc[S0,b] Í S ⇓ mark S ε-closure(move(S1,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} = S ε-closure(move(S1,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S transfunc[S1,a] Í S1 transfunc[S1,b] Í S ⇓ mark S ε-closure(move(S2,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} = S
ε-closure(move(S2,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S transfunc[S2,a] Í S1 transfunc[S2,b] Í S
S0 is the start state of DFA since 0 is a member of S0={0,1,2,4,7} S1 is an accepting state of DFA since 8 is a member of S1 = {1,2,3,4,6,7,8}
2.4 Lexical Analyzer Generator
Regular Expressions Lexical Analyzer
b
ε
ε
ε
ε
a
ε ε
ε
0 1 a
3
4 5
2
6 7 8
b
a
a
b
b
a
S (^1)
S (^2)
S (^0)
Lexical Analyzer Generator
Source Program Tokens
LEX is an example of Lexical Analyzer Generator.
2.4.1 Input to LEX
- The input to LEX consists primarily of Auxiliary Definitions and Translation Rules.
- To write regular expression for some languages can be difficult, because their regular expressions can be quite complex. In those cases, we may use Auxiliary Definitions.
- We can give names to regular expressions, and we can use these names as symbols to define other regular expressions.
- An Auxiliary Definition is a sequence of the definitions of the form: d 1 → r (^1) d 2 → r (^2) . . d (^) n → r (^) n
where d (^) i is a distinct name and r (^) i is a regular expression over symbols in Σ ∪ {d 1 ,d 2 ,...,d (^) i-1}
basic symbols previously defined names
Example: For Identifiers in Pascal letter → A | B | ... | Z | a | b | ... | z digit → 0 | 1 | ... | 9 id → letter (letter | digit ) *
If we try to write the regular expression representing identifiers without using regular definitions, that regular expression will be complex. (A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
Example: For Unsigned numbers in Pascal digit → 0 | 1 | ... | 9 digits → digit + opt-fraction → (. digits )? opt-exponent → ( E (+|-)? digits )? unsigned-num → digits opt-fraction opt-exponent
- Translation Rules comprise of a ordered list Regular Expressions and the Program Code to be executed in case of that Regular Expression encountered.
R 1 P 1
R 2 P 2
. . R (^) n Pn
- The list is ordered i.e. the RE’s should be checked in order. If a string matches more than one RE, the RE occurring higher in the list should be given preference and its Program Code is executed.
Lexical Analyzer
3. BASICS OF SYNTAX ANALYSIS
- Syntax Analyzer creates the syntactic structure of the given source program.
- This syntactic structure is mostly a parse tree.
- Syntax Analyzer is also known as parser.
- The syntax of a programming is described by a context-free grammar (CFG). We will use BNF (Backus-Naur Form) notation in the description of CFGs.
- The syntax analyzer (parser) checks whether a given source program satisfies the rules implied by a context-free grammar or not. - If it satisfies, the parser creates the parse tree of that program. - Otherwise the parser gives the error messages.
- A context-free grammar
- gives a precise syntactic specification of a programming language.
- the design of the grammar is an initial phase of the design of a compiler.
- a grammar can be directly converted into a parser by some tools.
3.1 Parser
- Parser works on a stream of tokens.
- The smallest item is a token.
- We categorize the parsers into two groups:
- Top-Down Parser
- the parse tree is created top to bottom, starting from the root.
- Bottom-Up Parser
- the parse is created bottom to top; starting from the leaves
- Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time).
- Efficient top-down and bottom-up parsers can be implemented only for sub-classes of context-free grammars.
- LL for top-down parsing
- LR for bottom-up parsing
3.2 Context Free Grammars
- Inherently recursive structures of a programming language are defined by a context-free grammar.
- In a context-free grammar, we have:
- A finite set of terminals (in our case, this will be the set of tokens)
- A finite set of non-terminals (syntactic-variables)
- A finite set of productions rules in the following form A → α where A is a non-terminal and
α is a string of terminals and non-terminals (including the empty string)
- A start symbol (one of the non-terminal symbol)
- L(G) is the language of G (the language generated by G) which is a set of sentences.
- A sentence of L(G) is a string of terminal symbols of G.
- If S is the start symbol of G then
(a) ω is a sentence of L(G) iff S ⇒ ω where ω is a string of terminals of G.
- If G is a context-free grammar, L(G) is a context-free language.
Lexical Analyzer
source Parser
program
token
get next token
parse
tree
- Two grammars are equivalent if they produce the same language.
- S ⇒ α
- If α contains non-terminals, it is called as a sentential form of G.
- If α does not contain non-terminals, it is called as a sentence of G.
3.2.1 Derivations
Example: (b) E → E + E | E – E | E * E | E / E | - E (c) E → ( E ) (d) E → id
- E ⇒ E+E means that E+E derives from E
- we can replace E by E+E
- to able to do this, we have to have a production rule E→E+E in our grammar.
- E ⇒ E+E ⇒ id+E ⇒ id+id means that a sequence of replacements of non-terminal symbols is called a derivation of id+id from E.
- In general a derivation step is αAβ ⇒ αγβ if there is a production rule A→γ in our grammar where α and β are arbitrary strings of terminal and non-terminal symbols
α 1 ⇒ α 2 ⇒ ... ⇒ αn (αn derives from α 1 or α 1 derives αn )
- At each derivation step, we can choose any of the non-terminal in the sentential form of G for the replacement.
- If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-most derivation.
Example:
E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id)
- If we always choose the right-most non-terminal in each derivation step, this derivation is called as right-most derivation.
Example:
E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id)
- We will see that the top-down parsers try to find the left-most derivation of the given source program.
- We will see that the bottom-up parsers try to find the right-most derivation of the given source program in the reverse order.
3.2.2 Parse Tree
- Inner nodes of a parse tree are non-terminal symbols.
- The leaves of a parse tree are terminal symbols.
- A parse tree can be seen as a graphical representation of a derivation.