TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
CHAPTER 3 Revision of last Chapter readings 1. High level languages has to be processed by either compiler and interpreter 1. Difference between compiler and interpreter 1. Compiler Steps a. Analysis ( Carried out by front end ) i. Lexical analysis 1. A generic lexical analyzer - lex i. Syntax analysis 1. Context Free Grammar 1. How to build parse trees 1. A generic parser - yacc i. Semantic analysis a. Synthesis ( Carried out by back end ) i. Intermediate code generation i. Code optimization i. Object code generation Continuing syntax analysis : We already talked about the derivations and its graphical representation i.e. parse tree. Two type of derivations are there ○ Leftmost derivation : at each step, leftmost non-terminal is replaced;
e.g. E => E * E => id * E => id * id ○ Rightmost derivation : at each step, rightmost non-terminal is replaced;
e.g. E => E * E => E * id => id * id Every parse tree has unique leftmost (or rightmost) derivation. Note that a sentence can have many parse trees but a parse tree will have unique derivation. Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Evaluation of parse tree will always happen from bottom to up and left to right. Ambiguity : A grammar is ambiguous if a sentence has more than one parse tree, i.e., more than one leftmost (or rightmost) derivation of a sentence is possible. Example : Given the grammar ( set of productions) E -> E + E E -> E * E E -> id
How to resolve ambiguity : Write the unambiguous grammar. This can be achieved by defining precedence rules with extra non-terminals. Example : Another ambiguous grammar E -> E + E | E-E
|
E*E | E / E| -E
|
id Problem : How to convert the above ambiguous grammar into non-ambiguous. Solution : Apply precedence rules with extra non-terminals. Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Usual precedence order from highest to lowest is : - (unary minus), *|/, +|Golden rule : Build grammar from lowest to highest precedence Goal -> Expr Expr -> Expr + Term | Expr - Term | Term Term -> Term * Factor | Term / Factor | Factor Factor -> -Primary | Primary Primary -> id Now the leftmost derivation for - id + id * id are Goal => Expr Expr => Expr + Term => Term + Term => Factor + Term => - Primary + Term => - id + Term => - id + Term*Factor => -id + Factor*Factor => -id + Primary*Factor => -id + id * Factor =>
-id + id * Primary
=> - id + id * id There are three new non-terminals ( Term, Factor, Primary ). You can not have 2 parse tree for the above sentence using above grammar. Parser : A program that, given a sentence, reconstructs a derivation for that sentence ---- if done successfully, it “recognizes” the sentence. All parsers read their input left-to-right, but construct parse tree differently. There are two type of parsers. a. Top-down parsers --- construct the tree from root to leaves a. Bottom-up parsers --- construct the tree from leaves to root
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Top-down parser : ( LL parser - left-to-left parser ) It attempts to derive a string matching a source string through a sequence of derivations starting with the start symbol of grammar. In other terms it constructs parse tree by starting at the start symbol and “guessing” at each derivation step. It uses next input symbol from the sentence to guide “guessing”. For a valid input string 'a', a top down parse thus determines a derivation sequence S => …=> … => a In top down parsing all the derivation has to be leftmost at each stage while matching the input string and that's why top down parsing is also termed as left-to-left (LL parsing). The first left is because all the parser reads the input sequence from left to right and second left is for leftmost derivation. There are three main concept in top down parsing 1. Start symbol - Selection of start symbol ( root of the parse tree ) is very important. 1. Guessing of right derivation which can lead to match the input sentence. This is called
'prediction'. 1.
If the guess is wrong then one need to revert the guess and try it again. This is called 'backtracking'.
High level flow of top-down parsing : Step 1 : Identify start symbol and start with this Step 2 : Guess a production(which can lead to match the input) and apply it -- Prediction Step 3 : Match the input string
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Step 4 : If match then go to step 2 till the complete sentence is matched Else it is a wrong guess and revert back the derivation and go to step 2 -- Backtrack If the prediction matches the input string then no backtracking else backtracking. Some disadvantages of top-down parsing. Two problems arise due to possibility of backtracking a. Semantic analysis can not be performed while making a prediction. The action must be
delayed until the prediction is known to be part of successful part. i.e. you don’t know whether this prediction is correct or not. a. A source string is known to be erroneous only after all predictions have failed. This makes it very inefficient. Based on prediction and backtracking top-down parsers can be categorized into two categories 1. Recursive-Descent Parsing ( RD) - A top-down parser with backtrack • Backtracking is needed (If a choice of a production rule does not work, we backtrack to try other alternatives.) • It is a general parsing technique, but not widely used. Not efficient. Can be used for quick and dirty parsing. • At each derivation it uses RHS of a derivation from left to right. • Grammar with right recursion is suitable for this and do not enter into infinite loop while making predictions. • Why the name is recursive-descent ? Parser is recursive in nature ( recursive derivations ) Descent because it goes from top->down Example : S aBc B bc | b ( Here it uses bc for B first and then b )
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Grammar suitable for recursive-descent top-down parsing Grammars containing left recursion( NT appears at left side of RHS of a production) are not suitable for top-down parsing. Example : for the string == id + id*id E => E + T | T T => T * V | V V => The first production would be E => E + T Now E has to be replaced as in top-down parsing leftmost derivtion takes place. If we consider the recursive-descent parsing then E will be again replaced by E + T which will create an infinite loop for prediction making. Grammars containing right recursion are suitable for top-down parsing and they never enter into infinite loop. However this method is time consuming and error-prone for large grammars. Example : The above grammar can be written as right recursion as follows. E => T + E | T T => V * T | V V => The first production would be E => T + E T has to be replaced ( as top-down parsing has leftmost derivation ). Here is the complete sequence E => T + E Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
=> V + E => + E => + T => + V * T => + * T => + * V => + * 1. Predictive Parsing (PP) - ( Also called recursive predictive parsing ) • Predictive Parsing is a special form of Recursive Descent parsing without backtracking. • Efficient • Needs a special form of CFG known as LL(k) grammar. Possible for only LL(k) grammar. LL(k) grammar - are the context-free grammars for which there exists some positive integer k that allows a recursive descent parser to decide which production to use by examining only the next k tokens of input. Here are some of the properties. Subset of CFG’s Permits deterministic left-to-right recognition with a look ahead of k symbols Builds the parse tree top-down If a parse table can be constructed for the grammar, then it is LL(k), if it can’t, it
is
not LL(k) Each LL(k) grammar is unambiguous An LL(k) has no left-recursion. It might have right recursion but in case of right
recursive production rule the same non terminal must have a production rule for epsilon also. With left recursion there might be chances of infinity loop which will never make this possible for a right prediction of K symbols. Given a left recursive grammar ( or right recursive grammar)this can be converted
to LL(k) grammar using concept of left factoring.
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Left factoring : Take common parts of productions and form a new non terminal. With left factoring each production (i.e. each non terminal)become non-recursive or right recursive. If the production is right recursive then there is production for e (epsilon) Examples : How to convert a left recursive grammar into LL(k) grammar E => E + T | T T => T * V | V
----> Left recursive ( Not suitable for any top-down parsing )
V => | | v E => T + E | T T => V * T | V ---> right recursive ( suitable for top-down recursive descent parsing ) V => | | v E => TE' E' => +T E' | e
( Note that all the recursive production will have an derivation to e
(epsilon ) ) T => V T '
---> Left factored LL(k) grammar ( suitable for top-down predictive
parsing ) T' => *V T' | e V =
Other examples on how to reduce grammar
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
The LL(k) grammars therefore exclude all ambiguous grammars, as well as all grammars that contain left recursion. LL(1) --> recursive descent parser can decide which production to apply by examining only the next '1' token of input. •
The predictive parser which uses the LL(1) grammar is known as LL(1) parser.
Something more about LL(1) parser ○ LL(1) means that the input is processed left-to-right a leftmost derivation is constructed the method uses at most one lookahead token ○ An LL(1) parser is a table driven parser for left-to-left parsing ( LL parsing ). ○ The '1' in LL(1) indicates that the grammar uses a look-ahead of one source symbol. i.e. the prediction to be made is • determined by the next source symbol. ○ It expects an LL(1) grammar. Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
• There are two important concepts in LL(1) parsing ○ Parsing table and algorithm to create parsing table ○ Algorithm for derivations About parsing table : ○ The parsing table has a row for each Non terminal(NT) in production rules ○ The parsing table has a column for each Terminal(T) in production rules ○ A parsing table entry PT(NT, T) indicates what prediction should be made if NT is the leftmost non-terminal in a sentential form • And T is the next source symbol. ○ A blank entry in parsing table indicates an error. Multiple entry in table indicates conflict and this tells that the grammar is not LL(1) • There must be exactly one entry in a cell. ○ There is a special column which depicts the end of symbols and it is marked as $ or |Algorithm to create the parsing table :
a E First(alpha) = => If alpha can derive a string starting from a B E Follow(A) ==> b that can follow a string derived from A Example :
Here is the example of LL(1) grammar for arithmetic operation and the
corresponding table Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Grammar :
Input string :
Parsing Table :
LL(1) parsing algorithm ’ {Input : A string {Output : If
ω
ω
and a and parsing table M for grammar G.}
is in L(G), a leftmost derivation of
ω
, otherwise, an error indication.}
Initially, the parser is in a configuration in which it has $S on the stack with S, the start symbol of G on top, and
ω
$ in the imput buffer.
Set ip to point ot the first symbol of
ω
$.
Repeat Let X be the top stack symbol and a the symbol pointed to by ip.
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
if X = a Pop of X fromt he stack and advance ip. else error() end if else
{X is a nonterminal}
if M [ X , a] = X → Y1Y2 ⋅⋅⋅ Yk
Pop X fromt he stack Push
onto the stack, with Yk , Yk –1 , ⋅⋅⋅, Y1
on top Y1
Output the production X → Y1Y2 ⋅ ⋅⋅Yk
else error() end if end if unitl X = $
{stack is empty}
Parsing steps :
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Bottom-up parser : (Construct parse tree “bottom-up” --- from leaves to the root ) As the name suggests, bottom-up parsing works in the opposite direction from topdown. A topdown parser begins with the start symbol at the top of the parse tree and works downward, driving productions in forward order until it gets to the terminal leaves. A bottom-up parse starts with the string of terminals itself and builds from the leaves upward, working backwards to the start symbol by applying the productions in reverse. Along the way, a bottom-up parser searches for substrings of the working string that match the right side of some production. When it finds such a substring, it reduces it, i.e., substitutes the left side nonterminal for the matching right side. The goal is to reduce all the way up to the start symbol and report a successful parse. In general, bottom-up parsing algorithms are more powerful than top-down methods, but not surprisingly, the constructions required are also more complex. It is difficult to write a bottom-up parser by hand for anything but trivial grammars, but fortunately, there are excellent parser generator tools like yacc that build a parser from an input specification Some features of bottom up parsing ○ Bottom-up parsing always constructs right-most derivation ○ It attempts to build trees upward toward the start symbol. ○ More complex than top-down but efficient Types of bottom up parser ( 2 types - shift reduce and precedence) • Shift reduce parser Shift-reduce parsing is the most commonly used and the most powerful of the bottom-up techniques. It takes as input a stream of tokens and develops the list of productions used to build the parse tree, but the productions are discovered in reverse order of a topdown parser. Like a table-driven predictive parser, a bottom-up parser makes use of a stack to keep track of the position in the parse and a parsing table to determine what to do next. To illustrate stack-based shift-reduce parsing, consider this simplified expression grammar: S –> E E –> T | E + T T –> id | (E)
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
The shift-reduce strategy divides the string that we are trying parse into two parts: an undigested part and a semi-digested part. The undigested part contains the tokens that are still to come in the input, and the semidigested part is put on a stack. If parsing the string v, it starts out completely undigested, so the input is initialized to v, and the stack is initialized to empty. A shift-reduce parser proceeds by taking one of three actions at each step: ○ Reduce:
If we can find a rule A –> w, and if the contents of the stack are qw for
some q (q may be empty), then we can reduce the stack to qA. We are applying the production for the nonterminal A backwards. There is also one special case: reducing the entire contents of the stack to the start symbol with no remaining input means we have recognized the input as a valid sentence (e.g., the stack contains just w, the input is empty, and we apply S –> w). This is the last step in a successful parse. The w being reduced is referred to as a handle. ○ Shift: If it is impossible to perform a reduction and there are tokens remaining in the
undigested input, then we transfer a token from the input onto the stack. This is called a shift. For example, using the grammar above, suppose the stack contained ( and the input contained id+id). It is impossible to perform a reduction on ( since it does not match the entire right side of any of our productions. So, we shift the first character of the input onto the stack, giving us (id on the stack and +id) remaining in the input. ○ Error: If neither of the two above cases apply, we have an error. If the sequence on the
stack does not match the right-hand side of any production, we cannot reduce. And if shifting the next input token would create a sequence on the stack that cannot eventually be reduced to the start symbol, a shift action would be futile. Thus, we have hit a dead end where the next token conclusively determines the input cannot form a valid sentence. This would happen in the above grammar on the input id+). The first id would be shifted, then reduced to T and again to E, next + is shifted. At this point, the stack contains E+ and the next input token is ). The sequence on the stack cannot be reduced, and shifting the ) would create a sequence that is not viable, so we have an error. Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
The general idea is to read tokens from the input and push them onto the stack attempting to build sequences that we recognize as the right side of a production. When we find a match, we replace that sequence with the nonterminal from the left side and continue working our way up the parse tree. This process builds the parse tree from the leaves upward, the inverse of the top-down parser. If all goes well, we will end up moving everything from the input to the stack and eventually construct a sequence on the stack that we recognize as a right-hand side for the start symbol. Example : Grammar :
Input :
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Another example :
Another Example: E -> E + E | E * E | ( E ) | a | b | c
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
Conflicts in the shift-reduce parsing : ambiguous grammars lead to parsing conflicts; conflicts can be fixed by rewriting the grammar, or making a decision during parsing • shift / reduce (SR) conflicts : choose between reduce and shift actions S -> if E then S | if E then S else S| ......
reduce/reduce (RR) conflicts : choose between two reductions
LR Parsing : table driven shift reduce parser LR parsers ("L" for left to right scan of input, "R" for rightmost derivation) are efficient, table-driven shift-reduce parsers. The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive LL parsers. In fact, virtually all programming language constructs for which CFGs can be written can be parsed with LR techniques. As an added advantage, there is no need for lots of grammar rearrangement to make it acceptable Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
for LR parsing the way that LL parsing requires. The primary disadvantage is the amount of work it takes to build the tables by hand, which makes it infeasible to hand-code an LR parser for most grammars. Fortunately, there are LR parser generators that create the parser from an unambiguous CFG specification. The parser tool does all the tedious and complex work to build the necessary tables and can report any ambiguities or language constructs that interfere with the ability to parse it using LR techniques. Rather than reading and shifting tokens onto a stack, an LR parser pushes "states" onto the stack; these states describe what is on the stack so far. An LR parser uses two tables: 1. The action table : Action[s,a] tells the parser what to do when the state on top of the stack is s and terminal a is the next input token. The possible actions are to shift a state onto the stack, to reduce the handle on top of the stack, to accept the input, or to report an error. 2. The goto table : Goto[s,X] indicates the new state to place on top of the stack after a reduction of the nonterminal X while state s is on top of the stack. LR Parser Types There are three types of LR parsers: LR(k), simple LR(k), and lookahead LR(k) (abbreviated to LR(k), SLR(k), LALR(k))). The k identifies the number of tokens of lookahead. We will usually only concern ourselves with 0 or 1 tokens of lookahead, but the techniques do generalize to k > 1. Here are some widely used LR parsers based on value of k. ○ LR(0) - No lookahead symbol ○ SLR(1) - Simple with one lookahead symbol ○ LALR(1) - Lookahead bottom up, not as powerful as full LR(1) but simpler to
implement. YACC deals with this kind of grammar. ○ LR(1) - Most general grammar, but most complex to implement. LR(0) is the simplest of all the LR parsing methods. It is also the weakest and although of theoretical importance, it is not used much in practice because of its limitations. LR(0) parses without using any lookahead at all. Adding just one token of lookahead to get LR(1) vastly increases the parsing power. Very few grammars can be parsed with LR(0), but most Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
unambiguous CFGs can be parsed with LR(1). The drawback of adding the lookahead is that the algorithm becomes somewhat more complex and the parsing table gets much, much bigger. The full LR(1) parsing table for a typical programming language has many thousands of states compared to the few hundred needed for LR(0). A compromise in the middle is found in the two variants SLR(1) and LALR(1) which also use one token of lookahead but employ techniques to keep the table as small as LR(0). SLR(k) is an improvement over LR(0) but much weaker than full LR(k) in terms of the number of grammars for which it is applicable. LALR(k) parses a larger set of languages than SLR(k) but not quite as many as LR(k). LALR(1) is the method used by the yacc parser generator. ○ Precedence parser Simple precedence parser Operator-precedence parser Extended precedence parser Questions from the previous papers 1. Explain the function of the software tool YACC. 2. With the help of diagram explain the process of parsing. What is the output generated after parsing process ? 3. What is semantic analysis ? Explain the semantic analysis of an arithmetic expression with an example. 4. Explain the principal of lex and yacc ? How they communicate with each other ? OR 5. Explain the utility of Lex and Yacc in the construction of a compiler ? 6. Explain different phases of compiler ? 7. What is a cross compiler ? 8. What is a bootstrap compiler ? Bootstrapping is a term used in computer science to describe the techniques involved in writing a compiler (or assembler) in the target programming language which it is intended to compile itself improvements to the compiler's back-end improve not only general purpose
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
programs but also the compiler itself it is a comprehensive consistency check as it should be able to reproduce its own object code. Earlier versions are written for a subset of language and then checks itself and then incrementally completes this. 1. Explain the lexical analysis ? Which tool can be used to generate the lexical analyzer ?
Explain a bit about tool. 1. Explain various tasks performed during lexical analysis. Also explain the relevance of
Regular Expression in lexical analysis. 1. What is context free grammar. Write down CFG for for loop of 'C' language 1. What is symbol table ? An essential function of compiler is to record the identifiers and the relevant information about its attribute type, its scope and in case of procedure or the function, names, arguments, return types. A symbol table is a table containing a record for each identifier with fields for the attribute of the identifier . This table is used by all the steps of compiler to access the data as well as report errors. 1. Generate parse tree for following sentences based on standard arithmetic CFG a -b *c a + b * c -d / ( e * f) a + b *c -d + e -f /(g + h ) a + b * c / d + e -f A /b + c * d + e -f 9*7+5-2 Use following grammar if not given : E => E + T | E-T|T T => T * V | T/V| V V => | (E) a-b*c
Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 34
TRAJECTORY EDUCATION
UGC–NET/COMPUTER SCIENCE/SLOT 10
First of all find a start prediction. ( Always start from lower to higher precedence in the input string). E => E - T => T-T => V-T => - T => - T*V => - V*V => - *V => - * a+b *c -d / (e*f) ( two at the lowest precedence + and -, choose the one which is at rightmost side i.e. -) E => E - T E => E + T - T => T + T - T => V + T - T => + T - T => + T * V - T => + V * V - T => + * V - T => + * - T => + * - T/V => + * - V/V => + * - /V => + * - /(E) => + * - /(T) => + * - /(T*V) => + * - /(V*V) => + * - /(*V) => + * - /(*) Main Office, 126 2nd Floor, Kingsway Camp, Delhi-09, 011-47041845, www.trajectoryeducation.com Page no. 33