We dedicate this work .. to the great persons Founded through the history giving everything to their nations and they didn’t take any things.! Also to any person around this world that searching about the pure love and pure Peace among the civilizations and he think this love can save this world from it’s narrow disputation.!
SE4 compiler design workgroup
SE4 Compiler Design Documentation
Page 2
Chapter One 1 Introduction ………………………………………………………………..…
7
1.1 Preface ………………………………………………………………… 7 1.2 Project proposal …………………………………………………………7 1.3
The plan ……………………………………………… ……………..7
1.4 Setting Milestones ………………………………………………………8 1.5 Report Overview ……………………………………………………….. 8
Chapter Two 2 Compiler Analysis …………………………………………………………..10 2.1 What Is a Compiler ……………………………………………………11 2.2 History of a of a Compiler ………………………………………………….11 2.3 Compiler Vs Interpreter ……………………………………………….12 2.4 Compiler Phases ……………………………………………………….12 2.4.1 Front End Phases ………………………………………………..13 2.4.2 Back End Phases …………………………………………………14 2.5 Object Oriented development …………………………………………15 2.6 Compilation Processes …………………………………………………15 2.6.1 Lexical Analyzer Phase ……………………………………………16 2.6.2 Syntax Analyzer Phase ……………………………………………16 2.6.2.1 Context Free Grammar ( CFG ) …………………………...17 2.6.2.2 Derivation the Parse Tree …………………………………17 2.6.3 Symantec Analyzer Phase …………………………………………18 2.6.3.1 Types and Declarations ……………………………………19 2.6.3.2 Type Checking ……………………………………………..19 2.6.3.3 Scope Checking ……………………………………………19
SE4 Compiler Design Documentation
Page 3
2.6.4 Intermediate Code Generator Phase ……………………………...19 2.6.5 Code Optimizer Phase ……………………………………………..20 of optimizations …………………………………….21 2.6.5.1 Types of optimizations 2.6.6 Code Generator Phase ……………………………………………..21 2.6.7 Error Handling methods …………………………………………..22 2.6.7.1 Recovery from the lexical phase errors ……………………23 2.6.7.2 Recovery from the Symantec phase errors …………….…23 2.6.8 Symbol Table & Symbol Table Manger …………………………..23
Chapter Three 3 Compiler Design 3.1 Compiler Design ……………………………………………………….25 3.2 The Context of the of the System……………………………………………..25 3.2.1 Compiler Architectural Design ……………………………………25 3.2.1.1 Compiler Structuring………………………………………..25 3.2.1.1.1 Compiler Organization……………………………..26 3.2.1.2 Compiler Control Modeling………………………………….27 3.2.1.3 Compiler Modular Decomposition…………………………..28 3.2.1.4 Compiler Domain Specific Architectures…………………….28 3.3 Compiler USE Case Models……………………………………………..28 3.3.1 Lexical Analyzer USE Case Diagram………………………………29 3.3.2 Syntax Analyzer USE Case Diagram………………………………29 3.3.3 Symantec Analyzer USE Case Diagram…………………………..30 3.3.4 I.R Code Generator USE Case Diagram…………………………..30 3.3.5 I.R Code Optimizer USE Case Diagram…………………………..31 3.3.6 Code Generator USE Case Diagram……………………………..32 3.3.7 Symbol Table Manager USE Case Diagram……………………...32 3.3.8 Error Handler USE Case Diagram………………………………..33
SE4 Compiler Design Documentation
Page 4
3.4 Compiler’s Subsystems Activity Diagrams………………………........33 3.4.1 Lexical Analyzer Activity Diagram………………………………34 3.4.2 Syntax Analyzer Activity Diagram………………………………35 3.4.3 Symantec Analyzer Activity Diagram………………………..…36 3.4.4 I.R Code Generator Activity Diagram………………………..…37 3.4.5 I.R Code Optimizer Activity Diagram………………………..….38 3.4.6 Code Generator Activity Diagram………………………………39 3.4.7 Symbol Table Manager Activity Diagram……………………….40 3.3.8 Error Handler USE Case Diagram………………………………..41 3.5 Compiler Class Diagram………………………………………………..42 3.6 Compiler sequence Diagram……………………………………….….44 3.6.1 Lexical analyzing sequence diagram………………………..….44 3.6.2 syntax analyzing sequence diagram……………………………45 3.6.3 Symantec Analyzing sequence diagram……………………..…46 3.6.4 I.R Code Generation sequence diagram………………………..47 3.6.5 I.R Code Optimizing sequence diagram…………………………48 3.6.6 Code Generation sequence diagram…………………………...49 3.6.7 Additional sequence diagram…………………………………...50
Glossary ………………………………………………………51 Bibliography ………………………………………………..…54
SE4 Compiler Design Documentation
Page 5
Introduction SE4 Compiler Design Documentation
Page 6
1.1
Compiler design and construction is an exercise in engineering design. The compiler writer must choose a path through a decision space that is filled with diverse alternatives, each with distinct costs, advantages, and complexity. Each decision has an impact on the resulting compiler. The quality of the end product depends on informed decisions at each step of way. This project or a short study tries to explore the design space – to present some of the ways that used to design or construct a compiler, this is done side by side with the principles of the software engineering phases, and we have a main of advantage that we doesn't find in any book that illustrate the compilers construction, this advantage is the UML ( unified modeling language ) diagrams that illustrate the compiler development process from many sides and also describe the small procedures that making the subsystems of the compiler and the communication between them that lead to produce a target file of the compiler.
1.2 In the early days, the approach taken to compiler design used to be directly affected by the complexity of the processing, the experience of the person(s) designing it, and the resources available. A compiler for a relatively simple language written by one person might be a single, monolithic piece of software. When the source language is large and complex, and high quality output is required the design may be split into a number of relatively independent phases. Having separate phases means development can be parceled up into small parts and given to different people. It also becomes much easier to replace a single phase by an improved one, or to insert new phases later (eg, additional optimizations). So, Our Project is to design a small functional compiler by take in consideration the above hints. Implement the software design principles in our work and establishing it step by step till the end of this development.
1.3 We planned to make some common tasks that every software developer does on his/her work plan ( information gathering, analysis this information, designing the basic diagrams, and implement his/her project ).
SE4 Compiler Design Documentation
Page 7
1.4 In order to establish a sentiment of progression towards the goals of the project, certain milestones have been set :
Implementation of the Object Oriented Strategy in the analysis phase of the development. Implementation of the Object Oriented Strategy in the design phase of the development. Implementation of the UML ( unified modeling language ) Diagrams in the Design phase as follow … o Implementation of the Use Case Diagrams. o Implementation of the Activity Diagrams. o Implementation of the Class Diagrams and the relations between this classes. o Implementation of the Sequence Diagrams.
1.5
Chapter 1 A brief overview of what will follow in the report is given here. Chapter 2 ( Compiler Analysis ) Describe why we use the object oriented technique in the analysis phase.
how the Compiler is produced, perception by giving the illustration of the theoretical concepts and principles of compiler phases analysis. Chapter 3 ( Compiler Design ) describes the main design aspects of the project commencing with the Requirements Analysis and concluding with specific designing of project parts, and giving the main diagrams during the design phase to simplify the operation of implementation that done by the programmers.
SE4 Compiler Design Documentation
Page 8
Chapter Two
Compiler Analysis SE4 Compiler Design Documentation
Page 9
2.1 A compiler is a special type of computer program that translates a human readable text file into a form that the computer can more easily understand. At its most basic level, a computer can only understand two things, a 1 and a 0. At this level, a human will operate very slowly and find the informaton contained in the long string of 1s and 0s incomprehensible. A compiler is a computer program that bridges this gap. In the beginning, compilers were very simple programs that could only translate symbols into the bits, the 1s and 0s, the computer understood. Programs were also very simple, composed of a series of steps that were originally translated by hand into data the computer could understand. This was a very time consuming task, so portions of this task were automated or programmed, and the first compiler was written. This program assembled, or compiled, the steps required to execute the step by step program. These simple compilers were used to write a more sophisticated compiler. With the newer version, more rules could be added to the compiler program to allow a more natural language structure for the human programmer to operate with. This made writing programs easier and allowed more people to begin writing programs. As more people started writing programs, more ideas about writing programs were offered and used to make more sophisticated compilers. In this way, compiler programs continue to evolve, improve and become easier to use. Compiler programs can also be specialized. Certain language structures are better suited for a particular task than others, so specific compilers were developed for specific tasks or languages. Some compilers are multistage or multiple pass. A first pass could take a very natural language and make it closer to a computer understandable language. A second or even a third pass could take it to the final stage, the executable file. The intermediate output in a multistage compiler is usually called pseudo‐code, since it not usable by the computer. Pseudo‐code is very structured, like a computer program, not free flowing and verbose like a more natural language. The final output is called the executable file, since it is what is actually executed or run by the computer. Splitting the task up like this made it easier to write more sophisticated compilers, as each sub task is different. It also made it easier for the computer to point out where it had trouble understanding what it was being asked to do. Errors that limit the compiler in understanding a program are called syntax errors. Errors in the way the program functions are called logic errors. Logic errors are much harder to spot and correct. Syntax errors are like spelling mistakes, whereas logic errors are a bit more like grammatical errors. Cross compiler programs have also been developed. A cross compiler allows a text file set of instructions that is written for one computer designed by a specific manufacturer to be compiled and run for a different computer by a different manufacturer. For example, a program that was written to run on an Intel computer can sometimes be cross compiled to run a on computer developed by Motorola. This frequently does not work very well. At the level at which computer programs operate, the computer hardware can look very different, even if they may look similar to you.
Cross compilation is different from having one computer emulate another computer. If a computer is emulating a different computer, it is pretending to be that other computer. Emulation is frequently slower than cross compilation, since two programs are running at once, the program that is pretending to be the other computer and the program that is running. However, for cross compilation to work, you need both the original natural language text that describes the program and a computer that is sufficiently similar to the original computer that the program can function on to run on a different computer. This is not always possible, so both techniques are in use.
SE4 Compiler Design Documentation
Page 10
2.2 Software for early computers was primarily written in assembly language for many years. Higher level programming languages were not invented until the benefits of being able to reuse software on different kinds of CPUs started to become significantly greater than the cost of writing a compiler. The very limited memory capacity of early computers also created many technical problems when implementing a compiler. Towards the end of the 1950s, machine‐independent programming languages were first proposed. Subsequently, several experimental compilers were developed. The first compiler was written by Grace Hopper, in 1952, for the A‐0 programming language. The FORTRAN team led by John Backus at IBM is generally credited as having introduced the first complete compiler, in 1957. COBOL was an early language to be compiled on multple architectures, in 1960. In many application domains the idea of using a higher level language quickly caught on. Because of the expanding functionality supported by newer programming languages and the increasing complexity of computer architectures, compilers have become more and more complex. Early compilers were written in assembly language. The first self ‐hosting compiler — capable of compiling its own source code in a high‐level language — was created for Lisp by Tim Hart and Mike Levin at MIT in 1962. Since the 1970s it has become common practce to implement a compiler in the language it compiles, although both Pascal and C have been popular choices for implementation language. Building a self ‐hosting compiler is a bootstrapping problem— the first such compiler for a language must be compiled either by a compiler written in a different language, or (as in Hart and Levin's Lisp compiler) compiled by running the compiler in an interpreter.
2.3 We usually prefer to write computer programs in languages we understand rather than in machine language, but the processor can only understand machine language. So we need a way of converting our instructions (source code) into machine language. This is done by an interpreter or a compiler. An interpreter reads the source code one instruction or line at a time, converts this line into machine code and executes it. The machine code is then discarded and the next line is read. The advantage of this is it's simple and you can interrupt it while it is running, change the program and either continue or start again. The disadvantage is that every line has to be translated every time it is executed, even if it is executed many times as the program runs. Because of this interpreters tend to be slow. Examples of interpreters are Basic on older home computers, and script interpreters such as JavaScript, and languages such as Lisp and Forth. A compiler reads the whole source code and translates it into a complete machine code program to perform the required tasks which is output as a new file. This completely separates the source code from the executable file. The biggest advantage of this is that the translation is done once only and as a separate process. The program that is run is already translated into machine code so is much faster in execution. The disadvantage is that you cannot change the program without going back to the original source code, editing SE4 Compiler Design Documentation
Page 11
that and recompiling (though for a professional software developer this is more of an advantage because it stops source code being copied). Current examples of compilers are Visual Basic, C, C++, C#, Fortran, Cobol, Ada, Pascal and so on.
You will sometimes see reference to a third type of translation program: an assembler. This is like a compiler, but works at a much lower level, where one source code line usually translates directly into one machine code instruction. Assemblers are normally used only by people who want to squeeze the last bit of performance out of a processor by working at machine code level.
2.4 In the early days, the approach taken to compiler design used to be directly affected by the complexity of the processing, the experience of the person(s) designing it, and the resources available. A compiler for a relatively simple language written by one person might be a single, monolithic piece of software. When the source language is large and complex, and high quality output is required the design may be split into a number of relatively independent phases. Having separate phases means development can be parceled up into small parts and given to different people. It also becomes much easier to replace a single phase by an improved one, or to insert new phases later (eg, additional optimizations).
2.4.1 The front end analyzes the source code to build an internal representation of the program, called the intermediate representation or IR. It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope. This is done over several phases, which includes some of the following: 1.
Line reconstruction. Languages which strop their keywords or allow arbitrary spaces within identifiers require a phase before parsing, which converts the input character sequence to a canonical form ready for the parser. The top‐down, recursive‐descent, table‐driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase. Atlas Autocode, and Imp (and some implementations of Algol and Coral66) are examples of stropped languages whose compilers would have a Line Reconstruction phase.
2.
Lexical analysis breaks the source code text into small pieces called tokens. Each token is a single atomic unit of the language, for instance a keyword, identifier or symbol name. The token syntax is typically a regular language, so a finite state automaton constructed from a regular expression can be
SE4 Compiler Design Documentation
Page 12
used to recognize it. This phase is also called lexing or scanning, and the software doing lexical analysis is called a lexical analyzer or scanner. 3.
Preprocessing. Some languages, e.g., C, require a preprocessing phase which supports macro substitution and conditional compilation. Typically the preprocessing phase occurs before syntactic or semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as Scheme support macro substitutions based on syntactic forms.
4.
Syntax analysis involves parsing the token sequence to identify the syntactic structure of the program. This phase typically builds a parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a formal grammar which define the language's syntax. The parse tree is often analyzed, augmented, and transformed by later phases in the compiler.
5.
Semantic analysis is the phase in which the compiler adds semantic information to the parse tree and builds the symbol table. This phase performs semantic checks such as type checking (checking for type errors), or object binding (associating variable and function references with their definitions), or definite assignment (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings. Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the parsing phase, and logically precedes the code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation.
2.4.2 The term back end is sometimes confused with code generator because of the overlapped functionality of generating assembly code. Some literature uses middle end to distinguish the generic analysis and optimization phases in the back end from the machine‐dependent code generators. The main phases of the back end include the following: 1.
Analysis: This is the gathering of program information from the intermediate representation derived from the input. Typical analyses are data flow analysis to build use‐define chains, dependence analysis, alias analysis, pointer analysis, escape analysis etc. Accurate analysis is the basis for any compiler optimization. The call graph and control flow graph are usually also built during the analysis phase.
SE4 Compiler Design Documentation
Page 13
2.
Optimization: the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms. Popular optimizations are inline expansion, dead code elimination, constant propagation, loop transformation, register allocation or even automatic parallelization.
3.
Code generation: the transformed intermediate language is translated into the output language, usually the native machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions along with their associated addressing modes (see also Sethi‐Ullman algorithm).
2.5 Object‐oriented strategy is a development strategy where system analyzers ,designers and programmer think in terms of ‘things’ instead of operations or functions. The executing system is made up of interacting objects that maintain their own local state and provide operations on that state information . They hide information about the representation of the state and hence limit access to it. An object‐oriented development shorty described as the following hints ... •
Object‐oriented analysis is concerned with developing an object‐oriented model of the
application domain. The identified objects reflect entities and operations that are associated with the problem to be solved. And here with our project , we use this stage of development to analysis our compiler because we are seeing that the object oriented analysis stage is the best stage to collect our data from the previous compilers or from any references books that talk about the compiler , because as we know the compiler is a software system and here the analysis stage of our project doesn’t completed without using the correctly if we use another strategy to analyses it. After this shortly illustration of the object oriented development process , we will give the full data about the terminologies in our work, and illustrate each one carefully, because this is our outline information that we can to start our compiler design depending to it. •
Object‐oriented design is concerned with developing an object‐oriented model of a software
system to implement the identified requirements. The objects in an object‐oriented design are related to the solution to the problem that is being solved. There may be close relationships between some problem objects and some solution objects but the designer inevitably has to add new objects and to transform problem objects to implement the solution. •
Object‐oriented programming is concerned with realising a software design using an object‐
oriented programming language. An object‐oriented programming language, such as Java, supports the direct implementation of objects and provides facilities to define object classes.
SE4 Compiler Design Documentation
Page 14
2.6
Source Code
Compilation refers to the compiler's process of translating a high‐level language program into a low‐level language program. This process is very complex; hence, from the logical as well as an implementation point of view, it is customary to partition the compilation process into several phases, which are nothing more than logically cohesive operations that input one representation of a source program and output another representation.
Lexical Anal zer Phase S ntax Anal zer Phase
S mantic Anal zer Phase
A typical compilation, broken down into phases, is shown in Figure 1.1. Intermediate Code Generator Phase
2.6.1
Code O timizer Phase
Lexical analyzer is the part that process where the stream of characters making up the source program is read from left‐to‐right and grouped into tokens. ` Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: constants (integer, double, char, string, etc.), operators (arithmetic, relational, logical), punctuation, and reserved words. A simple example of lexical analyzer is shown in Figure 1.2.
Code Generator Phase
Object Code Figure 2.1
Figure 2.2
The lexical analyzer takes a source program as input, and produces a stream of tokens as Output, this process could the Tokenization process. The lexical analyzer might recognize particular instances of tokens such as 3 or 255 for an integer constant token, "Fred" or "Wilma" for a string constant token, numTickets or queue for a variable token Such specific instances are called lexemes. A lexeme is the actual character sequence forming a token, the token is the general class that a lexeme belongs to. Some tokens have exactly one lexeme (e.g., the > character); for others, there are many lexemes (e.g., integer constants).
SE4 Compiler Design Documentation
Page 15
2.6.1.1 Regular expression notation can be used for specification of tokens because tokens constitute a regular set. It is compact, precise, and contains a deterministic finite automata (DFA) that accepts the language specified by the regular expression. The DFA is used to recognize the language specified by the regular expression notation, making the automatic construction of recognizer of tokens possible. Therefore, the study of regular expression notation and finite automata becomes necessary. And here is a simple Regular Expression that used to separate any kinds of tokens as shown in the Figure 1.3.
Figure 2.3
2.6.2 In the syntax‐analysis phase, a compiler verifies whether or not the tokens generated by the lexical analyzer are grouped according to the syntactic rules of the language. If the tokens in a string are grouped according to the language's rules of syntax, then the string of tokens generated by the lexical analyzer is accepted as a valid construct of the language; otherwise, an error handler is called. Hence, two issues are involved when designing the syntax‐analysis phase of a compilation process: 1.
All valid constructs of a programming language must be specified; and by using these specifications, a valid program is formed. That is, we form a specification of what tokens the lexical analyzer will return, and we specify in what manner these tokens are to be grouped so that the result of the grouping will be a valid construct of the language.
2.
A suitable recognizer will be designed to recognize whether a string of tokens generated by the lexical analyzer is a valid construct or not.
Therefore, suitable notation must be used to specify the constructs of a language. The notation for the construct specifications should be compact, precise, and easy to understand. The syntax‐structure SE4 Compiler Design Documentation
Page 16
specification for the programming language (i.e., the valid constructs of the language) uses context‐free grammar (CFG), because A regular expression is not powerful enough to represent languages which require parenthesis matching to arbitrary depths. And for certain classes of grammar, we can automatically construct an efficient parser that determines if a source program is syntactically correct. Hence, CFG notation is required topic for study.
2.6.2.1 CFG notation specifies a context‐free language that consists of terminals, nonterminals, a start symbol, and productions. The terminals are nothing more than tokens of the language, used to form the language constructs. Nonterminals are the variables that denote a set of strings. For example, S and E are nonterminals that denote statement strings and expression strings, respectively, in a typical programming language. The nonterminals define the sets of strings that are used to define the language generated by the grammar. They also impose a hierarchical structure on the language, which is useful for both syntax analysis and translation. Grammar productions specify the manner in which the terminals and string sets, defined by the nonterminals, can be combined to form a set of strings defined by a particular nonterminal. For example, as shown in Figure 1.4 consider the production S → aSb. This production specifies that the set of strings defined by the nonterminal S are obtained by concatenating terminal a with any string belonging to the set of strings defined by nonterminal S, and then with terminal b. Each production consists of a nonterminal on the left‐hand side, and a Figure 2.4 string of terminals and nonterminals on the right‐hand side. The left‐hand side of a production is separated from the right‐hand side using the "→" symbol, which is used to identify a relation on a set (V T)*. Therefore context‐free grammar is a four‐tuple denoted as: G = (V,T,P,S) where: 1. 2. 3. 4.
V is a finite set of symbols called as nonterminals or variables, T is a set a symbols that are called as terminals, P is a set of productions, and S is a member of V, called as start symbol.
2.6.2.2 When deriving a string w from S, if every derivation is considered to be a step in the tree construction, then we get the graphical display of the derivation of string w as a tree. This is called a "derivation tree" or a SE4 Compiler Design Documentation
Page 17
"parse tree" of string w. Therefore, a derivation tree or parse tree is the display of the derivations as a tree. Note that a tree is a derivation tree if it satisfies the following requirements: 1. All the leaf nodes of the tree are labeled by terminals of the grammar. 2. The root node of the tree is labeled by the start symbol of the grammar. 3. The interior nodes are labeled by the nonterminals. 4. If an interior node has a label A, and it has n descendents with labels X1, X2, …, Xn from lef to right, then the production rule A → X1 X2 X3 …… Xn must exist in the grammar. For example, consider a grammar whose list of productions is: E→ E+E E→ E*E E→ id The tree shown in Figure 1.5 is a derivation tree for a string id + id * id.
Figure 2.5
2.6.3 After completing the lexical and Syntax functions without errors, Now we’ll move forward to semantic analyzer, where we delve even deeper to check whether they form a sensible set of instructions in the programming language. Whereas any old noun phrase followed by some verb phrase makes a syntactically correct English sentence, a semantically correct one has subject‐verb agreement, proper use of gender, and the components go together to express an idea that makes sense. For a program to be semantically valid, all variables, functions, classes, etc. must be properly defined, expressions and variables must be used in ways that respect the type system, access control must be respected, and so forth. Semantic analyzer is the front end’s penultimate phase and the compiler’s last chance to weed out incorrect programs. We need to ensure the program is sound enough to carry on to code generation. A large part of semantic analyzer consists of tracking variable/function/type declarations and type checking. In many languages, identifiers have to be declared before they’re used. As the compiler encounters a new declaration, it records the type information assigned to that identifier. Then, as it continues examining the rest of the program, it verifies that the type of an identifier is respected in terms of the operations being performed. For example, the type of the right side expression of an assignment statement should match the type of the left side, and the left side needs to be a properly declared and assignable identifier. The parameters of a function should match the arguments of a function call in both number and type. The language may require that identifiers be unique, thereby forbidding two global declarations from sharing the same name. Arithmetic operands will need to be of numeric—perhaps even the exact same type (no automatic int‐to‐double conversion, for instance). These are examples of the things checked in the semantic analyzer phase. and here we will discover some functions that handled by Context Sensitive Grammer ( CSG ), that done by the Symantec analyzer ..
SE4 Compiler Design Documentation
Page 18
2.6.3.1 We begin with some basic definitions to set the stage for performing semantic analysis. A type is a set of values and a set of operations operating on those values. There are three categories of types in most programming languages: • Base types int, float, double, char, bool, etc. These are the primitive types provided directly by the underlying hardware. There may be a facility for user‐defined variants on the base types (such as C enums). • Compound types arrays, pointers, records, structs, unions, classes, and so on. These types are constructed as aggregations of the base types and simple compound types. • Complex types lists, stacks, queues, trees, heaps, tables, etc. You may recognize these as abstract data types. A language may or may not have support for these sort of higher‐level abstractions. In many languages, a programmer must first establish the name and type of any data object (e.g., variable, function, type, etc). In addition, the programmer usually defines the lifetime. A declaration is a statement in a program that communicates this information to the compiler. The basic declaration is just a name and type, but in many languages it may include modifiers that control visibility and lifetime (i.e., static in C, private in Java). Some languages also allow declarations to initialize variables, such as in C, where you can declare and initialize in one statement.
2.6.3.2 Type checking is the process of verifying that each operation executed in a program respects the type system of the language. This generally means that all operands in any expression are of appropriate types and number. Much of what we do in the semantic analyzer phase is type checking. Sometimes the rules regarding operations are defined by other parts of the code (as in function prototypes), and sometimes such rules are a part of the definition of the language itself (as in "both operands of a binary arithmetic operation must be of the same type"). If a problem is found, e.g., one tries to add a char pointer to a double in C, we encounter a type error. A language is considered stronglytyped if each and every type error is detected during compilation. Type checking can be done compilation, during execution, or divided across both.
2.6.3.3 To understand how this is handled in a compiler, we need a few definitions. A scope is a section of program text enclosed by basic program delimiters, e.g., {} in C, or begin‐end in Pascal. Many languages allow nested scopes that are scopes defined within other scopes. The scope defined by the innermost such unit is called the current scope. The scopes defined by the current scope and by any enclosing program units are known as open scopes. Any other scope is a closed. As we encounter identifiers in a program, we need to determine if the identifier is accessible at that point in the program. This is called scope checking. If we try to access a local variable declared in one function in another function, we should get an error message. This is because only variables declared in the current scope and in the open scopes containing the current scope are accessible.
2.6.4 Intermediate code generator is the part of the compilation phases that by which a compiler's code generator converts some internal representation of source code into a form (e.g., machine code) that can be readily executed by a machine (often a computer). SE4 Compiler Design Documentation
Page 19
The input to the Intermediate code generator typically consists of a parse tree or an abstract syntax tree. The tree is converted into a linear sequence of instructions, usually in an intermediate language such as three address code. Further stages of compilation may or may not be referred to as "code generation", depending on whether they involve a significant change in the representation of the program. (For example, a peephole optimization pass would not likely be called "code generation", although a code generator might incorporate a peephole optimization pass. Sophisticated compilers typically perform multiple passes over various intermediate forms. This multi‐ stage process is used because many algorithms for code optimization are easier to apply one at a time, or because the input to one optimization relies on the processing performed by another optimization. This organization also facilitates the creation of a single compiler that can target multiple architectures, as only the last of the code generation stages (the backend) needs to change from target to target.
2.6.5 the code optimizer is take the Intermediate Representation Code that would be generated by the Intermediate Code Generator as it’s input and doing the Compilation Optimization Process that finally produce the Optimized Code to be the input to the code generator. And here we can define the compiler optimization process that the process of tuning the output of the intermediate representation code to minimize or maximize some attribute of an executable computer program. The most common requirement is to minimize the time taken to execute a program. a less common one is to minimize the amount of memory occupied. The growth of portable computers has created a market for minimizing the power consumed by a program. Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code. It involves a complex analysis of the intermediate code and the performance of various transformations; but every optimizing transformation must also preserve the semantics of the program. That is, a compiler should not attempt any optimization that would lead to a change in the program's semantics. Optimization can be machine‐independent or machine‐dependent. Machine‐independent optimizations can be performed independently of the target machine for which the compiler is generating code; that is, the optimizations are not tied to the target machine's specific platform or language. Examples of machine‐independent optimizations are: elimination of loop invariant computation, induction variable elimination, and elimination of common subexpressions. On the other hand, machine‐dependent optimization requires knowledge of the target machine. An attempt to generate object code that will utilize the target machine's registers more efficiently is an example of machine‐ dependent code optimization. Actually, code optimization is a misnomer; even after performing various optimizing transformations, there is no guarantee that the generated object code will be optimal. Hence, we are actually
SE4 Compiler Design Documentation
Page 20
performing code improvement. When attempting any optimizing transformation, the following criteria should be applied: 1‐ The optimization should capture most of the potential improvements without an unreasonable amount of effort. 2‐ The optimization should be such that the meaning of the source program is preserved. 3‐ The optimization should, on average, reduce the time and space expended by the object code.
2.6.5.1 Techniques used in optimization can be broken up among various scopes which can affect anything from a single statement to the entire program. Generally speaking, locally scoped techniques are easier to implement than global ones but result in smaller gains. Some examples of scopes include : •
•
•
Local optimizations : These only consider information local to a function definition. This reduces the amount of analysis that needs to be performed (saving time and reducing storage requirements) but means that worst case assumptions have to be made when function calls occur or global variables are accessed (because little information about them is available).
Loop optimizations : These act on the statements which make up a loop, such as a for loop (eg, loop‐invariant code motion). Loop optimizations can have a significant impact because many programs spend a large percentage of their time inside loops. Peephole optimizations : Usually performed late in the compilation process after machine code has been generated. This form of optimization examines a few adjacent instructions (like "looking through a peephole" at the code) to see whether they can be replaced by a single instruction or a shorter sequence of instructons. For instance, a multplicaton of a value by 2 might be more efficiently executed by left‐shifting the value or by adding the value to itself. (This example is also an instance of strength reduction.)
2.6.6 It’s the last phase in the compiler operations, and this phase is being a machine‐dependent phase, it is not possible to generate good code without considering the details of the particular machine for which the compiler is expected to generate code. Even so, a carefully selected code‐generation algorithm can produce code that is twice as fast as code generated by an ill‐considered code‐generation algorithm.
SE4 Compiler Design Documentation
Page 21
And the clearly operations of this phase is the representation of the logical addresses of our compiled program in the physical addresses, allocating the registers the target machine, also linking the contents of the symbol table and the optimized code that generated by the code optimizer to produce finally the target file or the object file.
2.6.7 One of the important tasks that a compiler must perform is the detection of and recovery from errors. Recovery from errors is important, because the compiler will be scanning and compiling the entire program, perhaps in the presence of errors; so as many errors as possible need to be detected. Every phase of a compilation expects the input to be in a particular format, and whenever that input is not in the required format, an error is returned. When detecting an error, a compiler scans some of the tokens that are ahead of the error's point of occurrence. The fewer the number of tokens that must be scanned ahead of the point of error occurrence, the better the compiler's error‐detection capability. For example, consider the following statement : if a = b then x: = y + z; The error in the above statement will be detected in the syntactic analysis phase, but not before the syntax analyzer sees the token "then"; but the first token, itself, is in error. After detecting an error, the first thing that a compiler is supposed to do is to report the error by producing a suitable diagnostic. A good error diagnostic should possess the following properties. The message should be produced in terms of the original source program rather than in terms of some internal representation of the source program. For example, the message should be produced along with the line numbers of the source program. 1‐ The error message should be easy to understand by the user. 2‐ The error message should be specific and should localize the problem. For example, an error message should read, "x is not declared in function fun," and not just, "missing declaration". 3‐ The message should not be redundant; that is, the same message should not be produced again and again.
SE4 Compiler Design Documentation
Page 22
2.6.7.1 The lexical analyzer detects an error when it discovers that an input's prefix does not fit the specification of any token class. After detecting an error, the lexical analyzer can invoke an error recovery routine. This can entail a variety of remedial actions. The simplest possible error recovery is to skip the erroneous characters until the lexical analyzer finds another token. But this is likely to cause the parser to read a deletion error, which can cause severe difficulties in the syntaxanalysis and remaining phases. One way the parser can help the lexical analyzer can improve its ability to recover from errors is to make its list of legitimate tokens (in the current context) available to the error recovery routine. The error‐recovery routine can then decide whether a remaining input's prefix matches one of these tokens closely enough to be treated as that token.
2.6.7.2 A Symantec analyzer detects an error when it has no legal move from its current configuration. It is capable of announcing an error as soon as they read an input that is not a valid continuation of the previous input's prefix. This is earliest time that a left‐to‐right parser can announce an error. But there are a variety of other Symantec analyzer s that do not necessarily have this property. The advantages of using a Symantec analyzer with a valid‐prefix‐property capability is that it reports an error as soon as possible, and it minimizes the amount of erroneous output passed to subsequent phases of the compiler.
2.6.8
&
A symbol table is a data structure used by a compiler to keep track of scope/ binding information about names. This information is used in the source program to identify the various program elements, like variables, constants, procedures, and the labels of statements. The symbol table is searched every time a name is encountered in the source text. When a new name or new information about an existing name is discovered, the content of the symbol table changes. Therefore, a symbol table must have an efficient mechanism for accessing the information held in the table as well as for adding new entries to the symbol table. For efficiency, our choice of the implementation data structure for the symbol table and the organization its contents should be stress a minimal cost when adding new entries or accessing the information on existing entries. Also, if the symbol table can grow dynamically as necessary, then it is more useful for a compiler. So, in the advanced compilers there is a symbol table manager that helping in arranging the data in the symbol table and provide a specified access to the other component of the compiler that want to access the symbol table to does anything .
This is all the required information that wanted to construct a complete picture about the compiler in the mind of any interested person that have a little knowledge about the compiler and the automation language. And know we are ready to go to the next chapter or the important chapter – Compiler Design – because the project or study is a practice of the principles of the Software Design subject.
SE4 Compiler Design Documentation
Page 23
Chapter Three
Compiler Desi n SE4 Compiler Design Documentation
Page 24
3.1 As e mentioned above about the object oriented development strategy, and why we are use this strategy in our work and implement it in all stages. We also used this strategy in the design process of our compiler . In this chapter, we will show how to design the compiler by using the object oriented strategy, and also using the standard notations and diagrams of the UML (unified modeling language ) that helping the programmer to implement this system.
3.2 When we look to the compiler system as a subsystem, we can observe that, if we considered any low or high language as a system, we can extract or determine that we have many subsystems such as the compiler, the linker, the loader, the libraries and report generator( if it is a high language ) as a subsystems. As shown in the Figure 2.1. this figure showing that our system is a subsystem in a large system, and it is interact with the others subsystems to complete theirs function without any conflict. By using the object oriented design, now we will show the subsystems of our compiler, and how this subsystem is interconnect with each other. i.e. we will establish the architectural design of our compiler. Figure 3.1
3.2.1 Here, in this process we will define the subsystems that making up our main system, and how this subsystems are interact to produce the target file that the compiler if found to generate it.
3.2.1.1 The main output from the System Structuring is the definition of the subsystems, and how this subsystems would be communicate. So When we look to the Figure 3.2, we can easily optimized the
SE4 Compiler Design Documentation
Page 25
subsystems and determine the multi stage in the compiler from the beginning by receiving the source code through the lexical analyzer, syntax, Symantec, …etc, till the producing the target file. The lines that linking between this subsystems is represent the relations between them.
3.2.1.1.1 Also from the Figure 3.2, we can determine the organizaton models that used in our compiler. This models can help to illustrate how this subsystems
Figure 3.2
share it’s data, and communicate with each others. We can find easily the following models in our compiler… •
Repository Model : found in the symbol table manager that is connect with the all others
subsystems to dealing with any subsystem’s data to check, update, and save it to the symbol table. See the Figure 3.2 to understand this. From this model we can say that our compiler has a subsystem share the data with all the others, So all of our subsystems has a flexible interface or the same interface when getting or setting the data with the symbol table manager. Also the repository model appears in the error handler subsystem that communicate with all the other subsystems to report the errors if found in any stage. •
Abstract Machine ( Layered Model ) : this is the main model that the compiler depends to it.
Because it handle the main function of the compiler. The compilation process goes to end through the Layered model, at the beginning the lexical analyzer receive the source code from the editor screen if the language has a UI or from the File that the source code written to it. When the Lexal analyzer finishes it’s function of separation the token that the source have, it save the stream of tokens in the symbol table by helping from the symbol table manager. after that the syntax analyzer receive this stream of tokens and doing the phrasing operation to this tokens and produce the phrase tree to the symbol table, also this is done with helping with the symbol table manager. Then the Symantec analyzer receive this phrase tree and checking the SE4 Compiler Design Documentation
Page 26
logical errors then produce the abstract syntax tree. After that the next phases is begin with the Intermediate representation code generator that receive the abstract phrase tree and doing the low level operation as generate the dependent machine instruction and the instruction set architecture, then it produce the intermediate representation code. After that this intermediate representation code was receive by the I.R code optimizer that doing the additional effort in the optimizing this code such as reducing the time needed from the CPU to executed this program and also reduce the size of memory that is wanted to complete the execution, this is done by using many functions that work in the low level environment that solving many problems such as loop optimization and the parallel optimization and … etc. the output from this stage is the optimized code that would be received by the code generator that work with it and the content of the symbol table to execute many operation such as controlling the run time errors and transform the logical addresses in the program that we want to compile to the physical addresses in the memory, also allocating the registers to produce finally the target code. We can convey the meaning of this illustration in the following Figure 3.3 that represent this layered models in our system.
Figure 3.3
3.2.1.2 In this sub level we work with the controls to answers many questions such as how the sequence of control flows on this subsystems?, when this sequence change from subsystem to another ? and etc …
After many times to determine that, we can says that the control of our compiler is the Event – Based Control because the timing in our system is very important. i.e. the compilation process doesn’t take a few
second, it’s take a little bit, and the compilation process goes from any subsystem to another depending to the output of this subsystem, we have the ordering way to do that. So we can determine the model of controls of our system as the Broadcast Model because any subsystem is interest to doing it’s job, it is first do the monitoring SE4 Compiler Design Documentation
Page 27
operation to determine the time to start depending to any event the broadcast to all subsystems. And we can say that this is an efficient way has been done in the integrating subsystems. And here subsystem doesn’t know when an event will be handled. !
3.2.1.3 In this field e can show the strategy that used in the decomposition of our subsystems component. It is very important step because if this step completed successfully, we gives a big help to the programmers that will be written the codes of our compiler. As we mentioned above that we use the object oriented strategy to analysis and design our system, we also use the Object Model to doing the modular decomposition in the subsystems component. By identifying the needed classes and identifying theirs attributes and operations. We will illustrate this step when we reach to the UML class diagram in this chapter.
3.2.1.4 Here, we can determine the domain specific model of our compiler. This is one of the important step of the architectural design of any systems that we want to build. In the SE4 compiler design we can say we use to analysis the domain specific of our compiler the Generic Model that appear from the analysis the previous compilers and extracting the good features to implement it in our compiler. i.e. our compiler is come from the study of the existing compilers that have the same objectives with our compiler.
3.3 Use case models or diagrams is important to describe our subsystems functions, because the programmers need a diagrams nether that the natural language to implement our system. So we develop the use case models for each of our subsystems of the compiler. We start with the Lexical analyzer use case diagram .!
SE4 Compiler Design Documentation
Page 28
3.3.1 In this diagram shows in the Figure 3.4, we specify the lexical analyzer use case diagram whose main function is to generate the stream of tokens. As we see this diagram it also have many additional functions .
Figure 3.4
3.3.2 Syntax analyzer use case diagram is shown in the Figure 3.5. it’s main use case is the phrasing this stream of token that generated by the lexical analyzer.
Figure 3.5
SE4 Compiler Design Documentation
Page 29
3.3.3 Symantec analyzer doing the functions of determining the logical error by comparing the phrase tree that generated from the syntax analyzer with it’s CSG (context sensitive grammar) . and also produce the AST (abstract syntax tree) as shown in the Figure 3.6 .
Figure 3.6
3.3.4 In the intermediate representation code generator we can say this subsystem’s main use case is to produce the simple intermediate representation code. We shown it’s use case diagram in the Figure 3.7 .
SE4 Compiler Design Documentation
Page 30
Figure 3.7
3.3.5 In the Figure 3.8 we would show the use case diagram of the I.R code optimizer that doing it’s operation on the simple I.R code that was generated by the I.R code generator to finally produce Simple Optimized Code.
Figure 3.8
SE4 Compiler Design Documentation
Page 31
3.3.6 The main function of this subsystem and the system is to produce the target file, this function is done with many environmental functions that are shown in the Figure 3.9.
Figure 3.9
3.3.7 with this subsystem, we can access to the symbol table manager. So this subsystem have many functions applied on the symbol table. It’s use case diagram is shown in the Figure 3.10.
Figure 3.10
SE4 Compiler Design Documentation
Page 32
3.3.8 The function of the error handler is to report any occurred errors and recognize any types of errors occurred during the compilation process. This is shown in the Figure 3.11.
Figure 3.11
3.4 The activity diagram the diagrams that translate the use case to the sequence of operations and allowed the conditions through this sequence. So it is describe any function from the beginning till the end of it.
SE4 Compiler Design Documentation
Page 33
3.4.1 The activity diagram of the lexical analyzer subsystem is illustrate the process of tokenization and how this process is happened, also what is the probabilities of the result. Here the result May be the stream of tokens or the error that occurred. Figure 3.12 is shown this diagram.
Figure 3.12
SE4 Compiler Design Documentation
Page 34
3.4.2 The syntax analyzer activity diagram is illustrate how the syntax analyzer is internally interact to produce in the final, the phrase tree or occurring the syntax error that doesn’t allow this stage to complete it’s sequences to produce the phrase tree. Figure 3.13 show us the actvity diagram of the syntax analyzer subsystem.
Figure 3.13
SE4 Compiler Design Documentation
Page 35
3.4.3 The Symantec analyzer sequences to produce the abstract phrase tree if doesn’t error occurred is shown in the Figure 3.14.
Figure 3.14
SE4 Compiler Design Documentation
Page 36
3.4.4 From this subsystem till the end of the compilation process, the work is done in the low level meaning that this work is done dependently to the machine it self or from the other side if the compilation process reach to this stage without any error, the user finishes it’s requirements and the compiler is interact to finish this operation and produce the target file with less negative effect to the user’s machine. Figure 3.15 shown how the I.R code Generator internally interact to produce the simple I.R code.
Figure 3.15
SE4 Compiler Design Documentation
Page 37
3.4.5 I.R code Optimizer activity diagram shown in the Figure 3.16. it’s main functon is to produce the Optmized code from the I.R simple code produced by the I.R code generator.
Figure 3.16
SE4 Compiler Design Documentation
Page 38
3.4.6 The activity diagrams of this subsystem is shown in the Figure 3.17. this figure shown the sequence that used in this subsystem to produce the target file or code.
Figure 3.17
SE4 Compiler Design Documentation
Page 39
3.4.7 In the activity diagram of the symbol table manager, we illustrate the sequence of this subsystem operations that are shown in the Figure 3.18.
Figure 3.18
SE4 Compiler Design Documentation
Page 40
3.4.8 In the error handler activity diagram, we illustrate the operations that are executed when any type of error are happened in any stage of the compilation process. The Figure 3.19 is shown this diagram.
Figure 3.19
SE4 Compiler Design Documentation
Page 41
3.5 In this subsection of the design process, we show the generated class that needed to implement our compiler. This class attributes is extracted from the use case and activity diagrams that are discussed above. In the determination of this classes we also extract any redundant class that our main class in need it to complete it’s functions. The following large Figure 3.20 is show this classes of our compiler.
SE4 Compiler Design Documentation
Page 42
Figure 3.20
SE4 Compiler Design Documentation
Page 43
3.6 The sequence diagram is the diagram that describe the interaction among many actors of any system. i.e. it is describe how the subsystems will be communicated. And what is the sequence from many paths of the communications between them.
3.6.1 In the lexical analyzing stage, we generate the sequence diagram that is interact among our interface, the lexical analyzer, symbol table manager, symbol table and error handler actors. This sequence diagram is shown in the Figure 3.21.
Figure 3.21
SE4 Compiler Design Documentation
Page 44
3.6.2 In the syntax analyzing stage, we generate the sequence diagram that is interact among our interface, the syntax analyzer, symbol table manager, symbol table and error handler actors. This sequence diagram is shown in the Figure 3.22.
Figure 3.22
SE4 Compiler Design Documentation
Page 45
3.6.3 In the Symantec analyzing stage, we generate the sequence diagram that is interact among our interface, the Symantec analyzer, symbol table manager, symbol table and error handler actors. This sequence diagram is shown in the Figure 3.23.
Figure 3.23
SE4 Compiler Design Documentation
Page 46
3.6.4
.
In the I.R Code Generation stage, we generate the sequence diagram that is interact among our Intermediate Representation code generator, symbol table manager, and the symbol table . This sequence diagram is shown in the Figure 3.24.
Figure 3.24
SE4 Compiler Design Documentation
Page 47
3.6.5 . In the I.R Code Optimizing stage, we generate the sequence diagram that is interact among our Intermediate Representation code optimizer, symbol table manager, and the symbol table . This sequence diagram is shown in the Figure 3.25.
Figure 3.25
SE4 Compiler Design Documentation
Page 48
3.6.6 In the Code generation stage, we generate the sequence diagram that is interact among our code generator , symbol table manager, and the symbol table . This sequence diagram is shown in the Figure 3.26.
Figure 3.26
SE4 Compiler Design Documentation
Page 49
3.6.7 The sequences diagrams that are shown in the Figure 3.27, is represent the sequence diagrams that are executed once at the first time of installing our compiler. To feeding our subsystems with the grammars.
Figure 3.27
SE4 Compiler Design Documentation
Page 50
A compiler is a special type of computer program that translates a human readable text file into a form that the computer can more easily understand.
An interpreter reads the source code one instruction or line at a time, converts this line into machine code and executes it.
Is a development strategy where system analyzers ,designers and programmer think in terms of ‘things’ instead of operations or functions.
Lexical analyzer is the part where the stream of characters making up the source program is read from left‐to‐right and grouped into tokens.
Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: constants (integer, double, char, string, etc.), operators (arithmetic, relational, logical), punctuation, and reserved words.
is the actual character sequence forming a token.
A regular set is a set of strings for which there exists some finite automata that accepts that set. That is, if R is a regular set, then R = L(M) for some finite automata M. Similarly, if M is a finite automata, then L(M) is always a regular set.
The syntax‐analysis is the part of a compiler that verifies whether or not the tokens generated by the lexical analyzer are grouped according to the syntactic rules of the language.
SE4 Compiler Design Documentation
Page 51
CFG notation specifies a context‐free language that consists of terminals, nonterminals, a start symbol, and productions.
The terminals are nothing more than tokens of the language, used to form the language constructs.
The nonterminals define the sets of strings that are used to define the language generated by the grammar.
Derivation refers to replacing an instance of a given string's nonterminal, by the right‐hand side of the production rule, whose left‐hand side contains the nonterminal to be replaced.
Is the part of compiler where delve even deeper to check whether they form a sensible set of instructions in the programming language.
A type is a set of values and a set of operations operating on those values.
Type checking is the process of verifying that each operation executed in a program respects the type system of the language.
A scope is a section of program text enclosed by basic program delimiters.
Is the determination if the identifier is accessible at that point in the program.
Is the part of the compilation phases that by which a compiler's code generator converts some internal representation of source code into a form that can be readily executed by a machine.
SE4 Compiler Design Documentation
Page 52
Refers to techniques a compiler can employ in order to produce an improved object code for a given source program.
It’s the last phase in the compiler operations, that convert the optimized code to depended machine code>
One of the important tasks that a compiler must perform is the detection of and recovery from errors.
A symbol table is a data structure used by a compiler to keep track of scope/ binding information about names. This information is used in the source program to identify the various program elements, like variables, constants, procedures, and the labels of statements.
Is the part of the compiler that provide to other compiler components the access to symbol table.
SE4 Compiler Design Documentation
Page 53