Role of a Lexical Analyser
The lexical analyzer is the first phase of compiler. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner. A lexer often exists as a single function which is called by a parser a parser or or another function. y
y y y y y y y y
y y y y
Its
main task is is to read the input characters from from the source Program and produces output a sequence of tokens tok ens that the parser uses for syntax analysis. To group them into lexemes Produce as output a sequence of tokens Group them into lexemes Produce as output a sequence of tokens input for the syntactical analyzer Interact with the symbol table Insert identifiers identifiers to strip out comments whitespaces: blank, newline, tab, « other separators to correlate error messages generated by the co mpiler mpiler with the source program to keep track of the number of newlines seen to associate a line number with each eac h error Message. Macros expansion
Upon receiving a ³get next token´ command from the parser the lexical analyzer reads input characters until it can identify the next token. The LA return to the parser representation for the token it has found. The representation will be an integer code, if the token is a simple construct such as parenthesis, comma or colon. The representation representatio n is a pair consisting of an integer code and a pointer to a table if the token is a more complex element such as an identifier or constant. The integer code gives the token type and the pointer points to the value of that token.
Sometimes , lexical analyzers are divi d ivided ded into a cascade of two phases, the first called called ³scanning´, and the second ³lexical analysis´.
The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the more complex operations. The lexical analyzer which we have designed takes the input from a input file. It reads one character at a time from the input file, and continues to read until end of the file is reached. It recognizes the valid identifiers, keywords and specifies the to ken values of the keywords. It
also identifies the header files, #define statements, numbers, special characters, various relational and logical operators, ignores the white spaces and comments. It prints the output in a separate file specifying the line number
Token A token is a string of characters, categorized according to the rules as a symbol (e.g., IDENTIFIER , NUMBER , COMMA). The process of forming tokens from an input stream of characters is called tokenization and the lexer categorizes them according to a symbol type. A token can look like anything that is useful for processing an input text stream or text file. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser . For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each '(' is matched w ith a ')'. Consider this expression in the C programming language: sum=3+2;
Tokenized in the following table
Lexeme
Token
type
sum
Identifier
=
Assignment operator
3
Number
+
Addition operator
2
Number
;
End
of statement
Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer generator such as lex. is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. T he process can be considered a sub-task of parsing input. Tokenization
Take, for example, The quick brown fox jumps over the lazy dog
The string isn't implicitly segmented on spaces, as an English speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e. matching the string " " or regular expression /\s{1}/. The tokens could be represented in XML, The quick brown fox jumps over the lazy dog
Or an s-expression, (sentence ((word The) (word quick) (word brown) (word fox) (word jumps) (word over) (word the) (word lazy) (word dog)))
Examples
of Tokens
Dealing With Errors Lexical analyzer unable to proceed: no pattern matches Panic mode recovery: delete successive characters from remaining input until token found
Insert
missing character Delete a character Replace character by another Transpose characters
two
adjacent