You are on page 1of 27

Compil

er
Design
By:
Haftu
Hagos

Compiler Design
Chapter One
Compilers
Introduction
Computers are a balanced mix of software and hardware. Hardware is just a piece of mechanical
device and its functions are being controlled by compatible software. Hardware understands
instructions in the form of electronic charge, which is the counterpart of binary language in
software programming. Binary language has only two alphabets, 0 and 1. To instruct, the
hardware codes must be written in binary format, which is simply a series of 1s and 0s. It would
be a difficult and cumbersome task for computer programmers to write such codes, which is why
we have compilers to write such codes.

1.1 Language Processing System


We have learnt that any computer system is made of hardware and software. The hardware
understands a language, which humans cannot understand. So we write programs in high-level
language, which is easier for us to understand and remember. These programs are then fed into a
series of tools and OS components to get the desired code that can be used by the machine. This
is known as Language Processing System.

Saint Mary University

Compiler Design
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language. Let us first understand
how a program, using C compiler, is executed on a host machine.

User writes a program in C language (high-level language).

The C compiler compiles the program and translates it to assembly


program (low-level language).

An assembler then translates the assembly program into machine code


(object).

A linker tool is used to link all the parts of the program together for
execution (executable machine code).

A loader loads all of them into memory and then the program is executed.

Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.

1.1.1 Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text
3. Rational preprocessor: these preprocessors augment older languages with more modern
flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by
certain amounts to build-in macro
.

1.1.2 Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes the
whole program and may involve many passes. In contrast, an interpreter reads a statement from
the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it; whereas a compiler
reads the whole program even if it encounters several errors.

Saint Mary University

Compiler Design

Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds.
Type of object that denotes various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.

1.1.3Assembler
Programmers found it difficult to write or read programs in machine language. They begin to use
a mnemonic (symbols) for each machine instruction, which they would subsequently translate
into machine language. Such a mnemonic machine language is now called an assembly
language. Programs known as assembler were written to automate the translation of assembly
language in to machine language. The input to an assembler program is called source program,
the output is a machine language translation (object program).
The output of an assembler is called an object file, which contains a combination of machine
instructions as well as the data required to place these instructions in memory.

1.1.4 Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.

1.1.5 Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and executes them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.

Saint Mary University

Compiler Design
1.1.6 Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.

1.1.7Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.

1.2 The compilers


Compiler is a translator program that translates a program written in (HLL) the source program
and translates it into an equivalent program in (MLL) the target program. As an important part of
a compiler is error showing to the programmer.

1.3 Compilers architecture


A compiler can broadly be divided into two phases based on the way they compile.

1.3.1 Analysis Phase


Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts, and then checks for lexical, grammar, and syntax errors. The
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.

Saint Mary University

Compiler Design
1.3.2 Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with
the help of intermediate source code representation and symbol table.
A compiler can have many phases and passes.

Pass: A pass refers to the traversal of a compiler through the entire


program.
Phase: A phase of a compiler is a distinguishable stage, which takes input
from the previous stage, processes and yields output that can be used as
input for the next stage. A pass can have more than one phase.

1.4 Phases of Compilers


The compilation process is a sequence of various phases. Each phase takes input from its
previous stage, has its own representation of source program, and feeds its output to the next
phase of the compiler. Let us understand the phases of a compiler.

Saint Mary University

Compiler Design

Lexical analysis: This is the initial part of reading and analyzing the program
m text: The text is read and divided into tokens, each of which corresponds to a symbol in the
programming language, e.g., a variable name, keyword or number.
Syntax analysis: This phase takes the list of tokens produced by the lexical analysis and arranges
these in a tree-structure (called the syntax tree) that reflects the structure of the program. This
phase is often called parsing.
Semantic Analysis: Semantic analysis checks whether the parse tree constructed follows the
rules of language. For example, assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their types and

Saint Mary University

Compiler Design
expressions; whether identifiers are declared before use or not, etc. The semantic analyzer
produces an annotated syntax tree as an output.
e.g., if a variable is used but not declared or if it is used in a context that does not make sense
given the type of the variable, such as trying to use a boolean value as a function pointer.
Intermediate Code Generation: After semantic analysis, the compiler generates an
intermediate code of the source code for the target machine. It represents a program for some
abstract machine. It is in between the high-level language and the machine language. This
intermediate code should be generated in such a way that it makes it easier to be translated into
the target machine code.
Code Optimization: The next phase does code optimization of the intermediate code.
Optimization can be assumed as something that removes unnecessary code lines, and arranges
the sequence of statements in order to speed up the program execution without wasting resources
(CPU, memory).
Code Generation: In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator translates the
intermediate code into a sequence of (generally) re-locatable machine code. Sequence of
instructions of machine code performs the task as the intermediate code would do.
Symbol Table: It is a data-structure maintained throughout all the phases of a compiler. All the
identifiers names along with their types are stored here. The symbol table makes it easier for the
compiler to quickly search the identifier record and retrieve it. The symbol table is also used for
scope management.

1.5 Why learn about compilers


Some typical reasons are:
a) It is considered a topic that you should know in order to be well-cultured in computer
science.
b) A good craftsman should know his tools, and compilers are important tools for
programmers and computer scientists.
c) The techniques used for constructing a compiler are useful for other purposes as well.
d) There is a good chance that a programmer or computer scientist will need to write a
compiler or interpreter for a domain-specific language.

Review Questions
1)
2)
3)
4)
5)
6)
7)

What is the difference between high level languages and machine languages?
Define the terms computer hardware and software.
Whiter the phases language processing systems neatly.
What is the difference between cross-compiler and source-to-source compiler?
Write in detail all the core parts being done in the front-end of the compiler.
Write in detail all the core parts being done in the back-end of the compiler.
Define in detail about linkers and loaders.

Saint Mary University

Compiler Design

========================THE END====================

Chapter Two
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for legal
tokens, and passes the data to the syntax analyzer when it demands.
A lexical analyzer, or lexer for short, will as its input take a string of individual letters and divide
this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called
white-space), i.e., lay-out characters (spaces, newlines etc.)and comments.

Saint Mary University

Compiler Design

The main purpose of lexical analysis is to make life easier for the subsequent syntax analysis
phase. In theory, the work that is done during lexical analysis can be made an integral part of
syntax analysis, and in simple systems this is indeed often done. However, there are
reasons for keeping the phases separate:

Efficiency: A lexer may do the simple parts of the work faster than the more general
parser can. Furthermore, the size of a system that is split in two may be smaller than a
combined system. This may seem paradoxical but, as we shall see, there is a non-linear
factor involved which may make a separated system smaller than a combined system.
Modularity: The syntactical description of the language need not be cluttered with small
lexical details such as white-space and comments.
Tradition: Languages are often designed with separate lexical and syntactical phases in
mind, and the standard documents of such languages typically separate lexical and
syntactical elements of the languages.
For lexical analysis, specifications are traditionally written using regular expressions: An
algebraic notation for describing sets of strings. The generated lexers are in a class of extremely
simple programs called finite automata.

2.1 Tokens, Lexemes and Patterns


When discussing lexical analysis, we use three related but distinct terms:
A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword,
or a sequence of input characters denoting an identifier. The token names are the input
symbols that the parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.
A pattern is a description of the form that the lexemes of a token may take. In the case of
a keyword as a token, the pattern is just the sequence of characters that form the keyword.
For identifiers and some other tokens, the pattern is a more complex structure that is
matched by many strings.
A lexeme is a sequence of characters in the source program that matches the pattern for a
token and is identified by the lexical analyzer as an instance of that token.
Example 3.1: Figure 2.1 gives some typical tokens, their informally described patterns, and
some sample lexemes. To see how these concepts are used in practice, in the C statement.

Saint Mary University

Compiler Design
printf(Total = %d\n, score);
both printf and score are lexemes matching the pattern for the token id and total = %d\n is
a lexeme matching literal.

2.1.1 Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators, and
punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line.
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ;
(symbol).

Let us understand how the language theory undertakes the following terms:

Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is
a set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.

Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string St. Mary is 8 and is denoted by |St. Mary| =
8. A string having no alphabets, i.e. a string of zero length is known as an empty string and is
denoted by (epsilon).

Special Symbols
A typical high-level language contains the following symbols:-

10

Saint Mary University

Compiler Design
Arithmetic
Symbols

Addition(+),
Subtraction(-),
Multiplication(*), Division(/)

Punctuation
Assignment
Special Assignment
Comparison
Preprocessor
Punctuation
Assignment
Special Assignment
Comparison
Preprocessor
Comparison
Location Specifier
Logical
Shift Operator

Comma(,), Semicolon(;), Dot(.), Arrow(->)


=
+=, /=, *=, -=
==, !=, <, <=, >, >=
#
Comma(,), Semicolon(;), Dot(.), Arrow(->)
=
+=, /=, *=, -=
==, !=, <, <=, >, >=
#
==, !=, <, <=, >, >=
&
&, &&, |, ||, !
>>, >>>, <<, <<<

Modulo(%),

Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.

2.2 Regular Expressions


The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belongs to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of
strings, so regular expressions serve as names for a set of strings. Programming language tokens
can be described by regular languages. The specification of regular expressions is an example of
a recursive definition. Regular languages are easy to understand and have efficient
implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be used
to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}

11

Saint Mary University

Compiler Design
Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Precedence and Associativity
*, concatenation (.), and | (pipe sign) are left associative
* has the highest precedence
Concatenation (.) has the second highest precedence.
| (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Details:
= 0 1 2
+=0 1 2

+ { }
=
Concatenation:
X = 01101 Y = 110, XY = 01101110.
For any string x, x = x = x.
Representing occurrence of symbols using regular expressions
letter = [a z] or [A Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]

12

Saint Mary University

Compiler Design
sign = [ + | - ]
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted solution is
to use finite automata for verification.
Example 2.2: Let L be the

set of letters {A, B, . . . , Z, a, b, . . . , z) and let D be the set of digits


{0,1,.. .9). We may think of L and D in two, essentially equivalent, ways. One way is that L and
D are, respectively, the alphabets of uppercase and lowercase letters and of digits. The second
way is that L and D are languages, all of whose strings happen to be of length one. Here are some
other languages that can be constructed from languages L and D.
1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of
length one, each of which strings is either one letter or one digit.
2. LD is the set df 520 strings of length two, each consisting of one letter followed by one
digit.
3. L4 is the set of all 4-letter strings.
4. L* is the set of ail strings of letters, including e, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.

2.3 Regular Definitions


For notational convenience, we may wish to give names to certain regular expressions and use
those names in subsequent expressions, as if the names were themselves symbols. If C is an
alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:
d 1 r 1
d 2 r 2

.
.

dn rn
where:
1. Each di is a new symbol, not in C and not the same as any other of the d's, and
2. Each ri is a regular expression over the alphabet U {dl, d2,. . . , di-l).

Example 2.3 : C identifiers are strings of letters, digits, and underscores. Here is a regular
definition for the language of C identifiers. We shall conventionally use italics for the
symbols defined in regular definitions.
Letter_ -> A |B||Z | a | b || z |
digit -> 0|1| |9|
id -> letter_ ( letter_| digit )*
Example 2.4: : Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234, 6.336E4, or
1.89E-4. the regular definition

13

Saint Mary University

Compiler Design
digit

0|1||9|
digit digit*

digits

optionalFraction

optionalExponent

( E ( + | - | ) digits )|

number

. digits|

digits optionalFraction optionalExponent

2.4 Token Recognition


In the previous section we learned how to express patterns using regular expressions. Now, we
must study how to take the patterns for all the needed tokens and build a piece of code that
examines the input string and finds a prefix that is a lexeme matching one of the patterns. Our
discussion will make use of the following running example.

stmt

if expr then stmt

I if expr then stmt else stmt


I .

expr

term relop term

I term

term

id

I number
The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of
tokens as far as the lexical analyzer is concerned. The patterns for these tokens are described
using regular definitions.
digit [0-9]
digits digit+
number digits(. digits)?( E[+-]? digits)?
letter [A-Za-z]
id letter(letter|digit)*
if if
then then
else else
relop <|>|<=|>=|=|<>
For this language, the lexical analyzer will recognize the keywords i f , then, and else, as well as
lexemes that match the patterns for relop, id, and number. To simplify matters, we make the
common assumption that keywords are also reserved words: that is, they are not identifiers, even
though their lexemes match the pattern for identifiers
14

Saint Mary University

Compiler Design
In addition, we assign the lexical analyzer the job of stripping out whitespace, by recognizing the
"token" ws defined by:
ws ( blank I tab | newline )+
Here, blank, tab, and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that, when we recognize it, we do
not return it to the parser, but rather restart the lexical analysis from the character that follows the
whitespace. It is the following token that gets returned to the parser.
Our goal for the lexical analyzer is summarized in Fig. 2.1. That table shows, for each lexeme or
family of lexemes, which token name is returned to the parser and what attribute value, as
discussed in Section 3.1.3, is returned. Note that for the six relational operators, symbolic
constants LT, LE, and so on are used as the attribute value, in order to indicate which instance of
the token relop we have found. The particular operator found will influence the code that is
output from the compiler.

2.5 Finite Automata


Finite Automata = Abstract Computing Devices.

Finite automata are a state machine that takes a string of symbols as input and changes its state
accordingly. A finite automaton is a recognizer for regular expressions. When a regular
expression string is fed into finite automata, it changes its state for each literal. If the input string
is successfully processed and the automata reach its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.

15

Saint Mary University

Compiler Design
The mathematical model of finite automata consists of:
Finite set of states (Q)
Finite set of input symbols ()
One Start state (q0)
Set of final states (qf)
Transition function ()
The transition function () maps the finite set of state (Q) to a finite set of input symbols (),
QQ

2.5.1

Finite Automata Construction


Let L(r) is a regular language recognized by some finite automata (FA).
States: States of FA are represented by circles. State names are written inside circles.
Start state: The state from where the automata start is known as the start state. Start
state has an arrow pointed towards it.
Intermediate states: All intermediate states have at least two arrows; one pointing to
and another pointing out from them.
Final state: If the input string is successfully parsed, the automata are expected to be
in this state. Final state is represented by double circles. It may have any odd number
of arrows pointing to it and even number of arrows pointing out from it. The number
of odd arrows are one greater than even, i.e. odd = even+1.
Transition: The transition from one state to another state happens when a desired
symbol in the input is found. Upon transition, automata can either move to the next
state or stay in the same state. Movement from one state to another is shown as a
directed arrow, where the arrows point to the destination state. If automata stay on the
same state, an arrow pointing from a state to itself is drawn.

Example: We assume FA accepts any three digit binary value ending in digit 1. FA = {Q (q0, qf),
(0, 1), q0, qf, }

2.5.2 Transition diagram


We shall assume that all our transition diagrams are deterministic, meaning that there is never
more than one edge out of a given state with a given symbol among its labels
Some important conventions about transition diagrams are:

16

Saint Mary University

Compiler Design
1. Certain states are said to be accepting, or final. These states indicate that a lexeme has
been found, although the actual lexeme may not consist of all positions between the
lexemeBegin and forward pointers. We always indicate an accepting state by a double
circle, and if there is an action to be taken - typically returning a token and an attribute
value to the parser - we shall attach that action to the accepting state.
2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme
does not include the symbol that got us to the accepting state), then we shall additionally
place a * near that accepting state. In our example, it is never necessary to retract forward
by more than one position, but if it were, we could attach any number of *'s to the
accepting state.
Example 2.3 :

Figure 2.2 is a transition diagram that recognizes the lexemes matching the token relop. We
begin in state 0, the start state. If we see < as the first input symbol, then among the lexemes that
match the pattern for relop we can only be looking at <, <>, or <=. We therefore go to state 1,
and look at the next character. If it is =, then we recognize lexeme <=, enter state 2, and return
the token relop with attribute LE, the symbolic constant representing this particular comparison
operator. If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to
return an indication that the not-equals operator has been found. On any other character, the
lexeme is <, and we enter state 4 to return that information. Note, however, that state 4 has a * to
indicate that we must retract the input one position.

Figure 2.2: transition diagram for relop.

17

Saint Mary University

Compiler Design
2.5.2.1 Recognizing reserved words and Identifiers
Recognizing keywords and identifiers presents a problem. Usually, keywords like if or then are
reserved (as they are in our running example), so they are not identifiers even though they look
like identifiers. Thus, although we typically use a transition diagram like that of Fig. 2.3 to
search for identifier lexemes, this diagram will also recognize the keywords if , then, and else of
our running example.

Figure 2.3: Transition Diagram for keywords and identifies


There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially. A field of the symbol-table entry
indicates that these strings are never ordinary identifiers, and tells which token they
represent. We have supposed that this method is in use in Fig. 2.3. When we find an
identifier, a call to installID places it in the symbol table if it is not already there and
returns a pointer to the symbol-table entry for the lexeme found. Of course, any identifier
not in the symbol table during lexical analysis cannot be a reserved word, so its token is
id. The function getToken examines the symbol table entry for the lexeme found, and
returns whatever token name the symbol table says this lexeme represents - either id or
one of the keyword tokens that was initially installed in the table.
Example 2.4:
The transition diagram for token number is shown in Fig. 2.4, and is so far the most complex
diagram we have seen. Beginning in state 12, if we see a digit, we go to state 13. In that state, we
can read any number of additional digits. However, if we see anything but a digit or a dot, we
have seen a number in the form of an integer; 123 is an example. That case is handled by
entering state 20, where we return token number and a pointer to a table of constants where the
found lexeme is entered. These mechanics are not shown on the diagram but are analogous to the
way we handled identifiers.

Figure 2.4: transition diagram for unsigned numbers

18

Saint Mary University

Compiler Design
Example 2.5:
The final transition diagram, shown in Fig. 2.5, is for whitespace. In that diagram, we look for one or more
"whitespace" characters, represented by delim in that diagram - typically these characters would be blank, tab,
newline, and perhaps other characters that are not considered by the language design to be part of any token.

Figure 2.5: transition diagram for white space.


Note that in state 24, we have found a block of consecutive whitespace characters, followed by a
nonwhite space character. We retract the input to begin at the nonwhite space, but we do not
return to the parser. Rather, we must restart the process of lexical analysis after the whitespace.

2.6 Attributes for Tokens


When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the
particular lexeme that matched
This is done by using attribute value that describes the lexeme represented by
the token
The token name influences parsing decisions, while the attribute value
influences translation of tokens after the parse
Practically, a token has one attribute: a pointer to the symbol table entry in
which the information about the token is kept
The symbol table entry contains various information about the token such as the
lexeme, its type, the line number in which it was first seen, etc.
For example, in assignment statement (in FORTRAN): E = M * C ** 2, the tokens
and their attributes are written as:
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>

2.7 Lexical Errors


It is hard for a lexical analyzer to tell, without the aid of other components, that
there is a source-code error
For example, if the string f is encountered for the first time in a C program in the
context:
fi (a == f(x)) . . .

19

Saint Mary University

Compiler Design
a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or
undeclared function identifier since fi is a valid lexeme for the token identifier
When an error occurs, the lexical analyzer recovers by:
Skipping (deleting) successive characters from the remaining input until the
lexical analyzer can find a well-formed token (known as panic mode recovery)
Deleting extraneous characters from the remaining input
Inserting missing characters from the remaining input
Replacing an incorrect character by a correct character
Transposing two adjacent characters

2.8 Input Buffering


Here we will be looking at some ways that the task of reading the source
program can be speeded
This task is made difficult by the fact that we often have to look one or more
characters beyond the next lexeme before we can be sure we have the right
lexeme
We shall introduce a two-buffer scheme that handles large lookaheads safely
We then consider an improvement involving sentinels that saves time checking
for the ends of buffers
Buffer Pairs
Because of the amount of time taken to process characters and the large
number of characters that must be processed during compilation of a large
source program, specialized buffering techniques have been developed to
reduce the amount of overhead to process a single input character
An important scheme involves two buffers that are alternatively reloaded, as
suggested in Figure 2.2
E

eo
f

forwar
d

lexemeBe
Figure 2.6: Using a gin
pair of input buffers
Each buffer is of the same size N, and N is usually the size of a disk block, e.g.,
4096 bytes
Using one system read command we can read N characters into a buffer, rather
than using one system call per character
If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file and is different from any
possible character of the source program
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine
2. Pointer forward scans ahead until a pattern match is found

20

Saint Mary University

Compiler Design
Once the lexeme is determined, forward is set to the character at its right end
(involves retracting)
Then, after the lexeme is recorded as an attribute value of the token returned to
the parser, lexemeBegin is set to the character immediately after the lexeme
just found
Advancing forward requires that we first test whether we have reached the end
of one of the buffers, and if so, we must reload the other buffer from the input,
and move forward to the beginning of the newly loaded buffer
Sentinels
If we use the previous scheme, we must check each time we advance forward,
that we have not moved off one of the buffers; if we do, then we must also
reload the other buffer
Thus, for each character read, we must make to tests: one for the end of the
buffer, and one to determine which character is read
We can combine the buffer-end test with the test for the current character if we
extend each buffer to hold sentinel character at the end
E

eof

eo
f

eof

forwar

d
lexemeBe
Figure 2.7: Sentinels at gin
the end of each buffer
The sentinel is a special character that cannot be part of the source program,
and a natural choice is the character eof
Note that eof retains its use as a marker for the end of the entire input
Any eof that appears other than at the end of buffer means the input is at an end
Figure 2.3 shows the same arrangement as Figure 2.2, but with the sentinels
added

2.9 Architecture of a Transition-Diagram-Based Lexical Analyzer


There are several ways that a collection of transition diagrams can be used to build a lexical
analyzer. Regardless of the overall strategy, each state is represented by a piece of code. We may
imagine a variable state holding the number of the current state for a transition diagram. A
switch based on the value of s t a t e takes us to code for each of the possible states, where we
find the action of that state. Often, the code for a state is itself a switch statement or multi way
branch that determines the next state by reading and examining the next input character.
Example 2.5:
In Fig. 2.8 we see a sketch of getRelop(), a C++ function whose job is to simulate the transition
diagram of Fig. 2.5 and return an object of type TOKEN, that is, a pair consisting of the token
name (which must be relop in this case) and an attribute value (the code for one of the six
comparison operators in this case). getRelop() first creates a new object retToken and initializes
its first component to RELOP, the symbolic code for token relop. We see the typical behavior of
a state in case 0, the case where the current

21

Saint Mary University

Compiler Design
state is 0. A function nextchar() obtains the next character from the input and assigns it to local
variable c. We then check c for the three characters we expect to find, making the state transition
dictated by the transition diagram of Fig. 2.5 in each case. For example, if the next input
character is =, we go to state 5.
If the next input character is not one that can begin a comparison operator, then a function fail ()
is called. What fail () does depends on the global error recovery strategy of the lexical analyzer.
It should reset the forward pointer to lexemeBegin, in order to allow another transition diagram to be
applied to the true beginning of the unprocessed input.

We also show the action for state 8 in Fig. 3.18. Because state 8 bears a *, we must retract the
input pointer one position (i.e., put c back on the input stream). That task is accomplished by the
function r e t r a c t () .Since state 8 represents the recognition of lexeme >=, we set the second
component of the returned object, which we suppose is named a t t r i b u t e, to GT, the code for
this operator.

22

Saint Mary University

Compiler Design

Review Questions
Ques: 1
In a compiler the module that checks every character of the source text is called
A)
B)
C)
D)

The code generator


The code optimizer
Lexical analyzer
Syntax analyzer

Ques: 2
Which of the following strings can definitely be said to be tokens without looking at the next
input character while compiling a Pascal program?

Begin

Program

<>
A) I
B) II
C) III
Ques: 3
In some programming languages, an identifier is permitted to be a letter following by any
number of letters or digits. If L and D denote the sets of letters and digits respectively, which of
the following expressions defines an identifier?
A)
B)
C)
D)

(L+D)+
L(L+D)*
(L.D)*
L(L.D)*

Ques:4

23

Saint Mary University

Compiler Design
How many tokens are there in the following C statement?
printf (j=%d, &j=%x, j&j)

A)
B)
C)
D)

4
5
9
10

Ques: 5

In a compiler, the data structure responsible for the management of information about variables
and their attributes is
A)
B)
C)
D)

Semantic stack
Parser table
Symbol table
Abstract syntax-tree

Chapter Summary
Tokens. The lexical analyzer scans the source program and produces as output a sequence
of tokens, which are normally passed, one at a time to the parser. Some tokens may
consist only of a token name while others may also have an associated lexical value that
gives information about the particular instance of the token that has been found on the
input.
Lexernes. Each time the lexical analyzer returns a token to the parser, it has an associated
lexeme - the sequence of input characters that the token represents
Buffering. Because it is often necessary to scan ahead on the input in order to see where
the next lexeme ends, it is usually necessary for the lexical analyzer to buffer its input.
Using a pair of buffers cyclically and ending each buffer's contents with a sentinel that
warns of its end are two techniques that accelerate the process of scanning the input.
Patterns. Each token has a pattern that describes which sequences of characters can form
the lexemes corresponding to that token. The set of words or strings of characters that
match a given pattern is called a language.
Regular Expressions. These expressions are commonly used to describe patterns. Regular
expressions are built from single characters, using union, concatenation, and the Kleene
closure, or any-number-of, operator.
Regular Definitions. Complex collections of languages, such as the patterns that describe
the tokens of a programming language, are often defined by a regular definition, which is
a sequence of statements that each define one variable to stand for some regular
expression. The regular expression for one variable can use previously defined variables
in its regular expression

24

Saint Mary University

Compiler Design
Extended Regular-Expression Notation. A number of additional operators may appear as
short hands in regular expressions, to make it easier to express patterns. Examples
include the + operator (one-or-more-of), ? (zero-or-one-of), and character classes (the
union of the strings each consisting of one of the characters).
Transition Diagrams. The behavior of a lexical analyzer can often be described by a
transition diagram. These diagrams have states, each of which represents something about
the history of the characters seen during the current search for a lexeme that matches one
of the possible patterns. There are arrows, or transitions, from one state to another, each
of which indicates the possible next input characters that cause the lexical analyzer to
make that change of state.
========================THE END====================

Chapter-3
Syntax Analysis
By design every programming language has precise rules that prescribe the
syntactic structure of well-formed programs
The syntax of programming language constructs can be specified by context-free
grammars or BNF notation (both are discussed in the previous course)
The use of CFGs has several advantages over BNF:
helps in identifying ambiguities
a grammar gives a precise yet easy to understand syntactic specification of a
programming language
it is possible to have a tool which produces automatically a parser using the
grammar
a properly designed grammar helps in modifying the parser easily when the
language changes
1 The Role of the Parser

In our compiler model, the parser obtains a string of tokens from the
lexical analyzer, as shown in Figure 3.1, and verifies that the string of
token names can be generated by the source language

It is expected that the parser reports any syntax errors in an intelligible


fashion and to recover from commonly occurring errors to continue
processing the remainder of the program

Conceptually, for well-formed programs, the parser constructs a parse tree


and passes it to the rest of the compiler for further processing

The methods commonly used in compilers can be classified as being


either top-down or bottom-up

Top-down methods build parse trees from top (root) to the bottom
(leaves), while bottom-up methods start from the leaves and work their
way up to the root

25

Saint Mary University

Compiler Design

In either case, the input to the parser is scanned from left to right, one
symbol at a time

Figure 3.1: position of parser in compiler model


There are three general types of parsers for grammars: universal, top-down, and bottom-

up. Universal parsing methods such as the Cocke-Younger-Kasami algorithm and


Earley's algorithm can parse any grammar (see the bibliographic notes). These general
methods are, however, too inefficient to use in production compilers.
The most efficient top-down and bottom-up methods work only for subclasses of
grammars, but several of these classes, particularly, LL and LR grammars, are expressive
enough to describe most of the syntactic constructs in modern programming languages.
In practice, there are a number of tasks that might be conducted during parsing, such as
collecting information about various tokens into the symbol table, performing type
checking and other kinds of semantic analysis, and generating intermediate code.

26

Saint Mary University

You might also like