Professional Documents
Culture Documents
er
Design
By:
Haftu
Hagos
Compiler Design
Chapter One
Compilers
Introduction
Computers are a balanced mix of software and hardware. Hardware is just a piece of mechanical
device and its functions are being controlled by compatible software. Hardware understands
instructions in the form of electronic charge, which is the counterpart of binary language in
software programming. Binary language has only two alphabets, 0 and 1. To instruct, the
hardware codes must be written in binary format, which is simply a series of 1s and 0s. It would
be a difficult and cumbersome task for computer programmers to write such codes, which is why
we have compilers to write such codes.
Compiler Design
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language. Let us first understand
how a program, using C compiler, is executed on a host machine.
A linker tool is used to link all the parts of the program together for
execution (executable machine code).
A loader loads all of them into memory and then the program is executed.
Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.
1.1.1 Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text
3. Rational preprocessor: these preprocessors augment older languages with more modern
flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by
certain amounts to build-in macro
.
1.1.2 Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes the
whole program and may involve many passes. In contrast, an interpreter reads a statement from
the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it; whereas a compiler
reads the whole program even if it encounters several errors.
Compiler Design
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds.
Type of object that denotes various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
1.1.3Assembler
Programmers found it difficult to write or read programs in machine language. They begin to use
a mnemonic (symbols) for each machine instruction, which they would subsequently translate
into machine language. Such a mnemonic machine language is now called an assembly
language. Programs known as assembler were written to automate the translation of assembly
language in to machine language. The input to an assembler program is called source program,
the output is a machine language translation (object program).
The output of an assembler is called an object file, which contains a combination of machine
instructions as well as the data required to place these instructions in memory.
1.1.4 Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.
1.1.5 Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and executes them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.
Compiler Design
1.1.6 Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.
1.1.7Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
Compiler Design
1.3.2 Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with
the help of intermediate source code representation and symbol table.
A compiler can have many phases and passes.
Compiler Design
Lexical analysis: This is the initial part of reading and analyzing the program
m text: The text is read and divided into tokens, each of which corresponds to a symbol in the
programming language, e.g., a variable name, keyword or number.
Syntax analysis: This phase takes the list of tokens produced by the lexical analysis and arranges
these in a tree-structure (called the syntax tree) that reflects the structure of the program. This
phase is often called parsing.
Semantic Analysis: Semantic analysis checks whether the parse tree constructed follows the
rules of language. For example, assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their types and
Compiler Design
expressions; whether identifiers are declared before use or not, etc. The semantic analyzer
produces an annotated syntax tree as an output.
e.g., if a variable is used but not declared or if it is used in a context that does not make sense
given the type of the variable, such as trying to use a boolean value as a function pointer.
Intermediate Code Generation: After semantic analysis, the compiler generates an
intermediate code of the source code for the target machine. It represents a program for some
abstract machine. It is in between the high-level language and the machine language. This
intermediate code should be generated in such a way that it makes it easier to be translated into
the target machine code.
Code Optimization: The next phase does code optimization of the intermediate code.
Optimization can be assumed as something that removes unnecessary code lines, and arranges
the sequence of statements in order to speed up the program execution without wasting resources
(CPU, memory).
Code Generation: In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator translates the
intermediate code into a sequence of (generally) re-locatable machine code. Sequence of
instructions of machine code performs the task as the intermediate code would do.
Symbol Table: It is a data-structure maintained throughout all the phases of a compiler. All the
identifiers names along with their types are stored here. The symbol table makes it easier for the
compiler to quickly search the identifier record and retrieve it. The symbol table is also used for
scope management.
Review Questions
1)
2)
3)
4)
5)
6)
7)
What is the difference between high level languages and machine languages?
Define the terms computer hardware and software.
Whiter the phases language processing systems neatly.
What is the difference between cross-compiler and source-to-source compiler?
Write in detail all the core parts being done in the front-end of the compiler.
Write in detail all the core parts being done in the back-end of the compiler.
Define in detail about linkers and loaders.
Compiler Design
========================THE END====================
Chapter Two
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for legal
tokens, and passes the data to the syntax analyzer when it demands.
A lexical analyzer, or lexer for short, will as its input take a string of individual letters and divide
this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called
white-space), i.e., lay-out characters (spaces, newlines etc.)and comments.
Compiler Design
The main purpose of lexical analysis is to make life easier for the subsequent syntax analysis
phase. In theory, the work that is done during lexical analysis can be made an integral part of
syntax analysis, and in simple systems this is indeed often done. However, there are
reasons for keeping the phases separate:
Efficiency: A lexer may do the simple parts of the work faster than the more general
parser can. Furthermore, the size of a system that is split in two may be smaller than a
combined system. This may seem paradoxical but, as we shall see, there is a non-linear
factor involved which may make a separated system smaller than a combined system.
Modularity: The syntactical description of the language need not be cluttered with small
lexical details such as white-space and comments.
Tradition: Languages are often designed with separate lexical and syntactical phases in
mind, and the standard documents of such languages typically separate lexical and
syntactical elements of the languages.
For lexical analysis, specifications are traditionally written using regular expressions: An
algebraic notation for describing sets of strings. The generated lexers are in a class of extremely
simple programs called finite automata.
Compiler Design
printf(Total = %d\n, score);
both printf and score are lexemes matching the pattern for the token id and total = %d\n is
a lexeme matching literal.
2.1.1 Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators, and
punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line.
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ;
(symbol).
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is
a set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string St. Mary is 8 and is denoted by |St. Mary| =
8. A string having no alphabets, i.e. a string of zero length is known as an empty string and is
denoted by (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
10
Compiler Design
Arithmetic
Symbols
Addition(+),
Subtraction(-),
Multiplication(*), Division(/)
Punctuation
Assignment
Special Assignment
Comparison
Preprocessor
Punctuation
Assignment
Special Assignment
Comparison
Preprocessor
Comparison
Location Specifier
Logical
Shift Operator
Modulo(%),
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.
11
Compiler Design
Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Precedence and Associativity
*, concatenation (.), and | (pipe sign) are left associative
* has the highest precedence
Concatenation (.) has the second highest precedence.
| (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Details:
= 0 1 2
+=0 1 2
+ { }
=
Concatenation:
X = 01101 Y = 110, XY = 01101110.
For any string x, x = x = x.
Representing occurrence of symbols using regular expressions
letter = [a z] or [A Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
12
Compiler Design
sign = [ + | - ]
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted solution is
to use finite automata for verification.
Example 2.2: Let L be the
.
.
dn rn
where:
1. Each di is a new symbol, not in C and not the same as any other of the d's, and
2. Each ri is a regular expression over the alphabet U {dl, d2,. . . , di-l).
Example 2.3 : C identifiers are strings of letters, digits, and underscores. Here is a regular
definition for the language of C identifiers. We shall conventionally use italics for the
symbols defined in regular definitions.
Letter_ -> A |B||Z | a | b || z |
digit -> 0|1| |9|
id -> letter_ ( letter_| digit )*
Example 2.4: : Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234, 6.336E4, or
1.89E-4. the regular definition
13
Compiler Design
digit
0|1||9|
digit digit*
digits
optionalFraction
optionalExponent
( E ( + | - | ) digits )|
number
. digits|
stmt
expr
I term
term
id
I number
The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of
tokens as far as the lexical analyzer is concerned. The patterns for these tokens are described
using regular definitions.
digit [0-9]
digits digit+
number digits(. digits)?( E[+-]? digits)?
letter [A-Za-z]
id letter(letter|digit)*
if if
then then
else else
relop <|>|<=|>=|=|<>
For this language, the lexical analyzer will recognize the keywords i f , then, and else, as well as
lexemes that match the patterns for relop, id, and number. To simplify matters, we make the
common assumption that keywords are also reserved words: that is, they are not identifiers, even
though their lexemes match the pattern for identifiers
14
Compiler Design
In addition, we assign the lexical analyzer the job of stripping out whitespace, by recognizing the
"token" ws defined by:
ws ( blank I tab | newline )+
Here, blank, tab, and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that, when we recognize it, we do
not return it to the parser, but rather restart the lexical analysis from the character that follows the
whitespace. It is the following token that gets returned to the parser.
Our goal for the lexical analyzer is summarized in Fig. 2.1. That table shows, for each lexeme or
family of lexemes, which token name is returned to the parser and what attribute value, as
discussed in Section 3.1.3, is returned. Note that for the six relational operators, symbolic
constants LT, LE, and so on are used as the attribute value, in order to indicate which instance of
the token relop we have found. The particular operator found will influence the code that is
output from the compiler.
Finite automata are a state machine that takes a string of symbols as input and changes its state
accordingly. A finite automaton is a recognizer for regular expressions. When a regular
expression string is fed into finite automata, it changes its state for each literal. If the input string
is successfully processed and the automata reach its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.
15
Compiler Design
The mathematical model of finite automata consists of:
Finite set of states (Q)
Finite set of input symbols ()
One Start state (q0)
Set of final states (qf)
Transition function ()
The transition function () maps the finite set of state (Q) to a finite set of input symbols (),
QQ
2.5.1
Example: We assume FA accepts any three digit binary value ending in digit 1. FA = {Q (q0, qf),
(0, 1), q0, qf, }
16
Compiler Design
1. Certain states are said to be accepting, or final. These states indicate that a lexeme has
been found, although the actual lexeme may not consist of all positions between the
lexemeBegin and forward pointers. We always indicate an accepting state by a double
circle, and if there is an action to be taken - typically returning a token and an attribute
value to the parser - we shall attach that action to the accepting state.
2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme
does not include the symbol that got us to the accepting state), then we shall additionally
place a * near that accepting state. In our example, it is never necessary to retract forward
by more than one position, but if it were, we could attach any number of *'s to the
accepting state.
Example 2.3 :
Figure 2.2 is a transition diagram that recognizes the lexemes matching the token relop. We
begin in state 0, the start state. If we see < as the first input symbol, then among the lexemes that
match the pattern for relop we can only be looking at <, <>, or <=. We therefore go to state 1,
and look at the next character. If it is =, then we recognize lexeme <=, enter state 2, and return
the token relop with attribute LE, the symbolic constant representing this particular comparison
operator. If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to
return an indication that the not-equals operator has been found. On any other character, the
lexeme is <, and we enter state 4 to return that information. Note, however, that state 4 has a * to
indicate that we must retract the input one position.
17
Compiler Design
2.5.2.1 Recognizing reserved words and Identifiers
Recognizing keywords and identifiers presents a problem. Usually, keywords like if or then are
reserved (as they are in our running example), so they are not identifiers even though they look
like identifiers. Thus, although we typically use a transition diagram like that of Fig. 2.3 to
search for identifier lexemes, this diagram will also recognize the keywords if , then, and else of
our running example.
18
Compiler Design
Example 2.5:
The final transition diagram, shown in Fig. 2.5, is for whitespace. In that diagram, we look for one or more
"whitespace" characters, represented by delim in that diagram - typically these characters would be blank, tab,
newline, and perhaps other characters that are not considered by the language design to be part of any token.
19
Compiler Design
a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or
undeclared function identifier since fi is a valid lexeme for the token identifier
When an error occurs, the lexical analyzer recovers by:
Skipping (deleting) successive characters from the remaining input until the
lexical analyzer can find a well-formed token (known as panic mode recovery)
Deleting extraneous characters from the remaining input
Inserting missing characters from the remaining input
Replacing an incorrect character by a correct character
Transposing two adjacent characters
eo
f
forwar
d
lexemeBe
Figure 2.6: Using a gin
pair of input buffers
Each buffer is of the same size N, and N is usually the size of a disk block, e.g.,
4096 bytes
Using one system read command we can read N characters into a buffer, rather
than using one system call per character
If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file and is different from any
possible character of the source program
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine
2. Pointer forward scans ahead until a pattern match is found
20
Compiler Design
Once the lexeme is determined, forward is set to the character at its right end
(involves retracting)
Then, after the lexeme is recorded as an attribute value of the token returned to
the parser, lexemeBegin is set to the character immediately after the lexeme
just found
Advancing forward requires that we first test whether we have reached the end
of one of the buffers, and if so, we must reload the other buffer from the input,
and move forward to the beginning of the newly loaded buffer
Sentinels
If we use the previous scheme, we must check each time we advance forward,
that we have not moved off one of the buffers; if we do, then we must also
reload the other buffer
Thus, for each character read, we must make to tests: one for the end of the
buffer, and one to determine which character is read
We can combine the buffer-end test with the test for the current character if we
extend each buffer to hold sentinel character at the end
E
eof
eo
f
eof
forwar
d
lexemeBe
Figure 2.7: Sentinels at gin
the end of each buffer
The sentinel is a special character that cannot be part of the source program,
and a natural choice is the character eof
Note that eof retains its use as a marker for the end of the entire input
Any eof that appears other than at the end of buffer means the input is at an end
Figure 2.3 shows the same arrangement as Figure 2.2, but with the sentinels
added
21
Compiler Design
state is 0. A function nextchar() obtains the next character from the input and assigns it to local
variable c. We then check c for the three characters we expect to find, making the state transition
dictated by the transition diagram of Fig. 2.5 in each case. For example, if the next input
character is =, we go to state 5.
If the next input character is not one that can begin a comparison operator, then a function fail ()
is called. What fail () does depends on the global error recovery strategy of the lexical analyzer.
It should reset the forward pointer to lexemeBegin, in order to allow another transition diagram to be
applied to the true beginning of the unprocessed input.
We also show the action for state 8 in Fig. 3.18. Because state 8 bears a *, we must retract the
input pointer one position (i.e., put c back on the input stream). That task is accomplished by the
function r e t r a c t () .Since state 8 represents the recognition of lexeme >=, we set the second
component of the returned object, which we suppose is named a t t r i b u t e, to GT, the code for
this operator.
22
Compiler Design
Review Questions
Ques: 1
In a compiler the module that checks every character of the source text is called
A)
B)
C)
D)
Ques: 2
Which of the following strings can definitely be said to be tokens without looking at the next
input character while compiling a Pascal program?
Begin
Program
<>
A) I
B) II
C) III
Ques: 3
In some programming languages, an identifier is permitted to be a letter following by any
number of letters or digits. If L and D denote the sets of letters and digits respectively, which of
the following expressions defines an identifier?
A)
B)
C)
D)
(L+D)+
L(L+D)*
(L.D)*
L(L.D)*
Ques:4
23
Compiler Design
How many tokens are there in the following C statement?
printf (j=%d, &j=%x, j&j)
A)
B)
C)
D)
4
5
9
10
Ques: 5
In a compiler, the data structure responsible for the management of information about variables
and their attributes is
A)
B)
C)
D)
Semantic stack
Parser table
Symbol table
Abstract syntax-tree
Chapter Summary
Tokens. The lexical analyzer scans the source program and produces as output a sequence
of tokens, which are normally passed, one at a time to the parser. Some tokens may
consist only of a token name while others may also have an associated lexical value that
gives information about the particular instance of the token that has been found on the
input.
Lexernes. Each time the lexical analyzer returns a token to the parser, it has an associated
lexeme - the sequence of input characters that the token represents
Buffering. Because it is often necessary to scan ahead on the input in order to see where
the next lexeme ends, it is usually necessary for the lexical analyzer to buffer its input.
Using a pair of buffers cyclically and ending each buffer's contents with a sentinel that
warns of its end are two techniques that accelerate the process of scanning the input.
Patterns. Each token has a pattern that describes which sequences of characters can form
the lexemes corresponding to that token. The set of words or strings of characters that
match a given pattern is called a language.
Regular Expressions. These expressions are commonly used to describe patterns. Regular
expressions are built from single characters, using union, concatenation, and the Kleene
closure, or any-number-of, operator.
Regular Definitions. Complex collections of languages, such as the patterns that describe
the tokens of a programming language, are often defined by a regular definition, which is
a sequence of statements that each define one variable to stand for some regular
expression. The regular expression for one variable can use previously defined variables
in its regular expression
24
Compiler Design
Extended Regular-Expression Notation. A number of additional operators may appear as
short hands in regular expressions, to make it easier to express patterns. Examples
include the + operator (one-or-more-of), ? (zero-or-one-of), and character classes (the
union of the strings each consisting of one of the characters).
Transition Diagrams. The behavior of a lexical analyzer can often be described by a
transition diagram. These diagrams have states, each of which represents something about
the history of the characters seen during the current search for a lexeme that matches one
of the possible patterns. There are arrows, or transitions, from one state to another, each
of which indicates the possible next input characters that cause the lexical analyzer to
make that change of state.
========================THE END====================
Chapter-3
Syntax Analysis
By design every programming language has precise rules that prescribe the
syntactic structure of well-formed programs
The syntax of programming language constructs can be specified by context-free
grammars or BNF notation (both are discussed in the previous course)
The use of CFGs has several advantages over BNF:
helps in identifying ambiguities
a grammar gives a precise yet easy to understand syntactic specification of a
programming language
it is possible to have a tool which produces automatically a parser using the
grammar
a properly designed grammar helps in modifying the parser easily when the
language changes
1 The Role of the Parser
In our compiler model, the parser obtains a string of tokens from the
lexical analyzer, as shown in Figure 3.1, and verifies that the string of
token names can be generated by the source language
Top-down methods build parse trees from top (root) to the bottom
(leaves), while bottom-up methods start from the leaves and work their
way up to the root
25
Compiler Design
In either case, the input to the parser is scanned from left to right, one
symbol at a time
26