You are on page 1of 182

Compiler Construction Lecture Notes

Introduction
o Lecture 1 (printable)
Lexical Analysis
o Lecture 2 (printable)
o Lecture 3 (printable)
o Lecture 4 (printable)
o Lecture 5 (printable)
o Lecture 6 (printable)
o Lecture 7 (printable)
Syntax Analysis
o Lecture 8 (printable)
o Lecture 9 (printable)
o Lecture 10 (printable)
o Lecture 11 (printable)
o Lecture 12 (printable)
o Lecture 13 (printable)

Semantic Analysis
o Lecture 14 (printable)
o Lecture 15 (printable)
o Lecture 16 (printable)
o Lecture 17 (printable)
Intermediate Code
Generation
o Lecture 18 (printable)
o Lecture 19 (printable)
o Lecture 20 (printable)
o Lecture 21 (printable)
o Lecture 22 (printable)
Final Code Generation
o Lecture 23 (printable)
o Lecture 24 (printable)
o Lecture 25 (printable)
o Lecture 26 (printable)

lecture #1 began here


Why study compilers?
Most CS students do not go on to write a commercial compiler
someday, but that's not why we study compilers. We study
compiler construction for the following reasons:
Writing a compiler gives experience with large-scale
applications development. Your compiler program may be
the largest program you write as a student. Experience
working with really big data structures and complex

interactions between algorithms will help you out on your


next big programming project.
Compiler writing is one of the shining triumphs of CS
theory. It demonstrates the value of theory over the impulse
to just "hack up" a solution.
Compiler writing is a basic element of programming
language research. Many language researchers write
compilers for the languages they design.
Many applications have similar properties to one or more
phases of a compiler, and compiler expertise and tools can
help an application programmer working on other projects
besides compilers.
CS 370 is labor intensive. Famous computer scientist Dan Berry
of the University of Waterloo has argued convincingly that there
is no software development method for writing large programs
that doesn't involve pain: pain is inevitable in software
development (Berry's Theorem). From my own experience as a
student, I posulate Jeffery's Corollary: there is no way to learn
the skills necessary for writing big programs without pain. A
good CS course includes pain, and teaches pain management
and minimization.
The questions we should ask, then, are: (a) should CS majors be
required to spend a lot of time becoming really good
programmers? and (b) are we providing students with the
assistance and access to the tools and information they need to
accomplish their goals with the minimal doses of inevitable pain
that are required?
Some Tools we will use

Labs and lectures will discuss all of these, but if you do not
know them already, the sooner you go learn them, the better.
C and "make".
If you are not expert with these yet, you will be a lot closer
by the time you pass this class.
lex and yacc
These are compiler-writers tools, but they are useful for
other kinds of applications, almost anything with a complex
file format to read in can benefit from them.
gdb
If you do not know a source-level debugger well, start
learning. You will need one to survive this class.
e-mail
Regularly e-mailing your instructor is a crucial part of class
participation. If you aren't asking questions, you aren't
doing your job as a student.
web
This is where you get your lecture notes, homeworks, and
labs, and turnin all your work.
virtual environment
We have a 3D video game / chat tool available that can
help us handle questions when one of us is not on campus.
Compilers - What Are They and What Kinds of Compilers
are Out There?
The purpose of a compiler is: to translate a program in some
language (the source language) into a lower-level language
(the target language). The compiler itself is written in some
language, called the implementation language. To write a
compiler you have to be very good at programming in the

implementation language, and have to think about and


understand the source language and target language.
There are several major kinds of compilers:
Native Code Compiler
Translates source code into hardware (assembly or machine
code) instructions. Example: gcc.
Virtual Machine Compiler
Translates source code into an abstract machine code, for
execution by a virtual machine interpreter. Example: javac.
JIT Compiler
Translates virtual machine code to native code. Operates
within a virtual machine. Example: Sun's HotSpot java
machine.
Preprocessor
Translates source code into simpler or slightly lower level
source code, for compilation by another compiler.
Examples: cpp, m4.
Pure interpreter
Executes source code on the fly, without generating
machine code. Example: Lisp.
Phases of a Compiler
Lexical Analysis:
Converts a sequence of characters into words, or tokens
Syntax Analysis:
Converts a sequence of tokens into a parse tree
Semantic Analysis:
Manipulates parse tree to verify symbol and type
information

Intermediate Code Generation:


Converts parse tree into a sequence of intermediate code
instructions
Optimization:
Manipulates intermediate code to produce a more efficient
program
Final Code Generation:
Translates intermediate code into final (machine/assembly)
code
Example of the Compilation Process
Consider the example statement; its translation to machine code
illustrates some of the issues involved in compiling.
position = initial + rate * 60
30 or so characters, from a single line of source code, are first
transformed by lexical analysis into a sequence of 7 tokens.
Those tokens are then used to build a tree of height 4 during
syntax analysis. Semantic analysis may transform the tree into
one of height 5, that includes a type conversion necessary for
real addition on an integer operand. Intermediate code
generation uses a simple traversal algorithm to linearize the tree
back into a sequence of machine-independent three-addresscode instructions.
t1 = inttoreal(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3

Optimization of the intermediate code allows the four


instructions to be reduced to two machine-independent
instructions. Final code generation might implement these two
instructions using 5 machine instructions, in which the actual
registers and addressing modes of the CPU are utilized.
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
lecture #2 began here
Announcements
Reading!
I hope you have already been reading! Make sure you read the
class lecture notes, the related sections of the text, and please
ask questions about whatever is not totally clear. You can Ask
Questions in class, via e-mail, in the virtual environment, or on
the class message board.
Note: although last year's CS 370 lecture notes are ALL
available to you up front, I generally revise each lecture's notes,
making additions, corrections and adaptations to this year's
homeworks, the night before each lecture. The best time to print
hard copies of the lecture notes is one day at a time, right before
the lecture is given.
Overview of Lexical Analysis

A lexical analyzer, also called a scanner, typically has the


following functionality and characteristics.
Its primary function is to convert from a (often very long)
sequence of characters into a (much shorter, perhaps 10X
shorter) sequence of tokens. This means less work for
subsequent phases of the compiler.
The scanner must Identify and Categorize specific
character sequences into tokens. It must know whether
every two adjacent characters in the file belong together in
the same token, or whether the second character must be in
a different token.
Most lexical analyzers discard comments & whitespace. In
most languages these characters serve to separate tokens
from each other, but once lexical analysis is completed they
serve no purpose. On the other hand, the exact line # and/or
column # may be useful in reporting errors, so some record
of what whitespace has occurred may be retained. Note: in
some languages, even popular ones, whitespace is
significant.
Handle lexical errors (illegal characters, malformed tokens)
by reporting them intelligibly to the user.
Efficiency is crucial; a scanner may perform elaborate input
buffering
Token categories can be (precisely, formally) specified
using regular expressions, e.g.

IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
Lexical Analyzers can be written by hand, or implemented
automatically using finite automata.
What is a "token" ?

In compilers, a "token" is:


1. a single word of source code input (a.k.a. "lexeme")
2. an integer code that refers to a single word of input
3. a set of lexical attributes computed from a single word of
input
Programmers think about all this in terms of #1. Syntax
checking uses #2. Error reporting, semantic analysis, and code
generation require #3. In a compiler written in C, for each token
you allocate a C struct to store (3) for each token.
Worth Mentioning
Here are the names of several important tools closely related to
compilers. You should learn those of these terms that you don't
already know.
interpreter
a language processor program that translates and executes
source code directly, without compiling it ot machine code.
assembler
a translator from human readable (ASCII text) files of
machine instructions into the actual binary code (object
files) of a machine.
linker
a program that combines (multiple) object files to make an
executable. Converts names of variables and functions to
numbers (machine addresses).
loader
Program to load code. On some systems, different
executables start at different base addresses, so the loader
must patch the executable with the actual base address of
the executable.

preprocessor
Program that processes the source code before the compiler
sees it. Usually, it implements macro expansion, but it can
do much more.
editor
Editors may operate on plain text, or they may be wired
into the rest of the compiler, highlighting syntax errors as
you go, or allowing you to insert or delete entire syntax
constructs at a time.
debugger
Program to help you see what's going on when your
program runs. Can print the values of variables, show what
procedure called what procedure to get where you are, run
up to a particular line, run until a particular variable gets a
special value, etc.
profiler
Program to help you see where your program is spending
its time, so you can tell where you need to speed it up.
Auxiliary data structures
You were presented with the phases of the compiler, from
lexical and syntax analysis, through semantic analysis, and
intermediate and final code generation. Each phase has an input
and an output to the next phase. But there are a few data
structures we will build that survive across multiple phases: the
literal table, the symbol table, and the error handler.
lexeme table
a table that stores lexeme values, such as strings and
variable names, that may occur in many places. Only one

copy of each unique string and name needs to be allocated


in memory.
symbol table
a table that stores the names defined (and visible with) each
particular scope. Scopes include: global, and procedure
(local). More advanced languages have more scopes such
as class (or record) and package.
error handler
errors in lexical, syntax, or semantic analysis all need a
common reporting mechanism, that shows where the error
occurred (filename, line number, and maybe column
number are useful).
Reading Named Files in C using stdio
In this class you are opening and reading files. Hopefully this is
review for you; if not, you will need to learn it quickly. To do
any "standard I/O" file processing, you start by including the
header:
#include <stdio.h>
This defines a data type (FILE *) and gives prototypes for
relevant functions. The following code opens a file using a
string filename, reads the first character (into an int variable, not
a char, so that it can detect end-of-file; EOF is not a legal char
value).
FILE *f = fopen(filename, "r");
int i = fgetc(f);
if (i == EOF) /* empty file... */
Command line argument handling and file processing in C

The following example is from Kernighan & Ritchie's "The C


Programming Language", page 162.
#include <stdio.h>
/* cat: concatenate files, version 1 */
int main(int argc, char *argv[])
{
FILE *fp;
void filecopy(FILE *, FILE *);
if (argc == 1)
filecopy(stdin, stdout);
else
while (--argc > 0)
if ((fp = fopen(*++argv, "r")) == NULL) {
printf("cat: can't open %s\n", *argv);
return 1;
}
else {
filecopy(fp, stdout);
fclose(fp);
}
return 0;
}
void filecopy(FILE *ifp, FILE *ofp)
{
int c;
while ((c = getc(ifp)) != EOF)

putc(c, ofp);
}
Warning: while using and adapting the above code is fair game
in this class, the yylex() function is very different than the
filecopy() function! It takes no parameters! It returns an integer
every time it finds a token! So if you "borrow" from this
example, delete filecopy() and write yylex() from scratch.
Multiple students have fallen into this trap before you.
A Brief Introduction to Make
It is not a good idea to write a large program like a compiler as a
single source file. For one thing, every time you make a small
change, you would need to recompile the whole program, which
will end up being many thousands of lines. For another thing,
parts of your compiler may be generated by "compiler
construction tools" which will write separate files. In any case,
this class will require you to use multiple source files, compiled
separately, and linked together to form your executable program.
This would be a pain, except we have "make" which takes care
of it for us. Make uses an input file named "makefile", which
stores in ASCII text form a collection of rules for how to build a
program from its pieces. Each rule shows how to build a file
from its source files, or dependencies. For example, to compile a
file under C:
foo.o : foo.c
gcc -c foo.c
The first line says to build foo.o you need foo.c, and the second
line, which must being with a tab, gave a command-line to

execute whenever foo.o should be rebuilt, i.e. when it is missing


or when foo.c has been changed and need to be recompiled.
The first rule in the makefile is what "make" builds by default,
but note that make dependencies are recursive: before it checks
whether it needs to rebuild foo.o from foo.c it will check
whether foo.c needs to be rebuilt using some other rule. Because
of this post-order traversal of the "dependency graph", the first
rule in your makefile is usually the last one that executes when
you type "make". For a C program, the first rule in your
makefile would usually be the "link" step that assembles objects
files into an executable as in:
compiler: foo.o bar.o baz.o
gcc -o compiler foo.o bar.o baz.o
There is a lot more to "make" but we will take it one step at a
time. This article on Make may be useful to you. You can find
other useful on-line documentation on "make" (manual page,
Internet reference guides, etc) if you look.
A couple finer points for HW#1
extern vs. #include: when do you use the one, when the other?
public interface to yylex(): no, you can't add your own
parameters
Regular Expressions
The notation we use to precisely capture all the variations that a
given category of token may take are called "regular
expressions" (or, less formally, "patterns". The word "pattern" is
really vague and there are lots of other notations for patterns

besides regular expressions). Regular expressions are a


shorthand notation for sets of strings. In order to even talk about
"strings" you have to first define an alphabet, the set of
characters which can appear.
1. Epsilon () is a regular expression denoting the set
containing the empty string
2. Any letter in the alphabet is also a regular expression
denoting the set containing a one-letter string consisting of
that letter.
3. For regular expressions r and s,
r|s
is a regular expression denoting the union of r and s
4. For regular expressions r and s,
rs
is a regular expression denoting the set of strings consisting
of a member of r followed by a member of s
5. For regular expression r,
r*
is a regular expression denoting the set of strings consisting
of zero or more occurrences of r.
6. You can parenthesize a regular expression to specify
operator precedence (otherwise, alternation is like plus,
concatenation is like times, and closure is like
exponentiation)
Although these operators are sufficient to describe all regular
languages, in practice everybody uses extensions:
For regular expression r,
r+
is a regular expression denoting the set of strings consisting
of one or more occurrences of r. Equivalent to rr*

For regular expression r,


r?
is a regular expression denoting the set of strings consisting
of zero or one occurrence of r. Equivalent to r|
The notation [abc] is short for a|b|c. [a-z] is short for
a|b|...|z. [^abc] is short for: any character other than a, b, or
c.

lecture #3 began here


What is a "lexical attribute" ?
A lexical attribute is a piece of information about a token. These
typically include:
category
an integer code used to check syntax
lexeme
actual string contents of the token
line, column, file where the lexeme occurs in source code
value
for literals, the binary data they represent
Homework #2
Avoid These Common Bugs in Your Homeworks!
1. yytext or yyinput were not declared global
2. main() does not have its required argc, argv parameters!
3. main() does not call yylex() in a loop or check its return
value
4. getc() EOF handling is missing or wrong! check EVERY
all to getc() for EOF!
5. opened files not (all) closed! file handle leak!
6. end-of-comment code doesn't check for */
7. yylex() is not doing the file reading

8. yylex() does not skip multiple spaces, mishandles spaces at


the front of input, or requires certain spaces in order to
function OK
9. extra or bogus output not in assignment spec
10.
= instead of ==
Some Regular Expression Examples
In a previous lecture we saw regular expressions, the preferred
notation for specifying patterns of characters that define token
categories. The best way to get a feel for regular expressions is
to see examples. Note that regular expressions form the basis for
pattern matching in many UNIX tools such as grep, awk, perl,
etc.
What is the regular expression for each of the different lexical
items that appear in C programs? How does this compare with
another, possibly simpler programming language such as
BASIC?
lexical
category

operators

BASIC

the characters
themselves

C
For operators that are regular
expression operators we need
mark them with double quotes or
backslashes to indicate you mean
the character, not the regular
expression operator. Note several
operators have a common prefix.
The lexical analyzer needs to look
ahead to tell whether an = is an
assignment, or is followed by

another = for example.


Reserved words are also matched
the concatenation
reserved
by the regular expression for
of characters;
words
identifiers, so a disambiguating
case insensitive
rule is needed.
no _; $ at ends of
some; 2
identifiers significant
[a-zA-Z_][a-zA-Z0-9]*
letters!?; case
insensitive
ints and reals,
numbers starting with [0- 0x[0-9a-fA-F]+ etc.
9]+
comments REM.*
C's comments are tricky regexp's
almost ".*"; no
strings
escaped quotes
escapes
what else?
lex(1) and flex(1)
These programs generally take a lexical specification given in a
.l file and create a corresponding C language lexical analyzer in
a file named lex.yy.c. The lexical analyzer is then linked with
the rest of your compiler.
The C code generated by lex has the following public interface.
Note the use of global variables instead of parameters, and the
use of the prefix yy to distinguish scanner names from your
program names. This prefix is also used in the YACC parser
generator.

FILE *yyin; /* set this variable prior to calling yylex() */


int yylex(); /* call this function once for each token */
char yytext[]; /* yylex() writes the token's lexeme to an array */
/* note: with flex, I believe extern declarations must
read
extern char *yytext;
*/
int yywrap(); /* called by lex when it hits end-of-file; see
below */
The .l file format consists of a mixture of lex syntax and C code
fragments. The percent sign (%) is used to signify lex elements.
The whole file is divided into three sections separated by %%:
header
%%
body
%%
helper functions
The header consists of C code fragments enclosed in %{ and %}
as well as macro definitions consisting of a name and a regular
expression denoted by that name. lex macros are invoked
explicitly by enclosing the macro name in curly braces.
Following are some example lex macros.
letter
digit
ident

[a-zA-Z]
[0-9]
{letter}({letter}|{digit})*

The body consists of of a sequence of regular expressions for


different token categories and other lexical entities. Each regular
expression can have a C code fragment enclosed in curly braces
that executes when that regular expression is matched. For most
of the regular expressions this code fragment (also called
a semantic action consists of returning an integer that identifies
the token category to the rest of the compiler, particularly for
use by the parser to check syntax. Some typical regular
expressions and semantic actions might include:
""
{ /* no-op, discard whitespace */ }
{ident}
{ return IDENTIFIER; }
"*"
{ return ASTERISK; }
"."
{ return PERIOD; }
You also need regular expressions for lexical errors such as
unterminated character constants, or illegal characters.
The helper functions in a lex file typically compute lexical
attributes, such as the actual integer or string values denoted by
literals. One helper function you have to write is yywrap(),
which is called when lex hits end of file. If you just want lex to
quit, have yywrap() return 1. If your yywrap() switches yyin to a
different file and you want lex to continue processing, have
yywrap() return 0. The lex or flex library (-ll or -lfl) have default
yywrap() function which return a 1, and flex has the
directive %option noyywrap which allows you to skip writing
this function.
A Short Comment on Lexing C Reals

C float and double constants have to have at least one digit,


either before or after the required decimal. This is a pain:
([0-9]+.[0-9]* | [0-9]*.[0-9]+) ...
You might almost be happier if you wrote
([0-9]*.[0-9]*) { return (strcmp(yytext,".")) ? REAL :
PERIOD; }
You-all know C's ternary e1 ? e2 : e3 operator, don't ya? Its an
if-then-else expression, very slick.
Lex extended regular expressions
Lex further extends the regular expressions with several helpful
operators. Lex's regular expressions include:
c
normal characters mean themselves
\c
backslash escapes remove the meaning from most operator
characters. Inside character sets and quotes, backslash
performs C-style escapes.
"s"
Double quotes mean to match the C string given as itself.
This is particularly useful for multi-byte operators and may
be more readable than using backslash multiple times.
[s]
This character set operator matches any one character
among those in s.
[^s]
A negated-set matches any one character not among those
in s.

.
The dot operator matches any one character except
newline: [^\n]
r*
match r 0 or more times.
r+
match r 1 or more times.
r?
match r 0 or 1 time.
r{m,n}
match r between m and n times.
r1r2
concatenation. match r1 followed by r2
r1|r2
alternation. match r1 or r2
(r)
parentheses specify precedence but do not match anything
r1/r2
lookahead. match r1 when r2 follows, without consuming r2
^r
match r only when it occurs at the beginning of a line
r$
match r only when it occurs at the end of a line
lecture #4 began here
Announcements
Next homework I promise: I will ask the TA to run your
program with a nonexistent file as a command-line argument!
Lexical Attributes and Token Objects

Besides the token's category, the rest of the compiler may need
several pieces of information about a token in order to perform
semantic analysis, code generation, and error handling. These
are stored in an object instance of class Token, or in C, a struct.
The fields are generally something like:
struct token {
int category;
char *text;
int linenumber;
int column;
char *filename;
union literal value;
}
The union literal will hold computed values of integers, real
numbers, and strings. In your homework assignment, I am
requiring you to compute column #'s; not all compilers require
them, but they are easy. Also: in our compiler project we are not
worrying about optimizing our use of memory, so am not
requiring you to use a union.
Flex Manpage Examplefest
To read a UNIX "man page", or manual page, you type
"man command" where command is the UNIX program or
library function you need information on. Read the man page for
man to learn more advanced uses ("man man").
It turns out the flex man page is intended to be pretty complete,
enough so that we can draw our examples from it. Perhaps what
you should figure out from these examples is that flex is

actually... flexible. The first several examples use flex as a filter


from standard input to standard output.

sneaky string removal tool:


%%
"zap me"
excess whitespace trimmer
%%
[ \t]+
putchar( ' ' );
[ \t]+$
/* ignore this token */
sneaky string substitution tool:
%%
username printf( "%s", getlogin() );
Line Counter/Word Counter

int num_lines = 0, num_chars = 0;


%%
\n ++num_lines; ++num_chars;
.
++num_chars;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars =
%d\n",

num_lines, num_chars );

}
Toy compiler example

/* scanner for a toy Pascal-like language */


%{
/* need this for the call to atof() below */
#include <math.h>
%}
DIGIT [0-9]
ID
[a-z][a-z0-9]*
%%
{DIGIT}+ {
printf( "An integer: %s (%d)\n",
yytext,

atoi( yytext ) );
}
{DIGIT}+"."{DIGIT}*
{
printf( "A float: %s (%g)\n", yytext,
atof( yytext ) );
}
if|then|begin|end|procedure|function
{
printf( "A keyword: %s\n", yytext );
}

{ID}

printf( "An identifier: %s\n",

yytext );

"+"|"-"|"*"|"/" printf( "An operator: %s\n",


yytext );

"{"[^}\n]*"}"

/* eat up one-line

comments */

[ \t\n]+
.
%s\n", yytext );

/* eat up whitespace */
printf( "Unrecognized character:

%%
main( argc, argv )
int argc;
char **argv;
{
++argv, --argc; /* skip over program
name */

if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
yylex();
}

On the use of character sets (square brackets) in lex and


similar tools
A student recently sent me an example regular expression for
comments that read:
COMMENT [/*][[^*/]*[*]*]]*[*/]
One problem here is that square brackets are not parentheses,
they do not nest, they do not support concatenation or other
regular expression operators. They mean exactly: "match any
one of these characters" or for ^: "match any one character that
is not one of these characters". Note also that you can't use ^ as
a "not" operator outside of square brackets: you can't write the
expression for "stuff that isn't */" by saying (^ "*/")
lecture #5 began here
Finite Automata
A finite automaton (FA) is an abstract, mathematical machine,
also known as a finite state machine, with the following
components:
1. A set of states S
2. A set of input symbols E (the alphabet)
3. A transition function move(state, symbol) : new state(s)
4. A start state S0
5. A set of final states F
The word finite refers to the set of states: there is a fixed size to
this machine. No "stacks", no "virtual memory", just a known
number of states. The word automaton refers to the execution
mode: there is no instruction set, there is no sequence of

instructions, there is just a hardwired short loop that executes the


same instruction over and over:
while ((c=getchar()) != EOF) S := move(S, c);
DFAs
The type of finite automata that is easiest to understand and
simplest to implement (say, even in hardware) is called a
deterministic finite automaton (DFA). The
word deterministic here refers to the return value of function
move(state, symbol), which goes to at most one state. Example:
S = {s0, s1, s2}
E = {a, b, c}
move = { (s0,a):s1; (s1,b):s2; (s2,c):s2 }
S0 = s0
F = {s2}
Finite automata correspond in a 1:1 relationship to transition
diagrams; from any transition diagram one can write down the
formal automaton in terms of items #1-#5 above, and vice versa.
To draw the transition diagram for a finite automaton:

draw a circle for each state s in S; put a label inside the


circles to identify each state by number or name
draw an arrow between Si and Sj, labeled with x whenever
the transition says to move(Si, x) : Sj
draw a "wedgie" into the start state S0 to identify it
draw a second circle inside each of the final states in F

The Automaton Game

If I give you a transition diagram of a finite automaton, you can


hand-simulate the operation of that automaton on any input I
give you.
DFA Implementation
The nice part about DFA's is that they are efficiently
implemented on computers. What DFA does the following code
correspond to? What is the corresponding regular expression?
You can speed this code fragment up even further if you are
willing to use goto's or write it in assembler.
state := S0
for(;;)
switch (state) {
case 0:
switch (input) {
'a': state = 1; input = getchar(); break;
'b': input = getchar(); break;
default: printf("dfa error\n"); exit(1);
}
case 1:
switch (input) {
EOF: printf("accept\n"); exit(0);
default: printf("dfa error\n"); exit(1);
}
}
Deterministic Finite Automata Examples
A lexical analyzer might associate different final states with
different token categories:

C Comments:

Nondeterministic Finite Automata (NFA's)


Notational convenience motivates more flexible machines in
which function move() can go to more than one state on a given
input symbol, and some states can move to other states even
without consuming an input symbol (-transitions).
Fortunately, one can prove that for any NFA, there is an
equivalent DFA. They are just a notational convenience. So,
finite automata help us get from a set of regular expressions to a
computer program that recognizes them efficiently.
NFA Examples
-transitions make it simpler to merge automata:

multiple transitions on the same symbol handle common


prefixes:

factoring may optimize the number of states. Is this picture


OK/correct?

C Pointers, malloc, and your future


For most of you success as a computer scientist may boil down
to whether you can master the concept of dynamically allocated
memory. In C this means pointers and the malloc() family of
functions. Here are some tips:
Draw "memory box" pictures of your variables. Pencil and
paper understanding of memory leads to correct running
programs.
Always initialize local pointer variables. Consider this
code:

void f() {

int i = 0;

struct tokenlist *current, *head;

...

foo(current)

}
Here, current is passed in as a parameter to foo, but it is a
pointer that hasn't been pointed at anything. I cannot tell
you how many times I personally have written bugs myself
or fixed bugs in student code, caused by reading or writing
to pointers that weren't pointing at anything in particular.
Local variables that weren't initialized point at random
garbage. If you are lucky this is a coredump, but you might
not be lucky, you might not find out where the mistake
was, you might just get a wrong answer. This can all be
fixed by
struct tokenlist *current = NULL, *head = NULL;
Avoid this common C bug:
struct token *t = (struct token
*)malloc(sizeof(struct token *)));
This compiles, but causes coredumps during program
execution. Why?
Check your malloc() return value to be sure it is not NULL.
Sure, modern programs will "never run out of memory".
Wrong. malloc() can return NULL even on big machines.
Operating systems often place limits on memory so as to
protect themselves from runaway programs or hacker
attacks.

Regular expression examples


Can you draw an NFA corresponding to the following?
(a|c)*b(a|c)*

(a|c)*|(a|c)*b(a|c)*
(a|c)*(b|)(a|c)*
Regular expressions can be converted automatically to
NFA's
Each rule in the definition of regular expressions has a
corresponding NFA; NFA's are composed using transitions.
This is called "Thompson's construction" ). We will work
examples such as (a|b)*abb in class and during lab.
1. For , draw two states with a single transition.

2. For any letter in the alphabet, draw two states with a single
transition labeled with that letter.

3. For regular expressions r and s, draw r | s by adding a new


start state with transitions to the start states of r and s, and
a new final state with transitions from each final state in r

and s.

4. For regular expressions r and s, draw rs by adding


transitions from the final states of r to the start state of s.

5. For regular expression r, draw r* by adding new start and


final states, and transitions
o from the start state to the final state,
o from the final state back to the start state,
o from the new start to the old start and from the old
final states to the new final state.

6. For parenthesized regular expression (r) you can use the


NFA for r.
lecture #6 began here
NFA's can be converted automatically to DFA's
In: NFA N
Out: DFA D
Method: Construct transition table Dtran (a.k.a. the "move
function"). Each DFA state is a set of NFA states. Dtran
simulates in parallel all possible moves N can make on a given
string.
Operations to keep track of sets of NFA states:
_closure(s)
set of states reachable from state s via
_closure(T)
set of states reachable from any state in set T via
move(T,a)
set of states to which there is an NFA transition from states
in T on symbol a

NFA to DFA Algorithm:


Dstates := {_closure(start_state)}
while T := unmarked_member(Dstates) do {
mark(T)
for each input symbol a do {
U := _closure(move(T,a))
if not member(Dstates, U) then
insert(Dstates, U)
Dtran[T,a] := U
}
}
Practice converting NFA to DFA
OK, you've seen the algorithm, now can you use it?

...

...did you get:

OK, how about this one:

lecture #7 began here


Some Remarks

I have a collection of compiler textbooks in my office,


which I will make avaliable as "loaners" from class period
to class period, all you have to do is sign a return contract
in blood.
If you checked out the class web page, you saw a solution
to HW#1 was posted awhile ago... I will try to do this for

future assignments also, but not immediately, so as to allow


students a few days of lateness without a heavy penalty.
Whether we return the same or a different category for
integer constants and for line numbers depends very much
on the grammar we use to parse our language.

Lexical Analysis and the Literal Table


In many compilers, the memory management components of the
compiler interact with several phases of compilation, starting
with lexical analysis.
Efficient storage is necessary to handle large input files.
There is a colossal amount of duplication in lexical data:
variable names, strings and other literal values duplicate
frequently
What token type to use may depend on previous
declarations.
A hash table or other efficient data structure can avoid this
duplication. The software engineering design pattern to use is
called the "flyweight".
Major Data Structures in a Compiler
token
contains an integer category, lexeme, line #, column #,
filename... We could build these into a link list, but instead
we'll use them as leaves in a tree structure.
syntax tree
contains grammar information about a sequence of related
tokens. leaves contain lexical information (tokens). internal

nodes contain grammar rules and pointers to tokens or


other tree nodes.
symbol table
contains variable names, types, and information needed to
generate code for a name (such as its address, or constant
value). Look ups are by name, so we'll need a hash table.
intermediate & final code
We'll need link lists or similar structures to hold sequences
of machine instructions
Literal Table: Usage Example
Example abbreviated from [ASU86]: Figure 3.18, p. 109. Use
"install_id()" instead of "strdup()" to avoid duplication in the
lexical data.
%{
/* #define's for token categories LT, LE, etc.
%}
white [ \t\n]+
digit [0-9]
id
[a-zA-Z_][a-zA-Z_0-9]*
num {digit}+(\.{digit}+)?
%%
{ws} { /* discard */ }
if
{ return IF; }
then { return THEN; }
else { return ELSE; }

{id} { yylval.id = install_id(); return ID; }


{num} { yylval.num = install_num(); return NUMBER; }
"<" { yylval.op = LT; return RELOP; }
">" { yylval.op = GT; return RELOP; }
%%
install_id()
{
/* insert yytext into the literal table */
}
install_num()
{
/* insert (binary number corresponding to?) yytext into the
literal table */
}
So how would you implement a literal table using a hash table?
We will see more hash tables when it comes time to construct
the symbol tables with which variable names and scopes are
managed, so you had better become fluent.
lecture #8 began here
Constructing your Token inside yylex()
A student recently asked if it was OK to allocate a token
structure inside main() after yylex() returns the token. This is not
OK because in the next phase of your compiler, you are not
calling yylex(), the automatically generated parser will call
yylex(). There is a way for the parser to grab your token if

you've stored it in a global variable, but there is not a way for


the parser to build the token structure itself.
Syntax Analysis
Parsing is the act of performing syntax analysis to verify an
input program's compliance with the source language. A byproduct of this process is typically a tree that represents the
structure of the program.
Context Free Grammars
A context free grammar G has:
A set of terminal symbols, T
A set of nonterminal symbols, N
A start symbol, s, which is a member of N
A set of production rules of the form A -> w, where A is a
nonterminal and w is a string of terminal and nonterminal
symbols.
A context free grammar can be used to generate strings in the
corresponding language as follows:
let X = the start symbol s
while there is some nonterminal Y in X do
apply any one production rule using Y, e.g. Y -> w
When X consists only of terminal symbols, it is a string of the
language denoted by the grammar. Each iteration of the loop is
a derivation step. If an iteration has several nonterminals to
choose from at some point, the rules of derviation would allow
any of these to be applied. In practice, parsing algorithms tend to
always choose the leftmost nonterminal, or the rightmost
nonterminal, resulting in strings that are leftmost
derivations or rightmost derivations.

Context Free Grammar Examples


Well, OK, so how much of the C language grammar can we
come up with in class today? Start with expressions, work on up
to statements, and work there up to entire functions, and
programs.
lecture #9 began here
Dr. Pontelli is looking for a web developer, did everyone see
that ad? I too am looking for student research assistants.
Grammar Ambiguity
The grammar
E -> E + E
E -> E * E
E -> ( E )
E -> ident
allows two different derivations for strings such as "x + y * z".
The grammar is ambiguous, but the semantics of the language
dictate a particular operator precedence that should be used. One
way to eliminate such ambiguity is to rewrite the grammar. For
example, we can force the precedence we want by adding some
nonterminals and production rules.
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( E )

F -> ident
Given the arithmetic expression grammar from last lecture:
How can a program figure that x + y * z is legal?
How can a program figure out that x + y (* z) is illegal?
A brief aside on casting your mallocs
If you don't put a prototype for malloc(), C thinks it returns an
int.
#include <stdlib.h>
includes prototypes for malloc(), free(), etc. malloc() returns a
void *.
void * means "pointer that points at nothing", or "pointer that
points at anything". You need to cast it to what you are really
pointing at, as in:
union lexval *l = (union lexval *)malloc(sizeof(union lexval));
Note the stupid duplication of type information; no language is
perfect! Anyhow, always cast your mallocs. The program may
work without the cast, but you need to fix every warning, so you
don't accidentally let a serious one through.
Recursive Descent Parsing
Perhaps the simplest parsing method, for a large subset of
context free grammars, is called recursive descent. It is simple
because the algorithm closely follows the production rules of
nonterminal symbols.
Write 1 procedure per nonterminal rule

Within each procedure, a) match terminals at appropriate


positions, and b) call procedures for non-terminals.
Pitfalls:
1. left recursion is FATAL
2. must distinguish between several production rules, or
potentially, one has to try all of them via backtracking.

Recursive Descent Parsing Example #1


Consider the grammar we gave above. There will be functions
for E, T, and F. The function for F() is the "easiest" in some
sense: based on a single token it can decide which production
rule to use. The parsing functions return 0 (failed to parse) if the
nonterminal in question cannot be derived from the tokens at the
current point. A nonzero return value of N would indicate
success in parsing using production rule #N.
int F()
{
int t = yylex();
if (t == IDENT) return 6;
else if (t == LP) {
if (E() && (yylex()==RP) return 5;
}
return 0;
}
Comment #1: if F() is in the middle of a larger parse of E() or
T(), F() may succeed, but the subsequent parsing may fail. The
parse may have to backtrack, which would mean we'd have to
be able to put tokens back for later parsing. Add a memory (say,
a gigantic array or link list for example) of already-parsed

tokens to the lexical analyzer, plus backtracking logic to E() or


T() as needed. The call to F() may get repeated following a
different production rule for a higher nonterminal.
Comment #2: in a real compiler we need more than "yes it
parsed" or "no it didn't": we need a parse tree if it succeeds, and
we need a useful error message if it didn't.
Question: for E() and T(), how do we know which production
rule to try? Option A: just blindly try each one in turn. Option B:
look at the first (current) token, only try those rules that start
with that token (1 character lookahead). If you are lucky, that
one character will uniquely select a production rule. If that is
always true through the whole grammar, no backtracking is
needed.
Question: how do we know which rules start with whatever
token we are looking at? Can anyone suggest a solution, or are
we stuck?
lecture #10 began here
Announcements

Homework #3 minor extension


Midterm exam: Thursday March 16
The first midterm exam will cover lexical analysis and
syntax analysis

Removing Left Recursion


E -> E + T | T

T -> T * F | F
F -> ( E ) | ident
We can remove the left recursion by introducing new
nonterminals and new production rules.
E -> T E'
E' -> + T E' |
T -> F T'
T' -> * F T' |
F -> ( E ) | ident
Getting rid of such immediate left recursion is not enough, one
must get rid of indirect left recursion, where two or more
nonterminals are mutually left-recursive. One can
rewrite any CFG to remove left recursion (Algorithm 4.1).
for i := 1 to n do
for j := 1 to i-1 do begin
replace each Ai -> Aj gamma with productions
Ai -> delta1gamma | delta2gamma
end
eliminate immediate left recursion
Removing Left Recursion, part 2
Left recursion can be broken into three cases
case 1: trivial
A:A|
The recursion must always terminate by A finally deriving so
you can rewrite it to the equivalent

A : A'
A' : A' |
Example:
E : E op T | T
can be rewritten
E : T E'
E' : op T E' |
case 2: non-trivial, but immediate
In the more general case, there may be multiple recursive
productions and/or multiple non-recursive productions.
A : A 1 | A 2 | ... | 1 | 2
As in the trivial case, you get rid of left-recursing A and
introduce an A'
A : 1 A' | 2 A' | ...
A' : 1 A' | 2 A' | ... |
case 3: mutual recursion
1. Order the nonterminals in some order 1 to N.
2. Rewrite production rules to eliminate all nonterminals in
leftmost positions that refer to a "previous" nonterminal.
When finished, all productions' right hand symbols start
with a terminal or a nonterminal that is numbered equal or
higher than the nonterminal no the left hand side.
3. Eliminate the direct left recusion as per cases 1-2.

Left Recursion Versus Right Recursion: When does it


Matter?
A student came to me once with what they described as an
operator precedence problem where 5-4+3 was computing the
wrong value (-2 instead of 4). What it really was, was an
associativity problem due to the grammar:
E:T+E|T-E|T
The problem here is that right recursion is forcing right
associativity, but normal arithmetic requires left associativity.
Several solutions are: (a) rewrite the grammar to be left
recursive, or (b) rewrite the grammar with more nonterminals to
force the correct precedence/associativity, or (c) if using YACC
or Bison, there are "cheat codes" we will discuss later to allow it
to be majorly ambiguous and specify associativity separately
(look for %left and %right in YACC manuals).
Recursive Descent Parsing Example #2
The grammar
S -> A B C
A -> a A
A ->
B -> b
C -> c
maps to pseudocode like the following. (:= is an assignment
operator)
procedure S()
if A() & B() & C() then succeed # matched S, we win

end
procedure A()
if yychar == a then { # use production 2
yychar := scan()
return A()
}
else
succeed # production rule 3, match
end
procedure B()
if yychar == b then {
yychar := scan()
succeed
}
else fail
end
procedure C()
if yychar == c then {
yychar := scan()
succeed
}
else fail
end
Backtracking?

Could your current token begin more than one of your possible
production rules? Try all of them, remember and reset state for
each try.
S -> cAd
A -> ab
A -> a
Left factoring can often solve such problems:
S -> cAd
A -> a A'
A'-> b
A'-> ()
One can also perform left factoring to reduce or eliminate the
lookahead or backtracking needed to tell which production rule
to use. If the end result has no lookahead or backtracking
needed, the resulting CFG can be solved by a "predictive parser"
and coded easily in a conventional language. If backtracking is
needed, a recursive descent parser takes more work to
implement, but is still feasible. As a more concrete example:
S -> if E then S
S -> if E then S1 else S2
can be factored to:
S -> if E then S S'
S'-> else S2 |
Some More Parsing Theory

Automatic techniques for constructing parsers start with


computing some basic functions for symbols in the grammar.
These functions are useful in understanding both recursive
descent and bottom-up LR parsers.
First(a)
First(a) is the set of terminals that begin strings derived from a,
which can include .
1. First(X) starts with the empty set.
2. if X is a terminal, First(X) is {X}.
3. if X -> is a production, add to First(X).
4. if X is a non-terminal and X -> Y1 Y2 ... Yk is a production,
add First(Y1) to First(X).
5.
for (i = 1; if Yi can derive ; i++)
6.
add First(Yi+1) to First(X)
7.
First(a) examples
by the way, this stuff is all in section 4.3 in your text.
Last time we looked at an example with E, T, and F, and + and
*. The first-set computation was not too exciting and we need
more examples.
stmt : if-stmt | OTHER
if-stmt: IF LP expr RP stmt else-part
else-part: ELSE stmt |
expr: IDENT | INTLIT
What are the First() sets of each nonterminal?

Follow(A)
Follow(A) for nonterminal A is the set of terminals that can
appear immediately to the right of A in some sentential form S > aAxB... To compute Follow, apply these rules to all
nonterminals in the grammar:
1. Add $ to Follow(S)
2. if A -> aBb then add First(b) - to Follow(B)
3. if A -> aB or A -> aBb where is in First(b), then add
Follow(A) to Follow(B).
On resizing arrays in C
The sval attribute in homework #2 is a perfect example of a
problem which a BCS major might not be expected to manage,
but a CS major should be able to do by the time they graduate.
This is not to encourage any of you to consider BCS, but rather,
to encourage you to learn how to solve problems like these.
The problem can be summarized as: step through yytext,
copying each piece out to sval, removing doublequotes and
plusses between the pieces, and evaluating CHR$() constants.
Space allocated with malloc() can be increased in size by
realloc(). realloc() is awesome. But, it COPIES and MOVES the
old chunk of space you had to the new, resized chunk of space,
and frees the old space, so you had better not have any other
pointers pointing at that space if you realloc(), and you have to
update your pointer to point at the new location realloc() returns.
i = 0; j = 0;
while (yytext[i] != '\0') {

if (yytext[i] == '\"') {
/* copy string into sval */
i++;
while (yytext[i] != '\"') {
sval[j++] = yytext[i++];
}
}
else if ((yytext[i] == 'C') || (yytext[i] == 'c')) {
/* handle CHR$(...) */
i += 5;
k = atoi(yytext + i);
sval[j++] = k;
/* might check for 0-255 */
while (yytext[i] != ')') i++;
}
/* else we can just skip it */
i++;
}
sval[j] = '\0'; /* NUL-terminate our string */
There is one more problem: how do we allocate memory for
sval, and how big should it be?
Solution #1: sval = malloc(strlen(yytext)+1) is very safe,
but wastes space.
Solution #2: you could malloc a small amount and grow the
array as needed.

sval = strdup("");

...

sval = appendstring(sval, yytext[i]); /* instead of


sval[j++] = yytext[i] */
where the function appendstring could be:

char *appendstring(char *s, char c)


{
i = strlen(s);
s = realloc(s, i+2);
s[i] = c;
s[i+1] = '\0';
return s;
}
Note: it is very inefficient to grow your array one character
at a time; in real life people grow arrays in large chunks at
a time.
Solution #3: use solution one and then shrink your array
when you find out how big it actually needs to be.

sval = malloc(strlen(yytext)+1);
/* ... do the code copying into sval; be sure to
NUL-terminate */

sval = realloc(sval, strlen(sval)+1);

lecture #11 began here

YACC

YACC ("yet another compiler compiler") is a popular tool


which originated at
AT&T Bell Labs. YACC takes a context free grammar as input,
and generates a
parser as output. Several independent, compatible
implementations (AT&T
yacc, Berkeley yacc, GNU Bison) for C exist, as well as many
implementations
for other popular languages.

YACC files end in .y and take the form


declarations
%%
grammar
%%
subroutines
The declarations section defines the terminal symbols (tokens)
and
nonterminal symbols. The most useful declarations are:
%token a
declares terminal symbol a; YACC can generate a set of
#define's
that map these symbols onto integers, in a y.tab.h
file. Note: don't
#include your y.tab.h file from your grammar .y file,
YACC generates the

same definitions and declarations directly in the .c file,


and including
the .tab.h file will cause duplication errors.
%start A
specifies the start symbol for the grammar (defaults to
nonterminal
on left side of the first production rule).

The grammar gives the production rules, interspersed with


program code
fragments called semantic actions that let the programmer do
what's
desired when the grammar productions are reduced. They
follow the
syntax
A : body ;
Where body is a sequence of 0 or more terminals, nonterminals,
or semantic
actions (code, in curly braces) separated by spaces. As a
notational
convenience, multiple production rules may be grouped together
using the
vertical bar (|).

Bottom Up Parsing

Bottom up parsers start from the sequence of terminal symbols


and work
their way back up to the start symbol by repeatedly replacing
grammar
rules' right hand sides by the corresponding non-terminal. This
is
the reverse of the derivation process, and is called "reduction".

Example. For the grammar


(1)
(2)
(3)
(4)

S->aABe
A->Abc
A->b
B->d

the string "abbcde" can be parsed bottom-up by the following


reduction
steps:
abbcde
aAbcde
aAde
aABe
S

Handles

Definition: a handle is a substring that


1.

matches a right hand side of a production rule in the


grammar and

2.
3.

whose reduction to the nonterminal on the left hand


side of that
4.
grammar rule is a step along the reverse of a
rightmost derivation.
5.

Shift Reduce Parsing

A shift-reduce parser performs its parsing using the following


structure
Stack
$

Input
w$

At each step, the parser performs one of the following actions.


1.

Shift one symbol from the input onto the parse stack

2.
3.

Reduce one handle on the top of the parse stack. The


symbols
4.
from the right hand side of a grammar rule are
popped of the
5.
stack, and the nonterminal symbol is pushed on
the stack in their place.
6.
7.
Accept is the operation performed when the start
symbol is alone
8.
on the parse stack and the input is empty.
9.
10.
Error actions occur when no successful parse is
possible.
11.

The YACC Value Stack

YACC's parse stack contains only "states"


YACC maintains a parallel set of values
$ is used in semantic actions to name elements on
the value stack

$$ denotes the value associated with the LHS


(nonterminal) symbol
$n denotes the value associated with RHS symbol
at position n.

Value stack typically used to construct the parse


tree

Typical rule with semantic action: A : b C d { $$ =


tree(R,3,$1,$2,$3); }
The default value stack is an array of integers
The value stack can hold arbitrary values in an
array of unions
The union type is declared with %union and is
named YYSTYPE

Getting Lex and Yacc to talk

YACC uses a global variable named yylval, of type YYSTYPE,


to receive
lexical information from the scanner. Whatever is in this
variable

each time yylex() returns to the parser will get copied over to the
top of the value stack when the token is shifted onto the parse
stack.

You can either declare that struct token may appear in the
%union,
and put a mixture of struct node and struct token on the value
stack,
or you can allocate a "leaf" tree node, and point it at your struct
token. Or you can use a tree type that allows tokens to include
their lexical information directly in the tree nodes. If you have
more than one %union type possible, be prepared to see type
conflicts
and to declare the types of all your nonterminals.

Getting all this straight takes some time; you can plan on
it. Your best
bet is to draw pictures of how you want the trees to look, and
then make the
code match the pictures. No pictures == "Dr. J will ask to see
your
pictures and not be able to help if you can't describe your trees."
Declaring value stack types for terminal and nonterminal
symbols

Unless you are going to use the default (integer) value stack, you
will
have to declare the types of the elements on the value
stack. Actually,
you do this by declaring which
union member is to be used for each terminal and nonterminal in
the
grammar.
Example: in the cocogram.y that I gave you we could add a
%union declaration
with a union member named treenode:
%union {
nodeptr treenode;
}

This will produce a compile error if you haven't declared a


nodeptr type
using a typedef, but that is another story. To declare that a
nonterminal
uses this union member, write something like:
%type < treenode > function_definition

Terminal symbols use %token to perform the corresponding


declaration.
If you had a second %union member (say struct token *tokenptr)
you

might write:
%token < tokenptr > SEMICOL

Announcements

Having trouble debugging your grammar? "bison -v"


generates a .output
file that gives the gory details of conflicts and such.

lecture #12 began here

Announcements

In honor of Dr. Jeffery's 10th anniversary, a minor extension


in Homework #3.

Conflicts in Shift-Reduce Parsing

"Conflicts" occur when an ambiguity in the grammar creates a


situation
where the parser does not know which step to perform at a given
point
during parsing. There are two kinds of conflicts that occur.

shift-reduce
a shift reduce conflict occurs when the grammar indicates
that
different successful parses might occur with either a
shift or a reduce
at a given point during parsing. The vast majority of
situations where
this conflict occurs can be correctly resolved by shifting.
reduce-reduce
a reduce reduce conflict occurs when the parser has two or
more
handles at the same time on the top of the
stack. Whatever choice
the parser makes is just as likely to be wrong as not. In
this case
it is usually best to rewrite the grammar to eliminate the
conflict,
possibly by factoring.

Example shift reduce conflict:


S->if E then S
S->if E then S else S

In many languages two nested "if" statements produce a


situation where
an "else" clause could legally belong to either "if". The usual
rule
(to shift) attaches the else to the nearest (i.e. inner) if statement.

Example reduce reduce conflict:


(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)

S -> id LP plist RP
S -> E GETS E
plist -> plist, p
plist -> p
p -> id
E -> id LP elist RP
E -> id
elist -> elist, E
elist -> E

By the point the stack holds ...id LP id

the parser will not know which rule to use to reduce the id: (5)
or (7).
Further Discussion of Reduce Reduce and Shift Reduce
Conflicts

The following grammar, based loosely on our expression


grammar from
last time, illustrates a reduce reduce conflict, and how you have
to
exercise care when using epsilon productions. Epsilon
productions
were helpful for some of the grammar rewriting methods, such
as removing
left recursion, but used indiscriminately, they can cause much
trouble.

T : F | F T2 ;
T2 : p F T2 | ;
F:lTr|v;

The reduce-reduce conflict occurs after you have seen an F. If


the next
symbol is a p there is no question of what to do, but if the next
symbol
is the end of file, do you reduce by rule #1 or #4 ?

A slightly different grammar is needed to demonstrate a shiftreduce conflict:

T : F g;

T : F T2 g;
T2 : t F T2 ;
T2 : ;
F:lTr;
F:v;

This grammar is not much different than before, and has the
same problem,
but the surrounding context (the "calling environments") of F
cause the
grammar to have a shift-reduce instead of reduce-reduce. Once
again,
the trouble is after you have seen an F and dwells on the
question of
whether to reduce the epsilon production, or instead to shift,
upon
seeing a token g.

The .output file generated by "bison -v" explains these conflicts


in
considerable detail. Part of what you need to interpret them are
the
concepts of "items" and "sets of items" discussed below.

YACC precedence and associativity declarations

YACC headers can specify precedence and associativity rules


for otherwise
heavily ambiguous grammars. Precedence is determined by
increasing order
of these declarations. Example:

%right ASSIGN
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER
%%
expr: expr ASSIGN expr
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr POWER expr
;

YACC error handling and recovery

Use special predefined token error where errors


expected

On an error, the parser pops states until it enters


one that has an

action on the error token.

For example: statement: error ';' ;

The parser must see 3 good tokens before it


decides it has recovered.

yyerrok tells parser to skip the 3 token recovery


rule

yyclearin throws away the current (error-causing?)


token

yyerror(s) is called when a syntax error occurs (s


is the error message)

Improving YACC's Error Reporting

yyerror(s) overrides the default error message, which usually


just says either
"syntax error" or "parse error", or "stack overflow".

You can easily add information in your own yyerror() function,


for example
GCC emits messages that look like:
goof.c:1: parse error before '}' token
using a yyerror function that looks like
void yyerror(char *s)
{
fprintf(stderr, "%s:%d: %s before '%s' token\n",
yyfilename, yylineno, s, yytext);
}

You could instead, use the error recovery mechanism to produce


better messages.
For example
lbrace : LBRACE | { error_code=MISSING_LBRACE; } error ;
Where LBRACE is an expected token {

This uses a global variable error_code to pass parse information


to yyerror().

Another related option is to call yyerror() explicitly with a better


message
string, and tell the parser to recover explicitly:
package_declaration: PACKAGE_TK error
{ yyerror("Missing name"); yyerrok; } ;

But, using error recovery to perform better error reporting runs


against
conventional wisdom that you should use error tokens very
sparingly.
What information from the parser determined we had an error in
the first
place? Can we use that information to produce a better error
message?

LR Syntax Error Messages: Advanced Methods

The pieces of information that YACC/Bison use to determine


that there
is an error in the first place are the parse state (yystate) and the
current input token (yychar). These are exactly the pieces of
information
one might use to produce better diagnostic error messages
without

relying on the error recovery mechanism and mucking up the


grammar
with a lot of extra production rules that feature the error token.

Even just the parse state is enough to do pretty good error


messages.
yystate is not part of YACC's public interface, though, so you
may
have to play some tricks to pass it as a parameter into yyerror()
from
yyparse(). Say, for example:
#define yyerror(s) __yyerror(s,yystate)

Inside __yyerror(msg, yystate) you can use a switch statement or


a global
array to associate messages with specific parse states. But,
figuring
out which parse state means which syntax error message would
be by trial
and error.

A tool called Merr is available that let's you generate this


yyerror
function from examples: you supply the sample syntax errors
and messages,
and Merr figures out which parse state integer goes with which
message.

Merr also uses the yychar (current input token) to refine the
diagnostics
in the event that two of your example errors occur on the same
parse state.
See the Merr web page.

lecture #13 began here

Announcements

The TA's HW2 grades are available from the TA. The
distribution (out of 80) was
76, 74, 74, 74, 73, 72, 66, 65, 55, 52, 46, 35, 30, 30, 30, 15, 14
1/3rd of the class got an "A". The rest of you need to visit the
TA, see how
the grades were measured, see the professor, and most
important, get a lexical
analyzer working well enough to complete the later assignments
in this course.
If your grade was below 70, you probably want to get it working
and resubmit

it, I have asked the TA to accept resubmissions and average the


grades
(example: you got a 30, fixed it and resubmitted it and got a 70;
your overall
grade is a 50). This option is valid until the due date for the next
homework.
After all of this adjustment, you are being graded relative to
your
peers, not on an absolute 90/80/... scale. Depending on your
peers'
performance, a 60% score at the end of the semester could be a
"B" for all I
know. The purpose of the late penalty is to encourage you not to
fall
further and further behind as the semester progresses, and to
encourage you
to in fact catch up if you do fall behind.
For HW3 (syntax checker), make sure your tar file
unpacks OK and that "make" just works for us out of the
box. In your paper
turnin, make sure you DO include the lex .l and yacc .y files,
and make
sure you do NOT include the .c files generated from the lex .l
and
yacc .y files (lex.yy.c, y.tab.c, whatever). Include all .h files and
your
makefile.

For HW3, test your work on as many test cases as possible.

Midterm Exam is coming up, March 16. Midterm review


March 14.
Three more lectures before that.

LR vs. LL vs. LR(0) vs. LR(1) vs. LALR(1)

The first char ("L") means input tokens are read from the left
(left to right). The second char ("R" or "L") means parsing
finds the rightmost, or leftmost, derivation. Relevant
if there is ambiguity in the grammar. (0) or (1) or (k) after
the main lettering indicates how many lookahead characters are
used. (0) means you only look at the parse stack, (1) means you
use the current token in deciding what to do, shift or reduce.
(k) means you look at the next k tokens before deciding what
to do at the current position.

LR Parsers

LR denotes a class of bottom up parsers that is capable of


handling virtually

all programming language constructs. LR is efficient; it runs in


linear time
with no backtracking needed. The class of languages handled
by LR is a proper
superset of the class of languages handled by top down
"predictive parsers".
LR parsing detects an error as soon as it is possible to do
so. Generally
building an LR parser is too big and complicated a job to do by
hand, we use
tools to generate LR parsers.

The LR parsing algorithm is given below.


ip = first symbol of input
repeat {
s = state on top of parse stack
a = *ip
case action[s,a] of {
SHIFT s': { push(a); push(s') }
REDUCE A->beta: {
pop 2*|beta| symbols; s' = new state on top
push A
push goto(s', A)
}
ACCEPT: return 0 /* success */
ERROR: { error("syntax error", s, a); halt }
}
}

Constructing SLR Parsing Tables:

Note: in Spring 2006 this material is FYI but you will not be
examined on it.

Definition: An LR(0) item of a grammar G is a production


of G with a dot at some position of the RHS.
Example: The production A->aAb gives the items:
A -> . a A b

A -> a . A b

A -> a A . b

A -> a A b .
Note: A production A-> generates
only one item:
A -> .
Intuition: an item A-> . denotes:
1.
2.
3.
4.
5.
6.

- we have already seen a string


derivable from
- we hope to see a string derivable
from

Functions on Sets of Items

Closure: if I is a set of items for a grammar G, then closure(I)


is the set of items constructed as follows:
1.
2.
3.
4.
5.

Every item in I is in closure(I).


If A-> . B
is in closure(I) and B->
is a production, then add B-> .

6.
7.

to closure(I).

These two rules are applied repeatedly until no new items can
be added.
Intuition: If A -> . B is in
closure(I) then we hope to see a string derivable from B in the
input. So if B-> is a production,
we should hope to see a string derivable from .
Hence, B->. is in closure(I).

Goto: if I is a set of items and X is a grammar symbol, then


goto(I,X)
is defined to be:
goto(I,X) = closure({[A->X.] | [A->.X]
is in I})
Intuition:

[A->.X]
is in I => we've seen a string derivable
from ; we hope to see a string derivable
from X.

Now suppose we see a string derivable from X

Example: Consider the grammar

Then, we should "goto" a state where we've seen


a string derivable from X, and where
we hope to see a string derivable from .
The item corresponding to this is [A->X.]

E -> E+T | T
T -> T*F | F
F -> (E) | id

Let I = {[E -> E . + T]} then:


goto(I,+) = closure({[E -> E+.T]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F]})

= closure({[E -> E+.T], [E -> .T*F], [T -> .F], [F->


.(E)], [F -> .id]})
= { [E -> E + .T],[T -> .T * F],[T -> .F],[F -> .(E)],[F > .id]}

The Sets of Items Construction

1.

Given a grammar G with start symbol S, construct


the augmented
2.
grammar by adding a special production S'->S where
S' does
3.
not appear in G.
4.
5.
Algorithm for constructing the canonical collection
of LR(0)
6.
items for an augmented grammar G':
7.

begin
C := { closure({[S' -> .S]}) };
repeat
for each set of items I in C:

for each grammar symbol X:


if goto(I,X) != 0 and goto(I,X) is not in C then
add goto(I,X) to C;
until no new sets of items can be added to C;
return C;
end

Valid Items: an item A ->


1. 2
is valid for a viable prefix
1 if
there is a derivation:
S' =>*rm A =>*rm 1 2

Suppose A -> 1. 2 is valid for 1,


and B1 is on the parsing
stack
1.
2.
3.

if 2 != ,
we should shift

4.
5.
6.
7.

if 2 = ,
A -> 1 is the handle,
and we should reduce by this production

Note: two valid items may tell us to do different things for the
same viable prefix. Some of these conflicts can be resolved
using
lookahead on the input string.
Constructing an SLR Parsing Table

1.

Given a grammar G, construct the augmented


grammar by adding
2.
the production S' -> S.
3.
4.
Construct C = {I0, I1, In},
5.
the set of sets of LR(0) items for G'.
6.
7.
State I is constructed from Ii, with parsing action
8.
determined as follows:
9.
10.
o
[A -> .aB] is in
o
Ii, where a is a terminal; goto(Ii,a) = Ij
o
: set action[i,a] = "shift j"

o
o
o
o
o
o
o

[A -> .] is in
Ii : set action[i,a] to "reduce A -> x"
for all a e FOLLOW(A), where A != S'
[S' -> S] is in Ii :
set action[i,$] to "accept"

11.
12.
13.
14.
goto transitions constructed as follows: for all nonterminals:
15.
if goto(Ii, A) = Ij, then goto[i,A] = j
16.
17.
All entries not defined by (3) & (4) are made "error".
18.
If there are any multiply defined entries, grammar is
not SLR.
19.
20.
Initial state S0 of parser: that constructed from
21.
I0 or [S' -> S]
22.

Example:
S -> aABe
A -> Abc
{b,d}

FIRST(S) = {a}
FIRST{A} = {b}

FOLLOW(S) = {$}
FOLLOW(A) =

A -> b

FIRST{B} = {d}

FOLLOW{B} =

B -> d

FIRST{S'}= {a}

FOLLOW{S'}=

{e}
{$}
I0 = closure([S'->.S]
= closure([S'->.S],[S->.aABe])
goto(I0,S) = closure([S'->S.]) = I1
goto(I0,a) = closure([S->a.Abe])
= closure([S->a.Abe],[A->.Abc],[A->.b]) = I2
goto(I2,A) = closure([S->aA.Be],[A->A.bc])
= closure([S->aA.Be],[A->A.bc],[B->.d]) = I3
goto(I2,B) = closure([A->b.]) = I4
goto(I3,B) = closure([S->aAB.e]) = I5
goto(I3,b) = closure([A->Ab.c]) = I6
goto(I3,d) = closure([B->d.]) = I7
goto(I5,e) = closure([S->aABe.]) = I8
goto(I6,c) = closure([A->Abc.]) = I9

lecture #14 began here

On Tree Traversals

Trees are classic data structures. Trees have nodes and edges, so
they are
a special case of graphs. Tree edges are directional, with roles
"parent"
and "child" attributed to the source and destination of the edge.
A tree has the property that every node has zero or one
parent. A node
with no parents is called a root. A node with no children is
called a leaf.
A node that is neither a root nor a leaf is an "internal
node". Trees have
a size (total # of nodes), a height (maximum count of nodes
from root to a leaf),
and an "arity" (maximum number of children in any one node).

Parse trees are k-ary, where there is a


variable number of children bounded by a value k determined by
the grammar.
You may wish to consult your old data structures book, or look
at some books
from the library, to learn more about trees if you are not totally
comfortable with them.

#include <stdarg.h>
struct tree {
short label;
short nkids;

/* what production rule this came from */


/* how many children it really has */

struct tree *child[1];

/* array of children, size varies 0..k */

};
struct tree *alctree(int label, int nkids, ...)
{
int i;
va_list ap;
struct tree *ptr = malloc(sizeof(struct tree) +
(nkids-1)*sizeof(struct tree *));
if (ptr == NULL) {fprintf(stderr, "alctree out of memory\n");
exit(1); }
ptr->label = label;
ptr->nkids = nkids;
va_start(ap, nkids);
for(i=0; i < nkids; i++)
ptr->child[i] = va_arg(ap, struct tree *);
va_end(ap);
return ptr;
}

Besides a function to allocate trees, you need to write one or


more recursive
functions to visit each node in the tree, either top to bottom
(preorder),
or bottom to top (postorder). You might do many different
traversals on the
tree in order to write a whole compiler: check types, generate
machine-

independent intermediate code, analyze the code to make it


shorter, etc.
You can write 4 or more different traversal functions, or you can
write
1 traversal function that does different work at each node,
determined by
passing in a function pointer, to be called for each node.

void postorder(struct tree *t, void (*f)(struct tree *))


{
/* postorder means visit each child, then do work at the parent
*/
int i;
if (t == NULL) return;
/* visit each child */
for (i=0; i < t-> nkids; i++)
postorder(t->child[i], f);
/* do work at parent */
f(t);
}

You would then be free to write as many little helper functions


as you
want, for different tree traversals, for example:
void printer(struct tree *t)
{

if (t == NULL) return;
printf("%p: %d, %d children\n", t, t->label, t->nkids);
}

Semantic Analysis

Semantic ("meaning") analysis refers to a phase of compilation


in which the
input program is studied in order to determine what operations
are to be
carried out. The two primary components of a classic semantic
analysis
phase are variable reference analysis and type checking. These
components
both rely on an underlying symbol table.

What we have at the start of semantic analysis is a syntax tree


that
corresponds to the source program as parsed using the context
free grammar.
Semantic information is added by annotating grammar symbols
with

semantic attributes, which are defined by semantic rules.


A semantic rule is a specification of how to calculate a semantic
attribute
that is to be added to the parse tree.
So the input is a syntax tree...and the output is the same tree,
only
"fatter" in the sense that nodes carry more information.
Another output of semantic analysis are error messages
detecting many
types of semantic errors.

Two typical examples of semantic analysis include:


variable reference analysis
the compiler must determine, for each use of a variable,
which
variable declaration corresponds to that use. This
depends on
the semantics of the source language being translated.
type checking
the compiler must determine, for each operation in the
source code,
the types of the operands and resulting value, if any.

Notations used in semantic analysis:

syntax-directed definitions
high-level (declarative) specifications of semantic rules
translation schemes
semantic rules and the order in which they get evaluated

In practice, attributes get stored in parse tree nodes, and the


semantic rules are evaluated either (a) during parsing (for easy
rules) or
(b) during one or more (sub)tree traversals.

Two Types of Attributes:

synthesized
attributes computed from information contained within
one's children.
These are generally easy to compute, even on-the-fly
during parsing.
inherited

attributes computed from information obtained from one's


parent or siblings
These are generally harder to compute. Compilers may
be able to jump
through hoops to compute some inherited attributes
during parsing,
but depending on the semantic rules this may not be
possible in general.
Compilers resort to tree traversals to move semantic
information around
the tree to where it will be used.

Attribute Examples

Isconst and Value

Not all expressions have constant values; the ones that do may
allow
various optimizations.
CFG

Semantic Rule
E1.isconst = E2.isconst &&
E1 : E2 + T T.isconst
if (E1.isconst)

E:T

T:T*F

T:F

F:(E)
F : ident
F : intlit

E1.value = E2.value + T.value


E.isconst = T.isconst
if (E.isconst)
E.value = T.value
T1.isconst = T2.isconst && F.isconst
if (T1.isconst)
T1.value = T2.value * F.value
T.isconst = F.isconst
if (T.isconst)
T.value = F.value
F.isconst = E.isconst
if (F.isconst)
F.value = E.value
F.isconst = FALSE
F.isconst = TRUE
F.value = intlit.ival

lecture #15 began here

Questions from the board and from the floor

Symbol Table Module

Symbol tables are used to resolve names within name spaces.


Symbol
tables are generally organized hierarchically according to the
scope rules of the language. Although initially concerned with
simply
storing the names of various that are visible in each scope,
symbol
tables take on additional roles in the remaining phases of the
compiler.
In semantic analysis, they store type information. And for code
generation,
they store memory addresses and sizes of variables.

mktable(parent)
creates a new symbol table, whose scope is local to (or
inside) parent
enter(table, symbolname, type, offset)
insert a symbol into a table
lookup(table, symbolname)
lookup a symbol in a table; returns structure pointer
including type and offset. lookup operations are often
chained together progressively from most local scope on
out to global scope.
addwidth(table)

sums the widths of all entries in the table. ("widths" =


#bytes, sum of
widths = #bytes needed for an "activation record" or
"global data section").
Worry not about this method until code generation you
wish to implement.
enterproc(table, name, newtable)
enters the local scope of the named procedure

Variable Reference Analysis

The simplest use of a symbol table would check:

for each variable, has it been


declared? (undeclared error)
for each declaration, is it already declared?
(redeclared error)

Reading Tree Leaves

In order to work with your tree, you must be able to tell,


preferably
trivially easily, which nodes are tree leaves and which are
internal nodes,
and for the leaves, how to access the lexical attributes.
Options:
1.
2.
3.
4.

encode in the parent what the types of children are


encode in each child what its own type is (better)

How do you do option #2 here?


Perhaps the best approach to all this is to unify the tokens and
parse tree
nodes with something like the following, where perhaps an
nkids value of -1
is treated as a flag that tells the reader to use
lexical information instead of pointers to children:

struct node {
int code;
/* terminal or nonterminal symbol */
int nkids;
union {
struct token { ... } leaf;
struct node *kids[9];
}u;
};

There are actually nonterminal symbols with 0 children


(nonterminal with
a righthand side with 0 symbols) so you don't necessarily want
to use
an nkids of 0 is your flag to say that you are a leaf.

Type Checking

Perhaps the primary component of semantic analysis in many


traditional
compilers consists of the type checker. In order to check types,
one first
must have a representation of those types (a type system) and
then one must
implement comparison and composition operators on those
types using the
semantic rules of the source language being compiled. Lastly,
type checking
will involve adding (mostly-) synthesized attributes through
those parts of
the language grammar that involve expressions and values.
Type Systems

Types are defined recursively according to rules defined by the


source
language being compiled. A type system might start with rules
like:

Base types (int, char, etc.) are types


Named types (via typedef, etc.) are types
Types composed using other types are types, for
example:

o
o
o
o
o

array(T, indices) is a type. In some


languages indices always start with
0, so array(T, size) works.

T1 x T2 is a type (specifying, more or


less, the tuple or sequence T1
followed by T2;
o
x is a so-called cross-product
operator).
o
o
record((f1 x T1) x (f2 x T2) x ... x (fn x
Tn)) is a type
o

in languages with pointers, pointer(T) is a


type

o
o
o

(T1 x ... Tn) -> Tn+1 is a


type denoting a function mapping
parameter types to a return type

In some language type expressions may contain


variables whose values

are types.

In addition, a type system includes rules for assigning these


types
to the various parts of the program; usually this will be
performed
using attributes assigned to grammar symbols.

lecture #16 began here

Midterm Exam Review

The Midterm will cover lexical analysis, finite automatas,


context free
grammars, syntax analysis, and parsing. Sample problems:

1.

Write a regular expression for numeric quantities of


U.S. money
2.
that start with a dollar sign, followed by one or
more digits.
3.
Require a comma between every three digits, as in
$7,321,212.
4.
Also, allow but do not require a decimal point
followed by two
5.
digits at the end, as in $5.99
6.
7.
Use Thompson's construction to write a nondeterministic finite
8.
automaton for the following regular expression,
an abstraction
9.
of the expression used for real number literal
values in C.
10.
(d+pd*|d*pd+)(ed+)?
11.
Write a regular expression, or explain why you can't
write a
12.
regular expression, for Modula-2 comments which
use (* *) as
13.
their boundaries. Unlike C, Modula-2 comments
may be nested,
14.
as in (* this is a (* nested *) comment *)
15.
16.
Write a context free grammar for the subset of C
expressions

17.
that include identifiers and function calls with
parameters.
18.
Parameters may themselves be function calls, as in
f(g(x)),
19.
or h(a,b,i(j(k,l)))
20.
21.
What are the FIRST(E) and FOLLOW(T) in the
grammar:
22.
23.
E:E+T|T
24.
T:T*F|F
F : ( E ) | ident
25.
What is the -closure(move({2,4},b)) in the following
NFA?
26.
That is, suppose you might be in either state 2 or 4
at the time
27.
you see a symbol b: what NFA states might you
find yourself in
28.
after consuming b?
(automata to be written on the board)
29.

Q: What else is likely to appear on the midterm?

A: questions that allow you to demonstrate that you know the


difference

between an DFA and an NFA, questions about lex and flex


and tokens
and lexical attributes, questions about context free grammars:
ambiguity, factoring, removing left recursion, etc.

On the mysterious TYPE_NAME

The C language typedef construct is an example where all the


beautiful
theory we've used up to this point breaks down. Once a typedef
is
introduced (which can first be recognized at the syntax level),
certain
identifiers should be legal type names instead of identifiers. To
make
things worse, they are still legal variable names: the lexical
analyzer
has to know whether the syntactic context needs a type name or
an
identifier at each point in which it runs into one of these names.
This
sort of feedback from syntax or semantic analysis back into
lexical
analysis is not un-doable but it requires extensions added by
hand to
the machine generated lexical and syntax analyzer code.

typedef int foo;


foo x;
/* a normal use of typedef... */
foo foo;
/* try this on gcc! is it a legal global? */
void main() { foo foo; } /* what about this ? */

370-C does not support typedef's and without working typedef's


the
TYPE_NAME token simply will never occur. Typedef's are fair
game for
extra credit points.

Representing C (C++, Java, etc.) Types

The type system is represented using data structures in the


compiler's
implementation language.
In the symbol table and in the parse tree attributes used in type
checking,
there is a need to represent and compare source language
types. You might
start by trying to assign a numeric code to each type, kind of like
the
integers used to denote each terminal symbol and each
production rule of the

grammar. But what about arrays? What about structs? There


are an infinite
number of types; any attempt to enumerate them will
fail. Instead, you
should create a new data type to explicitly represent type
information.
This might look something like the following:

struct c_type {
int base_type; /* 1 = int, 2=float, ... */
union {
struct array {
int size;
struct c_type *elemtype;
} a;
struct ctype *p;
struct struc {
char *label;
struct field **f;
} s;
} u;
}
struct field {
char *name;
struct ctype *elemtype;
}

Given this representation, how would you initialize a variable to


represent each of the following types:

int [10][20]
struct foo { int x; char *s; }

Example Semantic Rules for Type Checking

grammar rule
semantic rule
E1 : E2 PLUS E3 E1.type = check_types(PLUS, E2.type, E3.type)

Where check_types() returns a (struct c_type *) value. One of


the values
it should be able to return is Error. The operator (PLUS) is
included in
the check types function because behavior may depend on the
operator -the result type for array subscripting works different than the
result
type for the arithmetic operators, which may work different (in
some
languages) than the result type for logical operators that return
booleans.

Type Promotion and Type Equivalence

When is it legal to perform an assignment x = y? When x and y


are
identical types, sure. Many languages such as C have automatic
promotion rules for scalar types such as shorts and longs.
The results of type checking may include not just a type
attribute,
they may include a type conversion, which is best represented
by
inserting a new node in the tree to denote the promoted value.
Example:
int x;
long y;
y = y + x;

For records/structures, some languages use name equivalence,


while
others use structure equivalence. Features like typedef
complicate
matters. If you have a new type name MY_INT that is defined
to be
an int, is it compatible to pass as a parameter to a function that

expects regular int's? Object-oriented languages also get


interesting
during type checking, since subclasses usually are allowed
anyplace
their superclass would be allowed.
Implementing Structs

1.

storing and retrieving structs by their label -- the


struct label is
2.
how structs are identified. You do not have to do
typedefs and such.
3.
The labels can be keys in a separate hash table,
similar to the global
4.
symbol table. You can put them in the global
symbol table so long as
5.
you can tell the difference between them and
variable names.
6.
7.
8.
You have to store fieldnames and their types, from
where the struct is
9.
declared. You could use a hash table for each
struct, but a link list
10.
is OK as an alternative.
11.
12.

13.
You have to use the struct information to check the
validity of each
14.
dot operator like in rec.foo. To do this you'll have
to lookup rec
15.
in the symbol table, where you store rec's
type. rec's type must be
16.
a struct type for the dot to be legal, and that struct
type should
17.
include a hash table or link list that gives the names
and types of
18.
the fields -- where you can lookup the name foo to
find its type.
19.

lecture #17 began here

Run-time Environments

How does a compiler (or a linker) compute the addresses for the
various
instructions and references to data that appear in the program
source code?
To generate code for it, the compiler has to "lay out" the data as
it will

be used at runtime, deciding how big things are, and where they
will go.

Relationship between source code names and data


objects during execution
Procedure activations
Memory management and layout
Library functions

lecture #18 began here

Announcements

Affinity Research Group Workshop this Saturday,


9-3 in SH 124.

Extra credit: 20 points will be added to your


midterm exam grade

for attending and providing sincere attention at this


workshop.

Lunch is also provided.


HW#5 is available

Scopes and Bindings

Variables may be declared explicitly or implicitly in some


languages

Scope rules for each language determine how to go from names


to declarations.

Each use of a variable name must be associated with a


declaration.
This is generally done via a symbol table. In most compiled
languages
it happens at compile time (in contrast, for example ,with LISP).
Environment and State

Environment maps source code names onto storage addresses (at


compile time),

while state maps storage addresses into values (at


runtime). Environment
relies on binding rules and is used in code generation; state
operations
are loads/stores into memory, as well as allocations and
deallocations.
Environment is concerned with scope rules, state is concerned
with things
like the lifetimes of variables.

Runtime Memory Regions

Operating systems vary in terms of how the organize program


memory
for runtime execution, but a typical scheme looks like this:
code
static data
stack (grows down)
heap (may grow up, from bottom of address space)

The code section may be read-only, and shared among multiple


instances
of a program. Dynamic loading may introduce multiple code
regions, which

may not be contiguous, and some of them may be shared by


different programs.
The static data area may consist of two sections, one for
"initialized data",
and one section for uninitialized (i.e. all zero's at the beginning).
Some OS'es place the heap at the very end of the address space,
with a big
hole so either the stack or the heap may grow arbitrarily
large. Other OS'es
fix the stack size and place the heap above the stack and grow it
down.
Questions to ask about a language, before writing its code
generator

1.
2.
3.
4.
5.

May procedures be recursive? (Duh, all modern


languages...)
What happens to locals when a procedure returns?
(Lazy deallocation rare)

May a procedure refer to non-local, non-global


names?
6.
(Pascal-style nested procedures, and object field
names)
7.

8.

How are parameters passed? (Many styles possible,


different
9.
declarations for each (Pascal), rules hardwired by
type (C)?)
10.
11.
May procedures be passed as parameters? (Not too
awful)
12.
13.
May procedures be return values? (Adds complexity
for non-local names)
14.
15.
May storage be allocated dynamically (Duh, all
modern languages...
16.
but some languages do it with syntax (new) others
with library (malloc))
17.
18.
Must storage by deallocated explicitly (garbage
collector?)
19.

Activation Records

Activation records organize the stack, one record per


method/function call.
return value
parameter

...
parameter
previous frame pointer (FP)
saved registers
...
FP--> saved PC
local
...
local
temporaries
SP--> ...

At any given instant, the live activation records form a chain and
follow a stack discipline. Over the lifetime of the program, this
information (if saved) would form a gigantic tree. If you
remember
prior execution up to a current point, you have a big tree in
which
its rightmost edge are live activation records, and the nonrightmost
tree nodes are an execution history of prior calls.

"Modern" Runtime Systems

The preceding discussion has been mainly about traditional


languages such as
C. Object-oriented programs might be much the same, only
every activation
record has an associated object instance; they need one extra
"register" in
the activation record. In practice, modern OO runtime systems
have many
more differences than this, and other more exotic language
features imply
substantial differences in runtime systems. Here are a few
examples of
features found in runtimes such as the Java Virtual Machine and
.Net CLR.

Garbage collection. Automatic storage


management plays a prominent role

in most modern languages; it is one of the single


most important features

that makes programming easier.

The Basic problem in garbage collection: given a


piece of memory, are there

any pointers to it? (And if so, where exactly are all


of them please).

Approaches:

o
o
o
o

reference counting
traversal of known pointers (marking)
o

copying (2 heaps approach)


compacting (mark and sweep)
generational

conservative collection

Reflection. Modern languages' values can often


describe themselves.

This plays a central role in Visual GUI builders


and Visual IDE's,

component architectures and other uses.

Just-in-time compilation. Modern languages often


have a virtual machine

model...and a compiler built-in to the VM that


converts VM instructions

to native code for frequently executed methods


or code blocks.

Security model. Modern languages may attempt


to guarantee certain

security properties, or prevent certain kinds of


attacks.

Goal-directed programs have an activation tree each instant, due


to
suspended activations that may be resumed for additional
results. The
lifetime view is a sort of multidimensional tree, with three types
of nodes.

Having Trouble Debugging?

To save yourself on the semester project in this class, you really


do have
to learn gdb and/or ddd as well as you can. Sometimes it can
help you
find your bug in seconds where you would have spent hours
without it. But
only if you take the time to read the manual and learn the
debugger.

To work on segmentation faults: recompile all .c files with -g


and run your
program inside gdb to the point of the segmentation fault. Type
the gdb
"where" command. Print the values of variables on the line
mentioned in the
debugger as the point of failure. If it is inside a C library
function, use
the "up" command until you are back in your own code, and
then print the
values of all variables mentioned on that line.

There is one more tool you should know about, which is useful
for certain
kinds of bugs, primarily subtle memory violations. It is called
electric
fence. To use electric fence you add

/home/uni1/jeffery/ef/ElectricFence-2.1/libefence.a
to the line in your makefile that links your object files together
to
form an executable.

lecture #19 began here

Need Help with Type Checking?

Implement the C Type Representation given in


lecture #16.
Read the Book

What OPERATIONS (functions) do you need, in


order to check

whether types are correct? What parameters


will they take?

Intermediate Code Generation

Goal: list of machine-independent instructions for each


procedure/method
in the program. Basic data layout of all variables.

Can be formulated as syntax-directed translation

add new attributes where necessary, e.g. for


expression E we might have
E.place
the name that holds the value of E
E.code
the sequence of intermediate code statements
evaluating E.

new helper functions, e.g.


newtemp()
returns a new temporary variable each time it is
called

newlabel()

returns a new label each time it is called

actions that generate intermediate code formulated


as semantic rules

Production
Semantic Rules
S -> id ASN
S.code = E.code || gen(ASN, id.place, E.place)
E
E.place = newtemp();
E -> E1
E.code = E1.code || E2.code ||
PLUS E2
gen(PLUS,E.place,E1.place,E2.place);
E.place = newtemp();
E -> E1 MUL
E.code = E1.code || E2.code ||
E2
gen(MUL,E.place,E1.place,E2.place);
E -> MINUS E.place = newtemp();
E1
E.code = E1.code || gen(NEG,E.place,E1.place);
E -> LP E1 E.place = E1.place;
RP
E.code = E1.code;
E.place = id.place;
E -> IDENT
E.code = emptylist();

Three-Address Code

Basic idea: break down source language expressions into simple


pieces that:

translate easily into real machine code


form a linearized representation of a syntax tree
allow us to check our own work to this point
allow machine independent code optimizations to
be performed

increase the portability of the compiler

Instruction set:

mnemonic
ADD,
SUB,MUL,DIV

C equivalent

NEG

x := op y

ASN
ADDR

x := y
x := &y

LCONT

x := *y

x := y op z

description
store result of binary
operation on y and z to x
store result of unary
operation on y to x
store y to x
store address of y to x
store contents pointed to by
y to x

SCONT

*x := y

GOTO

PARM

goto L
if x rop y then
goto L
if x then goto
L
if !x then goto
L
param x

CALL

call p,n,x

RET

return x

BLESS,...
BIF
BNIF

store y to location pointed to


by x
unconditional jump to L
binary conditional jump to L
unary conditional jump to L
unary negative conditional
jump to L
store x as a parameter
call procedure p with n
parameters, store result in x
return from procedure, use x
as the result

Declarations (Pseudo instructions):


These declarations list size units as "bytes"; in a uniform-size
environment
offsets and counts could be given in units of "slots", where a slot
(4 bytes
on 32-bit machines) holds anything.
global
x,n1,n2
proc
x,n1,n2

declare a global named x at offset n1 having n2


bytes of space
declare a procedure named x with n1 bytes of
parameter space and n2 bytes of local variable space

declare a local named x at offset n from the


procedure frame
label Ln designate that label Ln refers to the next instruction
declare the end of the current procedure
end

local x,n

TAC Adaptations for Object Oriented Code


x := y
field z
class
x,n1,n2
field x,n
new x

lookup field named z within y, store address to x


declare a class named x with n1 bytes of class
variables and n2 bytes of class method pointers
declare a field named x at offset n in the class frame
create a new instance of class name x

Variable Allocation and Access Issues

Given a variable name, how do we compute its address?


globals
easy, symbol table lookup
locals
easy, symbol table gives offset in (current) activation
record
objects

easy, symbol table gives offset in object, activation record


has
pointer to object in a standard location
locals in some enclosing block/method/procedure
ugh. Pascal, Ada, and friends offer their own unique kind
of pain.
Q: does the current block support recursion? Example:
for procedures
the answer would be yes; for nested { { } } blocks in C
the answer
would be no.

if no recursion, just count back some


number of frame pointers based

on source code nesting

if recursion, you need an extra pointer


field in activation record

to keep track of the "static link", follow


static link back some

# of times to find a name defined in an


enclosing scope

Sizing up your Regions and Activation Records

Add a size field to every symbol table entry. Many


types are not required

for your C370 project but we might want to discuss


them anyhow.

The size of integers is 4 (for x86; varies by CPU).

o
o

The size of reals is... ? (for x86; varies by CPU).

o
o

The size of strings is... <= 256? You could


allocate static
o
256 character arrays in the global area, but better to
do them as a
o
descriptor consisting of a length and a pointer.

o
o
o

The size of arrays is (sizeof (struct descrip)) * the


number of elements? Do we know an array size?

o
o
o
o
o
o

Are arrays all int, or all real, or can they be mixed?


(in BASIC and other dynamic languages, they
can be mixed!)
Are there arrays of strings? -- yes
what about sizes of structs?

You do this sizing up once for each scope. The


size of each scope is the

sum of the sizes of symbols in its symbol table.

Run Time Type Information

Some languages would need the type information


around at runtime; for
example, dynamic object-oriented languages. Its
almost the case that one
just writes the type information, or symbol table
information that includes
type information, into the generated code in this
case, but perhaps one
wants to attach it to the actual values held at
runtime.

struct descrip {
short type;
short size;
union {
char *string;
int ival;
float rval;
struct descrip *array;
/* ... for other types */
} value;
};

Compute the Offset of Each Variable

Add an address field to every symbol table entry.


The address contains a region plus an offset in that region.
No two variables may occupy the same memory at the same
time.
Locals and Parameters are not Contiguous

For each function you need either to manage two separate


regions
for locals and for parameters, or else you need to track where
in that region the split between locals and parameters will be.

Basic Blocks

Basic blocks are defined to be sequence of 1+ instructions in


which
there are no jumps into or out of the middle. In the most
extreme
case, every instruction is a basic block. Start from that
perspective
and then lump adjacent instructions together if nothing can come
between
them.

What are the basic blocks in the following 3-address code?


("read" is a 3-address code to read in an integer.)

read x
t1 = x > 0
if t1 == 0 goto L1
fact = 1
label L2
t2 = fact * x
fact = t2
t3 = x - 1
x = t3
t4 = x == 0

if t4 == 0 goto L2
t5 = addr const:0
param t5
; "%d\n"
param fact
call p,2
label L1
halt

Basic blocks are often used in order to talk about


specific types of optimizations that rely on basic blocks. So if
they are
used for optimization, why did I introduce basic blocks? You
can view
every basic block as a hamburger; it will be a lot easier to eat if
you
sandwich it inside a pair of labels (first and follow)!

Intermediate Code for Control Flow

Code for control flow (if-then, switches, and loops) consists of


code to test conditions, and the use of goto instructions and
labels to route execution to the correct code. Each chunk of
code
that is executed together (no jumps into or out of it) is called
a basic block. The basic blocks are nodes in a control flow
graph,

where goto instructions, as well as falling through from one


basic block
to another, are edges connecting basic blocks.

Depending on your source language's semantic rules for things


like
"short-circuit" evaluation for boolean operators, the operators
like || and && might be similar to + and * (non-short-circuit) or
they might be more like if-then code.

A general technique for implementing control flow code is to


add
new attributes to tree nodes to hold labels that denote the
possible targets of jumps. The labels in question are sort of
analogous to FIRST and FOLLOW; for any given list of
instructions
corresponding to a given tree node,
we might want a .first attribute to hold the label for the
beginning
of the list, and a .follow attribute to hold the label for the next
instruction that comes after the list of instructions. The .first
attribute can be easily synthesized. The .follow attribute must
be
inherited from a sibling.
The labels have to actually be allocated and attached to
instructions
at appropriate nodes in the tree corresponding to grammar
production

rules that govern control flow. An instruction in the middle of a


basic block need neither a first nor a follow.
C code

Attribute Manipulations
E.true = newlabel();
E.false = S.follow;
S->if E then S1
S1.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.code
E.true = newlabel();
E.false = newlabel();
S1.follow = S.follow;
S->if E then S1 else S2 S2.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.code || gen(GOTO, S.follow) ||
gen(LABEL, E.false) || S2.code

Exercise: OK, so what does a while loop look like?

lecture #20 began here

Announcement

Co-op positions available for fall 2006 at Los Alamos


National Laboratory-in the Computing, Telecommunications,
and
Networking Division.

LANL is seeking outstanding SOPHOMORE, JUNIOR AND


NONGRADUATING SENIOR LEVEL Computer Science majors to
work in
the areas of networking, desktop support, high performance
computing or software engineering. Positions are available
for the fall 2006 semester. MUST HAVE A GPA OF 3.0 OR
HIGHER.

To request a referral go to www.nmsu.edu/pment, click on "Coop Job Listings", Job #86 or call the co-op office at 6464115. LANL is requiring a cover letter to also be sent,
please send that via email at coop@nmsu.edu in the subject
line put attn: LANL cover letter.

Co-op Office
646-4115
More on Generating Code for Boolean Expressions

Last time we started to look at code generation for control


structures
such as if's and while's. Of course, before we can see the big
picture on these we have to understand how to generate code for
the
boolean expressions that control these constructs.
Comparing Regular and Short Circuit Control Flow

Different languages have different semantics for booleans; for


example
Pascal treats them as identical to arithmetic operators, while the
C family of languages (and many ) others specify "short-circuit"
evaluation in which operands are not evaluated once the answer
to
the boolean result is known. Some ("kitchen-sink" design)
languages
have two sets of boolean operators: short circuit and non-shortcircuit.
(Does anyone know a language that has both?)

Implementation techniques for these alternatives include:

1.

treat boolean operators same as arithmetic operators,


evaluate
2.
each and every one into temporary variable
locations.
3.
4.
add extra attributes to keep track of code locations
that are
5.
targets of jumps. The attributes store link lists of
those instructions
6.
that are targets to backpatch once a destination
label is known.
7.
Boolean expressions' results evaluate to jump
instructions and program
8.
counter values (where you get to in the code
implies what the boolean
9.
expression results were).
10.
11.
one could change the machine execution model so it
implicity routes
12.
control from expression failure to the appropriate
location. In
13.
order to do this one would
14.
15.

mark boundaries of code in which failure


propagates

maintain a stack of such marked


"expression frames"

Non-short Circuit Example

a<b || c<d && e<f


translates into
100:

if a<b goto 103


t1 = 0
goto 104
103: t1 = 1
104: if c<d goto 107
t2 = 0
goto 108
107: t2 = 1
108: if e<f goto 111
t3 = 0
goto 112
111: t3 = 1
112: t4 = t2 AND t3
t5 = t1 OR t4

Short-Circuit Example

a<b || c<d && e<f


translates into

L2:
L3:
L1:
L4:

if a<b goto L1
if c<d goto L2
goto L3
if e<f goto L1
t=0
goto L4
t=1
...

Note: L3 might instead be the target E.false; L1 might instead be


E.true;
no computation of a 0 or 1 into t might be needed at all.

While Loops

So, a while loop, like an if-then, would have attributes similar


to:
C code

Attribute Manipulations

E.true = newlabel();
E.false = S.follow;
S1.follow = E.first;
S->while E do S1 S.code = gen(LABEL, E.first) ||
E.code || gen(LABEL, E.true)||
S1.code ||
gen(GOTO, E.first)

C for-loops are trivially transformed into while loops, so they


pose no new
code generation issues.

lecture #21 began here

Intermediate Code Generation Examples

Consider the following small program. It would be fair game as


input
to your compiler project. In order to show blow-by-blow what
the code

generation process looks like, we need to construct the syntax


tree and
do the semantic analysis steps.

void main()
{
int i;
i = 0;
while (i < 20)
i = i * i + 1;
print(i);
}

This code has the following syntax tree

Intermediate Code Generation Example (cont'd)

Here is an example C progrma to compile:


i = 0;
if (i >= 20) goto L50;
i = i * i + 1;
goto 20;
print(i);

This program corresponds to the following syntax tree, which a


successful homework #5 would build. Note that it has a height
of
approximately 10, and a maximum arity of approximately
4. Also: your
exact tree might have more nodes, or slightly fewer; as long as
the
information and general shape is there, such variations are not a
problem.

A syntax tree, with attributes obtained from lexical and


semantic
analysis, needs to be shown here.
During semantic analysis, it is discovered that "print" has not
been
defined, so let it be:

void print(int i) { }

The code for the boolean conditional expression controlling the


while
loop is a list of length 1, containing the instruction t0 = i < 20,
or more formally
opcode dest src1 src2
LT
t0 i
20

The actual C representation of addresses dest, src1, and src2 is


probably as a
region
offset
pair, so the

picture of this intermediate code instruction really looks


something like this:

opcode

dest
local

src1 src2
local const

LT
t0.offset i.offset 20

Regions are expressed with a simple integer encoding like:


global=1, local=2, const=3.
Note that address values in all regions are offsets from the start
of the
region, except for region "const", which
stores the actual value of a single integer as its offset.

opcode

dest
local

src1 src2
local local

MUL
t1.offset i.offset i.offset

lecture #22 began here


Comments on Trees and Attributes

The main problem in semantic analysis and intermediate code


generation is to
Move Information Around the Tree. Moving information up the
tree is kind of
easy and follows the pattern we used to build the tree in the first
place.
To move the information down the tree, needed for HW4, you
write tree
traversal functions. The tree traversal is NOT a "blind" traversal
that does
the same thing at each node. It has a switch statement on what
grammar rule
was used to build each node, and often does different work
depending on what
nonterminal and what grammar rule a given node represents.

Traversal code example

The following code sample illustrates a code generation tree


traversal.
Note the gigantic switch statement. In class a student asked the
question
of whether the link lists might grow longish, and if one is
usually appending
instructions on to the end, wouldn't a naive link list do a terrible
O(n2) job. To which the answer was: yes, and it would be good
to use a smarter data structure, such as one which stores both the
head
and the tail of each list.

void codegen(nodeptr t)
{
int i, j;
if (t==NULL) return;
/*
* this is a post-order traversal, so visit children first
*/
for(i=0;i<t->nkids;i++)
codegen(t->child[i]);
/*
* back from children, consider what we have to do with
* this node. The main thing we have to do, one way or
* another, is assign t->code
*/
switch (t->label) {

case PLUS: {
t->code = concat(t->child[0].code, t->child[1].code);
g = gen(PLUS, t->address,
t->child[0].address, t->child[1].address);
t->code = concat(t->code, g);
break;
}
/* ... really, we need a bazillion cases, perhaps one for each
* production rule (in the worst case)
*/
default:
/* default is: concatenate our children's code */
t->code = NULL;
for(i=0;i<t->nkids;i++)
t->code = concat(t->code, t->child[i].code);
}
}

Code generation examples

Let us build one operator at a time. You should implement your


code generation the same way, simplest expressions first.

Zero operators.

if (x) S
translates into

if x != 0 goto L1
goto L2
label L1
...code for S
label L2

or if you are being fancy

if x == b goto L1
...code for S
label L1
I may do this without comment in later examples, to keep them
short.

One relational operator.

if (a < b) S
translates into

if i >= b goto L1
...code for S
label L1

One boolean operator.

if (a < b && c > d) S


translates into

if (a < b)
if (c > d)
...code for S
which if we expand it

if i >= b goto L1
if c <= d goto L2
...code for S
label L2
label L1

by mechanical means, we may wind up with lots of labels for


the same
target, this is OK.

if (a < b || c > d) S
translates into

if (a < b) ...code for S


if (c > d) ...code for S
but its unacceptable to duplicate the code for S! It might be
huge!
Generate labels for boolean-true-yes-we-do-this-thing, not just
for
boolean-false-we-skip-this-thing.

if a < b goto L1
if c > d goto L2
goto L3
label L2
label L1
...code for S
label L3

Array subscripting!

So far, we have only said, if we passed an array as a parameter


we'd have to
pass its address. 3-address instructions have an "implicit
dereferencing
semantics" which say all addresses' values are fetched / stored
by default.
So when you say t1 := x + y, t1 gets values at addresses x and y,
not the
addresses. Once we recognize arrays are basically a pointer
type, we need
3-address instructions to deal with pointers.

now, what about arrays? reading an array value: x = a[i]. Draw


the
picture. Consider the machine uses byte-addressing, not wordaddressing.

t0 := addr a
t1 := i * 4
t2 := plus t0 t1
t3 := deref t2

x := t3

What about writing an array value?

Debugging Miscellany

Prior experience suggests if you are having trouble debugging,


check:

makefile .h dependencies!
if you do not list makefile dependencies for important .h
files,
you may get coredumps!
traversing multiple times by accident?
at least in my version, I found it easy to accidentally retraverse
portions of the tree. this usually had a bad effect.
bad grammar?
our sample grammar was adapted from good sources, but
don't assume its
impossible that it could have a flaw or that you might
have messed it up.

lecture #23 began here

Remind me to come back to HW #6 before the end of


today's lecture.

Final Code

Goal: execute the program we have been translating, somehow.

Alternatives:
interpret the source code
we could have build an interpreter instead of a compiler, in
which the
source code was kept in string or token form, and reparsed every
execution. Early BASIC's did this, but it is Really Slow.
interpret the parse tree
we could have written an interpreter that executes the
program
by walking around on the tree doing traversals of
various subtrees.

This is still slow, but successfully used by many


"scripting languages".
interpret the 3-address code
we could interpret the link-list or a more compact binary
representation
of the intermediate code
translate into VM instructions
popular virtual machines such as JVM or .Net allow
execution from an
instruction set that is often higher level than hardware,
may be
independent of the underlying hardware, and may be
oriented toward
supporting the specific language features of our source
language.
For example, there are various BASIC virtual machines
out there.
translate into "native" instructions
"native" generally means hardware instructions.

In mainstream compilers,
final code generation takes a linear sequence of 3-address
intermediate
code instructions, and translates each 3-address instruction into
one or
more native instructions.
The big issues in code generation are (a) instruction selection,
and (b)
register allocation and assignment.

Collecting Information Necessary for Final Code


Generation

a top-down approach to learning your native target


code is to
o
study a reference work supplied by the chip
manufacturer, such
o
as the Intel 80386 Programmer's Reference
Manual
o
o
a bottom-up approach to learning your native
target code is to
o
study an existing compiler's native code. For
example, running
o
"gcc -S foo.c" will compile foo.c into a humanreadable native
o
assembler code equivalent foo.s file which you
can examine.
o
By systematically studying .s files for various
toy C programs
o
you can learn native instructions corresponding
to each C construct,
o
including ones equivalent to the various 3address instructions.

Instruction Selection

The hardware may have many difference sequences of


instructions to
accomplish a given task. Instruction selection must choose a
particular
sequence. At issue: how many registers to use, whether a
special
case instruction is available, and what addressing mode(s) to
use. Given
a choice among equivalent/alternaive sequences, the decision on
which
sequence of instructions to use is based on estimates or
measurements of
which sequence executes the fastest. This is usually determined
by the
number of memory references incurred during execution,
including the
memory references for the instructions themselves. Simply
picking the
shortest sequence of instructions is often a good approximation
of the
optimal result, since fewer instructions usually translates into
fewer
memory references.

Register Allocation and Assignment

Accessing values in registers is much much faster than accessing


main memory.
Register allocation denotes the selection of which variables will
go
into registers. Register assignment is the determination of
exactly
which register to place a given variable. The goal of these
operations
is generally to minimize the total number of memory accesses
required
by the program.

In the Old Days, there were Load-Store hardware architectures


in which
only one (accumulator) register was present. On such an
architecture,
register allocation and assignment is not needed; the compiler
has few
options about how it uses the accumulator register. Traditional
x86
16-bit architecture was only a little better than a load-store
architecture,
with 4 registers instead of 1. At the other extreme, Recent
History has
included CPU's with 32 or more general purpose registers. On
such systems,
high quality compiler register allocation and assignment makes a
huge

difference in program execution speed. Unfortunately, optimal


register
allocation and assignment is NP-complete, so compilers must
settle for
doing a "good" job.

Discussion of Tree Traversals that perform Semantic Tests.

Suppose we have a grammar rule


AssignStmt : Var EQU Expr.
We might extend the C semantic action for that rule with
extra code after building our parse tree node:
AssignStmt : Var EQU Expr { $$ = alctree(..., $1, $2, $3);
lvalue($1);
rvalue($3);
}

lvalue() and rvalue() are mini-tree traversals for the lefthand side
and righthand side of an assignment statement. Their missions
are to
propagate information from the parent, namely, inherited
attributes

that tell nodes whether their values are being assigned to


(initialized)
or being read from.

void lvalue(struct tree *t)


{
if (t->label == IDENT) {
struct symtabentry *ste = lookup(t->u.token.name);
ste->lvalue = 1;
}
for (i=0; inkids; i++) {
lvalue(t->child[i]);
}
}
void rvalue(struct tree *t)
{
if (t->label == IDENT) {
struct symtabentry *ste = lookup(t->u.token.name);
if (ste->lvalue == 0) warn("possible use before
assignment");
}
for (i=0; inkids; i++) {
lvalue(t->child[i]);
}
}

What is different about real life as opposed to this toy


example

In real life, you should build a flow graph, and propagate these
variable definition and use attributes using the flow graph
instead
of the syntax tree. For example, if the program starts by calling
a subroutine at the bottom of code which initializes all the
variables, the flow graph will not be fooled into generating
warnings
like you would if you just started at the top of the code and
checked
whether for each variable, assignments appear earlier in the
source
code than the uses of that variable.

lecture #24 began here

Runtime Systems

Every compiler (including yours) needs a runtime system. A


runtime system
is the set of library functions and possibly global variables
maintained by

the language on behalf of a running program. You use one all


the time; in C
it functions like printf(), plus perhaps internal compilergenerated calls
to do things the processor doesn't do in hardware.

So you need a runtime system; potentially, this might be as big


or bigger a
job than writing the compiler. Languages vary from assembler
(no runtime
system) and C (small runtime system, mostly C with some
assembler) on up to
Java (large runtime system, mostly Java with some C) and in
even higher level
languages the compiler may evaporate and the runtime system
become gigantic.
The Unicon language has a relatively trivial compiler and
gigantic virtual
machine and runtime system. Other scripting languages might
have no compiler
at all, doing everything (even lexing and parsing) in the runtime
system.

For your project: whether you generate C or X86 or Java, you'll


need a plan
for what to do about a runtime system. And, in principle, I am
not opposed
to helping with this part. But the compiler and runtime system
have to fit

together; if I write part of the BASIC runtime system for you, or


we write
it together, we have to agree on things such as: what the types of
parameters and return values must look like.

So, what belongs in a Color BASIC runtime system? Anything


not covered
by a three address instruction. Looking at cocogram.y:
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o

INPUT/PRINT
READ/DATA
CLEAR
CLOAD/CSAVE/SKIPF
CLS
SET/RESET
SOUND
CHR$, LEFT$, MID$, RIGHT$, INKEY$
ASC, INT, JOYSTK, LEN, PEEK, RND, VAL
DIM
string +, string compares

What would a runtime system function look like? It would take


in and
pass out BASIC values, represented as C structs. You would
then link
this code in to your generated C or assembler code (if you
generated
Java code, you would have to deal with the Java Native Interface
or
else write these functions in Java).
void PRINT(struct descrip *d)
{
switch (d->type) {
case INTEGER: printf("%d",d->value.ival); break;
case REAL: printf("%f",d->value.rval); break;
case STRING: printf("%*s",d->size,d->value.string); break;
case ARRAY: printf("cannot print arrays"); break; /* can't get
here */
default: printf("PRINT: internal error, type %d\n", d->type);
}
}

Now, let's look at the "whole" runtime system:

libb.c

More on Memory Management in the BASIC Runtime


System

Arrays are interesting. They can be used without being declared


or DIM'ed.
They can only be DIM'ed once. If you use them before they are
DIM'ed, they
are implicitly DIM'ed to size 11 and can't be re-DIM'ed.
What do variables A, A(), A$, and A$() look like in
memory? How does our
runtime system make it so?
Let's take a look at DIM, in libc.c. This DIM is for arrays of
numbers.
How would you handle arrays of strings?
Can you implement STRCAT for your BASIC runtime system?
What other BASIC statements, operators, or functions allocate
memory?
How would we avoid memory "leaks"?
Final Project comments

We have a week of classes more, plus a couple


weekends, before your
o
final project (HW#7) is due.
o
o
Really, really, test your turnin, by "turning your
.tar in to
o
yourself": unpacking in a separate directly, and
verifying that it builds
o
and runs correctly in a separate
subdirectory. Turnins that do not
o
compile and run due to missing files, etc. may
receive a 0.
o
o
You are invited and encouraged, but not required,
to make an appointment
o
to demo your compiler with me during Finals
week. The purpose is to
o
make sure you get credit for those parts which
you can show me are
o
working (as opposed to me testing your program
independently and
o
somehow missing the parts that work).
o
o
Student assignments which have "impossible
similarity" to each other
o
may result in a 0 for the assignment or an "F"
for the course.
o
Impossible similarity means: beyond a
reasonable doubt, substantial

code other than sample code fragments provided


in lecture or lab notes,
o
has been shared or copied.

STRCAT

So, what does your STRCAT look like? Here's one.


GOSUBs

Our 3-address instruction set has call and return instructions, but
basic
is less structured than regular procedural languages; you can
GOSUB to any
line number you want. You can't use a variable to GOSUB to
line number X,
but in principle every line number could be the target of a
procedure call.
If you use the "call" (3-address) instruction to do GOSUB, your
native code
will have to make a clear distinction between BASIC call's and
calls to

runtime system (built-in) functions. Perhaps it is best to


implement BASIC
GOSUB by pushing a "param" (the next instruction following
the GOSUB) and
a "goto". The BASIC RETURN is then a "pop" followed by a
"goto". What,
we don't have a "pop" 3-address instruction? We do now... the
name of
"param" should probably be "push" anyhow.

Come to think of it, we've been talking about doing a call to a


built-in
function such as PRINT, but that PRINT function we wrote is C
code; it
doesn't do a 3-address "ret" instruction, hmmm. How are we
going to
generate the native code for the 3-address "call" instruction?
It may include an assembler call instruction, but it may also
involve
instructions to handle the interface between BASIC and C.

lecture #25 began here


Register Allocation and Assignment (cont'd)

When the number of variables in use at a given time exceeds the


number
of registers available (the common case), some variables may be
used
directly from memory if the instruction set supports memorybased operations.
When an instruction set does not support memory-based
operations, all
variables must be loaded into a register in order to perform
arithmetic
or logic using them.

Even if an instruction set does support memory-based


operations, most
compilers will want to load load a value into a register while it is
being used, and then spill it back out to main memory when the
register
is needed for another purpose. The task of minimizing memory
accesses
becomes the task of minimizing register loads and spills.
Some Code Generation Examples

Reusing a Register

Consider the statement:


a = a+b+c+d+e+f+g+a+c+e;
Our naive three address code generator would generate a
lot of temporary variables here, when really one big number is
being added.
How many registers does the expression need? Some variables
are referenced once, some twice. GCC generates:
movl
addl
addl
addl
addl
addl
addl
addl
addl
addl
movl

b, %eax
a, %eax
c, %eax
d, %eax
e, %eax
f, %eax
g, %eax
a, %eax
c, %eax
e, %eax
%eax, a

Now consider
a = (a+b)*(c+d)*(e+f)*(g+a)*(c+e);
How many registers are needed here?
movl
movl

b, %eax
a, %edx

addl
movl
addl
imull
movl
addl
imull
movl
addl
imull
movl
addl
imull
movl

%eax, %edx
d, %eax
c, %eax
%eax, %edx
f, %eax
e, %eax
%eax, %edx
a, %eax
g, %eax
%eax, %edx
e, %eax
c, %eax
%edx, %eax
%eax, a

And now this:


a = ((a+b)*(c+d))+((e+f)*(g+a))+(c*e);
which compiles to
movl
movl
addl
movl
addl
movl
imull
movl
movl

b, %eax
a, %edx
%eax, %edx
d, %eax
c, %eax
%edx, %ecx
%eax, %ecx
f, %eax
e, %edx

addl
movl
addl
imull
leal
movl
imull
leal
movl

%eax, %edx
a, %eax
g, %eax
%edx, %eax
(%eax,%ecx), %edx
c, %eax
e, %eax
(%eax,%edx), %eax
%eax, a

Lastly (for now) consider:


a = ((a+b)*(c+d))+(((e+f)*(g+a))/(c*e));
The division instruction adds new wrinkles. It operates on an
implicit
register accumulator which is twice as many bits as the number
you divide
by, meaning 64 bits (two registers) to divide by a 32-bit
number. Note
in this code that gcc would rather spill than use %ebx. %ebx is
either
being used implicitly or is reserved by the compiler for some
(probably
good) reason. %edi and %esi are similarly ignored.
movl b, %eax
movl a, %edx
addl %eax, %edx
movl d, %eax
addl c, %eax

movl
imull
movl
movl
addl
movl
addl
imull
movl
imull
movl
movl
cltd
idivl
movl
movl
leal
movl

%edx, %ecx
%eax, %ecx
f, %eax
e, %edx
%eax, %edx
a, %eax
g, %eax
%eax, %edx
c, %eax
e, %eax
%eax, -4(%ebp)
%edx, %eax
-4(%ebp)
%eax, -4(%ebp)
-4(%ebp), %edx
(%edx,%ecx), %eax
%eax, a

Code Generation for Virtual Machines

A virtual machine architecture such as the JVM changes the


"final" code
generation somewhat. We have seen several changes, some of
which
simplify final code generation and some of which complicate
things.

no registers, simplified addressing


a virtual machine may omit a register model and avoid
complex
addressing modes for different types of variables
uni-size or descriptor-based values
if all variables are "the same size", some of the details of
memory management are simplified. In Java most
values occupy
a standard "slot" size, although some values occupy two
slots.
In Icon and Unicon, all values are stored using a samesize descriptor.
runtime type system
requiring type information at runtime may complicate the
code generation task since type information must be
present
in generated code. For example in Java method
invocation and
field access instructions must encode class information.

Just for fun, let's compare the generated code for java with that
X86
native code we were just looking at:
iload_1
iload_2
iadd
iload_3

iload 4
iadd
imul
iload 5
iload 6
iadd
iload 7
iload_1
iadd
imul
iload_3
iload 5
imul
idiv
iadd
istore_1

lecture #26 began here

A Shallow Introduction to Code Optimization

There are major classes of optimization that can significantly


speedup

a compiler's generated code. Usually you speed up code by


doing the
work with fewer instructions and by avoiding unnecessary
memory reads
and writes. You can also speed up code by rewriting it with
fewer gotos.
Peephole Optimization

Peephole optimizations look at the native code through a small,


moving
window for specific patterns that can be simplified. These are
some of the
easiest optimizations because they potentially don't require any
analysis
of other parts of the program in order to tell when they may be
applied.
Although some of these are stupid and you wouldn't think they'd
come up,
the simple code generation algorithm we presented earlier is
quite stupid
and does all sorts of obvious bad things that we can avoid.

name
redundant load or store

sample
MOVE R0,a
MOVE a,R0

optimized as
MOVE R0,a

dead code

control flow
simplification

algebraic simplification
strength reduction

#define debug 0
...
if (debug)
printf("ugh");
if a < b goto L1
...
L1: goto L2

if a < b goto
L2
...
L1: goto L2

x = x * 1;
x = y * 16;

x = y << 4;

Constant Folding

Constant folding is performing arithmetic at compile-time when


the values
are known. This includes simple expressions such as 2+3, but
with more
analysis
some variables' values may be known constants for some of
their uses.

x = 7;
...
y = x+5;

Common Subexpression Elimination

Code that redundantly computes the same value occurs fairly


frequently,
both explicitly because programmers wrote the code that way,
and implicitly
in the implementation of certain language features.

Explicit:
(a+b)*i + (a+b)/j;

The (a+b) is a common subexpression that you should not have


to compute twice.

Implicit:
x = a[i]; a[i] = a[j]; a[j] = x;

Every array subscript requires an addition operation to compute


the memory
address; but do we have to compute the location for a[i] and a[j]
twice in
this code?

Loop Unrolling

Gotos are expensive (do you know why?). If you know a loop
will
execute at least (or exactly) 3 times, it may be faster to copy
the
loop body those three times than to do a goto. Removing
gotos
simplifies code, allowing other optimizations.

x += 0 *
0;
y += x *
y += x * x;
x;
x += 1;
for(i=0; i<3; i++) { x += i * i; y += x
x += 1 * y += x *
* x; } >
1;
x;
y += x * x += 4;
x;
y += x *
x += 2 * x;
2;

y += x *
x;

Hoisting Loop Invariants

t_0 = strlen(s);
for (i=0; i<strlen(s); i++)
for (i=0; i<t_0; i++)
s[i] = tolower(s[i]);
s[i] = tolower(s[i]);

Interprocedural Optimization

Considering memory references across procedure call


boundaries;
for example, one might pass a parameter in a register if both
the caller and callee generated code knows about it.
argument culling
when the value of a specific parameter is a constant, a custom
version
of a called procedure can be generated, in which the
parameter is

eliminated, and the constant is used directly (may allow


additional
constant folding).

f(x,r,s,1);

f_1(x,r,s);

int f(int x, float y, char *z, int n)


int f_1(int x, float y, char *z)
{
{
switch (n) {
do_A;
case 1:
}
do_A; break;
int f_2(int x, float y, char *z)
case 2:
{
do_B; break;
do_B;
...
}
}
...
}

Final Exam Review

The final exam is comprehensive, but with a strong emphasis on


"back end"
compiler issues: symbol tables, semantic analysis, and code
generation.

o
o
o
o
o

Review your lexical analysis, regular expressions,


and finite automata.
Review your syntax analysis, CFG's, and parsing.

If a parser discovers a syntax error, how can it


report what line
o
number that error occurs on? If semantic
analysis discovers a
o
semantic error (or probable semantic error),
how can it report what
o
line number that error occurs on?
o
o
What are symbol tables?
o
o
What are symbol tables used for? What
information is stored there?
o
o
How does information get into a symbol table?
o
o
How many symbol tables does a compiler need?
o
o
What is "semantic analysis"?
o
o
What does "semantic analysis" accomplish? What
are its side effects?
o
o
What are the primary activities of a compiler's
semantic analyzer?
o

o
o
o
o
o
o
o
o
o

What are memory regions, and why does a


compiler care?
What memory regions are there, and how do they
affect code generation?
What does code generation do, anyhow?
What kinds of code generation are there?

Why do (almost all) compilers use an


"intermediate code"? What does
o
intermediate code look like? How is it different
from final code?

You might also like