You are on page 1of 10

What Characterizes a Language

as mentioned earlier, language consists of


symbols
rules for their use
meanings
break up language description into two parts:
syntax (form) rules that define the languages symbols and how
they may be combined
semantics (meaning) rules that give meaning to the symbols and
their combinations
example:
syntax errors (in most languages, anyway):
a := b c
foo)a,b(

semantic errors:
x := 5 # if x is undeclared
a := b+c
foo(a,b)
a[x] := 5

relationship between syntax and semantics, e.g., classic FORTRAN


DO 10 I = 1. 20
X[I] = ...
10 CONTINUE

illustrative of
unusual lexical rules (no spaces),
poor syntax (continue used anywhere, or dont even need continue),
poor rules (variables dont need to be declared)

ECS140A

Programming Languages

02-1

there are formal ways to describe both syntax and semantics


in this course, look at formal ways for syntax (although not much language theory)
will use informal semantics generally English descriptions

ECS140A

Programming Languages

02-2

Syntax BNF
Backus-Naur Form, first used in Algol60 report.
many variants since then, but all similar and most give power of contextfree grammar (study in other courses)
example
<id> ::= <alpha> | <alpha> <rest>
<rest> ::= <rest> <alphanum> | <alphanum>
<alphanum> ::= <alpha> | <digit>
<alpha> ::= A | B | C | D
<digit> ::= 0 | 1 | 2
meta-symbol

meaning

note

::=

is defined as

<>

meta-variable or nonterminal

or

lower precedence
than sequence of < >

A, B, ...

terminal

appear literally

Note: BNF is a meta-language a language about/describing a language


What language does above grammar describe?
begin with the start symbol (i.e., nonterminal defined by first rule) and see
what strings are produced
Show some productions and parse trees not general description but for
specific strings (sentences)
A
B2D
2A
E11112

yes
yes
no
no

an infinite number of sentences in this language. if expanded alpha and


digit, then in English: An id is a sequence of letters and digits that starts
with a letter.

ECS140A

Programming Languages

02-3

Note the use of recursion in the rule for <rest> it expresses repetition.
left recursive makes things harder to recognize (see text)
there are ways to remove left recursion demonstrate in above
but in general more complicated. so we express repetition in simpler way
{x}

0 or more instances of x

rewrite above grammar as


<id> ::= <alpha> { <alphanum> }
<alphanum> ::= ...as before
<alpha> ::= ...
<digit> ::= ...
(Whats does <id> ::= <alpha> <alphanum> { <alphanum> } generate?)
another handy abbreviation is
[x]

0 or 1 instance of x

one final abbreviation is


(x)

means just x, but parentheses indicate grouping


just like normal use of parentheses in arithmetic expressions

Example:
<a> ::= x <b> | y <b>
can be simplified to
<a> ::= (x|y) <b>
Another example (precedence):
<a> ::= w x | y z
is not the same as
<a> ::= w (x | y) z
sometimes BNF is defined to include parentheses; in this class, its OK for
you to use it there too (unless otherwise stated).
ECS140A

Programming Languages

02-4

Parsing
parsing process of recognizing strings (sentences) in a language
used in compilers and other translators (e.g., interpreters)
many ways well look at simple method
Steps in Compilation
draw picture of different phases
lexical analysis scanner
breaks up input into tokens
discards whitespace and comments
grouping characters into identifiers and numbers
although parser could do it, this is simpler.
thus, parser considers identifiers as tokens (terminals).
syntactic analysis parser
takes tokens and sees if valid program by seeing if tokens form a
valid string according to grammar.
semantic analysis e.g., type checking
code generation

ECS140A

Programming Languages

02-5

Simple Program Grammar Used for Parsing Example


program ::= block
block ::= {statement}
statement ::= assignment | while
assignment ::= id := expression
expression ::= id | number
while ::= while expression do block end
where the nonterminal number represents a nonempty sequence of digits
and the nonterminal id represents a nonempty sequence of letters. Note,
though, that below we treat id and number as terminals because they are
tokens returned by the scanner.
Give a few strings in this language

ECS140A

Programming Languages

02-6

Syntax Syntax Graphs


aka, syntax or railroad diagrams
direct mapping to/from BNF
BNF

example

railroad

nonterminal
terminal
sequence
alternation

x
y
s1 s2
s1|s2|s3

optional

[x]

repetition

{x}

box x
circle y
s1s2
|s1|
|s2|
|s3|
>
|x|
>
|x|

Note optional is special case of alternation one branch is .


Give syntax graphs for each production of example grammar and simplify
any if possible

ECS140A

Programming Languages

02-7

Generating a Parser
method, for a given grammar
determine first sets
determine syntax graphs
translate syntax and first sets into code
first(V) = set of all terminals that can begin a string derived from V and ,
if is in V.
example using the grammar from before. The first sets (starting with
the simpler ones):
first(while) = { while }
first(expression) = { id, number }
first(assignment) = { id }
first(statement) = first(assignment) first(while)
= { id, while }
first(block) = first(statement) { }
= { id, while, }
first(program) = first(block) = { id, while, }
overlapping first sets for the right-hand sides of rules cause problems in
parsing using our technique: they correspond to potential ambiguities.

ECS140A

Programming Languages

02-8

show syntax graphs


code: assume existence of routine called next, which sets global variable
sym to token in input.
BNF

syntax graph
component

terminal

circle x

if sym = x then call next;


else error;

nonterminal

box x

call x;

sequence

x1x2

call x1; call x2

alternation

...

if sym in first(f_x1) call x1;


else if sym in first(f_x2) call x2;
else error;

repetition

...

while sym in first(f_x) call x

code

notes
one procedure for each nonterminal
f_z represents first(z)
call z means to call procedure representing rule for z.

ECS140A

Programming Languages

02-9

The pseudocode for parsing the given grammar is below. Note how there
is one procedure for each nonterminal in the grammar. We assume the
procedure next sets the global variable sym to the current token. We also
assume the function first(x) returns true iff sym is in xs first set. Well
name the first set for x f_x.
main() {
/* read the first token. */
next();
/* parse the input. */
program();
/* do something to ensure that
* all input was parsed.
*/
...
}
program() {
block();
}
block(){
while( first(f_statement) ) {
statement();
}
}
statement(){
if( first(f_assignment) ) {
assignment();
}
else if( first(f_while) )
while_proc();
else ERROR;
}

assignment(){
if( sym is an id )
next();
else ERROR;
if( sym is a := )
next();
else ERROR;
expression();
}
expression(){
if( sym is an id )
next();
else if( sym is a number )
next();
else ERROR;
}
while_proc(){
if( sym is a while )
next();
else ERROR;
expression();
if( sym is a do )
next();
else ERROR;
block();
if( sym is an end )
next();
else ERROR;
}

On hw2, use integrated parser/semantic-checker/code-generator; so intermix statements for semantic checks and code generation in above code.
I.e., our project uses 1 pass; real translators typically use multiple passes
and communicate between passes via the programs parse tree.
What problem would left recursion in the grammar cause in the parser
generated using the above technique?
ECS140A

Programming Languages

02-10

You might also like