Beginning Python for Language Research

T
AF
Beginning Python Programming for Language
Research
Stuart Robinson and Harald Baayen
April 17, 2007

DR
DR
2
AF
T
T
Contents
1 Acknowledgements
2 Introduction
AF
2.1 Aims and Scope . . . . . . . . .
2.2 Organization . . . . . . . . . . .
2.3 Why Python? . . . . . . . . . .
2.4 What This Book Doesn’t Cover .
2.5 What Else to Read . . . . . . .
2.6 How to Use This Book . . . . .
3 Programming for Language Research

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
15
15
16
16
20
20
21
3.1 The Uses of Programming in Language Research . . . . . . . . . 25

3.2 Becoming a Good Programmer . . . . . . . . . . . . . . . . . . . 28
DR
3.3 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Installing and Running Python 31

4.1 Installing Python . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Running Python . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
I Python Programming Fundamentals 39

5 First Program 41
5.1 Saving Text as a Variable . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Tokenization: Getting a List of Tokens . . . . . . . . . . . . . . . 43
5.3 Obtaining a Frequency Count for Unique Words . . . . . . . . . . 44
5.4 Printing the Word List . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . 46
3
CONTENTS
T
6 Statements 49
6.1 Indentation and Block-Structuring . . . . . . . . . . . . . . . . . 50
6.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9
AF
Data Types
7.1 Strings . . . . . . . . . . . . . . . . .
7.2 Numbers . . . . . . . . . . . . . . . .
7.3 The None Value . . . . . . . . . . .
7.4 Querying and Converting Data Types .
7.5 Exercises . . . . . . . . . . . . . . .
Data Structures
8.1 Lists . . . . . . . . . . . . . .
8.2 Dictionaries (aka, Hash Tables)
8.3 Tuples . . . . . . . . . . . . .
8.4 Nested Data Structures . . . .
8.5 Exercises . . . . . . . . . . .
Data Flow
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
81
.
.
.
.
.
65
. 65
. 72
. 77
. 78
. 79
59
59
60
62
63
64
9.1 Conditional Statements: if . . . . . . . . . . . . . . . . . . . . . 81

9.2 Looping, Iteration, Cycling etc.: for and while . . . . . . . . . 83
DR
9.3 List Comprehension: Modifying a List In Situ . . . . . . . . . . . 86
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
10 Functions 89
10.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.2 Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.3 Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.4 Functions in Action . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11 Errors and Exceptions 99

11.1 Handling Exceptions . . . . . . . . . . . . . . . . . . . . . . . . 99
11.2 Exception Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.3 Exceptions in Action . . . . . . . . . . . . . . . . . . . . . . . . 105
11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4
CONTENTS
12 Input/Output 107
T
12.1 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.2 Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
12.3 The Command Line . . . . . . . . . . . . . . . . . . . . . . . . . 116
12.4 IO in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
13 Modules and Packages 121
AF
13.1 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
13.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
13.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
14 Strings In Depth 129

14.1 Plain String Types . . . . . . . . . . . . . . . . . . . . . . . . . . 129
14.2 Unicode Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
14.3 String Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
14.4 Carving Strings at the Joints . . . . . . . . . . . . . . . . . . . . 130
14.5 Case Studies in String Manipulation . . . . . . . . . . . . . . . . 140
14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
II Advanced Topics 151
15 Object Oriented Programming 153

DR
15.1 What are Objects? . . . . . . . . . . . . . . . . . . . . . . . . . . 153
15.2 The Virtues of OO Programming . . . . . . . . . . . . . . . . . . 155
15.3 Major Concepts of OO Programming . . . . . . . . . . . . . . . . 159
15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
16 Classes and Objects 181

16.1 Defining Classes . . . . . . . . . . . . . . . . . . . . . . . . . . 181
16.2 Class Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
16.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
16.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
16.5 Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
16.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5
CONTENTS
17 Regular Expressions 1 191
T
17.1 What is Text? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
17.2 Metacharacters . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
17.3 Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
17.4 Word Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . 212
17.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
AF
18 Regular Expressions 2
18.1 Using Regular Expressions: An Overview
18.2 Creating Regular Expressions . . . . . . .
18.3 The Regular Expression Object . . . . . .
18.4 Extensions to Regular Expressions . . . .
18.5 Exercises . . . . . . . . . . . . . . . . .
18.6 Suggested Reading . . . . . . . . . . . .
19 XML
19.1 What is XML? . . . . . .
19.2 The Syntax of XML . . . .
19.3 Processing XML in Python
19.4 Case Studies . . . . . . . .
19.5 Exercises . . . . . . . . .
19.6 Suggested Reading . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
217
217
219
225
230
238
238
241
241
242
244
246
250
250
20 Databases 251
DR
20.1 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
20.2 Pickling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
20.3 Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . 256
20.4 Structured Query Language (SQL) . . . . . . . . . . . . . . . . . 258
20.5 Using Python with a Relational Database . . . . . . . . . . . . . 264
20.6 Case Study: Importing a Spreadsheet into a Lexical Database . . . 267
20.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
21 The World Wide Web 269

21.1 How the Web Works . . . . . . . . . . . . . . . . . . . . . . . . 269
21.2 Client Side Programming: Browsing the Web . . . . . . . . . . . 271
21.3 Server Side Programming: CGI Scripts for Web Servers . . . . . . 275
21.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
21.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
6
CONTENTS
T
Bibliography 286
Glossary 290
AF
DR
7
CONTENTS
T
AF
DR
8
T
List of Figures
2.1
2.2
4.1
4.2
4.3
4.4
4.5
4.6
AF
The Python Prompt . . . . . . . . . . . . . . . . . . . . . . . . .
The Python Prompt . . . . . . . . . . . . . . . . . . . . . . . . .
??? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Command Prompt Window in Windows XP . . . . . . . . . . . .
Running the Python Interpreter from a Command Prompt Window
Path Error Running the Python Interpreter from a Command Prompt
Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Selecting ‘Environment Variables’ from the ‘System Properties’
Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adding the Python Interpreter to the PATH Setting . . . . . . . .
22
23
32
33
34
35
36
37
5.1 ??? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
DR
8.1 List Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.2 Negative List Indices . . . . . . . . . . . . . . . . . . . . . . . . 68
11.1 Class Hierarchy for Exception Class . . . . . . . . . . . . . . . . 102

11.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
17.1 Breakdown of Date String . . . . . . . . . . . . . . . . . . . . . 192

17.2 Regular Expression Matching . . . . . . . . . . . . . . . . . . . . 192
17.3 Using regexes1FileSearch.py to search for star in Non-Stop194
17.4 Using regexes1FileSearch.py to do a simple regular ex-
pression search on The Forever War . . . . . . . . . . . . . . . . 195
18.1 Regular Expression Grouping . . . . . . . . . . . . . . . . . . . 231
21.1 ??? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

21.2 Sample Web Page: (311) in Mozilla Firefox Browser . . . . . . . 270
9
LIST OF FIGURES
T
AF
DR
10
T
List of Tables
2.1
5.1
6.1
14.2
14.4
14.3
AF
Programming Languages Strong on String Processing
Breakdown of the First Five Tokens from (3) . . . . . . . . . . . .
Comparison Operators . . . . . . . . . . . . . . . . . . . . . . .
10.1 Organization of Program in (113)
14.1 String Types Compared . . . . . .
Comparing Words in a Minimal Pair

Conventions for String Interpolation
. . . . . . . . . . . . . . . . .
String Indices . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
43
55
97
129
131
146
150
15.1 Description of Shoebox Configuration in Figure 15.3.2 . . . . . . 160

15.2 Storing Morpheme Tiers as a List . . . . . . . . . . . . . . . . . 162
DR
15.3 Storing Morpheme Tiers as a Dictionary . . . . . . . . . . . . . . 162
15.4 Inheritance Hierarchy From (214) . . . . . . . . . . . . . . . . . 167
15.5 Inheritance Hierarchy in (??) . . . . . . . . . . . . . . . . . . . . 170
17.1 Science Fiction Novels Used for Sample Corpus . . . . . . . . . . 194

17.2 Special Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 205
17.3 Summary of Metacharacters . . . . . . . . . . . . . . . . . . . . 206
17.4 Quantifiers and their Bracket Notation Equivalents . . . . . . . . 211
17.5 Summary of Quantifiers . . . . . . . . . . . . . . . . . . . . . . 212
17.6 Summary of Metacharacters and Quantifiers . . . . . . . . . . . 215
18.1 Complilation Flags for Regular Expressions . . . . . . . . . . . . 221

18.2 re Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
18.3 Regular Expression Grouping in Python . . . . . . . . . . . . . . 240
18.4 Special Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 240
11
LIST OF TABLES
20.1 Tab-Separated Values . . . . . . . . . . . . . . . . . . . . . . . . 253
T
20.2 Outcome of a Join of Person and Authorship . . . . . . . . 264
21.1 Newspaper Article from RSS Newsfeed for the Indonesian News-
paper Antara . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
AF
DR
12
T
Chapter 1
Acknowledgements
AF
The following people at some point read part or all of draft versions of the book (in
alphabetical order): Heriberto Avelino, Steven Bird, Brian MacWhinney, Fermı́n
Moscoso del Prado, Loretta O’Connor, and Janna Zimmermann.
Special thanks to: Dieter Van Uytvanck, for clarifying questions about how
arguments are passed to functions; Heriberto Avelino, for using the book for self-
study and providing detailed feedback.
Their feedback is greatly appreciated, since it has significantly improved the
quality of the final product. (Any blame for defects are of course solely at-
tributable to the authors.)
This textbook was used to teach an intensive Python programming course at
the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands and
we wish to thank the students of that class for their feedback on the materials (not
DR
mention the bottle of Italian wine and the copy of Ian M. Banks’ The Algebraist
that they gave the first author as a gift).
Finally, this book is about an open source programming language and writ-
ten using Open Source software: Linux, Emacs, CVS, etc. (http://www.
opensource.org). Consequently, it would not be complete without an ac-
knowledgement of the pioneering efforts of Richard M. Stallman and Linus Tor-
valds, who gave the world GNU and Linux, respectively, as well as Guido van
Rossum, Python’s benevolent dictator for life. The open source revolution contin-
ues, and we are glad to be part of it.
13
DR
14
AF
T
T
Chapter 2
Introduction
AF
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated. Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one—and preferably only one—obvious way to do
DR
it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea—let’s do more of those!
—The Zen of Python
2.1 Aims and Scope

This book provides an introduction to the Python programming language for lan-
guage researchers, assuming that the average reader will be a newcomer to pro-
gramming. Although the target audience is fairly specific, the introduction to
15
2.2 Organization
Python is reasonably general and could profitably be read by any non-programmer
T
wishing to learn Python from scratch. (For those with a background in program-
ming, the book might serve as a useful introduction to the language, but there are
probably better alternatives, such as Harms and McDonald (2000).) Throughout
the book, the emphasis is on the application of Python programming to the types
of problems that language researchers are likely to face, but the aim is to provide
a comprehensive introduction that provides the foundation for the development of
general programming skills.
In the next few sections, we will tackle a few obvious questions that emerge
2.2 AF
from the aims and scope of the book. In §2.2, an overview of the book’s organi-
zation is provided. In §??, we explain why a language researcher ought to learn
Python rather than another programming language. In §2.4, we mention a few
topics that the emphasis on language problems leads us to neglect, and provide
references to other materials that could be used to supplement this book.
Organization
This book is divided into two parts:
Python basics The first part is fairly general and teaches the fundamentals of
Python programming. As such, it could be skipped by someone who already
is familiar with Python, but who wishes to learn how to apply Python to
problems in language research. Since it provides a fairly broad overview of
Python, it is a useful reference, even for those who are already familiar with
DR
the language.
Python for Language Research The second part is written specifically for a lan-
guage researcher (a linguist, sociolinguist, psycholinguist, etc.) who has
acquired the basics of Python programming (either by reading the first part
or from prior exposure) and wishes to develop more advanced skills. The
chapter provides a detailed look at a number of more advanced topics in
Python of relevance to language research (e.g., regular expressions, XML,
databases, etc.).
2.3 Why Python?

An obvious question the reader may be asking himself is why he should learn
Python, and not some other programming language. Choosing a programming
16
2. Introduction
language is not a trivial task, and there is always healthy debate among pro-
T
grammers concerning the strengths and weaknesses of various programming lan-
guages. For a classic endorsement of Python by Eric Raymond (author of The
New Hacker’s Dictionary and The Cathedral and the Bazaar), see his article
“Why Python?”, available from the web site of the Linux Journal: http://
www.linuxjournal.com/article/3882.
After all, there is no shortage of possible alternatives: Ada, C#, C++, COBOL,
C, ColdFusion, Delphi, FORTRAN, Icon, JavaScript, Java, ML, PHP, Perl, Pro-
log, Python, QBASIC, Ruby, Lisp, Smalltalk, Snobol, Tcl, Visual, etc. However,
AF
not all programming languages are created equal, as we will see when we consider
a few desiderata for a good programming language for language research.
For language researchers, what really matters is that a language excel at string
processing–that is, the manipulation of strings (loosely speaking, computer sci-
ence lingo for chunks of text). Table 2.1 lists a few languages with good string
processing capabilities and compares them in terms of the degree to which they
are still being actively developed by a community of developers and users and
whether or not they are object-oriented (see Chapter 15 for more explanation of
the term).
Table 2.1 Programming Languages Strong on String Processing

Language Name Actively Developed Object Oriented
SNOBOL
Icon
Perl
No
No
Yes
No
No
Partially
Python Yes Yes
DR
Java Yes Yes
Currently the only real competitor to Python in the arena of language research
are Perl and Java since they offer many of the same string processing capabilities
as Python.
The main difference between Perl and Java is that the former is a “scripting
language” whereas the latter is a “system programming language”. It is com-
mon to divide high-level programming languages into two types: “system pro-
gramming languages” and “scripting languages”.1 A “system programming lan-
guage” allows arbitrary amounts of complexity, are generally compiled (see Chap-
ter ?? for an explanation), and are meant to operate largely independently of other
programs. A prototypical example would be C or Java. By contrast, scripting
languages have fewer provisions for complexity, are generally interpreted, and
1
This division is sometimes known as Osterhout’s Dichotomy, after John Osterhout, the creator
of the language Tcl.
17
2.3 Why Python?
necessarily interact with other programs or the file system. Prototypical script-
T
ing languages would be MSDOS batch files or Unix shell scripts (and of course
Tcl). The dichotomy is not particularly neat and many believe it to be highly arbi-
trary (referring to it as ”Ousterhout’s fallacy” or ”Ousterhout’s false dichotomy”).
Sometimes the dichotomy is seen in terms of what the two kinds of languages are
good at doing: scripting languages are good for writing quick-and-dirty programs
while system programming languages are good for writing large-scale programs.
Another dimension is speed: scripting languages tend to run more slowly than
system programming languages. However, this is not a major concern in lan-
AF
guage research, where a great deal of the programming is used for data extraction
and analysis (e.g., building a concordance) and not for doing computations whose
results are needed immediately in real time (e.g., spoken word recognition).
Although both Perl and Python are scripting languages, there is also a good
deal that sets the two languages apart. Without seeking to disparage Perl in any
way, we would claim that Perl has a few drawbacks that recommend Python over
it for beginning programmers. The advantages of Python are the following:2
Syntax Python employs a fairly straightforward and consistent syntax whereas

Perl’s syntax tends to be a bit obscure (being influenced in part by legacy
scripting languages). This makes Perl somewhat off-putting for newcomers
to programming.
More structured Perl’s “there’s more than one way to do it” philosophy can be a
bit dangerous when you are beginning to learn to program, since it imposes
no discipline. Python is more structured in this respect, and provides better
protection against developing bad programming habits.
DR
Object orientation Python is a language that is object-oriented to the core (see
Chapter 15 for more information) whereas Perl’s object orientation is some-
thing of an after-thought, and is not thoroughly integrated into the language
(Conway, 1999). In non-object oriented programming, the focus is on the
sequence of instructions (or procedures) needed to get a job done. In object-
oriented programming, the focus is on the kind of task that is to be done, and
on writing the rules in such a way that what is common across different tasks
becomes clear. For instance, in order to add objects, non-object oriented
languages use one kind of syntax for numbers (3+5 to get 8) and another
kind of syntax for strings (strcat("tea", "cup") to get ”teacup”),
whereas Python uses the same syntax for both (3+5, "tea"+"cup"). In
2
For more comparison of scripting languages (and an attempt at ranking them rela-
tive to one another), try http://merd.sourceforge.net/pixel/language-study/
scripting-language/.
18
2. Introduction
other words, object-oriented programming is an approach to programming
T
that relies more on modelling the entities involved in a programming task
than on the procedures that are followed in the task. This is very useful for
various programming problems in the language sciences, and although Perl
offers object-orientation, it doesn’t go to the core of the language’s philos-
ophy and is in many ways merely tacked on as an afterthought.3
Finally, Python has a large and active community of developers and users (as
does Perl, it should be noted in all fairness). There are numerous web sites devoted
AF
to Python (e.g., http://www.python.org or ???) as well as a usenet group
(comp.lang.python). Extensive documentation is available (on the web and
in print) and there is a growing literature on the language, with new books being
published on a regular basis, as a quick search on amazon.com (or any other on-
line bookseller) will reveal.
2.3.1 The Virtues of Open Source

Another selling point of Python is that it is open source software. The term open
source is used to describe any software whose internals are unrestricted—i.e., OPEN SOURCE
there is no legal restriction on access to the source code—and is freely modifiable—
i.e., you can modify the source code to tune the program to your specific needs.
Open source software contrasts with proprietary software, which is not freely vis-
ible or modifiable. In addition, open source tends to be free, whereas proprietary
software costs money.4
Open source software has many advantages over proprietary software. Its de-
DR
velopment can proceed quite rapidly since it is done by enthusiastic volunteers
spread all of the world. In addition, open source software tends to be very respon-
sive to the user community because features are driven more by the community
than by commerical considerations and many of the developers are users of the
software. In addition, because anyone is allowed to look under the hood, bugs are
quickly found and fixed. But one large advantage of open source software is that
it costs far less (often nothing), meaning that it will place no burden on already-
strained academic budgets. In addition, the fact that it is open source means that
3
This may change, however, with the long-awaited new version of the language, Perl 6 (Alli-
son Randal, 2004).
4
The term “free software” is also used to describe open source software, and is still advocated
by Richard M. Stallman, founder of The Free Software Foundation. The term is problematic,
however, due to the ambiguity of the word “free”, which can either mean free in the sense of
costing nothing (gratis) or free in the sense that it is without restrictions (i.e., you can inspect and
modify the code). Open source software is not necessarily gratis. There is a great deal of open
source software being sold, sometimes at good profits (e.g., Red Hat Linux).
19
2.4 What This Book Doesn’t Cover
there is no risk of its owner going out of business and ceasing to develop or sup-
T
port it. For in-depth analysis of open source software and its benefits, see Weber
(2004). For more partisan views, try Raymond (2001) or Stallman (2002).
2.4 What This Book Doesn’t Cover

It is worthwhile to be explicit about what isn’t included in the scope of this book.
The main two limitations are that web programming is given only superficial treat-
AF
ment (how to search for example sentences on the web) and graphical user inter-
faces (GUIs) are not covered at all. These two topics have been overlooked be-
cause they require more background knowledge than our target audience is likely
to possess and it would require a great deal of space to bring the reader with no
previous background to a level where any serious work could be done. Further-
more, both topics are both very much system-dependent, meaning that a great deal
of effort would have to be expended covering the various operating systems (Unix,
Macintosh, Windows, etc.) and their idiosyncracies.
Furthermore, GUIs are often designed to make programs written by one person
easier for another to use, but this textbook aims to enable language researchers
to accomplish tasks without being limited by the GUIs of software designed for
specific purposes. In many cases, the time investment involved in building a user-
friendly GUI is not repaid, especially when you are using programs that you have
written yourself.
With these considerations in mind, we decided that it would be better to stick
to a solid core of Python programming which is largely system-independent and
cover it thoroughly rather than treat a large number of topics superficially. For
DR
readers interested in these topics, we recommend Holden (2002) for an introduc-
tion to web programming and Rempt (2002) for an introduction to programming
GUIs.
2.5 What Else to Read

Although this book is meant to be a self-sufficient introductory textbook, it is
sometimes easier to understand tricky programming concepts by reading multiple
explanations of them; if you don’t understand one explanation, you might under-
stand another. Some general introductions to Python that would supplement the
materials found in this book are Lutz and Ascher (1998) or Pilgrim (2004).
It is also helpful to have a Python reference at arm’s length. We would rec-
ommend Martelli (2003) or Beazley (2001), although we hasten to point out, for
the sake of those with limited budgets, that reference materials can be found for
20
2. Introduction
free on the web at http://www.python.org, and there is a section just for
T
beginners at http://www.python.org/doc/Intros.html.
2.6 How to Use This Book
AF
If you know nothing about programming and wish to learn, the best way to use
this book would be to read the first part from beginning to end, running each piece
of example code to ensure that you understand what it is meant to illustrate. Once
you have assimilated the material in the first part, you can move on to the second
part. The second part of the book is more specific, and would be useful even
those readers who have prior familiarity with Python. It is less important to read
the second part from beginning to end, since the chapters cover various topics
which more or less stand-alone. The only chapter in the second part that is really
a must-read is the first chapter, on string processing. The rest can be treated as
a sort of buffet, which the reader can pick and choose from. They are not truly
interdependent: while it is useful to understand regular expressions before reading
about XML or databases, it is by no means necessary.
To save the reader time, every piece of code given in this book is provided
as a Python script in the accompanying CD, where it can be found in a directory
called ’code’. Throughout the book, code examples will be given in the following
format:
DR
(1) helloWorld.py
print ’Hello, World!’ 1
Note the label at the top of the example. It has two parts, a unique identifier
number followed by the name of a file. The unique identifier number is what is
used to reference the code in the text. The filename refers to a Python script that
can be run to see the code in action. The code in (1) can therefore be found in the
file helloWorld.py. It is recommended that you run each piece of code as you
read the examples. This is quite easy to do in a Unix environment, assuming that
the right version of Python is installed. (You need to have version 2.3 or later.) To
check which version of Python you are running and then run the code in (1), run
the commands shown in Figure 2.1.
21
T
Figure 2.1 The Python Prompt
AF
DR
In Windows or on Mactinosh computer, the process is not quite a straightfor-
ward (see §4 for installation instructions). [SAY MORE] It is advisable to have a
look at one of the numerous books dedicated to Python in Windows—e.g., Ham-
mond and Robinson (2000).
Since you may wish to tinker with the example code, we recommend that you
make a local copy of it, and put the originals someplace where you can recopy
them if necessary. (It’s always a good idea to make backups of code, just in case
you decide to revert to a prior version or accidentally delete what you are working
on.)
Alternatively, you can type example code straight into Python by running
Python interactively with the Python prompt, as shown in Figure 2.2.
22
2. Introduction
T
Figure 2.2 The Python Prompt
AF
In fact, the Python prompt is a great way to get instant gratification since the
code is evaluated on a line-by-line basis, which means that any errors will imme-
diately be caught. Catching errors immediately through the Python prompt can
be very instructive, since it helps teach the user how Python responds to different
kinds of errors—for example, typing errors, syntax errors, conceptual errors, etc.
DR
23
T
AF
DR
24
T
Chapter 3
AF
Programming for Language Research
This book is aimed at language researchers (fieldworkers, sociolinguists, phonol-

ogists, psycholinguists, etc.) who wish to use the Python programming language.
In the previous chapter, we discussed the virtues of Python for language research.
Here we will look at why a language researcher would want to learn to program
in the first place and what it takes to become a skilled programmer.
3.1 The Uses of Programming in Language Research

A reasonable question that language researchers might ask is, “Why should I learn
to program?” One might argue that some knowledge of computer programming
is in and of itself desirable in the modern “information age”, as Chris Manning
explicitly does in his review of Hammond’s Programming for Linguists: Java
DR
Technology for Language Researchers (Manning, 2005):
To be able to efficiently perform empirical linguistic work, be it work-

ing with large corpora, your own collected language data, or psy-
cholinguistic experients, it is extremely useful to be able to write your
own programs. Otherwise one’s ability to do research projects is ar-
tifically limited by available computer tools, or the time available to
do by hand tasks what could be done much more efficiently by com-
puter. Established faculty can solve this problem by hiring research
assistants, but for the graduate student, this is a basic form of literacy.
But the argument in favor of learning to program can be made more concrete
by looking at a number of areas where the ability to program is invaluable. We
will discuss a few of these briefly: data munging, data analysis, simulation and
modelling, and computational linguistics.
25
3.1 The Uses of Programming in Language Research
3.1.1 Data Munging
T
Language researchers deal with data, and increasingly lots of it. [MENTION
GROWING SIZE OF CORPORA] That data typically takes the form of text. The
DATA MUNGING term data munging refers to the manipulation of data for sundry purposes. There
are numerous boring, repetitive tasks involved in language research that can be
easily automated, such as reformatting texts, finding systematic typos, changing
orthographies in a text, etc.1
AF
A language researcher who has some programming skills is in a position to
improve vastly his productivity, since he will be able to hand over a good deal of
the grunt work to a computer, allowing it to do the boring tedious work and leaving
him to do the fun stuff (programming, data analysis, write-up, etc.).2 Python,
being a high-level scripting language, is especially suited to data munging.
By way of example, one of the authors has been compiling a dictionary of Ro-
tokas, a non-Austronesian language spoken on the island of Bougainville, Papua
New Guinea. The dictionary is stored in the format used by Shoebox, a pro-
gram made available by the Summer Institute of Linguistics for storing dictionar-
ies and glossing texts (for more information, go to http://www.sil.org/
computing/shoebox/). Although Shoebox has built-in dictionary creation
capabilities, the results were not quite what the author desired. In order to cre-
ate a dictionary formatted according to his wishes, he wrote a Python script that
took the Shoebox dictionary file, manipulated the data in it, and produced a much
more aesthetically pleasing LATEX-formatted document. Without some program-
ming skills, this would not have been possible, and the resulting dictionary would
have poorer for it.
DR
It is always preferable to have control over one’s own data, and not to leave
every task in the hand of second-party software vendors, especially when deal-
ing with academic software (which does not have the same market potential and
is therefore not as well supported or documented). If the Summer Institute of
Linguistics stopped supporting Shoebox, for example, it is unclear what would
happen to it. Users of the software are at the developer’s mercy when it comes to
proprietary software.
1
The earliest computers used by codebreakers during WWII were employed to speed up the
process of decrypting enigma communiques. Before computers come onto the picture, the task
was done by armies of secretaries!
2
Fortunately, computers don’t mind doing our dirty work, at least not yet. Maybe one day the
kind of carping heard from Marvin the Paranoid Android in The Hitchhiker’s Guide to the Galaxy
will become a reality, “I’ve been ordered to take you down to the bridge. Here I am, brain the
size of a planet and they ask me to take you down to the bridge. Call that job satisfaction? ’Cos I
don’t.”
26
3. Programming for Language Research
3.1.2 Data Analysis
T
One of the areas where programming is especially useful is in data analysis. Al-
though there is a great deal of software available for performing statistical anal-
ysis, these programs usually require input to be formatted in a particular fashion.
Knowledge of a scripting language enables one to massage one’s data into a for-
mat where it can be readily analyzed. Furthermore, in some cases, custom data
analysis can be more easily programmed on demand.
Finally, although there are many good programs for doing statistical analysis,
AF
a good deal of statistical analysis can be done within Python. Python provides
has extensive libraries for number-crunching and there are usually ways of using
Python to run other programs. For example, Python is able to run scripts that call
R, an open source statistical language which provides a wide variety of statisti-
cal techniques as well as extensive graphing capabilities (Verzani, 2004; Baayen,
2006).
3.1.3 Simulation and Modelling

We mention computer simulation and computer modelling together since the di-
viding line between the two enterprises is indistinct. In an introduction to com-
puter modelling for social scientists, Gilbert and Troitzsch (2005, ???) treat sim-
ulation as a type of modelling:
Simulation is a particular type of modelling. Building a model is a

well-recognized way of understanding the world: something we do
all the time, but which science and social science has refined and
formalized. A model is a simplification—smaller, less detailed, less
DR
complex, or all of these together—of some other structure or system.
Computer modelling is already a well-established part of other disciplines,

such as biology, where it plays a central role in a variety of subfields—e.g., pop-
ulation ecology (Banks, 1998; Bernstein, 2003). The use of modelling in linguis-
tics is a relatively recent development, but it is already bearing fruit, judging from
work such as De Boer (2001), which uses an agent-based computational model of
the speech situation to simulate the evolution of vowel systems.
Python is well-suited to the task of modelling. As a thoroughly object-oriented
language, it is amenable to the construction of complex, large-scale systems that
scale well. Furthermore, its ease of use makes it particularly good for the rapid
development of prototype systems. In fact, it is used by developers at Google
for precisely this task, as a quote from Peter Norvig (director of search quality
for Google) confirms: ”Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves. Today dozens of
27
3.2 Becoming a Good Programmer
Google engineers use Python, and we’re looking for more people with skills in
T
this language.”
3.1.4 Computational Lingustics

COMPUTATIONAL The field of computational linguistics is an interdisciplinary field dealing with
LINGUISTICS the statistical and logical modeling of natural language from a computational per-
spective. Natural language processing (NLP) is a subfield of computational lin-
guistics. It studies the problems inherent in the processing and manipulation of
AF
natural language with an eye towards making computers to produce and compre-
hend human languages. The growth of the internet and the WWW has fuelled
a resurgence of interest in the field, since it provides techniques of considerable
utility in the storage, classification, retrieval, and analysis of text (Farghay, 2003).
The usefulness of Python in this domain can be judged by its increasingly
central role in introductory course in computational linguistics and its widespread
usage in industry. Python has already become the language of choice in compu-
tational linguistics, judging from the number of universiities where introductory
courses use Python as the programming language of choice—at Brown, Bran-
deis, Penn State, the University of Edinburgh, and the University of Melbourne,
to name a few.
Although this book is not meant to be an introduction to the field (for that, try
Jurafsky and Martin (2000) or Bird et al. (2006)), it does teach basic programming
skills, which are an essential part of the field. Furthermore, the third part of the
book deals with a variety of topics that traditionally fall within the purview of the
field (e.g., WordNet, corpus linguistics). For linguists interested in learning more
about the applied side of computational linguistics, provides an introduction to
DR
various facets of natural language processing (NLP).
3.2 Becoming a Good Programmer

There are many books on the market that purport to teach you how to program at
a ridiculously accelerated pace (e.g., Sams Teach Yourself Python in 24 Hours).
This book is not one of them. The reason for this is that learning to program in
Python within the time frame of a day, a week, or even a month is a fairly remote
possiblity, especially if you have no previous programming experience. Although
a highly-motivated, intelligent student may learn the rudiments of a programming
language in reasonably short order, he is unlikely to learn to program well by
proceeding at a breakneck pace.
There is much more to programming than learning the syntax of a language.
And there is much more to programming than simply hacking together a script that
28
3. Programming for Language Research
manages to do a particular task, since there are additional considerations involved
T
which reveal themselves over time—for example,
• learning how to write well-commented code that is readable to others,
• learning how to write robust code that will handle unexpected conditions
gracefully,
• learning how to write code that can be easily maintained as it changes over
time,
• learning how to optimize already working code so that it runs more effi-
AF
ciently.
Programming is a craft like any other which repays time and effort. It isn’t
as hard to program as the general public probably believes (judging from popular
culture depictions of programmers as wild-eyed geniuses), but it is a long road
that takes one from apprentice to journeyman programmer.
Here are a few caveats that beginning programmers would do well to heed:
Learn By Doing The best way to learn to program is by doing it. There’s no
substitute for the hands-on, do-it-yourself approach. You can learn a good
deal about programming by reading books, but at some point you have to
roll up your sleeves and put your nose to the grindstone. In the beginning,
programming can be frustrating, and there will be many times when you
want to beat your head against the keyboard or put your fist through the
monitor. But that’s part of the process.
Use it or Lose It If you don’t program on at least a semi-regular basis, you will
no doubt find it hard to remember nitty gritty details about Python. The
more you program, the more entrenched the knowledge becomes and the
DR
less likely you are to forget it in the future.
RTFM You may have come across this acronym before. It’s stands for the im-
polite dictum “Read the Fucking Manual”. If you want to learn to program
well, you can’t be afraid of digging into documentation. There is always a
lot to know about a computer language and the best way to learn more is to
read up.
3.3 Suggested Reading

It never hurts to know a bit about the history of a field. There is no shortage
of books on the history of computers and computer programming. While some
of these are technically oriented, many were written with the general public in
mind, such as Levy (2001) or Campbell-Kelly (2004). For more about learning to
program, try Hunt and Thomas (1999) or ???.
29
T
AF
DR
30
T
Chapter 4
AF
Installing and Running Python
This chapter presents the basics of installing and running Python on the three most
common operating systems: Windows, Linux/Unix, and/or MacOS.
4.1 Installing Python

4.1.1 Windows
The average read of this book runs some variety of the Windows operating system.
Python is not part of the default collection of programs that come with Windows
straight out of the box. It will therefore be necessary to download and install
Python. The necessarily files can be found at http://www.python.org/
download/, where you will find the files needed for installation. Installation in
DR
Windows is a fairly simple procedure, usually requiring nothing more than run-
ning an installer (a program that automates a program’s installation). At the time
of writing, the installer is named python-2.4.3.msi. We would recommend
doing the installation with administrator privileges using all of the default settings.
(This will ensure that Python can be used by everyone if your computer is shared
by multiple users.)
4.1.2 Linux/Unix
Many Unix operating systems come with Python pre-installed as part of their stan-
dard distribution. This is true of a variety of Linux distributions (Red Hat, Debian,
SuSE, etc.). To see whether Python has been installed on your Unix operating sys-
tem, you will need to open a command-line window (terminal window) and type
‘python’, as shown in Figure ??. This will instruct Python to print out its version
number.
31
4.2 Running Python
???
T
Figure 4.1 ???
If Python is not installed, you will get an error message rather than a Python
version number. There are a few different options for installing Python.
One possibility is to install Python from source files, following the instructions
provided on the Python download page. It is easier, however, to install Python
from a prepackaged binary file. The most commonly available will be an RPM
???
AF
file (*.rpm), which can be installed as root with the following command:
Installation of a Python from an RPM will fail if your system lacks any of
the files required for the installation of Python (e.g., ???). a package management
program can be used in order to handle these dependency issues automatically.
With a decent package management program, it is quite easy to install Python.
Using the package manager yum, the following command should suffice:
# yum install python
Alternatively, if you are using the package manager apt, the following com-
mand can be used:
DR
# ???
4.1.3 MacOS
The current version of the Macintosh operating system, Mac OS X, is built on
top of the Berkeley Standard Distribution of Unix. This means that most of the
instructions for installing and running Python for Unix also carry over to Macs.
[SPELL OUT HOW THEY DIFFER]
4.2 Running Python

???
32
4. Installing and Running Python
4.2.1 Windows
T
E Unless you use an IDE (Integrated Development Environment, you will run Python
by typing commands into the “DOS window” or “Command prompt window”.
The exact procedure for opening a command prompt window varies slightly be-
tween different version of Windows. In Windows XP, select the Start menu and
”Start — Programs — Accessories — Command Prompt”. Once you have done
this, the window should look similar to the one shown in Figure 4.2.
AF
DR
Figure 4.2 Command Prompt Window in Windows XP
To run your Python programs, you will have to use the Python interpreter. The
interpreter reads your script, converts it into machine language, and then executes
the machine language version of your script (for more information, see §6).
The first thing that you will need to do is to ensure that the Python interpreter
has been correctly installed. You can do this by opening a command prompt
window and runnning the Python interpreter in it. This is done by typing the full
path to the program python.exe and hitting return. If it works, Python will be
run in interactive mode, as shown in Figure 4.3.
33
4.2 Running Python
T
AF
Figure 4.3 Running the Python Interpreter from a Command Prompt Window
You can tell Python is running from the interpreter’s command prompt (the
three right-facing angle brackets). Once in interactive mode you can enter type
Python code interactively and have it executed on the spot. (To exit from the
interactive mode, you can hold the Control key and type ‘Z’—CONTROL-Z.)
Once you have succeeded in running the Python interpreter using the full path
to it, you can test whether you can also run it by simply typing python in the
command window. If this works, you will see the interpreter prompt, as you did
in Figure 4.3. If it does not work, instead of getting the interpreter prompt, as in
Figure 4.3, you will get an error message, such as the one shown in Figure 4.4.
DR
The problem is that your computer doesn’t know where to find the Python
interpreter. To fix the problem, you will have to modify a setting known simply
as PATH, which is a list of directories where Windows will look for programs. To
set PATH so that it includes the directory where the Python interpreter is installed,
you will first need to know where the Python interpreter program was installed. If
you installed Python fairly recently then the search command provided in (1) may
work. Otherwise you will be reduced to a search of your whole disk for the file
python.exe.
(1) dir C:\py*
Once you have verified the directory, you will need to add it to your com-
puter’s start-up routine. You should add (??) to the current setting for the PATH
environment variable. To do this, you will need the ‘System Properties’ window,
which can be obtained by right-clicking “My Computer” and selecting ‘Proper-
34
T
AF
Figure 4.4 Path Error Running the Python Interpreter from a Command Prompt Window
DR
ties’. From there, select ‘Environment Variables’ in the ‘Advanced’ tab of the
‘System Properties’ window, as shown in Figure 4.5.
35
4.2 Running Python
T
AF
Figure 4.5 Selecting ‘Environment Variables’ from the ‘System Properties’ Window
DR
Once you have selected ‘Environment Variables’, you will need to add ;C:\Python24
to PATH in the ‘System variables’, as shown in Figure 4.6.
36
T
AF
Figure 4.6 Adding the Python Interpreter to the PATH Setting
DR
If you have sufficient privilege you might get a choice of installing the settings
either for the current user or for the entire system. If your computer is shared by
multiple users, and you want them to be able to use Python, you should choose to
install Python for the entire system, and not only for the current user.
4.2.2 Linux/Unix
4.2.3 MacOS
37
4.2 Running Python
T
AF
DR
38
T
AF Part I
Python Programming Fundamentals

DR
39
DR
AF
T
T
Chapter 5
Word List
AF
Your First Python Program: Building a
Learning to program is a hands-on task. The best way to learn is to take a specific
task and think about how a program could be written to accomplish it. With that in
mind, let’s roll up our sleeves and look at a relatively basic task in text processing,
which is to build a word list from a corpus. This involves going through files,
isolating words, and keeping a count of how many times they occur. Let’s think
about how we would write a Python program that will automate this work.
Generally speaking, a corpus consists of a multiple text files. But for simplic-
ity we will work with a snippet of text, taken from the Wall Street Journal (WSJ)
corpus. The WSJ corpus consists of articles from the Wall Street Journal that have
been annotated by adding a part-of-speech (POS) tag to each word (using a for-
DR
ward slash as the separator between words and their tags). 1 An example sentence
is provided in (2) and its annotated counterpart in (3).
(2) Zenith Data Systems Corp., a subsidiary of Zenith Electronics

Corp., received a $534 million Navy contract for software
and services of microcomputers over an 84-month period.
(3) Zenith/NNP Data/NNP Systems/NNPS Corp./NNP ,/, a/DT subsidiary/NN

of/IN Zenith/NNP Electronics/NNP Corp./NNP ,/, received/VBD
a/DT / 534/CD million/CD Navy/NNP contract/NN for/IN software/NN
and/CC services/NNS of/IN microcomputers/NNS over/IN an/DT
84-month/JJ period/NN ./.
1
The annotation of the WSJ corpus consists of more than just part-of-speech labels, but we will
work with a simplified version of it for pedagogical purposes.
41
5.1 Saving Text as a Variable
Before we delve into Python-specific details, let’s think about what is con-
T
ceptually involved in creating a word list from text annotated in this style. Each
sentence in the text needs to be broken down into tokens, which in this case consist
of word/POS pairings (e.g., Zenith/NNP).2 For now, we will ignore the part-of-
speech label and concentrate on building a frequency count for every word. This
involves associating every unique word with a counter, which is incremented ev-
ery time an instance of that word is encountered. The words can then be sorted
alphabetically and printed out with their associated counters.
In the following sections, we will break our task down into sub-tasks and look
AF
at how each sub-task can be accomplished in Python. These sub-tasks are: storing
text as a variable (§5.1), breaking the text down into tokens (§5.2), creating a
frequency counter for each unique token (§5.3), and then printing out each token
along with its frequency count (§5.4). We’ll then put the parts together into a
single script that performs the entire job automatically.
5.1 Saving Text as a Variable

The first step is to obtain the contents of the corpus for analysis. Since most
corpora consist of multiple files, this would normally involve having the program
go through the various files that make up the corpus and process them one by one.
For the sake of simplicity, we will only analyze a few sentences from the WSJ
corpus, which will be typed directly into the program rather than read from file,
as in (2).
[EXPLAIN HOW TO TYPE THIS IN USING PYTHON AS A CALCULATOR—
FOLLOWING EXAMPLES SHOULD BE IN CALCULATOR MODE]
DR
(2) introFirstProgramPart1.py
text = "Zenith/NNP Data/NNP Systems/NNPS Corp./NNP ,/, a/DT \ 1
subsidiary/NN of/IN Zenith/NNP Electronics/NNP Corp./NNP ,/, received/VBD \ 2
a/DT $/$ 534/CD million/CD Navy/NNP contract/NN for/IN software/NN and/CC \ 3
services/NNS of/IN microcomputers/NNS over/IN an/DT 84-month/JJ period/NN \ 4
./. Rockwell/NNP International/NNP Corp./NNP won/VBD a/DT $/$ 130.7/CD \ 5
million/CD Air/NNP Force/NNP contract/NN for/IN AC-130U/NN gunship/NN \ 6
replacement/NN aircraft/NN ./." 7
In (2), a snippet of annotated text from the WSJ corpus is saved as a variable.
In computer programming, a variable is a bit like a storage container. Like a
storage container, a variable has a label (its name) and contents (a value), as in
Figure 5.1.
2
The only tokens that do not have a part-of-speech label are the punctuation marks, which are
treated as words for ease of processing—see §8.1 for additional discussion.
42
5. First Program
???
T
Figure 5.1 ???
A variable is given a value with the assignment operator, the equal sign (=). In
this case, the name of the variable text and its value is the snippet of annotated
text from (3). Note the formatting of the string. It is surrounded by quotes and for
the sake of readability has been wrapped across multiple lines using a backslash
at the end of every wrapped line.
5.2
1
2
3
4
5
Token
Zenith/NNP
Data/NNP
AF
Tokenization: Getting a List of Tokens
Once we have saved some annotated text into a variable, the next step is the to-
kenization of the text—that is, breaking the text down into tokens. In the case
of (3), a token consists of a pairing of a word with a part-of-speech label—e.g.,
ideas/N. Table 5.1 provides a breakdown of the first five tokens in (3).
Order
Systems/NNPS
Corp./NNP
,/,
Word
Zenith
Data
Systems
Corp.
,
Part of Speech
NNP
NNP
NNP
,
proper noun, singular
NNPS proper noun, plural
comma (punctuation)
TOKENIZATION
Table 5.1 Breakdown of the First Five Tokens from (3)

DR
A part-of-speech label simply provides information about the grammatical
class of the word. For example, the token ideas/N indicates that the word
ideas functions as a noun (N).)
But how do we tokenize text formatted in the style of (3)? The key is whites-
pace. Because the tokens are separated from one another by a single space mark,
we can carve up the text by space marks to create a list of tokens. This can be
done with the function split(), as shown in (3).
tokens = text.split() # split up text into tokens 1
The function split() breaks a string up by spaces and produces a list with
all of the strings between spaces. In (3), the list that is produced by using split()
on text is saved into the variable tokens. (We’ll see shortly that the function
split can use others characters as separators.)
43
5.3 Obtaining a Frequency Count for Unique Words
5.3 Obtaining a Frequency Count for Unique Words
T
Having obtained a list of tokens, we can now count how often each word appears
in the list. The code in (4) goes through the list of tokens saved into the variable
tokens and figures out how many times each word occurs in it. In order to track
the frequency of each word, we need some way of associating unique words with a
count of how often it occurs. In Python, we can use something called a dictionary
(see §8.2). A dictionary is used to store a collection of unique items (keys) and
associate each one with a particular value. In this case, the key would be a word
d = {}
AF
and the associated value would be a number (more specifically, a count of how
many times a particular word appears in the text).
for t in tokens :
parts = t.split("/")
w = parts[0]
gc = parts[1]
if not d.has_key(w) :
d[w] = 0
d[w] = d[w] + 1
# create a dictionary
# go through each token
# split token using / as separator
# 1st item in list is word
# 2nd item in list is gram. class
# see if word is not in dict
# if not in dict, give it init. value of 0
# reset value by adding 1 to prior value
In (4), we use a for loop to go through each token in the list. Each token is
temporarily saved as the variable t as the for loop works its way through the list.
The indented code in lines 3 through 8 is run once for each item in the list. (Since
1
2
3
4
5
6
7
8
the text being analyzed contains 47 tokens, and the for loop is repeated for each
DR
token in tokens, those 6 lines of code will be run 47 times.)
What do lines 3 through 8 do? Because each token consists of both a word
and a part-of-speech label, it is first necessary to separate the two so that the word
can be processed and the POS ignored. This is done with the function split,
using a slash (/) as a separator. The result is a list, saved as the variable parts.
The list will consists of two elements, a word and a part-of-speech label. These
are extracted from the list using an index (a number that refers to a position in the
list) (see §8.1 for more details). Once the word in a token has been isolated and
saved into a variable, it can be added to the dictionary.
How is this done? This is done by checking whether the word is the dictionary,
adding it if it is not, and then adding 1 to the value associated with the word. It is
necessary to check whether the word is already in the dictionary because an error
will result if the program attempts to retrieve a value for a word that is not in the
dictionary. To prevent this from happening, the program ensures that there is a
value associated with the word by giving it an initial value of 0.
44
5. First Program
5.4 Printing the Word List
T
The final step is to print out all of the words and their frequency counts in a way
that it can be easily read. One very simple format is a tabular layout with space-
separated values—that is, a plain text file with data organized into columns that
are separated from one another by a space mark. Files formatted in this way are
easy for people to read and also easily imported into spreadsheet programs (e.g.,
Microsoft Excel).
To create a file with a tabular word frequency count, we need the program to
words.sort()
for w in words :
count = d[w]
AF
get a list of keys in the dictionary and sort them. It should then be possible to go
through each key, get its associated value, and print the two of them out. The code
that accomplishes this is given in (5).
words = d.keys()
print w , str(count)
# get list of unique words in dictionary d
# sort list of words
# go through each word
# get counter for a given word
# print word and count to standard output
In (5), we first get a list of all of the keys (words) in the dictionary using the
function keys and save that list into the variable words. We then sort that list us-
ing the function sort (which knows how to do standard case-sensitive alphabetic
sorting for English—we’ll see ways of customizing the sort order in §8.1.3.4). We
1
2
3
4
5
then use a for loop to go through each token in the list, and as the program loops
over each token in the list, it saves it into the temporary variable w.
DR
Notice that the lines 4 through 8 are indented. This is because they are asso-
ciated with the for loop. They will be run as many times as there are items in
the list being looped over. Their collective purpose is to extract some information
about a token and print it out. First, the occurence count for a given token is ob-
tained by looking it up in the dictionary. Then the token is split into a wordform
and a part of speech label with the function split, but using a slash as a sep-
arator instead of the default (a space). The wordform, POS label, and frequency
count are then joined together into a string (using commas to insert spaces be-
tween them) and printed out. (Note that the frequency count is converted into a
string using the function str. The nature of, and need for, this number-to-string
conversion will be explained in §7.4.)
Each line of output will consists of three columns: the first column, a word
form; the second column, a part of speech abbreviation; and the third column, the
number of times that a particular wordform/POS pairing occurs in the text that the
program was run on.
45
5.5 Putting It All Together
5.5 Putting It All Together
T
We are now in a position to bring all of the previously discussed code together a
single program. When run, this program will print out a list of all of the words in
our corpus excerpt. The program is provided in its entirety in (6).
(6) introFirstProgram.py
AF
tokens = text.split() # split up text into tokens 8
d = {} # create a dictionary 9
for t in tokens : # go through each token 10
parts = t.split("/") # separate token into word and gram. class 11
w = parts[0] # 1st item in list is word 12
gc = parts[1] # 2nd item in list is gram. class 13
if not d.has_key(w) : # see if word is not in dict 14
d[w] = 0 # if not in dict, give it a counter 15
d[w] = d[w] + 1 # increment counter 16
words = d.keys() # get list of unique words 17
words.sort() # sort list of words 18
for w in words : # go through each word 19
count = d[w] # get counter for a given word 20
print w , str(count) # print word and count to standard output 21
When it is run, (6) will produce a word list in alphabetical order with frequency
counts, as you can test for yourself. Running the program should produce the
output shown in (7).
(7) first-program-output.csv
$ 2
DR
, 2
. 2
130.7 1
534 1
84-month 1
AC-130U 1
Air 1
Corp. 3
Data 1
Electronics 1
Force 1
International 1
Navy 1
Rockwell 1
Systems 1
Zenith 2
a 3
aircraft 1
an 1
and 1
contract 2
for 2
gunship 1
46
5. First Program
T
microcomputers 1
million 2
of 2
over 1
period 1
received 1
replacement 1
services 1
software 1
subsidiary 1
won 1
AF
Although there is a good deal about (6) that remains to be explained, the basics
should be reasonably clear, even at this early stage. In the following chapters, we
will focus on various topics in Python programming and show ways in which this
program can be elaborated. This will deepen your understanding of (6) and give
you the knowledge necessary to tackle other problems.

The choice of the WSJ corpus for illustration was motivated by the simplicity of
its annotation and its widespread use (e.g., Klein and Manning (2002, 2004)). Por-
tions of the WSJ corpus are available for free and the full corpus can be purchased
from the Linguistic Data Consortium (http://www.ldc.upenn.edu/). To
learn more about the nature of the annotation of the WSJ corpus, see Paul and
Baker (1992). For more information about linguistic annotation schemes, try
http://www.ldc.upenn.edu/annotation/.
DR
47
T
AF
DR
48
T
Chapter 6
Statements
AF
Python programs are composed of statements, which are instructions for Python
to carry out a particular operations. In fact, a Python program is basically just
a collection of statements. To understand how Python statements are converted
into instructions that are carried out by the computer, it pays to step back briefly
and distinguish between two types of computer languages: low-level languages
(sometimes known as “machine languages” or “assembly languages”) and high-
level languages (such as Python, C++, Perl, Java, etc.)
Computers can only execute programs written in low-level languages. There-
fore, programs written in a high-level language require some additional processing
since the statements in the high-level language have to be translated into state-
STATEMENTS
ments in the low-level language. The disadvantage of a high-level languages is

that there are costs associated with this translation process (time, memory, etc.),
DR
but these are outweighed by the numerous advantages of high-level languages.
First, it is much easier to program in a high-level language. (The programs writ-
ten in high-level languages are easier to read, usually shorter, and less prone to
human error.) Second, high-level languages are portable, meaning that they can
run on different operating systems with little to no modification. Low-level pro-
grams can run on only one kind of operating system and have to be rewritten to
run on another.
Two kinds of programs process high-level languages and convert them into
low-level languages: interpreters and compilers. An interpreter reads a high- INTERPRETERS
level program and executes it, meaning that it does what the program says. It COMPILERS
processes the program sequentially, alternately reading lines and performing com-
putations. A compiler, on the other hand, reads the program in its entirety and
converts it to a low-level language before it is run. The high-level program is
called the source code, and the translated program is called the object code or the
executable. Once a program is compiled, you can execute it repeatedly without
49
6.1 Indentation and Block-Structuring
further translation. (It is for this reason that compiled programs can be optimized
T
to run more quickly and use less memory than interpreted programs.)
Python is a high-level interpreted language. In other words, Python statements
must first be converted to machine language and this is when the program is run
by an interpreter, and not in advance by a compiler.
6.1 Indentation and Block-Structuring
AF
One aspect of Python that is unique among programming languages is its use
BLOCK STRUC - of indentation to determine block structure, which is the way in which code is
TURE organized into larger units, or blocks. Most other languages use paired delimiters
(brackets, parentheses, etc.) for this purpose, as can been seen in the Perl code in
(8), which tests the value of a variable.
(8) statementsPerlEx1.py
my $i = 0; # set variable to zero 1
if ($i < 1) { # is value less than one? 2
print "Value is less than one.\n"; # if yes, print message 3
print $i; # print actual value of variable 4
} 5
Notice that brackets are used to structure the code in (8). The if-clause is
separated from the then-clause using brackets and each statement ends with a
semi-colon. (There are other conventions—e.g., variables are flagged with a dollar
sign—that can be ignored, since they are specific to Perl.)
The Python equivalent of (8) would be (9), which performs the same opera-
tions and produces the same output.
(9) statementsEx1.py
DR
i = 0 # set variable to zero 1
if i < 1 : # is value less than one? 2
print "Value is less than one." # if yes, print message 3
print i # print value of variable 4
There are a few things to notice about the code in (9). First, it does not use
braces. The if-clause is separated from the then-clause by a colon. Second, there
are no semi-colons in (8) because semi-colons are optional in Python if a line
consists of only a single statement.
In essence, the default in Python is a single statement per line. That is what the
Python interpreter expects unless it is told otherwise. To have more than one state-
ment on a single line, semi-colons are required. Therefore, (10) is a permissible
alternative to (9). (It would also be possible to put semi-colons at the end of each
line in (9), but it would be pointless, since they are not required by the interpreter
and add nothing to the program in terms of readability.) They are functionally
equivalent and the choice of one over the other is a matter of programming style.
50
6. Statements
(In general, putting each statement on a separate line makes for better readability
T
and is therefore preferred.)
i = 0 1
while (i < 10) : 2
i += 1; print i 3
Note also that the indented block need not occur on a separate line. It is
possible to collapse the first two lines of (10), thereby obtaining (11), but we
i = 0
while (i < 10) :
6.2 Variables
AF
recommend against it, since it makes the code harder to read.
i += 1; print i
In this section, we discuss how variables work in Python. Variables lie at the heart
of any programming language, given that they are where information is stored. A
variable is something like a box, in that it is a container for information. If we
pursue this analogy, then we can think of variables as boxes that contain values
(a particular number, word, etc.). In the following sections we will look at the
basic statement types where variables are concerned: declaration, initialization,
and assignment.
1
2
DR
6.2.1 Declaration
The term declaration refers to the process of asserting a variable’s existence in a DECLARATION
program. A declaration essentially announces, or declares, that a variable exists.
Hence the term. Using our variables as containers analogy, declaring a variable
amounts to telling the Python interpreter that there is a box and giving it a la-
bel. Unlike some other programming languages, a variable cannot be declared in
Python without giving it a value. That means that it not possible to simply tell a
program that a variable exists, and expect it to hang around waiting for more infor-
mation. Therefore, in Python, if you want to declare a variable without giving it a
value, a dummy value, known as None, must be used as a sort of placeholder, as
in (12). This is a bit like creating an empty box. (For more information concerning
the nature of the value None, see Chapter 7.3.)
(12) statementsDeclarationEx1.py
x = None 1
51
6.2 Variables
The name of a variable is fairly free in Python, and the main restrictions are
T
the following: First, a variable’s name must begin with a letter. Second, a vari-
able’s name may consist only of letters, numbers, and underscores. (Given the
first restriction, this means that numbers and underscores can only occur after the
first letter of a variable’s name.) Third, a variable’s name cannot conflict with
KEYWORD a reserved keyword, which are words that have a special interpretation and are
therefore restricted in use by the Python interpreter (e.g., if, and, not). (Key-
words can be easily identified in this book by their formatting: they appear in bold
face in example code.)
b.
c.
d.
AF
More concretely, the variable names in (4) are legitimate whereas the ones in
(5) are illegitimate.
(4) a. myVariable1
b. my variable 1
c. mv1
(5) a. 1myVariable
my variable 1
mv1
myVariable!
Programmers tend to adhere to one of two conventions when naming variables.

One convention uses all lowercase letters with underscores to separate words in
variables with long names (e.g., line counter) while the other uses capital
letters to separate words in variables with long names (e.g., lineCounter). As
DR
far as Python is concerned, both conventions are legitimate and pose no prob-
lems. The authors prefer the latter convention, since it results in shorter names for
variables.1
6.2.2 Assignment
A SSIGNMENT Assignment refers to the act of giving a value to a variable. Continuing with our
variables as containers analogy, assignment is a bit like placing an object within
a labelled box. When this is done for the first time, the process is referred to as
INITIALIZATION initialization. There is no real distinction between initialization and assignment
in Python, and a variable can be given a new value at any point using the equal sign
(the assignment operator), as illustrated in statementsAssignmentEx1.
1
For a discussion of these stylistic issues, which have a tendency to take the form of dogma
and ignite minor religious wars among programmers, see ?.
52
6. Statements
T
(13) statementsAssignmentEx1.py
i = 0 # Initialize counter to zero 1
2
# Repeatedly execute following code as long as 3
# value of i is less than 10 4
while i < 10 : 5
i += 1 # Increment counter 6
print i # Print value 7
To initialize a variable is to give it a value for the first time. As already noted
in §6.2.1, the declaration of a variable requires its simultaneous initialization. This
AF
has the effect of giving it a type, since the assigned value will force the variable to
take the appropriate type. Continuing with the variables as containers analogy, it
is important to recognize that not all containers are the same size. Variables, like
boxes, are not uniform. Some are bigger than others, and some are custom-made
for particular types of objects. Just as you use one kind of box to store your tennis
shoes and another kind of box to store your computer, you would use one kind of
variable to hold a number and another kind of variable to hold a word (see Chapter
7 for more details). Assignment is done with the equal sign (=). Therefore, in (14)
a string variable is simultaneously declared and initialized before being printed.
print word
(14) statementsInitializationEx0.py
word = ’This string is declared and initalized.’
When assigning a value to a variable, care must be taken with the proper for-
matting of the value, since this can have an effect on the data type of the assigned
1
2
variable. For example, when assigning a value to a variable that holds a numeric
DR
value, the presence of absence of a decimal point decides whether the number is
an integer or a floating point number (see Chapter 7 for details), and this has con-
sequences for the treatment of the numbers in any mathematical operations that
occur later in a program (see chapter 7 for more information about data types), as
illustrated in (15).
# Initialize two variables as integers 1
x = 5 2
y = 2 3
4
# Print result of dividing x by y 5
print x / y 6
7
# Initialize two variables as floating point numbers 8
x = 5.0 9
y = 2.0 10
11
# Print result of dividing x by y 12
print x / y 13
53
6.3 Expressions
The result of running (15) might not be what you expect. Although we all
T
know that 5 divided by 2 is 2.5, it is only printed out as such in the second instance.
This is because in the first instance the two numbers being divided were assigned
without decimal points, making them integers, and when an integer is divided by
another integer, the resulting value is an integer. (The number is not rounded up
or down. Rather, the remainder is simply lopped off.)
It is possible to query a variable’s type, since the interpreter keeps track of this
information. To check a variable’s type, the function type() can be used, as
shown in (16), where three variables are created and then have their type checked
EXPRESSION
using type().
n = 0
pi = 3.14159
print type(n)
AF
print type(message)
print type(pi)
6.3 Expressions
message = ’This is a string’
An expression is ???. The result of executing an expression is known as a return

value (see §??). In the simplest case, this will be one of two results: True or False.
A common expression is to test whether a variable has a particular value. Since
this involves comparing the variable’s value to another value and seeing whether
1
2
3
4
5
6
7
IDENTITY TEST they are identical, this is sometimes referred to as an identity test. An identity
test is an expression, since it returns a value: either the two things comapred are
DR
identical, in which case the result is True, or they are non-identical, in which case
the result is False. You can confirm this for yourself by seeing what value is
printed out by the various expressions in (17).
(17) statementsExpressionEx1.py
s = "Word" # string 1
n = 1 # integer 2
# True 3
print s == s 4
print s == "Word" 5
print n == n 6
print n == 1 7
# False 8
print s != s 9
print s != "Word" 10
print n != n 11
print n != 1 12
A list of the more common Python operators is provided in Table 6.1.
54
6. Statements
Operator Meaning
T
> Greater Than
>= Greater Than or Equal To
== Equal To
<= Less Than or Equal To
< Less Than
Table 6.1 Comparison Operators
s = "Aari"
n = 1
AF
These operators can be used with any data type, but they are more appropriate
for some than others. For example, the meaning of the operator > is straightfor-
ward when it comes to numbers (integers, floats, etc.), but less so when it comes to
strings. When used with strings, the operator > means “sorts before”, as demon-
strated by (18). .
print s < "Zyphe" # True

print n < 5
print n > 5
# True
# False
print s > "Zyphe" # False
# String
# Number
In (18), the first two expressions evaluate as True and the last two expressions
evaluate as False.
Returning to the wordcount program introduced in Chapter 5, we will show
1
2
3
4
5
6
7
8
how words occuring more than once can be extracted from a dictionary.
DR
import sys 1
2
def getFilename() : 3
try : 4
return sys.argv[1] 5
except IndexError : 6
print "Usage: %s <FILE>" % sys.argv[0] 7
sys.exit(0) 8
9
def getFileContents(filename) : 10
fo = open(filename, ’r’) 11
fc = fo.read() 12
fo.close() 13
return fc 14
15
def extractWords(fileContents) : 16
fileContents = fileContents.replace(’,’, ’’) 17
fileContents = fileContents.replace(’!’, ’’) 18
fileContents = fileContents.replace(’.’, ’’) 19
fileContents = fileContents.replace(’;’, ’’) 20
fileContents = fileContents.replace(’:’, ’’) 21
words = fileContents.split(" ") 22
return words 23
55
6.4 Comments
T
24
def buildWordlist(words) : 25
wordlist = {} 26
for w in words : 27
parts = w.split("/") 28
word = parts[0] 29
tag = parts[1] 30
31
try : 32
n = wordlist[word][tag] 33
wordlist[word][tag] = n + 1 34
except KeyError : 35
if not wordlist.has_key(word) : 36
AF
wordlist[word] = {} 37
wordlist[word][tag] = 1 38
return wordlist 39
40
def main() : 41
filename = getFilename() 42
fileContents = getFileContents(filename) 43
words = extractWords(fileContents) 44
wordlist = buildWordlist(words) 45
printNonUniques(wordlist) 46
47
main() 48
6.4 Comments
Everything in a line of code following a hash mark (#) is ignored by the Python
interpreter (unless the hash mark occurs within a string), as shown in (20).
(20) statementsCommentsEx0.py
# This line will be ignored by the Python interpreter. 1
2
print ’A # isn’t a comment in strings.’ # True of all strings 3
DR
4
# Comments can span multiple lines, but each 5
# line has to be commented out. There is no 6
# convention in Python for multi-line comments. 7
When programming, it is useful to includes notes concerning the code. These

notes can play various roles. Sometimes they are used simply to explain what role
the various bits of code play in a program, as in (21), where explanatory comments
have been interleaved in the code from (21). (For explanation of how a while
loop works, see §9.2.1.)
i = 0 # Initialize counter to zero 1
2
# Repeatedly execute following code as long as 3
# value of i is less than 10 4
while i < 10 : 5
i += 1 # Increment counter 6
print i # Print value 7
56
6. Statements
Comments are also a useful way of blocking out the structure of a program
T
and for providing more general information about its authorship, purpose, etc., as
illustrated in (22), where the header and function descriptions of a program to find
minimal pairs are given (without their accompanying code).
# ----------------------------------- 1
# AUTHOR: Stuart Robinson 2
# DATE: 10 April 2004 3
# DESCRIPTION: 4
# This program takes a Shoebox-formatted 5
# dictionary file and finds all of the 6
AF
# minimal pairs contained within it. 7
# VERSION: 1.0 8
# --------------------------------------- 9
10
# --------------------------------------- 11
# Takes the Shoebox file and extracts 12
# each entry 13
# --------------------------------------- 14
15
... 16
17
# --------------------------------------- 18
# Take the entries and compares them with 19
# one another to find the min pairs. 20
# --------------------------------------- 21
22
... 23
Comments are also used to “comment out” code that needs to be rendered
inactive for some reason. Often when programmers are revising a program, they
will copy a block of code they will comment out a copy of it first, so that if their
changes fail to work, they can fall back on the prior versio of the code. (A good
text editor will provide a way of commenting out a region of text with a single
DR
command.)
The general rule of thumb for comments is to avoid stating the obvious and
to provide comments on anything that is likely to trip up other programmers.
(Also, remember that even the author of a program is likely to forget how it
works as time drags on, so it pays to provide comments not just for others, but
also for one’s future self.) Beginning programmers have a tendency to either
provide no comments whatsoever (always a mistake) or to comment everything,
often providing comments on the workings of Python itself (which really ought
to be left up to documentation) Striking a balance between the poles of over-
commenting and undercommenting is one of the signs of a mature program-
mer. Although Python has built-in documentation capabilities, coverage of them
goes beyond the intended scope of this textbook. For more information, go to
http://www.python.org/peps/pep-0257.html.
You will find comments used quite liberally throughout this book in the code
examples provided. A good deal of these comments would be overkill in everyday
57
6.5 Exercises
working programs, but they are very useful for the purposes of exposition.
T
6.5 Exercises
• Examine the following pieces of Python code and determine for each whether
it is legitimate and, if not, what’s wrong with it and how it could be fixed:
– if i = 0 : print i
AF
– in = 5
– s = ’If you can’t explain it clearly, you don’t
understand it yourself.’
– ???
– ???
• ???
• ???
DR
58
T
Chapter 7
Data Types
AF
In the previous chapter, variables were introduced. As already mentioned, vari-
ables are fundamental to programming, since they are where data is kept while a
program is running. Because different types of information are stored in variables
(text, numbers, lists of numbers, etc.), it is useful to break down variables into
different types. This is known as data typing and the different types of variables
recognized in a language are known as data types. Python has a number of built-
in data types—that is, data types that are implicitly understood by the Python in-
terpreter without any need for the programmer to provide additional specification.
Each has a different job to do and behaves somewhat differently. It is important
to understand how the various data types differ from one another so that one can
make use of their various strengths. Later in Chapter 16 we will show how it is
DATA TYPES
possible in Python to create custom data types, called classes.

DR
7.1 Strings
Strings are used to store text and for that reason they are the data type of most
interest to language researchers. In (23) a string variable is initialized with Chom-
sky’s famous grammatical but meaningless sentence.
(23) dataTypesStringEx1.py
s = ’Colorless green ideas sleep furiously.’ 1
Note that in (23) the string is surrounded by single quotes. These quotes are
necessary because they ensure that the words in your string aren’t interpreted as
instructions to the Python interpreter. It also also possible to use double (or even
triple quotes), although slightly different behavior is obtained by doing so. For
the moment, the differences need not concern us and the two types will be used
59
7.2 Numbers
indifferently. (Later, in Chapter 14, the difference between strings with single,
T
DYNAMIC TYPING double, and triple quotes will be discussed in detail.)
Strings cannot be modified once they are created, and are therefore described
IMMUTABLE as being immutable. (This holds true regardless of whether the string was initial-
ized with single, double, or triple quotes.) One practical consequence of the im-
mutability of strings in Python is that any operations performed on strings create a
modified copy of the string, rather than modifying the pre-existing one. Consider
(24).
AF
s1 = ’Strings are immutable.’ 1
s2 = s1.replace(’immutable’, ’IMMUTABLE’) 2
print s1 3
print s2 4
In (24), a string is created and a copy of the string is made with the word ’im-
mutable’ replaced by ‘IMMUTABLE’. The replacement is done using the method
replace. Each data type makes available a number of different methods which
can be used to carry out commonly performed operations. For a complete list of
the methods available for a given data type, you will need to consult some docu-
mentation. The methods available for strings are discussed in considerable detail
in Chapter 14. For now, the important point is that when the method replace is
called on the string in (24), the original string is unaffected by the replacement, as
you can confirm by running the code yourself.
Strings can be joined together to form larger strings, a process known as string
STRING CONCATE - concatenation. Python provides two operators for string conatenation, which are
NATION the comma and the plus sign. Strings concatenated with a comma have a space
DR
inserted between them whereas strings concatenated with a plus sign are directly
concatenated with no intervening materials, as shown in (25).
print "The hatches of the ship" , "closed." 1
print "The ship" + "’s gravity field must be rather weak." 2
For more comprehensive coverage of strings, see Chapter 14, which is devoted
exclusively to the topic.
7.2 Numbers
Numbers in Python come in more than one variety. The language has a number of
built-in numeric types, but only two are likely to be of relevance to the concerns
of language researchers: integers and floats. We will discuss each in turn.
60
7. Data Types
7.2.1 Integers
T
An integer in Python is used to hold whole numbers, both positive and negative INTEGER
(e.g., -5, 0, 36, etc.), of virtually any size.1 In (26), an integer is created and saved
as the variable n. Various mathematical operations are then performed on it and
the results printed out.
(26) dataTypesIntegersEx1.py
n = 25 1
print n / 5 2
print n + 5 3
AF
print n - 10 4
print 2 * n 5
Integers do not store fractions and any mathematical operation performed with
integers will result in an integer. Therefore, if you perform a mathematical oper-
ation with an integer that results in a fraction (such as division), the fraction will
simply be eliminated. This effectively rounds the number down to the nearest
whole number, as illustrated in (27).
x = 7 # prime number 1
y = 2 2
print x / y # NOT 3.5 3
In order to deal with fractions, integers must be converted into a float, which
is a data type that supports fractions (see the next section §7.2.2 for details).
7.2.2 Floats
Unlike integers, a float (short for “floating point number”) is able to store frac- FLOAT
DR
tions. The term derives from the fact that the decimal points of these numbers
can move around, or “float”. To create a float, simpy add a decimal point to the
number, to any place (tenth, hundredth, etc.), as in (28) (cf. (27)).
(28) dataTypesFloatEx1.py
x = 7.0 # numerator is float 1
y = 2 # divisor is integer 2
print x / y # result is float 3
Note that it doesn’t matter whether the numerator or the divisor is a float. If
either one (or both) is a float, the result will be a float, as can be seen in (29),
where the divisor is a float (as opposed to (28), where the numerator is a float).
Nevertheless, the result is the same.
1
In older versions of Python, there was an upward bound on the size of integers. If an integer
grew too large, it would exceed the memory allocated to it, resulting in an error. To avoid such
errors, an integer would have to be explicitly created as a long integer. However, since Python 2.2,
whenever a normal integer exceeds its bounds, it is automatically converted to a long integer.)
61
7.3 The None Value
T
x = 7 # numerator is integer 1
y = 2.0 # divisor is float 2
print x / y # result is float 3
When you print out a float, it will be printed out to the most precise decimal
point. There are numerous way of printing out a float without the decimal point,
but probably the simplest is to simply convert the float to an integer, as illustrated
in (30). (Other ways of doing this will be covered later—see §14.4.5 on string
AF
interpolation).
n = 8.0 / 2 1
print n 2
print int(n) 3
Integers can also be converted into floats. One method of conversion relies on
the fact that multiplying an integer by a float results in a float, and it is simply
to multiply the integer by 1.0 before carrying out an additional operations on it.
Another way is explicitly convert the integer to a float using the built-in function
float(). Both techniques are illustrated in (31).
x = 7 1
y = 2 2
print x / y # 3.0 3
print x * 1.0 / y # 3.5 4
print float(x) / y # 3.5 5
7.3 The None Value

DR
The value None somewhat special, since it represents the absence of data, an
empty value. In other words, it is a value that represents the lack of a value.2 Try
not to get too bogged down in the metaphysics of this (or you may find yourself
chain-smoking, drinking black coffee, and reading Sartre’s Being and Nothing-
ness instead of programming). The basic idea is that the value None gives you
an alternative to the answer “Yes” and ”No”—namely, “I don’t know” or “It’s
unknown”.
Both None and 0 are treated as false in the evaluation of if-statements, as
shown by (32).
(32) dataTypesNoneValueEx1.py
if None : print "’None’ counts as true." 1
else : print "’None’ counts as false." 2
2
Just in case you are looking for analogies from other programming languages, the equivalent
of None in Python is null in Java and undef in Perl.
62
7. Data Types
T
3
if 0 : print "’0’ counts as true." 4
else : print "’0’ counts as false." 5
6
if 1 : print "’0’ counts as true." 7
else : print "’0’ counts as false." 8
Although None and 0 are both treated as false, they are not equal in value, as
shown by (33).
AF
(33) dataTypesNoneValueEx2.py
if None == 0 : 1
print "The value ’None’ is equal to ’0’." 2
else : 3
print "The value ’None’ is not equal to ’0’." 4
7.4 Querying and Converting Data Types

The previous section provided a brief introduction to Python’s built-in data types.
We observed in §2.3 that Python is an object-oriented language and that object
orientation in Python is pervasive. One consequence of this is that all variables
are really objects. Because it is sometimes useful to be able to query the type of
an object, Python offers a built-in function called type() which can be used to
obtain a variable’s type, as illustrated in (34).
(34) dataTypesTypingEx1.py
s = "one" 1
f = 1.0 2
i = 1 3
DR
4
print type(s) 5
print type(f) 6
print type(i) 7
Attempting to concatenate unlike data types gives rise to errors, as you can see
for yourself by running (35), which attempt to concatenate the variables in (34).
s = "one" 1
f = 1.0 2
i = 1 3
4
print s , f , i 5
In order to concatenate and print unlike data types, it is first necessary to con-
vert them into strings, which can be done using the built-in function str(), as
63
7.5 Exercises
T
s = "one" 1
f = 1.0 2
i = 1 3
print s , str(f) , str(i) 4
(In §14.4.5, we describe string interpolation, a more powerful technique for

inserting non-strings into strings.)
AF
7.5 Exercises
• If you wanted to know how many times a particular word appeared in a
corpus, which type of number would you use? If the frequency of a word
in a corpus is defined as the number of times it occurs in the corpus divided
by the total number of words in the corpus, which type of number would be
appropriate for the word’s frequency?
• If you had a list of words and you wanted to see how many times each word
occurs in a particular corpus, what would you use as the initial word count
value?
• ???
DR
64
T
Chapter 8
Data Structures
AF
A data structure is basically a data type that provides the means of organising
a larger (and possibly more complicated) collection of data in such a way that it
can be stored and manipulated more effectively. In other words, it is a way of
structuring data (hence the term). Although the term may sound off-putting to the
uninitiated, the reader is no doubt familiar with at least one data structure: list. A
list is a collection of items, stored in a fixed order. Every programming language
implements them in one way or another, although they sometimes go by more
esoteric names (e.g., vector in Java, array in Perl).
Some common data structures are: arrays, dictionaries (aka, associative arrays
or hashes), graphs, heaps, linked lists, matrices, objects, queues, rings, stacks,
DATA STRUCTURE
trees, and vectors. (Don’t worry if you aren’t familiar with all of these data struc-
tures, since most are unnecessary for basic programming tasks.) In Python, all of
DR
these data structures are potentially available. Those that are not built-in to the
language can be implemented using object-oriented programming techniques (see
chapters 15 and 16).
In this chapter we will discuss Python’s built-in data structure: lists (§??),
dictionaries (§8.2), and tuples (§8.3).
8.1 Lists
A list is a data type used to store multiple items in sequential order. A list can LIST
contain items of any data type (string, numbers, etc.), but we’ll stick to strings for
now. Let’s take the first sentence from Hal Clement’s novel, A Mission of Gravity,
and store each word as an item in a list.
(6) The wind came across the bay like something living.
65
8.1 Lists
There are are numerous ways to create a list in Python, but the simplest is to
T
put all of the items between square brackets, separated by commas. Since we are
creating a list of strings, we’ll have to put each string inside of quotes, as in (37).
(37) dataStructuresListEx0.py
l = [’The’,’wind’,’came’,’across’,’the’,’bay’,’like’,’something’,’living.’] 1
AF
Another way of creating a list of strings from a sentence is to let Python do
the work of splitting the sentence into words. This can be done with the string
function split, as shown in (38).
(38) dataStructuresListEx0.5.py
s = "The wind came across the bay like something living." 1
l = s.split() 2
print l 3
When the list in (38) is printed, it will be formatted in the same manner as (37).
Notice that the last word in both (37) and (??) occurs with punctuation: the period
is treated as part of the last word because it is not separated from it by a space. To
ensure that each word occurs alone in the list, the period can be separated from
the rest of the sentence by a space, ensuring that it is treated as a separate word.
(This is a fairly common practice in machine–readable corpora—i.e., texts that
have been formatted for automatic processing.)
DR
(39) dataStructuresListEx0.75.py
s = "The wind came across the bay like something living ." 1
l = s.split() 2
print l 3
8.1.1 Accessing List Items

INDEX Individual items in a list can be obtained using an index (the plural is indices). An
index is a number that refers to the position of an item in the list. List indices start
at 0. This is sometimes a source of confusion for beginners, who expect to start
counting at 1 rather than 0. But in Python, the first item in a list has an index of
0, the second item an index of 1, etc. The indices for the list in (37) are shown in
Figure 8.1.
66
8. Data Structures
T
Figure 8.1 List Indices
Index String
0 The
1 wind
2 came
3 across
4 the
5 bay
AF
6 like
7 something
8 living.
The use of indices to retrieve values from a list is illustrated in (40), where a
list of the words from (6) is created and each of its items is printed out. Note that
the last line of code, which refers to the eleventh item (i.e., index 10), produces an
error, since the list contains only 10 items.
list = "The wind came across the bay like something living .".split() 1
print list[0] # First item 2
print list[1] # Second item 3
print list[2] # Third item 4
print list[3] # Fourth item 5
print list[4] # Fifth item 6
print list[5] # Sixth item 7
print list[6] # Seventh item 8
print list[7] # Eighth item 9
print list[8] # Ninth item 10
print list[9] # Tenth item 11
print list[10] # ERROR: No eleventh item! 12
DR
In (40), we see how to get the first, second, third, etc. item in a list, but how
do we get the last, second-to-last, third-to-last, etc. item in a list? One way is to
figure out its position and to subtract the appropriate amount (since counting starts
at 0, not 1). To get the last item in a list, we obtain the total number of items in
the list and subtract 1; to get the second-to-last item in a list, we subtract 2; etc.
Returning to our previous example in (40), we can print the second-to-last and last
items, as follows:
l = "The wind came across the bay like something living.".split() 1
print l[len(l) - 1] # Last item 2
print l[len(l) - 2] # Second-to-last 3
print l[len(l) - 3] # Third from the last 4
There is, however, a more convenient way of retrieving the last and second-to-
last item in a list, which is simply to provide a negative index. Negative indices
67
8.1 Lists
count from the end of a list rightward. The last item has an index of -1, the second-
T
to-last item has an index of -2, etc. This is illustrated in (42), which does the same
work as (41), but more economically.
l = "The wind came across the bay like something living.".split() 1
print l[-1] # Last item 2
print l[-2] # Second-to-last item 3
print l[-3] # Third-from-last item 4
The negative indices for all of the words in (42) is provided in Figure 8.2.
AF
8.1.2 Slicing
Figure 8.2 Negative List Indices
Index
-9
-8
-7
-6
-5
-4
-3
-2
-1
String
The
wind
came
across
the
bay
like
something
living.
SLICING Using a technique known as slicing, the indices of a list can also be used to ma-
DR
nipulate multiple items simultaneously. The idea behind slicing is that a range of
items can be indicated using two indices, one indicating where the range begins
and another indicating where it ends. The slice range does not include the item
referenced by the second index—in other words, the end index is exclusive. In
(43), various slices of our Mission of Gravity list are illustrated.
l = "The wind came across the bay like something living .".split() 1
print l[0:1] # [The] 2
print l[0:2] # [The wind] 3
print l[:6] # [The wind came across the bay] 4
print l[2:3] # [came] 5
print l[3:6] # [across the bay] 6
print l[6:11] # [like something living .] 7
print l[6:] # [like something living .] 8
An important point about (43) is that no error results if a non-existent index is

used (cf. (40)).
68
8. Data Structures
Negative indices also work with slicing, as shown by (44), where different
T
slice ranges are provided: the last item, the second-to-last to the last item, etc.
(Again, no error results from a reference to a non-existent index.)
list = [’alpha’, ’beta’, ’charlie’, ’delta’, ’epsilon’] 1
print list[-1:] # Last 2
print list[-2:] # Second-to-last and last 3
print list[-3:] # Third-to-last, second-to-last, and last 4
print list[-4:] # Fourth-to-last to last 5
print list[-5:] # Fifth-to-last to last 6
print list[-6:] # Sixth-from-last to last 7
print
print
print
print
print
print
list[:1]
list[:2]
list[:3]
list[:4]
list[:5]
list[:6]
#
#
#
#
#
#
AF
No index is provided for the right boundary of the slice range in (44). When
a slice index is left blank, it is interpreted as the far edge of the list. Therefore, if
the left slice index is omitted, the slice starts at the beginning of the list, and if the
right slice index is omitted, the slice ends at the end of the list, as shown in (45),
where the equivalent of (43) is provided using a blank left slice index rather than
a 0.
list = [’alpha’, ’beta’, ’charlie’, ’delta’, ’epsilon’]
First item
First and second items
First, second, and third items
First through fourth items
All items, first through fifth (last)
All items, first through non-existent sixth
You may be wondering why slice ranges are able to refer to non-existent in-
dexes. This is because lists are mutable, which means that it is still possible to
1
2
3
4
5
6
7
8
9
change individual items in them after they have been created, as illustrated in
DR
(46).
l = [] # Create empty list 1
l.insert(0, ’a’) # Insert item at beginning of list 2
l.insert(1, ’b’) # Default insert position is end 3
l.append(’c’) # Add item to end of list 4
print l 5
Because lists of mutable, it is possible to use slice notation to manipulating

them, as shown in (47).
l = s.split() 2
3
# [The wind] => [A stench] 4
l[0:3] = [’A’, ’stench’] 5
print l 6
7
# [something living] => [an invading army] 8
l[6:8] = [’an’, ’invading’, ’army’] 9
print l 10
69
8.1 Lists
In (47), two noun phrases are replaced using slices. This is done in two steps:
T
the words The wind are first replaced by A stench and the words something living
are replaced with an invading army.
8.1.3 Miscellaneous List Operations

There are a variety of operations that can be performed on lists using Python’s
built-in functions. Here we will look at a few of these.
8.1.3.1
(??).
AF
Adding and Removing Items in a List
A list isn’t very useful unless you can add items to it and remove them, as well.
The simplest way to add an item to a list is with the function append(), as in
(48) dataStructuresListAppendEx1.py
l = [’???’, ’???’, ’???’]
l.append(’???’)
In (47), we see how a list (or a slice from a list) can be added to another in a
specific position. If one wishes merely to add one list to the end of another, there
are simpler options: either to use the function extend(list) or concatenate
two lists with the plus sign (+), as shown in (49).
a = [1, 2]
b = [3, 4]
# Create two lists
1
2
1
2
3
c = a + b # Concatenate the two lists 4
print c 5
DR
6
a.extend(b) # Add list b to list a 7
print a 8
There is yet another way of adding the elements of one list to another. It is
also possible to create lists of lists. [MENTION THE APPEND FUNCTION]
a = [1, 2] # Create two lists 1
b = [3, 4] 2
3
# Here we merge two lists of two items 4
c = a + b # List c has 4 items 5
print c[1] # Print second item of new list (2) 6
7
# Here we create a nested list, whose first item 8
# is list a and whose second item is list b 9
c = a , b # List c has 2 items 10
print c[1] # Print second item of new list (a) 11
print c[0][1] # Print second item of first list (2) 12
70
8. Data Structures
8.1.3.2 Counting Elements in a List
T
When dealing with lists, one often wants to know the total number of elements
are found within a list. The count of elements contained within a list is generally
referred to as its length. (The term “size” is also used to describe the length of a
list, but the term is ambiguous, since it can also refer to the amount of memory
taken up by a list.) The length of a list can be easily obtained using the function
len(), as in (51).
(51) dataStructuresListLenEx1.py
AF
s = "Colorless green ideas sleep furiously ." 1
words = s.split(" ") 2
print len(words) 3
It is also possible to find the number of times a specific values occurs in a list.
For example, if you had a list of word lengths in a corpus, you might want to find
the number of times words of one length occur relative to words of another length.
How often do three-letter words occur relative to four-letter words?
(52) dataStructuresListLenEx2.py
s = "Colorless green ideas sleep furiously ." 1
print len(words) 3
8.1.3.3 Finding Minimum or Maximum Values

In addition to determining whether a list contains a particular value, it is some-
times necessary to determine the minimum or maximum value in a list. This is
more often the case with a list full of numeric values, as illustrated in (53), where
a string is split into words. A list of word lengths is then obtained by running
DR
len() on each word using the map() function and the minimum and maximum
value is obtained from the list of word lengths.
(53) dataStructuresListMinMaxEx1.py
text = ’Colorless green ideas sleep furiously.’ 1
words = text.split(’ ’) 2
lengths = map(len, words) 3
print "Min." , min(lengths) 4
print "Max." , max(lengths) 5
When the same functions are used with a list of strings, the minimum value
will be the item that ??? while the maximum value will be the item that ???, as
(54) dataStructuresListMinMaxEx2.py
l = [’Colorless’, ’green’, ’ideas’, ’sleep’, ’furiously’, ’.’] 1
print "Min." , min(l) 2
print "Max." , max(l) 3
???
71
8.1.3.4 Sorting Lists
T
When retrieving the keys from a dictionary with the method keys(), their order-
ing is unpredictable. In many cases, however, the keys need to be sorted before
being iterated over. The easiest way of accomplishing this is to retrieve the keys
as a sequence, which can then be sorted, as in (55).
(55) dataStructuresListSortingEx1.py
from string import join 1
2
d = {} 3
AF
words = [’zebra’, ’ape’, ’monkey’, ’ostrich’, ’echidna’] 4
for w in words : d[w] = len(w) 5
6
# Print keys in default order 7
keys = d.keys() 8
print "Default Order:" , join(keys, ", ") 9
10
# Print keys in sorted order 11
keys.sort() 12
print "Sorted Order: " , join(keys, ", ") 13
The same strategy works when sorting the values in a dictionary, as can be
seen in (56).
(56) dataStructuresListSortingEx2.py
import string 1
2
# Add each word to dictionary with its length as its value 3
d = {} 4
words = [’zebra’, ’ape’, ’monkey’, ’ostrich’, ’echidna’] 5
for w in words : d[w] = repr(len(w)) # integer -> string 6
7
# Print values in default order 8
DR
values = d.values() 9
print "Default Order:" , string.join(values, ", ") 10
11
# Print values in sorted order 12
values.sort() 13
print "Sorted Order: " , string.join(values, ", ") 14

DICTIONARY A dictionary is a data structure used to associates a unique item with an arbitrary
amount of data. Typically, that unique item is a string of some sort. This makes
it like dictionaries in the real world, which have a unique item associated with
some data, much like a word is associated with a definition in a dictionary. In
(57), a dictionary is created and four entries are added to it. The keys are three-
letter language codes from the Ethnologue (Grimes and Grimes, 2000) and their
associated values are the full names of the languages.
72
8. Data Structures
T
(57) dataStructuresDictionaryEx1.py
d = {} 1
d[’ROO’] = ’Rotokas’ 2
d[’???’] = ’???’ 3
d[’???’] = ’???’ 4
d[’???’] = ’???’ 5
print d[’ROO’] 6
print d[’???’] 7
print d[’???’] 8
print d[’???’] 9
It is important to understand that keys are unique. Therefore, if a key-value
d = {}
d[’ROO’] = ’Rotokas’
print d[’ROO’]
d[’???’] = ’???’
print d[’???’]
AF
pair is added to the dictionary for a key that already exists, the new value will
replace the old one, as illustrated in (58), where ???.
(58) dataStructuresDictionaryEx1a.py
All of the keys in a dictionary can be retrieved as a list using the function
keys() and all of the values using the function values(), as illustrated in
(59), where all of the keys and all of the values of the dictionary from (57) are
printed out.
d = {}
d[’a’] = ’Chimpanzee’
d[’b’] = ’Pygmy Marmoset’

d[’c’] = ’Gibbon’
1
2
3
4
5
1
2
3
4
d[’d’] = ’Howler Monkey’ 5
print d.keys() 6
DR
print d.values() 7
To understand a little better what a dictionary is and what purposes it can

serves, let’s pursue the analogy a bit further and use the dictionary data structure
to store part of a real dictionary—in this case, Routledge’s Ducth Dictionary xxx
(xxxa). Each entry in the dictionary consists of a Dutch word followed by a note,
its part-of-speech, and its definition in English. We can take the data in Figure
?? and store the definitions in a dictionary data structure where they can be more
easily manipulated, as in (60).
import sys 1
d = {} # Create dictionary 2
d[’geklets’] = ’talking, gossiping’ # Add items... 3
d[’geknabbel’] = ’nibbling, gnawing’ 4
d[’gekletter’] = ’clatter’ 5
d[’geklop’] = ’clucking’ 6
d[’geknars’] = ’gnashing, grating’ 7
d[’gekleurd’] = ’coloured’ 8
73
T
d[’geklok’] = ’clucking’ 9
dutchWord = sys.argv[1] # Get command-line arg 10
print d[dutchWord] # Find value for key 11
The values in a dictionary can be anything. In the previous examples, the

values have been strings, but there is no reason why the values cannot be other
data structures, such as lists. In fact, using lists, we can create more realistic
dictionaries that have multiple translation for each word defined. This is illustrated
in (61), where definitions are lists of translations.
AF
# Create dictionary with lists for items 1
d = {’geklets’ : [’talking’, ’gossiping’], 2
’geknabbel’ : [’nibbling’, ’gnawing’], 3
’gekletter’ : [’clatter’], 4
’geklop’ : [’clucking’], 5
’geknars’ : [’gnashing’, ’grating’], 6
’gekleurd’ : [’coloured’], 7
’geklok’ : [’clucking’]} 8
9
# Get keys and retrieve items by keys 10
words = d.keys() 11
words.sort() 12
for w in words : 13
defList = d[w] 14
for d in defList : 15
print "%s -> %s" % (w, d) 16
8.2.1 Common Operations Performed on Dictionary

In the following sections we will look at some of the operations typically per-
formed on a dictionary.
DR
8.2.1.1 Checking for the Existence of a Key
When using a dictionary, it is useful to be able to check whether a key is already
present in the dictionary before attempting to retrieve it (since an attempt to re-
trieve a non-existent key will cause an error). Imagine that we wanted to be able to
obtain the frequency of a particular word in a corpus. The program in (62) reads
a snippet of text and prints out the frequency of the word a.
(62) dataStructuresDictionaryHasKeyEx1.py
text = "Zenith Data Systems Corp., a subsidiary of Zenith Electronics Corp., \ 1
received a $534 million Navy contract for software and services of \ 2
microcomputers over an 84-month period. Rockwell International Corp. won a \ 3
$130.7 million Air Force contract for AC-130U gunship replacement aircraft." 4
words = text.split() # Split up contents into tokens 5
for w in words : # go through each token 7
if not d.has_key(w) : # see if token is not in dict 8
d[w] = 0 # if so, give it a counter 9
print d["a"] # print out freq. for "a" 11
74
8. Data Structures
The problem with this script, however, is that if the user specifies a word that
T
is not found in the word list, an error will result.
AF
print d["the"] # print out freq. for "the" 11
We can solve this problem by first checking for the existence of the word as
a key in the dictionary, as in (64). If it is there, the program prints its frequency;
otherwise, it prints a message.
if d.has_key("the") : # check whether ’the’ is in dict 11
print d["the"] # print out freq. for ’the’ 12
else : # ’the’ isn’t in dict, so... 13
print "The word ’the’ was not found." # print message 14
DR
8.2.1.2 Another Way to Obtain a Key’s Value
The attempt to retrieve the value for a non-existent key results in an error, as we
saw in §8.2.1.1. To avoid program errors, it is necessary to check whether a key
exists before trying to obtain its value, as in (64). But there is another way to avoid
an error is to use the function get(key), which is safer in the sense that it will
return the value None for a non-existent key rather than raise an error, as in (65).
(65) dataStructuresDictionaryGetKeyEx1.py
print d.get("the") 11
75
In (65), ???. It does this by taking two arguments: the key and a default value
T
that will be returned if the key is not found, as illustrated by (64), which provides
a simplified variant of (65).
(66) dataStructuresDictionaryGetKeyEx2.py
AF
print d.get("the", 0) # return value for ’the’ or 0 11
8.2.2 Sorting Dictionaries

By default, when all of the keys or values of a dictionary are obtained with
keys() or values(), the items are retrieved in the same order in which they
are entered. But this is not always what’s desired. Sometimes it is desirable to sort
these items before retrieving them. This is particularly true when automatically
extracting word lists from a text.
8.2.2.1 Sorting by Key

By using the sort function we are able to ensure that when the items in the dic-
tionary printed out, they are nevertheless printed out in alphbaetical order. This
example shows how to sort a dictionary and retrieve its key-value pairs as a three-
DR
step process: first the keys are extracted as a list, then the list is sorted (as already
described in §??), and then the list is iterated over.
(67) dataStructuresDictionarySortingEx1.py
# TODO 1
Since sorting dictionaries consists of sorting a list of keys, custom sorting of

lists is simply custom sorting of lists, which was already described in §??.
8.2.2.2 Sorting by Value

It is very useful to be able to sort a dictionary by value. For example, if you
are tracking word frequency using a dictionary, you may want to be able to sort
words by their rank frequency. Since such a dictionary has words as keys and their
frequencies as values, sorting a word frequency table by value requires being able
to sort a dictionary by value.
76
8. Data Structures
T
(68) dataStructuresDictionarySortByValue.py
# Example from PEP 265 - Sorting Dictionaries By Value 1
# Counting occurences of letters 2
3
d = {’a’:2, ’b’:23, ’c’:5, ’d’:17, ’e’:1} 4
5
# operator.itemgetter is new in Python 2.4 6
# ‘itemgetter(index)(container)‘ is equivalent to ‘container[index]‘ 7
from operator import itemgetter 8
9
# Items sorted by key 10
# The new builtin ‘sorted()‘ will return a sorted copy of the input iterable. 11
print sorted(d.items()) 12
AF
13
# Items sorted by key, in reverse order 14
# The keyword argument ‘reverse‘ operates as one might expect 15
print sorted(d.items(), reverse=True) 16
17
# Items sorted by value 18
# The keyword argument ‘key‘ allows easy selection of sorting criteria 19
print sorted(d.items(), key=itemgetter(1)) 20
21
# In-place sort still works, and also has the same new features as sorted 22
items = d.items() 23
items.sort(key = itemgetter(1), reverse=True) 24
print items 25
8.3 Tuples
In Python, there is another data structure called a tuple which resembles a list but TUPLE
differs in one crucial regard. Unlike lists, tuples cannot be changed once they have
been created. In other words, they are immutable. A tuple is created much like a IMMUTABLE
list, but with parentheses rather than square brackets, as illustrated in (69).
DR
(69) dataStructuresTupleEx1.py
t = (’The’, ’wind’, ’came’, ’across’, ’the’, ’bay’, ’like’, ’something’, ’living.’)
1
Tuples can be manipulated in the same way as lists, using indices, slices, etc.,
as demonstrated in (69). Note that, just as a slice from a list creates another list, a
slice from a tuple creates another tuple.
1
print t[0] # positive indices 2
print t[1] 3
print t[-1] # negative indices 4
print t[-2] 5
print t[0:1] # slice 6
The defining property of tuples, which distinguish them from lists, is their
immutability. Any attempt to modify a tuple will result in an error, as illustrated
in (71).
77
8.4 Nested Data Structures
T
1
t[1] = ’A’ # Assignment to a tuple raises an error 2
You may be wondering why it is worth having both mutable and immutable
lists. What additional value is there in having tuples in addition to lists? The
answer is that tuples utilize less memory. They are therefore the preferred option
when a program is creating lots of lists that will not be modified. For example, if
AF
you are reading
8.4 Nested Data Structures

It is often convenient to embed one data structure within another. We already saw
in Chapter 5 how to create a dictionary with words as keys and their frequency of
occurence as values.
[SAY MORE]
(72) dataStructuresDictionaryOfDictionaries.py
8
stopwords = [’$’, ’,’, ’.’] # pos labels that can be ignored 9
DR
11
tokens = text.split() # split up contents into tokens 12
parts = t.split("/") # split token into list 14
w = parts[0] # 1st part is the word 15
gc = parts[1] # 2nd part is POS (gc = grammatical class) 16
if not d.has_key(gc) : # if POS label not in dictionary, 17
d[gc] = {} # then add it w/ a dictionary as its value 18
if not d[gc].has_key(w) : # if word not in embedded dictionary, 19
d[gc][w] = 0 # then add it w/ 0 as its value 20
d[gc][w] = d[gc][w] + 1 # increase counter for word/POS by 1 21
gclasses = d.keys() # get list of parts of speech 22
gclasses.sort() # sort list of tokens 23
for gc in gclasses : # go through each token 24
if gc not in stopwords : # ignore any POS in list of stopwords 25
print gc # print POS 26
wd = d[gc] # get embedded dictionary for POS 27
words = wd.keys() # get keys (words) in embedded dictionary 28
words.sort() # sort words 29
for w in words : # go through each word 30
count = wd[w] # get count associated with word 31
print " ", w , count # print word and its freq. count w/ indentation 32
78
8. Data Structures
8.5 Exercises
T
• Zip’s law states that the frequency of any word is roughly inversely propor-
tional to its rank frequency—that is, its position in a frequency table that
is sorted in descending order (from most to least frequent). If you wanted
to write a program in Python that takes a text and builds a word frequency
table, what kind of data structure(s) would you use?
• A concordance is an alphabetical index of words in a book or in an author’s
AF
works with the passages in which they occur. If you wanted to write a
program in Python to create a concordance from a text, what kinds of data
structures would you use? How would they interact?
• Tree diagrams are a commonly used device for representing the hierarchi-
cal organization of elements in a sentence. For example, consider the tree
diagram in (7).
(7) S
NP VP
The dog V NP
bit the man
What kind of data structure could be used to store the information found in
(7)?
DR
79
8.5 Exercises
T
AF
DR
80
T
Chapter 9
Data Flow
AF
The term data flow refers to the devices available for putting the logic of a pro-
gram into action. These devices are sometimes called control structures, given
that they control how and when statements are executed. The various control
structures available to Python will be discussed in turn.
9.1 Conditional Statements: if

One of the most fundamental control structures in programming is the if-then
statement—e.g., “If it’s raining, I’ll stay inside”, “If I pick the right lottery num-
bers, I will be rich”, etc. The logical structure of if can be broken down into two
DATA FLOW
CONTROL
TURES
STRUC -
parts: the antecedent (the if-clause) and the consequent (the then-clause).1
In Python, the if-clause is separated from the then-clause by a colon, as in
DR
(73). (Note the difference between a single equal sign (=) and a double equal sign
(==): the single equal sign is used to assign variables a value while the double
equal sign is used to compare the values of variables.)
(73) dataFlowIfEx1.py
pos = "N" 1
if pos == "N" : 2
print "The POS is N." 3
if not pos == "V" : 4
print "The POS is not V." 5
if pos != "V" : 6
print "The POS is not V." 7
8
The if-clause in all of the statements in (73) are true, and therefore their then-
clauses are executed. It is important to be clear about what represents true and
1
In logic, the former is sometimes referred to as the implicans or protasis and the latter as the
implicate or apodosis (McCawley, 1993; Chapman, 2000). We’ll stick to the simpler terminology
of if-clause and then-clause.
81
9.1 Conditional Statements: if
false when evaluating an if statement. The answer is quite simple in principle, but
T
can sometimes be a stumbling block for beginners. The rule is, the values 0 and
None evaluate as false; everything else evaluates as true.
if 1 : print "’1’ evaluates as true." 1
if -1 : print "’-1’ evaluates as true." 2
if None : print "’None’ evaluates as false." 3
if 0 : print "’0’ evaluates as false." 4
5
if not 1 : print "’not 1’ evaluates as false." 6
if not -1 : print "’not -1’ evaluates as false." 7
AF
if not None : print "’not None’ evaluates as true." 8
if not 0 : print "’not 0’ evaluates as true." 9
Since the variable n has the value of 1 in (74), the if statement will be true, and
the program will print The value of n is 0!. Python defines true as 1 and
false as 0 for the purposes of evaluating an if-statement. It is therefore possible to
simplify (74) by directly evaluating the variable by itself. Since its value is 1, the
if-statement will evaluate as true, and the same message will be printed out.
pos = "N" 1
if pos : print "The POS is ‘N’." 2
If the value of the variable n were 0, the if-statement in (75) would not evaluate
true and nothing would be printed. We can ensure that something is printed even
when n is 0 by including an else-statement, as in (76). (Note that we’ve put the
consequent on a separate line from the antecedent. This isn’t necessary, but it
improves readability.)
DR
n = 0 1
if n : 2
print "The value of ‘n’ is 1." 3
else : 4
print "The value of ‘n’ is not 1." 5
A series of if-statements can be joined together with elif (an abbreviation of

else if). In (77), a series of if-then statements checks whether the value of n is 0,
1, or 2, or none-of-the-above.
n = 0 1
if n == 0 : 2
print "The value of n is 0." 3
elif n == 1 : 4
elif n == 2 : 6
else : 8
print "The value of n is not 0, 1, or 2." 9
82
9. Data Flow
9.2 Looping, Iteration, Cycling etc.: for and while
T
The term looping refers to control structures that repeatedly execute blocks of LOOPING
code.The two main control structures for looping in Python are while and for,
each of which is described below.
9.2.1 while
A while loop is a bit like an if statement, since it also consists of two parts: a con-
i = 0
while i < len(s) :
print s[i]
i += 1
???
numbers = [1, 2, 3]
i = 0
AF
dition and a consequence. However, an if statement only executes once, whereas
a while loop, as the name suggests, will repeatedly execute the consequence as
long as the condition evaluates as true. A simple example of the use of while to
print out the characters in a string is provided in (78).
while i < numbers.size :

print n
i += 1
(78) dataFlowWhileEx1.py
s = "The wind came across the bay like something living."
1
2
3
4
5
1
2
3
4
5
???
A while loop can be used to examine all of the values in a list, as shown in
DR
(79), but the for loop is a more convenient means of doing the same (see §9.2.2).
numbers = [1, 2, 3] 1
i = 0 2
while i < numbers.size : 3
print n 4
i += 1 5
9.2.2 for
A typical task in programming is to examine all of the values in a sequence (list)
in order to decide what to do with them. The for loop in Python provides a mean
of iterating over any type of sequence, such as lists or tuples. Let’s return to the
example that we looked at in §8.1, where we created a list of words from the first
sentence of Hal Clement’s novel A Mission of Gravity. In (81), a for loop is used
to iterate over the words in the list and print each one out.
83
9.2 Looping, Iteration, Cycling etc.: for and while
T
(81) dataFlowForEx1.py
for w in words : 3
print n 4
The reason that this control structure is known as a loop is because the indented
block of code following the for loop is executed as many times as there are
elements in the list. In other words, there is a loop that repeats itself. Each item
in the list is temporarily stored in the variable immediately following the keyword
for.
AF
We can combine a for loop with other control structures, such as an if-
statement, to build more complicated control structure, as illustrated in (82), where
all of the words in the list are examined and only those that consist of four letters
or less are printed out.
words = s.split(" ")

for w in words :
if len(w) <= 4 :
print w
s = "The wind came across the bay like something living."
It is sometimes useful to loop over the elements in a list using their indices,
since it is then possible to manipulate the indices for various purposes. The main
trick is to use the function range, which takes two numbers as arguments—a
range start and a range end—and walks through the range defined by the numbers.
Note that the range end is non-inclusive. In other words, the function will walk
1
2
3
4
5
through the numbers up to, but not including, the range end. This technique is
illustrated for the same list in (83).
DR
for i in range(0, len(words)) : 3
w = words[i] 4
print str(i), w 5
To see how this technique can be useful, consider the following code, which
shows how it is possible to identify the word immediatley preceding each word
in the sentence using indices. Using this technique, it is possible to extract useful
information about word co-occurence in a corpus. (For an introduction to the
statistical analysis of word collocations (bigrams, trigrams, etc.), see Manning
and Schütze (1999).)
84
9. Data Flow
T
w = words[i] 4
prev = words[i-1] 5
print str(i), w , "[" + prev + "]" 6
Note that the last word of the sentence (living.) is given as the word immedi-
ately preceding the first word in the sentence (The). This is because the index for
the first word in the sentence is zero, and subtracting one from it gives negative
one, which, according to the rules for list indices (see §8.1.1), will be the last item
in the list. To prevent this from happening, we can simply use an if-statement to
treat the first item in the list as a special case, as in (85).
AF
w = words[i] 4
if i == 0 : 5
prev = "--" 6
else : 7
prev = words[i-1] 8
print str(i), w , "[" + prev + "]" 9
9.2.3 Controlling Looping with break and continue

It is often necessary while looping over the values in a sequence to suspend the
looping once a certain condition has been met or to skip over a particular value
during the looping. The keywords break and continue provide these capabil-
ities in Python.
9.2.3.1 break
DR
The break statement breaks out of the smallest enclosing for or while loop. For
example, imagine that you have run a number of subjects through the lexical-
decision task and that you want to find the reaction time of each subject for the
first presentation of a particular word—say, room. If each subject has a separate
data file and the data is stored as tab-delimited values with one row per presented
words (with the presented word in the first column and the reaction time in the
second column), then the code in (??) should work to retireve the desired data for
each file it is run on.
(86) dataFlowBreakEx1.py
from string import split 1
2
fn = ’tab-delimited_data.txt’ # hardcoded 3
f = open(fn, ’r’) # ’r’ = read-mode 4
lineList = f.readlines() # get list of lines 5
for l in lineList: # loop over lines 6
columns = split(l, "\t") # split line by tabs 7
if columns[0] == ’room’ : 8
break # found it, stop looping 9
print columns[0], columns[1] # print out the data 10
85
9.3 List Comprehension: Modifying a List In Situ
9.2.3.2
T
continue
The continue statement skips to the next iteration of the loop.
For example, imagine that you are looping over the lines in a tab-delimited
file. Some of the lines in the file have a blank first column and should not be
processed. This can be easily accomplished with continue, as shown in (??).
(87) dataFlowContinueEx1.py
2
fn = ’tab-delimited_data.txt’ # hardcoded
AF
3
lineList = f.readlines() # get list of lines 5
for l in lineList: # loop over lines 6
columns = split(l, "\t") # split line by tabs 7
if columns[1] == ’’ : 8
continue # continue to next loop 9
print columns[0], columns[1] # print out 1st & 3rd col. 10
9.3 List Comprehension: Modifying a List In Situ

We have already seen how it is possible to walk through every item in a sequence
using a for loop. Using a for loop to perform the same operation on every
item in a sequence is very common operation. For example, if you have a list
of words, you may want to systematically change their capitalization, say, by
using the capitalization normally found in titles (where each word begins with an
uppercase letter). One way of doing this is to use a for loop to go through each
word in the list, as in (88).
DR
(88) dataFlowListComprehensionEx1.py
words = s.split() 2
uwords = [] 3
for w in words : 4
uwords.append(w.title()) 5
print " ".join(uwords) 6
In (88), a sentence is split into a list of words and a for loop is used to go
through each word in the list. As each word is converted to the proper capital-
ization pattern, it is stuck into another list, which is printed out as a string at the
end.
This situation is so common that Python provides a convenient device (an
LIST COMPREHEN - idiom, if you will) for handling it, known as list comprehension. The use of
SION this idiom can significantly simplify the process of transforming a list on the fly
without making any permanent changes to it, as shown by (89), which is a much
more compact version of (88).
86
9. Data Flow
T
words = s.split() 2
titlewords = [ w.title() for w in words ] 3
print " ".join(titlewords) 4
Note that the code in (89) can be made even more compact, as shown in (90).
words = " ".join([w.title() for w in s.split()]) 2
Exercises
AF
The two-line verison is more compact but less readable. There is often a trade-
off between concision and readability, and balancing the two is a fine art which
requires a judgement call. In general, we would recommend erring on the side of
readability.
9.4
• In (84), each word in a sentence is printed out along with the word imme-
diately preceding it. What changes to the script would be required to obtain
the word immediately following each word in the sentence?
• ???
• ???
DR
87
9.4 Exercises
T
AF
DR
88
T
Chapter 10
Functions
10.1 The Basics

AF
A function is a block of code that can be called by name.1 Functions are created
in Python using the keyword def and the name of the function followed by the
code that will executed when the function is used. The code is separated from
the function’s name by a colon and should be indented. This is illustrated with
the function apologize in (91), which does nothing more than concatenate a
few strings and print out the result. (The message printed out is a quote, an apol-
ogy from HAL, the sentient computer in Stanley Kubrick’s film 2001: A Space
Odyssey.)
FUNCTION
(91) functionsBasicsEx1.py
def apologize() : 1
DR
name = "Dave" 2
print "I’m sorry, " + name + ", I’m afraid I can’t do that." 3
apologize() 4
In (91), the function is defined in lines 1 through 3 and called in line 4 (Termi-
nology note: Using a function is known as “calling” or “invoking” it.)
There are a few potential pitfalls to bear in mind when creating a function.
First, the names of functions are subject to the same constraints as the names of
variables (see §6.2.1 for details). Second, the definition of a function must precede
its invocation. In other words, you have to define a function before you can use
it. Otherwise, the Python interpreter will raise an error, as can be seen by running
(92), where the definition and invocation are in the wrong order—i.e., the function
apologize() is called first and then defined.
1
In other programming languages, functions go by different names—e.g., procedures or
subroutines–but the basic idea is the same.
89
10.2 Arguments
T
apologize() 1
def apologize(): 2
name = "Dave" 3
print "I’m sorry " + name + ", but I’m afraid I can’t do that." 4
When (92) is run, an error message is given by the interpreter, which should
look something like this:
% python functionsBasicsEx2
AF
Traceback (most recent call last) :
File "functionsBasicsEx2.py", line 1, in ?
apologize()
NameError: name ’apologize’ is not defined
Functions serve a number of purposes. First and foremost, they provide a way
of reusing code. If we wanted to print our hello message from (91) more than
once, we can simply call the function apologize() multiple times, as in (93),
which calls the function three times.
def apologize():
name = "Dave"
print "I’m sorry, " + name ", but I’m afraid I can’t do that."
apologize()
apologize()
apologize()
1
2
3
4
5
6
The second purpose of functions is to organize code into bite-sized chunks. By

grouping code into functions, it is easier to read and therefore to update, debug,
DR
etc. Functions play an extremely important role as programs grow longer and
require more organization. This is true even of relatively short programs, as we
will demonstrate later in §10.4 by reorganizing the concordance program from
Chapter 5.
10.2 Arguments
In the previous section, we learned how to create a simple function. In (91), the
message that is printed out by the function apologize() is fixed, since the
addressee can only be “Dave”. But what if HAL wants to apologize to a member
of the crew other than Dave? In order to do this, here needs to be some way to
custom-tailor the apology to the addressee. One way to do that is to make the
variable name accessible throughout the program so that it can be set every time
before the function is called, as illustrated in (94).
90
10. Functions
T
(94) functionsArgumentsEx0.py
def apologize() : 1
print "I’m sorry, " + name + ", but I’m afraid I can’t do that." 2
name = ’Bill’ 3
apologize() 4
The function in (94) makes use of a global variable, a variable that is defined GLOBAL VARIABLE
in one place and potentially available everywhere. The use of global variables
in functions has a number of drawbacks. Some of them are fairly obvious. For
example, the name of the variable in the function must match the name of the
AF
variable outside of the function. Although this is easy to ensure in a short program,
it becomes more difficult once a program grows longer and more complicated.
However, there are more subtle reasons why (94) is unsatisfactory. [EXPLAIN
WHY]
Given the drawbacks of global variables, programming languages invariably
have a mechanism that allows a variable to be passed to a function as a local
variable. A variables that is passed to a function is called an argument. In ARGUMENT
Python, every function’s definition must specify the number of arguments that
it takes. In (95), we define a function that takes a single argument, which is la-
belled addressee. (Think of arguments as variables. The names of arguments
are arbitrary and have the same restrictions as variables—see §6.2.1.)
def apologize(addressee) : 1
print "I’m sorry, " + addressee + ", but I’m afraid I can’t do that." 2
name = "Bill" 3
apologize(name) 4
Although (96) produces the same output as (94), it is a much better solution.
Because the addressee of the apology is passed in to the function as an argument,
DR
the function can be called using any name. It is possible to call the function
multiple times with different names, as illustrated in (96).
(96) functionsArgumentsEx1.5.py
def apologize(firstName) : 1
print "I’m sorry, " + firstName + ", but I’m afraid I can’t do that." 2
apologize("Dave") 3
apologize("Bill") 4
apologize("Frank") 5
In Python, arguments are passed to function by object reference.2 This means

that the argument of a function does not simply have the same value as the variable
passed in, but actually refers to the same object. This is an important point, which
is often a source of confusion for newcomers to programming, so let’s consider it
carefully.
Consider the code in (97), which defines a function called add word().
2
Not all programming languages work this way. Perl, for example, passes variables by value,
resulting in quite different behavior.
91
10.2 Arguments
T
(97) functionsArgumentsScopeEx1.py
def add_word(d, w) : 1
w = w.upper() 2
if not d.has_key(w) : 3
d[w] = 0 4
d[w] = d[w] + 1 5
6
wfreq = {} 7
for w in words : 10
add_word(wfreq, w) 11
print len(wfreq.keys()) 12
AF
The function add word() adds a word to a dictionary that serves as a word
frequency lookup table. It converts the word to uppercase and then checks whether
the uppercase version of the word is already in the dictionary. If it is not already
there, it is added with a value of 1. If it is already there, then its associated value
is incremented by one. The important point about (97) is that changes made to the
dictionary object inside of the function apply to the dictionary object outside of
the function, as you can confirm for yourself by running (97) and observing that
wfreq has 9 keys.
It is important to bear in mind that changes made to a variable inside of a
function only hold true outside of the function as long as the variable continues to
refer to the same object. If a variable is assigned a new value inside of a function,
then the variable will no longer refer to the object it originally did. Because the
variable outside of the function will continue to refer to the original object, it will
continue to have the same value it did before the function was called. This is an
issue for immutable objects, which cannot be modified in place. Therefore, in
DR
(98), ???.
(98) functionsArgumentsScopeEx2.py
def format_sentence(s) : 1
s = s.capitalize() 2
s = s + "." 3
print s 4
s = "the wind came across the bay like something living" 5
format_sentence(s) 6
print s 7
10.2.1 Multiple Arguments

A function can also take multiple arguments, in which case commas are used to
separate the various arguments from one another, as shown in (99), which defines
a function that takes two arguments, fname and lname, for the first and last
name of the apology’s addressee.
92
10. Functions
T
def apologize(fname, lname) : 1
print "I’m sorry, " + fname + lname + ", but I’m afraid I can’t do that." 2
apologize("David", "Bowman") 3
A function cannot be invoked with a different number of arguments than it was

defined to take (although arguments can be optional or take a default value, as we
will see in §10.2.2); otherwise, an error occurs and Python complains, as can be
seen by running (100), where the new dynamic apologize() function is called
AF
with no arguments (even though one is expected).
def apologize(name) 1
apologize() 3
10.2.2 Default Arguments

In some cases, it is convenient to have arguments that are optional or take a default
value (i.e., a value that is assumed unless one is specifically provided). Returning
to the HAL from the last section, it might be convenient to assume that the ad-
dressee is ‘Dave’ unless a different name is specified. This can be done by putting
=’Dave’ after the argument, as in (101).
(101) functionsDefaultArgsEx1.py
def apologize(name=’Dave’): 1
apologize() 3
The possibility of default arguments provides a way of invoking the same func-
DR
tion with a different number of arguments, as in (101), where the apologize()
function is called in three different ways. The first call is made without an argu-
ment, in which case the addressee defaults to ‘Dave’. The second call passes a
different name to the function, which is the addressee in the message. The third
call passes the default name ‘Dave’ to the function, which doesn’t make much
sense but is nevertheless possible.
def apologize(name=’Dave’): 1
apologize() 3
apologize("Dave") 4
apologize("David") 5
In some cases, you may want to have multiple arguments, in which case the or-
dering of arguments is significant. For example, let’s say that we want the function
apologize() to be flexible in terms of the addressee as well as the message,
93
10.2 Arguments
so that the function could print out something like “I’m sorry, Dave, but this con-
T
versation can serve no further purpose.” The function would then look something
like (103). (The punctuation of the message is hard-coded in the function. Be-
cause this is an apology, a question mark is ruled out. And since it’s a sentient
computer talking, we’ll assume that an exclamation mark would be uncharacter-
istically emotional. After all, this is the same computer that locked Dave out of
the ship to die in space and then, when Dave tried to talk him out of killing him,
ended the conversation by saying calmly, “Dave, this conversation can serve no
purpose anymore. Goodbye.”)
AF (103) functionsDefaultArgsEx3.py
def apologize(name, msg):
print "I’m sorry, " + name + ", but " + msg + "."
apologize("Dave", "this conversation serves no further purpose")
Now, if we want the argument name to have a default (e.g., ‘Dave’) but require
the argument message to have a value, we face a problem. Default arguments
cannot be followed by non-default arguments. Therefore, the code in (104) will
raise an error, as can be seen by running (104).
def apologize(name=’Dave’, msg):
apologize("Dave", "this conversation serves no further purpose")
The solution is to ensure that the non-default arguments precede the default
arguments, as in (105).
1
2
3
4
1
2
3
def apologize(msg, name=’Dave’): 1
DR
2
apologize("this conversation serves no further purpose") 3
4
10.2.3 Keyword Arguments

So far, we have relied upon the ordering of arguments to establish their identity.
However, Python also allows arguments to be specified using keywords. The idea
behind using keywords is that they are mnemonically more useful than relying
upon the arbitrary ordering of arguments. For example, the convention for for-
matting dates differs between countries. In America, the convention is for the
month to precede the day (e.g., 2-5-2006 is February 5th, 2006) whereas in
most other parts of the world the convention is for the day to precede the month
(e.g., 2-5-2006 is May 2nd, 2006). For example, imagine a function that takes
a date and formats it in a uniform fashion. The arguments of the function are the
day, month, and year of the to-be-formatted date.
94
10. Functions
T
(106) functionsKeywordArgEx1.py
def formatDate(m, d, y) : 1
print "%s-%s-%s" % (y, m, d) 2
formatDate(10, 28, 1973) 3
While it is possible to pass the arguments in the usual fashion, it is also pos-
sible to specify them using keywords, in which case the ordering is irrelevant.
This is illustrated in (107), where a function is called using keyword-labelled
arguments. Although the function is called with three different orderings, the
printed-out result is the same in each case, as can be seen by running the script.
AF
(107) functionsKeywordArgEx2.py
def formatDate(m, d, y) : 1
print "%s-%s-%s" % (y, m, d) 2
formatDate(y=1973, d=28, m=10) 3
formatDate(m=10, d=28, y=1973) 4
formatDate(d=28, m=10, y=1973) 5
10.3 Return Values

Data goes into functions as arguments and comes out of them as a return value. RETURN VALUE
The return value of a function is specified with the keyword return. The use
of return terminates execution of the function. Therefore, a function can only
produce a single return value per execution. In (108), we define a new version of
the apologize function, which returns the apology from the function as a string,
rather than directly printing it. The returned string is printed out by assigning it to
a variable and then printing the variable.
(108) functionsReturnValuesEx0.py
def apologize() : 1
name = "Dave" 2
DR
return "I’m sorry, " + name + ", but I’m afraid I can’t do that." 3
s = apologize() 4
print s 5
A function that returns a value can be also embedded directly into the same
context as the data type that it returns (rather than saved first as a variable), as
def apologize() : 1
name = "Dave" 2
return "I’m sorry, " + name + ", but I’m afraid I can’t do that." 3
print apologize() 4
It is possible to return from a function midway through its execution. For ex-
ample, (110) defines a function that takes two numbers as arguments. The first is
divided by the second, unless the second argument is equal to 0, in which case,
the function returns immediatley with the value None (instead of throwing a Ze-
roDivisionError).
95
10.4 Functions in Action
T
def divide(i, j) : 1
if j == 0 : 2
return None 3
else : 4
return i / j 5
print divide(5, 0) 6
Functions that return a value can also be used directly as the arguments of
other functions, as in (111).
AF
def multiply(i, j) : 1
return i * j 2
print multiply(multiply(3, 3), 3) 3
The code in (111) is equivalent to that in (112), where the value returned from
the first invocation of the function multiply() is stored as a variable and then
passed to its second invocation.
def multiply(i, j) : 1
return i * j 2
p1 = multiple(3, 3) 3
p2 = multiply(product, 3) 4
print p2 5
10.4 Functions in Action

In order to show how functions can be used to improve the organization of a
program, we will rewrite our first program from Chapter 5 using functions. Since
DR
it is a fairly simple program, only a few functions are needed. We’ll define only
three: one to extract the contents of a file, one to create a list of tokens from the file
contents, and one to print out the results. The full program is provided in (113).
(113) functionsFirstProgram.py
8
def get_text() : 9
return text 10
11
def create_tokenlist(contents) : 12
tokens = contents.split() # Split up contents into tokens 13
if d.has_key(t) : # see if it’s in the dict. 16
d[t] = d[t] + 1 # if so, add 1 to its counter 17
96
10. Functions
T
else : # otherwise 18
d[t] = 1 # initialize it to 1 19
return d 20
21
def print_results(d) : 22
tokens = d.keys() # get list of unique tokens 23
tokens.sort() # sort list of tokens 24
count = d[t] # get counter for a given token 26
parts = t.split("/") # split it up into wordform and POS 27
wordform = parts[0] # first part is wordform 28
POS = parts[1] # second part is the POS 29
print wordform , POS , str(count) # print it to standard output 30
AF
31
text = get_text() 32
tl = create_tokenlist(text) 33
print_results(tl) 34
The program’s organization can be described as a “daisy chain” since the out-
put of each function is the input to the next. This is shown in Table 10.1, where
we see that each function gets its input from the output of the previous function.
The only exception is the first function, get text(), which simply returns the
global variable text, which contains the corpus excerpt to be analyzed by the
program.
Input Output Function

– text (string) get text()
text (string) token counts (dictionary) create tokenlist()
token counts (dictionary) – print results()
Table 10.1 Organization of Program in (113)

DR
10.5 Exercises
• ???
• ???
• There are many different ways to break a program down into functions.
What are some alternative ways of organizing (113)?
97
10.5 Exercises
T
AF
DR
98
T
Chapter 11
Errors and Exceptions
AF
An exception is an error condition that changes the normal flow of control in a
program. That is just a rather technical way of saying that exceptions are what
happen when something goes wrong while a program is being run. Python has
very good error handling capabilities which can make it much easier to write pro-
grams that deal gracefully with the errors that arise when a program is run. In this
chapter, we will look at how exceptions work in Python and show how they can
be used to deal with error handling effectively.
11.1 Handling Exceptions

EXCEPTION
Perhaps the best way to understand the different approaches to error handling
is to concentrate on a particular task that frequently causes problems during the
DR
execution of a program, which is file input and output (see chapter 12). Various
things can go wrong when opening a file or writing to it: the file may not exist,
it may not be readable due to restrictive access privileges on the filesystem or
because it is corrupt, there may not be enough memory to write the new contents,
etc. A well-written program must be able to handle all of these contingencies.
To see how such contingencies would be handled in Python, let’s have a look
at a simple script that takes a filename from the command-line, opens it, and prints
out its contents, as in (114).
(114) exceptionsBackgroundEx1.py
import sys 1
try : 2
filename = sys.argv[1] 3
print fo.read() 5
fo.close() 6
except : 7
print "Couldn’t open specified file." 8
99
11.1 Handling Exceptions
In this example, we see the basics of exception-handling in Python. If an error
T
occurs when the code inside of the try-block is run, the except block will be run
and the error message will be printed out.
There is a problem with (114), however, which is that it does not provide a
way of teasing apart the various things that can go wrong in the try block. The
behavior of the program is the same regardless of whether the user fails to provide
a filename or the user provides the name of a non-existent file. To tease apart
different error conditions, multiple except blocks can be used, as illustrated in
(115), where there are two except blocks, one for catching the error that results if
import sys
try :
fo.close()
AF
the user does not provide a filename and another for catching the error that results
if the file specified does not exist or cannot be opened because of access privileges.
filename = sys.argv[1]
except IndexError :
(115) exceptionsBackgroundEx2.py
fo = open(filename, ’r’)
print fo.read()
print "Usage:" , sys.argv[0] , "<FILE>"

except IOError :
print "Cannot open file ’" + filename + "’."
When an exception is raised, it must somehow be handled in order to prevent it

from bringing the program in which it occurs to a grinding halt. (When speaking
of exceptions, it is common parlance to refer to an error being raised or thrown.)
We can start with the example in (116), which is a short script that takes two
1
2
3
4
5
6
7
8
9
10
strings as command-line arguments and concatenates them.

(116) exceptionsHandlingEx1.py
DR
import string, sys 1
try : 2
str1 = sys.argv[1] 3
str2 = sys.argv[2] 4
print str1 , str2 5
print "Usage:" , sys.argv[0] , "<STRING> <STRING>" 7
Before discussing how this example of exception handling works, we need to

establish some terminology. We will refer to the code that follows try and pre-
cedes except as the try-block and we will refer to the code following except
as the except-block. The exception handling in (116) works as follows: First, the
try-block is executed (run). If no exception results from its execution, then the
except-clause is skipped and the try-block is skipped. However, if an exception
occurs in the try-block, the rest of the block is ignored and the except-block is
immediately executed. (As we will see, more than one except-block may be de-
clared, since a try-block may potentially give rise to numerous different exception
100
11. Errors and Exceptions
types, but we will ignore this possibility for the present.) If the except clause does
T
not in turn raise an exception (a distinct possibility that should not be forgotten
but hopefully won’t occur), then the code following the except-clause is run, and
the program continues to run its merry way. If an exception occurs which does not
match the exception named in the except clause, it is passed on to outer try state-
ments; if no handler is found, it is an unhandled exception and execution stops
with a message as shown above.
This process is summarized as a flow diagram:
Error Handling
Exception
raised in
tryDoes
Execute
block?
the exception
raised
match
one of the
except-
clauses.
except-block.
try:
return
xxx
Halt the
program.
AFIgnore except
block.
(117) exceptionsHandlingEx3.py
raise Exception(’first’, ’second’)

except Exception, inst:
1
2
3
print type(inst) # the exception instance 4
print inst.args # arguments stored in .args 5
print inst # __str__ allows args to printed directly 6
DR
x, y = inst # __getitem__ allows args to be unpacked directly 7
print ’x =’, x 8
print ’y =’, y 9
It is also possible to raise an error explicitly using the keyword raise, as

shown in (??).
(118) exceptionsRaiseErrorEx1.py
freq = { "a" : 0, 1
"the" : 0, 2
"I" : 0 } 3
4
def lookup(d, w) : 5
if d.has_key(w) : 6
return d[w] 7
else : 8
raise Exception, "Word not found in frequency table!" 9
10
lookup(freq, "you") 11
101
11.2 Exception Types
T
In order to understand fully the various types of exceptions and how they relate
to one another, some acquaintance with object-oriented programming is required,
since exceptions are objects and are therefore associated with a particular class.
Without going into too much detail (see §15.3.3 and §16.5 for more information
about how inheritance works), exceptions are organized hierarchically, such that
every error type in the hierarchy is a subtype of the one above it, as shown diagra-
matically in Figure 11.1.
AF SystemExit
Exception
SyntaxError
StandardError
LookupError
IndexError KeyError
Figure 11.1 Class Hierarchy for Exception Class
In Figure 11.1 we see that an IndexError is a type of LookupError which is in

turn a type of StandardError. For full documentation of all error types, you will
need to consult the appropriate Python language documentation.1
In the following sections, we will look at three different types of errors: syntax
DR
errors in §11.2.1, standard errors §11.2.2, and custom errors n §11.2.3.
11.2.1 Syntax Errors

At some point or another the reader has surely come across the dreaded mess-
sage “Syntax Error”. Probably the most common type of error made by beginning
Python programmers is the failure to indent properly. (Even experienced pro-
grammers are apt to make this mistake, since they will no doubt be coming from
languages where indentation is irrelevant, as long as the bracketting is correct.)
This can be illustrated by the code in (119), which shows an if-then block where
the consequent (i.e., the then-clause) is on a separate line but has not been properly
indented.
1
Available on the WWW: http://docs.python.org/lib/module-exceptions.
html.
102
T
(119) exceptionsIndentationErrorEx1.py
n = 1 1
if n == 1 : 2
print n 3
When (119) is run, it will give rise to a SyntaxError–more specifically, to an

IndentationError. Unlike other types of errors, it does not usually make a great
deal of sense to catch a syntax error, since it is the result of programmer error
rather than user error.
import string, sys

try :
n1 = sys.argv[1]
n2 = sys.argv[2]
x = string.atoi(n1)
y = string.atoi(n2)
print x / y
except IndexError :
AF
11.2.2 Standard Errors
Even when the syntax of a statement or expression is correct, it is still possible
for an exception to occur under certain circumstances. For example, in a program
that takes user input, an exception might be raised when the user provides input
of the wrong type. If a word is provided as user input when a word was expected,
an error can result, as illustrated in (120), where ???.
(120) exceptionsRealExceptionsEx1.py
print "Usage:" , sys.argv[0] , "<NUM> <NUM>"

except ValueError :
print "Arguments must be numbers."
1
2
3
4
5
6
7
8
9
10
11
except ZeroDivisionError : 12
print "Can’t divide by 0." 13
DR
A simple example of an exception is the ZeroDivisionError. (Exceptions are
considered a subtype of error by Python, hence the name of the exception.) It
arises whenever there is an attempt to divide a number by zero, as in Figure 11.2.
Figure 11.2
This error message consists of three lines:
1. The first line simply states that the error is being traced backed to its source.
Since an exception in one part of a program may lead to cascading errors,
it is necessary to provide the full traceback, starting with its ultimate cause
and finishing with its immediate cause.
2. The second line states...
103
3. The third line provides the exception type and its name (which often amounts
T
to a short description).
11.2.3 Custom Exceptions

It is very easy to define custom exceptions in Python; however, to understand how
custom exceptions work, the reader will need to know something about object-
oriented programming and inheritance. For that reason, this section should proba-
bly be skipped and revisited once the reader has absorbed the material in Chapter
pass
AF
15 and Chapter ??, particularly §15.3.3 and §16.5 on inheritance.
The basics of custom exceptions are straightforward. Any custom class that
extends from the class Exception is a custom exception. The simplest possible
exception would be the one in (121).
(121) exceptionsCustomExceptionEx1.py
class CustomException(Exception) :
In order to obtain a custom error message, it is necessary for the custom error
class to override the str method, as in (122), which creates a new type of
exception, CustomError, whose error message is simply “An error has occurred”.
class CustomException(Exception) :
def __str__(self) :
return "Error message for CustomException"
raise CustomException()
1
2
1
2
3
4
5
To see how custom exceptions work, and how they can be useful when writ-
DR
ing your own programs, we will create a small program that parses a line from a
Shoebox file into its individual tiers. For now, we will take for granted the abil-
ity to extract lines from a Shoebox file and simply concentrate on how a line is
processed once its contents are stored as a string.
2
class MissingTierError(Exception) : 3
pass 4
5
def parseLine(line) : 6
tiers = {} 7
requiredTiers = [’t’, ’m’, ’g’, ’f’] 8
lines = line.split("\n") 9
for l in lines : 10
l = l.lstrip("\\") 11
(marker, data) = split(l, sep=" ", maxsplit=1) 12
tiers[marker] = data 13
for t in requiredTiers : 14
if not tiers.has_key(t) : 15
104
T
s = "The tier \\%s is missing!" % t 16
raise MissingTierError, s 17
18
# Nothing wrong with this line 19
shbx = ’’’\\t Waar is de fietsenstalling? 20
\\m waar is de fiets-en-stalling 21
\\g where is the bike-PL-shed 22
\\f Where is the bike shed?’’’ 23
parseLine(shbx) 24
25
# The tier \g is missing 26
shbx = ’’’\\t Waar is de fietsenstalling? 27
\\m waar is de fiets-en-stalling 28
AF
\\f Where is the bike shed?’’’ 29
parseLine(shbx) 30
11.3 Exceptions in Action

In Chapter ??, we saw how to rewrite the simple program that reads a POS-
annotated file and creates a word list from it. Because the first-pass version of
the program was meant to be maximally simple for expository purposes, it in-
cluced no error handling, even though there were various parts of the program
that could go wrong. Here we will add some error handling to the program to
show how it can be improved.
(124) exceptionsFirstProgram.py
import sys 1
2
def print_results(wordlist) : 3
pairings = wordlist.keys() 4
pairings.sort() 5
for p in pairings : 6
count = wordlist[p] 7
DR
parts = p.split("/") 8
9
# Break token down into word and POS 10
try : 11
word = parts[0] 12
tag = parts[1] 13
print.stderr("Token [%s] could not be parsed properly!\n" % p) 15
print "%s\t%s\t%i" % (word, tag, count) 16
17
def create_wordlist(fileContents) : 18
words = fileContents.split(" ") 19
wordlist = {} 20
for w in words : 21
if wordlist.has_key(w) : 22
wordlist[w] = wordlist[w] + 1 23
else : 24
wordlist[w] = 1 25
return wordlist 26
27
def get_file_contents() : 28
# Get filename from commandline 29
try : 30
105
11.4 Exercises
T
print "Usage: %s <FILE>" % sys.argv[0] 33
sys.exit(0) 34
35
# Read file contents 36
try : 37
fc = fo.read() 39
fo.close() 40
except IOError : 41
print "Could not open ‘%s’!" % filename 42
sys.exit(0) 43
AF
return fc 44
45
fileContents = get_file_contents() 46
wordlist = create_wordlist(fileContents) 47
print_results(wordlist) 48
11.4 Exercises
• ???
• ???
• ???
DR
106
T
Chapter 12
Input/Output
AF
The topic of program input and output (usually referred to by the acronym IO) IO
concerns how Python interacts with the file system. In other words, it is the is-
sue of how files and directories are opened, read, written to, deleted, etc. The
way files are handled differs a fair bit across operating systems (Windows, Mac-
intosh, Linux, Unix, etc.), which means we will also have to cover some of the
tools available for handling differences between operating systems. In Python, the
majority of the work as far as input/output is concerned is handled by the mod-
ule os, which is documented in detail at http://docs.python.org/lib/
module-os.html.
12.1 Files
DR
When dealing with files, there are two main issues: where files are located and
what they contain. File locations is discussed in §12.1.1 and reading and writing
file contents is discussed in §12.1.2.
12.1.1 File Paths

A file path describes where a file can be found on computer. In Windows, for FILE PATH
example, a filepath might be something like (8).
(8) C:\\Documents and Settings\user\My Documents\dictionary.txt
(8) says that the text file dictionary.txt is in the folder My Documents,
which is in the folder user, which is in the folder Documents and Settings
on the C drive. Note that backslashes are used to separate folders in Windows file
paths. This is not true for all operating systems. Unix, for example, uses forward
107
12.1 Files
slashes to seprate folders (directories) from one another. A common problem with
T
file manipulation is that there are many differences between operating systems
(Windows, Macintosh, Unix, etc.) in this respect. Fortunately, Python provides
the tools to write portable code–that is, code that will continue to work even if it
is run on different operating systems.
Before we introduce ways of creating file paths that are portable (i.e., valid for
all operating systems), it is worth looking at a negative example which will show
the wrong way of creating a file path.
AF
(125) ioFilePathHardcoded.py
import os, sys 1
dir = sys.argv[1] 2
file = sys.argv[2] 3
print dir + "/" + file 4
In the previous example, we hard-coded the character used to separate directo-

ries from one another in the file path, which is non-portable since it will not work
when the program is run in a Windows environment. The superior alternative to
hard-coding the separator is to let the Python script figure out which operating sys-
tem it is being run on and to automatically select the correct option. This can be
done by building up the file path using the function os.path.join(), which
takes the various parts of a file path and joins them into a single filepath using the
separators appropriate for the operating system being used.
(126) ioFilePathJoin.py
import os, sys 1
dir = sys.argv[1] 2
print os.path.join(dir, file) 4
DR
12.1.2 Reading From and Writing To Files
One of the most common operations performed on files is reading them. When
files are opened for reading, it is necessary to specify how they should be opened—
that is, for what purpose. The command open() is used to open files. Its first
argument is the path to the to-be-opened file. The second argument of the open
command is a single character which determines how a file will be opened. This
second argument is known as a read mode. There are three mutually exclusive
basic read modes, listed in Table 12.1.2.
Read Mode Character Description

Read r Open file only for reading
Write w Open file for writing, preexisting contents erased
Append a Open file for writing, preexisting contents retained
108
12. Input/Output
In addition to the basic read modes, there are additional flags that control the
T
way in which files are handled by Python. One of the most useful of these is the
read mode ’U’ (short for ’Universal’), which ensures that all file types are read
consistently, despite the differing conventions across operating systems. Unix
uses the linefeed character (LF) whereas Macintosh uses the carriage return char-
acter (CR). Windows uses both (CR LF). The ’U’ flag can only be used in read
mode—i.e., ’rU’ is legitimate, whereas ’rw’ or ’ra’ is not.
12.1.2.1 Read
filename = sys.argv(1)
lines = fo.readlines()
fo.close()
AF
Being able to read the contents of files is absolutely critical for many programming
tasks. Here we will look at how this is done. We will first look at how a file can be
read line by line and then at how the entire contents of a file can be read at once.
Reading Files Line By Line One way to read a file is line by line. There are a
few different ways to do this. One way is to use the built-in function open() to
create a file object and to then call the readlines() function of the file object.
The readlines() function returns a list of lines in the file. In (127), we obtain
a filepath from the command-line, open the file, save its content as a list of lines,
and then close the file.
import sys
(127) ioReadFileLineByLineEx1.py
# Import required module
# Get filename from command-line
fo = open(filename, ’rU’) # Open file for reading
# Read lines into list
# Close file
1
2
3
4
5
One advantage of reading a file line by line is that at any point the reading
DR
of the file can be terminated and the file closed. This is particularly useful when
dealing with extremely long files, where reading the entire file is not necessary,
too time-consuming, or requires more memory than is available to the system.
In (128), a file is read until a blank line is found, at which point file reading
terminates and the file is closed. This can be very useful technique for saving time
and memory resources when dealing with very large files.
import sys # Import required module 1
filename = sys.argv(1) # Get filename from command-line 2
fo = open(filename, ’rU’) # Open file for reading 3
for l in fo.readlines() : # Loop over lines 4
if l == ’’ : break # Stop reading if line is empty 5
print l # Print line 6
fo.close() # Close file 7
There is another way of reading a file line by line which takes advantage of
the fact that a file object can be treated as a list and iterated over, as illustrated in
109
12.1 Files
(129). This is probably the best way of reading a file line by line, since it is simple
T
and requires fewer lines of code than the alternatives.
import sys # import required module 1
filename = sys.argv[1] # get filename from command-line 2
fo = open(filename, ’r’) # open file for reading 3
n = 0 # counter for line number 4
for l in fo : # loop over lines 5
n += 1 # increment counter 6
print "%i:%s" % (n, l) # print line with line number 7
fo.close() # close file 8
AF
Reading Files All at Once (Slurping) In many cases it is necessary to read the
entire contents of a file before they can be processed. This is particularly the case
when patterns that span multiple lines need to be extracted from the file. This
can be done by using the method read on a file object, as shown in (130), which
takes a filename from the command-line and slurps its contents into a string before
printing them out again.
import sys
filename = sys.argv(1)
fo = open(filename, ’r’)
contents = fo.read()
fo.close()
(130) ioSlurpingEx1.py
# Import required module
# Get filename from command-line
# Open file for reading
# Read file contents as string
# Close file
Because the entire file is represented as a single string, patterns that span mul-
tiple lines can be detected and manipulated, as shown in (131), which takes a file
with hard line breaks (that is, a file where the line breaks have been added using
1
2
3
4
5
carriage returns) and removes the line breaks. Double carriage returns are reduced
to single carriage returns. The overall effect is to take a file with hard line breaks
DR
and turn it into one where each paragraph is a single line in the file. This would
be much harder to accomplish if the file had been read line by line.
import sys # Import required module 1
filename = sys.argv(1) # Get filename from command-line 2
fo = open(filename, ’r’) # Open file for reading 3
contents = fo.read() # Read file contents as string 4
contents.replace("\n\n", "<Placeholder>") # Use placeholder for double return 5
contents.replace("\n", "") # Remove hard line breaks 6
contents.replace("<Placeholder>", "\n") # Turn placeholder into return 7
fo.close() # Close file 8
Even if the entire contents of a file have been saved as string using read(),
it is still possible to read a file line by line. This can be done by splitting up the
string by line breaks using the string method split(), as shown in (132). One
advantage of using this method to read a file line by line is that line breaks are
automatically removed in the process.
110
12. Input/Output
T
import sys # import required module 1
filename = sys.argv[1] # get filename from command-line 2
fo = open(filename, ’rU’) # open file for reading 3
contents = fo.read() # read file contents as string 4
lines = contents.split("\n") # split string by newlines 5
fo.close() # close file 6
for l in lines : print l # print each line 7
12.1.2.2 Write
AF
???
12.1.2.3 Append
???
12.1.3 Creating Files

Files are created using the function open(), which takes two arguments: a
filepath and a read mode. Since we are creating a new file, the appropriate read
mode is ’w’ (short for ‘write’). (133) creates a file using a filepath supplied by the
command-line.
(133) ioCreateFileEx1.py
import sys 1
2
# Get filename from command-line 3
filename = sys.argv(1) 4
5
# Open file for writing 6
fo = open(filename, ’w’)
DR
7
fo.close() 8
There are a few problems with (133), the most glaring of which is that if
there is a pre-existing file in the same location with the same name, it will be
overwritten, and its contents lost. To check for a pre-existing file before creating
a new file, we can XXX, as shown in (134).
import sys 1
2
5
# Check for pre-existing file 6
# if XXX : 7
# print "The file ’" + filename + "’ already exists!" 8
# else : 9
# fo = open(filename, ’w’) # Open file for writing 10
# fo.close() # Close 11
111
12.1 Files
An alterate strategy for dealing with a pre-existing file is to make a copy of it
T
with a different name before create a new one, as shown in (135).
import sys 1
2
5
# Open file for writing 6
fo = open(filename, ’w’) 7
fo.close() 8
try :
AF
12.1.4 Deleting Files
A word of caution is in order. Before we show you how to delete files using
Python, we should remind you that we are essentially handing you a loaded
weapon. Computer programs that delete files are dangerous. It is always a good
idea to back up any files or directories before you begin experimenting with this
aspect of Python’s IO capabilities.
Files can be easily deleted using the remove function of the os module.
import os, sys

(136) ioDeleteFileEx1.py
# Delete file or print error message
file = sys.argv[1] # Get a filepath from the command-line

print "Permanently remove file ’%s’ [y/n]?" % (file)
response = ’n’ # TODO: Prompt user for yes-no response
if response.lower() == ’y’:
os.remove(file)
1
2
3
4
5
6
7
8
9
print "Usage: %s <FILE>" % (sys.argv[0]) 11
DR
except OSError, e : 12
print "Error: Could not delete file ’%s’!" % (file) 13
print "Details: " + str(e) 14
Using the remove function, it is easy to automate the clean-up of directories.

(137) provides a script that will delete all of the backup files in a given directory.
(137) ioDeleteFileEx2.py
import os, sys 1
2
# Delete all files in dir ending with .bak or ˜ 3
try : 4
dir = sys.argv[1] # Get directory to clean up 5
tempFiles = XXX 6
for f in tempFiles : 7
os.remove(f) 8
print "Usage: %s <DIR>" % (sys.argv[0]) 10
print "Error: Could not delete file ’%s’!" % (file) 12
112
12. Input/Output
12.1.5 File Types
T
Although plain text files are the bread and butter of language research, it is impor-
tant to recognize that there are other file types in existence, some of which need to
be dealt with differently. The two fundamental file types are text files and binary
files.
TODO
12.2 Directories
AF
At a higher level of organization, files are organized by directories (sometimes
called “folders”). Python provides the means for manipulating directories as well
as files. In this section, we will cover how directories are located, created, edited,
and deleted using the os module.
12.2.1 Locating Directories

It is important to bear in mind that not all operating systems encode directory paths
in the same way. It is therefore a bad idea to build up directory paths using a hard-
coded separator (such as the slash used by Unix—e.g., /dev/null) since the
code will not be portable (i.e., it will not work if run in a Windows environment).
A better alternative is to use the function os.path.join(), as in (138).
import os
(138) osLocatingDirsEx1.py
1
DIRECTORIES
2
os.path.join() 3
DR
12.2.2 Reading Directories
Directories contains other directories (subdirectories) or files and reading directo-
ries consists of gaining access to these contents. In Python, directories are read
using the function listdir in the os module. To see how it works, consider
(139), a short script that takes a directory path as a command-line argument and
prints out a sorted lists of the files and directories within it.
(139) ioReadDirEx1.py
import os, sys 1
try : 2
dirPath = sys.argv[1] # Get dir from command-line 3
contents = os.listdir(dirPath) # Get list of contents 4
contents.sort() # Sort the contents 5
6
# Print header 7
print "-" * 30 , "-" * 30 8
print "File Name".ljust(30) , "File Type".ljust(30) 9
113
12.2 Directories
T
print "-" * 30 , "-" * 30 10
11
# Print contents 12
for c in contents : 13
filepath = os.path.join(dirPath, c) 14
if os.path.isdir(filepath) : 15
print c.ljust(30) , "D" 16
else : 17
print c.ljust(30) , "F" 18
print "Usage: %s" % (sys.argv[0]) 20
except OSError : 21
print "Error: Couldn’t open or read dir ’%s’!" % (dirPath) 22
AF
except : 23
print "Unknown Error" 24
12.2.3 Creating Directories

Creating directories in Python is straightforward thanks to the mkdir function of
the os module. In (140) is a simple script that takes a directory as a command-line
argument and creates a directory.
(140) ioCreateDirEx1.py
import os, sys 1
try : 2
dir = sys.argv[1] # Get a dir from command-line 3
os.mkdir(dir) # Creat dir 4
print "Usage: %s <DIR>" % (sys.argv[0]) 6
print "Error: Could not create directory ’%s’!" % (dir) 8
If the directory cannot be created (because the directory already exists, insuf-
ficient privileges exist, etc.), an error message is printed out instead.
DR
12.2.4 Editing Directories
Directories can be renamed using the function rename from the os module.
(Unlike many other functions that operate on directories, rename also works on
files.) (141) provides a short script that takes two arguments, an old directory
name and a new directory name, and changes the former to the latter.
(141) ioEditDirEx1.py
import os, sys 1
try : 2
old = sys.argv[1] 3
new = sys.argv[2] 4
os.rename(old, new) 5
print "Usage: %s <OldName> <NewName>" % (sys.argv[0]) 7
print "Error: Couldn’t rename dir ’%s’ as ’%s’!" % (old, new) 9
114
12. Input/Output
12.2.5 Deleting Directories
T
Directories can easily be deleted using the rmdir function of the os module.
(142) provides a simple script that deletes a directory after prompting the user to
confirm the action.
(142) ioDeleteDirEx1.py
import os, sys 1
2
# Get a directory name from the command-line 3
dir = sys.argv[1] 4
5
AF
# Delete directory or print error message 6
try : 7
print "Are you sure you want to remove the directory ’%s’ [y/n]?" % (dir) 8
response = ’n’ # TODO: Prompt user for yes-no response 9
if response.lower() == ’y’: 10
os.rmdir(dir) 11
print "Error: Could not delete directory ’%s’!" % (dir) 13
12.2.6 Going Through Multiple Directories

Sometimes it is necessary to process more than one directory at a time. One
particularly common operation is to open a directory recursively, which means to
open that directory, all of the directories within it, and all of the directories within
the directories within it, etc. This is sometimes known as traversing or walking
a directory structure, and it can be easily accomplished using the built-in Python
function os.path.walk, which takes three arguments:
1. the directory to traverse

DR
2. the name of a function invoked every directory found–the callback function CALLBACK FUNC -
TION
3. an argument that is passed to every invocation of the callback function
(143) provides a simple examples of how this works. It provides a script that
takes a directory as a command-line argument and calls os.path.walk on that
directory, using a custom function, called processFile, to process every di-
rectory traversed. The function is very simple. It goes through the list of files
in the directory and joins the name of the directory with the name of the file us-
ing os.path.join, which ensures that the code will work across platforms
(on Unix, Windows, etc.). The resulting path is then printed out, preceded by an
asterisk (the argument passed to os.path.walk in the first place).
(143) ioWalkEx1.py
import os 1
import sys 2
3
115
12.3 The Command Line
T
def processFile(arg, dir, files) : 4
for f in files : 5
path = os.path.join(dir, f) 6
print "%s %s" % (arg, path) 7
8
try : 9
dir = sys.argv[1] 10
os.path.walk(dir, processFile, ’*’) 11
print "Usage: %s <DIR>" % sys.argv[0] 13
12.3
import sys
AF
The Command Line
12.3.1 Arguments
Command-line arguments to a script are stored in a list accessible through the
module sys. The code in (144) prints out all of the arguments to the script on a
separate line. This list is never empty, since its first item is always the name of
the script being run. Therefore, if the script is run without any arguments, a single
line with the name of the script is printed out.
args = sys.argv
for a in args :
print a
(144) ioCommandLineArgEx1.py
1
2
3
4
5
Note that the filename is the full filepath supplied to the Python interpreter and
DR
therefore may be more than just the name of the script itself. To obtain only the
name of the script, it may be necessary to strip off the path information preceding
it. This can be done by XXX, as illustrated in (145).
import sys 1
2
print "The name of this script is ‘" + sys.argv[0] + "’." 3
4
args = sys.argv[1:] 5
6
if args : 7
i = 0 8
for a in args : 9
i = i + 1 10
print i, a 11
Individual command-line arguments in the argv list are accessed in the nor-
mal manner, by their index, as illustrated in (146).
116
12. Input/Output
T
import sys 1
2
firstArg = sys.argv[0] 3
4
print "The first item in sys.argv is ‘" + firstArg + "’." 5
12.3.2 Options
It is common for Python scripts to have options that change their behavior in sys-
AF
tematic ways. These command-line options (sometimes called flags) are like those
of UNIX commands. For example, the common UNIX command grep provides
a tool for searching files using regular expressions (see chapter 17). Its syntax is
grep <regular expression> <file(s)>, as illustrated below:
% grep ’the’ file.txt
(147) ioOptionsEx1.py
import getopt, sys 1
2
def usage() : 3
print "Usage: " + sys.argv[0] + "[OPTIONS] <FILE>" # Usage info 4
sys.exit(2) 5
6
def main() : 7
optVerbose = False # Variable for option value 8
9
# ----------------------------------------------------- 10
# getopt() takes 3 arguments: 11
# 1. argument list to be parsed 12
# 2. string with short option letters 13
# 3. list of long option names 14
# ----------------------------------------------------- 15
try: 16
DR
opts, args = getopt.getopt(sys.argv[1:], "hv", ["help"]) 17
except getopt.GetoptError: 18
usage() 19
20
# Loop over a list of tuples, where o is an option’s name 21
# and a is its value 22
for o, a in opts : 23
if o == "-v" : # Option -v 24
optVerbose = True 25
if o in ("-h", "--help") : # Option -h or --help 26
usage() 27
12.4 IO in Action
12.4.1 Reading Corpus from File
In the previous chapters we have gradually refined our first program. In the pre-
vious examples, the text analyzed in the program was directly entered into it by
117
12.4 IO in Action
hand. But this is obviously unsatisfactory. Here we will show how to obtain text
T
from the contents of a file.
(148) ioFirstProgramFromFile.py
import sys 1
2
fo = open(sys.argv[1], ’rU’) 4
contents = fo.read() 5
fo.close() 6
return contents 7
8
AF
return d 17
18
28
12.4.2 Reading Multiple Corpus Files

DR
(149) ioFirstProgramFromDirectory.py
import sys 1
2
fo = open(sys.argv[1], ’rU’) 4
contents = fo.read() 5
fo.close() 6
return contents 7
8
return d 17
18
118
12. Input/Output
T
28
12.5
• ???
• ???
• ???
AF
Exercises
DR
119
12.5 Exercises
T
AF
DR
120
T
Chapter 13
Modules and Packages
AF
As programs get longer and more complicated, it becomes less convenient to keep
all of the code in a single file. It is useful in such cases to be able to break up
the code and store the various parts of it in separate files. This sort of approach
also improves the reusability of code, since commonly used code can be shared
between programs. Code that is stored in a file and shared between programs is
known as a module. A module is basically nothing more than a file containing
Python definitions and statements. In this section, we will learn the ins-and-outs
of Python modules.
MODULE
13.1 Modules
DR
13.1.1 Defining and Modules
The name of a module is its filename minus the suffix .py. Therefore, the code
for a module named LingUtils (short for “Linguistic Utilities”) will be found in a
file named LinguisticUtilities.py, as in (150).
(150) LingUtils.py
def isEmpty(str) : 1
if str == ’’ or str == ’ ’ : 2
return 1 3
else : 4
return 0 5
6
def sort(str) : 7
chars = [] 8
for s in str : 9
chars.append(s) 10
chars.sort() 11
return chars 12
121
13.1 Modules
A script can take advantage of the functionality provided by a module by im-
T
porting it, as illustrated in (151), which uses the function isEmpty(str) from
LingUtils to test whether the lines in a given file are empty and to print them
out if they are not.
(151) modulesEx1.py
import LingUtils 1
2
def usage() : 3
print "Usage: XXX" 4
5
def main() : 6
AF
try : 7
fn = sys.argv[0] 8
open(fn, READ) 9
for l in fn.readlines() : 10
if not LingUtils.isEmpty(l) : 11
print l 12
except : 13
print usage() 14
15
main() 16
13.1.2 Importing Modules

There are three different way to import the contents of a module. First, just the
module can be imported. Second, only specific variables or functions from the
module can be imported. Third, everything within the module can be directly
imported. The advantages and disadvantages of each way is discussed below.
13.1.2.1 Importing the Entire Module

The simplest way to import a module is to import it in its entirety. When an entire
DR
module is imported, its contents are read and made available, but not directly. Any
subsequent calls to variables or functions within the module must be qualified by
the name of the module. There are only two things that can go wrong with this
type of import statement.
First, the module may not exist, as in (152), which calls on a non-existent
module.
(152) modulesImportErrorEx1.py
import LinguisticUtilities 1
2
print "This won’t work!" 3
When (152) is run through the Python interpreter, an ImportError will be

thrown, and an error message will be printed out.
Second, the module may have internal errors. If there are problems with the
contents of a module, an error will be raised, as can be seen from (153), which
imports a module that has a syntax error in it.
122
13. Modules and Packages
T
(153) modulesImportErrorEx2.py
import LinguisticUtilities 1
2
print "This won’t work!" 3
13.1.2.2 Importing the Entire Namespace of a Module with Star

It is also possible to import all of the variables and functions from a module us-
ing star notation. The main disadvantage of this approach is the possibility of
AF
namespace conflicts. [EXPLAIN WHAT A NAMESPACE CONFLICT IS] Since
the full contents of a module are typically unknown to the user, it is dangerous to
import its entire namespace, since it may conflict with pre-existing namespace.
To see all of the names in a module’s namespace, the function XXX can be
called on the module, as in (154).
(154) modulesImportEx1.py
# TODO 1
The names of some variables and functions are not automatically imported
with the star notation. Any variables or functions within a module whose names
begin with an underscore are not automatically imported. They can, however, be
imported by name, as described in §13.1.2.3.
13.1.2.3 Importing Specific Variables or Functions from a Module

There is an alternative to importing the entire namespace of a module using the
star notation. Thanks to another variant on the import statement, it is possible to
DR
import from a module specific variables or functions by name, as in (155), where
the function isEmpty() is directly called without prepending the name of the
module (e.g., LingUtils.isEmpty()).
from LingUtils import isEmpty 1
2
def usage() : 3
print "" 4
5
def main() : 6
try : 7
fn = sys.argv[0] 8
fo = open(fn, ’r’) 9
for l in fo.readlines() : 10
if not isEmpty(l) : 11
print l 12
fo.close() 13
except IndexErrror : 14
usage() 15
123
13.1 Modules
By selectively importing names, it is easier to avoid conflicts of the sort exem-
T
plified in 156, where the function sort() is imported from the module LingUtils
and run on a string. In keeping with the definition of the function in the module,
the characters in the string are printed out in ascending order. However, the func-
tion sort() is then redefined within the script to sort characters in descending
order and run on the same string.
from LingUtils import sort 1
2
s = ’antidisestablishmentarianism’ 3
AF
print sort(s) 4
5
def sort(str) : 6
chars = [] 7
for s in str : 8
chars.append(s) 9
chars.sort() 10
chars.reverse() 11
return chars 12
13
print sort(s) 14
15
16
13.1.3 Finding Modules

???
The Search Path The search path determines where Python looks for the files
that define modules, as determined by the environment variable PYTHONPATH,
which is accessible to programmers through the module sys. The simple script
DR
in (157) prints out a list of the directories found within the Python interpreter’s
search path.
(157) modulesSearchPath.py
import sys 1
2
print "These directories are in the search path:" 3
for d in sys.path : 4
print d 5
When looking for a module, these directories will be searched for a matching
file until the first one is found, at which point it will be imported. If no matching
file is found, an ImportError will be raised. The results of running (157) depend
upon your system configuration. The sys.path variable is initialized from the value
of the environment variable PYTHONPATH.
If you don’t know what environment variables are or how to set them, you
will need to review this aspect of your operating system. To see the value of this
environment variable in Unix, the following command should do the trick:
124
T
echo $PYTHONPATH
For example, the directory /usr/local/modules/ can be added to the

search path in Bash using the command (??) and in Tcsh using the command
(??).
export PYTHONPATH=$PYTHONPATH:/usr/local/modules//
AF
set PYTHONPATH=’PYTHONPATH:/usr/local/modules/’
Where to Put Custom Modules After you create custom modules, the question
that immediately arises is where you should keep the files containing them. There
are basically three options. The advantages and disadvantages of each will be
discussed in turn:
Keep the modules in a directory normally in Python’s search path. Although

this option appears the simplest, it is actually the worst of the three. Why?
Because the default search path for Python consists of directories full of
modules from the standard library, and if you install a new version of Python,
these will be disturbed. Either they will be ignored by the new installation
of Python or—worse yet–they will be deleted.
Keep the modules in the same directory as the script that uses them. This op-
tion is also simple but its main drawback is that it will not work for modules
that are likely to be used by numerous scripts, which will potentially be
saved in different locations.
DR
Keep the modules in a new directory set aside for custom modules. This option
is the most complicated of the three, since it requires creating a new direc-
tory and reconfiguring the Python interpreter to include it in its search path
(see §13.1.3). However, the additional work involved in this option is well
repaid, since it creates a stable location for custom modules to be stored and
shared across scripts, even when they exist in different locations.
13.1.4 Standard Modules

The Python code base is broke down into modules to make it more manageable.
These modules collectively consistute the Python standard library. For detailed
documentation of the Python standard library, see the appropriate version of the
Python Library Reference. The current version can be found on the WWW at
http://www.python.org/doc/current/lib/lib.html.
125
13.2 Packages
13.2 Packages
T
Large projects involve large amounts of code and this code may need even more
structure than modules can provide. This is where packages enter the picture.
Packages in Python are basically collections of modules. They provide a means
of organizing modules that have a more complicated interrelationship with one
another by offering the possibility of having some modules contain others.
To understand why it might be useful to organization modules into packages,
imagine that we are creating modules for the manipulation of Shoebox files. A
echo.py AF
comprehensive set of tools for dealing with Shoebox would have to include a
number of things.
Shoebox/
__init__.py
Parsers/
__init__.py
MorphemeParser.py
WordParser.py
LineParser.py
TextParser.py
...
Objects/
__init__.py
Morpheme.py
(158)
Top-level package
Subpackage for parsers
Subpackage for object model
Word.py
DR
Line.py
Text.py
Utilities/ Subpackage for misc tools
__init__.py
Configuration.py
Formatter.py
The overall organization of the Shoebox package in (158) is that there is a main
package, called Shoebox, which contains a number of subpackages, each of which
contains numerous modules. The Shoebox package is broken down into various
subpackages for convenience. There is an object model, which models the entities
involved in a Shoebox text, and parsers to shornhoe texts into this object model, as
well as various utilities for accomplishing odd jobs (such as reformatting of text in
particular ways). Note that each directory requires a file init .py. Without
this file, the directory will not be recognized as a package by Python.
126
To see how the Shoebox package would be used in a script, we can turn to
T
(159), which demonstrates the use of the Shoebox package to parse a Shoebox
text and print out the original text on a line-by-line basis with numbering.
(159) modulesPackagesEx1.py
import sys 1
from Shoebox.Parsers.TextParser import TextParser 2
from Shoebox.Utilities.Formatter import removeWhitespace 3
4
try : 5
fn = sys.argv[1] 6
tp = TextParser(fn) 7
AF
t = tp.parse() 8
i = 0 9
for l in t.getLines() : 10
i = i + 1 11
original = l.getOriginalText() 12
formattedOriginal = removeWhitespace(original) 13
print "%i. %s" % (i, formattedOriginal) 14
print "Usage: %s <SHOEBOX FILE>" % (sys.argv[0]) 16
There are a number of things to note about (159). The most important thing to
understand is that files within a package do not automatically have access to other
files in the same package. Therefore, if we look at the file TextParser.py, provided
in (??), we see that the LineParser module is explicitly imported since it is not
visible by default.
13.3 Exercises
• ???
DR
• ???
• ???
127
13.3 Exercises
——————————————————————–
T
AF
DR
128
T
Chapter 14
Strings In Depth
AF
Probably the most important data type for language research are strings, since they
are what Python uses to store text. The term string derives from its definition in
various branches of mathematics (such as formal language theory), where a string
is an ordered sequence of symbols that are drawn from a pre-determined set. In
this section, we will cover many important aspects of strings and string processing
in Python.
14.1 Plain String Types

There are three different ways of initializing strings in Python: with single quotes,
STRING
double quotes, or triple quotes. Thebehavior of these three string types is com-
DR
pared in table Table 14.1 according to three features: whether quotes need to be
escaped when they occur within a string and whether newlines must be encoded
as \n.
String Type Quotes Newline

Single Quotes Escape single quotes Escaped
Double Quotes Escape double quotes Escaped
Triple Quotes No escaping required Direct
Table 14.1 String Types Compared
Single quote string are best used for literal strings. If a single quote appears
within a single-quoted string, it must be preceded by a backslash, as shown in
(160). The backslash is known as an escape character in this context because it ESCAPE CHARAC -
provides an “escape” from a character’s normal interpretation. The escape charac- TER
129
14.2 Unicode Strings
ter is necessary for a single quote within a single-quoted string because otherwise
T
the interpreter would treat the single quote as the end of the string.
(160) stringsSingleQuoted1.py
print ’I\’ve put in five different req\’s for Minimal Protein Ration for those kids
1 and they don\’t co
print ’"Lousy mixture, barbiturates and Dexedrine. What were you trying to do to 2yourself?"’
Double quotes within a double-quoted string must be escaped using a back-

slash; single quotes require no escaping, as shown in (161).
(161) stringsDoubleQuoted1.py
s = "John said, \"Bill said, ’Your kung-fu is weak’.\"" 1
AF
print s 2
To improve readability, it is sometimes useful to “word-wrap” a string—that

is, to split it across multiple lines. This can be done with double-quoted strings by
putting a backslash wherever the string wraps, as illustrated in (162).
(162) stringsDoubleQuoted2.py
hello = "This string \ 1
will print \ 2
as a single line.’’ 3
print hello 4
For strings that contain many carriage returns or quotes, triple quoted strings
are a good option, since neither single or double quotes need to be escaped and
carriage returns are preserved as newslines, as illustrated in (163).
(163) stringsTripleQuoted1.py
s = ’’’"You could buy a cat," Barbour offered. 1
"Cats are cheap; look in your Sidney’s catalogue."’’’ 2
print s 3
DR
14.2 Unicode Strings
This might be a separate chap- ???
ter!!!
14.3 String Literals

???
14.4 Carving Strings at the Joints

Python offers a number of different ways of manipulating strings for sundry pur-
STRING INDICES poses. Using string indices, it is relatively simple to extract substrings, as de-
scribed in §14.4.1. The string class also provides a number of built-in methods for
manipulating strings, which are described in §14.4.2.
130
14. Strings In Depth
14.4.1 String Indices
T
Python provides a very convenient way of manipulating strings, which is the use
of string indices. Each position in a string is assigned an index, as illustrated in
figure 14.2. The way string indexing works is by counting from 0 upwards from
left to right and from -1 downwards from right to left.
Table 14.2 String Indices

0 1 2 3 4
in (164).
string = ’word’
print string
print string[0]
print string[1]
print string[2]
print string[3]
print string[-1]
print string[-2]
print string[-3]
print string[-4]
#
#
#
#
#
#
#
#
#
#
-5
o
r
d
d
r
o
w
AF w
-4
initialize string
word
w
o
-3
r
-2
(164) stringsSliceNotation1.py
d
-1
Using string indices, any character from this string can be extracted, as shown
Using string indices and slice notation (two indices separated by a colon), it
1
2
3
4
5
6
7
8
9
10
SLICE NOTATION
possible to extract substrings quickly and easily. For example, (165) defines a
DR
string with four letters (word) and then prints the following using slice notation:
the whole string, the first two characters, the first three character, the last two
characters, and the last three characters.
(165) stringsSliceNotation2.py
string = ’word’ # initialize string 1
print string # word 2
print string[0:1] # w 3
print string[0:2] # wo 4
print string[0:3] # wor 5
print string[-1:] # d 6
print string[-2:] # rd 7
print string[-3:] # ord 8
When we are refering only a single character, the slice notation

Because slice notation on strings resembles similar techniques for lists (see
§XXX), it should be reiterated that strings (unlike lists–see §??) are immutable,
meaning that they cannot be modified in place. It is therefore not possible to
replace part of a string using indices, as can be seen from running (166).
131
T
(166) stringsSliceNotationEx3.py
s = ’spaceship’ # strings are immutable 1
s[0:5] = ’rocket’ # an attempt at replacement 2
When (166) is run, an error message such as the following results:

(167)
Traceback (most recent call last):
File "../code/python/stringsSliceNotationEx3.py", line 2, in ?
s[0:5] = ’rocket’ # an attempt at replacement
AF
TypeError: object doesn’t support slice assignment
14.4.2 String Methods

The string class provides a large number of methods that greatly simplify string
processing. These will be broken down into three main categories: methods for
obtaining information about the contents of a string (§14.4.2.1); methods for mod-
ifying strings (§14.4.2.2); and methods for formatting strings (§14.4.2.3).
14.4.2.1 String Methods for Querying Strings
In addition to manipulating strings and formatting them, it is often necessary to

query strings for information. In this section, we overview the various string meth-
ods that can be used for this purpose.
partition ???
DR
len(str) The function len() is not a string method but it nevertheless requires
explanation, since it provides the means for query a string for a very basic property—
namely, its length, which equals the number of characters in it.
For example, if we have a word list for a corpus, we might want to look at the
relationship between word lenth and word frequency. In (??), we illustrate how to
read in a file, break it into words, create a unique list of words, and print out a list
of each word’s length and frequency.
(168) stringsLengthEx1.py
??? 1
If we run this script on a large corpus of Tok Pisin texts and plot the relation-
ship between word length and frequency, we obtain the graph in ???.
???
132
count(sub[, start[, end]]) The string method count() finds the number of
T
times that a substring occurs within a string.
(169) stringsCountEx1.py
s = ’antidisestablishmentarianism’ 1
print s.count(’a’) 2
print s.count(’is’) 3
find(sub[, start[, end]])

The method find provides a means of finding where a particular substring is
AF
found within a larger string. It does this by returning the lowest index in a string
where the searched-for substring is found. In other words, it returns the index for
the first (leftmost) index of the start of the searched-for substring. If the searched-
for string is not found, find returns -1 .
(170) stringsFindEx1.py
s = "The ships were always accompanied by an automated probe \ 1
that followed a couple of million miles behind. \ 2
We knew about the portal planets, little bits of flotsam \ 3
that whirled around the collapsars; the purpose of the drone \ 4
was to come back and tell us in the event that \ 5
a ship had smacked into a portal planet at .999 of the speed of light." 6
print s.find("ship") 7
print s.rfind("ship") 8
The search domain of find can be restricted by specifying the optional sec-
ond and third arguments, as in (171).
todo 1
Therefore, (171) is equivalent to (172).

DR
todo 1
index(sub[, start[, end]]) The method index provides a means of finding

where a particular substring is found within a larger string. If the substring is
found, the index of the first occurence of the substring is returned as an integer. If
the substring is not found, a ValueError is thrown, as shown in (173).
(173) stringsIndexEx1.py
# Function that prints out where a substring is within a string 1
def findPositionOfSubstring(s, sub) : 2
try : 3
pos = s.index(sub) 4
print "The position of ’%s’ is %i in ’%s’." % (sub, pos, s) 5
except : 6
print "Couldn’t find ’%s’ in ’%s’!" % (sub, s) 7
8
# Apply function to string with the alphabet 9
str = ’abcdefghijklmnopqrstuvwxyz’ 10
133
T
findPositionOfSubstring(str, ’a’) 11
findPositionOfSubstring(str, ’m’) 12
findPositionOfSubstring(str, ’z’) 13
findPositionOfSubstring(str, ’A’) 14
Sometimes it is useful to restrict where to look for the substring within the
larger string. A start and and end point are optionally specified in the second and
third argument to index, as shown in (174).
(174) stringsIndexEx2.py
# Function that prints out where a substring is within a string 1
AF
def findPositionOfSubstring(s, sub) : 2
try : 3
start = 52 4
end = 78 5
pos = s.index(sub, start, end) 6
print "Between %i and %i, " \ 7
"’%s’ occurs at %i." % (start, end, sub, pos) 8
except : 9
print "Between %i and %i, " \ 10
"’%s’ isn’t found!" % (start, end, sub) 11
12
# Apply function to string with 4 repetitions of the alphabet. 13
# The slices for each iteration of the alphabet are as follows: 14
# 1st: 0 - 25 15
# 2nd: 26 - 51 16
# 3rd: 52 - 77 17
# 4th: 78 - 103 18
str = "abcdefghijklmnopqrstuvwxyz" \ 19
"abcdefghijklmnopqrstuvwxyz" \ 20
"abcdefghijklmnopqrstuvwxyz" \ 21
"abcdefghijklmnopqrstuvwxyz" 22
findPositionOfSubstring(str, ’a’) 23
findPositionOfSubstring(str, ’m’) 24
findPositionOfSubstring(str, ’z’) 25
DR
The string method rindex behaves almost identically to except that the sub-
string is searched for from the right end of the string (i.e., from the end) rather
than from the left end of the string (i.e., from the beginning).
(175) stringsRindexEx1.py
# Finds rightmost position of substring 1
def findRightmostPositionOfSubstring(s, sub) : 2
try : 3
pos = s.rindex(sub) 4
except : 6
8
# Finds leftmost position of substring 9
def findLeftmostPositionOfSubstring(s, sub) : 10
try : 11
pos = s.index(sub) 12
except : 14
16
# Apply function to string with the alphabet 17
134
T
str = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz" 18
findLeftmostPositionOfSubstring(str, ’a’) 19
findLeftmostPositionOfSubstring(str, ’m’) 20
findLeftmostPositionOfSubstring(str, ’z’) 21
22
findRightmostPositionOfSubstring(str, ’a’) 23
findRightmostPositionOfSubstring(str, ’m’) 24
findRightmostPositionOfSubstring(str, ’z’) 25
14.4.2.2 Modifying Strings
AF
One of the most important capabilities a programming language must provide is
the ability to modify strings. Python provides a fairly large number of methods
for modify strings. We will cover a few of the most useful ones. Note that these
methods do not directly modify the string (since strings are immutable); instead,
they modify a copy of the string, and return the copy, leaving the original intact.
join(seq) Technically, this method modifies sequences, and not strings, but be-
cause it is a string method that is frequently used to convert sequences into strings,
it is included here. The method is called on a string and takes a sequence as an
argument. It joins the elements in the sequence together into a string, using the
string as a separator, as illustrated in (176).
(176) stringsJoinEx1.py
l = [ ’The’, ’wind’, ’came’, ’across’, ’the’, ’bay’, 1
’like’, ’something’, ’living’, ’.’ ] 2
print " ".join(l) 3
The join method uses the string that is called on as a separator in the output
string. If no separator is desired, the method can be called on a blank string, as
DR
in (177), where the morphemes of the Dutch word ??? are joined together into a
single word and printed.
l = [’a’, ’b’, ’c’, ’d’] 1
print "".join(l) 2
An error will occur if the elements within the sequence passed to join are
not strings, as in (178), where the list elements are integers.
l = [1, 2, 3, 4] 1
print "".join(l) 2
One way to avoid this problem is by first converting each item in the list into a
string. This can be done quite easily using list comprehension (described in §9.3),
as shown in (179).
l = [’Lift-off’, ’at’, 14, ’hours’, ’40’, ’.’] 1
print " ".join([str(i) for i in l]) 2
135
replace(old, new [, count]) The function replace takes a string and produces
T
a copy in which every instance of one string within it has been systematically
replaced by another. For example, if a word is sysematically misspelled within a
text, it can be automatically fixed with this function, as shown in (180), where a
systematic misspelling of the word phoneme is corrected.
(180) stringsExReplace1.py
before = ’A foneme is the minimal phonological unit. \ 1
The term foneme was coined by Edward Sapir, \ 2
who believed that fonemes had psychological reality.’ 3
after = before.replace(’foneme’, ’phoneme’)
AF
4
print after 5
It is possible to delete characters with this function by using an empty string

as the replacement string, as illustrated in (181).
(181) stringsExReplace2.py
before = ’anti-dis-establish-ment-arian-ism’ 1
after = s.replace(’-’, ’’) 2
print after 3
This method also takes an optional third argument which limits the number of
replacements that will be made since by default there is no upward limit on the
number of replacements that will be made.
split(separator [,maxsplit])
The function split takes a string and breaks it up into parts given a a partic-
ular separator. If the separator is not found in the string, the value None is re-
DR
turned. For example, the string dis-establish-ment-arian-ism, which
provides a morphemic breakdown of the word disestablishmentarianism, could be
split up into its constituent morphemes by running split on it using a hyphen as a
separator. The results would be a list with five morphemes, as shown in (182).
(182) stringsSplit1.py
word = ’dis-establish-ment-arian-ism’ 1
morphemes = word.split(’-’) 2
i = 0 3
for m in morphemes : 4
i = i + 1 5
print i, m 6
strip([chars]) The function strip provides the means of removing (or strip-
ping) any unwanted characters from the beginning and end of a string. If no
argument is supplied to the method, then whitespace characters will be stripped
from both ends of the string—that is, from the beginning and the end.
136
T
(183) stringsStripEx2.py
s1 = " xxx" 1
s2 = "xxx " 2
print s1 3
print s2 4
print s1.strip() 5
print s2.strip() 6
To strip specific characters from a string, the method requires a string argu-
ment which provides all of the characters to be stripped. For example, the string
AF
aabbaa could be reduced to bb by removing the two letter a’s from its two ends,
as shown in (184).
s = ’aabbaa’ 1
s.strip(’aa’) 2
print s 3
There are two variants of strip, which remove unwanted strings only from
one edge of the string: lstrip removes an unwanted string from the left side
(the beginning) and rstrip from the right side (end), as illustrated in (185).
before = ’aabbaa’ 1
after = s.lstrip(’aa’) 2
print after 3
after = s.rstrip(’aa’) 4
print after 5
14.4.2.3 Formatting Strings

DR
Under the heading of formatting, we will discuss string functions that can change
how a string is displayed.
14.4.3 String Methods for Formatting

There are various string methods that provide ways of formatting strigs for display.
(We will ignore those that manipulate case, since they are covered in §14.4.4.)
center(width) It is sometimes convenient to center strings for formatting.

(186) stringsCenterEx1.py
# Automatically create a centered header of an arbitrary width 1
width = 50 2
text = ’SECTION’ 3
print "-" * width 4
print "-" + text.center(width - 2) + "-" 5
print "-" * width 6
137
ljust(width)
T
(187) stringsLjustEx1.py
# Automatically create a centered header of an arbitrary width 1
width = 50 2
text = ’SECTION’ 3
print "-" * width 4
print "-" + text.ljust(width - 2) + "-" 5
print "-" * width 6
rjust(width) In some cases, it is also useful to right-justify strings, especially
AF
numbers. The code in (188) provides a simple example of how this works by
printing the numbers 1 through 10 with right justification.
(188) stringsRjustEx1.py
width = 2 1
for i in range(1, 11) : 2
s = str(i) 3
print s.rjust(width) 4
zfill(width) The string method zfill is used to pad strings with zeros. Its
main purpose is formatting numbers, as illustrated in (189).
(189) stringsZfillEx1.py
for i in range(1, 11) : 1
s = str(i) 2
print s.zfill(2) 3
Zero-padding for numbers is useful since it ensures that they will sort properly
when sorted alphabetically (rather than numerically). To see the difference, try
sorting the raw numbers 1 through 10 and comparing the results with a sort of the
padded numbers printed out by (189).
DR
14.4.4 Handling Case
Under the heading of case, we will discuss string functions that are used to evalu-
ate and/or manipulate the case of the letters constituting a string.
(190) stringsCaseEx1.py
name = ’noam chomsky’ 1
2
# Capitalize -- Noam chomsky 3
caps = name.capitalize() 4
print caps 5
6
# Lowercase -- noam chomsky 7
lower = caps.lower() 8
print lower 9
10
# Uppercase - NOAM CHOMSKY 11
upper = lower.upper() 12
print upper 13
138
T
14
# Title -- Noam Chomsky 15
title = upper.title() 16
print title 17
18
# Swap Case -- nOAM cHOMSKY 19
swap = title.swapcase() 20
print swap 21
In some cases, one may want to know which case a string is in.
(191) stringsCaseEx2.py
lcase = ’time travel’
AF
1
if lcase.islower() : 2
print "The word ’%s’ is all lowercase." % (lcase) 3
4
ucase = ’TARDIS’ 5
if ucase.isupper() : 6
print "The word ’%s’ is all uppercase." % (ucase) 7
8
title = ’Dalek’ 9
if title.istitle() : 10
print "The word ’%s’ is in title case." % (title) 11
14.4.5 String Interpolation

The term string interpolation refers to placement of variables within strings. Due STRING INTERPO -
to Python’s data typing, not all variables are immediately available for insertion LATION
into a string. Therefore, the attempt to concatenate a string with a float gives rise
to an error, as can be seen by running the code in (192).
(192) stringsStringInterpolationEx1.py
pi = 3.1415 # Set pi’s value 1
print ’The value of pi is ’ + pi + ’.’ # ERROR! 2
DR
There are a few different way of solving this problem. One solution is for
non-strings to be converted into strings for the purposes of string concatenation,
as in (194).
i = 3.1415 1
print ’The value of pi is ’ + str(i) + ’.’ 2
Although the code in (193) works fine, it isn’t particularly pretty. It’s not
easy to see what exactly will be printed out in the clutter of string concatenations.
Fortunately, there is a more elegant solution, which is string interpolation. String
interpolation is a convention whereby a placeholder in a string is replaced by a
variable provided after the string, as illustrated in (194). In other words, a variable
is interpolated into a string (interpolate v. tr. 1. To insert or introduce between
other elements or parts”) The results obtained in (193) can also be obtained with
string interpolation, and in a much more simple and clean fashion, as shown in
(194).
139
14.5 Case Studies in String Manipulation
T
i = 3.1415 1
print ’The value of pi is %f.’ % i 2
In (194), the float is specified to the fourth decimal point and its interpolated
value in the string is similarly specified. It is possible, however, to restrict the
number of decimal points in the interpolated float by placing a number after the
ampersand and before the letter, as in (195).
AF
i = 3.1415 1
print ’pi = %0.1f (to the 1st decimal place)’ % i 2
print ’pi = %0.2f (to the 2nd decimal place)’ % i 3
print ’pi = %0.3f (to the 3rd decimal place)’ % i 4
print ’pi = %i (as an integer)’ % i 5
A summary of the various conventions available for use in string interpolation

is summarized in Table 14.3.

Rather than discuss each string method in turn, we will try to provide real-world
examples of their use in attacking the kinds of problems that are likely to arise in
the everyday world of language research.
14.5.1 Case Study 1: Building a Concordance

DR
In corpus analysis, many techniques require the use of a concordance—that is, a
list of words in a text along with their frequency in the text.
In this case study, we will write a concordance-maker in Python. Our input
will be a short poem by W.D. Snodgrass, entitled “April Inventory” (code/april inventory.txt
Our concordance-maker will take this file as input and produce the concordance
shown in (??), where we find each word in the text listed in lowercase followed
by its frequency. (The output is truncated to save space: all words starting with b
through x are ellipsed.)
(196) concordance-april-inventory.txt
all 2
...
once 1
Building a concordance like this is much simpler than it might seem. The code
is provided below in (??).
140
T
(197) stringsConcordance1.py
from string import split, lower 1
2
# ---------------------------------------------------- 3
# 1. Get list of lines from file 4
# ---------------------------------------------------- 5
6
fn = ’inputs/april_inventory.txt’ # hardcoded 7
lineList = f.readlines() # Get list of lines 9
10
# ---------------------------------------------------- 11
# 2. Create dictionary from list 12
AF
# ---------------------------------------------------- 13
14
concord = dict() # Initialize dictionary 15
for l in lineList: # Loop over line list 16
# Get lowercase copy of line with newline removed 17
cleanLine = lower(l)[:-1] 18
19
# Remove all nonalphanumeric characters 20
cleanLine = cleanLine.replace(’.’, ’’) # Ditch . 21
cleanLine = cleanLine.replace(’,’, ’’) # Ditch , 22
cleanLine = cleanLine.replace(’;’, ’’) # Ditch ; 23
cleanLine = cleanLine.replace(’:’, ’’) # Ditch : 24
cleanLine = cleanLine.replace(’!’, ’’) # Ditch ! 25
cleanLine = cleanLine.replace(’"’, ’’) # Ditch " 26
27
# Split line (spaces are default separator) 28
words = split(cleanLine) 29
30
for w in words: # Loop over words 31
if not concord.has_key(w): # Is it already there? 32
concord[w] = 1 # No, add it. 33
else: # Yes, increment freq. 34
concord[w] = concord[w] + 1 35
36
# ---------------------------------------------------- 37
# 3. Print out concordance 38
DR
# ---------------------------------------------------- 39
items = concord.items() # Extract items 40
items.sort() # Sort them 41
for word, freq in items: # Now print them 42
print word , "\t" , freq # Tab-delimited w/ \t 43
We accomplish the task by putting all of the lines in the text into a list. We
then remove the carriage return, convert the line to lowercase, and then strip out
all of the punctuation marks using replace(old,new). We then iterate over
the lines. Each line is split into words (i.e., tokenized) using space marks as word-
delimiters. (Since space marks are the default separator for split(), there is no
need to supply it with a second argument.) And the words are put into a dictionary.
When a word does not already exist in the dictionary, it is added and its counter
is initialized to 1. When a word already exists in the dictionary, its counter is
incremented by 1. Finally, we then sort the items in the dictionary (in descending
alphabetic order, which is the default) and then print them out followed by their
frequency to obtain our concordance.
141
14.5.2 Case Study 2: Creating a Dictionary from a Shoebox
T
File
Scripting languages like Python are great for jobs that need to be done in a hurry
and probably won’t be reused. The advantage of a language like Python is that it
is also amenable to larger, more involved projects–in other words, it’s a scripting
language that scales. Therefore, we’ll show you how quick-and-dirty jobs can be
accomplished in a way that leaves open the possibility of future expansion of the
scope of the work.
AF
For this case study, we will look at how a dictionary file from Shoebox, a
program put out by the SIL and used by fieldworkers for storing dictionaries and
interlinearizing texts automatically.
An excerpt from a Shoebox file belonging to one of the authors is provided
below.
(198) stringsRotokasDictionary.in
\lx asige
\rt asige
\ge make sneezing noise
\gp kus
\nt
\ps VI
\ex Asigeparoi oirato.
\xp
\xe The man is sneezing.
\lx asigea
\rt asigea
\ge a sneeze
\gp kus
\nt
\ps N.N
\ex
DR
\xp
\xe
\lx asigo
\rt asigo
\ge speak Rotokas
\gp ???
\nt
\ps VT
\ex Akoitai asigoparevoi.
\xp
\xe Akoitai is using the Rotokas (Asigoao) language.
\lx asigoao
\rt asigoao
\ge mountain people/people of the mountains
\gp ???
\nt Name used to designate the Rotokas people before the word rotokas was (NPRN according to Firchow)
\ps N.PN?
\ex
\xp
\xe
142
T
\lx asigoao reo
\rt asigoao reo
\ge Rotokas language as designated before the term Rotokas came into use
\gp ???
\nt (NPRN according to Firchow)
\ps N.PN?
\ex
\xp
\xe
\lx asikauru
\rt asikauru
\ge rust
AF
\gp ros
\nt
\ps V.STAT
\ex Ro-ia Toiota asikauroepa.
\xp
\xe The Toyota truck rusted away.
We can begin to tackle the problem by creating a Shoebox entry object, which
will store the data contained in an entry as well as the parsing of the raw Shoebox
text. Although it is possible to create a separate parser to do the latter job, we will
not do so here, based on the judgement call that it would unnecessarily compli-
cate our object model. (Remember, these decisions are not carved in stone, and
sometime it may later be necessary to reevaluate one’s object model and do some
refactoring.) REFACTORING
14.5.3 Case Study 3: Parsing Words into Syllables

xxx
DR
14.5.4 Case Study 4: Finding Minimal Pairs in a Wordlist
When studying the phonology of a language, it is useful to prepare a list of mini-
mal pairs, that is, pairs of words that differ from one another by a single segment MINIMAL PAIRS
(e.g., bill and pill in English). Producing such minimal pairs from memory can be
difficult and tedious. Fortunately, a little bit of Python programming can simplify
the job by automatically extracting such minimal pairs from a word list. The only
requirement is that the word list satisfy two criteria. First, it must be phonemic
(for fairly obvious reasons). Second, each phoneme must be encoded by a singe
letter (i.e., no digraphs). The reason for this is that it makes the task much easier to
accomplish, as we will see shortly. The code in (199) will take a word list, where
each word occupies a separate line, and produce a list of minimal pairs, organized
according to the contrasts exemplified.
(199) stringsMinimalPairs1.py
import sys 1
2
143
T
# ------------------------------- 3
# Extract every word from the file 4
# supplied as a command-line arg 5
# ------------------------------- 6
def extractWords(f) : 7
# Get lines from file 8
fo = open(f, ’r’) 9
lines = fo.readlines() 10
fo.close() 11
12
# Go through lines and get words 13
words = [] 14
for l in lines : 15
AF
nl = l.replace("\n", "") 16
words.append(nl) 17
return words 18
19
# ------------------------------- 20
# Sort the words by their length 21
# and put them in a dictionary 22
# where the key is a wordlength 23
# and the value is the list of 24
# words of that length 25
# 1: a, I 26
# 2: an, to, it, ... 27
# 3: the, and, but, ... 28
# ... 29
# ------------------------------- 30
def sortWordsByLength(allWords) : 31
wordLengths = {} 32
for w in allWords : 33
wl = len(w) 34
if not wordLengths.has_key(wl) : 35
wordLengths[wl] = [] 36
words = wordLengths[wl] 37
words.append(w) 38
wordLengths[wl] = words 39
return wordLengths 40
41
DR
# ------------------------------- 42
# Object for convenient storage 43
# and manipulation of minpair 44
# ------------------------------- 45
class MinimalPair : 46
def __init__(self, s1, s2, w1, w2) : 47
self.s1 = s1 48
self.s2 = s2 49
self.w1 = w1 50
self.w2 = w2 51
def getFirstWord(self) : 52
return self.w1 53
def getSecondWord(self) : 54
return self.w2 55
def setFirstWord(self, w1) : 56
self.w1 = w1 57
def setSecondWord(self, w2) : 58
self.w2 = w2 59
def getFirstSegment(self) : 60
return self.s1 61
def getSecondSegment(self) : 62
return self.s2 63
def setFirstSegment(self, s1) : 64
144
T
self.s1 = s1 65
def setSecondSegment(self, s2) : 66
self.s2 = s2 67
def getWordPair(self) : 68
if self.w1 < self.w2 : 69
return self.w1 + self.w2 70
else : 71
return self.w2 + self.w1 72
def getContrast(self) : 73
if self.s1 < self.s2 : 74
return self.s1 + self.s2 75
else : 76
return self.s2 + self.s1 77
AF
78
# ------------------------------- 79
# Go through the words by their 80
# length and compare words of 81
# similar length to see if they 82
# constitute a minimal pair 83
# ------------------------------- 84
def findMinPairs(words) : 85
wordsByLength = sortWordsByLength(words) 86
minPairs = {} 87
for l in wordsByLength.keys() : 88
words = wordsByLength[l] 89
for w1 in words : 90
#print w1 + " / " + w2 92
i = 0 93
diffCount = 0 94
diffChar1 = ’’ 95
diffChar2 = ’’ 96
while i < l : 97
#print w1[i] + " / " + w2[i] 98
if not w1[i] == w2[i] : 99
diffCount = diffCount + 1 100
diffChar1 = w1[i] 101
diffChar2 = w2[i] 102
i = i + 1 103
DR
if diffCount == 1 : 104
mp = MinimalPair(diffChar1, diffChar2, w1, w2) 105
if not minPairs.has_key(mp.getContrast()) : 106
minPairs[mp.getContrast()] = {} 107
minPairsForContrast = minPairs[mp.getContrast()] 108
minPairsForContrast[mp.getWordPair()] = mp 109
return minPairs 110
111
# ------------------------------- 112
# constitute a minimal pair 116
# ------------------------------- 117
def printResults(minPairs) : 118
contrast = minPairs.keys() 119
contrast.sort() 120
for c in contrast : 121
minPairsByContrast = minPairs[c] 122
print "/" + c[0] + "/ vs. /" + c[1] + "/" 123
for mp in minPairsByContrast.values() : 124
print mp.getFirstWord() 125
print mp.getSecondWord() 126
145
T
print 127
print 128
129
# ------------------------------- 130
# Run everything 131
# ------------------------------- 132
def main() : 133
filepath = sys.argv[1] 134
words = extractWords(filepath) 135
minPairs = findMinPairs(words) 136
printResults(minPairs) 137
138
# ------------------------------- 139
AF
140
main() 141
The code in (199) first extracts all of the words from the file and puts them
in a list. It then goes through the list and finds all of the minimal pairs. This is
accomplished in a number of steps. The first step XXX.
When a phonemic alphabet is used and each phoneme is represented by a
single letter (or character), finding minimal pairs is a fairly trivial task, consisting
of the following: Every word in the word list is compared to every other word.
If two words are of different lengths, they cannot be a minimal pair, and can
therefore be ignored. If two words are of the same length, the two words are lined
up and each segment in the word is compared one by one, in sequential order. A
minimal pair is simply a pair of words that differ only by one segment. (This can
be easily determined by keeping a counter of the number of differing segments.)
Consider a pair of words like mint and lint. If we line up the two words, and
compare their letters one by one, we find that they differ by a single letter, and are
therefore a minimal pair, as shown in 14.4.
DR
Table 14.4 Comparing Words in a Minimal Pair
Index 0 1 2 3
Letters m i n t
l i n t
Same? N Y Y Y
The only problem with this approach is that when you compare every word
in a list to every other, the number of comparisons required ends up being very
large, since the number of comparisons is exponentially related to the number of
words in the list (if the number of words in a word list is n, then the number of
comparison is n2 ). Therefore, the speed up the program, we first divide up the list
according to word length and only compare words of similar length.
146
14.5.5 Case Study 5: Finding Anagrams in a Word List
T
Two words are considered anagrams of one another if the letters in one word can
be rearranged to form the other.
(200) stringsAnagrams1.py
import sys 1
3
# ------------------------------- 4
# Extract every word from the file 5
# supplied as a command-line arg 6
AF
# ------------------------------- 7
def extractWords(f) : 8
fo = open(f, ’r’) 10
fo.close() 12
13
# Go through lines and get words 14
words = [] 15
for l in lines : 16
nl = l.replace("\n", "") 17
words.append(nl) 18
return words 19
20
# ------------------------------- 21
# Sort the words by their length 22
# and put them in a dictionary 23
# where the key is a wordlength 24
# and the value is the list of 25
# words of that length 26
# 1: a, I 27
# 2: an, to, it, ... 28
# 3: the, and, but, ... 29
# ... 30
# ------------------------------- 31
def sortWordsByLength(allWords) : 32
DR
wordLengths = {} 33
for w in allWords : 34
wl = len(w) 35
if not wordLengths.has_key(wl) : 36
wordLengths[wl] = [] 37
words = wordLengths[wl] 38
words.append(w) 39
wordLengths[wl] = words 40
return wordLengths 41
42
# ------------------------------- 43
# Take all of the chars in a 44
# string, alphabetize them, 45
# and return them as a string 46
# ------------------------------- 47
def sortStringChars(s) : 48
charList = [] 49
i = 0 50
while i < len(s) : 51
charList.append(s[i]) 52
i = i + 1 53
charList.sort() 54
return join(charList, ’’) 55
147
T
56
# ------------------------------- 57
# are anagrams 61
# ------------------------------- 62
def findAnagrams(words) : 63
anagrams = {} 64
wordsByLength = sortWordsByLength(words) 65
for l in wordsByLength.keys() : 66
words = wordsByLength[l] 67
AF
chars1 = sortStringChars(w1) 69
chars2 = sortStringChars(w2) 71
if chars1 == chars2 and w1 != w2 : 72
if not anagrams.has_key(chars1) : 73
anagrams[chars1] = {} 74
ac = anagrams[chars1] 75
ac[w1] = ’’ 76
ac[w2] = ’’ 77
anagrams[chars1] = ac 78
return anagrams 79
80
# ------------------------------- 81
# Print out the results in a 82
# nicely formatted fashion 83
# ------------------------------- 84
def printResults(anagrams) : 85
for a in anagrams.values() : 86
words = a.keys() 87
print join(words, "/") 88
89
# ------------------------------- 90
# Run everything 91
# ------------------------------- 92
def main() : 93
DR
words = extractWords(filepath) 95
anagrams = findAnagrams(words) 96
printResults(anagrams) 97
98
main() 99
14.5.6 Case Study 6: Finding Palindromes in a Word List

A palindrome is a word (phrase, sentence, etc.) that retains its identity regardless
of whether it is spelled forwards or backwards.
(201) stringsPalindromes1.py
import sys 1
3
# --------------------------------- 4
# Reverse string 5
# --------------------------------- 6
def reverse(s) : 7
ns = ’’ 8
148
T
i = len(s) 9
while i > 0 : 10
i = i - 1 11
ns = ns + s[i] 12
print ns 13
return ns 14
15
# --------------------------------- 16
# Go through file, find words, 17
# and check whether they are 18
# palindromes 19
# --------------------------------- 20
def main() : 21
AF
23
fo = open(file, ’r’) 25
fo.close() 27
28
# Find palindromes 29
palindromes = [] 30
for l in lines : 31
w = l.replace("\n", "") 32
if w == reverse(w) : 33
palindromes.append(w) 34
35
# Print out palindromes found 36
for p in palindromes : 37
print p 38
39
main() 40
14.6 Exercises
• ???
DR
• ???
• ???
149
14.6 Exercises
d
i
o
u
x
FT
Convention Meaning
Table 14.3 Conventions for String Interpolation
Signed integer decimal.

Signed integer decimal.
Unsigned octal.
Unsigned decimal.
Unsigned hexidecimal (lowercase).
Notes
(1)
(2)
A
X
e
E
Unsigned hexidecimal (uppercase).
Floating point exponential format (lowercase).
Floating point exponential format (uppercase).
(2)
150
f Floating point decimal format.
F
g
G
c
r
s
(1) ???
(2) ???
(3) ???
(4) ???
Floating point decimal format.
String (converts any python object using repr()).

String (converts any python object using str()).
D R
Same as ”e” if exponent is greater than -4 or less than precision, ”f” otherwise.
Same as ”E” if exponent is greater than -4 or less than precision, ”F” otherwise.
Single character (accepts integer or single character string).
(3)
(4)
T
AF Part II
Advanced Topics
DR
151
DR
AF
T
T
Chapter 15
AF
Object Oriented Programming
This chapter will introduce to the basic concepts of OO programming, an approach

to programming that approaches problems in terms of modelling objects and their
interactions, rather than in terms of specifying tasks and procedures. In other
words, it is programming that is oriented towards objects—hence the terminology.
Object orientation a relatively recent development in programming, but it is a
major innovations in software development. The Python programming language
takes full advantage of it, putting object orientation at the core of its philosophy.
15.1 What are Objects?

Since OO programming is programming that works in terms of objects, the first
DR
order of business is clarifying what exactly is meant by an object. One way of
looking at objects is to think of them as fancy data structures, which provide con-
venient and well-defined ways of interacting with the data held within them. These
data structures—i.e., objects—can be used to represent linguistic constructs. For
example, if we were writing a program to syllabify words, we would want to cre-
ate objects that model constructs of linguistic theory. We would create a syllable
object that is composed of other objects, such as onsets, nuclei, and codas. These
would in turn be composed of other objects, such as a phoneme objects to store
the individual segments that make up the onset, nucleus, and the coda. The syl-
labification program would then do its work by modelling the interaction of these
objects.
Objects can be broken down in terms of their properties, or, to use the jargon
of the field, their attributes. For example, a consonant phoneme could be rep- ATTRIBUTES
resented in terms of its various parameters: its place and manner of articulation,
whether it is voiced or voiceless, etc. And the behavior of the object could be
153
15.1 What are Objects?
defined in such a way that various operations on it are possible, such as devoicing,
T
lengthening, etc.
Let’s see how this might work. Before we can create an object, we need to in-
CLASS troduce a bit more terminology. The template for an object is a class. Therefore,
we speak of instantiating a class in an object. This may sound rather abstruse,
but it’s actually fairly simple, and fortunately there’s already linguistic terminol-
ogy that can be pressed into service. The distinction between classes and objects
is basically the distinction between types and tokens. Tokens are individual in-
stances of a type. Similarly, objects are instances of a class.
METHODS
class Stop :
AF
The first step towards creating a phoneme object is therefore to write a class
that will define the phoneme object. The code in (202) defines a very simple object
that could be used to represent stop phonemes (plosives).
def __init__(self) :
self.voicing
def voicing(self) :
return self.voicing
def devoiced(self) :
self.voicing = ’VOICELESS’
def voiced(self) :
self.voicing = ’VOICED’
(202) ooObjectEx1.py
# Name of the class
# Constructor method
# Object variable (attribute)
# Accessor method for voicing
# Sets voicing to ’VOICELESS’
# Sets voicing to ’VOICED’
The class definition begins with the keyword class followed by the name of
the class, in this case, Stop. It also defines four functions, or methods, to use
the terminology of OO programming. The first is init , which is a method that
every classes requires to handle the creation of an object from a class, associat-
1
2
3
4
5
6
7
8
9
ing with it a single variable, voicing. (Don’t sweat the details. More more
information about objecty construction, see §16.1.1.) The remaining three meth-
DR
ods relate to voicing. The first, voicing, provides the value of the voicing
variable (attribute). The other two change the attribute, setting its value to either
VOICED or VOICELESS.
We now have a class that defines an object that we can use to represent stop
phonemes. But how do we make use of it? In (203), we insert the class definition
into the beginning of a script, create a phoneme object, and invoke its methods for
illustration.
(203) ooObjectEx2.py
# Define the class 1
class Stop : 2
def __init__(self) : 3
_voicing 4
def voicing(self) : 5
return self._voicing 6
def devoiced(self) : 7
self._voicing = ’VOICELESS’ 8
def voiced(self) : 9
self._voicing = ’VOICED’ 10
154
15. Object Oriented Programming
T
11
# Create an object 12
s = Stop() 13
s.devoiced() 14
print s.voicing() 15
s.voiced() 16
print s.voicing() 17
Note that the stop object is saved into the variable s. The devoiced()
method is called on s, setting the voicing variable, which is then retrieved and
AF
printed by invoking the voicing() method on s. The same is then done for the
voicing() method. This example already illustrates a few things about objects.
First, they are really quite easy to use. It’s not rocket science! Second, they have
very well-defined behavior. Notice that there is no way of setting the value of
voicing to anything but VOICELESS or VOICED.1 If we attempt to call a
method that doesn’t exist, we get an error, as shown in (204). (We’ve ommitted
the class definition to save space.)
(204)
>>> s = Stop()
>>> print s.aspiration()
File "./ex_class_error.py", line 18, in ?
print s.aspiration()
AttributeError: Stop instance has no attribute ’aspiration’
Although there are still lots of technical details that need to be covered, the
basic idea of how objects are defined and used should be clear at this point. We
can now turn to a more important issue, which the reader is no doubt beginning to
wonder about, which is why this approach represents an advantage over traditional
DR
programming practices. The next section takes up the virtues of OO programming.
15.2 The Virtues of OO Programming

To understand better what objects are and how they are useful, let’s look at simple
programming task and see how it would have been traditionally tackled from a
procedural programming point of view and how the OO oriented approach differs.
The task is straightforward: take a file with data organized into tab-delimited
columns and manipulate it for analysis and reformatting. The file is provided in
??.
1
Technically, this isn’t 100% true, since the variable voicing is public, not private, and could
therefore be sneakily accessed by a programmer who isn’t adhering to the defined interface, but
we won’t delve into these technicalities until the next chapter. Skip ahead to §?? if you want the
full story.
155
T
xxx
(9) 1 Harald Baayen ??? 2004-???-???
2 Stuart Robinson ??? 1973-10-28
3 Fermin Moscoso del Prado ??? 1974-???-???
The data file contains four columns:
1. first name,
AF
2. last name,
3. research group, and
4. birthdate (in the format year-month-date—more specifically, YYYY-MM-

DD)
According to traditional programming approaches, this task would be broken

down into a series of procedures. The file would be opened; the lines processed
one by one; and the data extracted, analyzed, and then printed out in a reformatted
fashion. The following script accomplishes these tasks.
(205) ooEx1.py
import sys 1
2
filename = sys.argv[1] # Filename is command-line arg 3
f = open(filename) # Open file 4
lines = f.readlines() # Read in lines 5
i = 0 # Counter for lines 6
for l in lines : # Go through lines 7
i = i + 1 # Increment counter 8
cols = l.split("\t") # Split line by tabs 9
DR
print cols[1] + ", " + cols[0] 10
There are some aspects of this program that are awkward and could be im-
proved, even within a procedural programming approach. For starters, it would
be very difficult to change the way that the data is reformatted, since the refor-
matting happens as each line is processed. It would be desirable to have a bit
more flexibility in the reformatting, perhaps by storing all of the data in a large
data structure and then manipulating this data structure at the end. The following
script is a variant of (??) which differs in the way it handles the data in the file.
Instead of reformatting the names as they are processed, it stores the data as a list
of lists, which it manipulates at the end.
(206) ooEx2.py
import sys 1
2
filename = sys.argv[1] # Filename is command-line arg 3
f = open(filename) # Open file 4
lines = f.readlines() # Read in lines 5
156
T
data = [] 6
for l in lines : # Process lines 7
cols = l.split("\t") # Split by tabs 8
data.append(cols) # Put data in list 9
10
for d in data : # Extract data 11
fname = d[0] # First name is 1st col 12
lname = d[1] # Last name is 2nd col 13
print lname + ", " + fname # Print out 14
Note how the data is organized in (??): the data in each line is put into a list
AF
and these lists with line data are then put into one big list. (In other words, we
have a list of lines, each of which is a list of columns.) This approach works
but it isn’t very elegant. For starters, it is somewhat inflexible. EXPLAIN HOW.
Some of the inelegance of this program can be solved using objects. The main
innovation we will introduce is the use of a Person object, which will be used to
save all of the data about the people listed in the data file. This solution isn’t fully
OO, since a good deal of it continues to be procedural (e.g., the parsing of the data
file into lines and of lines into columns), but it will suffice to introduce some basic
concepts about objects.
(207) ooEx3.py
import sys 1
3
#----------------------------------- 4
# Define Person class 5
#----------------------------------- 6
class Person : 7
_fname 9
_lname 10
def set_first_name(self, fname) : 11
DR
self._fname = fname 12
def set_last_name(self, lname) : 13
self._lname = lname 14
def get_first_name(self) : 15
return self._fname 16
def get_last_name(self) : 17
return self._lname 18
def get_first_initial(self) : 19
return self._fname[0] + "." 20
21
#----------------------------------- 22
# Process data file 23
#----------------------------------- 24
25
# Filename is command-line arg 26
28
# Open file 29
f = open(filename) 30
31
# Read in lines 32
lines = f.readlines() 33
34
157
T
# Process lines 35
people = [] 36
for l in lines : 37
cols = split(l, "\t") 38
p = Person() 39
p.set_first_name(cols[0]) 40
p.set_last_name(cols[1]) 41
people.append(p) 42
43
# Extract data 44
for p in people : 45
print p.get_last_name() + ", " + p.get_first_initial() 46
INTERFACE AF
Don’t worry too much about the details of how the definition of the Person
object. The point is that it provides a way of storing particular data fields and
provides a way of setting those data fields and retrieving them. Basically, this
is done through custom-defined functions, or methods, to use OO jargon. In this
case, we have four methods: two methods for setting its variables (setter methods)
and two methods for retrieving them (getter methods). The methods of an object
collectively constitute an interface.
It should be obvious that objects provide a much more flexible way of organiz-
ing programs, since it provides a means of directly modelling conceptual entities.
But there is much more to OO programming than convenience manipulation of
data structures.
It may seem like using objects creates a great deal of unnecessary overhead.
After all, (??) consists of only XXX lines, whereas (??) has XXX (nearly twice
as many). In a simple example like this one, the benefits of OO programming
are not as obvious, but they become apparent once we begin dealing with more
complicated examples. This shouldn’t be too surprising, since OO programming
DR
is basically a technique for managing complexity by decomposing a complex sys-
tem into a set of less complex objects, whose interactions define the behavior of
the system. It may be worth saying something at this point about the relative ad-
vantages of OO programming over traditional procedural programming. A list of
some of the strengths of the former are listed below:
Abstraction A class’s representation is visible only through its methods

Encapsulation The data contained within an object is not directly accessible and
can only be accessed through its defined interface
Flexibility Inheritance and polymoprhism enhances flexibility
Inheritance Inheritance simplifies the sharing of code and allows changes in one
place to propagate widely in a clear and controlled way.
Modularity Localizes chages, simplifying maintenance
158
Polymorphism Behavior that depends on which class is carrying it out—i.e.,
T
classes react differently to the same message
Reuse Integration of methods and data promotes reuse of code
Typing ???
In the next few sections, we will explore these concepts in some depth. Note,
however, that we will not yet delve into the details of how objects are handled in
Python, since our concern here is with larger conceptual issues and not nitty-gritty
AF
language-specific details (which are are covered in chapter ??). This means that
some aspects of the code in the upcoming examples with not be fully explained
and must be taken on faith. If as a result some aspects of the code are unclear,
there is no need to worry, since the next chapter will fill in the gaps.
15.3 Major Concepts of OO Programming

The literature on OO programming is extensive, and it is difficult to do the subject
justice in a single chapter, but we will attempt to overview the basics here. We
won’t be able to cover the basics in greath depth, but the material here should
lay a foundation to build upon in the future. More importantly, it should give the
reader the basic working vocabulary required for tackling more in-depth technical
treatments of these topics. For addition readings on OO programming, see the
suggested reading section at the end of the chapter.
15.3.1 Data Abstraction

DR
XXX
15.3.2 Encapsulation
One fairly central concept behind object orientation is encapsulation, which refers ENCAPSULATION
to the technique of making the internal details of an object invisible to the user, in
some sense hiding them. Encapsulation gives object orientation a great deal of its
appeal. By shielding the internal workings of an object from the user, the focus is
placed upon how you interact with an object (i.e., its interface) rather than how it
is written (i.e., its implementation). This encourages the reuse of objects, since IMPLEMENTATION
the user doesn’t need to understand their inner workings, but only how to interact
with them. The implementation of an object can be improved without changing
its interface, which makes maintenance of the code a lot simpler, since any code
that refers to it won’t also have to be changed.
159
To see how encapsulation works in a more concrete way, let’s look at a simple
T
example. The program Shoebox is a tool used by linguists to manage lexicons
and deploy them for textual analysis. It produces interlinearized texts like the one
excerpted in Figure 15.3.2.
What we have in Figure 15.3.2 is a single line of text, broken down into words,
which are in turn broken down in morphemes, with each morpheme described on
four levels. (A free translation of the line is also provided, but we will ignore this
information for now.)
\u
\g
\c
AF
Table 15.1 Description of Shoebox Configuration in Figure 15.3.2
Field Marker
\t
Description
The surface form of the morphemes in the word.
The underlying form of the morphemes in the word.
A gloss of the morphemes meaning.
The category that the morpheme belongs to (noun, prefix, suffix, etc.).
Suppose that we want to create a program that manipulates Shoebox files of

this sort. The OO approach consists of breaking the program down into the various
conceptual entities involved and modelling their interaction. The conceptual enti-
ties involved are rather high-level ones, like text, line, word, morpheme, phoneme
(moving from a more coarse-grained to a more fine-grained level of analysis).
Let’s keep it simple for now and think about what a Morpheme object might look
like. The main challenge here is storing the multiple tiers of information for each
morpheme in an easily accessible way. We have already seen that each morpheme
is described on four tiers. We can assign each tier a number and use a list to store
DR
the information, as in (208), which defines a Morpheme class. (Remember, a class
is a template for an object, so this code defines Morpheme objects.)
(208) ooEncapsulationEx1.py
# ----------------------------------- 1
# Definition of Morpheme Class (v. 1) 2
# ----------------------------------- 3
class Morpheme : 4
# Constructor 5
tiers = [’’,’’,’’,’’] 7
8
# Getters 9
def getSurfaceForm(self) : 10
return self.tiers[0] 11
def getUnderlyingForm(self) : 12
def getGloss(self) : 14
def getCategory(self) : 16
18
# Setters 19
160
DR

\t
\m
viapau oisio
viapau oisio tarai
AF
tarai-pa-vi-ei
-pa -vio -ei
vao
vao
vo
vo
vokioia.
voki -
161
\g NEG like_this understand -PROG -1.PL.INCL -PRES DEM.3SG.N this/here day/ time -
\p N.N ??? VI -SUFF.V.3 -SUFF.VI.4 -SUFF.VI.5 DEM ??? N.N -
T
\f Yumi no save long dispela taim.
T
def setSurfaceForm(self, sf) : 20
self.tiers[0] = sf 21
def setUnderlyingForm(self, uf) : 22
self.tiers[1] = uf 23
def setGloss(self, g) : 24
self.tiers[2] = g 25
def setCategory(self, c) : 26
self.tiers[3] = c 27
The basic idea is that each tier is a different index in the list contained within
the object, starting from 0. This numbering scheme is shown in table 15.2 for the
AF
morpheme vi, the third morpheme from the Rotokas word taraipaviei from the
figure 15.3.2.
Table 15.2 Storing Morpheme Tiers as a List

Tier Index Value
Surface Morphemes 0 vi
Underlying Morphemes 1 vio
Gloss 2 1.PL.INCL
Category 3 SUFF.2
Now imagine that a programmer decides that storing tiers as lists is unesirable
for some reason—perhaps because lists are not particularly mnemonic due to the
fact that the association between a tier and position in the list (that is, its index) is
THIS IS NOT A GOOD REASON. arbitrary. (If you want to expand on the morpheme object by adding a new method
GO THE OTHER WAY, FIRST
SHOW DICTIONARY AND and somewhere in that new method you need to reference the, you would have to
THEN INDEX, AS A WAY TO
IMPROVE EFFICIENCY remember that it is index is 2.) A more mnemonic alternative is to represent the
tiers as key-value pairs in a dictionary (see §?? for more information about this
DR
type of data structure), using the single-letter Shoebox field marker as the key for
a particular tier’s data. The same morpheme from table ?? is broken down into
key-value pairs, as shown in table 15.3.
Table 15.3 Storing Morpheme Tiers as a Dictionary

Tier Key Value
Surface Morphemes t vi
Underlying Morphemes u vio
Gloss g 1.PL.INCL
Category c SUFF.2
This change in the implementation of the object need not affect its inter-
face, since it can be hidden away (i.e., encapsulated) behind accessor methods,
as shown by the code in (209).
162
T
# ----------------------------------- 1
# Definition of Morpheme Class (v. 2) 2
# ----------------------------------- 3
class Morpheme : 4
# Constructor 5
tiers = dict() 7
8
# Getter Methods 9
def getSurfaceForm(self) : 10
return self.morphemes[s] 11
def getUnderlyingForm(self) : 12
AF
return self.morphemes[u] 13
def getGloss(self) : 14
return self.morphemes[g] 15
def getCategory(self) : 16
return self.morphemes[c] 17
18
# Setter Methods 19
def setSurfaceForm(self, sf) : 20
self.morphemes[s] = sf 21
def setUnderlyingForm(self, uf) : 22
self.morphemes[u] = uf 23
def setGloss(self, g) : 24
self.morphemes[g] = g 25
def setCategory(self, c) : 26
self.morphemes[c] = c 27
It is important to emphasize that these changes in implementation will obvi-

ously have effects on the workings of any system involving the Word object (since
dictionaries use more memory than lists and this fact will affect performance), but
any preexisting code calling on the object will not have to be changed, since the
interface remains untouched. The code in (210) would work under either imple-
mentation of the Word object (that is, either (208) or (209)):
DR
m = Morpheme() 1
m.setSurfaceForm(’tarai’) 2
m.setUnderlyingForm(’tarai’) 3
m.setGloss(’understand’) 4
m.setCategory(’VI’) 5
print w.getWordform() 6
print w.getSurfaceMorpheme() 7
print w.getGloss() 8
print w.getCategory() 9
For simplicity’s sake, let’s just say that a word consists of morphemes. Since
the morphemes are ordered in sequence, we’ll use a list to store them. We can
now define a Word class that will provide a list of morphemes and a method for
accessing them, as in (211).
class Word : 1
163
T
morphemes = [] 3
def addMorpheme(self, morpheme) : 4
self.morphemes.append(morpheme) 5
def getMorpheme(self, i) : 6
return self.morphemes[i] 7
Note that we can’t directly access the list of morphemes. Instead, we have
defined a method that provides access to a particular morpheme. This is already
an instance of encapsulation, because it shows how we use an interface (in this
case, a very simple one consisting of only a single method) to restrict a user’s
XXX.
w = Word()
AF
access to an objects internal workings.
To complete our example, we can bring the Morpheme and Word classes to-
gether to provide an object-oriented representation of the word taraipaviei from
m = Morpheme()
m.setSurfaceForm(’tarai’)
m.setUnderlyingForm(’tarai’)
m.setGloss(’understand’)
m.setCategory(’VI’)
w.addMorpheme(m)
m.setSurfaceForm(’pa’)
m.setUnderlyingForm(’pa’)
m.setGloss(’PROG’)
m.setCategory(’SUFF???’)
w.addMorpheme(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
m.setSurfaceForm(’vi’) 17
DR
m.setUnderlyingForm(’vio’) 18
m.setGloss(’???’) 19
m.setCategory(’SUFF???’) 20
w.addMorpheme(m) 21
22
m.setSurfaceForm(’ei’) 23
m.setUnderlyingForm(’ei’) 24
m.setGloss(’PRES’) 25
m.setCategory(’SUFF???’) 26
w.addMorpheme(m) 27
15.3.3 Inheritance
I NHERITANCE Inheritance refers to the technique of organizing classes into hierarchies such
that classes lower in the hierarchy automatically share something in common with
classes higher in the hierarchy. Inheritance provides a way of writing less code,
since the sharing of code obviates the need for its duplication through copy-and-
pasting, and reducing the likelihood of errors, since changes can be made at one
164
point in the inheritance hierarchy, where they will automatically trickle down-
T
wards. (Unlike former American president Ronald Reagan’s trickle-down eco-
nomics, trickle-down programming isn’t voodoo. It actually works!)
The basic idea is straightforward. We can illustrate the usefulness of inher-
itance by imagining how we might design a bibliographic program. A biblio-
graphic database keeps track of references, but references are not all of the same
type. There are books, journal articles, manuscripts, etc. Some data fields are
common to all types (for example, a title), but others are unique to particular
types (for example, books have publishers, but manuscripts don’t). Without inher-
class Book :
_where_published
_year
AF
itance, we would have to write a great deal of duplicate code. To illustrate, look at
the definition of three classes in (213) for three different reference types: books,
journal articles, and manuscripts. It does not make use of inheritance.
_id
_publisher
_title
def identifier(self) :
return self._id
def publisher(self) :
return self._publisher
def title(self) :
return self._title1
(213) ooInheritanceEx1.py
# -----------------------------------------------
# CLASS: Book
# -----------------------------------------------
def where_published(self) :
return self._where_published
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def year(self) : 19
return self._year 20
DR
21
22
# ----------------------------------------------- 23
# CLASS: JournalArticle 24
# ----------------------------------------------- 25
class JournalArticle : 26
_id 28
_journal 29
_title 30
_year 31
def identifier(self) : 32
return self._id 33
def journal(self) : 34
return self._journal 35
def title(self) : 36
return self._title1 37
def year(self) : 38
40
41
# ----------------------------------------------- 42
# CLASS: Manuscript 43
165
T
# ----------------------------------------------- 44
class Manuscript : 45
_id 47
_title 48
_year 49
return self._id 51
def year(self) : 54
#
AF
There is a good deal of overlap in these three classes. For example, each class
has an accessor method for its unique identifier number, its title, and the year in
which it was written (in the case of manuscripts) or published (in all other cases).
By rewriting the same two accessor method for each class, we introduce the pos-
sibilty of copy-and-paste errors. Furthermore, if at some point the workings of a
common method needs to change, that change must be made for each class, which
is time-consuming and error-prone. Inheritance simplifies the task of factoring out
similarities, as illustrated by (214).
# -----------------------------------------------
# CLASS: Reference
Parent of: Book, JournalArticle, Manuscript
# -----------------------------------------------
class Reference :
def __init__(self, title, year) :
self.title = title
self.year = year
def get_identifier(self) :
1
2
3
4
5
6
7
8
9
return self.id 10
def get_title(self) : 11
DR
return self.title 12
def get_year(self) : 13
return self.year 14
def set_identifier(self, id) : 15
self.id = id 16
def set_title(self, title) : 17
self.title = title 18
def set_year(self, year) : 19
self.year = year 20
21
# ----------------------------------------------- 22
# Child Classes 23
# ----------------------------------------------- 24
25
# Definition of a class for books 26
class Book(Reference) : 27
def __init__(self, title, year) : 28
Reference.__init__(self, title, year) 29
30
# Definition of a class for journal articles 31
class JournalArticle(Reference) : 32
166
T
journal 35
38
# Definition of a class for books 39
class Manuscript(Reference) : 40
The methods relating to identifier, title, and year have been placed within
the Reference class, which the three classes Book, JournalArticle, and
AF
Manuscript all extend or inherit from. Using a geneological metaphor, we
could alternative say that the Reference class is the parent class for the child
classes Book, JournalArticle, and Manuscript.
The only downside of inheritance is that it introduces some conceptual com-
plexity. It becomes necessary to keep track of inheritance hierarchies, but this is
a fairly small price to pay for the benefits that inheritance provides. The relation-
ship between the various classes (i.e., the inheritance hierarchy) can be illustrated
diagrammatically, as in Figure 15.4.
Reference
JournalArticle Manuscript Book
Table 15.4 Inheritance Hierarchy From (214)
As Figure 15.4 shows, each class in our bibliography example has only one
DR
parent class, but conceptually there is no reason why this must be the case. Before
continuing, we need some terminology. The term single inheritance refers to SINGLE INHERI -
a situation where a class has only one parent class, whereas the term multiple TANCE
inheritance is used to refer to a situation where a class has more than one parent MULTIPLE INHERI -
class. Python allows multiple inheritance, even though some OO programming TANCE
languages do not (for example, Java). The problem with multiple inheritance is
that when a class has two parent classes and each defines the same method, there
needs to be some way of determining which is operative, and this adds some
conceptual complexity, and potential confusion, to inheritance. (This is why it is
barred in some programming languages.) An example should clarify.
Recall that in the object model for our bibliography, there were two reference
types that had editors: EditedBook and BookArticle. All of the referency types
have authors, with the exception of EditedBook. It would be nice if the com-
mon code could be pulled out and shared between the two classes. However, an
edited book has only editors whereas a book article has both authors and editors.
167
The ideal situation is to have authors and editors available if they are needed and
T
only if they are needed. There are a number of ways to handle this, but multiple
inheritance provides one solution, as shown by the code in (215).
#----------------------------------- 1
# Person class 2
#----------------------------------- 3
class Person : 4
def __init__(self, fname, lname) : 5
self.fname = fname 6
self.lname = lname 7
AF
return self.fname 9
return self.lname 13
return self.fname[0] + "." 17
def name_lastfirst(self) : 18
return "%s, %s" % (self.getLastName(), self.getFirstInitial()) 19
20
# ----------------------------------------------- 21
# Parent Reference Class 22
# ----------------------------------------------- 23
class Reference : 24
def __init__(self, year, title) : 25
self.year = year 27
self.authors = [] 28
return self.id 30
return self.title1 32
DR
return self.year 34
self.year = year 36
37
class AuthoredWork : 38
def get_authors(self) : 41
return self.authors 42
def add_author(self, author) : 43
self.authors.append(author) 44
def first_author(self) : 45
return self.authors[0].get_last_name() 46
def pcitation(self) : 47
return "(%s, %s)" % ( self.firstAuthor(), self.getYear()) 48
49
class EditedWork : 50
self.editors = [] 52
return self.editors 54
def add_editor(self, editor) : 55
self.editors.append(editor) 56
168
T
def first_editor(self) : 57
return self.editors[0].get_last_name() 58
return "(%s, %s)" (self.firstEditor(), self.getYear()) 60
61
# ----------------------------------------------- 62
# Child Classes 63
# ----------------------------------------------- 64
class Book(Reference, AuthoredWork) : 65
Reference.__init__(self) 67
68
class JournalArticle(Reference, AuthoredWork) : 69
AF
Reference.__init__(self, year, title) 71
self.journal 72
return self.journal 74
75
class Manuscript(Reference, AuthoredWork) : 76
self.status 79
80
class EditedBook(Reference, EditedWork) : 81
87
class BookArticle(Reference, AuthoredWork, EditedWork) : 88
def __init__(self, year, title, book_title) : 89
self.set_book_title(book_title) 91
def get_book_title(self) : 93
return self.book_title 94
def set_book_title(self, book_title) : 95
DR
self.book_title = book_title 96
97
98
# ----------------------------------------------- 99
# Create 3 reference objects of different types 100
# and print the output of their pcitation() 101
# methods 102
# ----------------------------------------------- 103
104
# Create person object for author and editor 105
au = Person(’Stephen’, ’Levinson’) 106
ed = Person(’Melissa’, ’Bowerman’) 107
108
# JournalArticle, only has an author 109
ja = JournalArticle(’2003’, ’xxx’) 110
ja.add_author(au) 111
print ja.pcitation() 112
113
# EditedBook, only has an editor 114
eb = EditedBook(’200X’, ’Language and Conceptual Development’) 115
eb.add_editor(ed) 116
print eb.pcitation() 117
118
169
T
# BookArticle, has author and editor 119
ba = BookArticle(’200X’, ’Language and Mind: Let\’s Get the Issues Straight’, ’Language
120 and Conceptual
ba.add_author(au) # author is also editor 121
ba.add_editor(au) 122
ba.add_editor(ed) 123
print ba.pcitation() 124
The code in (215) sets up the inheritance hierarchy shown in figure 15.5, which
is obviously more complicated than the one previously given in Figure 15.4.
AF
Reference EditedWork AuthoredWork
EditedBook BookArticle Book
Table 15.5 Inheritance Hierarchy in (??)
Although the inheritance hierarchy is slightly more complicated, the main ad-
vantage is that we have a more sophisticated way of doling out variables and
methods for manipulating them. No references have authors that don’t need them
and no references have editors that don’t need them. But every reference that uses
either does so in the exact same manner thanks to the inheritance of attributes and
methods. For more details concerning how inheritance works, and in particular
for more information about how multiple inheritance is handled, see §??.
So far, we have seen how inheritance gives the programmer a way to avoid
writing the same method twice by defining it once for a parent class and letting
its child classes acquire it automatically. However, there will be cases where
the behavior of the child classes needs to depart from that of the parent class. To
maintain the same interface but achieve different behavior, it is possible for a child
DR
OVERRIDE class to override the method of a parent class. To say that a method is overriden is
really just a fancy way of saying that it is redefined. But what is the rationale for
overriding a method? There are many, but we will illustrate with an example from
our bibliographic classes. Imagine that we want to create a full-fledged system
for tracking references (say, along the lines of EndNote). We will need to flesh
out our object model by adding an attribute for the authors and/or editors of a
reference. Most reference types have authors, some have editors, and a few have
both. (None should have neither.) Here we can see how objects interact with one
another, since authors and editors can be represented by the Person object defined
previously in (207).
#----------------------------------- 1
# Person class 2
#----------------------------------- 3
class Person : 4
_fname 6
170
T
_lname 7
return self._fname[0] + "." 17
return self.get_lastname() + ", " + self.get_firstinitial() 19
AF
20
# ----------------------------------------------- 21
# ----------------------------------------------- 23
_title 26
_year 27
_authors = [] 28
_editors = [] 29
return self._authors 31
return self._authors 33
return self._id 35
def year(self) : 38
def get_display_author : 40
return "(" + self.get_editors()[0].get_lastname() + self.get_year() + ")" 41
42
# ----------------------------------------------- 43
# Child Classes 44
# ----------------------------------------------- 45
DR
class Book(Reference) : 46
_publisher 48
_where_published 49
def publisher(self) : 50
return self._publisher 51
def where_published(self) : 52
return self._where_published 53
54
class JournalArticle(Reference) : 55
_journal 57
60
class Manuscript(Reference) : 61
63
class EditedBook(Reference) : 64
def get_display_author : 66
return "(" + self.get_editors()[0].get_lastname() + self.get_year() + ")" 67
68
171
T
class BookArticle(Reference) : 69
Now imagine that we want to create a method that will provide a parenthetical
citation for a given reference. Let’s assume that the desired formatting is the last
name of the first author (followed by et al. if there is more than one author) fol-
lowed by the year of publication, with a comma separating the two, as illustrated
in (??). (This is more or less the standard format for linguistic journals.)
AF
1. Here is an example of a parenthetical ciation with a single author (???,
2006).
2. Here is an example of a parenthetical citation with multiple authors (??? et

al., 2006).
The code in (217) demonstrates how two reference objects can be created,
have their attributes populated, and then have their parenthetical citations printed
out for display. (Note that we are manually setting the attributes of the various
reference objects. We will see later in chapter 20 how they could be automatically
populated from a database.)
# -----------------------------------------------
# Create person objects for author and editor
# and populate their attributes manually
# -----------------------------------------------
au = Person()
au.first_name(’Stuart’)
1
2
3
4
5
6
au.middle_name(’Payton’) 7
au.last_name(’Robinson’ 8
DR
9
ed = Person() 10
ed.first_name(’Timothy’) 11
ed.last_name(’Shopen’) 12
13
14
# ----------------------------------------------- 15
# Create 3 reference objects of different types 16
# and print the output of their pcitation() 17
# methods 18
# ----------------------------------------------- 19
20
# JournalArticle, only has an author 21
ja = JournalArticle() 22
ja.add_author(au) 23
ja.set_year(’2003’) 24
ja.set_title(’Constituent Order in Tenejapa Tzeltal’) 25
print ja.pcitation() 26
27
# EditedBook, only has an editor 28
eb = EditedBook() 29
eb.add_editor(ed) 30
eb.set_year(’1996’) 31
172
T
eb.set_title(’Language Typology and Syntactic Description’) 32
print eb.pcitation() 33
34
# BookArticle, has author and editor 35
ba = BookArticle() 36
ba.add_author(au) 37
ba.add_editor(ed) 38
ba.set_year(????) 39
ba.set_title(’xxx’) 40
print ba.pcitation() 41
AF
The pcitation() method of the JournalArticle class and the BookArticle
providess the last name of its author, while the pcitation() method of the
EditedBook class provides the last name of its editor. The behavior of the Edited-
Book class differs from the others because its pcitation() method has been
overriden and its behavior redefined.
15.3.4 Modularity
The concept of modularity in OO programming refers to the idea of grouping MODULARITY
code together into discrete units, called modules, which can be optionally included
in a program to provide well-defined functional bundles. It does not refer to the
discreteness of objects, which is so central to the entire idea behind object ori-
entation that it usually goes without saying, but rather to the goruping of related
entities (objects) into larger units. The partitioning of a system into modules has
a number of advantages, not the least of which is the ability to include or exclude
modules according to one’s needs.
The concept of modularity runs deep in Python and is exemplified, given
DR
that the core library of the language is organized into modules, which provide
thematically-related functionality. For example, the datetime module provides
a variety of class types that facilitate the manipulation of dates and times. The
code in (220) shows how the datetime module can be used in a script to calcu-
late a person’s current age given their data of birth. (Note that the year, month, and
day of the birthdate are passed to the script as arguments—e.g., ooModularityEx1.py
1973 28 10.)
(220) ooModularityEx1.py
import sys 1
from datetime import date 2
year = sys.argv[1] 3
month = sys.argv[2] 4
day = sys.argv[3] 5
print "Birthdate: %s-%s-%s" % (year, month, day) 6
bday = date(1973, 10, 28) 7
age = date.today() - bday 8
print age 9
173
To continue with our bibliography example, we can put all of the code that
T
defines the various bibliography classes into a single file, which we’ll call ???.
(219) references.py
#----------------------------------- 1
# Person class 2
#----------------------------------- 3
class Person : 4
def __init__(self, fname, lname) : 5
return self.fname 9
AF
return self.lname 13
return self.fname[0] + "." 17
return "%s, %s" % (self.getLastName(), self.getFirstInitial()) 19
20
# ----------------------------------------------- 21
# ----------------------------------------------- 23
self.year = year 27
return self.id 30
return self.title1 32
return self.year 34
DR
self.year = year 36
37
class AuthoredWork : 38
return self.authors 42
return self.authors[0].get_last_name() 46
return "(%s, %s)" % ( self.authors[0].get_last_name(), self.get_year()) 48
49
class EditedWork : 50
return self.editors 54
def add_editor(self, editor) : 55
self.editors.append(editor) 56
def first_editor(self) : 57
174
T
return "(%s, %s)" % (self.editors[0].get_last_name(), self.get_year()) 60
61
# ----------------------------------------------- 62
# Child Classes 63
# ----------------------------------------------- 64
class Book(Reference, AuthoredWork) : 65
68
class JournalArticle(Reference, AuthoredWork) : 69
AF
self.journal 72
return self.journal 74
75
class Manuscript(Reference, AuthoredWork) : 76
self.status 79
80
class EditedBook(Reference, EditedWork) : 81
87
class BookArticle(Reference, AuthoredWork, EditedWork) : 88
def __init__(self, year, title, book_title) : 89
self.set_book_title(book_title) 91
def get_book_title(self) : 93
return self.book_title 94
def set_book_title(self, book_title) : 95
self.book_title = book_title 96
DR
Other programs can then access the code using an import statement, as il-
lustrated in (??).
(220) ooModularityEx1.py
import sys 1
year = sys.argv[1] 3
month = sys.argv[2] 4
day = sys.argv[3] 5
print "Birthdate: %s-%s-%s" % (year, month, day) 6
bday = date(1973, 10, 28) 7
age = date.today() - bday 8
print age 9
For more details about how modules and how to import them, see chapter 13,
which goes into the topic in considerable detail.
175
15.3.5 Typing
T
TYPING The concept of typing refers to the way that objects belong to types. To say that
an object belongs to a type is a fairly trivial claim. Since classes define objects
and each object is created on the basis of its class definition, every object therefore
belongs to a type–namely, its class.
While some OO languages are very strictly typed, Python is more lenient. It is
unnecessary to declare the type of a method’s argument, as we have already seen
in §14, where we defined a method for the Reference class that takes a Person ob-
AF
ject as its argument. The advantage of not strictly enforcing typing is flexibility,
but the drawback is unpredictability. For example, if a programmer misunderstood
the nature of the add author() method and passed a string, rather than a Per-
son object, as an argument, the error would not be immediately caught. In fact,
no error would result at all, although the results of calling the pcitation()
method would somewhat nonsensical, as illustrated in (221).
(221) ooTypingEx1.py
from references import Book 1
2
r = Book(’2003’, ’Text Processing in Python’) 3
r.add_author(’David Mertz’) 4
print r.pcitation() 5
Running the code in (221) will result in an error, as shown below:
[stuart@localhost tex]$ python code/python/ooTypingEx1.py

File "code/python/ooTypingEx1.py", line 5, in ?
print r.pcitation()
File "/home/stuart/code/python/references.py", line 48, in pcitation
return "(%s, %s)" % ( self.authors[0].get_last_name(), self.get_year())
DR
AttributeError: ’str’ object has no attribute ’get_last_name’
This is due to the fact that a string object was appended to the authors list
when a Person object was expected. When the method pcitation() is called,
it attempts to retrieve a Person object and call various methods on it. However,
since a String object rather than a Person object is retrieved, and the String object
has no method xxx, the results is an error.
If type-checking is desired, there are mechanism available for its enforcement
in Python. Exception raising can be used to prevent the misuse of methods that
expect a particular argument type. For example, the add author() method
could be revised to check the type of its single argument and to raise an exception
if the argument is not of the expected type, as shown in (??).
(222) ooTypingEx2.py
from references import Book 1
2
class Reference : 3
176
T
self.authors = [] 5
if isinstance(author, Person) : 7
else : 9
raise Exception, "The argument of add_author must be a Person object." 10
11
r = Book("???", "???") 12
r.add_author(’Stuart P. Robinson’) 13
Now, when (222) is run, an error results as soon as the add author()
AF
method is invoked with an unexpected argument type, as shown in (223).

File "ex_typing3.py", line 16, in ?
r.add_author(’Stuart P. Robinson’)
(223)
File "ex_typing3.py", line 12, in add_author

raise Exception, "The argument of add_author must be a Person object."
Exception: The argument of add_author must be a Person object.
Although type-checking doesn’t prevent errors from occuring, it does ensure

that they are more readily identified and therefore more easily rectified. The real
strength of type-checking is that it simplifies debugging, which is often the most
difficult and frustrating part of programming.
15.3.6 Polymorphism
Polymorphism refers to the way that a method with a single name can take a P OLYMORPHISM
different number of arguments. [XXX] An example should make this concept
DR
clearer. Returning to the example discussed in §15.1, let’s imagine how we could
change the script that takes the data file and prints out each person’s name with
the last name followed by the initial of the first name.
(224) ooPolymorphismEx1.py
import sys 1
4
#----------------------------------- 5
#----------------------------------- 7
class Person : 8
_fname 10
_lname 11
def set_birthdate(self, bdate) : 16
self._bdate = bdate 17
177
T
def get_birthdate(self) : 22
return self._bdate 23
def age(self) : 24
age = date.today() - self.get_birthdate() 25
return age.days() 26
27
#----------------------------------- 28
#----------------------------------- 30
AF
31
34
# Open file 35
37
# Read in lines 38
40
# Process lines 41
people = [] 42
for l in lines : 43
45
# Data columns 46
fname = cols[0] 47
lnamme = cols[1] 48
bdayParts = split(cols[3], ’-’) 49
year = bdayParts[0] 50
month = bdayParts[1] 51
day = bdayParts[2] 52
bday = date(year, month, day) 53
54
p = Person() 55
p.set_first_name(fname) 56
DR
p.set_last_name(lname) 57
p.set_birthdate(bday) 58
people.append(p) 59
60
# Extract data 61
print p.get_first_name() + ", " + \ 63
p.get_last_name()[0] + "." + \ 64
"\t" + p.age() 65
The age() method assumes that age is calculated by taking the difference
between the current date and a person’s birthdate. However, at some point we
might want to be able to determine a person’s age on a particular date in the past.
For the sake of our example, let’s assume that we want to know a person’s age in
19XX (the year that the initial version of Python was first publically released).
(225) ooPolymorphismEx2.py
import sys 1
178
T
4
#----------------------------------- 5
#----------------------------------- 7
class Person : 8
self._bdate = None 10
self._fname = None 11
self._lname = None 12
def setFirstName(self, fname) : 13
def setLastName(self, lname) : 15
AF
def setBirthdate(self, bdate) : 17
self._bdate = bdate 18
def getFirstName(self) : 19
def getLastName(self) : 21
def getBirthdate(self) : 23
return self._bdate 24
def age(self, date=date.today()) : 25
age = date - self.getBirthdate() 26
return age.days() 27
28
29
#----------------------------------- 30
#----------------------------------- 32
33
36
# Open file 37
39
# Read in lines 40
42
DR
# Process lines 43
people = [] 44
for l in lines : 45
47
# Data columns 48
fname = cols[0] 49
lnamme = cols[1] 50
bdayParts = split(cols[3], ’-’) 51
year = bdayParts[0] 52
month = bdayParts[1] 53
day = bdayParts[2] 54
bday = date(year, month, day) 55
56
p = Person() 57
p.setFirstName(fname) 58
p.setLastName(lname) 59
p.setBirthdate(bday) 60
people.append(p) 61
62
# Extract data 63
refDate = date(’1973’, ’??’, ’??’) 65
179
15.4 Exercises
T
print "%s, %s\t%s" % (p.getFirstName(), p.getLastName(), p.age(refDate)) 66
When no date is passed to the age() method, the age is calculating using the
current date. However, when the age() method takes an argument, the age is
calculated using the date supplied to it.
15.4 Exercises
AF
• Many aspects of linguistic systems explify some sort of inheritance hierar-
chy. What are some examples? Do these show single or multiple inheri-
tance?
• TYPE COERCION
• ???
• ???
• ???

The literature on OO programming is quite large. For convenience, we can break
down the available literature in three categories:
History A review of the history of OO programming goes beyond the scope of
this book, but some historical perspective can be found in ? or ? [BYTE
DR
MAGAZINE ARTICLE?]
Basic Introduction to OO There are numerous books that provide an introduc-
tion to the basic concepts of object-oriented programming. Most of them
cover the same ground as this chapter, though in more depth. Both Budd
(1996) and Weisfeld (2000) provide an introduction to object-oriented pro-
gramming that is not focussed on a particular programming language.
Python OO Most books about Python provide either general documentation of
the language or topic-specific coverage of more advanced techniques. The
best place to look for more information about object-oriented programming
in Python are articles from magazines and trade journals, such as Dr Dobb’s
or ???.
Advanced OO Techniques ???
180
T
Chapter 16
Classes and Objects
AF
In the previous chapter, we discussed object-oriented programming in fairly broad
terms, without going into too much detail about the technical details of Python
object orientation. In this section, we will get our hands dirty with the nitty-gritty
details so that the abstract knowledge gained about object-oriented programming
can be put into effective practice.
16.1 Defining Classes

Defining classes in Python is very straightforward. To illustrate, let’s create an
class that will be used to store data about individual people. We will call the
class Person. (The convention for naming classes is to capitalize them in the same
manner as proper nouns, such as Alpha Centauri or Alfred Bester.) The code that
DR
defines this Person class is provided in (226), followed by some code that creates
a Person object (i.e, instantiates the Person class), sets the name field with the
method set name and then retrieves the name field with the get name method
in order to print it.
(226) classesDefinitionEx1.py
# ---------------------------------------------- 1
# Class Definition 2
# ---------------------------------------------- 3
class Person : # name of the class 4
def __init__(self) : # constructor 5
self._name # instance variable 6
def set_name(self, name) : # method 7
self.name = name # sets name var 8
def get_name(self) : # method 9
return self.name # returns name var 10
11
# ---------------------------------------------- 12
# Class Usage 13
# ---------------------------------------------- 14
p = Person() # instantiate class 15
181
16.1 Defining Classes
T
p.set_name(’Joe Haldeman’) # set variable accessor method 16
print p.get_name() # get variable w/ accessor method 17
There are a number of things to note concerning the definition of the Person
class. It has a constructor method, which Python requires to be named init ,
that initializes a single variable, self. name. It also defines two methods, one
for setting the instance variable (a getter method) and another for retrieving it (a
setter method). The idea behind provided methods to access instance variables
is to encapsulate the data (cf. §15.3.2) so that anyone using the class interacts
AF
with the interface defined by the author of the class. (Since they provide access to
variables, getter and setter methods are typically referred to as accessor methods.)
This class could be simplified and improved in a few ways. First, we can use
a single method for both getting and setting our instance variable, thereby making
our code more succint, as shown in (227). (Succintness is only a virtue if it does
not sacrifice performance or readability. In this case, we believe it does not.)
# Class Definition
self._name
if name :
self._name = name
return self._name
# ----------------------------------------------
# ----------------------------------------------
class Person :
# name of the class
# constructor
# instance variable
def name(self, name=None) : # method
# check argument
# set if provided
# return variable
# ----------------------------------------------
# Class Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
# ---------------------------------------------- 14
p = Person() # instantiate class 15
DR
p.name(’Bruce Lee’) # method call 16
print p.name() # method call 17
The single accessor method now returns the value of the name variable, set-
ting it beforehand if the method is provided with an argument. (We take advantage
of argument defaults, setting the value of the argument to None by default.) This
class is still less than ideal, however, since it leaves the name variable too unstruc-
tured. A name can be broken down into parts and each of these can be saved as
separate variables, as in (228).
# ---------------------------------------------- 1
# ---------------------------------------------- 3
class Person : 4
## Constructor 5
self._fname 7
self._lname 8
182
16. Classes and Objects
T
9
## Methods 10
def first_name(self, fname=None) : 11
if fname : 12
15
def last_name(self, lname=None) : # method 16
if lname : 17
20
# ---------------------------------------------- 21
AF
# Class Usage 22
# ---------------------------------------------- 23
p = Person() 24
p.first_name(’Bruce’) 25
p.last_name(’Lee’) 26
print p.first_name() + ’ ’ + p.last_name 27
Note that we reconstruct the full name by concatentating the first name and
the last name. There is a better alternative. We can take advantage of object
orientation to to format the full name in whatever manner we please by adding a
method that does the formatting automatically and intelligently, as shown in (229).
# ---------------------------------------------- 1
# ---------------------------------------------- 3
class Person : 4
5
## Constructor 6
self._fname 8
self._lname 9
DR
10
## Methods 11
def first_name(self, fname=None) : 12
if fname : 13
16
def last_name(self, lname=None) : 17
if lname : 18
21
def full_name(self) : 22
return self._fname + ’ ’ + self._lname 23
24
# ---------------------------------------------- 25
# Class Usage 26
# ---------------------------------------------- 27
p = Person() 28
print p.full_name() 31
183
16.2 Class Types
By defining a method that does the formatting of the name automatically, we
T
save ourselves time when programming (since it’s easier to invoke the method
full name than to concatenate two method calls) and reduce the likelihood of
inconsistency. We can ensure that names are always formatted in the exact same
manner. In addition, there is no sacrifice of flexibility for consistency, since we can
always define other methods to format the names differently. In fact, it is possible
to define a much more intelligent name-formatting method, as in (230), where
we add a method for middle names and have the full name method handle it
intelligently (by omitting it if undefined). Note that we have also added a flag that
## Methods
AF
determines which comes first: the first name or the last name.
## Constructor
self.__fname
self.__lname
# ----------------------------------------------
# Here the class is defined...
# ----------------------------------------------
class Person :
def first_name(self, fname=None) :

if fname : self._fname = fname
return self.__fname
def last_name(self, lname=None) :
if lname : self.__lname = lname
return self._lname
def full_name(self) :
return self.__fname + ’ ’ + self.__lname
# ----------------------------------------------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Here the class is used... 21
# ---------------------------------------------- 22
DR
p = Person() 23
print p.full_name() 26
16.1.1 Constructors
CONSTRUCTOR A constructor is XXX.
In Python, the method init is a special required constructor method. We
saw it in operation in numerous previous examples, the most recent being (230).
16.2 Class Types

TODO
184
16.3 Variables
T
16.3.1 Instance Variables
The most common type of variable are instance variables. Instance variables INSTANCE VARI -
are variables that are associated with instances of a class. Therefore, their values ABLES
potentially differ from one instance of a class to the next. For example, in (14),
we create three separate instances of a custom class called Word.
(231) classesVarInstanceEx1.py
AF
class Word : 1
def __init__(self, wordform) : 2
self.wordform = wordform 3
def getWordForm(self) : 4
return self.wordform 5
6
for wf in [’chimp’, ’orangutan’, ’gorilla’] : 7
w = Word(wf) 8
print w.getWordForm() 9
16.3.2 Class Variables

A class variable, unlike an instance variable, is part of the class, and not of a CLASS VARIABLE
particular instance of the class. Therefore, a class variable has only one value for
all instances of the class.
(232) classesVarClassEx1.py
class Word : 1
def __init__(self, wordform) : 2
self.wordform = wordform 3
def getWordForm(self) : 4
return self.wordform 5
DR
6
for wf in [’chimp’, ’orangutan’, ’gorilla’] : 7
w = Word(wf) 8
print w.getWordForm() 9
16.4 Privacy
Privacy refers to the accessibility of attributes (variables) and methods (functions)
defined for a class. An attribute or method is public if it is accessible from outside PUBLIC
an object; it is private if it is not. (Privacy therefore relates to encapsulation, PRIVATE
one of the major concepts behind object-oriented programming discussed in the
previous chapter.) In Python, variables and methods are public unless their name
begins with a double underscore, which flags them as private. 1 Private variables
1
Some OO programming languages (e.g., Java) use a keyword to flag variables a private, but
Python’s convention for private variables is much simpler.
185
16.4 Privacy
cannot be directly accessed, whereas public variables can be directly accessed
T
using the dot notation illustrated in (233).
(233) classesPrivacyEx1.py
# ---------------------------------------------- 1
# Here the class is defined... 2
# ---------------------------------------------- 3
class Person : 4
def __init__(self, fname, lname) : # Constructor 5
self.fname = fname # Initalize public var 6
self.lname = lname # Intialize public var 7
8
# ----------------------------------------------
AF
9
# ---------------------------------------------- 11
p = Person(’Stuart’, ’Robinson’) # Instantiate object 12
print p.fname # Directly access public var 13
print p.lname # Directly access public var 14
In (233), two instance variables are defined for the Person class: fname
and lname. They are assigned a value by the constructor, which takes two argu-
ments. However, there are no accessor methods for these variables, but they can
be directly accessed, as shown by the lines were they are printed out. If, however,
we change the definition of the Person class so that the names of the two variables
begin with double underscores, any attempt to directly access the Person objects
instance variables will result in an error, as can be seen by running (234).
# ---------------------------------------------- 1
# ---------------------------------------------- 3
class Person : 4
self.__fname = fname # Initialize private var 6
DR
self.__lname = lname # Initialize private var 7
8
# ---------------------------------------------- 9
# ---------------------------------------------- 11
p = Person(’Stuart’, ’Robinson’) # Instantiate object 12
print p.fname # !!! Directly access private var !!! 13
print p.lname # !!! Directly access private var !!! 14
In order to gain access to a private variable, an accessor method must be used.

In (235), we show how the use of private variables with an accessor method can
help ensure the proper capitalization of names for our Person object.
# ---------------------------------------------- 1
# ---------------------------------------------- 3
class Person : 4
self.__fname = fname # Initiate private var 6
self.__lname = lname # Initiate private var 7
186
T
def get_first_name(self) : # Public accessor method 8
return self.__fname.capitalize() # Return var capitalized 9
def get_last_name(self) : # Public accessor method 10
return self.__lname.capitalize() # Return var capitalized 11
12
# ---------------------------------------------- 13
# ---------------------------------------------- 15
p = Person(’stuart’, ’robinson’) # Instantiate object 16
print p.get_first_name() # Print __fname 17
print p.get_last_name() # Print __lname 18
AF
Even though the Person object is instantiated in (235) with two names in
all lowercase, when the accessor method is called, the capitalize() string
method is called on the variables before they are returned, thereby ensuring that
they are printed out in the proper case.
16.5 Inheritance
In the previous chapter, we covered the concept of inheritance in object-oriented
programming.
16.5.1 Single Inheritance

Single inheritance refers to a situation where a class inherits from one and only
one other class.
Note that the parent class does not extend from another class and therefore no
parenthese are required from its name.
DR
xxx
(10) 1 class Car :
2 def __init__(price) :
3 self.__price = price
4 class Fiat(Car) : pass
16.5.2 Multiple Inheritance

Python places no restrictions on the number of classes from which a class can
inherit. In other words, Python permits multiple inheritance (unlike some other
object-oriented programming languages, such as Java).
The main trick in multiple inheritance is determining which attribute or method
to use when a class inherits from multiple . Python searches the parent classes in
left-to-right order. Therefore, if we define three parent classes with different im-
plementations of the same method, and three child classes that inherit from them,
187
16.5 Inheritance
as in (236), we see that it is the first parent class listed in their inheritance list
T
whose method gets called.
(236) classesMultipleInheritanceEx1.py
class A : 1
def printout(self) : 2
print ’A’ 3
4
class B : 5
print ’B’ 7
8
class C :
AF
9
print ’C’ 11
12
class D(A, B, C) : 13
pass 14
15
class E(B, A, C) : 16
pass 17
18
class F(C, A, B) : 19
pass 20
21
d = D() 22
d.printout() 23
24
e = E() 25
e.printout() 26
27
f = F() 28
f.printout() 29
However, if we create a more complicated inheritance hierarchy, like the one

in (237), the bheavior becomes a bit more difficult to predict.
DR
(237) classesMultipleInheritanceEx2.py
class A : 1
print ’A’ 3
4
class B(A) : 5
pass 6
7
class C : 8
print ’C’ 10
11
class D(A, B, C) : 12
pass 13
14
class E(B, A, C) : 15
pass 16
17
class F(C, A, B) : 18
pass 19
20
d = D() 21
d.printout() 22
188
T
23
e = E() 24
e.printout() 25
26
f = F() 27
f.printout() 28
16.6 Exercises
AF
• ???
• ???
• ???

???
DR
189
T
AF
DR
190
T
Chapter 17
AF
An Introduction to Regular
Expressions
This chapter aims to teach the reader about regular expressions, a very powerful
tool for text processing, and how they can be exploited for language research.
Since regular expressions are reasonably standardized, it is possible to introduce
the basics of regular expressions without worrying initially about how they are
handled in Python. This chapter will do just that. It can therefore be skipped by
readers who already have a good familiarity with regular expressions from other
contexts (e.g., the Unix utility grep). In the next chapter, we will look specifically
at how regular expression are handled by Python.
REGULAR EXPRES -
SIONS
17.1 What is Text?

DR
It should be clear by now that when we talk about text, what we really mean
are strings (see §7.1 for the quick intro or Chapter 14 for in-depth information).
Strings are used to represent everything from phonemes to sentences to para-
graphs. There is no real upward limit on the amount of linguistic information
that can be encoded by a string (other than the upward limits on string length and
system memory resources). Therefore, a string might consist of a word, as in
(11), a phrase, as in (12); a sentence, as in (13) (a quote from Ransom K. Ferm);
or something more (a paragraph, even a novel).
(11) word
(12) 28 Oct 1973
(13) Every passing hour brings the Solar System forty-three
thousand miles closer to Globular Cluster M13 in Hercules
191
17.1 What is Text?
T
- and still there are some misfits who insist that there
is no such thing as progress.
Regular expressions provide a means of describing a pattern, which can then

be compared to a string to see whether they match. This is why the use of regular
expressions is sometimes referred to as pattern matching. To make this a bit more
concrete, we can return to the date found in (12). If we wanted to find all of the
dates in a text, we could look for every date formatted in the same fashion as
(12). More narrowly, a date formatted like (12) consists of a specific pattern—
AF
namely, two digits, a space, three or more letters, another space, and four digits,
as illustrated in Figure 17.1.
one or two digits more letters space four digits

z}|{ space three or z}|{ z}|{
28 z}|{ Oct z}|{ 1973
Figure 17.1 Breakdown of Date String
What we need is a way of describing this pattern, some kind of formula, with
variables that stand for letters, numbers, etc. Regular expressions provide pre-
cisely that, as can be seen from (14), which provides a regular expression that
matches the pattern of (12).
(14) [0-9]{1,2} [A-Z][a-z]+ [0-9]{2,4}
The correspondence between the regular expression in (14) and a few sample
dates is shown in Figure 17.2.
[0-9]{1,2} [A-Z][a-z]+ ’?[0-9]{2,4}

DR
28 Oct 1973
1 January ’73
05 Sept 73
Figure 17.2 Regular Expression Matching
But how does all of this work? The first step on the road to understanding reg-
ular expressions is to get a grip on the difference between two types of characters
LITERAL CHARAC - found in a regular expression: literal characters and metacharacters. A literal
TERS characters stands for itself in a regular expression. In other words, a literal char-
METACHARACTERS acter is simply a character that matches itself. Therefore, the regular expression in
(15) would match the string star, even if that string is found within a larger string
(such as upstart, starship, star-shaped, or starry).
192
17. Regular Expressions 1
(15) star
T
It’s as simple as that. Metacharacters, however, are an entirely different kettle
of fish.1 They are special characters that stand for more than themselves. For
example, a dot (.) doesn’t stand for a dot. Rather, it is a wildcard that stands for WILDCARD
any character. It will match a single letter, number, or punctuation mark.
Metacharacters are the real workhorse of regular expressions and this tutorial
AF
is devoted to them. Fortunately, they are few in number (and therefore pretty easy
to memorize). We turn to them now.
17.2 Metacharacters
We already established that regular expressions are strings that capture patterns in
strings. They can be compared to a string to find a match and therefore are used
in doing text searches (among other things). To illustrate how it all works, we
will begin with what seems on the surface of things to be a fairly simple search.
Suppose that we want to search an English text for every instance of the indefinite
article. No doubt the reader has done some sort of search before in a word pro-
cessing program (e.g., Microsoft Word) or a web browser (e.g., Netscape). The
problem is, the indefinite article has more than one form. It can be either a or an. A
simple text search would require two passes, once for a and a second time for an.
To do this, we will use a short search script, named regexes1FileSearch,
DR
that takes two command-line arguments, a file and a regular expression, and finds
every match for the regular expression in the file’s contents. Don’t worry for now
about exactly how this program works, since we won’t go into the details of how
Python handles regular expressions until the next chapter. The main point is that,
given a file and a regular expression, it finds every line of the file matching the
regular expression.
To test our regular expressions, we will use a small corpus consisting of the
first chapter of a handful of science fiction novels. These novels are all part of
Orion’s Science Fiction Masterworks Series and are listed below in Table 17.1.2
1
The prefix meta- may sound like postmodern mumbo-jumbo, but it actually makes a good
deal of sense. Just as metacharacters are characters that operate on characters (allowing you to
search for them), a metalanguage is a language that operates on language (allowing you to talk
about it).
2
For a full listing of the titles available in the series, including some reviews, see Orion’s
official web site www.orionbooks.co.uk.
193
17.2 Metacharacters
Author Title File
T
Aldiss, Brian Non-Stop nonstop-ch1.txt
Blish, James A Case of Conscience conscience-ch1.txt
Dick, Phillip K. Do Androids Dream of Electric Sheep? sheep-ch1.txt
Haldeman, Joe The Forever War forever-ch1.txt
Le Guin, Ursula The Dispossessed dispossessed-ch1.txt
Le Guin, Ursula The Lathe of Heaven lathe-ch1.txt
Vonnegut, Kurt The Sirens of Titan sirens-ch1.txt
AF
Table 17.1 Science Fiction Novels Used for Sample Corpus
To see how the program can be called to perform a regular expression search of
one of the texts in the corpus, have a look at the screenshot of regexes1FileSearch
being used to search for the word star in the first chapter of Brian Aldiss’ novel
Non-Stop.
DR
Figure 17.3 Using regexes1FileSearch.py to search for star in Non-Stop
What we find is that the two searches return many lines that don’t have indef-
inite articles. Why is this the case? The answer is simple. We are doing a search
for the letter a or the letters an, but we haven’t told the regular expressions that
these two strings are supposed to be stand-alone words, so it finds them as parts
of other words. Every line containing the letter a is matched, which is clearly
not what we want. Remember, regular expressions match strings, not words. To
194
match words, we need to tell the regular expression how to identify word bound-
T
aries, which is not entirely straightforward. For now, we will assume that a word
is anything surrounded by spaces and simply put spaces around a and an in the
regular expression. (We will see later, in §17.2.8, that there are superior alterna-
tives to this approach.) Repeating the search, we now get more sensible results.
However, it is tedious to run two searches. What we need is a way of collapsing
these two searches into one.
There are a couple of ways to do this with regular expressions, but let’s keep
it simple and do an either-or search. Either we’ll search for one form of the article
(16) (a|an)
AF
or the other. This can be done using the vertical bar, which separates alternatives
from one another (see §17.2.1), as in (16), where a regular expression is provided
that searches for either a or an surrounded by spaces.
This regular expression can be used on the first chapter from Joe Haldeman’s
Forever War, as illustrated in Figure 17.4.
DR
Figure 17.4 Using regexes1FileSearch.py to do a simple regular expression
search on The Forever War
We’ve just constructed our first useful regular expression using a metacharac-
ter, namely the vertical stroke. The reason the vertical stroke is a metacharacter
is that its interpretation is not literal. The regular expression in (16) doesn’t look
195
17.2 Metacharacters
for a space, an open parentheses, and a followed by a vertical stroke followed
T
by an, a close parentheses, and another space. Rather, it says, look for either a
or an between spaces. In the following section, we will see how this works, and
learn many other conventions that allow for more sophisticated pattern matching
on strings.
17.2.1 “Or”: Disjunction Vertical Bars

The fundamental trick in our indefinite article example was introducing a way
DISJUNCTION
AF
of expressing the concept of “either-or”, so that searches can look for either one
element or another (in the indefinite article example, either a or an). The con-
cept of disjunction (OR) is expressed in regular expressions with the vertical bar
metcharacter (|).
(17) a|an
Although some hair-splitting types feel that “or” should be used exclusively
for binary disjunction—that is, for situations where there are only two options
(i.e., “either A or B”)—the everyday English “or” is often extended to situations
where there are a number of options (i.e., “either A or B or C”). The same holds
true for the vertical bar metacharacter, which can be used with multiple options,
as in (18), where our search for articles is expanded to include the definite article
the.
(18) a|an|the
DR
The usefulness of disjunction should be obvious. It is quite handy when one
wishes to allow more than one string to fill a particular slot, as in (19), which
would match airship, starship, or spaceship.
(19) (air|star|space)ship
Note that parentheses have been used in the previous example. Parentheses
are important when there might otherwise be ambiguity in the interpretation of
a regular expression. In the case of (19), it is clear that what is wanted is ship
preceded by either air or space. Without the parentheses, the regular expression
would be interpreted differently, as can be seen from (20), which is interpreted as
a search for air or spaceship.
(20) air|spaceship
In other words, (20) is equivalent to (21).
196
(21) (air)|(spaceship)
T
It is possible in theory to work out how the various parts of a regular expression
will be grouped together using precedence rules but these require a more in-depth
treatment of regular expressions than it really necessary here. The important thing
to remember is that where there is potential ambiguity, you should use parentheses
to explicitly group together the various parts of your regular expression. “When
in doubt, bracket it out!”
AF
The ability to handle disjunction gives regular expressions a great deal of flex-
ibility and power. Disjunction is also quite handy when searching for the various
forms of a given lexeme, as in (22), which could be used to match any of the words
in (23).
(22) profan(e|ity)
(23) profane
profanity
17.2.2 Matching Characters with Character Classes

In the previous section, the use of vertical bars to express disjunction was intro-
duced. Although disjunction is a very powerful tool, capable of doing a great
deal of work, it is not always the most convenient way of matching single char-
acters. Consider for a moment what it would take to create a regular expression
DR
that matches any single lowercase vowel letter. Using disjunction, such a regular
expression would like like (24).
(24) (a|e|i|o|u)
When dealing with a larger set of options, however, disjunction proves to be

very awkward and verbose, as can be seen in (25), which provides a regular ex-
pression that matches any lowercase letter, and not simply lowercase vowel letters.
(25) (a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)
Thankfully, it is not necessary to always spell out classes of characters to be

matched in this fashion. There are more compact and economical alternatives to
(24) and (25) which will be introduced in the following section.
197
17.2 Metacharacters
17.2.3 Character Classes
T
It is quite useful when performing searches to allow more than one character to
occupy a position in a regular expression. For example, imagine searching for the
word enquire in a collection of texts from all over the world. Since the spelling
of this word is variable—enquire in Commonwealth countries, but inquire in the
USA—one would want to leave open two possibilities for the first character of
this string. One way to accomplish this is by searching for either spellings dis-
junctively, as in (26).
(27).
AF
(26) (enquire|inquire)
The regular expression in (26) fails to factor out the common element in the
words enquire and inquire. Since it is only the first letter of the two words that
varies, it alone can be separated off and put into parentheses for disjucntion, as in
(27) (e|i)nquire
Disjunction that is restricted to a single character is so common that there is

a ready-made way of handling it, which is to place the desired set of characters
within square brackets, as in (28), which is basically equivalent to (27).3
(28) [ei]nquire
CHARACTER What we’ve just used in (28) is a character class: a string of characters (of
CLASS any length) surrounded by square brackets which forms a pattern that matches any
DR
single character in that string. Character classes are useful in many contexts. For
example, it is quite handy when dealing with limited variation in capitalization.
If one searched for the word hope, for example, one would be forced to deal with
the fact that its first letter is capitalized at the beginning of sentences, as in (29).
(29) Hope springs in the heart eternal.
Character classes provide a way of easy way of handling limited variation in

capitalization. To see how this work, consider the regular expressions presented
earlier for finding articles, (18). It would not match articles at the beginning of
sentences, since these will begin with a capital letter. But the regular expression in
(30) will match articles at the beginning of sentences, since it explicitly provides
for the possibility of their capitalization.
3
The two are not entirely equivalent. The square brackets are not only more compact but also
more efficient. For more information about performance issues with regular expressions, see ???.
198
(30) ([aA]|[aA]n|[tT]he)
T
When run on the first chapter of The Forever War, it matches lines with sen-
tences that begin with an article, such as those in (31).
(31) a. The guy who said that was a sergeant who didn’t look five years older
than me.
b. The projector woke me up and I sat through a short tape showing the
“eight silent ways.”
AF
c. The sergeant nodded at her and she rose to parade rest.
d. The whole company’d been dragging ever since we got back from the
two-week lunar training
It is also possible to specify a range of characters using square brackets. For

example, if you wanted to match a single lowercase letter, you could use (32). But
typing out every single lowercase letter is tedious and there is a very real risk of
overlooking a letter.
(32) [abcdefghijklmnopqrstuvwxyz]
Instead of typing out each lower letter, you can instead use a character range,
as in (33), which uses the range of lowercase letters from a through z.
(33) [a-z]
The regular expression in (33) will match any one character in the alphabet
from a through z. More than one range can be placed within a pair of brackets, as
DR
in (34), which will match both lowercase and uppercase letters in the same range
of the alphabet.
(34) [a-zA-Z]
Therefore, (34) will match any letter, regardless of case. Note that more than
one range can be included between square brackets and that separation of the two
is unnecessary. By the same token, other characters can be included, as in (35),
which would match any letter (lowercase or uppercase) or an ampersand.
(35) [a-zA-Z&]
There are also non-alphabetical character ranges, such as (36), which matches
any integer between 1 and 5, and is equivalent to (37).
(36) [1-5]
199
17.2 Metacharacters
(37) [12345]
T
Be warned, however, that unexpected results can arise from the use of non-
alphanumeric characters in character ranges. For example, it is doubtful that the
reader knows which characters will be matched by (38).
(38) [!-a]
(38) is obscure because the average reader knows the order of the alphabet
AF
perfectly well (abc . . . xyz), but not the order of characters in ASCII (the American
Standard Code for the Interchange of Information), which is what determines the
range within a character class. 4 The ASCII range specified by (38) is equivalent
to the regular expression in (39).
(39) [123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]ˆ ‘a]
Since including nonalphanumeric elements in ranges can make a regular ex-

pression difficult to predict, it is wise to avoid them and to include any nonal-
phanumeric characters explicitly, as in (40), which defines a character class that
includes all lowercase and uppercase letters as well as commas.
(40) [a-zA-Z,]
One possibility for including a dash in a character class is to place it at the end
of the character class, where it will be interpreted as a dash and not as an indicator
of a character range, as in (41).
DR
(41) [a-zA-Z-]
There is another possibility for including a dash in a character class, which

solves a more general problem—namely, how you can search for a metacharacter
in a text, such as a dot, a question mark, or a hyphen. The solution involves placing
a backslash before the metacharacter so that is interpreted literally, as in (42).
(42) [a-zA-Z]
The use of the backslash to derive literal characters from metacharacters will
be discussed in more detail in §17.2.5.
4
Any reader who has memorized the order of ASCII is unlikely to need this tutorial, but may
need one on human social interaction!
200
17.2.4 The Wildcard Metacharacter: The Dot
T
A very useful metacharacter in regular expressions is the wildcard character, the
dot (.). In less geeky circles, the dot is called a period (North America) or a
full stop (elsewhere). A dot is a metacharacter that serves as a stand-in for a
single character. It essentially means, “take your pick for this character, it can be
anything”. For example, (43) matches any three-character string whose first and
last characters are p and n, irrespective of whether the middle character is a letter,
a number, or a punctuation mark.
(43) p.n
(44) pinion
company
happened
AF
Therefore, the regular expression in (43) would match any of the strings in
(44), since each contains the letter p with a single letter in between them.
The regular expression (43) would not match other wordslisted below in (45).
(45) planet
equipment
telephone
DR
Why won’t (43) match any of the words in (45)? Because these words have
two characters between p and n and the dot in (43) will match only one. Obviously,
there are many cases where the variable portion of one’s search consists of more
than one character. One way of working around the built-in limitation on the dot
is to use more than one as in (46).
(46) p..n
It is somewhat cumbersome to repeat wildcard characters. It would be far

preferable to be able to specify how many characters a wildcard character can
match. This is in fact possible using regular expression, as we will see later in
§17.3.
201
17.2 Metacharacters
17.2.5 Making Literals out of Metacharacters: The Backslash
T
At this point, the reader may be asking the following important question: How
does one search a text for characters that happen to be metacharacters, such as
dots, question marks, stars, etc.? More concretely, how would someone search
for, say, all sentences ending with two period marks (a common typo). The regular
expression (47) won’t work.
(47) ..
AF
The reason (47) won’t work is that, given the normal wildcard interpretation
of a dot, it would match any two characters in a row, which means it would match
any string that contains two or more characters. (In other words, it would produce
many false positives. It would match two dots, but it would also match many
two-character sequences that don’t consist of two dots.)
There is a convention for searches on characters that happen to be metacharc-
ters: the use of a backslash (\) before a metacharacter that one wishes to be treated
as a literal. (The backslash is also known as an escape character, since it allows
an “escape” from the normal interpretation of a metacharacter.) To search for two
dots in a row, we need to place a backslash before each dot, so that each one is
interpeted literally, as in (48).
(48) \.\.
For another example, consider what is involved in searching for the abbrevia-
tion of etcetera (etc.). The regular expression in (49) will not work, since it also
matches the undesired words in (50).
DR
(49) etc.
(50) etcetera
wretched
stretch
etching
fetch
A much better way of searching for etc. is by converting the dot from a
metacharacter to a literal by placing a backslash in front of it, as in (51).
(51) etc\.
(51) will not match any of the above-listed words but will match etc.. In order
to create a regular expression that would match the word etcetera as well as its
abbreviation, disjunction can be added to the mix, as in (52).
202
(52) etc(\.|etera)
T
The astute reader may at this point be wondering what happens when a literal
character is escaped with a backslash. The answer is simple. Nothing. When a
backslash precedes a literal character (that is, a non-metacharacter), it simply has
no effect. Therefore, (53) and (54) are equivalent.
(53) \=
AF
(54) =
Although the backslash in (53) is superfluous, it does no harm. This will not
always be the case. There are a handful of literal characters that change meaning
when preceded by a backslash. These will be covered in §17.2.7.
17.2.6 Negation: The Caret in Square Brackets

Related to the use of square brackets for character classes is the use of a caret
(ˆ) to define negative character ranges. In other words, by placing a caret at the
beginning of a character range, you can ensure that a regular expression will NOT
match any of the characters in it.
GIVE SIMPLE EXAMPLE FIRST
To see another use of negative character classes, imagine that you wanted to
search for the preposition for everywhere except at the end of a sentence. Since
punctuation marks (dots, commas, exclamation marks, and question marks) typi-
cally occur at the end of sentences, one quick-and-easy way of doing this would
be to search for the word for followed by anything but a punctuation mark, as in
DR
(55). (Remember: Dots and question marks are metacharacters so they must both
be escaped.)
(55) for[ˆ\.,!\?]
The regular expression in (55) would match all of the examples in (57) (from
James Blish’s novel A Case of Conscience), among others.
(56) a. “I mean-what am I going to go to all this trouble to get to Triton for?”

[Sirens of Titan]
b. ”Titan, Triton,” said Constant ”What the blast would I go there for?”
[Sirens of Titan]
c. Another Iotic word floated into Shevek’s head, one he had never had a
referent for, though he liked the sound of it: ”splendour”. [The Dispos-
sessed]
203
17.2 Metacharacters
d. Examining the schedule for January 3, 1992, he saw that a businesslike
T
professional attitude was called for. [Do Androids Dream of Electric
Sheep?]
AF
However, it would specifically not match the various line in (56) (taken from
various sources), since they contain the string for followed by a letter (rather than
a punctuation mark).
(57) a. ‘God forbid,’ Ruiz-Sanchez said under his breath.

b. ‘Well, don’t forget that Lithia is my first extrasolar planet,’ Ruiz-Sanchez
said.
c. The infinite mutability of life forms, and the cunning inherent in each of
them....
d. But for a physicist, this place is hell.
e. In Hawaii, as Ruiz-Sanchez remembered, the tropical forest was quite
impassable to anyone not wearing heavy boots and tough trousers.
f. In a way this was lucky, for there were no birds on Lithia.
DR
17.2.7 Special Sequences
We have already seen the escape character [\] used to treat metacharacters as
literals in order to search for dots, question marks, etc. There is another usage of
the escape character, which involves putting a backslash before a literal character
in order to create a special sequence. These special sequences either define special
characters (e.g., linefeeds or space marks) or convenient character classes (e.g.,
whitespace).
204
Sequence Description Notes
T
\t tab
\n newline
\r carriage return
\f formfeed
\v vertical tab
\d decimal digit
\D non-digit character
\s whitespace character
\S
\w
\W
\A
\Z
AF non-whitespace character
alphanumeric character
non-alphanumeric character
beginning of string
end of string
Table 17.2 Special Sequences

???
???
The special characters (tab, newline, etc.) have the same interpretation out-
side of the context of regular expressions. In other words, they have the same
interpretation in regular strings (for more information about these, see Chapter
14.3.
17.2.8 Line Starts and Ends: Carets and Dollars

There are two metacharacters related to the starts and ends of lines: caret (ˆ) and
dollar ($). The caret is used to represent the start of a line while the dollar is used
DR
to represent the end of a line. If we were searching for the word fast in a text, then
(58) matches every line beginning with fast, while (59) matches every line ending
with fast.
(58) ˆfast
(59) fast$
It is important to note that since no additional information has been provided,

it is possible for (58) to match (60) and for (59) to match (61), since what follows
fast isn’t specified in (58) and what precedes fast isn’t specified in (59).
(60) faster
fastest
fastener
fastidious
205
17.3 Quantifiers
(61) breakfast
T
Belfast
In the previous section, the special sequences \A and \Z were introduced.

These two special sequences are equivalent to the caret and dollar sign metachar-
acters unless specifically instructed otherwise. We won’t delve into these sub-
tleties here because they require more detailed knowledge of how regular expres-
sions are deployed in Python, which will be covered in detail in the next chapter.
For details, see §18.2.2 (more specifically, §18.2.2).
.
|
[abc]
[âbc]
[a-z]
()
AF
17.2.9 Summary
Before expanding upon our repretoire of regular expressions, it may be worth-
wile to recap what we have learned so far. We’ve been introduced to a variety
of metacharacters—the dot, the vertical bar, square brackets (with and without
hyphens), parentheses, and the backlash (or escape character, as it is sometimes
known). These are summarized in Table 17.3.
Metacharacter Meaning
matches any character
or
matches one of the bracketted characters
matches anything but one of the bracketed characters
matches one of the characters in a through z
grouping
\ causes metacharacters to be read as literals
DR
Table 17.3 Summary of Metacharacters
Regular expressions built up from these metacharacters are fairly powerful and
give searches considerable flexibility, but there are a number of ways in which they
are lacking. In the next section, we will expand the possibilities provided by these
metacharacters by introducing a number of quantifiers—that is, metacharacters
which allow other metacharacters to match once, twice, multiple times, or even
not at all.
17.3 Quantifiers
So far, we have used various metacharacters (such as the dot or the vertical bar) to
create regular expressions for pattern-matching. Most of these patterns have been
fairly simple. Using quantifiers, it is possible to build even more complicated
206
regular expressions to perform more sophisticated pattern matching. Quantifiers
T
provide a means of specifying how many times a regular expression should match.
For exaple, imagine that you are searching a corpus for every word that contains
two vowels in a row (i.e., a two-letter vowel sequence). We have learned how to
create a regular expression that will only match vowel letters, such as (62) or (63).
(62) (a|e|i|o|u)
(63) [aeiou]
AF
One way to find two vowel letters in a row is simply to use the character class
for vowel letters twice, as in (64).
(64) [aeiou][aeiou]
But what if we wanted to find three vowel letters in a row? This could of
course be done in the fashion of (64),by using the character class three times, as
in (65).
(65) [aeiou][aeiou][aeiou]
But (65) is awkward. Furthermore, the approach that it embodies doesn’t scale
well. In other words, it isn’t practical for larger sequences. While it might work as
long as we’re only matching sequences with just a few repetitions, it is a hopeless
way of matching sequences with hundreds of repetitions. Not only does it require
too much typing, but it also is fairly unreadable.
What is needed is a way of controlling how many times part of a regular ex-
pression should match, and this is precisely what quantifiers provide. We turn now
to the various quantifiers found in regular expression, starting with the plus mark.
DR
17.3.1 Match One or More Times: The Plus Mark
The plus mark [+] essentially says to match the preceding regular expression at
least once. In other words, by adding the plus mark to part of a regular expression,
it will ensure at least one match, and as many more as are possible. Returning to
the vowel sequence example, we could find every vowel letter and sequence of
vowel letters using (66).
(66) [aeiou]+
Alternatively, we could find every vowel sequence of two or more letters by
having two character classes for vowel letters and adding a plus sign to the second,
as in (67). The plus sign dictates that the second character class must match at least
once, but leaves open the possiblity of even more matches. Therefore, (67) will
match 2 to n times, where n could be any number.
207
17.3 Quantifiers
(67) [aeiou][aeiou]+
T
Quantifiers therefore solve the problem of defining regular expressions that
will match sequences containing a very high number of repetitions. (67) will not
only match vowel sequences consisting of just a few vowels as easily as vowel
sequences consisting of hundreds of vowels. Of course, a sequence of hundreds
of vowels is fairly unrealistic. But being able to search for long sequences can be
quite handy nonetheless. Imagine that you are searching for all words that begin
with a vowel and end with a consonant. In order to match such words, the regular
AF
expression in (68) could be used.
(68) [aeiou][ˆ ]+[âeiou ]
(68) uses spaces to identify word boundaries. It looks for a vowel followed by
one or more non-spaces followed by a non-vowel and matches a large number of
words of varying lengths, including those in (69).
(69) a. and
b. again
c. arrogant
d. insignificant
e. intercontinental
Because the final character is negatively defined (i.e., as anything but a vowel
letter or a space), (69) will also give a number of undesired matches, such as those
in (70).
DR
(70) a. us!
b. one,”
c. upward.
This problem can easily be rectified by positively specifying the final character
class, as in (71), which is less compact but more accurate. It will still match the
words in (69) but not those in (70).
(71) [aeiou][ˆ ]+[bcdfghjklmnpqrstvwxyz]
So far, we have used the plus mark for repetitions of single letters, but quanti-
fiers can also be applied to larger sequences. The scope of a quantifier is either the
previous character or the previous grouping (i.e., anything inside of brackets or
parentheses). To see how a quantifier can be used to modify a larger unit, imagine
that you are searching for words that consist only of strict consonant-vowel (CV)
syllables. The regular expression in (72) can be used.
208
(72)
T
([bcdfghjklmnpqrstvwxyz][aeiou])+
In (72) the template for a CV syllable is grouped by parentheses and that

grouping is modified with the plus mark, which will ensure that the entire group-
ing matches one or more times.
17.3.2 Match Many Times or Not at All: The Star

The star (asterisk) [*] resembles the plus mark, except that it does not require any
(74)
AF
match at all.5 In other words, it dictates that a regular expression will match as
many instances as possible, including none at all.
To see how it differs from the plus mark, we will return to the regular expres-
sion in (??) that was used to search for words beginning with a vowel and ending
with a consonant. It is repeated below for convenience in (73).
(73) [aeiou][ˆ ]+[bcdfghjklmnpqrstvwxyz]
(73) will not match words that consist of only a vowel and a consonant since
it requires at least one match between the first and last character. To match such
words, the star quantifier can be used instead of the plus mark, as in (74).
[aeiou][ˆ ]*[bcdfghjklmnpqrstvwxyz]
Because the non-space character between the vowel and the consonant is quan-
tified with a star, it will match all of the words that (73) does, but it will also match
words that have nothing between the first and last character, such as those in (75).
DR
(75) a. at
b. is
c. or
d. up
17.3.3 Match One Time or Not at All: The Question Mark

The question mark metacharacter makes whatever it modifies optional. In other
words, it says to match once or not at all. The ability to make part of a regular
expression optional is quite useful. It is indispensable when searching for different
spelling variants of a word. For example, many words in English are spelled
5
The term star comes from “Kleene star”, named after Stephen Kleene (pronounced /k-
lay’nee/), the American mathematician who invented regular expressions. He is best known for
founding the branch of mathematical logic known as recursion theory. For more about his life and
work, see http://en.wikipedia.org/wiki/Stephen Kleene.
209
17.3 Quantifiers
differently in North America than they are elsewhere—e.g., color vs. colour. If
T
you were searching for this word in a mixed corpus (i.e., a corpus that contains
both varieties of written English), the question mark could be used to construct a
regular expression that would match either spelling of the word, as shown in (76).
(76) colou?r
In (76), a single letter is made optional using the question mark quantifier, but
it is also possible to make a longer string optional by placing it between paren-
AF
theses. For example, consider how one might search for words in the ship family.
The English language contains many words involving ship, such as those listed in
(77).
(77) airship, fellowship, leadership, membership, relationship, rocketship, schol-

arship, ship, shipping, shipworthy, shipwrecked, spaceship
If you wanted to find out how many words are based on, or derived from,
ship, you could search a corpus using a regular expression that would match all
of these words (and any others involving ship).6 This can be done fairly easily
using the question mark quantifier to make any material preceding or following
ship optional, as in (78).
(78) ([a-zA-Z]+)?ship([a-zA-Z]+)?
Note that (78) restricts the material preceding or following ship to one or more
letters. It would therefore fail to match words such as short-shipped. In order to
ensure that such words are matched, as well, the regular expression would have to
DR
be modified slightly, so as to include dashes, as in (79).
(79) ([a-zA-Z-]+)?ship([a-zA-Z]+)?
17.3.4 Getting More Specific: Minimums and Maximums

The most exact method of quantification would of course amount to stating the
lower and upper limit of desired matches. For example, the quantifier ? is really
just another way of saying that we want 0 to 1 matches. In Python, there is a device
that provides for precisely this sort of specificity in regular expressions. It consists
of a pair of comma-separated numbers in curly brackets. The first number is the
lowest number of desired matches and the second number the greatest. Using this
convention, the quantifiers we have already seen could be restated as shown in
Table 17.4.
6
MORPHOLOGICAL FAMILY SIZE
210
Quantifier Bracket Notation Equivalent
T
? {0,1}
+ {1,}
* {0,}
Table 17.4 Quantifiers and their Bracket Notation Equivalents
Note that when there is no upward limit, no number is provided (although the
comma remains).
AF
Bracket notation is invaluable when searching for a particular number of re-
peated characters. For example, if one wished to find all sequences of three or
more vowel letters, the regular expressions in (80) will do the job.
(80) [aeiou]{3,}
(81) provides the first five lines with different matching words from The Dis-
possessed that are matched by (80).
(81) a. Like all walls it was ambiguous, two-faced.
b. People often came out from the nearby city of Abbenay in hopes of seeing
a space ship, or simply to see the wall.
c. He sat down on the shelf-like bed, still feeling light-headed and lethargic,
and watched the doctor incuriously.
d. He felt he ought to be curious ; this man was the first Urrasti he had ever
seen.
e. Contagious.
Alternatively, if one wished to find sequences of three or more consonant let-
DR
ters, (82) could be used.
(82) [ˆbcdfghjjklmnpqrstvxwyz]{3,}
The first five matching lines obtained by running (82) against the first chapter
of The Dispossessed are given in (83).
(83) a. It was built of uncut rocks roughly mortared; an adult could look right
over it, and even a child could climb it.
b. Where it crossed the roadway, instead of having a gate it degenerated
into mere geometry, a line, an idea of boundary.
c. For seven generations there had been nothing in the world more impor-
tant than that wall.
d. Like all walls it was ambiguous, two-faced.
e. Looked at from one side, the wall enclosed a barren sixty-acre field
called the Port of Anarres.
211
17.4 Word Boundaries
17.3.5 Summary of Quantifiers
T
To recap, there are four ways of quantifying regular expressions. Each quantifier
behaves somewhat differently and allows for different matching possibilities. The
full list is provided in Table 17.5.
Quantifier Meaning
? match zero or one time
* match zero or more times
AF
+ match one or more times
{i,j} match a minimum of i and a maximum of j
Table 17.5 Summary of Quantifiers
17.4 Word Boundaries

In the various examples provided in this chapter, words were identified by placing
regular expressions between spaces. This is only a partial solution to the problem,
since it leads to the systematic oversight of words that occur at the beginning or
ends of lines. To see why this is the case, consider the regular expression used to
search for articles.
(84) (a|an|the)
Because (84) expects a space before and after the desired strings, it will fail
DR
to match these strings when they occur at the beginning or end of lines. At the
beginning of lines, the string will have a space after it but not before it. At the end
of lines, the string will have a space before it but not after it. Although articles
occuring at the end of lines will be rare (since articles are normally followed by
other words), their occurence at the beginning of lines is commonplace. In order
to capture articles at the beginning of lines, the regular expression in (85) could
be used.
(85) ˆ(a|an|the)
A search on the first chapter of Kurt Vonnegut’s The Sirens of Titans using
(85) produces many matches, a few of which are illustrated below in (86).
(86) a. The following is a true story from the Nightmare Ages . . .

b. The crowd had gathered because there was to be a materialization . . .
c. An ancient butler in knee breeches opened the door . . .
212
d. A bald man made an attempt on Constant’s life with a hot dog, . . .
T
In order to match words that occur at the beginning or end of a line as well as
words that occur in between, a regular expression such as (87) is required, where
line starts and ends are explicitly included as options for delimiting the boundaries
of a word.
(87) (ˆ| )(a|an|the)($| )
AF
There are other whitespace characters that should probably be included as
potential word boundaries, such as tabs. This can be easily rectified, as in (88),
but the inclusion of more word boundary characters makes the resulting regular
expression more difficult to read.
(88) (ˆ| |\t)(a|an|the)($| |\t)
Fortunately, Python offers a useful convention for identifying word boundaries

in regular expressions which addresses some of the previously identified issues.
This is the special sequence \b, which will only match whitespace (including
line starts and ends) or non-alphanumeric characters. 7 Using this convention, it is
much easier to write a regular expression for definite and indefinite articles, as in
(89).
(89) \b(a|an|the)\b
It is important to bear in mind that there are still issues concerning word seg-
mentation that are not solved by the existence of the special sequence \b. For
DR
example, hyphens in raw, unstructured text raise problems, because they are used
inconsistently. This can be illustrated with a few sentences taken from Philip K.
Dick’s Do Androids Dream of Electric Sheep?. In some cases, hyphens occur
within words, as in (90).
(90) a. ”My schedule for today lists a six-hour self-accusatory depression,” Iran
said.
b. Despair like that, about total reality, is self-perpetuating.”
c. Hey, for twenty-five bucks you can buy a full-grown mouse.”
However, in other cases hyphens play a different function. Instead of separat-

ing parts of morphologically complex words, they serve to bracket some parts of
the sentence off from others, as can be seen in (91).
7
Technically, \b is a zero-width assertion—i.e., they don’t cause the regular expression engine
to advance through the string and simply succeed or fail.
213
17.5 Summary
(91) a. Percheron colts just don’t change hands-at catalogue value, even.
T
b. You know why? Because back before W.W.T. there existed literally hundreds-
”’
c. I left a piece and Groucho-that’s what I called him, then-got a scratch
and in that way contracted tetanus.
It is because of such inconsistencies that punctuation is typically treated as

stand-alone words in machine-readable corpora. Parsing is greatly facilitated by
reformatting the sentences in (91) as follows:
AF
(92) a. Percheron colts just don’t change hands - at catalogue value , even .
b. You know why ? Because back before W.W.T. there existed literally hun-
dreds -”’
c. I left a piece and Groucho - that’s what I called him , then - got a scratch
and in that way contracted tetanus .
According to the usual capitalization convention for special sequences (see

§17.2.7), the inverse of \b is \B. Just as \b makes it possible to match the edge of
a word, \B makes it possible to match within a word. It is therefore useful when
trying to match bound morphology, such as affixes or clitics. For example, if you
were searching for suffixes in a corpus that is morphologically unanalyzed (i.e.,
where there is no breakdown of words into morphemes), the regular expression in
(93) could be used.
(93) \B(ment|ship)\b
DR
Thanks to the special sequences in (93), the searched-for strings will only
match when they occur at the end of a word. Therefore, the stand-alone word ship
will not be matched but the verb reship will.
Finding prefixes is simply the reverse of (93), with \b occuring before the
searched-for string and \B occuring after it, as in (94).
(94) \b(un|re)\B
Note that (94) will match many words that do not have prefixes but simply
begin with un and re, such as underwear or reality.
17.5 Summary
Regular expressions are really nothing more than a set of metacharacters and a
few conventions concerning their interaction. The full set is fairly small and easily
214
memorizable, as shown by the summary provided in Table 17.6. In the following
T
chapter, we will see how regular expressions can be used in Python to manip-
ulate text and overview some Python-specific extensions to regular expressions
that make them even more powerful and flexible.
Symbol Meaning
. match any character
? match zero or one time
+ match zero or more times
17.6
1. ???
*
{i,j}
|
[abc]
[âbc]
[a-z]
()
\
Exercises
AF
match one or more times
match i to j times (j defaults to infinity)
or
match one of the bracketed characters
match anything but one of the bracketed characters
match one of the characters in a through z
grouping
treat following character as literal
Table 17.6 Summary of Metacharacters and Quantifiers
2. ???
DR
3. ???

The best extended treatment of regular expressions is Friedl (2002), an excellent
book that goes into all of the gory details of regular expressions. Its only drawback
is that it does not focus on Python per se (although much of the discussion of
Perl carries over to Python). Another introduction to regular expressions is Watt
(2005). There is also a Python-specific treatment of regular expressions in Mertz
(2003). Finally, Good (2004) provides a cookbook of regular expression “recipes”
which provide ready-made solutions to a variety of text processing problems.
215
T
AF
DR
216
T
Chapter 18
AF
Regular Expressions in Python
In the previous chapter, regular expressions were introduced in the abstract, with-
out reference to Python. In this chapter, we will cover how regular expressions
are specifically handled in Python. We will also look at some powerful Python-
specific extensions to regular expressions which can be leveraged for even better
text processing.

Regular expressions are fairly easy to use in Python. Our first example will il-
lustrate the simplest possible usage of regular expressions—namely, applying a
regular expression composed entirely of literal characters to a text and finding all
the matches. In (238), we use regular expressions to search for the word Ger-
DR
man in an except from Don DeLillo’s book White Noise, to see whether or not it
appears. (It does.)
(238) regexes2RegexEx1.py
import re 1
2
# An except from Don DeLillo’s White Noise 3
s = "The German tongue. \ 4
Fleshy, warped, spit-spraying, purplish and cruel. \ 5
One eventually had to confront it." 6
7
# Match a regex (’German’) against string 8
if re.search(’German’, s) : 9
print "Found a match!" 10
else : 11
print "No match found!" 12
The code in (238) illustrates a few fundamentals about regular expressions in

Python. The first order of business is importing the required regular expression
217
module, re. The function search() is then called on the text with a simple
T
regular expression (Star*). The search() method—as its name suggests—
simply searches a string for anything and everything that matches it. If a match is
found, it returns a MatchObject; otherwise, it returns None. Since a match will be
found in this example, calling the search() method will return a MatchObject,
the if-clause will therefore evaluate as true, and the appropriate message will be
printed out.
In (238), the regular expression is directly applied to a string to find matches,
thereby skipping a step that is often quite useful, which is compiling a string with
import re
AF
a regular expression into a RegularExpressionObject, as illustrated in (239).
Starglider to Earth:
(239) regexes2RegexEx2.py
# Excerpt from The Fountains of Paradise by Arthur C. Clarke

text = """2069 June 11 GMT 06.84. Message 8964 sequence 2.
Starholme infomrmed me 456 years ago that the origin of

the universe had been discovered but that I do not have the
appropriate circuits to comprehend it. You must communicate
direct for further information.
I am now switching to cruise mode and must break contact.

Goodbye."""
# Compile a RegularExpressionObject
reo = re.compile(’Star*’)
# Find matches
if reo.search(text) :
print "Found a match!"
else :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
print "No match found!" 22
DR
Unless you are using one of the regular expression compilation flags, it is not
necessary to compile a regular expression in order to use it, although you may
want to do so for reason of efficiency—see §18.2.2 for more details.
In summary, the use of regular expressions in Python involves the following
steps:
1. Creating a regular expression

2. Compile the regular expression (Optional)
3. Using the regular expression to find matches
4. Manipulating the matches using a MatchObject
Each step in the process will be discussed in greater detail below. §18.2.1
shows how regular expressions are created; §18.2.2 discusses how regular expres-
sions are compiled; §18.3 overviews the various ways that regular expressions can
218
be used for pattern-matching; and §18.3.2 describes how matches can be leveraged
T
for text processing.
18.2 Creating Regular Expressions

18.2.1 Defining Regular Expressions
As was pointed out in the previous chapter, a regular expression is essentially a
AF
string that describes other strings. Because regular expressions are strings, the
way string are handled in Python affects the way regular expressions are handled,
and sometimes in ways that are not entirely obvious and may lead to confusion.
The main difficulty concerns how backslashes are handled. As you may recall
from Chapter 14, a backslash is used to escape single quotes in single-quoted
string, double quotes in double-quoted strings, and to wrap long strings across
multiple lines, as can be seen in (240), where each string has a backslash that
doesn’t appear in the output when it is run.
(240) regexes2DefineRegexEx0.py
import re 1
2
print ’This is called a \ 3
\’single-quoted\’ string.’ 4
5
print "This is called a \ 6
\"double-quoted\" string." 7
8
print ’’’This is called a \ 9
\’’’triple-quoted\’’’ string.’’’ 10
11
print """This is also called a \ 12
DR
\"""triple-quoted\""" string.""" 13
14
If you want a backslash in a string, there are a few ways of getting one. As
long as the backslash doesn’t come at the end of a line and doesn’t precede a single
quote in a single-quoted string or a double-quote in a double-quoted string, it will
behave like any other character, as can be seen by running (241).
import re 1
2
print ’Here is a / in a single-quoted string.’ 3
print "Here is a / in a double-quoted string." 4
print ’’’Here is a / in a triple-quoted string.’’’ 5
print """Here is a / in another triple-quoted string.""" 6
7
But what if you want a backslash at the end of a line? What then?
219
T
import re 1
2
print ’This single-quoted line ends with a backslash.\\’ 3
print "This double-quoted line ends with a backslash.\\" 4
print ’’’This triple-quoted line ends with a backslash.\\’’’ 5
print """This triple-quoted line ends with a backslash.\\""" 6
With the foregoing discussion of backslashes in mind, consider the code in

(243), which shows three different attempts to create a regular expression that
import re
AF
will match a snippet of LATEXcode.1
# Create regex as raw string
# Create regex as plain string

s = ’\documentclass[a4paper,10pt]{book}’
if re.search(r’\\documentclass’, s) :
print "Got match with double slash in raw string."
else :
print "No match with double slash in raw string."
if re.search(’\\documentclass’, s) :
print "Got match with double slash in plain string."
else :
print "No match with double slash in plain string."
# Create regex as plain string

if re.search(’\\\\documentclass’, s) :
print "Got match with quadruple slash in plain string."
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
else : 20
print "No match with quadruple slash in plain string." 21
DR
Running (243) produces results that may be surprising. The first and third
regular expressions produce matches, but not the second. The reason for this is
that in order to match a backslash in a regular expression, two backlashes are
required. (Remember, a backslash normally serves as an escape character, so
in order to get a literal backslash, the backslash must be escaped.) But since
backslashes are meaningful in strings, they must be escaped. So, in order to get
one literal backslash in a string, two are required, which means that getting two
literal backslashes in a regular expression requires a total of four backslashes in
the string. If all of this seems a bit confusing, there is no need to worry. The
bottom line is simple: Always use raw strings for regular expressions.
1
If you’re not familiar with LATEX, it is a text-processing system often used for the formatting
of technical material, such as this book. If you want to know more about LATEX, a good place to
start is Lamport (xxx).
220
18.2.2 Compiling Regular Expressions
T
The second step in using regular expressions in Python is to compile a regular ex-
pression object (http://docs.python.org/lib/re-objects.html).
When compiling a regular expression object, there are a number of options avail-
able which systematically change the way in which the regular expression is in-
terpeted. Table 18.1 lists the various possibilities.
Table 18.1 Complilation Flags for Regular Expressions

Flag
re.DOTALL
re.MULTILINE
re.VERBOSE
S
re.IGNORECASE I
re.LOCALE L
M
X
AF
Abbrev. Meaning
the wildcard metacharacter (the dot [.]) matches
any character, even newlines
ignore case when performing matches
use locale setting when performing matches
line start and end metacharacters (the caret [ˆ]
and the dollar sign [$]) match newlines instead
of start and end of string
enable verbose REs
DOTALL Normally, newlines (\n) are not matched by a dot (the wildchard
metacharacter). In order to ensure that absolutely every character, including new-
lines, are matched by wildcards, a regular expression can be compiled with the
DOTALL flag. This convention is very useful for parsing larger texts since pat-
terns that span multiple lines can be captured.
GIVE EXAMPLE
DR
IGNORECASE Regular expressions are normally case-sensitive. Therefore,
the regular expression the will not match the word The. It is possible to work
around this limitation using character classes, as previously described in §17.2.3,
but it is often easier to use the IGNORECASE flag when compiling a regular ex-
pression so that matches are made in a case-insensitive manner (i.e., capitaliza-
tion is ignored). In (244), for example, the regular expression books matches the
word books (lowercase) as well as Books (title case).
(244) regexes2IgnoreCaseEx1.py
import re 1
2
s1 = ’It is good to read books.’ # Lowercase ’books’ 3
s2 = ’BOOKS ARE GOOD TO READ.’ # Uppercase ’BOOKS’ 4
5
# Compile case-insensitive regex object 6
reo = re.compile(’Books’, re.IGNORECASE) 7
8
221
T
# A match is found, despite case mismatch 9
if reo.match(s1) : print s1 10
if reo.match(s2) : print s2 11
This compilation flag is especially useful when dealing with the poetry of e.
e. cummings, whose poetic license took a typographic turn with his disdain for
capitalization.
(245) regexes2IgnoreCaseEx2.py
s = ’’’e. e. cummings’’’ 1
AF
LOCALE Locales are used by operating systems and applications for interna-
tionalization, which basically means that they customize the behavior of a com-
puter according to its setting or location. A locale usually affects such things as
character sets, date and time formats, and currency formats, all of which change
from one language and/or culture to the next. For example, the authors of this
book reside in different places. One of them is based in the United States while
the other is based in The Netherlands. Their locale affects the way dates are for-
matted. For example, the date ‘1-12-1999’ would be interpreted as January 12th,
1999 in the US but as December 1st, 1999 in the Netherlands.
By using the LOCALE flag for compiling regular expressions, it is possible
to ensure that differences between alphabets are preserved in the interpretation of
the metasequences \w, \W, \b, and \B. For example, French permits a number
of letters with diacritics, such as é or ć, but \w will not match these characters by
default. However, by using the locale for French, it is possible to make \w include
these characters.
DR
(246) regexes2LocaleEx1.py
import re 1
french = ’French sentence goes here.’ 2
reo = re.compile(’(\w+)’, re.LOCALE) 3
mo = reo.match(french) 4
CAN WE GIVE A LISTING OF LOCALES? ARE THERE SOME THAT

ARE UNIVERSALLY SUPPORTED OR DOES IT DEPEND ON THE OS?
CAN PYTHON ONLY USE LOCALES THAT ARE SUPPORTED BY THE OS?
NEED TO FOLLOW UP ON THIS AND FIGURE OUT HOW IT WORKS.
The only drawback to using the LOCALE flag is that it causes the resulting
compiled object to use the C functions for \w, which slows it down a bit, but the
performance cost is a small price to pay for this sort of language-sensitivity.
MULTILINE As previously explained in §17.2.8, the metacharacters and $

normally match the beginning and ending of a string. With the MULTILINE flag,
222
the interpretation of these two metacharacters is changed so that they match the
T
beginning and ending of lines rather the string being matched against.
The code in (247) illustrates the difference between a regular expression object
compiled with this flag versus one compiled without it.
(247) regexes2MultilineCompileFlagEx1.py
import re 1
2
s = ’’’Dissection is a virtue when 3
it operates on other men.’’’ 4
5
AF
regex = r’ˆ(.*)$’ 6
reo = re.compile(regex) 7
mo = reo.match(s) 8
if mo : 9
print mo.group(1) 10
11
reo = re.compile(regex, re.MULTILINE) 12
mo = reo.match(s) 13
if mo : 14
16
17
FINE BETTER EXAMPLE

To see how this can be of practical value, imagine that you are parsing a data
file with the results of a reaction time experiment, where each line has a single
reaction time measurement, and you wanted to analyze only those reaction times
that have triple-digit values. All such lines could be easily extracted using nothing
more than the simple regular expression in (248).
(248) regexes2MultilineCompileFlagEx2.py
import sys 1
import re 2
DR
3
# regexes2MultilineCompileFlagEx2.py ../inputs/rt_data.txt 4
5
fn = sys.argv[1] # Obtain filename 6
file = open(fn, ’r’) # Open file 7
text = file.read() # Slurp file 8
regex = r’ˆ[0-9][0-9][0-9]$’ # Line w/ 3 digits 9
reo = re.compile(regex, re.MULTILINE) # Compile regex 10
values = reo.findall(text) # Find all matches 11
if values : # Is match list empty? 12
for n in values : # No. Loop over values. 13
print n # Print value 14
else : # Yes. 15
print "No three-digit values found." # Report situation 16
VERBOSE As you’ve probably figured out by now, regular expressions can get
pretty complicated. And the more complicated they get, the more cryptic they
become. (Power, not readability, is their forte.) It is wise when writing difficult
regular expressions to comment them, much as you would any other code. One
223
way of doing this is to simply use Python’s normal commenting convention, as
T
shown in (249).
(249) regexes2VerboseCompileFlagEx1.py
import sys 1
2
4
# Regular expression for URLs 5
# (https?://) : Begin optional protocol 6
# [a-zA-Z_\.] : Character class of valid characters 7
8
# Regular expression that matches valid URLs 9
AF
regex = r’(https?://)([a-zA-Z_\.])’ 10
ro = re.compile() 11
ro.match(text) 12
The VERBOSE compilation flag provides another way of commenting regular

expressions by permitting comments to be interleaved in a regular expression, as
shown in (251).
import sys 1
2
4
regex = r’( # Begin optional protocol 6
https?:// # The s is optional 7
) # End optional protocol 8
() # 9
[a-zA-Z_\.] # Only these characters are legit 10
# 11
’ 12
ro.match(text) 14
In a regular expression compiled with the VERBOSE flag, comments can be

DR
added using a hash mark, as per regular Python code. Whitespace is ignored and
can therefore be used liberally to indent the various part of a regular expression for
readability. This raises the issue of how space marks can be included in a regular
expression. The answer is simple: either escaped with a backslash or included as
a character class, as shown in (??).
import sys 1
2
4
regex = r’( # Begin optional protocol 6
https?:// # The s is optional 7
) # End optional protocol 8
() # 9
[a-zA-Z_\.] # Only these characters are legit 10
# 11
’ 12
ro.match(text) 14
224
18.3 The Regular Expression Object
T
One of the keys to mastering regular expressions in Python is familiarizing oneself
with all of the functions and classes available for the re module. A list of the most
commonly used ones is provided in Table 18.2 for ease of reference.
The use of each of the main regular expression methods will be discussed and
illustrated below.
18.3.1 Regular Expression Functions and Methods
18.3.1.1
AF
A regular expression object has quite a few methods available for it, and familiar-
izing yourself with these is a must for regular expression mastery in Python.
re Functions
There are a few functions that can be called directly from the re module. The two
most important of these are compile() and escape().
re.compile(RE[, flags]) Regular expressions can be optimized by pre-

compiling them. This is particularly useful when the same regular expression is
used repeatedly. To see how the performance of a regular expression that is com-
piled compares to one that is not, try running (252) and looking at the output.
import re, time

(252) regexes2CompileEx1.py
1
2
# Pattern matches 1 or more digits--i.e., integers 3
pattern = r’\d+’ 4
DR
5
# Precompiled ---------------------------------- 6
prog = re.compile(pattern) 7
t1 = time.time() 8
for i in range(0, 100001) : 9
prog.search(str(i)) 10
t2 = time.time() 11
print "Precompiled: %s secs" % str(t2-t1) 12
13
# Not Precompiled ------------------------------ 14
t1 = time.time() 15
for i in range(0, 100001) : 16
re.search(pattern, str(i)) 17
t2 = time.time() 18
print "Not Precompiled: %s secs" % str(t2-t1) 19
(??) loops through the integers between 1 and 100,000 twice. In each loop,
every number is converted to a string and compared to a regular expression for a
match. The first loop does this using a precompiled regular expression while the
second loop does this using a regular expression that is compiled on the fly. On
225
average, the loop with the precompiled regular expression runs twice as fast as the
T
other.
re.escape(str) This is very useful if you want to match an arbitrary literal

string that may have regular expression metacharacters in it. The usefulness of
such a method can be illustrated by considering how one might write a Python
script that searches files using regular expressions. A simple version of such a
script was already provided in (??). One annoying aspect of this script is that if
you want to search for literal characters that happen to be metacharacters, they
AF
must all be escaped. A much more convenient way of doing the same thing is
to include an option that forces all metacharacters to be treated as literals, as in
(253).
import sys, re
try :
pattern = sys.argv[1]
fn = sys.argv[2]
i = 0
match = False
ro = re.compile(pattern)
fo = open(fn, ’r’)
for l in fo :
i = i + 1
mo = ro.search(l)
if mo :
match = True
(253) regexes1FileSearchRev.py
print "%i: %s" % (i,l)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
fo.close() 18
19
DR
if match == 0 : print ’Sorry. No matches found.’ 20
21
print "Usage: %s [-l] <REGEX> <FILE>" % sys.argv[0] 23
sys.exit(0) 24
25
18.3.1.2 Methods
The remaining methods can be called in one of two ways. Either they can be called
directly from the module re as functions (e.g., re.findall() or they can be
called as methods of a compiled regular expression (i.e., a RegularExpression
object). When called as functions, a regular expression string is required as the
first argument of the function.
findall(RE, str) Return a list of all non-overlapping matches of RE in

string.
226
T
(254) regexes2FindallEx1.py
import re, sys 1
2
# Get filecontents for file supplied on command line 3
fc = fo.read() 6
fo.close() 7
8
# Regex matches two sequences of alphanumeric chars 9
# separated by a hyphen 10
regex = r’\w+-\w+’ 11
12
AF
# Find all matches and print them out 13
for m in re.findall(regex, fc) : 14
print m 15
If one or more groups are present in the RE, return a list of groups; this will
be a list of tuples if the RE has more than one group. Empty matches are included
in the result unless they touch the beginning of another match.
(255) regexes2FindallEx2.py
import re, sys 1
2
# Get filecontents for file supplied on command line 3
fc = fo.read() 6
fo.close() 7
8
# Regex matches two sequences of alphanumeric chars 9
# separated by a hyphen 10
regex = r’\w+-\w+’ 11
12
# Find all matches and print them out 13
for m in re.findall(regex, fc) : 14
print m 15
DR
match(RE, str[, flags]) The name of this method is somewhat mis-
leading. Although it may seem like this method would simply find a match for a
regular expression, its behavior is in fact more restricted than that. The method
match() determines whether a regular expression matches the beginning of a
string.
(256) regexes2SearchEx1.py
import re, sys 1
2
# Get filename from command line 3
5
# Regex matches lines starting and ending w/ same char 6
regex = r’ˆ(.).*\1$’ 7
8
# Print out every line in file that matches regex 9
for l in fo.readlines() : 11
227
T
if re.match(regex, l) : 12
print l 13
fo.close() 14
To find a match anywhere in a string, and not just at the beginning, the method
search should be used. In many ways, this method is superfluous, given that
calling it is the equivalent of appending the line-start metacharacter (ˆ)to the
beginning of the regular expression.
AF
search(RE, str[, flags]) To compare a regular expression with a string
and see whether there are any matches, the method match can be used. If a match
is found, a MatchObject is returned; otherwise, None is returned.
import re, sys
regex = r’ˆ(.).*\1$’
(257) regexes2MatchEx1.py
# Get filename from command line

file = sys.argv[1]
# Regex matches lines starting and ending w/ same char
# Print out every line in file that matches regex

fo = open(file, ’r’)
for l in fo.readlines() :
if re.match(regex, l) :
print l
fo.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
DR
split(RE, str[, maxsplit]) The method split is very useful, since it
provides a way of splitting a string using a regular expression as a separator instead
of a fixed string.
(258) regexes2SplitEx1.py
import re 1
2
# Numbers 1 to 10 variable number of spaces between them 3
s = ’1 2 3 4 5 6 7 8 9 10’ 4
5
print "Using split with single space as separator..." 6
for n in s.split(" ") : 7
print "[%s]" % n 8
9
print "Using split with default separator..." 10
for n in s.split() : 11
print "[%s]" % n 12
13
print "Using split with regex as separator..." 14
for n in re.split(r’ +’, s) : 15
print "[%s]" % n 16
228
sub(RE, repl, str[, cnt]) To replace one string with another, the
T
string method replace can be used (see §14.4.2.2). However, there are many
cases where replacements needs to take advantage of pattern-matching. More
intelligent string replacements can be done with regular expressions using the
method sub(), as illustrated by (259), which replaces two or more carriage re-
turns with two carriage returns.
(259) regexes2SubEx1.py
import re, sys 1
2
AF
5
# Get file contents as string 6
fc = fo.read() 8
fo.close() 9
10
# Regex that matches 2 or more carriage returns 11
regex = r’\n{2,}’ 12
13
# Perform replacement on file contents and print results 14
print re.sub(regex, ’ ’, fc) 15
The usefulness of splitting strings with regular expressions can be illustrated

by considering how interlinear text aligned at the morpheme or word level can be
parsed.
???
subn(RE, repl, str[, cnt]) The method subn behaves almost iden-
tically to sub; however, instead of returning only the string that has had substi-
DR
tutions performed on it, it returns a tuple with the new string and the number of
substitutions performed. Using this method, it is very easy to perform a substitu-
tion and report how many times the substitution was performed, as illustrated by
(??), which does a global find-replace on a file and prints out the results as well
as the number of replacements made.
(260) regexes2SubnEx1.py
import re, sys 1
2
5
# Get file contents as string 6
fc = fo.read() 8
fo.close() 9
10
# Regex that matches 2 or more carriage returns 11
regex = r’\n{2,}’ 12
13
# Perform replacement on file contents and print results 14
229
18.4 Extensions to Regular Expressions
T
newFc, numSubs = re.subn(regex, ’ ’, fc) 15
16
# Print message to stderr 17
msg = "Total Number of Substitutions: %i." % (numSubs) 18
sys.stderr.write(msg) 19
20
# Print new file contents to stdout 21
print newFc 22
18.3.2 The MatchObject
AF
The various methods for regular expression matching produce match objects,
which can then be used to obtain further information and perform additional ma-
nipulations.
group(), groups() ???

To learn more about grouping in Python regular expressions, see §18.4.1.
start(), end(), span() ???

Python provides a number of extensions to regular expressions that are not stan-
dard. Since they are Python-specific, they were not covered in the previous chap-
DR
ter.
18.4.1 Grouping
In §17.2.1 we learned how to group expressions in regular expressions with paren-
theses to ensure their proper parsing. However, these parenthetical groupings can
also be utilized in Python to simplify a variety of string manipulation tasks. In
fact, Python provides fairly extensive group-handling capabilities which go far be-
yond what is available even in fairly regular expression-friendly languages (e.g.,
Perl). The various grouping conventions available in Python regular expressions
are listed below in table 18.3. (Remember, these covnentions are Python-specific
and will not carry over to regular expressions in other contexts.)
We will illustrate the usage of each convention and provide some advice con-
cerning their strengths and weaknesses.
230
Simple Grouping: (R) We have already seen how parentheses can be used to
T
group regular expressions. Sometimes this grouping is necessary in order to clar-
ify how a regular expression is intepreted. For example, if you were searching
for the past participle or gerund form of the verb kill, you could use the regular
expression kill(ed|ing), where parentheses are used to ensure that the regu-
lar expression matches kill followed by either ed or ing. (Without parentheses, as
killed|ing, the regular expression would be interpeted as killed or ing, which
is clearly not what we want.)
This grouping can also be used to tease apart the various elements of a regular
import sys, re
fo = open(fn, ’r’)
for l in lineList:
mo = reo.search(l)
if mo :
AF
expression. Continuing with our kill example, we could use the code in (261) to
search for every instance of the words killed or killing.
suffix = mo.group(2)
(261) regexes2SimpleGroupingEx1.py
fn = sys.argv[1] # Get filename from command-line
# Get list of lines from file
lineList = fo.readlines()
fo.close()
regex = ’((kill)(ed|ing))’
reo = re.compile(regex)
#
#
#
#
#
#
nl = reo.sub(’*\g<1>*’, l) #
print nl #
Define regex
Compile regex into object
Go through each line...
Obtain match object
If there’s a match...
Obtain second group
Surround match w/ stars
Print starred line
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
There are a number of things to note about this script. First, the filename
comes from the command-line and could be run on any file. Second, the regular
DR
expression contains two groupings: one for the entire verb form and one for just
the verb suffixes. These groupings can be referenced by number, starting from
the left with 1. (Don’t get group numbering confused with list indices, which
count from 0 rather than 1.) If a match is found for ((kill)(ed|ing)), the
numbered groupings would be those shown in Figure 18.1.
Figure 18.1 Regular Expression Grouping

1 1
z }| { z }| {
2 3 2 3
z}|{ z}|{ z}|{ z}|{
kill ed kill ing
To reference these groupings within a regular expressions, the grouping num-

ber must be surrounded by angle brackets and preceded by a backslash—e.g.,
\<1>. The purpose of grouping the entire verb is to surround it with asterisks
231
for easier identification in the output. If we run this script on an article by Peter
T
Beaumont in The Guardian, we obtain the following output (note the use of stars
to flag the matched strings):
(95) "It is nice to be *killed* while *killing*"

His friends and family say that in the days before his
attack, which *killed* 19 and injured more than 50,
he showed no sign of what he planned to do. He watched
the World Cup on television with his friends and his
AF
brothers and expressed the hope that Brazil, the team
he was following, would win.
That is why his family is so baffled by his decision.
At his home in the refugee camp of al-Faraa, near Nablus,
yesterday his father and brothers said they could not
believe that he had *killed* himself.
"He said he did not like the idea of civilians being
*killed*. But no one forced him to this. He chose
this route. He had a good life - a good upbringing.
He was hoping to study for his PhD."
He says he does not want to "kill for the sake of *killing*
but so that others might have life", but that it is
"nice to be *killed* while *killing*".
Non-grouping: (?:R) In the previous section, we saw how a regular expres-

sion in parentheses forms a group and how that grouping can be used in Python
DR
programs. However, there are times when parentheses are necessary to ensure
the proper interpretation of a regular expression, but their grouping is otherwise
undesired. We can illustrate this using the example from §18.4.1
(262) regexes2NongroupingEx1.py
import re, sys # Required modules 1
fn = sys.argv[1] # File to search 2
fo = open(fn, ’r’) # Get list of lines 3
lineList = fo.readlines() 4
fo.close() 5
regex = ’(?:(kill)(ed|ing))’ # Create regex 6
reo = re.compile(regex) # Compile regex object 7
mo = reo.search(l) # Obtain match object 9
if mo : # Is there a match? 10
print reo.sub(’*\g<1>-\g<2>*’, l) # Flag match w/ stars 11
The example in (262) is very similar to the example in (261). In both cases, we
are searching for the verb root kill followed by the suffix ed or ing. In (261) each
232
match was bracketted by stars so that matches could be seen more easily in the
T
program’s output. The example in (261) has a number of groupings to illustrate
how the counting of groups works, but only two are necessary: one to group the
possible suffixes and another to group the entire verb so that it can be referenced
for text substitution. By inserting a question mark and a colon at the beginning of
the second grouping, it ceases to count, and we are able to refer to the verb stem
as 1 and the suffix as 2. Therefore, (262) is equivalent to (263), where only two
groupings are used. (The two programs are identical save for line 6, where the
regular expression is defined.)
AF
(263) regexes2NongroupingEx2.py
fo.close() 5
regex = ’(kill)(ed|ing)’ # Create regex 6
print reo.sub(’*\g<1>-\g<2>*’, l) # Flag match w/ stars 11
Followed By: (?= R) It can be useful to search for one pattern only when it
is followed by another.
(264) regexes2FollowedByEx1.py
# TODO 1
Not Followed By: (?! R) Just as it can be useful to search for one pattern
only when it is followed by another, it can also be useful to search for one pattern
DR
only when it is NOT followed by another.
(265) regexes2NotFollowedByEx1.py
# TODO 1
Named Grouping: (?P<name>R) We’ve already seen how groupings can be

referenced by number—for example, (261) or (262). They can also be referenced
by name, provided that they are given one.
(266) regexes2NamedGroupingEx1.py
fo.close() 5
regex = ’(?P<verb>kill)(?P<suffix>ed|ing)’ # Create regex 6
print reo.sub(’*\g<verb>-\g<suffix>*’, l) # Flag match w/ stars 11
233
BACKREFERENCES Backreferences: (?P=name) vs. \g<n> Backreferences are an extremely
T
handy method for matching a group within a regular expression and referring to
it within the same regular expression, much as one would a variable. We have
already seen backreferences in use in (??) and (??), but here we discuss them in a
bit more detail and provide a few more examples of their use.
One linguistic domain where backreferences are especially handy is redupli-
REDUPLICATION cation, which can be defined as a morphological process in which a root or stem
is repeated, either in part or as a whole. In some languages, reduplication is a
very productive process (e.g., Ilocano, where it marks plurality). We will look
AF
at reduplication in a collection of Tzeltal texts (Stross, 1977). These texts have
been placed in a single file (love in the armpit.txt), with bibliographic
information provided on lines beginning with @, Tzeltal text provided on lines be-
ginning with //, and English translations on lines with no special marker. The
mark-up scheme for these texts is illustrated below:
@HUNTING DOG / TE JLEBOJOM C’I‘ NAMEJ K’INALE

//te namej k’inale ‘ay laj jleojom
In the olden days there was a hunter.
//ya laj xbajt ta lebojel sok te tz’i‘e
He would go hunting with his dog.
//te namej k’inale yak to laj xk’opoj ‘a te tz’i‘e
In the olden days yes still they talked then, the dogs.
//ja‘nax laj jich bajt laj ta lebojel sok te stz’i‘e
It was just thus he went hunting with his dog.
//te tz’i‘e malaba ya sk’an ya sle te t’ule
The dog, not he wanted to hunt rabits.
DR
//ta poro sujtel laj tal ya sk’an te tz’i‘e
The dog just wanted to go back.
//malaba ya sk’an sle te t’ule
He didn’t want to hunt rabbits.
...
A short script in (267) provides a first-pass attempt at extracting examples of

reduplication from it.
(267) regexes2BackrefEx1.py
import sys 2
import re 3
4
# Deal with file provided at command-line 5
fn = sys.argv[1] 6
fo.close() 9
234
T
10
# -------------------------------------------------------- 11
# REGULAR EXPRESSION BREAKDOWN: 12
# \w = any valid letter 13
# \’ = apostrophe marks ejectives and glottal stops 14
# [\w\’] = character class for letters and apostrophes 15
# {3,5} = match 3 to 5 times 16
# \1 = backreference 17
# -------------------------------------------------------- 18
regex = r’ˆ//.*([\w\’]{3,5})\1’ # Create regex 19
ro = re.compile(regex) # Compile regex 20
for l in lineList : # Loop over line list 21
if ro.search(l) : # Check for redup 22
AF
l.replace("\n") # Ditch newlines 23
print l # Print line 24
The script in (267) looks for every line of Tzeltal text, which should start with
//, and finds every word that contains a sequence of 3 to 5 Tzeltal letters that is
immediately repeated.
Running the program on love in the armpit.txt should produce a num-
ber of hits. The output has a few drawbacks, however, which we ought to remedy.
First, it is hard to identify where the reduplication occurs at a glance. We need
some way of flagging the words containing reduplication. Second, no information
is provided as to where these examples occur in our text. Each of these deficits is
corrected in (268).
Since most of the work in (268) happens inside of the regular expression, we
will look at it carefully.
(268) regexes2BackrefEx2.py
import sys 2
import re 3
DR
4
# Deal with file from command-line 5
fn = sys.argv[1] 6
fo.close() 9
10
# -------------------------------------------------------- 11
# \w = any valid letter 13
# \’ = apostrophe marks ejectives and glottal stops 14
# [\w\’] = character class for letters and apostrophes 15
# {3,5} = match 3 to 5 times 16
# \1 = backreference 17
# -------------------------------------------------------- 18
regex = r’(\w*([\w\’]{3,5})\2\w*)’ # Create regex 19
ro = re.compile(regex) # Compile regex object 20
for i, l in enumerate(lineList) : # Loop over index and line 21
if ro.search(l) : # Check for match 22
l.replace("\n") # Get ride of newline 23
newLine = ro.sub(’*\1*’, l) # Flag reduplication 24
print i + ’ : ’ + newLine # Print line 25
235
Comment: (?#...) Regular expressions can get a bit cryptic. We already
T
saw in §18.2.2 that there is a compilation flag that permits the interlevaing of com-
ments into regular expressions. But Python gives you another option: comments
can be inserted within parentheses, as in (269).
(269) regexes2CommentGroupingEx1.py
import re 1
regex = ’(?#This matches phone numbers)(\d+) (\d+)-(\d+)’ 2
s = ’805 528-1267’ 4
AF
mo = reo.search(s) # Obtain match object 5
print s # Flag match w/ stars 7
In the opinion of the authors, this method of commenting regular expressions

is not terribly useful since they are not particularly easy to read and there is a good
alternative in the form of the verbose compilation flag.
Set Mode Flag: (?letter) We have already seen how regular expression
compilation flags allow for systematic changes to the interpretation of regular
expressions. In some cases, however, it is desirable to change the interpretation of
part of a regular expression rather than the whole thing.
(270) regexes2SetModeFlagEx1.py
import re 1
regex = r’(?I)’ 2
reo = re.compile(regex) 3
DR
s = ’???’ 4
mo = reo.search(s) 5
if mo : 6
print s 7
18.4.2 Special Sequences

Python provides a number of special sequences that simplify the job of construct-
ing useful regular expressions. They uniformly begin with the regular expression
escape character, the backslash. These letters come in pairs: a lowercase and
an uppercase letter. The uppercase letter is the inversion of the lowercase one.
Therefore, since \d stands for a digit, its uppercase counterpart \D stands for a
non-digit.
Two of these special sequences are sensitive to the LOCALE compilation flag
(see §18.2.2).
236
18.4.3 Matching Greediness
T
By default, quantifiers are “greedy”, meaning that they will match the longest
string possible.2 Therefore, even though the quantifier + matches 1 to n instances
of a whatever it quantifies, it will by default match the largest number possible.
Therefore, in (271), the plus sign following the letter b will match the largest
possible number, which will be the entire sequence (all 10 instances of the letter).
(271) regexes2GreedinessEx1.py
import sys 1
import re
AF
2
3
s = ’aaaaabbbbbbbbbbccccc’ 4
regex = r’(b+)’ 5
ro = re.compile(regex) 6
mo = ro.search(s) 7
if mo : print mo.group(1) 8
Sometimes the greedy interpretation of a quantifier is not what is needed for

the task at hand. Fortunately, there is a solution to the greed of quantifiers: placing
a question mark after a quantifier will force it behave in a non-greedy fashion.
(If only greed in the real world could be dispatched with so easily!) In (272),
the same string is matched against the same regular expression, with one major
difference: in (272) the greediness of the quantifier is held in check by a question
mark, resulting in much different behavior. Now only a single b from the string is
matched and printed out.
import sys 1
import re 2
3
s = ’aaaaabbbbbbbbbbccccc’ 4
DR
regex = r’(b+?)’ 5
mo = ro.search(s) 7
if mo : print mo.group(1) 8
To see how this might be useful, imagine that you wish to extract the first
sentence from every paragraph of a text. Assuming that there is one paragraph per
line in a text, (273) would do the job.
import sys 1
import re 2
3
fn = sys.argv[1] 4
fo.close() 7
2
GIVE QUOTE ABOUT INVETERATE ATTRIBUTION OF ANIMACY TO NATURE
FROM NAGEL
237
18.5 Exercises
T
8
# -------------------------------------------------------- 9
# ˆ = matches beginning of line 11
# .*? = matches anything non-greedily 12
# [\?\.!] = character class that matches ‘?’, ‘.’, or ‘!’ 13
# (metacharacter escaped with backslash) 14
# -------------------------------------------------------- 15
regex = r’ˆ(.*?[\?\.!])’ 16
for l in lineList : 18
mo = ro.search(l) 19
if mo : 20
AF
It matches everything from the start of a line up to a punctuation mark (which

marks the ends of a sentence). Normally, this would not work, since it would
match everything from the start of a line up to the last punctuation mark, which
would effectively be the entire paragraph rather than the first sentence. You could
get the first sentence, but you’d get everything following it, as well. By forcing
non-greedy matching, matching will end as soon as the first punctuatio mark is
encountered, which is exactly what’s called for.
18.5 Exercises
1. ???
2. ???
3. ???
DR
4. ???
5. ???

A regular expressions tutorial for Python can be found at http://www.amk.
ca/python/howto/regex/. For more detailed technical information about
regular expressions syntax in Python, consult the official documentation at http:
//docs.python.org/lib/re-syntax.html. There is also a Python-
focussed treatment of regular expressions in Mertz (2003).
238
DR

Method Behavior
compile(RE[, flags]) Compiles a regular expression object
escape(str) Return string w/ non-alphanumerics backslashed
findall(RE, str) Returns list of non-overlapping matches
match() Determine if the RE matches at the beginning of the string.
match(RE, str[, flags]) If zero or more characters at the beginning of string match the
regular expression RE, return a corresponding MatchObject in-
AF
stance. Return None if the string does not match the RE; note
that this is different from a zero-length match.
search(RE, str[, flags]) Scan through string looking for a location where the regular
expression RE produces a match, and return a corresponding
MatchObject instance. Return None if no match is found.
split(RE, str[, maxsplit]) Split string by the occurrences of RE. If capturing parentheses
239
are used in RE, then the text of all groups in the RE are also
returned as part of the resulting list. If maxsplit is nonzero, at
T
most maxsplit splits occur, and the remainder of the string is
returned as the final element of the list.
sub(RE, repl, str[, cnt]) Return the string obtained by replacing the leftmost non-
overlapping occurrences of RE in string by the replacement
repl. If the RE isn’t found, string is returned unchanged. repl
can be a string or a function; if it is a string, any backslash
escapes in it are processed.
subn(RE, repl, str[, cnt]) Perform the same operation as sub(), but return a tuple
(new string, number of subs made).
Table 18.2 re Methods

T
Table 18.3 Regular Expression Grouping in Python
Sequence Meaning
(R) Treat regular expression R as a group
(?:R) Do not treat R as a group
AF
(?=R) ???
(?!R) ???
(?P<name>R) Give the regular expression the name ¡name¿
(?P=<name> Refer to the grouping named ¡name¿
(?#...) Refer to the grouping number ¡num¿
(?letter) ???
Table 18.4 Special Sequences

Sequence Meaning
DR
\1, \2, \n Match regular expression group 1, 2, n
\A Matches only at the start of the string
\b Empty string at word boundaries
\B Empty string not at word boundary
\d Any digit (equivalent to [0-9])
\D Any nondecimal digit character (equivalent to [ˆ0-9])
\s Any whitespace character (equivalent to [ \t\n\r\f\v])
\S Any non-whitespace character (equivalent to [ˆ \t\n\r\f\v])
\w Any alphanumeric character (uses LOCALE flag)
\W Any nonalphanumeric character (uses LOCALE flag)
\Z Matches only at the end of the string
240
T
Chapter 19
XML
19.1
AF
What is XML?
The eXtensible Markup Language (XML) is a format for the encoding of struc-
tured documents. It stands for ’eXtensible Markup Language’ and is related to
HTML, since both are simplified derivatives of SGML, the Standard Generalized
Markup Language (ISO 8879). Originally developed by the W3, XML was de-
signed to be a simplified version of SGML for use on the web. Since its initial
release in 1998, it has become increasingly widespread. We will provide only a
brief overview for those readers who have no prior acquaintance with it since there
is extensive documentation of XML available in print (see Ray (2003) or Harold
and Means (2004)) and on-line (http://www.w3.org/XML).
DR
In brief, the chief advantages of XML over other schemes for storage and
exchange of data are:
Portable Since XML is simply text, it is very portable across different presently
existing operating systems.
Durable Because XML is simply text, it will remain readable in all foreseeable
future operating systems.
Flexible XML defines a syntax for the encoding of documents but not its content,
making it a very flexible general purpose tool.
Simple XML is a very simple markup language, which is easy to learn and easy
to understand.
Easily Processed Because XML is in common usage, there is an abundance of
tools available for its automated processing.
241
19.2 The Syntax of XML
19.2 The Syntax of XML
T
The basics of XML syntax are fairly straightforward. (276) illustrates a very sim-
ple XML annotation of a sentence (taken from the Penn Corpus Marcus et al.
(1993, 1994)), one which consists only of tags with no attributes.
(276) pennCorpusEx1.xml
<S> 1
<NP> 2
<DT>A</DT> 3
<NN>stockbroker</NN> 4
</NP> 5
AF
<VP> 6
<VBZ>is</VBZ> 7
<NP> 8
<DT>an</DT> 9
<NN>example</NN> 10
<PP> 11
<IN>of</IN> 12
<NP> 13
<DT>a</DT> 14
<NN>profession</NN> 15
<PP> 16
<IN>in</IN> 17
<NP> 18
<NN>trade</NN> 19
<CC>and</CC> 20
<NN>finance</NN> 21
</NP> 22
</PP> 23
</NP> 24
</PP> 25
</NP> 26
</VP> 27
<.>.</.> 28
</S> 29
DR
The following rules account for nearly everything that you need to know about
XML syntax:1
All XML elements must have a closing tag This means that for every start tag
there is a matching end tag. So, whereas the matching pair <NP></NP>
is correct XML syntax, <NP> is not. (The tags <p></p> and <br> come
from HTML (Hypertext Transfer Markup Language), which is similar to
XML, but does not entirely conform to its standards.)
XML tags are case sensitive Start and end tags must match in case. Therefore,
<word></word> would be correct whereas <Word></word> would be
incorrect (since the start tag begins with a capital letter but the end tag does
not). There is no requirement, however, that tags must be upper or lower-
case. Therefore, both <p></p> and <P></P> are correct.
1
A tutorial for these rules can be found at http://www.w3schools.com/xml/
default.asp.
242
19. XML
All XML elements must be properly nested Tags that occur within other tags
T
must be closed before their containing tags. An example of proper nesting
would be <p>Nested <b>XML</b></p>, where the tags <b></b>
are nested inside of the tags <p></p>.
All XML documents must have a root element All elements must occur inside
of a single base tag, usually called the root elements, and all tags must fall
under it.
AF
Attribute values must always be quoted Elements have in addition to their con-
tent the possibility of possing attributes. Consider, for example, a tag for
pronouns in English. The tag <pro></pro> is used to surround a pro-
noun and attributes are used to specify various features of the noun, such as
person, number, and gender. The full set of pronouns in English might like
as follows:
(275) attributesExample1.xml
<pronouns lg="eng"> 1
<pro per="1" num="sg" case="nom">I</pro> 2
<pro per="1" num="sg" case="acc">me</pro> 3
<pro per="1" num="pl" case="nom">we</pro> 4
<pro per="1" num="pl" case="acc">us</pro> 5
<pro per="2" num="sg" case="nom">you</pro> 6
<pro per="2" num="sg" case="acc">you</pro> 7
<pro per="3" num="sg" case="nom" gender="m">he</pro> 8
<pro per="3" num="sg" case="acc" gender="m">him</pro> 9
<pro per="3" num="sg" case="nom" gender="f">she</pro> 10
<pro per="3" num="sg" case="acc" gender="f">her</pro> 11
<pro per="3" num="pl" case="nom">they</pro> 12
<pro per="3" num="pl" case="acc">them</pro> 13
</pronouns> 14
DR
Comments take the form of special tags Comments which will be ignored by
parsers can be interleaved within an XML document, using the tag , as illustrated in (277), where comments have been added to
(276) pennCorpusEx1.xml
<S> 1
<NP> 2
<DT>A</DT> 3
<NN>stockbroker</NN> 4
</NP> 5
<VP> 6
<VBZ>is</VBZ> 7
<NP> 8
<DT>an</DT> 9
<NN>example</NN> 10
<PP> 11
<IN>of</IN> 12
<NP> 13
<DT>a</DT> 14
<NN>profession</NN> 15
<PP> 16
<IN>in</IN> 17
243
T
<NP> 18
<NN>trade</NN> 19
<CC>and</CC> 20
<NN>finance</NN> 21
</NP> 22
</PP> 23
</NP> 24
</PP> 25
</NP> 26
</VP> 27
<.>.</.> 28
</S> 29
<S>
AF
in order to identify the subject and predicate.
<DT>A</DT>
<NN>stockbroker</NN>
</NP>
<VBZ>is</VBZ>
<NP>
<DT>an</DT>
<NN>example</NN>
<PP>
<IN>of</IN>
<NP>
<DT>a</DT>
(277) xmlSyntaxEx1.xml


<NP>


<VP>
<NN>profession</NN>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<PP> 19
<IN>in</IN> 20
<NP> 21
DR
<NN>trade</NN> 22
<CC>and</CC> 23
<NN>finance</NN> 24
</NP> 25
</PP> 26
</NP> 27
</PP> 28
</NP> 29
</VP> 30
<.>.</.> 31
</S> 32

One of the chief virtues of XML is that it can be easily automatically processed
and there are many tools in existence for doing so. In this section, we will look at
how XML can be manipulated in Python.
244
19. XML
19.3.1 What is an XML Parser?
T
An XML parser provides a way of taking a chunk of XML and breaking it down
into its parts for processing. In other words, it is a tool that automates XML
parsing. For those unfamiliar with the terminology, the term parse means “to PARSE
break information down into its constituent parts”. There are two standard parsers
for XML, SAX and DOM. The difference between the two concerns how they
read and store the contents of an XML file for processing.
SAX stands for serial access parser, and is so-called because it reads an XML
AF
file by going through it serially (i.e., from start to finish). As it goes through the
XML files parts, it calls back to event handlers that decide how to act upon a
particular part of the XML document. The advantage of SAX is that it is fast and
efficient but its drawback is that it can be more difficult to write programs using it,
since they must always keep track of where they are in the document as it is being
read through. It is best suited to dealing with large XML files where certain types
of information are handled in a uniform fashion, regardless of where they occur
in the document.
DOM stands for Document Object Model and, unlike SAX, it works by read-
ing an entire XML document into memory and provided an interface to its hier-
archical structure. The hierarchical structure of the document can then be treated
like a tree, where there is a single root node and various child nodes beneathe it.
Since this is a fairly natural way of viewing XML, it is often easier to write pro-
grams in this fashion. The drawback, however, is that reading an entire XML file
can be memory-intensive and inefficient.
Here we will present two Python modules for XML processing, xml.sax, an
older module which has been part of the Python standard library since version
DR
2.0, and elementtree, a newer API which will beome part of the Python standard
library as of version 2.5.
19.3.2 xml.sax
???
19.3.3 elementtree
???
It is also possible to navigate the implicit hierarchy of an XML document
using XPath, which is a language for finding information in an XML document.
It provides a convenient method of navigating through elements and attributes in
an XML document.
245
19.4 Case Studies
A complete introduction goes beyond the scope of this textbook. A good tuto-
T
rial can be found on the W3 site: http://www.w3schools.com/xpath/
default.asp.
Returning to the example from the Penn Corpus, we can obtain all of the noun
phrases as a list of Element objects. It is then very simple to find the number of
noun phrases by obtaining the length of the list.
(278) xmlEtreeFindNouns.py
import elementtree.ElementTree as ET 1
tree = ET.parse("page.xhtml") 2
AF
nouns = tree.find("NP") 3
print len(nouns) 4
19.4 Case Studies

To show the usefulness of XML, we will provide two case studies. The first ex-
ample shows how to process a simple XML document—namely, a newsfeed in
Indonesian.
19.4.1 Parsing RSS Newsfeeds

[???] There are now RSS newsfeeds available in numerous languages. Here we
will show how to extract information from a newsfeed for the Indonesian newspa-
per, Andara. An example newsfeed is provided in (320).
(320) antara-rss-feed-umm-abbrev.xml
<?xml version="1.0" encoding="ISO-8859-1"?> 1
<?xml-stylesheet href="feed.css" type="text/css"?> 2
<rss version="2.0"> 3
DR
<channel> 4
<title>ANTARA - Umum</title> 5
<description>Berita Terkini dari ANTARA News</description> 6
<link>http://www.antara.co.id</link> 7
<lastBuildDate>Thu, 29 Jun 2006 05:15:00 +0100</lastBuildDate> 8
9
<item> 10
<title>Pemilihan PM Baru Timtim Perlu Waktu Sebulan</title> 11
<link>http://www.antara.co.id/seenws/?id=36957</link> 12
<description> Para pemimpin politik di Timor Timur (Timtim) mungkin memerlukan
13 waktu sebul
<pubDate>Thu, 29 Jun 2006 04:47:06 +0100</pubDate> 14
<guid>http://www.antara.co.id/seenws/?id=36957</guid> 15
16
</item> 17
<item> 18
<title>Montenegro Resmi Jadi Anggota ke-192 PBB</title> 19
<description> Montenegro resmi menjadi anggota ke-192 PBB, Rabu, sebulan
21 setelah negara it
23
</item> 25
... 26
246
19. XML
T
<item> 27
<title>TNI AL Tangkap Enam Kapal Penyelundup Ikan</title> 28
<description> Kapal patroli TNI AL, KRI Layang 805, berhasil menangkap
30 enam kapal berikut 61 ABK (a
31
<pubDate>Wed, 28 Jun 2006 17:22:06 +0100</pubDate> 32
</item> 34
</channel> 35
</rss> 36
We will use elementtree to parse this newsfeed and extract information
AF
from it. In (280), we present a very simple program that parses newsfeeds of this
sort and prints out the key parts of information in it.
import sys
(280) xmlRSSNewsfeedParser.py
import elementtree.ElementTree as ET
rssfile = sys.argv[1]
tree = ET.fromstring(rss_contents)
for item in tree.findall("channel/item") :
title = item.findtext("title")
date = item.findtext("pubDate")
cat = item.findtext("category")
link = item.findtext("link")
desc = item.findtext("description").strip()
guid = item.findtext("guid")
print "--"
print "TITLE:
print "DATE:
print "GUID:
print "LINK:
print "CATEGORY:
[%s]" % title
[%s]" % date
[%s]" % guid
[%s]" % link
[%s]" % cat
print "DESCRIPTION: [%s]" % desc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
We will return to this case study in Chapter 21, when we will see how to
DR
write the Python code that handles automatically pulling these RSS newsfeeds
from the web. This fully automates the entire process, so that the downloading
of newsfeeds and the parsing of them into a corpus can all be accomplished by a
single program.
19.4.2 Managing Parallel Translated Texts with XML

When dealing with multilingual texts, it is often convenient to store the texts in
one format and then transform the text for display.
Consider the following short text, taken from an elementary school reader
developed by the Australian Government for classrooms in Papua New Guinea.
The text was originally written in English and has been translated into two lan-
guages,Tok Pisin and Rotokas, for use in a village classroom.
The text is stored as XML. The main tag is <text>, which has two child tags,
<header> and body. The <header> contains information about the text, such
247
19.4 Case Studies
as its author, its translator, its editor, its title, etc. The <body> contains the actual
T
contents of the text, broken down into a variable number of paragraphs, each
of which contain a variable number of lines. Each line is translated in multiple
languages. Each translation occurs within a language tag, and the name of the
language is provided as a code.
xxx
<text>
<title>
AF
<lg name=’roo’>Eake iava viapau kaakaukare reoreopave<lg>
<lg name=’eng’>Why dogs don’t talk</tr>
<lg name=’tp’>Watpo ol dok i no save toktok</tr>
</title>
<body>
<paragraph>
<line>
<lg name=’eng’>Gaurai lived alone in the bush.</tr>
<lg name=’tp’>Gaurai em wanpela i save stap long bus.</tr>
<lg name=’roo’>Gaurai raga oisoa topareve vegoaro.</tr>
</line>
<line>
<lg name=’eng’>He lived with his two dogs.</tr>
<lg name=’tp’>Em i save stap wantaim long tupela dok bilong em.</tr
<lg name=’roo’>Vaiterei kaakau vaio rera vaitereiaro taporo oisoa t
</line>
<line>
<lg name=’eng’>He was very lonely and unhappy.</tr>
<lg name=’tp’>Em wanpela i save stap and em i no save hamamas.</tr>
DR
<lg name=’roo’>Rera raga oisoa toupareve uva viapau oisoa ruraparev
</line>
<line>
<lg name=’eng’>He had no one to talk to.</tr>
<lg name=’tp’>Nogat narapela man long toktok wantaim.</tr>
<lg name=’roo’>Viapau irai oiratoa vai ra irai tapo orareoparo.</tr
</line>
</paragraph>
...
<body>
</text>
Storing texts in this format has at least two virtues. First, because XML is
easily parsed and validated, the parts of the text can be easily extracted and its
overall integrity can be easily verified. It is a fairly trivial matter to ensure that
248
19. XML
for each line, there is a translation in all three languages, as shown by (??), which
T
is a short script that takes an XML file, validates it against the above-described
format, and prints a short error message if anything is amiss.
???
Second, because the text is stored independently of its formatting, it is possible
to create tools to extract its contents and automatically format them. This means
that errors in the text can be fixed in one place (namely, the XML version) and a
new formatted version can be produced, quickly, easily, and automatically. This
is obviously superior to the alternative, which would be to maintain multiple ver-
(96)
AF
sions of the same text and, when errors are found in one version of the text, to fix
them in all others. (This is especially unmanageable as the number of languages
in parallel translation grows larger.) The script provided in (??) takes an XML
text and creates a bilingual text from it. The file and the two languages desired are
provided as command-line arguments.
???
Creating an English-Tok Pisin version of the text or an English-Rotokas ver-
sion of the text is simple, thanks to this script. The difference is simply which
language code is provided as the third argument on the command-line:
2
xxx
bilingual_text_formatter.py dogs.xml ENG TKP
bilingual_text_formatter.py dogs.xml ENG ROO
Note that script has been written in such a way that it will give an error mes-
sage if one of the language codes supplied as a command-line argument is not
found in the XML file being processed:
DR
xxx
(97) 1 bilingual_text_formatter.py dogs.xml ENG ???
2 ERROR: Language code ‘???’ not found in ‘dogs.xml’! Codes found: ENG, ROO, TK
19.4.3 Lexical Database

The management of a lexicon is no easy task, but XML can make the job some-
what easier.
(281) lexicalDatabaseEx1.xml
<entry> 1
<root>ikauvira</root> 2
<stem>ikau</stem> 3
<translation> 4
<language lg=’eng’>quickly</language> 5
<language lg=’eng’>rapidly</language> 6
</translation> 7
<grammaticalClass>ADVERB</grammaticalClass> 8
249
19.5 Exercises
T
<semanticField>XXX</semanticField> 9
<notes>XXX</notes> 10
<dialects> 11
<dialect>Central</dialect> 12
</dialects> 13
<examples> 14
<example> 15
<original>ikauvira</original> 16
<morphemeGloss>run-ADV</morphemeGloss> 17
<translation lg=’eng’>quickly</translation> 18
<example> 19
<examples> 20
</entry> 21
parts.
19.5
1. ???
2. ???
3. ???
4. ???
AF
It is very easy to validate XML and consequently to parse it and extract its
It is relatively easy to create a bilingual dictionary from a dataset encoded with

the XML scheme illustrated in (281).
Exercises
5. ???
DR
There is a great deal more to XML than what is covered in this chapter. This chap-
ter really only scratches the surface, and there are a good number of worthwhile
topics that were neglected, such as XXX, XXX, XXX. There is no shortage of ma-
terials on XML. For an article that overviews XML programming in Python, see
McGrath (1998a). Two book-length treatments of the topic are McGrath (2000)
and Jones and Drake (2002).
250
T
Chapter 20
Databases
AF
The title of this chapter is to some extent a misnomer, given that it is not just an
introduction to databases, but covers the larger issue of data persistence. Data
persistence refers to the problem of ensuring that data used during one running of
a program is available to other programs or for the running of the same program at
another time. There are a number of different way of achieving data persistence.
Here we cover the three main options available in Python: files, pickling, and
databases. Each will be discussed in turn.
20.1 Files
DATA
TENCE
PERSIS -
One way of achieving data persistence is through the use of files. File IO was
already covered in Chapter ??, and XML is covered in Chapter 19. Here we will
DR
look at two file formats commonly used for the storage of semi-structured data—
namely, tab-delimited values and comma-separated values.
The main thing to point out is that flat files face a number of problems as far
as data peristence is concerned. The main problems are the following:
Concurrency XXX
Consistency XXX
Data Format XXX
In §20.3 we will see that relational databases solve a number of these problems
and therefore are usually a superior alternative. (There are some situations where
relational databases are overkill and flat files will do just fine, but a database of
some sort is essential for anything large-scale.)
251
20.1 Files
20.1.1 Comma-Separated Values (CSV)
T
The best way to handle files with comma-separated values is to use the module
csv, which forms part of Python’s standard library since version 2.3. For full
documentation, see the appropriate entry in the Python Library Reference: http:
//docs.python.org/lib/module-csv.html.
Before looking at the csv module, it pays to think briefly about why such a
module is needed. After all, the string method split can be used to split a string by
a given separator, as illustrated in (??).
AF
(282) databasesSplitCSVEx1.py
data = """???,???,???,???""" 1
for line in data.split("\n") : 2
for col in line.split(",") : 3
print "%s" % col 4
The trouble with simply splitting a CSV file using commas is that commas can
occur within a data field, and such data fields will create parsing errors, as can be
seen in (??).
(283) databasesSplitCSVEx2.py
data = """???,???,???,???""" 1
for line in data.split("\n") : 2
for col in line.split(",") : 3
print "%s" % col 4
Although it is possible to write custom code to handle the parsing of CSV into
columns of data, it is much easier to use the built-in CSV capabilities of Python.
The approach is essentially the same as the one used for tab-separated values,
except that the splitting of lines into columns is done by ??? rather than a simple
DR
split() function, as shown by the following script:
(284) dataPersistenceCSV1.py
import sys 1
2
def usage() : 3
print "Usage: %s <FILEPATH>" % sys.argv[0] 4
sys.exit(0) 5
6
def main() : 7
try : 8
except : 10
usage() 11
12
fo = open(filepath, ’rU’) 13
fileContents = fo.read() 14
fo.close() 15
16
ln = 0 17
for line in fileContents.split("\n") : 18
ln = ln + 1 19
cn = 0 20
252
20. Databases
T
for col in line.split("\t") : 21
cn = cn + 1 22
print "[%i][%i] %s" % (ln, cn, col) 23
The advantage of this approach is that it handles both flavors of CSV format-
ting.
There is nothing particularly mysterious about files with tab-separated values.
For example, consider the following file with tab-separated values:
AF
car vehicle ground
truck vehicle ground
jeep vehicle ground
tank vehicle ground
helicopter vehicle air
plane vehicle air
jet vehicle air
glider vehicle air
The basic idea is that each line has a number of values separated from one
another by tabs. These values can be thought of as columns of data and and are
in fact displayed as such when files in this format are imported into spreadsheet
software (e.g., Microsoft Excel):
Table 20.1 Tab-Separated Values

Column 1 Column 2 Column 3
car vehicle ground
DR
truck vehicle ground
jeep vehicle ground
tank vehicle ground
helicopter vehicle air
plane vehicle air
jet vehicle air
glider vehicle air
These files can be easily processed by opening a file object, extracting its
contents as a string, splitting the string into lines, and then splitting each line into
columns with the function usings tabs as the separator—i.e., split(’’ ⁀ ).
(285) dataPersistenceTSV1.py
import sys 1
2
def usage() : 3
print "Usage: %s <FILEPATH>" % sys.argv[0] 4
253
20.2 Pickling
T
sys.exit(0) 5
6
def main() : 7
try : 8
except : 10
usage() 11
12
fo = open(filepath, ’rU’) 13
fileContents = fo.read() 14
fo.close() 15
16
ln = 0 17
AF
for line in fileContents.split("\n") : 18
ln = ln + 1 19
cn = 0 20
for col in line.split("\t") : 21
cn = cn + 1 22
print "[%i][%i] %s" % (ln, cn, col) 23
20.2 Pickling
One of the main challenges in data persistence, especially in the context of object-
oriented programming, is how to save data that is stored within well-defined ob-
jects. Although it is always possible to print the data out into a text and then later
parse it back in to your object model, this is not always desirable. The main reason
for this is that you will have to print out data into a format that can be parsed back
in again and the work involved in ensuring that data can go into the object model
from a file and back out again can be onerous.
Python has a built-in method for saving objects out as files for the sake of data
PICKLING persistence, which is colorfully referred to as pickling.1 This is a convenient way
DR
of saving data that you have parsed into an object model (see ???) when you don’t
want to go to the work of printing it out again to save into a file.
For example, let’s assume that you have two wordlists for unrelated languages
spoken in adjacent regions and you want to parse both of them and then find any
words in common between the two. The code in (??) parses these two wordlists
and builds a dictionary of the terms within them for easy comparison. There is no
point in printing out the data itself because it will only be used to find common
words and print them out. We could therefore pickle the data, as shown in (??).
(286) dataPersistencePicklingEx1.py
import pickle, sys 1
2
def reverse_dictionary(fn) : 3
d = {} 4
fo = open(fn, ’rU’) 5
1
Pickling is sometimes referred to as “serialization”, “marshalling”, or “flattening”, because
???.
254
20. Databases
T
for l in fo : 6
cols = l.split("\t") 7
term = cols[0] 8
definition = cols[1] 9
for w in definition.split(" ") : 10
if not d.has_key(w) : 11
d[w] = [] 12
d[w].append(term) 13
fo.close() 14
return d 15
16
def merge_dictionaries(d1, d2) : 17
d = {} 18
AF
for w in d1.keys() : 19
if d2.has_key(w) : 20
terms1 = d1[w] 21
terms2 = d2[w] 22
d[w] = (terms1, terms2) 23
return d 24
25
def main() : 26
try : 27
fn1 = sys.argv[1] 28
except : 31
print "Usage: %s <DICT 1> <DICT 2> <PICKLE FILE>" % sys.argv[0] 32
sys.exit(0) 33
d1 = reverse_dictionary(fn1) 34
d2 = reverse_dictionary(fn2) 35
d = merge_dictionaries(d1, d2) 36
fo = open(fn3, ’w’) 37
pickle.dump(d, fo) 38
fo.close() 39
40
main() 41
There are many different ways of finding cognates among the words in the
DR
pickled object. Using the same pickled object, we can employ different means
of identifying cognates and neither one needs to worry about parsing the files
anymore, since the work is already done and the result is saved into our pickled
file.
One possibility is to ???, as in (287)
(287) dataPersistenceUnpicklingEx1.py
2
def main() : 3
try : 4
except : 6
print "Usage: %s <PICKLED FILE>" % sys.argv[0] 7
d = pickle.load(fo) 9
fo.close() 10
11
for w in d.keys() : 12
(wordlist1, wordlist2) = d[w] 13
255
20.3 Relational Databases
T
print w 14
print " ".join(wordlist1) 15
17
main() 18
Another possibility is to ???, as in (288)

(288) dataPersistenceUnpicklingEx2.py
2
AF
def main() : 3
try : 4
except : 6
print "Usage: %s <PICKLED FILE>" % sys.argv[0] 7
d = pickle.load(fo) 9
fo.close() 10
11
for w in d.keys() : 12
(wordlist1, wordlist2) = d[w] 13
print w 14
17
main() 18
A few caveats concerning pickling are in order. [???]
20.3 Relational Databases

DR
RDBMS The most robust means of data storage is a RDBMS, a relational database man-
agement system. The term relational database management system refers to the
RELATIONAL concept of a relational database Codd (1970). According to the Free Online
DATABASE Dictionary of Computing (http://www.foldoc.org):
A relational database allows the definition of data structures, storage

and retrieval operations and integrity constraints. In such a database
the data and relations between them are organised in tables. A table
is a collection of rows or records and each row in a table contains the
same fields. Certain fields may be designated as keys, which means
that searches for specific values of that field will use indexing to speed
them up.
There are four key elements to a relational database, often called the ACID
test on the basis of the mnemonic scheme for their recollection (XXX):
256
20. Databases
Atomicity The results of a transaction are committed in an all-or-none fashion.
T
(This prevents the database from entering into invalid states–e.g., crediting
one account without debiting another in a financial database.
Consistency The database only goes from valid state to valid state. Tables may
be related (hence relational) by common keys. Consistency forces valid
relationships between tables. (In other words, it is impossible for a column
of one table to reference a non-existent entry in the column of another table
when both are properly designed.)
AF
Isolation The results of transactions are invisible to other users until the trans-
action is complete. (This ensures that changes occur in a block, and that
interrelated changes do not occur in a piecemeal fashion.)
Durability All transactions and states of the database are fully recoverable (through
rollback/commit operations). Durability ensures that once a transaction is
complete the updated information will survive failures of any kind.
In addition to the ACID criteria, an RDBMS has a number of other advantages

over alternative schemes for data persistence (such as flat files or XML). Some of
these advantages are listed below:
Performance RDBMSes are fast and can be fine-tuned for even greater perfor-
mance.
Interface The interface to a RDBMS is more or less uniform to the extent that
they conform to the various standards set for SQL (Structured Query Lan-
DR
guage) and does not require that the programmer build a new interface or
that the user learn one. (This consideration argues more against tabular files
than XML, which is more easily manipulable.)
Scalability Relational databases continue to perform well as their size increases–

in other words, they are scalable. This not often the case with flat files.
Robustness Relational databases are more robust than flat files in the sense that
constraints on tables can prevent the entry of non-sensical data. These con-
straints can enforce the data model in various ways: by ensuring that entries
are unique with respect to particular parameters, by ensuring that the values
for columns come from a pre-determined set of options, etc.
Security Relational databases management systems have a good deal of built-in

security (e.g., the ability to restrict user access privileges on a table-by-table
basis or even on a column-by-column basis).
257
20.4 Structured Query Language (SQL)
T
SQL SQL (Structure Query Language) is a standardized language for interacting with
relational databases. It consists of a well-defined syntax that enables a user to send
queries to a database that will create, delete, or edit data within the database. The
literature on SQL is absolutely huge, and there is no way to provide a complete
introduction to SQL in this book, but we should be able to introduce the basics
and gives the reader a general idea of how it works and what it can be used for.
Readers interested in pursuing the topic in greater detail can refer to the suggested
AF
reading section at the end of the chapter.
[SAY SOMETHING ABOUT STANDARDS]
Unfortunately, SQL, like most standards, is not universally adhered to. Dif-
ferent database management systems (DBMS) depart from it to various degrees.
So, while in theory a standard SQL query will run properly on any database that
is SQL-compliant, in practice this is not always the case. For various reasons
(ranging from the undderstandable desire to provide extra functionality to the less
forgivable wish to lock users in to a particular DBMS once they have adopted it),
the implementation of SQL is not 100% uniform. The rule of thumb is to stick to
the basics. The more established a SQL convention is, the more likely it is to be
work across different databases.
In the following sections, we will review the basics of SQL and show how it
can be used to access and manipulate structured data.
20.4.1 Creating and Removing Tables in a Database

The syntax of table creation, deletion, and modification is one of those areas where
DR
standardization is especially weak. This has to do with various factors, not the
least of which is considerable variation between databases in their data types. For
example, Oracle 8i (a very widely used commercial database) defines three data
types for dealing with text: VARCHAR, VARCHAR2, and TEXT, whereas MySQL
Is this right? Need to double- (a widely used Open Source database) defines only one: CHAR.
check
In the examples that follow, we will use MySQL because it is open source,
free, easy to use, and fairly simple to install. In addition, it comes pre-installed on
many Linux variants—e.g., Red Hat Linux.
The following SQL code will create two tables in a MySQL database.
(289) createTablesEx1.sql
CREATE TABLE bibliography ( 1
id xxx, 2
title1 xxx, 3
title2 xxx, 4
publisher xxx, 5
where_published xxx 6
); 7
258
20. Databases
T
8
CREATE TABLE person ( 9
id xxx, 10
first_name xxx, 11
middle_name xxx, 12
last_name xxx 13
); 14
15
CREATE TABLE authorship ( 16
id_person xxx, 17
id_bibliography xxx 18
) 19
AF
The tables bibliography and person both have a column called id that
serves as a primary key. A primary key is a unique identifier that can be used
to identify a particular row within a table. The primary key of one table can be
reference by another, where its serves as a foreign key. Using these three tables, a
row in the bibliography table can be related to a row in the person table by
means of the authorship table. (In §20.4.5, we will see how this can be done
using a method of combining table data called a join.)
It’s probably not clear why we have designed our data model in this way.
Why use keys and why relate tables using them? The answer is that data that has
been normalized in this fashion is more consistent and efficient, since there is no
redundant data within tables, and more flexible. To see why, consider what an
alternative might look like, provided in (290).
CREATE TABLE bibliography (

first_author
second_author
xxx,
xxx,
(290) createTablesEx2.sql
1
2
3
PRIMARY KEY
FOREIGN KEY
JOIN
DATA MODEL
NORMALIZED
third_author xxx, 4
title1 xxx, 5
DR
title2 xxx, 6
publisher xxx, 7
where_published xxx 8
); 9
In (290), there is only a single table, and the authors of each book are stored
in the same row as the book in the bibliography table. There are a number
of problems with this alternative approach. The first, and probably most obvious
problem is that some bibliography entries will have more than three authors (e.g.,
scientific papers with a half-dozen or more authors). We could perhaps solve
the problem by adding more columns for more authors, but this causes as many
problems as it solves. First, there is no upward limit on the number of columns.
If we provide columns for thirteen authors, at some point a paper with fourteen
authors will turn up. Second, even if we did provide a ridiculous large number of
columns for authors, there would be problems. One is inefficiency. Some space is
set aside for each column of a row, even if it is blank, and there will be a very large
259
number of blank rows, especially if the number of columns is made ridiculously
T
large to accomodate papers with numerous authors. Another problem with this
approach is that we cannot put any constraints on these authors columns. We
cannot force them to be non-blank, for example, since a paper with a single paper
will necessarily have numerous blank columns.
All of the problems with (289) are solved by using unique identifiers for each
row in the tables person and bibliography. First, there is no longer any
problem with papers with a large number of authors, since all that is required
to associate an author with a bibliography entry is to add a new row to the table
SEED SCRIPT
AF
authorship, and tables can hold a virtually unlimited number of rows. Second,
there is no problem with constraints. We can ensure that every foreign key refers to
a primary key and that all of the parts of a name are provided (by breaking a name
down into parts and making some parts mandatory–for example, the first and last
name). Finally, this scheme is far more efficient, because a name is stored only
once in the person table and everywhere else referred to by a number (which
requires much less memory than a string).
Since a database without data is not particularly useful, we have provided a
short SQL script that can be run in order to add a small dataset to the database
defined in (291). This seed script will add data for three authors and three books
to the database. One of the books is written by Leonard Bloomfield xxx (xxxb)
and another two by Edward Sapir xxx (xxxb,x).
(291) seedDatabaseEx1.sql
-- Add data to the table ‘person’
insert into person (first_name, middle_name, last_name)
values (’Edward’, ’’, ’Sapir’);
insert into person (first_name, middle_name, last_name)
1
2
3
4
values (’Leonard’, ’’, ’Bloomfield’); 5
insert into person (first_name, middle_name, last_name) 6
values (’???’, ’’, ’Whorf’); 7
DR
8
-- Add data to the table ‘bibliography’ 9
insert into bibliography (title, xxx) values (’Language’); 10
insert into bibliography () values (); 11
insert into bibliography () values (); 12
13
-- Add data to the table ‘authorship’ 14
insert into authorship (id_person, id_bibliography) 15
values ((SELECT id 16
FROM person 17
WHERE), 18
()); 19
FROM person 22
WHERE), 23
()); 24
FROM person 27
WHERE), 28
()); 29
260
20. Databases
Don’t worry if you don’t understand how this script works. Using this simple
T
database with only two tables, we will learn some SQL basics in the following
sections.
20.4.2 Adding data with INSERT

In SQL, new data can be added to a table using the keyword INSERT. Every time
INSERT is used, a new row will be added to the specified table. Therefore, (292)
will insert a single row into the person table.
AF (292) insertEx1.sql
INSERT INTO person (first_name, middle_name, last_name)
values (’John’, ’Peabody’, ’Harrington’)
The names of the columns are surrounded by parentheses, as are their val-
ues. The sequential order of the columns and values determines which values
are assigned to which columns. Therefore, the two must have an equal number.
Therefore, the SQL in (293) will give rise to an error, as can be seen by running
it.
(293) insertEx2.sql
INSERT INTO person (first_name, last_name) values (’John’);
Note that the table person also has a column called id, which is not given a
value by the insert statement in 292. This is because this column is automatically
handled by the table, thanks to its data type, which is the MySQL-specific XXX.
It is also possible to insert a value using a sequence (something like a dynamic
counter for generating IDs automatically.)
1
2
(294) insertEx3.sql
INSERT INTO person (id, first_name, middle_name, last_name) 1
DR
values (XXX, ’John’, ’Peabody’, ’Harrington’) 2
A value for the id column can be hard coded, as in (295), but this is not
recommended, since it may conflict with a pre-existing value. Furthermore, there
is little point, since it is automatically handled for you.
(295) insertEx4.sql
INSERT INTO person (id, first_name, middle_name, last_name) 1
values (XXX, ’John’, ’Peabody’, ’Harrington’) 2
Finally, if we return to the definition of the person table has a uniqueness

constraint. Uniqueness constraints guarantee that each row is unique according
to a specified columns. In this case, the table has a uniqueness constraint on
the columns first name, middle name, and last name, ensuring that the
same name doesn’t appear in the table more than once. Any attempt to do so will
result in an error, as the reader can confirm by trying to run the insert statement in
(292) more than once.
261
20.4.3 Removing data with DELETE
T
Now that we know how to add data to the database, it would be useful to know
how to remove it. In SQL, rows can be deleted from a table using the keyword
DELETE. The SQL command in 296 will delete all of the rows in a table. (A
word of warning. If a table has a great deal of data in it, this operation may be
time-consuming.)
(296) deleteEx1.sql
DELETE FROM person; 1
AF
Typically, one wishes to delete specific rows in a table and not simply all of
them. To impose conditions on which rows will be deleted, the keyword WHERE
can be used. For example, to remove every entry in the table person that has the
last name ’Robinson’, the SQL command in 297 will do the job.
(297) deleteEx2.sql
DELETE FROM person WHERE last_name = ’Harrington’
20.4.4 Editing data with UPDATE

In SQL, preexisting rows in the database can be modified using the keyword
UPDATE.
(298) updateEx1.sql
UPDATE table_name set col1=’xxx’
The update statement in (??) updates all of the rows in XXX. The more typical
1
situation, however, is that you will want to modify only a subset of rows using the
keyword WHERE.
DR
(299) updateEx2.sql
UPDATE table_name 1
SET col1=’xxx’, 2
col2=’xxx’ 3
WHERE col3=’xxx’ 4
20.4.5 Retrieving data with SELECT

In SQL, retrieving rows from the database is done with the keyword SELECT.
The most basic SELECT command consists of a query from a single table, as in
(300), which selects all of the rows in the person table.
(300) selectEx1.sql
SELECT id, 1
first_name, 2
middle_name, 3
last_name 4
FROM person; 5
262
20. Databases
In (300), specific columns were specified in the select statement, but it is also
T
possible to select all of the columns in a table using a star (*), as in (301).
(301) selectEx2.sql
SELECT * FROM person; 1
In general, however, it is preferable to specify explicitly the desired columns,

since the order in which columns will be retrieved using the star notation is not
always obvious, and may even change from one database system to the next. (Re-
member, it is best to make one’s SQL as portable as possible, since at some point,
first_name,
middle_name,
last_name
FROM person
AF
it may be necessary to switch databases.)
To restrict the number of rows retrieved from a table, the keyword WHERE can
be used to impose conditions on which rows will be selected from the database. In
the example in (302), all of the columns in the person table are selected for any
row (or rows) where the column last name is ‘Harrington’. (Note that the string
comparison operation is case-sensitive, and different results would be obtained if
the name provided were, for example, ‘HARRINGTON’.)
SELECT id,
(302) selectEx3.sql
WHERE last_name = ’Harrington’
In some cases, it may be necessary to combine the data of two tables into a
single select. For example, to retrieve all of the authors associated with a particular
1
2
3
4
5
6
bibliographic reference (identified by its primary key), it is necessary to join the

DR
authorship and the author tables, as in (303).
(303) selectEx4.sql
SELECT person.first_name, 1
person.last_name 2
FROM person, 3
authorship 4
WHERE authorship.id_person = person.id AND 5
authorship.id_bibliography = 3; 6
The query in (303) crosses the rows in the table author with those in the
table authorship and retrieves only those where the foreign key id author
of the authorship table corresponds to the primary key id of the author
table and the id person is the one provided (i.e., 3).
That may not be entirely clear to you, so let’s go through it again more slowly,
with the help of a diagram. A join cross all of the rows in one table with all of
the rows in another. Therefore, if there are i rows in one table and j rows in
another table, the total number of rows that result from joining the two tables is
263
20.5 Using Python with a Relational Database
T
Table 20.2 Outcome of a Join of Person and Authorship
Person Authorship
id first name last name id person id bibliography
1 Leonard Bloomfield 1 1
2 Edward Sapir 1 1
2 Edward Sapir 2 2
AF
2 Edward Sapir 2 3
3 ??? Whorf 1 1
3 ??? Whorf 2 2
3 ??? Whorf 2 3
i ∗ j. Returning to (303), if the seed data, and only the seed data is present in the
database, the result would be 9 rows, given that the table person has 3 rows and
the table authorship has 3 rows.
In this case, XXX.
(304) selectEx5.sql
SELECT id 1
FROM bibliography 2
WHERE title = ’Language’ 3
SUBSELECT Using a subselect, it is possible to embed one select within another. Using a
subselect, we can eliminate the need to know the id of a bibliographic reference
in (305) by using a subselect in lieu of the hardcoded value of 5.
DR
(305) subSelectEx1.sql
SELECT author.first_name, 1
author.last_name 2
FROM author, 3
authorship 4
WHERE authorship.id_person = author.id AND 5
authorship.id_bibliography = -- Subselect produces value 6
(SELECT id FROM bibliography WHERE title = ’Language’); 7

In order to access a relational database within Python, you will need to import the
module MySQLDB, which is part of the core Python distribution since version ???.
Here we will show how you can use Python’s interface to MySQL in order to
interact with the lexical database that we created in the previous section.
264
20. Databases
20.5.1 Obtaining a Connection
T
Once you have imported the necessary module, MySQLDB, you will need to tell
Python to connect to a database. In order to do so, a few pieces of information
are required: the host, the username, the password, and the name of the database.
The following code does nothing more than obtain a database connection and
immediately close it. An error will be raised if the program is unable to connect
to the database. Otherwise, it will simply run and exit silently.
(306) databaseObtainConnection.py
AF
import MySQLDB 1
cnx = MySQLdb.connect(host = "", 2
user = "", 3
passwd = "", 4
db = "") 5
cnx.close() 6
Note: Bear in mind that there is overhead associated with connecting to a

database, which means you don’t want to connect to the database repeatedly, since
it will significantly slow down your program. In addition, by using a single con-
nection, it it possible to ensure that no changes are made to a database if some-
thing goes wrong during the interaction with it. This is known as a rollback since ROLLBACK
it rolls the database back to its prior state. The ability to rollback changes made
to a database is important, since it provides a way of maintaining the database’s
integrity.
20.5.2 Executing a Query

Once a connection to the database has been obtained, it is possible to run a SQL
DR
statement on the database.
(307) databaseExecuteQuery.py
import MySQLDB 1
conn = MySQLdb.connect(host = "", 2
user = "", 3
passwd = "", 4
db = "") 5
cursor = conn.cursor() 6
cursor.execute("SELECT id, lexeme FROM lexemes") 7
rows = cursor.fetchall() 8
for r in rows : 9
id = r[0] 10
lex = r[1] 11
print "[%s] %s" % (id, lex) 12
The SQL statement in (??) was hardcoded, but it is also possible to create
dynamic SQL statements using variables. For example, a program that retrieves
entries from a lexical database for a given lexeme would need to obtain the lexeme
from the user and then construct a SQL statement on the basis of the user’s input.
265
T
(308) databaseExecuteQueryWithBindings1.py
import MySQLDB 1
sql = "SELECT id, lexeme FROM lexemes WHERE lexeme = %s" 3
cursor.execute(sql, (lexeme)) 4
for r in rows : 6
id = r[0] 7
lex = r[1] 8
Notice that %s is used a placeholder in the SQL statement. When the SQL
try :
AF
statement is executed on line ???, the value for this placeholder is provided as an
item within a tuple.
import MySQLDB, sys
except IndexError :
(309) databaseExecuteQueryWithBindings2.py
lexeme = sys.argv[1]
sys.stderr.write("Usage: %s <LEXEME>\n" % sys.argv[0])

sys.exit(1)
conn = MySQLdb.connect(host
cursor = conn.cursor()
user
= "localhost",
= "root",
passwd = "",
db
rows = cursor.fetchall()
for r in rows :
id = r[0]
lex = r[1]
= "rotokas-dictionary")
sql = "SELECT id, lexeme FROM lexemes WHERE lexeme = %s"

cursor.execute(sql, (lexeme))
print "[%s] %s" % (id, lex)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
DR
20.5.3 Retrieving Results
Once a query has been executed, all that remains is to retrieve its results. In some
cases, there will be no results to retrieve. For example, a SQL insert statement
will either succeed or fail but will not necessarily return any results.
To retrieve the results of a SQL query executed in Python, ???.
(310) databaseRetrieveResults.py
import MySQLDB 1
conn = MySQLdb.connect(host = "", 2
user = "", 3
passwd = "", 4
db = "") 5
cursor.execute("SELECT id, lexeme FROM lexemes") 7
for r in rows : 9
id = r[0] 10
lex = r[1] 11
266
20. Databases
20.6 Case Study: Importing a Spreadsheet into a
T
Lexical Database
20.7 Exercises
1. ???
2. ???
3. ???
4. ???
5. ???
20.8
AF
Suggested Reading
To learn more about SQL, we recommend XXX. For more details concerning
database management systems and their relative strength and weaknesses, see
XXX. For technical documentation of the Python DBI, see XXX. In general, we
recommend the various open source database management systems, since they
are free, well-maintained, and have very active user communities. (They also
have the advantage of being customizable, although being able to do this requires
a level of technical knowledge that the average reader of this book is unlikely
to possess.) In particular, we recommend either MySQL (http://www.mysql.org)
and PostgresSQL (http://www.postgres.org), two fairly advanced open source re-
DR
lational database management systems.
267
T
AF
DR
268
T
Chapter 21
The World Wide Web
AF
The world-wide web was invented by Tim Berners-Lee in the early 1990s during
his employment at the CERN. Originally designed as a tool for scientists to share
data, it has grown explosively to become a cultural phenomenon. As the web has
grown, the amount of textual material available on it has created a ready-made
corpus, freely accessible to language researchers who have the technical know-
how to tap into it. Although a full introduction to the topic of web programming
goes far beyond the scope of this book, we can still introduce a few basics that
will enable you to write simple scripts to connect to web pages and manipulate
their contents.
21.1 How the Web Works

DR
Before you can begin writing programs that connect to web sites and interact with
them, you need to have a basic understanding of how the web works. The WWW
is organized around a client-server model. In other words, when you access a
web page, your computer (the client) sends a message to another computer on the
network (the server), requesting a particular web page. The server then sends back
a message (a response). The communication between the client and the server is
carried out using HTTP, the Hyper Text Transfer Protocol. If the request was HTTP
successful, then the HTTP response will consist of the information necessary to
display the requested web page on your web browser, which is a program that is WEB BROWSER
able to display web pages (e.g., Netscape, Explorer, Firefox, etc.). This process is
depicted in the form of a diagram below in Figure 21.1.
???
Figure 21.1 ???
269
21.1 How the Web Works
When you surf the web, you do so using a web browser a program that is
T
WEB BROWSER
able to send and receive messages in HTTP and display the information sent back
HTML from a server. This information typically takes the form of HTML. For example,
consider the simple HTML document in (311).
(311) sampleWebPage.html
<html> 1
<header> 2
<title>Sample Web Page</title> 3
</header> 4
<body bgcolor="#FFFFFF"> 5
AF
<h2>Title of Sample Web Page</h2> 6
<p>The contents of the web page go here.</p> 7
</body> 8
</html> 9
When rendered by a web browser, Figure 21.2 will look something like Figure
21.2.
?
DR
Figure 21.2 Sample Web Page: (311) in Mozilla Firefox Browser
Because HTML is simply plain text, it can be automatically processed like

any text document using Python’s text processing capabilities. In the following
sections, you will learn how to write programs that behave like web browsers and
automatically process web pages.
270
21. The World Wide Web
21.2 Client Side Programming: Browsing the Web
T
As explained in §??, the web is based on a client-server model. A web browser
(the client) connects to a networked computer (the server) to display a web page.
21.2.1 The Basics

In order to write programs that will connect to web sites and automatically obtain
their contents, you will need to acquaint yourself with the module urllib, which
import sys, urllib

url = sys.argv[1]
AF
is part of the Python standard library (i.e., it comes with the standard distribution
of Python). The urllib modules provides the building blocks for constructing a
program that will browse the web automatically. A very simple illustration of its
use is provided in (312).
socket = urllib.urlopen(url)
contents = sock.read()
print contents
socket.close()
(312) wwwWebBrowser.py
This script takes a URL supplies to it as a command-line argument by the user

and prints out the URL’s contents, which will normally be raw HTML. It does
this by calling the function urlopen, which, given a URL, will return a socket,
which is one endpoint of a two-way communication link between two programs
running on the network (in this case, Python and the web server that you are
1
2
3
4
5
6
SOCKET
connecting to). The method read is then called on the socket object to read the
contents of the desired URL.
DR
Before testing the program yourself, make sure that have network access first
by trying to access a reliable web site (such as www.google.com or www.
yahoo.com) If that works, then it means you have a working internet connection
and that you should be able to test (312) yourself by calling the script on a reliable
web site, such as www.google.com or www.slashdot.org. If you are able
to view web pages on a browser but the script doesn’t work, the most likely reason
has to do with your network setup and something called a proxy server.
21.2.2 Dealing with Proxy Servers

A proxy server is a sort of middleman between you and the web and is typically PROXY SERVER
found on larger networks with many users. To understand how using a proxy
server works, you must first think about what happens when you connect to a web
site directly, without a proxy server. Under those circumstances, when you go to a
web page on your browser, your browser directly sends a request to a web server
271
and that web server responds directly to your computer. But when you use a proxy
T
server, when your browser send a request to a web server, it goes to a proxy server
first, and the proxy server then forward the request on to the desired web server.
The web server then sends a response to the proxy server, which then forwards
that response on to your computer.
(313) wwwWebBrowserProxy.py
import sys, urllib 1
url = sys.argv[1] 2
socket = urllib.urlopen(url, proxies={’http’: ’http://www-proxy.mpi.nl:8080’}) 3
contents = sock.read() 4
AF
print contents 5
socket.close() 6
[EXPLAIN HOW IT WORKS] If you don’t know whether your computer uses
a proxy server, you should ask your network administrator. If your network ad-
ministrator is unavailable, you might be ablt to find the settings in the preferences
of your web browser (Internet Explorer, Firefox, etc.).
21.2.3 Handling HTML

Because most web documents are encoded as HTML, being able to process HTML
automatically is a crucial part of web programming. In the following sections, we
will cover a few HTML-processing basics, such as stripping the HTML for a text
(§21.2.3.1) or parsing it and manipulating it (§21.2.3.2).
21.2.3.1 Removing HTML Tags

It is often useful to strip all of the HTML tags from a web page before processing
DR
its contents. There are a number of ways in which this can be done. One obvious
solution is to use regular expressions, as in (314).
(314) wwwStripHTMLRegex1.py
??? 1
Although (314) works in most cases, it will fail in a number of important

cases. First, ???. Second, there is no reason why a left of right angled bracket
cannot appear within a tag. We might therefore use a more permissive regular
expression, as in (315).
(315) wwwStripHTMLRegex2.py
??? 1
[MENTION PROBLEMS]
Given the various shortcomings of (??), we need to find a more reliable method
of stripping HTML tags from a document. Python’s standard library includes
272
sgmllib, a module which provides functionality for dealing with SGML, a su-
T
GML
perset of HTML. Using sgmllib, it is possible to write a simple function that
will strip a string of all HTML tags, as in (316).
(316) wwwStripHTMLStripper.py
import sgmllib, sys, urllib 1
2
class Stripper(sgmllib.SGMLParser): 3
def __init__(self): 4
self.data = [] 5
sgmllib.SGMLParser.__init__(self) 6
def unknown_starttag(self, tag, attrib): 7
AF
self.data.append(" ") 8
def unknown_endtag(self, tag): 9
self.data.append(" ") 10
def handle_data(self, data): 11
self.data.append(data) 12
def gettext(self): 13
text = "".join(self.data) 14
return " ".join(text.split()) # normalize whitespace 15
16
def strip(text): 17
s = Stripper() # create Stripper object 18
s.feed(text) # feed text to be parsed 19
s.close() # close Stripper object 20
return s.gettext() # return stripped text 21
22
url = sys.argv[1] # get url from command-line 23
socket = urllib.urlopen(url) # open socket w/ URL 24
contents = socket.read() # read HTML 25
print strip(contents) # strip tags from HTML 26
socket.close() # close socket 27
Most of the work done by the function strip() is accomplished with a

custom class, Stripper, which inherits from the SGMLParser. [EXPLAIN
HOW SGMLParser works]
DR
21.2.3.2 Parsing HTML
???
To illustrate, let’s look at how one could connect to Wikipedia and compare the
articles in two languages for a particular topic. For our example, we will compare
the entries for George W. Bush on the Norwegian and Dutch versions of the site.
(317) wwwParseWikipediaArticle.py
import urllib 1
2
protocol = "http://" 3
4
BushURL = ".wikipedia.org/wiki/George_W._Bush" 5
6
Dutch = "nl" 7
English = "en" 8
Norwegian = "no" 9
10
def get_url_contents(url) : 11
273
T
print "Opening URL ’%s’..." % url 12
socket = urllib.urlopen(url) 13
contents = socket.read() 14
socket.close() 15
return contents 16
17
NorwegianURL = protocol + Norwegian + BushURL 18
DutchURL = protocol + Dutch + BushURL 19
EnglishURL = protocol + English + BushURL 20
21
print "NORWEGIAN" 22
print get_url_contents(NorwegianURL) 23
print "" 24
AF
print "" 25
26
print "DUTCH" 27
print get_url_contents(DutchURL) 28
print "" 29
print "" 30
31
print "ENGLISH" 32
print get_url_contents(EnglishURL) 33
print "" 34
print "" 35
21.2.4 Other Issues

21.2.4.1 Removing Javascript
When you access a web page and strip all of the HTML from it, you may find that
there is more than just the content of the page reamining. The most commonly
encountered code is Javascript. [DESCRIBE IT]
???
DR
21.2.4.2 ???
???
21.2.5 Web Crawler

A web crawler is a program that browses the web in a methodical, automated
manner. Other commonly used names for a web crawler include web spider or
web robot. A web crawler is one type of bot, or software agent. In general, it
starts with a list of URLs to visit, called the seeds. As the crawler visits these
URLs, it identifies all the hyperlinks on the pages there and adds them to the list
of URLs to visit, called the crawl frontier. URLs from the frontier are then visited,
leading to new crawl frontiers, which are also processed.
What a web crawler does is usually called web crawling or spidering. Many
web sites use spidering as a means of providing up-to-date data. Search enginers,
274
for example, use spidering to create a copy of all the visited pages for later pro-
T
cessing by a search engine, which will index the downloaded pages to provide fast
searches. Crawlers can also be used for automating maintenance tasks on a Web
site, such as checking links or validating HTML code. Also, crawlers can be used
to gather specific types of information from Web pages, such as harvesting e-mail
addresses (usually for spam).
21.3 Server Side Programming: CGI Scripts for Web

Servers
AF
Every time you browse the web, you are connecting to a web server. A web
server can be programmed to serve up dynamic content and there are innumerable
frameworks for doing this.
Web development using a more sophisticated web development framework
(such as Zope/Plone) goes beyond the scope of this textbook, but here we can
cover a relatively simple framework for server side programming, known as CGI
(Common Gateway Interface).
21.3.1 CGI Basics

A CGI script is one that is called by a web server. Here we will show how a simple
CGI script called hello.cgi can be served up with Apache.
(319) hello.cgi
#!/usr/bin/python 1
print "Content-Type: text/html\n\n" 2
DR
print "<h1>Hello</h1>" 3
print "This is a greeeting." 4
Once this script has been placed in the appropriate directory and made exe-
cutable, it should be possible to access it. When this is done, the results should
look something like Figure ??.
???
21.3.2 Customizing Web Pages with Parameters

???
21.3.3 Responding to User Input: Form Basics

???
275
21.4 Case Studies
T
(319) hello.cgi
#!/usr/bin/python 1
print "Content-Type: text/html\n\n" 2
print "<h1>Hello</h1>" 3
print "This is a greeeting." 4
21.4 Case Studies

Here we present a few case studies that involve web programming of some variety
RSS
AF
(client or server side).
21.4.1 Reading The News in Indonesian

The growth of the worldwide web has led the accumulation of massive amounts of
raw, unstructured text in countless languages. Although many languages remain
underrepresented because of poor access to the necessary infrastructure, many
other languages have flourished on the web. Many newspapers can now be read
for free on the WWW. And thanks the development of protocols and standards for
publishing news articles, it is now relatively easy to download newspaper articles
automatically. The key to this is RSS (Really Simple Syndication), a family of
web feed formats, specified in XML and used for Web syndication. The RSS for-
mat is an XML standard that standardizes the electronic publication of newsfeeds.
To see how RSS works, let’s have a look at an RSS newsfeed from the In-
donesian newspaper Antara, which provides a number of RSS newsfeeds, broken
down by area (e.g., ??? ‘???’ or ??? ‘???’). An example RSS file from Andara is
provided below in (320).
DR
(320) antara-rss-feed-umm-abbrev.xml
<?xml version="1.0" encoding="ISO-8859-1"?> 1
<?xml-stylesheet href="feed.css" type="text/css"?> 2
<rss version="2.0"> 3
<channel> 4
<title>ANTARA - Umum</title> 5
<description>Berita Terkini dari ANTARA News</description> 6
<link>http://www.antara.co.id</link> 7
<lastBuildDate>Thu, 29 Jun 2006 05:15:00 +0100</lastBuildDate> 8
9
<item> 10
<title>Pemilihan PM Baru Timtim Perlu Waktu Sebulan</title> 11
<description> Para pemimpin politik di Timor Timur (Timtim) mungkin memerlukan
13 waktu sebul
16
</item> 17
<item> 18
<title>Montenegro Resmi Jadi Anggota ke-192 PBB</title> 19
<description> Montenegro resmi menjadi anggota ke-192 PBB, Rabu, sebulan
21 setelah negara it
276
T
23
</item> 25
... 26
<item> 27
<title>TNI AL Tangkap Enam Kapal Penyelundup Ikan</title> 28
<description> Kapal patroli TNI AL, KRI Layang 805, berhasil menangkap
30 enam kapal berikut 61 ABK (a
31
<pubDate>Wed, 28 Jun 2006 17:22:06 +0100</pubDate> 32
</item> 34
AF
</channel> 35
</rss> 36
There are a couple of things to note about this RSS file. First, there is some
metadata provided at the beginning of the file, such as ???. Second, the actual
content begins with the root tag <rss>, which specifies which version of RSS
format is used (2.0). Third, [SAY SOMETHING ABOUT CHANNEL].
Each <item> tag in the RSS newsfeed corresponds to a newspaper article.
For example, the first item contains the information provided in Table 21.1.
Title Pemilihan PM Baru Timtim Perlu Waktu Sebulan

Link http://www.antara.co.id/seenws/?id=36957
Description Para pemimpin politik di Timor Timur (Timtim) mungkin memerlukan
waktu sebulan untuk memilih seorang perdana menteri baru, kata seo-
rang anggota partai Fretilin yang berkuasa di parlemen.
Timestamp Thu, 29 Jun 2006 04:47:06 +0100
ID http://www.antara.co.id/seenws/?id=36957
DR
Table 21.1 Newspaper Article from RSS Newsfeed for the Indonesian Newspaper Antara
This RSS newsfeed provides a link to the web version of the full article, from
which the newspaper article itself can be extracted. Using some Python scripting,
it is possible to automate the downloading of the RSS newsfeed and the extrac-
tion of the article from the link it refers to. In this way, a corpus of Indonesian
newspaper articles can be quickly and easily created.
The first step is to obtain the RSS newsfeed itself. In §??, we learned how to
download a web page given a URL. The same technique can be used to download
an RSS newsfeed. It really doesn’t make much difference whether it is HTML or
XML. The procedure is the same, as shown by the code in (321).
(321) wwwDownloadRSSNewsfeed.py
import urllib 1
2
url = "http://www.antara.co.id/rss/umm.xml" 3
socket = urllib.urlopen(url) 4
277
21.4 Case Studies
T
rss_contents = socket.read() 5
socket.close() 6
print rss_contents 7
The next step is to parse the RSS newsfeed and obtain the link to the full
newspaper article.
(322) wwwParseRSSNewsfeed.py
import sys 1
import elementtree.ElementTree as ET 2
AF
3
rss_contents = socket.read() 4
tree = ET.fromstring(rss_contents) 5
print "Items Found in RSS Newsfeed:" 6
for item in tree.findall("channel/item") : 7
title = item.findtext("title") 8
date = item.findtext("pubDate") 9
cat = item.findtext("category") 10
link = item.findtext("link") 11
desc = item.findtext("description").strip() 12
guid = item.findtext("guid") 13
print "--" 14
print "TITLE: [%s]" % title 15
print "DATE: [%s]" % date 16
print "GUID: [%s]" % guid 17
print "LINK: [%s]" % link 18
print "CATEGORY: [%s]" % cat 19
print "DESCRIPTION: [%s]" % desc 20
21
The final step is to grab the newspaper article from the web and extract the
newspaper article from the surrounding HTML markup.
(323) wwwGetNewsArticle.py
DR
import re, sys, urllib 1
2
socket = urllib.urlopen(self.url) 3
html = socket.read() 4
regex = r’<td class=contenttxt width="100%">(.*?)</td>’ 5
reo = re.compile(regex, re.DOTALL) 6
mo = reo.search(html) 7
if mo : 8
content = mo.group(1) 9
content = self.content.replace("<br />" , "") 10
content = re.sub(r"</?b>", "", self.content) 11
content = re.sub(r"$\*$.*", "", self.content) 12
content = self.content.strip() 13
print content 14
else : 15
sts.stderr.write("Regex failed to find article content!\n") 16
Finally, the short scripts described above can be combined into a single pro-
gram that handles the process from start to finish and puts the resulting newspaper
articles into a user-specified directory.
278
21.4.2 Obtaining Information from The Ethnologue
T
The Summer Institute of Linguistics (SIL) publishes a book called The Ethno-
logue which provides an index of the world’s langauges which provides basic
background information such as where a language is spoken, how many people
speak it, whether there is a translation of the Bible in the language, etc. As a pub-
lic service, the SIL makes the data in The Ethnologue publically available on the
WWW at www.ethnologue.org.
AF
Because a given language may go by multiple names, it is necessary when
building a database of languages to a unique identifier associated with each lan-
guage. The SIL uses a three-letter code for language identification, called simply
an Ethnologue code. (For more information about these codes, go to http:
//www.ethnologue.com/codes/default.asp.) It is possible to look
up a language using this code by going to a web page that lists language informa-
tion, given a specific Ethnologue code.
The URL for this page is www.ethnologue.com/show language.asp

and it takes a single parameter, code (the Ethnologue code for the language that
you want to obtain information about). For example, if you wanted to obtain in-
formation about Rotokas (a Papuan language spoken in Bougainville, Papua New
Guinea), you would use the URL provided in (98).
DR
(98) www.ethnologue.com/show language.asp?code=ROO
When you connect to this URL using a web browser, the results will look
something like this. The web page is generated by connecting to a remote web
server and obtaining a response. The response consists of HTML which is render-
ing in your web browser to look like this (as The Ethnologue 15).
279
21.4 Case Studies
T
AF
Using a simple Python script, it is possible to view the raw HTML that is used
to generate this web page. The following script will do the job:
???
DR
(324) wwwEthnologueClassificationByCode.py
#!/usr/bin/python 1
2
import re, sys, urllib 3
4
def get_affiliation(classification) : 5
affiliation = classification.split(",")[0].strip() 6
if affiliation == "Austronesian" : 7
return "AN" 8
else : 9
return "NAN" 10
11
def get_classification_by_code(code) : 12
sys.stderr.write("<%s>\n" % code ) 13
proxies = {’http’: ’http://www-proxy.mpi.nl:8080’} 14
url = "http://www.ethnologue.com/show_language.asp?code=%s" % code 15
regex = r’<TD><I>Classification</I></TD>.*?<TD><A HREF=".*?">(.*)</A></TD>’ 16
#print "Attempting to connect to the following URL:" 17
#print " [%s]" % url 18
sock = urllib.urlopen(url, proxies=proxies) 19
html = sock.read() 20
#print html 21
280
T
reo = re.compile(regex, re.DOTALL) 22
mo = reo.search(html) 23
classification = mo.group(1) 24
#print "Classification: [%s]" % classification 25
sock.close() 26
return classification 27
28
def main() : 29
31
fo = open(filename, ’rU’) 32
lines = fo.read().split("\n") 33
fo.close() 34
AF
35
print lines[0] 36
for l in lines[1:] : 37
cols = l.split("\t") 38
code = cols[1].lower() 39
if code == "---" : 40
classification = "---" 41
major_affiliation = "---" 42
else : 43
classification = get_classification_by_code(code) 44
major_affiliation = get_affiliation(classification) 45
cols[2] = major_affiliation 46
cols[3] = classification 47
print "\t".join(cols) 48
49
main() 50
21.5 Exercises
???
• ???
DR
• In §21.4.1, we discussed a way of writing a program that automates the work
of downloading news articles from a web site and saving them as text files.
Saving the newspaper articles as plain text files is not ideal since the infor-
mation about them from the RSS newsfeed is not saved with them. How
would you write Python code to save the newspaper articles in a database
rather than as text files?

There is a great deal to learn about web programming in Python. For starters, you
will need to know a good deal more about the various protocols that underly it:
TCP/IP (?), HTTP (?), and HTML (?). [???] Once you have mastered the basics,
you will be ready for more advanced books, such as ? or ?.
281
T
AF
DR
282
T
Bibliography
AF
Leopold Totsch Allison Randal, Dan Sugalski. Perl 6 and Parrot Essentials.
O’Reilly Press, ???, second edition, 2004.
Harald Baayen. An introduction to r for the language sciences, 2006.
Jerry Banks, editor. Handbook of Simulation: Principles, Methodology, Advances,

Applications, and Practice. Wiley-Interscience, ???, 1998.
David Beazley. Python Essential Reference. Sams, ???, 2001.
Ruth Bernstein. Population Ecology : An Introduction to Computer Simulations.

John Wiley and Sons, ???, 2003.
Steven Bird, ??? Klein, and ??? Loper. Introduction to Computational Linguistics.
Cambridge University Press, Cambridge University, 2006.
DR
Timothy Budd. An Introduction to Object-Oriented Programming. Addison-
Wesley, ???, 1996.
Martin Campbell-Kelly. From Airline Reservations to Sonic the Hedgehog : A

History of the Software Industry. MIT Press, Cambridge, MA, 2004.
Siobhan Chapman, editor. Philosophy for Linguists. Routledge, ???, 2000.
E. F. Codd. A relational model for large shared data banks. Communications of

ACM, 13(6):377–387, 1970.
Damian Conway. Object Oriented Perl. Manning Publications, ???, 1999.
Bart De Boer. The Origins of Vowel Systems. Oxford University Press, New York,
2001.
Ali Farghay, editor. A Handbook for Language Engineers. CSLI, Stanford, 2003.
283
BIBLIOGRAPHY
Jeffrey E. F. Friedl. Mastering Regular Expressions. O’Reilly, Sebastopol, CA,
T
2002.
Nigel Gilbert and Klaus G. Troitzsch. Simulation for Social Scientists. Open
University Press, New York, second edition, 2005.
Nathan A. Good. Regular Expression Recipes: A Problem-Solution Approach.

Apress, ???, 2004.
Barbara F. Grimes and Joseph E. Grimes. Ethnologue: Languages of the World.
AF
Summer Institute of Linguistics, Dallas, Texas, fourteenth edition, 2000.
Mark Hammond and Andy Robinson. Python Programming on Win32. O’Reilly,

Sebastopol, CA, 2000.
Daryl Harms and Kenneth McDonald. The Quick Python Book. Manning Publi-
cations Company, Greenwich, CT, 2000.
Elliotte Rusty Harold and W. Scott Means. Learning XML. O’Reilly, Sebastopol,
2004.
Steve Holden, editor. Python Web Programming. Pearson Education, xxx, 2002.
ISBN 0735710902.
Andrew Hunt and David Thomas. The Pragmatic Programmer: From Journey-
man to Master. Addison-Wesley, ???, 1999.
Christopher A. Jones and Fred L. Drake. Python and XML. O’Reilly & Associates,
Inc., 103a Morris Street, Sebastopol, CA 95472, USA, Tel: +1 707 829 0515,
DR
and 90 Sherman Street, Cambridge, MA 02140, USA, Tel: +1 617 354 5800,
2002. ISBN 0-596-00128-2.
Daniel Jurafsky and James H. Martin. Speech and Language Processing: An

Introduction to Natural Language Processing, Computational Linguistics, and
Speech Recognition. Prentice Hall, xxx, 2000.
Dan Klein and Christopher Manning. Natural language grammar induction us-
ing a constituent-context model. In Thomas G. Dietterich, Suzanna Becker,
and Zoubin Ghahramani, editors, Advances in Neural Information Processing
Systems 14, volume 1, pages 35–42, Cambridge, MA, 2002. MIT Press.
Dan Klein and Christopher Manning. Corpus-based induction of syntactic struc-

ture: Models of dependency and constituency. In Proceedings of the 42nd
Annual Meeting of the Association for Computational Linguistics, ???, 2004.
???
284
BIBLIOGRAPHY
Leslie Lamport, editor. xxx. xxx, xxx, xxx.
T
Steven Levy. Hackers: Heroes of the Computer Revolution. Penguin, ???, 2001.
Mark Lutz and David Ascher. Learning Python. O’Reilly & Associates, Inc.,
103a Morris Street, Sebastopol, CA 95472, USA, Tel: +1 707 829 0515, and
90 Sherman Street, Cambridge, MA 02140, USA, Tel: +1 617 354 5800, 1998.
Chris Manning. Review of programming for linguists: Java technology for lan-
guage researchers. Language, 81(3):740–742, 2005.
AF
Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural
Language Processing. MIT Press, Cambridge, MA, 1999.
Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann
Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank:
Annotating predicate argument structure. 1994.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a
large annotated corpus of english: the penn treebank. Computational Linguis-
tics, 19(2):313–330, 1993.
Alex Martelli. Python in a Nutshell. O’Reilly, Sebastopol, CA, 2003.
James D. McCawley, editor. Everything That Linguists Have Always Wanted
to Know About Logic but Were Ashamed to Ask. Chicago University Press,
Chicago, 1993.
S. McGrath. XML programming in Python. Dr. Dobb’s Journal of Software Tools,
DR
23(2):82–??, 84–87, 101–104, February 1998a. ISSN 1044-789X.
Sean McGrath. Internet programming: XML programming in Python. Dr. Dobb’s
Journal of Software Tools, 23(2):82, 84–87, 101–104, February 1998b.
Sean McGrath. XML processing with Python. The Charles F. Goldfarb se-
ries on open information management. Prentice-Hall, Englewood Cliffs, NJ
07632, USA, 2000. ISBN 0-13-021119-2. URL http://www.phptr.
com/ptrbooks/ptr 0130211192.html. Includes CD-ROM.
David Mertz. Text Processing in Python. Addison-Wesley, Reading, MA, USA,
2003.
Douglas B. Paul and Janet M. Baker. The design for the wall street journal-based
csr corpus. In HLT ’91: Proceedings of the workshop on Speech and Natu-
ral Language, pages 357–362, Morristown, NJ, USA, 1992. Association for
Computational Linguistics. ISBN 1-55860-272-0.
285
BIBLIOGRAPHY
Mark Pilgrim. Dive into Python. Apress, ???, 2004.
T
Erik T. Ray. Learning XML. O’Reilly, Sebastopol, 2003.
Eric S. Raymond. The New Hacker’s Dictionary. MIT Press, Cambridge, MA,
1996.
Eric S. Raymond. The Cathedral and the Bazaar. O’Reilly Press, Sebastopol,
2001.
AF
Eric S. Raymond. The Art of Unix Programming. Addison-Wesley, 2003.
Boudewijn Rempt, editor. GUI Programming With Python: Using the Qt Toolkit.
Opendocs Lic, xxx, 2002. ISBN xxx.
Richard Stallman. Free Software, Free Society: Selected Essays of Richard M.

Stallman. Free Software Foundation, Boston, MA, 2002.
Brian Stross. Love in the Armpit: Tzeltal Tales of Love, Murder, and Cannibalism,
volume 23 of Museum Brief. Museum of Anthropology of Missouri, Columbia,
Missouri, 1977.
John Verzani. Using R for Introductory Statistics. Chapman and Hall/CRC, ???,
2004.
Andrew Watt. Beginning Regular Expressions. Wrox, ???, 2005.
Stephen Weber, editor. The Success of Open Source. Harvard University Press,
Cambridge, Mass., 2004. ISBN 0674012925.
DR
Matt Weisfeld. The Object-Oriented Thought Process. Sams, ???, 2000.
xxx, editor. xxx. xxx, xxx, xxxa.
xxx, editor. xxx. xxx, xxx, xxxb.
286
T
Glossary
argument
Assignment
attributes
Backreferences
block structure
callback function
character class
class
AF A variable passed to a function when it is called.,
89
To give a value to a variable., 50
???, 151
???, 230
???, 48
???, 113
???, 196
The definition of an object., 152
class variable ???, 183
compilers ???, 47
DR
computational linguistics ???, 28
constructor ???, 182
control structures A device that controls data flow within a pro-
gram., 79
data flow ???, 79

data model ???, 257
data munging ???, 26
data persistence ???, 249
data structure Any method of organising a collection of data to
allow it to be manipulated effectively., 63
data types ???, 57
declaration To assert the existence of a variable., 49
dictionary A data type that associates unique keys with ar-
bitrary values., 71
287
Glossary
T
directories ???, 111
disjunction The technical term for ???., 194
encapsulation ???, 157

escape character ???, 127
exception ???, 97
expression ???, 52
AF
file path ???, 105
float ???, 59
foreign key ???, 257
function A block of code that can be called by name., 87
global variable A variable that is accessible anywhere in a pro-

gram (opposed to local)., 89
HTML Hyper Text Markup Language, 265

HTTP Hyper Text Transfer Protocol: ???, 265
IDE Integrated Development Environment, 32

identity test Checking whether two objects have the same
value., 52
immutable ???, 75
immutable Any data type that cannot be directly modified.,
57
DR
implementation ???, 157
index ???, 64
Inheritance ???, 162
initialization To declare a variable and assign it a value for the
first time., 50
instance variables ???, 182
integer ???, 58
interface ???, 156
interpreters ???, 47
IO ???, 105
join ???, 257
288
Glossary
T
keyword A word that has a special meaning in Python and
therefore cannot be used as a variable name (e.g.,
if or and)., 50
list ???, 63
list comprehension A Python idiom for in situ modification of every
item in a list., 84
literal characters A character that stands for itself in a regular ex-
AF
pression., 190
looping The repeated execution of a block of code., 80
metacharacters A character that has a special meaning in a regu-

lar expression (e.g., the dot is a wildcard charac-
ter)., 190
methods ???, 152
minimal pairs A pair of words that differ only by a single
phoneme., 141
modularity ???, 171
module ???, 119
multiple inheritance Inheritance that allows a class to inherit from
multiple parent classes., 165
normalized ???, 257
open source Describes software whose source code is open-

DR
i.e., it can be freely viewed and modified., 19
override ???, 168
parse To break information down into its consituent

parts., 242
pickling , 252
Polymorphism ???, 175
primary key ???, 257
private ???, 183
proxy server ???, 267
public ???, 183
RDBMS Relational Database Management System, 254

reduplication ???, 231
289
Glossary
T
refactoring Improving a computer program by reorganising
its internal structure without altering its external
behaviour., 141
regular expressions ???, 189
relational database ???, 254
return value ???, 93
RSS Really Simple Syndication. RSS is an XML for-
mat used for web feeds., 269
seed script
SGML
slicing
socket
SQL
statements
AF
single inheritance
slice notation
string concatenation
A program that is run to add data to a fresh
database in order to determine its initial state.,
258
Standard Generalized Markup Language, 268
Inheritance that restrict a class to inheriting from
only one parent class., 165
???, 129
???, 66
In network programming, one endpoint of a two-
way communication link between two programs
running on the network., 266
Structure Query Language. A standard language
for manipulating relational databases, 255
An instruction to the Python interpreter., 47
The joining of one or more strings together to
form a longer string., 58
DR
string indices ???, 128
string interpolation The insertion of variables into a string using in
situ placeholders., 137
subselect A select within a select., 262
tokenization ???, 41
tuple ???, 75
typing ???, 173
web browser ,, 265

web browser ???, 265
wildcard ???, 190
290
T
Index
command line, 116

arguments, 116
options, 117
comments, 56
directories
creating, 114
deleting, 115
editing, 114
locating, 113
reading, 113
renaming, 114
elementree, 245
error
AF modes, 108
slurping, 110
writing, 108
index, 134
Kleene Star, 209
len(), 71
list
length, 71
metacharacters
definition, 193
dot, 201
escaping, 202
custom, 104 methods
standard, 103
DR
accessor, 182
syntax, 102 getter, 158, 182
exception setter, 158, 182
types, 102 modules, 121
exceptions defining, 121
catching, 99 importing, 122
handling, 99 standard, 125
raising, 99
PYTHONPATH, 124
file
types, 113 regular expressions
files caret, 203
creating, 111 character classes, 198
deleting, 112 compilation
paths, 107 DOTALL, 221
reading, 108 flags, 221
291
INDEX
IGNORECASE, 221 index, 133
T
LOCALE, 222 ljust, 138
MULTILINE, 222 replace, 136
VERBOSE, 223 rjust, 138
disjunction, 196 split, 136
escape character, 202 zfill, 138
matching querying, 132
greediness, 237 slices, 131
metacharacters, 193 types
AF
backslash, 202
caret, 205
curly brackets, 210
dollar, 205
dot, 201
square brackets, 198
vertical bar, 196
negation, 203
quantifiers, 206
plus mark, 207
question mark, 209
star, 209
search path, 124

slurping, 110
plain, 129
unicode, 130
unicode, 130
variables
assignment, 52
declaration, 51
initialization, 52
xmllib, 245
SQL
keyword
DR
DELETE, 262
INSERT, 261
SELECT, 262
UPDATE, 262
string
indices, 131
string interpolation, 139
strings
case, 138
formatting, 137
immutability, 135
interpolation, 139
methods, 132
center, 137
count, 133
292

Beginning Python for Language Research

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Beginning Python for Language Research

Uploaded by

Copyright:

Available Formats

T

Stuart Robinson and Harald Baayen

April 17, 2007

3 Programming for Language Research

3.1 The Uses of Programming in Language Research . . . . . . . . . 25

4 Installing and Running Python 31

I Python Programming Fundamentals 39

5.6 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . 47

9.1 Conditional Statements: if . . . . . . . . . . . . . . . . . . . . . 81

11 Errors and Exceptions 99

13 Modules and Packages 121

14 Strings In Depth 129

II Advanced Topics 151

15 Object Oriented Programming 153

16 Classes and Objects 181

17 Regular Expressions 1 191

21 The World Wide Web 269

21.6 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . 281

11.1 Class Hierarchy for Exception Class . . . . . . . . . . . . . . . . 102

17.1 Breakdown of Date String . . . . . . . . . . . . . . . . . . . . . 192

18.1 Regular Expression Grouping . . . . . . . . . . . . . . . . . . . 231

21.1 ??? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

Breakdown of the First Five Tokens from (3) . . . . . . . . . . . .

10.1 Organization of Program in (113)

14.1 String Types Compared . . . . . .

Comparing Words in a Minimal Pair

15.1 Description of Shoebox Configuration in Figure 15.3.2 . . . . . . 160

17.1 Science Fiction Novels Used for Sample Corpus . . . . . . . . . . 194

18.1 Complilation Flags for Regular Expressions . . . . . . . . . . . . 221

20.1 Tab-Separated Values . . . . . . . . . . . . . . . . . . . . . . . . 253

2.1 Aims and Scope

Python is reasonably general and could profitably be read by any non-programmer

2.3 Why Python?

Table 2.1 Programming Languages Strong on String Processing

Syntax Python employs a fairly straightforward and consistent syntax whereas

other words, object-oriented programming is an approach to programming

2.3.1 The Virtues of Open Source

2.4 What This Book Doesn’t Cover

2.5 What Else to Read

free on the web at http://www.python.org, and there is a section just for

2.6 How to Use This Book

This book is aimed at language researchers (fieldworkers, sociolinguists, phonol-

3.1 The Uses of Programming in Language Research

To be able to efficiently perform empirical linguistic work, be it work-

3.1.1 Data Munging

3.1.2 Data Analysis

3.1.3 Simulation and Modelling

Simulation is a particular type of modelling. Building a model is a

Computer modelling is already a well-established part of other disciplines,

3.1.4 Computational Lingustics

3.2 Becoming a Good Programmer

manages to do a particular task, since there are additional considerations involved

3.3 Suggested Reading

4.1 Installing Python

# yum install python

4.2 Running Python

(1) dir C:\py*

Python Programming Fundamentals

(2) Zenith Data Systems Corp., a subsidiary of Zenith Electronics

(3) Zenith/NNP Data/NNP Systems/NNPS Corp./NNP ,/, a/DT subsidiary/NN

5.1 Saving Text as a Variable

Table 5.1 Breakdown of the First Five Tokens from (3)