Professional Documents
Culture Documents
AF
Beginning Python Programming for Language
Research
2
AF
T
T
Contents
1 Acknowledgements
2 Introduction
AF
2.1 Aims and Scope . . . . . . . . .
2.2 Organization . . . . . . . . . . .
2.3 Why Python? . . . . . . . . . .
2.4 What This Book Doesn’t Cover .
2.5 What Else to Read . . . . . . .
2.6 How to Use This Book . . . . .
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
15
15
16
16
20
20
21
3
CONTENTS
T
6 Statements 49
6.1 Indentation and Block-Structuring . . . . . . . . . . . . . . . . . 50
6.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9
AF
Data Types
7.1 Strings . . . . . . . . . . . . . . . . .
7.2 Numbers . . . . . . . . . . . . . . . .
7.3 The None Value . . . . . . . . . . .
7.4 Querying and Converting Data Types .
7.5 Exercises . . . . . . . . . . . . . . .
Data Structures
8.1 Lists . . . . . . . . . . . . . .
8.2 Dictionaries (aka, Hash Tables)
8.3 Tuples . . . . . . . . . . . . .
8.4 Nested Data Structures . . . .
8.5 Exercises . . . . . . . . . . .
Data Flow
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
81
.
.
.
.
.
65
. 65
. 72
. 77
. 78
. 79
59
59
60
62
63
64
10 Functions 89
10.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.2 Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.3 Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.4 Functions in Action . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4
CONTENTS
12 Input/Output 107
T
12.1 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.2 Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
12.3 The Command Line . . . . . . . . . . . . . . . . . . . . . . . . . 116
12.4 IO in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
AF
13.1 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
13.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
13.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5
CONTENTS
T
17.1 What is Text? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
17.2 Metacharacters . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
17.3 Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
17.4 Word Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . 212
17.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
17.7 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . 215
AF
18 Regular Expressions 2
18.1 Using Regular Expressions: An Overview
18.2 Creating Regular Expressions . . . . . . .
18.3 The Regular Expression Object . . . . . .
18.4 Extensions to Regular Expressions . . . .
18.5 Exercises . . . . . . . . . . . . . . . . .
18.6 Suggested Reading . . . . . . . . . . . .
19 XML
19.1 What is XML? . . . . . .
19.2 The Syntax of XML . . . .
19.3 Processing XML in Python
19.4 Case Studies . . . . . . . .
19.5 Exercises . . . . . . . . .
19.6 Suggested Reading . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
217
217
219
225
230
238
238
241
241
242
244
246
250
250
20 Databases 251
DR
20.1 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
20.2 Pickling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
20.3 Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . 256
20.4 Structured Query Language (SQL) . . . . . . . . . . . . . . . . . 258
20.5 Using Python with a Relational Database . . . . . . . . . . . . . 264
20.6 Case Study: Importing a Spreadsheet into a Lexical Database . . . 267
20.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
20.8 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . 267
6
CONTENTS
T
Bibliography 286
Glossary 290
AF
DR
7
CONTENTS
T
AF
DR
8
T
List of Figures
2.1
2.2
4.1
4.2
4.3
4.4
4.5
4.6
AF
The Python Prompt . . . . . . . . . . . . . . . . . . . . . . . . .
The Python Prompt . . . . . . . . . . . . . . . . . . . . . . . . .
??? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Command Prompt Window in Windows XP . . . . . . . . . . . .
Running the Python Interpreter from a Command Prompt Window
Path Error Running the Python Interpreter from a Command Prompt
Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Selecting ‘Environment Variables’ from the ‘System Properties’
Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adding the Python Interpreter to the PATH Setting . . . . . . . .
22
23
32
33
34
35
36
37
5.1 ??? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
DR
8.1 List Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.2 Negative List Indices . . . . . . . . . . . . . . . . . . . . . . . . 68
9
LIST OF FIGURES
T
AF
DR
10
T
List of Tables
2.1
5.1
6.1
14.2
14.4
14.3
AF
Programming Languages Strong on String Processing
Comparison Operators . . . . . . . . . . . . . . . . . . . . . . .
String Indices . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
43
55
97
129
131
146
150
11
LIST OF TABLES
T
20.2 Outcome of a Join of Person and Authorship . . . . . . . . 264
21.1 Newspaper Article from RSS Newsfeed for the Indonesian News-
paper Antara . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
AF
DR
12
T
Chapter 1
Acknowledgements
AF
The following people at some point read part or all of draft versions of the book (in
alphabetical order): Heriberto Avelino, Steven Bird, Brian MacWhinney, Fermı́n
Moscoso del Prado, Loretta O’Connor, and Janna Zimmermann.
Special thanks to: Dieter Van Uytvanck, for clarifying questions about how
arguments are passed to functions; Heriberto Avelino, for using the book for self-
study and providing detailed feedback.
Their feedback is greatly appreciated, since it has significantly improved the
quality of the final product. (Any blame for defects are of course solely at-
tributable to the authors.)
This textbook was used to teach an intensive Python programming course at
the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands and
we wish to thank the students of that class for their feedback on the materials (not
DR
mention the bottle of Italian wine and the copy of Ian M. Banks’ The Algebraist
that they gave the first author as a gift).
Finally, this book is about an open source programming language and writ-
ten using Open Source software: Linux, Emacs, CVS, etc. (http://www.
opensource.org). Consequently, it would not be complete without an ac-
knowledgement of the pioneering efforts of Richard M. Stallman and Linus Tor-
valds, who gave the world GNU and Linux, respectively, as well as Guido van
Rossum, Python’s benevolent dictator for life. The open source revolution contin-
ues, and we are glad to be part of it.
13
DR
14
AF
T
T
Chapter 2
Introduction
AF
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated. Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one—and preferably only one—obvious way to do
DR
it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea—let’s do more of those!
—The Zen of Python
15
2.2 Organization
T
wishing to learn Python from scratch. (For those with a background in program-
ming, the book might serve as a useful introduction to the language, but there are
probably better alternatives, such as Harms and McDonald (2000).) Throughout
the book, the emphasis is on the application of Python programming to the types
of problems that language researchers are likely to face, but the aim is to provide
a comprehensive introduction that provides the foundation for the development of
general programming skills.
In the next few sections, we will tackle a few obvious questions that emerge
2.2 AF
from the aims and scope of the book. In §2.2, an overview of the book’s organi-
zation is provided. In §??, we explain why a language researcher ought to learn
Python rather than another programming language. In §2.4, we mention a few
topics that the emphasis on language problems leads us to neglect, and provide
references to other materials that could be used to supplement this book.
Organization
This book is divided into two parts:
Python basics The first part is fairly general and teaches the fundamentals of
Python programming. As such, it could be skipped by someone who already
is familiar with Python, but who wishes to learn how to apply Python to
problems in language research. Since it provides a fairly broad overview of
Python, it is a useful reference, even for those who are already familiar with
DR
the language.
Python for Language Research The second part is written specifically for a lan-
guage researcher (a linguist, sociolinguist, psycholinguist, etc.) who has
acquired the basics of Python programming (either by reading the first part
or from prior exposure) and wishes to develop more advanced skills. The
chapter provides a detailed look at a number of more advanced topics in
Python of relevance to language research (e.g., regular expressions, XML,
databases, etc.).
16
2. Introduction
language is not a trivial task, and there is always healthy debate among pro-
T
grammers concerning the strengths and weaknesses of various programming lan-
guages. For a classic endorsement of Python by Eric Raymond (author of The
New Hacker’s Dictionary and The Cathedral and the Bazaar), see his article
“Why Python?”, available from the web site of the Linux Journal: http://
www.linuxjournal.com/article/3882.
After all, there is no shortage of possible alternatives: Ada, C#, C++, COBOL,
C, ColdFusion, Delphi, FORTRAN, Icon, JavaScript, Java, ML, PHP, Perl, Pro-
log, Python, QBASIC, Ruby, Lisp, Smalltalk, Snobol, Tcl, Visual, etc. However,
AF
not all programming languages are created equal, as we will see when we consider
a few desiderata for a good programming language for language research.
For language researchers, what really matters is that a language excel at string
processing–that is, the manipulation of strings (loosely speaking, computer sci-
ence lingo for chunks of text). Table 2.1 lists a few languages with good string
processing capabilities and compares them in terms of the degree to which they
are still being actively developed by a community of developers and users and
whether or not they are object-oriented (see Chapter 15 for more explanation of
the term).
Currently the only real competitor to Python in the arena of language research
are Perl and Java since they offer many of the same string processing capabilities
as Python.
The main difference between Perl and Java is that the former is a “scripting
language” whereas the latter is a “system programming language”. It is com-
mon to divide high-level programming languages into two types: “system pro-
gramming languages” and “scripting languages”.1 A “system programming lan-
guage” allows arbitrary amounts of complexity, are generally compiled (see Chap-
ter ?? for an explanation), and are meant to operate largely independently of other
programs. A prototypical example would be C or Java. By contrast, scripting
languages have fewer provisions for complexity, are generally interpreted, and
1
This division is sometimes known as Osterhout’s Dichotomy, after John Osterhout, the creator
of the language Tcl.
17
2.3 Why Python?
necessarily interact with other programs or the file system. Prototypical script-
T
ing languages would be MSDOS batch files or Unix shell scripts (and of course
Tcl). The dichotomy is not particularly neat and many believe it to be highly arbi-
trary (referring to it as ”Ousterhout’s fallacy” or ”Ousterhout’s false dichotomy”).
Sometimes the dichotomy is seen in terms of what the two kinds of languages are
good at doing: scripting languages are good for writing quick-and-dirty programs
while system programming languages are good for writing large-scale programs.
Another dimension is speed: scripting languages tend to run more slowly than
system programming languages. However, this is not a major concern in lan-
AF
guage research, where a great deal of the programming is used for data extraction
and analysis (e.g., building a concordance) and not for doing computations whose
results are needed immediately in real time (e.g., spoken word recognition).
Although both Perl and Python are scripting languages, there is also a good
deal that sets the two languages apart. Without seeking to disparage Perl in any
way, we would claim that Perl has a few drawbacks that recommend Python over
it for beginning programmers. The advantages of Python are the following:2
More structured Perl’s “there’s more than one way to do it” philosophy can be a
bit dangerous when you are beginning to learn to program, since it imposes
no discipline. Python is more structured in this respect, and provides better
protection against developing bad programming habits.
DR
Object orientation Python is a language that is object-oriented to the core (see
Chapter 15 for more information) whereas Perl’s object orientation is some-
thing of an after-thought, and is not thoroughly integrated into the language
(Conway, 1999). In non-object oriented programming, the focus is on the
sequence of instructions (or procedures) needed to get a job done. In object-
oriented programming, the focus is on the kind of task that is to be done, and
on writing the rules in such a way that what is common across different tasks
becomes clear. For instance, in order to add objects, non-object oriented
languages use one kind of syntax for numbers (3+5 to get 8) and another
kind of syntax for strings (strcat("tea", "cup") to get ”teacup”),
whereas Python uses the same syntax for both (3+5, "tea"+"cup"). In
2
For more comparison of scripting languages (and an attempt at ranking them rela-
tive to one another), try http://merd.sourceforge.net/pixel/language-study/
scripting-language/.
18
2. Introduction
T
that relies more on modelling the entities involved in a programming task
than on the procedures that are followed in the task. This is very useful for
various programming problems in the language sciences, and although Perl
offers object-orientation, it doesn’t go to the core of the language’s philos-
ophy and is in many ways merely tacked on as an afterthought.3
Finally, Python has a large and active community of developers and users (as
does Perl, it should be noted in all fairness). There are numerous web sites devoted
AF
to Python (e.g., http://www.python.org or ???) as well as a usenet group
(comp.lang.python). Extensive documentation is available (on the web and
in print) and there is a growing literature on the language, with new books being
published on a regular basis, as a quick search on amazon.com (or any other on-
line bookseller) will reveal.
19
2.4 What This Book Doesn’t Cover
there is no risk of its owner going out of business and ceasing to develop or sup-
T
port it. For in-depth analysis of open source software and its benefits, see Weber
(2004). For more partisan views, try Raymond (2001) or Stallman (2002).
AF
ment (how to search for example sentences on the web) and graphical user inter-
faces (GUIs) are not covered at all. These two topics have been overlooked be-
cause they require more background knowledge than our target audience is likely
to possess and it would require a great deal of space to bring the reader with no
previous background to a level where any serious work could be done. Further-
more, both topics are both very much system-dependent, meaning that a great deal
of effort would have to be expended covering the various operating systems (Unix,
Macintosh, Windows, etc.) and their idiosyncracies.
Furthermore, GUIs are often designed to make programs written by one person
easier for another to use, but this textbook aims to enable language researchers
to accomplish tasks without being limited by the GUIs of software designed for
specific purposes. In many cases, the time investment involved in building a user-
friendly GUI is not repaid, especially when you are using programs that you have
written yourself.
With these considerations in mind, we decided that it would be better to stick
to a solid core of Python programming which is largely system-independent and
cover it thoroughly rather than treat a large number of topics superficially. For
DR
readers interested in these topics, we recommend Holden (2002) for an introduc-
tion to web programming and Rempt (2002) for an introduction to programming
GUIs.
20
2. Introduction
T
beginners at http://www.python.org/doc/Intros.html.
AF
If you know nothing about programming and wish to learn, the best way to use
this book would be to read the first part from beginning to end, running each piece
of example code to ensure that you understand what it is meant to illustrate. Once
you have assimilated the material in the first part, you can move on to the second
part. The second part of the book is more specific, and would be useful even
those readers who have prior familiarity with Python. It is less important to read
the second part from beginning to end, since the chapters cover various topics
which more or less stand-alone. The only chapter in the second part that is really
a must-read is the first chapter, on string processing. The rest can be treated as
a sort of buffet, which the reader can pick and choose from. They are not truly
interdependent: while it is useful to understand regular expressions before reading
about XML or databases, it is by no means necessary.
To save the reader time, every piece of code given in this book is provided
as a Python script in the accompanying CD, where it can be found in a directory
called ’code’. Throughout the book, code examples will be given in the following
format:
DR
(1) helloWorld.py
print ’Hello, World!’ 1
Note the label at the top of the example. It has two parts, a unique identifier
number followed by the name of a file. The unique identifier number is what is
used to reference the code in the text. The filename refers to a Python script that
can be run to see the code in action. The code in (1) can therefore be found in the
file helloWorld.py. It is recommended that you run each piece of code as you
read the examples. This is quite easy to do in a Unix environment, assuming that
the right version of Python is installed. (You need to have version 2.3 or later.) To
check which version of Python you are running and then run the code in (1), run
the commands shown in Figure 2.1.
21
2.6 How to Use This Book
T
Figure 2.1 The Python Prompt
AF
DR
In Windows or on Mactinosh computer, the process is not quite a straightfor-
ward (see §4 for installation instructions). [SAY MORE] It is advisable to have a
look at one of the numerous books dedicated to Python in Windows—e.g., Ham-
mond and Robinson (2000).
Since you may wish to tinker with the example code, we recommend that you
make a local copy of it, and put the originals someplace where you can recopy
them if necessary. (It’s always a good idea to make backups of code, just in case
you decide to revert to a prior version or accidentally delete what you are working
on.)
Alternatively, you can type example code straight into Python by running
Python interactively with the Python prompt, as shown in Figure 2.2.
22
2. Introduction
T
Figure 2.2 The Python Prompt
AF
In fact, the Python prompt is a great way to get instant gratification since the
code is evaluated on a line-by-line basis, which means that any errors will imme-
diately be caught. Catching errors immediately through the Python prompt can
be very instructive, since it helps teach the user how Python responds to different
kinds of errors—for example, typing errors, syntax errors, conceptual errors, etc.
DR
23
2.6 How to Use This Book
T
AF
DR
24
T
Chapter 3
AF
Programming for Language Research
But the argument in favor of learning to program can be made more concrete
by looking at a number of areas where the ability to program is invaluable. We
will discuss a few of these briefly: data munging, data analysis, simulation and
modelling, and computational linguistics.
25
3.1 The Uses of Programming in Language Research
T
Language researchers deal with data, and increasingly lots of it. [MENTION
GROWING SIZE OF CORPORA] That data typically takes the form of text. The
DATA MUNGING term data munging refers to the manipulation of data for sundry purposes. There
are numerous boring, repetitive tasks involved in language research that can be
easily automated, such as reformatting texts, finding systematic typos, changing
orthographies in a text, etc.1
AF
A language researcher who has some programming skills is in a position to
improve vastly his productivity, since he will be able to hand over a good deal of
the grunt work to a computer, allowing it to do the boring tedious work and leaving
him to do the fun stuff (programming, data analysis, write-up, etc.).2 Python,
being a high-level scripting language, is especially suited to data munging.
By way of example, one of the authors has been compiling a dictionary of Ro-
tokas, a non-Austronesian language spoken on the island of Bougainville, Papua
New Guinea. The dictionary is stored in the format used by Shoebox, a pro-
gram made available by the Summer Institute of Linguistics for storing dictionar-
ies and glossing texts (for more information, go to http://www.sil.org/
computing/shoebox/). Although Shoebox has built-in dictionary creation
capabilities, the results were not quite what the author desired. In order to cre-
ate a dictionary formatted according to his wishes, he wrote a Python script that
took the Shoebox dictionary file, manipulated the data in it, and produced a much
more aesthetically pleasing LATEX-formatted document. Without some program-
ming skills, this would not have been possible, and the resulting dictionary would
have poorer for it.
DR
It is always preferable to have control over one’s own data, and not to leave
every task in the hand of second-party software vendors, especially when deal-
ing with academic software (which does not have the same market potential and
is therefore not as well supported or documented). If the Summer Institute of
Linguistics stopped supporting Shoebox, for example, it is unclear what would
happen to it. Users of the software are at the developer’s mercy when it comes to
proprietary software.
1
The earliest computers used by codebreakers during WWII were employed to speed up the
process of decrypting enigma communiques. Before computers come onto the picture, the task
was done by armies of secretaries!
2
Fortunately, computers don’t mind doing our dirty work, at least not yet. Maybe one day the
kind of carping heard from Marvin the Paranoid Android in The Hitchhiker’s Guide to the Galaxy
will become a reality, “I’ve been ordered to take you down to the bridge. Here I am, brain the
size of a planet and they ask me to take you down to the bridge. Call that job satisfaction? ’Cos I
don’t.”
26
3. Programming for Language Research
T
One of the areas where programming is especially useful is in data analysis. Al-
though there is a great deal of software available for performing statistical anal-
ysis, these programs usually require input to be formatted in a particular fashion.
Knowledge of a scripting language enables one to massage one’s data into a for-
mat where it can be readily analyzed. Furthermore, in some cases, custom data
analysis can be more easily programmed on demand.
Finally, although there are many good programs for doing statistical analysis,
AF
a good deal of statistical analysis can be done within Python. Python provides
has extensive libraries for number-crunching and there are usually ways of using
Python to run other programs. For example, Python is able to run scripts that call
R, an open source statistical language which provides a wide variety of statisti-
cal techniques as well as extensive graphing capabilities (Verzani, 2004; Baayen,
2006).
27
3.2 Becoming a Good Programmer
Google engineers use Python, and we’re looking for more people with skills in
T
this language.”
AF
natural language with an eye towards making computers to produce and compre-
hend human languages. The growth of the internet and the WWW has fuelled
a resurgence of interest in the field, since it provides techniques of considerable
utility in the storage, classification, retrieval, and analysis of text (Farghay, 2003).
The usefulness of Python in this domain can be judged by its increasingly
central role in introductory course in computational linguistics and its widespread
usage in industry. Python has already become the language of choice in compu-
tational linguistics, judging from the number of universiities where introductory
courses use Python as the programming language of choice—at Brown, Bran-
deis, Penn State, the University of Edinburgh, and the University of Melbourne,
to name a few.
Although this book is not meant to be an introduction to the field (for that, try
Jurafsky and Martin (2000) or Bird et al. (2006)), it does teach basic programming
skills, which are an essential part of the field. Furthermore, the third part of the
book deals with a variety of topics that traditionally fall within the purview of the
field (e.g., WordNet, corpus linguistics). For linguists interested in learning more
about the applied side of computational linguistics, provides an introduction to
DR
various facets of natural language processing (NLP).
28
3. Programming for Language Research
T
which reveal themselves over time—for example,
• learning how to write well-commented code that is readable to others,
• learning how to write robust code that will handle unexpected conditions
gracefully,
• learning how to write code that can be easily maintained as it changes over
time,
• learning how to optimize already working code so that it runs more effi-
AF
ciently.
Programming is a craft like any other which repays time and effort. It isn’t
as hard to program as the general public probably believes (judging from popular
culture depictions of programmers as wild-eyed geniuses), but it is a long road
that takes one from apprentice to journeyman programmer.
Here are a few caveats that beginning programmers would do well to heed:
Learn By Doing The best way to learn to program is by doing it. There’s no
substitute for the hands-on, do-it-yourself approach. You can learn a good
deal about programming by reading books, but at some point you have to
roll up your sleeves and put your nose to the grindstone. In the beginning,
programming can be frustrating, and there will be many times when you
want to beat your head against the keyboard or put your fist through the
monitor. But that’s part of the process.
Use it or Lose It If you don’t program on at least a semi-regular basis, you will
no doubt find it hard to remember nitty gritty details about Python. The
more you program, the more entrenched the knowledge becomes and the
DR
less likely you are to forget it in the future.
RTFM You may have come across this acronym before. It’s stands for the im-
polite dictum “Read the Fucking Manual”. If you want to learn to program
well, you can’t be afraid of digging into documentation. There is always a
lot to know about a computer language and the best way to learn more is to
read up.
29
3.3 Suggested Reading
T
AF
DR
30
T
Chapter 4
AF
Installing and Running Python
This chapter presents the basics of installing and running Python on the three most
common operating systems: Windows, Linux/Unix, and/or MacOS.
4.1.2 Linux/Unix
Many Unix operating systems come with Python pre-installed as part of their stan-
dard distribution. This is true of a variety of Linux distributions (Red Hat, Debian,
SuSE, etc.). To see whether Python has been installed on your Unix operating sys-
tem, you will need to open a command-line window (terminal window) and type
‘python’, as shown in Figure ??. This will instruct Python to print out its version
number.
31
4.2 Running Python
???
T
Figure 4.1 ???
If Python is not installed, you will get an error message rather than a Python
version number. There are a few different options for installing Python.
One possibility is to install Python from source files, following the instructions
provided on the Python download page. It is easier, however, to install Python
from a prepackaged binary file. The most commonly available will be an RPM
???
AF
file (*.rpm), which can be installed as root with the following command:
Installation of a Python from an RPM will fail if your system lacks any of
the files required for the installation of Python (e.g., ???). a package management
program can be used in order to handle these dependency issues automatically.
With a decent package management program, it is quite easy to install Python.
Using the package manager yum, the following command should suffice:
Alternatively, if you are using the package manager apt, the following com-
mand can be used:
DR
# ???
4.1.3 MacOS
The current version of the Macintosh operating system, Mac OS X, is built on
top of the Berkeley Standard Distribution of Unix. This means that most of the
instructions for installing and running Python for Unix also carry over to Macs.
[SPELL OUT HOW THEY DIFFER]
32
4. Installing and Running Python
4.2.1 Windows
T
E Unless you use an IDE (Integrated Development Environment, you will run Python
by typing commands into the “DOS window” or “Command prompt window”.
The exact procedure for opening a command prompt window varies slightly be-
tween different version of Windows. In Windows XP, select the Start menu and
”Start — Programs — Accessories — Command Prompt”. Once you have done
this, the window should look similar to the one shown in Figure 4.2.
AF
DR
Figure 4.2 Command Prompt Window in Windows XP
To run your Python programs, you will have to use the Python interpreter. The
interpreter reads your script, converts it into machine language, and then executes
the machine language version of your script (for more information, see §6).
The first thing that you will need to do is to ensure that the Python interpreter
has been correctly installed. You can do this by opening a command prompt
window and runnning the Python interpreter in it. This is done by typing the full
path to the program python.exe and hitting return. If it works, Python will be
run in interactive mode, as shown in Figure 4.3.
33
4.2 Running Python
T
AF
Figure 4.3 Running the Python Interpreter from a Command Prompt Window
You can tell Python is running from the interpreter’s command prompt (the
three right-facing angle brackets). Once in interactive mode you can enter type
Python code interactively and have it executed on the spot. (To exit from the
interactive mode, you can hold the Control key and type ‘Z’—CONTROL-Z.)
Once you have succeeded in running the Python interpreter using the full path
to it, you can test whether you can also run it by simply typing python in the
command window. If this works, you will see the interpreter prompt, as you did
in Figure 4.3. If it does not work, instead of getting the interpreter prompt, as in
Figure 4.3, you will get an error message, such as the one shown in Figure 4.4.
DR
The problem is that your computer doesn’t know where to find the Python
interpreter. To fix the problem, you will have to modify a setting known simply
as PATH, which is a list of directories where Windows will look for programs. To
set PATH so that it includes the directory where the Python interpreter is installed,
you will first need to know where the Python interpreter program was installed. If
you installed Python fairly recently then the search command provided in (1) may
work. Otherwise you will be reduced to a search of your whole disk for the file
python.exe.
Once you have verified the directory, you will need to add it to your com-
puter’s start-up routine. You should add (??) to the current setting for the PATH
environment variable. To do this, you will need the ‘System Properties’ window,
which can be obtained by right-clicking “My Computer” and selecting ‘Proper-
34
4. Installing and Running Python
T
AF
Figure 4.4 Path Error Running the Python Interpreter from a Command Prompt Window
DR
ties’. From there, select ‘Environment Variables’ in the ‘Advanced’ tab of the
‘System Properties’ window, as shown in Figure 4.5.
35
4.2 Running Python
T
AF
Figure 4.5 Selecting ‘Environment Variables’ from the ‘System Properties’ Window
DR
Once you have selected ‘Environment Variables’, you will need to add ;C:\Python24
to PATH in the ‘System variables’, as shown in Figure 4.6.
36
4. Installing and Running Python
T
AF
Figure 4.6 Adding the Python Interpreter to the PATH Setting
DR
If you have sufficient privilege you might get a choice of installing the settings
either for the current user or for the entire system. If your computer is shared by
multiple users, and you want them to be able to use Python, you should choose to
install Python for the entire system, and not only for the current user.
4.2.2 Linux/Unix
4.2.3 MacOS
37
4.2 Running Python
T
AF
DR
38
T
AF Part I
39
DR
AF
T
T
Chapter 5
Word List
AF
Your First Python Program: Building a
Learning to program is a hands-on task. The best way to learn is to take a specific
task and think about how a program could be written to accomplish it. With that in
mind, let’s roll up our sleeves and look at a relatively basic task in text processing,
which is to build a word list from a corpus. This involves going through files,
isolating words, and keeping a count of how many times they occur. Let’s think
about how we would write a Python program that will automate this work.
Generally speaking, a corpus consists of a multiple text files. But for simplic-
ity we will work with a snippet of text, taken from the Wall Street Journal (WSJ)
corpus. The WSJ corpus consists of articles from the Wall Street Journal that have
been annotated by adding a part-of-speech (POS) tag to each word (using a for-
DR
ward slash as the separator between words and their tags). 1 An example sentence
is provided in (2) and its annotated counterpart in (3).
41
5.1 Saving Text as a Variable
Before we delve into Python-specific details, let’s think about what is con-
T
ceptually involved in creating a word list from text annotated in this style. Each
sentence in the text needs to be broken down into tokens, which in this case consist
of word/POS pairings (e.g., Zenith/NNP).2 For now, we will ignore the part-of-
speech label and concentrate on building a frequency count for every word. This
involves associating every unique word with a counter, which is incremented ev-
ery time an instance of that word is encountered. The words can then be sorted
alphabetically and printed out with their associated counters.
In the following sections, we will break our task down into sub-tasks and look
AF
at how each sub-task can be accomplished in Python. These sub-tasks are: storing
text as a variable (§5.1), breaking the text down into tokens (§5.2), creating a
frequency counter for each unique token (§5.3), and then printing out each token
along with its frequency count (§5.4). We’ll then put the parts together into a
single script that performs the entire job automatically.
In (2), a snippet of annotated text from the WSJ corpus is saved as a variable.
In computer programming, a variable is a bit like a storage container. Like a
storage container, a variable has a label (its name) and contents (a value), as in
Figure 5.1.
2
The only tokens that do not have a part-of-speech label are the punctuation marks, which are
treated as words for ease of processing—see §8.1 for additional discussion.
42
5. First Program
???
T
Figure 5.1 ???
A variable is given a value with the assignment operator, the equal sign (=). In
this case, the name of the variable text and its value is the snippet of annotated
text from (3). Note the formatting of the string. It is surrounded by quotes and for
the sake of readability has been wrapped across multiple lines using a backslash
at the end of every wrapped line.
5.2
1
2
3
4
5
Token
Zenith/NNP
Data/NNP
AF
Tokenization: Getting a List of Tokens
Once we have saved some annotated text into a variable, the next step is the to-
kenization of the text—that is, breaking the text down into tokens. In the case
of (3), a token consists of a pairing of a word with a part-of-speech label—e.g.,
ideas/N. Table 5.1 provides a breakdown of the first five tokens in (3).
Order
Systems/NNPS
Corp./NNP
,/,
Word
Zenith
Data
Systems
Corp.
,
Part of Speech
NNP
NNP
NNP
,
proper noun, singular
proper noun, singular
NNPS proper noun, plural
proper noun, singular
comma (punctuation)
TOKENIZATION
The function split() breaks a string up by spaces and produces a list with
all of the strings between spaces. In (3), the list that is produced by using split()
on text is saved into the variable tokens. (We’ll see shortly that the function
split can use others characters as separators.)
43
5.3 Obtaining a Frequency Count for Unique Words
T
Having obtained a list of tokens, we can now count how often each word appears
in the list. The code in (4) goes through the list of tokens saved into the variable
tokens and figures out how many times each word occurs in it. In order to track
the frequency of each word, we need some way of associating unique words with a
count of how often it occurs. In Python, we can use something called a dictionary
(see §8.2). A dictionary is used to store a collection of unique items (keys) and
associate each one with a particular value. In this case, the key would be a word
d = {}
AF
and the associated value would be a number (more specifically, a count of how
many times a particular word appears in the text).
for t in tokens :
parts = t.split("/")
w = parts[0]
gc = parts[1]
if not d.has_key(w) :
d[w] = 0
d[w] = d[w] + 1
(4) introFirstProgramPart3.py
# create a dictionary
# go through each token
# split token using / as separator
# 1st item in list is word
# 2nd item in list is gram. class
# see if word is not in dict
# if not in dict, give it init. value of 0
# reset value by adding 1 to prior value
In (4), we use a for loop to go through each token in the list. Each token is
temporarily saved as the variable t as the for loop works its way through the list.
The indented code in lines 3 through 8 is run once for each item in the list. (Since
1
2
3
4
5
6
7
8
the text being analyzed contains 47 tokens, and the for loop is repeated for each
DR
token in tokens, those 6 lines of code will be run 47 times.)
What do lines 3 through 8 do? Because each token consists of both a word
and a part-of-speech label, it is first necessary to separate the two so that the word
can be processed and the POS ignored. This is done with the function split,
using a slash (/) as a separator. The result is a list, saved as the variable parts.
The list will consists of two elements, a word and a part-of-speech label. These
are extracted from the list using an index (a number that refers to a position in the
list) (see §8.1 for more details). Once the word in a token has been isolated and
saved into a variable, it can be added to the dictionary.
How is this done? This is done by checking whether the word is the dictionary,
adding it if it is not, and then adding 1 to the value associated with the word. It is
necessary to check whether the word is already in the dictionary because an error
will result if the program attempts to retrieve a value for a word that is not in the
dictionary. To prevent this from happening, the program ensures that there is a
value associated with the word by giving it an initial value of 0.
44
5. First Program
T
The final step is to print out all of the words and their frequency counts in a way
that it can be easily read. One very simple format is a tabular layout with space-
separated values—that is, a plain text file with data organized into columns that
are separated from one another by a space mark. Files formatted in this way are
easy for people to read and also easily imported into spreadsheet programs (e.g.,
Microsoft Excel).
To create a file with a tabular word frequency count, we need the program to
words.sort()
for w in words :
count = d[w]
AF
get a list of keys in the dictionary and sort them. It should then be possible to go
through each key, get its associated value, and print the two of them out. The code
that accomplishes this is given in (5).
words = d.keys()
print w , str(count)
(5) introFirstProgramPart4.py
# get list of unique words in dictionary d
# sort list of words
# go through each word
# get counter for a given word
# print word and count to standard output
In (5), we first get a list of all of the keys (words) in the dictionary using the
function keys and save that list into the variable words. We then sort that list us-
ing the function sort (which knows how to do standard case-sensitive alphabetic
sorting for English—we’ll see ways of customizing the sort order in §8.1.3.4). We
1
2
3
4
5
then use a for loop to go through each token in the list, and as the program loops
over each token in the list, it saves it into the temporary variable w.
DR
Notice that the lines 4 through 8 are indented. This is because they are asso-
ciated with the for loop. They will be run as many times as there are items in
the list being looped over. Their collective purpose is to extract some information
about a token and print it out. First, the occurence count for a given token is ob-
tained by looking it up in the dictionary. Then the token is split into a wordform
and a part of speech label with the function split, but using a slash as a sep-
arator instead of the default (a space). The wordform, POS label, and frequency
count are then joined together into a string (using commas to insert spaces be-
tween them) and printed out. (Note that the frequency count is converted into a
string using the function str. The nature of, and need for, this number-to-string
conversion will be explained in §7.4.)
Each line of output will consists of three columns: the first column, a word
form; the second column, a part of speech abbreviation; and the third column, the
number of times that a particular wordform/POS pairing occurs in the text that the
program was run on.
45
5.5 Putting It All Together
T
We are now in a position to bring all of the previously discussed code together a
single program. When run, this program will print out a list of all of the words in
our corpus excerpt. The program is provided in its entirety in (6).
(6) introFirstProgram.py
text = "Zenith/NNP Data/NNP Systems/NNPS Corp./NNP ,/, a/DT \ 1
subsidiary/NN of/IN Zenith/NNP Electronics/NNP Corp./NNP ,/, received/VBD \ 2
a/DT $/$ 534/CD million/CD Navy/NNP contract/NN for/IN software/NN and/CC \ 3
services/NNS of/IN microcomputers/NNS over/IN an/DT 84-month/JJ period/NN \ 4
AF
./. Rockwell/NNP International/NNP Corp./NNP won/VBD a/DT $/$ 130.7/CD \ 5
million/CD Air/NNP Force/NNP contract/NN for/IN AC-130U/NN gunship/NN \ 6
replacement/NN aircraft/NN ./." 7
tokens = text.split() # split up text into tokens 8
d = {} # create a dictionary 9
for t in tokens : # go through each token 10
parts = t.split("/") # separate token into word and gram. class 11
w = parts[0] # 1st item in list is word 12
gc = parts[1] # 2nd item in list is gram. class 13
if not d.has_key(w) : # see if word is not in dict 14
d[w] = 0 # if not in dict, give it a counter 15
d[w] = d[w] + 1 # increment counter 16
words = d.keys() # get list of unique words 17
words.sort() # sort list of words 18
for w in words : # go through each word 19
count = d[w] # get counter for a given word 20
print w , str(count) # print word and count to standard output 21
When it is run, (6) will produce a word list in alphabetical order with frequency
counts, as you can test for yourself. Running the program should produce the
output shown in (7).
(7) first-program-output.csv
$ 2
DR
, 2
. 2
130.7 1
534 1
84-month 1
AC-130U 1
Air 1
Corp. 3
Data 1
Electronics 1
Force 1
International 1
Navy 1
Rockwell 1
Systems 1
Zenith 2
a 3
aircraft 1
an 1
and 1
contract 2
for 2
gunship 1
46
5. First Program
T
microcomputers 1
million 2
of 2
over 1
period 1
received 1
replacement 1
services 1
software 1
subsidiary 1
won 1
AF
Although there is a good deal about (6) that remains to be explained, the basics
should be reasonably clear, even at this early stage. In the following chapters, we
will focus on various topics in Python programming and show ways in which this
program can be elaborated. This will deepen your understanding of (6) and give
you the knowledge necessary to tackle other problems.
47
5.6 Suggested Reading
T
AF
DR
48
T
Chapter 6
Statements
AF
Python programs are composed of statements, which are instructions for Python
to carry out a particular operations. In fact, a Python program is basically just
a collection of statements. To understand how Python statements are converted
into instructions that are carried out by the computer, it pays to step back briefly
and distinguish between two types of computer languages: low-level languages
(sometimes known as “machine languages” or “assembly languages”) and high-
level languages (such as Python, C++, Perl, Java, etc.)
Computers can only execute programs written in low-level languages. There-
fore, programs written in a high-level language require some additional processing
since the statements in the high-level language have to be translated into state-
STATEMENTS
49
6.1 Indentation and Block-Structuring
further translation. (It is for this reason that compiled programs can be optimized
T
to run more quickly and use less memory than interpreted programs.)
Python is a high-level interpreted language. In other words, Python statements
must first be converted to machine language and this is when the program is run
by an interpreter, and not in advance by a compiler.
AF
One aspect of Python that is unique among programming languages is its use
BLOCK STRUC - of indentation to determine block structure, which is the way in which code is
TURE organized into larger units, or blocks. Most other languages use paired delimiters
(brackets, parentheses, etc.) for this purpose, as can been seen in the Perl code in
(8), which tests the value of a variable.
(8) statementsPerlEx1.py
my $i = 0; # set variable to zero 1
if ($i < 1) { # is value less than one? 2
print "Value is less than one.\n"; # if yes, print message 3
print $i; # print actual value of variable 4
} 5
Notice that brackets are used to structure the code in (8). The if-clause is
separated from the then-clause using brackets and each statement ends with a
semi-colon. (There are other conventions—e.g., variables are flagged with a dollar
sign—that can be ignored, since they are specific to Perl.)
The Python equivalent of (8) would be (9), which performs the same opera-
tions and produces the same output.
(9) statementsEx1.py
DR
i = 0 # set variable to zero 1
if i < 1 : # is value less than one? 2
print "Value is less than one." # if yes, print message 3
print i # print value of variable 4
There are a few things to notice about the code in (9). First, it does not use
braces. The if-clause is separated from the then-clause by a colon. Second, there
are no semi-colons in (8) because semi-colons are optional in Python if a line
consists of only a single statement.
In essence, the default in Python is a single statement per line. That is what the
Python interpreter expects unless it is told otherwise. To have more than one state-
ment on a single line, semi-colons are required. Therefore, (10) is a permissible
alternative to (9). (It would also be possible to put semi-colons at the end of each
line in (9), but it would be pointless, since they are not required by the interpreter
and add nothing to the program in terms of readability.) They are functionally
equivalent and the choice of one over the other is a matter of programming style.
50
6. Statements
(In general, putting each statement on a separate line makes for better readability
T
and is therefore preferred.)
(10) statementsEx2.py
i = 0 1
while (i < 10) : 2
i += 1; print i 3
Note also that the indented block need not occur on a separate line. It is
possible to collapse the first two lines of (10), thereby obtaining (11), but we
i = 0
while (i < 10) :
6.2 Variables
AF
recommend against it, since it makes the code harder to read.
(11) statementsEx3.py
i += 1; print i
In this section, we discuss how variables work in Python. Variables lie at the heart
of any programming language, given that they are where information is stored. A
variable is something like a box, in that it is a container for information. If we
pursue this analogy, then we can think of variables as boxes that contain values
(a particular number, word, etc.). In the following sections we will look at the
basic statement types where variables are concerned: declaration, initialization,
and assignment.
1
2
DR
6.2.1 Declaration
The term declaration refers to the process of asserting a variable’s existence in a DECLARATION
program. A declaration essentially announces, or declares, that a variable exists.
Hence the term. Using our variables as containers analogy, declaring a variable
amounts to telling the Python interpreter that there is a box and giving it a la-
bel. Unlike some other programming languages, a variable cannot be declared in
Python without giving it a value. That means that it not possible to simply tell a
program that a variable exists, and expect it to hang around waiting for more infor-
mation. Therefore, in Python, if you want to declare a variable without giving it a
value, a dummy value, known as None, must be used as a sort of placeholder, as
in (12). This is a bit like creating an empty box. (For more information concerning
the nature of the value None, see Chapter 7.3.)
(12) statementsDeclarationEx1.py
x = None 1
51
6.2 Variables
The name of a variable is fairly free in Python, and the main restrictions are
T
the following: First, a variable’s name must begin with a letter. Second, a vari-
able’s name may consist only of letters, numbers, and underscores. (Given the
first restriction, this means that numbers and underscores can only occur after the
first letter of a variable’s name.) Third, a variable’s name cannot conflict with
KEYWORD a reserved keyword, which are words that have a special interpretation and are
therefore restricted in use by the Python interpreter (e.g., if, and, not). (Key-
words can be easily identified in this book by their formatting: they appear in bold
face in example code.)
b.
c.
d.
AF
More concretely, the variable names in (4) are legitimate whereas the ones in
(5) are illegitimate.
(4) a. myVariable1
b. my variable 1
c. mv1
(5) a. 1myVariable
my variable 1
mv1
myVariable!
6.2.2 Assignment
A SSIGNMENT Assignment refers to the act of giving a value to a variable. Continuing with our
variables as containers analogy, assignment is a bit like placing an object within
a labelled box. When this is done for the first time, the process is referred to as
INITIALIZATION initialization. There is no real distinction between initialization and assignment
in Python, and a variable can be given a new value at any point using the equal sign
(the assignment operator), as illustrated in statementsAssignmentEx1.
1
For a discussion of these stylistic issues, which have a tendency to take the form of dogma
and ignite minor religious wars among programmers, see ?.
52
6. Statements
T
(13) statementsAssignmentEx1.py
i = 0 # Initialize counter to zero 1
2
# Repeatedly execute following code as long as 3
# value of i is less than 10 4
while i < 10 : 5
i += 1 # Increment counter 6
print i # Print value 7
To initialize a variable is to give it a value for the first time. As already noted
in §6.2.1, the declaration of a variable requires its simultaneous initialization. This
AF
has the effect of giving it a type, since the assigned value will force the variable to
take the appropriate type. Continuing with the variables as containers analogy, it
is important to recognize that not all containers are the same size. Variables, like
boxes, are not uniform. Some are bigger than others, and some are custom-made
for particular types of objects. Just as you use one kind of box to store your tennis
shoes and another kind of box to store your computer, you would use one kind of
variable to hold a number and another kind of variable to hold a word (see Chapter
7 for more details). Assignment is done with the equal sign (=). Therefore, in (14)
a string variable is simultaneously declared and initialized before being printed.
print word
(14) statementsInitializationEx0.py
word = ’This string is declared and initalized.’
When assigning a value to a variable, care must be taken with the proper for-
matting of the value, since this can have an effect on the data type of the assigned
1
2
variable. For example, when assigning a value to a variable that holds a numeric
DR
value, the presence of absence of a decimal point decides whether the number is
an integer or a floating point number (see Chapter 7 for details), and this has con-
sequences for the treatment of the numbers in any mathematical operations that
occur later in a program (see chapter 7 for more information about data types), as
illustrated in (15).
(15) statementsInitializationEx1.py
# Initialize two variables as integers 1
x = 5 2
y = 2 3
4
# Print result of dividing x by y 5
print x / y 6
7
# Initialize two variables as floating point numbers 8
x = 5.0 9
y = 2.0 10
11
# Print result of dividing x by y 12
print x / y 13
53
6.3 Expressions
The result of running (15) might not be what you expect. Although we all
T
know that 5 divided by 2 is 2.5, it is only printed out as such in the second instance.
This is because in the first instance the two numbers being divided were assigned
without decimal points, making them integers, and when an integer is divided by
another integer, the resulting value is an integer. (The number is not rounded up
or down. Rather, the remainder is simply lopped off.)
It is possible to query a variable’s type, since the interpreter keeps track of this
information. To check a variable’s type, the function type() can be used, as
shown in (16), where three variables are created and then have their type checked
EXPRESSION
using type().
n = 0
pi = 3.14159
print type(n)
AF
print type(message)
print type(pi)
6.3 Expressions
(16) statementsInitializationEx2.py
message = ’This is a string’
IDENTITY TEST they are identical, this is sometimes referred to as an identity test. An identity
test is an expression, since it returns a value: either the two things comapred are
DR
identical, in which case the result is True, or they are non-identical, in which case
the result is False. You can confirm this for yourself by seeing what value is
printed out by the various expressions in (17).
(17) statementsExpressionEx1.py
s = "Word" # string 1
n = 1 # integer 2
# True 3
print s == s 4
print s == "Word" 5
print n == n 6
print n == 1 7
# False 8
print s != s 9
print s != "Word" 10
print n != n 11
print n != 1 12
54
6. Statements
Operator Meaning
T
> Greater Than
>= Greater Than or Equal To
== Equal To
<= Less Than or Equal To
< Less Than
s = "Aari"
n = 1
AF
These operators can be used with any data type, but they are more appropriate
for some than others. For example, the meaning of the operator > is straightfor-
ward when it comes to numbers (integers, floats, etc.), but less so when it comes to
strings. When used with strings, the operator > means “sorts before”, as demon-
strated by (18). .
print n > 5
# True
# False
print s > "Zyphe" # False
(18) statementsExpressionEx2.py
# String
# Number
In (18), the first two expressions evaluate as True and the last two expressions
evaluate as False.
Returning to the wordcount program introduced in Chapter 5, we will show
1
2
3
4
5
6
7
8
how words occuring more than once can be extracted from a dictionary.
(19) statementsExpressionEx3.py
DR
import sys 1
2
def getFilename() : 3
try : 4
return sys.argv[1] 5
except IndexError : 6
print "Usage: %s <FILE>" % sys.argv[0] 7
sys.exit(0) 8
9
def getFileContents(filename) : 10
fo = open(filename, ’r’) 11
fc = fo.read() 12
fo.close() 13
return fc 14
15
def extractWords(fileContents) : 16
fileContents = fileContents.replace(’,’, ’’) 17
fileContents = fileContents.replace(’!’, ’’) 18
fileContents = fileContents.replace(’.’, ’’) 19
fileContents = fileContents.replace(’;’, ’’) 20
fileContents = fileContents.replace(’:’, ’’) 21
words = fileContents.split(" ") 22
return words 23
55
6.4 Comments
T
24
def buildWordlist(words) : 25
wordlist = {} 26
for w in words : 27
parts = w.split("/") 28
word = parts[0] 29
tag = parts[1] 30
31
try : 32
n = wordlist[word][tag] 33
wordlist[word][tag] = n + 1 34
except KeyError : 35
if not wordlist.has_key(word) : 36
AF
wordlist[word] = {} 37
wordlist[word][tag] = 1 38
return wordlist 39
40
def main() : 41
filename = getFilename() 42
fileContents = getFileContents(filename) 43
words = extractWords(fileContents) 44
wordlist = buildWordlist(words) 45
printNonUniques(wordlist) 46
47
main() 48
6.4 Comments
Everything in a line of code following a hash mark (#) is ignored by the Python
interpreter (unless the hash mark occurs within a string), as shown in (20).
(20) statementsCommentsEx0.py
# This line will be ignored by the Python interpreter. 1
2
print ’A # isn’t a comment in strings.’ # True of all strings 3
DR
4
# Comments can span multiple lines, but each 5
# line has to be commented out. There is no 6
# convention in Python for multi-line comments. 7
56
6. Statements
Comments are also a useful way of blocking out the structure of a program
T
and for providing more general information about its authorship, purpose, etc., as
illustrated in (22), where the header and function descriptions of a program to find
minimal pairs are given (without their accompanying code).
(22) statementsCommentsEx2.py
# ----------------------------------- 1
# AUTHOR: Stuart Robinson 2
# DATE: 10 April 2004 3
# DESCRIPTION: 4
# This program takes a Shoebox-formatted 5
# dictionary file and finds all of the 6
AF
# minimal pairs contained within it. 7
# VERSION: 1.0 8
# --------------------------------------- 9
10
# --------------------------------------- 11
# Takes the Shoebox file and extracts 12
# each entry 13
# --------------------------------------- 14
15
... 16
17
# --------------------------------------- 18
# Take the entries and compares them with 19
# one another to find the min pairs. 20
# --------------------------------------- 21
22
... 23
Comments are also used to “comment out” code that needs to be rendered
inactive for some reason. Often when programmers are revising a program, they
will copy a block of code they will comment out a copy of it first, so that if their
changes fail to work, they can fall back on the prior versio of the code. (A good
text editor will provide a way of commenting out a region of text with a single
DR
command.)
The general rule of thumb for comments is to avoid stating the obvious and
to provide comments on anything that is likely to trip up other programmers.
(Also, remember that even the author of a program is likely to forget how it
works as time drags on, so it pays to provide comments not just for others, but
also for one’s future self.) Beginning programmers have a tendency to either
provide no comments whatsoever (always a mistake) or to comment everything,
often providing comments on the workings of Python itself (which really ought
to be left up to documentation) Striking a balance between the poles of over-
commenting and undercommenting is one of the signs of a mature program-
mer. Although Python has built-in documentation capabilities, coverage of them
goes beyond the intended scope of this textbook. For more information, go to
http://www.python.org/peps/pep-0257.html.
You will find comments used quite liberally throughout this book in the code
examples provided. A good deal of these comments would be overkill in everyday
57
6.5 Exercises
working programs, but they are very useful for the purposes of exposition.
T
6.5 Exercises
• Examine the following pieces of Python code and determine for each whether
it is legitimate and, if not, what’s wrong with it and how it could be fixed:
– if i = 0 : print i
AF
– in = 5
– s = ’If you can’t explain it clearly, you don’t
understand it yourself.’
– ???
– ???
• ???
• ???
DR
58
T
Chapter 7
Data Types
AF
In the previous chapter, variables were introduced. As already mentioned, vari-
ables are fundamental to programming, since they are where data is kept while a
program is running. Because different types of information are stored in variables
(text, numbers, lists of numbers, etc.), it is useful to break down variables into
different types. This is known as data typing and the different types of variables
recognized in a language are known as data types. Python has a number of built-
in data types—that is, data types that are implicitly understood by the Python in-
terpreter without any need for the programmer to provide additional specification.
Each has a different job to do and behaves somewhat differently. It is important
to understand how the various data types differ from one another so that one can
make use of their various strengths. Later in Chapter 16 we will show how it is
DATA TYPES
Note that in (23) the string is surrounded by single quotes. These quotes are
necessary because they ensure that the words in your string aren’t interpreted as
instructions to the Python interpreter. It also also possible to use double (or even
triple quotes), although slightly different behavior is obtained by doing so. For
the moment, the differences need not concern us and the two types will be used
59
7.2 Numbers
indifferently. (Later, in Chapter 14, the difference between strings with single,
T
DYNAMIC TYPING double, and triple quotes will be discussed in detail.)
Strings cannot be modified once they are created, and are therefore described
IMMUTABLE as being immutable. (This holds true regardless of whether the string was initial-
ized with single, double, or triple quotes.) One practical consequence of the im-
mutability of strings in Python is that any operations performed on strings create a
modified copy of the string, rather than modifying the pre-existing one. Consider
(24).
AF
(24) dataTypesStringEx2.py
s1 = ’Strings are immutable.’ 1
s2 = s1.replace(’immutable’, ’IMMUTABLE’) 2
print s1 3
print s2 4
In (24), a string is created and a copy of the string is made with the word ’im-
mutable’ replaced by ‘IMMUTABLE’. The replacement is done using the method
replace. Each data type makes available a number of different methods which
can be used to carry out commonly performed operations. For a complete list of
the methods available for a given data type, you will need to consult some docu-
mentation. The methods available for strings are discussed in considerable detail
in Chapter 14. For now, the important point is that when the method replace is
called on the string in (24), the original string is unaffected by the replacement, as
you can confirm by running the code yourself.
Strings can be joined together to form larger strings, a process known as string
STRING CONCATE - concatenation. Python provides two operators for string conatenation, which are
NATION the comma and the plus sign. Strings concatenated with a comma have a space
DR
inserted between them whereas strings concatenated with a plus sign are directly
concatenated with no intervening materials, as shown in (25).
(25) dataTypesStringEx3.py
print "The hatches of the ship" , "closed." 1
print "The ship" + "’s gravity field must be rather weak." 2
For more comprehensive coverage of strings, see Chapter 14, which is devoted
exclusively to the topic.
7.2 Numbers
Numbers in Python come in more than one variety. The language has a number of
built-in numeric types, but only two are likely to be of relevance to the concerns
of language researchers: integers and floats. We will discuss each in turn.
60
7. Data Types
7.2.1 Integers
T
An integer in Python is used to hold whole numbers, both positive and negative INTEGER
(e.g., -5, 0, 36, etc.), of virtually any size.1 In (26), an integer is created and saved
as the variable n. Various mathematical operations are then performed on it and
the results printed out.
(26) dataTypesIntegersEx1.py
n = 25 1
print n / 5 2
print n + 5 3
AF
print n - 10 4
print 2 * n 5
Integers do not store fractions and any mathematical operation performed with
integers will result in an integer. Therefore, if you perform a mathematical oper-
ation with an integer that results in a fraction (such as division), the fraction will
simply be eliminated. This effectively rounds the number down to the nearest
whole number, as illustrated in (27).
(27) dataTypesIntegersEx2.py
x = 7 # prime number 1
y = 2 2
print x / y # NOT 3.5 3
In order to deal with fractions, integers must be converted into a float, which
is a data type that supports fractions (see the next section §7.2.2 for details).
7.2.2 Floats
Unlike integers, a float (short for “floating point number”) is able to store frac- FLOAT
DR
tions. The term derives from the fact that the decimal points of these numbers
can move around, or “float”. To create a float, simpy add a decimal point to the
number, to any place (tenth, hundredth, etc.), as in (28) (cf. (27)).
(28) dataTypesFloatEx1.py
x = 7.0 # numerator is float 1
y = 2 # divisor is integer 2
print x / y # result is float 3
Note that it doesn’t matter whether the numerator or the divisor is a float. If
either one (or both) is a float, the result will be a float, as can be seen in (29),
where the divisor is a float (as opposed to (28), where the numerator is a float).
Nevertheless, the result is the same.
1
In older versions of Python, there was an upward bound on the size of integers. If an integer
grew too large, it would exceed the memory allocated to it, resulting in an error. To avoid such
errors, an integer would have to be explicitly created as a long integer. However, since Python 2.2,
whenever a normal integer exceeds its bounds, it is automatically converted to a long integer.)
61
7.3 The None Value
T
(29) dataTypesFloatEx2.py
x = 7 # numerator is integer 1
y = 2.0 # divisor is float 2
print x / y # result is float 3
When you print out a float, it will be printed out to the most precise decimal
point. There are numerous way of printing out a float without the decimal point,
but probably the simplest is to simply convert the float to an integer, as illustrated
in (30). (Other ways of doing this will be covered later—see §14.4.5 on string
AF
interpolation).
(30) dataTypesFloatEx3.py
n = 8.0 / 2 1
print n 2
print int(n) 3
Integers can also be converted into floats. One method of conversion relies on
the fact that multiplying an integer by a float results in a float, and it is simply
to multiply the integer by 1.0 before carrying out an additional operations on it.
Another way is explicitly convert the integer to a float using the built-in function
float(). Both techniques are illustrated in (31).
(31) dataTypesIntegersEx3.py
x = 7 1
y = 2 2
print x / y # 3.0 3
print x * 1.0 / y # 3.5 4
print float(x) / y # 3.5 5
2
Just in case you are looking for analogies from other programming languages, the equivalent
of None in Python is null in Java and undef in Perl.
62
7. Data Types
T
3
if 0 : print "’0’ counts as true." 4
else : print "’0’ counts as false." 5
6
if 1 : print "’0’ counts as true." 7
else : print "’0’ counts as false." 8
Although None and 0 are both treated as false, they are not equal in value, as
shown by (33).
AF
(33) dataTypesNoneValueEx2.py
if None == 0 : 1
print "The value ’None’ is equal to ’0’." 2
else : 3
print "The value ’None’ is not equal to ’0’." 4
Attempting to concatenate unlike data types gives rise to errors, as you can see
for yourself by running (35), which attempt to concatenate the variables in (34).
(35) dataTypesTypingEx2.py
s = "one" 1
f = 1.0 2
i = 1 3
4
print s , f , i 5
In order to concatenate and print unlike data types, it is first necessary to con-
vert them into strings, which can be done using the built-in function str(), as
illustrated in (36).
63
7.5 Exercises
T
(36) dataTypesTypingEx3.py
s = "one" 1
f = 1.0 2
i = 1 3
print s , str(f) , str(i) 4
AF
7.5 Exercises
• If you wanted to know how many times a particular word appeared in a
corpus, which type of number would you use? If the frequency of a word
in a corpus is defined as the number of times it occurs in the corpus divided
by the total number of words in the corpus, which type of number would be
appropriate for the word’s frequency?
• If you had a list of words and you wanted to see how many times each word
occurs in a particular corpus, what would you use as the initial word count
value?
• ???
DR
64
T
Chapter 8
Data Structures
AF
A data structure is basically a data type that provides the means of organising
a larger (and possibly more complicated) collection of data in such a way that it
can be stored and manipulated more effectively. In other words, it is a way of
structuring data (hence the term). Although the term may sound off-putting to the
uninitiated, the reader is no doubt familiar with at least one data structure: list. A
list is a collection of items, stored in a fixed order. Every programming language
implements them in one way or another, although they sometimes go by more
esoteric names (e.g., vector in Java, array in Perl).
Some common data structures are: arrays, dictionaries (aka, associative arrays
or hashes), graphs, heaps, linked lists, matrices, objects, queues, rings, stacks,
DATA STRUCTURE
trees, and vectors. (Don’t worry if you aren’t familiar with all of these data struc-
tures, since most are unnecessary for basic programming tasks.) In Python, all of
DR
these data structures are potentially available. Those that are not built-in to the
language can be implemented using object-oriented programming techniques (see
chapters 15 and 16).
In this chapter we will discuss Python’s built-in data structure: lists (§??),
dictionaries (§8.2), and tuples (§8.3).
8.1 Lists
A list is a data type used to store multiple items in sequential order. A list can LIST
contain items of any data type (string, numbers, etc.), but we’ll stick to strings for
now. Let’s take the first sentence from Hal Clement’s novel, A Mission of Gravity,
and store each word as an item in a list.
(6) The wind came across the bay like something living.
65
8.1 Lists
There are are numerous ways to create a list in Python, but the simplest is to
T
put all of the items between square brackets, separated by commas. Since we are
creating a list of strings, we’ll have to put each string inside of quotes, as in (37).
(37) dataStructuresListEx0.py
l = [’The’,’wind’,’came’,’across’,’the’,’bay’,’like’,’something’,’living.’] 1
AF
Another way of creating a list of strings from a sentence is to let Python do
the work of splitting the sentence into words. This can be done with the string
function split, as shown in (38).
(38) dataStructuresListEx0.5.py
s = "The wind came across the bay like something living." 1
l = s.split() 2
print l 3
When the list in (38) is printed, it will be formatted in the same manner as (37).
Notice that the last word in both (37) and (??) occurs with punctuation: the period
is treated as part of the last word because it is not separated from it by a space. To
ensure that each word occurs alone in the list, the period can be separated from
the rest of the sentence by a space, ensuring that it is treated as a separate word.
(This is a fairly common practice in machine–readable corpora—i.e., texts that
have been formatted for automatic processing.)
DR
(39) dataStructuresListEx0.75.py
s = "The wind came across the bay like something living ." 1
l = s.split() 2
print l 3
66
8. Data Structures
T
Figure 8.1 List Indices
Index String
0 The
1 wind
2 came
3 across
4 the
5 bay
AF
6 like
7 something
8 living.
The use of indices to retrieve values from a list is illustrated in (40), where a
list of the words from (6) is created and each of its items is printed out. Note that
the last line of code, which refers to the eleventh item (i.e., index 10), produces an
error, since the list contains only 10 items.
(40) dataStructuresListEx1.py
list = "The wind came across the bay like something living .".split() 1
print list[0] # First item 2
print list[1] # Second item 3
print list[2] # Third item 4
print list[3] # Fourth item 5
print list[4] # Fifth item 6
print list[5] # Sixth item 7
print list[6] # Seventh item 8
print list[7] # Eighth item 9
print list[8] # Ninth item 10
print list[9] # Tenth item 11
print list[10] # ERROR: No eleventh item! 12
DR
In (40), we see how to get the first, second, third, etc. item in a list, but how
do we get the last, second-to-last, third-to-last, etc. item in a list? One way is to
figure out its position and to subtract the appropriate amount (since counting starts
at 0, not 1). To get the last item in a list, we obtain the total number of items in
the list and subtract 1; to get the second-to-last item in a list, we subtract 2; etc.
Returning to our previous example in (40), we can print the second-to-last and last
items, as follows:
(41) dataStructuresListEx3.py
l = "The wind came across the bay like something living.".split() 1
print l[len(l) - 1] # Last item 2
print l[len(l) - 2] # Second-to-last 3
print l[len(l) - 3] # Third from the last 4
There is, however, a more convenient way of retrieving the last and second-to-
last item in a list, which is simply to provide a negative index. Negative indices
67
8.1 Lists
count from the end of a list rightward. The last item has an index of -1, the second-
T
to-last item has an index of -2, etc. This is illustrated in (42), which does the same
work as (41), but more economically.
(42) dataStructuresListEx4.py
l = "The wind came across the bay like something living.".split() 1
print l[-1] # Last item 2
print l[-2] # Second-to-last item 3
print l[-3] # Third-from-last item 4
The negative indices for all of the words in (42) is provided in Figure 8.2.
AF
8.1.2 Slicing
Figure 8.2 Negative List Indices
Index
-9
-8
-7
-6
-5
-4
-3
-2
-1
String
The
wind
came
across
the
bay
like
something
living.
SLICING Using a technique known as slicing, the indices of a list can also be used to ma-
DR
nipulate multiple items simultaneously. The idea behind slicing is that a range of
items can be indicated using two indices, one indicating where the range begins
and another indicating where it ends. The slice range does not include the item
referenced by the second index—in other words, the end index is exclusive. In
(43), various slices of our Mission of Gravity list are illustrated.
(43) dataStructuresListEx5.py
l = "The wind came across the bay like something living .".split() 1
print l[0:1] # [The] 2
print l[0:2] # [The wind] 3
print l[:6] # [The wind came across the bay] 4
print l[2:3] # [came] 5
print l[3:6] # [across the bay] 6
print l[6:11] # [like something living .] 7
print l[6:] # [like something living .] 8
68
8. Data Structures
Negative indices also work with slicing, as shown by (44), where different
T
slice ranges are provided: the last item, the second-to-last to the last item, etc.
(Again, no error results from a reference to a non-existent index.)
(44) dataStructuresListEx6.py
list = [’alpha’, ’beta’, ’charlie’, ’delta’, ’epsilon’] 1
print list[-1:] # Last 2
print list[-2:] # Second-to-last and last 3
print list[-3:] # Third-to-last, second-to-last, and last 4
print list[-4:] # Fourth-to-last to last 5
print list[-5:] # Fifth-to-last to last 6
print list[-6:] # Sixth-from-last to last 7
print
print
print
print
print
print
list[:1]
list[:2]
list[:3]
list[:4]
list[:5]
list[:6]
#
#
#
#
#
#
AF
No index is provided for the right boundary of the slice range in (44). When
a slice index is left blank, it is interpreted as the far edge of the list. Therefore, if
the left slice index is omitted, the slice starts at the beginning of the list, and if the
right slice index is omitted, the slice ends at the end of the list, as shown in (45),
where the equivalent of (43) is provided using a blank left slice index rather than
a 0.
(45) dataStructuresListEx7.py
list = [’alpha’, ’beta’, ’charlie’, ’delta’, ’epsilon’]
First item
First and second items
First, second, and third items
First through fourth items
All items, first through fifth (last)
All items, first through non-existent sixth
You may be wondering why slice ranges are able to refer to non-existent in-
dexes. This is because lists are mutable, which means that it is still possible to
1
2
3
4
5
6
7
8
9
change individual items in them after they have been created, as illustrated in
DR
(46).
(46) dataStructuresListEx8.py
l = [] # Create empty list 1
l.insert(0, ’a’) # Insert item at beginning of list 2
l.insert(1, ’b’) # Default insert position is end 3
l.append(’c’) # Add item to end of list 4
print l 5
69
8.1 Lists
In (47), two noun phrases are replaced using slices. This is done in two steps:
T
the words The wind are first replaced by A stench and the words something living
are replaced with an invading army.
8.1.3.1
(??).
AF
Adding and Removing Items in a List
A list isn’t very useful unless you can add items to it and remove them, as well.
The simplest way to add an item to a list is with the function append(), as in
(48) dataStructuresListAppendEx1.py
l = [’???’, ’???’, ’???’]
l.append(’???’)
In (47), we see how a list (or a slice from a list) can be added to another in a
specific position. If one wishes merely to add one list to the end of another, there
are simpler options: either to use the function extend(list) or concatenate
two lists with the plus sign (+), as shown in (49).
a = [1, 2]
b = [3, 4]
(49) dataStructuresListEx10.py
# Create two lists
1
2
1
2
3
c = a + b # Concatenate the two lists 4
print c 5
DR
6
a.extend(b) # Add list b to list a 7
print a 8
There is yet another way of adding the elements of one list to another. It is
also possible to create lists of lists. [MENTION THE APPEND FUNCTION]
(50) dataStructuresListEx11.py
a = [1, 2] # Create two lists 1
b = [3, 4] 2
3
# Here we merge two lists of two items 4
c = a + b # List c has 4 items 5
print c[1] # Print second item of new list (2) 6
7
# Here we create a nested list, whose first item 8
# is list a and whose second item is list b 9
c = a , b # List c has 2 items 10
print c[1] # Print second item of new list (a) 11
print c[0][1] # Print second item of first list (2) 12
70
8. Data Structures
T
When dealing with lists, one often wants to know the total number of elements
are found within a list. The count of elements contained within a list is generally
referred to as its length. (The term “size” is also used to describe the length of a
list, but the term is ambiguous, since it can also refer to the amount of memory
taken up by a list.) The length of a list can be easily obtained using the function
len(), as in (51).
(51) dataStructuresListLenEx1.py
AF
s = "Colorless green ideas sleep furiously ." 1
words = s.split(" ") 2
print len(words) 3
It is also possible to find the number of times a specific values occurs in a list.
For example, if you had a list of word lengths in a corpus, you might want to find
the number of times words of one length occur relative to words of another length.
How often do three-letter words occur relative to four-letter words?
(52) dataStructuresListLenEx2.py
s = "Colorless green ideas sleep furiously ." 1
words = s.split(" ") 2
print len(words) 3
When the same functions are used with a list of strings, the minimum value
will be the item that ??? while the maximum value will be the item that ???, as
illustrated in (54).
(54) dataStructuresListMinMaxEx2.py
l = [’Colorless’, ’green’, ’ideas’, ’sleep’, ’furiously’, ’.’] 1
print "Min." , min(l) 2
print "Max." , max(l) 3
???
71
8.2 Dictionaries (aka, Hash Tables)
T
When retrieving the keys from a dictionary with the method keys(), their order-
ing is unpredictable. In many cases, however, the keys need to be sorted before
being iterated over. The easiest way of accomplishing this is to retrieve the keys
as a sequence, which can then be sorted, as in (55).
(55) dataStructuresListSortingEx1.py
from string import join 1
2
d = {} 3
AF
words = [’zebra’, ’ape’, ’monkey’, ’ostrich’, ’echidna’] 4
for w in words : d[w] = len(w) 5
6
# Print keys in default order 7
keys = d.keys() 8
print "Default Order:" , join(keys, ", ") 9
10
# Print keys in sorted order 11
keys.sort() 12
print "Sorted Order: " , join(keys, ", ") 13
The same strategy works when sorting the values in a dictionary, as can be
seen in (56).
(56) dataStructuresListSortingEx2.py
import string 1
2
# Add each word to dictionary with its length as its value 3
d = {} 4
words = [’zebra’, ’ape’, ’monkey’, ’ostrich’, ’echidna’] 5
for w in words : d[w] = repr(len(w)) # integer -> string 6
7
# Print values in default order 8
DR
values = d.values() 9
print "Default Order:" , string.join(values, ", ") 10
11
# Print values in sorted order 12
values.sort() 13
print "Sorted Order: " , string.join(values, ", ") 14
72
8. Data Structures
T
(57) dataStructuresDictionaryEx1.py
d = {} 1
d[’ROO’] = ’Rotokas’ 2
d[’???’] = ’???’ 3
d[’???’] = ’???’ 4
d[’???’] = ’???’ 5
print d[’ROO’] 6
print d[’???’] 7
print d[’???’] 8
print d[’???’] 9
d = {}
d[’ROO’] = ’Rotokas’
print d[’ROO’]
d[’???’] = ’???’
print d[’???’]
AF
pair is added to the dictionary for a key that already exists, the new value will
replace the old one, as illustrated in (58), where ???.
(58) dataStructuresDictionaryEx1a.py
All of the keys in a dictionary can be retrieved as a list using the function
keys() and all of the values using the function values(), as illustrated in
(59), where all of the keys and all of the values of the dictionary from (57) are
printed out.
d = {}
d[’a’] = ’Chimpanzee’
(59) dataStructuresDictionaryEx2.py
1
2
3
4
d[’d’] = ’Howler Monkey’ 5
print d.keys() 6
DR
print d.values() 7
73
8.2 Dictionaries (aka, Hash Tables)
T
d[’geklok’] = ’clucking’ 9
dutchWord = sys.argv[1] # Get command-line arg 10
print d[dutchWord] # Find value for key 11
AF
# Create dictionary with lists for items 1
d = {’geklets’ : [’talking’, ’gossiping’], 2
’geknabbel’ : [’nibbling’, ’gnawing’], 3
’gekletter’ : [’clatter’], 4
’geklop’ : [’clucking’], 5
’geknars’ : [’gnashing’, ’grating’], 6
’gekleurd’ : [’coloured’], 7
’geklok’ : [’clucking’]} 8
9
# Get keys and retrieve items by keys 10
words = d.keys() 11
words.sort() 12
for w in words : 13
defList = d[w] 14
for d in defList : 15
print "%s -> %s" % (w, d) 16
74
8. Data Structures
The problem with this script, however, is that if the user specifies a word that
T
is not found in the word list, an error will result.
(63) dataStructuresDictionaryHasKeyEx2.py
text = "Zenith Data Systems Corp., a subsidiary of Zenith Electronics Corp., \ 1
received a $534 million Navy contract for software and services of \ 2
microcomputers over an 84-month period. Rockwell International Corp. won a \ 3
$130.7 million Air Force contract for AC-130U gunship replacement aircraft." 4
words = text.split() # Split up contents into tokens 5
d = {} # create a dictionary 6
for w in words : # go through each token 7
if not d.has_key(w) : # see if token is not in dict 8
d[w] = 0 # if so, give it a counter 9
AF
d[w] = d[w] + 1 # increment counter 10
print d["the"] # print out freq. for "the" 11
We can solve this problem by first checking for the existence of the word as
a key in the dictionary, as in (64). If it is there, the program prints its frequency;
otherwise, it prints a message.
(64) dataStructuresDictionaryHasKeyEx3.py
text = "Zenith Data Systems Corp., a subsidiary of Zenith Electronics Corp., \ 1
received a $534 million Navy contract for software and services of \ 2
microcomputers over an 84-month period. Rockwell International Corp. won a \ 3
$130.7 million Air Force contract for AC-130U gunship replacement aircraft." 4
words = text.split() # Split up contents into tokens 5
d = {} # create a dictionary 6
for w in words : # go through each token 7
if not d.has_key(w) : # see if token is not in dict 8
d[w] = 0 # if so, give it a counter 9
d[w] = d[w] + 1 # increment counter 10
if d.has_key("the") : # check whether ’the’ is in dict 11
print d["the"] # print out freq. for ’the’ 12
else : # ’the’ isn’t in dict, so... 13
print "The word ’the’ was not found." # print message 14
DR
8.2.1.2 Another Way to Obtain a Key’s Value
The attempt to retrieve the value for a non-existent key results in an error, as we
saw in §8.2.1.1. To avoid program errors, it is necessary to check whether a key
exists before trying to obtain its value, as in (64). But there is another way to avoid
an error is to use the function get(key), which is safer in the sense that it will
return the value None for a non-existent key rather than raise an error, as in (65).
(65) dataStructuresDictionaryGetKeyEx1.py
text = "Zenith Data Systems Corp., a subsidiary of Zenith Electronics Corp., \ 1
received a $534 million Navy contract for software and services of \ 2
microcomputers over an 84-month period. Rockwell International Corp. won a \ 3
$130.7 million Air Force contract for AC-130U gunship replacement aircraft." 4
words = text.split() # Split up contents into tokens 5
d = {} # create a dictionary 6
for w in words : # go through each token 7
if not d.has_key(w) : # see if token is not in dict 8
d[w] = 0 # if so, give it a counter 9
d[w] = d[w] + 1 # increment counter 10
print d.get("the") 11
75
8.2 Dictionaries (aka, Hash Tables)
In (65), ???. It does this by taking two arguments: the key and a default value
T
that will be returned if the key is not found, as illustrated by (64), which provides
a simplified variant of (65).
(66) dataStructuresDictionaryGetKeyEx2.py
text = "Zenith Data Systems Corp., a subsidiary of Zenith Electronics Corp., \ 1
received a $534 million Navy contract for software and services of \ 2
microcomputers over an 84-month period. Rockwell International Corp. won a \ 3
$130.7 million Air Force contract for AC-130U gunship replacement aircraft." 4
words = text.split() # Split up contents into tokens 5
d = {} # create a dictionary 6
for w in words : # go through each token 7
AF
if not d.has_key(w) : # see if token is not in dict 8
d[w] = 0 # if so, give it a counter 9
d[w] = d[w] + 1 # increment counter 10
print d.get("the", 0) # return value for ’the’ or 0 11
76
8. Data Structures
T
(68) dataStructuresDictionarySortByValue.py
# Example from PEP 265 - Sorting Dictionaries By Value 1
# Counting occurences of letters 2
3
d = {’a’:2, ’b’:23, ’c’:5, ’d’:17, ’e’:1} 4
5
# operator.itemgetter is new in Python 2.4 6
# ‘itemgetter(index)(container)‘ is equivalent to ‘container[index]‘ 7
from operator import itemgetter 8
9
# Items sorted by key 10
# The new builtin ‘sorted()‘ will return a sorted copy of the input iterable. 11
print sorted(d.items()) 12
AF
13
# Items sorted by key, in reverse order 14
# The keyword argument ‘reverse‘ operates as one might expect 15
print sorted(d.items(), reverse=True) 16
17
# Items sorted by value 18
# The keyword argument ‘key‘ allows easy selection of sorting criteria 19
print sorted(d.items(), key=itemgetter(1)) 20
21
# In-place sort still works, and also has the same new features as sorted 22
items = d.items() 23
items.sort(key = itemgetter(1), reverse=True) 24
print items 25
8.3 Tuples
In Python, there is another data structure called a tuple which resembles a list but TUPLE
differs in one crucial regard. Unlike lists, tuples cannot be changed once they have
been created. In other words, they are immutable. A tuple is created much like a IMMUTABLE
list, but with parentheses rather than square brackets, as illustrated in (69).
DR
(69) dataStructuresTupleEx1.py
t = (’The’, ’wind’, ’came’, ’across’, ’the’, ’bay’, ’like’, ’something’, ’living.’)
1
Tuples can be manipulated in the same way as lists, using indices, slices, etc.,
as demonstrated in (69). Note that, just as a slice from a list creates another list, a
slice from a tuple creates another tuple.
(70) dataStructuresTupleEx2.py
t = (’The’, ’wind’, ’came’, ’across’, ’the’, ’bay’, ’like’, ’something’, ’living.’)
1
print t[0] # positive indices 2
print t[1] 3
print t[-1] # negative indices 4
print t[-2] 5
print t[0:1] # slice 6
The defining property of tuples, which distinguish them from lists, is their
immutability. Any attempt to modify a tuple will result in an error, as illustrated
in (71).
77
8.4 Nested Data Structures
T
(71) dataStructuresTupleEx3.py
t = (’The’, ’wind’, ’came’, ’across’, ’the’, ’bay’, ’like’, ’something’, ’living.’)
1
t[1] = ’A’ # Assignment to a tuple raises an error 2
You may be wondering why it is worth having both mutable and immutable
lists. What additional value is there in having tuples in addition to lists? The
answer is that tuples utilize less memory. They are therefore the preferred option
when a program is creating lots of lists that will not be modified. For example, if
AF
you are reading
(72) dataStructuresDictionaryOfDictionaries.py
text = "Zenith/NNP Data/NNP Systems/NNPS Corp./NNP ,/, a/DT \ 1
subsidiary/NN of/IN Zenith/NNP Electronics/NNP Corp./NNP ,/, received/VBD \ 2
a/DT $/$ 534/CD million/CD Navy/NNP contract/NN for/IN software/NN and/CC \ 3
services/NNS of/IN microcomputers/NNS over/IN an/DT 84-month/JJ period/NN \ 4
./. Rockwell/NNP International/NNP Corp./NNP won/VBD a/DT $/$ 130.7/CD \ 5
million/CD Air/NNP Force/NNP contract/NN for/IN AC-130U/NN gunship/NN \ 6
replacement/NN aircraft/NN ./." 7
8
stopwords = [’$’, ’,’, ’.’] # pos labels that can be ignored 9
d = {} # create a dictionary 10
DR
11
tokens = text.split() # split up contents into tokens 12
for t in tokens : # go through each token 13
parts = t.split("/") # split token into list 14
w = parts[0] # 1st part is the word 15
gc = parts[1] # 2nd part is POS (gc = grammatical class) 16
if not d.has_key(gc) : # if POS label not in dictionary, 17
d[gc] = {} # then add it w/ a dictionary as its value 18
if not d[gc].has_key(w) : # if word not in embedded dictionary, 19
d[gc][w] = 0 # then add it w/ 0 as its value 20
d[gc][w] = d[gc][w] + 1 # increase counter for word/POS by 1 21
gclasses = d.keys() # get list of parts of speech 22
gclasses.sort() # sort list of tokens 23
for gc in gclasses : # go through each token 24
if gc not in stopwords : # ignore any POS in list of stopwords 25
print gc # print POS 26
wd = d[gc] # get embedded dictionary for POS 27
words = wd.keys() # get keys (words) in embedded dictionary 28
words.sort() # sort words 29
for w in words : # go through each word 30
count = wd[w] # get count associated with word 31
print " ", w , count # print word and its freq. count w/ indentation 32
78
8. Data Structures
8.5 Exercises
T
• Zip’s law states that the frequency of any word is roughly inversely propor-
tional to its rank frequency—that is, its position in a frequency table that
is sorted in descending order (from most to least frequent). If you wanted
to write a program in Python that takes a text and builds a word frequency
table, what kind of data structure(s) would you use?
AF
works with the passages in which they occur. If you wanted to write a
program in Python to create a concordance from a text, what kinds of data
structures would you use? How would they interact?
• Tree diagrams are a commonly used device for representing the hierarchi-
cal organization of elements in a sentence. For example, consider the tree
diagram in (7).
(7) S
NP VP
The dog V NP
What kind of data structure could be used to store the information found in
(7)?
DR
79
8.5 Exercises
T
AF
DR
80
T
Chapter 9
Data Flow
AF
The term data flow refers to the devices available for putting the logic of a pro-
gram into action. These devices are sometimes called control structures, given
that they control how and when statements are executed. The various control
structures available to Python will be discussed in turn.
parts: the antecedent (the if-clause) and the consequent (the then-clause).1
In Python, the if-clause is separated from the then-clause by a colon, as in
DR
(73). (Note the difference between a single equal sign (=) and a double equal sign
(==): the single equal sign is used to assign variables a value while the double
equal sign is used to compare the values of variables.)
(73) dataFlowIfEx1.py
pos = "N" 1
if pos == "N" : 2
print "The POS is N." 3
if not pos == "V" : 4
print "The POS is not V." 5
if pos != "V" : 6
print "The POS is not V." 7
8
The if-clause in all of the statements in (73) are true, and therefore their then-
clauses are executed. It is important to be clear about what represents true and
1
In logic, the former is sometimes referred to as the implicans or protasis and the latter as the
implicate or apodosis (McCawley, 1993; Chapman, 2000). We’ll stick to the simpler terminology
of if-clause and then-clause.
81
9.1 Conditional Statements: if
false when evaluating an if statement. The answer is quite simple in principle, but
T
can sometimes be a stumbling block for beginners. The rule is, the values 0 and
None evaluate as false; everything else evaluates as true.
(74) dataFlowIfEx2.py
if 1 : print "’1’ evaluates as true." 1
if -1 : print "’-1’ evaluates as true." 2
if None : print "’None’ evaluates as false." 3
if 0 : print "’0’ evaluates as false." 4
5
if not 1 : print "’not 1’ evaluates as false." 6
if not -1 : print "’not -1’ evaluates as false." 7
AF
if not None : print "’not None’ evaluates as true." 8
if not 0 : print "’not 0’ evaluates as true." 9
Since the variable n has the value of 1 in (74), the if statement will be true, and
the program will print The value of n is 0!. Python defines true as 1 and
false as 0 for the purposes of evaluating an if-statement. It is therefore possible to
simplify (74) by directly evaluating the variable by itself. Since its value is 1, the
if-statement will evaluate as true, and the same message will be printed out.
(75) dataFlowIfEx3.py
pos = "N" 1
if pos : print "The POS is ‘N’." 2
If the value of the variable n were 0, the if-statement in (75) would not evaluate
true and nothing would be printed. We can ensure that something is printed even
when n is 0 by including an else-statement, as in (76). (Note that we’ve put the
consequent on a separate line from the antecedent. This isn’t necessary, but it
improves readability.)
(76) dataFlowIfEx4.py
DR
n = 0 1
if n : 2
print "The value of ‘n’ is 1." 3
else : 4
print "The value of ‘n’ is not 1." 5
82
9. Data Flow
T
The term looping refers to control structures that repeatedly execute blocks of LOOPING
code.The two main control structures for looping in Python are while and for,
each of which is described below.
9.2.1 while
A while loop is a bit like an if statement, since it also consists of two parts: a con-
i = 0
while i < len(s) :
print s[i]
i += 1
???
numbers = [1, 2, 3]
i = 0
AF
dition and a consequence. However, an if statement only executes once, whereas
a while loop, as the name suggests, will repeatedly execute the consequence as
long as the condition evaluates as true. A simple example of the use of while to
print out the characters in a string is provided in (78).
(79) dataFlowWhileEx2.py
1
2
3
4
5
1
2
3
4
5
???
A while loop can be used to examine all of the values in a list, as shown in
DR
(79), but the for loop is a more convenient means of doing the same (see §9.2.2).
(80) dataFlowWhileEx3.py
numbers = [1, 2, 3] 1
i = 0 2
while i < numbers.size : 3
print n 4
i += 1 5
9.2.2 for
A typical task in programming is to examine all of the values in a sequence (list)
in order to decide what to do with them. The for loop in Python provides a mean
of iterating over any type of sequence, such as lists or tuples. Let’s return to the
example that we looked at in §8.1, where we created a list of words from the first
sentence of Hal Clement’s novel A Mission of Gravity. In (81), a for loop is used
to iterate over the words in the list and print each one out.
83
9.2 Looping, Iteration, Cycling etc.: for and while
T
(81) dataFlowForEx1.py
s = "The wind came across the bay like something living." 1
words = s.split(" ") 2
for w in words : 3
print n 4
The reason that this control structure is known as a loop is because the indented
block of code following the for loop is executed as many times as there are
elements in the list. In other words, there is a loop that repeats itself. Each item
in the list is temporarily stored in the variable immediately following the keyword
for.
AF
We can combine a for loop with other control structures, such as an if-
statement, to build more complicated control structure, as illustrated in (82), where
all of the words in the list are examined and only those that consist of four letters
or less are printed out.
It is sometimes useful to loop over the elements in a list using their indices,
since it is then possible to manipulate the indices for various purposes. The main
trick is to use the function range, which takes two numbers as arguments—a
range start and a range end—and walks through the range defined by the numbers.
Note that the range end is non-inclusive. In other words, the function will walk
1
2
3
4
5
through the numbers up to, but not including, the range end. This technique is
illustrated for the same list in (83).
DR
(83) dataFlowForEx3.py
s = "The wind came across the bay like something living." 1
words = s.split(" ") 2
for i in range(0, len(words)) : 3
w = words[i] 4
print str(i), w 5
To see how this technique can be useful, consider the following code, which
shows how it is possible to identify the word immediatley preceding each word
in the sentence using indices. Using this technique, it is possible to extract useful
information about word co-occurence in a corpus. (For an introduction to the
statistical analysis of word collocations (bigrams, trigrams, etc.), see Manning
and Schütze (1999).)
(84) dataFlowForEx4.py
s = "The wind came across the bay like something living." 1
words = s.split(" ") 2
for i in range(0, len(words)) : 3
84
9. Data Flow
T
w = words[i] 4
prev = words[i-1] 5
print str(i), w , "[" + prev + "]" 6
Note that the last word of the sentence (living.) is given as the word immedi-
ately preceding the first word in the sentence (The). This is because the index for
the first word in the sentence is zero, and subtracting one from it gives negative
one, which, according to the rules for list indices (see §8.1.1), will be the last item
in the list. To prevent this from happening, we can simply use an if-statement to
treat the first item in the list as a special case, as in (85).
AF
(85) dataFlowForEx5.py
s = "The wind came across the bay like something living." 1
words = s.split(" ") 2
for i in range(0, len(words)) : 3
w = words[i] 4
if i == 0 : 5
prev = "--" 6
else : 7
prev = words[i-1] 8
print str(i), w , "[" + prev + "]" 9
9.2.3.1 break
DR
The break statement breaks out of the smallest enclosing for or while loop. For
example, imagine that you have run a number of subjects through the lexical-
decision task and that you want to find the reaction time of each subject for the
first presentation of a particular word—say, room. If each subject has a separate
data file and the data is stored as tab-delimited values with one row per presented
words (with the presented word in the first column and the reaction time in the
second column), then the code in (??) should work to retireve the desired data for
each file it is run on.
(86) dataFlowBreakEx1.py
from string import split 1
2
fn = ’tab-delimited_data.txt’ # hardcoded 3
f = open(fn, ’r’) # ’r’ = read-mode 4
lineList = f.readlines() # get list of lines 5
for l in lineList: # loop over lines 6
columns = split(l, "\t") # split line by tabs 7
if columns[0] == ’room’ : 8
break # found it, stop looping 9
print columns[0], columns[1] # print out the data 10
85
9.3 List Comprehension: Modifying a List In Situ
9.2.3.2
T
continue
The continue statement skips to the next iteration of the loop.
For example, imagine that you are looping over the lines in a tab-delimited
file. Some of the lines in the file have a blank first column and should not be
processed. This can be easily accomplished with continue, as shown in (??).
(87) dataFlowContinueEx1.py
from string import split 1
2
fn = ’tab-delimited_data.txt’ # hardcoded
AF
3
f = open(fn, ’r’) # ’r’ = read-mode 4
lineList = f.readlines() # get list of lines 5
for l in lineList: # loop over lines 6
columns = split(l, "\t") # split line by tabs 7
if columns[1] == ’’ : 8
continue # continue to next loop 9
print columns[0], columns[1] # print out 1st & 3rd col. 10
In (88), a sentence is split into a list of words and a for loop is used to go
through each word in the list. As each word is converted to the proper capital-
ization pattern, it is stuck into another list, which is printed out as a string at the
end.
This situation is so common that Python provides a convenient device (an
LIST COMPREHEN - idiom, if you will) for handling it, known as list comprehension. The use of
SION this idiom can significantly simplify the process of transforming a list on the fly
without making any permanent changes to it, as shown by (89), which is a much
more compact version of (88).
86
9. Data Flow
T
(89) dataFlowListComprehensionEx2.py
s = "The wind came across the bay like something living ." 1
words = s.split() 2
titlewords = [ w.title() for w in words ] 3
print " ".join(titlewords) 4
Note that the code in (89) can be made even more compact, as shown in (90).
(90) dataFlowListComprehensionEx3.py
s = "The wind came across the bay like something living ." 1
words = " ".join([w.title() for w in s.split()]) 2
Exercises
AF
The two-line verison is more compact but less readable. There is often a trade-
off between concision and readability, and balancing the two is a fine art which
requires a judgement call. In general, we would recommend erring on the side of
readability.
9.4
• In (84), each word in a sentence is printed out along with the word imme-
diately preceding it. What changes to the script would be required to obtain
the word immediately following each word in the sentence?
• ???
• ???
DR
87
9.4 Exercises
T
AF
DR
88
T
Chapter 10
Functions
(91) functionsBasicsEx1.py
def apologize() : 1
DR
name = "Dave" 2
print "I’m sorry, " + name + ", I’m afraid I can’t do that." 3
apologize() 4
In (91), the function is defined in lines 1 through 3 and called in line 4 (Termi-
nology note: Using a function is known as “calling” or “invoking” it.)
There are a few potential pitfalls to bear in mind when creating a function.
First, the names of functions are subject to the same constraints as the names of
variables (see §6.2.1 for details). Second, the definition of a function must precede
its invocation. In other words, you have to define a function before you can use
it. Otherwise, the Python interpreter will raise an error, as can be seen by running
(92), where the definition and invocation are in the wrong order—i.e., the function
apologize() is called first and then defined.
1
In other programming languages, functions go by different names—e.g., procedures or
subroutines–but the basic idea is the same.
89
10.2 Arguments
T
(92) functionsBasicsEx2.py
apologize() 1
def apologize(): 2
name = "Dave" 3
print "I’m sorry " + name + ", but I’m afraid I can’t do that." 4
When (92) is run, an error message is given by the interpreter, which should
look something like this:
% python functionsBasicsEx2
AF
Traceback (most recent call last) :
File "functionsBasicsEx2.py", line 1, in ?
apologize()
NameError: name ’apologize’ is not defined
Functions serve a number of purposes. First and foremost, they provide a way
of reusing code. If we wanted to print our hello message from (91) more than
once, we can simply call the function apologize() multiple times, as in (93),
which calls the function three times.
def apologize():
name = "Dave"
(93) functionsBasicsEx3.py
print "I’m sorry, " + name ", but I’m afraid I can’t do that."
apologize()
apologize()
apologize()
1
2
3
4
5
6
10.2 Arguments
In the previous section, we learned how to create a simple function. In (91), the
message that is printed out by the function apologize() is fixed, since the
addressee can only be “Dave”. But what if HAL wants to apologize to a member
of the crew other than Dave? In order to do this, here needs to be some way to
custom-tailor the apology to the addressee. One way to do that is to make the
variable name accessible throughout the program so that it can be set every time
before the function is called, as illustrated in (94).
90
10. Functions
T
(94) functionsArgumentsEx0.py
def apologize() : 1
print "I’m sorry, " + name + ", but I’m afraid I can’t do that." 2
name = ’Bill’ 3
apologize() 4
The function in (94) makes use of a global variable, a variable that is defined GLOBAL VARIABLE
in one place and potentially available everywhere. The use of global variables
in functions has a number of drawbacks. Some of them are fairly obvious. For
example, the name of the variable in the function must match the name of the
AF
variable outside of the function. Although this is easy to ensure in a short program,
it becomes more difficult once a program grows longer and more complicated.
However, there are more subtle reasons why (94) is unsatisfactory. [EXPLAIN
WHY]
Given the drawbacks of global variables, programming languages invariably
have a mechanism that allows a variable to be passed to a function as a local
variable. A variables that is passed to a function is called an argument. In ARGUMENT
Python, every function’s definition must specify the number of arguments that
it takes. In (95), we define a function that takes a single argument, which is la-
belled addressee. (Think of arguments as variables. The names of arguments
are arbitrary and have the same restrictions as variables—see §6.2.1.)
(95) functionsArgumentsEx1.py
def apologize(addressee) : 1
print "I’m sorry, " + addressee + ", but I’m afraid I can’t do that." 2
name = "Bill" 3
apologize(name) 4
Although (96) produces the same output as (94), it is a much better solution.
Because the addressee of the apology is passed in to the function as an argument,
DR
the function can be called using any name. It is possible to call the function
multiple times with different names, as illustrated in (96).
(96) functionsArgumentsEx1.5.py
def apologize(firstName) : 1
print "I’m sorry, " + firstName + ", but I’m afraid I can’t do that." 2
apologize("Dave") 3
apologize("Bill") 4
apologize("Frank") 5
91
10.2 Arguments
T
(97) functionsArgumentsScopeEx1.py
def add_word(d, w) : 1
w = w.upper() 2
if not d.has_key(w) : 3
d[w] = 0 4
d[w] = d[w] + 1 5
6
wfreq = {} 7
s = "The wind came across the bay like something living ." 8
words = s.split(" ") 9
for w in words : 10
add_word(wfreq, w) 11
print len(wfreq.keys()) 12
AF
The function add word() adds a word to a dictionary that serves as a word
frequency lookup table. It converts the word to uppercase and then checks whether
the uppercase version of the word is already in the dictionary. If it is not already
there, it is added with a value of 1. If it is already there, then its associated value
is incremented by one. The important point about (97) is that changes made to the
dictionary object inside of the function apply to the dictionary object outside of
the function, as you can confirm for yourself by running (97) and observing that
wfreq has 9 keys.
It is important to bear in mind that changes made to a variable inside of a
function only hold true outside of the function as long as the variable continues to
refer to the same object. If a variable is assigned a new value inside of a function,
then the variable will no longer refer to the object it originally did. Because the
variable outside of the function will continue to refer to the original object, it will
continue to have the same value it did before the function was called. This is an
issue for immutable objects, which cannot be modified in place. Therefore, in
DR
(98), ???.
(98) functionsArgumentsScopeEx2.py
def format_sentence(s) : 1
s = s.capitalize() 2
s = s + "." 3
print s 4
s = "the wind came across the bay like something living" 5
format_sentence(s) 6
print s 7
92
10. Functions
T
(99) functionsArgumentsEx2.py
def apologize(fname, lname) : 1
print "I’m sorry, " + fname + lname + ", but I’m afraid I can’t do that." 2
apologize("David", "Bowman") 3
AF
with no arguments (even though one is expected).
(100) functionsArgumentsEx3.py
def apologize(name) 1
print "I’m sorry, " + name + ", but I’m afraid I can’t do that." 2
apologize() 3
The possibility of default arguments provides a way of invoking the same func-
DR
tion with a different number of arguments, as in (101), where the apologize()
function is called in three different ways. The first call is made without an argu-
ment, in which case the addressee defaults to ‘Dave’. The second call passes a
different name to the function, which is the addressee in the message. The third
call passes the default name ‘Dave’ to the function, which doesn’t make much
sense but is nevertheless possible.
(102) functionsDefaultArgsEx2.py
def apologize(name=’Dave’): 1
print "I’m sorry, " + name + ", but I’m afraid I can’t do that." 2
apologize() 3
apologize("Dave") 4
apologize("David") 5
In some cases, you may want to have multiple arguments, in which case the or-
dering of arguments is significant. For example, let’s say that we want the function
apologize() to be flexible in terms of the addressee as well as the message,
93
10.2 Arguments
so that the function could print out something like “I’m sorry, Dave, but this con-
T
versation can serve no further purpose.” The function would then look something
like (103). (The punctuation of the message is hard-coded in the function. Be-
cause this is an apology, a question mark is ruled out. And since it’s a sentient
computer talking, we’ll assume that an exclamation mark would be uncharacter-
istically emotional. After all, this is the same computer that locked Dave out of
the ship to die in space and then, when Dave tried to talk him out of killing him,
ended the conversation by saying calmly, “Dave, this conversation can serve no
purpose anymore. Goodbye.”)
AF (103) functionsDefaultArgsEx3.py
def apologize(name, msg):
print "I’m sorry, " + name + ", but " + msg + "."
apologize("Dave", "this conversation serves no further purpose")
Now, if we want the argument name to have a default (e.g., ‘Dave’) but require
the argument message to have a value, we face a problem. Default arguments
cannot be followed by non-default arguments. Therefore, the code in (104) will
raise an error, as can be seen by running (104).
(104) functionsDefaultArgsEx4.py
def apologize(name=’Dave’, msg):
print "I’m sorry, " + name + ", but " + msg + "."
apologize("Dave", "this conversation serves no further purpose")
The solution is to ensure that the non-default arguments precede the default
arguments, as in (105).
1
2
3
4
1
2
3
(105) functionsDefaultArgsEx5.py
def apologize(msg, name=’Dave’): 1
print "I’m sorry, " + name + ", but " + msg + "."
DR
2
apologize("this conversation serves no further purpose") 3
4
94
10. Functions
T
(106) functionsKeywordArgEx1.py
def formatDate(m, d, y) : 1
print "%s-%s-%s" % (y, m, d) 2
formatDate(10, 28, 1973) 3
While it is possible to pass the arguments in the usual fashion, it is also pos-
sible to specify them using keywords, in which case the ordering is irrelevant.
This is illustrated in (107), where a function is called using keyword-labelled
arguments. Although the function is called with three different orderings, the
printed-out result is the same in each case, as can be seen by running the script.
AF
(107) functionsKeywordArgEx2.py
def formatDate(m, d, y) : 1
print "%s-%s-%s" % (y, m, d) 2
formatDate(y=1973, d=28, m=10) 3
formatDate(m=10, d=28, y=1973) 4
formatDate(d=28, m=10, y=1973) 5
A function that returns a value can be also embedded directly into the same
context as the data type that it returns (rather than saved first as a variable), as
illustrated in (109).
(109) functionsReturnValuesEx1.py
def apologize() : 1
name = "Dave" 2
return "I’m sorry, " + name + ", but I’m afraid I can’t do that." 3
print apologize() 4
It is possible to return from a function midway through its execution. For ex-
ample, (110) defines a function that takes two numbers as arguments. The first is
divided by the second, unless the second argument is equal to 0, in which case,
the function returns immediatley with the value None (instead of throwing a Ze-
roDivisionError).
95
10.4 Functions in Action
T
(110) functionsReturnValuesEx2.py
def divide(i, j) : 1
if j == 0 : 2
return None 3
else : 4
return i / j 5
print divide(5, 0) 6
Functions that return a value can also be used directly as the arguments of
other functions, as in (111).
AF
(111) functionsReturnValuesEx3.py
def multiply(i, j) : 1
return i * j 2
print multiply(multiply(3, 3), 3) 3
The code in (111) is equivalent to that in (112), where the value returned from
the first invocation of the function multiply() is stored as a variable and then
passed to its second invocation.
(112) functionsReturnValuesEx4.py
def multiply(i, j) : 1
return i * j 2
p1 = multiple(3, 3) 3
p2 = multiply(product, 3) 4
print p2 5
96
10. Functions
T
else : # otherwise 18
d[t] = 1 # initialize it to 1 19
return d 20
21
def print_results(d) : 22
tokens = d.keys() # get list of unique tokens 23
tokens.sort() # sort list of tokens 24
for t in tokens : # go through each token 25
count = d[t] # get counter for a given token 26
parts = t.split("/") # split it up into wordform and POS 27
wordform = parts[0] # first part is wordform 28
POS = parts[1] # second part is the POS 29
print wordform , POS , str(count) # print it to standard output 30
AF
31
text = get_text() 32
tl = create_tokenlist(text) 33
print_results(tl) 34
The program’s organization can be described as a “daisy chain” since the out-
put of each function is the input to the next. This is shown in Table 10.1, where
we see that each function gets its input from the output of the previous function.
The only exception is the first function, get text(), which simply returns the
global variable text, which contains the corpus excerpt to be analyzed by the
program.
• ???
• There are many different ways to break a program down into functions.
What are some alternative ways of organizing (113)?
97
10.5 Exercises
T
AF
DR
98
T
Chapter 11
Errors and Exceptions
AF
An exception is an error condition that changes the normal flow of control in a
program. That is just a rather technical way of saying that exceptions are what
happen when something goes wrong while a program is being run. Python has
very good error handling capabilities which can make it much easier to write pro-
grams that deal gracefully with the errors that arise when a program is run. In this
chapter, we will look at how exceptions work in Python and show how they can
be used to deal with error handling effectively.
Perhaps the best way to understand the different approaches to error handling
is to concentrate on a particular task that frequently causes problems during the
DR
execution of a program, which is file input and output (see chapter 12). Various
things can go wrong when opening a file or writing to it: the file may not exist,
it may not be readable due to restrictive access privileges on the filesystem or
because it is corrupt, there may not be enough memory to write the new contents,
etc. A well-written program must be able to handle all of these contingencies.
To see how such contingencies would be handled in Python, let’s have a look
at a simple script that takes a filename from the command-line, opens it, and prints
out its contents, as in (114).
(114) exceptionsBackgroundEx1.py
import sys 1
try : 2
filename = sys.argv[1] 3
fo = open(filename, ’r’) 4
print fo.read() 5
fo.close() 6
except : 7
print "Couldn’t open specified file." 8
99
11.1 Handling Exceptions
T
occurs when the code inside of the try-block is run, the except block will be run
and the error message will be printed out.
There is a problem with (114), however, which is that it does not provide a
way of teasing apart the various things that can go wrong in the try block. The
behavior of the program is the same regardless of whether the user fails to provide
a filename or the user provides the name of a non-existent file. To tease apart
different error conditions, multiple except blocks can be used, as illustrated in
(115), where there are two except blocks, one for catching the error that results if
import sys
try :
fo.close()
AF
the user does not provide a filename and another for catching the error that results
if the file specified does not exist or cannot be opened because of access privileges.
filename = sys.argv[1]
except IndexError :
(115) exceptionsBackgroundEx2.py
fo = open(filename, ’r’)
print fo.read()
100
11. Errors and Exceptions
types, but we will ignore this possibility for the present.) If the except clause does
T
not in turn raise an exception (a distinct possibility that should not be forgotten
but hopefully won’t occur), then the code following the except-clause is run, and
the program continues to run its merry way. If an exception occurs which does not
match the exception named in the except clause, it is passed on to outer try state-
ments; if no handler is found, it is an unhandled exception and execution stops
with a message as shown above.
This process is summarized as a flow diagram:
Error Handling
Exception
raised in
tryDoes
Execute
block?
the exception
raised
match
one of the
except-
clauses.
except-block.
try:
return
xxx
Halt the
program.
AFIgnore except
block.
(117) exceptionsHandlingEx3.py
101
11.2 Exception Types
T
In order to understand fully the various types of exceptions and how they relate
to one another, some acquaintance with object-oriented programming is required,
since exceptions are objects and are therefore associated with a particular class.
Without going into too much detail (see §15.3.3 and §16.5 for more information
about how inheritance works), exceptions are organized hierarchically, such that
every error type in the hierarchy is a subtype of the one above it, as shown diagra-
matically in Figure 11.1.
AF SystemExit
Exception
SyntaxError
StandardError
LookupError
IndexError KeyError
102
11. Errors and Exceptions
T
(119) exceptionsIndentationErrorEx1.py
n = 1 1
if n == 1 : 2
print n 3
Figure 11.2
1. The first line simply states that the error is being traced backed to its source.
Since an exception in one part of a program may lead to cascading errors,
it is necessary to provide the full traceback, starting with its ultimate cause
and finishing with its immediate cause.
103
11.2 Exception Types
3. The third line provides the exception type and its name (which often amounts
T
to a short description).
pass
AF
15 and Chapter ??, particularly §15.3.3 and §16.5 on inheritance.
The basics of custom exceptions are straightforward. Any custom class that
extends from the class Exception is a custom exception. The simplest possible
exception would be the one in (121).
(121) exceptionsCustomExceptionEx1.py
class CustomException(Exception) :
In order to obtain a custom error message, it is necessary for the custom error
class to override the str method, as in (122), which creates a new type of
exception, CustomError, whose error message is simply “An error has occurred”.
(122) exceptionsCustomExceptionEx2.py
class CustomException(Exception) :
def __str__(self) :
return "Error message for CustomException"
raise CustomException()
1
2
1
2
3
4
5
To see how custom exceptions work, and how they can be useful when writ-
DR
ing your own programs, we will create a small program that parses a line from a
Shoebox file into its individual tiers. For now, we will take for granted the abil-
ity to extract lines from a Shoebox file and simply concentrate on how a line is
processed once its contents are stored as a string.
(123) exceptionsCustomExceptionEx3.py
from string import split 1
2
class MissingTierError(Exception) : 3
pass 4
5
def parseLine(line) : 6
tiers = {} 7
requiredTiers = [’t’, ’m’, ’g’, ’f’] 8
lines = line.split("\n") 9
for l in lines : 10
l = l.lstrip("\\") 11
(marker, data) = split(l, sep=" ", maxsplit=1) 12
tiers[marker] = data 13
for t in requiredTiers : 14
if not tiers.has_key(t) : 15
104
11. Errors and Exceptions
T
s = "The tier \\%s is missing!" % t 16
raise MissingTierError, s 17
18
# Nothing wrong with this line 19
shbx = ’’’\\t Waar is de fietsenstalling? 20
\\m waar is de fiets-en-stalling 21
\\g where is the bike-PL-shed 22
\\f Where is the bike shed?’’’ 23
parseLine(shbx) 24
25
# The tier \g is missing 26
shbx = ’’’\\t Waar is de fietsenstalling? 27
\\m waar is de fiets-en-stalling 28
AF
\\f Where is the bike shed?’’’ 29
parseLine(shbx) 30
105
11.4 Exercises
T
filename = sys.argv[1] 31
except IndexError : 32
print "Usage: %s <FILE>" % sys.argv[0] 33
sys.exit(0) 34
35
# Read file contents 36
try : 37
fo = open(filename, ’r’) 38
fc = fo.read() 39
fo.close() 40
except IOError : 41
print "Could not open ‘%s’!" % filename 42
sys.exit(0) 43
AF
return fc 44
45
fileContents = get_file_contents() 46
wordlist = create_wordlist(fileContents) 47
print_results(wordlist) 48
11.4 Exercises
• ???
• ???
• ???
DR
106
T
Chapter 12
Input/Output
AF
The topic of program input and output (usually referred to by the acronym IO) IO
concerns how Python interacts with the file system. In other words, it is the is-
sue of how files and directories are opened, read, written to, deleted, etc. The
way files are handled differs a fair bit across operating systems (Windows, Mac-
intosh, Linux, Unix, etc.), which means we will also have to cover some of the
tools available for handling differences between operating systems. In Python, the
majority of the work as far as input/output is concerned is handled by the mod-
ule os, which is documented in detail at http://docs.python.org/lib/
module-os.html.
12.1 Files
DR
When dealing with files, there are two main issues: where files are located and
what they contain. File locations is discussed in §12.1.1 and reading and writing
file contents is discussed in §12.1.2.
(8) says that the text file dictionary.txt is in the folder My Documents,
which is in the folder user, which is in the folder Documents and Settings
on the C drive. Note that backslashes are used to separate folders in Windows file
paths. This is not true for all operating systems. Unix, for example, uses forward
107
12.1 Files
slashes to seprate folders (directories) from one another. A common problem with
T
file manipulation is that there are many differences between operating systems
(Windows, Macintosh, Unix, etc.) in this respect. Fortunately, Python provides
the tools to write portable code–that is, code that will continue to work even if it
is run on different operating systems.
Before we introduce ways of creating file paths that are portable (i.e., valid for
all operating systems), it is worth looking at a negative example which will show
the wrong way of creating a file path.
AF
(125) ioFilePathHardcoded.py
import os, sys 1
dir = sys.argv[1] 2
file = sys.argv[2] 3
print dir + "/" + file 4
108
12. Input/Output
In addition to the basic read modes, there are additional flags that control the
T
way in which files are handled by Python. One of the most useful of these is the
read mode ’U’ (short for ’Universal’), which ensures that all file types are read
consistently, despite the differing conventions across operating systems. Unix
uses the linefeed character (LF) whereas Macintosh uses the carriage return char-
acter (CR). Windows uses both (CR LF). The ’U’ flag can only be used in read
mode—i.e., ’rU’ is legitimate, whereas ’rw’ or ’ra’ is not.
12.1.2.1 Read
filename = sys.argv(1)
lines = fo.readlines()
fo.close()
AF
Being able to read the contents of files is absolutely critical for many programming
tasks. Here we will look at how this is done. We will first look at how a file can be
read line by line and then at how the entire contents of a file can be read at once.
Reading Files Line By Line One way to read a file is line by line. There are a
few different ways to do this. One way is to use the built-in function open() to
create a file object and to then call the readlines() function of the file object.
The readlines() function returns a list of lines in the file. In (127), we obtain
a filepath from the command-line, open the file, save its content as a list of lines,
and then close the file.
import sys
(127) ioReadFileLineByLineEx1.py
# Import required module
# Get filename from command-line
fo = open(filename, ’rU’) # Open file for reading
# Read lines into list
# Close file
1
2
3
4
5
One advantage of reading a file line by line is that at any point the reading
DR
of the file can be terminated and the file closed. This is particularly useful when
dealing with extremely long files, where reading the entire file is not necessary,
too time-consuming, or requires more memory than is available to the system.
In (128), a file is read until a blank line is found, at which point file reading
terminates and the file is closed. This can be very useful technique for saving time
and memory resources when dealing with very large files.
(128) ioReadFileLineByLineEx2.py
import sys # Import required module 1
filename = sys.argv(1) # Get filename from command-line 2
fo = open(filename, ’rU’) # Open file for reading 3
for l in fo.readlines() : # Loop over lines 4
if l == ’’ : break # Stop reading if line is empty 5
print l # Print line 6
fo.close() # Close file 7
There is another way of reading a file line by line which takes advantage of
the fact that a file object can be treated as a list and iterated over, as illustrated in
109
12.1 Files
(129). This is probably the best way of reading a file line by line, since it is simple
T
and requires fewer lines of code than the alternatives.
(129) ioReadFileLineByLineEx3.py
import sys # import required module 1
filename = sys.argv[1] # get filename from command-line 2
fo = open(filename, ’r’) # open file for reading 3
n = 0 # counter for line number 4
for l in fo : # loop over lines 5
n += 1 # increment counter 6
print "%i:%s" % (n, l) # print line with line number 7
fo.close() # close file 8
AF
Reading Files All at Once (Slurping) In many cases it is necessary to read the
entire contents of a file before they can be processed. This is particularly the case
when patterns that span multiple lines need to be extracted from the file. This
can be done by using the method read on a file object, as shown in (130), which
takes a filename from the command-line and slurps its contents into a string before
printing them out again.
import sys
filename = sys.argv(1)
fo = open(filename, ’r’)
contents = fo.read()
fo.close()
(130) ioSlurpingEx1.py
# Import required module
# Get filename from command-line
# Open file for reading
# Read file contents as string
# Close file
Because the entire file is represented as a single string, patterns that span mul-
tiple lines can be detected and manipulated, as shown in (131), which takes a file
with hard line breaks (that is, a file where the line breaks have been added using
1
2
3
4
5
carriage returns) and removes the line breaks. Double carriage returns are reduced
to single carriage returns. The overall effect is to take a file with hard line breaks
DR
and turn it into one where each paragraph is a single line in the file. This would
be much harder to accomplish if the file had been read line by line.
(131) ioSlurpingEx2.py
import sys # Import required module 1
filename = sys.argv(1) # Get filename from command-line 2
fo = open(filename, ’r’) # Open file for reading 3
contents = fo.read() # Read file contents as string 4
contents.replace("\n\n", "<Placeholder>") # Use placeholder for double return 5
contents.replace("\n", "") # Remove hard line breaks 6
contents.replace("<Placeholder>", "\n") # Turn placeholder into return 7
fo.close() # Close file 8
Even if the entire contents of a file have been saved as string using read(),
it is still possible to read a file line by line. This can be done by splitting up the
string by line breaks using the string method split(), as shown in (132). One
advantage of using this method to read a file line by line is that line breaks are
automatically removed in the process.
110
12. Input/Output
T
(132) ioSlurpingEx3.py
import sys # import required module 1
filename = sys.argv[1] # get filename from command-line 2
fo = open(filename, ’rU’) # open file for reading 3
contents = fo.read() # read file contents as string 4
lines = contents.split("\n") # split string by newlines 5
fo.close() # close file 6
for l in lines : print l # print each line 7
12.1.2.2 Write
AF
???
12.1.2.3 Append
???
There are a few problems with (133), the most glaring of which is that if
there is a pre-existing file in the same location with the same name, it will be
overwritten, and its contents lost. To check for a pre-existing file before creating
a new file, we can XXX, as shown in (134).
(134) ioCreateFileEx2.py
import sys 1
2
# Get filename from command-line 3
filename = sys.argv(1) 4
5
# Check for pre-existing file 6
# if XXX : 7
# print "The file ’" + filename + "’ already exists!" 8
# else : 9
# fo = open(filename, ’w’) # Open file for writing 10
# fo.close() # Close 11
111
12.1 Files
T
with a different name before create a new one, as shown in (135).
(135) ioCreateFileEx3.py
import sys 1
2
# Get filename from command-line 3
filename = sys.argv(1) 4
5
# Open file for writing 6
fo = open(filename, ’w’) 7
fo.close() 8
try :
AF
12.1.4 Deleting Files
A word of caution is in order. Before we show you how to delete files using
Python, we should remind you that we are essentially handing you a loaded
weapon. Computer programs that delete files are dangerous. It is always a good
idea to back up any files or directories before you begin experimenting with this
aspect of Python’s IO capabilities.
Files can be easily deleted using the remove function of the os module.
112
12. Input/Output
T
Although plain text files are the bread and butter of language research, it is impor-
tant to recognize that there are other file types in existence, some of which need to
be dealt with differently. The two fundamental file types are text files and binary
files.
TODO
12.2 Directories
AF
At a higher level of organization, files are organized by directories (sometimes
called “folders”). Python provides the means for manipulating directories as well
as files. In this section, we will cover how directories are located, created, edited,
and deleted using the os module.
import os
(138) osLocatingDirsEx1.py
1
DIRECTORIES
2
os.path.join() 3
DR
12.2.2 Reading Directories
Directories contains other directories (subdirectories) or files and reading directo-
ries consists of gaining access to these contents. In Python, directories are read
using the function listdir in the os module. To see how it works, consider
(139), a short script that takes a directory path as a command-line argument and
prints out a sorted lists of the files and directories within it.
(139) ioReadDirEx1.py
import os, sys 1
try : 2
dirPath = sys.argv[1] # Get dir from command-line 3
contents = os.listdir(dirPath) # Get list of contents 4
contents.sort() # Sort the contents 5
6
# Print header 7
print "-" * 30 , "-" * 30 8
print "File Name".ljust(30) , "File Type".ljust(30) 9
113
12.2 Directories
T
print "-" * 30 , "-" * 30 10
11
# Print contents 12
for c in contents : 13
filepath = os.path.join(dirPath, c) 14
if os.path.isdir(filepath) : 15
print c.ljust(30) , "D" 16
else : 17
print c.ljust(30) , "F" 18
except IndexError : 19
print "Usage: %s" % (sys.argv[0]) 20
except OSError : 21
print "Error: Couldn’t open or read dir ’%s’!" % (dirPath) 22
AF
except : 23
print "Unknown Error" 24
If the directory cannot be created (because the directory already exists, insuf-
ficient privileges exist, etc.), an error message is printed out instead.
DR
12.2.4 Editing Directories
Directories can be renamed using the function rename from the os module.
(Unlike many other functions that operate on directories, rename also works on
files.) (141) provides a short script that takes two arguments, an old directory
name and a new directory name, and changes the former to the latter.
(141) ioEditDirEx1.py
import os, sys 1
try : 2
old = sys.argv[1] 3
new = sys.argv[2] 4
os.rename(old, new) 5
except IndexError : 6
print "Usage: %s <OldName> <NewName>" % (sys.argv[0]) 7
except OSError, e : 8
print "Error: Couldn’t rename dir ’%s’ as ’%s’!" % (old, new) 9
print "Details: " + str(e) 10
114
12. Input/Output
T
Directories can easily be deleted using the rmdir function of the os module.
(142) provides a simple script that deletes a directory after prompting the user to
confirm the action.
(142) ioDeleteDirEx1.py
import os, sys 1
2
# Get a directory name from the command-line 3
dir = sys.argv[1] 4
5
AF
# Delete directory or print error message 6
try : 7
print "Are you sure you want to remove the directory ’%s’ [y/n]?" % (dir) 8
response = ’n’ # TODO: Prompt user for yes-no response 9
if response.lower() == ’y’: 10
os.rmdir(dir) 11
except OSError, e : 12
print "Error: Could not delete directory ’%s’!" % (dir) 13
print "Details: " + str(e) 14
(143) provides a simple examples of how this works. It provides a script that
takes a directory as a command-line argument and calls os.path.walk on that
directory, using a custom function, called processFile, to process every di-
rectory traversed. The function is very simple. It goes through the list of files
in the directory and joins the name of the directory with the name of the file us-
ing os.path.join, which ensures that the code will work across platforms
(on Unix, Windows, etc.). The resulting path is then printed out, preceded by an
asterisk (the argument passed to os.path.walk in the first place).
(143) ioWalkEx1.py
import os 1
import sys 2
3
115
12.3 The Command Line
T
def processFile(arg, dir, files) : 4
for f in files : 5
path = os.path.join(dir, f) 6
print "%s %s" % (arg, path) 7
8
try : 9
dir = sys.argv[1] 10
os.path.walk(dir, processFile, ’*’) 11
except IndexError : 12
print "Usage: %s <DIR>" % sys.argv[0] 13
12.3
import sys
AF
The Command Line
12.3.1 Arguments
Command-line arguments to a script are stored in a list accessible through the
module sys. The code in (144) prints out all of the arguments to the script on a
separate line. This list is never empty, since its first item is always the name of
the script being run. Therefore, if the script is run without any arguments, a single
line with the name of the script is printed out.
args = sys.argv
for a in args :
print a
(144) ioCommandLineArgEx1.py
1
2
3
4
5
Note that the filename is the full filepath supplied to the Python interpreter and
DR
therefore may be more than just the name of the script itself. To obtain only the
name of the script, it may be necessary to strip off the path information preceding
it. This can be done by XXX, as illustrated in (145).
(145) ioCommandLineArgEx2.py
import sys 1
2
print "The name of this script is ‘" + sys.argv[0] + "’." 3
4
args = sys.argv[1:] 5
6
if args : 7
i = 0 8
for a in args : 9
i = i + 1 10
print i, a 11
Individual command-line arguments in the argv list are accessed in the nor-
mal manner, by their index, as illustrated in (146).
116
12. Input/Output
T
(146) ioCommandLineArgEx3.py
import sys 1
2
firstArg = sys.argv[0] 3
4
print "The first item in sys.argv is ‘" + firstArg + "’." 5
12.3.2 Options
It is common for Python scripts to have options that change their behavior in sys-
AF
tematic ways. These command-line options (sometimes called flags) are like those
of UNIX commands. For example, the common UNIX command grep provides
a tool for searching files using regular expressions (see chapter 17). Its syntax is
grep <regular expression> <file(s)>, as illustrated below:
% grep ’the’ file.txt
(147) ioOptionsEx1.py
import getopt, sys 1
2
def usage() : 3
print "Usage: " + sys.argv[0] + "[OPTIONS] <FILE>" # Usage info 4
sys.exit(2) 5
6
def main() : 7
optVerbose = False # Variable for option value 8
9
# ----------------------------------------------------- 10
# getopt() takes 3 arguments: 11
# 1. argument list to be parsed 12
# 2. string with short option letters 13
# 3. list of long option names 14
# ----------------------------------------------------- 15
try: 16
DR
opts, args = getopt.getopt(sys.argv[1:], "hv", ["help"]) 17
except getopt.GetoptError: 18
usage() 19
20
# Loop over a list of tuples, where o is an option’s name 21
# and a is its value 22
for o, a in opts : 23
if o == "-v" : # Option -v 24
optVerbose = True 25
if o in ("-h", "--help") : # Option -h or --help 26
usage() 27
12.4 IO in Action
12.4.1 Reading Corpus from File
In the previous chapters we have gradually refined our first program. In the pre-
vious examples, the text analyzed in the program was directly entered into it by
117
12.4 IO in Action
hand. But this is obviously unsatisfactory. Here we will show how to obtain text
T
from the contents of a file.
(148) ioFirstProgramFromFile.py
import sys 1
2
def get_file_contents() : 3
fo = open(sys.argv[1], ’rU’) 4
contents = fo.read() 5
fo.close() 6
return contents 7
8
def create_tokenlist(contents) : 9
AF
tokens = contents.split() # Split up contents into tokens 10
d = {} # create a dictionary 11
for t in tokens : # go through each token 12
if d.has_key(t) : # see if it’s in the dict. 13
d[t] = d[t] + 1 # if so, add 1 to its counter 14
else : # otherwise 15
d[t] = 1 # initialize it to 1 16
return d 17
18
def print_results(d) : 19
tokens = d.keys() # get list of unique tokens 20
tokens.sort() # sort list of tokens 21
for t in tokens : # go through each token 22
count = d[t] # get counter for a given token 23
parts = t.split("/") # split it up into wordform and POS 24
wordform = parts[0] # first part is wordform 25
POS = parts[1] # second part is the POS 26
print wordform , POS , str(count) # print it to standard output 27
28
text = get_text() 29
tl = create_tokenlist(text) 30
print_results(tl) 31
118
12. Input/Output
T
for t in tokens : # go through each token 22
count = d[t] # get counter for a given token 23
parts = t.split("/") # split it up into wordform and POS 24
wordform = parts[0] # first part is wordform 25
POS = parts[1] # second part is the POS 26
print wordform , POS , str(count) # print it to standard output 27
28
text = get_text() 29
tl = create_tokenlist(text) 30
print_results(tl) 31
12.5
• ???
• ???
• ???
AF
Exercises
DR
119
12.5 Exercises
T
AF
DR
120
T
Chapter 13
Modules and Packages
AF
As programs get longer and more complicated, it becomes less convenient to keep
all of the code in a single file. It is useful in such cases to be able to break up
the code and store the various parts of it in separate files. This sort of approach
also improves the reusability of code, since commonly used code can be shared
between programs. Code that is stored in a file and shared between programs is
known as a module. A module is basically nothing more than a file containing
Python definitions and statements. In this section, we will learn the ins-and-outs
of Python modules.
MODULE
13.1 Modules
DR
13.1.1 Defining and Modules
The name of a module is its filename minus the suffix .py. Therefore, the code
for a module named LingUtils (short for “Linguistic Utilities”) will be found in a
file named LinguisticUtilities.py, as in (150).
(150) LingUtils.py
def isEmpty(str) : 1
if str == ’’ or str == ’ ’ : 2
return 1 3
else : 4
return 0 5
6
def sort(str) : 7
chars = [] 8
for s in str : 9
chars.append(s) 10
chars.sort() 11
return chars 12
121
13.1 Modules
T
porting it, as illustrated in (151), which uses the function isEmpty(str) from
LingUtils to test whether the lines in a given file are empty and to print them
out if they are not.
(151) modulesEx1.py
import LingUtils 1
2
def usage() : 3
print "Usage: XXX" 4
5
def main() : 6
AF
try : 7
fn = sys.argv[0] 8
open(fn, READ) 9
for l in fn.readlines() : 10
if not LingUtils.isEmpty(l) : 11
print l 12
except : 13
print usage() 14
15
main() 16
122
13. Modules and Packages
T
(153) modulesImportErrorEx2.py
import LinguisticUtilities 1
2
print "This won’t work!" 3
AF
namespace conflicts. [EXPLAIN WHAT A NAMESPACE CONFLICT IS] Since
the full contents of a module are typically unknown to the user, it is dangerous to
import its entire namespace, since it may conflict with pre-existing namespace.
To see all of the names in a module’s namespace, the function XXX can be
called on the module, as in (154).
(154) modulesImportEx1.py
# TODO 1
The names of some variables and functions are not automatically imported
with the star notation. Any variables or functions within a module whose names
begin with an underscore are not automatically imported. They can, however, be
imported by name, as described in §13.1.2.3.
123
13.1 Modules
T
plified in 156, where the function sort() is imported from the module LingUtils
and run on a string. In keeping with the definition of the function in the module,
the characters in the string are printed out in ascending order. However, the func-
tion sort() is then redefined within the script to sort characters in descending
order and run on the same string.
(156) modulesImportEx3.py
from LingUtils import sort 1
2
s = ’antidisestablishmentarianism’ 3
AF
print sort(s) 4
5
def sort(str) : 6
chars = [] 7
for s in str : 8
chars.append(s) 9
chars.sort() 10
chars.reverse() 11
return chars 12
13
print sort(s) 14
15
16
The Search Path The search path determines where Python looks for the files
that define modules, as determined by the environment variable PYTHONPATH,
which is accessible to programmers through the module sys. The simple script
DR
in (157) prints out a list of the directories found within the Python interpreter’s
search path.
(157) modulesSearchPath.py
import sys 1
2
print "These directories are in the search path:" 3
for d in sys.path : 4
print d 5
When looking for a module, these directories will be searched for a matching
file until the first one is found, at which point it will be imported. If no matching
file is found, an ImportError will be raised. The results of running (157) depend
upon your system configuration. The sys.path variable is initialized from the value
of the environment variable PYTHONPATH.
If you don’t know what environment variables are or how to set them, you
will need to review this aspect of your operating system. To see the value of this
environment variable in Unix, the following command should do the trick:
124
13. Modules and Packages
T
echo $PYTHONPATH
export PYTHONPATH=$PYTHONPATH:/usr/local/modules//
AF
set PYTHONPATH=’PYTHONPATH:/usr/local/modules/’
Where to Put Custom Modules After you create custom modules, the question
that immediately arises is where you should keep the files containing them. There
are basically three options. The advantages and disadvantages of each will be
discussed in turn:
125
13.2 Packages
13.2 Packages
T
Large projects involve large amounts of code and this code may need even more
structure than modules can provide. This is where packages enter the picture.
Packages in Python are basically collections of modules. They provide a means
of organizing modules that have a more complicated interrelationship with one
another by offering the possibility of having some modules contain others.
To understand why it might be useful to organization modules into packages,
imagine that we are creating modules for the manipulation of Shoebox files. A
echo.py AF
comprehensive set of tools for dealing with Shoebox would have to include a
number of things.
Shoebox/
__init__.py
Parsers/
__init__.py
MorphemeParser.py
WordParser.py
LineParser.py
TextParser.py
...
Objects/
__init__.py
Morpheme.py
(158)
Top-level package
Word.py
DR
Line.py
Text.py
Utilities/ Subpackage for misc tools
__init__.py
Configuration.py
Formatter.py
The overall organization of the Shoebox package in (158) is that there is a main
package, called Shoebox, which contains a number of subpackages, each of which
contains numerous modules. The Shoebox package is broken down into various
subpackages for convenience. There is an object model, which models the entities
involved in a Shoebox text, and parsers to shornhoe texts into this object model, as
well as various utilities for accomplishing odd jobs (such as reformatting of text in
particular ways). Note that each directory requires a file init .py. Without
this file, the directory will not be recognized as a package by Python.
126
13. Modules and Packages
To see how the Shoebox package would be used in a script, we can turn to
T
(159), which demonstrates the use of the Shoebox package to parse a Shoebox
text and print out the original text on a line-by-line basis with numbering.
(159) modulesPackagesEx1.py
import sys 1
from Shoebox.Parsers.TextParser import TextParser 2
from Shoebox.Utilities.Formatter import removeWhitespace 3
4
try : 5
fn = sys.argv[1] 6
tp = TextParser(fn) 7
AF
t = tp.parse() 8
i = 0 9
for l in t.getLines() : 10
i = i + 1 11
original = l.getOriginalText() 12
formattedOriginal = removeWhitespace(original) 13
print "%i. %s" % (i, formattedOriginal) 14
except IndexError : 15
print "Usage: %s <SHOEBOX FILE>" % (sys.argv[0]) 16
There are a number of things to note about (159). The most important thing to
understand is that files within a package do not automatically have access to other
files in the same package. Therefore, if we look at the file TextParser.py, provided
in (??), we see that the LineParser module is explicitly imported since it is not
visible by default.
13.3 Exercises
• ???
DR
• ???
• ???
127
13.3 Exercises
——————————————————————–
T
AF
DR
128
T
Chapter 14
Strings In Depth
AF
Probably the most important data type for language research are strings, since they
are what Python uses to store text. The term string derives from its definition in
various branches of mathematics (such as formal language theory), where a string
is an ordered sequence of symbols that are drawn from a pre-determined set. In
this section, we will cover many important aspects of strings and string processing
in Python.
double quotes, or triple quotes. Thebehavior of these three string types is com-
DR
pared in table Table 14.1 according to three features: whether quotes need to be
escaped when they occur within a string and whether newlines must be encoded
as \n.
Single quote string are best used for literal strings. If a single quote appears
within a single-quoted string, it must be preceded by a backslash, as shown in
(160). The backslash is known as an escape character in this context because it ESCAPE CHARAC -
provides an “escape” from a character’s normal interpretation. The escape charac- TER
129
14.2 Unicode Strings
ter is necessary for a single quote within a single-quoted string because otherwise
T
the interpreter would treat the single quote as the end of the string.
(160) stringsSingleQuoted1.py
print ’I\’ve put in five different req\’s for Minimal Protein Ration for those kids
1 and they don\’t co
print ’"Lousy mixture, barbiturates and Dexedrine. What were you trying to do to 2yourself?"’
AF
print s 2
For strings that contain many carriage returns or quotes, triple quoted strings
are a good option, since neither single or double quotes need to be escaped and
carriage returns are preserved as newslines, as illustrated in (163).
(163) stringsTripleQuoted1.py
s = ’’’"You could buy a cat," Barbour offered. 1
"Cats are cheap; look in your Sidney’s catalogue."’’’ 2
print s 3
DR
14.2 Unicode Strings
This might be a separate chap- ???
ter!!!
130
14. Strings In Depth
T
Python provides a very convenient way of manipulating strings, which is the use
of string indices. Each position in a string is assigned an index, as illustrated in
figure 14.2. The way string indexing works is by counting from 0 upwards from
left to right and from -1 downwards from right to left.
in (164).
string = ’word’
print string
print string[0]
print string[1]
print string[2]
print string[3]
print string[-1]
print string[-2]
print string[-3]
print string[-4]
#
#
#
#
#
#
#
#
#
#
-5
o
r
d
d
r
o
w
AF w
-4
initialize string
word
w
o
-3
r
-2
(164) stringsSliceNotation1.py
d
-1
Using string indices, any character from this string can be extracted, as shown
Using string indices and slice notation (two indices separated by a colon), it
1
2
3
4
5
6
7
8
9
10
SLICE NOTATION
possible to extract substrings quickly and easily. For example, (165) defines a
DR
string with four letters (word) and then prints the following using slice notation:
the whole string, the first two characters, the first three character, the last two
characters, and the last three characters.
(165) stringsSliceNotation2.py
string = ’word’ # initialize string 1
print string # word 2
print string[0:1] # w 3
print string[0:2] # wo 4
print string[0:3] # wor 5
print string[-1:] # d 6
print string[-2:] # rd 7
print string[-3:] # ord 8
131
14.4 Carving Strings at the Joints
T
(166) stringsSliceNotationEx3.py
s = ’spaceship’ # strings are immutable 1
s[0:5] = ’rocket’ # an attempt at replacement 2
AF
TypeError: object doesn’t support slice assignment
partition ???
DR
len(str) The function len() is not a string method but it nevertheless requires
explanation, since it provides the means for query a string for a very basic property—
namely, its length, which equals the number of characters in it.
For example, if we have a word list for a corpus, we might want to look at the
relationship between word lenth and word frequency. In (??), we illustrate how to
read in a file, break it into words, create a unique list of words, and print out a list
of each word’s length and frequency.
(168) stringsLengthEx1.py
??? 1
If we run this script on a large corpus of Tok Pisin texts and plot the relation-
ship between word length and frequency, we obtain the graph in ???.
???
132
14. Strings In Depth
count(sub[, start[, end]]) The string method count() finds the number of
T
times that a substring occurs within a string.
(169) stringsCountEx1.py
s = ’antidisestablishmentarianism’ 1
print s.count(’a’) 2
print s.count(’is’) 3
AF
found within a larger string. It does this by returning the lowest index in a string
where the searched-for substring is found. In other words, it returns the index for
the first (leftmost) index of the start of the searched-for substring. If the searched-
for string is not found, find returns -1 .
(170) stringsFindEx1.py
s = "The ships were always accompanied by an automated probe \ 1
that followed a couple of million miles behind. \ 2
We knew about the portal planets, little bits of flotsam \ 3
that whirled around the collapsars; the purpose of the drone \ 4
was to come back and tell us in the event that \ 5
a ship had smacked into a portal planet at .999 of the speed of light." 6
print s.find("ship") 7
print s.rfind("ship") 8
The search domain of find can be restricted by specifying the optional sec-
ond and third arguments, as in (171).
(171) stringsFindEx2.py
todo 1
133
14.4 Carving Strings at the Joints
T
findPositionOfSubstring(str, ’a’) 11
findPositionOfSubstring(str, ’m’) 12
findPositionOfSubstring(str, ’z’) 13
findPositionOfSubstring(str, ’A’) 14
Sometimes it is useful to restrict where to look for the substring within the
larger string. A start and and end point are optionally specified in the second and
third argument to index, as shown in (174).
(174) stringsIndexEx2.py
# Function that prints out where a substring is within a string 1
AF
def findPositionOfSubstring(s, sub) : 2
try : 3
start = 52 4
end = 78 5
pos = s.index(sub, start, end) 6
print "Between %i and %i, " \ 7
"’%s’ occurs at %i." % (start, end, sub, pos) 8
except : 9
print "Between %i and %i, " \ 10
"’%s’ isn’t found!" % (start, end, sub) 11
12
# Apply function to string with 4 repetitions of the alphabet. 13
# The slices for each iteration of the alphabet are as follows: 14
# 1st: 0 - 25 15
# 2nd: 26 - 51 16
# 3rd: 52 - 77 17
# 4th: 78 - 103 18
str = "abcdefghijklmnopqrstuvwxyz" \ 19
"abcdefghijklmnopqrstuvwxyz" \ 20
"abcdefghijklmnopqrstuvwxyz" \ 21
"abcdefghijklmnopqrstuvwxyz" 22
findPositionOfSubstring(str, ’a’) 23
findPositionOfSubstring(str, ’m’) 24
findPositionOfSubstring(str, ’z’) 25
DR
The string method rindex behaves almost identically to except that the sub-
string is searched for from the right end of the string (i.e., from the end) rather
than from the left end of the string (i.e., from the beginning).
(175) stringsRindexEx1.py
# Finds rightmost position of substring 1
def findRightmostPositionOfSubstring(s, sub) : 2
try : 3
pos = s.rindex(sub) 4
print "The position of ’%s’ is %i in ’%s’." % (sub, pos, s) 5
except : 6
print "Couldn’t find ’%s’ in ’%s’!" % (sub, s) 7
8
# Finds leftmost position of substring 9
def findLeftmostPositionOfSubstring(s, sub) : 10
try : 11
pos = s.index(sub) 12
print "The position of ’%s’ is %i in ’%s’." % (sub, pos, s) 13
except : 14
print "Couldn’t find ’%s’ in ’%s’!" % (sub, s) 15
16
# Apply function to string with the alphabet 17
134
14. Strings In Depth
T
str = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz" 18
findLeftmostPositionOfSubstring(str, ’a’) 19
findLeftmostPositionOfSubstring(str, ’m’) 20
findLeftmostPositionOfSubstring(str, ’z’) 21
22
findRightmostPositionOfSubstring(str, ’a’) 23
findRightmostPositionOfSubstring(str, ’m’) 24
findRightmostPositionOfSubstring(str, ’z’) 25
AF
One of the most important capabilities a programming language must provide is
the ability to modify strings. Python provides a fairly large number of methods
for modify strings. We will cover a few of the most useful ones. Note that these
methods do not directly modify the string (since strings are immutable); instead,
they modify a copy of the string, and return the copy, leaving the original intact.
join(seq) Technically, this method modifies sequences, and not strings, but be-
cause it is a string method that is frequently used to convert sequences into strings,
it is included here. The method is called on a string and takes a sequence as an
argument. It joins the elements in the sequence together into a string, using the
string as a separator, as illustrated in (176).
(176) stringsJoinEx1.py
l = [ ’The’, ’wind’, ’came’, ’across’, ’the’, ’bay’, 1
’like’, ’something’, ’living’, ’.’ ] 2
print " ".join(l) 3
The join method uses the string that is called on as a separator in the output
string. If no separator is desired, the method can be called on a blank string, as
DR
in (177), where the morphemes of the Dutch word ??? are joined together into a
single word and printed.
(177) stringsJoinEx2.py
l = [’a’, ’b’, ’c’, ’d’] 1
print "".join(l) 2
An error will occur if the elements within the sequence passed to join are
not strings, as in (178), where the list elements are integers.
(178) stringsJoinEx3.py
l = [1, 2, 3, 4] 1
print "".join(l) 2
One way to avoid this problem is by first converting each item in the list into a
string. This can be done quite easily using list comprehension (described in §9.3),
as shown in (179).
(179) stringsJoinEx4.py
l = [’Lift-off’, ’at’, 14, ’hours’, ’40’, ’.’] 1
print " ".join([str(i) for i in l]) 2
135
14.4 Carving Strings at the Joints
replace(old, new [, count]) The function replace takes a string and produces
T
a copy in which every instance of one string within it has been systematically
replaced by another. For example, if a word is sysematically misspelled within a
text, it can be automatically fixed with this function, as shown in (180), where a
systematic misspelling of the word phoneme is corrected.
(180) stringsExReplace1.py
before = ’A foneme is the minimal phonological unit. \ 1
The term foneme was coined by Edward Sapir, \ 2
who believed that fonemes had psychological reality.’ 3
after = before.replace(’foneme’, ’phoneme’)
AF
4
print after 5
This method also takes an optional third argument which limits the number of
replacements that will be made since by default there is no upward limit on the
number of replacements that will be made.
split(separator [,maxsplit])
The function split takes a string and breaks it up into parts given a a partic-
ular separator. If the separator is not found in the string, the value None is re-
DR
turned. For example, the string dis-establish-ment-arian-ism, which
provides a morphemic breakdown of the word disestablishmentarianism, could be
split up into its constituent morphemes by running split on it using a hyphen as a
separator. The results would be a list with five morphemes, as shown in (182).
(182) stringsSplit1.py
word = ’dis-establish-ment-arian-ism’ 1
morphemes = word.split(’-’) 2
i = 0 3
for m in morphemes : 4
i = i + 1 5
print i, m 6
strip([chars]) The function strip provides the means of removing (or strip-
ping) any unwanted characters from the beginning and end of a string. If no
argument is supplied to the method, then whitespace characters will be stripped
from both ends of the string—that is, from the beginning and the end.
136
14. Strings In Depth
T
(183) stringsStripEx2.py
s1 = " xxx" 1
s2 = "xxx " 2
print s1 3
print s2 4
print s1.strip() 5
print s2.strip() 6
To strip specific characters from a string, the method requires a string argu-
ment which provides all of the characters to be stripped. For example, the string
AF
aabbaa could be reduced to bb by removing the two letter a’s from its two ends,
as shown in (184).
(184) stringsStripEx1.py
s = ’aabbaa’ 1
s.strip(’aa’) 2
print s 3
There are two variants of strip, which remove unwanted strings only from
one edge of the string: lstrip removes an unwanted string from the left side
(the beginning) and rstrip from the right side (end), as illustrated in (185).
(185) stringsStripEx3.py
before = ’aabbaa’ 1
after = s.lstrip(’aa’) 2
print after 3
after = s.rstrip(’aa’) 4
print after 5
137
14.4 Carving Strings at the Joints
ljust(width)
T
(187) stringsLjustEx1.py
# Automatically create a centered header of an arbitrary width 1
width = 50 2
text = ’SECTION’ 3
print "-" * width 4
print "-" + text.ljust(width - 2) + "-" 5
print "-" * width 6
AF
numbers. The code in (188) provides a simple example of how this works by
printing the numbers 1 through 10 with right justification.
(188) stringsRjustEx1.py
width = 2 1
for i in range(1, 11) : 2
s = str(i) 3
print s.rjust(width) 4
zfill(width) The string method zfill is used to pad strings with zeros. Its
main purpose is formatting numbers, as illustrated in (189).
(189) stringsZfillEx1.py
for i in range(1, 11) : 1
s = str(i) 2
print s.zfill(2) 3
Zero-padding for numbers is useful since it ensures that they will sort properly
when sorted alphabetically (rather than numerically). To see the difference, try
sorting the raw numbers 1 through 10 and comparing the results with a sort of the
padded numbers printed out by (189).
DR
14.4.4 Handling Case
Under the heading of case, we will discuss string functions that are used to evalu-
ate and/or manipulate the case of the letters constituting a string.
(190) stringsCaseEx1.py
name = ’noam chomsky’ 1
2
# Capitalize -- Noam chomsky 3
caps = name.capitalize() 4
print caps 5
6
# Lowercase -- noam chomsky 7
lower = caps.lower() 8
print lower 9
10
# Uppercase - NOAM CHOMSKY 11
upper = lower.upper() 12
print upper 13
138
14. Strings In Depth
T
14
# Title -- Noam Chomsky 15
title = upper.title() 16
print title 17
18
# Swap Case -- nOAM cHOMSKY 19
swap = title.swapcase() 20
print swap 21
In some cases, one may want to know which case a string is in.
(191) stringsCaseEx2.py
lcase = ’time travel’
AF
1
if lcase.islower() : 2
print "The word ’%s’ is all lowercase." % (lcase) 3
4
ucase = ’TARDIS’ 5
if ucase.isupper() : 6
print "The word ’%s’ is all uppercase." % (ucase) 7
8
title = ’Dalek’ 9
if title.istitle() : 10
print "The word ’%s’ is in title case." % (title) 11
Although the code in (193) works fine, it isn’t particularly pretty. It’s not
easy to see what exactly will be printed out in the clutter of string concatenations.
Fortunately, there is a more elegant solution, which is string interpolation. String
interpolation is a convention whereby a placeholder in a string is replaced by a
variable provided after the string, as illustrated in (194). In other words, a variable
is interpolated into a string (interpolate v. tr. 1. To insert or introduce between
other elements or parts”) The results obtained in (193) can also be obtained with
string interpolation, and in a much more simple and clean fashion, as shown in
(194).
139
14.5 Case Studies in String Manipulation
T
(194) stringsStringInterpolationEx3.py
i = 3.1415 1
print ’The value of pi is %f.’ % i 2
In (194), the float is specified to the fourth decimal point and its interpolated
value in the string is similarly specified. It is possible, however, to restrict the
number of decimal points in the interpolated float by placing a number after the
ampersand and before the letter, as in (195).
AF
(195) stringsStringInterpolationEx4.py
i = 3.1415 1
print ’pi = %0.1f (to the 1st decimal place)’ % i 2
print ’pi = %0.2f (to the 2nd decimal place)’ % i 3
print ’pi = %0.3f (to the 3rd decimal place)’ % i 4
print ’pi = %i (as an integer)’ % i 5
Building a concordance like this is much simpler than it might seem. The code
is provided below in (??).
140
14. Strings In Depth
T
(197) stringsConcordance1.py
from string import split, lower 1
2
# ---------------------------------------------------- 3
# 1. Get list of lines from file 4
# ---------------------------------------------------- 5
6
fn = ’inputs/april_inventory.txt’ # hardcoded 7
f = open(fn, ’r’) # ’r’ = read-mode 8
lineList = f.readlines() # Get list of lines 9
10
# ---------------------------------------------------- 11
# 2. Create dictionary from list 12
AF
# ---------------------------------------------------- 13
14
concord = dict() # Initialize dictionary 15
for l in lineList: # Loop over line list 16
# Get lowercase copy of line with newline removed 17
cleanLine = lower(l)[:-1] 18
19
# Remove all nonalphanumeric characters 20
cleanLine = cleanLine.replace(’.’, ’’) # Ditch . 21
cleanLine = cleanLine.replace(’,’, ’’) # Ditch , 22
cleanLine = cleanLine.replace(’;’, ’’) # Ditch ; 23
cleanLine = cleanLine.replace(’:’, ’’) # Ditch : 24
cleanLine = cleanLine.replace(’!’, ’’) # Ditch ! 25
cleanLine = cleanLine.replace(’"’, ’’) # Ditch " 26
27
# Split line (spaces are default separator) 28
words = split(cleanLine) 29
30
for w in words: # Loop over words 31
if not concord.has_key(w): # Is it already there? 32
concord[w] = 1 # No, add it. 33
else: # Yes, increment freq. 34
concord[w] = concord[w] + 1 35
36
# ---------------------------------------------------- 37
# 3. Print out concordance 38
DR
# ---------------------------------------------------- 39
items = concord.items() # Extract items 40
items.sort() # Sort them 41
for word, freq in items: # Now print them 42
print word , "\t" , freq # Tab-delimited w/ \t 43
We accomplish the task by putting all of the lines in the text into a list. We
then remove the carriage return, convert the line to lowercase, and then strip out
all of the punctuation marks using replace(old,new). We then iterate over
the lines. Each line is split into words (i.e., tokenized) using space marks as word-
delimiters. (Since space marks are the default separator for split(), there is no
need to supply it with a second argument.) And the words are put into a dictionary.
When a word does not already exist in the dictionary, it is added and its counter
is initialized to 1. When a word already exists in the dictionary, its counter is
incremented by 1. Finally, we then sort the items in the dictionary (in descending
alphabetic order, which is the default) and then print them out followed by their
frequency to obtain our concordance.
141
14.5 Case Studies in String Manipulation
T
File
Scripting languages like Python are great for jobs that need to be done in a hurry
and probably won’t be reused. The advantage of a language like Python is that it
is also amenable to larger, more involved projects–in other words, it’s a scripting
language that scales. Therefore, we’ll show you how quick-and-dirty jobs can be
accomplished in a way that leaves open the possibility of future expansion of the
scope of the work.
AF
For this case study, we will look at how a dictionary file from Shoebox, a
program put out by the SIL and used by fieldworkers for storing dictionaries and
interlinearizing texts automatically.
An excerpt from a Shoebox file belonging to one of the authors is provided
below.
(198) stringsRotokasDictionary.in
\lx asige
\rt asige
\ge make sneezing noise
\gp kus
\nt
\ps VI
\ex Asigeparoi oirato.
\xp
\xe The man is sneezing.
\lx asigea
\rt asigea
\ge a sneeze
\gp kus
\nt
\ps N.N
\ex
DR
\xp
\xe
\lx asigo
\rt asigo
\ge speak Rotokas
\gp ???
\nt
\ps VT
\ex Akoitai asigoparevoi.
\xp
\xe Akoitai is using the Rotokas (Asigoao) language.
\lx asigoao
\rt asigoao
\ge mountain people/people of the mountains
\gp ???
\nt Name used to designate the Rotokas people before the word rotokas was (NPRN according to Firchow)
\ps N.PN?
\ex
\xp
\xe
142
14. Strings In Depth
T
\lx asigoao reo
\rt asigoao reo
\ge Rotokas language as designated before the term Rotokas came into use
\gp ???
\nt (NPRN according to Firchow)
\ps N.PN?
\ex
\xp
\xe
\lx asikauru
\rt asikauru
\ge rust
AF
\gp ros
\nt
\ps V.STAT
\ex Ro-ia Toiota asikauroepa.
\xp
\xe The Toyota truck rusted away.
We can begin to tackle the problem by creating a Shoebox entry object, which
will store the data contained in an entry as well as the parsing of the raw Shoebox
text. Although it is possible to create a separate parser to do the latter job, we will
not do so here, based on the judgement call that it would unnecessarily compli-
cate our object model. (Remember, these decisions are not carved in stone, and
sometime it may later be necessary to reevaluate one’s object model and do some
refactoring.) REFACTORING
143
14.5 Case Studies in String Manipulation
T
# ------------------------------- 3
# Extract every word from the file 4
# supplied as a command-line arg 5
# ------------------------------- 6
def extractWords(f) : 7
# Get lines from file 8
fo = open(f, ’r’) 9
lines = fo.readlines() 10
fo.close() 11
12
# Go through lines and get words 13
words = [] 14
for l in lines : 15
AF
nl = l.replace("\n", "") 16
words.append(nl) 17
return words 18
19
# ------------------------------- 20
# Sort the words by their length 21
# and put them in a dictionary 22
# where the key is a wordlength 23
# and the value is the list of 24
# words of that length 25
# 1: a, I 26
# 2: an, to, it, ... 27
# 3: the, and, but, ... 28
# ... 29
# ------------------------------- 30
def sortWordsByLength(allWords) : 31
wordLengths = {} 32
for w in allWords : 33
wl = len(w) 34
if not wordLengths.has_key(wl) : 35
wordLengths[wl] = [] 36
words = wordLengths[wl] 37
words.append(w) 38
wordLengths[wl] = words 39
return wordLengths 40
41
DR
# ------------------------------- 42
# Object for convenient storage 43
# and manipulation of minpair 44
# ------------------------------- 45
class MinimalPair : 46
def __init__(self, s1, s2, w1, w2) : 47
self.s1 = s1 48
self.s2 = s2 49
self.w1 = w1 50
self.w2 = w2 51
def getFirstWord(self) : 52
return self.w1 53
def getSecondWord(self) : 54
return self.w2 55
def setFirstWord(self, w1) : 56
self.w1 = w1 57
def setSecondWord(self, w2) : 58
self.w2 = w2 59
def getFirstSegment(self) : 60
return self.s1 61
def getSecondSegment(self) : 62
return self.s2 63
def setFirstSegment(self, s1) : 64
144
14. Strings In Depth
T
self.s1 = s1 65
def setSecondSegment(self, s2) : 66
self.s2 = s2 67
def getWordPair(self) : 68
if self.w1 < self.w2 : 69
return self.w1 + self.w2 70
else : 71
return self.w2 + self.w1 72
def getContrast(self) : 73
if self.s1 < self.s2 : 74
return self.s1 + self.s2 75
else : 76
return self.s2 + self.s1 77
AF
78
# ------------------------------- 79
# Go through the words by their 80
# length and compare words of 81
# similar length to see if they 82
# constitute a minimal pair 83
# ------------------------------- 84
def findMinPairs(words) : 85
wordsByLength = sortWordsByLength(words) 86
minPairs = {} 87
for l in wordsByLength.keys() : 88
words = wordsByLength[l] 89
for w1 in words : 90
for w2 in words : 91
#print w1 + " / " + w2 92
i = 0 93
diffCount = 0 94
diffChar1 = ’’ 95
diffChar2 = ’’ 96
while i < l : 97
#print w1[i] + " / " + w2[i] 98
if not w1[i] == w2[i] : 99
diffCount = diffCount + 1 100
diffChar1 = w1[i] 101
diffChar2 = w2[i] 102
i = i + 1 103
DR
if diffCount == 1 : 104
mp = MinimalPair(diffChar1, diffChar2, w1, w2) 105
if not minPairs.has_key(mp.getContrast()) : 106
minPairs[mp.getContrast()] = {} 107
minPairsForContrast = minPairs[mp.getContrast()] 108
minPairsForContrast[mp.getWordPair()] = mp 109
return minPairs 110
111
# ------------------------------- 112
# Go through the words by their 113
# length and compare words of 114
# similar length to see if they 115
# constitute a minimal pair 116
# ------------------------------- 117
def printResults(minPairs) : 118
contrast = minPairs.keys() 119
contrast.sort() 120
for c in contrast : 121
minPairsByContrast = minPairs[c] 122
print "/" + c[0] + "/ vs. /" + c[1] + "/" 123
for mp in minPairsByContrast.values() : 124
print mp.getFirstWord() 125
print mp.getSecondWord() 126
145
14.5 Case Studies in String Manipulation
T
print 127
print 128
129
# ------------------------------- 130
# Run everything 131
# ------------------------------- 132
def main() : 133
filepath = sys.argv[1] 134
words = extractWords(filepath) 135
minPairs = findMinPairs(words) 136
printResults(minPairs) 137
138
# ------------------------------- 139
AF
140
main() 141
The code in (199) first extracts all of the words from the file and puts them
in a list. It then goes through the list and finds all of the minimal pairs. This is
accomplished in a number of steps. The first step XXX.
When a phonemic alphabet is used and each phoneme is represented by a
single letter (or character), finding minimal pairs is a fairly trivial task, consisting
of the following: Every word in the word list is compared to every other word.
If two words are of different lengths, they cannot be a minimal pair, and can
therefore be ignored. If two words are of the same length, the two words are lined
up and each segment in the word is compared one by one, in sequential order. A
minimal pair is simply a pair of words that differ only by one segment. (This can
be easily determined by keeping a counter of the number of differing segments.)
Consider a pair of words like mint and lint. If we line up the two words, and
compare their letters one by one, we find that they differ by a single letter, and are
therefore a minimal pair, as shown in 14.4.
DR
Table 14.4 Comparing Words in a Minimal Pair
Index 0 1 2 3
Letters m i n t
l i n t
Same? N Y Y Y
The only problem with this approach is that when you compare every word
in a list to every other, the number of comparisons required ends up being very
large, since the number of comparisons is exponentially related to the number of
words in the list (if the number of words in a word list is n, then the number of
comparison is n2 ). Therefore, the speed up the program, we first divide up the list
according to word length and only compare words of similar length.
146
14. Strings In Depth
T
Two words are considered anagrams of one another if the letters in one word can
be rearranged to form the other.
(200) stringsAnagrams1.py
import sys 1
from string import join 2
3
# ------------------------------- 4
# Extract every word from the file 5
# supplied as a command-line arg 6
AF
# ------------------------------- 7
def extractWords(f) : 8
# Get lines from file 9
fo = open(f, ’r’) 10
lines = fo.readlines() 11
fo.close() 12
13
# Go through lines and get words 14
words = [] 15
for l in lines : 16
nl = l.replace("\n", "") 17
words.append(nl) 18
return words 19
20
# ------------------------------- 21
# Sort the words by their length 22
# and put them in a dictionary 23
# where the key is a wordlength 24
# and the value is the list of 25
# words of that length 26
# 1: a, I 27
# 2: an, to, it, ... 28
# 3: the, and, but, ... 29
# ... 30
# ------------------------------- 31
def sortWordsByLength(allWords) : 32
DR
wordLengths = {} 33
for w in allWords : 34
wl = len(w) 35
if not wordLengths.has_key(wl) : 36
wordLengths[wl] = [] 37
words = wordLengths[wl] 38
words.append(w) 39
wordLengths[wl] = words 40
return wordLengths 41
42
# ------------------------------- 43
# Take all of the chars in a 44
# string, alphabetize them, 45
# and return them as a string 46
# ------------------------------- 47
def sortStringChars(s) : 48
charList = [] 49
i = 0 50
while i < len(s) : 51
charList.append(s[i]) 52
i = i + 1 53
charList.sort() 54
return join(charList, ’’) 55
147
14.5 Case Studies in String Manipulation
T
56
# ------------------------------- 57
# Go through the words by their 58
# length and compare words of 59
# similar length to see if they 60
# are anagrams 61
# ------------------------------- 62
def findAnagrams(words) : 63
anagrams = {} 64
wordsByLength = sortWordsByLength(words) 65
for l in wordsByLength.keys() : 66
words = wordsByLength[l] 67
for w1 in words : 68
AF
chars1 = sortStringChars(w1) 69
for w2 in words : 70
chars2 = sortStringChars(w2) 71
if chars1 == chars2 and w1 != w2 : 72
if not anagrams.has_key(chars1) : 73
anagrams[chars1] = {} 74
ac = anagrams[chars1] 75
ac[w1] = ’’ 76
ac[w2] = ’’ 77
anagrams[chars1] = ac 78
return anagrams 79
80
# ------------------------------- 81
# Print out the results in a 82
# nicely formatted fashion 83
# ------------------------------- 84
def printResults(anagrams) : 85
for a in anagrams.values() : 86
words = a.keys() 87
print join(words, "/") 88
89
# ------------------------------- 90
# Run everything 91
# ------------------------------- 92
def main() : 93
filepath = sys.argv[1] 94
DR
words = extractWords(filepath) 95
anagrams = findAnagrams(words) 96
printResults(anagrams) 97
98
main() 99
148
14. Strings In Depth
T
i = len(s) 9
while i > 0 : 10
i = i - 1 11
ns = ns + s[i] 12
print ns 13
return ns 14
15
# --------------------------------- 16
# Go through file, find words, 17
# and check whether they are 18
# palindromes 19
# --------------------------------- 20
def main() : 21
AF
file = sys.argv[1] 22
23
# Get lines from file 24
fo = open(file, ’r’) 25
lines = fo.readlines() 26
fo.close() 27
28
# Find palindromes 29
palindromes = [] 30
for l in lines : 31
w = l.replace("\n", "") 32
if w == reverse(w) : 33
palindromes.append(w) 34
35
# Print out palindromes found 36
for p in palindromes : 37
print p 38
39
main() 40
14.6 Exercises
• ???
DR
• ???
• ???
149
14.6 Exercises
d
i
o
u
x
FT
Convention Meaning
Table 14.3 Conventions for String Interpolation
(1)
(2)
A
X
e
E
Unsigned hexidecimal (uppercase).
Floating point exponential format (lowercase).
Floating point exponential format (uppercase).
(2)
150
f Floating point decimal format.
F
g
G
c
r
s
(1) ???
(2) ???
(3) ???
(4) ???
Floating point decimal format.
D R
Same as ”e” if exponent is greater than -4 or less than precision, ”f” otherwise.
Same as ”E” if exponent is greater than -4 or less than precision, ”F” otherwise.
Single character (accepts integer or single character string).
(3)
(4)
T
AF Part II
Advanced Topics
DR
151
DR
AF
T
T
Chapter 15
AF
Object Oriented Programming
153
15.1 What are Objects?
defined in such a way that various operations on it are possible, such as devoicing,
T
lengthening, etc.
Let’s see how this might work. Before we can create an object, we need to in-
CLASS troduce a bit more terminology. The template for an object is a class. Therefore,
we speak of instantiating a class in an object. This may sound rather abstruse,
but it’s actually fairly simple, and fortunately there’s already linguistic terminol-
ogy that can be pressed into service. The distinction between classes and objects
is basically the distinction between types and tokens. Tokens are individual in-
stances of a type. Similarly, objects are instances of a class.
METHODS
class Stop :
AF
The first step towards creating a phoneme object is therefore to write a class
that will define the phoneme object. The code in (202) defines a very simple object
that could be used to represent stop phonemes (plosives).
def __init__(self) :
self.voicing
def voicing(self) :
return self.voicing
def devoiced(self) :
self.voicing = ’VOICELESS’
def voiced(self) :
self.voicing = ’VOICED’
(202) ooObjectEx1.py
# Name of the class
# Constructor method
# Object variable (attribute)
# Accessor method for voicing
The class definition begins with the keyword class followed by the name of
the class, in this case, Stop. It also defines four functions, or methods, to use
the terminology of OO programming. The first is init , which is a method that
every classes requires to handle the creation of an object from a class, associat-
1
2
3
4
5
6
7
8
9
ing with it a single variable, voicing. (Don’t sweat the details. More more
information about objecty construction, see §16.1.1.) The remaining three meth-
DR
ods relate to voicing. The first, voicing, provides the value of the voicing
variable (attribute). The other two change the attribute, setting its value to either
VOICED or VOICELESS.
We now have a class that defines an object that we can use to represent stop
phonemes. But how do we make use of it? In (203), we insert the class definition
into the beginning of a script, create a phoneme object, and invoke its methods for
illustration.
(203) ooObjectEx2.py
# Define the class 1
class Stop : 2
def __init__(self) : 3
_voicing 4
def voicing(self) : 5
return self._voicing 6
def devoiced(self) : 7
self._voicing = ’VOICELESS’ 8
def voiced(self) : 9
self._voicing = ’VOICED’ 10
154
15. Object Oriented Programming
T
11
# Create an object 12
s = Stop() 13
s.devoiced() 14
print s.voicing() 15
s.voiced() 16
print s.voicing() 17
Note that the stop object is saved into the variable s. The devoiced()
method is called on s, setting the voicing variable, which is then retrieved and
AF
printed by invoking the voicing() method on s. The same is then done for the
voicing() method. This example already illustrates a few things about objects.
First, they are really quite easy to use. It’s not rocket science! Second, they have
very well-defined behavior. Notice that there is no way of setting the value of
voicing to anything but VOICELESS or VOICED.1 If we attempt to call a
method that doesn’t exist, we get an error, as shown in (204). (We’ve ommitted
the class definition to save space.)
(204)
>>> s = Stop()
>>> print s.aspiration()
Traceback (most recent call last):
File "./ex_class_error.py", line 18, in ?
print s.aspiration()
AttributeError: Stop instance has no attribute ’aspiration’
Although there are still lots of technical details that need to be covered, the
basic idea of how objects are defined and used should be clear at this point. We
can now turn to a more important issue, which the reader is no doubt beginning to
wonder about, which is why this approach represents an advantage over traditional
DR
programming practices. The next section takes up the virtues of OO programming.
155
15.2 The Virtues of OO Programming
T
xxx
(9) 1 Harald Baayen ??? 2004-???-???
2 Stuart Robinson ??? 1973-10-28
3 Fermin Moscoso del Prado ??? 1974-???-???
1. first name,
AF
2. last name,
There are some aspects of this program that are awkward and could be im-
proved, even within a procedural programming approach. For starters, it would
be very difficult to change the way that the data is reformatted, since the refor-
matting happens as each line is processed. It would be desirable to have a bit
more flexibility in the reformatting, perhaps by storing all of the data in a large
data structure and then manipulating this data structure at the end. The following
script is a variant of (??) which differs in the way it handles the data in the file.
Instead of reformatting the names as they are processed, it stores the data as a list
of lists, which it manipulates at the end.
(206) ooEx2.py
import sys 1
2
filename = sys.argv[1] # Filename is command-line arg 3
f = open(filename) # Open file 4
lines = f.readlines() # Read in lines 5
156
15. Object Oriented Programming
T
data = [] 6
for l in lines : # Process lines 7
cols = l.split("\t") # Split by tabs 8
data.append(cols) # Put data in list 9
10
for d in data : # Extract data 11
fname = d[0] # First name is 1st col 12
lname = d[1] # Last name is 2nd col 13
print lname + ", " + fname # Print out 14
Note how the data is organized in (??): the data in each line is put into a list
AF
and these lists with line data are then put into one big list. (In other words, we
have a list of lines, each of which is a list of columns.) This approach works
but it isn’t very elegant. For starters, it is somewhat inflexible. EXPLAIN HOW.
Some of the inelegance of this program can be solved using objects. The main
innovation we will introduce is the use of a Person object, which will be used to
save all of the data about the people listed in the data file. This solution isn’t fully
OO, since a good deal of it continues to be procedural (e.g., the parsing of the data
file into lines and of lines into columns), but it will suffice to introduce some basic
concepts about objects.
(207) ooEx3.py
import sys 1
from string import split 2
3
#----------------------------------- 4
# Define Person class 5
#----------------------------------- 6
class Person : 7
def __init__(self) : 8
_fname 9
_lname 10
def set_first_name(self, fname) : 11
DR
self._fname = fname 12
def set_last_name(self, lname) : 13
self._lname = lname 14
def get_first_name(self) : 15
return self._fname 16
def get_last_name(self) : 17
return self._lname 18
def get_first_initial(self) : 19
return self._fname[0] + "." 20
21
#----------------------------------- 22
# Process data file 23
#----------------------------------- 24
25
# Filename is command-line arg 26
filename = sys.argv[1] 27
28
# Open file 29
f = open(filename) 30
31
# Read in lines 32
lines = f.readlines() 33
34
157
15.2 The Virtues of OO Programming
T
# Process lines 35
people = [] 36
for l in lines : 37
cols = split(l, "\t") 38
p = Person() 39
p.set_first_name(cols[0]) 40
p.set_last_name(cols[1]) 41
people.append(p) 42
43
# Extract data 44
for p in people : 45
print p.get_last_name() + ", " + p.get_first_initial() 46
INTERFACE AF
Don’t worry too much about the details of how the definition of the Person
object. The point is that it provides a way of storing particular data fields and
provides a way of setting those data fields and retrieving them. Basically, this
is done through custom-defined functions, or methods, to use OO jargon. In this
case, we have four methods: two methods for setting its variables (setter methods)
and two methods for retrieving them (getter methods). The methods of an object
collectively constitute an interface.
It should be obvious that objects provide a much more flexible way of organiz-
ing programs, since it provides a means of directly modelling conceptual entities.
But there is much more to OO programming than convenience manipulation of
data structures.
It may seem like using objects creates a great deal of unnecessary overhead.
After all, (??) consists of only XXX lines, whereas (??) has XXX (nearly twice
as many). In a simple example like this one, the benefits of OO programming
are not as obvious, but they become apparent once we begin dealing with more
complicated examples. This shouldn’t be too surprising, since OO programming
DR
is basically a technique for managing complexity by decomposing a complex sys-
tem into a set of less complex objects, whose interactions define the behavior of
the system. It may be worth saying something at this point about the relative ad-
vantages of OO programming over traditional procedural programming. A list of
some of the strengths of the former are listed below:
158
15. Object Oriented Programming
T
classes react differently to the same message
Reuse Integration of methods and data promotes reuse of code
Typing ???
In the next few sections, we will explore these concepts in some depth. Note,
however, that we will not yet delve into the details of how objects are handled in
Python, since our concern here is with larger conceptual issues and not nitty-gritty
AF
language-specific details (which are are covered in chapter ??). This means that
some aspects of the code in the upcoming examples with not be fully explained
and must be taken on faith. If as a result some aspects of the code are unclear,
there is no need to worry, since the next chapter will fill in the gaps.
15.3.2 Encapsulation
One fairly central concept behind object orientation is encapsulation, which refers ENCAPSULATION
to the technique of making the internal details of an object invisible to the user, in
some sense hiding them. Encapsulation gives object orientation a great deal of its
appeal. By shielding the internal workings of an object from the user, the focus is
placed upon how you interact with an object (i.e., its interface) rather than how it
is written (i.e., its implementation). This encourages the reuse of objects, since IMPLEMENTATION
the user doesn’t need to understand their inner workings, but only how to interact
with them. The implementation of an object can be improved without changing
its interface, which makes maintenance of the code a lot simpler, since any code
that refers to it won’t also have to be changed.
159
15.3 Major Concepts of OO Programming
To see how encapsulation works in a more concrete way, let’s look at a simple
T
example. The program Shoebox is a tool used by linguists to manage lexicons
and deploy them for textual analysis. It produces interlinearized texts like the one
excerpted in Figure 15.3.2.
What we have in Figure 15.3.2 is a single line of text, broken down into words,
which are in turn broken down in morphemes, with each morpheme described on
four levels. (A free translation of the line is also provided, but we will ignore this
information for now.)
\u
\g
\c
AF
Table 15.1 Description of Shoebox Configuration in Figure 15.3.2
Field Marker
\t
Description
The surface form of the morphemes in the word.
The underlying form of the morphemes in the word.
A gloss of the morphemes meaning.
The category that the morpheme belongs to (noun, prefix, suffix, etc.).
160
DR
AF
tarai-pa-vi-ei
-pa -vio -ei
vao
vao
vo
vo
vokioia.
voki -
161
\g NEG like_this understand -PROG -1.PL.INCL -PRES DEM.3SG.N this/here day/ time -
\p N.N ??? VI -SUFF.V.3 -SUFF.VI.4 -SUFF.VI.5 DEM ??? N.N -
T
\f Yumi no save long dispela taim.
15.3 Major Concepts of OO Programming
T
def setSurfaceForm(self, sf) : 20
self.tiers[0] = sf 21
def setUnderlyingForm(self, uf) : 22
self.tiers[1] = uf 23
def setGloss(self, g) : 24
self.tiers[2] = g 25
def setCategory(self, c) : 26
self.tiers[3] = c 27
The basic idea is that each tier is a different index in the list contained within
the object, starting from 0. This numbering scheme is shown in table 15.2 for the
AF
morpheme vi, the third morpheme from the Rotokas word taraipaviei from the
figure 15.3.2.
Now imagine that a programmer decides that storing tiers as lists is unesirable
for some reason—perhaps because lists are not particularly mnemonic due to the
fact that the association between a tier and position in the list (that is, its index) is
THIS IS NOT A GOOD REASON. arbitrary. (If you want to expand on the morpheme object by adding a new method
GO THE OTHER WAY, FIRST
SHOW DICTIONARY AND and somewhere in that new method you need to reference the, you would have to
THEN INDEX, AS A WAY TO
IMPROVE EFFICIENCY remember that it is index is 2.) A more mnemonic alternative is to represent the
tiers as key-value pairs in a dictionary (see §?? for more information about this
DR
type of data structure), using the single-letter Shoebox field marker as the key for
a particular tier’s data. The same morpheme from table ?? is broken down into
key-value pairs, as shown in table 15.3.
This change in the implementation of the object need not affect its inter-
face, since it can be hidden away (i.e., encapsulated) behind accessor methods,
as shown by the code in (209).
162
15. Object Oriented Programming
T
(209) ooEncapsulationEx2.py
# ----------------------------------- 1
# Definition of Morpheme Class (v. 2) 2
# ----------------------------------- 3
class Morpheme : 4
# Constructor 5
def __init__(self) : 6
tiers = dict() 7
8
# Getter Methods 9
def getSurfaceForm(self) : 10
return self.morphemes[s] 11
def getUnderlyingForm(self) : 12
AF
return self.morphemes[u] 13
def getGloss(self) : 14
return self.morphemes[g] 15
def getCategory(self) : 16
return self.morphemes[c] 17
18
# Setter Methods 19
def setSurfaceForm(self, sf) : 20
self.morphemes[s] = sf 21
def setUnderlyingForm(self, uf) : 22
self.morphemes[u] = uf 23
def setGloss(self, g) : 24
self.morphemes[g] = g 25
def setCategory(self, c) : 26
self.morphemes[c] = c 27
For simplicity’s sake, let’s just say that a word consists of morphemes. Since
the morphemes are ordered in sequence, we’ll use a list to store them. We can
now define a Word class that will provide a list of morphemes and a method for
accessing them, as in (211).
(211) ooEncapsulationEx4.py
class Word : 1
def __init__(self) : 2
163
15.3 Major Concepts of OO Programming
T
morphemes = [] 3
def addMorpheme(self, morpheme) : 4
self.morphemes.append(morpheme) 5
def getMorpheme(self, i) : 6
return self.morphemes[i] 7
Note that we can’t directly access the list of morphemes. Instead, we have
defined a method that provides access to a particular morpheme. This is already
an instance of encapsulation, because it shows how we use an interface (in this
case, a very simple one consisting of only a single method) to restrict a user’s
XXX.
w = Word()
AF
access to an objects internal workings.
To complete our example, we can bring the Morpheme and Word classes to-
gether to provide an object-oriented representation of the word taraipaviei from
m = Morpheme()
m.setSurfaceForm(’tarai’)
m.setUnderlyingForm(’tarai’)
m.setGloss(’understand’)
m.setCategory(’VI’)
w.addMorpheme(m)
m.setSurfaceForm(’pa’)
m.setUnderlyingForm(’pa’)
m.setGloss(’PROG’)
m.setCategory(’SUFF???’)
w.addMorpheme(m)
(212) ooEncapsulationEx5.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
m.setSurfaceForm(’vi’) 17
DR
m.setUnderlyingForm(’vio’) 18
m.setGloss(’???’) 19
m.setCategory(’SUFF???’) 20
w.addMorpheme(m) 21
22
m.setSurfaceForm(’ei’) 23
m.setUnderlyingForm(’ei’) 24
m.setGloss(’PRES’) 25
m.setCategory(’SUFF???’) 26
w.addMorpheme(m) 27
15.3.3 Inheritance
I NHERITANCE Inheritance refers to the technique of organizing classes into hierarchies such
that classes lower in the hierarchy automatically share something in common with
classes higher in the hierarchy. Inheritance provides a way of writing less code,
since the sharing of code obviates the need for its duplication through copy-and-
pasting, and reducing the likelihood of errors, since changes can be made at one
164
15. Object Oriented Programming
point in the inheritance hierarchy, where they will automatically trickle down-
T
wards. (Unlike former American president Ronald Reagan’s trickle-down eco-
nomics, trickle-down programming isn’t voodoo. It actually works!)
The basic idea is straightforward. We can illustrate the usefulness of inher-
itance by imagining how we might design a bibliographic program. A biblio-
graphic database keeps track of references, but references are not all of the same
type. There are books, journal articles, manuscripts, etc. Some data fields are
common to all types (for example, a title), but others are unique to particular
types (for example, books have publishers, but manuscripts don’t). Without inher-
class Book :
_where_published
_year
AF
itance, we would have to write a great deal of duplicate code. To illustrate, look at
the definition of three classes in (213) for three different reference types: books,
journal articles, and manuscripts. It does not make use of inheritance.
def __init__(self) :
_id
_publisher
_title
def identifier(self) :
return self._id
def publisher(self) :
return self._publisher
def title(self) :
return self._title1
(213) ooInheritanceEx1.py
# -----------------------------------------------
# CLASS: Book
# -----------------------------------------------
def where_published(self) :
return self._where_published
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def year(self) : 19
return self._year 20
DR
21
22
# ----------------------------------------------- 23
# CLASS: JournalArticle 24
# ----------------------------------------------- 25
class JournalArticle : 26
def __init__(self) : 27
_id 28
_journal 29
_title 30
_year 31
def identifier(self) : 32
return self._id 33
def journal(self) : 34
return self._journal 35
def title(self) : 36
return self._title1 37
def year(self) : 38
return self._year 39
40
41
# ----------------------------------------------- 42
# CLASS: Manuscript 43
165
15.3 Major Concepts of OO Programming
T
# ----------------------------------------------- 44
class Manuscript : 45
def __init__(self) : 46
_id 47
_title 48
_year 49
def identifier(self) : 50
return self._id 51
def title(self) : 52
return self._title1 53
def year(self) : 54
return self._year 55
#
AF
There is a good deal of overlap in these three classes. For example, each class
has an accessor method for its unique identifier number, its title, and the year in
which it was written (in the case of manuscripts) or published (in all other cases).
By rewriting the same two accessor method for each class, we introduce the pos-
sibilty of copy-and-paste errors. Furthermore, if at some point the workings of a
common method needs to change, that change must be made for each class, which
is time-consuming and error-prone. Inheritance simplifies the task of factoring out
similarities, as illustrated by (214).
(214) ooInheritanceEx2.py
# -----------------------------------------------
# CLASS: Reference
Parent of: Book, JournalArticle, Manuscript
# -----------------------------------------------
class Reference :
def __init__(self, title, year) :
self.title = title
self.year = year
def get_identifier(self) :
1
2
3
4
5
6
7
8
9
return self.id 10
def get_title(self) : 11
DR
return self.title 12
def get_year(self) : 13
return self.year 14
def set_identifier(self, id) : 15
self.id = id 16
def set_title(self, title) : 17
self.title = title 18
def set_year(self, year) : 19
self.year = year 20
21
# ----------------------------------------------- 22
# Child Classes 23
# ----------------------------------------------- 24
25
# Definition of a class for books 26
class Book(Reference) : 27
def __init__(self, title, year) : 28
Reference.__init__(self, title, year) 29
30
# Definition of a class for journal articles 31
class JournalArticle(Reference) : 32
def __init__(self, title, year) : 33
Reference.__init__(self, title, year) 34
166
15. Object Oriented Programming
T
journal 35
def journal(self) : 36
return self._journal 37
38
# Definition of a class for books 39
class Manuscript(Reference) : 40
def __init__(self, title, year) : 41
Reference.__init__(self, title, year) 42
The methods relating to identifier, title, and year have been placed within
the Reference class, which the three classes Book, JournalArticle, and
AF
Manuscript all extend or inherit from. Using a geneological metaphor, we
could alternative say that the Reference class is the parent class for the child
classes Book, JournalArticle, and Manuscript.
The only downside of inheritance is that it introduces some conceptual com-
plexity. It becomes necessary to keep track of inheritance hierarchies, but this is
a fairly small price to pay for the benefits that inheritance provides. The relation-
ship between the various classes (i.e., the inheritance hierarchy) can be illustrated
diagrammatically, as in Figure 15.4.
Reference
As Figure 15.4 shows, each class in our bibliography example has only one
DR
parent class, but conceptually there is no reason why this must be the case. Before
continuing, we need some terminology. The term single inheritance refers to SINGLE INHERI -
a situation where a class has only one parent class, whereas the term multiple TANCE
inheritance is used to refer to a situation where a class has more than one parent MULTIPLE INHERI -
class. Python allows multiple inheritance, even though some OO programming TANCE
languages do not (for example, Java). The problem with multiple inheritance is
that when a class has two parent classes and each defines the same method, there
needs to be some way of determining which is operative, and this adds some
conceptual complexity, and potential confusion, to inheritance. (This is why it is
barred in some programming languages.) An example should clarify.
Recall that in the object model for our bibliography, there were two reference
types that had editors: EditedBook and BookArticle. All of the referency types
have authors, with the exception of EditedBook. It would be nice if the com-
mon code could be pulled out and shared between the two classes. However, an
edited book has only editors whereas a book article has both authors and editors.
167
15.3 Major Concepts of OO Programming
The ideal situation is to have authors and editors available if they are needed and
T
only if they are needed. There are a number of ways to handle this, but multiple
inheritance provides one solution, as shown by the code in (215).
(215) ooInheritanceEx3.py
#----------------------------------- 1
# Person class 2
#----------------------------------- 3
class Person : 4
def __init__(self, fname, lname) : 5
self.fname = fname 6
self.lname = lname 7
AF
def get_first_name(self) : 8
return self.fname 9
def set_first_name(self, fname) : 10
self.fname = fname 11
def get_last_name(self) : 12
return self.lname 13
def set_last_name(self, lname) : 14
self.lname = lname 15
def get_first_initial(self) : 16
return self.fname[0] + "." 17
def name_lastfirst(self) : 18
return "%s, %s" % (self.getLastName(), self.getFirstInitial()) 19
20
# ----------------------------------------------- 21
# Parent Reference Class 22
# ----------------------------------------------- 23
class Reference : 24
def __init__(self, year, title) : 25
self.title = title 26
self.year = year 27
self.authors = [] 28
def identifier(self) : 29
return self.id 30
def title(self) : 31
return self.title1 32
def get_year(self) : 33
DR
return self.year 34
def set_year(self, year) : 35
self.year = year 36
37
class AuthoredWork : 38
def __init__(self) : 39
self.authors = [] 40
def get_authors(self) : 41
return self.authors 42
def add_author(self, author) : 43
self.authors.append(author) 44
def first_author(self) : 45
return self.authors[0].get_last_name() 46
def pcitation(self) : 47
return "(%s, %s)" % ( self.firstAuthor(), self.getYear()) 48
49
class EditedWork : 50
def __init__(self) : 51
self.editors = [] 52
def get_authors(self) : 53
return self.editors 54
def add_editor(self, editor) : 55
self.editors.append(editor) 56
168
15. Object Oriented Programming
T
def first_editor(self) : 57
return self.editors[0].get_last_name() 58
def pcitation(self) : 59
return "(%s, %s)" (self.firstEditor(), self.getYear()) 60
61
# ----------------------------------------------- 62
# Child Classes 63
# ----------------------------------------------- 64
class Book(Reference, AuthoredWork) : 65
def __init__(self, year, title) : 66
Reference.__init__(self) 67
68
class JournalArticle(Reference, AuthoredWork) : 69
AF
def __init__(self, year, title) : 70
Reference.__init__(self, year, title) 71
self.journal 72
def journal(self) : 73
return self.journal 74
75
class Manuscript(Reference, AuthoredWork) : 76
def __init__(self, year, title) : 77
Reference.__init__(self, year, title) 78
self.status 79
80
class EditedBook(Reference, EditedWork) : 81
def __init__(self, year, title) : 82
Reference.__init__(self, year, title) 83
self.editors = [] 84
def first_author(self) : 85
return self.editors[0].get_last_name() 86
87
class BookArticle(Reference, AuthoredWork, EditedWork) : 88
def __init__(self, year, title, book_title) : 89
Reference.__init__(self, year, title) 90
self.set_book_title(book_title) 91
self.editors = [] 92
def get_book_title(self) : 93
return self.book_title 94
def set_book_title(self, book_title) : 95
DR
self.book_title = book_title 96
97
98
# ----------------------------------------------- 99
# Create 3 reference objects of different types 100
# and print the output of their pcitation() 101
# methods 102
# ----------------------------------------------- 103
104
# Create person object for author and editor 105
au = Person(’Stephen’, ’Levinson’) 106
ed = Person(’Melissa’, ’Bowerman’) 107
108
# JournalArticle, only has an author 109
ja = JournalArticle(’2003’, ’xxx’) 110
ja.add_author(au) 111
print ja.pcitation() 112
113
# EditedBook, only has an editor 114
eb = EditedBook(’200X’, ’Language and Conceptual Development’) 115
eb.add_editor(ed) 116
print eb.pcitation() 117
118
169
15.3 Major Concepts of OO Programming
T
# BookArticle, has author and editor 119
ba = BookArticle(’200X’, ’Language and Mind: Let\’s Get the Issues Straight’, ’Language
120 and Conceptual
ba.add_author(au) # author is also editor 121
ba.add_editor(au) 122
ba.add_editor(ed) 123
print ba.pcitation() 124
The code in (215) sets up the inheritance hierarchy shown in figure 15.5, which
is obviously more complicated than the one previously given in Figure 15.4.
AF
Reference EditedWork AuthoredWork
Although the inheritance hierarchy is slightly more complicated, the main ad-
vantage is that we have a more sophisticated way of doling out variables and
methods for manipulating them. No references have authors that don’t need them
and no references have editors that don’t need them. But every reference that uses
either does so in the exact same manner thanks to the inheritance of attributes and
methods. For more details concerning how inheritance works, and in particular
for more information about how multiple inheritance is handled, see §??.
So far, we have seen how inheritance gives the programmer a way to avoid
writing the same method twice by defining it once for a parent class and letting
its child classes acquire it automatically. However, there will be cases where
the behavior of the child classes needs to depart from that of the parent class. To
maintain the same interface but achieve different behavior, it is possible for a child
DR
OVERRIDE class to override the method of a parent class. To say that a method is overriden is
really just a fancy way of saying that it is redefined. But what is the rationale for
overriding a method? There are many, but we will illustrate with an example from
our bibliographic classes. Imagine that we want to create a full-fledged system
for tracking references (say, along the lines of EndNote). We will need to flesh
out our object model by adding an attribute for the authors and/or editors of a
reference. Most reference types have authors, some have editors, and a few have
both. (None should have neither.) Here we can see how objects interact with one
another, since authors and editors can be represented by the Person object defined
previously in (207).
(216) ooInheritanceEx4.py
#----------------------------------- 1
# Person class 2
#----------------------------------- 3
class Person : 4
def __init__(self) : 5
_fname 6
170
15. Object Oriented Programming
T
_lname 7
def set_first_name(self, fname) : 8
self._fname = fname 9
def set_last_name(self, lname) : 10
self._lname = lname 11
def get_first_name(self) : 12
return self._fname 13
def get_last_name(self) : 14
return self._lname 15
def get_first_initial(self) : 16
return self._fname[0] + "." 17
def name_lastfirst(self) : 18
return self.get_lastname() + ", " + self.get_firstinitial() 19
AF
20
# ----------------------------------------------- 21
# Parent Reference Class 22
# ----------------------------------------------- 23
class Reference : 24
def __init__(self) : 25
_title 26
_year 27
_authors = [] 28
_editors = [] 29
def get_authors(self) : 30
return self._authors 31
def get_authors(self) : 32
return self._authors 33
def identifier(self) : 34
return self._id 35
def title(self) : 36
return self._title1 37
def year(self) : 38
return self._year 39
def get_display_author : 40
return "(" + self.get_editors()[0].get_lastname() + self.get_year() + ")" 41
42
# ----------------------------------------------- 43
# Child Classes 44
# ----------------------------------------------- 45
DR
class Book(Reference) : 46
def __init__(self) : 47
_publisher 48
_where_published 49
def publisher(self) : 50
return self._publisher 51
def where_published(self) : 52
return self._where_published 53
54
class JournalArticle(Reference) : 55
def __init__(self) : 56
_journal 57
def journal(self) : 58
return self._journal 59
60
class Manuscript(Reference) : 61
def __init__(self) : 62
63
class EditedBook(Reference) : 64
def __init__(self) : 65
def get_display_author : 66
return "(" + self.get_editors()[0].get_lastname() + self.get_year() + ")" 67
68
171
15.3 Major Concepts of OO Programming
T
class BookArticle(Reference) : 69
def __init__(self) : 70
Now imagine that we want to create a method that will provide a parenthetical
citation for a given reference. Let’s assume that the desired formatting is the last
name of the first author (followed by et al. if there is more than one author) fol-
lowed by the year of publication, with a comma separating the two, as illustrated
in (??). (This is more or less the standard format for linguistic journals.)
AF
1. Here is an example of a parenthetical ciation with a single author (???,
2006).
The code in (217) demonstrates how two reference objects can be created,
have their attributes populated, and then have their parenthetical citations printed
out for display. (Note that we are manually setting the attributes of the various
reference objects. We will see later in chapter 20 how they could be automatically
populated from a database.)
(217) ooInheritanceEx5.py
# -----------------------------------------------
# Create person objects for author and editor
# and populate their attributes manually
# -----------------------------------------------
au = Person()
au.first_name(’Stuart’)
1
2
3
4
5
6
au.middle_name(’Payton’) 7
au.last_name(’Robinson’ 8
DR
9
ed = Person() 10
ed.first_name(’Timothy’) 11
ed.last_name(’Shopen’) 12
13
14
# ----------------------------------------------- 15
# Create 3 reference objects of different types 16
# and print the output of their pcitation() 17
# methods 18
# ----------------------------------------------- 19
20
# JournalArticle, only has an author 21
ja = JournalArticle() 22
ja.add_author(au) 23
ja.set_year(’2003’) 24
ja.set_title(’Constituent Order in Tenejapa Tzeltal’) 25
print ja.pcitation() 26
27
# EditedBook, only has an editor 28
eb = EditedBook() 29
eb.add_editor(ed) 30
eb.set_year(’1996’) 31
172
15. Object Oriented Programming
T
eb.set_title(’Language Typology and Syntactic Description’) 32
print eb.pcitation() 33
34
# BookArticle, has author and editor 35
ba = BookArticle() 36
ba.add_author(au) 37
ba.add_editor(ed) 38
ba.set_year(????) 39
ba.set_title(’xxx’) 40
print ba.pcitation() 41
AF
The pcitation() method of the JournalArticle class and the BookArticle
providess the last name of its author, while the pcitation() method of the
EditedBook class provides the last name of its editor. The behavior of the Edited-
Book class differs from the others because its pcitation() method has been
overriden and its behavior redefined.
15.3.4 Modularity
The concept of modularity in OO programming refers to the idea of grouping MODULARITY
code together into discrete units, called modules, which can be optionally included
in a program to provide well-defined functional bundles. It does not refer to the
discreteness of objects, which is so central to the entire idea behind object ori-
entation that it usually goes without saying, but rather to the goruping of related
entities (objects) into larger units. The partitioning of a system into modules has
a number of advantages, not the least of which is the ability to include or exclude
modules according to one’s needs.
The concept of modularity runs deep in Python and is exemplified, given
DR
that the core library of the language is organized into modules, which provide
thematically-related functionality. For example, the datetime module provides
a variety of class types that facilitate the manipulation of dates and times. The
code in (220) shows how the datetime module can be used in a script to calcu-
late a person’s current age given their data of birth. (Note that the year, month, and
day of the birthdate are passed to the script as arguments—e.g., ooModularityEx1.py
1973 28 10.)
(220) ooModularityEx1.py
import sys 1
from datetime import date 2
year = sys.argv[1] 3
month = sys.argv[2] 4
day = sys.argv[3] 5
print "Birthdate: %s-%s-%s" % (year, month, day) 6
bday = date(1973, 10, 28) 7
age = date.today() - bday 8
print age 9
173
15.3 Major Concepts of OO Programming
To continue with our bibliography example, we can put all of the code that
T
defines the various bibliography classes into a single file, which we’ll call ???.
(219) references.py
#----------------------------------- 1
# Person class 2
#----------------------------------- 3
class Person : 4
def __init__(self, fname, lname) : 5
self.fname = fname 6
self.lname = lname 7
def get_first_name(self) : 8
return self.fname 9
AF
def set_first_name(self, fname) : 10
self.fname = fname 11
def get_last_name(self) : 12
return self.lname 13
def set_last_name(self, lname) : 14
self.lname = lname 15
def get_first_initial(self) : 16
return self.fname[0] + "." 17
def name_lastfirst(self) : 18
return "%s, %s" % (self.getLastName(), self.getFirstInitial()) 19
20
# ----------------------------------------------- 21
# Parent Reference Class 22
# ----------------------------------------------- 23
class Reference : 24
def __init__(self, year, title) : 25
self.title = title 26
self.year = year 27
self.authors = [] 28
def identifier(self) : 29
return self.id 30
def title(self) : 31
return self.title1 32
def get_year(self) : 33
return self.year 34
def set_year(self, year) : 35
DR
self.year = year 36
37
class AuthoredWork : 38
def __init__(self) : 39
self.authors = [] 40
def get_authors(self) : 41
return self.authors 42
def add_author(self, author) : 43
self.authors.append(author) 44
def first_author(self) : 45
return self.authors[0].get_last_name() 46
def pcitation(self) : 47
return "(%s, %s)" % ( self.authors[0].get_last_name(), self.get_year()) 48
49
class EditedWork : 50
def __init__(self) : 51
self.editors = [] 52
def get_authors(self) : 53
return self.editors 54
def add_editor(self, editor) : 55
self.editors.append(editor) 56
def first_editor(self) : 57
return self.editors[0].get_last_name() 58
174
15. Object Oriented Programming
T
def pcitation(self) : 59
return "(%s, %s)" % (self.editors[0].get_last_name(), self.get_year()) 60
61
# ----------------------------------------------- 62
# Child Classes 63
# ----------------------------------------------- 64
class Book(Reference, AuthoredWork) : 65
def __init__(self, year, title) : 66
Reference.__init__(self, year, title) 67
68
class JournalArticle(Reference, AuthoredWork) : 69
def __init__(self, year, title) : 70
Reference.__init__(self, year, title) 71
AF
self.journal 72
def journal(self) : 73
return self.journal 74
75
class Manuscript(Reference, AuthoredWork) : 76
def __init__(self, year, title) : 77
Reference.__init__(self, year, title) 78
self.status 79
80
class EditedBook(Reference, EditedWork) : 81
def __init__(self, year, title) : 82
Reference.__init__(self, year, title) 83
self.editors = [] 84
def first_author(self) : 85
return self.editors[0].get_last_name() 86
87
class BookArticle(Reference, AuthoredWork, EditedWork) : 88
def __init__(self, year, title, book_title) : 89
Reference.__init__(self, year, title) 90
self.set_book_title(book_title) 91
self.editors = [] 92
def get_book_title(self) : 93
return self.book_title 94
def set_book_title(self, book_title) : 95
self.book_title = book_title 96
DR
Other programs can then access the code using an import statement, as il-
lustrated in (??).
(220) ooModularityEx1.py
import sys 1
from datetime import date 2
year = sys.argv[1] 3
month = sys.argv[2] 4
day = sys.argv[3] 5
print "Birthdate: %s-%s-%s" % (year, month, day) 6
bday = date(1973, 10, 28) 7
age = date.today() - bday 8
print age 9
For more details about how modules and how to import them, see chapter 13,
which goes into the topic in considerable detail.
175
15.3 Major Concepts of OO Programming
15.3.5 Typing
T
TYPING The concept of typing refers to the way that objects belong to types. To say that
an object belongs to a type is a fairly trivial claim. Since classes define objects
and each object is created on the basis of its class definition, every object therefore
belongs to a type–namely, its class.
While some OO languages are very strictly typed, Python is more lenient. It is
unnecessary to declare the type of a method’s argument, as we have already seen
in §14, where we defined a method for the Reference class that takes a Person ob-
AF
ject as its argument. The advantage of not strictly enforcing typing is flexibility,
but the drawback is unpredictability. For example, if a programmer misunderstood
the nature of the add author() method and passed a string, rather than a Per-
son object, as an argument, the error would not be immediately caught. In fact,
no error would result at all, although the results of calling the pcitation()
method would somewhat nonsensical, as illustrated in (221).
(221) ooTypingEx1.py
from references import Book 1
2
r = Book(’2003’, ’Text Processing in Python’) 3
r.add_author(’David Mertz’) 4
print r.pcitation() 5
This is due to the fact that a string object was appended to the authors list
when a Person object was expected. When the method pcitation() is called,
it attempts to retrieve a Person object and call various methods on it. However,
since a String object rather than a Person object is retrieved, and the String object
has no method xxx, the results is an error.
If type-checking is desired, there are mechanism available for its enforcement
in Python. Exception raising can be used to prevent the misuse of methods that
expect a particular argument type. For example, the add author() method
could be revised to check the type of its single argument and to raise an exception
if the argument is not of the expected type, as shown in (??).
(222) ooTypingEx2.py
from references import Book 1
2
class Reference : 3
176
15. Object Oriented Programming
T
def __init__(self) : 4
self.authors = [] 5
def add_author(self, author) : 6
if isinstance(author, Person) : 7
self.authors.append(author) 8
else : 9
raise Exception, "The argument of add_author must be a Person object." 10
11
r = Book("???", "???") 12
r.add_author(’Stuart P. Robinson’) 13
Now, when (222) is run, an error results as soon as the add author()
AF
method is invoked with an unexpected argument type, as shown in (223).
15.3.6 Polymorphism
Polymorphism refers to the way that a method with a single name can take a P OLYMORPHISM
different number of arguments. [XXX] An example should make this concept
DR
clearer. Returning to the example discussed in §15.1, let’s imagine how we could
change the script that takes the data file and prints out each person’s name with
the last name followed by the initial of the first name.
(224) ooPolymorphismEx1.py
import sys 1
from string import split 2
from datetime import date 3
4
#----------------------------------- 5
# Define Person class 6
#----------------------------------- 7
class Person : 8
def __init__(self) : 9
_fname 10
_lname 11
def set_first_name(self, fname) : 12
self._fname = fname 13
def set_last_name(self, lname) : 14
self._lname = lname 15
def set_birthdate(self, bdate) : 16
self._bdate = bdate 17
177
15.3 Major Concepts of OO Programming
T
def get_first_name(self) : 18
return self._fname 19
def get_last_name(self) : 20
return self._lname 21
def get_birthdate(self) : 22
return self._bdate 23
def age(self) : 24
age = date.today() - self.get_birthdate() 25
return age.days() 26
27
#----------------------------------- 28
# Process data file 29
#----------------------------------- 30
AF
31
# Filename is command-line arg 32
filename = sys.argv[1] 33
34
# Open file 35
f = open(filename) 36
37
# Read in lines 38
lines = f.readlines() 39
40
# Process lines 41
people = [] 42
for l in lines : 43
cols = split(l, "\t") 44
45
# Data columns 46
fname = cols[0] 47
lnamme = cols[1] 48
bdayParts = split(cols[3], ’-’) 49
year = bdayParts[0] 50
month = bdayParts[1] 51
day = bdayParts[2] 52
bday = date(year, month, day) 53
54
p = Person() 55
p.set_first_name(fname) 56
DR
p.set_last_name(lname) 57
p.set_birthdate(bday) 58
people.append(p) 59
60
# Extract data 61
for p in people : 62
print p.get_first_name() + ", " + \ 63
p.get_last_name()[0] + "." + \ 64
"\t" + p.age() 65
The age() method assumes that age is calculated by taking the difference
between the current date and a person’s birthdate. However, at some point we
might want to be able to determine a person’s age on a particular date in the past.
For the sake of our example, let’s assume that we want to know a person’s age in
19XX (the year that the initial version of Python was first publically released).
(225) ooPolymorphismEx2.py
import sys 1
from string import split 2
from datetime import date 3
178
15. Object Oriented Programming
T
4
#----------------------------------- 5
# Define Person class 6
#----------------------------------- 7
class Person : 8
def __init__(self) : 9
self._bdate = None 10
self._fname = None 11
self._lname = None 12
def setFirstName(self, fname) : 13
self._fname = fname 14
def setLastName(self, lname) : 15
self._lname = lname 16
AF
def setBirthdate(self, bdate) : 17
self._bdate = bdate 18
def getFirstName(self) : 19
return self._fname 20
def getLastName(self) : 21
return self._lname 22
def getBirthdate(self) : 23
return self._bdate 24
def age(self, date=date.today()) : 25
age = date - self.getBirthdate() 26
return age.days() 27
28
29
#----------------------------------- 30
# Process data file 31
#----------------------------------- 32
33
# Filename is command-line arg 34
filename = sys.argv[1] 35
36
# Open file 37
f = open(filename) 38
39
# Read in lines 40
lines = f.readlines() 41
42
DR
# Process lines 43
people = [] 44
for l in lines : 45
cols = split(l, "\t") 46
47
# Data columns 48
fname = cols[0] 49
lnamme = cols[1] 50
bdayParts = split(cols[3], ’-’) 51
year = bdayParts[0] 52
month = bdayParts[1] 53
day = bdayParts[2] 54
bday = date(year, month, day) 55
56
p = Person() 57
p.setFirstName(fname) 58
p.setLastName(lname) 59
p.setBirthdate(bday) 60
people.append(p) 61
62
# Extract data 63
for p in people : 64
refDate = date(’1973’, ’??’, ’??’) 65
179
15.4 Exercises
T
print "%s, %s\t%s" % (p.getFirstName(), p.getLastName(), p.age(refDate)) 66
When no date is passed to the age() method, the age is calculating using the
current date. However, when the age() method takes an argument, the age is
calculated using the date supplied to it.
15.4 Exercises
AF
• Many aspects of linguistic systems explify some sort of inheritance hierar-
chy. What are some examples? Do these show single or multiple inheri-
tance?
• TYPE COERCION
• ???
• ???
• ???
180
T
Chapter 16
Classes and Objects
AF
In the previous chapter, we discussed object-oriented programming in fairly broad
terms, without going into too much detail about the technical details of Python
object orientation. In this section, we will get our hands dirty with the nitty-gritty
details so that the abstract knowledge gained about object-oriented programming
can be put into effective practice.
181
16.1 Defining Classes
T
p.set_name(’Joe Haldeman’) # set variable accessor method 16
print p.get_name() # get variable w/ accessor method 17
There are a number of things to note concerning the definition of the Person
class. It has a constructor method, which Python requires to be named init ,
that initializes a single variable, self. name. It also defines two methods, one
for setting the instance variable (a getter method) and another for retrieving it (a
setter method). The idea behind provided methods to access instance variables
is to encapsulate the data (cf. §15.3.2) so that anyone using the class interacts
AF
with the interface defined by the author of the class. (Since they provide access to
variables, getter and setter methods are typically referred to as accessor methods.)
This class could be simplified and improved in a few ways. First, we can use
a single method for both getting and setting our instance variable, thereby making
our code more succint, as shown in (227). (Succintness is only a virtue if it does
not sacrifice performance or readability. In this case, we believe it does not.)
# Class Definition
self._name
if name :
self._name = name
return self._name
(227) classesDefinitionEx2.py
# ----------------------------------------------
# ----------------------------------------------
class Person :
def __init__(self) :
# name of the class
# constructor
# instance variable
def name(self, name=None) : # method
# check argument
# set if provided
# return variable
# ----------------------------------------------
# Class Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
# ---------------------------------------------- 14
p = Person() # instantiate class 15
DR
p.name(’Bruce Lee’) # method call 16
print p.name() # method call 17
The single accessor method now returns the value of the name variable, set-
ting it beforehand if the method is provided with an argument. (We take advantage
of argument defaults, setting the value of the argument to None by default.) This
class is still less than ideal, however, since it leaves the name variable too unstruc-
tured. A name can be broken down into parts and each of these can be saved as
separate variables, as in (228).
(228) classesDefinitionEx3.py
# ---------------------------------------------- 1
# Class Definition 2
# ---------------------------------------------- 3
class Person : 4
## Constructor 5
def __init__(self) : 6
self._fname 7
self._lname 8
182
16. Classes and Objects
T
9
## Methods 10
def first_name(self, fname=None) : 11
if fname : 12
self._fname = fname 13
return self._fname 14
15
def last_name(self, lname=None) : # method 16
if lname : 17
self._lname = lname 18
return self._lname 19
20
# ---------------------------------------------- 21
AF
# Class Usage 22
# ---------------------------------------------- 23
p = Person() 24
p.first_name(’Bruce’) 25
p.last_name(’Lee’) 26
print p.first_name() + ’ ’ + p.last_name 27
Note that we reconstruct the full name by concatentating the first name and
the last name. There is a better alternative. We can take advantage of object
orientation to to format the full name in whatever manner we please by adding a
method that does the formatting automatically and intelligently, as shown in (229).
(229) classesDefinitionEx4.py
# ---------------------------------------------- 1
# Class Definition 2
# ---------------------------------------------- 3
class Person : 4
5
## Constructor 6
def __init__(self) : 7
self._fname 8
self._lname 9
DR
10
## Methods 11
def first_name(self, fname=None) : 12
if fname : 13
self._fname = fname 14
return self._fname 15
16
def last_name(self, lname=None) : 17
if lname : 18
self._lname = lname 19
return self._lname 20
21
def full_name(self) : 22
return self._fname + ’ ’ + self._lname 23
24
# ---------------------------------------------- 25
# Class Usage 26
# ---------------------------------------------- 27
p = Person() 28
p.first_name(’Bruce’) 29
p.last_name(’Lee’) 30
print p.full_name() 31
183
16.2 Class Types
T
save ourselves time when programming (since it’s easier to invoke the method
full name than to concatenate two method calls) and reduce the likelihood of
inconsistency. We can ensure that names are always formatted in the exact same
manner. In addition, there is no sacrifice of flexibility for consistency, since we can
always define other methods to format the names differently. In fact, it is possible
to define a much more intelligent name-formatting method, as in (230), where
we add a method for middle names and have the full name method handle it
intelligently (by omitting it if undefined). Note that we have also added a flag that
## Methods
AF
determines which comes first: the first name or the last name.
## Constructor
def __init__(self) :
self.__fname
self.__lname
(230) classesDefinitionEx5.py
# ----------------------------------------------
# Here the class is defined...
# ----------------------------------------------
class Person :
# ----------------------------------------------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Here the class is used... 21
# ---------------------------------------------- 22
DR
p = Person() 23
p.first_name(’Bruce’) 24
p.last_name(’Lee’) 25
print p.full_name() 26
16.1.1 Constructors
CONSTRUCTOR A constructor is XXX.
In Python, the method init is a special required constructor method. We
saw it in operation in numerous previous examples, the most recent being (230).
184
16. Classes and Objects
16.3 Variables
T
16.3.1 Instance Variables
The most common type of variable are instance variables. Instance variables INSTANCE VARI -
are variables that are associated with instances of a class. Therefore, their values ABLES
potentially differ from one instance of a class to the next. For example, in (14),
we create three separate instances of a custom class called Word.
(231) classesVarInstanceEx1.py
AF
class Word : 1
def __init__(self, wordform) : 2
self.wordform = wordform 3
def getWordForm(self) : 4
return self.wordform 5
6
for wf in [’chimp’, ’orangutan’, ’gorilla’] : 7
w = Word(wf) 8
print w.getWordForm() 9
16.4 Privacy
Privacy refers to the accessibility of attributes (variables) and methods (functions)
defined for a class. An attribute or method is public if it is accessible from outside PUBLIC
an object; it is private if it is not. (Privacy therefore relates to encapsulation, PRIVATE
one of the major concepts behind object-oriented programming discussed in the
previous chapter.) In Python, variables and methods are public unless their name
begins with a double underscore, which flags them as private. 1 Private variables
1
Some OO programming languages (e.g., Java) use a keyword to flag variables a private, but
Python’s convention for private variables is much simpler.
185
16.4 Privacy
T
using the dot notation illustrated in (233).
(233) classesPrivacyEx1.py
# ---------------------------------------------- 1
# Here the class is defined... 2
# ---------------------------------------------- 3
class Person : 4
def __init__(self, fname, lname) : # Constructor 5
self.fname = fname # Initalize public var 6
self.lname = lname # Intialize public var 7
8
# ----------------------------------------------
AF
9
# Here the class is used... 10
# ---------------------------------------------- 11
p = Person(’Stuart’, ’Robinson’) # Instantiate object 12
print p.fname # Directly access public var 13
print p.lname # Directly access public var 14
In (233), two instance variables are defined for the Person class: fname
and lname. They are assigned a value by the constructor, which takes two argu-
ments. However, there are no accessor methods for these variables, but they can
be directly accessed, as shown by the lines were they are printed out. If, however,
we change the definition of the Person class so that the names of the two variables
begin with double underscores, any attempt to directly access the Person objects
instance variables will result in an error, as can be seen by running (234).
(234) classesPrivacyEx2.py
# ---------------------------------------------- 1
# Here the class is defined... 2
# ---------------------------------------------- 3
class Person : 4
def __init__(self, fname, lname) : # Constructor 5
self.__fname = fname # Initialize private var 6
DR
self.__lname = lname # Initialize private var 7
8
# ---------------------------------------------- 9
# Here the class is used... 10
# ---------------------------------------------- 11
p = Person(’Stuart’, ’Robinson’) # Instantiate object 12
print p.fname # !!! Directly access private var !!! 13
print p.lname # !!! Directly access private var !!! 14
186
16. Classes and Objects
T
def get_first_name(self) : # Public accessor method 8
return self.__fname.capitalize() # Return var capitalized 9
def get_last_name(self) : # Public accessor method 10
return self.__lname.capitalize() # Return var capitalized 11
12
# ---------------------------------------------- 13
# Here the class is used... 14
# ---------------------------------------------- 15
p = Person(’stuart’, ’robinson’) # Instantiate object 16
print p.get_first_name() # Print __fname 17
print p.get_last_name() # Print __lname 18
AF
Even though the Person object is instantiated in (235) with two names in
all lowercase, when the accessor method is called, the capitalize() string
method is called on the variables before they are returned, thereby ensuring that
they are printed out in the proper case.
16.5 Inheritance
In the previous chapter, we covered the concept of inheritance in object-oriented
programming.
187
16.5 Inheritance
as in (236), we see that it is the first parent class listed in their inheritance list
T
whose method gets called.
(236) classesMultipleInheritanceEx1.py
class A : 1
def printout(self) : 2
print ’A’ 3
4
class B : 5
def printout(self) : 6
print ’B’ 7
8
class C :
AF
9
def printout(self) : 10
print ’C’ 11
12
class D(A, B, C) : 13
pass 14
15
class E(B, A, C) : 16
pass 17
18
class F(C, A, B) : 19
pass 20
21
d = D() 22
d.printout() 23
24
e = E() 25
e.printout() 26
27
f = F() 28
f.printout() 29
188
16. Classes and Objects
T
23
e = E() 24
e.printout() 25
26
f = F() 27
f.printout() 28
16.6 Exercises
AF
• ???
• ???
• ???
189
16.7 Suggested Reading
T
AF
DR
190
T
Chapter 17
AF
An Introduction to Regular
Expressions
This chapter aims to teach the reader about regular expressions, a very powerful
tool for text processing, and how they can be exploited for language research.
Since regular expressions are reasonably standardized, it is possible to introduce
the basics of regular expressions without worrying initially about how they are
handled in Python. This chapter will do just that. It can therefore be skipped by
readers who already have a good familiarity with regular expressions from other
contexts (e.g., the Unix utility grep). In the next chapter, we will look specifically
at how regular expression are handled by Python.
REGULAR EXPRES -
SIONS
191
17.1 What is Text?
T
- and still there are some misfits who insist that there
is no such thing as progress.
AF
namely, two digits, a space, three or more letters, another space, and four digits,
as illustrated in Figure 17.1.
What we need is a way of describing this pattern, some kind of formula, with
variables that stand for letters, numbers, etc. Regular expressions provide pre-
cisely that, as can be seen from (14), which provides a regular expression that
matches the pattern of (12).
The correspondence between the regular expression in (14) and a few sample
dates is shown in Figure 17.2.
But how does all of this work? The first step on the road to understanding reg-
ular expressions is to get a grip on the difference between two types of characters
LITERAL CHARAC - found in a regular expression: literal characters and metacharacters. A literal
TERS characters stands for itself in a regular expression. In other words, a literal char-
METACHARACTERS acter is simply a character that matches itself. Therefore, the regular expression in
(15) would match the string star, even if that string is found within a larger string
(such as upstart, starship, star-shaped, or starry).
192
17. Regular Expressions 1
(15) star
T
It’s as simple as that. Metacharacters, however, are an entirely different kettle
of fish.1 They are special characters that stand for more than themselves. For
example, a dot (.) doesn’t stand for a dot. Rather, it is a wildcard that stands for WILDCARD
any character. It will match a single letter, number, or punctuation mark.
Metacharacters are the real workhorse of regular expressions and this tutorial
AF
is devoted to them. Fortunately, they are few in number (and therefore pretty easy
to memorize). We turn to them now.
17.2 Metacharacters
We already established that regular expressions are strings that capture patterns in
strings. They can be compared to a string to find a match and therefore are used
in doing text searches (among other things). To illustrate how it all works, we
will begin with what seems on the surface of things to be a fairly simple search.
Suppose that we want to search an English text for every instance of the indefinite
article. No doubt the reader has done some sort of search before in a word pro-
cessing program (e.g., Microsoft Word) or a web browser (e.g., Netscape). The
problem is, the indefinite article has more than one form. It can be either a or an. A
simple text search would require two passes, once for a and a second time for an.
To do this, we will use a short search script, named regexes1FileSearch,
DR
that takes two command-line arguments, a file and a regular expression, and finds
every match for the regular expression in the file’s contents. Don’t worry for now
about exactly how this program works, since we won’t go into the details of how
Python handles regular expressions until the next chapter. The main point is that,
given a file and a regular expression, it finds every line of the file matching the
regular expression.
To test our regular expressions, we will use a small corpus consisting of the
first chapter of a handful of science fiction novels. These novels are all part of
Orion’s Science Fiction Masterworks Series and are listed below in Table 17.1.2
1
The prefix meta- may sound like postmodern mumbo-jumbo, but it actually makes a good
deal of sense. Just as metacharacters are characters that operate on characters (allowing you to
search for them), a metalanguage is a language that operates on language (allowing you to talk
about it).
2
For a full listing of the titles available in the series, including some reviews, see Orion’s
official web site www.orionbooks.co.uk.
193
17.2 Metacharacters
T
Aldiss, Brian Non-Stop nonstop-ch1.txt
Blish, James A Case of Conscience conscience-ch1.txt
Dick, Phillip K. Do Androids Dream of Electric Sheep? sheep-ch1.txt
Haldeman, Joe The Forever War forever-ch1.txt
Le Guin, Ursula The Dispossessed dispossessed-ch1.txt
Le Guin, Ursula The Lathe of Heaven lathe-ch1.txt
Vonnegut, Kurt The Sirens of Titan sirens-ch1.txt
AF
Table 17.1 Science Fiction Novels Used for Sample Corpus
To see how the program can be called to perform a regular expression search of
one of the texts in the corpus, have a look at the screenshot of regexes1FileSearch
being used to search for the word star in the first chapter of Brian Aldiss’ novel
Non-Stop.
DR
Figure 17.3 Using regexes1FileSearch.py to search for star in Non-Stop
What we find is that the two searches return many lines that don’t have indef-
inite articles. Why is this the case? The answer is simple. We are doing a search
for the letter a or the letters an, but we haven’t told the regular expressions that
these two strings are supposed to be stand-alone words, so it finds them as parts
of other words. Every line containing the letter a is matched, which is clearly
not what we want. Remember, regular expressions match strings, not words. To
194
17. Regular Expressions 1
match words, we need to tell the regular expression how to identify word bound-
T
aries, which is not entirely straightforward. For now, we will assume that a word
is anything surrounded by spaces and simply put spaces around a and an in the
regular expression. (We will see later, in §17.2.8, that there are superior alterna-
tives to this approach.) Repeating the search, we now get more sensible results.
However, it is tedious to run two searches. What we need is a way of collapsing
these two searches into one.
There are a couple of ways to do this with regular expressions, but let’s keep
it simple and do an either-or search. Either we’ll search for one form of the article
(16) (a|an)
AF
or the other. This can be done using the vertical bar, which separates alternatives
from one another (see §17.2.1), as in (16), where a regular expression is provided
that searches for either a or an surrounded by spaces.
This regular expression can be used on the first chapter from Joe Haldeman’s
Forever War, as illustrated in Figure 17.4.
DR
Figure 17.4 Using regexes1FileSearch.py to do a simple regular expression
search on The Forever War
We’ve just constructed our first useful regular expression using a metacharac-
ter, namely the vertical stroke. The reason the vertical stroke is a metacharacter
is that its interpretation is not literal. The regular expression in (16) doesn’t look
195
17.2 Metacharacters
T
by an, a close parentheses, and another space. Rather, it says, look for either a
or an between spaces. In the following section, we will see how this works, and
learn many other conventions that allow for more sophisticated pattern matching
on strings.
DISJUNCTION
AF
of expressing the concept of “either-or”, so that searches can look for either one
element or another (in the indefinite article example, either a or an). The con-
cept of disjunction (OR) is expressed in regular expressions with the vertical bar
metcharacter (|).
(17) a|an
Although some hair-splitting types feel that “or” should be used exclusively
for binary disjunction—that is, for situations where there are only two options
(i.e., “either A or B”)—the everyday English “or” is often extended to situations
where there are a number of options (i.e., “either A or B or C”). The same holds
true for the vertical bar metacharacter, which can be used with multiple options,
as in (18), where our search for articles is expanded to include the definite article
the.
(18) a|an|the
DR
The usefulness of disjunction should be obvious. It is quite handy when one
wishes to allow more than one string to fill a particular slot, as in (19), which
would match airship, starship, or spaceship.
(19) (air|star|space)ship
Note that parentheses have been used in the previous example. Parentheses
are important when there might otherwise be ambiguity in the interpretation of
a regular expression. In the case of (19), it is clear that what is wanted is ship
preceded by either air or space. Without the parentheses, the regular expression
would be interpreted differently, as can be seen from (20), which is interpreted as
a search for air or spaceship.
(20) air|spaceship
196
17. Regular Expressions 1
(21) (air)|(spaceship)
T
It is possible in theory to work out how the various parts of a regular expression
will be grouped together using precedence rules but these require a more in-depth
treatment of regular expressions than it really necessary here. The important thing
to remember is that where there is potential ambiguity, you should use parentheses
to explicitly group together the various parts of your regular expression. “When
in doubt, bracket it out!”
AF
The ability to handle disjunction gives regular expressions a great deal of flex-
ibility and power. Disjunction is also quite handy when searching for the various
forms of a given lexeme, as in (22), which could be used to match any of the words
in (23).
(22) profan(e|ity)
(23) profane
profanity
(24) (a|e|i|o|u)
(25) (a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)
197
17.2 Metacharacters
T
It is quite useful when performing searches to allow more than one character to
occupy a position in a regular expression. For example, imagine searching for the
word enquire in a collection of texts from all over the world. Since the spelling
of this word is variable—enquire in Commonwealth countries, but inquire in the
USA—one would want to leave open two possibilities for the first character of
this string. One way to accomplish this is by searching for either spellings dis-
junctively, as in (26).
(27).
AF
(26) (enquire|inquire)
The regular expression in (26) fails to factor out the common element in the
words enquire and inquire. Since it is only the first letter of the two words that
varies, it alone can be separated off and put into parentheses for disjucntion, as in
(27) (e|i)nquire
(28) [ei]nquire
CHARACTER What we’ve just used in (28) is a character class: a string of characters (of
CLASS any length) surrounded by square brackets which forms a pattern that matches any
DR
single character in that string. Character classes are useful in many contexts. For
example, it is quite handy when dealing with limited variation in capitalization.
If one searched for the word hope, for example, one would be forced to deal with
the fact that its first letter is capitalized at the beginning of sentences, as in (29).
198
17. Regular Expressions 1
(30) ([aA]|[aA]n|[tT]he)
T
When run on the first chapter of The Forever War, it matches lines with sen-
tences that begin with an article, such as those in (31).
(31) a. The guy who said that was a sergeant who didn’t look five years older
than me.
b. The projector woke me up and I sat through a short tape showing the
“eight silent ways.”
AF
c. The sergeant nodded at her and she rose to parade rest.
d. The whole company’d been dragging ever since we got back from the
two-week lunar training
(32) [abcdefghijklmnopqrstuvwxyz]
Instead of typing out each lower letter, you can instead use a character range,
as in (33), which uses the range of lowercase letters from a through z.
(33) [a-z]
The regular expression in (33) will match any one character in the alphabet
from a through z. More than one range can be placed within a pair of brackets, as
DR
in (34), which will match both lowercase and uppercase letters in the same range
of the alphabet.
(34) [a-zA-Z]
Therefore, (34) will match any letter, regardless of case. Note that more than
one range can be included between square brackets and that separation of the two
is unnecessary. By the same token, other characters can be included, as in (35),
which would match any letter (lowercase or uppercase) or an ampersand.
(35) [a-zA-Z&]
There are also non-alphabetical character ranges, such as (36), which matches
any integer between 1 and 5, and is equivalent to (37).
(36) [1-5]
199
17.2 Metacharacters
(37) [12345]
T
Be warned, however, that unexpected results can arise from the use of non-
alphanumeric characters in character ranges. For example, it is doubtful that the
reader knows which characters will be matched by (38).
(38) [!-a]
(38) is obscure because the average reader knows the order of the alphabet
AF
perfectly well (abc . . . xyz), but not the order of characters in ASCII (the American
Standard Code for the Interchange of Information), which is what determines the
range within a character class. 4 The ASCII range specified by (38) is equivalent
to the regular expression in (39).
(40) [a-zA-Z,]
One possibility for including a dash in a character class is to place it at the end
of the character class, where it will be interpreted as a dash and not as an indicator
of a character range, as in (41).
DR
(41) [a-zA-Z-]
(42) [a-zA-Z]
The use of the backslash to derive literal characters from metacharacters will
be discussed in more detail in §17.2.5.
4
Any reader who has memorized the order of ASCII is unlikely to need this tutorial, but may
need one on human social interaction!
200
17. Regular Expressions 1
T
A very useful metacharacter in regular expressions is the wildcard character, the
dot (.). In less geeky circles, the dot is called a period (North America) or a
full stop (elsewhere). A dot is a metacharacter that serves as a stand-in for a
single character. It essentially means, “take your pick for this character, it can be
anything”. For example, (43) matches any three-character string whose first and
last characters are p and n, irrespective of whether the middle character is a letter,
a number, or a punctuation mark.
(43) p.n
(44) pinion
company
happened
AF
Therefore, the regular expression in (43) would match any of the strings in
(44), since each contains the letter p with a single letter in between them.
The regular expression (43) would not match other wordslisted below in (45).
(45) planet
equipment
telephone
DR
Why won’t (43) match any of the words in (45)? Because these words have
two characters between p and n and the dot in (43) will match only one. Obviously,
there are many cases where the variable portion of one’s search consists of more
than one character. One way of working around the built-in limitation on the dot
is to use more than one as in (46).
(46) p..n
201
17.2 Metacharacters
T
At this point, the reader may be asking the following important question: How
does one search a text for characters that happen to be metacharacters, such as
dots, question marks, stars, etc.? More concretely, how would someone search
for, say, all sentences ending with two period marks (a common typo). The regular
expression (47) won’t work.
(47) ..
AF
The reason (47) won’t work is that, given the normal wildcard interpretation
of a dot, it would match any two characters in a row, which means it would match
any string that contains two or more characters. (In other words, it would produce
many false positives. It would match two dots, but it would also match many
two-character sequences that don’t consist of two dots.)
There is a convention for searches on characters that happen to be metacharc-
ters: the use of a backslash (\) before a metacharacter that one wishes to be treated
as a literal. (The backslash is also known as an escape character, since it allows
an “escape” from the normal interpretation of a metacharacter.) To search for two
dots in a row, we need to place a backslash before each dot, so that each one is
interpeted literally, as in (48).
(48) \.\.
For another example, consider what is involved in searching for the abbrevia-
tion of etcetera (etc.). The regular expression in (49) will not work, since it also
matches the undesired words in (50).
DR
(49) etc.
(50) etcetera
wretched
stretch
etching
fetch
A much better way of searching for etc. is by converting the dot from a
metacharacter to a literal by placing a backslash in front of it, as in (51).
(51) etc\.
(51) will not match any of the above-listed words but will match etc.. In order
to create a regular expression that would match the word etcetera as well as its
abbreviation, disjunction can be added to the mix, as in (52).
202
17. Regular Expressions 1
(52) etc(\.|etera)
T
The astute reader may at this point be wondering what happens when a literal
character is escaped with a backslash. The answer is simple. Nothing. When a
backslash precedes a literal character (that is, a non-metacharacter), it simply has
no effect. Therefore, (53) and (54) are equivalent.
(53) \=
AF
(54) =
Although the backslash in (53) is superfluous, it does no harm. This will not
always be the case. There are a handful of literal characters that change meaning
when preceded by a backslash. These will be covered in §17.2.7.
(55) for[ˆ\.,!\?]
The regular expression in (55) would match all of the examples in (57) (from
James Blish’s novel A Case of Conscience), among others.
203
17.2 Metacharacters
T
professional attitude was called for. [Do Androids Dream of Electric
Sheep?]
AF
However, it would specifically not match the various line in (56) (taken from
various sources), since they contain the string for followed by a letter (rather than
a punctuation mark).
We have already seen the escape character [\] used to treat metacharacters as
literals in order to search for dots, question marks, etc. There is another usage of
the escape character, which involves putting a backslash before a literal character
in order to create a special sequence. These special sequences either define special
characters (e.g., linefeeds or space marks) or convenient character classes (e.g.,
whitespace).
204
17. Regular Expressions 1
T
\t tab
\n newline
\r carriage return
\f formfeed
\v vertical tab
\d decimal digit
\D non-digit character
\s whitespace character
\S
\w
\W
\A
\Z
AF non-whitespace character
alphanumeric character
non-alphanumeric character
beginning of string
end of string
The special characters (tab, newline, etc.) have the same interpretation out-
side of the context of regular expressions. In other words, they have the same
interpretation in regular strings (for more information about these, see Chapter
14.3.
(58) ˆfast
(59) fast$
(60) faster
fastest
fastener
fastidious
205
17.3 Quantifiers
(61) breakfast
T
Belfast
.
|
[abc]
[ˆabc]
[a-z]
()
AF
17.2.9 Summary
Before expanding upon our repretoire of regular expressions, it may be worth-
wile to recap what we have learned so far. We’ve been introduced to a variety
of metacharacters—the dot, the vertical bar, square brackets (with and without
hyphens), parentheses, and the backlash (or escape character, as it is sometimes
known). These are summarized in Table 17.3.
Metacharacter Meaning
matches any character
or
matches one of the bracketted characters
matches anything but one of the bracketed characters
matches one of the characters in a through z
grouping
\ causes metacharacters to be read as literals
DR
Table 17.3 Summary of Metacharacters
Regular expressions built up from these metacharacters are fairly powerful and
give searches considerable flexibility, but there are a number of ways in which they
are lacking. In the next section, we will expand the possibilities provided by these
metacharacters by introducing a number of quantifiers—that is, metacharacters
which allow other metacharacters to match once, twice, multiple times, or even
not at all.
17.3 Quantifiers
So far, we have used various metacharacters (such as the dot or the vertical bar) to
create regular expressions for pattern-matching. Most of these patterns have been
fairly simple. Using quantifiers, it is possible to build even more complicated
206
17. Regular Expressions 1
T
provide a means of specifying how many times a regular expression should match.
For exaple, imagine that you are searching a corpus for every word that contains
two vowels in a row (i.e., a two-letter vowel sequence). We have learned how to
create a regular expression that will only match vowel letters, such as (62) or (63).
(62) (a|e|i|o|u)
(63) [aeiou]
AF
One way to find two vowel letters in a row is simply to use the character class
for vowel letters twice, as in (64).
(64) [aeiou][aeiou]
But what if we wanted to find three vowel letters in a row? This could of
course be done in the fashion of (64),by using the character class three times, as
in (65).
(65) [aeiou][aeiou][aeiou]
But (65) is awkward. Furthermore, the approach that it embodies doesn’t scale
well. In other words, it isn’t practical for larger sequences. While it might work as
long as we’re only matching sequences with just a few repetitions, it is a hopeless
way of matching sequences with hundreds of repetitions. Not only does it require
too much typing, but it also is fairly unreadable.
What is needed is a way of controlling how many times part of a regular ex-
pression should match, and this is precisely what quantifiers provide. We turn now
to the various quantifiers found in regular expression, starting with the plus mark.
DR
17.3.1 Match One or More Times: The Plus Mark
The plus mark [+] essentially says to match the preceding regular expression at
least once. In other words, by adding the plus mark to part of a regular expression,
it will ensure at least one match, and as many more as are possible. Returning to
the vowel sequence example, we could find every vowel letter and sequence of
vowel letters using (66).
(66) [aeiou]+
Alternatively, we could find every vowel sequence of two or more letters by
having two character classes for vowel letters and adding a plus sign to the second,
as in (67). The plus sign dictates that the second character class must match at least
once, but leaves open the possiblity of even more matches. Therefore, (67) will
match 2 to n times, where n could be any number.
207
17.3 Quantifiers
(67) [aeiou][aeiou]+
T
Quantifiers therefore solve the problem of defining regular expressions that
will match sequences containing a very high number of repetitions. (67) will not
only match vowel sequences consisting of just a few vowels as easily as vowel
sequences consisting of hundreds of vowels. Of course, a sequence of hundreds
of vowels is fairly unrealistic. But being able to search for long sequences can be
quite handy nonetheless. Imagine that you are searching for all words that begin
with a vowel and end with a consonant. In order to match such words, the regular
AF
expression in (68) could be used.
(68) uses spaces to identify word boundaries. It looks for a vowel followed by
one or more non-spaces followed by a non-vowel and matches a large number of
words of varying lengths, including those in (69).
(69) a. and
b. again
c. arrogant
d. insignificant
e. intercontinental
Because the final character is negatively defined (i.e., as anything but a vowel
letter or a space), (69) will also give a number of undesired matches, such as those
in (70).
DR
(70) a. us!
b. one,”
c. upward.
This problem can easily be rectified by positively specifying the final character
class, as in (71), which is less compact but more accurate. It will still match the
words in (69) but not those in (70).
So far, we have used the plus mark for repetitions of single letters, but quanti-
fiers can also be applied to larger sequences. The scope of a quantifier is either the
previous character or the previous grouping (i.e., anything inside of brackets or
parentheses). To see how a quantifier can be used to modify a larger unit, imagine
that you are searching for words that consist only of strict consonant-vowel (CV)
syllables. The regular expression in (72) can be used.
208
17. Regular Expressions 1
(72)
T
([bcdfghjklmnpqrstvwxyz][aeiou])+
(74)
AF
match at all.5 In other words, it dictates that a regular expression will match as
many instances as possible, including none at all.
To see how it differs from the plus mark, we will return to the regular expres-
sion in (??) that was used to search for words beginning with a vowel and ending
with a consonant. It is repeated below for convenience in (73).
(73) will not match words that consist of only a vowel and a consonant since
it requires at least one match between the first and last character. To match such
words, the star quantifier can be used instead of the plus mark, as in (74).
[aeiou][ˆ ]*[bcdfghjklmnpqrstvwxyz]
Because the non-space character between the vowel and the consonant is quan-
tified with a star, it will match all of the words that (73) does, but it will also match
words that have nothing between the first and last character, such as those in (75).
DR
(75) a. at
b. is
c. or
d. up
209
17.3 Quantifiers
differently in North America than they are elsewhere—e.g., color vs. colour. If
T
you were searching for this word in a mixed corpus (i.e., a corpus that contains
both varieties of written English), the question mark could be used to construct a
regular expression that would match either spelling of the word, as shown in (76).
(76) colou?r
In (76), a single letter is made optional using the question mark quantifier, but
it is also possible to make a longer string optional by placing it between paren-
AF
theses. For example, consider how one might search for words in the ship family.
The English language contains many words involving ship, such as those listed in
(77).
If you wanted to find out how many words are based on, or derived from,
ship, you could search a corpus using a regular expression that would match all
of these words (and any others involving ship).6 This can be done fairly easily
using the question mark quantifier to make any material preceding or following
ship optional, as in (78).
(78) ([a-zA-Z]+)?ship([a-zA-Z]+)?
Note that (78) restricts the material preceding or following ship to one or more
letters. It would therefore fail to match words such as short-shipped. In order to
ensure that such words are matched, as well, the regular expression would have to
DR
be modified slightly, so as to include dashes, as in (79).
(79) ([a-zA-Z-]+)?ship([a-zA-Z]+)?
210
17. Regular Expressions 1
T
? {0,1}
+ {1,}
* {0,}
Note that when there is no upward limit, no number is provided (although the
comma remains).
AF
Bracket notation is invaluable when searching for a particular number of re-
peated characters. For example, if one wished to find all sequences of three or
more vowel letters, the regular expressions in (80) will do the job.
(80) [aeiou]{3,}
(81) provides the first five lines with different matching words from The Dis-
possessed that are matched by (80).
(81) a. Like all walls it was ambiguous, two-faced.
b. People often came out from the nearby city of Abbenay in hopes of seeing
a space ship, or simply to see the wall.
c. He sat down on the shelf-like bed, still feeling light-headed and lethargic,
and watched the doctor incuriously.
d. He felt he ought to be curious ; this man was the first Urrasti he had ever
seen.
e. Contagious.
Alternatively, if one wished to find sequences of three or more consonant let-
DR
ters, (82) could be used.
(82) [ˆbcdfghjjklmnpqrstvxwyz]{3,}
The first five matching lines obtained by running (82) against the first chapter
of The Dispossessed are given in (83).
(83) a. It was built of uncut rocks roughly mortared; an adult could look right
over it, and even a child could climb it.
b. Where it crossed the roadway, instead of having a gate it degenerated
into mere geometry, a line, an idea of boundary.
c. For seven generations there had been nothing in the world more impor-
tant than that wall.
d. Like all walls it was ambiguous, two-faced.
e. Looked at from one side, the wall enclosed a barren sixty-acre field
called the Port of Anarres.
211
17.4 Word Boundaries
T
To recap, there are four ways of quantifying regular expressions. Each quantifier
behaves somewhat differently and allows for different matching possibilities. The
full list is provided in Table 17.5.
Quantifier Meaning
? match zero or one time
* match zero or more times
AF
+ match one or more times
{i,j} match a minimum of i and a maximum of j
(84) (a|an|the)
Because (84) expects a space before and after the desired strings, it will fail
DR
to match these strings when they occur at the beginning or end of lines. At the
beginning of lines, the string will have a space after it but not before it. At the end
of lines, the string will have a space before it but not after it. Although articles
occuring at the end of lines will be rare (since articles are normally followed by
other words), their occurence at the beginning of lines is commonplace. In order
to capture articles at the beginning of lines, the regular expression in (85) could
be used.
(85) ˆ(a|an|the)
A search on the first chapter of Kurt Vonnegut’s The Sirens of Titans using
(85) produces many matches, a few of which are illustrated below in (86).
212
17. Regular Expressions 1
T
In order to match words that occur at the beginning or end of a line as well as
words that occur in between, a regular expression such as (87) is required, where
line starts and ends are explicitly included as options for delimiting the boundaries
of a word.
AF
There are other whitespace characters that should probably be included as
potential word boundaries, such as tabs. This can be easily rectified, as in (88),
but the inclusion of more word boundary characters makes the resulting regular
expression more difficult to read.
(89) \b(a|an|the)\b
It is important to bear in mind that there are still issues concerning word seg-
mentation that are not solved by the existence of the special sequence \b. For
DR
example, hyphens in raw, unstructured text raise problems, because they are used
inconsistently. This can be illustrated with a few sentences taken from Philip K.
Dick’s Do Androids Dream of Electric Sheep?. In some cases, hyphens occur
within words, as in (90).
(90) a. ”My schedule for today lists a six-hour self-accusatory depression,” Iran
said.
b. Despair like that, about total reality, is self-perpetuating.”
c. Hey, for twenty-five bucks you can buy a full-grown mouse.”
213
17.5 Summary
(91) a. Percheron colts just don’t change hands-at catalogue value, even.
T
b. You know why? Because back before W.W.T. there existed literally hundreds-
”’
c. I left a piece and Groucho-that’s what I called him, then-got a scratch
and in that way contracted tetanus.
AF
(92) a. Percheron colts just don’t change hands - at catalogue value , even .
b. You know why ? Because back before W.W.T. there existed literally hun-
dreds -”’
c. I left a piece and Groucho - that’s what I called him , then - got a scratch
and in that way contracted tetanus .
(93) \B(ment|ship)\b
DR
Thanks to the special sequences in (93), the searched-for strings will only
match when they occur at the end of a word. Therefore, the stand-alone word ship
will not be matched but the verb reship will.
Finding prefixes is simply the reverse of (93), with \b occuring before the
searched-for string and \B occuring after it, as in (94).
(94) \b(un|re)\B
Note that (94) will match many words that do not have prefixes but simply
begin with un and re, such as underwear or reality.
17.5 Summary
Regular expressions are really nothing more than a set of metacharacters and a
few conventions concerning their interaction. The full set is fairly small and easily
214
17. Regular Expressions 1
T
chapter, we will see how regular expressions can be used in Python to manip-
ulate text and overview some Python-specific extensions to regular expressions
that make them even more powerful and flexible.
Symbol Meaning
. match any character
? match zero or one time
+ match zero or more times
17.6
1. ???
*
{i,j}
|
[abc]
[ˆabc]
[a-z]
()
\
Exercises
AF
match one or more times
match i to j times (j defaults to infinity)
or
match one of the bracketed characters
match anything but one of the bracketed characters
match one of the characters in a through z
grouping
treat following character as literal
2. ???
DR
3. ???
215
17.7 Suggested Reading
T
AF
DR
216
T
Chapter 18
AF
Regular Expressions in Python
In the previous chapter, regular expressions were introduced in the abstract, with-
out reference to Python. In this chapter, we will cover how regular expressions
are specifically handled in Python. We will also look at some powerful Python-
specific extensions to regular expressions which can be leveraged for even better
text processing.
217
18.1 Using Regular Expressions: An Overview
module, re. The function search() is then called on the text with a simple
T
regular expression (Star*). The search() method—as its name suggests—
simply searches a string for anything and everything that matches it. If a match is
found, it returns a MatchObject; otherwise, it returns None. Since a match will be
found in this example, calling the search() method will return a MatchObject,
the if-clause will therefore evaluate as true, and the appropriate message will be
printed out.
In (238), the regular expression is directly applied to a string to find matches,
thereby skipping a step that is often quite useful, which is compiling a string with
import re
AF
a regular expression into a RegularExpressionObject, as illustrated in (239).
Starglider to Earth:
(239) regexes2RegexEx2.py
# Compile a RegularExpressionObject
reo = re.compile(’Star*’)
# Find matches
if reo.search(text) :
print "Found a match!"
else :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
print "No match found!" 22
DR
Unless you are using one of the regular expression compilation flags, it is not
necessary to compile a regular expression in order to use it, although you may
want to do so for reason of efficiency—see §18.2.2 for more details.
In summary, the use of regular expressions in Python involves the following
steps:
Each step in the process will be discussed in greater detail below. §18.2.1
shows how regular expressions are created; §18.2.2 discusses how regular expres-
sions are compiled; §18.3 overviews the various ways that regular expressions can
218
18. Regular Expressions 2
be used for pattern-matching; and §18.3.2 describes how matches can be leveraged
T
for text processing.
AF
string that describes other strings. Because regular expressions are strings, the
way string are handled in Python affects the way regular expressions are handled,
and sometimes in ways that are not entirely obvious and may lead to confusion.
The main difficulty concerns how backslashes are handled. As you may recall
from Chapter 14, a backslash is used to escape single quotes in single-quoted
string, double quotes in double-quoted strings, and to wrap long strings across
multiple lines, as can be seen in (240), where each string has a backslash that
doesn’t appear in the output when it is run.
(240) regexes2DefineRegexEx0.py
import re 1
2
print ’This is called a \ 3
\’single-quoted\’ string.’ 4
5
print "This is called a \ 6
\"double-quoted\" string." 7
8
print ’’’This is called a \ 9
\’’’triple-quoted\’’’ string.’’’ 10
11
print """This is also called a \ 12
DR
\"""triple-quoted\""" string.""" 13
14
If you want a backslash in a string, there are a few ways of getting one. As
long as the backslash doesn’t come at the end of a line and doesn’t precede a single
quote in a single-quoted string or a double-quote in a double-quoted string, it will
behave like any other character, as can be seen by running (241).
(241) regexes2DefineRegexEx1.py
import re 1
2
print ’Here is a / in a single-quoted string.’ 3
print "Here is a / in a double-quoted string." 4
print ’’’Here is a / in a triple-quoted string.’’’ 5
print """Here is a / in another triple-quoted string.""" 6
7
But what if you want a backslash at the end of a line? What then?
219
18.2 Creating Regular Expressions
T
(242) regexes2DefineRegexEx2.py
import re 1
2
print ’This single-quoted line ends with a backslash.\\’ 3
print "This double-quoted line ends with a backslash.\\" 4
print ’’’This triple-quoted line ends with a backslash.\\’’’ 5
print """This triple-quoted line ends with a backslash.\\""" 6
import re
AF
will match a snippet of LATEXcode.1
s = ’\documentclass[a4paper,10pt]{book}’
if re.search(r’\\documentclass’, s) :
print "Got match with double slash in raw string."
else :
print "No match with double slash in raw string."
if re.search(’\\documentclass’, s) :
print "Got match with double slash in plain string."
else :
print "No match with double slash in plain string."
220
18. Regular Expressions 2
T
The second step in using regular expressions in Python is to compile a regular ex-
pression object (http://docs.python.org/lib/re-objects.html).
When compiling a regular expression object, there are a number of options avail-
able which systematically change the way in which the regular expression is in-
terpeted. Table 18.1 lists the various possibilities.
re.MULTILINE
re.VERBOSE
S
re.IGNORECASE I
re.LOCALE L
M
X
AF
Abbrev. Meaning
the wildcard metacharacter (the dot [.]) matches
any character, even newlines
ignore case when performing matches
use locale setting when performing matches
line start and end metacharacters (the caret [ˆ]
and the dollar sign [$]) match newlines instead
of start and end of string
enable verbose REs
DOTALL Normally, newlines (\n) are not matched by a dot (the wildchard
metacharacter). In order to ensure that absolutely every character, including new-
lines, are matched by wildcards, a regular expression can be compiled with the
DOTALL flag. This convention is very useful for parsing larger texts since pat-
terns that span multiple lines can be captured.
GIVE EXAMPLE
DR
IGNORECASE Regular expressions are normally case-sensitive. Therefore,
the regular expression the will not match the word The. It is possible to work
around this limitation using character classes, as previously described in §17.2.3,
but it is often easier to use the IGNORECASE flag when compiling a regular ex-
pression so that matches are made in a case-insensitive manner (i.e., capitaliza-
tion is ignored). In (244), for example, the regular expression books matches the
word books (lowercase) as well as Books (title case).
(244) regexes2IgnoreCaseEx1.py
import re 1
2
s1 = ’It is good to read books.’ # Lowercase ’books’ 3
s2 = ’BOOKS ARE GOOD TO READ.’ # Uppercase ’BOOKS’ 4
5
# Compile case-insensitive regex object 6
reo = re.compile(’Books’, re.IGNORECASE) 7
8
221
18.2 Creating Regular Expressions
T
# A match is found, despite case mismatch 9
if reo.match(s1) : print s1 10
if reo.match(s2) : print s2 11
This compilation flag is especially useful when dealing with the poetry of e.
e. cummings, whose poetic license took a typographic turn with his disdain for
capitalization.
(245) regexes2IgnoreCaseEx2.py
s = ’’’e. e. cummings’’’ 1
AF
LOCALE Locales are used by operating systems and applications for interna-
tionalization, which basically means that they customize the behavior of a com-
puter according to its setting or location. A locale usually affects such things as
character sets, date and time formats, and currency formats, all of which change
from one language and/or culture to the next. For example, the authors of this
book reside in different places. One of them is based in the United States while
the other is based in The Netherlands. Their locale affects the way dates are for-
matted. For example, the date ‘1-12-1999’ would be interpreted as January 12th,
1999 in the US but as December 1st, 1999 in the Netherlands.
By using the LOCALE flag for compiling regular expressions, it is possible
to ensure that differences between alphabets are preserved in the interpretation of
the metasequences \w, \W, \b, and \B. For example, French permits a number
of letters with diacritics, such as é or ć, but \w will not match these characters by
default. However, by using the locale for French, it is possible to make \w include
these characters.
DR
(246) regexes2LocaleEx1.py
import re 1
french = ’French sentence goes here.’ 2
reo = re.compile(’(\w+)’, re.LOCALE) 3
mo = reo.match(french) 4
222
18. Regular Expressions 2
the interpretation of these two metacharacters is changed so that they match the
T
beginning and ending of lines rather the string being matched against.
The code in (247) illustrates the difference between a regular expression object
compiled with this flag versus one compiled without it.
(247) regexes2MultilineCompileFlagEx1.py
import re 1
2
s = ’’’Dissection is a virtue when 3
it operates on other men.’’’ 4
5
AF
regex = r’ˆ(.*)$’ 6
reo = re.compile(regex) 7
mo = reo.match(s) 8
if mo : 9
print mo.group(1) 10
11
reo = re.compile(regex, re.MULTILINE) 12
mo = reo.match(s) 13
if mo : 14
print mo.group(1) 15
16
17
VERBOSE As you’ve probably figured out by now, regular expressions can get
pretty complicated. And the more complicated they get, the more cryptic they
become. (Power, not readability, is their forte.) It is wise when writing difficult
regular expressions to comment them, much as you would any other code. One
223
18.2 Creating Regular Expressions
T
shown in (249).
(249) regexes2VerboseCompileFlagEx1.py
import sys 1
2
file = sys.argv[1] 3
4
# Regular expression for URLs 5
# (https?://) : Begin optional protocol 6
# [a-zA-Z_\.] : Character class of valid characters 7
8
# Regular expression that matches valid URLs 9
AF
regex = r’(https?://)([a-zA-Z_\.])’ 10
ro = re.compile() 11
ro.match(text) 12
224
18. Regular Expressions 2
T
One of the keys to mastering regular expressions in Python is familiarizing oneself
with all of the functions and classes available for the re module. A list of the most
commonly used ones is provided in Table 18.2 for ease of reference.
The use of each of the main regular expression methods will be discussed and
illustrated below.
18.3.1.1
AF
A regular expression object has quite a few methods available for it, and familiar-
izing yourself with these is a must for regular expression mastery in Python.
re Functions
There are a few functions that can be called directly from the re module. The two
most important of these are compile() and escape().
(??) loops through the integers between 1 and 100,000 twice. In each loop,
every number is converted to a string and compared to a regular expression for a
match. The first loop does this using a precompiled regular expression while the
second loop does this using a regular expression that is compiled on the fly. On
225
18.3 The Regular Expression Object
average, the loop with the precompiled regular expression runs twice as fast as the
T
other.
AF
must all be escaped. A much more convenient way of doing the same thing is
to include an option that forces all metacharacters to be treated as literals, as in
(253).
import sys, re
try :
pattern = sys.argv[1]
fn = sys.argv[2]
i = 0
match = False
ro = re.compile(pattern)
fo = open(fn, ’r’)
for l in fo :
i = i + 1
mo = ro.search(l)
if mo :
match = True
(253) regexes1FileSearchRev.py
18.3.1.2 Methods
The remaining methods can be called in one of two ways. Either they can be called
directly from the module re as functions (e.g., re.findall() or they can be
called as methods of a compiled regular expression (i.e., a RegularExpression
object). When called as functions, a regular expression string is required as the
first argument of the function.
226
18. Regular Expressions 2
T
(254) regexes2FindallEx1.py
import re, sys 1
2
# Get filecontents for file supplied on command line 3
file = sys.argv[1] 4
fo = open(file, ’r’) 5
fc = fo.read() 6
fo.close() 7
8
# Regex matches two sequences of alphanumeric chars 9
# separated by a hyphen 10
regex = r’\w+-\w+’ 11
12
AF
# Find all matches and print them out 13
for m in re.findall(regex, fc) : 14
print m 15
If one or more groups are present in the RE, return a list of groups; this will
be a list of tuples if the RE has more than one group. Empty matches are included
in the result unless they touch the beginning of another match.
(255) regexes2FindallEx2.py
import re, sys 1
2
# Get filecontents for file supplied on command line 3
file = sys.argv[1] 4
fo = open(file, ’r’) 5
fc = fo.read() 6
fo.close() 7
8
# Regex matches two sequences of alphanumeric chars 9
# separated by a hyphen 10
regex = r’\w+-\w+’ 11
12
# Find all matches and print them out 13
for m in re.findall(regex, fc) : 14
print m 15
DR
match(RE, str[, flags]) The name of this method is somewhat mis-
leading. Although it may seem like this method would simply find a match for a
regular expression, its behavior is in fact more restricted than that. The method
match() determines whether a regular expression matches the beginning of a
string.
(256) regexes2SearchEx1.py
import re, sys 1
2
# Get filename from command line 3
file = sys.argv[1] 4
5
# Regex matches lines starting and ending w/ same char 6
regex = r’ˆ(.).*\1$’ 7
8
# Print out every line in file that matches regex 9
fo = open(file, ’r’) 10
for l in fo.readlines() : 11
227
18.3 The Regular Expression Object
T
if re.match(regex, l) : 12
print l 13
fo.close() 14
To find a match anywhere in a string, and not just at the beginning, the method
search should be used. In many ways, this method is superfluous, given that
calling it is the equivalent of appending the line-start metacharacter (ˆ)to the
beginning of the regular expression.
AF
search(RE, str[, flags]) To compare a regular expression with a string
and see whether there are any matches, the method match can be used. If a match
is found, a MatchObject is returned; otherwise, None is returned.
regex = r’ˆ(.).*\1$’
(257) regexes2MatchEx1.py
228
18. Regular Expressions 2
sub(RE, repl, str[, cnt]) To replace one string with another, the
T
string method replace can be used (see §14.4.2.2). However, there are many
cases where replacements needs to take advantage of pattern-matching. More
intelligent string replacements can be done with regular expressions using the
method sub(), as illustrated by (259), which replaces two or more carriage re-
turns with two carriage returns.
(259) regexes2SubEx1.py
import re, sys 1
2
AF
# Get filename from command line 3
file = sys.argv[1] 4
5
# Get file contents as string 6
fo = open(file, ’r’) 7
fc = fo.read() 8
fo.close() 9
10
# Regex that matches 2 or more carriage returns 11
regex = r’\n{2,}’ 12
13
# Perform replacement on file contents and print results 14
print re.sub(regex, ’ ’, fc) 15
subn(RE, repl, str[, cnt]) The method subn behaves almost iden-
tically to sub; however, instead of returning only the string that has had substi-
DR
tutions performed on it, it returns a tuple with the new string and the number of
substitutions performed. Using this method, it is very easy to perform a substitu-
tion and report how many times the substitution was performed, as illustrated by
(??), which does a global find-replace on a file and prints out the results as well
as the number of replacements made.
(260) regexes2SubnEx1.py
import re, sys 1
2
# Get filename from command line 3
file = sys.argv[1] 4
5
# Get file contents as string 6
fo = open(file, ’r’) 7
fc = fo.read() 8
fo.close() 9
10
# Regex that matches 2 or more carriage returns 11
regex = r’\n{2,}’ 12
13
# Perform replacement on file contents and print results 14
229
18.4 Extensions to Regular Expressions
T
newFc, numSubs = re.subn(regex, ’ ’, fc) 15
16
# Print message to stderr 17
msg = "Total Number of Substitutions: %i." % (numSubs) 18
sys.stderr.write(msg) 19
20
# Print new file contents to stdout 21
print newFc 22
AF
The various methods for regular expression matching produce match objects,
which can then be used to obtain further information and perform additional ma-
nipulations.
18.4.1 Grouping
In §17.2.1 we learned how to group expressions in regular expressions with paren-
theses to ensure their proper parsing. However, these parenthetical groupings can
also be utilized in Python to simplify a variety of string manipulation tasks. In
fact, Python provides fairly extensive group-handling capabilities which go far be-
yond what is available even in fairly regular expression-friendly languages (e.g.,
Perl). The various grouping conventions available in Python regular expressions
are listed below in table 18.3. (Remember, these covnentions are Python-specific
and will not carry over to regular expressions in other contexts.)
We will illustrate the usage of each convention and provide some advice con-
cerning their strengths and weaknesses.
230
18. Regular Expressions 2
Simple Grouping: (R) We have already seen how parentheses can be used to
T
group regular expressions. Sometimes this grouping is necessary in order to clar-
ify how a regular expression is intepreted. For example, if you were searching
for the past participle or gerund form of the verb kill, you could use the regular
expression kill(ed|ing), where parentheses are used to ensure that the regu-
lar expression matches kill followed by either ed or ing. (Without parentheses, as
killed|ing, the regular expression would be interpeted as killed or ing, which
is clearly not what we want.)
This grouping can also be used to tease apart the various elements of a regular
import sys, re
fo = open(fn, ’r’)
for l in lineList:
mo = reo.search(l)
if mo :
AF
expression. Continuing with our kill example, we could use the code in (261) to
search for every instance of the words killed or killing.
suffix = mo.group(2)
(261) regexes2SimpleGroupingEx1.py
lineList = fo.readlines()
fo.close()
regex = ’((kill)(ed|ing))’
reo = re.compile(regex)
#
#
#
#
#
#
nl = reo.sub(’*\g<1>*’, l) #
print nl #
Define regex
Compile regex into object
Go through each line...
Obtain match object
If there’s a match...
Obtain second group
Surround match w/ stars
Print starred line
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
There are a number of things to note about this script. First, the filename
comes from the command-line and could be run on any file. Second, the regular
DR
expression contains two groupings: one for the entire verb form and one for just
the verb suffixes. These groupings can be referenced by number, starting from
the left with 1. (Don’t get group numbering confused with list indices, which
count from 0 rather than 1.) If a match is found for ((kill)(ed|ing)), the
numbered groupings would be those shown in Figure 18.1.
231
18.4 Extensions to Regular Expressions
for easier identification in the output. If we run this script on an article by Peter
T
Beaumont in The Guardian, we obtain the following output (note the use of stars
to flag the matched strings):
AF
brothers and expressed the hope that Brazil, the team
he was following, would win.
That is why his family is so baffled by his decision.
At his home in the refugee camp of al-Faraa, near Nablus,
yesterday his father and brothers said they could not
believe that he had *killed* himself.
"He said he did not like the idea of civilians being
*killed*. But no one forced him to this. He chose
this route. He had a good life - a good upbringing.
He was hoping to study for his PhD."
He says he does not want to "kill for the sake of *killing*
but so that others might have life", but that it is
"nice to be *killed* while *killing*".
The example in (262) is very similar to the example in (261). In both cases, we
are searching for the verb root kill followed by the suffix ed or ing. In (261) each
232
18. Regular Expressions 2
match was bracketted by stars so that matches could be seen more easily in the
T
program’s output. The example in (261) has a number of groupings to illustrate
how the counting of groups works, but only two are necessary: one to group the
possible suffixes and another to group the entire verb so that it can be referenced
for text substitution. By inserting a question mark and a colon at the beginning of
the second grouping, it ceases to count, and we are able to refer to the verb stem
as 1 and the suffix as 2. Therefore, (262) is equivalent to (263), where only two
groupings are used. (The two programs are identical save for line 6, where the
regular expression is defined.)
AF
(263) regexes2NongroupingEx2.py
import re, sys # Required modules 1
fn = sys.argv[1] # File to search 2
fo = open(fn, ’r’) # Get list of lines 3
lineList = fo.readlines() 4
fo.close() 5
regex = ’(kill)(ed|ing)’ # Create regex 6
reo = re.compile(regex) # Compile regex object 7
for l in lineList: # Loop over line list 8
mo = reo.search(l) # Obtain match object 9
if mo : # Is there a match? 10
print reo.sub(’*\g<1>-\g<2>*’, l) # Flag match w/ stars 11
Followed By: (?= R) It can be useful to search for one pattern only when it
is followed by another.
(264) regexes2FollowedByEx1.py
# TODO 1
Not Followed By: (?! R) Just as it can be useful to search for one pattern
only when it is followed by another, it can also be useful to search for one pattern
DR
only when it is NOT followed by another.
(265) regexes2NotFollowedByEx1.py
# TODO 1
233
18.4 Extensions to Regular Expressions
T
handy method for matching a group within a regular expression and referring to
it within the same regular expression, much as one would a variable. We have
already seen backreferences in use in (??) and (??), but here we discuss them in a
bit more detail and provide a few more examples of their use.
One linguistic domain where backreferences are especially handy is redupli-
REDUPLICATION cation, which can be defined as a morphological process in which a root or stem
is repeated, either in part or as a whole. In some languages, reduplication is a
very productive process (e.g., Ilocano, where it marks plurality). We will look
AF
at reduplication in a collection of Tzeltal texts (Stross, 1977). These texts have
been placed in a single file (love in the armpit.txt), with bibliographic
information provided on lines beginning with @, Tzeltal text provided on lines be-
ginning with //, and English translations on lines with no special marker. The
mark-up scheme for these texts is illustrated below:
234
18. Regular Expressions 2
T
10
# -------------------------------------------------------- 11
# REGULAR EXPRESSION BREAKDOWN: 12
# \w = any valid letter 13
# \’ = apostrophe marks ejectives and glottal stops 14
# [\w\’] = character class for letters and apostrophes 15
# {3,5} = match 3 to 5 times 16
# \1 = backreference 17
# -------------------------------------------------------- 18
regex = r’ˆ//.*([\w\’]{3,5})\1’ # Create regex 19
ro = re.compile(regex) # Compile regex 20
for l in lineList : # Loop over line list 21
if ro.search(l) : # Check for redup 22
AF
l.replace("\n") # Ditch newlines 23
print l # Print line 24
The script in (267) looks for every line of Tzeltal text, which should start with
//, and finds every word that contains a sequence of 3 to 5 Tzeltal letters that is
immediately repeated.
Running the program on love in the armpit.txt should produce a num-
ber of hits. The output has a few drawbacks, however, which we ought to remedy.
First, it is hard to identify where the reduplication occurs at a glance. We need
some way of flagging the words containing reduplication. Second, no information
is provided as to where these examples occur in our text. Each of these deficits is
corrected in (268).
Since most of the work in (268) happens inside of the regular expression, we
will look at it carefully.
(268) regexes2BackrefEx2.py
from string import split, lower 1
import sys 2
import re 3
DR
4
# Deal with file from command-line 5
fn = sys.argv[1] 6
fo = open(fn, ’r’) 7
lineList = fo.readlines() 8
fo.close() 9
10
# -------------------------------------------------------- 11
# REGULAR EXPRESSION BREAKDOWN: 12
# \w = any valid letter 13
# \’ = apostrophe marks ejectives and glottal stops 14
# [\w\’] = character class for letters and apostrophes 15
# {3,5} = match 3 to 5 times 16
# \1 = backreference 17
# -------------------------------------------------------- 18
regex = r’(\w*([\w\’]{3,5})\2\w*)’ # Create regex 19
ro = re.compile(regex) # Compile regex object 20
for i, l in enumerate(lineList) : # Loop over index and line 21
if ro.search(l) : # Check for match 22
l.replace("\n") # Get ride of newline 23
newLine = ro.sub(’*\1*’, l) # Flag reduplication 24
print i + ’ : ’ + newLine # Print line 25
235
18.4 Extensions to Regular Expressions
T
saw in §18.2.2 that there is a compilation flag that permits the interlevaing of com-
ments into regular expressions. But Python gives you another option: comments
can be inserted within parentheses, as in (269).
(269) regexes2CommentGroupingEx1.py
import re 1
regex = ’(?#This matches phone numbers)(\d+) (\d+)-(\d+)’ 2
reo = re.compile(regex) # Compile regex object 3
s = ’805 528-1267’ 4
AF
mo = reo.search(s) # Obtain match object 5
if mo : # Is there a match? 6
print s # Flag match w/ stars 7
Set Mode Flag: (?letter) We have already seen how regular expression
compilation flags allow for systematic changes to the interpretation of regular
expressions. In some cases, however, it is desirable to change the interpretation of
part of a regular expression rather than the whole thing.
(270) regexes2SetModeFlagEx1.py
import re 1
regex = r’(?I)’ 2
reo = re.compile(regex) 3
DR
s = ’???’ 4
mo = reo.search(s) 5
if mo : 6
print s 7
236
18. Regular Expressions 2
T
By default, quantifiers are “greedy”, meaning that they will match the longest
string possible.2 Therefore, even though the quantifier + matches 1 to n instances
of a whatever it quantifies, it will by default match the largest number possible.
Therefore, in (271), the plus sign following the letter b will match the largest
possible number, which will be the entire sequence (all 10 instances of the letter).
(271) regexes2GreedinessEx1.py
import sys 1
import re
AF
2
3
s = ’aaaaabbbbbbbbbbccccc’ 4
regex = r’(b+)’ 5
ro = re.compile(regex) 6
mo = ro.search(s) 7
if mo : print mo.group(1) 8
To see how this might be useful, imagine that you wish to extract the first
sentence from every paragraph of a text. Assuming that there is one paragraph per
line in a text, (273) would do the job.
(273) regexes2GreedinessEx3.py
import sys 1
import re 2
3
fn = sys.argv[1] 4
fo = open(fn, ’r’) 5
lineList = fo.readlines() 6
fo.close() 7
2
GIVE QUOTE ABOUT INVETERATE ATTRIBUTION OF ANIMACY TO NATURE
FROM NAGEL
237
18.5 Exercises
T
8
# -------------------------------------------------------- 9
# REGULAR EXPRESSION BREAKDOWN: 10
# ˆ = matches beginning of line 11
# .*? = matches anything non-greedily 12
# [\?\.!] = character class that matches ‘?’, ‘.’, or ‘!’ 13
# (metacharacter escaped with backslash) 14
# -------------------------------------------------------- 15
regex = r’ˆ(.*?[\?\.!])’ 16
ro = re.compile(regex) 17
for l in lineList : 18
mo = ro.search(l) 19
if mo : 20
AF
print mo.group(1) 21
18.5 Exercises
1. ???
2. ???
3. ???
DR
4. ???
5. ???
238
DR
AF
stance. Return None if the string does not match the RE; note
that this is different from a zero-length match.
search(RE, str[, flags]) Scan through string looking for a location where the regular
expression RE produces a match, and return a corresponding
MatchObject instance. Return None if no match is found.
split(RE, str[, maxsplit]) Split string by the occurrences of RE. If capturing parentheses
239
are used in RE, then the text of all groups in the RE are also
returned as part of the resulting list. If maxsplit is nonzero, at
T
most maxsplit splits occur, and the remainder of the string is
returned as the final element of the list.
sub(RE, repl, str[, cnt]) Return the string obtained by replacing the leftmost non-
overlapping occurrences of RE in string by the replacement
repl. If the RE isn’t found, string is returned unchanged. repl
can be a string or a function; if it is a string, any backslash
escapes in it are processed.
subn(RE, repl, str[, cnt]) Perform the same operation as sub(), but return a tuple
(new string, number of subs made).
T
Table 18.3 Regular Expression Grouping in Python
Sequence Meaning
(R) Treat regular expression R as a group
(?:R) Do not treat R as a group
AF
(?=R) ???
(?!R) ???
(?P<name>R) Give the regular expression the name ¡name¿
(?P=<name> Refer to the grouping named ¡name¿
(?#...) Refer to the grouping number ¡num¿
(?letter) ???
240
T
Chapter 19
XML
19.1
AF
What is XML?
The eXtensible Markup Language (XML) is a format for the encoding of struc-
tured documents. It stands for ’eXtensible Markup Language’ and is related to
HTML, since both are simplified derivatives of SGML, the Standard Generalized
Markup Language (ISO 8879). Originally developed by the W3, XML was de-
signed to be a simplified version of SGML for use on the web. Since its initial
release in 1998, it has become increasingly widespread. We will provide only a
brief overview for those readers who have no prior acquaintance with it since there
is extensive documentation of XML available in print (see Ray (2003) or Harold
and Means (2004)) and on-line (http://www.w3.org/XML).
DR
In brief, the chief advantages of XML over other schemes for storage and
exchange of data are:
Portable Since XML is simply text, it is very portable across different presently
existing operating systems.
Durable Because XML is simply text, it will remain readable in all foreseeable
future operating systems.
Flexible XML defines a syntax for the encoding of documents but not its content,
making it a very flexible general purpose tool.
Simple XML is a very simple markup language, which is easy to learn and easy
to understand.
Easily Processed Because XML is in common usage, there is an abundance of
tools available for its automated processing.
241
19.2 The Syntax of XML
T
The basics of XML syntax are fairly straightforward. (276) illustrates a very sim-
ple XML annotation of a sentence (taken from the Penn Corpus Marcus et al.
(1993, 1994)), one which consists only of tags with no attributes.
(276) pennCorpusEx1.xml
<S> 1
<NP> 2
<DT>A</DT> 3
<NN>stockbroker</NN> 4
</NP> 5
AF
<VP> 6
<VBZ>is</VBZ> 7
<NP> 8
<DT>an</DT> 9
<NN>example</NN> 10
<PP> 11
<IN>of</IN> 12
<NP> 13
<DT>a</DT> 14
<NN>profession</NN> 15
<PP> 16
<IN>in</IN> 17
<NP> 18
<NN>trade</NN> 19
<CC>and</CC> 20
<NN>finance</NN> 21
</NP> 22
</PP> 23
</NP> 24
</PP> 25
</NP> 26
</VP> 27
<.>.</.> 28
</S> 29
DR
The following rules account for nearly everything that you need to know about
XML syntax:1
All XML elements must have a closing tag This means that for every start tag
there is a matching end tag. So, whereas the matching pair <NP></NP>
is correct XML syntax, <NP> is not. (The tags <p></p> and <br> come
from HTML (Hypertext Transfer Markup Language), which is similar to
XML, but does not entirely conform to its standards.)
XML tags are case sensitive Start and end tags must match in case. Therefore,
<word></word> would be correct whereas <Word></word> would be
incorrect (since the start tag begins with a capital letter but the end tag does
not). There is no requirement, however, that tags must be upper or lower-
case. Therefore, both <p></p> and <P></P> are correct.
1
A tutorial for these rules can be found at http://www.w3schools.com/xml/
default.asp.
242
19. XML
All XML elements must be properly nested Tags that occur within other tags
T
must be closed before their containing tags. An example of proper nesting
would be <p>Nested <b>XML</b></p>, where the tags <b></b>
are nested inside of the tags <p></p>.
All XML documents must have a root element All elements must occur inside
of a single base tag, usually called the root elements, and all tags must fall
under it.
AF
Attribute values must always be quoted Elements have in addition to their con-
tent the possibility of possing attributes. Consider, for example, a tag for
pronouns in English. The tag <pro></pro> is used to surround a pro-
noun and attributes are used to specify various features of the noun, such as
person, number, and gender. The full set of pronouns in English might like
as follows:
(275) attributesExample1.xml
<pronouns lg="eng"> 1
<pro per="1" num="sg" case="nom">I</pro> 2
<pro per="1" num="sg" case="acc">me</pro> 3
<pro per="1" num="pl" case="nom">we</pro> 4
<pro per="1" num="pl" case="acc">us</pro> 5
<pro per="2" num="sg" case="nom">you</pro> 6
<pro per="2" num="sg" case="acc">you</pro> 7
<pro per="3" num="sg" case="nom" gender="m">he</pro> 8
<pro per="3" num="sg" case="acc" gender="m">him</pro> 9
<pro per="3" num="sg" case="nom" gender="f">she</pro> 10
<pro per="3" num="sg" case="acc" gender="f">her</pro> 11
<pro per="3" num="pl" case="nom">they</pro> 12
<pro per="3" num="pl" case="acc">them</pro> 13
</pronouns> 14
DR
Comments take the form of special tags Comments which will be ignored by
parsers can be interleaved within an XML document, using the tag <!--
-->, as illustrated in (277), where comments have been added to
(276) pennCorpusEx1.xml
<S> 1
<NP> 2
<DT>A</DT> 3
<NN>stockbroker</NN> 4
</NP> 5
<VP> 6
<VBZ>is</VBZ> 7
<NP> 8
<DT>an</DT> 9
<NN>example</NN> 10
<PP> 11
<IN>of</IN> 12
<NP> 13
<DT>a</DT> 14
<NN>profession</NN> 15
<PP> 16
<IN>in</IN> 17
243
19.3 Processing XML in Python
T
<NP> 18
<NN>trade</NN> 19
<CC>and</CC> 20
<NN>finance</NN> 21
</NP> 22
</PP> 23
</NP> 24
</PP> 25
</NP> 26
</VP> 27
<.>.</.> 28
</S> 29
<S>
AF
in order to identify the subject and predicate.
<DT>A</DT>
<NN>stockbroker</NN>
</NP>
<VBZ>is</VBZ>
<NP>
<DT>an</DT>
<NN>example</NN>
<PP>
<IN>of</IN>
<NP>
<DT>a</DT>
(277) xmlSyntaxEx1.xml
<NN>profession</NN>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<PP> 19
<IN>in</IN> 20
<NP> 21
DR
<NN>trade</NN> 22
<CC>and</CC> 23
<NN>finance</NN> 24
</NP> 25
</PP> 26
</NP> 27
</PP> 28
</NP> 29
</VP> 30
<.>.</.> 31
</S> 32
244
19. XML
T
An XML parser provides a way of taking a chunk of XML and breaking it down
into its parts for processing. In other words, it is a tool that automates XML
parsing. For those unfamiliar with the terminology, the term parse means “to PARSE
break information down into its constituent parts”. There are two standard parsers
for XML, SAX and DOM. The difference between the two concerns how they
read and store the contents of an XML file for processing.
SAX stands for serial access parser, and is so-called because it reads an XML
AF
file by going through it serially (i.e., from start to finish). As it goes through the
XML files parts, it calls back to event handlers that decide how to act upon a
particular part of the XML document. The advantage of SAX is that it is fast and
efficient but its drawback is that it can be more difficult to write programs using it,
since they must always keep track of where they are in the document as it is being
read through. It is best suited to dealing with large XML files where certain types
of information are handled in a uniform fashion, regardless of where they occur
in the document.
DOM stands for Document Object Model and, unlike SAX, it works by read-
ing an entire XML document into memory and provided an interface to its hier-
archical structure. The hierarchical structure of the document can then be treated
like a tree, where there is a single root node and various child nodes beneathe it.
Since this is a fairly natural way of viewing XML, it is often easier to write pro-
grams in this fashion. The drawback, however, is that reading an entire XML file
can be memory-intensive and inefficient.
Here we will present two Python modules for XML processing, xml.sax, an
older module which has been part of the Python standard library since version
DR
2.0, and elementtree, a newer API which will beome part of the Python standard
library as of version 2.5.
19.3.2 xml.sax
???
19.3.3 elementtree
???
It is also possible to navigate the implicit hierarchy of an XML document
using XPath, which is a language for finding information in an XML document.
It provides a convenient method of navigating through elements and attributes in
an XML document.
245
19.4 Case Studies
A complete introduction goes beyond the scope of this textbook. A good tuto-
T
rial can be found on the W3 site: http://www.w3schools.com/xpath/
default.asp.
Returning to the example from the Penn Corpus, we can obtain all of the noun
phrases as a list of Element objects. It is then very simple to find the number of
noun phrases by obtaining the length of the list.
(278) xmlEtreeFindNouns.py
import elementtree.ElementTree as ET 1
tree = ET.parse("page.xhtml") 2
AF
nouns = tree.find("NP") 3
print len(nouns) 4
246
19. XML
T
<item> 27
<title>TNI AL Tangkap Enam Kapal Penyelundup Ikan</title> 28
<link>http://www.antara.co.id/seenws/?id=36922</link> 29
<description> Kapal patroli TNI AL, KRI Layang 805, berhasil menangkap
30 enam kapal berikut 61 ABK (a
31
<pubDate>Wed, 28 Jun 2006 17:22:06 +0100</pubDate> 32
<guid>http://www.antara.co.id/seenws/?id=36922</guid> 33
</item> 34
</channel> 35
</rss> 36
AF
from it. In (280), we present a very simple program that parses newsfeeds of this
sort and prints out the key parts of information in it.
import sys
(280) xmlRSSNewsfeedParser.py
import elementtree.ElementTree as ET
rssfile = sys.argv[1]
tree = ET.fromstring(rss_contents)
for item in tree.findall("channel/item") :
title = item.findtext("title")
date = item.findtext("pubDate")
cat = item.findtext("category")
link = item.findtext("link")
desc = item.findtext("description").strip()
guid = item.findtext("guid")
print "--"
print "TITLE:
print "DATE:
print "GUID:
print "LINK:
print "CATEGORY:
[%s]" % title
[%s]" % date
[%s]" % guid
[%s]" % link
[%s]" % cat
print "DESCRIPTION: [%s]" % desc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
We will return to this case study in Chapter 21, when we will see how to
DR
write the Python code that handles automatically pulling these RSS newsfeeds
from the web. This fully automates the entire process, so that the downloading
of newsfeeds and the parsing of them into a corpus can all be accomplished by a
single program.
247
19.4 Case Studies
as its author, its translator, its editor, its title, etc. The <body> contains the actual
T
contents of the text, broken down into a variable number of paragraphs, each
of which contain a variable number of lines. Each line is translated in multiple
languages. Each translation occurs within a language tag, and the name of the
language is provided as a code.
xxx
<text>
<title>
AF
<lg name=’roo’>Eake iava viapau kaakaukare reoreopave<lg>
<lg name=’eng’>Why dogs don’t talk</tr>
<lg name=’tp’>Watpo ol dok i no save toktok</tr>
</title>
<body>
<paragraph>
<line>
<lg name=’eng’>Gaurai lived alone in the bush.</tr>
<lg name=’tp’>Gaurai em wanpela i save stap long bus.</tr>
<lg name=’roo’>Gaurai raga oisoa topareve vegoaro.</tr>
</line>
<line>
<lg name=’eng’>He lived with his two dogs.</tr>
<lg name=’tp’>Em i save stap wantaim long tupela dok bilong em.</tr
<lg name=’roo’>Vaiterei kaakau vaio rera vaitereiaro taporo oisoa t
</line>
<line>
<lg name=’eng’>He was very lonely and unhappy.</tr>
<lg name=’tp’>Em wanpela i save stap and em i no save hamamas.</tr>
DR
<lg name=’roo’>Rera raga oisoa toupareve uva viapau oisoa ruraparev
</line>
<line>
<lg name=’eng’>He had no one to talk to.</tr>
<lg name=’tp’>Nogat narapela man long toktok wantaim.</tr>
<lg name=’roo’>Viapau irai oiratoa vai ra irai tapo orareoparo.</tr
</line>
</paragraph>
...
<body>
</text>
Storing texts in this format has at least two virtues. First, because XML is
easily parsed and validated, the parts of the text can be easily extracted and its
overall integrity can be easily verified. It is a fairly trivial matter to ensure that
248
19. XML
for each line, there is a translation in all three languages, as shown by (??), which
T
is a short script that takes an XML file, validates it against the above-described
format, and prints a short error message if anything is amiss.
???
Second, because the text is stored independently of its formatting, it is possible
to create tools to extract its contents and automatically format them. This means
that errors in the text can be fixed in one place (namely, the XML version) and a
new formatted version can be produced, quickly, easily, and automatically. This
is obviously superior to the alternative, which would be to maintain multiple ver-
(96)
AF
sions of the same text and, when errors are found in one version of the text, to fix
them in all others. (This is especially unmanageable as the number of languages
in parallel translation grows larger.) The script provided in (??) takes an XML
text and creates a bilingual text from it. The file and the two languages desired are
provided as command-line arguments.
???
Creating an English-Tok Pisin version of the text or an English-Rotokas ver-
sion of the text is simple, thanks to this script. The difference is simply which
language code is provided as the third argument on the command-line:
2
xxx
bilingual_text_formatter.py dogs.xml ENG TKP
bilingual_text_formatter.py dogs.xml ENG ROO
Note that script has been written in such a way that it will give an error mes-
sage if one of the language codes supplied as a command-line argument is not
found in the XML file being processed:
DR
xxx
(97) 1 bilingual_text_formatter.py dogs.xml ENG ???
2 ERROR: Language code ‘???’ not found in ‘dogs.xml’! Codes found: ENG, ROO, TK
249
19.5 Exercises
T
<semanticField>XXX</semanticField> 9
<notes>XXX</notes> 10
<dialects> 11
<dialect>Central</dialect> 12
</dialects> 13
<examples> 14
<example> 15
<original>ikauvira</original> 16
<morphemeGloss>run-ADV</morphemeGloss> 17
<translation lg=’eng’>quickly</translation> 18
<example> 19
<examples> 20
</entry> 21
parts.
19.5
1. ???
2. ???
3. ???
4. ???
AF
It is very easy to validate XML and consequently to parse it and extract its
Exercises
5. ???
DR
19.6 Suggested Reading
There is a great deal more to XML than what is covered in this chapter. This chap-
ter really only scratches the surface, and there are a good number of worthwhile
topics that were neglected, such as XXX, XXX, XXX. There is no shortage of ma-
terials on XML. For an article that overviews XML programming in Python, see
McGrath (1998a). Two book-length treatments of the topic are McGrath (2000)
and Jones and Drake (2002).
250
T
Chapter 20
Databases
AF
The title of this chapter is to some extent a misnomer, given that it is not just an
introduction to databases, but covers the larger issue of data persistence. Data
persistence refers to the problem of ensuring that data used during one running of
a program is available to other programs or for the running of the same program at
another time. There are a number of different way of achieving data persistence.
Here we cover the three main options available in Python: files, pickling, and
databases. Each will be discussed in turn.
20.1 Files
DATA
TENCE
PERSIS -
One way of achieving data persistence is through the use of files. File IO was
already covered in Chapter ??, and XML is covered in Chapter 19. Here we will
DR
look at two file formats commonly used for the storage of semi-structured data—
namely, tab-delimited values and comma-separated values.
The main thing to point out is that flat files face a number of problems as far
as data peristence is concerned. The main problems are the following:
Concurrency XXX
Consistency XXX
In §20.3 we will see that relational databases solve a number of these problems
and therefore are usually a superior alternative. (There are some situations where
relational databases are overkill and flat files will do just fine, but a database of
some sort is essential for anything large-scale.)
251
20.1 Files
T
The best way to handle files with comma-separated values is to use the module
csv, which forms part of Python’s standard library since version 2.3. For full
documentation, see the appropriate entry in the Python Library Reference: http:
//docs.python.org/lib/module-csv.html.
Before looking at the csv module, it pays to think briefly about why such a
module is needed. After all, the string method split can be used to split a string by
a given separator, as illustrated in (??).
AF
(282) databasesSplitCSVEx1.py
data = """???,???,???,???""" 1
for line in data.split("\n") : 2
for col in line.split(",") : 3
print "%s" % col 4
The trouble with simply splitting a CSV file using commas is that commas can
occur within a data field, and such data fields will create parsing errors, as can be
seen in (??).
(283) databasesSplitCSVEx2.py
data = """???,???,???,???""" 1
for line in data.split("\n") : 2
for col in line.split(",") : 3
print "%s" % col 4
Although it is possible to write custom code to handle the parsing of CSV into
columns of data, it is much easier to use the built-in CSV capabilities of Python.
The approach is essentially the same as the one used for tab-separated values,
except that the splitting of lines into columns is done by ??? rather than a simple
DR
split() function, as shown by the following script:
(284) dataPersistenceCSV1.py
import sys 1
2
def usage() : 3
print "Usage: %s <FILEPATH>" % sys.argv[0] 4
sys.exit(0) 5
6
def main() : 7
try : 8
filepath = sys.argv[1] 9
except : 10
usage() 11
12
fo = open(filepath, ’rU’) 13
fileContents = fo.read() 14
fo.close() 15
16
ln = 0 17
for line in fileContents.split("\n") : 18
ln = ln + 1 19
cn = 0 20
252
20. Databases
T
for col in line.split("\t") : 21
cn = cn + 1 22
print "[%i][%i] %s" % (ln, cn, col) 23
The advantage of this approach is that it handles both flavors of CSV format-
ting.
There is nothing particularly mysterious about files with tab-separated values.
For example, consider the following file with tab-separated values:
AF
car vehicle ground
truck vehicle ground
jeep vehicle ground
tank vehicle ground
helicopter vehicle air
plane vehicle air
jet vehicle air
glider vehicle air
The basic idea is that each line has a number of values separated from one
another by tabs. These values can be thought of as columns of data and and are
in fact displayed as such when files in this format are imported into spreadsheet
software (e.g., Microsoft Excel):
These files can be easily processed by opening a file object, extracting its
contents as a string, splitting the string into lines, and then splitting each line into
columns with the function usings tabs as the separator—i.e., split(’’ ⁀ ).
(285) dataPersistenceTSV1.py
import sys 1
2
def usage() : 3
print "Usage: %s <FILEPATH>" % sys.argv[0] 4
253
20.2 Pickling
T
sys.exit(0) 5
6
def main() : 7
try : 8
filepath = sys.argv[1] 9
except : 10
usage() 11
12
fo = open(filepath, ’rU’) 13
fileContents = fo.read() 14
fo.close() 15
16
ln = 0 17
AF
for line in fileContents.split("\n") : 18
ln = ln + 1 19
cn = 0 20
for col in line.split("\t") : 21
cn = cn + 1 22
print "[%i][%i] %s" % (ln, cn, col) 23
20.2 Pickling
One of the main challenges in data persistence, especially in the context of object-
oriented programming, is how to save data that is stored within well-defined ob-
jects. Although it is always possible to print the data out into a text and then later
parse it back in to your object model, this is not always desirable. The main reason
for this is that you will have to print out data into a format that can be parsed back
in again and the work involved in ensuring that data can go into the object model
from a file and back out again can be onerous.
Python has a built-in method for saving objects out as files for the sake of data
PICKLING persistence, which is colorfully referred to as pickling.1 This is a convenient way
DR
of saving data that you have parsed into an object model (see ???) when you don’t
want to go to the work of printing it out again to save into a file.
For example, let’s assume that you have two wordlists for unrelated languages
spoken in adjacent regions and you want to parse both of them and then find any
words in common between the two. The code in (??) parses these two wordlists
and builds a dictionary of the terms within them for easy comparison. There is no
point in printing out the data itself because it will only be used to find common
words and print them out. We could therefore pickle the data, as shown in (??).
(286) dataPersistencePicklingEx1.py
import pickle, sys 1
2
def reverse_dictionary(fn) : 3
d = {} 4
fo = open(fn, ’rU’) 5
1
Pickling is sometimes referred to as “serialization”, “marshalling”, or “flattening”, because
???.
254
20. Databases
T
for l in fo : 6
cols = l.split("\t") 7
term = cols[0] 8
definition = cols[1] 9
for w in definition.split(" ") : 10
if not d.has_key(w) : 11
d[w] = [] 12
d[w].append(term) 13
fo.close() 14
return d 15
16
def merge_dictionaries(d1, d2) : 17
d = {} 18
AF
for w in d1.keys() : 19
if d2.has_key(w) : 20
terms1 = d1[w] 21
terms2 = d2[w] 22
d[w] = (terms1, terms2) 23
return d 24
25
def main() : 26
try : 27
fn1 = sys.argv[1] 28
fn2 = sys.argv[2] 29
fn3 = sys.argv[3] 30
except : 31
print "Usage: %s <DICT 1> <DICT 2> <PICKLE FILE>" % sys.argv[0] 32
sys.exit(0) 33
d1 = reverse_dictionary(fn1) 34
d2 = reverse_dictionary(fn2) 35
d = merge_dictionaries(d1, d2) 36
fo = open(fn3, ’w’) 37
pickle.dump(d, fo) 38
fo.close() 39
40
main() 41
There are many different ways of finding cognates among the words in the
DR
pickled object. Using the same pickled object, we can employ different means
of identifying cognates and neither one needs to worry about parsing the files
anymore, since the work is already done and the result is saved into our pickled
file.
One possibility is to ???, as in (287)
(287) dataPersistenceUnpicklingEx1.py
import pickle, sys 1
2
def main() : 3
try : 4
filename = sys.argv[1] 5
except : 6
print "Usage: %s <PICKLED FILE>" % sys.argv[0] 7
fo = open(filename, ’r’) 8
d = pickle.load(fo) 9
fo.close() 10
11
for w in d.keys() : 12
(wordlist1, wordlist2) = d[w] 13
255
20.3 Relational Databases
T
print w 14
print " ".join(wordlist1) 15
print " ".join(wordlist2) 16
17
main() 18
AF
def main() : 3
try : 4
filename = sys.argv[1] 5
except : 6
print "Usage: %s <PICKLED FILE>" % sys.argv[0] 7
fo = open(filename, ’r’) 8
d = pickle.load(fo) 9
fo.close() 10
11
for w in d.keys() : 12
(wordlist1, wordlist2) = d[w] 13
print w 14
print " ".join(wordlist1) 15
print " ".join(wordlist2) 16
17
main() 18
There are four key elements to a relational database, often called the ACID
test on the basis of the mnemonic scheme for their recollection (XXX):
256
20. Databases
T
(This prevents the database from entering into invalid states–e.g., crediting
one account without debiting another in a financial database.
Consistency The database only goes from valid state to valid state. Tables may
be related (hence relational) by common keys. Consistency forces valid
relationships between tables. (In other words, it is impossible for a column
of one table to reference a non-existent entry in the column of another table
when both are properly designed.)
AF
Isolation The results of transactions are invisible to other users until the trans-
action is complete. (This ensures that changes occur in a block, and that
interrelated changes do not occur in a piecemeal fashion.)
Durability All transactions and states of the database are fully recoverable (through
rollback/commit operations). Durability ensures that once a transaction is
complete the updated information will survive failures of any kind.
Performance RDBMSes are fast and can be fine-tuned for even greater perfor-
mance.
Interface The interface to a RDBMS is more or less uniform to the extent that
they conform to the various standards set for SQL (Structured Query Lan-
DR
guage) and does not require that the programmer build a new interface or
that the user learn one. (This consideration argues more against tabular files
than XML, which is more easily manipulable.)
Robustness Relational databases are more robust than flat files in the sense that
constraints on tables can prevent the entry of non-sensical data. These con-
straints can enforce the data model in various ways: by ensuring that entries
are unique with respect to particular parameters, by ensuring that the values
for columns come from a pre-determined set of options, etc.
257
20.4 Structured Query Language (SQL)
T
SQL SQL (Structure Query Language) is a standardized language for interacting with
relational databases. It consists of a well-defined syntax that enables a user to send
queries to a database that will create, delete, or edit data within the database. The
literature on SQL is absolutely huge, and there is no way to provide a complete
introduction to SQL in this book, but we should be able to introduce the basics
and gives the reader a general idea of how it works and what it can be used for.
Readers interested in pursuing the topic in greater detail can refer to the suggested
AF
reading section at the end of the chapter.
[SAY SOMETHING ABOUT STANDARDS]
Unfortunately, SQL, like most standards, is not universally adhered to. Dif-
ferent database management systems (DBMS) depart from it to various degrees.
So, while in theory a standard SQL query will run properly on any database that
is SQL-compliant, in practice this is not always the case. For various reasons
(ranging from the undderstandable desire to provide extra functionality to the less
forgivable wish to lock users in to a particular DBMS once they have adopted it),
the implementation of SQL is not 100% uniform. The rule of thumb is to stick to
the basics. The more established a SQL convention is, the more likely it is to be
work across different databases.
In the following sections, we will review the basics of SQL and show how it
can be used to access and manipulate structured data.
258
20. Databases
T
8
CREATE TABLE person ( 9
id xxx, 10
first_name xxx, 11
middle_name xxx, 12
last_name xxx 13
); 14
15
CREATE TABLE authorship ( 16
id_person xxx, 17
id_bibliography xxx 18
) 19
AF
The tables bibliography and person both have a column called id that
serves as a primary key. A primary key is a unique identifier that can be used
to identify a particular row within a table. The primary key of one table can be
reference by another, where its serves as a foreign key. Using these three tables, a
row in the bibliography table can be related to a row in the person table by
means of the authorship table. (In §20.4.5, we will see how this can be done
using a method of combining table data called a join.)
It’s probably not clear why we have designed our data model in this way.
Why use keys and why relate tables using them? The answer is that data that has
been normalized in this fashion is more consistent and efficient, since there is no
redundant data within tables, and more flexible. To see why, consider what an
alternative might look like, provided in (290).
FOREIGN KEY
JOIN
DATA MODEL
NORMALIZED
third_author xxx, 4
title1 xxx, 5
DR
title2 xxx, 6
publisher xxx, 7
where_published xxx 8
); 9
In (290), there is only a single table, and the authors of each book are stored
in the same row as the book in the bibliography table. There are a number
of problems with this alternative approach. The first, and probably most obvious
problem is that some bibliography entries will have more than three authors (e.g.,
scientific papers with a half-dozen or more authors). We could perhaps solve
the problem by adding more columns for more authors, but this causes as many
problems as it solves. First, there is no upward limit on the number of columns.
If we provide columns for thirteen authors, at some point a paper with fourteen
authors will turn up. Second, even if we did provide a ridiculous large number of
columns for authors, there would be problems. One is inefficiency. Some space is
set aside for each column of a row, even if it is blank, and there will be a very large
259
20.4 Structured Query Language (SQL)
T
large to accomodate papers with numerous authors. Another problem with this
approach is that we cannot put any constraints on these authors columns. We
cannot force them to be non-blank, for example, since a paper with a single paper
will necessarily have numerous blank columns.
All of the problems with (289) are solved by using unique identifiers for each
row in the tables person and bibliography. First, there is no longer any
problem with papers with a large number of authors, since all that is required
to associate an author with a bibliography entry is to add a new row to the table
SEED SCRIPT
AF
authorship, and tables can hold a virtually unlimited number of rows. Second,
there is no problem with constraints. We can ensure that every foreign key refers to
a primary key and that all of the parts of a name are provided (by breaking a name
down into parts and making some parts mandatory–for example, the first and last
name). Finally, this scheme is far more efficient, because a name is stored only
once in the person table and everywhere else referred to by a number (which
requires much less memory than a string).
Since a database without data is not particularly useful, we have provided a
short SQL script that can be run in order to add a small dataset to the database
defined in (291). This seed script will add data for three authors and three books
to the database. One of the books is written by Leonard Bloomfield xxx (xxxb)
and another two by Edward Sapir xxx (xxxb,x).
(291) seedDatabaseEx1.sql
-- Add data to the table ‘person’
insert into person (first_name, middle_name, last_name)
values (’Edward’, ’’, ’Sapir’);
insert into person (first_name, middle_name, last_name)
1
2
3
4
values (’Leonard’, ’’, ’Bloomfield’); 5
insert into person (first_name, middle_name, last_name) 6
values (’???’, ’’, ’Whorf’); 7
DR
8
-- Add data to the table ‘bibliography’ 9
insert into bibliography (title, xxx) values (’Language’); 10
insert into bibliography () values (); 11
insert into bibliography () values (); 12
13
-- Add data to the table ‘authorship’ 14
insert into authorship (id_person, id_bibliography) 15
values ((SELECT id 16
FROM person 17
WHERE), 18
()); 19
insert into authorship (id_person, id_bibliography) 20
values ((SELECT id 21
FROM person 22
WHERE), 23
()); 24
insert into authorship (id_person, id_bibliography) 25
values ((SELECT id 26
FROM person 27
WHERE), 28
()); 29
260
20. Databases
Don’t worry if you don’t understand how this script works. Using this simple
T
database with only two tables, we will learn some SQL basics in the following
sections.
AF (292) insertEx1.sql
INSERT INTO person (first_name, middle_name, last_name)
values (’John’, ’Peabody’, ’Harrington’)
The names of the columns are surrounded by parentheses, as are their val-
ues. The sequential order of the columns and values determines which values
are assigned to which columns. Therefore, the two must have an equal number.
Therefore, the SQL in (293) will give rise to an error, as can be seen by running
it.
(293) insertEx2.sql
INSERT INTO person (first_name, last_name) values (’John’);
Note that the table person also has a column called id, which is not given a
value by the insert statement in 292. This is because this column is automatically
handled by the table, thanks to its data type, which is the MySQL-specific XXX.
It is also possible to insert a value using a sequence (something like a dynamic
counter for generating IDs automatically.)
1
2
(294) insertEx3.sql
INSERT INTO person (id, first_name, middle_name, last_name) 1
DR
values (XXX, ’John’, ’Peabody’, ’Harrington’) 2
A value for the id column can be hard coded, as in (295), but this is not
recommended, since it may conflict with a pre-existing value. Furthermore, there
is little point, since it is automatically handled for you.
(295) insertEx4.sql
INSERT INTO person (id, first_name, middle_name, last_name) 1
values (XXX, ’John’, ’Peabody’, ’Harrington’) 2
261
20.4 Structured Query Language (SQL)
T
Now that we know how to add data to the database, it would be useful to know
how to remove it. In SQL, rows can be deleted from a table using the keyword
DELETE. The SQL command in 296 will delete all of the rows in a table. (A
word of warning. If a table has a great deal of data in it, this operation may be
time-consuming.)
(296) deleteEx1.sql
DELETE FROM person; 1
AF
Typically, one wishes to delete specific rows in a table and not simply all of
them. To impose conditions on which rows will be deleted, the keyword WHERE
can be used. For example, to remove every entry in the table person that has the
last name ’Robinson’, the SQL command in 297 will do the job.
(297) deleteEx2.sql
DELETE FROM person WHERE last_name = ’Harrington’
The update statement in (??) updates all of the rows in XXX. The more typical
1
situation, however, is that you will want to modify only a subset of rows using the
keyword WHERE.
DR
(299) updateEx2.sql
UPDATE table_name 1
SET col1=’xxx’, 2
col2=’xxx’ 3
WHERE col3=’xxx’ 4
262
20. Databases
In (300), specific columns were specified in the select statement, but it is also
T
possible to select all of the columns in a table using a star (*), as in (301).
(301) selectEx2.sql
SELECT * FROM person; 1
first_name,
middle_name,
last_name
FROM person
AF
it may be necessary to switch databases.)
To restrict the number of rows retrieved from a table, the keyword WHERE can
be used to impose conditions on which rows will be selected from the database. In
the example in (302), all of the columns in the person table are selected for any
row (or rows) where the column last name is ‘Harrington’. (Note that the string
comparison operation is case-sensitive, and different results would be obtained if
the name provided were, for example, ‘HARRINGTON’.)
SELECT id,
(302) selectEx3.sql
In some cases, it may be necessary to combine the data of two tables into a
single select. For example, to retrieve all of the authors associated with a particular
1
2
3
4
5
6
The query in (303) crosses the rows in the table author with those in the
table authorship and retrieves only those where the foreign key id author
of the authorship table corresponds to the primary key id of the author
table and the id person is the one provided (i.e., 3).
That may not be entirely clear to you, so let’s go through it again more slowly,
with the help of a diagram. A join cross all of the rows in one table with all of
the rows in another. Therefore, if there are i rows in one table and j rows in
another table, the total number of rows that result from joining the two tables is
263
20.5 Using Python with a Relational Database
T
Table 20.2 Outcome of a Join of Person and Authorship
Person Authorship
id first name last name id person id bibliography
1 Leonard Bloomfield 1 1
1 Leonard Bloomfield 2 2
1 Leonard Bloomfield 2 3
2 Edward Sapir 1 1
2 Edward Sapir 2 2
AF
2 Edward Sapir 2 3
3 ??? Whorf 1 1
3 ??? Whorf 2 2
3 ??? Whorf 2 3
i ∗ j. Returning to (303), if the seed data, and only the seed data is present in the
database, the result would be 9 rows, given that the table person has 3 rows and
the table authorship has 3 rows.
In this case, XXX.
(304) selectEx5.sql
SELECT id 1
FROM bibliography 2
WHERE title = ’Language’ 3
SUBSELECT Using a subselect, it is possible to embed one select within another. Using a
subselect, we can eliminate the need to know the id of a bibliographic reference
in (305) by using a subselect in lieu of the hardcoded value of 5.
DR
(305) subSelectEx1.sql
SELECT author.first_name, 1
author.last_name 2
FROM author, 3
authorship 4
WHERE authorship.id_person = author.id AND 5
authorship.id_bibliography = -- Subselect produces value 6
(SELECT id FROM bibliography WHERE title = ’Language’); 7
264
20. Databases
T
Once you have imported the necessary module, MySQLDB, you will need to tell
Python to connect to a database. In order to do so, a few pieces of information
are required: the host, the username, the password, and the name of the database.
The following code does nothing more than obtain a database connection and
immediately close it. An error will be raised if the program is unable to connect
to the database. Otherwise, it will simply run and exit silently.
(306) databaseObtainConnection.py
AF
import MySQLDB 1
cnx = MySQLdb.connect(host = "", 2
user = "", 3
passwd = "", 4
db = "") 5
cnx.close() 6
The SQL statement in (??) was hardcoded, but it is also possible to create
dynamic SQL statements using variables. For example, a program that retrieves
entries from a lexical database for a given lexeme would need to obtain the lexeme
from the user and then construct a SQL statement on the basis of the user’s input.
265
20.5 Using Python with a Relational Database
T
(308) databaseExecuteQueryWithBindings1.py
import MySQLDB 1
cursor = conn.cursor() 2
sql = "SELECT id, lexeme FROM lexemes WHERE lexeme = %s" 3
cursor.execute(sql, (lexeme)) 4
rows = cursor.fetchall() 5
for r in rows : 6
id = r[0] 7
lex = r[1] 8
print "[%s] %s" % (id, lex) 9
Notice that %s is used a placeholder in the SQL statement. When the SQL
try :
AF
statement is executed on line ???, the value for this placeholder is provided as an
item within a tuple.
except IndexError :
(309) databaseExecuteQueryWithBindings2.py
lexeme = sys.argv[1]
cursor = conn.cursor()
user
= "localhost",
= "root",
passwd = "",
db
rows = cursor.fetchall()
for r in rows :
id = r[0]
lex = r[1]
= "rotokas-dictionary")
266
20. Databases
T
Lexical Database
20.7 Exercises
1. ???
2. ???
3. ???
4. ???
5. ???
20.8
AF
Suggested Reading
To learn more about SQL, we recommend XXX. For more details concerning
database management systems and their relative strength and weaknesses, see
XXX. For technical documentation of the Python DBI, see XXX. In general, we
recommend the various open source database management systems, since they
are free, well-maintained, and have very active user communities. (They also
have the advantage of being customizable, although being able to do this requires
a level of technical knowledge that the average reader of this book is unlikely
to possess.) In particular, we recommend either MySQL (http://www.mysql.org)
and PostgresSQL (http://www.postgres.org), two fairly advanced open source re-
DR
lational database management systems.
267
20.8 Suggested Reading
T
AF
DR
268
T
Chapter 21
The World Wide Web
AF
The world-wide web was invented by Tim Berners-Lee in the early 1990s during
his employment at the CERN. Originally designed as a tool for scientists to share
data, it has grown explosively to become a cultural phenomenon. As the web has
grown, the amount of textual material available on it has created a ready-made
corpus, freely accessible to language researchers who have the technical know-
how to tap into it. Although a full introduction to the topic of web programming
goes far beyond the scope of this book, we can still introduce a few basics that
will enable you to write simple scripts to connect to web pages and manipulate
their contents.
???
269
21.1 How the Web Works
When you surf the web, you do so using a web browser a program that is
T
WEB BROWSER
able to send and receive messages in HTTP and display the information sent back
HTML from a server. This information typically takes the form of HTML. For example,
consider the simple HTML document in (311).
(311) sampleWebPage.html
<html> 1
<header> 2
<title>Sample Web Page</title> 3
</header> 4
<body bgcolor="#FFFFFF"> 5
AF
<h2>Title of Sample Web Page</h2> 6
<p>The contents of the web page go here.</p> 7
</body> 8
</html> 9
When rendered by a web browser, Figure 21.2 will look something like Figure
21.2.
?
DR
Figure 21.2 Sample Web Page: (311) in Mozilla Firefox Browser
270
21. The World Wide Web
T
As explained in §??, the web is based on a client-server model. A web browser
(the client) connects to a networked computer (the server) to display a web page.
AF
is part of the Python standard library (i.e., it comes with the standard distribution
of Python). The urllib modules provides the building blocks for constructing a
program that will browse the web automatically. A very simple illustration of its
use is provided in (312).
socket = urllib.urlopen(url)
contents = sock.read()
print contents
socket.close()
(312) wwwWebBrowser.py
SOCKET
connecting to). The method read is then called on the socket object to read the
contents of the desired URL.
DR
Before testing the program yourself, make sure that have network access first
by trying to access a reliable web site (such as www.google.com or www.
yahoo.com) If that works, then it means you have a working internet connection
and that you should be able to test (312) yourself by calling the script on a reliable
web site, such as www.google.com or www.slashdot.org. If you are able
to view web pages on a browser but the script doesn’t work, the most likely reason
has to do with your network setup and something called a proxy server.
271
21.2 Client Side Programming: Browsing the Web
and that web server responds directly to your computer. But when you use a proxy
T
server, when your browser send a request to a web server, it goes to a proxy server
first, and the proxy server then forward the request on to the desired web server.
The web server then sends a response to the proxy server, which then forwards
that response on to your computer.
(313) wwwWebBrowserProxy.py
import sys, urllib 1
url = sys.argv[1] 2
socket = urllib.urlopen(url, proxies={’http’: ’http://www-proxy.mpi.nl:8080’}) 3
contents = sock.read() 4
AF
print contents 5
socket.close() 6
[EXPLAIN HOW IT WORKS] If you don’t know whether your computer uses
a proxy server, you should ask your network administrator. If your network ad-
ministrator is unavailable, you might be ablt to find the settings in the preferences
of your web browser (Internet Explorer, Firefox, etc.).
[MENTION PROBLEMS]
Given the various shortcomings of (??), we need to find a more reliable method
of stripping HTML tags from a document. Python’s standard library includes
272
21. The World Wide Web
sgmllib, a module which provides functionality for dealing with SGML, a su-
T
GML
perset of HTML. Using sgmllib, it is possible to write a simple function that
will strip a string of all HTML tags, as in (316).
(316) wwwStripHTMLStripper.py
import sgmllib, sys, urllib 1
2
class Stripper(sgmllib.SGMLParser): 3
def __init__(self): 4
self.data = [] 5
sgmllib.SGMLParser.__init__(self) 6
def unknown_starttag(self, tag, attrib): 7
AF
self.data.append(" ") 8
def unknown_endtag(self, tag): 9
self.data.append(" ") 10
def handle_data(self, data): 11
self.data.append(data) 12
def gettext(self): 13
text = "".join(self.data) 14
return " ".join(text.split()) # normalize whitespace 15
16
def strip(text): 17
s = Stripper() # create Stripper object 18
s.feed(text) # feed text to be parsed 19
s.close() # close Stripper object 20
return s.gettext() # return stripped text 21
22
url = sys.argv[1] # get url from command-line 23
socket = urllib.urlopen(url) # open socket w/ URL 24
contents = socket.read() # read HTML 25
print strip(contents) # strip tags from HTML 26
socket.close() # close socket 27
273
21.2 Client Side Programming: Browsing the Web
T
print "Opening URL ’%s’..." % url 12
socket = urllib.urlopen(url) 13
contents = socket.read() 14
socket.close() 15
return contents 16
17
NorwegianURL = protocol + Norwegian + BushURL 18
DutchURL = protocol + Dutch + BushURL 19
EnglishURL = protocol + English + BushURL 20
21
print "NORWEGIAN" 22
print get_url_contents(NorwegianURL) 23
print "" 24
AF
print "" 25
26
print "DUTCH" 27
print get_url_contents(DutchURL) 28
print "" 29
print "" 30
31
print "ENGLISH" 32
print get_url_contents(EnglishURL) 33
print "" 34
print "" 35
274
21. The World Wide Web
for example, use spidering to create a copy of all the visited pages for later pro-
T
cessing by a search engine, which will index the downloaded pages to provide fast
searches. Crawlers can also be used for automating maintenance tasks on a Web
site, such as checking links or validating HTML code. Also, crawlers can be used
to gather specific types of information from Web pages, such as harvesting e-mail
addresses (usually for spam).
AF
Every time you browse the web, you are connecting to a web server. A web
server can be programmed to serve up dynamic content and there are innumerable
frameworks for doing this.
Web development using a more sophisticated web development framework
(such as Zope/Plone) goes beyond the scope of this textbook, but here we can
cover a relatively simple framework for server side programming, known as CGI
(Common Gateway Interface).
Once this script has been placed in the appropriate directory and made exe-
cutable, it should be possible to access it. When this is done, the results should
look something like Figure ??.
???
275
21.4 Case Studies
T
(319) hello.cgi
#!/usr/bin/python 1
print "Content-Type: text/html\n\n" 2
print "<h1>Hello</h1>" 3
print "This is a greeeting." 4
RSS
AF
(client or server side).
276
21. The World Wide Web
T
<pubDate>Thu, 29 Jun 2006 04:36:45 +0100</pubDate> 22
23
<guid>http://www.antara.co.id/seenws/?id=36956</guid> 24
</item> 25
... 26
<item> 27
<title>TNI AL Tangkap Enam Kapal Penyelundup Ikan</title> 28
<link>http://www.antara.co.id/seenws/?id=36922</link> 29
<description> Kapal patroli TNI AL, KRI Layang 805, berhasil menangkap
30 enam kapal berikut 61 ABK (a
31
<pubDate>Wed, 28 Jun 2006 17:22:06 +0100</pubDate> 32
<guid>http://www.antara.co.id/seenws/?id=36922</guid> 33
</item> 34
AF
</channel> 35
</rss> 36
There are a couple of things to note about this RSS file. First, there is some
metadata provided at the beginning of the file, such as ???. Second, the actual
content begins with the root tag <rss>, which specifies which version of RSS
format is used (2.0). Third, [SAY SOMETHING ABOUT CHANNEL].
Each <item> tag in the RSS newsfeed corresponds to a newspaper article.
For example, the first item contains the information provided in Table 21.1.
This RSS newsfeed provides a link to the web version of the full article, from
which the newspaper article itself can be extracted. Using some Python scripting,
it is possible to automate the downloading of the RSS newsfeed and the extrac-
tion of the article from the link it refers to. In this way, a corpus of Indonesian
newspaper articles can be quickly and easily created.
The first step is to obtain the RSS newsfeed itself. In §??, we learned how to
download a web page given a URL. The same technique can be used to download
an RSS newsfeed. It really doesn’t make much difference whether it is HTML or
XML. The procedure is the same, as shown by the code in (321).
(321) wwwDownloadRSSNewsfeed.py
import urllib 1
2
url = "http://www.antara.co.id/rss/umm.xml" 3
socket = urllib.urlopen(url) 4
277
21.4 Case Studies
T
rss_contents = socket.read() 5
socket.close() 6
print rss_contents 7
The next step is to parse the RSS newsfeed and obtain the link to the full
newspaper article.
(322) wwwParseRSSNewsfeed.py
import sys 1
import elementtree.ElementTree as ET 2
AF
3
rss_contents = socket.read() 4
tree = ET.fromstring(rss_contents) 5
print "Items Found in RSS Newsfeed:" 6
for item in tree.findall("channel/item") : 7
title = item.findtext("title") 8
date = item.findtext("pubDate") 9
cat = item.findtext("category") 10
link = item.findtext("link") 11
desc = item.findtext("description").strip() 12
guid = item.findtext("guid") 13
print "--" 14
print "TITLE: [%s]" % title 15
print "DATE: [%s]" % date 16
print "GUID: [%s]" % guid 17
print "LINK: [%s]" % link 18
print "CATEGORY: [%s]" % cat 19
print "DESCRIPTION: [%s]" % desc 20
21
The final step is to grab the newspaper article from the web and extract the
newspaper article from the surrounding HTML markup.
(323) wwwGetNewsArticle.py
DR
import re, sys, urllib 1
2
socket = urllib.urlopen(self.url) 3
html = socket.read() 4
regex = r’<td class=contenttxt width="100%">(.*?)</td>’ 5
reo = re.compile(regex, re.DOTALL) 6
mo = reo.search(html) 7
if mo : 8
content = mo.group(1) 9
content = self.content.replace("<br />" , "") 10
content = re.sub(r"</?b>", "", self.content) 11
content = re.sub(r"\(\*\).*", "", self.content) 12
content = self.content.strip() 13
print content 14
else : 15
sts.stderr.write("Regex failed to find article content!\n") 16
Finally, the short scripts described above can be combined into a single pro-
gram that handles the process from start to finish and puts the resulting newspaper
articles into a user-specified directory.
278
21. The World Wide Web
T
The Summer Institute of Linguistics (SIL) publishes a book called The Ethno-
logue which provides an index of the world’s langauges which provides basic
background information such as where a language is spoken, how many people
speak it, whether there is a translation of the Bible in the language, etc. As a pub-
lic service, the SIL makes the data in The Ethnologue publically available on the
WWW at www.ethnologue.org.
AF
Because a given language may go by multiple names, it is necessary when
building a database of languages to a unique identifier associated with each lan-
guage. The SIL uses a three-letter code for language identification, called simply
an Ethnologue code. (For more information about these codes, go to http:
//www.ethnologue.com/codes/default.asp.) It is possible to look
up a language using this code by going to a web page that lists language informa-
tion, given a specific Ethnologue code.
When you connect to this URL using a web browser, the results will look
something like this. The web page is generated by connecting to a remote web
server and obtaining a response. The response consists of HTML which is render-
ing in your web browser to look like this (as The Ethnologue 15).
279
21.4 Case Studies
T
AF
Using a simple Python script, it is possible to view the raw HTML that is used
to generate this web page. The following script will do the job:
???
DR
(324) wwwEthnologueClassificationByCode.py
#!/usr/bin/python 1
2
import re, sys, urllib 3
4
def get_affiliation(classification) : 5
affiliation = classification.split(",")[0].strip() 6
if affiliation == "Austronesian" : 7
return "AN" 8
else : 9
return "NAN" 10
11
def get_classification_by_code(code) : 12
sys.stderr.write("<%s>\n" % code ) 13
proxies = {’http’: ’http://www-proxy.mpi.nl:8080’} 14
url = "http://www.ethnologue.com/show_language.asp?code=%s" % code 15
regex = r’<TD><I>Classification</I></TD>.*?<TD><A HREF=".*?">(.*)</A></TD>’ 16
#print "Attempting to connect to the following URL:" 17
#print " [%s]" % url 18
sock = urllib.urlopen(url, proxies=proxies) 19
html = sock.read() 20
#print html 21
280
21. The World Wide Web
T
reo = re.compile(regex, re.DOTALL) 22
mo = reo.search(html) 23
classification = mo.group(1) 24
#print "Classification: [%s]" % classification 25
sock.close() 26
return classification 27
28
def main() : 29
filename = sys.argv[1] 30
31
fo = open(filename, ’rU’) 32
lines = fo.read().split("\n") 33
fo.close() 34
AF
35
print lines[0] 36
for l in lines[1:] : 37
cols = l.split("\t") 38
code = cols[1].lower() 39
if code == "---" : 40
classification = "---" 41
major_affiliation = "---" 42
else : 43
classification = get_classification_by_code(code) 44
major_affiliation = get_affiliation(classification) 45
cols[2] = major_affiliation 46
cols[3] = classification 47
print "\t".join(cols) 48
49
main() 50
21.5 Exercises
???
• ???
DR
• In §21.4.1, we discussed a way of writing a program that automates the work
of downloading news articles from a web site and saving them as text files.
Saving the newspaper articles as plain text files is not ideal since the infor-
mation about them from the RSS newsfeed is not saved with them. How
would you write Python code to save the newspaper articles in a database
rather than as text files?
281
21.6 Suggested Reading
T
AF
DR
282
T
Bibliography
AF
Leopold Totsch Allison Randal, Dan Sugalski. Perl 6 and Parrot Essentials.
O’Reilly Press, ???, second edition, 2004.
Steven Bird, ??? Klein, and ??? Loper. Introduction to Computational Linguistics.
Cambridge University Press, Cambridge University, 2006.
DR
Timothy Budd. An Introduction to Object-Oriented Programming. Addison-
Wesley, ???, 1996.
Bart De Boer. The Origins of Vowel Systems. Oxford University Press, New York,
2001.
Ali Farghay, editor. A Handbook for Language Engineers. CSLI, Stanford, 2003.
283
BIBLIOGRAPHY
T
2002.
Nigel Gilbert and Klaus G. Troitzsch. Simulation for Social Scientists. Open
University Press, New York, second edition, 2005.
AF
Summer Institute of Linguistics, Dallas, Texas, fourteenth edition, 2000.
Daryl Harms and Kenneth McDonald. The Quick Python Book. Manning Publi-
cations Company, Greenwich, CT, 2000.
Elliotte Rusty Harold and W. Scott Means. Learning XML. O’Reilly, Sebastopol,
2004.
Steve Holden, editor. Python Web Programming. Pearson Education, xxx, 2002.
ISBN 0735710902.
Andrew Hunt and David Thomas. The Pragmatic Programmer: From Journey-
man to Master. Addison-Wesley, ???, 1999.
Christopher A. Jones and Fred L. Drake. Python and XML. O’Reilly & Associates,
Inc., 103a Morris Street, Sebastopol, CA 95472, USA, Tel: +1 707 829 0515,
DR
and 90 Sherman Street, Cambridge, MA 02140, USA, Tel: +1 617 354 5800,
2002. ISBN 0-596-00128-2.
Dan Klein and Christopher Manning. Natural language grammar induction us-
ing a constituent-context model. In Thomas G. Dietterich, Suzanna Becker,
and Zoubin Ghahramani, editors, Advances in Neural Information Processing
Systems 14, volume 1, pages 35–42, Cambridge, MA, 2002. MIT Press.
284
BIBLIOGRAPHY
T
Steven Levy. Hackers: Heroes of the Computer Revolution. Penguin, ???, 2001.
Mark Lutz and David Ascher. Learning Python. O’Reilly & Associates, Inc.,
103a Morris Street, Sebastopol, CA 95472, USA, Tel: +1 707 829 0515, and
90 Sherman Street, Cambridge, MA 02140, USA, Tel: +1 617 354 5800, 1998.
Chris Manning. Review of programming for linguists: Java technology for lan-
guage researchers. Language, 81(3):740–742, 2005.
AF
Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural
Language Processing. MIT Press, Cambridge, MA, 1999.
Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann
Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank:
Annotating predicate argument structure. 1994.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a
large annotated corpus of english: the penn treebank. Computational Linguis-
tics, 19(2):313–330, 1993.
Alex Martelli. Python in a Nutshell. O’Reilly, Sebastopol, CA, 2003.
James D. McCawley, editor. Everything That Linguists Have Always Wanted
to Know About Logic but Were Ashamed to Ask. Chicago University Press,
Chicago, 1993.
S. McGrath. XML programming in Python. Dr. Dobb’s Journal of Software Tools,
DR
23(2):82–??, 84–87, 101–104, February 1998a. ISSN 1044-789X.
Sean McGrath. Internet programming: XML programming in Python. Dr. Dobb’s
Journal of Software Tools, 23(2):82, 84–87, 101–104, February 1998b.
Sean McGrath. XML processing with Python. The Charles F. Goldfarb se-
ries on open information management. Prentice-Hall, Englewood Cliffs, NJ
07632, USA, 2000. ISBN 0-13-021119-2. URL http://www.phptr.
com/ptrbooks/ptr 0130211192.html. Includes CD-ROM.
David Mertz. Text Processing in Python. Addison-Wesley, Reading, MA, USA,
2003.
Douglas B. Paul and Janet M. Baker. The design for the wall street journal-based
csr corpus. In HLT ’91: Proceedings of the workshop on Speech and Natu-
ral Language, pages 357–362, Morristown, NJ, USA, 1992. Association for
Computational Linguistics. ISBN 1-55860-272-0.
285
BIBLIOGRAPHY
T
Erik T. Ray. Learning XML. O’Reilly, Sebastopol, 2003.
Eric S. Raymond. The New Hacker’s Dictionary. MIT Press, Cambridge, MA,
1996.
Eric S. Raymond. The Cathedral and the Bazaar. O’Reilly Press, Sebastopol,
2001.
AF
Eric S. Raymond. The Art of Unix Programming. Addison-Wesley, 2003.
Boudewijn Rempt, editor. GUI Programming With Python: Using the Qt Toolkit.
Opendocs Lic, xxx, 2002. ISBN xxx.
Brian Stross. Love in the Armpit: Tzeltal Tales of Love, Murder, and Cannibalism,
volume 23 of Museum Brief. Museum of Anthropology of Missouri, Columbia,
Missouri, 1977.
John Verzani. Using R for Introductory Statistics. Chapman and Hall/CRC, ???,
2004.
Stephen Weber, editor. The Success of Open Source. Harvard University Press,
Cambridge, Mass., 2004. ISBN 0674012925.
DR
Matt Weisfeld. The Object-Oriented Thought Process. Sams, ???, 2000.
286
T
Glossary
argument
Assignment
attributes
Backreferences
block structure
callback function
character class
class
AF A variable passed to a function when it is called.,
89
To give a value to a variable., 50
???, 151
???, 230
???, 48
???, 113
???, 196
The definition of an object., 152
class variable ???, 183
compilers ???, 47
DR
computational linguistics ???, 28
constructor ???, 182
control structures A device that controls data flow within a pro-
gram., 79
287
Glossary
T
directories ???, 111
disjunction The technical term for ???., 194
AF
file path ???, 105
float ???, 59
foreign key ???, 257
function A block of code that can be called by name., 87
288
Glossary
T
keyword A word that has a special meaning in Python and
therefore cannot be used as a variable name (e.g.,
if or and)., 50
list ???, 63
list comprehension A Python idiom for in situ modification of every
item in a list., 84
literal characters A character that stands for itself in a regular ex-
AF
pression., 190
looping The repeated execution of a block of code., 80
289
Glossary
T
refactoring Improving a computer program by reorganising
its internal structure without altering its external
behaviour., 141
regular expressions ???, 189
relational database ???, 254
return value ???, 93
RSS Really Simple Syndication. RSS is an XML for-
mat used for web feeds., 269
seed script
SGML
slicing
socket
SQL
statements
AF
single inheritance
slice notation
string concatenation
A program that is run to add data to a fresh
database in order to determine its initial state.,
258
Standard Generalized Markup Language, 268
Inheritance that restrict a class to inheriting from
only one parent class., 165
???, 129
???, 66
In network programming, one endpoint of a two-
way communication link between two programs
running on the network., 266
Structure Query Language. A standard language
for manipulating relational databases, 255
An instruction to the Python interpreter., 47
The joining of one or more strings together to
form a longer string., 58
DR
string indices ???, 128
string interpolation The insertion of variables into a string using in
situ placeholders., 137
subselect A select within a select., 262
tokenization ???, 41
tuple ???, 75
typing ???, 173
290
T
Index
291
INDEX
T
LOCALE, 222 ljust, 138
MULTILINE, 222 replace, 136
VERBOSE, 223 rjust, 138
disjunction, 196 split, 136
escape character, 202 zfill, 138
matching querying, 132
greediness, 237 slices, 131
metacharacters, 193 types
AF
backslash, 202
caret, 205
curly brackets, 210
dollar, 205
dot, 201
square brackets, 198
vertical bar, 196
negation, 203
quantifiers, 206
plus mark, 207
question mark, 209
star, 209
unicode, 130
variables
assignment, 52
declaration, 51
initialization, 52
xmllib, 245
SQL
keyword
DR
DELETE, 262
INSERT, 261
SELECT, 262
UPDATE, 262
string
indices, 131
string interpolation, 139
strings
case, 138
formatting, 137
immutability, 135
interpolation, 139
methods, 132
center, 137
count, 133
292