You are on page 1of 183

REVERSE

ENGINEERING

Edited by

LINDA WILLS
PHILIP NEWCOMB

KLUWER ACADEMIC PUBLISHERS


REVERSE ENGINEERING
REVERSE ENGINEERING

edited by

Linda Wills
Georgia Institute of Technology

Philip Newcomb
The Software Revolution, Inc.

A Special Issue of
AUTOMATED SOFTWARE ENGINEERING
An International Journal
Volume 3, Nos. 1/2(1996)

KLUWER ACADEMIC PUBLISHERS


Boston / Dordrecht / London
AUTOMATED
SOFTWARE
ENGINEERING
An International Journal
Volume 3, Nos. 1/2, June 1996
Special Issue: Reverse Engineering
Guest Editors: Linda Wills and Philip Newcomh

Preface Lewis Johnson 5

Introduction Linda Wills 1

Database Reverse Engineering: From Requirements to CARE Tools


J.'L. Hainaut, V. En^lebert, J. Henrard, J.-M. Hick and D. Roland 9

Understanding Interleaved Code


Spencer Ru^aber, Hurt Stirewalt and Linda M. Wills Al

Pattern Matching for Clone and Concept Detection


K.A. Konto^iannis, R. Pernori, E. Merlo, M Galler andM. Berstein 11

Extracting Architectural Features from Source Code


David R. Harris, Alexander S. Yeh and Howard B. Reubenstein 109

Strongest Postcondition Semantics and the Formal Basis for Reverse


Engineering GeraldC. Gannod and Betty H C. Chenf^ 139

Recent Trends and Open Issues in Reverse Engineering


Linda M. Wills and James H. Cross II 165

Desert Island Column John Dobson 173


Distributors for North America:
Kluwer Academic Publishers
101 Philip Drive
Assinippi Park
Norwell, Massachusetts 02061 USA

Distributors for all other countries:


Kluwer Academic Publishers Group
Distribution Centre
Post Office Box 322
3300 AH Dordrecht, THE NETHERLANDS

Library of Congress Cataloging-in-Publication Data

A CLP. Catalogue record for this book is available


from the Library of Congress.

Copyright © 1996 by Kluwer Academic Publishers

All rights reserved. No part of this publication may be reproduced, stored in a


retrieval system or transmitted in any form or by any means, mechanical, photo-
copying, recordmg, or otherwise, without the prior written permission of the
publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell,
Massachusetts 02061

Printed on acid-free paper.

Printed in the United States of America


Automated Software Engineering 3, 5 (1996)
© 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Preface

This issue of Automated Software Engineering is devoted primarily to the topic of reverse
engineering. This is a timely topic: many organizations must devote increasing resources
to the maintenance of outdated, so-called "legacy" systems. As these systems grow older,
and changing demands are made on them, they constitute an increasing risk of catastrophic
failure. For example, it is anticipated that on January 1, 2000, there will be an avalanche
of computer errors from systems that were not designed to handle dates larger than 1999.
In software engineering the term "legacy" has a negative connotation, meaning old and
decrepit. As Leon Osterweil has observed, the challenge of research in reverse engineering
and software understanding is to give the term the positive connotation that it deserves.
Legacy systems ought to be viewed as a valuable resource, capturing algorithms and busi-
ness rules that can be reused in future software systems. They are often an important cultural
heritage for an organization, embodying the organization's collective knowledge and ex-
pertise. But in order to unlock and preserve the value of legacy systems, we need tools that
can help extract useful information, and renovate the codes so that they can continue to be
maintained. Thus automated software engineering plays a critical role in this endeavor.
Last year's Working Conference on Reverse Engineering (WCRE) attracted a number of
excellent papers. Philip Newcomb and Linda Wills, the program co-chairs of the conference,
and I decided that many of these could be readily adapted into journal articles, and so we
decided that a special issue should be devoted to reverse engineering. By the time we were
done, there were more papers than could be easily accommodated in a single issue, and
so we decided to publish the papers as a double issue, along with a Desert Island Column
that was due for publication. Even so, we were not able to include all of the papers that
we hoped to publish at this time, and expect to include some additional reverse engineering
papers in future issues.
I would like to express my sincere thanks to Drs. Newcomb and Wills for organizing
this special issue. Their tireless efforts were essential to making this project a success.
A note of clarification is in order regarding the review process for this issue. When
Philip and Linda polled the WCRE program committee to determine which papers they
thought deserved consideration for this issue, they found that their own papers were among
the papers receiving highest marks. This was a gratifying outcome, but also a cause for
concern, as it might appear to the readership that they had a conflict of interest. After
reviewing the papers myself, I concurred with the WCRE program committee; these papers
constituted an important contribution to the topic of reverse engineering, and should not be
overlooked. In order to eliminate the conflict of interest, it was decided that these papers
would be handled through the regular Automated Software Engineering admissions process,
and be published when they reach completion. One of these papers, by Rugaber, Stirewalt,
and Wills, is now ready for publication, and I am pleased to recommend it for inclusion in
this special issue. We look to publish additional papers in forthcoming issues.

W.L. Johnson
Automated Software Engineering 3, 7-8 (1996)
(c) 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Introduction to the Special Double Issue


on Reverse Engineering
LINDA M.WILLS
Georgia Institute of Technology

A central activity in software engineering is comprehending existing software artifacts.


Whether the task is to maintain, test, migrate, or upgrade a legacy system or reuse software
components in the development of new systems, the software engineer must be able to
recover information about existing software. Relevant information includes: What are its
components and how do they interact and compose? What is their functionality? How
are certain requirements met? What design decisions were made in the construction of the
software? How do features of the software relate to concepts in the application domain?
Reverse engineering involves examining and analyzing software systems to help answer
questions like these. Research in this field focuses on developing tools for assisting and
automating portions of this process and representations for capturing and managing the
information extracted.
Researchers actively working on these problems in academia and industry met at the
Working Conference on Reverse Engineering (WCRE), held in Toronto, Ontario, in July
1995. This issue of Automated Software Engineering features extended versions of select
papers presented at the Working Conference. They are representative of key technological
trends in the field.
As with any complex problem, being able to provide a well-defined characterization
of the problem's scope and underlying issues is a crucial step toward solving it. The
Hainaut et al. and Rugaber et al. papers both do this for problems that have thus far
been ill-defined and attacked only in limited ways. Hainaut et al. deal with the problem
of recovering logical and conceptual data models from database applications. Rugaber et
al. characterize the difficult problem of unraveling code that consists of several interleaved
strands of computation. Both papers draw together work on several related, but seemingly
independent problems, providing a framework for solving them in a unified way.
While Rugaber et al. deal with the problem of interleaving, which often arises due to
structure-sharing optimizations, Kontogiannis et al. focus on the complementary problem
of code duplication. This occurs as programs evolve and code segments are reused by simply
duplicating them where they are needed, rather than factoring out the common structure
into a single, generalized function. Kontogiannis et al. describe a collection of new pattern
matching techniques for detecting pairs of code "clones" as well as for recognizing abstract
programming concepts.
The recognition of meaningful patterns in software is a widely-used technique in reverse
engineering. Currently, there is a trend toward flexible, interactive recognition paradigms,
which give the user explicit control, for example, in selecting the type of recognizers to use
WILLS

and the degree of dissimilarity to tolerate in partial matches. This trend can be seen in the
Kontogiannis et al. and Harris et al. papers.
Harris et al. focus on recognition of high-level, architectural features in code, using
a library of individual recognizers. This work not only attacks the important problem of
architectural recovery, it also contributes to more generic recognition issues, such as li-
brary organization and analyst-controlled retrieval, interoperability between recognizers,
recognition process optimization, and recognition coverage metrics.
Another trend in reverse engineering is toward increased use of formal methods. A repre-
sentative paper by Gannod and Cheng describes a formal approach to extracting specifica-
tions from imperative programs. They advocate the use of strongest postcondition semantics
as a formal model that is more appropriate for reverse engineering than the more famil-
iar weakest precondition semantics. The use of formal methods introduces more rigor and
clarity into the reverse engineering process, making the techniques more easily automated
and validated.
The papers featured here together provide a richly detailed perspective on the state of the
field of reverse engineering. A more general overview of the trends and challenges of the
field is provided in the summary article by Wills and Cross.
The papers in this issue are extensively revised and expanded versions of papers that
originally appeared in the proceedings of the Working Conference on Reverse Engineering.
We would like to thank the authors and reviewers of these papers, as well as the reviewers of
the original WCRE papers, for their diligent efforts in creating high-quality presentations
of this research.
Finally, we would like to acknowledge the general chair of WCRE, Elliot Chikofsky,
whose vision and creativity has provided a forum for researchers to share ideas and work
together in a friendly, productive environment.
Automated Software Engineering 3, 9-45 (1996)
(c) 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Database Reverse Engineering: From Requirements


to CARE Tools*
J.-L. HAINAUT jlh@info.fundp.ac.be
V. ENGLEBERT, J. HENRARD, J.-M. HICK AND D. ROLAND
Institut d'Informatique, University of Namur, rue Grandgagnage, 21-B-5000 Namur

Abstract. This paper analyzes the requirements that CASE tools should meet for effective database reverse
engineering (DBRE), and proposes a general architecture for data-centered applications reverse engineering CASE
environments. First, the paper describes a generic DBMS-independent DBRE methodology, then it analyzes the
main characteristics of DBRE activities in order to collect a set of desirable requirements. Finally, it describes
DB-MAIN, an operational CASE tool developed according to these requirements. The main features of this tool
that are described in this paper are its unique generic specification model, its repository, its transformation toolkit,
its user interface, the text processors, the assistants, the methodological control and its functional extensibility.
Finally, the paper describes five real-world projects in which the methodology and the CASE tool were applied.

Keywords: reverse engineering, database engineering, program understanding, methodology, CASE tools

1. Introduction

1.1. The problem and its context

Reverse engineering a piece of software consists, among others, in recovering or recon-


structing its functional and technical specifications, starting mainly from the source text of
the programs (IEEE, 1990; Hall, 1992; Wills et al., 1995). Recovering these specifications
is generally intended to redocument, convert, restructure, maintain or extend old applica-
tions. It is also required when developing a Data Administration function that has to know
and record the description of all the information resources of the company.
The problem is particularly complex with old and ill-designed applications. In this case,
not only can no decent documentation (if any) be relied on, but the lack of systematic method-
ologies for designing and maintaining them have led to tricky and obscure code. Therefore,
reverse engineering has long been recognized as a complex, painftil and prone-to-failure

*This is a heavily revised and extended version of "Requirements for Information System Reverse Engineering
Support" by J.-L. Hainaut, V. Englebert, J. Henrard, J.-M. Hick, D. Roland, which first appeared in the Proceedings
of the Second Working Conference on Reverse Engineering, IEEE Computer Society Press, pp. 136-145, July
1995. This paper presents some results of the DB-MAIN project. This project is partially supported by the Region
Wallonne, the European Union, and by a consortium comprising ACEC-OSI (Be), ARIANE-II (Be), Banque UCL
(Lux), BBL (Be), Centre de recherche public H. Tudor (Lux), CGER (Be), Cockerill-Sambre (Be), CONCIS
(Fr), D'leteren (Be), DIGITAL, EDF (Fr), EPFL (CH), Groupe S (Be), ffiM, OBLOG Software (Port), ORIGIN
(Be), Ville de Namur (Be), Winterthur (Be), 3 Suisses (Be). The DB-Process subproject is supported by the
Communaute Frangaise de Belgique.
10 HAINAUTETAL.

activity, so much so that it is simply not undertaken most of the time, leaving huge amounts
of invaluable knowledge buried in the programs, and therefore definitively lost.
In information systems, ox data-oriented applications, i.e., in applications whose central
component is a database (or a set of permanent files), the complexity can be broken down
by considering that the files or databases can be reverse engineered (almost) independently
of the procedural parts.
This proposition to split the problem in this way can be supported by the following
arguments:
— the semantic distance between the so-called conceptual specifications and the physical
implementation is most often narrower for data than for procedural parts;
— the permanent data structures are generally the most stable part of applications;
— even in very old applications, the semantic structures that underlie the file structures are
mainly procedure-independent (though their physical structures are highly procedure-
dependent);
— reverse engineering the procedural part of an application is much easier when the
semantic structure of the data has been elicited.
Therefore, concentrating on reverse engineering the data components of the application
first can be much more efficient than trying to cope with the whole application.
The database community considers that there exist two outstanding levels of description
of a database or of a consistent collection of files, materialized into two documents, namely
its conceptual schema and its logical schema. The first one is an abstract, technology-
independent, description of the data, expressed in terms close to the application domain.
Conceptual schemas are expressed in some semantics-representation formalisms such as
the ERA, NIAM or OMT models. The logical schema describes these data translated into
the data model of a specific data manager, such as a commercial DBMS. A logical schema
comprises tables, columns, keys, record types, segment types and the like.
The primary aim of database reverse engineering (DBRE) is to recover possible logical
and conceptual schemas for an existing database.

1.2. Two introductory examples

The real scope of database reverse engineering has sometimes been misunderstood, and pre-
sented as merely redrawing the data structures of a database into some DBMS-independent
formalism. Many early scientific proposals, and most current CASE tools are limited to the
translation process illustrated in figure 1. In such situations, some elementary translation
rules suffice to produce a tentative conceptual schema.
Unfortunately, most situations are actually far more complex. In figure 2, we describe
a very small COBOL fragment fi-om which we intend to extract the semantics underlying
the files CF008 and PFOS. By merely analyzing the record structure declarations, as most
DBRE CASE tools do at the present time, only schema (a) in figure 2 can be extracted. It
obviously brings little information about the meaning of the data.
However, by analyzing the procedural code, the user-program dialogs, and, if needed,
the file contents, a more expressive schema can be obtained. For instance, schema (b) can
be considered as a refinement of schema (a) resulting from the following reasonings:
create table CUSTOMER (
CUSTOMEF
CNUM numeric{6)not null, CNUM
CNAME char(24) not null, CNAME
CADDRESS char(48 not null, CADDRESS
id: CNUM
primary key (CNUM))
0-N
1
create table ORDER ( <^passes\
ONUM char(8) not null, 1
CNUM numeric(6) not null, 1-1

ODATE date, 1 ORDER 1


ONUM
primary key (ONUM),
ODATE
foreign key (CNUM) j id: ONUM 1
references CUSTOMER)

Figure 1. An idealistic view of database reverse engineering.

Environment d i v i s i o n . 0 1 CPy-DATA.
Input-output section. 02 CNAME p i c X ( 2 5 ) .
[File control. 02 CADDRESS p i c X ( I O O ) .
s e l e c t CF008 a s s i g n t o DSK02:P12
orgemization i s indexed 01 PKEY.
r e c o r d k e y i s Kl o f REC- C F 0 0 8 - 1 . 02 K l l p i c X(9).
02 K12 p i c X(6).
s e l e c t PFOS a s s i g n t o DSK02:P27
organization i s indexed
r e c o r d k e y i s Kl o f REC- P F O S - 1 . Procedure division.

Data d i v i s i o n . move z e r o e s t o PKBY.


File section. d i s p l a y " E n t e r ID :* w i t h n o a d v a n c i n g .
fd CF008. a c c e p t K l l o f PKEY.
01 REC-CF008-1. move PKEY t o Kl o f REC-PFOS-1.
02 Kl p i c 9 ( 6 ) . r e a d PFOS k e y Kl on i n v a l i d k e y
02 f i l l e r p i c X ( 1 2 5 ) . d i s p l a y " I n v a l i d P r o d u c t ID"
d i s p l a y PDATA w i t h n o a d v a n c i n g .
f d PFOS.
0 1 REC-PFOS-1. r e a d PFOS.
02 Kl p e r f o r m u n t i l K l l o f Kl > K l l o f PKEY
03 K l l p i c X ( 9 ) . d i s p l a y " P r o d u c t i o n : " w i t h no a d v a n c i n g
03 f i l l e r p i c 9 ( 6 ) d i s p l a y PRDATA w i t h n o a d v a u i c i n g
0 2 PDATA p i c X ( 1 8 0 ) . d i s p l a y " t o n s b y " w i t h no a d v a n c i n g
02 PRDATA r e d e f i n e s PDATA. move Kl o f REC-PFOS-1 t o PKEY
03 f i l l e r p i c 9 ( 4 ) V 9 9 . move K12 o f PKEY t o Kl o f REC-CF008-1
r e a d CF008 i n t o IN-COMPANY
not i n v a l i d key
Working s t o r a g e s e c t i o n . move C-DATA t o CPY-DATA
0 1 IN-COMPANY. d i s p l a y CNAME o f CPY-DATA
02 CPY-ID p i c 9 ( 6 ) . end-read.
02 C-DATA p i c X ( 1 2 5 ) . r e a d next PFOS
end-perform.

^
1COMPANY PRODUCT j
CPY-ID PRO-ID
REC-CF008-11 REC-PFOS-1 CNAME PNAME
CADDRESS CATEGORY^
KL Ki
I id: CPY-ID <k id: PRO-ID |<
niler
id:KI
Kll
1 filler
=^ COMPANY PRODUCT
PRODUCTION
1
1
PDATA[0.|] CPY-ID PRO-ID
PRDATA[0-1] P-ID /PRODUCTIONV ^
PRO-ID CNAME PNAME
niler \ VOLUME /
CPY-ID CADDRESSi CATEGORYI
id:KI id: CPY-ID id: PRO-ID
excIPDATA
PRDATA
L
VOLUME
id: P-ID
ref:P-ID.PRO-lD
ref:P-ID.CPY-lD
n
(a) (b) (c)

Figure 2. A more realistic view of database reverse engineering. Merely analyzing the data stnicture declaration
statements yields a poor result (a), while further inspection of the procedural code makes it possible to recover a
much more explicit schema (b), which can be expressed as a conceptual schema (c).
12 HAINAUTETAL.

— the gross structure of the program suggests that there are two kinds of REC-PFOS-1
records, arranged into ordered sequences, each comprising one type-1 record (whose
PDATA field is processed before the loop), followed by an arbitrary sequence of type-2
records (whose PRDATA field is processed in the body of the loop); all the records of
such a sequence share the same first 9 characters of the key;
— the processing of type-1 records shows that the Kll part of key Kl is an identifier,
the rest of the key acting as pure padding; the user dialog suggests that type-1 records
describe products; this record type is called PRODUCT, and its key PRO-ID; visual
inspection of the contents of the PFOS file could confirm this hypothesis;
— examining the screen contents when running the program shows that PDATA is made of
a product name followed by a product category; this interpretation is given by a typical
user of the program; this field can then be considered as the concatenation of a PNAME
field and a CATEGORY field;
— the body of the loop processes the sequence of type-2 records depending on the cur-
rent PRODUCT record; they all share the PRO-ID value of their parent PRODUCT
record, so that this 9-digit subfield can be considered as a foreign key to the PRODUCT
record;
— the processing of a type-2 record consists in displaying one line made up of constants
and field values; the linguistic structure of this line suggests that it informs us about
some Production of the current product; the PDATA value is expressed in tons (most
probably a volume), and seems to be produced by some kind of agents described in the
file CF008; hence the names PRODUCTION for type-2 record type and VOLUME for
the PRDATA field;
— the agent of a production is obtained by using the second part of the key of the PRO-
DUCTION record; therefore, this second part can be considered as a foreign key to the
REC-CF008-1 records;
— the name of the field in which the agent record is stored suggests that the latter is a
company; hence the name COMPANY for this record type, and CPY-ID for its access
key;
— the C-DATA field of the COMPANY record type should match the structure of the
CPY-DATA variable, which in turn is decomposed into CNAME and CADDRESS.

Refining the initial schema (a) by these reasonings results in schema (b), and interpreting
these technical data structures into a semantic information model (here some variant of the
Entity-relationship model) leads to schema (c).
Despite its small size, this exercise emphasizes some common difficulties of database
reverse engineering. In particular, it shows that the declarative statements that define file
and record structures can prove a poor source of information. The analyst must often rely
on the inspection of other aspects of the application, such as the procedural code, the user-
program interaction, the program behaviour, the file contents. This example also illustrates
the weaknesses of most data managers which, together with some common programming
practices that tend to hide important structures, force the programmer to express essential
data properties through procedural code. Finally domain knowledge proves essential to
discover and to validate some components of the resulting schema.
DATABASE REVERSE ENGINEERING 13

L3, State of the art

Though reverse engineering data structures is still a complex task, it appears that the current
state of the art provides us with concepts and techniques powerful enough to make this
enterprise more realistic.
The literature proposes systematic approaches for database schema recovery:

— for standard files: (Casanova and Amarel de Sa, 1983; Nilson, 1985; Davis and Arora,
1985; Sabanis and Stevenson, 1992)
— for IMS databases: (Navathe and Awong, 1988; Winans and Davis, 1990; Batini et al.,
1992; Fong and Ho, 1994)
— for CODAS YL databases: (Batini et al., 1992; Fong and Ho, 1994; Edwards and Munro,
1995)
— for relational databases: (Casanova and AmaralDeSa, 1984; Navathe and Awong, 1988;
Davis and Arora, 1988; Johannesson and Kalman, 1990; Markowitz and Makowsky,
1990; Springsteel and Kou, 1990; Fonkam and Gray, 1992; Batini et al., 1992; Premer-
lani and Blaha, 1993; Chiang et al, 1994; Shoval and Shreiber, 1993; Petit et al., 1994;
Andersson, 1994; Signore et al., 1994; Vermeer and Apers, 1995).

Many of these studies, however, appear to be limited in scope, and are generally based on
assumptions about the quality and completeness of the source data structures to be reverse
engineered that cannot be relied on in many practical situations. For instance, they often
suppose that

— all the conceptual specifications have been translated into data structures and constraints
(at least until 1993); in particular, constraints that have been procedurally expressed
are ignored;
— the translation is rather straightforward (no tricky representations); for instance, a rela-
tional schema is often supposed to be in 4NF^; Premerlani and Blaha (1993) and Blaha
and Premerlani (1995) are among the only proposals that cope with some non trivial
representations, or idiosyncrasies, observed in real world applications; let us mention
some of them: nuUable primary key attributes, almost unique primary keys, denormal-
ized structures, degradation of inheritance, mismatched referential integrity domains,
overloaded attributes, contradictory data;
— the schema has not been deeply restructured for performance objectives or for any other
requirements; for instance, record fragmentation or merging for disc space or access
time minimization will remain undetected and will be propagated to the conceptual
schema;
— a complete and up-to-date DDL schema of the data is available;
— names have been chosen rationally (e.g., a foreign key and the referenced primary key
have the same name), so that they can be used as reliable definition of the objects
they denote.

In many proposals, it appears that the only databases that can be processed are those
that have been obtained by a rigourous database design method. This condition cannot be
14 HAINAUTETAL.

assumed for most large operational databases, particularly for the oldest ones. Moreover,
these proposals are most often dedicated to one data model and do not attempt to elaborate
techniques and reasonings conmion to several models, leaving the question of a general
DBRE approach still unanswered.
Since 1992, some authors have recognized that the procedural part of the application
programs is an essential source of information on data structures (Joris et al., 1992; Hainaut
et al, 1993a; Petit et al., 1994; Andersson, 1994; Signore et al., 1994).
Like any complex process, DBRE cannot be successful without the support of adequate
tools called CARE tools^. An increasing number of conmiercial products (claim to) offer
DBRE functionalities. Though they ignore many of the most difficult aspects of the problem,
those tools provide their users with invaluable help to carry out DBRE more effectively
(Rock-Evans, 1990).
In Hainaut et al. (1993a), we proposed the theoretical baselines for a generic, DBMS-
independent, DBRE methodology. These baselines have been developed and extended
in Hainaut et al. (1993b) and Hainaut et al. (1994). The current paper translates these
principles into practical requirements DBRE CARE tools should meet, and presents the main
aspects and components of a CASE tool dedicated to database applications engineering,
and more specifically to database reverse engineering.

7.4. About this paper

The paper is organized as follows. Section 2 is a synthesis of the main problems which
occur in practical DBRE, and of a generic DBMS-independent DBRE methodology. Sec-
tion 3 discusses some important requirements which should be satisfied by future DBRE
CARE tools. Section 4 briefly presents a prototype DBRE CASE tool which is intended
to address these requirements. The following sections describe in further detail some of
the original principles and components of this CASE tool: the specification model and the
repository (Section 5), the transformation toolkit (Section 6), the user interface (Section
7), the text analyzers and name processor (Section 8), the assistants (Section 9), functional
extensibility (Section 10) and methodological control (Section 11). Section 12 evaluates
to what extent the tool meets the requirements while Section 13 describes some real world
applications of the methodology and of the tool.
The reader is assumed to have some basic knowledge of data management and design. Re-
cent references Elmasri and Navathe (1994) and Date (1994) are suggested for data manage-
ment, while Batini et al. (1992) and Teorey (1994) are recommended for database design.

2. A generic methodology for database reverse engineering (DBRE)

The problems that arise when recovering the documentation of the data naturally fall in two
categories that are addressed by the two major processes in DBRE, namely data structure
extraction and data structure conceptualization (Joris et al., 1992; Sabanis and Stevenson,
1992; Hainaut et al., 1993a). By naturally, we mean that these problems relate to the
recovery of two different schemas, and that they require quite different concepts, reasonings
and tools. In addition, each of these processes grossly appears as the reverse of a standard
DATABASE REVERSE ENGINEERING 15

Nonnalized
conceptual schema

CONCEPTUAL
NORMALIZATION

C Raw conceptuaT""^
schema ^^

SCHEMA
DE'^OPTIMJZATION\

l^ ./'"'^onceptual-logic^^^

iir 'V^^^^hysical s c h e m a ^

SCHEMA
8 UNTRANSUTION

SCHEMA
PREPARATION

DMS-compliant
optimized schema

Figure 3. Main components of the generic DBRE methodology. Quite naturally, this reverse methodology is to
be read fromrightto left, and bottom-up!

database design process (resp. physical and logical design (Teorey, 1994; Batini et al.,
1992)). We will describe briefly these processes and the problems they try to solve. Let
us mention however, that partitioning the problems in this way is not proposed by many
authors, who prefer proceeding in one step only. In addition, other important processes are
ignored in this discussion for simplicity (see Joris et al. (1992) for instance).
This methodology is generic in two ways. First, its architecture and its processes are
largely DMS-independent. Secondly, it specifies what problems have to be solved, and
in which way, rather than the order in which the actions must be carried out. Its general
architecture, as developed in Hainaut et al. (1993a), is outlined in figure 3.

2.1. Data structure extraction

This phase consists in recovering the complete DMS^ schema, including all the implicit
and explicit structures and constraints. True database systems generally supply, in some
readable and processable form, a description of this schema (data dictionary contents, DDL
16 HAINAUTETAL.

texts, etc.). Though essential information may be missing from this schema, the latter is
a rich starting point that can be refined through further analysis of the other components
of the application (views, subschemas, screen and report layouts, procedures, fragments of
documentation, database content, program execution, etc.).
The problem is much more complex for standard files, for which no computerized de-
scription of their structure exists in most cases'*. The analysis of each source program
provides a partial view of the file and record structures only. For most real-world (i.e., non
academic) applications, this analysis must go well beyond the mere detection of the record
structures declared in the programs. In particular, three problems are encountered, that
derive from frequent design practices, namely structure hiding, non declarative structures
and lost specifications. Unfortunately, these practices are also conraion in (true) databases,
i.e., those controlled by DBMS, as illustrated by Premerlani and Blaha (1993) and Blaha
and Premerlani (1995) for relational databases.
Structure hiding applies to a source data structure or constraint S1, which could be imple-
mented in the DMS. It consists in declaring it as another data structure S2 that is more general
and less expressive than SI, but that satisfies other requirements such as field reusability,
genericity, program conciseness, simplicity or efficiency. Let us mention some examples:
a compound/multivalued field in a record type is declared as a single-valued atomic field, a
sequence of contiguous fields are merged into a single anonymous field (e.g., as an unnamed
COBOL field), a one-to-many relationship type is implemented as a many-to-many link, a
referential constraint is not explicitly declared as a foreign key, but is procedurally checked,
a relationship type is represented by a foreign key (e.g., in IMS and CODASYL databases).
Non declarative structures are structures or constraints which cannot be declared in
the target DMS, and therefore are represented and checked by other means, such as pro-
cedural sections of the application. Most often, the checking sections are not centralized,
but are distributed and duplicated (frequently in different versions), throughout the appli-
cation programs. Referential integrity in standard files and one-to-one relationship types in
CODASYL databases are some examples.
Lost specifications are constructs of the conceptual schema that have not been imple-
mented in the DMS data structures nor in the application programs. This does not mean
that the data themselves do not satisfy the lost constraint^, but the trace of its enforcement
cannot be found in the declared data structures nor in the application programs. Let us
mention popular examples: uniqueness constraints on sequential files and secondary keys
in IMS and CODASYL databases.
Recovering hidden, non-declarative and lost specifications is a complex problem, for
which no deterministic methods exist so far. A careful analysis of the procedural state-
ments of the programs, of the dataflow through local variables and files, of the file contents,
of program inputs and outputs, of the user interfaces, of the organizational rules, can accu-
mulate evidence for these specifications. Most often, that evidence must be consolidated
by the domain knowledge.
Until very recently, these problems have not triggered much interest in the literature.
The first proposals address the recovery of integrity constraints (mainly referential and
inclusion) in relational databases through the analysis of SQL queries (Petit et al., 1994;
Andersson, 1994; Signore et al., 1994). In our generic methodology, the main processes of
DATABASE REVERSE ENGINEERING 17

DATA STRUCTURE EXTRACTION are the following:

• DMS-DDL text ANALYSIS. This rather straightforward process consists in analyzing


the data structures declaration statements (in the specific DDL) included in the schema
scripts and application programs. It produces a first-cut logical schema.
• PROGRAM ANALYSIS. This process is much more complex. It consists in analyzing
the other parts of the application programs, a.o., the procedural sections, in order to detect
evidence of additional data structures and integrity constraints. The first-cut schema can
therefore be refined following the detection of hidden, non declarative structures.
• DATA ANALYSIS. This refinement process examines the contents of the files and
databases in order (1) to detect data structures and properties (e.g., to find the unique
fields or the functional dependencies in a file), and (2) to test hypotheses (e.g., "could
this field be a foreign key to this file?"). Hidden, non declarative and lost structures can
be found in this way.
• SCHEMA INTEGRATION. When more than one information source has been pro-
cessed, the analyst is provided with several, generally different, extracted (and possibly
refined) schemas. Let us mention some common situations: base tables and views
(RDBMS), DBD and PSB (IMS), schema and subschemas (CODASYL), file structures
from all the application programs (standard files), etc. The final logical schema must
include the specifications of all these partial views, through a schema integration process.

The end product of this phase is the complete logical schema. This schema is expressed
according to the specific model of the DMS, and still includes possible optimized constructs,
hence its name: the DMS-compliant optimized schema, or DMS schema for short.
The current DBRE CARE tools offer only limited DMS-DDL text ANALYSIS function-
alities. The analyst is left without help as far as PROGRAM ANALYSIS, DATA ANALYSIS
and SCHEMA INTEGRATION processes are concerned. The DB-MAIN tool is intended
to address all these processes and to improve the support that analysts are entitled to expect
from CARE tools.

2.2. Data structure conceptualization

This second phase addresses the conceptual interpretation of the DMS schema. It con-
sists for instance in detecting and transforming or discarding non-conceptual structures,
redundancies, technical optimization and DMS-dependent constructs. It consists of two
sub-processes, namely Basic conceptualization and Conceptual normalization. The reader
will find in Hainaut et al. (1993b) a more detailed description of these processes, which
rely heavily on schema restructuring techniques (or schema transformations).

• BASIC CONCEPTUALIZATION. The main objective of this process is to extract all


the relevant semantic concepts underlying the logical schema. Two different problems,
requiring different reasonings and methods, have to be solved: schema untranslation
and schema de-optimization. However, before tackling these problems, we often have to
prepare the schema by cleaning it.
18 HAINAUTETAL.

SCHEMA PREPARATION. The schema still includes some constructs, such as files
and access keys, which may have been useful in the Data Structure Extraction phase,
but which can now be discarded. In addition, translating names to make them more
meaningful (e.g., substitute the file name for the record name), and restructuring some
parts of the schema can prove useful before trying to interpret them.
SCHEMA UNTRANSLATION. The logical schema is the technical translation of con-
ceptual constructs. Through this process, the analyst identifies the traces of such transla-
tions, and replaces them by their original conceptual construct. Though each data model
can be assigned its own set of translating (and therefore of untranslating) rules, two facts
are worth mentioning. First, the data models can share an important subset of translating
rules (e.g., COBOL files and SQL structures). Secondly, translation rules considered as
specific to a data model are often used in other data models (e.g., foreign keys in IMS
and CODAS YL databases). Hence the importance of generic approaches and tools.
SCHEMA DE- OPTIMIZATION, The logical schema is searched for traces of constructs
designed for optimization purposes. Three main families of optimization techniques
should be considered: denormalization, structural redundancy and restructuring (Hainaut
et al., 1993b).
• CONCEPTUAL NORMALIZATION. This process restructures the basic conceptual
schema in order to give it the desired qualities one expects from any final conceptual
schema, e.g., expressiveness, simplicity, minimality, readability, genericity, extensibility.
For instance, some entity types are replaced by relationship types or by attributes, is-a
relations are made explicit, names are standardized, etc. This process is borrowed from
standard DB design methodologies (Batini et al., 1992; Teorey, 1994; Rauh and Stickel,
1995).

All the proposals published so far address this phase, most often for specific DMS,
and for rather simple schemas (e.g., with no implementation tricks). They generally pro-
pose elementary rules and heuristics for the SCHEMA UNTRANSLATION process and to
some extent for CONCEPTUAL NORMALIZATION, but not for the more complex D^:-
OPTIMIZATION phsiSQ. The DB-MAIN CARE tool has been designed to address all these
processes in a flexible way.

2,3. Summary of the limits of the state of the art in CARE tools

The methodological framework developed in Sections 2.1 and 2.2 can be specialized ac-
cording to a specific DMS and according to specific development standards. For instance
Hainaut et al. (1993b) suggests specialized versions of the CONCEPTUALIZATION phdisc
for SQL, COBOL, IMS, CODASYL and TOTAL/IMAGE databases.
It is interesting to use this framework as a reference process model against which existing
methodologies can be compared, in particular, those which underlie the current CARE tools
(figure 4). The conclusions can be summarized as follows:

• DATA STRUCTURE EXTRACTION. Current CARE tools are limited to parsing


DMS-DDL schemas only (DMS-DDL t e x t ANALYSIS). All the other sources are
DATABASE REVERSE ENGINEERING 19

Normalized
conceptual schema

2 DMS-compiiant
optimized schema
o CONCEPTUAL
H NORMALIZATION

t ^-"^ ^-x 1 I DMS-DDL text


~^r Conceptual-logical ^ 1 ANALYSIS 1
!e ^ ^ ^ . . ^ schema ^^,^ 1
s SCHEMA
UNTRANSLATION ,
DMS-DDL
schema
1.
C DMS-compliant^"^N
optimized schema^x

Figure 4. Simplified DBRE methodology proposed by most current CARE tools.

ignored, and must be processed manually. For instance, these tools are unable to collect
the multiple views of a COBOL application, and to integrate them to produce the global
COBOL schema. A user of a popular CARE tool tells us ''how he spent several weeks,
cutting and pasting hundreds of sections ofprograms, to build an artificial COBOL pro-
gram in which all the files and records were fully described. Only then was the tool able
to extract the file data structures".
• DATA STRUCTURE CONCEPTUALIZATION. Current CARE tools focus mainly
on untranslation (SCHEMA UNTRANSLATION) and offer some restructuring facilities
(CONCEPTUAL NORMALIZATION), though these processes often are merged. Once
again, some strong naming conventions must often be satisfied for the tools to help.
For instance, a foreign key and the referenced primary key must have the same names.
All performance-oriented constructs, as well as most non standard database structures,
(see Premerlani and Blaha (1993) and Blaha and Premerlani (1995) for instance) are
completely beyond the scope of these tools.

3. Requirements for a DBRE CARE tool

This section states some of the most important requirements an ideal DBRE support envi-
ronment (or CARE tool) should meet. These requirements are induced by the analysis of the
specific characteristics of the DBRE processes. They also derive from reverse engineering
files and databases, often by hand or with very basic text editing tools, of a dozen actual
applications.

Flexibility

Observation, The very nature of the RE activities differs from that of more standard engi-
neering activities. Reverse engineering a software component, and particularly a database.
20 HAINAUTETAL.

is basically an exploratory and often unstructured activity. Some important aspects of


higher level specifications are discovered (sometimes by chance), and not deterministically
inferred from the operational ones.
Requirements. The tool must allow the user to follow flexible working patterns, including
unstructured ones. It should be methodology-neutral^ unlike forward engineering tools. In
addition, it must be highly interactive.

Extensibility

Observation, RE appears as a learning process; each RE project often is a new problem


of its own, requiring specific reasonings and techniques.
Requirements, Specific functions should be easy to develop, even for one-shot use.

Source multiplicity

Observation, RE requires a great variety of information sources: data structure, data (from
files, databases, spreadsheets, etc.), program text, program execution, program output,
screen layout, CASE repository and Data dictionary contents, paper or computer-based
documentation, interview, workflow and dataflow analysis, domain knowledge, etc.
Requirements, The tool must include browsing and querying interfaces with these sources.
Customizable functions for automatic and assisted specification extraction should be avail-
able for each of them.

Text analysis

Observation, More particularly, database RE requires browsing through huge amounts


of text, searching them for specific patterns (e.g., programming cliches (Selfridge et al.,
1993)), following static execution paths and dataflows, extracting program slices (Weizer,
1984).
Requirements, The CARE tool must provide sophisticated text analysis processors. The
latter should be language independent, easy to customize and to program, and tightly coupled
with the specification processing functions.

Name processing

Observation, Object names in the operational code are an important knowledge source.
Frustratingly enough, these names often happen to be meaningless (e.g., REC-001-R08,
1-087), or at least less informative than expected (e.g., INV-QTY, QOH, C-DATA), due
to the use of strict naming conventions. Many applications are multilingual, so that data
names may be expressed in several languages. In addition, multi-programmer development
often induces non consistent naming conventions.
DATABASE REVERSE ENGINEERING 21

Requirements. The tool must include sophisticated name analysis and processing func-
tions.

Links with other CASE processes

Observation. RE is seldom an independent activity. For instance, (1) forward engineering


projects frequently include reverse engineering of some existing components, (2) reverse
engineering share important processes with forward engineering (e.g., conceptual normal-
ization), (3) reverse engineering is a major activity in broader processes such as migration,
reengineering and data administration.
Requirements. A CARE tool must offer a large set of functions, including those which
pertain to forward engineering.

Openness

Observation. There is (and probably will be) no available tool that can satisfy all corporate
needs in application engineering. In addition, companies usually already make use of one
or, most often, several CASE tools, software development environments, DBMS, 4GL or
DDS.
Requirements. A CARE tool must communicate easily with the other development tools
(e.g., via integration hooks, communications with a common repository or by exchanging
specifications through a conmion format).

Flexible specification model

Observation. As in any CAD activity, RE applies on incomplete and inconsistent spec-


ifications. However, one of its characteristics makes it intrinsically different ft-om design
processes: at any time, the current specifications may include components from different
abstraction levels. For instance, a schema in process can include record types (physical
objects) as well as entity types (conceptual objects).
Requirements. The specification model must be wide-spectrum, and provides artifacts
for components of different abstraction levels.

Genericity

Observation. Tricks and implementation techniques specific to some data models have
been found to be used in other data models as well (e.g., foreign keys are frequent in IMS
and CODASYL databases). Therefore, many RE reasonings and techniques are common
to the different data models used by current applications.
Requirements. The specification model and the basic techniques offered by the tool must
be DMS-independent, and therefore highly generic.
22 HAINAUTETAL.

Multiplicity of views

Observation, The specifications, whatever their abstraction level (e.g., physical, logical
or conceptual), are most often huge and complex, and need to be examined and browsed
through in several ways, according to the nature of the information one tries to obtain.
Requirements, The CARE tool must provide several ways of viewing both source texts
and abstract structures (e.g., schemas). Multiple textual and graphical views, summary and
fine-grained presentations must be available.

Rich transformation toolset

Observation. Actual database schemas may include constructs intended to represent con-
ceptual structures and constraints in non standard ways, and to satisfy non functional re-
quirements (performance, distribution, modularity, access control, etc.). These constructs
are obtained through schema restructuration techniques.
Requirements, The CARE tool must provide a rich set of schema transformation tech-
niques. In particular, this set must include operators which can undo the transformations
commonly used in practical database designs.

Traceability

Observation, A DBRE project includes at least three sets of documents: the operational
descriptions (e.g., DDL texts, source program texts), the logical schema (DMS-compliant)
and the conceptual schema (DMS-independent). The forward and backward mappings
between these specifications must be precisely recorded. The forward mapping specifies
how each conceptual (or logical) construct has been implemented into the operational (or
logical) specifications, while the backward mapping indicates of which conceptual (or
logical) construct each operational (or logical) construct is an implementation.
Requirements, The repository of the CARE tool must record all the links between the
schemas at the different levels of abstraction. More generally, the tool must ensure the
traceability of the RE processes.

4. The DB-MAIN CASE tool

The DB-MAIN database engineering environment is a result of a R&D project initiated in


1993 by the DB research unit of the Institute of Informatics, University of Namur. This tool
is dedicated to database applications engineering, and its scope encompasses, but is much
broader than, reverse engineering alone. In particular, its ultimate objective is to assist
developers in database design (including full control of logical and physical processes),
database reverse engineering, database application reengineering, maintenance, migration
and evolution. Further detail on the whole approach can be found in Hainaut et al. (1994).
DATABASE REVERSE ENGINEERING 23

As far as DBRE support is concerned, the DB-MAIN CASE tool has been designed to
address as much as possible the requirements developed in the previous section.
As a wide-scope CASE tool, DB-MAIN includes the usual functions needed in database
analysis and design, e.g., entry, browsing, management, validation and transformation of
specifications, as well as code and report generation. However, the rest of this paper, namely
Sections 5 to 11, will concentrate only on the main aspects and components of the tool which
are directly related to DBRE activities. In Section 5 we describe the way schemas and other
specifications are represented in the repository. The tool is based on a general purpose
transformational approach which is described in Section 6. Viewing the specifications from
different angles and in different formats is discussed in Section 7. In Section 8, various
tools dedicated to text and name processing and analysis are described. Section 9 presents
some expert modules, called assistant, which help the analyst in complex processing and
analysis tasks. DB-MAIN is an extensible tool which allows its users to build new functions
through the Voyager-2 tool development language (Section 10). Finally, Section 11 evokes
some aspects of the tool dedicated to methodological customization, control and guidance.
In Section 12, we will reexamine the requirements described in Section 3, and evaluate
to what extent the DB-MAIN tool meets them.

5. The DB-MAIN specification model and repository

The repository collects and maintains all the information related to a project. We will limit
the presentation to the data aspects only. Though they have strong links with the data struc-
tures in DBRE, the specification of the other aspects of the applications, e.g., processing,
will be ignored in this paper. The repository comprises three classes of information:

— a structured collection of schemas and texts used and produced in the project,
— the specification of the methodology followed to conduct the project,
— the history (or trace) of the project.

We will ignore the two latter classes, which are related to methodological control and
which will be evoked briefly in Section 11.
A schema is a description of the data structures to be processed, while a text is any textual
material generated or analyzed during the project (e.g., a program or an SQL script). A
project usually comprises many (i.e., dozens to hundreds of) schemas. The schemas of a
project are linked through specific relationships; they pertain to the methodological control
aspects of the DB-MAIN approach, and will be ignored in this section.
A schema is made up of specification constructs which can be classified into the usual
three abstraction levels. The DB-MAIN specification model includes the following concepts
(Hainautetal., 1992a).
The conceptual constructs are intended to describe abstract, machine-independent,
semantic structures. They include the notions of entity types (with/without attributes;
with/without identifiers), of super/subtype hierarchies (single/multiple inheritance, total
and disjoint properties), and of relationship types (binary/iV-ary; cyclic/acyclic), whose
roles are characterized by min-max cardinalities and optional names; a role can be defined
24 HAINAUTETAL.

PRODUCT ORDER CUSTOMER


PROD-ID ORD-ID CUST-]P
NAME DATE NAME
U-PRICE ORIGIN ADDRESS
id: PROD-ID DETAIL[l-20] id:CUST-ID
PRO 1
QTY 0-N

I L.
id: ORD-ID
ace
ref: ORIGIN
ace
ref: DETAIL[*].PRO
1-1
1
ACCOUNT
ACC-NBR
AMOUNT
id: ACC-NBR
1 DSK:MGT-03 1
of.CUSTOMER

Figure 5. A typical data structure schema during reverse engineering. This schema includes conceptualized
objects (PRODUCT, CUSTOMER, ACCOUNT, of), logical objects (record type ORDER, with single-valued
and multivalued foreign keys) and physical objects (access keys ORDER.ORD-ID and ORDER.ORIGIN; file
DSK:MGT-03).

on one or several entity-types. Attributes can be associated with entity and relationship
types; they can be single-valued or multivalued, atomic or compound. Identifiers (or keys),
made up of attributes and/or roles, can be associated with an entity type, a relationship type
and a multivalued attribute. Various constraints can be defined on these objects: inclusion,
exclusion, coexistence, at-least-one, etc.
The logical constructs are used to describe schemas compliant with DMS models, such
as relational, CODASYL or IMS schemas. They comprise, among others, the concepts
of record types (or table, segment, etc.), fields (or columns), referential constraints, and
redundancy.
The physical constructs describe implementation aspects of the data which are related
to such criteria as the performance of the database. They make it possible to specify files,
access keys (index, calc key, etc.), physical data types, bag and list multivalued attributes,
and other implementation details.
In database engineering, as discussed in Section 2, a schema describes a fragment of the
data structures at a given level of abstraction. In reverse engineering, an in progress schema
may even include constructs at different levels of abstraction. Figure 5 presents a schema
which includes conceptual, logical and physical constructs. Ultimately, this schema will be
completely conceptualized through the interpretation of the logical and physical objects.
Besides these concepts, the repository includes some generic objects which can be cus-
tomized according to specific needs. In addition, annotations can be associated with each
object. These annotations can include semi-formal properties, made of the property name
and its value, which can be interpreted by Voyager-2 functions (see Section 10). These
features provide dynamic extensibility of the repository. For instance, new concepts such
as organizational units, servers, or geographic sites can be represented by specializing the
generic objects, while statistics about entity populations, the gender dind plural of the object
names can be represented by semi-formal attributes. The contents of the repository can
DATABASE REVERSE ENGINEERING 25

be expressed as a pure text file through the ISL language, which provides import-export
facilities between DB-MAIN and its environment.

6. The transformation toolkit

The desirability of the transformational approach to software engineering is now widely


recognized. According to Fikas (1985) for instance, the process of developing a program
[can be] formalized as a set of transformations. This approach has been put forward in
database engineering by an increasing number of authors since several years, either in
research papers, or in text books and, more recently, in several CASE tools (Hainaut et al.,
1992; Rosenthal and Reiner, 1994). Quite naturally, schema transformations have found
their way into DBRE as well (Hainaut et al., 1993a, 1993b). The transformational approach
is the cornerstone ofthe DB-MAIN approach (Hainaut, 1981,1991,1993b, 1994)andCASE
tool (Hainaut et al., 1992, 1994; Joris et al., 1992). A formal presentation of this concept
can be found in Hainaut (1991, 1995, 1996).
Roughly speaking, a schema transformation consists in deriving a target schema 5' from
a source schema S by some kind of local or global modification. Adding an attribute to an
entity type, deleting a relationship type, and replacing a relationship type by an equivalent
entity type, are three examples of schema transformations. Producing a database schema
from another schema can be carried out through selected transformations. For instance,
normalizing a schema, optimizing a schema, producing an SQL database or COBOL files,
or reverse engineering standard files and CODAS YL databases can be described mostly as
sequences of schema transformations. Some authors propose schema transformations for
selected design activities (Navathe, 1980; Kobayashi, 1986; Kozaczynsky, 1987; Rosenthal
and Reiner, 1988; Batini et al, 1992; Rauh and Stickel, 1995; Halpin and Proper, 1995).
Moreover, some authors claim that the whole database design process, together with other
related activities, can be described as a chain of schema transformations (Batini et al., 1993;
Hainaut et al., 1993b; Rosenthal and Reiner, 1994). Schema transformations are essential
to define formally forward and backward mappings between schemas, and particularly
between conceptual structures and DMS constructs (i.e., traceability).
A special class of transformations is of particular importance, namely the semantics-
preserving transformations, also called reversible since each of them is associated with
another semantics-preserving transformation called its inverse. Such a transformation en-
sures that the source and target schemas have the same semantic descriptive power. In other
words any situation of the application domain that can be modelled by an instance of one
schema can be described by an instance of the other. If we can produce a relational schema
from a conceptual schema by applying reversible transformations only, then both schemas
will be equivalent by construction, and no semantics will be lost in the translation process.
Conversely, if the interpretation of a relational schema, i.e., its conceptualization (Section
2.2), can be performed by using reversible transformations, the resulting conceptual schema
will be semantically equivalent to the source schema. An in-depth discussion ofthe concept
of specification preservation can be found in Hainaut (1995, 1996).
To illustrate this concept, we will outline informally three of the most popular trans-
formation techniques, called mutations (type changing) used in database design. As a
26 HAINAUTETAL.

A B
I
A B ii-ji r2-J2

<=^ {^A/ \ XB >


R
Il-Jl- \ R / > —- I2-J2 l-l id: RB.B 1-1
RA.A

Figure 6. Transforming a relationship type into an entity type, and conversely.

B
A M A B
62
Al
id:Bl
<^ AJ
Biri-J] B2
1 .. J . . ref: Bl[*] O id:Bl

—0-N

Figure 7. Relationship-type R is represented by foreign key Bl, and conversely.

EA2
A
A2
1 Al -I-J -<(^~Rr\-\-\~
A A3
id:A2 (a)
R.A
AI
A2[l-J]
A3
<^
A EA2
Al -I-J -<JR~\-I-N- A2 (b)
A3 id:A2

Figure 8. Transformation of an attribute into an entity type: (a) by explicit representation of its instances,
(b) by explicit representation of its distinct values.

consequence, their inverse will be used in reverse engineering. To simplify the presenta-
tion, each transformation and its inverse are described in one figure, in which the direct
transformation is read from left to right, and its inverse from right to left.
Figure 6 shows graphically how a relationship type can be replaced by an equivalent
entity type, and conversely. The technique can be extended to A^-ary relationship types.
Another widely used transformation replaces a binary relationship type by a foreign key
(figure 7), which can be either multivalued (/ > 1) or single-valued (J = 1).
The third standard technique transforms an attribute into an entity type. It comes in two
flavours, namely instance representation {figure 8a), in which each instance of attribute A2
in each A entity is represented by an EA2 entity, and value representation (figure 8b), in
which each distinct value of A2, whatever the number of its instances, is represented by one
EA2 entity.
DB-MAIN proposes a three-level transformation toolset that can be used freely, according
to the skill of the user and the complexity of the problem to be solved. These tools
DATABASE REVERSE ENGINEERING 27

I II i f f I t[iii[[i iii^f iiitiiin ill iiMriiiliiiTNlTiiiiiffti ^ fi^il

Figure 9. The dialog box of the Split/Merge transformation through which the analyst can either extract some
components from the master entity type (left), or merge two entity types, or migrate components from an entity
type to another.

are neutral and generic, in that they can be used in any database engineering process.
As far as DBRE is concerned, they are mainly used in Data Structure Conceptualization
(Section 2.2).

• Elementary transformations. The selected transformation is applied to the selected


object:

apply transformation T to current object 0

With these tools, the user keeps full control of the schema transformation. Indeed, similar
situations can often be solved by different transformations; e.g., a multivalued attribute
can be transformed in a dozen ways. Figure 9 illustrates the dialog box for the Split/Merge
of an entity type. The current version of DB-MAIN proposes a toolset of about 25 ele-
mentary transformations.
• Global transformations. A selected elementary transformation is applied to all the
objects of a schema which satisfy a specified precondition:

apply transformation T to the objects that satisfy condition P

DB-MAIN offers some predefined global transformations, such as: replace all one-to-
many relationship types by foreign keys or replace all multivalued attributes by entity
types. However, the analyst can define its own toolset through the Transformation As-
sistant described in Section 9.
28 HAINAUTETAL.

• Model-driven transformations. All the constructs of a schema that violate a given


model M are transformed in such a way that the resulting schema complies with M:
apply the transformation plan which makes the
current schema satisfy model M
Such an operator is defined by a transformation plan, which is a sort of algorithm compris-
ing global transformations, which is proved (or assumed) to make any schema comply
with M. A model-driven transformation implements formal techniques or heuristics
applicable in such major engineering processes as normalization, model translation or
untranslation, and conceptualization. Here too, DB-MAIN offers a dozen predefined
model-based transformations such as relational, CODASYL, and COBOL translation,
untranslation from these models and conceptual normalization. The analyst can define
its own transformation plans, either through the scripting facilities of the Transforma-
tion Assistant, or, for more complex problems, through the development of Voyager-2
functions (Section 10).

A more detailed discussion of these three transformation modes can be found in Hainaut
et al. (1992) and Hainaut (1995).

7. The user interface

The user interaction uses a fairly standard GUI. However, DB-MAIN offers some original
options which deserve being discussed.
Browsing through, processing, and analyzing, large schemas require an adequate pre-
sentation of the specifications. It quickly appears that more than one way of viewing them
is necessary. For instance, a graphical representation of a schema allows an easy detection
of certain structural patterns, but is useless when analyzing name structures to detect sim-
ilarities as is common in the DATA STRUCTURE EXTRACTION process (Section 2.1).
DB-MAIN currently offers six ways of presenting a schema. Four of these views use a
hypertext technique: compact (sorted list of entity type, relationship type and file names),
standard (same + attributes, roles and constraints), extended (same + domains, annotations,
ET-RT cross-reference) and sorted (sorted list of all the object names). Two views are
graphical: full and compact (no attributes and no constraints). All of them are illustrated
in figure 10.
Switching from one view to another is inmiediate, and any object selected in a view
is still current when another view is chosen. Any relevant operator can be applied to an
object, whatever the view through which it is presented. In addition, the text-based views
make it possible to navigate from entity types to relationship types and conversely through
hypertext links.

8. Text analysis and processing

Analyzing and processing various kinds of texts are basic activities in two specific processes,
namely DMS-DDL text ANALYSIS and PROGRAM ANALYSIS.
DATABASE REVERSE ENGINEERING 29

Figure 10. Six different views of the same schema.

The first process is rather simple, and can be carried out by automatic extractors which
analyze the data structure declaration statements of programs and build the corresponding
abstract objects in the repository. DB-MAIN currently offers built-in standard parsers for
COBOL, SQL, CODAS YL, IMS, and RPG, but other parsers can be developed in Voyager-l
(Section 10).
To address the requirements of the second process, through which the preliminary speci-
fications are refined from evidence found in programs or in other textual sources, DB-MAIN
includes a collection of program analysis tools comprising, at the present time, an interac-
tive pattern-matching engine, a dataflow diagram inspector and a program slicer. The main
objective of these tools is to contribute to program understanding as far as data manipulation
is concerned.
ThQ pattern-matching function allows searching text files for definite patterns or cliches
expressed in PDL, a Pattern Definition Language. As an illustration, we will describe one
of the most popular heuristics to detect an implicit foreign key in a relational schema. It
consists in searching the application programs for some forms of SQL queries which evoke
the presence of an undeclared foreign key (Signore et al., 1994; Andersson, 1994; Petit et al.,
1994). The principle is simple: most multi-table queries use primary/foreign key joins. For
instance, considering that column CNUM has been recognized as a candidate key of table
CUSTOMER, the following query suggests that column CUST in table ORDER may be a
foreign key to CUSTOMER:

select CUSTOMER.CNUM, CNAME, DATE


from ORDER, CUSTOMER
where ORDER.CUST = CUSTOMER.CNUM
30 HAINAUTETAL.

The SQL generic patterns The COBOUDB2 specific patterns

Tl : : = t a b l e - n a m e AN-name ::= [ a - z A - Z ] [ - a - z A - Z O - 9 ]
T2 : := table-neune table-name : : = AN-name
' c l : := column-name column-name ::= AN-name
C2 : := column-name _ ::= ( { " / n " | V t " | " "}) +
j o i n - q u a l i f ::= - ::= ( { - / n " | - / f | - " ) ) *
begin-SQL begin-SQL ::= {"exec"|"EXEC"}
select select-list _{"sql"|-SQL"}_
from i {©Tl ! @T2 j ©T2 ! STl} end-SQL ::= _{"end"j"END"}
where ! @T1"."®C1 _ " = '' _ eT2"."eC2 ! {"-exec"1"-EXEC"}-"."
end-SQL s e l e c t ::= {"select"("SELECT"}
from ::= {"from"1"FROM"}
where ::= {"where"|"WHERE"}
s e l e c t - l i s t ::= any-but(frora)
! ::= a n y - b u t ( { w h e r e j end-SQL})
{ " , " | " / n " | " / t - | " "}

Figure 11. Generic and specific patterns for foreign key detection in SQL queries. In the specific patterns, "-"
designates any non-empty separator, "_" any separator, and "AN-name" any alphanumeric string beginning with
a letter. The "any-but(E)" function identifies any string not including expression E. Symbols "+'\ "*", "/t", "/n",
"I" and "a-z" have their usual grep or BNF meaning.

More generally, any SQL expression that looks like:


s e l e c t ...
from ... Tl,...T2 ...
where ... Tl.Cl = T2.C2 ...
may suggest that Cl is a foreign key to table T2 or C2 a foreign key to Tl. Of course, this
evidence would be even stronger if we could prove that Cl—^resp. Cl—is a key of its table.
This is just what figure 11 translates more formally in PDL.
This example illustrates two essential features of PDL and of its engine.

1. A set of patterns can be split into two parts (stored in different files). When a generic
pattern file is opened, the unresolved patterns are to be found in the specified specific
pattern file. In this example, the generic patterns define the skeleton of an SQL query,
which is valid for any RDBMS and any host language, while the specific patterns com-
plement this skeleton by defining the C0B0L/DB2 API conventions. Replacing the
latter will allow processing, e.g., C/0/?ACL£^ programs.
2. A pattern can include variables, the name of which is prefixed with @. When such a
pattern is instantiated in a text, the variables are given a value which can be used, e.g.,
to automatically update the repository.

The pattern engine can analyze external source files, as well as textual descriptions
stored in the repository (where, for instance, the extractors store the statements they do not
understand, such as comments, SQL t r i g g e r and check). These texts can be searched
for visual inspection only, but pattern instantiation can also trigger DB-MAIN actions.
For instance, if a procedure such as that presented in figure 11 (creation of a referential
constraint between column C2 and table Tl) is associated with this pattern, this procedure
can be executed automatically (under the analyst's control) for each instantiation of the
DATABASE REVERSE ENGINEERING 31

pattern. In this way, the analyst can build a powerful custom tool which detects foreign
keys in queries and which adds them to the schema automatically.
The dataflow inspector builds a graph whose nodes are the variables of the program to
be analyzed, and the edges are relationships between these variables. These relationships
are defined by selected PDL patterns. For instance, the following COBOL rules can be
used to build a graph in which two nodes are linked if their corresponding variables appear
simultaneously in a simple assignment statement, in a redefinition declaration, in an indirect
write statement or in comparisons:

var_l : := cob_var;
var_2 : : = cob.var ;
move : := "MOVE" - @var_l - "TO" - (ivar_2 ;
redefines : := @var_l - "REDEFINES" - @var_2;
write : := "WRITE" - ivar^l - "FROM" @var_2;
if : := "IF" - @var_l - rel.op - @var.2;
if.not : := "IF" - @var_l - "NOT" - rel_op - (ivar.2 ;

This tool can be used to solve structure hiding problems such as the decomposition of
anonymous fields and procedurally controlled foreign keys, as illustrated in figure 2.
The first experiments have quickly taught us that pattern-matching and dataflow inspec-
tion work fine for small programs and for locally concentrated patterns, but can prove
difficult to use for large programs. For instance, a pattern made of a dozen statements
can span several thousands lines of code. With this problem in mind, we have developed
a variant of program slicer (Weiser, 1984), which, given a program P, generates a new
program P ' defined as follows. Let us consider a point 5 in P (generally a statement) and
an object O (generally a variable) of P. The program slice of P for O at 5 is the smallest
subset P' of P whose execution will give O the same state at S as would the execution of P
in the same environment. Generally P ' is a very small fragment of P, and can be inspected
much more efficiently and reliably, both visually and with the help of the analysis tools
described above, than its source program P. One application in which this program slicer
has proved particularly valuable is the analysis of the statements contributing to the state
of a record when it is written in its file.
DB-MAIN also includes a name processor which can transform selected names in a
schema, or in selected objects of a schema, according to substitution patterns. Here are
some examples of such patterns:

"^C-" -> "CUST-" replaces all prefixes "C-" with the prefix "CUST-";
" DATE" -> " TIME" replaces each substring " DATE", whatever its position,
with the substring "TIME";
" ^CODE$" -> "REFERENCE" replaces all the names "CODE" with the new name
"REFERENCE".

In addition, it proposes case transformation: lower-to-upper, upper-to-lower, capitalize


and remove accents. These parameters can be saved as a name processing script, and
reused later.
32 HAINAUTETAL.

Figure 12. Control panel of the Transformation assistant. The left-side area is the problem solver, which presents
a catalog of problems (1st column) and suggested solutions (2nd column). Theright-sidearea is the script manager.
The worksheet shows a simplified script for conceptuahzing relational databases.

9. The assistants

An assistant is a higher-level solver dedicated to coping with a special kind of problem, or


performing specific activities efficiently. It gives access to the basic toolboxes of DB-MAIN,
but in a controlled and intelligent way.
The current version of DB-MAIN includes three general purpose assistants which can sup-
port, among others, the DBRE activities, namely the Transformation assistant, the Schema
Analysis assistant and the Text Analysis assistant. These processors offer a collection of
built-in functions that can be enriched by user-defined functions developed in Voyager-2
(Section 10).
The Transformation Assistant (figure 12) allows applying one or several transformations
to selected objects. Each operation appears as a problem/solution couple, in which the
problem is defined by a pre-condition (e.g., the objects are the many-to-many relationship
types of the current schema), and the solution is an action resulting in eliminating the
problem (e.g., transform them into entity types). Several dozens problem/solution items are
proposed. The analyst can select one of them, and execute it automatically or in a controlled
way. Alternatively, (s)he can build a script comprising a list of operations, execute it,
save and load it.
Predefined scripts are available to transform any schema according to popular models
(e.g., Bachman model, binary model, relational, CODASYL, standard files), or to perform
standard engineering processes (e.g., conceptualization of relational and COBOL schemas,
normalization). Customized operations can be added via Voyager-2 functions (Section
10). Figure 12 shows the control panel of this tool. A second generation of the Trans-
formation assistant is under development. It provides a more flexible approach to build
complex transformation plans thanks to a catalog of more than 200 preconditions, a library
DATABASE REVERSE ENGINEERING 33

of about 50 actions and more powerful scripting control structures including loops and
if-then-else patterns.
The Schema Analysis assistant is dedicated to the structural analysis of schemas. It
uses the concept of submodel, defined as a restriction of the generic specification model
described in Section 5 (Hainaut et al., 1992). This restriction is expressed by a boolean
expression of elementary predicates stating which specification patterns are valid, and which
ones are forbidden. An elementary predicate can specify situations such as the following:
"entity types must have from 1 to 100 attributes", "relationship types have from 2 to 2 roles",
"entity type names are less than 18-character long", "names do not include spaces", "no
name belongs to a given list of reserved words", "entity types have from 0 to 1 supertype",
"the schema is hierarchical", "there are no access keys". A submodel appears as a script
which can be saved and loaded. Predefined submodels are available: Normalized ER, Binary
ER, NIAM, Functional ER, Bachman, Relational, CODASYL, etc. Customized predicates
can be added via Voyager-2 functions (Section 10). The Schema Analysis assistant offers
two ftinctions, namely Check and Search. Checking a schema consists in detecting all the
constructs which violate the selected submodel, while the Search frinction detects all the
constructs which comply with the selected submodel.
The Text Analysis assistant presents in an integrated package all the tools dedicated to text
analysis. In addition it manages the active links between the source texts and the abstract
objects in the repository.

10. Functional extensibility

DB-MAIN provides a set of built-in standard ftinctions that should be sufficient to satisfy
most basic needs in database engineering. However, no CASE tool can meet the require-
ments of all users in any possible situation, and specialized operators may be needed to deal
with unforeseen or marginal situations. There are two important domains in which users re-
quire customized extensions, namely additional internal functions and interfaces with other
tools. For instance, analyzing and generating texts in any language and according to any
dialect, or importing and exchanging specifications with any CASE tool or Data Dictionary
Systems are practically impossible, even with highly parametric import/export processors.
To cope with such problems, DB-MAIN provides the Voyager-2 tool development environ-
ment allowing analysts to build their own functions, whatever their complexity. Voyager-2
offers a powerfiil language in which specific processors can be developed and integrated
into DB-MAIN. Basically, Voyager-2 is a procedural language which proposes primitives to
access and modify the repository through predicative or navigational queries, and to invoke
all the basic functions of DB-MAIN. It provides a poweful list manager as well as ftinctions
to parse and generate complex text files. A user's tool developed in Voyager-2 is a program
comprising possible recursive procedures and functions. Once compiled, it can be invoked
by DB-MAIN just like any basic function.
Figure 13 presents a small but powerftil Voyager-2 function which validates and creates
a referential constraint with the arguments extracted from a COBOL/SQL program by the
pattern defined in figure 11. When such a pattern instantiates, the pattern-matching engine
passes the values of the four variables Tl, T2, CI and C2 to the MakeForeignKey ftinction.
34 HAINAUTETAL.

function integer KakeForeignKey (string : T1,T2,C1,C2) ;


explain(*
title="Create a foreign key from an SQL join";
helps"if CI is a unique key of table Tl and if C2 is a column of T2, and if CI and
C2 are compatible, then define C2 as a foreign key of T2 to Tl, and return true,
else return false"
*);
/*define the variables; any repository object type can be a domain */
schema : S;
entity_type : E;
attribute : A,IK,FK;
list : ID-LIST,FK-LIST;

S := Ge t C u r r e n t Schema 0 ; / * S is the current schema */


ID-LIST = list of the attributes A such that :(I)A belongs to an entity type E which is in schema S and whose name is
(2) the name of A is CI, (3) A is an identifier ofE (the ID property of A is true) */
I D - L I S T := a t t r i b u t e [ A l { o f : e n t i t y _ t y p e [ E ] { i n : [ S ] a n d E.NAME = T l )
a n d A.NAME = Cl
and A.ID = t r u e ) ;
FK-UST = list of the attributes A such that: (I) A belongs to an entity type E which is in S and whose name is Tl, (2)
name of A is C2 */
FK-LIST := a t t r i b u t e [ A ] { o f : e n t i t y _ t y p e [ E ] { i n : { S ] a n d E.NAME = T2}
and A.NAME = C 2 } ;
if both lists are not-empty, then
if the attributes are compatible then define the attribute in ID-LIST as a foreign key to the attribute in FK-LIST * /
i f not(empty(ID-LIST) o r empty(FK-LIST)) t h e n
{IK := GetFirst(ID-LIST); FK := GetFirst(FK-LIST);
if IK.TYPE = FK.TYPE and IK.LENGTH = FK.LENGTH then
{connect(reference,IK,FK); return true;}
else {return false;};}
else {return false;};

Figure 13. A (strongly simplified) excerpt of the repository and a Voyager-2 function which uses it. The repository
expresses the fact that schemas have entity types, which in turn have attributes. Some attributes can be identifiers
(boolean ID) or can reference (foreign key) another attribute (candidate key). The input arguments of the procedure
are four names T1,T2,C1,C2 such as those resulting from an instantiation of the pattern offigure11, The function
first evaluates the possibility of attribute (i.e., column) C2 of entity type (i.e., table) T2 being a foreign key to
entity type Tl with identifier (candidate key), Cl. If the evaluation is positive, the referential constraint is created.
The e x p l a i n section illustrates the self-documenting facility of Woyager-2 programs; it defines the answers the
compiled version of this function will provide when queried by the DB-MAIN tool.

11. Methodological control and design recovery^

Though this paper presents it as a CARE tool only, the DB-MAIN environment has a wider
scope, i.e., data-centered applications engineering. In particular, it is to address the com-
plex and critical problem of application evolution. In this context, understanding how the
DATABASE REVERSE ENGINEERING 35

engineering processes have been carried out when legacy systems were developed, and
guiding today's analysts in conducting application development, maintenance and reengi-
neering, are major functions that should be offered by the tool. This research domain,
known as design (or software) process modeling, is still under full development, and few
results have been made available to practitioners so far. The reverse engineering process is
strongly coupled with these aspects in three ways.
First, reverse engineering is an engineering activity of its own (Section 2), and therefore
is submitted to rules, techniques and methods, in the same way as forward engineering; it
therefore deserves being supported by methodological control functions of the CARE tool.
Secondly, DBRE is a complex process, based on trial-and-error behaviours. Exploring
several solutions, comparing them, deriving new solutions from earlier dead-end ones,
are common practices. Recording the history of a RE project, analyzing it, completing it
with new processes, and replaying some of its parts, are typical design process modeling
objectives.
Thirdly, while the primary aim of reverse engineering is (in short) to recover technical and
functional specifications from the operational code of an existing application, a secondary
objective is progressively emerging, namely to recover the design of the application, i.e.,
the way the application has (or could have) been developed. This design includes not
only the specifications, but also the reasonings, the transformations, the hypotheses and the
decisions the development process consists of.
Briefly stated, DB-MAIN proposes a design process model comprising concepts such as
design product, design process, process strategy, decision, hypothesis and rationale. This
model derives from proposals such as those of Potts and Bruns (1988) and Rolland (1993),
extended to all database engineering activities. This model describes quite adequately not
only standard design methodologies, such as the Conceptual-Logical-Physical approaches
(Teorey, 1994; Batini et al., 1992) but also any kind of heuristic design behaviour, includ-
ing those that occur in reverse engineering. We will shortly describe the elements of this
design process model.

Product and product instance, A product instance is any outstanding specification object
that can be identified in the course of a specific design. A conceptual schema, an SQL DDL
text, a COBOL program, an entity type, a table, a collection of user's views, an evaluation
report, can all be considered product instances. Similar product instances are classified into
products, such as Normalized c o n c e p t u a l schema, DMS-compliant o p t i m i z e d
schema or DMS-DDL schema (see figure 3).

Process and process instance, A process instance is any logical unit of activity which
transforms a product instance into another product instance. Normalizing schema SI into
schema S2 is a process instance. Similar process instances are classified into processes,
such as CONCEPTUAL NORMALIZATION in figure 3.

Process strategy. The strategy of a process is the specification of how its goal can be
achieved, i.e., how each instance of the process must be carried out. A strategy may be
deterministic, in which case it reduces to an algorithm (and can often be implemented as a
primitive), or it may be non-deterministic, in which case the exact way in which each of its
36 HAINAUTETAL.

instances will be carried out is up to the designer. The strategy of a design process is defined
by a script that specifies, among others, what lower-level processes must/can be triggered,
in what order, and under what conditions. The control structures in a script include action
selection (at most one, one only, at least one, all in any order, all in this order, at least one
any number of times, etc.), alternate actions, iteration, parallel actions, weak condition
(should be satisfied), strong condition (must be satisfied), etc.

Decision, hypothesis and rationale. In many cases, the analyst/developer will carry out
an instance of a process with some hypothesis in mind. This hypothesis is an essential
characteristics of this process instance since it implies the way in which its strategy will be
performed. When the engineer needs to try another hypothesis, (s)he can perform another
instance of the same process, generating a new instance of the same product. After a while
(s)he is facing a collection of instances of this product, from which (s)he wants to choose the
best one (according to the requirements that have to be satisfied). A justification of the deci-
sion must be provided. Hypothesis and decision justification comprise the design rationale.

History. The history of a process instance is the recorded trace of the way in which its
strategy has been carried out, together with the product instances involved and the rationale
that has been formulated. Since a project is an instance of the highest level process, its
history collects all the design activities, all the product instances and all the rationales that
have appeared, and will appear, in the life of the project. The history of a product instance P
(also called its design) is the set of all the process instances, product instances and rationales
which contributed to P. For instance, the design of a database collects all the information
needed to describe and explain how the database came to be what it is.
A specific methodology is described in MDL, the DB-MAIN Methodology Description
Language. The description includes the specification of the products and of the processes
the methodology is made up of, as well as of the relationships between them. A product
is of a certain type, described as a specialization of a generic specification object from the
DB-MAIN model (Section 5), and more precisely as a submodel generated by the Schema
analysis assistant (Section 9). For instance, a product called Raw-conceptual-schema
(figure 3), can be declared as a BINARY-ER-SCHEMA. The latter is a product type that can
be defined by a SCHEMA satisfying the following predicate, stating that relationship types
are binary, and have no attributes, and that the attributes are atomic and single- valued:
(all rel-types have from 2 to 2 roles)
and (all rel-types have from 0 to 0 attributes)
and (all attributes have from 0 to 0 components)
and (all attributes have a max cardinality from 1 to 1);

A process is defined mainly by the input product type(s), the internal product type, the
output product type(s) and by its strategy.
The DB-MAIN CASE tool is controlled by a methodology engine which is able to in-
terpret such a method description once it has been stored in the repository by the MDL
compiler. In this way, the tool is customized according to this specific methodology. When
developing an application, the analyst carries out process instances according to chosen
hypotheses, and builds product instances. (S)he makes decisions which (s)he can justify.
DATABASE REVERSE ENGINEERING 37

All the product instances, process instances, hypotheses, decisions and justifications, re-
lated to the engineering of an application make up the trace, or history of this application
development. This history is also recorded in the repository. It can be examined, replayed,
synthesized, and processed (e.g., for design recovery).
One of the most promising applications of histories is database design recovery. Con-
structing a possible design history for an existing, generally undocumented database is a
complex problem which we propose to tackle in the following way. Reverse engineering the
database generates a DBRE history. This history can be cleaned by removing unnecessary
actions. Reversing each of the actions of this history, then reversing their order, yields a
tentative, unstructured, design history. By normalizing the latter, and by structuring it ac-
cording to a reference methodology, we can obtain a possible design history of the database.
Replaying this history against the recovered conceptual schema should produce a physical
schema which is equivalent to the current database.
A more comprehensive description of how these problems are addressed in the DB-MAIN
approach and CASE tool can be found in Hainaut et al. (1994), while the design recovery
approach is described in Hainaut et al. (1996).

12. DBRE requirements and the DB-MAIN CASE tool

We will examine the requirements described in Section 3 to evaluate how the DB-MAIN
CASE tool can help satisfy them.

Flexibility. Instead of being constrained by rigid methodological frameworks, the analyst


is provided with a collection of neutral toolsets that can be used to process any schema
whatever its level of abstraction and its degree of completion. In particular, backtracking and
multi-hypothesis exploration are easily performed. However, by customizing the method
engine, the analyst can build a specialized CASE tool that is to enforce strict methodologies,
such as that which has been described in Section 2.

Extensibility, Through the Voyager-2 language, the analyst can quickly develop specific
functions; in addition, the assistants, the name and the text analysis processors allows the
analyst to develop customized scripts.

Sources multiplicity. The most common information sources have a text format, and can
be queried and analyzed through the text analysis assistant. Other sources can be processed
through specific Voyager-2 functions. For example, data analysis is most often performed
by small ad hoc queries or application programs, which validate specific hypotheses about,
e.g., a possible identifier or foreign key. Such queries and programs can be generated by
Voyager-2 programs that implement heuristics about the discovery of such concepts. In
addition, external information processors and analyzers can easily introduce specifications
through the text-based import-export ISL language. For example, a simple SQL program can
extract SQL specificationsfi-omDBMS data dictionaries, and generate their ISL expression,
which can then be imported into the repository.
38 HAINAUTETAL.

Text analysis. The DB-MAIN tool offers both general purpose and specific text analyzers
and processors. If needed, other processors can be developed in Voyager-2. Finally, external
analyzers and text processors can be used provided they can generate ISL specifications
which can then be imported in DB-MAIN to update the repository.

Name processing. Besides the name processor, specific Voyager-l functions can be de-
veloped to cope with more specific name patterns or heuristics. Finally, the compact and
sorted views can be used as poweful browsing tools to examine name patterns or to detect
similarities.

Links with other CASE processes. DB-MAIN is not dedicated to DBRE only; therefore
it includes in a seamless way supporting functions for the other DB engineering processes,
such as forward engineering. Being neutral, many functions are common to all the engi-
neering processes.

Openness, DB-MAIN supports exchanges with other CASE tools in two ways. First,
Voyager-2 programs can be developed (1) to generate specifications in the input language
of the other tools, and (2) to load into the repository the specifications produced by these
tools. Secondly, ISL specifications can be used as a neutral intermediate language to
communicate with other processors.

Flexible specification model. The DB-MAIN repository can accomodate specifications


of any abstraction level, and based on a various paradigms; if asked to be so, DB-MAIN can
be fairly tolerant to incomplete and inconsistent specifications and can represent schemas
which include objects of different levels and of different paradigms (see figure 5); at the end
of a complex process the analyst can ask, through the Schema Analysis assistant, a precise
analysis of the schema to sort out all the structural flaws.

Genericity, Both the repository schema and the functions of the tool are independent
of the DMS and of the programming languages used in the application to be analyzed.
They can be used to model and to process specifications initially expressed in various
technologies. DB-MAIN includes several ways to specialize the generic features in order
to make them compliant with a specific context, such as processing PL/l-IMS, COBOL-
VSAM or C-ORACLE applications.

Multiplicity of views. The tool proposes a rich palette of presentation layouts both in
graphical and textual formats. In the next version, the analyst will be allowed to define
customized views.

Rich transformation toolset, DB -MAIN proposes a transformational toolset of more than


25 basic functions; in addition, other, possibly more complex, transformations can be built
by the analyst through specific scripts, or through Voyager-2 functions.

Traceability, DB-MAIN explicitly records a history, which includes the successive states
of the specifications as well as all the engineering activities performed by the analyst and
DATABASE REVERSE ENGINEERING 39

by the tool itself. Viewing these activities as specification transformations has proved an
elegant way to formalize the links between the specifications states. In particular, these
links can be processed to explain how a conceptual object has been implemented (forward
mapping), and how a technical object has been interpreted (reverse mapping).

13. Implementation and applications of DB-MAIN

We have developed DB-MAIN in C++ for MS-Windows machines. The repository has
been implemented as an object oriented database. For performance reasons, we have built a
specific 0 0 database manager which provides very short access and update times, and whose
disc and core memory requirements are kept very low. For instance, a fully documented
40,000-object project can be developed on a 8-MB machine.
The first version of DB-MAIN was released in September 1995. It includes the basic
processors and functions required to design, implement and reverse engineer large size
databases according to various DMS. Version 1 supports many of the features that have
been described in this paper. Its repository can accomodate data structure specifications at
any abstraction level (Section 5). It provides a 25-transformation toolkit (Section 6), four
textual and two graphical views (Section 7), parsers for SQL, COBOL, CODASYL, IMS
and RPG programs, the PDL pattern-matching engine, the dataflow graph inspector, the
name processor (Section 8), the Transformation, Schema Analysis and Text Analysis assis-
tants (Section 9), the Voyager-2 virtual machine and compiler (Section 10), a simple history
generator and its replay processor (Section 11). Among the other functions of Version 1,
let us mention code generators for various DMS. Its estimated cost was about 20 man/year.
The DB-MAIN tool has been used to carry out several government and industrial projects.
Let us describe five of them briefly.

• Design of a government agricultural accounting system. The initial information was


found in the notebooks in which the farmers record the day-to-day basic data. These
documents were manually encoded as giant entity types with more than 1850 attributes
and up to 9 decomposition levels. Through conceptualization techniques, these structures
were transformed into pure conceptual schemas of about 90 entity types each. Despite
the unusual context for DBRE, we have followed the general methodology described
in Section 2:
Data structure extraction. Manual encoding; refinement through direct contacts with
selected accounting officers;
Data structure conceptualization.
— Untranslation. The multivalued and compound attributes have been transformed
into entity types; the entity types with identical semantics have been merged; serial
attributes, i.e., attributes with similar names and identical types, have been replaced
with multivalued attributes;
— De-optimization. The farmer is requested to enter the same data at different places;
these redundancies have been detected and removed; the calculated data have been
removed as well;
40 HAINAUTETAL.

— Normalization. The schema included several implicit IS-A hierarchies, which have
been expressed explicitly;

The cost for encoding, conceptualizing and integrating three notebooks was about 1 per-
son/month. This rather unusual application of reverse engineering techiques was a very
interesting experience because it proved that data structure engineering is a global domain
which is difficult (and sterile) to partition into independent processes (design, reverse).
It also proved that there is a strong need for highly generic CASE tools.
• Migrating a hybrid file/SQL social security system into a pure SQL database. Due to a
strict disciplined design, the programs were based on rather neat file structures, and used
systematic cliches for integrity constraints management. This fairly standard two-month
project comprised an interesting work on name patterns to discover foreign keys. In
addition, the file structures included complex identifying schemes which were difficult
to represent in the DB-MAIN repository, and which required manual processing.
• Redocumenting the ORACLE repository of an existing OO CASE tool. Starting from
various SQL scripts, partial schemas were extracted, then integrated. The conceptual-
ization process was fairly easy due to systematic naming conventions for candidate and
foreign keys. In addition, it was performed by a developer having a deep knowledge of
the database. The process was completed in two days.
• Redocumentating a medium size ORACLE hospital database. The database included
about 200 tables and 2,700 columns. The largest table had 75 columns. The analyst
quickly detected a dozen major tables with which one hundred views were associated.
It appeared that these views defined, in a systematic way, a 5-level subtypes hierarchy.
Entering the description of these subtypes by hand would have required an estimated
one week. We chose to build a customized function in PDL and Voyager-l as follows.
A pattern was developed to detect and analyze the c r e a t e view statements based on
the main tables. Each instantiation of this pattern triggered a Voyager-2 function which
defined a subtype with the extracted attributes. Then, the function scanned these IS-A
relations, detected the conraion attributes, and cleaned the supertype, removing inherited
attributes, and leaving the conmion ones only. This tool was developed in 2 days, and
its execution took 1 minute. However, a less expert Voyager-l programmer could have
spent more time, so that these figures cannot be generalized reliably. The total reverse
engineering process cost 2 weeks.
• Reverse engineering of an RPG database. The application was made of 31 flat files
comprising 550 fields (2 to 100 fields per file), and 24 programs totalling 30,000 LOC.
The reverse engineering process resulted in a conceptual schema comprising 90 entity
types, including 60 subtypes, and 74 relationship types. In the programs, data validation
concentrated in well defined sections. In addition, the programs exhibited complex access
patterns. Obviously, the procedural code was a rich source of hidden structures and
constraints. Due to the good quality of this code, the program analysis tools were of little
help, except to quickly locate some statements. In particular, pattern detection could be
done visually, and program slicing yielded too large program chunks. Only the dataflow
inspector was found useful, though in some programs, this graph was too large, due to
the presence of working variables common to several independent program sections. At
that time, no RPG parser was available, so that a Voyager-2 RPG extractor was developed
DATABASE REVERSE ENGINEERING 41

in about one week. The final conceptual schema was obtained in 3 weeks. The source
file structures were found rather complex. Indeed, some non-trivial patterns were largely
used, such as overlapping foreign keys, conditional foreign and primary keys, overloaded
fields, redundancies (Blaha and Premerlani, 1995). Surprisingly, the result was estimated
unnecessarily complex as well, due to the deep type/subtype hierarchy. This hierarchy was
reduced until it seemed more tractable. This problem triggered an interesting discussion
about the limit of this inheritance mechanism. It appeared that the precision vs readability
trade-off may lead to unnormalized conceptual schemas, a conclusion which was often
formulated against object class hierarchies in 0 0 databases, or in OO applications.

14. Conclusions

Considering the requirements outlined in Section 3, few (if any) commercial CASE/CARE
tools offer the functions necessary to carry out DBRE of large and complex applications in
a really effective way. In particular, two important weaknesses should be pointed out. Both
derive from the oversimplistic hypotheses about the way the application was developed.
First, extracting the data structures from the operational code is most often limited to
the analysis of the data structure declaration statements. No help is provided for further
analyzing, e.g., the procedural sections of the programs, in which essential additional
information can be found. Secondly, the logical schema is considered as a straighforward
conversion of the conceptual schema, according to simple translating rules such as those
found in most textbooks and CASE tools. Consequently, the conceptualization phase
uses simple rules as well. Most actual database structures appear more sophisticated,
however, resulting from the application of non standard translation rules and including
sophisticated performance oriented constructs. Current CARE tools are completely blind
to such structures, which they carefully transmit into the conceptual schema, producing,
e.g., optimized IMS conceptual schemas, instead of pure conceptual schemas.
The DB-MAIN CASE tool presented in this paper includes several CARE components
which try to meet the requirements described in Section 3. The first version^ has been used
successfully in several real size projects. These experiments have also put forward several
technical and methodological problems, which we describe briefly.

• Functional limits of the tool. Though DB-MAIN Version 1 already offers a reasonable
set of integrity constraints, a more powerful model was often needed to better describe
physical data structures or to express semantic structures. Some useful schema trans-
formations were lacking, and the scripting facilities of the assistants were found very
interesting, but not powerful enough in some situations. As expected, several users asked
for ''full program reverse engineering".
• Problem and tool complexity. Reverse engineering is a software engineering domain
based on specific, and still unstable, concepts and techniques, and in which much remains
to learn. Not surprisingly, true CARE tools are complex, and DB-MAIN is no exception
when used at its full potential. Mastering some of its functions requires intensive training
which can be justified for complex projects only. In addition, writing and testing specific
PDL pattern libraries and Voyager-2 functions can cost several weeks.
42 HAINAUTETAL.

• Performance, While some components of DB-MAIN proved very efficient when pro-
cessing large projects with multiple sources, some others slowed down as the size of the
specifications grew. That was the case when the pattern-matching engine parsed large
texts for a dozen patterns, and for the dataflow graph constructor which uses the former.
However, no dramatic improvement can be expected, due to the intrinsic complexity of
pattern-matching algorithms for standard machine architectures.
• Viewing the specifications. When a source text has been parsed, DB-MAIN builds a first-
cut logical schema. Though the tool proposes automatic graphical layouts, positioning
the extracted objects in a natural way is up to the analyst. This task was often considered
painful, even on a large screen, for schemas comprising many objects and connections.
In the same realm, several users found that the graphical representations were not as
attractive as expected for very large schemas, and that the textual views often proved
more powerful and less cumbersome.

The second version, which is under development, will address several of the observed
weaknesses of Version 1, and will include a richer specification model and extended toolsets.
We will mainly mention some important extensions: a view derivation mechanism, which
will solve the problem of mastering large schemas, a view integration processor to build
a global schema from extracted partial views, the first version of the MDL compiler, of
the methodology engine, and of the history manager, and an extended program sheer. The
repository will be extended to the representation of additional integrity constraints, and of
other system components such as programs. A more powerful version of the Voyager-2 lan-
guage and a more sophisticated Transformation assistant (evoked in Section 9) are planned
for Version 2 as well. We also plan to experiment the concept of design recovery for actual
applications.

Acknowledgments

The detailed conmients by the anonymous reviewers have been most useful to improve the
readability and the consistency of this paper, and to make it as informative as possible. We
would also like to thank Linda Wills for her friendly encouragements.

Notes

1. A table is in 4NF ij^all the non-trivial multivalued dependencies are functional. The BCNF (Boyce-Codd
normal form) is weaker but has a more handy definition: a table is in BCNF (/f each functional determinant is
a key,
2. A CASE tool offering arichtoolset for reverse engineering is often called a CARE (Computer-Aided Reverse
Engineering) tool.
3. A Data Management System (DMS) is either a File Management System (FMS) or a Database Management
System (DBMS).
4. Though some practices (e.g., disciplined use of COPY or INCLUDE meta-statements to include common data
structure descriptions in programs), and some tools (such as data dictionaries) may simulate such centralized
schemas.
5. There is no miracle here: for instance, the data are imported, or organizational and behavioural rules make
them satisfy these constraints.
DATABASE REVERSE ENGINEERING 43

6. But methodology-aware if ^5/g« recovery is intended. This aspect has been developed in Hainautetal. (1994),
and will be evoked in Section 11.
7. For instance, Belgium commonly uses three legal languages, namely Dutch, French and German. As a
consequence, English is often used as a de facto common language.
8. The part of the DB-MAIN project in charge of this aspect is the DB-Process sub-project, fully supported by
the Communaut^ Francaise de Belgique.
9. In order to develop contacts and collaboration, an Education version (complete but limited to small applications)
and its documentation have been made available. This free version can be obtained by contacting thefirstauthor
at j lh@inf o. f undp. a c . be.

References

Andersson, M. 1994. Extracting an entity relationship schema from a relational database through reverse engi-
neering. In Proc. of the 13th Int. Conf on ER Approach, Manchester: Springer-Verlag.
Batini, C, Ceri, S., and Navathe, S.B. 1992. Conceptual Database Design. Benjamin-Cummings.
Batini, C, Di Battista, 0., and Santucci, G. 1993. Structuring primitives for a dictionary of entity relationship data
schemas. IEEE TSE, 19(4).
Blaha, M.R. and Premerlani, W.J. 1995. Observed idiosyncracies of relational database designs. In Proc. of the
2nd IEEE Working Conf. on Reverse Engineering, Toronto: IEEE Computer Society Press,
Bolois, G. and Robillard, P. 1994. Transformations in reengineering techniques. In Proc. of the 4th Reengineering
Forum Reengineering in Practice, Victoria, Canada.
Casanova, M. and Amarel de Sa, J. 1983. Designing entity relationship schemas for conventional information
systems. In Proc. of ERA, pp. 265-278.
Casanova, M.A. and Amaral, De Sa 1984. Mapping uninterpreted schemes into entity-relationship diagrams: Two
applications to conceptual schema design. In IBM J. Res. & Develop., 28(1).
Chiang, R.H,, Barron, TM,, and Storey, V.C. 1994. Reverse engineering of relational databases: Extraction of an
EER model from a relational database. Joum. of Data and Knowledge Engineering, 12(2): 107-142.
Date, C,J. 1994. An Introduction to Database Systems. Vol. 1, Addison-Wesley.
Davis, K.H. and Arora, A.K. 1985, A Methodology for translating a conventional file system into an entity-
relationship model. In Proc. of ERA, lEEE/North-HoUand.
Davis, K.H. and Arora, A.K. 1988. Converting a relational database model to an entity relationship model. In
Proc. of ERA: A Bridge to the User, North-Holland.
Edwards, H.M. and Munro, M. 1995. Deriving a logical model for a system using recast method. In Proc. of the
2nd IEEE WC on Reverse Engineering. Toronto: IEEE Computer Society Press.
Fikas, S.F. 1985. Automating the transformational development of software. IEEE TSE, SE-11:1268-1277.
Fong, J. and Ho, M. 1994, Knowledge-based approach for abstracting hierarchical and network schema semantics.
In Proc. of the 12th Int. Conf. on ER Approach, Arlington-Dallas: Springer-Verlag.
Fonkam, M.M. and Gray, W.A. 1992. An approach to ehciting the semantics of relational databases. In Proc. of
4th Int. Conf on Advance Information Systems Engineering—CAiSE'92, pp. 463-480, LNCS, Springer-Verlag.
Elmasri, R. and Navathe, S. 1994. Fundamentals of Database Systems. Benjamin-Cummings.
Hainaut, J.-L. 1981. Theoretical and practical tools for data base design. In Proc. Intern. VLDB Conf, ACM/IEEE.
Hainaut, J.-L, 1991. Entity-generating schema transformation for entity-relationship models. In Proc. of the 10th
ERA, San Mateo (CA), North-Holland.
Hainaut, J.-L., Cadelli, M„ Decuyper, B., and Marchand, O. 1992. Database CASE tool architecture: Principles
for flexible design strategies. In Proc. of the 4th Int. Conf on Advanced Information System Engineering
(CAiSE-92), Manchester: Springer-Veriag, LNCS.
Hainaut, J.-L., Chandelon M,, Tonneau C, and Joris, M. 1993a. Contribution to a theory of database reverse
engineering. In Proc. of the IEEE Working Conf. on Reverse Engineering, Baltimore: IEEE Computer Society
Press.
Hainaut, J.-L., Chandelon M., Tonneau C, and Joris, M. 1993b. Transformational techniques for database reverse
engineering. In Proc. of the 12thlnt. Conf on ER Approach, Arlington-Dallas: E/R Institute and Springer-Verlag,
LNCS.
44 HAINAUTETAL.

Hainaut, J.-L., Englebert, V., Henrard, J., Hick, J.-M., and Roland, D. 1994. Evolution of database applications:
The DB-MAIN approach. In Proc. of the 13th Int. Conf. on ER Approach^ Manchester: Springer-Verlag.
Hainaut, J.-L. 1995. Transformation-based database engineering. Tutorial notes, VLDB'95, Ziirich, Switzerland,
(available atjlh@info.fundp.ac.be).
Hainaut, J.-L. 1996. Specification Preservation in Schema transformations—Application to Semantics and Statis-
tics, Elsevier: Data & Knowledge Engineering (to appear).
Hainaut, J.-L., Roland, D., Hick J.-M., Henrard, J., and Englebert, V. 1996. Database design recovery, in Proc. of
CAiSE'96, Springer-Veriag.
Halpin, T.A. and Proper, H.A. 1995. Database schema transformation and optimization. In Proc. of the 14th Int.
Conf on ER/00 Modelling (ERA), Springer-Verlag.
Hall, P.A.V. (Ed.) 1992. Software Reuse and Reverse Engineering in Practice, Chapman & Hall IEEE, 1990.
Special issue on Reverse Engineering, IEEE Software, January, 1990.
Johannesson, P. and Kalman, K. 1990. A Method for translating relational schemas into conceptual schemas. In
Proc. of the 8th ERA, Toronto, North-Holland.
Joris, M., Van Hoe, R., Hainaut, J.-L., Chandelon, M., Tonneau, C, and Bodart F. et al. 1992. PHENIX: Methods
and tools for database reverse engineering. In Proc. 5th Int Conf. on Software Engineering and Applications,
Toulouse, December 1992, EC2 Publish.
Kobayashi, I. 1986. Losslessness and semantic correctness of database schema transformation: Another look of
schema equivalence. In Information Systems, ll(l):41-59.
Kozaczynsky, Lilien, 1987. An extended entity-relationship (E2R) database specification and its automatic veri-
fication and transformation. In Proc. of ERA Conf
Markowitz, K.M. and Makowsky, J.A. 1990. Identifying extended entity-relationship object structures in relational
schemas. IEEE Trans, on Software Engineering, 16(8).
Navathe, S.B. 1980. Schema analysis for databaserestrucmring.In ACM TODS, 5(2).
Navathe, S.B. and Awong, A. 1988. Abstracting relational and hierarchical data with a semantic data model. In
Proc. of ERA: A Bridge to the User, North-Holland.
Nilsson, E.G. 1985. The translation of COBOL data structure to an entity-rel-type conceptual schema. In Proc. of
ERA, lEEE/North-HoUand.
Petit, J.-M., Kouloumdjian, J., Bouliaut, J.-F., and Toumani, F. 1994. Using queries to improve database reverse
engineering. In Proc. of the 13th Int. Conf on ER Approach, Springer-Verlag: Manchester.
Premerlani, W.J. and Blaha, M.R. 1993. An approach for reverse engineering of relational databases. In Proc. of
the IEEE Working Conf on Reverse Engineering, IEEE Computer Society Press.
Potts, C. and Bruns, G. 1988. Recording the reasons for design decisions. In Proc. oflCSE, IEEE Computer
Society Press.
Rauh, O. and Stickel, E. 1995. Standard transformations for the normalization of ER schemata. In Proc. of the
CAiSE'95 Conf, Jyvaskyla, Finland, LNCS, Springer-Veriag.
Rock-Evans, R. 1990. Reverse engineering: Markets, methods and tools, OVUM report.
Rosenthal, A. and Reiner, D. 1988. Theoretically sound transformations for practical database design. In Proc. of
ERA Conf
Rosenthal, A. and Reiner, D. 1994. Tools and transformations—Rigourous and otherwise—for practical database
design, ACM TODS, 19(2).
Rolland, C. 1993. Modeling the requirements engineering process. In Proc. of the 3rd European-Japanese Seminar
in Information Modeling and Knowledge Bases, Budapest (preprints).
Sabanis, N. and Stevenson, N. 1992. Tools and techniques for data remodelling cobol applications. In Proc.
5th Int. Conf on Software Engineering and Applications, Toulouse, 7-11 December, pp. 517-529, EC2
Publish.
Selfridge, P.G., Waters, R.C., and Chikofsky, E.J. 1993. Challenges to thefieldof reverse engineering. In Proc. of
the 1st WC on Reverse Engineering, pp. 144-150, IEEE Computer Society Press.
Shoval, P. and Shreiber, N. 1993. Database reverse engineering: From relational to the binary relationship Model,
data and knowledge Engineering, 10(10).
Signore, O., Lofftedo, M., Gregori, M., and Cima, M. 1994. Reconstruction of E-R schema from database
Applications: A cognitive approach. In Proc. of the 13th Int Conf on ER Approach, Manchester: Springer-
Verlag.
DATABASE REVERSE ENGINEERING 45

Springsteel, F.N. and Kou, C. 1990. Reverse data engineering of E-R designed relational schemas. In Proc, of
Databases, Parallel Architectures and their Applications.
Teorey, T.J. 1994. Database Modeling and Design: The Fundamental Principles, Morgan Kaufmann.
Vermeer, M. and Apers, R 1995. Reverse engineering of relational databases. In Proc. of the 14th Int. Conf on
ER/00 Modelling (ERA).
Weiser, M. 1984. Program slicing, IEEE TSE, 10:352-357.
Wills, L., Newcomb, R, and Chikofsky, E. (Eds.) 1995. Proc. of the 2nd IEEE Working Conf on Reverse Engi-
neering. Toronto: IEEE Computer Society Press.
Winans, J. and Davis, K.H. 1990. Software reverse engineering from a currently existing IMS database to an
entity-relationship model. In Proc. of ERA: the Core of Conceptual Modelling, pp. 345-360, North-Holland.
Automated Software Engineering, 3, 47-76 (1996)
© 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Understanding Interleaved Code


SPENCER RUGABER, KURT STIREWALT {spencer,kurt}@cc.gatech.edu
College of Computing, Georgia Institute of Technology, Atlanta, GA

LINDA M. WILLS linda.wills@ee.gatech.edu


School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA

Abstract. Complex programs often contain multiple, interwoven strands of computation, each responsible for
accomplishing a distinct goal. The individual strands responsible for each goal are typically delocalized and
overlap rather than being composed in a simple linear sequence. We refer to these code fragments as being
interleaved. Interleaving may be intentional-for example, in optimizing a program, a programmer might use
some intermediate result for several purposes-or it may creep into a program unintentionally, due to patches,
quick fixes, or other hasty maintenance practices. To understand this phenomenon, we have looked at a variety
of instances of interleaving in actual programs and have distilled characteristic features. This paper presents our
characterization of interleaving and the implications it has for tools that detect certain classes of interleaving and
extract the individual strands of computation. Our exploration of interleaving has been done in the context of a case
study of a corpus of production mathematical software, written in Fortran from the Jet Propulsion Laboratory. This
paper also describes our experiences in developing tools to detect specific classes of interleaving in this software,
driven by the need to enhance a formal description of this software library's components. The description, in turn
aids in the automated component-based synthesis of software using the library.

With every leaf a miracle.


— Walt Whitman.

Keywords: software understanding, interleaving, domain models, specification extraction, analysis tools.

1. Introduction

Imagine being handed a software system you have never seen before. Perhaps you need to
track down a bug, rewrite the software in another language or extend it in some way. We
know that software maintenance tasks such as these consume the majority of software costs
(Boehm, 1981), and we know that reading and understanding the code requires more effort
than actually making the changes (Fjeldstad and Hamlen, 1979). But we do not know what
makes understanding the code itself so difficult.
Letovsky has observed that programmers engaged in software understanding activities
typically ask "how" questions and "why" questions (Letovsky, 1988). The former require
an in-depth knowledge of the programming language and the ways in which programmers
express their software designs. This includes knowledge of common algorithms and data
structures and even concerns style issues, such as indentation and use of comments. Nev-
ertheless, the answers to "how" questions can be derived from the program text. "Why"
questions are more troublesome. Answering them requires not only comprehending the
program text but relating it to the program's purpose - solving some sort of problem. And
48 RUGABER, STIREWALT, AND WILLS

the problem being solved may not be explicitly stated in the program text; nor is the rationale
the programmer had for choosing the particular solution usually visible.
This paper is concerned with a specific difficulty that arises when trying to answer "why"
questions about computer programs. In particular, it is concerned with the phenomenon
of interleaving in which one section of a program accomplishes several purposes, and
disentangling the code responsible for each purpose is difficult. Unraveling interleaved code
involves discovering the purpose of each strand of computation, as well as understanding
why the programmer decided to interleave the strands. To demonstrate this problem, we
examine an example program in a step-by-step fashion, trying to answer the questions "why
is this program the way it is?" and "what makes it difficult to understand?"

/./. NPEDLN

The Fortran program, called NPEDLN, is part of the SPICELIB library obtained from the Jet
Propulsion Laboratory and intended to help space scientists analyze data returned from
space missions. The acronym NPEDLN stands for Nearest Point on Ellipsoid to Line. The
ellipsoid is specified by the lengths of its three semi-axes (A, B, and c), which are oriented
with the X, y, and z coordinate axes. The line is specified by a point (LINEPT) and a
direction vector (LINEDR). The nearest point is contained in a variable called PNEAR. The
full program consists of 565 lines; an abridged version can be found in the Appendix with
a brief description of subroutines it calls and variables it uses. The executable statements,
with comments and declarations removed, are shown in Figure L
The lines of code in NPEDLN that actually compute the nearest point are somewhat hard to
locate. One reason for this has to with error checking. It turns out that SPICELIB includes an
elaborate mechanism for reporting and recovering from errors, and roughly half of the code
in NPEDLN is used for this purpose. We have indicated those lines by shading in Figure 2.
The important point to note is that although it is natural to program in a way that intersperses
error checks with computational code, it is not necessary to do so. In principal, an entirely
separate routine could be constructed to make the checks and NPEDLN called only when all
the checks are passed. Although this approach would require redundant computation and
potentially more total lines of code, the resultant computations in NPEDLN would be shorter
and easier to follow.
In some sense, the error handling code and the rest of the routine realize independent
plans. We use the term plan to denote a description or representation of a computational
structure that the designers have proposed as a way of achieving some purpose or goal
in a program. This definition is distilled from definitions in (Letovsky and Soloway, 1986,
Rich and Waters, 1990, Selfridge et al., 1993). Note that apian is not necessarily stereotyp-
ical or used repeatedly; it may be novel or idiosyncratic. Following (Rich and Waters, 1990,
Selfridge et al., 1993) , we reserve the term cliche for a plan that represents a standard,
stereotypical form, which can be detected by recognition techniques, such as (Hartman,
1991, Letovsky, 1988, Kozaczynski and Ning, 1994, Quilici, 1994, Rich and Wills, 1990,
Wills, 1992). . Plans can occur at any level of abstraction from architectural overviews to
code. By extracting the error checking plan from NPEDLN, we get the much smaller and,
presumably, more understandable program shown in Figure 3.
UNDERSTANDING INTERLEAVED CODE 49

SUBROUTINE NPEDLN ( A, B, C, LINEPT, LINEDR, CALL SURFPT (SCLPT, UDIR, SCLA, SCLB,
PNEAR, DIST ) SCLC, PT(1,1), FOUND(l))
CALL SURFPT (SCLPT, OPPDIR, SCLA,
IF ( RETURN 0 ) SCLB, SCLC, PT{1,2), FOUND(2))
RETURN
ELSE DO 50001
CALL CHKIN ( I = 1, 2
END IF IF ( FOUND(I) ) THEN
DIST = 0.ODO
CALL UNORM ( LINEDR, UDIR, MAG ) CALL VEQU ( PT(1,I), PNEAR )
IF ( MAG .EQ. 0 ) THEN CALL VSCL ( SCALE,PNEAR, PNEAR )
CALL SETMSG( 'Line direction vector CALL CHKOUT ( 'NPEDLN' )
is the zero vector. RETURN
CALL SIGERR( 'SPICE(ZEROVECTOR)' END IF
CALL CHKOUTI 'NPEDLN' ) 50001 CONTINUE
RETURN C
ELSE IF { ( A LE. DO ) NORMALd) = U D I R d ) / SCLA**2
.OR. { B LE. DO ) NORMAL(2) = UDIR(2) / SCLB**2
.OR. { C LE. DO ) NORMAL(3) = UDIR(3) / SCLC**2
THEN CALL NVC2PL ( NORMAL, O.DO, CANDPL )
CALL SETMSG 'Semi-axes: A = #, CALL INEDPL ( SCLA, SCLB, SCLC CANDPL,
B = #, C = #. ' CAND, XFOUND )
CALL ERRDP '#', A ) IF ( .NOT. XFOUND ) THEN
CALL ERRDP '#', B ) CALL SETMSG ( Candidate ellipse could not
CALL ERRDP '#', C ) be found.' )
CALL SIGERR ('SPICE(INVALIDAXISLENGTH)') CALL SIGERR { SPICE(DEGENERATECASE)'
CALL CHKOUT ( 'NPEDLN' ) CALL CHKOUT { NPEDLN* )
)
RETURN RETURN
END IF END IF
SCALE = MAX ( DABS (A) DABS(B), DABS(C) ) CALL NVC2PL ( UDIR, O.DO, PRJPL )
SCLA A / SCALE CALL PJELPL ( CAND, PRJ P L , P R J E L )
SCLB B / SCALE
SCLC = C / SCALE CALL VPRJP ( SCLPT, PRJPL, PRJPT )
IF ( ( SCLA**2 .LE O.DO ) CALL NPELPT ( PRJPT, PRJEL, PRJNPT )
. .OR. ( SCLB**2 .LE. O.DO ) DIST = VDIST ( PRJNPT, PRJPT )
. .OR. ( SCLC**2 .LE. O.DO ) ) THEN
CALL SETMSG { 'Semi-axis too small: CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR,
A = #, B = #, C = #. ' ) IFOUND )
CALL ERRDP '#', A ) IF ( .NOT. IFOUND ) T H E N
CALL ERRDP '#', B ) CALL SETMSG ( Inverse projection could not
CALL ERRDP '#*, C ) be found.' )
CALL SIGERR ('SPICE(DEGENERATECASE)') CALL SIGERR ( SPICE(DEGENERATECASE)' )
CALL CHKOUT ( 'NPEDLN' ) CALL CHKOUT ( NPEDLN• )
RETURN RETURN
END IF E N D IF
Scale LINEPT. CALL VSCL ( SCALE, PNEAR,
S C L P T d ) = LINEPT (1) / SCALE DIST = SCALE * DIST
SCLPT{2) = LINEPT(2) / SCALE C A L L C H K O U T { 'NPEDLN' )
SCLPT{3) = LINEPT(3) / SCALE RETURN
CALL VMINUS (UDIR, OPPDIR ) END

Figure 1. NPEDLN minus comments and declarations.


50 RUGABER, STIREWALT, AND WILLS

SUBROUTINE NPEDLN { A, B, C, LINEPT, LINEDR, CALL SURFPT (SCLPT, UDIR, SCLA, SCLB,
PNEAR, DIST ) SCLC, PT(1,1), FOUND(l))
CALL SURFPT (SCLPT, OPPDIR, SCLA,
IF ( RETURN 0 f SCLB, SCLC, PT(1,2), F0UND(2))
RETUiUS
EL$E DO 50001
CAX*L CHKIN { 'N|»E0I^* ) I = 1, 2
mo XF IF ( FOUND(I) ) THEN
DIST = 0 ODO
CALL UNORM ( LINEDR, UDIR, MAG ) CALL VEQU ( PT(1,I), PNEAR )
I F ( JJAG ,EQ. 0 ) THfiN CALL VSCL { SCALE,PNEAR, PNEAR )
CALL SETMSG{ 'Line direction vector CALL CRKOUT ( )
is the zer6 vector, ' RETURN
CALL SIQERR< * SPICE(ZEROVECTOB)' \ END IF
CALL CHKODTi *NPEI>LK' ) 50001 CONTINUE
RETPim C
ELSE IF { ( A X E . 0,00 ) NORMAL(1) = UDIR(l) / SCLA**2
.OR. { B ,LE, Q,m ) NORMAL(2) = UDIR(2) / SCLB**2
.OR. NORMAL(3) = UDIR(3) / SCLC**2
i c .m. o.m ) } CALL NVC2PL ( NORMAL, O.DO, CANDPL )
mm CALL INEDPL ( SCLA, SCLB, SCLC, CANDPL,
CALL SKtKSG { fi » #. C « #•' i CAND, XFOUND )
CALL ERRDP { * #' , A } IF ( ,NOT. XFOUND ) THEN
CALL ERRDP i ' # \ B ) CALL SETMSG < 'Candidate ellipse could not
CALL EKRDP. { '#', C ) be found,' >
CALL SIGBRR T SPICE {IHVALIDAXISJi^ESGTH) ^ ) CALL SIGERR ( ' SPICE (DES^NERATECASEP )
CALL CHKOUT f 'nPSlDlM' J CALL CHKOUT ( 'NP13>WI' \
RETtmN RETtmJJ
e»D IF FND I F
SCALE = MAX { DABS(A), DABS(B), DABS(C) ) CALL NVC2PL ( UDIR, O.DO, PRJPL )
SCLA A / SCALE CALL PJELPL ( CAND, PRJPL, PRJEL )
SCLB B / SCALE
SCLC C / SCALE CALL VPRJP ( SCLPT, PRJPL, PRJPT )
IF { { SCLA**2 .LE. O.m } CALL NPELPT ( PRJPT, PRJEL, PRJNPT )
( SCLB**2 .LE. O.bO ) DIST = VDIST ( PRJNPT, PRJPT )
. ,OR, i SCLC**2 .LE, a.m ) \ THEN
CALL SETMSG i 'Semi-axis too SMiall: CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR,
A « #, B = #. C == i. ' } IFOUND )
CALL ERRDP i '#*, A ) IF ( ,NOT. IPOmm ) THEN
CAtt. ERRDP i '#% B ) CALL SETMSG ( 'Inverse projection could not
CALt WmW { '#'r C J be found.' >
CALL SIGEKR ('SPtCfiCDEtSENERATSCASE) ') CALL SIGERR i 'SPICS(DSJeENERATSCASEj ' }
CALL CHKOUT i 'NPEDLN* } CAU, CHKOUT { 'JtJPEDLN' >
RETURN RETURN
END IF END I F
Scale LINEPT. CALL VSCL ( SCALE, PNEAR, PNEAR )
SCLPT(l) = LINEPT(l) / SCALE DIST = SCALE * DIST
SCLPT(2) = LINEPT(2) / SCALE CALL CHKOUT ( 'STPEDLN* )
SCLPT{3) = LINEPT(3) / SCALE RETURN
CALL VMINUS (UDIR, OPPDIR ) END

Figure 2. Code with error handling highlighted.

The structure of an understanding process begins to emerge: detect a plan, such as error
checking, in the code and extract it, leaving a smaller and more coherent residue for further
analysis; document the extracted plan independently; and note the ways in which it interacts
with the rest of the code.
We can apply this approach further to NPEDLN'S residual code in Figure 3. NPEDLN has a
primary goal of computing the nearest point on an ellipsoid to a specified line. It also has
a related goal of ensuring that the computations involved have stable numerical behavior;
that is, that the computations are accurate in the presence of a wide range of numerical
inputs. A standard trick in numerical programming for achieving stability is to scale the
data involved in a computation, perform the computation, and then unscale the results. The
UNDERSTANDING INTERLEAVED CODE 51

JROUTINE NPEDLN ( A, B, C, LINEPT, LINEDR,


PNEAR, DIST ) NORMAL (1) = U D I R d ) / SCLA**2
NORMAL(2) = UDIR(2) / SCLB**2
CALL UNORM ( LINEDR, UDIR, MAG ) NORMAL(3) = UDIR{3) / SCLC**2
SCALE = MAX { DABS(A), DABS(B), DABS{C) CALL NVC2PL ( NORMAL, O.DO, CANDPL )
SCLA A / SCALE CALL INEDPL ( SCLA, SCLB, SCLC, CANDPL,
SCLB B / SCALE CAND, XFOUND )
SCLC C / SCALE CALL NVC2PL { UDIR, O.DO, PRJPL )
CALL PJELPL ( CAND, PRJPL, PRJEL )
SCLPT{1) = LINEPT(1) / SCALE
SCLPT{2) = LINEPT(2) / SCALE CALL VPRJP ( SCLPT, PRJPL, PRJPT )
SCLPT(3) = LINEPT(3) / SCALE CALL NPELPT ( PRJPT, PRJEL, PRJNPT )
DIST = VDIST ( PRJNPT, PRJPT )
CALL VMINUS (UDIR, OPPDIR )
CALL SURFPT (SCLPT, UDIR, SCLA, SCLB, CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR,
SCLC, PT(1,1), FOUNDd)) IFOUND )
CALL SURFPT (SCLPT, OPPDIR, SCLA, CALL VSCL ( SCALE, PNEAR, PNEAR )
SCLB, SCLC, PT(1,2), F0UND(2)) DIST = SCALE * DIST
DO 50001 RETURN
I = 1, 2 END
IF ( F O U N D d ) ) THEN
DIST = O.ODO
CALL VEQU ( PT(1,I), PNEAR )
CALL VSCL ( SCALE,PNEAR, PNEAR )
RETURN
END IF
50001 CONTINUE

Figure 3. The residual code without the error handling plan.

code responsible for doing this in NPEDLN is scattered throughout the program's text. It is
highlighted in the excerpt shown in Figure 4.
The delocalized nature of this "scale-unscale" plan makes it difficult to gather together
all the pieces involved for consistent maintenance. It also gets in the way of understanding
the rest of the code, since it provides distractions that must be filtered out. Letovsky and
Soloway's cognitive study (Letovsky and Soloway, 1986) shows the deleterious effects of
delocalization on comprehension and maintenance.
When we extract the scale-unscale code from NPEDLN, we are left with the smaller code
segment shown in Figure 5 that more directly expresses the program's purpose: computing
the nearest point.
There is one further complication, however. It turns out that NPEDLN not only computes
the nearest point from a line to an ellipsoid, it also computes the shortest distance between
the line and the ellipsoid. This additional output (DIST) is convenient to construct because it
can make use of intermediate results obtained while computing the primary output (PNEAR).
This is illustrated in Figure 6. (The computation of D I S T using VDIST is actually the last
computation performed by the subroutine NPELPT, which NPEDLN calls; we have pulled
this computation out of NPELPT for clarity of presentation.)
Note that an alternative way to structure SPICELIB would be to have separate routines
for computing the nearest point and the distance. The two routines would each be more
coherent, but the common intermediate computations would have to be repeated, both in
the code and at runtime.
The "pure" nearest point computation is shown in Figure 7. It is now much easier to see
the primary computational purpose of this code.
52 RUGABER, STIREWALT, AND WILLS

SUBROUTINE NPEDLN { A, B, C, LINEPT, LINEDR,


PNEAR, DIST ) NORMAL(1) = UDIR(l) / SCLA**2
N0RMAL(2) = UDIR(2) / SCLB**2
CALL UNORM { LINEDR, UDIR, MAG ) NORMALO) = UDIR(3) / SCLC**2
SCALE - MAX { DABS {Ah mBS(B), DABS(C) } CALL NVC2PL ( NORMAL, O.DO, CANDPL )
SClk / SCALE CALL INEDPL ( SCLA, SCLB, SCLC, CANDPL,
SCLB / SCALE CAND, XFOUND )
SCLC / SCALE CALL NVC2PL ( UDIR, O.DO, PRJPL )
CALL PJELPL ( CAND, PRJPL, PRJEL )
SCLP1*aj ISCALB
SCLPTI2) LINEPT (2) SCALE CALL VPRJP ( SCLPT, PRJPL, PRJPT )
SCLPTU) LINEPTO) SCALE CALL NPELPT ( PRJPT, PRJEL, PRJNPT )
DIST = VDIST ( PRJNPT, PRJPT )
CALL VMINUS (UDIR, OPPDIR )
CALL SURFPT (SCLPT, UDIR, SCLA, SCLB, CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR,
SCLC, PT{1,1), FOUND(l)) IFOUND )
CALL SURFPT (SCLPT, OPPDIR, SCLA, CALL VSCL ( SCALE, PimAR, PJIEAH J
SCLB, SCLC, PT(1,2), F0UND(2)) DIST =? SCALE * DIST
DO 50001 RETURN
I = 1, 2 END
IF ( FOUND(I) ) THEN
DIST = 0.,0D0
CALL VEQU ( PT(1<,1), PNEAR )
CALL VSCL { SCALE, fWEAR, mBAR }
RETURN
END IF
50001 CONTINUE

Figure 4. Code with scale-unscale plan highlighted.

SUBROUTINE NPEDLN ( A, B, C, LINEPT, LINEDR,


PNEAR, DIST ) NORMAL(1) = UDIR(l) / SCLA**2
C NORMAL(2) = UDIR(2) / SCLB**2
CALL UNORM ( LINEDR, UDIR, MAG ) NORMAL(3) = UDIR(3) / SCLC**2
C CALL NVC2PL { NORMAL, O.DO, CANDPL )
CALL VMINUS (UDIR, OPPDIR ) CALL INEDPL { SCLA, SCLB, SCLC, CANDPL,
CALL SURFPT (SCLPT, UDIR, SCLA, SCLB, CAND, XFOUND )
SCLC, PT(1,1), FOUND{1)) CALL NVC2PL ( UDIR, O.DO, PRJPL )
CALL SURFPT (SCLPT, OPPDIR, SCLA, CALL PJELPL ( CAND, PRJPL, PRJEL )
SCLB, SCLC, PT(1,2), FOUND(2)) C
DO 50001 CALL VPRJP { SCLPT, PRJPL, PRJPT )
I = 1, 2 CALL NPELPT { PRJPT, PRJEL, PRJNPT )
IF { FOUND(I) ) THEN DIST = VDIST ( PRJNPT, PRJPT )
DIST = O.ODO C
CALL VEQU { PT(1,I), PNEAR ) CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR,
RETURN IFOUND )
END IF RETURN
50001 CONTINUE END

Figure 5. The residual code without the scale-unscale plan.


UNDERSTANDING INTERLEAVED CODE 53

SUBROUTINE NPEDLN ( A, B, C, LINEPT, LINEDR,


PNEAR, DIST ) NORMAL(1) = UDIRd) / SCLA**2
C NORMAL(2) = UDIR(2) / SCLB**2
CALL UNORM { LINEDR, UDIR, MAG ) NORMAL(3) = UDIR(3) / SCLC**2
C CALL NVC2PL ( NORMAL, O.DO, CANDPL )
CALL VMINUS (UDIR, OPPDIR ) CALL INEDPL ( SCLA, SCLB, SCLC, CANDPL,
CALL SURFPT (SCLPT, UDIR, SCLA, SCLB, CAND, XFOUND )
SCLC, PT(1,1), FOUNDd)) CALL NVC2PL ( UDIR, O.DO, PRJPL )
CALL SURFPT (SCLPT, OPPDIR, SCLA, CALL PJELPL ( CAND, PRJPL, PRJEL )
SCLB, SCLC, PT(1,2), F0UND{2))
DO 50001 CALL VPRJP ( SCLPT, PRJPL, PRJPT )
I = 1, 2 CALL NPELPT ( PRJPT, PRJEL, PRJNPT )
IF { FOUND(I) ) THEN DlSfT s VDXST ( PRJN&T, tRJPT }
DISfT » O.ODQ
CALL VEQU ( PT(1,I), PNEAR ) CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR,
RETURN IFOUND )
END IF RETURN
50001 CONTINUE END

Figure 6. Code with distance plan highlighted.

SUBROUTINE NPEDLN ( A, B, C, LINEPT, LINEDR,


PNEAR ) NORMAL(1) = UDIRd) / SCLA**2
NORMAL(2) = UDIR(2) / SCLB**2
CALL UNORM ( LINEDR, UDIR, MAG ) NORMAL(3) = UDIR(3) / SCLC**2
CALL NVC2PL ( NORMAL, O.DO, CANDPL )
CALL VMINUS (UDIR, OPPDIR ) CALL INEDPL ( SCLA, SCLB, SCLC, CANDPL,
CALL SURFPT (SCLPT, UDIR, SCLA, SCLB, CAND, XFOUND )
SCLC, PT(1,1), FOUNDd) ) CALL NVC2PL ( UDIR, O.DO, PRJPL )
CALL SURFPT (SCLPT, OPPDIR, SCLA, CALL PJELPL ( CAND, PRJPL, PRJEL )
SCLB, SCLC, PT(1,2), FOUND(2))
DO 50001 CALL VPRJP ( SCLPT, PRJPL, PRJPT )
I = 1, 2 CALL NPELPT ( PRJPT, PRJEL, PRJNPT )
IF ( FOUND(I) THEN
CALL VEQU ( PT(1,I), PNEAR ) CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR,
RETURN IFOUND )
END IF RETURN
50001 CONTINUE END

Figure 7. The residual code without the distance plan.


54 RUGABER, STIREWALT, AND WILLS

The production version of NPEDLN contains several interleaved plans. Intermediate For-
tran computations are shared by the nearest point and distance plans. A delocalized scaling
plan is used to improve numerical stability, and an independent error handling plan is used
to deal with unacceptable input. Knowledge of the existence of the several plans, how they
are related, and why they were interleaved is required for a deep understanding of NPEDLN.

1,2. Contributions

In this paper, we present a characterization of interleaving, incorporating three aspects


that make interleaved code difficult to understand: independence, delocalization, and re-
source sharing. We have distilled this characterization from an empirical examination of
existing software - primarily SPICELIB. Secondary sources of existing software which we
also examined are a Cobol database report writing system from the US Army and a pro-
gram for finding the roots of functions, presented and analyzed in (Basili and Mills, 1982)
and (Rugaber et al., 1990). We relate our characterization of interleaving to existing con-
cepts in the literature, such as delocalized plans (Letovsky and Soloway, 1986), coupling
(Yourdon and Constantine, 1979), and redistribution of intermediate results (Hall, 1990,
Hall, 1991).
We then describe the context in which we are exploring and applying these ideas. Our
driving program comprehension problem is to elaborate and validate existing partial spec-
ifications of the JPL library routines to facilitate the automation of specification-driven
generation of programs using these routines. We have developed analysis tools, based on
the Software Refinery, to detect interleaving. We describe the analyses that we have for-
mulated to detect specific classes of interleaving that are particularly useful in elaborating
specifications. We then discuss open issues concerning requirements on software and plan
representations that detection imposes, the role of application knowledge in addressing the
interleaving problem, scaling up the scope of interleaving, and the feasibility of building
tools to assist interleaving detection and extraction. We conclude with a description of how
related research in cliche recognition as well as non-recognition techniques can play a role
in addressing the interleaving problem.

2. Interleaving

Programmers solve problems by breaking them into pieces. Pieces are programming lan-
guage implementations of plans, and it is common for multiple plans to occur in a single
code segment. We use the term interleaving to denote this merging (Rugaber et al., 1995).

Interleaving expresses the merging of two or more distinct plans within some con-
tiguous textual area of a program. Interleaving can be characterized by the delocal-
ization of the code for the individual plans involved, the sharing of some resource,
and the implementation of multiple, independent plans in the program's overall
purpose.
UNDERSTANDING INTERLEAVED CODE 55

Interleaving may arise for several reasons. It may be intentionally introduced to improve
program efficiency. For example, it may be more efficient to compute two related values in
one place than to do so separately. Intentional interleaving may also be performed to deal
with non-functional requirements, such as numerical stability, that impose global constraints
which are satisfied by diffuse computational structures. Interleaving may also creep into
a program unintentionally, as a result of inadequate software maintenance, such as adding
a feature locally to an existing routine rather than undertaking a thorough redesign. Or
interleaving may arise as a natural by-product of expressing separate but related plans in
a linear, textual medium. For example, accessors and constructors for manipulating data
structures are typically interleaved throughout programs written in traditional programming
languages due to their procedural, rather than object-oriented structure. Interleaving cannot
always be avoided (e.g., due to limitations of the available progranmiing language) and may
be desirable (e.g., for economy and avoiding duplication which can lead to inconsistent
maintenance). Regardless of why interleaving is introduced, it complicates understanding
a program. This makes it difficult to perform tasks such as extracting reusable components,
localizing the effects of maintenance changes, and migrating to object-oriented languages.
There are several reasons interleaving is a source of difficulties. The first has to do with
delocalization. Because two or more design purposes are implemented in a single segment
of code, the individual code fragments responsible for each purpose are more spread out
than they would be if they were segregated in their own code segments. Another reason in-
terleaving presents a problem is that when it is the result of poorly thought out maintenance
activities such as "patches" and "quick fixes", the original, highly coherent structure of the
system may degrade. Finally, the rationale behind the decision to intentionally introduce
interleaving is often not explicitly recorded in the program. For example, although inter-
leaving is often introduced for purposes of optimization, expressing intricate optimizations
in a clean and well-documented fashion is not typically done. For all of these reasons, our
ability to comprehend code containing interleaved fragments is compromised.
Our goal is not to completely eliminate interleaving from programs, since that is not
always desirable or possible to do at the level of source text. Rather, it is to find ways of
detecting interleaving and representing the interleaved plans at a level of abstraction that
makes the individual plans and their interrelationships clear.
We now examine each of the characteristics of interleaving - delocalization, sharing,
independence - in more detail.

2.1, Delocalization

Delocalization is one of the key characteristics of interleaving: one or more parts of a


plan are spatially separated from other parts by code from other plans with which they are
interleaved.
56 RUGABER, STIREWALT, AND WILLS

SUBROUTINE NPEDLN(A, B, C, LINEPT, LINEDR,


PNEAR, DIST)

CALL UNORM ( LINEDR, UDIR, MAG )


. . . l^rror c^h^ck^]
SCALB « m X { DABjS(A), DABS<B), PABS(C) )
SCLA. « A / SCALE
SCLB = B / SCALE
SCLC * C / SCAt^fi
... [error checks]
SCLPT(l) « LXIilBPTil) / SCALE
SCLP'r(2) * LINEPT <2) / SCALE
SCLPT{3) ^ LINEPT(3) / SCALE
CALL VMINUS ( UDIR, OPPDIR )
CALL SURFPT ( SCLPT, UDIR, SCLA, SCLB,
SCLC,PT(1,1), FOUND(1))
CALL SURFPT ( SCLPT, OPPDIR,SCLA, SCLB,
SCLC, PT(1,2), FOUND(2))
... [checking for intersection of the
line with the ellipsoid]
IF ( FOUND(I) ) THEN
DXST ** O.ODO
CALI* VSCL {SCALE, PNEAR, PNEAH}
RETURN
END IF
... [handling the non-intercept case]
CALL VSCL ( SCALE, PNEASt, PNBAR )
DIST s= SCALE * DIST

RETURN
END

Figure 8. Portions of the NPEDLN Fortran program. Shaded regions highlight the lines of code responsible for
scaling and unsealing.
UNDERSTANDING INTERLEAVED CODE 57

The "scale-unscale" pattern found in NPEDLN is a simple example of a more general de-
localized plan that we refer to as a reformulation wrapper, which is frequently interleaved
with computations in SPICELIB. Reformulation wrappers transform one problem into an-
other that is simpler to solve and then transfer the solution back to the original situation.
Other examples of reformulation wrappers in SPICELIB are reducing a three-dimensional
geometry problem to a two-dimensional one and mapping an ellipsoid to the unit sphere to
make it easier to solve intersection problems.
Delocalization may occur for a variety of reasons. One is that there may be an inherently
non-local relationship between the components of the plan, as is the case with reformula-
tion wrappers, which makes the spatial separation necessary. Another reason is that the
intermediate results of part of a plan may be shared with another plan, causing the plans to
overlap and their steps to be shuffled together; the steps of one plan separate those of the
other. For example, in Figure 8, part of the unscale plan (computing the scaling factor) is
separated from the rest of the plan (multiplying by the scaling factor) in all unscalings of
the results (DIST and PNEAR). This allows the scaling factor to be computed once and the
result reused in all scalings of the inputs A, B, and c and in unsealing the results.
Realizing that a reformulation wrapper or some other delocalized plan is interleaved
with a particular computation can help prevent comprehension failures during maintenance
(Letovsky and Soloway, 1986). It can also help detect when the delocalized plan is incom-
plete, as it was in an earlier version of our example subroutine whose modification history
includes the following correction:
C- SPICELIB Version 1.2.0, 25-NOV-1992 (NJB)
C Bug fix: in the intercept case, PNEAR is now
C properly re-scaled prior to output. Formerly,
C it was returned without having been re-scaled.

2,2, Resource Sharing

The sharing of some resource is characteristic of interleaving. When interleaving is intro-


duced into a program, there is normally some implicit relationship between the interleaved
plans, motivating the designer to choose to interleave them. An example of this within
NPEDLN is shown in Figure 9. The shaded portions of the code shown are shared between
the two computations for PNEAR and DIST. In this case, the common resources shared by
the interleaved plans are intermediate data computations. The implementations for com-
puting the nearest point and the shortest distance overlap in that a single structural element
contributes to multiple goals.
The sharing of the results of some subcomputation in the implementation of two distinct
higher level operations is termed redistribution of intermediate results by Hall (Hall, 1990,
Hall, 1991). More specifically, redistribution is a class of function sharing optimizations
which are implemented simply by tapping into the dataflow from some value producer and
feeding it to an additional target consumer, introducing fanout into the dataflow. Redistri-
bution covers a wide range of conmion types of function sharing optimizations, including
common subexpression elimination and generalized loop fusion. Hall developed an au-
tomated technique for redistributing results for use in optimizing code generated from
58 RUGABER, STIREWALT, AND WILLS

SOBROtTTINE NPEDLN (A, B, C, LINEPT,

tFir&t 100 lines of NPEDLMJ


CALL NPELPT { PRJPT, PRJEL, PRJNPT )
PI ST = VDIST ( PRJNPT, PRJPT ) ~|
CALL VPRJPI(PRJNPT,PRJPL,CANDPL,PNEAR
IFOUND )
IF ( .NOT. IFOUND ) THEN Shared
... [error handling]
END IF
CALL VSCL ( SCALE, PNEAR, PNEAR
DIST SCALE * DIST
CALL CHKOUT ( 'NPBDLN* )
RETUKKf
END

Figure 9. Portions of NPEDLN, highlighting two overlapping computations.

general-purpose reusable software components. Redistribution of results is a form of inter-


leaving in which the resources shared are data values.
The commonality between interleaved plans might be in the form of other shared resources
besides data values, for example control structures, lexical module structures, and names.
Often when interleaving is unintentional, the resource shared is code space: the code
statements of two plans are interleaved because they must be expressed in linear text.
Typically, intentional interleaving involves sharing higher level resources.
Control coupling. Control conditions may be redistributed just as data values are. The
use of control flags allows control conditions to be determined once but used to affect
execution at more than one location in the program. In NPEDLN, for example, SURFPT is
called to compute the intersection of the line with the ellipsoid. This routine returns a
control flag, FOUND, indicating whether or not the intersection exists. This flag is then used
outside of SURFPT to control whether the intercept or non-intercept case is to be handled, as
is shown in Figure 10.
The use of control flags is a special form of control coupling: "any connection between
two modules that communicates elements of control (Yourdon and Constantine, 1979),"
typically in the form of function codes, flags, or switches (Myers, 1975). This sharing of
control information between two modules increases the complexity of the code, complicat-
ing comprehension and maintenance.
Content coupling. Anotherform of resource sharing occurs when the lexical structure of
a module is shared among several related functional components. For example, the entire
contents of a module may be lexically included in another. This sometimes occurs when
a programmer wants to take advantage of a powerful intraprocedural optimizer limited to
improving the code in a single routine. Another example occurs when a programmer uses
ENTRY statements to partially overlap the contents of several routines so that they may share
UNDERSTANDING INTERLEAVED CODE 59

CALL SURFPT ( SCLPT, UDIR, SCLA, SCLB,


SCLC, PT{1,1), FOUND(l) )
CALL SURFPT ( SCLPT, OPPDIR, SCLA, SCLB,
SCLC, PT{1,2), FOUND(2) )
DO 50001
I = 1, 2
IF ( FOUND(I) ) THEN
. . . [handling the intercept case]
RETURN
END I F
50001 CONTINUE
C G e t t i n g h e r e means t h e l i n e d o e s n ' t
C i n t e r s e c t the e l l i p s o i d ,
. . . [handling the non-intercept case]
RETURN
END

Figure 10. Fragment of subroutine showing control coupling

access to some state variables. This is sometimes done in a language, such as Fortran, that
does not contain an encapsulation mechanism like packages or objects.
These two practices are examples of a phenomenon called content coupling (Yourdon
and Constantine, 1979) in which - "some or all of the contents of one module are included
in the contents of another" - and which often manifests itself in the form of a multiple-
entry module. Content coupling makes it difficult to independently modify or maintain the
individual functions.
Name Sharing. A simple form of sharing is the use of the same variable name for two
different purposes. This can lead to incorrect assumptions about the relationship between
subcomputations within a program.
In general, the difficulty that resource sharing introduces is that it causes ambiguity in
interpreting the purpose of program pieces. This can lead to incorrect assumptions about
what effect changes will have, since the maintainer might be focusing on only one of the
actual uses of the resource (variable, value, control flag, data structure slot, etc.).

2,3. Independence

While interleaving is introduced to take advantage of commonalities, it is also true that


the interleaved plans each have a distinct purpose. Because understanding relates program
goals to program code, having two goals realized in one section of code can be confusing.
There are several ways for dealing with this problem. One way would be to make two
copies of the code segment, each responsible for one of the goals, and both duplicating
any common code. In the NPEDLN example, a separate routine could be provided that is
responsible for computing DIST. Although this may make understanding each of the routines
somewhat simpler, there are costs due to the extra code and the implicit need, often forgotten,
to update both versions of the common code whenever it needs to be fixed. A variant of
60 RUGABER, STIREWALT, AND WILLS

this approach is to place the common code in a separate routine, replacing it in each of
the two copies with a call to the new routine. This factoring approach works well when
the common code is contiguous, but quickly becomes unworkable if the common code is
interrupted by the plan specific code.
The bottom line is that this style of intentional interleaving confronts the programmer with
a tradeoff between efficiency and maintainability/understandability. Ironically, making the
efficiency choice may hinder efforts to make the code more efficient and reusable in the
long run, such as parallelizing or "objectifying" the code (converting it to an object-oriented
style).

3. Case Study

In order to better understand interleaving, we have undertaken a case study of production


library software. The library, called SPICELIB, consists of approximately 600 mathematical
programs, written in Fortran by programmers at the Jet Propulsion Laboratory, for analyzing
data sent back from space missions. The software performs calculations related to solar
system geometry, such as coordinate frame conversions, intersections of rays, ellipses,
planes, and ellipsoids, and light-time calculations, NPEDLN comes from this library.
We were introduced to SPICELIB by researchers at NASA Ames, who have devel-
oped a component-based software synthesis system called Amphion (Lowry et al., 1994,
Lowry et al., 1994, Stickel et al., 1994). Amphion automatically constructs programs that
compose routines drawn from SPICELIB. It does this by making use of di domain theory that
includes formal specifications of the library routines, connecting them to the abstract con-
cepts of solar system geometry. The domain theory is encoded in a structured representation,
expressed as axioms in first-order logic with equality. A space scientist using Amphion can
schematically specify the geometry of a problem through a graphical user interface, and
Amphion automatically generates Fortran programs to call SPICELIB routines to solve the
described problem. Amphion is able to do this by proving a theorem about the solvability
of the problem and, as a side effect, generating the appropriate calls. This is shown in the
bottom half of Figure 11. Amphion has been installed at JPL and used by space scientists to
successfully generate over one hundred programs to solve solar system kinematics problems.
The programs consist of dozens of subroutine calls and are typically synthesized in under
three minutes of CPU time using a Sun Sparc 2 (Lowry et al., 1994, Lowry zi al., 1994).
Amphion's success depends on how accurate, consistent, and complete its domain theory
is. An essential program understanding task is to validate the domain theory by checking
it against the SPICELIB routines and extending it when incompletenesses are found. To do
this, we need to be able to pull apart interleaved strands. For example, one incompleteness
in Amphion's domain theory is that it does not fully cover the functionality of the routines
in SPICELIB. Some routines compute more than one result. For example, NPEDLN computes
the nearest point on an ellipsoid to a line as well as the shortest distance between that
point and the ellipsoid. However, the domain theory does not describe both of these
values. In the case of NPEDLN, only the nearest point computation is modelled, not the
shortest distance. In these routines, it is often the case that the code responsible for the
secondary functionalities is interleaved with the code for the primary function covered by
UNDERSTANDING INTERLEAVED CODE 61

Interleaving! plans Spec Extraction,


Detection Elaboration
componentT T partial spec Tt I elaborated spec
Reu$e Domain
Library Theory
— ^
domain AmphJon
z. Fortran /=".^
based^ program i ^ l
(NASA)
spec
Jupiter
Galileo

Figure 11. Applying interleaving detection to component-based reuse.

Amphion's domain theory. Uncovering the secondary functionahty requires unraveling and
understanding two interleaved computations.
Another way in which Amphion's current domain theory is incomplete is that it does
not express preconditions on the use of the library routines; for example, that a line given
as input to a routine must not be the zero vector or that an ellipsoid's semi-axes must
be large enough to be scalable. It is difficult to detect the code responsible for checking
these preconditions because it is usually tightly interleaved with the code for the primary
computation in order to take advantage of intermediate results computed for the primary
computation.
In collaboration with NASA Ames researchers, we explored ways in which Amphion's
domain theory is incomplete, and we built program comprehension techniques to extend
it. As the top half of Figure 11 shows, we developed mechanisms for detecting particular
classes of interleaving, with the aim of extending the incomplete domain theory. In the
process, we also performed analyses to gather empirical information about how much of
spiCELiB is covered by the domain theory.
We have built interleaving detection mechanisms and empirical analyzers using a commer-
cial tool called the Software Refinery (Reasoning Systems Inc.). This is a comprehensive
tool suite including language-specific parsers and browsers for Fortran, C, Ada, and Cobol,
language extension mechanisms for building analyzers for new languages, and a user inter-
face construction tool for displaying the results of analyses. It maintains an object-oriented
repository for holding the results of its analyses, such as abstract syntax trees and symbol
tables. It provides a powerfiil wide-spectrum language, called Refine (Smith et al., 1985),
which supports pattern matching and querying the repository. Using the Software Refinery
allows us to leverage a commercially available tool as well as to evaluate the strengths and
limitations of its approach to program analysis, which we discuss in Section 4.4.
62 RUGABER, STIREWALT, AND WILLS

3,L Domain Theory Elaboration in Synthesis and Analysis

Our motivations for validating and extending a partial domain theory of existing software
come both from the synthesis and from the analysis perspectives. The primary motivations
for doing this from the synthesis perspective are to make component retrieval more accurate
in support of reuse, to assist in updating and growing the domain theory as new software
components are added, and to improve the software synthesized.
From the software analysis perspective, the refinement and elaboration of the domain
theory, based on what is discovered in the code, is a primary activity, driving the generation of
hypotheses and informing future analyses. The process of understanding software involves
two parallel knowledge acquisition activities (Brooks, 1983, Ornburn and Rugaber, 1992,
Soloway and Ehrlich, 1984):

1. using domain knowledge to understand the code - knowledge about the application sets
up expectations about how abstract concepts are typically manifested in concrete code
implementations;
2. using knowledge of the code to understand the domain - what is discovered in the code
is used to build up a description of various aspects of the application and to help answer
questions about why certain code structures exist and what is their purpose with respect
to the application.

We are studying interleaving in the context of performing these activities, given SPICELIB
and an incomplete theory of its application domain. We are targeting our detection of
interleaving toward elaborating the existing domain theory. We are also looking for ways
in which the current knowledge in the domain theory can guide detection and ultimately
comprehension.

5.2. Extracting Preconditions

Using the Software Refinery, we automated a number of program analyses, one of which
is the detection of subroutine parameter precondition checks. A precondition is a Boolean
guard controlling execution of a routine. Preconditions normally occur early in the code of
a routine before a significant commitment (in terms of execution time and state changes that
must be reversed) is made to execute the routine. Because precondition checks are often
interspersed with the computation of intermediate results, they tend to delocalize the plans
that perform the primary computational work. Moreover precondition computations are
usually part of a larger plan that detects exceptional, possibly erroneous conditions in the
state of a running program, and then takes alternative action when these conditions arise,
such as returning with an error code, signaling, or invoking error handlers. In some instances
the majority of the lines of code in a routine are there to deal with the preconditions and
resulting exception handling rather than to actually implement the base plan of the routine.

We found many examples of precondition checks on input parameters in our empirical


analysis of the SPICELIB. One such check occurs in the subroutine SURFPT and is shown in
UNDERSTANDING INTERLEAVED CODE 63

C$Procedure SURFPT ( Surface point on an ellipsoid )


SUBROUTINE SURFPT ( POSITN, U, A, B, C, POINT, FOUND )
DOUBLE PRECISION U {3 )
...declarations...
C Check the input vector to see if its the zero vector. If it is
C signal an error and return.
C
IF ( ( U(l) .EQ. O.ODO ) .AND.
( U(2) .EQ. O.ODO ) .AND.
( U(3) .EQ. O.ODO ) ) THEN
CALL SETMSG { 'SURFPT: The input vector is the zero vector.
CALL SIGERR ( 'SPICE(ZEROVECTOR)' )
CALL CHKOUT ( 'SURFPT' )
RETURN
END IF

Figure 12. A fragment of the subroutine SURFPT in SPICELIB. This fragment shows a precondition check which
invokes an exception if all of the elements of the u array are 0.

Figure 12. SURFPT finds the intersection (POINT) of a ray (represented by a point POSITN
and a direction vector u) with an ellipsoid (represented as three semi-axes lengths A, B, and
c), if such an intersection exists (indicated by FOUND). One of the preconditions checked by
SURFPT is that the direction vector u is not the zero-vector.
Parameter precondition checks make explicit the assumptions a subroutine places on its
inputs. The process of understanding a subroutine can be facilitated by detecting its precon-
dition checks and using the information they encode to elaborate a high-level specification
of the subroutine. We have created a tool that detects parameter precondition checks and
extracts the preconditions into a documentation form suitable for expression as a partial
specification. The specifications can then be compared against the Amphion domain model.
Precondition checks are particularly difficult to understand when they are sprinkled
throughout the code of a subroutine as opposed to being concentrated at the beginning.
However, we discovered that, though interleaved, these checks could be heuristically iden-
tified in SPICELIB by searching for I F statements whose predicates are unmodified input
parameters (or simple dataflow dependents of them) and whose bodies invoke exception
handlers. The logical negation of each of the predicates forms a conjunct in the precondition
of the subroutine. The analysis that decides whether or not I F statements test only unmod-
ified input parameters is specific to the Fortran language; but the analysis that decides if a
code fragment is an exception plan depends on the fact that exceptions are dealt with in a
stylized and stereotypical manner in SPICELIB. The implication is that the Fortran specific
portion is not likely to need changing when we apply the tool to a new Fortran application;
whereas the SPICELIB specific portion will certainly need to change. With this in mind, we
chose a tool architecture that allows flexibility in keeping these different types of pattern
knowledge separate and independently adaptable.
64 RUGABER, STIREWALT, AND WILLS

Detecting Exception Handlers. In general, we need application specific knowledge about


usage patterns in order to discover exception handlers. For example, the developers of
spiCELiB followed a strict discipline of exception propagation by registering an exception
upon detection using a subroutine SIGERR and then exiting the executing subroutine using
a RETURN statement. Hence, a call to SIGERR together with a RETURN indicates a cliche for
handling an exception in SPICELIB. In some other application, the form of this cliche will be
different. It is, therefore, necessary to design the recognition component of our architecture
around this need to specialize the tool with knowledge about the system being analyzed.
The Software Refinery provides excellent support for this design principle through the
use of the rule construct and a tree-walker that applies these rules to an abstract syntax
tree (AST). Rules declaratively specify state changes by listing the conditions before and
after the change without specifying how the change is implemented. This is useful for in-
cluding SPICELIB specific pattern knowledge because it allows the independent, declarative
expression of the different facets of the pattern.
We recognize application specific exception handlers using two rules that search the AST
for a call to SIGERR followed by a RETURN statement. These rules and the Refine code that
applies them are presented in detail in (Rugaber et al., 1995).
Detecting Guards. Discovering guards, which are I F statements that depend only upon
input parameters, involves keeping track of whether or not these parameters have been
modified. If they have been modified before the check, then the check probably is not a
precondition check on inputs. In Fortran, a variable X can be modified by:

1. appearing on the left hand side of an assignment statement,

2. being passed into a subroutine which then modifies the formal parameter bound to X
by the call,

3. being implicitly passed into another subroutine in a COMMON block and modified in this
other subroutine, or

4. being explicitly aliased by an EQUIVALENCE statement to another variable which is then


modified.

Currently our analysis does not detect modification through COMMON or EQUIVALENCE because
none of the code in SPICELIB uses these features with formal parameters. We track modi-
fications to input parameters by using an approximate dataflow algorithm that propagates a
set of unmodified variables through the sequence of statements in the subroutine. At each
statement, if a variable X in the set could be modified by the execution of the statement,
then X is removed from the set. After the propagation, we can easily check whether or not
an IF statement is a guard.
Results. The result of this analysis is a table of preconditions associated with each subrou-
tine. Since we are targeting partial specification elaboration for Amphion, we chose to make
the tool output the preconditions in I^TgX form. Figure 13 gives examples of preconditions
extracted for a few SPICELIB subroutines. Our tool generated the lATgX source included in
Figure 13 without change.
UNDERSTANDING INTERLEAVED CODE 65

RECGEO -n{F > 1) A ^{RE < O.ODO)


REMSUB -^({LEFT > RIGHT) V (RIGHT < 1) V (LEFT < 1) V (RIGHT > LEN(IN)) V
(LEFT > LEN(IN)))
SURFPT -'((C/(l) = O.ODO) A (U(2) = O.ODO) A (U(3) = O.ODO))
XPOSBL -y((MOD(NCOL,BSIZE) 7«^ 0) V (MOD(NROW, BSIZE) 7^ 0)) A -^(NCOL < 1) A
^(NROW < 1) A -^(BSIZE < 1)

Figure 13. Preconditions extracted for some of the subroutines in SPICELIB.

Taken literally, the precondition for SURFPT, for example, states that one of the first three
elements of the u array parameter must be non-zero. In terms of solar system geometry,
u is seen as a vector, so the more abstract precondition can be stated as "U is not the zero
vector." Extracting the precondition into the literal representation is the first step to being
able to express the precondition in the more abstract form.
The other preconditions listed in Figure 13, stated in their abstract form, are the following.
The subroutine RECGEO converts the rectangular coordinates of a point RECTAN to geodetic
coordinates, with respect to a given reference spheroid whose equatorial radius is RE, using
a flattening coefficient p. Its precondition is that the radius is greater than 0 and the flattening
coefficient is less than 1. The subroutine REMSUB removes the substring (LEFT-.RIGHT) from
a character string IN. It requires that the positions of the first character LEFT and the last
character RIGHT to be removed are in the range 1 to the length of the string and that the
position of the first character is less than the position of the last. Finally, the subroutine
XPOSBL transposes the square blocks within a matrix BMAT. Its preconditions are that the
block size BSIZE must evenly divide both the number of rows NROW in BMAT and the number
of columns NCOL and that the block size, number of rows, and number of columns are all at
least 1.

3.3. Finding Interleaving Candidates

There are several other analyses that we have investigated using heuristic techniques for
finding interleaving candidates.
66 RUGABER, STIREWALT, AND WILLS

3.3.1. Routines with Multiple Outputs

One heuristic forfindinginstances of interleaving is to determine which subroutines compute


more than one output. When this occurs, the subroutine is returning either the results of
multiple distinct computations or a result whose type cannot be directly expressed in the
Fortran type system (e.g., as a data aggregate). In the former case, the subroutine is realized
as the interleaving of multiple distinct plans, as is the case with NPEDLN'S computation of
both the nearest point and the shortest distance.
In the latter case, the subroutine may be implementing only a single plan, but a maintainer's
conceptual categorization of the subroutine is still obscured by the appearance of some
number of seemingly distinct outputs. A good example of this case occurs in the SPICELIB
subroutine SURFPT, which conceptually returns the intersection of a vector with the surface
of an ellipsoid. However, it is possible to give SURFPT a vector and an ellipsoid that
do not intersect. In such a situation the output parameter POINT will be undefined, but the
Fortran type system cannot express the type: DOUBLE PRECISION V Undefined. The original
programmer was forced to simulate a variable of this type using two variables, POINT and
FOUND, adopting the convention that when FOUND is false, the return value is Undefined, and
when FOUND is true, the return value is POINT.
Clearly subroutines with multiple outputs complicate program understanding. We built a
tool that determines the multiple output subroutines in a library by analyzing the direction
of dataflow in parameters of functions and subroutines. A parameter's direction is either:
in if the parameter is only read in the subroutine, out if the parameter is only written in the
subroutine, or in-out if the parameter is both read and written in the subroutine. Multiple
output subroutines will have more than one parameter with direction out or in-out.
Our tool bases its analysis on the structure chart (call graph) objects that the Software
Refinery creates. The nodes of these structure charts are annotated with parameter direction
information. The resulting analysis showed that 25 percent of the subroutines in SPICELIB
had multiple output parameters. We were thus able to focus our work on these routines
first, as they are likely to involve interleaving.
In addition, we performed an empirical analysis to determine, for those routines covered
by the Amphion domain model (35 percent of the library), which ones have multiple output
parameters, some of which are not covered by the domain model. We refer to outputs
that are not mapped to anything in the domain model as dead end dataflows (similar to an
interprocedural version of dead code (Aho et al., 1986)). Since the programs that Amphion
creates can never make use of these return values, they have not been associated with any
meaning in the domain theory. For example, NPEDLN'S distance output (DIST) is a dead end
dataflow as far as the domain theory is concerned. Dead end dataflows imply interleaving
in the subroutine and/or an incompleteness in the domain theory. Our analysis revealed that
of the subroutines covered by the domain theory, 30 percent have some output parameters
that are dead end dataflows. These are good focal points for detecting interleaved plans that
might be relevant to extending the domain theory.
UNDERSTANDING INTERLEAVED CODE 67

3.3.2. Control Coupling

Another heuristic for detecting potential interleaving finds candidate routines that may be
involved in control coupling. Control coupling is often implemented by using a subroutine
formal parameter as a control flag. So, we focus on calls to library routines that supply a
constant as a parameter to other routines, as opposed to a variable. The constant parameter
may be a flag that is being used to choose among a set of possible computations to perform.
The heuristic strategy we use for detecting control coupling first computes a set of candidate
routines that are invoked with a constant parameter at every call-site in the library or in code
generated from the Amphion domain theory. Each member of this set is then analyzed
to see if the formal parameter associated with the constant actual parameter is used to
conditionally execute disjoint sections of code. Our analysis shows that 19 percent of the
routines in SPICELIB are of this form.

3.3.3. Reformulation Wrappers

A third heuristic for locating interleaving is to ask: Which pairs of routines co-occurl Two
routines co-occur if they are always called by the same routines, they are executed under the
same conditions, and there is a flow of computed data from one to the other. We would like
to detect co-occurrence pairs because they are likely to form reformulation wrappers. Of
course, in general we would like to consider any code fragments as potential pairs, not just
library routines. Once co-occurrence pairs are detected, they must be further checked to see
whether they are inverses of each other. For example, in the "scale-unscale" reformulation
wrapper, the operations that divide and multiply by the scaling factor co-occur and invert
the effects of each other; the inputs are scaled (divided) and the results of the wrapped
computation are later unsealed (multiplied). Through empirical investigation of SPICELIB,
we have discovered co-occurrence pairs that form reformulation wrappers and are building
tools to perform this analysis automatically.

4. Open Issues and Future Work

We are convinced that interleaving seriously complicates understanding computer programs.


But recognizing a problem is different from knowing how to fix it. Questions arise as to
what form of representation is appropriate to hold the extracted information, how knowledge
of the application domain can be used to detect plans, the extent to which the concept of
interleaving scales up, and how powerful tools need to be to detect and extract interleaved
components.

4,1. Representation

Our strategy for building program analysis tools is to formulate a program representation
whose structural properties correspond to interesting program properties. A programming
68 RUGABER, STIREWALT, AND WILLS

Style tool, for example, uses a control flow graph that explicitly represents transfer of execu-
tion flow in programs. Irreducible control flow graphs signify the use of unstructured GO TO
statements. The style tool uses this structural property to report violations of structured
programming style. Since we want to build tools for interleaving detection we have to for-
mulate a representation that captures the properties of interleaving. We do this by first listing
structural properties that correspond to each of the three characteristics of interleaving and
then searching for a representation that has these structural properties.
The key characteristics of interleaving are delocalization, resource sharing, and indepen-
dence. In sequential languages like Fortran, delocalization often cannot be avoided when
two or more plans share data. The components of the plans have to be serialized with
respect to the dataflow constraints. This typically means that components of plans cluster
around the computation of the data being shared as opposed to clustering around other
components of the same plan. This total ordering is necessary due to the lack of support
for concurrency in most high level programming languages. It follows then that in order
to express a delocalized plan, a representation must impose a partial rather than a total
execution ordering on the components of plans.
The partial execution ordering requirement suggests that some form of graphical represen-
tation is appropriate. Graph representations naturally express a partial execution ordering
via implicit concurrency and explicit transfer of control and data. Since there are a number
of such representations to choose from, we narrow the possibilities by noting that:

1. independent plans must be localized as much as possible, with no explicit ordering


among them;

2. sharing must be detectable (shared resources should explicitly flow from one plan to
another); similarly if two plans pi, p2 both share a resource provided by a plan ps then
Pi and p2 should appear in the graph as siblings with a common ancestor ps;

3. the representation must support multiple views of the program as the interaction of plans
at various levels of abstraction, since interleaving may occur at any level of abstraction.

An existing formalism that meets these criteria is Rich's Plan Calculus (Rich, 1981,
Rich, 1981, Rich and Waters, 1990). A plan in the Plan Calculus is encoded as a graphi-
cal depiction of the plan's structural parts and the constraints (e.g., data and control flow
connections) between them. This diagrammatic notation is complemented with an axiom-
atized description of the plan that defines its formal semantics. This allows us to develop
correctness preserving transformations to extract interleaved plans. The Plan Calculus also
provides a mechanism, called overlays^ for representing correspondences and relationships
between pairs of plans (e.g., implementation and optimization relationships). This enables
the viewing of plans at multiple levels of abstraction. Overlays also support a general notion
of plan composition which takes into account resource sharing at all levels of abstraction
by allowing overlapping points of view.
UNDERSTANDING INTERLEAVED CODE 69

4.2. Exploiting Application Knowledge

Most of the current technology available to help understand programs addresses implemen-
tation questions; that is, it is driven by the syntactic structure of programs written in some
programming language. But the tasks that require the understanding - perfective, adaptive,
and corrective maintenance - are driven by the problem the program is solving; that is,
its application domain. For example, if a maintenance task requires extending NPEDLN to
handle symmetric situations where more than one "nearest point" to a line exist, then the
programmer needs to figure out what to do about the distance calculation also computed by
NPEDLN. Why was DIST computed inside of the routine instead of separately? Was it only for
efficiency reasons, or might the nearest point and the distance be considered a/?a/r of results
by its callers? In the former case, a single DIST return value is still appropriate, in the latter,
a pair of identical values is indicated. To answer questions like these, programmers need
to know which plans pieces of code are implementing. And this sort of plan knowledge
derives from understanding the application area, not the program.
Another example from NPEDLN concerns reformulation wrappers. These plans are inher-
ently delocalized. In fact, they only make sense as plans at all when considered in the
context of the application: stable computations of solar system geometry. Without this
understanding, the best hope is to recognize that the code has uniformly applied a function
and its inverse in two places, without knowing why this was done and how the computations
are connected.
The underlying issue is that any scheme for code understanding based solely on a top-down
or a bottom-up approach is inherently limited. As illustrated by the examples, a bottom-up
approach cannot hope to relate delocalized segments or disentangle interleavings without
being able to relate to the application goals. And a top-down approach cannot hope to find
where a plan is implemented without being able to understand how plan implementations
are related syntactically and via dataflows. The implication is that a coordinated strategy
is indicated, where plans generate expectations that guide program analysis and program
analysis generates related segments that need explanation.

4.3, Scaling the Concept of Interleaving

We can characterize the ways interleaving manifests itself in source code along two spec-
trums. These form a possible design space of solutions to the interleaving problem and can
help relate existing techniques that might be applicable. One spectrum is the scope of the
interleaving, which can range from intraprocedural to interprocedural to object (clusters
of procedures and data) to architectural. The other spectrum is the structural mechanism
providing the interleaving, which may be naming, control, data, or protocol. Protocols are
global constraints, such as maintaining stack discipline or synchronization mechanisms for
cooperating processes. For example, the use of control flags is a control-based mechanism
for interleaving with interprocedural scope. The common iteration construct involved in
loop fusion is another control-based mechanism, but this interleaving has intraprocedural
scope. Reformulation wrappers use a protocol mechanism, usually at the intraprocedural
level, but they can have interprocedural scope. Multiple-inheritance is an example of a
70 RUGABER, STIREWALT, AND WILLS

data-centered interleaving mechanism with object scope. Interleaving at the scope of ob-
jects and architectures or involving global protocol mechanisms is not yet well understood.
Consequently, few mechanisms for detection and extraction currently exist in these areas.

4.4. Tool Support

We used the Software Refinery from Reasoning Systems in our analyses. This comprehen-
sive toolkit provides a set of language-specific browsers and analyzers, a parser generator,
a user interface builder, and an object-oriented repository for holding the results of anal-
yses. We made particular use of two other features of the toolkit. The first is called the
Workbench, and it provided pre-existing analyses for traditional graphs and reports such as
structure charts, dataflow diagrams, and cross reference lists. The results of the analyses
can be accessed from the repository using small, Refine language programs such as those
described in (Rugaber et al., 1995). The Refine compiler was the other feature we used,
compiling a Refine program into compiled Lisp.
The approach taken by the Refine language and tool suite has many advantages for
attacking problems like ours. The language itself combines features of imperative, object-
oriented, functional, and rule-based programming, thus providing flexibility and generality.
Of particular value to us is its rule-based constructs. Before-and-after condition patterns
define the properties of constructs without indicating how to find them. We had merely to
add a simple tree walking routine to apply the rules to the abstract syntax tree. In addi-
tion to the rule-based features. Refine provides abstract data structures, such as sets, maps,
and sequences, which manage their own memory requirements, thereby reducing program-
mer work. The object-oriented repository further reduces programmer responsibility by
providing persistence and memory management.
We also take full advantage of Reasoning Systems' existing Fortran language model and
its structure chart analysis. These allowed us a running start on our analysis and provided a
robust handling of Fortran constructs that are not typically available from non-commercial
research tools.
We can see several ways in which the Refine approach can be extended. In particular, the
availability of other analyses, such as control flow graphs for Fortran and general dataflow
analysis, would prove useful. Robust dataflow analysis is particularly important to the
precision of precondition extraction.

5. Related Work

Techniques for detecting interleaving and disentangling interleaved plans are likely to build
on existing program comprehension and maintenance techniques.
UNDERSTANDING INTERLEAVED CODE 71

5.1, The Role of Recognition

When what is interleaved is familiar (i.e., stereotypical, frequently used plans), cliche
recognition (e.g., (Hartman, 1991, Johnson, 1986, Kozaczynski and Ning, 1994, Letovsky,
1988, Quilici, 1994, Rich and Wills, 1990, Wills, 1992)) is a useful detection mechanism.
In fact, most recognition systems deal explicitly with the recognition of cliches that are
interleaved in specific ways with unrecognizable code or other cliches. One of the key
features of GRASPR (Wills, 1992), for instance, is its ability to deal with delocalization and
redistribution-type function sharing optimizations.
KBEmacs (Rich and Waters, 1990, Waters, 1979) uses a simple, special-purpose recogni-
tion strategy to segment loops within programs. This is based on detecting coarse patterns
of data and control flow at the procedural level that are indicative of common ways of con-
structing, augmenting, and interleaving iterative computations. For example, KBEmacs looks
for minimal sections of a loop body that have data flow feeding back only to themselves.
This decomposition enables a powerful form of abstraction, called temporal abstraction,
which views iterative computations as compositions of operations on sequences of values.
The recognition and temporal abstraction of iteration cliches is similarly used in GRASPR to
enable it to deal with generalized loop fusion forms of interleaving. Loop fusion is viewed
as redistribution of sequences of values and treated as any other redistribution optimization
(Wills, 1992).
Most existing cliche recognition systems tend to deal with interleaving involving data and
control mechanisms. Domain-based clustering, as explored by DM-TAG in the DESIRE system
(Biggerstaff et al., 1994), focuses on naming mechanisms, by keying in on the patterns of
linguistic idioms used in the program, which suggest the manifestations of domain concepts.
Mechanisms for dealing with specific types of interleaving have been explicitly built into
existing recognition systems. In the future, we envision recognition architectures that detect
not only familiar computational patterns, but also recognize familiar types of transforma-
tions or design decisions that went into constructing the program. Many existing cliche
recognition systems implicitly detect and undo certain types of interleaving design deci-
sions. However, this process is usually done with special-purpose procedural mechanisms
that are difficult to extend and that are viewed as having supporting roles to the cliche
recognition process, rather than as being an orthogonal form of recognition.

5.2, Disentangling Unfamiliar Plans

When what is interleaved is unfamiliar (i.e., novel, idiosyncratic, not repeatedly used plans),
other, non-recognition-based methods of delineation are needed. For example, slicing
(Weiser, 1981, Ning et al., 1994) is a widely-used technique for localizing functional com-
ponents by tracing through data dependencies within the procedural scope. Cluster analysis
(Biggerstaff et al., 1994, Hutchens and Basili, 1985, Schwanke, 1991, Schwanke, 1989) is
used to group related sections of code, based on the detection of shared uses of global
data, control paths, and names. However, clustering techniques can only provide limited
assistance by roughly delineating possible locations of functionally cohesive components.
Another technique, called "potpourri module detection" (Calliss and Cornelius, 1990), de-
72 RUGABER, STIREWALT, AND WILLS

tects modules that provide more than one independent service by looking for multiple proper
subgraphs in an entity-to-entity interconnection graph. These graphs show dependencies
among global entities within a single module. Presumably, the independent services reflect
separate plans in the code.
Research into automating data encapsulation has recently provided mechanisms for
hypothesizing possible locations of data plans at the object scope. For example, Bow-
didge and Griswold (Bowdidge and Griswold, 1994) use an extended data flow graph rep-
resentation, called a star diagram, to help progranmiers see all the uses of a particu-
lar data structure and to detect frequently occurring computations that are candidates
for abstract functions. Techniques have also been developed within the RE'^ project
(Canfora et al., 1993, Cimitile et al., 1994) for identifying candidate abstract data types and
their associated modules, based on the call graph and dominance relations. Further research
is required to develop techniques for extracting objects from pieces of data that have not
already been aggregated in programmer-defined data structures. For example, detecting
multiple pieces of data that are always used together might suggest candidates for data
aggregation (as for example, in NPEDLN, where the input parameters A, B, and c are used as
a tuple representing an ellipsoid, and the outputs PNEAR and DIST represent a pair of results
related by interleaved, highly overlapping plans).

6. Conclusion

Interleaving is a commonly occurring phenomenon in the code that we have examined.


Although a particular instance may be the result of an intentional decision on the part
of a programmer trying to improve the efficiency of a program, it can nevertheless make
understanding the program more difficult for subsequent maintainers. In our studies we
have observed that interleaving typically involves the implementation of several independent
plans in one code segment, often so that a program resource could be shared among the
plans. The interleaving can, in turn, lead to each of the separate plan implementations being
spread out or delocalized throughout the segment.
To investigate the phenomenon of interleaving, we have studied a substantial collection
of production software, SPICELIB from the Jet Propulsion Laboratory, SPICELIB needs
to be clearly understood in order to support automated program generation as part of the
Amphion project, and we were able to add to the understanding by performing a variety
of interleaving-based analyses. The results of these studies reinforce our feelings that
interleaving is a useful concept when understanding is important, and that many instances
of interleaving can be detected by relatively straightforward tools.

Acknowledgments

Support for this research has been provided by ARPA, (contract number NAG 2-890). We
are grateful to JPL'S NAIF group for enabling our study of their SPICELIB software. We also
benefited from insightful discussions with Michael Lowry at Nasa Ames Research Center
concerning this study and interesting future directions.
UNDERSTANDING INTERLEAVED CODE 73

Appendix NPELDN with Some of Its Documentation

C$ Nearest point on ellipsoid to line. END IF


SUBROUTINE NPEDLN(A,B,C,LINEPT,LINEDR,PNEAR, SCLPT(1) = LINEPT(1) / SCALE
DIST) SCLPT(2) = LINEPT(2) / SCALE
INTEGER UBEL SCLPT(3) = LINEPT(3) / SCALE
PARAMETER ( UBEL = 9 ) C Hand off the intersection case to SURFPT.
INTEGER UBPL C SURFPT determines whether rays intersect a body,
PARAMETER { UBPL = 4 ) C so we treat the line as a pair of rays.
DOUBLE PRECISION A CALL VMINUS(UDIR, OPPDIR)
DOUBLE PRECISION B CALL SURFPT(SCLPT, UDIR, SCLA, SCLB,
DOUBLE PRECISION SCLC, PT(1,1), FOUND(l))
DOUBLE PRECISION LINEPT ( 3 ) CALL SURFPT(SCLPT, OPPDIR, SCLA, SCLB,
DOUBLE PRECISION LINEDR ( 3 ) SCLC, PT(1,2), F0UND(2))
DOUBLE PRECISION PNEAR ( 3 ) DO 50001
DOUBLE PRECISION DIST I = 1, 2
LOGICAL RETURN IF ( FOUND(I) ) THEN
DOUBLE PRECISION CANDPL ( UBPL ) DIST = O.ODO
DOUBLE PRECISION CAND ( UBEL ) CALL VEQU ( PT(1,I), PNEAR )
DOUBLE PRECISION OPPDIR ( 3 ) CALL VSCL ( SCALE, PNEAR, PNEAR )
DOUBLE PRECISION PRJPL ( UBPL ) CALL CHKOUT { 'NPEDLN' )
DOUBLE PRECISION MAG RETURN
DOUBLE PRECISION NORMAL (3 ) END IF
DOUBLE PRECISION PRJEL { UBEL ) 50001 CONTINUE
DOUBLE PRECISION PRJPT (3 ) C Getting here means the line doesn't intersect
DOUBLE PRECISION PRJNPT ( 3 ) C the ellipsoid. Find the candidate ellipse CAND.
DOUBLE PRECISION C NORMAL is a normal vector to the plane
DOUBLE PRECISION SCALE ( C containing the candidate ellipse. Mathematically
DOUBLE PRECISION SOLA C the ellipse must exist; it's the intersection of
DOUBLE PRECISION SCLB C an ellipsoid centered at the origin and a plane
DOUBLE PRECISION SCLC C containing the origin. Only numerical problems
DOUBLE PRECISION SCLPT 3 ) C can prevent the intersection from being found.
DOUBLE PRECISION UDIR 3 ) NORMAL(l) = UDIR(l) / SCLA**2
INTEGER I NORMAL(2) = UDIR(2) / SCLB**2
LOGICAL FOUND 2 ) N0RMAL(3) = UDIR(3) / SCLC**2
LOGICAL IFOUND CALL NVC2PL ( NORMAL, O.DO, CANDPL )
LOGICAL XFOUND CALL INEDPL (SCLA,SCLB,SCLC,CANDPL,CAND,XFOUND)
IF ( RETURN 0 ) THEN IF ( .NOT. XFOUND ) THEN
RETURN CALL SETMSG ( 'Cauididate ellipse not found.')
ELSE CALL SIGERR ( 'SPICE(DEGENERATECASE)' )
CALL CHKIN ( 'NPEDLN' ) CALL CHKOUT ( 'NPEDLN' )
END IF RETURN
CALL UNORM ( LINEDR, UDIR, MAG ) END IF
IF ( MAG .EQ. 0 ) THEN C Project the candidate ellipse onto a plane
CALL SETMSG('Direction is zero vector.') C orthogonal to the line. We'll call the plane
CALL SIGERR{'SPICE(ZEROVECTOR)* ) C PRJPL and the projected ellipse PRJEL.
CALL CHKOUTCNPEDLN' ) CALL NVC2PL ( UDIR, O.DO, PRJPL )
RETURN CALL PJELPL ( CAND, PRJPL, PRJEL )
ELSE IF (( A .LE. O.DO ) C Find the point on the line lying in the project-
.OR. ( B .LE. O.DO ) C ion plane, and then find the near point PRJNPT
•OR. { C .LE. O.DO ) ) THEN C on the projected ellipse. Here PRJPT is the
CALL SETMSG ('Semi-axee: A=#,B=#,C=#.') C point on the line lying in the projection plane.
CALL ERRDP {'#', A ) C The distance between PRJPT and PRJNPT is DIST.
CALL ERRDP ('#', B ) CALL VPRJP ( SCLPT, PRJPL, PRJPT )
CALL ERRDP {•*'. C ) CALL NPELPT ( PRJPT, PRJEL, PRJNPT )
CALL SIGERR ('SPICE(INVALIDAXISLENGTH)') DIST = VDIST ( PRJNPT, PRJPT )
CALL CHKOUT ('NPEDLN' ) C Find the near point PNEAR on the ellipsoid by
RETURN C taking the inverse orthogonal projection of
END IF C PRJNPT; this is the point on the camdidate
C Scale the semi-zixes lengths for better C ellipse that projects to PRJNPT. The output
C numerical behavior. If squaring any of the C DIST was coinputed in step 3 and needs only to be
C scaled lengths causes it to underflow to C re-scaled. The inverse projection of PNEAR ought
C zero, signal an error. Otherwise scale the C to exist, but may not be calculeible due to nu-
C point on the input line too. C merical problems (this cem only happen when the
SCALE = MAX { DABS(A), DABS{B), DABS(C) ) C ellipsoid is extremely flat or needle-shaped).
SCLA = A / SCALE CALL VPRJPKPRJNPT,PRJPL, CANDPL, PNEAR, IFOUND)
SCLB = B / SCALE IF ( .NOT. IFOUND ) THEN
SCLC = C / SCALE CALL SETMSG ('Inverse projection not found.*)
IF (( SCLA**2 .LE. O.DO ) CALL SIGERR ('SPICE(DEGENERATECASE)' )
.OR. { SCLB**2 .LE. O.DO ) CALL CHKOUT ('NPEDLN' )
.OR. ( SCLC**2 .LE. O.DO ) ) THEN RETURN
CALL SETMSG {'Axis too small: A=#,B=#,C=#.') END IF
CALL ERRDP ('•', A ) C Undo the scaling.
CALL ERRDP (' # ' , B ) CALL VSCL ( SCALE, PNEAR, PNEAR )
CALL ERRDP (' # ' , C ) DIST = SCALE * DIST
CALL SIGERR {'SPICE(DEGENERATECASE)') CALL CHKOUT ( 'NPEDLN' )
CALL CHKOUT ('NPEDLN' ) RETURN
RETURN END
74 RUGABER, STIREWALT, AND WILLS

C Descriptions of subroutines called by NPEDLN:


C
c CHKIN Module Check In (error handling).
c UNORM Normalize double precision 3-vector.
c SETMSG Set Long Error Message.
c SIGERR Signal Error Condition.
c CHKOUT Module Check Out (error handling).
c ERRDP Insert DP Number into Error Message Text.
c VMINUS Negate a double precision 3-D vector.
c SURFPT Find intersection of vector w/ ellipsoid.
c VEQU Make one DP 3-D vector equal to another.
c VSCL Vector scaling, 3 dimensions.
c NVC2PL Make plane from normal and constant.
c INEDPL Intersection of ellipsoid and plane.
c PJELPL Project ellipse onto plane, orthogonally.
c VPRJP Project a vector onto plane orthogonally.
c NPELPT Find nearest point on ellipse to point.
c VPRJPI Vector projection onto plane, inverted.
c
c
c
c Length of semi-axis in the x direction.
c Length of semi-axis in the y direction.
c Length of semi-axis in the z direction.
c Point on input line.
c Direction vector of input line.
c Nearest point on ellipsoid to line.
c Distance of ellipsoid from line.
c Upper bound of array containing ellipse.
c Upper bound of array containing plane.
c Intersection point of line & ellipsoid.
c Candidate ellipse.
c Plane containing candidate ellipse.
c Normal to the candidate plane CANDPL.
c Unitized line direction vector.
c Magnitude of line direction vector.
c Vector in direction opposite to UDIR.
c Projection plane, which the candidate
c ellipse is projected onto to yield PRJEL.
c PRJEL Projection of the candidate ellipse
c CAND onto the projection plane PRJEL.
c PRJPT Projection of line point.
c PRJNPT Nearest point on projected ellipse to
c projection of line point.
c SCALE Scaling factor.
UNDERSTANDING INTERLEAVED CODE 75

References

Aho, A., R. Sethi, and J. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA,
1986.
Basili, V.R. and H.D. Mills. Understanding and documenting programs. IEEE Transactions on Software
Engineering, 8(3):27(>-283, May 1982.
Biggerstaff, T, B. Mitbander, and D. Webster. Program understanding and the concept assignment problem.
Communications of the ACM, 37(5):72-83, May 1994.
Boehm, Bany. Software Engineering Economics. Prentice Hall, 1981.
Bowdidge, R. and W. Griswold. Automated support for encapsulating abstract data types. In Proc. 2nd ACM
SIGSOFT Symposium on Foundations of Software Engineering, pages 97-110, New Orleans, Dec. 1994.
Brooks, R. Towards a theory of the comprehension of computer programs. International Journal ofMan-Machine
Studies, 18:543-554,1983.
Calliss, F. and B.Cornelius. Potpourri module detection. In IEEE Conference on Software Maintenance -1990,
pages 46-51, San Diego, CA, November 1990. IEEE Computer Society Press.
Canfora, G., A.Cimitile, and M.Munro. A reverse engineering method for identifying reusable abstract data types.
In Proc. of the First Working Conference on Reverse Engineering, pages 73-82, Baltimore, Maryland, May
1993. IEEE Computer Society Press.
Cimitile, A., M.Tortorella, and M.Munro. Program comprehension through the identification of abstract data
types. In Proc. 3rd Workshop on Program Comprehension, pages 12-19, Washington, D.C., November 1994.
IEEE Computer Society Press.
Fjeldstad, R.K. and W.T. Hamlen. Application program maintenance study: Report to our respondents. In
GUIDE 48, 4 1979. Also appears in (Parikh and Zvegintozov, 1983).
Hall, R. Program improvement by automatic redistribution of intermediate results. Technical Report 1251, MIT
Artificial Intelligence Lab., February 1990. PhD.
Hall, R. Program improvement by automatic redistribution of intermediate results: An overview. In M.Lowry
and R.McCartney, editors. Automating Software Design. AAAI Press, Menlo Park, CA, 1991.
Hartman, J. Automatic control understanding for natural programs. Technical Report AI91-161, University of
Texas at Austin, 1991. PhD thesis.
Hutchens, D. and V.Basili. System structure analysis: Clustering with data bindings. IEEE Transactions on
Software Engineering, 11(8), August 1985.
ReasoningSystems Incorporated. Software Refinery Toolkit. Palo Alto, CA.
Johnson, W.L. Intention-Based Diagnosis of Novice Programming Errors. Morgan Kaufmann Publishers, Inc.,
Los Altos, CA, 1986.
Kozaczynski, W. and J.Q. Ning. Automated program understanding by concept recognition. Automated Software
Engineering, l(l):61-78, March 1994.
Letovsky, S. Plan analysis of programs. Research Report 662, Yale University, December 1988. PhD.
Letovsky, S. and E.Soloway. Delocalized plans and program comprehension. IEEE Software, 3(3), 1986.
Lowry, M., A.Philpot, TPressburger, and I.Underwood. Amphion: automatic programming for subroutine
libraries. In Proc. 9th Knowledge-Based Software Engineering Conference, pages 2-11, Monterey, CA, 1994.
Lowry, M., A.Philpot, TPressburger, and I.Underwood. A formal approach to domain-oriented software design
environments. In Proc. 9th Knowledge-Based Software Engineering Conference, pages 48-57, Monterey, CA,
1994.
Myers, G. Reliable Software through Composite Design. Petrocelli Charter, 1975.
Ning, J.Q., A.Engberts, and WKozaczynski. Automated support for legacy code understanding. Communications
of the ACM, 37(5):50-57, May 1994.
Ombum, S. and S.Rugaber. Reverse engineering: Resolving conflicts between expected and actual software
designs. In IEEE Conf on Software Maintenance -1992, pages 32-40, Orlando, Florida, November 1992.
Parikh, G. and N.Zvegintozov, editors. Tutorial on Software Maintenance. IEEE Computer Society, 1983. Order
No. EM453.
Quilici, A. A memory-based approach to recognizing programming plans. Communications of the ACM,
37(5):84-93, May 1994.
Rich, C. A formal representation for plans in the Programmer's Apprentice. In Proc. 7th International Joint
Conference on Artificial Intelligence, pages 1044-1052, Vancouver, British Columbia, Canada, August 1981.
76 RUGABER, STIREWALT, AND WILLS

Rich, C. Inspection methods in programming. Technical Report 604, MIT Artificial Intelligence Lab., June 1981.
PhD thesis.
Rich, C. and R.C. Waters. The Programmer's Apprentice. Addison-Wesley, Reading, MA and ACM Press,
Baltimore, MD, 1990.
Rich, C. and L.M. Wills. Recognizing a program's design: A graph-parsing approach. IEEE Software yl{\)\%2-%9,
January 1990.
Rugaber, S., S.Ombum, and R.LeBlanc. Recognizing design decisions in programs. IEEE Software, 7(l):46-54,
January 1990.
Rugaber, S., K.Stirewalt, and L.Wills. Detecting interleaving. In IEEE Conference on Software Maintenance -
1995, pages 265-274, Nice, France, September 1995. IEEE Computer Society Press.
Rugaber, S., K.Stirewalt, and L.Wills. The interleaving problem in program understanding. In Proc. of the Second
Working Conference on Reverse Engineering, pages 166-175, Toronto, Ontario, July 1995. IEEE Computer
Society Press.
Schwanke, R. An intelligent tool for re-engineering software modularity. In IEEE Conference on Software
Maintenance -1991, pages 83-92,1991.
Schwanke, R., R.Altucher, and M.Platoff. Discovering, visualizing, and controlling software structure. In Proc.
5th Int. Workshop on Software Specification and Design, pages 147-150, Pittsburgh, PA, 1989.
Selfridge, P., R.Waters, and E.Chikofsky. Challenges to the field of reverse engineering - A position paper. In
Proc. of the First Working Conference on Reverse Engineering, pages 144-150, Baltimore, Maryland, May
1993. IEEE Computer Society Press.
Smith, D., G.Kotik, and S.Westfold. Research on knowledge-based software environments at Kestrel Institute.
IEEE Transactions on Software Engineering, November 1985.
Soloway, E. and K.Ehrlich. Empirical studies of programming knowledge. IEEE Transactions on Software
Engineering, 10(5):595-609, September 1984. Reprinted in C. Rich and R.C. Waters, editors. Readings in
Artificial Intelligence and Software Engineering, Morgan Kauftnann, 1986.
Stickel, M., R.Waldinger, M.Lowry, T.Pressburger, I.Underwood, and A.Bundy. Deductive composition of astro-
nomical software from subroutine libraries. In Proc. 12th International Conference on Automated Deduction,
pages 341-55, Nancy, France, 1994.
Waters, R.C. A method for analyzing loop programs. IEEE Transactions on Software Engineering, 5(3):237-247,
May 1979.
Weiser, Mark. Program slicing. In 5th International Conference on Software Engineering, pages 439-449, San
Diego, CA, 3 1981.
Wills, L. Automated programrecognitionby graph parsing. Technical Report 1358, MIT Artificial Intelligence
Lab., July 1992. PhD Thesis.
Yourdon, E. and L. Constantine. Structured Design: Fundamentals of a Discipline of Computer Program and
Systems Design. Prentice-Hall, 1979.
Automated Software Engineering, 3, 77-108 (1996)
© 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Pattern Matching for Clone and Concept Detection *


K. A. KONTOGIANNIS, R. DEMORI, E. MERLO, M. GALLER, M. BERNSTEIN

kostas @ cs.mcgill.ca
McGill University School of Computer Science 3480 University St., Room 318, Montreal, Canada H3A 2A7

Abstract.
A legacy system is an operational, large-scale software system that is maintained beyond its first generation
of programmers. It typically represents a massive economic investment and is critical to the mission of the
organization it serves. As such systems age, they become increasingly complex and brittle, and hence harder to
maintain. They also become even more critical to the survival of their organization because the business rules
encoded within the system are seldom documented elsewhere.
Our research is concerned with developing a suite of tools to aid the maintainers of legacy systems in recovering
the knowledge embodied within the system. The activities, known collectively as "program understanding*', are
essential preludes for several key processes, including maintenance and design recovery for reengineering.
In this paper we present three pattern-matching techniques: source code metrics, a dynamic programming
algorithm for finding the best alignment between two code fragments, and a statistical matching algorithm between
abstract code descriptions represented in an abstract language and actual source code. The methods are applied to
detect instances of code cloning in several moderately-sized production systems including tcsh, bash, and CLIPS.
The programmer's skill and experience are essential elements of our approach. Selection of particular tools and
analysis methods depends on the needs of the particular task to be accomplished. Integration of the tools provides
opportunities for synergy, allowing the programmer to select the most appropriate tool for a given task.

Keywords: reverse engineering, pattern matching, program understanding, software metrics, dynamic program-
ming

1. Introduction

Large-scale production software systems are expensive to build and, over their useful life-
times, are even more expensive to maintain. Successful large-scale systems are often called
"legacy systems" because (a) they tend to have been in service for many years, (b) the
original developers, in the normal course of events, move on to other projects, leaving the
system to be maintained by successive generations of maintenance programmers, and (c)
the systems themselves represent enormous corporate assets that cannot be easily replaced.
Legacy systems are intrinsically difficult to maintain because of their sheer bulk and
because of the loss of historical information: design documentation is seldom maintained
as the system evolves. In many cases, the source code becomes the sole repository for
evolving corporate business rules.

* This work is in part supported by IBM Canada Ltd., Institute for Robotics and Intelligent Systems, a Canadian
Network of Centers of Excellence and, the Natural Sciences and Engineering Research Council of Canada.
Based on "Pattern Matching for Design Concept Localization" by K.A.Kontogiannis, R.DeMori, M.Bernstein,
M.Galler, E.Merlo, whichfirstappeared in Proceedings of the Second Working Conference on Reverse Enginering,
pp.96-103, July, 1995, © IEEE, 1995
78 KONTOGIANNIS ET AL.

During system maintenance, it is often necessary to move from low, implementation-


oriented levels of abstraction back to the design and even the requirements levels. The
process is generally known as "reverse engineering".^ In (Chikofsky, 1990) there are def-
initions for a variety of subtasks, including "reengineering", "restructuring", and "redocu-
mentation".
In particular, it has been estimated that 50 to 90 percent of the maintenance programmer's
effort is devoted to simply understanding relationships within the program. The average
Fortune 100 company maintains 35 million lines of source code (MLOC) with a growth rate
of 10 percent per year just in enhancements, updates, and normal maintenance. Facilitating
the program understanding process can yield significant economic savings.
We believe that maintaining a large legacy software system is an inherently human activity
that requires knowledge, experience, taste, judgement and creativity. For the foreseeable
future, no single tool or technique will replace the maintenance progranmier nor even
satisfy all of the programmer's needs. Evolving real-world systems requires pragmatism
and flexibility.
Our approach is to provide a suite of complementary tools from which the programmer
can select the most appropriate one for the specific task at hand. An integration framework
enables exploitation of synergy by allowing conmiunication among the tools.
Our research is part of a larger joint project with researchers from IBM Centre for Ad-
vanced Studies, University of Toronto, and University of Victoria (Buss et al., 1994)
Over the past three years, the team has been developing a toolset, called RevEngE (i?everse
Engineering Environment), based on an open architecture for integrating heterogeneous
tools. The toolset is integrated through a common repository specifically designed to
support program understanding (Mylopoulos, 1990). Individual tools in the kit include
Ariadne (Konto, 1994), ART (Johnson, 1993), and Rigi (Tilley, 1994). ART (Analysis of
7?edundancy in Text) is a prototype textual redundancy analysis system. Ariadne is a
set of pattern matching and design recovery programs implemented using a commercial
tool called The Software Refinery^. Currently we are working on another version of the
Ariadne environment implemented in C++. Rigi is a programmable environment for pro-
gram visualization. The tools communicate through a flexible object server and single
global schema implemented using the Telos information modeling language and repository
(Mylopoulos, 1990).
In this paper we describe two types of pattern-matching techniques and discuss why
pattern matching is an essential tool for program understanding. The first type is based on
numerical comparison of selected metric values that characterize and classify source code
fragments.
The second type is based on Dynamic Programming techniques that allow for statement-
level comparison of feature vectors that characterize source code program statements. Con-
sequently, we apply these techniques to address two types of relevant program understanding
problems.
The first one is a comparison between two different program segments to see if one is
a clone of the other, that is if the two segments are implementations of the same algo-
rithm. The problem is in theory undecidable, but in practice it is very useftil to provide
software maintainers with a tool that detects similarities between code segments. Similar
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 79

segments are proposed to the software engineer who will make the final decision about their
modification or other use.
The second problem is the recognition of program segments that implement a given
progranmiing concept. We address this problem by defining a concept description language
called ACL and by applying statement-level comparison between feature vectors of the
language and feature vectors of source code program statements.

1.1. The Code Cloning Problem

Source code cloning occurs when a developer reuses existing code in a new context by
making a copy that is altered to provide new functionality. The practice is widespread
among developers and occurs for several reasons: making a modified copy may be simpler
than trying to exploit conunonality by writing a more general, parameterized function;
scheduling pressures may not allow the time required to generalize the code; and efficiency
constraints may not admit the extra overhead (real or perceived) of a generalized routine.
In the long run, code cloning can be a costly practice. Firstly, it results in a program that is
larger than necessary, increasing the complexity that must be managed by the maintenance
programmer and increasing the size of the executable program, requiring larger computers.
Secondly, when a modification is required (for example, due to bug fixes, enhancements,
or changes in business rules), the change must be propagated to all instances of the clone.
Thirdly, often-cloned functionality is a prime candidate for repackaging and generaliza-
tion for a repository of reusable components which can yield tremendous leverage during
development of new applications.
This paper introduces new techniques for detecting instances of source code cloning.
Program features based on software metrics are proposed. These features apply to basic
program segments like individual statements, begin-end blocks and functions. Distances
between program segments can be computed based on feature differences. This paper
proposes two methods for addressing the code cloning detection problem.
Thefirstis based on direct comparison of metric values that classify a given code fragment.
The granularity for selecting and comparing code fragments is at the level of begin-end
blocks. This method returns clusters of begin-end blocks that may be products of cut-
and-paste operations.
The second is based on a new Dynamic Programming (DP) technique that is used to
calculate the best alignment between two code fragments in terms of deletions, insertions
and, substitutions. The granularity for selecting code fragments for comparison is again
at the level of begin-end blocks. Once two begin-end blocks have been selected, they
are compared at the statement level. This method returns clusters of begin-end blocks
that may be products of cut-and-paste operations. The DP approach provides in general,
more accurate results (i.e. less false positives) than the one based on direct comparison of
metric values at the begin-end block level. The reason is that comparison occurs at the
statement level and informal information is taken into account (i.e. variable names, literal
strings and numbers).
80 KONTOGIANNIS ET AL.

1.2, The Concept Recognition Problem

Programming concepts are described by a concept language. A concept to be recognized


is a phrase of the concept language. Concept descriptions and source code are parsed. The
concept recognition problem becomes the problem of establishing correspondences, as in
machine translation, between a parse tree of the concept description language and the parse
tree of the code.
A new formalism is proposed to see the problem as a stochastic syntax-directed translation.
Translation rules are pairs of rewriting rules and have associated a probability that can be
set initially to uniform values for all the possible alternatives.
Matching of concept representations and source code representations involves alignment
that is again performed using a dynamic programming algorithm that compares feature
vectors of concept descriptions, and source code.
The proposed concept description language, models insertions as wild characters
(AbstractStatement* and AbstractStatemenf^) and does not allow any deletions from
the pattern. The comparison and selection granularity is at the statement level. Comparison
of a concept description language statement with a source code statement is achieved by
comparing feature vectors (i.e. metrics, variables used, variables defined and keywords).
Given a concept description M = Ai; A2; .-Am, a code fragment V = Si; S2] .-Sk is
selected for comparison if: a) the first concept description statement Ai matches with Si,
and b) the sequence of statements 52; ...5jfc, belong to the innermost begin-end block
containing 5i.
The use of a statistical formalism allows a score (a probability) to be assigned to every
match that is attempted. Incomplete or imperfect matching is also possible leaving to the
software engineer the final decision on the similar candidates proposed by the matcher.
A way of dynamically updating matching probabilities as new data are observed is also
suggested in this paper. Concept-to-code matching is under testing and optimization. It
has been implemented using the REFINE environment and supports plan localization in C
programs.

1.3. Related Work

A number of research teams have developed tools and techniques for localizing specific
code patterns.
The UNIX operating system provides numerous tools based on regular expressions both
for matching and code replacement. Widely-used tools include grep, awk, ed and v i .
These tools are very efficient in localizing patterns but do not provide any way for partial
and hierarchical matching. Moreover, they do not provide any similarity measure between
the pattern and the input string.
Other tools have been developed to browse source code and query software repositories
based on structure, permanent relations between code fragments, keywords, and control or
dataflow relationships. Such tools include CIA, Microscope, Rigi, SCAN, and REFINE.
These tools are efficient on representing and storing in local repositories relationships
between program components. Moreover, they provide effective mechanisms for querying
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 81

and updating their local repositories. However, they do not provide any other mechanism
to localize code fragments except the stored relations. Moreover no partial matching and
no similarity measures between a query and a source code entity can be calculated.
Code duplication systems use a variety of methods to localize a code fragment given a
model or apattem. One category of such tools uses structure graphs to identify the"fingerprint"
of a program (Jankowitz, 1988). Other tools use metrics to detect code patterns (McCabe,
1990),(Halstead, 1977), common dataflow (Horwitz, 1990), approximate fingerprints from
program text files (Johnson, 1993), text comparison enhanced with heuristics for approxi-
mate and partial matching (Baker, 1995), and text comparison tools such as Unix d i f f.
The closest tool to the approach discussed in this paper, is SCRUPLE (Paul, 1994).
The major improvement of the solution proposed here is a) the possibility of performing
partial matching with feature vectors, providing similarity measures between a pattern and
a matched code fragment, and b) the ability to perform hierarchical recognition. In this
approach, explicit concepts such as i t e r a t i v e - s t a t e m e n t can be used allowing for
multiple matches with a Whi l e , a For or, a Do statement in the code. Moreover, recognized
patterns can be classified, and stored so that they can be used inside other more complex
composite patterns. An expansion process is used for unwrapping the composite pattern
into its components.

2. Code to Code Matching

In this section we discuss pattern-matching algorithms applied to the problem of clone


detection. Determining whether two arbitrary program functions have identical behavior
is known to be undecidable in the general case. Our approach to clone detection exploits
the observation that clone instances, by their nature, should have a high degree of structural
similarity. We look for identifiable characteristics or features that can be used as a signature
to categorize arbitrary pieces of code.
The work presented here uses feature vectors to establish similarity measures. Features
examined include metric values and specific data- and control-flow properties. The analysis
framework uses two approaches:
1. direct comparison of metric values between begin-end blocks, and
2. dynamic programming techniques for comparing begin-end blocks at a statement-
by-statement basis.
Metric-value similarity analysis is based on the assumption that two code fragments Ci
and C2 have metric values M{Ci) and M(C2) for some source code metric M. If the two
fragments are similar under the set of features measured by M, then the values of M{Ci)
and M{C2) should be proximate.
Program features relevant for clone detection focus on data and control flow program
properties. Modifications of five widely used metrics (Adamov, 1987), (Buss et al., 1994)
for which their components exhibit low correlation (based on the Spearman-Pierson corre-
lation test) were selected for our analyses:
1. The number of functions called (fanout);
82 KONTOGIANNIS ET AL.

2. The ratio of input/output variables to the fanout;


3. McCabe cyclomatic complexity;
4. Modified Albrecht's function point metric;
5. Modified Henry-Kafura's information flow quality metric.
Detailed descriptions and references for metrics will be given later on in this section.
Similarity of two code fragments is measured using the resulting 5-dimensional vector.
Two methods of comparing metric values were used. The first, naive approach, is to make
0{'n?) pairwise comparisons between code fragments, evaluating the Euclidean distance
of each pair. A second, more sophisticated analytical approach was to form clusters by
comparing values on one or more axes in the metric space.
The selection of the blocks to be compared is based on the proximity of their metric value
similarity in a selected metric axis. Specifically, when the source code is parsed an Abstract
Syntax Tree (AST) Tc is created, five different metrics are calculated compositionally for
every statement, block, function, and file of the program and are stored as annotations in
the corresponding nodes of the AST. Once metrics have been calculated and annotations
have been added, a reference table is created that contains source code entities sorted by
their corresponding metric values. This table is used for selecting the source code entities
to be matched based on their metric proximity. The comparison granularity is at the level
of a begin-end block of length more than n lines long, where n is a parameter provided
by the user.
In addition to the direct metric comparison techniques, we use dynamic programming
techniques to calculate the best alignment between two code fragments based on insertion,
deletion and comparison operations. Rather than working directly with textual representa-
tions, source code statements, as opposed to begin-end blocks, are abstracted into feature
sets that classify the given statement. The features per statement used in the Dynamic
Programming approach are:
• Uses of variables, definitions of variables, numerical literals, and strings;
• Uses and definitions of data types;
• The five metrics as discussed previously.
Dynamic programming (DP) techniques detect the best alignment between two code
fragments based on insertion, deletion and comparison operations. Two statements match
if they define and use the same variables, strings, and numerical literals. Variations in these
features provide a dissimilarity value used to calculate a global dissimilarity measure of
more complex and composite constructs such as begin-end blocks and functions. The
comparison function used to calculate dissimilarity measures is discussed in detail in Section
2.3. Heuristics have been incorporated in the matching process to facilitate variations that
may have occurred in cut and paste operations. In particular, the following heuristics are
currently considered:

• Adjustments between variable names by considering lexicographical distances;


PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 83

• Filtering out short and trivial variable names such as i and j which are typically used
for temporary storage of intermediate values, and as loop index values. In the current
implementation, only variable names of more than three characters long are considered.

Dynamic progranmiing is a more accurate method than the direct metric comparison
based analysis (Fig. 2) because the comparison of the feature vector is performed at
the statement level. Code fragments are selected for Dynamic Programming comparison
by preselecting potential clone candidates using the direct metric comparison analysis.
Within this framework only the begin-end blocks that have a dissimilarity measure less
than a given threshold are considered for DP comparison. This preselection reduces the
comparison space for the more computationally expensive DP match.
The following sections further discuss these approaches and present experimental results
from analyzing medium scale (< lOOkLOC) software systems.

2,1, Program Representation and the Development of the Ariadne Environment

The foundation of the Ariadne system is a program representation scheme that allows for
the calculation of the feature vectors for every statement, block or function of the source
code. We use an object-oriented annotated abstract syntax tree (AST). Nodes of the AST
are represented as objects in a LISP-based development environment^.
Creating the annotated AST is a three-step process. First, a grammar and object (domain)
model must be written for the programming language of the subject system. The tool
vendor has parsers available for such conmion languages as C and COBOL. Parsers for
other languages may be easily constructed or obtained through the user community. The
domain model defines object-oriented hierarchies for the AST nodes in which, for example,
an If-Statement and a While-Statement are defined to be subclasses of the Statement class.
The second step is to use the parser on the subject system to construct the AST repre-
sentation of the source code. Some tree annotations, such as linkage information and the
call graph are created automatically by the parser. Once the AST is created, further steps
operate in an essentially language-independent fashion.
The final step is to add additional annotations into the tree for information on data types,
dataflow (dataflow graphs), the results of external analysis, and links to informal informa-
tion. Such information is typically obtained using dataflow analysis algorithms similar to
the ones used within compilers.
For example, consider the following code fragment from an IBM-proprietary PL/1-like
language. The corresponding AST representation for the i f statement is shown in Fig. 1.
The tree is annotated with the fan-out attribute which has been determined during an analysis
phase following the initial parse.

MAIN: PROCEDURE(OPTION);
DCL OPTION FIXED(31);
IF (OPTION>0) THEN
CALL SHOW_MENU(OPTION);
ELSE
84 KONTOGIANNIS ET AL.

CALL SHOW_ERROR("Invalid o p t i o n number");


END MAIN;

f
I
i
-.
^ _
OPTION
l ^
1 1
( 0 ^
J
f
I
SHOW.
MENU
^
J
f
I
OPTION ^
1 1
f
1
SHOW_
ERROR
\
J
f
I
"Invalid
option..."
^
J

pLegend —

(
[^
NODE
NAME \
J »ASTnode

1
altributo naiiw
m Link from parent
tocNIdvfaa
named attribute.

+
1
(anoul
• Fanout attribute
containing Integer
fh value V.

\ B

Figure 1. The AST for an IF Statement With Fanout Attributes.

2.2. Metrics Based Similarity Analysis

Metrics based similarity analysis uses five source-code metrics that are sensitive to several
different control and data flow program features. Metric values are computed for each
statement, block, and function. Empirical analysis ^ (Buss et al., 1994) shows the metrics
components have low correlation, so each metric adds useful information.
The features examined for metric computation include:
• Global and local variables defined or used;
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 85

• Functions called;
• Files accessed;
• I/O operations (read, write operations);
• Defined/used parameters passed by reference and by value;
• Control flow graph.
Partial matching may occur because the metrics are not sensitive to variable names, source
code white space, and minor modifications such as replacement of while with f o r loops
and insertion of statements that do not alter the basic data and control flow of the original
code structure.
A description of the metrics used is given below but a more detailed description can be
found in (Adamov, 1987), (Fenton, 1991), (Moller93).
Let 5 be a code fragment. The description of the five modified metrics used is given
below. Note that these metrics are computed compositionally from statements, to b e g i n -
end blocks, functions, and files.
1. S-COMPLEXITY(s) = FANJOUT{sf
where
• FAN_OUT(s) is the number of individual function calls made within s.

D_COMPLEXITY(s) = GLOBALS{S)/{FANJDUT{S) + 1)
where
• GLOBALS(s) is the number of individual declarations of global variables used or updated
within s. A global variable is a variable which is not declared in the code fragment s.

3. MCCABE(5) = € - n + 2
where
• € is the number of edges in the controlflowgraph
• n is the number of nodes in the graph.
Altematively McCabe metric can be calculated using
• MCCABE(s) = 1 + d, where d is the number of control decision predicates in j

f pi * VARSJJSEDJiNDJSET{s)-^
P2 * GLOBAL.VARS.SET{s)-\-
4. ALBRECHT(s) = { P3 * USERJNPUT{s)+
[ P4 * FILEJNPUT{s)
where,
86 KONTOGIANNIS ET AL.

VARSJJSED^NDJ5ET{s) is the number of data elements set and used in the state-
ment s,
GLOBAL-VARSSET{s) is the number of global data elements set in the statement s,
USERJNPUT{s) is the number of read operations in statement s,
FILEJNPUT(s) is the number offilesaccessed for reading in 5-.
The factors pi,.., p4, are weight factors. In (Adamov, 1987) possible values for these factors
are given. In the current implementation the values chosen are pi = 5, p2 = 4, p3 = 4 and,
P4 = 7. The selection of values for the piS' ^0 does not affect the matching process.

5. KAFURA(s) = { {KAFURAJN{s) * KAFURA.OUT{s)y where,


• KAFURA JN(5) is the sum of local and global incoming dataflow to the the code fragment
s.
• KAFURA_OUT(s) is the sum of local and global outgoing dataflow from the the code
fragment s.
Once the five metrics Mi to M5 are computed for every statement, block and function
node, the pattern matching process is fast and efficient. It is simply the comparison of
numeric values.
We have experimented with two techniques for calculating similar code fragments in a
software system.
The first one is based on pairwise Euclidean distance comparison of all begin-end
blocks that are of length more than n lines long, where n is a parameter given by the user.
In a large software system though there are many begin-end blocks and such a pairwise
comparison is not possible because of time and space limitations. Instead, we limit the
pairwise comparison between only these begin-end blocks that for a selected metric axis
Mi their metric values differ in less than a given threshold di. In such a way every block
is compared only with its close metric neighbors.
The second technique is more efficient and is using clustering per metric axis. The
technique starts by creating clusters of potential clones for every metric axis A^^ (i = 1 ..
5). Once the clusters for every axis are created, then intersections of clusters in different
axes are calculated forming intermediate results. For example every cluster in the axis Mi
contains potential clones under the criteria implied by this metric. Consequently, every
cluster that has been calculated by intersecting clusters in Mi and Mj contains potential
clones under the criteria implied by both metrics. The process ends when all metric axis
have been considered. The user may specify at the beginning the order of comparison, and
the clustering thresholds for every metric axis. The clone detection algorithm that is using
clustering can be summarized as:

1. Select all source code begin-end blocks B from the AST that are more than n lines
long. The parameter n can be changed by the user.
2. For every metric axis Mi (i = 1.. 5) create clusters Cij that contain begin-end blocks
with distance less than a given threshold di that is selected by the user. Each cluster
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 87

then contains potential code clone fragments under the metric criterion Mi. Set the
current axis Mcurr = Mi, where i = 1. Mark Mi as used
3. For every cluster Ccurr,m in the current metric axis Mcurr > intersect with all clusters
Cj^k in one of the non used metric axis Mj, j G {1 .. 5}. The clusters in the resulting
set contain potential code clone fragments under the criteria Mcurr and Mj, and form
a composite metric axis McurrOj- Mark Mj as used and set the current axis Mcurr
~ '^currQj'

4. If all metric axes have been considered the stop; else go to Step 3.

The pattern matching engine uses either the computed Euclidean distance or clustering
in one or more metric dimensions combined, as a similarity measure between program
constructs.
As a refinement, the user may restrict the search to code fragments having minimum size
or complexity.
The metric-based clone detection analysis has been applied to a several medium-sized
production C programs.
In tcsh, a 45 kLOC Unix shell program, our analysis has discovered 39 clusters or groups
of similar functions of average size 3 functions per cluster resulting in a total of 17.7 percent
of potential system duplication at the function level.
In bash, a 40KLOC Unix shell program, the analysis has discovered 25 clusters, of
average size 5.84 functions per cluster, resulting to a total of 23 percent of potential code
duplication at the function level.
In CLIPS, a 34 kLOC expert system shell, we detected 35 clusters of similar functions of
average size 4.28 functions per cluster, resulting in a total of 20 percent of potential system
duplication at the function level.
Manual inspection of the above results combined with more detailed Dynamic Program-
ming re-calculation of distances gave some statistical data regarding false positives. These
results are given in Table 1. Different programs give different distribution of false alarms,
but generally the closest the distance is to 0.0 the more accurate the result is.
The following section, discusses in detail the other code to code matching technique we
developed, that is based on Dynamic Programming.

2.3. Dynamic Programming Based Similarity Analysis

The Dynamic Programming pattern matcher is used (Konto, 1994), (Kontogiannis, 1995)
to find the best alignment between two code fragments. The distance between the two code
fragments is given as a summation of comparison values as well as of insertion and deletion
costs corresponding to insertions and deletions that have to be applied in order to achieve
the best alignment between these two code fragments.
A program feature vector is used for the comparison of two statements. The features are
stored as attribute values in aframe-basedstructure representing expressions and statements
in the AST. The cumulative similarity measure T> between two code fragments P , M, is
calculated using the function
88 KONTOGIANNIS ET AL,

D : Feature ^Vector X Feature^Vector —> Real


where:
A(p,j~l,P,M)+
D{£{l,p,nS{lJ-l,M))

I{p-lj^V,M)^ (1)
D{£{l,p,V),£{lJ,M)) = Mm{

C{p-lJ-l,V,M)-h
D{£{l,p-l,V),£{lJ-l,M))
and,
• Al is the model code fragment

• 7^ is the input codefragmentto be compared with the model M

• £{h jt Q) is a program feature vectorfromposition / to position y in codefragmentQ

• -D(Vx , Vy) is the the distance between two feature vectors Vx, Vy

• A(i, J, 7^5 M) is the cost of deleting \hc']th statement of Al, at position / of the fragment
V
• /(i, J, 7^, X ) the cost of inserting the ith statement of V at position^* of the model M
and

• C(^, J, V^ M) is the cost of comparing the ith statement of the codefragmentV with
the j^Afragmentof the model M. The comparison cost is calculated by comparing the
corresponding feature vectors. Currently, we compare ratios of variables set, used per
statement, data types used or set, and comparisons based on metric values

Note that insertion, and deletion costs are used by the Dynamic Programming algorithm
to calculate the best fit between two codefragments.An intuitive interpretation of the best
fit using insertions and deletions is "if we insert statement i of the input at position 7 of the
model then the model and the input have the smallest feature vector difference."
The quality and the accuracy of the comparison cost is based on the program features se-
lected and the formula used to compare these features. For simplicity in the implementation
we have attached constant real values as insertion and deletion costs.
Table 1 summarizes statistical data regarding false alarms when Dynamic Programming
comparison was applied to functions that under direct metric comparison have given distance
0.0. The column labeled Distance Range gives the value range of distances between
functions using the Dynamic Progranmiing approach. The column labeled False Alarms
contains the percentage of functions that are not clones but they have been identified as such.
The column labeled Partial Clones contains the percentage of functions which correspond
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 89

Table 1. False alarms for the Clips program

Distance Range False Alarms Partial Clones Positive Clones


0.0 0.0% 10.0% 90.0%
0.01 - 0.99 6.0% 16.0 % 78.0%
1.0-1.49 8.0% 3.0% 89.0%
1.5-1.99 30.0% 37.0 % 33.0%
2.0 - 2.99 36.0% 32.0 % 32.0%
3.0 - 3.99 56.0% 13.0 % 31.0%
4.0 - 5.99 82.0% 10.0 % 8.0%
6.0 -15.0 100.0% 0.0% 0.0%

only in parts to cut and paste operations. Finally, the column labeled as Positive Clones
contains the percentage of functions clearly identified as cut and paste operations.
The matching process between two code fragments M and V is discussed with an example
later in this section and is illustrated in Fig.3
The comparison cost function C{i,j,M,V) is the key factor in producing the final
distance result when DP-based matching is used. There are many program features that can
be considered to characterize a code fragment (indentation, keywords, metrics, uses and
definitions of variables). Within the experimentation of this approach we used the following
three different categories of features

1. definitions and uses of variables as well as, literal values within a statement:

(A) Featurei : Statement —> String denotes the set of variables used in within a
statement,
(B) Feature2 - Statement -^ String denotes the set of variables defined within a
statement
(C) Features • Statement —> String denotes the set of literal values (i.e numbers,
strings) within a statement (i.e. in a printf statement).

2. definitions and uses of data types :

(A) Featurei • Statement —> String denotes the set of data type names used in
within a statement,
(B) Feature2 • Statement —> String denotes the set of data type names defined
within a statement

The comparison cost of the ith statement in the input V and the jth statement of the
model M. for the first two categories is calculated as :
90 KONTOGIANNIS ET AL.

. 1 Y^ card{InputFeaturem{Vi) O ModelFeaturem{-Mj))
*' ^ V ^ card{InputFeaturem{Vi)UModelFeaturemMj))

where v is the size of the feature vector, or in other words how many features are used,

3. five metric values which are calculated compositionally from the statement level to
function and file level:
The comparison cost of the ith statement in the input V and the jth statement of the
model M when the five metrics are used is calculated as :

C{VuMj) ^^{Mk{V,) - MUMj))^ (3)


\ A:=l

Within this framework new metrics and features can be used to make the comparison
process more sensitive and accurate.
The following points on insertion and deletion costs need to be discussed.

• The insertion and deletion costs reflect the tolerance of the user towards partial matching
(i.e. how much noise in terms of insertions and deletions is allowed before the matcher
fails). Higher insertion and deletion costs indicate smaller tolerance, especially if cut-
off thresholds are used (i.e. terminate matching if a certain threshold is exceeded),
while smaller values indicate higher tolerance.

• The values for insertion and deletion should be higher than the threshold value by which
two statements can be considered "similar", otherwise an insertion or a deletion could
be chosen instead of a match.

• A lower insertion cost than the corresponding deletion cost indicates the preference of
the user to accept a code fragment V that is written by inserting new statements to the
model M. The opposite holds when the deletion cost is lower than the corresponding
insertion cost. A lower deletion cost indicates the preference of the Ubcr to accept a
code fragment V that is written by deleting statements from the model M. Insertion
and deletion costs are constant values throughout the comparison process and can be
set empirically.

When different comparison criteria are used different distances are obtained. In Fig.2
(Clips) distances calculated using Dynamic Programming are shown for 138 pairs of func-
tions (X - axis) that have been already identified as clones (i.e. zero distance) using the
direct per function metric comparison. The dashed line shows distance results when def-
initions and uses of variables are used as features in the dynamic programming approach,
while the solid line shows the distance results obtained when the five metrics are used as
features. Note that in the Dynamic Programming based approach the metrics are used at
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 91

Distances between Function pairs (Clips) Distances between Function Pairs (Bash)

- Distances on definitions and uses of variables


- Distances on definitions and uses of variables
_ Distances on data and controlflowmeasurements.
_ Distances on data and control flow measurements

3h

C
40 60 80 100 120 140 0
Function Pairs

Figure 2. Distances between function pairs of possible function clones using DP-based matching.

the statement level, instead of the begin-end block level when metrics direct comparison
is performed.
As an example consider the following statements M and V:

ptr = head;
while(ptr != NULL && !found)
{
i f ( p t r - > i t e i t i == s e a r c h l t e m )
found = 1
else
ptr = ptr->next;

while(ptr != NULL && !found)


{
if(ptr->item == searchltem)
92 KONTOGIANNIS ET AL.

^ ptr I-.. i£().

1 y.
M A

•Ise-part

then-part ^-l-T-
ptr->lten •>
t^—1—
ptx->it«m H . . than-purt
\ ^-^
alfls part
£ounJk> 1

prlntfO..
i2i
found • 1

Figure 3. The matching process between two code fragments. Insertions are represented as horizontal hnes,
deletions as vertical lines and, matches as diagonal hnes.

{
printf("ELEMENT FOUND %s\n", searchltem);
found = 1;
}
else
ptr = ptr->next;

The Dynamic Programming matching based on definitions and uses of variables is illus-
trated in Fig. 3.
In the first grid the two code fragments are initially considered. At position (0, 0) of
the first grid a deletion is considered as it gives the best cumulative distance to this point
(assuming there will be a match at position (0, 1). The comparison of the two composite
while statements in the first grid at position (0, 1), initiates a nested match (second grid).
In the second grid the comparison of the composite i f - t h e n - e l s e statements at position
(1,1) initiates a new nested match. In the third grid, the comparison of the composite t h e -
p a r t of the i f - t h e n - e l s e statements initiates the final fourth nested match. Finally,
in the fourth grid at position (0, 0), an insertion has been detected, as it gives the best
cumulative distance to this point (assuming a potential match in (1,0).
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 93

When a nested match process finishes it passes its result back to the position from which
it was originally invoked and the matching continues from this point on.

3. Concept To Code Matching

The concept assignment (Biggerstaff, 1994) problem consists of assigning concepts de-
scribed in a concept language to program fragments. Concept assignment can also be seen
as a matching problem.
In our approach, concepts are represented as abstract-descriptions using a concept lan-
guage called ACL. The intuitive idea is that a concept description may match with a number
of different implementations. The probability that such a description matches with a code
fragment is used to calculate a similarity measure between the description and the implemen-
tation. An abstract-description is parsed and a corresponding AST Ta is created. Similarly,
source code is represented as an annotated AST Tc. Both Ta and Tc are transformed into
a sequence of abstract and source code statements respectively using transformation rules.
We use REFINE to build and transform both ASTs. The reason for this transformation is
to reduce the complexity of the matching algorithm as Ta and Tc may have a very complex
and different to each other structure. In this approach feature vectors of statements are
matched instead of Abstract Syntax Trees. Moreover, the implementation of the Dynamic
Programming algorithm is cleaner and faster once structural details of the ASTs have been
abstracted and represented as sequences of entities.
The associated problems with matching concepts to code include :

• The choice of the conceptual language,

• The measure of similarity,

• The selection of a fragment in the code to be compared with the conceptual represen-
tation.

These problems are addressed in the following sections.

3,1, Language for Abstract Representation

A number of research teams have investigated and addressed the problem of code and plan
localization. Current successful approaches include the use of graph granmiars (Wills,
1992), (Rich, 1990), query pattern languages (Paul, 1994), (Muller, 1992), (Church, 1993),
(Biggerstaff, 1994), sets of constraints between components to be retrieved (Ning, 1994),
and summary relations between modules and data (Canfora, 1992).
In our approach a stochastic pattern matcher that allows for partial and approximate
matching is used. A concept language specifies in an abstract way sequences of design
concepts.
The concept language contains:
94 KONTOGIANNIS ET AL.

• Abstract expressions £ that correspond to source code expression. The correspondence


between an abstract expression and the source code expression that it may generate is
given at Table 3

• Abstract feature descriptions T that contain the feature vector data used for matching
purposes. Currently the features that characterize an abstract statement and an abstract
expression are:

1. Uses of variables : variables that are used in a statement or expression


2. Definitions of variables', ariables that are defined in a statement or expression
3. Keywords: strings, numbers, characters that may used in the text of a code statement
4. Metrics : a vector of five different complexity, data and control flow metrics.

• Typed Variables X
Typed variables are used as a placeholders for feature vector values, when no actual
values for the feature vector can be provided. An example is when we are looking
for a Traversal of a list plan but we do not know the name of the pointer variable that
exists in the code. A type variable can generate (match) with any actual variable in the
source code provided that they belong to the same data type category. For example a
List type abstract variable can be matched with an Array or a Linked List node source
code pointer variable.
Currently the following abstract types are used :

1. Numeral: Representing Int, and float types


2. Character : Representing char types
3. List: Representing array types
4. Structure : Representing struct types
5. Named : matching the actual data type name in the source code

• Operators O
Operators are used to compose abstract statements in sequences. Currently the following
operators have been defined in the language but only sequencing is implemented for
the matching process :

1. Sequencing (;): To indicate one statement follows another


2. Choice ( 0 ) : To indicate choice (one or the other abstract statement will be used
in the matching process
3. Inter Leaving (|| ) : to indicate that two statements can be interleaved during the
matching process
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 95

Table 2. Generation (Allowable Matching) of source code statements from ACL statements

ACL Statement Generated Code Statement


While Statement
Abstract Iterative Statement For Statement
Do Statement

Abstract While Statement While Statement

Abstract For Statement For Statement


Abstract Do Statement Do Statement
Abstract Conditional Statement If Statement
Switch Statement
Abstract If Statement If Statement

Abstract Switch Statement Switch Statement

Abstract Return Statement Return Statement

Abstract GoTo Statement GoTo Statement


Abstract Continue Statement Continue Statement

Abstract Break Statement Break Statement


Abstract Labeled Statement Labeled Statement
Abstract Statement* Zero or more sequential source code statements
AhstractStatement^ One or more sequential source code statements
96 KONTOGIANNIS ET AL.

Table 3. Generation (Allowable Matching) of source code expressions from ACL expressions

ACL Expression Generated Code Expression


Abstract Function Call Function Call
Abstract Equality Equality (==)
Abstract Inequality Inequality (\ =)
Abstract Logical And Logical And (Sz&z)
Abstract Logical Or Logical Or (\\)

Abstract Logical Not Logical Not (!)

• Macros M
Macros are used to facilitate hierarchical plan recognition (Hartman, 1992), (Chikof-
sky, 19890). Macros are entities that refer to plans that are included at parse time.
For example if a plan has been identified and is stored in the plan base, then special
preprocessor statements can be used to include this plan to compose more complex
patterns. Included plans are incorporated in the current pattern's AST at parse time. In
this way they are similar to inline functions in C++.
Special macro definition statements in the Abstract Language are used to include the
necessary macros.
Currently there are two types of macro related statements

1. include definitions: These are special statements in ACL that specify the name of
the plan to be included and the file it is defined.
As an example consider the statement
i n c l u d e planl.acl traversal-linked-list
that imports the plan traversal-linked-list defined in file planl.acl.
2. inline uses : These are statements that direct the parser to inline the particular plan
and include its AST in the original pattern's AST. As an example consider the
inlining
p l a n : traversal-linked-list
that is used to include an instance of the traversal-linked-list plan at a particular
point of the pattern. In a pattern more than one occurrence of an included plan may
appear.

A typical example of a design concept in our concept language is given below. This
pattern expresses an iterative statement (e.g. while ,for, do loop that has in its condition an
inequality expression that uses variable ?x that is a pointer to the abstract type l i s t (e.g.
array, linked list) and the conditional expression contains the keyword "NULL". The body of
I t e r a t i v e - s t a t e m e n t contains a sequence of one or more stateme nts (+-statement)
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 97

that uses at least variable ?y (which matches to the variable obj) in the code below and
contains the keyword meniber, and an Assignment-Statement that uses at least variable
?x, defines variable ?x which in this example matches to variable f i e l d , and contains the
keyword next.

{
Iterative-statement(Inequality-Expression
abstract-description
uses : [ ?x : *list],
keywords : [ "NULL" ])
{
-(--Statement
abstract-description
uses : [?y : string, ..]
keywords : [ "member" ];
Assignment-Statement
abstract-description
uses : [?x, . . ] ,
defines : [?x],
keywords : [ "next" ]

A code fragment that matches the pattern is:

{
while (field != NULL)
{
if (!strcmp(obj,origObj) ||
(!strcmp(field->AvalueType,"member") &&
notlnOrig ) )
if (strcmp(field->Avalue,"method") != 0)
INSERT_THE_FACT(o->ATTLIST[num].Aname,origObj,
field->Avalue);
field = field->nextValue;
}
}

3.2, Concept-tO'Code Distance Calculation

In this section we discuss the mechanism that is used to match an abstract pattern given in
ACL with source code.
98 KONTOGIANNIS ET AL.

In general the matching process contains the following steps :

1. Source code (^i; ...5^) is parsed and an AST Tc is created.

2. The ACL pattern {Ai; ...A^) is parsed and an AST Ta is created.

3. A transformation program generates from Ta a Markov Model called Abstract Pattern


Model (APM).

4. A Static Model called SCM provides the legal entities of the source language. The
underlying finite-state automaton for the mapping between a APM state and an SCM
state basically implements the Tables 2, 3.

5. Candidate source code sequences are selected.

6. A Viterbi (Viterbi, 1967) algorithm is used to find the best fit between the Dynamic
Model and a code sequence selected from the candidate list.

A Markov model is a source of symbols characterized by states and transitions, A


model can be in a state with certain probability. From a state, a transition to another
state can be taken with a given probability. A transition is associated with the generation
(recognition) of a symbol with a specific probability. The intuitive idea of using Markov
models to drive the matching process is that an abstract pattern given in ACL may have many
possible alternative ways to generate (match) a code fragment. A Markov model provides
an appropriate mechanism to represent these alternative options and label the transitions
with corresponding generation probabilities. Moreover, the Vitrebi algorithm provides an
efficient way to find the path that maximizes the overall generation (matching) probability
among all the possible alternatives.
The selection of a code fragment to be matched with an abstract description is based on
the following criteria : a) the first source code statement Si matches with the first pattern
statement Ai and, b) S2]S^]..Sk belong to the innermost block containing Si
The process starts by selecting all program blocks that match the criteria above. Once a
candidate list of code fragments has been chosen the actual pattern matching takes place
between the chosen statement and the outgoing transitions from the current active APM's
state. If the type of the abstract statement the transition points to and the source code
statement are compatible (compatibility is computed by examining the Static Model) then
feature comparison takes place. This feature comparison is based on Dynamic Programming
as described in section 2.3. A similarity measure is established by this comparison between
the features of the abstract statement and the features of the source code statement. If
composite statements are to be compared, an expansion function "flattens" the structure by
decomposing the statement into a sequence of its components. For example an i f statement
will be decomposed as a sequence of an e x p r e s s i o n (for its condition), its then part and
its e l s e part. Composite statements generate nested matching sessions as in the DP-based
code-to-code matching.
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 99

3,3, ACL Markov Model Generation

Let Tc be the AST of the code fragment and Ta be the AST of the abstract representation.
A measure of similarity between Tc and Ta is the following probability

where,
(rci,...rc,,...rcj (5)
is the sequence of the grammar rules used for generating Tc and

{ra^,...ran"'raL) (6)
is the sequence of rules used for generating Ta. The probability in (1) cannot be computed
in practice, because of complexity issues related to possible variations in Ta generating Tc.
An approximation of (4) is thus introduced.
Let iSi, ..5fc be a sequence of program statements During the parsing that generates Ta,
a sequence of abstract descriptions is produced. Each of these descriptions is considered
as a Markov source whose transitions are labeled by symbols Aj which in turn generate
(match) source code.
The sequence of abstract descriptions Aj forms a pattern A in Abstract Code Language
(ACL) and is used to build dynamically a Markov model called Abstract Pattern Model
(APM). An example of which is given in Fig.4.
The Abstract Pattern Model is generated an ACL pattern is parsed. Nodes in the APM
represent Abstract ACL Statements and arcs represent transitions that determine what is
expected to be matched from the source code via a link to a static, permanently available
Markov model called a Source Code Model (SCM).
The Source Code Model is an alternative way to represent the syntax of a language entity
and the correspondence of Abstract Statements in ACL with source code statements.
For example a transition in APM labeled as (pointing to) an A b s t r a c t while S t a t e -
ment is linked with the while node of the static model. In its turn a while node in the
SCM describes in terms of states and transitions the syntax of a legal while statement in c.
The best alignment between a sequence of statements S = 5i; ^2; 5fc and a pattern
A = Ai;A2]....Aj is computed by the Viterbi (Viterbi, 1967) dynamic programming
algorithm using the SCM and a feature vector comparison function for evaluating the
following type of probabilities:

P,(5i,52,...5,|%,)) (7)
where/(/) indicates which abstract description is allowed to be considered at step /. This
is determined by examining the reachable APM transitions at the ith step. For the matching
to succeed the constraint P^(*S'i|Ai) = 1.0 must be satisfied and ^/(fc) corresponds to a
final APM state.
This corresponds to approximating (4) as follows (Brown, 1992):

Pr{Tc\Ta) c^ P.(5i; ..Sk\Ai- ..An) =


100 KONTOGIANNIS ET AL.

^maa:(P^(5l;52..5,_l|Al;^2;..%^-l))•Pr(5^|%i))) (8)
i=l
This is similar to the code-to-code matching. The difference is that instead of matching
source code features, we allow matching abstract description features with source code
features. The dynamic model (APM) guarantees that only the allowable sequences of
comparisons are considered at every step.
The way to calculate similarities between individual abstract statements and code frag-
ments is given in terms of probabilities of the form Pr{Si\Aj) as the probability of abstract
statement Aj generating statement Si.
The probability p = Pr{Si\Aj) = Pscm{Si\Aj) * Pcomp{Si\Aj) is interpreted as "The
probability that code statement Si can be generated by abstract statement Aj". The mag-
nitude of the logarithm of the probability p is then taken to be the distance between Si and
Aj.
The value ofp is computed by multiplying the probability associated with the correspond-
ing state for Aj in SCM with the result of comparing the feature vectors of Si and Aj. The
feature vector comparison function is discussed in the following subsection.
As an example consider the APM of Fig. 4 generated by the pattern ^ i ; ^2 5 ^3» where
Aj is one of the legal statements in ACL. Then the following probabilities are computed
for a selected candidate code fragment 5i, 52, S'a:

Figure 4. A dynamic model for the pattern Al\ A2*; A3*

Pr{Si\Ai) = 1.0 [delineation criterion) (9)


Pr{Su S2\A2) = PriSllAi) • Pr{S2\A2) (10)
PriSuS2\As) = PriSMl) ' Pr{S2\As) (11)

Pr{SuS2\A2)'Pr{Ss\A3)
Pr{SuS2,S3\As)=Max (12)
PriSi,S2\As)'Pr{Ss\As)
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 101

Pr{SuS2,Ss\A2) = Pr{Si,S2\A2) ' Pr{Ss\A2) (13)

Note that when the first two program statements 81^82 have already been matched,
(equations 12 and 13) two transitions have been consumed and the reachable active states
currently are A2 or A3.
Moreover at every step the probabilities of the previous steps are stored and there is no
need to be reevaluated. For example Pr{Si^S2\A2) is computed in terms of Pr{Si\Ai)
which is available from the previous step.
With each transition we can associate a list of probabilities based on the type of expression
likely to be found in the code for the plan that we consider.
For example, in the T r a v e r s a l of a l i n k e d l i s t plan the while loop condition,
which is an expression, most probably generates an i n e q u a l i t y of the form (list-node-ptr
1= NULL) which contains an identifier reference and the keyword NULL.
An example of a static model for the p a t t e r n - e x p r e s s i o n is given in Fig. 5. Here
we assume for simplicity that only four C expressions can be generated by a P a t t e r n -
Expression.
The initial probabilities in the static model are provided by the user who either may
give a uniform distribution in all outgoing transitions from a given state or provide some
subjectively estimated values. These values may come from the knowledge that a given plan
is implemented in a specific way. In the above mentioned example of the T r a v e r s a l of
a l i n k e d l i s t plan the I t e r a t i v e - S t a t e m e n t pattern usually is implemented with
a while loop. In such a scenario the i t e r a t i v e abstract statement can be considered
to generate a while statement with higher probability than a for statement. Similarly,
the expression in the while loop is more likely to be an inequality (Fig. 5). The preferred
probabilities can be specified by the user while he or she is formulating the query using the
ACL primitives. Once the system is used and results are evaluated these probabilities can
be adjusted to improve the performance.
Probabilities can be dynamically adapted to a specific software system using a cache
memory method originally proposed (for a different application) in (Kuhn, 1990).
A cache is used to maintain the counts for most frequently recurring statement patterns in
the code being examined. Static probabilities can be weighted with dynamically estimated
ones as follows :

Pscm{Si\Aj) = X . Pcache{Si\Aj) + ( 1 - A) • Pstatic{Si\Aj) (14)

In this formula Pcache{Si\Aj) represents the frequency that Aj generates Si in the code
examined at run time while PstaticiSi\Aj) represents the a-priori probability of Aj gen-
erating Si given in the static model. A is a weighting factor. The choice of the weighting
factor A indicates user's preference on what weight he or she wants to give to the feature
vector comparison. Higher A values indicate a stronger preference to depend on feature
vector comparison. Lower A values indicate preference to match on the type of statement
and not on the feature vector.
The value of A can be computed by deleted-interpolation as suggested in (Kuhn, 1990).
It can also be empirically set to be proportional to the amount of data stored in the cache.
102 KONTOGIANNIS ET AL.

1.0 1.0
/ Pattern \
^,-^-'*''^^ y Inequality J Argl Arg2
expression expression
0.25 / ^^-.^^^
1 is-an-inequality
1.0
1.0
/ / Pattern \ Arg2
Argl
/ 0.25^^^.^*\ Equality / expression
expression
r Pattern \ is~an-equality
lExpression 7
1.0
I is-a-id-ref 7 Pattern \
id-ref
\ V Id-Ref /
0.25 \
\ is-a-function-call
1.0
^'''^>..^/ Pattern \
V Fcn~Call / Fen-Name -Args
id-ref
0.5

expression

expression

Figure 5. The static model for the expression-pattern. Different transition probability values may be set by the
user for different plans. For example the traversal of linked-list plan may have higher probability attached to the
is-an-inequality transition as the programmer expects a pattern of the form (field f= NULL)

As proposed in (Kuhn, 1990), different cache memories can be introduced, one for each
Aj. Specific values of A can also be used for each cache.

3,4, Feature Vector Comparison

In this section we discuss the mechanism used for calculating the similarity between two
feature vectors. Note that Si's and ^^'s feature vectors are represented as annotations in
the corresponding ASTs.
The feature vector comparison of Si, Aj returns a value p = Pr{Si\Aj).
The features used for comparing two entities (source and abstract) are:

1. Variables defined V : Source-Entity —> {String}

2. Variables usedU : Source-Entity —> {String}


PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 103

3. Keywords /C : Source-Entity —> {String}


4. Metrics
• Fan out All : Source-Entity —> Number
• D-Complexity M2 - Source-Entity -^ Number
• McCabe Ms : Source-Entity -^ Number
• Albrecht M4 : Source-Entity —^ Number
• Kafura M5 : Source-Entity —> Number

These features are AST annotations and are implemented as mappings from an AST node
to a set of AST nodes, set of Strings or set of Numbers.
Let Si be a source code statement or expression in program C and Aj an abstract statement
or expression in pattern A. Let the feature vector associated with Si be Vi and the feature
vector associated with Aj be Vj. Within this framework we experimented with the following
similarity considered in the computation as a probability:

/CM \ -'• ^r^ car d{ Abstract Feature j^n^CodeF eaturci^n)


comp % 3 ^ £^ card{AbstractFeaturej^n ^ CodeFeaturCi^n)

where v is the size of the feature vector, or in other words how many features are used,
CodeFeaturei^n is the nth feature of source statement Si and, AbstractFeaturCj^n is the
nth feature of the ACL statement Aj.
As in the code to code dynamic programming matching, lexicographical distances be-
tween variable names (i.e. next, next value) and numerical distances between metrics are
used when no exact matching is the objective. Within this context two strings are considered
similar if their lexicographical distance is less than a selected threshold, and the comparison
of an abstract entity with a code entity is valid if their corresponding metric values are less
than a given threshold.
These themes show that ACL is viewed more as a vehicle where new features and new
requirements can be added and be considered for the matching process. For example a new
feature may be a link or invocation to another pattern matcher (i.e. SCRUPLE) so that the
abstract pattern in ACL succeeds to match a source code entity if the additional pattern
matcher succeeds and the rest of the feature vectors match.

4. System Architecture

The concept-to-code pattern matcher of the Ariadne system is composed of four modules.
Thefirstmodule consists of an abstract code language (ACL) and its corresponding parser.
Such a parser builds at run time, an AST for the ACL pattern provided by the user. The
ACL AST is built using Refine and its corresponding domain model maps to entities of the
C language domain model. For example, an Abstract-Iterative-Statement corresponds to
an Iterative-Statement in the C domain model.
104 KONTOGIANNIS ET AL.

A Static explicit mapping between the ACL's domain model and C's domain model is
given by the SCM (Source Code Model), Ariadne's second module. SCM consists of states
and transitions. States represent Abstract Statements and are nodes of the ACL's AST.
Incoming transitions represent the nodes of the C language AST that can be matched by
this Abstract Statement. Transitions have initially attached probability values which follow
a uniform distribution. A subpart of the SCM is illustrated in Fig. 5 where it is assumed
for simplicity that an Abstract Pattern Expression can be matched by a C i n e q u a l i t y ,
e q u a l i t y , i d e n t i f i e r r e f e r e n c e , and a function c a l l .
The third module builds the Abstract Pattern Model at run time for every pattern provided
by the user. APM consists of states and transitions. States represent nodes of the ACL's
AST. Transitions model the structure of the pattern given, and provide the pattern statements
to be considered for the next matching step. This model directly reflects the structure of
the pattern provided by the user. Formally APM is an automaton <Q, E, 5, qo, F> where

• Q, is the set of states, taken from the domain of ACL's AST nodes

• S, is the input alphabet which consists of nodes of the C language AST

• <5, is a transition function implementing statement expansion (in the case of composite
abstract or C statements) and the matching process

• qo,i^ the Initial state. The set of outgoing transitions must match the first statement in
the code segment considered.

• F, is a set of final states. The matching process stops when one of the final states have
been reached and no more statements from the source code can be matched.

Finally, the fourth module is the matching engine. The algorithm starts by selecting
candidate code fragments V = Si; 82', .-Sk, given a model M = A\\A2\ ..An.
The Viterbi algorithm is used to evaluate the best path from the start to the final state of
the APM,
An example of a match between two simple expresssions (di function call and an Abstract-
Expression is given below :

INSERT_THE_FACT(o->ATTLIST[num].Aname,origObj,
field->Avalue);

is matched with the abstract pattern

Expression(abstract-description
uses : ["ATTLIST", "Aname", "Avalue"]
Keywords : ["INSERT", "FACT"] )
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 105

In this scenario both abstract and code statements are simple and do not need expansion.
Expression and INSERT_THE_FACT(...) are type compatible statements because an
expression can generate a function call (Fig. 5) so the matching can proceed. The next step
is to compare features, and lexicographical distances between variable names in the abstract
and source statement. The finalvalue is obtained by multiplying the value obtained from
the feature vectors comparison and the probability that Expression generates a Function
Call. As the pattern statement does not specify what type of expression is to be matched the
static model (SCM) provides an estimate. In the SCM given in Fig. 5 the likelihood that
the Expression generates a function call is 0.25. The user may provide such a value if
a plan favours a particular type instead of another. For example in the T r a v e r s a l of a
1 inked l i s t plan the loop statement is most likely to be a whi 1 e loop. Once a final value
is set then a record < abstract jpattern, matched-Code, distance-value > is created and
is associated with the relevant transition of the APM. The process ends when a final state
of the APM has been reached and no more statements match the pattern.
With this approach the matching process does not fail when imperfect matching between
the pattern and the code occurs. Instead, partial and inexact matching can be computed.
This is very important as the programmer may not know how to specify in detail the code
fragment that is sought.
To reduce complexity when variables in the pattern statement occur, Ariadne maintains a
global binding table and it checks if the given pattern variable is bound to one of the legal
values from previous instantiations. These legal values are provided by the binding table
and are initialized every time a new pattern is tried and a new APM is created.

5. Conclusion

Pattern matching plays an important role for plan recognition and design recovery. In this
paper we have presented a number of pattern matching techniques that are used for code-
to-code and concept-to-code matching. The main objective of this research was to devise
methods and algorithms that are time efficient, allow for partial and inexact matching, and
tolerate a measure of dissimilarity between two code fragments. For code representation
schemes the program's Abstract Syntax Tree was used because it maintains all necessary
information without creating subjective views of the source code (control or dataflowbiased
views).
Code-to-code matching is used for clone detection and for computing similarity distances
between two code fragments. It is based on a) a dynamic programming pattern matcher that
computes the best alignment between two code fragments and b) metric values obtained
for every expression, statement, and block of the AST. Metrics are calculated by taking
into account a number of control and data program properties. The dynamic programming
pattern matcher produces more accurate results but the metrics approach is cheaper and can
be used to limit the search space when code fragments are selected for comparison using
the dynamic progranuning approach.
We have experimented with different code features for comparing code statements and
are able to detect clones in large software systems > 300 KLOC. Moreover, clone detection
is used to identify "conceptually" related operations in the source code. The performance
106 KONTOGIANNIS ET AL.

is limited by the fact we are using a LISP environment (frequent garbage collection calls)
and the fact that metrics have to be calculated first. When the algorithm using metric values
for comparing program code fragments was rewritten in C it performed very well. For
30KLOCS of the CLIPS system and for selecting candidate clones from approximately
500,000 pairs of functions the C version of the clone detection system run in less than
10 seconds on a Sparc 10, as opposed to a Lisp implementation that took 1.5 minutes to
complete. The corresponding DP-based algorithm implemented in Lisp took 3.9 minutes
to complete.
Currently the system is used for system clustering, redocumentation and program un-
derstanding. Clone detection analysis reveals clusters of functions with similar behaviour
suggesting thus a possible system decomposition. This analysis is combined with other
data flow analysis tools (Konto, 1994) to obtain a multiple system decomposition view. For
the visualization and clustering aspect the Rigi tool developed at the University of Victoria
is used. Integration between the Ariadne tool and the Rigi tool is achieved via the global
software repository developed at the University of Toronto.
The false alarms using only the metric comparison was on average for the three systems
39% of the total matches reported. When the DP approach was used,this ratio dropped to
approximately 10% in average (when zero distance is reported). Even if the noise presents
a significant percentage of the result, it can be filtered in almost all cases by adding new
metrics (i.e. line numbers, Halstead's metric, statement count). The significant gain though
in this approach is that we can limit the search space to a few hundreds (or less than a
hundred, when DP is considered) of code fragment pairs from a pool of half a million
possible pairs that could have been considered in total. Moreover, the method is fully
automatic, does not require any knowledge of the system and is computationally acceptable
0{n * m) for DP, where m is the size of the model and n the size of the input.
Concept-to-code matching uses an abstract language (ACL) to represent code operations
at an abstract level. Markov models and the Viterbi algorithm are used to compute similarity
measures between an abstract statement and a code statement in terms of the probability
that an abstract statement generates the particular code statement.
The ACL can be viewed not only as a regular expression-like language but also as a vehicle
to gather query features and an engine to perform matching between two artifacts. New
features, or invocations and results from other pattern matching tools, can be added to the
features of the language as requirements for the matching process. A problem we foresee
arises when binding variables exist in the pattern. If the pattern is vague then complexity
issues slow down the matching process. The way we have currently overcome this problem
is for every new binding to check only if it is a legal one in a set of possible ones instead of
forcing different alternatives when the matching occurs.
Our current research efforts are focusing on the development of a generic pattern matcher
which given a set of features, an abstract pattern language, and an input code fragment can
provide a similarity measure between an abstract pattern and the input stream.
Such a pattern matcher can be used a) for retrieving plans and other algorithmic struc-
tures from a variety of large software systems ( aiding software maintenance and program
understanding), b) querying digital databases that may contain partial descriptions of data
and c) recognizing concepts and other formalisms in plain or structured text (e.g.,HTML)
PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 107

Another area of research is the use of metrics for finding a measure of the changes
introduced from one to another version in an evolving software system. Moreover, we
investigate the use of the cloning detection technique to identify similar operations on
specific data types so that generic classes and corresponding member functions can be
created when migrating a procedural system to an object oriented system.

Notes

1. In this paper, "reverse engineering*' and related terms refer to legitimate maintenance activities based on source-
language programs. The terms do not refer to illegal or unethical activities such as the reverse compilation of
object code to produce a competing product.
2. "The Software Refinery" and REFINE are trademarks of Reasoning Systems, Inc.
3. We are using a commercial tool called REFINE (a trademark of Reasoning Systems Corp.).
4. The Spearman-Pearson rank correlation test was used.

References

Adamov, R. "Literature review on software metrics", Zurich: Institutfur Informatik der Universitat Zurich, 1987.
Baker S. B, "On Finding Duplication and Near-Duplication in Large Software Systems" In Proceedings of the
Working Conference on Reverse Engineering 1995, Toronto ON. July 1995
Biggerstaff, T, Mitbander, B., Webster, D., "Program Understanding and the Concept Assignment Problem",
Communications of the ACM, May 1994, Vol. 37, No.5, pp. 73-83.
P. Brown et. al. "Class-Based n-gram Models of natural Language", Journal of Computational Linguistics, Vol.
18, No.4, December 1992, pp.467-479.
Buss, E., et. al. "Investigating Reverse Engineering Technologies for the CAS Program Understanding Project",
IBM Systems Journal, Vol. 33, No. 3,1994, pp. 477-500.
G. Canfora., A. Cimitile., U. Carlini., "A Logic-Based Approach to Reverse Engineering Tools Production"
Transactions of Software Engineering, Vol.18, No. 12, December 1992, pp. 1053-1063.
Chikofsky, E.J. and Cross, J.H. II, "Reverse Engineering and Design Recovery: A Taxonomy," IEEE Software,
Jan. 1990, pp. 13 -17.
Church, K., Helfman, I., "Dotplot: a program for exploring self-similarity in millions of lines of text and code",
/. Computational and Graphical Statistics 2,2, June 1993, pp. 153-174.
C-Language Integrated Production System User's Manual NASA Software Technology Division, Johnson Space
Center, Houston, TX.
Fenton, E. "Software metrics: a rigorous approach". Chapman and Hall, 1991.
Halstead, M., H., "Elements of Software Science", New York: Elsevier North-Holland, 1977.
J. Hartman., "Technical Introduction to the First Workshop on Artificial Intelligence and Automated Program
Understanding" First Workshop on Al and Automated Program Understanding, AAAr92, San-Jose, CA.
Horwitz S., "Identifying the semantic and textual differences between two versions of a program. In Proc. ACM
SIGPLAN Conference on Programming Language Design and Implementation, June 1990, pp. 234-245.
Jankowitz, H., T, "Detecting plagiarism in student PASCAL programs". Computer Journal, 31.1, 1988, pp. 1-8.
Johnson, H., "Identifying Redundancy in Source Code Using Fingerprints" In Proceedings of GASCON '93, IBM
Centre for Advanced Studies, October 24 - 28, Toronto, Vol.1, pp. 171-183.
Kuhn, R., DeMori, R., "A Cache-Based Natural Language Model for Speech Recognition", IEEE Transactions
on Pattern Analysis and Machine Intelligence, Vol. 12, No.6, June 1990, pp. 570-583.
Kontogiannis, K., DeMori, R., Bernstein, M., Merlo, E., "Localization of Design Concepts in Legacy Systems",
In Proceedings of International Conference on Software Maintenance 1994, September 1994, Victoria, BC.
Canada, pp. 414-423.
108 KONTOGIANNIS ET AL.

Kontogiannis, K., DeMori, R., Bernstein, M., Galler, M., Merlo, E., "Pattern matching for Design Concept
Localization", In Proceedings of the Second Working Conference on Reverse Engineering, July 1995, Toronto,
ON. Canada, pp. 96-103.
"McCabe T., J. "Reverse Engineering, reusability, redundancy : the connection", American Programmer 3, 10,
October 1990, pp. 8-13.
MoUer, K., Software metrics: a practitioner's guide to improved product development"
Muller, H., Corrie, B., Tilley, S., Spatial and Visual Representations of Software Structures, Tech. Rep. TR-74.
086, IBM Canada Ltd. April 1992.
Mylopoulos, J., "Telos : A Language for Representing Knowledge About Information Systems," University of
Toronto, Dept. of Computer Science Technical Report KRR-TR-89-1, August 1990, Toronto.
J. NIng., A. Engberts., W. Kozaczynski., "Automated Support for Legacy Code Understanding", Communications
of the ACM, May 1994, Vol.37, No.5, pp.50-57.
Paul, S., Prakash, A., "A Framework for Source Code Search Using Program Patterns", IEEE Transactions on
Software Engineering, June 1994, Vol. 20, No.6, pp. 463-475.
Rich, C. and Wills, L.M., "Recognizing a Program's Design: A Graph-Parsing Approach," IEEE Software, Jan
1990, pp. 82 - 89.
Tilley, S., Muller, H., Whitney, M., Wong, K., "Domain-retargetable Reverse Engineeringll: Personalized User
Interfaces", In CSM'94 : Proceedings of the 1994 Conference on Software Maintenance, September 1994, pp.
336 - 342.
Viterbi, A.J, "Error Bounds for Convolutional Codes and an Asymptotic Optimum Decoding Algorithm", IEEE
Trans. Information Theory, 13(2) 1967.
Wills, L.M.,"Automated Program Recognition by Graph Parsing", MIT Technical Report, AI Lab No. 1358,1992
Automated Software Engineering, 3, 109-138 (1996)
© 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netheriands.

Extracting Architectural Features from Source


Code*
DAVID R. HARRIS, ALEXANDER S. YEH drh@mitre.org
The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA

HOWARD B. REUBENSTEIN * hbr@mitretek.org


Mitretek Systems, 25 Burlington Mall Road, Burlington, MA 01803, USA

Abstract. Recovery of higher level design information and the ability to create dynamic software documen-
tation is crucial to supporting a number of program understanding activities. Software maintainers look for
standard software architectural structures (e.g., interfaces, interprocess communication, layers, objects) that the
code developers had employed. Our goals center on supporting software maintenance/evolution activities through
architectural recovery tools that are based on reverse engineering technology. Our tools start with existing source
code and extract architecture-level descriptions linked to the source code firagments that implement architectural
features. Recognizers (individual source code query modules used to analyze the target program) are used to
locate architectural features in the source code. We also report on representation and organization issues for the
set of recognizers that are central to our approach.

Keywords: Reverse engineering, software architecture, software documentation

1. Introduction

We have implemented an architecture recovery framework on top of a source code exam-


ination mechanism. The framework provides for the recognition of architectural features
in program source code by use of a library of recognizers. Architectural features are the
constituent parts of architectural styles (Perry and Wolf, 1992), (Shaw, 1991) which in turn
define organizational principles that guide a programmer in developing source code. Ex-
amples of architectural styles include pipe and filter data processing, layering, abstract data
type, and blackboard control processing.
Recognizers are queries that analysts or applications can run against source code to
identify portions of the code with certain static properties. Moreover, recognizer authors
and software analysts can associate recognition results with architectural features so that the
code identified by a recognizer corresponds to an instance of the associated architectural

This is a revised and extended version based on two previous papers: 1. "Reverse Engineering to the Ar-
chitectural Level" by Harris, Reubenstein and Yeh, which appeared in the Proceedings of the 17th International
Conference on Software Engineering, April 1995, © 1995 ACM. 2. "Recognizers for Extracting Architectural
Features from Source Code" by Harris, Reubenstein and Yeh, which appeared in the Proceedings of the 2nd
Working Conference on Reverse Engineering, July 1995, © 1995 IEEE. The work reported in this paper was
sponsored by the MITRE Corporation's internal research program and was performed while all the authors were
at the MITRE Corp, This paper was written while H. Reubenstein was at GTE Laboratories. H. Reubenstein's
current address is listed above.
110 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

feature. Within our implementation, we have developed an extensive set of recognizers


targeted for architecture recovery applications. The implementation provides for analyst
control over parameterization and retrieval of recognizers from a library.
Using these recognizers, we have recovered constituent features of architectural styles in
our laboratory experiments (Harris, Reubenstein, Yeh: ICSE, 1995). In addition, we have
used the recognizers in a stand-alone mode as part of a number of source code quality
assessment exercises. These technology transfer exercises have been extremely useful for
identifying meaningful architectural features.
Our motivation for building our recovery framework stems from our efforts to understand
legacy software systems. While it is clear that every piece of software conforms to some
design, it is often the case that existing documentation provides little clue to that design.
Recovery of higher level design information and the ability to create as-built software
documentation is crucial to supporting a number of program understanding activities. By
stressing as-built, we emphasize how a program is actually structured versus the structure
that designers sketch out in idealized documentation.
The problem with conventional paper documentation is that it quickly becomes out of date
and it often is not adequate for supporting the wide range of tasks that a software maintainer
or developer might wish to perform, e.g., general maintenance, operating system port,
language port, feature addition, program upgrade, or program consolidation. For example,
while a system block diagram portrays an idealized software architecture description, it
typically does not even hint at the source level building blocks required to construct the
system.
As a starting point, conmiercially available reverse engineering tools (Olsem and Sitte-
nauer, 1993) provide a set of limited views of the source under analysis. While these views
are an improvement over detailed paper designs in that they provide accurate information
derived directly from the source code, they still only present static abstractions that focus
on code level constructs rather than architectural features.
We argue that it is practical and effective to automatically (sometimes semi-automatically)
recognize architectural features embedded in legacy systems. Our framework goes beyond
basic tools by integrating reverse engineering technology and architectural style represen-
tations. Using the framework, analysts can recover multiple as-built views - descriptions
of the architectural structures that actually exist in the code. Concretely, the representation
of architectural styles provides knowledge of software design beyond that defined by the
syntax of a particular language and enables us to respond to questions such as the following:

• When are specific architectural features actually present?


• What percent of the code is used to achieve an architectural feature?
• Where does any particular code fragment fall in an overall architecture?

The paper describes our overall architecture recovery framework including a description
of our recognition library. We begin in Section 2 by describing the overall framework.
Next, in Section 3, we address the gap between idealized architectural descriptions and
source code and how we bridge this gap with architectural feature recognizers. In Section
4, we describe the underlying analysis tools of the framework. In Section 5, we describe
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 111

the aspects of the recognition library that support analyst access and recognizer authoring.
In section 6, we describe our experience in using our recovery techniques on a moderately
sized (30,000 lines of code) system. In addition, we provide a very preliminary notion of
code coverage metrics that researchers can used for quantifying recovery results. Related
work and conclusions appear in Sections 7 and 8 respectively.

2. Architecture Recovery - Framework and Process

Our recovery framework (see Figure 1) spans three levels of software representation:

• a program parsing capability (implemented using Software Refinery


(Reasoning Systems, 1990)) with accompanying code level organization views, i.e.,
abstract syntax trees and a "bird's eye" file overview
• an architectural representation that supports both idealized and as-built architectural
representations with a supporting library of architectural styles and constituent archi-
tectural features
• a source code recognition engine and a supporting library of recognizers

Figure 1 shows how these three levels interact. The idealized architecture contains the
initial intentions of the system designers. Developers encode these intentions in the source
code. Within our framework, the legacy source code is parsed into an internal abstract
syntax tree representation. We run recognizers over this representation to discover archi-
tectural features - the components/connectors associated with architectural styles (selecting
a particular style selects a set of constituent features to search for). The set of architectural
features discovered in a program form its as-built architecture containing views with respect
to many architectural styles. Finally, note that the as-built architecture we have recovered
is both less than and more than the original idealized architecture. The as-built is less than
the idealized because it may miss some of the designer's original intentions and because it
may not be complete. The as-built is also more than the idealized because it is up-to-date
and because we now have on-line linkage between architecture features and their imple-
mentation in the code. We do not have a definition of a complete architecture for a system.
The notions of code coverage described later in the paper provides a simple metric to use
in determining when a full understanding of the system has been obtained.
The framework supports architectural recovery in both a bottom-up and top-down fashion.
In bottom-up recovery, analysts use the bird's eye view to display the overall file structure
and file components of the system. The features we display (see Figure 2) include file
type (diamond shapes for source files with entry point functions; rectangles for other source
files), name, pathname of directory, number of top level forms, and file size (indicated by the
size of the diamond or rectangle). Since file structure is a very weak form of architectural
organization, only shallow analysis is possible; however, the bird's eye view is a place
where our implementation can register results of progress toward recognition of various
styles.
In top-down recovery, analysts use architectural styles to guide a mixed-initiative recovery
process. From our point of view, an architectural style places an expectation on what
112 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

Idealized Architecture Views of the As-Built Architecture

combine using architectural styles to form

implemented by Architectural Features


I

provides clues for recognizing


1
1
parses into
Program Abstract Syntax Tree

Figure 1. Architectural recovery framework

recovery tools will find in the softw^are system. That is, the style establishes a set of
architectural feature types which define components/connectors types to be found in the
software. Recognizers are used tofindthe component/connector features. Once the features
are discovered, the set of mappings from feature types to their realization in the source code
forms the as-built architecture of the system.

2. /. Architectural Styles

The research community has provided detailed examples (Garlan and Shaw, 1993, Shaw,
1989, Shaw, 1991, Perry and Wolf, 1992, Hofmeister, Nord, Soni, 1995) of architec-
tural styles, and we have codified many of these in an architecture modeling language.
Our architecture modeling language uses entity/relation taxonomies to capture the com-
ponent/connector style aspects that are prevalent in the literature (Abowd, Allen, Garlan,
1993, Perry and Wolf, 1992, Tracz, 1994). Entities include clusters, layers, processing el-
ements, repositories, objects, and tasks. Some recognizers discover source code instances
of entities where developers have implemented major components - "large" segments of
source code (e.g., a layer may be implemented as a set of procedures). Relations such
as contains, initiates, spawns, and is-connected-to each describe how entities are linked.
Component participation in a relation follows from the existence of a connector - a specific
code fragment (e.g., special operating system invocation) or the infrastructure that pro-
cesses these fragments. This infrastructure may or may not be part of the body of software
under analysis. For example, it may be found in a shared library or it may be part of the
implementation language itself.
As an illustration. Figure 3 details the task entity and the spawns relation associated with a
task spawning style. In a task spawning architectural style, tasks (i.e., executable processing
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 113

Figure 2. Bird's Eye Overview


114 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

elements) are linked when one task initiates a second task. Task spawning is a style that is
recognized by the presence of its connectors (i.e., the task invocations). Its components are
tasks, repositories, and task-functions. Its connectors are spawns (invocations from tasks to
tasks), spawned-by (the inverse of spawns), uses (relating tasks to any tasks with direct in-
terprocess communications and to any repositories used for interprocess communications),
and conducts (relating tasks to functional descriptions of the work performed).
Tasks are a kind of processing element that programmers might implement by files (more
generally, by call trees). A default recognizer named executables will extract a collection
of tasks. Spawns relates tasks to tasks (i.e., parent and child tasks respectively). Spawns
might be implemented by objects of type system-call (e.g., in Unix/C, programmers can
use a system, execl, execv, execlp, or execvp call to start a new process via a shell
command). Analysts can use the default recognizer, find-executable-links, to retrieve
instances of task spawning.

defentity TASK
:specialization-of processing-element
:possible-implementation file
:recognized-by executables

defrel SPAWNS
:specialization-of initiates
:possible-implementation system-call
:recognized-by find-executable-links
:domain task
:range task

Figure 3. Elements in an architecture modeling language

Many of the styles we work with have been elaborated by others (e.g., pipe and filter,
object-oriented, abstract data type, implicit invocation, layered, repository). In addition
we have worked with a few styles that have special descriptive power for the type of
programs we have studied. These include application programming interface (API) use,
the task spawning associated with real time systems, and a service invocation style. Space
limitations do not permit a full description of all styles here. However, we offer two more
examples to help the reader understand the scope of our activities.
Layered: In a layered architecture the components (layers) form a partitioning of a subset,
possibly the entire system, of the program's procedures and data structures. As mentioned
in (Garlan and Shaw, 1993), layering is a hierarchical style: the connectors are the specific
references that occur in components in an upper layer and reference components that are
defined in a lower layer. One way to think of a layering is that each layer provides a service
to the layer(s) above it. A layering can either be opaque: components in one layer cannot
reference components more than one layer away, or transparent: components in one layer
can reference components more than one layer away.
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 115

Data Abstractions and Objects: Two related ways to partially organize a system are to
identify its abstract data types and its groups of interacting objects (Abelson and Sussman,
1984, Garlan and Shaw, 1993). A data abstraction is one or more related data representations
whose internal structure is hidden to all but a small group of procedures, i.e., the procedures
that implement that data abstraction. An object is an entity which has some persistent state
(only directly accessible to that entity) and a behavior that is governed by that state and
by the inputs the object receives. These two organization methods are often used together.
Often, the instances of an abstract data type are objects, or conversely, objects are instances
of classes that are described as types of abstract data.

3. Recognizers

Recognizers map parts of a program to features found in architectural styles. The recogniz-
ers traverse some or all of a parsed program representation (abstract syntax tree, or AST)
to extract code fragments (pieces of concrete syntax) that implement some architectural
feature. Examples of these code fragments include a string that names a data file or a call
to a function with special effects. The fragments found by recognizers are components
and connectors that implement architectural style features. A component recognizer re-
turns a set of code-fragments in which each code-fragment is a component. A connector
recognizer returns a set of ordered triples - code-fragment, enclosing structure, and some
meaningful influence such as a referenced file, executable object, or service. In each triple,
the code-fragment is a connector, and the other two elements are the two components being
connected by that connector.

3.1. A Sample Recognizer

The appendix contains a partial listing of the recognizers we use. Here, we examine parts
of one of these in detail.
Table 1 shows the results computed by a task spawning recognizer (named Find-Executable-
Links) applied to a network management program. For each task to task connector, the
ordered triple contains the special function call that is the connector, the task which makes
the spawn (one end of the connector), and the task that is spawned (invoked) by the call
(the other end). This recognizer has a static view of a task: a task is the call tree subset of
the source code that might be run when the program enters the task.
The action part of a recognizer is written in our RRL (REFINE-based recognition lan-
guage). The main difference between RRL and REFINE is the presence in RRL of iteration
operators that make it easy for RRL authors to express iterations over pieces of a code frag-
ment. The RRL code itself may call functions written either in RRL or REFINE. Figure 4
shows the action part of the previously mentioned task spawning recognizer.
This recognizer examines an AST that analysts generate using a REFINE language work-
bench, such as REFINE/C (Reasoning Systems, 1992). The recognizer calls the function
i n v o c a t i o n s - o f - t y p e , which finds and returns a set of all the calls in the program to
functions that may spawn a task. For each such call, the recognizer calls p r o c e s s - invoked
116 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

Table 1. The results of task spawning recognition

Function Call Spawning Task Spawned Task


system(... RUN_SNOOPY SNOOPY
system{... SNOOPY EXNFS
system(... SNOOPY EX69
system(... SNOOPY EX25
system(... SNOOPY EX21
system(... SNOOPY SCANP
execlp(... MAIN RUN_SNOOPY

let (results = {})


(for-every call in invocations-of-type('system-calls) do
let (target = process-invoked(call))
if ~(target = undefined) then
let (root = go-to-top-from-root(call))
results <- prepend(results, [call, root, target]));
results

Figure 4. The action part of the task spawning recognizer Find-Executable-Links

to determine if a task is indeed spawned, and if so, get the target task being spawned. If
p r o c e s s - i n v o k e d finds a target task, the recognizer then calls g o - t o - t o p - f r o m - r o o t ,
which finds the root of the task which made the call and then returns the entire call tree
(the task) starting from that root. The target task is also in the form of the entire call tree
starting from the target task's root function. These triples of function calls, spawning tasks
and target tasks are saved in r e s u l t s and then returned by the recognizer.
Figure 5 shows what this task spawning recognizer examines when it encounters the
special function call systeiti(cmd), which is embedded in the task Run_snoopy and is
used by Run_snoopy to spawn the task Snoopy. The command "system(cind) ;" is a
connector. Starting from that connector, the recognizer finds and connects Run_snoopy's
call tree to Snoopy's call tree. The figure also shows processing details that are described
in Section 4, where we highlight our underlying analysis capabilities.

3,2, Rationale for Level of Recovery

In addition to architectural features actually found in the source code, we would like to
recover the idealized architecture - a description of the intended structure for a program.
Unfortunately, these idealized descriptions cannot be directly recognized from source code.
The structural information at the code level differs from idealized descriptions in two im-
portant ways. First, while a program's design may commit to certain architectural features
(e.g., pipes or application programming interfaces), actual programs implement these fea-
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 117

main(argc, argv)
i "^ file snoopy.c
map from
find task root
snoopy-' maitefile

special pattern

Call tree of task spawned


(Snoopy)

Call tree of spawning task


(RunjBnoopy)

Figure 5. Task spawning recognizer examines task Run_snoopy spawning task snoopy via the connector
"system (cmd) ;*'

lures with source code constructs (e.g., procedure parameter passing, use of Unix pipes,
export lists) - a one-to-many conceptual shift from the idealized to the concrete. Second,
there are differences due to architectural mixing/matching and architectural violations. Rea-
sons for such violations are varied. Some are due to a developer's failure to either honor
or understand the entailments of one or more architectural features. This erosion from the
ideal usually increases over the life cycle of the program due to the expanding number of
developers and maintainers who touch the code. Other violations are due to the inability
of an existing or required environment (e.g., language, host platform, development tools,
or commercial enabling software) to adequately support the idealized view and may occur
with the earliest engineering decisions.
If we just target idealized architectures directly and do not search for architectural features
as they are actually built in the source code, we risk missing important structures because
the ideal does not exist in the code. To overcome this difficulty we use a partial recognition
approach that does not require finding full compliance to an idealized architecture nor does
it bog down in a detailed analysis of all of the source code. As described at the start of
Section 3, we aim our recognizers at extracting code fragments that implement specific
architectural features. Together, a collection of these code fragments forms a view on the
program's as-built architecture, but generating such an aggregation is not the responsibility
of the individual recognizers. Note that this restriction relaxes expectations that we will
find fully formed instantiations of architectural styles in existing programs. Rather our
recognizers will find partial instantiations of style concepts and are tolerant of architectural
pathologies such as missing components and missing connectors.
118 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

The recognizers are designed to recognize typical and possible patterns of architectural
feature implementations. The recognizers are not fool-proof. A programmer can always
find an obscure way to implement an architectural feature which the recognizers will not
detect and a programmer may write code that accidently aligns with an architectural feature.
However, the recognizers written so far capture the more common patterns and have worked
well on the examples we have seen. As we encounter more examples, we will modify and
expand the recognizers as needed.
The more advanced recognizers from the set of recognizers (listing in the appendix)
capture task spawnings and service invocations via slice evaluation and searching for special
progranuning patterns. Section 4 highlights this analysis. In addition, (Holtzblatt, Piazza,
Reubenstein, Roberts, 1994) describes our related work on CMS2 code. In most of the other
cases, the features are not difficult to recognize. Among other things, the recognizers cover
a wide spectrum of components and connectors that C/Unix programmers typically use for
implementing architectural features.

4. Analysis Tools for Supporting Recognition

The recognizers make use of conunercially available reverse engineering technology, but
there are several important analysis capabilities that we have added. The capabilities them-
selves are special functions that recognizer authors can include in a recognizer's definition.
The most prominent capabilities find potential values of variables at a given line of source
code, analyze special patterns, manage clusters (i.e., collections of code ft'agments), and
encode language-specific ways of accomplishing abstract tasks.

4,1. Values of Variables

Several recognizers use inter-procedural dataflow. We implement this analysis by first com-
puting a program slice (Gallagher and Lyle, 1991), (Weiser, 1984) that handles parameter
passing and local variables. From the slice, we compute a slice evaluation to retrieve the
potential variable values at given points in the source code. This approach is used for
finding users of communication channels, data files that a procedure accesses or modifies,
and references to executable programs. Figure 6 shows two code fragments that illustrate
the requirements for the slice evaluator. Starting with the first argument to the system call
or the fourth argument to the execlp call, the slice evaluator finds the use of C's sprintf to
assign the cmd variable with a command line string. The string contains a pathname to an
executable image.
Our "slice evaluator" algorithm makes several assumptions to avoid intractable compu-
tation. Most notably it ignores control flow and finds the potential values of argument
assignments but not the conditions under which different choices will be made. For exam-
ple, if variable x is bound to 3 and 5 respectively in the "then" and "else" parts of an "if"
statement, the slice evaluator identifies 3 and 5 as possible values, but does not evaluate the
conditional part of the " i f statement.
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 119

4,2. Special Patterns

Slicing provides only part of the story for the examples in Figure 6. Programmers use
stereotypical code patterns to implement frequently occurring computations. Some of
these patterns can be easily recognized in abstract syntax trees. For example, the code in
Figure 6 shows two standard ways of invoking an executable (and potentially invoking a
task). To uncover this architectural feature, we need to exploit knowledge of two patterns.
The first pattern identifies the position - first argument for system calls, last but for the
null string for execlp - of the key command string that contains the name of the executable.
The second pattern describes potential ways programmers can encode pathnames in the
command strings. In the first example, the function sprintf binds the variable cmd to
the string "%s/snoopy" where the %s is replaced by the name of the directory stored
in the variable b i n _ d i r . In the second, the movement to the appropriate directory ("cd
%s / b i n ; " ) is separated from the actual spawning of "snoopy". We designed our approach
to catch such dominate patterns and to ferret out the names of files and executable images
(possibly tasks) within string arguments.

1. sprintf(cmd, "%s/snoopy", bin_dir);


if ( debug == 0)
status = system (cmd);

2. sprintf(cmd,"cd %s/bin; ./snoopy",


top_dir);
if (forkO == 0) {
e x e c l p ( " / b i n / s h " , " s h " , " - c " , cmd, ( c h a r *)0);}

Figure 6. Two approaches for invoking an executable image

Other examples of patterns for C/Unix systems include the use of socket calls with con-
nect or bind calls for creating client-server architectures, and the declaration of read/write
modes in fopen calls. While our approach has been somewhat catch-as-catch-can, we
have found that identifying only a few of these patterns goes a long way toward recovering
architectural features across many architectural styles.

43, Clustering

Clusters are groupings of features of the program - a set of files, a set of procedures, or
other informal structures of a program. Some recognizers need to bundle up collections of
objects that may be de-localized in the code. Clustering facilities follow some algorithm
for gathering elements from the abstract syntax tree. They create clusters (or match new
120 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

collections to an old cluster), and, in some cases, conduct an analysis that assigns properties
to pairs of clusters based on relationships among constituent parts of the clusters.
For example, our OBject and Abstract Data type (OBAD) recovery sub-tool (Harris,
Reubenstein, Yeh: Recovery, 1995) builds clusters whose constituents are collections of
procedures, data structures, or global variables. OBAD is an interactive approach to the
recovery of implicit abstract data types (ADTs) and object instances from C source code.
This approach includes automatic recognition and semi-automatic techniques that handle
potential recognition pitfalls.
OBAD assumes that an ADT is implemented as one or a few data structure types whose
internal fields are only referenced by the procedures that are part of the ADT. The basic
version of OBAD finds candidate ADTs by examining a graph where the procedures and
structure types are the nodes of the graph, and the references by the procedures to the internal
fields of the structures are the edges. The set of connected components in this graph form the
set of candidate ADTs. OBAD has automatic and semi-automatic enhancements to handle
pitfalls by modifying what is put into the above graphs. Currently, OBAD constructs the
graph from the abstract syntax tree. In the future, OBAD will use graphs made from the
results returned by more primitive recognizers.
Also, recognizers can use clusters as input and proceed to detect relationships among
clusters. For example, a computation of pairwise cluster level dominance looks at the
procedures within two clusters. If cluster A contains a reference to an entry point defined
in cluster B, while cluster B does not reference cluster A, we say that A is dominant over
B. This notion of generalizing properties held by individual elements of groups occurs in
several of our recognizers.

4,4, Language/Operating-System Models

A design goal has been to write recognizers that are LOL-independent - independent of
specific patterns due to the source code Language, the Operating system, and any Legacy
system features. Our hope is that we will be able to reuse most recognizers across ASTs
associated with different LOL combinations. While we have not explored this goal exten-
sively, we have had some success with recognizers that work for both FORTRAN (under
the MPX operating system) and C (under Unix). Our approach to this is two-fold. First,
we write recognizers using special accessors and analysis functions that have distinct im-
plementations for each LOL. That is, the special access functions need to be re-written
for each LOL, but the recognizer's logic is reusable across languages. Second, we isolate
LOL-specific function (e.g., operating system calls) names in separately loadable libraries
of call specifications. Each call specification describes the language, operating system, and
sometimes even target system approach for coding a LOL-neutral behaviors such as system
calls, time and date calls, communication channel creators, data accessing, data transmis-
sion, input/output calls, API's for commercial products, and network calls. For examples.
Figure 7 is the C/Unix model for system-calls (i.e., calls that run operating system line
commands or spawn a task) while Figure 8 shows an analogous FORTRAN/MPX model.
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 121

These specifications are also a convenient place for describing attributes of special pat-
terns. In these examples, the key-positions field indicates the argument position of the
variable that holds the name of the executable invoked.

defcalls SYSTEM-CALLS
:call-desc "System Calls"
:call-type system-call
:call-ref-names "system", "execve",
"exec1", "execV",
"execlp", "execvp", "execle"
:key-positions first, next-last, next-last, next-last,
next-last, next-last, next-last

Figure 7. A C/Unix Call Specification

defcalls SYSTEM-CALLS
:call-desc "System Calls"
:call-type system-call
:call-ref-names "m::rsum", "m::sspnd"
:key-positions first, first

Figure 8. A FORTRAN/MPX Call Specification

4,5. An Example - Putting it all together

We return to the find-executable-links recognizer described in Section 3.1. When faced


with either code fragment of Figure 6, this recognizer will collect the appropriate triple. We
explain this activity in terms of the above analysis capabilities. The functions g o - t o - t o p -
from-root and i n v o c a t i o n s - o f - t y p e perform their job by traversing the program
AST. i n v o c a t i o n s - o f - t y p e accesses the call-specification to tell it which functions in
the examined program can implement some architectural style feature. For example, in the
Unix operating system, the system-call specification names the functions that can spawn a
task (i.e., system or members of the execlp family of functions). The function p r o c e s s -
invoked uses slice evaluation to find the value(s) of the arguments to the fiinction calls
returned by i n v o c a t i o n s - o f - t y p e . process-invoked then uses special patterns to
determine the name of the executable image within the command string. In addition,
p r o c e s s - i n v o k e d consults a map to tell it which source code file has the root for which
task. The map is currently hand generated from examining system makefvX^^. In the file
with the root, process-invoked finds the task's root function (in the C language, this is
122 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

the function named main) and then traverses the program AST to collect the call tree into
a cluster starting at that root function. Figure 5 shows how these various actions are put
together for the sample recognition described in Section 3.1.
The database of language and operating system specific functions, the program slicing
(and slice evaluation), and the special patterns described in this section are all areas where
our architecture recovery tool adds value beyond that of commercially available software
reverse engineering tools.

5. Recognizers in Practice

As we developed a set of recognizers, it quickly became clear to us that we needed to


pay attention to organization and indexing issues. Even at a preliminary stage addressing
only a few architecture styles and a single implementation language, we found we could not
easily manage appropriate recognizers without some form of indexing. Since we intend that
software maintenance and analysis organizations treat recognizers as software assets that can
be used interactively or as as part of analysis applications, we have augmented the recognizer
representations with retrieval and parameterization features. These features provide support
so that families of recognizers can be managed as a software library. As part of this effort, we
identified reusable building blocks that enable us to quickly construct new recognizers and
manage the size of the library itself. This led us to codify software knowledge in canonical
forms that can be uniformly accessed by the recognizers. In addition, we discovered that
architectural commitments map to actual programs at multiple granularity levels and this
imposed some interesting requirements on the types of recognizers we created.
In this section, we describe several of the features of our framework that facilitate recogni-
tion authoring and recognizer use. In particular, we describe a retrieval by effect mechanism
and several recognizer composition issues.

5. i. Recognizer Authoring

Recognizer authors (indeed all plan/recognition library designers) face standard software
development trade-off issues that impact the size of the library, the understandability of the
individual library members, and the difficulty of composing new library members from old.
While our REFINE-based recognition language (RRL) does not support input variables,
it does have a mechanism for parameterization. These parameters have helped us keep
the recognition library size small. The parameters we currently use are focus, program,
reference, and functions-of-interest. The parameters provide quite a bit of flexibility for
the recognizer author who can populate the library with the most appropriate member of a
family of related recognizers. As an illustration, when f unctions-of-interest is bound to the
set of names "system", "execve", "execl", "execv", "execlp", "execvp", and "execle" and
reference is bound to "system-calls", the three fragments in Figure 9 yield an equivalent
enumeration (over the same sets of objects in a legacy program).
The first fragment maximizes programming flexibility, but does require analyst tailoring
(i.e., building the appropriate functions-of-interest list from scratch or setting it to some
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 123

1. let (function-names =
FUNCTIONS-OF-INTEREST)
(for-every item in program such-that
function-call(item) and
name(item) in function-names do

2. (for-every item in invocations-of-type(reference) do

3. (for-every item in invocations-of-type('system-calls) do

Figure 9. A family of recognizer fragments

pre-defined list of special calls). In addition, more of the processing is explicitly stated,
perhaps making the fragment more difficult to understand (i.e., lacking abstractions). In
contrast, the third special purpose recognizer does not require any external parameter set-
tings, but would co-exist in a library with many close cousins. The second fragment is
a compromise. In general, our set of parameters allows recognizer authors to modulate
abstraction versus understandability issues to produce a collection that best suits the needs
of their specific user community.

5,2, Operation and Control

Analysts use recognizers in two ways. First, recognizers can be stand-alone analysis meth-
ods for answering a specific question about the source code. For example, an analyst might
ask for the locations where the source code invokes the sendmail service. Second, within
our architecture recovery implementation, recognizers are semi-automatically bundled to-
gether to produce a composite view. For example, Section 6 below shows a system's as-built
architecture with respect to the task-spawning style. This view was constructed using the set
of default recognizers associated with the entities and relations of the task-spawning style.
Three recognizers were employed. The find-executable-links recognizer found instances
of the spawns relation (encoded in the system or execlp calls of the program), a second
recognizer found instances of file/shared-memory interprocess communication (through
fopen and open calls), and a third looked for separate executables (identified by "main"
procedures) that may not have been found by the other recognizers. Within our recovery
framework, analysts can override the defaults by making selections from the recognition
library. Thus, either in stand-alone or as-built architecture recovery modes, recovery is an
interactive process and we need facilities that will help analyst make informed selections
from the library.
124 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

5.2.1. Recognizer Retrieval

Since the library is large (60 or more entries), we have provided two indexing schemes that
help the analyst find an appropriate recognizer. The first scheme simply uses the text strings
in a description attribute associated with each recognizer. The analyst enters a text string
and the implementation returns a list of all recognizers whose description contains text that
matches the string. The analyst can review the list of returned descriptions and select the
recognizer that looks most promising.

The second scheme allows an analyst to see and select from all the recognizers that
would return some type of information. While, analysts may not remember the name of
a recognizer, they will probably know the type of information (e.g., file, function-call,
procedure) that they are looking for. To support this retrieval, we have attached effect
descriptions to each recognizer. Since, the result of running a recognizer may be that some
part of the source code is annotated with markers, we think of the "effects" of running
a recognizer on the AST. For example, the task-spawning recognizer in Figure 4 finds
function calls and files (associated with tasks). The format for these effect descriptions is
"[<category> <type>]" where <category> is either "know" or "check" and <type> is
some entry in the type hierarchy. Such tuples indicate that the recognizer will "know" about
fragments of the stated type or "check" to each if fragments are of the stated type.

Figure 10 is the type taxonomy our implementation uses. Uppercase entries are top
entries of taxonomies based on the language model (e.g., C, FORTRAN) along with our
specializations (e.g., specializations of function call) and clustering extensions. The depth
of indentation indicates the depth in a subtree.

When analysts select a type from this list, the system shows them a list of all the recognizers
that find items of that type. Figure 11 is an example that shows the restricted menu of
recognizers that achieve [know function-call]. In the event that the analyst does not find
a relevant recognizer in the list, the system helps by offering to expand the search to find
recognizers that know generalizations of the current type. For example, a request [know
special-call] would be extended to the request [know function-call] to the request [know
expression] climbing into the upper domain model for the legacy system's language.

Once a recognizer is selected, the system prompts the analyst for parameters that the
recognizer requires. Analysts can set the reference parameter to the result of a previous
recognition thus providing a mechanism for cascading several recognizers together to re-
trieve a complex pattern. In addition, there is an explicit backtracking scheme encoded
for the recognizers. If a recognizer requires other recognizers to have been run (i.e., to
populate some information on the AST) its representation indicates that the second recog-
nizer is a pre-condition. The analyst can review the result and select some subset of the
returned results for subsequent analysis. Reasons for only selecting a subset could range
from abstracting away details (for understanding or analysis) to removing irrelevant details
that cannot be detected syntactically (e.g., a module is only used for testing).
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 125

CLUSTER
Ne twork-exchange
RPC-exchange
Port-exchange
Pipe-exchange
Unix-pipe
Code-fragment
Connector-fragment
Module
Service
Non-source-file
Shell-script
Input-file
Output-file
Source-file
Executable-object
FUNCTION-CALL
Special-call
Network-call
System-call
I/O-call
Non-POSIX-compliant-call
FUNCTION-DEF
STRUCT-TYPE

Figure 10. k Taxonomy of recognition types


126 D.R. HARRIS. H.B. REUBENSTEIN. A.S. YEH

FUHCTIOH-CALL-ARTIFACT
NETWORK-CALL : implementations of client process
NETWORK-CALL : implementations of server process
SERVICE : LINKS between the program and any network services or remote procedun
NETWORK-CALL : LINKS between procedures and some service
NETWORK-CALL : LINKS between procedures and network services
PROCESS-INVOCATION : LINKS between procedures and shell commands
SPECIAL-CALL : Connection family used in a network exchange
SPECIAL-CALL : Connection type used in a network exchange
PROCESS-INVOCATION : Spawning LINKS between executable modules
PROCESS-INVOCATION : Invocations that activate executables
FUNCTION-DEF : LINKS between local and remote procedures
SPECIAL-CALL : Function calls identified directly or by dereferenced function name
SPECIAL-CALL : Invocations of members of a family of functions

Abort

Figure 11. Recognizers with effect [know function-call]

5.2.2. Recognizer Results

From among several possible representations our recognition results are either sets of objects
from the AST or sets of tuples of objects from the AST, This choice has been motivated by
the multiple purposes we envision for recognizer use. As we have mentioned, recognition
results may stand by themselves in answering a question, they may be joined with other
results to form a composite picture (i.e., this is how style recognition is accomplished), or
they may be used as inputs to other recognizers in a more detailed analysis of the code.
Standard output results are needed to support interoperability among recognizers and to
provide a uniform API to applications. This notion needs to be balanced with the need to
allow analyst toflexiblycompose solutions to a wide variety of questions involving multiple
aggregation modes. For example, many architectural features (e.g., tasks, functional units)
requires an analysis of a calling hierarchy. Given a set of procedures - perhaps a functionally
cohesive unit - several aggregations are possible. We might be interested in identifying a
set of common callers of these procedures, the entire calling hierarchy, a calling hierarchy
that is mutually exclusive with some other set of procedures (i.e., a distinct functional unit),
or a set of root nodes (i.e., candidates for task entry points). All of these are meaningful
for identifying architectural components. Thus our library contains recognizers that return
various aggregations within the calling hierarchy. The danger is that if we have too many
different output forms, we will drastically limit our ability to compose recognition results.
Our solution deals with this problem in two ways. First, we output results in a manner
that reduces the need for repeating computationally expensive analyses in subsequent rec-
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 127

ognizers of a cascaded chain. Second, we standardize output levels so that results can be
compared and bundled together easily.
Avoiding Redundant Computations: One approach to recognition would be to assume
that each recognizer always returns a single object and that adjoining architectural structures
can be found piecemeal by following the AST (or using some of the analysis tools described
above). We have found this approach to be unsatisfactory because many of the recognizers
collect objects in the context of some useful larger structure. Rather, it is useful to return
a structure (i.e. the ordered triples described above) that contains contextual information.
For example, a slice evaluation coupled with the use of program patterns (e.g., the slice
associated with the code in Figure 6) can be a relatively expensive computation. Once the
recognizer completes this examination it caches the result as the third element of a triple
(as in Table 1) to avoid re-computations. This format has enabled us to support extensive
architecture recovery without excessive duplication of computations.
Standard Contexts: Each recognizer has only a local view; it cannot know how some
other recognizer will use its results. The critical concern is to identify some standard
contexts so that other parts of an analysis process can rely on a uniform type of response. If
we do not have some standardization, the enclosing structure part of a recognition could be a
procedure, a file, a directory, a task, or something else. This would require each recognizer
to carry out a normalization step prior to using the results of another recognizer.
For the current framework, we selected the procedure level as a standard context. That is
to say, unless there is reason to report some other structures, triples will be of the form <
object, procedure, procedure >. Our justification for this is that, if necessary, courser grained
structure (e.g., file, directory) can be easily re-derived from the AST, while procedures offer
an architecture level result that embodies the results of expensive lower-level analyses such
as slice evaluation.

5.3. Recognizer Representation

We can summarize the above issues by displaying the internal representation we use for
each recognizer. In our implementation, each recognizer is an object with a set of attributes
that the implementation uses for composition and retrieval. The attributes are as follows:

• Name: a unique identifier

• Description: a textual description of what the recognizer finds (used in indexing)

• Effects: effects indicate the types of source code fragments that are found (also used in
indexing)

• Pre-condition - other recognizers that are run before this recognizer will run

• Environment: the set of parameters that analysts must set before invoking the recognizer

• Recognition method: the action part of the recognizer; written in RRL (as illustrated
in Section 3.1 above)
128 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

In summary, recognizer authors build the RRL descriptions using the RRL language con-
structs and special analysis functions. They set pre-conditions and environment attributes
to link the recognizer into the library. At this time they may add the new recognizer's name
to default recognizer lists for the style-level entities/relations.
Subsequently, during an investigation, an analyst retrieves the recognizer either by se-
lecting an entity/relation with a default, by recognizer name, by indicating a text fragment
of the description, or by indicating the effect desired. The implementation recursively
runs recognizers in the pre-condition attribute, asks the analyst to set any of the required
parameters, and interprets the RRL code in the recognizer's method.
If the analyst employed the recognizer in architecture recovery, the results are added
to the as-built architecture with respect to some style. We provide additional support
via specialization hierarchies among the architectural entities and relation. Upon finding
that few examples of an architectural feature are recognized, the analyst has the option of
expanding a search by following generalization and specialization links and searching for
architecturally related information. This capability complements the recognizer indexing
scheme based on code level relationships.

6. Experience

During the past year, we employed our architecture recovery tools on six moderate sized
programs. Our most successful example was XSN, a MITRE-developed network-based
program for Unix system management. The program contains approximately 30,000 source
lines of code.
This program contains several common C/Unix building blocks and has the potential
for matching aspects of multiple styles. It is built on top of the X window system and
hence contains multiple invocations of the X application program interface. It consists of
executable files for multiple tasks developed individually by different groups over time.
These executables are linked in an executive task that uses operating system calls to spawn
specific tasks in accordance with switches set when the user initiates the program. Each
task is a test routine consisting of a stimulus, a listener, and analysis procedures. Calls
using socket constructs provide communications between host platforms on the network to
implement a client/server architecture.
Periodically, we were able to present our analysis to the original code developers and
receive their feedback and suggestions on identifying additional architectural features in
the code.
Our first recovery effort involved looking for the task spawning structure. XSN contains
several tasks and specific operating system calls that are used to connect these modules.
Figure 12 is a screen image of the graphical view of task spawning recovered from XSN. This
view also contains elements of what we call the file-based repository style - the connections
between the tasks and the data files that they access or modify. The rectangular boxes
represent a static view of a task: the source code that may be run when entering that task.
The diamonds represent data files. The data files' names (and indication of their existence)
are recovered from the source code. The oval is an unknown module. The arrows indicate
connections of either one task spawning another task or the data flow between data files and
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 129

tasks. Figure 12 is actually a view of a thinned-out XSN: several tasks and data files have
been removed to reduce diagram clutter. In the view's legend, "query" is another term for
"recognizer".
We next looked for layering structure. We attempted an approach that bundled up cycles
within the procedure calling hierarchy but otherwise used the procedure calling hierarchy
in its entirety. This approach lead to little reduction over the basic structure chart report.
We felt that additional clustering was possible using either deeper dominance analysis or
domain knowledge, but we did not pursue these approaches. We did build some preliminary
capabilities based on advertised API's for commercial subsystems or program layers. These
capabilities found portions of the code identified as users of some API. One predominate
example, particularly informative for XSN, was the code that accesses the underlying X
window system. We have not yet been able to implement a method that would combine
such bottom-up recognition with more globally-based layering recovery methods.
XSN acts as a client (sometimes a server) in its interactions with network services such
as sendmail or ftp. A service-invocation recognizer shown in Figure 13 recovered elements
of this style successfully. Over time we made several enhancements to the recognizer to
improve its explanation power. First, we refined its ability to identify the source of an
interaction. The notion we settled on was to identify the procedures that set port numbers
(i.e., indicate the service to be contacted) rather than the procedure containing the service
invocation call. Second, we enhanced the recognizer so that it would recognize a certain
pattern of complex, but stereotypical client/server interaction. In this pattern, we see the
client setting up a second communication channel in which it now acts as the server. It
was necessary to recognize this pattern in order to identify the correct external program
associated with the second channel.
At this point, we inspected the code to see if there were any obvious gaps in system
coverage by the as-built architecture we had found. We discovered that there were several
large blocks of code that did not participate in any of the styles. By examining the code,
it was clear that the developers had implemented several abstract data types - tables, lists.
Thus, we set about building and applying OBAD to the XSN system. These table and list
abstractions were recognized interactively by our OBAD sub-tool (see Section 4.3).
We developed over sixty recognizers for this analysis. Thirteen were used for client/server
recovery, seven for task spawning, nine were used for some form of layering, four for repos-
itory, seven for code level features, two for ADT recovery, and one for implicit invocation.
Thirteen of the recognizers were utilities producing intermediate results that could be used
for recovering features of multiple styles. The library also contains seven recognizers that
make some simplifying assumptions in order to approximate the results of more computa-
tional intensive recognizers. These recognizers proved to be particularly useful in situations
where it was not possible to obtain a complete program slice (Section 4.1).
Since the above profile of recognizers is based on recognition adequacy with respect to
only a few systems, the numbers should be taken in context. What is important is that they
indicate the need for serious recognition library management of the form we have described
in this paper.
We feel that we have gone a long way toward recognizing standard C/Unix idioms for
encoding architectural features. We are still at a stage where each new system we analyze
130 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

UJ

Figure 12. Task spawning view of (a thinned-out version of) XSN


EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 131

let (result = [] )
for-every call in invocations-of-type('service-invocations)
for-each port in where-ports-set(call)
let (target = service-at(second(port)))
let (proc = enclosing-procedure(first(port)))
(result <- prepend(result, [call, target, proc])),
result

Figure 13. A service recognizer uses the invocations-of-type construct.

requires some modifications to our implementation, but the number of required modifi-
cations is decreasing. In one case, we encoded a new architecture style called "context"
(showing the relationship between system processes and the connections to external files
and devices) as a means to best describe a new system's software architecture. We were
able to recognize all features of this style by just authoring one new recognizer and reusing
several others. More frequently, we have found that the set of recognizers is adequate but
we need to refine existing recognizers to account for subtleties that we had not seen before.
Table 2 summarizes the amount of code in XSN covered when viewed with respect to the
various styles. The first row gives the percentage of the lines of code used in the connectors
for that style. The second row gives the percentage of the procedures covered by that style.
A procedure is covered if it is included in some component in that style.

Table 2. Code coverage measures for XSN

Style: ADT API c/s Repository Task Spawning


% connector LOC: 0 0 0.3 2.2 0.7
% of procedures: 39.3 13.9 3.3 13.1 2.5

Combining all the styles whose statistics are given results in a total connector coverage
of about 3% of the lines of code and over 47% of the procedures. Procedure coverage total
is less than the sum of its constituents in the above table because the same procedure may
be covered by multiple styles.
We offer these statistics as elementary examples of architectural recovery metrics. This
endeavor is important both to determine the effectiveness of the style representations (e.g.,
what is the value-added of authoring a new style) and to provide an indicator for analysts
of how well they have done in understanding the system under analysis.
The measures we provide are potentially subject to some misinterpretation. It is difficult
to determine how strongly a system exhibits a style and how predictive that style is of
the entire body of code. As an extreme example, one could fit an entire system into one
layer. This style mapping is perfectly legal and covers the whole system, but provides no
abstraction of the system and no detailed explanation of the components.
132 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

In spite of these limits, there are experimental and programmatic advantages for defin-
ing code coverage metrics. The maintenance conmiunity can benefit from discussion on
establishing reasonable measures of progress toward understanding large systems.

7. Related Work

We can contrast our work with related work in recovery of high-level software design,
top-down and bottom-up approaches to software understanding, and interactive reverse
engineering.

7. /. Recovery of high-level design

Program structure has been analyzed independently of any pre-conceived architectural


styles to reveal program organization as discussed in (Biggerstaff, 1989, Biggerstaff, Mit-
bander, Webster, 1994,),(Schwanke, 1991, (Richardson and Wilde, 1993). General in-
quiry into the structure of software can be supported by software information systems
such as LaSSIE (Devanbu,Ballard, Brachman, Selfridge, 1991). LaSSIE represents pro-
grams from a relational view that misses some of the architectural grist we find deeply
embedded in abstract syntax trees. However, their automatic classification capabilities
are more powerful than our inferencing capabilities. In contrast to our work, DESIRE
(Biggerstaff, Mitbander, Webster, 1994) relies on externally supplied cues regarding pro-
gram structure, modularization heuristics, manual assistance, and informal information.
Informal information and heuristics can also be used to reorganize and automatically re-
fine recovered software designs/modularizations. Schwanke (Schwanke, 1991) describes a
clustering approach based on similarity measurements. This notion matches well to some of
the informal clustering that we are doing although their work is not used to find components
of any particular architectural style.
Canfora et al (Canfora, De Lucia, DiLucca, Fasolino, 1994) recovers architectural mod-
ules by aggregating program units into modules via a concept of node dominance on a
directed graph. This work addresses a functional architectural style that we have not con-
sidered, but again there are similarities to the clustering we perform within our OBAD
subsystem.

7.2, Top-down approaches

Our recognizers are intended for use in explorations of architectural hypotheses - a form of
top-down hypothesis-driven recognition coupled with bottom-up recognition rules. Quilici
(Quilici, 1993) also explores a mixed top-down, bottom-up recognition approach using
traditional plan definitions along with specialization links and plan entailments.
It is useful to compare our work to activities in the tutoring/bug detection community.
Our context-independent approach is similar to MENO (Soloway, 1983). In the tutoring
domain, context-independent approaches suffer because they cannot deal with the higher
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 133

level plans in the program. PROUST (Johnson and Soloway, 1985) remedies some of this
via a combination of bottom-up recognition and top-down analysis - i.e., looking up typical
patterns which implement a programmer's intentions. In contrast, architectural commit-
ments to use a particular architectural style are made at the top level, thus the mapping
between intentions and code is more direct.

73, Bottom-up approaches

The reverse engineering and program understanding conmiunity has approached software
understanding problems generally with a bottom-up approach where a program is matched
to a set of pre-defined plans/cliches from a library. This work is not motivated by archi-
tecture organizational principles essential for the construction of large programs. Current
work on program concept recognition is exemplified by (Kozaczynski, Ning, Sarver, 1992),
(Engberts, Kozaczynski, Ning, 1991), (Dekker and Ververs, 1994) which continues the cliche-
based tradition of (Rich and Wills, 1990). This work is based on a precise data and con-
trol flow match which indicates that the recognized source component is precisely the
same as the library template. Our partial recognition approach does not require algo-
rithmic equivalence between a plan and the source being matched, rather they are based
on events (Harandi and Ning, 1990) in the source code. That is to say, the existence
of patterns of these events is sufficient to establish a match. Our style of source code
event-based recognition rules is also exemplified in (Kozaczynski, Ning, Sarver, 1992),
(Engberts, Kozaczynski, Ning, 1991) which demonstrates a combination of precise control
and data flow relation recognition and more abstract code event recognition.

7A. Interactive Reverse Engineering

Wills (Wills, 1993) points out the need for flexible, adaptable control structures in reverse
engineering. Her work attacks the important problem of building interactive support that
cuts across multiple types of software analysis. In contrast, our work emphasizes authoring
and application of multiple analysis approaches applicable for uncovering architectural
features in the face of specific source code nuances and configurations.
Paul and Prakash (Paul and Prakash: patterns, 1994) (Paul and Prakash: queries, 1994)
investigate source code search using program patterns. This work uses a query language
for specifying high level patterns on the source code. Some of these patterns correspond
to specific recognition rules in our approach. Our approach focuses more on analyst use
of a pre-defined set of parameterizable recognizers each written in a procedural language.
That is, we restrict analyst access to a set of predefined recognizers, but allow recognizer
authors the greater flexibility of a procedural language.
134 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

8. Evaluation and Conclusions

We have implemented an architecture recovery framework that merges reverse engineering


and architectural style representation. This is an important first step toward long range
goals of providing custom, dynamic documentation to over a variety of software analysis
tasks. The framework provides for analyst control over parameterization and retrieval of
recognition library elements. We have described methods for recognizer execution and
offered some recognizer authoring guidance for identifying recognizers that will interact
well with other recognizers in the library.
The recognizers make use of commercially available reverse engineering technology, but
there are several important analysis capabilities that we have added. In addition, one of
our major contributions has been to determine the architectural patterns to recognize and
to express these patterns with respect to the underlying analysis capabilities.
Our current recognition capabilities have been motivated by thinking about a CAJnix
environment which does have its unique progranmiing idioms. While we phrase our rec-
ognizers at a general language/operating system independent level (e.g., task spawning or
service invocation), there are some biases within the recognition library itself and we would
like to extend our approaches to cover idioms of other high level languages and operating
systems. Primarily, there is a dependence of a set of functions on specifics of the legacy
language or operating system. In addition, many of the features that are recognized through
low level patterns in C/Unix implementations (e.g., a call to the system function spawns
a task, a struct-type) will appear explicitly in other languages/operating systems as special
constructs (e.g., tasks, class definitions).
There are four broad areas in which we intend to extend our work:

• Additional automation
We would like to expand our ability to index into the growing library of recognizers
and would like to develop additional capabilities for bridging the gap from source code
to style descriptions. The ultimate job of recognizers is to map the entities/relations
(i.e., objects in the domain of system design such as pipes or layers) to recognizable
syntactic features of programs (i.e., objects in the implementation domain). Clearly,
we are working with a moving target. New programming languages, COTS products,
and standard patterns present the reverse engineering community with the challenge
of recovering the abstractions from source code. We are hopeful that many of the
mechanisms we have put in place will enable us to rapidly turn out new recognizers
that can deal with new abstractions.
An enhancement that we intend to consider is the automatic generation of effect descrip-
tions from information encoded in explicit output lists of recognizers. This scheme is
similar to the transformation indexing scheme of Aries (Johnson, Feather, Harris, 1992).

• Combining Styles
We intend to investigate combining architectural styles. The as-built architectural views
each provides only a partial view unto the structure of a program and such partial views
can overlap in fundamental ways (e.g., a repository view emphasizing data elements
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 135

contains much in common with an interprocess-communication view emphasizing data


transmissions through shared memory or data files on disk) In addition, style combi-
nations can be used to uncover hybrid implementations where individual components
with respect to one style are implemented in terms of a second style.

• COTS modeling
Systems that we wish to analyze do not always come with the entire body of source
code, e.g., they may make use of COTS (commercial off-the-shelf) packages that are
simply accessed through an API. For example, from the analysis point of view, the
Unix operating system is a COTS package. We have developed representations for
COTS components that allow us to capture the interface and basic control and dataflow
dependencies of the components. This modeling needs to be extended to represent
architectural invariants required by the package.

• Requirements modeling
The distinction between functional and non-functional requirements suggests two broad
thrusts for reverse engineering to the requirements level. For functional requirements
we want to answer the important software maintenance question: "Where is X imple-
mented?". For example, a user may want to ask where message decoding is imple-
mented. Message and decoding are concepts at the user requirements level. Answering
such questions will require building functional models of systems. These models will
contain parts and constraints that we can use to map function to structure. For non-
functional requirements, we need to first recognize structural components that imple-
ment the non-functional requirements. For example, fault tolerant requirements will to
some degree be observable as exception handling in the code. We believe our frame-
work is well suited for extensions in this direction. As a second step, we need to identify
measures of compliance (e.g., high "coverage" by abstract data types means high data
modifiability). Preliminary work in this area appears in (Chung, Nixon, Yu, 1995) and
(Kazman, Bass, Abowd, Clements, 1995).

While we are continuing to refine our representations to provide more automated assis-
tance both for recognizer authors and for analysts such as software maintainers, the current
implementation is in usable form and provides many insights for long range development
of architectural recognition libraries.

Acknowledgments

We would like to thank The MITRE Corporation for sponsoring this work under its internal
research program. We also thank MITRE colleagues Melissa Chase, Susan Roberts, and
Richard Piazza. Their work on related MITRE efforts and enabling technology has been
of great benefit for the research we have reported on above. Finally, we acknowledge the
many insightful suggestions of the anonymous reviewers.
136 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

Appendix
The Recognizer Library

Our recognition library contains approximately sixty recognizers directed toward discovery
of the architectural components and relations of nine styles.
The following partial list of recognizers shows the variety of the elements of our recogni-
tion library and is organized by analysis method. Italicized words in the descriptions (e.g.,
focus, reference) highlight a parameter that analysts must set before running the recognizer.

1. Program structure - found directly on abstract syntax trees (ASTs)


• Find-Structu re-With-Attribute: structures that have reference as an attribute name
• Find-Loops: find all loops
• Hill-Climbing: instances of hill-climbing algorithms

2. Typed function calls - use special call specifications


• Find-Interesting-Invocations: invocations oifunctions-of-interest
• Find-lnvocations-Of-Executables: invocations that activate other executables
• Find-UI-Demons: registrations of user-interface demons
3. Forward references - procedures that use a variable set by a special call
• Envelope-Of-A-Conduit: procedures that use the communication endpoint cre-
ated hy focus
4. Clusters of objects
• Decomposables: decomposable objects of an architecture
• Top-Clusters: top clusters of the current architecture
5. Clusters derived from dependency analysis
• Find-Upper-Layers: clusters that are layered above iht focus cluster
• Find-Global-Var-Based-Clusters: find clusters based on common global vari-
able reference
6. Structures referenced in special invocations
Find-Executable-Links: links between spawned tasks and the tasks that spawned
them
Task-invocation: task invoked (spawned) by a special function call
File-Access-Links: links between procedures and the files that they access
File-IPC: files touched by more than one process
Service-Thru-Port: all relations to a reference port
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 137

• Find-Port-Connections: links (relations) between program layers and local or


network services
7. Clusters derived from calling hierarchy
• Find-Upper-Functional-Entry-Points: high level functional entry points
• Find-Mid-Level-Functional-Entry-Points: mid-level functional entry points
• Find-Common-Callers: common callers of a set of functions
• Who-Calls-lt: procedures that call focus
8. Procedures within some context - using containment within clusters
• Find-Functions-Of-Cluster: Functions of a cluster
• Find-Exported-Functions: Exported functions of focus cluster
• Find-Localized-Function-Calls: procedure invocations within the focus proce-
dure
• Has-Non-Local-Referents: non-local procedures that call definitions located in
focus

References
H. Abelson and G. Sussman. Structure and Interpretation of Computer Programs. The MIT Press, 1984.
G. Abowd, R. Allen, and D. Garlan. Using style to understand descriptions of software architecture. ACM Software
Engineering Notes, 18(5), 1993. Also in Proc. of the 1st ACM SIGSOFT Symposium on the Foundations of
Softwa re Engineering, 1993.
T. Biggerstaff. Design recovery for maintenance and reuse. IEEE Computer, July 1989.
T. Biggerstaff, B. Mitbander, and D. Webster. Program understanding and the concept assignment problem.
Communications of the ACM, 37(5), May 1994.
G. Canfora, A. De Lucia, G. DiLucca, and A. Fasolino. Recovering the architectural design for software
comprehension. In IEEE 3rd Workshop on Program Comprehension, pages 30-38. IEEE Computer Society
Press, November 1994.
L. Chung, B. Nixon, and E. Yu. Using non-functional requirements to systematically select among alternatives
in architectural design. In First International Workshop on Architectures for Software Systems, April 1995.
R. Dekker and F. Ververs. Abstract data structure recognition. The Ninth Knowledge-Based Software Engineering
Conference, 1994.
P. Devanbu, B. Ballard, R. Brachman, and P. Selfridge. Automating Software Design, chapter LaSSIE: A
Knowledge-Based Software Information System. AAAI/MIT Press, 1991.
A. Engberts, W. Kozaczynski, and J. Ning. Concept recognition-based program transformation. In 1991 IEEE
Conference on Software Maintenance, 1991.
K. Gallagher and J. Lyle. Using program slicing in software maintenance. IEEE Transactions on Software
Engineering, 17(8), 1991.
D. Garlan and M. Shaw. An introduction to software architecture. Tutorial at 15th International Conference on
Software Engineering, 1993.
M. Harandi and J. Ning. Knowledge-based program analysis. IEEE Software, 7(1), 1990.
D. Harris, H. Reubenstein, and A. Yeh. Recognizers for extracting architectural features from source code. In
Second Working Conference on Reverse Engineering, July 1995.
D. Harris, H. Reubenstein, and A. Yeh. Recoverying abstract data types and object instances from a conventional
procedure language. In Second Working Conference on Reverse Engineering, July 1995.
138 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

D. Harris, H. Reubenstein, and A. Yeh. Reverse engineering to the architectural level. In ICSE-I7 Proceedings,
April 1995.
C. Hofmeister, R. Nord, and D. Soni. Architectural descriptions of software systems. In First International
Workshop on Architectures for Software Systems, April 1995.
L. Holtzblatt, R. Piazza, H. Reubenstein, and S. Roberts. Using design knowledge to extract real-time task models.
In Proceedings of the 4th Systems Reengineering Technology Workshop, 1994.
W. L. Johnson and E. Soloway. Proust: Knowledge-based program understanding. IEEE Transactions on
Software Engineering, 11(3), March 1985.
W.L. Johnson, M. Feather, and D. Harris. Representation and presentation of requirements knowledge. IEEE
Transactions on Software Engineering, 18(10), October 1992.
R. Kazman, L. Bass, G. Abowd, and R Clements. An architectural analysis case study: Internet information
systems. In First International Workshop on Architectures for Software Systems, April 1995.
W. Kozaczynski, J. Ning, and T. Sarver. Program concept recognition. In 7th Annual Knowledge-Based Software
Engineering Conference, 1992.
E. Mettala and M. Graham. The domain specific software architecture program. Technical Report CMU/SEI-92-
SR-9, SEI, 1992.
M. Olsem and C. Sittenauer. Reengineering technology report. Technical report. Software Technology Support
Center, 1993.
S. Paul and A. Prakash. A framework for source code search using program patterns. IEEE Transactions on
Software Engineering, 20(6), June 1994.
S. Paul and A. Prakash. Supporting queries on source code: A formal framework. International Journal of
Software Engineering and Knowledge Engineering, September 1994.
D. Perry and A. Wolf. Foundations for the study of software architecture. ACM Software Engineering Notes,
17(4), 1992.
A. Quilici. A hybrid approach to recognizing program plans. In Proceedings of the Working Conference on
Reverse Engineering, 1993.
Reasoning Systems, Inc., Palo Alto, CA. REFINE User's Guide, 1990. For R E F I N E ^ ^ Version 3.0.
Reasoning Systems. Refine/C User's Guide, March 1992.
C. Rich and L. Wills. Recognizing a program's design: A graph parsing approach. IEEE Software, 7(1), 1990.
R. Richardson and N. Wilde. Applying extensible dependency analysis: A case study of a heterogeneous system.
Technical Report SERC-TR-62-F, SERC, 1993.
R. Schwanke. An intelligent tool for re-engineering software modularity. In 13th International Conference on
Software Engineering, 1991.
M. Shaw. Larger scale systems require higher-level abstractions. In Proceedings of the 5th Intematioruzl Workshop
on Software Specification and Design, 1989.
M. Shaw. Heterogeneous design idioms for software architecture. In Proceedings of the 6th International
Workshop on Software Specification and Design, 1991.
E. Soloway. Meno-ii: An intelligent program tutor. Computer-based Instruction, 10,1983.
W. Tracz. Domain-specific software architecture (DSSA) frequently asked questions (FAQ). ACM Software
Engineering Notes, 19(2), 1994.
M. Weiser. Program slicing. IEEE Transactions on Software Engineering, 10(4), July 1984.
L. Wills. Flexible control for program recognition. In Working Conference on Reverse Engineering, May 1993.
Automated Software Engineering, 3,139-164 (1996)
© 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Strongest Postcondition Semantics as the Formal


Basis for Reverse Engineering*
GERALD C. GANNOD** AND BETTY H.C. CHENGt {gannod,chengb}@cps.msu.edu
Department of Computer Science
Michigan State University
East Lansing, Michigan 48824-1027

Abstract. Reverse engineering of program code is the process of constructing a higher level abstraction of
an implementation in order to facihtate the understanding of a system that may be in a "legacy" or "geriatric"
state. Changing architectures and improvements in programming methods, including formal methods in software
development and object-oriented programming, have prompted a need to reverse engineer and re-engineer program
code. This paper describes the application of the strongest postcondition predicate transformer (sp) as the formal
basis for the reverse engineering of imperative program code.

Keywords: formal methods, formal specification, reverse engineering, software maintenance

1. Introduction

The demand for software correctness becomes more evident when accidents, sometimes
fatal, are due to software errors. For example, recently it was reported that the software of
a medical diagnostic system was the major source of a number of potentially fatal doses of
radiation (Leveson and Turner, 1993). Other problems caused by or due to software failure
have been well documented and with the change in laws concerning liability (Flor, 1991),
the need to reduce the number of problems due to software increases.
Software maintenance has long been a problem faced by software professionals, where
the average age of software is between 10 to 15 years old (Osborne and Chikofsky, 1990).
With the development of new architectures and improvements in programming methods
and languages, including formal methods in software development and object-oriented
programming, there is a strong motivation to reverse engineer and re-engineer existing
program code in order to preserve functionality, while exploiting the latest technology.
Formal methods in software development provide many benefits in the forward engineer-
ing aspect of software development (Wing, 1990). One of the advantages of using formal
methods in software development is that the formal notations are precise, verifiable, and
facilitate automated processing (Cheng, 1994). Reverse Engineering is the process of con-
structing high level representations from lower level instantiations of an existing system.
One method for introducing formal methods, and therefore taking advantage of the benefits

This work is supported in part by the National Science Foundation grants CCR-9407318, CCR-9209873, and
CDA-9312389.
** This author is supported in part by a NASA Graduate Student Researchers Program Fellowship.
t Please address all correspondences to this author.
140 GANNOD AND CHENG

of formal methods, is through the reverse engineering of existing program code into formal
specifications (Gannod and Cheng, 1994, Lano and Breuer, 1989, Ward et al., 1989).
This paper describes an approach to reverse engineering based on the formal semantics
of the strongest postcondition predicate transformer sp (Dijkstra and Scholten, 1990), and
the partial correctness model of program semantics introduced by Hoare (Hoare, 1969).
Previously, we investigated the use of the weakest precondition predicate transformer
wp as the underlying formal model for constructing formal specifications from program
code (Cheng and Gannod, 1991, Gannod and Cheng, 1994). The difference between the
two approaches is in the ability to directly apply a predicate transformer to a program (i.e.,
sp) versus using a predicate transformer as a guideline for constructing formal specifications
(i.e., wp).
The remainder of this paper is organized as follows. Section 2 provides background
material for software maintenance and formal methods. The formal approach to reverse
engineering based on sp is described in Sections 3 and 4, where Section 3 discusses the sp
semantics for assignment, alternation, and sequence, and Section 4 gives the sp semantics for
iterative and procedural constructs. An example applying the reverse engineering technique
is given in Section 5. Related work is discussed in Section 6. Finally, Section 7 draws
conclusions and suggest future investigations.

2. Background

This section provides background information for software maintenance and formal meth-
ods for software development. Included in this discussion is the formal model of program
semantics used throughout the paper.

2,1, Software Maintenance

One of the most difficult aspects of re-engineering is the recognition of the functionality of
existing programs. This step in re-engineering is known as reverse engineering. Identifying
design decisions, intended use, and domain specific details are often significant obstacles
to successfully re-engineering a system.
Several terms are frequently used in the discussion of re-engineering (Chikofsky and
Cross, 1990). Forward Engineering is the process of developing a system by mov-
ing from high level abstract specifications to detailed, implementation-specific manifes-
tations (Chikofsky and Cross, 1990). The explicit use of the word "forward" is used
to contrast the process with Reverse Engineering, the process of analyzing a system
in order to identify system components, component relationships, and intended behav-
ior (Chikofsky and Cross, 1990). Restructuring is the process of creating a logically equiv-
alent system at the same level of abstraction (Chikofsky and Cross, 1990). This process
does not require semantic understanding of the system and is best characterized by the task
of transforming unstructured code into structured code. Re-Engineering is the examination
and alteration of a system to reconstitute it in a new form, which potentially involves changes
at the requirements, design, and implementation levels (Chikofsky and Cross, 1990).
STRONGEST POSTCONDITION SEMANTICS 141

Byrne described the re-engineering process using a graphical model similar to the one
shown in Figure 1 (Byrne, 1992, Byrne and Gustafson, 1992). The process model appears
in the form of two sectioned triangles, where each section in the triangles represents a
different level of abstraction. The higher levels in the model are concepts and requirements.
The lower levels include designs and implementations. The relative size of each of the
sections is intended to represent the amount of information known about a system at a given
level of abstraction. Entry into this re-engineering process model begins with system A,
where Abstraction (or reverse engineering) is performed to an appropriate level of detail.
The next step is Alteration, where the system is constituted into a new form at a different
level of abstraction. Finally, Refinement of the new form into an implementation can be
performed to create system B,

Alteration

"Reverse Engineering "Forward Engineering"

Abstraction Refinement

System A System B

Figure 1. Reverse Engineering Process Model

This paper describes an approach to reverse engineering that is applicable to the imple-
mentation and design levels. In Figure 1, the context for this paper is represented by the
dashed arrow. That is, we address the construction of formal low-level or ''as-builf de-
sign specifications. The motivation for operating in such an implementation-bound level
of abstraction is that it provides a means of traceability between the program source code
and the formal specifications constructed using the techniques described in this paper. This
traceability is necessary in order to facilitate technology transfer of formal methods. That is,
currently existing development teams must be able to understand the relationship between
the source code and the specifications.

2,2. Formal Methods

Although the waterfall development life-cycle provides a structured process for developing
software, the design methodologies that support the life-cycle (i.e., Structured Analysis and
142 GANNOD AND CHENG

Design (Yourdon and Constantine, 1978)) make use of informal techniques, thus increasing
the potential for introducing ambiguity, inconsistency, and incompleteness in designs and
implementations. In contrast, formal methods used in software development are rigorous
techniques for specifying, developing, and verifying computer software (Wing, 1990). A
formal method consists of a well-defined specification language with a set of well-defined
inference rules that can be used to reason about a specification (Wing, 1990). A benefit of
formal methods is that their notations are well-defined and thus, are amenable to automated
processing (Cheng, 1994).

2.2.7. Program Semantics

The notation Q {S} R (Hoare, 1969) is used to represent a partial correctness model of
execution, where, given that a logical condition Q holds, if the execution of program S
terminates, then logical condition R will hold. A rearrangement of the braces to produce
{Q} S {R}/m contrast, represents a total correctness model of execution. That is, if
condition Q holds, then S is guaranteed to terminate with condition R true.
A precondition describes the initial state of a program, and di postcondition describes the
final state. Given a statement S and a postcondition R, the weakest precondition wp{S, R)
describes the set of all states in which the statement S can begin execution and terminate
with postcondition R true, and the weakest liberal precondition wlp{S^ R) is the set of all
states in which the statement S can begin execution and establish R as true if S terminates.
In this respect, wp{S^ R) establishes the total correctness of S, and wlp{S^ R) establishes
the partial correctness of 5. The wp and wlp are called predicate transformers because they
take predicate R and, using the properties listed in Table 1, produce a new predicate.

Table 1. Properties of the wp and wlp predicate trans-


formers
wp{S, A) = wp{S, true) A wlp{S, A)
wp{S, A) => -^wlp{S, ^A)
wp(Syfalse) = false
wp{S, AAB) = wp(S, A) A wp{S, B)
wp{S, AV B) =^ wp{S, A) V wp{S, B)
wp(S, A-^ B) => wp{S, A) -)• wp{S, B)

The context for our investigations is that we are reverse engineering systems that have
desirable properties or functionality that should be preserved or extended. Therefore, the
partial correctness model is sufficient for these purposes.

2.2.2. Strongest Postcondition

Consider the predicate -twlp{S,-^R), which is the set of all states in which there exists
an execution of S that terminates with R true. That is, we wish to describe the set of
states in which satisfaction of R is possible (Dijkstra and Scholten, 1990). The predicate
STRONGEST POSTCONDITION SEMANTICS 143

^wlp{S, ->R) is contrasted to wlp{S^ R) which is the set of states in which the computation
of S either fails to terminate or terminates with R true.
An analogous characterization can be made in terms of the computation state space
that describes initial conditions using the strongest postcondition sp{S, Q) predicate trans-
former (Dijkstra and Scholten, 1990), which is the set of all states in which there exists a
computation of 5 that begins with Q true. That is, given that Q holds, execution of S results
in sp(5, Q) true, if 5 terminates. As such, sp{S^ Q) assumes partial correctness. Finally, we
make the following observation about sp{S^ Q) and wlp{S, R) and the relationship between
the two predicate transformers, given the Hoare triple Q{5}/?(Dijkstra and Scholten, 1990):

Q ^ wlp{S,R)
sp{S,Q) => R

The importance of this relationship is two-fold. First, it provides a formal basis for trans-
lating programming statements into formal specifications. Second, the symmetry of sp and
wlp provides a method for verifying the correctness of a reverse engineering process that
utilizes the properties of wlp and sp in tandem.

2.2,3. spvs. wp

Given a Hoare triple Q{S} R,wc note that wp is a backward rule, in that a derivation of a
specification begins with R, and produces a predicate wp{S^ R). The predicate transformer
wp assumes a total correctness model of computation, meaning that given S and R, if the
computation of S begins in state wp{S, R), the program S will halt with condition R true.
We contrast this model with the sp model, a forward derivation rule. That is, given a
precondition Q and a program 5, sp derives a predicate sp{Sy Q). The predicate transformer
sp assumes a partial correctness model of computation meaning that if a program starts in
state Q, then the execution of S will place the program in state sp{S, Q) if S terminates.
Figure 2 gives a pictorial depiction of the differences between sp and wp, where the input to
the predicate transformer produces the corresponding predicate. Figure 2(a) gives the case
where the input to the predicate transformer is "S" and "R", and the output to the predicate
transformer (given by the box and appropriately named "wp") is "wp(S,R)". The sp case
(Figure 2(b)) is similar, where the input to the predicate transformer is "S" and "Q", and
the output to the transformer is "sp(S,Q)".
The use of these predicate transformers for reverse engineering have different implica-
tions. Using wp implies that a postcondition R is known. However, with respect to reverse
engineering, determining R is the objective, therefore wp can only be used as a guideline
for performing reverse engineering. The use of sp assumes that a precondition Q is known
and that a postcondition will be derived through the direct application of sp. As such, sp is
more applicable to reverse engineering.
144 GANNOD AND CHENG

{Q Q)

sp - ^ sp(S,Q) wp(S,R) -^ wp
(a) (b)

Figure 2. Black box representation and differences between wp and sp\ (a) wp (b) sp

3. Primitive Constructs

This section describes the derivation of formal specifications from the primitive program-
ming constructs of assignment, ahernation, and sequences. The Dijkstra guarded command
language (Dijkstra, 1976) is used to represent each primitive construct but the techniques are
applicable to the general class of imperative languages. For each primitive, we first describe
the semantics of the predicate transformers wlp and sp as they apply to each primitive and
then, for reverse engineering purposes, describe specification derivation in terms of Hoare
triples. Notationally, throughout the remainder of this paper, the notation {Q} S {R} will
be used to indicate a partial correctness interpretation.

3J, Assignment

An assignment statement has the form x: = e; where x is a variable, and e is an expression.


The wlp of an assignment statement is expressed as wlp{yi: =e, R) = R^, which represents
the postcondition R with every free occurrence of x replaced by the expression e. This type
of replacement is termed a textual substitution of x by e in expression R. If x corresponds
to a vector y of variables and e represents a vector E of expressions, then the wlp of the
assignment is of the form R^, where each yi is replaced by Ei, respectively, in expression
R. The sp of an assignment statement is expressed as follows (Dijkstra and Scholten, 1990)

5p(x:=e,Q) = {3v :: Q^Ax = e^), (1)

where Q is the precondition, v is the quantified variable, and *::' indicates that the range of
the quantified variable v is not relevant in the current context.
We conjecture that the removal of the quantification for the initial values of a variable
is valid if the precondition Q has a conjunct that specifies the textual substitution. That
is, performing the textual substitution Q^ in Expression (1) is a redundant operation if,
initially, Q has a conjunct of the form x = v. Refer to Appendix A where this case is
described in more depth. Given the imposition of initial (or previous) values on variables,
the Hoare triple formulation for assignment statements is as follows:
STRONGEST POSTCONDITION SEMANTICS 145

{Q} /* p r e c o n d i t i o n * /
X := e;
{(xj+i = e^.) A Q } /* p o s t c o n d i t i o n */

where Xj represents the initial value of the variable x, Xj^i is the subsequent value of x, Q
is the precondition. Subscripts are added to variables to convey historical information for
a given variable.
Consider a program that consists of a series of assignments to a variable x, "x : = a;
x : = b ; x:= c; x:= d; x:= e; x:= f; x:= g; x:= h;" Despite its simplicity, the

{x = X} {xo = X} {xo = X}
X := a; X := a; x := a;
{x = aAX = X} {xi = a} {xi=aAxo=X}
X := b ; x := b; x := b;
[x = bAa = a} {x2=b} {x2 = bAxi = aA...}
X : = C; X : = c; x := c;
{x = cAb = b} {x3=c} {x3=cAx2=bA...}
X := d; X := d; x := d;
{x = dAc= c} {x4 = d} {x4 = dAx3 = cA...}
X : = e; x : = e; x := e;
{x = e Ad = d} {x5=e} {x5=eAx4=dA...}
x:=f; x:=f; x:=f;
{x = f Ae = e} {x6 = f} {xe = fAx5=eA...}
X := g; x := g; x := g;
{x = gAf = f} {x7 = g} {x7 = 9Ax6 = fA...}
x := h; X := h; x := h;
{x = h Ag = 9} {xs = h} {xs = hAx7=gA...}

(a) Code with strict sp (b) Code (c) Code with historical
application with subscripts and propagation
historical
subscripts
Figure 3. Different approaches to specifying the history of a variable

example is useful in illustrating the different ways that the effects of an assignment statement
on a variable can be specified. For instance, Figure 3(a) depicts the specification of the
program by strict application of the strongest postcondition.
Another possible way to specify the program is through the use of historical subscripts
for a variable. A historical subscript is an integer number used to denote the i^"' textual
assignment to a variable, where a textual assignment is an occurrence of an assignment
statement in the program source (versus the number of times the statement is executed).
An example of the use of historical subscripts is given in Figure 3(b). However, when
using historical subscripts, special care must be taken to maintain the consistency of the
specification with respect to the semantics of other programming constructs. That is, using
146 GANNOD AND CHENG

the technique shown in Figure 3(b) is not sufficient. The precondition of a given statement
must be propagated to the postcondition, as shown in Figure 3(c). The main motivation for
using histories is to remove the need to apply textual substitution to a complex precondition
and to provide historical context to complex disjunctive and conjunctive expressions. The
disadvantage to using such a technique is that the propagation of the precondition can
potentially be complex visually. Note that we have not changed the semantics of the
strongest postcondition, but, rather, in the application of strongest postcondition, extra
information is appended that provides a historical context to all variables of a program
during some "snapshot" or state of a program.

3,2. Alternation

An alternation statement using the Dijkstra guarded command language (Dijkstra, 1976) is
expressed as

if
Bi ^ Si;

\\ ^n ^ ^Uf

fi;
where B^ ^^ s^ is a guarded command such that Si is only executed if logical expression
(guard) Bi is true. The wlp for alternation statements is given by (Dijkstra and Scholten,
1990):

wlp{iF, R) = (Vi : Bi : wlp{Si,R)),

where IF represents the alternation statement. The equation states that the necessary
condition to satisfy R, if the alternation statement terminates, is that given B^ is true, the
wlp for each guarded statement Si with respect to R holds. The sp for alternation has the
form (Dijkstra and Scholten, 1990)

5p(lF, Q) = {Bi :: sp{Si, Bi A Q)). (2)

The existential expression can be expanded into the following form

Sp{lF, Q) = (5p(Si, Bi A (5) V . . . V Sp{Sn, Bn A Q)). (3)

Expression (3) illustrates the disjunctive nature of alternation statements where each disjunct
describes the postcondition in terms of both the precondition Q and the guard and guarded
command pairs, given by B^ and s^, respectively. This characterization follows the intuition
that a statement Si is only executed if B^ is true. The translation of alternation statements to
specifications is based on the similarity of the semantics of Expression (3) and the execution
behaviour for alternation statements. Using the Hoare triple notation, a specification is
constructed as follows
STRONGEST POSTCONDITION SEMANTICS 147

{Q}
if
Bi - > S i ;

II ^n ^ ^n>
fi;
{ Sp(Si, 5 i A Q) V . . . V Sp{Sn, BnAQ)}

3.3. Sequence

For a given sequence of statements S i ; . . . ; Sn, it follows that the postcondition for some
statement Si is the precondition for some subsequent statement S^+i. The wlp and sp for
sequences follow accordingly. The wlp for sequences is defined as follows (Dijkstra and
Scholten, 1990):

wlp{Si; S2, R) = wlp{Si,wlp{S2, R)).


Likewise, the sp (Dijkstra and Scholten, 1990) is

Sp{Si;S2,Q) = Sp{S2,Sp{Si,Q)). (4)

In the case of wlp, the set of states for which the sequence Si; S2 can execute with R true
(if the sequence terminates) is equivalent to the wlp of Si with respect to the set of states
defined by wlp{S2,R)' For 577, the derived postcondition for the sequence Si;S2 with
respect to the precondition Q is equivalent to the derived postcondition for S2 with respect
to a precondition given by sp{si, Q). The Hoare triple formulation and construction process
is as follows:
{Q}
Si;
{sp(Si,Q)}
S2;
{sp{s2,sp{Si,Q)) }.

4. Iterative and Procedural Constructs

The programming constructs of assignment, alternation, and sequence can be combined to


produce straight-line programs (programs without iteration or recursion). The introduction
of iteration and recursion into programs enables more compactness and abstraction in pro-
gram development. However, constructing formal specifications of iterative and recursive
programs can be problematic, even for the human specifier. This section discusses the
formal specification of iteration and procedural abstractions without recursion. We deviate
from our previous convention of providing the formalisms for wlp and sp for each construct
and use an operational definition of how specifications are constructed. This approach is
148 GANNOD AND CHENG

necessary because the formalisms for the wlp and sp for iteration are defined in terms of
recursive functions (Dijkstra and Scholten, 1990, Gries, 1981) that are, in general, difficult
to practically apply.

4,1. Iteration

Iteration allows for the repetitive application of a statement. Iteration, using the Dijkstra
language, has the form
do
Bi -^ S i ;

od;
In more general terms, the iteration statement may contain any number of guarded com-
mands of the form B^ —> s^, such that the loop is executed as long as any guard B^ is true.
A simplified form of repetition is given by "do B —> s od ".
In the context of iteration, a bound function determines the upper bound on the number
of iterations still to be performed on the loop. An invariant is a predicate that is true
before and after each iteration of a loop. The problem of constructing formal specifications
of iteration statements is difficult because the bound functions and the invariants must be
determined. However, for a partial correctness model of execution, concerns of boundedness
and termination fall outside of the interpretation, and thus can be relaxed.
Using the abbreviated form of repetition "do B —^ s od", the semantics for iteration
in terms of the weakest liberal precondition predicate transformer wlp is given by the
following (Dijkstra and Scholten, 1990):

wlp{BO, R) = {Wi:0<i: wlp{iF\ B V R)), (5)

where the notation " I F * " is used to indicate the execution of " i f B ^^ s f i " i times.
Operationally, Expression (5) states that the weakest condition that must hold in order for
the execution of an iteration statement to result with R true, provided that the iteration
statement terminates, is equivalent to a conjunctive expression where each conjunct is an
expression describing the semantics of executing the loop i times, where i > 0.
The strongest postcondition semantics for repetition has a similar but notably distinct
formulation (Dijkstra and Scholten, 1990):

5P(DO, Q) = -^^ A (3i : 0 < ^ : sp{iF\ Q)). (6)

Expression (6) states that the strongest condition that holds after executing an iterative
statement, given that condition Q holds, is equivalent to the condition where the loop guard
is false {-^B), and a disjunctive expression describing the effects of iterating the loop i
times, where i > 0.
Although the semantics for repetition in terms of strongest postcondition and weakest
liberal precondition are less complex than that of the weakest precondition (Dijkstra and
STRONGEST POSTCONDITION SEMANTICS 149

Scholten, 1990), the recurrent nature of the closed forms make the appHcation of such
semantics difficult. For instance, consider the counter program "do i < n ^ i : =
i + 1 od". The application of the sp semantics for repetition leads to the following
specification:

sp{do i < n - > i : = i + l od,Q) = {i>n)A{3j:0<j:sp{IF^,Q)).

The closed form for iteration suggests that the loop be unrolled j times. If j is set to
n — start, where start is the initial value of variable i, then the unrolled version of the loop
would have the following form:

1. i : = start;
2. if
3. i < n —> i : = i + 1;
4. fi
5. if
6. i < n --> i : = i + 1;
7. fi
8.
9. if
10. i < n —> i : = i + 1;
11. fi

Application of the rule for alternation (Expression (2)) yields the sequence of annotated
code shown in Figure 4, where the goal is to derive

5p(do i < n — > i : = i + l od, {start < n) A {i = start)).

In the construction of specifications of iteration statements, knowledge must be introduced


by a human specifier. For instance, in line 19 of Figure 4 the inductive assertion that
"i = start + (n — start — 1)" is made. This assertion is based on a specifier providing the
information that (n — start — 1) additions have been performed if the loop were unrolled
at least (n — start — 1) times. As such, by using loop unrolling and induction, the derived
specification for the code sequence is

( ( n - 1 < n) A (z = n)).

For this simple example, we find that the solution is non-trivial when applying the formal
definition of sp{DO,Q). As such, the specification process must rely on a user-guided
strategy for constructing a specification. A strategy for obtaining a specification of a
repetition statement is given in Figure 5.
150 GANNOD AND CHENG

1. { {i = I) A {start < n) }
2. i:= s t a r t ;
3. {{'^ = start) A {start < n)}
4. i f i < n -> i : = i + 1 f i
5. { sp{i := i -\-1, {i < n) A {i = start) A {start < n))
6. V
7. {{i >= n) A{i = start) A {start < n))
8. =
9. {{i = start + 1) A {start < n)) }
10. i f i < n -> i : = i + 1 f i
11. { sp{i := i -\-1, {i < n) A {i = start + 1) A {start < n))
12. V
13. {{i >= n) A{i = start -h 1) A {start < n))
14. =
15. {{i = start + 2) A {start + 1 < n))
16. V
17. {{i > = n) A{i = start + 1) A {start < n)) }
18.
19. { ((^ = start -h (n - start - 1)) A {start + (n - start - 1) - 1 < n))
20. V
21. {{i >= n)A{i = start-\-{n-start-2))A{start-h{n-start-2)-l < n))
22. =
23. {{i = n-l)A{n-2<n))}
24. i f i < n -> i : = i + 1 f i
25. { sp{i := i -h 1, (z < n) A (i = n - 1) A (n - 2 < n))
26. V
27. {{i>=n)A{i = n-l)A{n-2< n))
28. =
29. {^ = n)}

Figure 4. Annotated Source Code for Unrolled Loop


STRONGEST POSTCONDITION SEMANTICS 151

1. The following criteria are the main characteristics to be identified during the specifica-
tion of the repetition statement:
• invariant (P): an expression describing the conditions prior to entry and upon exit
of the iterative structure.
• guards (B): Boolean expressions that restrict the entry into the loop. Execution of
each guarded command, Bi —> Si terminates with P true, so that P is an invariant
of the loop.
{P A Bi}Si{P}, fori <i<n

When none of the guards is true and the invariant is true, then the postcondition of
the loop should be satisfied (P A ^BB -^ R, where BB = BiV .. .W BnmdR
is the postcondition).
2. Begin by introducing the assertion "Q ^ ^ ^ " ^s the precondition to the body of the
loop.
3. Query the user for modifications to the assertion made in step 2. This guided interaction
allows the user to provide generalizations about arbitrary iterations of the loop. In order
to verify that the modifications made by a user are valid, wlp can be applied to the
assertion.
4. Apply the strongest postcondition to the loop body Si using the precondition given by
step 3.
5. Using the specification obtained from step 4 as a guideline, query the user for a loop
invariant. Although this step is non-trivial, techniques exist that aid in the construction
of loop invariants (Katz and Manna, 1976, Gries, 1981).
6. Using the relationship stated above ( P A -*BB -^ R), construct the specification of
the loop by taking the negation of the loop guard, and the loop invariant.

Figure 5. Strategy for constructing a specification for an iteration statement

4.2, Procedural Abstractions

This section describes the construction of formal specifications from code containing the
use of non-recursive procedural abstractions. A procedure declaration can be represented
using the following notation
152 GANNOD AND CHENG

proc p (value x\ value-result y; result z );


{P}{body){Q}
where x, y, and 'z represent the value, value-result, and result parameters for the procedure,
respectively. A parameter of type value means that the parameter is used only for input
to the procedure. Likewise, a parameter of type result indicates that the parameter is used
only for output from the procedure. Parameters that are known as value-result indicate that
the parameters can be used for both input and output to the procedure. The notation (body
) represents one or more statements making up the "procedure", while {P} and {Q} are
the precondition and postcondition, respectively. The signature of a procedure appears as

proc/7: {inputjtype)* —> {output Jype)* (7)

where the Kleene star (*) indicates zero or more repetitions of the preceding unit, input-type
denotes the one or more names of input parameters to the procedure p, and output-type
denotes the one or more names of output parameters of procedure p. A specification of a
procedure can be constructed to be of the form
{P:f/}
proc p : EQ —^ El
(body)
{Q:sp(bodyM)AU}

where EQ is one or more input parameter types with attribute value or value-result, and Ei is
one or more output parameter types with attribute value-result or result. The postcondition
for the body of the procedure, sp(body, U), is constructed using the previously defined
guidelines for assignment, alternation, sequence, and iteration as applied to the statements
of the procedure body.
Gries defines a theorem for specifying the effects of a procedure call (Gries, 1981) using
a total correctness model of execution. Given a procedure declaration of the above form,
the following condition holds (Gries, 1981)

{PRT : P f f A iWu,v :: Q|;i => I^l)}pia,b,c) {R} (8)

for a procedure call p(a, 6, c), where a, 6, and c represent the actual parameters of type value,
value-result, and result, respectively. Local variables of procedure p used to compute
value-result and result parameters are represented using u and v, respectively. Informally,
the condition states that PRT must hold before the execution of procedure p in order to
satisfy R. In addition, PRT states that the precondition for procedure p must hold for the
parameters passed to the procedure and that the postcondition for procedure/? implies R for
each value-result and result parameter. The formulation of Equation (8) in terms of a partial
correctness model of execution is identical, assuming that the procedure is straight-line,
non-recursive, and terminates. Using this theorem for the procedure call, an abstraction
of the effects of a procedure call can be derived using a specification of the procedure
declaration. That is, the construction of a formal specification from a procedure call can be
performed by inlining a procedure call and using the strongest postcondition for assignment.
STRONGEST POSTCONDITION SEMANTICS 153

begin

'{PR}
p(a, 6, c)

end

begin
d e c l a r e x, y, z", u, v;

{PR } _ _
x,y : = a,b;
{p}
(body)
{Q}__
y,z : = u,v;
{e^} _
t>,c : = y , z ;
{«}
end

Figure 6. Removal of procedure call p(a, 6, c) abstraction

A procedure call p(a, 6, c) can be represented by the program block (Gries, 1981) found in
Figure 6, where (body) comprises the statements of the procedure declaration for/?, {PR}
is the precondition for the call to procedure/?, {P}is the specification of the program after
the formal parameters have been replaced by actual parameters, {Q} is the specification
of the program after the procedure has been executed, { QR} is the specification of the
program after formal parameters have been assigned with the values of local variables, and
{ /?} is the specification of the program after the actual parameters to the procedure call have
been "returned". By representing a procedure call in this manner, parameter binding can be
achieved through multiple assignment statements and a postcondition R can be established
by using the sp for assignment. Removal of a procedural abstraction enables the extension
of the notion of straight-line programs to include non-recursive straight-line procedures.
Making the appropriate sp substitutions, we can annotate the code sequence from Figure 6
to appear as follows:
154 GANNOD AND CHENG

{PR } _ _
x,y : = a,b;
{ P: {3a,/3:: PR1% A X = a^'^ A y = b1%) }
{body)
{e}__
y,z := u,v;
{G/?.-(37,C::e^|Ay = u | j A z = v ^ | ) }
b,c : = YJZ";

{ R: (3^,^ :: fi/^^f Ab = y5'5 A c = z^f) }

where a, ^, 7, C» ^» and ip are the initial values of x, y (before execution of the procedure
body), y (after execution of the procedure body), ^, b, and c, respectively. Recall that in
Section 3.1, we described how the existential operators and the textual substitution could
be removed from the calculation of the sp. Applying that technique to assignments and
recognizing that formal and actual result parameters have no initial values, and that local
variables are used to compute the values of the value-result parameters, the above sequence
can be simplified using the semantics of 577 for assignments to obtain the following annotated
code sequence:

{PR } _ _
x,y : = a,b; ^
{P.-P/?Ax = a A y = b }
{body)
{G}__
y,z : = u,v; _ _
{ei?.-QAy = u ^ A z = v ^ }
b,c : = y,'z ;^
{/?.-G/?Ab = y A c = ^ }

where Q is derived using sp{{body),P).

5. Example

The following example demonstrates the use of four major programming constructs de-
scribed in this paper (assignment, alternation, sequence, and procedure call) along with
the application of the translation rules for abstracting formal specifications from code.
The program, shown in Figure 7, has four procedures, including three different imple-
mentations of "swap". A U T O S P E C (Cheng and Gannod, 1991, Gannod and Cheng, 1993,
Gannod and Cheng, 1994) is a tool that we have developed to support the derivational ap-
proach to the reverse engineering of formal specifications fi-om program code.
STRONGEST POSTCONDITION SEMANTICS 155

program MaxMin ( input, output ) ;


var a, b, c. Largest, Smallest : real;

procedure Find-
MaxMin{NumOne, NumTwo:real; var Max, Min:real ) ;
begin
if NumOne > NumTwo then
begin
Max := NumOne;
Min := NumTwo;
end
else
begin
Max := NumTwo;
Min := NumOne;
end
end;

procedure swapa( var X:integer; var Y:integer ) ;


begin
Y +
X = Y - X
Y = Y - X
end;

procedure swapb( var X:integer; var Y:integer ) ;


var
temp : integer;
begin
temp := X;
X := Y;
Y := temp
end;
procedure funnyswap( X:integer; Y:integer ) ;
var
temp : integer;
begin
temp := X;
X := Y;
Y := tenp
end;

begin
a := 5;
b := 10;
swapa(a,b);
swapb(a,b);
funnyswap{a,b);
FindMaxMin{a,b,Largest,Smallest);
c := Largest;
end.

Figure 7. Example Pascal program

Figures 8, 9, and 10 depict the output of A U T O S P E C when applied to the program code
given in Figure 7 where the notation id{scope}instance is used to indicate a variable
i d with scope defined by the referencing environment for scope. The i n s t a n c e identifier
156 GANNOD AND CHENG

program McixMin { input, output ) ;

var
a, b, c, Largest, Smallest : real;

procedure FindMaxMin( NumOne, NumTwo:real; var Max,


Min:real ) ;
begin
if (NumOne > NumTwo) then
begin
Max := NumOne;
(* Max{2)l = NumOneO & U *)
Min := NumTwo;
(* Min{2}l = NumTwoO & U *)
end
I: (* (Max{2)l = Nu-
mOneO & Min{2}l = NumTwoO) & U *)
else
begin
Max := NumTwo;
(* Max{2}l = NiimTwoO & U *)
Min := NumOne;
(* Min{2}l = NumOneO & U *)
end
J: (* (Max{2)l = NumTwoO & Min{2}l = Nu-
mOneO) & U *)
K: (* (((NumOneO > NumTwoO) &
(Max{0}l = NumOneO & Min{0}l = NumTwoO)) |
(not (NumOneO > NumTwoO) &
(Max{0}l = NumTwoO & Min{0}l = Nu-
mOneO ) ) ) & U *)
end
L: (* (((NumOneO > NumTwoO) &
(Max{0}l = NumOneO & Min{0}l = N\JimTwoO)) |
(not (NumOneO > NumTwoO) &
(Max{0}l = NumTwoO & Min{0}l = Nu-
mOneO ) ) ) & U *)

Figure 8. Output created by applying AUTOSPEC to example

is used to provide an ordering of the assignments to a variable. The scope identifier has
two purposes. When scope is an integer, it indicates the level of nesting within the current
program or procedure. When scope is an identifier, it provides information about variables
specified in a different context. For instance, if a call to some arbitrary procedure called f oo
is invoked, then specifications for variables local to f oo are labeled with an integer scope.
Upon return, the specification of the calling procedure will have references to variables
local to f oo. Although the variables being referenced are outside the scope of the calling
procedure, a specification of the input and output parameters for f oo can provide valuable
information, such as the logic used to obtain the specification for the output variables to
f oo. As such, in the specification for the variables local to f oo but outside the scope of
the calling procedure, we use the scope label So. Therefore, if we have a variable q local
to f oo, it might appear in a specification outside its local context as q{f oo}4, where "4"
indicates the fourth instance of variable q in the context of f oo.
STRONGEST POSTCONDITION SEMANTICS 157

In addition to the notations for variables, we use the notation ' | ' to denote a logical-or, '&'
to denote a logical-and, and the symbols ' (* * ) ' to delimit comments (i.e., specifications).
In Figure 8, the code for the procedure FindMaxMin contains an alternation statement,
where lines I, J, K, and L specify the guarded commands of the alternation statement (i
and j ) , the effect of the alternation statement (K), and the effect of the entire procedure (L),
respectively.
Of particular interest are the specifications for the swap procedures given in Figure 9
named swapa and swapb. The variables x and Y are specified using the notation described
above. As such, the first assignment to Y is written using Y{O}I, where Y is the variable,
'{o}' describes the level of nesting (here, it is zero), and ' 1 ' is the historical subscript, the
*l' indicating the first instance of Y after the initial value. The final comment for swapa
(line M), which gives the specification for the entire procedure, reads as:
(* (Y{0}2 = XO Sc X{0}1 = YO & Y{0}1 = YO + XO) & U *)

where Y{O}2 = XO is the specification of the final value of Y, and x{o}l = YD is the
specification of the final value of x. In this case, the intermediate value of Y, denoted
Y{O}I, with value YO + xo is not considered in the final value of Y.
Procedure swapb uses a temporary variable algorithm for swap. Line N is the specification
after the execution of the last line and reads as:

(* (Y{0}1 = XO & X{0}1 = YO & temp{0}l = XO) & U *)


where Y{O}I = XO is the specification of the final value of Y, and x{o}l = YO is the
specification of the final value of x.
Although each implementation of the swap operation is different, the code in each proce-
dure effectively produces the same results, a property appropriately captured by the respec-
tive specifications for swapa and swapb with respect to the final values of the variables x
and Y.
In addition, Figure 10 shows the formal specification of the funnyswap procedure. The
semantics for the funnyswap procedure are similar to that of swapb. However, the param-
eter passing scheme used in this procedure is pass by value.
The specification of the main begin-end block of the program MaxMin is given in
Figure 10. There are eight lines of interest, labeled i, J, K, L, M, N, o, and p, respectively.
Lines I and J specify the effects of assignment statements. The specification at line K
demonstrates the use of identifier scope labels, where in this case, we see the specification
of variables x and Y from the context of swapa. Line L is another example of the same
idea, where the specification of variables from the context of swapb (x and Y), are given.
In the main program, no variables local to the scope of the call to funnyswap are affected
by funnyswap due to the pass by value nature of funnyswap, and thus the specification
shows no change in variable values, which is shown by line M of Figure 10. The effects
of the call to procedure FindMaxMin provides another example of the specification of a
procedure call (line N). Finally, line P is the specification of the entire program, with every
precondition propagated to the final postcondition as described in Section 3.1. Here, of
interest are the final values of the variables that are local to the program MaxMin (i.e., a, b,
and C). Thus, according to the rules for historical subscripts, the a{o}3, b{o}3, and c{o}l
158 GANNOD AND CHENG

procedure swapa( var X:integer; var Y:integer ) ;

begin
Y := (Y + X ) ;
(* {Y(0)1 = (YO + XO)) & U *)
X := (Y - X) ;
(* (X{0}1 = ((YO + XO) - XO)) & U *)
Y := (Y - X ) ;
(* (Y{0}2 = ((YO + XO) - ((YO + XO) - XO))) & U *)
end
(* (Y{0}2 = XO & X{0)1 = YO & Y(0)1 = YO + XO) & U *)

procedure swapb( var X:integer; var Y:integer );


var
tenp : integer;

begin
temp := X;
(* (temp{0)l = XO) & U *)
X := Y;
(* (X{0)1 = YO) & U *)
Y := temp;
(* (Y{0}1 = XO) & U *)
end
(* (Y{0)1 = XO & X{0}1 = YO & temp{0}l = XO) & U *)

procedure funnyswap( X:integer; Y:integer ) ;


var
temp : integer;

begin
temp := X;
(* (tenp{0}l = XO) & U *)
X := Y;
(* (X{0)1 = YO) & U *)
Y := temp;
(* (Y{0}1 = XO) & U *)
end
(* (Y{0}1 = XO & X{0}1 = YO & temp{0}l = XO) & U *)

Figure 9. Output created by applying AUTOSPEC to example (cont.)

are of interest. In addition, by propagating the preconditions for each statement, the logic
that was used to obtain the values for the variables of interest can be analyzed.

6. Related Work

Previously, formal approaches to reverse engineering have used the semantics of the weak-
est precondition predicate transformer wp as the underlying formalism of their technique.
The Maintainer's Assistant uses a knowledge-based transformational approach to con-
struct formal specifications from program code via the use of a Wide-Spectrum Language
(WSL)(Ward et al., 1989). A WSL is a language that uses both specification and imperative
language constructs. A knowledge-base manages the correctness preserving transforma-
STRONGEST POSTCONDITION SEMANTICS 159

(* Main Program for MaxMin *)


begin
a := 5;
(*a{0}l = 5 & U *)

b := 10;
(* b{0}l = 10' & U *)

swapa{a,b)
(* (b{0)2 = 5 &
(a{0}2 = 10 &
{Y{swapa}2 ^ 5 &
(X{swapa}l = 10 & Y{swapa)l =15)))) & U *)
swapb(a,b)
(* (b{0}3 = 10 &
(a{0}3 = 5 &
(Y{swapb}l = 10 &
(X{swapb}l = 5 & temp{swapb}l =10)))) & U *)
funnyswap(a,b)
(* (Y{funnyswap}l = 5 & X{funnyswap)l = 10 &
tenp(funnyswap}l = 5 ) & U *)
FindMaxMin{a,b,Largest,Smallest)
(* (Smallest{0}l = Min{FindMaxMin)l &
Largest{0)1 = Max{FindMaxMin}l &
(({5 > 10) &
{Max{FindMaxMin}l = 5 &
Min{FincaMaxMin}l = 10)) |
(not (5 > 10) &
(Max{FindMaxMin)l = 10 &
Min{FindMaxMin)l = 5)))) & U *)
c := Largest;
(* c{0}l = Max{FindMaxMin)l & U *)

(* ((c{0)l = Max{FindMaxMin}l) &


(Smallest{0)l = Min{FindMaxMin)l &
Largest{0}l = Max{FindMaxMin)l &
(((5 > 10) &
(Max{Finc3MaxMin}l = 5 &
Min{FindMaxMin}l = 1 0 ) ) |
(not(5 > 10) &
(Max{FindMaxMin)l = 10 &
Min{FindMaxMin)l = 5))))) &
( Y{funnyswap}l = 5 & X{fvinnyswap) 1 = 1
tenip{funnyswap)l = 5 ) &
( b{0)3 = 10 6c
a{0}3 = 5 &
(Y{swapb}l = 10 & X{swapb}l = 5 &
teirp{swapb)l = 10)) &
( b{0}2 = 5 &
a{0}2 = 10 &
(Y{swapa}2 = 5 & X{swapa)l = 10 &
Y{swapa}l = 15)) &
(b{0}l = 10 & a{0)l = 5 ) & U *)

Figure 10. Output created by applying AUTOSPEC to example (cont.)

tions of concrete, implementation constructs in a WSL to abstract specification constructs


in the same WSL.
160 GANNOD AND CHENG

REDO (Lano and Breuer, 1989) (Restructuring, Maintenance, Validation and Documen-
tation of Software Systems) is an Espirit II project whose objective is to improve applications
by making them more maintainable through the use of reverse engineering techniques. The
approach used to reverse engineer COBOL involves the development of general guidelines
for the process of deriving objects and specifications from program code as well as providing
a framework for formally reasoning about objects (Haughton and Lano, 1991).
In each of these approaches, the applied formalisms are based on the semantics of the
weakest precondition predicate transformer wp. Some differences in applying wp and sp are
that wp is a backward rule for program semantics and assumes a total correctness model of
execution. However, the total correctness interpretation has no forward rule (i.e. no strongest
total postcondition stp (Dijkstra and Scholten, 1990)). By using a partial correctness model
of execution, both a forward rule {sp) and backward rule {wlp) can be used to verify and
refine formal specifications generated by program understanding and reverse engineering
tasks. The main difference between the two approaches is the ability to directly apply
the strongest postcondition predicate transformer to code to construct formal specifications
versus using the weakest precondition predicate transformer as a guideline for constructing
formal specifications.

7. Conclusions and Future Investigations

Formal methods provide many benefits in the development of software. Automating the
process of abstracting formal specifications from program code is sought but, unfortunately,
not completely realizable as of yet. However, by providing the tools that support the reverse
engineering of software, much can be learned about the functionality of a system.
The level of abstraction of specifications constructed using the techniques described in
this paper are at the "as-built" level, that is, the specifications contain implementation-
specific information. For straight-line programs (programs without iteration or recursion)
the techniques described herein can be applied in order to obtain a formal specification from
program code. As such, automated techniques for verifying the correctness of straight-line
programs can be facilitated.
Since our technique to reverse engineering is based on the use of strongest postcondition
for deriving formal specifications from program code, the application of the technique to
other programming languages can be achieved by defining the formal semantics of a pro-
gramming language using strongest postcondition, and then applying those semantics to the
programming constructs of a program. Our current investigations into the use of strongest
postcondition for reverse engineering focus on three areas. First, we are extending our
method to encompasses all major facets of imperative programming constructs, including
iteration and recursion. To this end, we are in the process of defining the formal seman-
tics of the ANSI C programming language using strongest postcondition and are applying
our techniques to a NASA mission control application for unmanned spacecraft. Second,
methods for constructing higher level abstractions from lower level abstractions are be-
ing investigated. Finally, a rigorous technique for re-engineering specifications from the
imperative programming paradigm to the object-oriented programming paradigm is being
developed (Gannod and Cheng, 1993). Directly related to this work is the potential for
STRONGEST POSTCONDITION SEMANTICS 161

applying the results to facilitate software reuse, where automated reasoning is applied to
the specifications of existing components to determine reusability (Jeng and Cheng, 1992).

Acknowledgments

The authors greatly appreciate the comments and suggestions from the anonymous refer-
ees. Also, the authors wish to thank Linda Wills for her efforts in organizing this special
issue. Finally, the authors would like to thank the participants of the IEEE 1995 Working
Conference on Reverse Engineering for the feedback and comments on an earlier version
of this paper.
This is a revised and extended version of "Strongest Postcondition semantics as the
Formal Basis for Reverse Engineering" by G.C. Gannod and B.H.C. Cheng, which first
appeared in the Proceedings of the Second Working Conference on Reverse Engineering,
IEEE Computer Society Press, pp. 188-197, July 1995.

Appendix A
Motivations for Notation and Removal of Quantification

Section 3.1 states a conjecture that the removal of the quantification for the initial values of
a variable is valid if the precondition Q has a conjunct that specifies the textual substitution.
This Appendix discusses this conjecture. Recall that

5p(x:= e,Q) = {3v::QlAx = el). (A.l)

There are two goals that must be satisfied in order to use the definition of strongest post-
condition for assignment. They are:

1. Elimination of the existential quantifier

2. Development and use of a traceable notation.

Eliminating the Quantifier, First, we address the elimination of the existential quantifier.
Consider the RHS of definition A.l. Let y be a variable such that

(Q^ A X = el) ^ {3v :: Q^ A X = el). (A.2)

Define spp{K:= e, Q) (pronounced "s-p-rho") as the strongest postcondition for assign-


ment with the quantifier removed. That is,

spp{K:= e,Q) = {Ql Ax = ey) forsome}'. (A.3)

Given the definition of spp, it follows that

spp{x:= e,(3) =4> 5p(x:= e , Q ) . (A.4)


162 GANNOD AND CHENG

As such, the specification of the assignment statement can be made more simple if y from
equation (A.3) can either be identified explicitly or named implicitly. The choice ofy must
be made carefully. For instance, consider the following. Let Q := P A {x = z) such that
P contains no free occurrences of x. Choosing an arbitrary a for y in (A.3) leads to the
following derivation:

= {Q:=PA{x = z))
{PA{x = z)raA{x = e%)
= (textual substitution)
(P^ A{x = z)l A (x = el)
= {P has no free occurrences of x. Textual substitution)
PA{a = z)A{x = 6%)
= {a = z)
P A ( a = z)A(x = 0
= (textual substitution)
PA{a = z)A{x = e^).

At first glance, this choice ofy would seem to satisfy the first goal, namely removal of the
quantification. However, this is not the case. Suppose P were replaced with P' A{a^ z).
The derivation would lead to

5pp(x:= e,(3) = P^A{a^z)A{a = z)A{x = e%).

This is unacceptable because it leads to a contradiction, meaning that the specification of a


program describes impossible behaviour. Ideally, it is desired that the specification of the
assignment statement satisfy two requirements. It must:

1. Describe the behaviour of the assignment of the variable x, and

2. Adjust the precondition Q so that the free occurrences of x are replaced with the value
of X before the assignment is encountered.

It can be proven that through successive assignments to a variable x that the specification
spp will have only one conjunct of the form (x = /?), where P is an expression. Informally,
we note that each successive application of spp uses a textual substitution that eliminates
free references to x in the precondition and introduces a conjunct of the form (x = /3).
The convention used by the approach described in this paper is to choose for y the
expression yS. If no P can be identified, use a place holder 7 such that the precondition Q
has no occurrence of 7. As an example, let y in equation (A.3) be z, and Q := PA{X = z).
Then

spp{x:= e,Q) = PA{z^z)A{x = e^).

Notice that the last conjunct in each of the derivations is (x = e^) and that since P contains
no free occurrences of x, P is an invariant.
STRONGEST POSTCONDITION SEMANTICS 163

Notation. Define sp^^ (pronounced "s-p-rho-iota") as the strongest postcondition for


assignment with the quantifier removed and indices. Formally, spp^ has the form

spp,{yi: = e, Q) = {Qy Axk = Sy) for somey. (A.5)

Again, an appropriate y must be chosen. Let Q := PA{xi = y), where P has no occurrence
of X other than i subscripted x's of form {xj = ej),0 < j < i. Based on the previous
discussion, choose y to be the RHS of the relation (xi = y). As such, the definition of spp^^
can be modified to appear as

spp,{x:= e, Q) = ((P A {xi = y))l A x^+i = e^) for some j . (A.6)

Consider the following example where subscripts are used to show the effects of two
consecutive assignments to the variable x. Let Q := P A{xi = a), and let the assignment
statement be x: = e. Application of sppc yields

spp,{x:= e,Q) = ( P A ( x i = a))^ A(xi+i = e ) ^


= (textual substitution)
P^ A {xi = a)% A {xi+i = e)l
= (textual substitution)
PA{xi = a)A{xi^i = e^)

A subsequentapplication of 5j?pt on the statement x:= f subjecttoQ' := QA(xi+i = e%)


has the following derivation:

SPP,{K:= f, Q') = ( P A {xi = a)A (x^+i = e^))^. A Xi^2 = fe^


= (textual substitution)
Pfx A {xi = a)^x A {xi^i = eS)^x A x^+s = f^.
= (P has no free x, textual substitution)
PA{xi = a)A {xi^i = e^) A Xi+2 = fe-
= (definition of Q)
Q A {xi+i = e^) A Xi+2 = fe-
= (definition of Q')
Q' A Xi+2 = feg

Therefore, it is observed that by using historical subscripts, the construction of the speci-
fication of the assignment statements involves the propagation of the precondition Q as an
invariant conjuncted with the specification of the effects of setting a variable to a dependent
value. This convention makes the evaluation of a specification annotation traceable by
avoiding the elimination of descriptions of variables and their values at certain steps in the
program. This is especially helpful in the case where choice statements (alternation and
iteration) create alternative values for specific variable instances.
164 GANNOD AND CHENG

References

Byrne, Eric J. A Conceptual Foundation for Software Re-engineering. In Proceedings for the Conference on
Software Maintenance, pages 226-235. IEEE, 1992.
Byrne, Eric J. and Gustafson, David A. A Software Re-engineering Process Model. In COMPSAC. IEEE, 1992.
Cheng, Betty H. C. Applying formal methods in automated software development. Journal of Computer and
Software Engineering, 2(2): 137-164, 1994.
Cheng, Betty H.C., and Gannod, Gerald C. Abstraction of Formal Specifications from Program Code. In
Proceedings for the IEEE 3rd International Conference on Tools for Artificial Intelligence, pages 125-128.
IEEE, 1991.
Chikofsky, Elliot J. and Cross, James H. Reverse Engineering and Design Recovery: A Taxonomy. IEEE
Software, 7(1): 13-17, January 1990.
Dijkstra, Edsgar W. A Discipline of Programming. Prentice Hall, 1976.
Dijkstra, Edsger W. and Scholten, Carel S. Predicate Calculus and Program Semantics. Springer-Verlag, 1990.
Flor, Victoria Slid. Ruling's Dicta Causes Uproar. The National Law Journal, July 1991.
Gannod, Gerald C. and Cheng, Betty H.C. A Two Phase Approach to Reverse Engineering Using Formal Methods.
Lecture Notes in Computer Science: Formal Methods in Programming and Their Applications, 735:335-348,
July 1993.
Gannod, Gerald C. and Cheng, Betty H.C. Facilitating the Maintenance of Safety-Critical Systems Using Formal
Methods. The International Journal of Software Engineering and Knowledge Engineering, 4(2): 183-204,1994.
Gries, David. The Science of Programming. Springer-Verlag, 1981.
Haughton, H.P., and Lano, Kevin. Objects Revisited. In Proceedingsfor the Conference on Software Maintenance,
pages 152-161. IEEE, 1991.
Hoare, C. A. R. An axiomatic basis for computer programming. Communications of the ACM, 12(10):576-580,
October 1969.
Jeng, Jun-jang and Cheng, Betty H. C. Using Automated Reasoning to Determine Software Reuse. International
Journal of Software Engineering and Knowledge Engineering, 2(4):523-546, December 1992.
Katz, Shmuel and Manna, Zohar. Logical Analysis of Programs. Communications of the ACM, 19(4): 188-206,
April 1976.
Lano, Kevin and Breuer, Peter T. From Programs to Z Specifications. In John E. Nicholls, editor, Z User Workshop,
pages 46-70. Springer-Verlag, 1989.
Leveson, Nancy G. and Turner, Clark S. An Investigation of the Therac-25 Accidents. IEEE Computer, pages
1 8 ^ 1 , July 1993.
Osborne, Wilma M. and Chikofsky, Elliot J. Fitting pieces to the maintenance puzzle. IEEE Software, 7(1): 11-12,
January 1990.
Ward, M., Calliss, F.W., and Munro, M. The Maintainer's Assistant. In Proceedings for the Conference on
Software Maintenance. IEEE, 1989.
Wing, Jeannette M. A Specifier's Introduction to Formal Methods. IEEE Computer, 23(9):8-24, September
1990.
Yourdon, E. and Constantine, L. Structured Analysis and Design: Fundamentals Discipline of Computer Programs
and System Design. Yourdon Press, 1978.
Automated Software Engineering, 3, 165-172 (1996)
© 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Recent Trends and Open Issues in Reverse


Engineering
LINDA M. WILLS linda.wills@ee.gatech.edu
School of Electrical and Computer Engineering
Georgia Institute of Technology, Atlanta, GA 30332-0250

JAMES H. CROSS II cross@eng.aubum.edu


Auburn University, Computer Science and Engineering
lOJDunstan Hall, Auburn University, AL 36849

Abstract. This paper discusses recent trends in the field of reverse engineering, particularly those highlighted
at the Second Working Conference on Reverse Engineering, held in July 1995. The trends observed include
increased orientation toward tasks, grounding in complex real-world applications, guidancefromempirical study,
analysis of non-code sources, and increased formalization. The paper also summarizes open research issues and
provides pointers to future events and sources of information in this area.

1. Introduction

Researchers in reverse engineering use a variety of metaphors to describe the role their
work plays in software development and evolution. They are detectives, piecing together
clues incrementally discovered about a system's design and what "crimes" were committed
in its evolution. They are rescuers, salvaging huge software investments, left stranded by
shifting hardware platforms and operating systems. Some practice radiology, finding ways
of viewing internal structures, obscured by and entangled with other parts of the software
"organism": objects in procedural programs, logical data models in relational databases,
and data and control flow "circulatory and nervous systems." Others are software arche-
ologists (Chikofsky, 1995), reconstructing models of structures buried in the accumulated
deposits of software patches and fixes; inspectors, measuring compliance with design, cod-
ing, and documentation standards; foreign language interpreters, translating software in one
language to another; and treasure hunters and miners, searching for gems to extract, polish,
and save in a reuse library.
Although working from diverse points of view, reverse engineering researchers have a
common goal of recovering information from existing software systems. Conceptual com-
plexity is the software engineer's worst enemy. It directly affects costs and ultimately the
reliability of the delivered system. Comprehension of existing systems is the underlying goal
of reverse engineering technology. By examining and analyzing the system, the reverse engi-
neering process generates multiple views of the system that highlight its salient features and
delineate its components and the relationships between them (Chikofsky and Cross, 1990).
Recovering this information makes possible a wide array of critical software engineering
activities, including those mentioned above. The prospect of being able to provide tools
and methodologies to assist and automate portions of the reverse engineering process is
166 WILLS AND CROSS

an appealing one. Reverse engineering is an area of tremendous economic importance to


the software industry not only in saving valuable existing assets, but also in facilitating the
development of new software.
From the many different metaphors used to describe the diverse roles that reverse engi-
neering plays, it is apparent that supporting and semi-automating the process is a complex,
multifarious problem. There are many different types of information to extract and many
different task situations, with varying availability and accuracy of information about the
software. A variety of approaches and skills is required to attack this problem.
To help achieve coherence and facilitate communication in this rapidly growing field,
researchers and practitioners have been meeting at the Working Conference on Reverse
Engineering, the first of which was held in May 1993 (Waters and Chikofsky, 1993). The
Working Conference provides a forum for researchers to discuss as a group current research
directions and challenges to the field. The adjective "working" in the title emphasizes the
conference's format of interspersing significant periods of discussion with paper presenta-
tions. The Second Working Conference on Reverse Engineering (Wills et al., 1995) was
held in July, 1995, organized by general chair Elliot Chikofsky of Northeastern University
and the DMR Group, and by program co-chairs Philip Newcomb of the Software Revolution
and Linda Wills of Georgia Institute of Technology.
This article uses highlights and observations from the Second Working Conference on
Reverse Engineering to present a recent snapshot of where we are with respect to our overall
goals, what new trends are apparent in the field, and where we are heading. It also points
out areas where hopefully more research attention will be drawn in the future. Finally,
it provides pointers to future conferences and workshops in this area and places to find
additional information.

2. Increased Task-Orientation

The diverse set of metaphors listed above indicates the variety of tasks in which reverse
engineering plays a significant role. Different tasks place different demands on the reverse
engineering process. The issue in reverse engineering is not only how to extract information
from an existing system, but which information should be extracted and in whatform should
it be made accessible? Researchers are recognizing the need to tailor reverse engineering
tools toward recovering information relevant to the task at hand. Mechanisms for focused,
goal-driven inquiries about a software system are actively being developed.
Dynamic Documentation. A topic of considerable interest is automatically generating
accessible, dynamic documentation from legacy systems. Lewis Johnson coined the phrase
"explanation on demand" for this type of documentation technology (Johnson, 1995). The
strategy is to concentrate on generating only documentation that addresses specific tasks,
rather than generating all possible documentation whether it is needed or not.
Two important open issues are: what formalisms are appropriate for documentation, and
how well do existing formalisms match the particular tasks maintainers have to perform?
These issues are relevant to documentation at all levels of abstraction. For example, a
similar issue arises in program understanding: what kinds of formal design representations
TRENDS AND ISSUES IN REVERSE ENGINEERING 167

should be used as a target for program understanding systems? How can multiple models
of design abstractions be extracted, viewed, and integrated?
Varying the Depth of Analysis. Depending on the task, different levels of analysis
power are required. For example, recent advances have been made in using analysis
techniques to detect duplicate fragments of code in large software systems (Baker, 1995,
Kontogiannis et al., 1995). This is useful in identifying candidates for reuse and in prevent-
ing inconsistent maintenance of conceptually related code. If a user were interested only
in detecting instances of "cut-and-paste" reuse, it would be sufficient to find similarities
based on matching syntactic features (e.g., constant and function names, variable usage,
and keywords), without actually understanding the redundant pieces. The depth of analysis
must be increased, however, if more complex, semantic similarities are to be detected, for
example, for the task of identifying families of reusable components that all embody the
same mathematical equations or business rules.
Interactive Tools. Related to the issue of providing flexibility in task-oriented tools is
the degree of automation and interaction the tools have with people (programmers, main-
tainers, and domain experts (Quilici and Chin, 1995)). How is the focusing done? Who
is controlling the depth of analysis and level of effort? The reverse engineering process
is characterized by a search for knowledge about a design artifact with limited sources of
information available. The person and the tool each bring different types of interpretive
skills and information sources to the discovery process. A person can often see global
patterns in data or subtle connections to informal domain concepts that would be difficult
for tools based on current technology to uncover. Successful collaboration will depend on
finding ways to leverage the respective abilities of the collaborators. The division of labor
will be influenced by the task and environmental situation.

3. Attacking Industrial-Strength Problems

The types of problems that are driving reverse engineering research come from real-world
systems and applications. Early work tended to focus on simplified versions of reverse en-
gineering problems, often using data that did not always scale up to more realistic problems
(Selfi-idge et al., 1993). This helped in initial explorations of techniques that have since
matured.
At the Working Conference, several researchers reported on the application of reverse
engineering techniques to practical industrial problems with results of significant economic
importance. The software and legacy systems to which their techniques are being applied
are quite complex, large, and diverse. Examples include a public key encryption program,
industrial invoicing systems, the X window system, and software for analyzing data sent
back from space missions. Consequently, the types of information being extracted from
existing software spans a wide range, including specifications, business rules, objects, and
more recently, architectural features. A good example of a large scale application was
provided by Philip Newcomb (Newcomb,1995) who presented a tool, called the Legacy
System Cataloging Facility. This tool supports modeling, analyzing, and transforming
legacy systems on an enterprise scale by providing a mechanism for efficiently storing and
managing huge models of information systems at Boeing Computer Services.
168 WILLS AND CROSS

Current applications are pushing the limits of existing techniques in terms of scalability
and feasibility. Exploring these issues and developing new techniques in the context of
real-world systems and problems is critical.

4. More Empirical Studies

One of the prerequisites in addressing real-world, economically significant reverse engi-


neering problems is understanding what the problems are and establishing requirements on
what it would take to solve them. Researchers are recognizing the necessity of conducting
studies that examine what practitioners are doing currently, what is needed to support them,
and how well (or poorly) the existing technology is meeting their needs.
The results of one such full-scale case study were presented at the Working Conference
by Piernicola Fiore (Fiore et al., 1995). The study focused on a reverse engineering project
at a software factory (Basica S.p.A in Italy) to reverse engineer banking software. Based
on an analysis of productivity, the study identified the need for adaptable automated tools.
Results indicated that cost is not necessarily related to number of lines of code, and that
both the data and the program need distinct econometric models.
In addition to this formal, empirical investigation, some informal studies were reported
at the Working Conference. During a panel discussion, Lewis Johnson described his work
on dynamic, accessible documentation, which was driven by studies of inquiry episodes
gathered from newsgroups. This helped to determine what types of questions software users
and maintainers typically ask. (Blaha and Premerlani, 1995) reported on idiosyncracies they
observed in relational database designs, many of which are in conmiercial software products!
Empirical data is useful not only in driving and guiding reverse engineering technology
development, but also in estimating the effort involved in reverse engineering a given system.
This can influence a software engineer's decisions about whether to reengineer a system or
opt for continued maintenance or a complete redesign (Newcomb,1995). While the value
of case studies is widely recognized, relatively few have been conducted thus far.
Closely related to this problem is the critical need for publicly available data sets that em-
body representative reverse engineering problems (e.g., a legacy database system including
all its associated documentation (Selfi"idge et al., 1993)). Adopting these as standard test
data sets would enable researchers to quantitatively compare results and set clear milestones
for measuring progress in the field.
Unfortunately, it is difficult tofinddata sets that can be agreed upon as being representative
of those found in common reverse engineering situations. They must not be proprietary and
they must be made easily accessible. Papers describing case studies and available data sets
would significantly contribute to advancing the research in this field (Selfridge et al., 1993)
and are actively sought by the Working Conference.

5. Looking Beyond Code for Sources of Information

In trying to understand aspects of a software system, a reverse engineer uses all the sources of
information available. In the past, most reverse engineering research focused on supporting
TRENDS AND ISSUES IN REVERSE ENGINEERING 169

the recovery of information solely from the source code. Recently, the value of non-
code system documents as rich sources of information has been recognized. Documents
associated with the source code often contain information that is difficult to capture in
the source code itself, such as design rationale, connections to "human-oriented" concepts
(Biggerstaff et al., 1994), or the history of evolutionary steps that went into creating the
software.
For example, at the Working Conference, analysis techniques were presented that auto-
matically derived test cases from reference manuals and structured requirements (Lutsky,
1995), business rules and a domain lexicon from structured analysis specifications (Leite
and Cerqueira, 1995), and formal semantics from dataflow diagrams (Butler et al., 1995).
A crucial open issue in this area of exploration is what happens when one source of
information is inaccurate or inconsistent with another source of information, particularly
the code. Who is thefinalarbiter? Often it is valuable simply to detect such inconsistencies,
as is the case in generating test cases.

6. Increased Formalization

When a field is just beginning to form, it is common for researchers to try many different
informal techniques and experimental methodologies to get a handle on the complex prob-
lems they face. As the field matures, researchers start to formalize their methods and the
underlying theory. The field of reverse engineering is starting to see this type of growth.
A fruitful interplay is emerging between prototyping and experimenting with new tech-
niques which are sketched out informally, and the process of formalization which tries to
provide an underlying theoretical basis for these informal techniques. This helps make the
methods more precise and less prone to ambiguous results. Formal methods contribute to the
validation of reverse engineering technology and to a clearer understanding of fundamental
reverse engineering problems.
While formal methods, with their well-defined notations, also have a tremendous potential
for facilitating automation, the current state-of-the-art focuses on small programs. This
raises issues of practicality, feasibility and scalability. A promising strategy is to explore how
formal methods can be used in conjunction with other approaches, for example, coupling
pattern matching with symbolic execution.
Although the formal notations lend themselves to machine manipulation, they tend to
introduce a communication barrier between the reverse engineer who is not familiar with
formal methods and the machine. Making reverse engineering tools based on formal meth-
ods accessible to practicing engineers will require the support of interfaces to the formal
notations, including graphical notations and domain-oriented representations, such as those
being explored in applying formal methods to component-based reuse (Lowry et al., 1994).

7. Challenges for the Future

Other issues not specifically addressed by papers presented at the Working Conference
include:
170 WILLS AND CROSS

• How do we validate and test reverse engineering technology?


• How do we measure its potential impact? How can we support the critical task of
assessment that should precede any reverse engineering activity? This includes deter-
mining how amenable an artifact is to reverse engineering, what outcome is expected,
the estimated cost of the reverse engineering project and the anticipated cost oinot re-
verse engineering. Most reverse engineering research assumes that reverse engineering
will be performed and thus overlook this critical assessment task which needs tools and
methodologies to support it.
• What can we do now to prevent the software systems we are currently creating from
becoming the incomprehensible legacy systems of tomorrow? For example, what new
problems does object-oriented code present? What types of programming language
features, documentation, or design techniques are helpful for later comprehension and
evolution of the software?
• A goal of reverse engineering research is to raise the conceptual level at which software
tools interact and conmiunicate with software engineers, domain experts, and end users.
This raises issues concerning how to most effectively acquire, refine, and use knowledge
of the application domain. How can it be used to organize and present information
extracted in terms the tool user can readily comprehend? What new presentation and
visualization techniques are useful? How can domain knowledge be captured from non-
code sources? What new techniques are needed to reverse engineer programs written in
non-traditional, domain-oriented "languages," such as spreadsheets, database queries,
granmiar-based specifications, and hardware description languages?
• A clearer articulation of the reverse engineering process is needed. What is the life-
cycle of a reverse engineering activity and how does it relate to the forward engineering
life-cycle? Can one codify the best practices of reverse engineers, and thereby improve
the effectiveness of reverse engineering generally?
• What is management's role in the success of reverse engineering technology? From
the perspective of management, reverse engineering is often seen as a temporary set of
activities, focused on short-term transition. As such, management is reluctant to invest
heavily in reverse engineering research, education, and application. In reality, reverse
engineering can be used in forward engineering as well as maintenance to better control
conceptual complexity across the life-cycle of evolving software.

8. Conclusion and Future Events

This article has highlighted the key trends in thefieldof reverse engineering that we observed
at the Second Working Conference. More details about the WCRE presentations and
discussions is given in (Cross et al., 1995). The 1993 and 1995 WCRE proceedings are
available from IEEE Computer Society Press.
Even more important than the trends and ideas discussed is the energy and enthusiasm
shared by the research community. Even though the problems being attacked are complex,
TRENDS AND ISSUES IN REVERSE ENGINEERING 171

they are intensely interesting and highly relevant to many software-related activities. One of
the hallmarks of the Working Conference is that Elliot Chikofsky manages to come up with
amusing reverse engineering puzzles that allow attendees to revel in the reverse engineering
process. For example, at the First Working Conference, he challenged attendees to reverse
engineer jokes given only their punch-lines. This year, he created a "reverse taxonomy"
of tongue-in-cheek definitions that needed to be reverse engineered into computing-related
words. ^
The next Working Conference is planned for November 8-10, 1996 in Monterey, CA.
It will be held in conjunction with the 1996 International Conference on Software Main-
tenance (ICSM). Further information on the upcoming Working Conference can be found at
http ://www.ee.gatech.edii/coiiferencesAVCRE or by sending mail to were @ computer.org.
Other future events related to reverse engineering include:

• the Workshop on Program Comprehension, which was held in conjunction with the
International Conference on Software Engineering in March, 1996 in Berlin, Germany;

• the International Workshop on Computer-Aided Software Engineering (CASE), which


is being planned for London, England, in the Summer of 1997; and

• the Reengineering Forum, a commercially-oriented meeting, which complements the


Working Conference and is being held June 27-28,1996 in St. Louis, MO.

Acknowledgments

This article is based, in part, on notes taken by rapporteurs at the Second Working Con-
ference on Reverse Engineering: Gerardo Canfora, David Eichmann, Jean-Luc Hainaut,
Lewis Johnson, Julio Cesar Leite, Ettore Merlo, Michael Olsem, Alex Quilici, Howard
Reubenstein, Spencer Rugaber, and Mark Wilson. We also appreciate comments from
Lewis Johnson which contributed to our list of challenges.

Notes

1. Some examples of Elliot's reverse taxonomy: (A) a suggestion made to a computer; (B) the answer when
asked "what is that bag the Blue Jays batter runs to after hitting the ball?" (C) an instrument used for entering
errors into a system. Answers: (A)command; (B)database; (C)k:eyboard

References

Baker, B. On finding duplication and near-duplication in large software systems. In (Willis et al., 1995), pages
86-95.
Biggerstaff, T., B. Mitbander, and D. Webster. Program understanding and the concept assignment problem.
Communications of the ACM, 37(5):72-83, May 1994.
Blaha, M. and W. Premerlani. Observed idiosyncracies of relational database designs. In (Willis et al., 1995),
pages 116-125.
172 WILLS AND CROSS

Butler, G., P. Grogono, R. Shinghal, and L Tjandra. Retrieving information from data flow diagrams. In [19],
pages 22-29.
Chikofsky, E. Message from the general chair. In (Willis et al., 1995) (contains a particularly vivid analogy to
archeology), page ix.
Chikofsky, E. and J. Cross. Reverse engineering and design recovery: A taxonomy. IEEE Software, pages 13-17,
January 1990.
Cross, J., A. Quilici, L. Wills, R Newcomb, and E. Chikofsky. Second working conference on reverse engineering
summary report. ACM SIGSOFTSoftware Engineering Notes, 20(5):23-26, December 1995.
Fiore, R, E Lanubile, and G. Vissaggio. Analyzing empirical data from a reverse engineering project. In (Willis
et al., 1995), pages 106-114.
Johnson, W. L. Interactive explanation of software systems. In Proc. 10th Knowledge-Based Software Engineering
Conference, pages 155-164, Boston, MA, 1995. IEEE Computer Society Press.
Kontogiannis, K., R. DeMori, M. Bernstein, M. Galler, and E. Merlo. Pattern matching for design concept
localization. In (Willis et al., 1995), pages 96-103.
Leite, J. and P. Cerqueira. Recovering business rules from structured analysis specifications. In (Willis et al.,
1995), pages 13-21.
Lowry, M., A. Philpot, T. Pressburger, and I. Underwood. A formal approach to domain-oriented software design
environments. In Proc. 9th Knowledge-Based Software Engineering Conference, pages 48-57, Monterey, CA,
1994.
Lutsky, P. Automating testing by reverse engineering of software documentation. In (Willis et al., 1995), pages
8-12.
Newcomb, P. Legacy system cataloging facility. In (Wilhs et al., 1995), pages 52-60, July 1995.
Quilici, A. and D. Chin. Decode: A cooperative environment for reverse-engineering legacy software. In (Willis
et al., 1995), pages 156-165.
Selfridge, P., R. Waters, and E. Chikofsky. Challenges to the field of reverse engineering - A position paper. In
Proc. of the First Working Conference on Reverse Engineering, pages 144-150, Baltimore, MD, May 1993.
IEEE Computer Society Press.
Waters, R. and E. Chikofsky, editors. Proc. of the First Working Conference on Reverse Engineering, Baltimore,
MD, May 1993. IEEE Computer Society Press.
Wills, L., P. Newcomb, and E. Chikofsky, editors. Proc. of the Second Working Conference on Reverse Engi-
neering, Toronto, Ontario, July 1995. IEEE Computer Society Press.
Automated Software Engineering 3, 173-178 (1996)
(c) 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Desert Island Column


JOHN DOBSON john.dobson@newcastle.ac.uk
Centre for Software Reliability, Bedson Building, University of Newcastle, Newcastle NEl 7RU, U.K.

When I started preparing for this article, I looked along my bookshelves to see what books
I had on software engineering. There were none. It is not that software engineering has not
been part of my life, but that I have not read anything on it as a subject that I wished to keep
in order to read again. There were books on software, and books on engineering, and books
on many a subject of interest to software engineers such as architecture and language. In
fact these categories provided more than enough for me to wish to take to the desert island,
so making the selection provided an enjoyable evening. I also chose to limit my quota to
six (or maybe the editor did, I forget).
Since there was an element of choice involved, I made for myself some criteria: it had to
be a book that I had read and enjoyed reading, it had to have (or have had) some significance
for me in my career, either in terms of telling me how to do something or increasing my
understanding, it had to be relevant to the kind of intellectual exercise we engage in when
we are engineering software, and it had to be well-written. Of these, the last was the most
important. There is a pleasure to be gained from reading a well-written book simply because
it is written well. That doesn't necessarily mean easy to read; it means that there is a just and
appropriate balance between what the writer has brought to the book and what the reader
needs to bring in order to get the most out of it. All of my chosen books are well-written.
All are worth reading for the illumination they shed on software engineering from another
source, and I hope you will read them for that reason.
First, a book on engineering: To Engineer is Human, by Petroski (1985). Actually it is not
so much about engineering (understood as meaning civil engineering) as about the history
of civil engineering. Perhaps that is why I have no books on software engineering: the disci-
pline is not yet old enough to have a decent history, and so there is not much of interest to say
about it. What is interesting about Petroski's book, though, is the way it can be used as a base
text for a future book on the history of software engineering, for it shows how the civil engi-
neering discipline (particularly the building of bridges) has developed through disaster. The
major bridge disasters of civil engineering history—Tay Bridge, Tacoma Narrows—have
their analogues in our famous disasters—Therac, the London Ambulance Service. The im-
portance of disasters, lies, of course, in what is learnt from them; and this means that they
have to be well documented. The examples of software disasters that I gave have been doc-
umented, the London Ambulance Service particularly, but these are in the minority. There
must be many undocumented disasters in software engineering, from which as a result noth-
ing has been learnt. This is yet another example of the main trouble with software being
its invisibility, which is why engineering it is so hard. It is probably not possible, at least
in the western world, to have a major disaster in civil engineering which can be completely
concealed; Petroski's book shows just how this has helped the development of the discipline.
174 DOBSON

What makes Petroski's book so pleasant to read is the stress he places on engineering as a
human activity and on the forces that drive engineers. Engineering is something that is born
in irritation with something that is not as good as it could have been, a matter of making
bad design better. But of course this is only part of the story. There is the issue of what the
artifact is trying to achieve to consider. Engineering design lies in the details, the "minutely
organized Particulars" as Blake calls them^. But what about the general principles, the grand
scheme of things in which the particulars have a place, the "generalizing Demonstrations
of the Rational Power"? In a word, the architecture—and of course the architect (the
"Scoundrel, Hypocrite and Flatterer" who appeals to the "General Good"?).
It seems that the software engineer's favourite architect is Christopher Alexander. A
number of colleagues have been influenced by that remarkable book A Pattern Language
(Alexander et al., 1977), which is the architects' version of a library of reusable object
classes. But for all its influence over software architects (its influence over real architects
is, I think, much less noticeable), it is not the one I have chosen to take with me. Alexander's
vision of the architectural language has come out of his vision of the architectural process,
which he describes in an earlier book. The Timeless Way of Building (Alexander, 1979).
He sees the creation of pattern languages as being an expression of the actions of ordinary
people who shape buildings for themselves instead of having the architect do it for them.
The role of the architect is that of a facilitator, helping people to decide for themselves what
it is they want. This is a process which Alexander believes has to be rediscovered, since
the languages have broken down, are no longer shared, because the architects and planners
have taken them for themselves.
There is much talk these days of empowerment. I am not sure what it means, though
I am sure that a lot of people who use it do not know what it means either. When it is
not being used merely as a fashionable management slogan, empowerment seems to be a
recognition of the embodiment in an artifact of the Tao, the quality without a name. As
applied to architecture, this quality has nothing to do with the architecture of the building
or with the processes it supports and which stem from it. The architecture and architectural
process should serve to release a more basic understanding which is native to us. We find
that we already know how to make the building live, but that the power has been frozen in
us. Architectural empowerment is the unfreezing of this ability.
The Timeless Way of Building is an exploration of this Zen-like way of doing architecture.
Indeed the book could have been called Zen and the Art of Architecture, but fortunately it
was not. A cynical friend of mine commented, after he had read the book, "It is good to have
thought like that"—the implication being that people who have been through that stage are
more mature in their thinking than those who have not or who are still in it. I can see what he
means but I think he is being unfair. I do not think we have really given this way of building
systems a fair try. Christopher Alexander has, of course, and the results are described in
two of his other books, The Production of Houses (Alexander et al., 1985) and The Oregon
Experiment (Alexander et al., 1975). Reading between the lines of these two books does
seem to indicate that the process was perhaps not as successful as it might have been and I
think there is probably scope for an architectural process engineer to see what could be done
to improve the process design. Some experiments in designing computer systems that way
have been performed. One good example is described in Pelle Ehn's book Work-Oriented
DESERT ISLAND COLUMN 175

Design of Computer Artifacts (Ehn, 1988), which has clearly been influenced by Alexan-
der's view of the architectural process. It also shares Alexander's irritating tendency to
give the uneasy impression that the project was not quite as successful as claimed. But
nevertheless I think these books of Alexander's should be required reading, particularly for
those who like to acknowledge the influence of A Pattern Language. Perhaps The Timeless
Way of Building and The Production of Houses will come to have the same influence on the
new breed of requirements engineers as A Pattern Language has had on software engineers.
That would be a good next stage of development for requirements engineering to go through.
If there is something about the architectural process that somehow embodies the human
spirit, then there is something about the architectural product that embodies the human in-
tellect. It sometimes seems as if computers have taken over almost every aspect of human
intellectual endeavour, from flying aeroplanes to painting pictures. Where is it all going to
end—indeed will it ever end? Is there anything that they can't do?
Well of course there is, and their limitations are provocatively explored in Hubert Dreyfus'
famous book What Computers Can't Do (Dreyfus, 1979), which is my third selection.
For those who have yet to read this book, it is an enquiry into the basic philosophical
presuppositions of the artificial intelligence domain. It raises some searching questions
about the nature and use of intelligence in our society. It is also a reaction against some of
the more exaggerated claims of proponents of artificial intelligence, claims which, however
they may deserve respect for their useftilness and authority, have not been found agreeable
to experience (as Gibbon remarked about the early Christian belief in the nearness of the
end of the world).
Now it is too easy, and perhaps a bit unfair, to tease the AI community with some of the
sillier sayings of their founders. Part of the promotion of any new discipline must involve
a certain amount of overselling (look at that great engineer Brunei, for example). I do not
wish to engage in that debate again here, but it is worth remembering that some famous
names in software engineering have, on occasion, said things which perhaps they now wish
they had not said. It would be very easy to write a book which does for software engineering
what What Computers Can't Do did for artificial intelligence: raise a few deep issues, upset
a lot of people, remind us all that when we cease to think about something we start to
say stupid things and make unwarranted claims. It might be harder to do it with Dreyfus'
panache, rhetoric, and philosophic understanding. I do find with What Computers Can't
Do, though, that the rhetoric gets in the way a bit. A bit more dialectic would not come
amiss. But the book is splendid reading.
Looking again at the first three books I have chosen, I note that all of them deal with
the human and not the technical side of software capabilities, design and architecture. One
of the great developments in software engineering came when it was realised and accepted
that the creation of software was a branch of mathematics, with mathematical notions of
logic and proof. The notion of proof is a particularly interesting one when it is applied
to software, since it is remarkable how shallow and uninteresting the theorems and proofs
about the behaviour of programs usually are. Where are the new concepts that make for
great advances in mathematical proofs?
The best book I know that explores the nature of proof is Imre Lakatos' Proofs and
Refutations (Lakatos, 1976) (subtitled The Logic of Mathematical Discovery—making the
176 DOBSON

point that proofs and refutations lead to discoveries, all very Hegelian). This surely is a
deathless work which so cleverly explores the nature of proof, the role of counterexamples
in producing new proofs by redefining concepts, and the role of formalism in convincing
a mathematician. In a way, it describes the history of mathematical proof in the way
that To Engineer is Human describes the history of engineering (build it; oh dear, it's
fallen down; build it again, but better this time). What makes Proofs and Refutations
so memorable is its cleverness, its intellectual fun, its wit. But the theorem discussed is
just an ordinary invariant theorem (Euler's formula relating vertices, edges and faces of a
polyhedron: V — E-\-F = 2), and its proof is hardly a deep one, either. But Lakatos makes
all sorts of deep discussion come out of this simple example: the role of formalism in the
advancement of understanding, the relationship between the certainty of a formal proof and
the meaning of the denotational terms in the proof, the process of concept formation. To
the extent that software engineering is a branch of mathematics, the discussion of the nature
of mathematics (and there is no better discussion anywhere) is of relevance to software
engineers.
Mathematics is not, of course, the only discipline of relevance to software engineering.
Since computer systems have to take their place in the world of people, they have to respect
that social world. I have lots of books on that topic on my bookshelf, and the one that
currently I like the best is Computers in Context by Dahlbom and Mathiassen (1993), but
it is not the one that I would choose to take to my desert island. Instead, I would prefer
to be accompanied by Women, Fire and Dangerous Things by Lakoff (1987). The subtitle
of this book is What Categories Reveal about the Mind. The title comes from the fact
that in the Dyirbal language of Australia, the words for women, fire, and dangerous things
are all placed in one category, but not because women are considered fiery or dangerous.
Any object-oriented software engineer should, of course, be intensely interested in how
people do categorise things and what the attributes are that are common to each category
(since this will form the basis of the object model and schema). I find very little in my
books on object-oriented requirements and design that tells me how to do this, except that
many books tell me it not easy and requires a lot of understanding of the subject domain,
something which I know already but lacks the concreteness of practical guidance. What
Lakoff's book does is to tell you what the basis of linguistic categorisation actually is. (But
I'm not going to tell you; my aim is to get you to read this book as well.) With George
Lakoff telling you about the linguistic basis for object classification and the Christopher
Alexander telling you about how to go about finding out what a person or organisation's
object classification is, you are beginning to get enough knowledge to design a computer
system for them.
However, you should be aware that the Lakoff book contains fiindamental criticisms of
the objectivist stance, which believes that meaning is a matter of truth and reference (i.e.,
that it concerns the relationship between symbols and things in the world) and that there is
a single correct way of understanding what is and what is not true. There is some debate
about the objectivist stance and its relation to software (see the recent book Information
Systems Development and Data Modelling by Hirschheim, Klein and Lyytinen (1995) for a
fair discussion), but most software engineers seem reluctant to countenance any alternative
view. Perhaps this is because the task of empowering people to construct their own reality,
DESERT ISLAND COLUMN 177

which is what all my chosen books so far are about, is seen as a task not fit, too subversive,
for any decently engineered software to engage in. (Or maybe it is just too hard.)
My final choice goes against my self-denying ordinance not to make fun of the artificial
intelligentsia. It is the funniest novel about computers ever written, and one of the great
classics of comedy literature: The Tin Men by Frayn (1965). For those who appreciate such
things, it also contains (in its last chapter) the best and most humorous use of self-reference
ever published, though you have to read the whole book to get the most enjoyment out of
it. For a book which was written more than thirty years ago, it still seems very pointed,
hardly dated at all. I know of some institutions that claim as a matter of pride to have been
the original for the fictitious William Morris Institute of Automation Research (a stroke of
inspiration there!). They still could be; the technology may have been updated but the same
individual types are still there, and the same meretricious research perhaps—constructing
machines to invent the news in the newspapers, to write bonkbusters, to do good and say
their prayers, to play all the world's sport and watch it—while the management gets on
with more stimulating and demanding tasks, such as organising the official visit which the
Queen is making to the Institute to open the new wing.
So there it is. I have tried to select a representative picture of engineering design,
of the architecture of software artifacts, of the limitations and powers of mathematical
formalisation of software, of the language software embodies and of the institutions in
which software research is carried out. Together they say something about my view, not
so much of the technical detail of software engineering, but of the historical, architectural,
intellectual and linguistic context in which it takes place. So although none of these books
is about software engineering, all are relevant since they show that what is true of our
discipline is true of other disciplines also, and therefore we can learn from them and use
their paradigms as our own.
There are many other books from other disciplines of relevance to computing that I
am particularly sorry to leave behind, Wassily Kandinsky's book Point and Line to Plane
(Kandinsky, 1979) (which attempts to codify the rules of artistic composition) perhaps the
most. Now for my next trip to a desert island, I would like to take, in addition to the
Kandinsky, [that's enough books, Ed.].

Note

L Jerusalem, Part III, plate 55.

References

Alexander, C. 1979. The Timeless Way of Building. New York: Oxford University Press.
Alexander, C, Ishikawa, S., and Silverstein, M. 1977. A Pattern Language. New York: Oxford University Press.
Alexander, C, Martinez, J., and Comer, D. 1985. The Production of Houses. New York: Oxford University Press.
Alexander, C, Silverstein, M., Angel, S., Ishikawa, S., and Abrams, D. 1975. The Oregon Experiment. New York:
Oxford University Press.
Dahlbom, B. and Mathiassen, L. 1993. Computers in Context. Cambridge, MA and Oxford, UK: NCC Blackwell.
Dreyfus, H.L. 1979. What Computers Can't Do (revised edition). New York: Harper & Row.
Ehn, P 1988. Work-Oriented Design of Computer Artifacts. Stockholm: Arbetslivscentrum (ISBN 91-86158-45-7).
178 DOBSON

Frayn, M. 1965. The Tin Men. London: Collins, (republished by Penguin Books, 1995).
Hirschheim, R., Klein, H.K., and Lyytinen, K. 1995. Information Systems Development and Data Modelling.
Cambridge University Press.
Kandinsky, W. 1979. Point and Line to Plane Trans. H. Dearstyne and H. Rebay (Eds.), New York: Dover,
(originally published 1926, in German).
Lakatos, I. 1976. Proofs and Refutations. J. Worrall and E. Zahar (Eds.), Cambridge University Press.
Lakoff, G. 1987. Women, Fire, and Dangerous Things: What Categories Reveal about the Mind. University of
Chicago Press.
Petroski, H. 1985. To Engineer is Human. New York: St. Martin's Press.
Automated Software Engineering
An International Journal

Instructions for Authors


Authors are encouraged to submit high quality, original work that has neither appeared in,
nor is under consideration by, other journals.

PROCESS FOR SUBMISSION


1. Authors should submit five hard copies of their final manuscript to:
Mrs. Judith A. Kemp
AUTOMATED SOFTWARE ENGINEERING
Editorial Office
Kluwer Academic Publishers Tel.: 617-871-6300
101 Philip Drive FAX: 617-871-6528
Norwell, MA 02061 E-mail: jkemp@wkap.com
2. Authors are strongly encouraged to use Kluwer's LSTgX journal style file. Please see
ELECTRONIC SUBMISSION Section below.
3. Enclose with each manuscript, on a separate page, from three to five key words.
4. Enclose originals for the illustrations, in the style described below, for one copy of the
manuscript. Photocopies of the figures may accompany the remaining copies of the
manuscript. Alternatively, original illustrations may be submitted after the paper has
been accepted.
5. Enclose a separate page giving the preferred address of the contact author for correspon-
dence and return of proofs. Please include a telephone number, fax number and email
address, if available.
6. If possible, send an electronic mail message to <jkkluwer@world.std.com> at the time
your manuscript is submitted, including the title, the names of the authors, and an
abstract. This will help the journal expedite the refereeing of the manuscript.
7. The refereeing is done by anonymous reviewers.

STYLE FOR MANUSCRIPT


1. Typeset, double or 1^ space; use one side of sheet only (laser printed, typewritten, and
good quality duplication acceptable).
2. Use an informative title for the paper and include an abstract of 100 to 250 words at the
head of the manuscript. The abstracts should be a carefully worded description of the
problem addressed, the key ideas introduced, and the results. Abstracts will be printed
with the article.
3. Provide a separate double-space sheet listing all footnotes, beginning with "Affiliation
of author" and continuing with numbered references. Acknowledgment of financial
support may be given if appropriate.
References should appear in a separate bibliography at the end of the paper in alphabetical
order with items referred to in the text by author and date of publication in parentheses,
e.g., (Marr, 1982). References should be complete, in the following style:
Style for papers: Authors, last names followed by first initials, year of publication,
title, volume, inclusive page numbers.
Style for books: Authors, year of publication, title, publisher and location, chapter
and page numbers (if desired).
Examples as follows:
(Book) Marr, D. 1982. Vision, a Computational Investigation into the Human Repre-
sentation & Processing of Visual Information. San Francisco: Freeman.
(Journal) Rosenfeld, A. and Thurston, M. 1971. Edge and curve detection for visual
scene analysis. IEEE Trans. Comput., €-20:562-569.
(Conference Proceedings) Witkin, A. 1983. Scales space filtering. Proc. Int. Joint
Conf Artif Intell, Karlsruhe, West Germany, pp. 1019-1021.
(Lab. memo) Yuille, A.L. and Poggio, T. 1983. Scaling theorems for zero crossings.
M.I.T. Artif. Intell. Lab., Massachusetts Inst. Technol., Cambridge, MA, A.L Memo.
722.
Type or mark mathematical copy exactly as they should appear in print. Journal style
for letter symbols is as follows: variables, italic type (indicated by underline); constants,
roman text type; matrices and vectors, boldface type (indicated by wavy underline). In
word-processor manuscripts, use appropriate typeface. It will be assumed that letters
in displayed equations are to be set in italic type unless you mark them otherwise. All
letter symbols in text discussion must be marked if they should be italic or boldface.
Indicate best breaks for equations in case they will not fit on one line.

ELECTRONIC SUBMISSION PROCEDURE


Upon acceptance of publication, the preferred format of submission is the Kluwer I^^^gX
journal style file. The style file may be accessed through a gopher site by means of the
following commands:
Internet: gopher g o p h e r . wkap. n l or (IP number 192.87.90.1)
WWW URL: gopher://gopher.wkap.nl
- Submitting and Author Instructions
- Submitting to a Journal
- Choose Journal Discipline
- Choose Journal Listing
- Submitting Camera Ready

Authors are encouraged to read the ''About this menu" file.

If you do not have access to gopher or have questions, please send e-mail to:
srumsey@wkap.com
The Kluwer L^T^ journal style file is the preferred format, and we urge all authors to use
this style for existing and future papers; however, we will accept other common formats
(e.g., WordPerfect or MicroSoft Word) as well as ASCII (text only) files. Also, we accept
FrameMaker documents as "text only" files. Note, it is also helpful to supply both the
source and ASCII files of a paper. Please submit PostScript files for figures as well as
separate, original figures in camera-ready form. A PostScript figure file should be named
after its figure number, e.g., figl.eps or circlel.eps.

ELECTRONIC DELIVERY

IMPORTANT - Hard copy of the ACCEPTED paper (along with separate, original figures
in camera-ready form) should still be mailed to the appropriate Kluwer department. The
hard copy must match the electronic version, and any changes made to the hard copy must
be incorporated into the electronic version.
Via electronic mail
1. Please e-mail ACCEPTED, FINAL paper to

KAPfiles @ wkap.com

2. Recommended formats for sending files via e-mail:


a. Binary files - uuencode or binhex
b. Compressing files - compress, pkzip, gunzip
c. Collecting files - tar
3. The e-mail message should include the author's last name, the name of the journal to
which the paper has been accepted, and the type of file (e.g., I^lgX or ASCII).

Via disk
1. Label a 3.5 inch floppy disk with the operating system and word processing program
(e.g., DOSAVordPerfect5.0) along with the authors' names, manuscript title, and name
of journal to which the paper has been accepted.
2. Mail disk to
Kluwer Academic Publishers
Desktop Department
101 Philip Drive
Assinippi Park
Norwell, MA 02061

Any questions about the above procedures please send e-mail to:

srumsey @ wkap.com
STYLE FOR ILLUSTRATIONS
1. Originals for illustrations should be sharp, noise-free, and of good contrast. We regret
that we cannot provide drafting or art service.
2. Line drawings should be in laser printer output or in India ink on paper, or board. Use 8 ^
by 11-inch (22 x 29 cm) size sheets if possible, to simplify handling of the manuscript.
3. Each figure should be mentioned in the text and numbered consecutively using Arabic
numerals. In one of your copies, which you should clearly distinguish, specify the desired
location of each figure in the text but place the original figure itself on a separate page.
In the remainder of copies, which will be read by the reviewers, include the illustration
itself at the relevant place in the text.
4. Number each table consecutively using Arabic numerals. Please label any material that
can be typeset as a table, reserving the term "figure" for material that has been drawn.
Specify the desired location of each table in the text, but place the table itself on a
separate page following the text. Type a brief title above each table.
5. All lettering should be large enough to permit legible reduction.
6. Photographs should be glossy prints, of good contrast and gradation, and any reasonable
size.
7. Number each original on the back.
8. Provide a separate sheet listing all figure captions, in proper style for the typesetter, e.g.,
"Fig. 3. Examples of the fault coverage of random vectors in (a) combinational and (b)
sequential circuits."

PROOFING
Page proofs for articles to be included in a journal issue will be sent to the contact author
for proofing, unless otherwise informed. The proofread copy should be received back by
the Publisher within 72 hours.

COPYRIGHT
It is the policy of Kluwer Academic Publishers to own the copyright of all contributions it
publishes. To comply with the U.S. Copyright Law, authors are required to sign a copyright
transfer form before publication. This form returns to authors and their employers full
rights to reuse their material for their own purposes. Authors must submit a signed copy of
this form with their manuscript.

REPRINTS
Each group of authors will be entitled to 50 free reprints of their paper.
REVERSE ENGINEERING brings together in one place important
contributions and up-to-date research results in this important
area.
REVERSE ENGINEERING serves as an excellent reference,
providing insight into some of the most important research issues
in the field.

ISBN 0-7923-9756-8

0-7923-9756-8 92»'397564»

You might also like