Efficient Program Tracing: James R. Larus, University of Wisconsin-Madison

Efficient Program Tracing
James R. Larus, University of Wisconsin-Madison
program trace lists the addresses of instructions executed and data referenced when a program runs. Detailed program traces support many simulations used in computer science and engineering, for example, in the design of processor instruction sets and memory systems, the study of storage reclamation and virtual memory page-replacement algorithms, and the analysis of input to parallelizing compilers. In the past, collecting detailed program traces was extremely costly, and the few existing large trace files were carefully preserved and passed like ancient manuscripts among researchers. New tracing techniques based on compiler-style analysis now make it possible for anyone to trace a program simply and economically. This article reviews the historical difficulties and technical approaches to capturing detailed program traces, then focuses on the new techniques and results of efficient program tracing. One preliminary note: Program tracing should not be confused with program profiling. The latter measures or samples the execution frequency of program statements; it ignores data references, recording only an aggregate count of the number of times each statement executes. Tracing, by contrast, obtains a comJ plete listing of instruction or data-reference addresses.
Background New techniques reduce the high costs of recording program trace data and storing trace files, making it easy to obtain detailed listings of how a program executes.
52 The difficulties in obtaining a complete program trace stem from the high cost of recording every instruction and data address as the application program executes and from the large size of the resulting trace files. Straightforward but inefficient tracing systems examine every instruction as a program executes, but this approach slows the program to a crawl. The tracing overhead can be reduced by modifying either the computer hardware or the application software to record addresses. Although these modifications can significantly improve tracing performance, the overhead nevertheless remains too high for many applications, for example, those that involve real-time interaction with the outside world. Furthermore, neither of these approaches deals with the equally important issue of trace file size. A lo-million-instruction-per-second (MIPS) processor produces up to 70 megabytes of trace per second of execution. At this rate, relatively short executions quickly fill available disks. One approach to reducing trace length first computes the difference between adjacent addresses (to increase their regularity)
001%9162/93/0500-00521603.00 @ 1993 IEEE
COMPUTER
Input
Program
and then compresses the resulting file with standard compressing utilities. This process reduces trace size by a factor of 10, but the resulting traces still consume 7 megabytes per CPU second. Many tracing systems avoid storage problems entirely by collection process,decoding sending a trace directly to a Figure 1. A tracing system captures the sequence of inthe instruction to determine consumer process and never structions and data references that occur during the execuits memory references, and writing it to a file. However, tion of an application program. Another program, the returning to the traced pros this approach -called on-line trace consumer, uses the trace to simulate the program cess slows a program by tracing - makes it difficult to behavior under different hardware or software policies. more than 1,000 times. The share traces or to obtain resystem processes 2,350 inpeatable results from programs structions per second when that exhibit indeterminate behavior or represent the major approaches to pro- tracing only instruction, not data, memthat interact with the outside world. gram tracing. The sidebar presents a ory references. These problems are illustrated in the taxonomy for classifying tracing systems, Tracer is another early tracing sysfollowing discussions of several well- and Figure 1 presents a high-level sche- tern. It captures both instructions and data references for the VAX family of known tracing systems. These systems matic of a generic system.
Tracing taxonomy
Program-tracing systems, although built for different applications and computers and with different techniques, share many common features. The following taxonomy describes five dimensions for comparing these systems. Number of processors. Most systems trace programs running on a uniprocessor. Some systems, however, can trace a parallel program running on a multiprocessor.1 Parallelism complicates tracing by increasing the volume of data that must be recorded, by introducing uncertainty into the ordering of instruction and memory references between processes, and by allowing programs with indeterminacies that are affected by tracing. Number of processes. A uniprocessor may run several programs concurrently (multitasking). Some systems can trace all processes running on a computer - sometimes even including the operating system. The value of a multitasking trace depends on its intended application. Simulations of hardware features shared by multiple tasks, such as cache and physical memory, need multitasking traces to model the implications of sharing. Other simulations do not require multiple processes, for example, simulations of resources such as an instruction set. Tracing a multitasking system is easier than tracing a multiprocessor, since a uniprocessor executes only one program at a time and the transitions between processes are relatively infrequent and well defined. Intrusiveness. Intrusive tracing perturbs a program s execution in some manner, usually by slowing it down. Deterministic programs that do not interact with their environment are unaffected by this perturbation, as long as the tracing overhead does not prevent the program from completing its execution. However, large tracing overhead affects the behavior of other programs. For example, it can change the execution sequence of instructions in programs with indeterminate behavior or it can adversely affect a program interactions with the outside world. Imags ine, for example, the difficulty of using an editor or window system if it suddenly ran 10 times slower. Totally nonintrusive tracing requires modifying a processor hardware so s that it collects addresses without affecting program execution. This approach to tracing is difficult or impossible for many computers; and it is intrusive in another sense, since it requires a computer to be physically modified. Trace information. The information collected by a tracing system depends on the requirements of a trace consumer and on the technique used to collect the data. The simplest systems list only instruction and data references. Others include semantic information; for example, they identify an instruction as the first one executed in a loop body or as part of a source-language statement. On line/off line. In off-line tracing systems, information is written to a file as a traced program executes. This file is later read by trace-consuming applications. In on-line tracing systems, the traced program and the trace consumer run concurrently and the information is sent between them. The primary advantage of this approach is that it eliminates the need to save large trace files. However, it can greatly increase the effect of tracing on the traced program. Trace-consuming applications typically read traces 10 (or more) times slower than the rate at which a trace can be produced. Under these conditions, the traced process overhead does not greatly affect the trace consums er. On the other hand, the consumer greatly slows the traced program. Reference
1. C.B. Stunkel, B. Janssens, and W.K. Fuchs, Address Tracing for Parallel Machines, Computer, Vol. 24, No. 1, Jan. 1991, pp. 31-45.
May 1993
53
machines running Unix.s Tracer steps through each instruction in a program by running the program under the conAbstract execution and trol of a small tracing routine. This code optimal control tracing examines each instruction immediately reduce file size by a before it executes and records its address and data references. Tracer then factor up to 250 times. restores the program state and exes cutes state-modifying instructions directly on the hardware. On a VAX 111 780, Tracer could run 350 instructions per second, for a slowdown factor of lected by Trapeds includes each instruc1,500. tion type, its memory accessingmodes, s and any constants used in the address Hardware modiication. The ATUM calculations. Software modification is tracing system modified the microcode the compiled equivalent of the interon a VAX to record all instruction and pretive processused in Tracer. Although data references in a reserved portion of Traped modifications greatly increase s memory.6 Unlike the software systems, program size, the on-line collection ATUM could trace all processes and slowed program execution by only a the operating system running on a factor of 30. computer. The ATUM approach, alBorg et al. developed a tracing systhough a huge improvement over pre- tem for an experimental RISC computvious techniques, still had substantial er (Titan) that uses software modificaoverhead. It slowed a program by a tion to trace all processes,including the factor of 10, not including the time to operating system running on a uniprowrite trace data to a file for off-line cessor.* This tracing system uses a speprocessing. (Many papers on tracing cial program linker to modify a prosystemsdo not report the cost of writing gram object code by adding tracing s a trace to disk. In my experience with code before each load and store instruchighly compressed traces, this cost is tion. In addition, this system adds code generally as large as the cost of collect- at the beginning of each basic block to s ing the trace, so the partial overhead record the block memory location and figures need to be doubled, at a mini- size, from which the addresses of its mum, to reflect the actual effect of trac- instructions can be derived. All traced ing on a program.) processes, including a specially modiAlthough attractive in many respects, fied kernel, write their information into the ATUM approach is impossible to a single buffer, which accurately reflects replicate on systems that lack modifithe interleaving of multiple tasks. This able microcode. For example, today system is used for off-line tracing by s RISC processors do not contain micro- writing the buffer to a file, and for oncode and the microcode in other single- line tracing by running an analysis prochip microprocessors is inaccessible. cess when the buffer fills. On-line tracing without analysis slows a traced Software modification. Trapeds is an program by 8-12 times. MPTrace is an off-line tracing system on-line tracing system that uses software modification to trace programs for a Sequent shared-memory multirunning on an IPSC12distributed-memprocessor.9 MPTrace uses simple proory multicomputer. Trapeds modifies gram analysis to reduce tracing overthe assembly-language output of a com- head by recording a program control s piler by inserting a call to a monitoring flow only upon entry to superblocks, routine at the beginning of each basic which are single-entry, multiple-exit block (that is, each straight-line sequence regions of instructions. MPTrace also of machine instructions that execute se- recognizes special types of memory acquentially, without jumps into or out of cesses, such as references through a the sequence). The monitoring routine frame pointer to local variables. Beuses information previously collected cause a frame pointer typically is set by Trapeds about the instructions in the upon subroutine entry and does not block to dynamically compute their change, its value can be recorded and memory references. The addressing in- the saved value used to compute the formation is passed directly to a trace address of each variable reference in consumer. The static information col- the routine. MPTrace records a sepa-
rate address trace for each process running on a Sequent. It does not attempt to interleave or order the streams. MPTrace slows the traced program by 2-3 times, not including the cost of writing trace files.
Efficient tracing systems. I have developed two program-tracing systems that reduce both the tracing time overhead and the trace file size to levels that permit effective tracing of long-running or interactive programs. These systems use a variety of compiler-based techniques. The first system,called AE, demonstrated a technique called abstract execution,1 which greatly reduces the cost of tracing. The second system, qpt, uses abstract execution along with another technique, optimal control tracing, to further reduce the cost of tracing. These efficient tracing techniques reduce the time overhead to a fraction of a program execution cost and the s file size by a factor up to 250 times. (The next section describes these techniques in detail.) Both systems perform on- or off-line, single-process tracing. AE is based on the Gnu C compiler, so it is limited to tracing C programs compiled by that compiler. In contrast, qpt inserts tracing code in an executable file and consequently works for programs written in many languages and compiled by many compilers. In addition to recording memory accesses, both systemsannotate traces with a large amount of semantic information that identifies loops, function calls, and memory allocation. Their tracing overhead, including writing the highly compressed trace files to disk, is 1-12 times a program untraced execution cost. s
Techniques for efficient tracing systems

Abstract execution and optimal program tracing divide program tracing into two steps, as shown in Figure 2. In the first step, a compiler-style analysis of a program helps the tracing system reduce the quantity of information that must be collected during the program s execution. This analysisidentifies a small subset of a trace that suffices to reproduce the full trace. Only events in this subset, called the trace record, must be recorded while the program runs. In the COMPUTER
second step, a process called trace regeneration produces the full program trace from the trace record. The process regenerates a trace on demand by using an automaticallyproducedregeneration routine that is linked into a trace-consuming application. There is never a need to produce or store the full program trace. Abstract execution operates by breaking tracing into two distinct tasks: control-flow tracing and data-reference tracing.
Application
Program tracing system
Program trace * regeneration routine . *
I Trace consumer
Control-flow tracing. A control- flow trace records the sequence of instruc- Figure 2. Tracing systems based on abstract execution break tracing into two tions executed when a program runs. steps. In the first, the traced program writes a trace record, which is a minimal s For efficiency, tracing systems usually amount of information about a program execution. In the second step, a trace record basic-block, not instruction, ex- regeneration routine uses this record to reproduce a full program trace. ecution. This works because execution of the first instruction in a basic block causesall other instructions in the block Figure 3. A controlto execute. (An exception is when a flow graph consists block contains a call on code that termiof basic blocks (rectnates the program, such as the exit sysangular boxes) that tem call, or executes a nonlocal goto, contain a straightsuch as Unix setjmp/longjmp. Because s line sequence of inthese cases can be detected by static structions, and edges fori:=OtolOdo program analysis, they are easily hanthat represent possiif 0 > 5) then := a + i; a dled in the manner discussed below. By ble control flow beelse contrast, abnormal program terminatween the blocks. A b := b + i; tion can only be handled with the assisblock with multiple tance of the operating system.) outgoing edges ends Control flow is conveniently reprein a conditional sented by a common compiler data strucbranch. Each block ture called a control-flow graph, which has an identifier, is a general graph whose nodes reprerepresented by the sent basic blocks and whose edges repcircled numbers. resent possible control flow between blocks. For example, Figure 3 contains a small piece of code and its controlflow graph. ure 3, control leaving block 4 or 5 alSeveral types of instructions compliThere are many ways to record a pro- ways goes to block 6. If a tracing system cate this process. Subroutine calls and gram control flow. The simplest is to records that control reached block 4 or returns transfer control between proces assign each block a unique identifier 5, it need not record the execution of dures. Control-flow graphs are typicaland to append the identifier to the trace block 6. The tracing system AE oper- ly constructed on a procedure-by-prorecord every time the block executes. ates this way. It writes out an identifier cedure basis, so a call requires the This approach requires prepending a in each block that is the target of a trace-regeneration routine to consult small amount of code (5-10 RISC-style conditional branch. another routine control-flow graph. If s instructions) to every basic block. UnAfter tracing, a trace regeneration the call is an ordinary one - in which fortunately, many nonnumeric programs routine uses this sequence of identifithe invoked routine address is part of s have very small basic blocks (2-S in- ers, in conjunction with a program the call instruction - the trace-regens structions are common so the tracing control-flow graph, to reproduce the eration program can easily find the ap), code greatly increases both a program program entire control-flow trace. The propriate control-flow graph. On the s s size and execution cost. Numeric pro- regeneration routine starts at the pro- other hand, if the call is indirect - in grams, which have larger blocks con- gram entry node and traces through which the callee address is a data vals s taining hundreds of instructions, are its control-flow graph. At each point ue - the address of the invoked routine affected less by tracing code. where control divides, the regeneration must be traced by whatever mechaA better way to capture control flow routine reads the next block identifier nism is used for tracing data references. is to record only transitions between from the trace record and uses it to In addition, a call may invoke a nonloblocks where a program chooses among determine which arc to follow in the cal goto, for example, C setjmp/ s alternative paths. For example, in Fig- control-flow graph. longjmp. A tracing system must record
May 1993
55
additional information at calls on these routines so that the trace-regeneration program can find the target address at a nonlocal goto and transfer control to the appropriate routine control-flow s graph. Although recording the outcome of conditional branches works very well, another technique allows a tracing system to record still lessinformation about conditional branches without affecting its ability to reproduce the instruction trace. Ball and Larus developed this technique, called optimal control tracing, which reduces the points at which control-flow information is collected to a minimum. Figure 3 illustrates the basic idea of this technique. Every time the program iterates through the loop, either block 4 or block 5 executes. These blocks identifiers serve as an indication of another loop iteration, thereby eliminating the need to record the outcome of block 2. When the trace-regeneration routine reaches block 2, it looks at the next block identifier in the trace record. If the identifier is 4 or 5, control passes to block 3. If the regeneration routine sees another identifier, it quits the loop. More generally, Ball and Larus show how to insert less tracing code by constructing a spanning tree of a controlflow graph and tracing only the edges that are not in the tree. A spanning tree is a tree that connects all nodes in a graph. Each edge in the tree is either an
Optimal control tracing records a minimum amount of information about conditional branches.
li
Figure 4. The spanning tree for the control-flow graph from Figure 3 includes all heavy edges. The dashed edges are not in tree and must be traced.
edge from the graph or the reverse of a graph edge. For example, Figure 4 shows a spanning tree for the control-flow graph in Figure 3. The edgesfrom blocks 4 and 5 to block 6 are not in the spanning tree and consequently must be traced. A graph typically contains many spanning trees. If a tracing system assigns execution frequencies to edges, it can further reduce the number of events that are actually recorded as a program runs. The execution frequencies come either from a heuristic that examines the control-flow graph or from measurements of an earlier program execution. In the former case, Ball and Larus demonstrate that a simple heuristic specifically, one that assignsequal weight to both outcomes of a conditional branch and assumesthat each loop executes 10 times - performs very well. After assigning likely execution frequencies to edges,a tracing system computes a maximum spanning tree. This is a spanning tree whose edgeshave the largest weighted sum of any spanning tree of a graph. In this application, the maximum spanning tree includes the edges most likely to execute. Consequently, the traced edges, which are not part of the spanning tree, are the edges least likely to execute. Although this technique does not reduce the amount of tracing code, it places the code at points where it is unlikely to execute and increase the cost of tracing. Note that this technique requires tracing edges, not blocks, in a control-flow graph. As a practical matter, tracing edges is only slightly more difficult than tracing blocks, but it reduces the number of events that must be recorded during a program execution by a factor of 1.2-3.0 times over the approach used in AE. As an aside, if the space required to store a trace is the primary concern, a tracing system can do better than optimal control tracing by recording the
outcome of every conditional branch with a single bit. Since AE uses a byte or halfword to record a block identifier, s using a single bit reduces the space by a factor of 8 or more, which is a factor of 3-7 times smaller than the sequence of edge identifiers (bytes) produced by optimal control tracing. Although the code to manipulate bits, instead of bytes, is slightly slower, the reduction in the amount of information that must be written to the disk frequently more than compensates, so that the traced program actually runs faster. Unfortunately, bit tracing is difficult to combine with data-reference tracing and requires a traced program to produce two separate trace files.
Data-reference tracing. Memory references to data arise either from explicit load or store instructions or from implicit references to memory-resident operands. In either case, abstract execution reduces the cost of determining the address used in a memory reference by identifying the set of instructions that compute the address and reexecuting these instructions while regenerating the trace. The set of instructions that compute a memory address is called its address slice. Figure 5 shows the address slice for a simple example. The address used in the array access is the sum of the address of array a and the array index variable i, scaled by the size of an array element. The shaded instructions in the figure compose the address slice for the array access. A tracing system computes an address slice from a program reaching s definitions. Reaching-definition analysis is a standard compiler analysis that finds all assignments(definitions) whose value reaches a given use of a variable. The tracing system finds a load or store instruction address slice by working s back from the use of the address in the instruction to all definitions that compute the address. It then finds the other definitions that provide operands to these definitions. This process is repeated to find the other instructions that contribute to the address calculation. For example, in Figure 5, all three shaded instructions directly participate in the address calculation and are incorporated into the address slice. A tracing system can divide the instructions in an address slice into three mutually exclusive categories:
56
COMPUTER
-r
l Easy instructions. These instructions compute a constant value that does not depend on any other value in the program. For example, in Figure 5, the instruction i := 0 is an easy instruction. l Hard instructions. These instructions compute a value that depends on other values computed by a program. If it knew these values, a program-tracing system could recalculate a hard instruction result. For example, in Figure 5, s the instruction i := i + 1 is a hard instruction because its result depends on the previous value of i. l Impossible instructions. These instructions compute a value that depends upon a program state in a manner that s is too complex to recalculate. Three types of instructions are impossible: load instructions, which bring a value from memory; function calls, which return the result of an arbitrary computation; and function entries. which receive the result from unknown computations as arguments. In Figure 5, the function entry defines the address of array a and is an impossible instruction.
function sum (a: array) begin var i, sum : int; sum := 0; for i := 0 to length (a) do sum := sum + a[i]; return (sum); end reference
Figure 5. The shaded instructions are the address slice for the memory reference in the array access. The function entry supplies the address of array a. The other instructions compute the value of the array index i.
How well does it work?

The measurements in this section demonstrate that efficient program tracing works for a wide variety of programs. The measurements were collected with the qpt tracing system described in the sidebar. Note that the performance of efficient tracing depends on regularity in a program control-flow s and memory-access patterns. Programs with few conditional branches that access array locations sequentially perform best, Nonnumericprograms, which have many conditional branches and nonlinear data structures such as trees and graphs. require recording more information. Table 1 briefly describes the programs used to evaluate qpt. Most of the programs are nonnumeric, with irregular
These categories names reflect the difficulty of reexecuting the instructions to determine their contribution to an address slice computation. The traces regeneration program simultaneously steps through the control-flow graph to regenerate the instruction trace and reexecutes these instructions to produce the address trace. Each time the regeneration procedure encounters an easy or hard instruction in an address slice, it reexecutes the instruction and modifies a copy of the program state. When the procedure encounters an impossible instruction, it consults the trace record to find the instruction value, which s must have been recorded during the program execution. At a load or store instruction, the regeneration routine determines the referenced address by examining the program state and res producing the instruction address cals culation. This technique only requires a tracing system to record results from impossible instructions. Many address slices do not contain an impossible instruction. Other slices, like the one in Figure 5, execute the impossible instruction once, but execute the memory reference many times. Both factors greatly reduce the information that must be collected to trace data references. May 1993
control-flow and memory-access patterns. The two exceptions are sgefa and dcgc. which are array-manipulating numeric programs. All measurements were collected on a DECstation 3100 (a 14. MIPS computer that contains a Mips R2000 processor) with 24 megabytes of main memory and a local disk. Table 2 shows the programs execution times with and without tracing. Program Time is the execution time (in seconds) for the untraced version of a program. Traced Time is the traced programs execution time. Time is split into user(u) and system (s) time. Figure 6 illustrates the cost of untraced execution and the cost of tracing, and breaks the latter into two components: the direct overhead of collecting trace information and the I/O cost to write this information to a disk file. The total (direct + I/O) overhead of
Table 1. Sample applications
used to evaluate qpt.
Program Name costscale dcgc eqntott* espresso* gee pdp polyominoes sgefa xlisp*
*From the S P E C benchmark
Application Min cost network flow Preconditioned conjugate gradient package Translate equation to truth table Minimize Boolean functions Gnu C compiler Connectionism simulator Poly dominoes game Gaussian elimination Lisp interpreter
suite
Lines of Code 1,580 1,166 3,140 14,838 102,007 4,800 555 1,220 7,741
57
Table 2. Overhead of program tracing. The first column lists the application program execution time (in seconds), s broken down into user (u) and system (s) time. The second column lists the traced application execution time. s
The qpt tracing system

The qpt program-tracing system is a second-generation system developed by the author. The qpt system greatly improves the efficiency and ease of use of its predeceesor, AE. It uses both abstract execution1 and optimal control tracing2 to trace all code in a program executable (a.out) s file. Figure A illustrates its tracing process: qpt adds tracing code to a program executable file and produces the s traced application (foo.qpt) and a trace-rogeneratlon program (foo.sma.c). The latter program Is linked wtti an ap plication program (c0nsume.c) that uses the trace Information. To illustrate this process, consider the small C program In Figure B and the Mips R2000 assembly code for its routine sub In Figure C. The routine contains three bask blocks, the first and last of which contain a single instruction. The SW instruction is the only memory reference; It writes to the address contained In register R. The shaded instructions are the address slice, which also includes the first argument (x) to the routine sub, which is contained In register r4. In addition to inserting control-flow and address-tracing code into a program executable file, qpt writes out a s trace-regeneration program that reads the trace record written by a traced program and regenerates a full address trace from it. Figure D shows the portion of the regeneration program corresponding to routine sub. In this fragment, the function calls In capital letters, such as BLOCK-ENTRY, pass trace Information to the traceconsuming application. The call on F-ENTRY reports the start of sub execution. B-ENTRY reports a basic block s s execution. INSTS reports the execution of a sequence of instructions, along with its cost in cycles and the integer and floating-point registers that they read and modify. WMEM reports a write to memory. LOOP-START, LOOP-CONT, and LOOP-END report on the loop in the routine. READ-WITNESS reads the next witness into the variable CURRENT- WITNESS.The other statements In the regeneration program. reexecute.the. address. slice from the original program. The portion of the program that actually produces and consumes data values is irrelevant to tracing and does not need to be reexecuted. The regeneration program is a C program that Is compiled and linked to the trace-consuming application. Each tlme the traced program executes, It writes a trace file (foo.Trace) that the regeneration program reads and expands into a full address trace. A strong advantage of using abstract execution for program tracing is the ease of integrating semantic information into a trace. In qpt, the regeneration program tells a trace consumer which instructions correspond to function entry and exit: basic-block entry and exit: loop entry, exit, and iteration: memory allocation and deallocation: program line numbers; and calls on setjmp and longjmp. In addition, qpt annotates instruction references with an estimate of their execution cost and the set of registers that they read and write. Other information inferred by a compiler can SSSllY be Incorporated and passed along to the trace consumer. Annotations of this type add nothing to the tracing Costs and only marginally increase the cost of trace regeneration.
OH
Program Name costscale dcgc eqntott espresso F-c pdp polyominoes sgefa xlisp
Traced Time Program Time (user + system in seconds) 6.3~ + 1.6~ + O.lu + 1.9u + 2.2~ + 1.9u + 2.2u + 3.lu + 1.2u + 0.9s 0.1s 0.0s 0.0s 0.8s 0.1s 0.1s 0.0s 0.1s 34.7u + 44.4s 1.9u + 0.5s 0.3u + 0.4s 10.8~ + 12.3s 11su + 11.1s 8.1~ + 10.0s 11.2~ + 13.6s 4.7u + 1.5s 7.7~ + 8.3s
tracing ranged from 1.4 to 12.3 times the cost of executing an untraced program. A good portion of this cost (24-57 percent) is the time required to write trace files to disk. This cost reflects the design of the Unix file system and the speed of the local disk. Discounting this cost for a moment, the direct overhead of tracing ranged from 0.2 to 5.0 times the cost of executing an untraced program. (This cost could be lowered by reducing the number of instructions in the code sequences used by qpt. The current sequences are simple and general,
15
f 10 1 Q) % 5 a
Application
m Untraoed exeoution OfracIng
ci
overheadal/O overtiead
Figure 6. Analysis of tracing overhead. The black portion of Ihe bar represents the application s normalized untraced execution time. The white portion represents the normalized direct overhead of tracing. The gray portion is the cosf of writing fhe trace data to disk.
58
COMPUTER
--I
void-qpt-sub0 static char
{ *sfn
= "test.c";
PC=0x4001f4; F- ENTRY(PC,"sub",sfn); RB=qpt-get_wordO; LO: /* Block 0 */ PC=Ox4001f4; B-ENTRY( INSTS(1,1,0x40,0x0,0xO,OxO~; R6=RO: LOOP-START(l); /* Block 1 */ PC=Ox4001f8; B-ENTRY( INSTS(1,1,0x4,0x40,OxO,OxO); R2=R6<<2; INSTS(2,2,0x48,0x60,OxO,OxO); R6=R6+1; INSTS(1,1,0x4,0x14,OxO,OxO); R2=R2+R4; INSTS(3,3,0x2,0x44,0xO,OxO); WMEM(R2+0); Lls:switch(CURRENT-WITNESS) { case 0: LOOP-CONT(1); READ-WITNESS(O); goto Ll; case 1: LOOP-END(l); READ-WITNESS(l); got0 L2; case -1: READ-WITNESSO; goto Lls; default: qpt-cjump-err(PC); 1
Compiler
Ll:
Flgurs A. Overvlew of qpt system.
inta[1001[1001; main ( int ( ) i; i = i + 1)
for (i = 0; i < 100; sub (a[il, i);
sub (x,i) intx[]; int i; int for

x[j]
j;
(j
= 0; j < 100; = i + j;
j = j + l)
2 */ L2: /* Block PC=Ox400214; B-ENTRY( INSTS(~,~,~~~,~~O,OXO,O~O); INSTS(1,1,0x1,OxO,OxO,OxOi; F~EXIT(Ox4001f4,"sub",sfn);
Figur8 6. EX8mpk C program rayatotttellium~Ita~~lnass,sosEt,fl~~*~
8ets each 8188?88t of er-
Figure D. Siightfy edited 00&S fbt 8&.
version
of the r8generation
Block 0 Block 1
Ox4001f4: Ox4001f8 : 0x4001fc: 0x400200: 0x400204: 0x400208: 0x40020~: 0x400210: 0x400214:
move
Sll
r6. r2, r3, r6, r2, rl, rl, r3, r31
r0 2 r6, r5, r6 r6, 1 r2, r4 100 r6, Ox4001f8 r0, 0 (r2 )
;j<; ; ; ; tmpl tmp2 j <tmpl
0 <- j*4 <- i+j j+l <- &x+j*4 Referencer

1. JR. Laws, Abstract i%cution: A Technique for Efficiently Tracing Programs, S&ware Pnctlcs & Experience, Vd. 20, No. 12, Dec. lW0, pp. 1,2411,2f&
addu addiu addu slti bne SW
; loop if j < 100 ; *tmpl <- tmp2 ; return ti GOflt8bS th8

tfM SMHmd
Block 2
jr
Figure C. COmpikid COnSt8nt 0, regi8&#
code for the I-OutfIW Sub from F&Jf8 8. fb#irter ti ho&b fir& 64rgufilmt, tAb prslrtw ts hawI
th8
2. T. Bali and JR Lwur, Optimaily Profiling and Tracing Programs, Conf. Re&ni 19th Ann. ACM Symp. PtWiples of Pmgrammhg Languages, ACM Press, New York, 1992, pp.
teoS-th8Mq&.
8W stopss
0 yyoTI(.
58-70.
May 1993
59
---T--
Table 3. Size of trace files. The second column contains the number of instruction references during a program execution. The third column contains the s number of data memory references. The fourth column lists the size of each trace file. The fifth column lists the rate at which the trace was produced. x 1,000
Program Name costscale dcgc eqntott espresso SC pdp polyominoes sgefa xlisp
Number of Instructions 78,325 11,470 1,180 28,919 22,756 25,395 36,058 15,489 14,883
Number of Trace Size Trace Rate Memory Refs. (bytes) (bytes/CPU sec.) 24,690 6,530 467 6,865 7,281 7,621 6,678 3,497 5,494 40,632 3.58 451 11,469 8,067 9,163 13,630 1,348 7,498 6,449 223 4,512 6,036 3,666 4,822 6,195 434 6,248
The regeneration programs produced between 213,000 and 530,000 addresses per second of CPU time. Even in the worst case, the addressesare regenerated much faster than most applications consume them. For example, cachememory simulators generally process tens of thousands of addresses per second. The additional time to regenerate addresses is insignificant for these programs. ompiler-based program tracing not only alleviates the major problems of previous tracing techniques (excessive time and space overhead), but also opens new uses for traces by facilitating the incorporation of additional semantic information. For example, AE and qpt have been used to produce traces to simulate programs parallel execution. This application requires the semantic annotations that delimit function and loop boundaries as well as the address trace. n
Table 4. File compression. The first column lists the ratio of the full address trace size (5 bytes per memory reference) to the recorded file size. The second column has the corresponding ratio for the compressed trace file.
Program Name costscale dcgc eqntott espresso SC pdp polyominoes sgefa xlisp
Size Trace/Size File Uncompressed 12.7 251.3 18.3 15.6 18.6 18.0 15.7 70.4 13.6 Compressed 55.7 2,105.O 36.7 67.3 52.9 309.4 116.0 690.9 45.5
Acknowledgments
Chris Frasersuggested usingsinglebits for control tracing; Guhan Viswanathanimplementedthe idea and demonstrated viabilits ity. Guri Sohi and Tony Laundrie provided the code for a profiler that helped in the developmentof qpt.
This work was supported in part by the National Science Foundation under grants CCR-8958530 and CCR-9101035 and by the Wisconsin Alumni Research Foundation.
but require up to twice as many instruc- size of the trace file written during each tions as more optimized sequencesused program execution. Column five shows s on other computers such as the Spare.) the rate at which each file was proThe large variation in overhead is duced. This rate varied between 420,000 due to differences in the programs and 2,660,OOO bytes of trace per CPU memory-referencing behavior. Abstract second and depended on the success of execution easily describes the access abstract execution at reducing the tracpatterns in programs, such as sgefa or ing overhead. dcgc, that examine memory in a regular Table 4 illustrates qpt ability to coms and predictable pattern. By contrast, press trace data. The table lists the ratio programs such aspdp and espresso, which of a full address trace size (at 5 bytes s follow pointers or access memory in a per memory reference) to the uncomdata-dependent pattern (for example, pressed and compressed trace file sizes. during sorting), need to record more The trace files produced by qpt are 13addresses in the trace file. This increas- 250 times smaller than the full address es both the overhead of collecting the traces. Moreover, when the trace files information and the amount of data are compressed by the Unix compress that must be written to disk. utility, the files range from 37 to 2,105 Table 3 shows the size of trace files. times smaller. With an achievable comThe second and third columns list the pression factor of 100 times, a lo-MIPS number of instructions and data memo- computer produces only 700 kilobytes ry references during each program ex- of data per second, so a 500-megabyte s ecution. The next column contains the disk holds almost 1,000seconds of trace. 60
References
1. T. Ball and J.R. Larus, Optimally Profiling and Tracing Programs, Conf. Record of the 19th Ann. ACM Symp. Principles of Programming Languages, ACM Press, New York, 1992, pp. 59-70. 2. S.L. Graham, P.B. Kessler, and M.K. McKusick, An Execution Profiler for Modular Programs, Software Practice & Experience, Vol. 13,1983, pp. 671-685. 3. A.D. Samples, Mache: No-Loss Trace Compaction, Proc. 1989ACM SIGMetrics Conf. Measuring and Modeling of Computer Systems, ACM Press, New York, 1989, pp. 89-97. 4. C.A. Wiecek, A Case Study of VAX-11 Instruction Set Usage for Compiler Execution, Proc. Symp. Architectural Support for Programming Languages and Operating Systems, ACM Press, New
York, 1982,pp. 177-184.
COMPUTER
5. R.R. Henry, VAX Address and Instruction Traces, Tech. Report, Univ. of California, Berkeley, 1983. 6. A. Agarwal, R.L. Sites, and M. Horwitz, ATUM: A New Technique for Capturing Address Traces Using Microcode, in Proc. 13th Ann. Int Symp. Computer l Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 719,1986, pp. 119127. 7. C.B. Stunkel, B. Janssens, and W.K. Fuchs, Address Tracing for Parallel Machines, Computer, Vol. 24, No. 1, Jan. 1991, pp. 31-45. 8. A. Borg, R.E. Kessler, and D.W. Wall, Generation and Analysis of Very Long Address Traces, Proc. 17th Ann. Int l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 2047,1990, pp. 270-281. 9. S.J. Eggers et al., Techniques for Efficient In-Line Tracing on a Shared-Memory Multiprocessor, Proc. 1990 ACM SIGMetrics Conf Measuring and Modeling of Computer Systems, ACM Press, New York, 1990, pp. 37-47. 10. J.R. Larus, Abstract Execution: A Technique for Efficiently Tracing Programs, Software Practice & Experience, Vol. 20, No. 12, Dec. 1990, pp. 1,241.1,258. 11. A.V. Aho, R. Sethi, and J.D. Ullman, Compilers: Principles, Techniques, and Tools, Addison-Wesley, Reading, Mass., 1985. 12. J.R. Larus, Estimating the Potential Parallelism in Programs, in Proc. Third Workshop on Languages and Compilers for Parallel Computing, Nicolau et al., eds., MIT Press, Cambridge, Mass., pp. 331-349.
James R. Larus is an assistant professor in the Computer Sciences Department at the University of Wisconsin-Madison. His research interests include parallel programming, programming languages, and compilers. He received his AB in applied mathematics from Harvard University in 1980, and his MS and PhD in computer science from the University of California, Berkeley, in 1982 and 1989, respectively. Readers may contact the author at the Department of Computer Science, Universitv of Wisconsin-Madison. 1210 West Dayton St., Madison, WI 53706, e-mail larus@cs.wisc.edu.
Name
(Please
Print)
PLEASENOTIFY US 4 WEEKS IN ADVANCE
New Address
MAIL TO: IEEE Service Center 445 Hoes Lane Piscataway, NJ 08854
May
1993

Efficient Program Tracing: James R. Larus, University of Wisconsin-Madison

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Program Tracing: James R. Larus, University of Wisconsin-Madison

Uploaded by

Copyright:

Available Formats

Efficient Program Tracing

James R. Larus, University of Wisconsin-Madison

Techniques for efficient tracing systems

Program tracing system

Program trace * regeneration routine . *

How well does it work?

Table 1. Sample applications

used to evaluate qpt.

The qpt tracing system

void-qpt-sub0 static char

inta[1001[1001; main ( int ( ) i; i = i + 1)

for (i = 0; i < 100; sub (a[il, i);

sub (x,i) intx[]; int i; int for

2 */ L2: /* Block PC=Ox400214; B-ENTRY( INSTS(~,~,~~~,~~O,OXO,O~O); INSTS(1,1,0x1,OxO,OxO,OxOi; F~EXIT(Ox4001f4,"sub",sfn);

Figur8 6. EX8mpk C program rayatotttellium~Ita~~lnass,sosEt,fl~~*~

8ets each 8188?88t of er-

Figure D. Siightfy edited 00&S fbt 8&.

Ox4001f4: Ox4001f8 : 0x4001fc: 0x400200: 0x400204: 0x400208: 0x40020~: 0x400210: 0x400214:

r6. r2, r3, r6, r2, rl, rl, r3, r31

r0 2 r6, r5, r6 r6, 1 r2, r4 100 r6, Ox4001f8 r0, 0 (r2 )

;j<; ; ; ; tmpl tmp2 j <tmpl

0 <- j*4 <- i+j j+l <- &x+j*4 Referencer

addu addiu addu slti bne SW

; loop if j < 100 ; *tmpl <- tmp2 ; return ti GOflt8bS th8

Figure C. COmpikid COnSt8nt 0, regi8&#

York, 1982,pp. 177-184.

PLEASENOTIFY US 4 WEEKS IN ADVANCE

You might also like

2 / L2: / Block PC=Ox400214; B-ENTRY( INSTS(~,~,~,O,OXO,O~O); INSTS(1,1,0x1,OxO,OxO,OxOi; F~EXIT(Ox4001f4,"sub",sfn);

Figur8 6. EX8mpk C program rayatotttellium~Italnass,sosEt,fl*~

0 <- j4 <- i+j j+l <- &x+j4 Referencer