You are on page 1of 6

The Inte1386~~ -

CPU Family Architecture & Performance Analysis

Tejpal Chadha Partha Srinivasan


Intel Corporation,
2625 Walsh Avenue, Santa Clara, CA. 95120

averaged for the run time of a job are used in comparing


Abstract different processors across different architectures. CPI
has become a very 'popular' metric for cross-architecture
The goal of this paper is to present a synopsis of the comparisons. We wish to point out some dangers
various performance metrics related to the i 3 8 P CPU associated with such comparisons when dealing with the
architecture. In most of the published reports the one i386 architecture. In the case of the so called RISC
metric that grabs the most attention is %PI' (Cycles Per processors, where most instructions take the same amount
Instruction). To evaluate the performance of a given of execution time, the main parameters impacting CPI
architecture this is but one data point in the overall are: (i) Cache performance and memory system
pevormance spectrum. The complex i386 CPU performance (ii) register dependencies. Two
architecture has a number of factors which determine architectures in widespread use - SPARC and MIPS have
the overall pevormance of a job. We have developed a similar looking instruction sets. This may justify
rich set of tools to analyze the 386 architecture and its comparing them using their CPI numbers alone.
performance. Using these tools and methodologies, we In the case of the i386 architecture, the CPI equation is
show how an average number like CPI can vary over more complicated In addition to memory subsystem and
the run time of the job. All the results presented are for register dependencies, each instruction could take several
the Intec4d6ry DX microprocessor, executing the four clocks. Examples are complex instructions such as 'enter'
integer bencizmarks in the SPEC 1.0 benchmark suite. and 'leave', which provide procedure entry and exit
capability and repeated string operations. The amount of
work done by a single instruction varies more than in the
1. Introduction case of processon with simple instruction sets. Thus the
only way to truly compare architectures is to take the
There are a number of modes that an i386 architecture total time taken to execute a benchmark as opposed to
microprocessor can execute in. DOS applications run in CPI (which is one reason S P E P was formed).
16-bit Real mode. Microsoft's Windows* 3.0, operates For a complex architecture like the i386 architecture,
in protected mode and mixes 16/32 bit code. The goal of there are a number of factors that determine the overall
this paper is to focus on the flat 32bit model (using the performance of a given job. The instruction length and
protected 32bit mode) of programming the i386 the memory references per instruction determine the
architecture, implemented by such processors as the memory bandwidth. The cache hit rate and the memory
Intel486 DX, Intel486 SX, Intel386 DX, Intel386 SX, latency contribute to the CPI. The CPI is dynamic and
Intel386 SL. This results in a flat 32 bit address space wiU change based on the instruction mix being executed.
that most Unix implementations such as USES U d B A computer architect would like to understand why and
SVR4 utilize. An excellent source of information on the how the CPI varies for the run time of a job.
programming models of the i386 architecture can be We have developed a rich set of tools to analyze the
found in the book by Crawford and Gelsinger [11. i386 architecture features and their impact on
There are several performance metrics that performance. We can also profile the various
characterize the performance of a given architecture. The performance metrics across different sections of a job,
metric is a mean of some kind, geometric, arithmetic or thus giving a dynamic profile of any performance
harmonic, for the whole job. The SPECmarks [4, 51, are indicator. The basic technique used is trace-driven
one example of this. The SPECmark is a useful metric to simulation. The instruction traces for the various SPEC
evaluate the overall performance of an Engineering jobs were collected and then run through detailed
Workstation, however it lacks the depth and details architecture and performance simulators that model the
needed to understand why the performance is good or bad Intel486 DX microprocessor.
(see appendix Al). Cycles per Instruction (CPI),

0-8186-2655-0/92 $3.00 0 1992 IEEE


In this paper we analyze the performance of the four
Average instruction Length (Bytes)
integer benchmarks in the SPEC 1.2b benchmark suite.
Section 2 contains profiles of the 4 integer benchmarks 3.5
- 3
and compares them with those obtained for SPARC and 2.5
MIPS processors. Section 3 presents the conclusions. It ._
c 2
c 1.5
must be noted at this point that the different i386 P 1
compilers in use today show markedly varied 3 0.5
0
performance. All the metrics presented in this paper were
obtained using the Metaware High C compiler version
2.31 for the Sun Microsystems's 386i workstation.
Fig. 1. Average Instruction Lengths

Once we know the average instruction length, Fig. 1.


2. i386 CPU Instruction Profile based on the study is extended to compare the instruction
Integer SPEC Benchmarks bandwidth across different architectures. Table 1.0
compares the total number of instructions executed by
Instruction set characteristics of the i386 architecture, i386 CPU, MIPS, SPARC and RS6000 to complete the
like the average instruction lengths, memory references integer benchmarks. Tzle results for MIPS and SPARC are
per instruction and the use of certain types of instruction from the paper by Cmelik & Kong [l] whereas the
prefixes influence the overall performance. We present RS6000 results are from the paper by Stephens [2]. The
some of these statistics followed by the CPI profile for instruction counts in Table 1.0 exclude library routines.
the various hot spots in individual benchmarks. As we go This was done in order to do a fair comparison, as some
along, the impact of the various memcs like the reports included the percentage of instructions from
instruction length and the memory references per Library calls [l],whereas [2] did not.
instruction on the overall performance will be described
Program I i386 [ SPARCI MIPS I RS60001
2.1 Instruction Lengths & Bandwidth gcc1.35 I 0.958851 0.93635 1.02195 N/A
espresso [ 1.958981 2.49123 2.74394 2.180
The average instruction length for an architecture li I 4.70600 4.65832 4.65832 5.779
when combined with the total number of instructions eqntott [ 1.13182 1.26867 1.26867 0.979
executed, provides a lower bound on the total instruction
bandwidth required This bandwidth affects several Table 1. Dynamic Instruction counts excluding
important figures of merit such as the Instruction Cache libraries in Billions of Instructions.
hit-rate and the memory bandwidth consumed by
instruction fetch traffic. In general a lower instruction
bandwidth results in a higher Instruction Cache hit-rate It can be Seen that the i386 CPU family enjoys an
and a greater availability of the memory bus for data advantage in executing fewer instructions (7%-25%),that
traffic. We have calculated the average instruction length can be attributed to the use of complex instructions and
for the 4 Integer SPEC benchmarks and show the results complex memory ddmsing modes, and is influenced
in Fig. 1. We have shown two results for Espresso greatly by the compiler used It is also seen that the
(executed with two different input files) separately to MIPS architecture has consistently more instructions than
show the similarity in the lengths. The average the SPARC architecture, though this disparity may be
instruction length for the integer SPEC benchmarks is reduced by more recent MIPS compilers, such as those
2.85. Most of the so called RISC processors in used in the latest SPEC benchmarking [8].
comparison, have a fixed instruction length of 4 bytes.
The smaller instruction size allows more instructions
to be contained in a cache of given size. This improves
the instruction cache hit-rate for the i386 family of
processors over that of a fixed length instruction set,
when all other cache parameters are the same.

333
Normalized Ratk of Total instructionsbytes
consuned

i386 SPARC MIPS

2 00

1 50

100

0 50

0.00
gcc1.35 espresso li eqntotl
Fig. 3. Occurrence of Memory Operations
Fig. 2. Normalized Instruction Volume (Total
number of instructions AV. instruction length). Fig. 4. shows the relative occurrence of memory accesses
per insmction in other architectures. The metric
Fig. 2. shows the total instruction volume in bytes presented is the total number of memory bytes accessed
consumed in the execution of a program for the i386 for read and write, normalized w.r.t. the total accessed
architecture and SPARC or MIPS architectures. This (read+write) for the i386 architecture. The data for the
metric shows more disparity between the i386 SPARC and MIPS architectures is obtained from [2].It is
architecture and the other architectures due to the fact seen that the i386 architecture has greater data memory
that the average instruction length is only 71% of that for traffic than SPARC or MIPS architectures. This tends to
the other CPU architectures. The total instruction volume balance out the lower demand the i386 architecture
is arrived at by multiplying the average instruction length imposes on instruction memory traffic (see. Fig. 2).
by the total number of instructionsexecuted The number Similar data is shown by HeMessy and Patterson [9],
is normalized with respect to the i386 architecture. comparing the VAX architecture and a reference
Load/Store architecture.
2.2 Memory references per Instruction
Normallzed Total Memory traffic (mt I386
The i386 architecture allows memory accesses by total memory tremc)
I
several instructions, and thus the number of memory
readdwrites can't be simply determined by looking at the
number of loadstore instructions as in the case of
processors with Load/Store arthitectures. The breakdown
of the average memory references per instruction shows
the memory bandwidth reQuirements of each program.
Excessive memory traffic causes a degradation of the
prefetch hit-rate also, since the Intel486 CPU has a
unified first level cache.
I 1386 SPARC MIPS

Fig. 4. Occurrence of Memory Operations in 18 16


Fig. 3. shows the breakdown into Reads and Writes per architecture and other architectures
instruction and prefetches per instruction. This metric
helps in determining the memory baudwidth requiredin a
system. It must be bome in mind that this does not 2.3 Protile for the Integer SPEC benchmarks
include memory traffic created by TLB miss processing
or U 0 permission protection checks or other memory The SPEC benchmarks were designed such that
traffic created indirectly by the processor. The readwrite performance of a machine on these benchmarks would be
metrics are a function of the compiler, the application and indicative of their performance on other jobs. We show
the instruction set. The number of prefetches are highly how the CPI varies within each benchmark depending on
dependent on the implementation of the CPU and to a the instruction mix being executed This also shows how
lesser extent on the application. good compilation and instruction selection can improve
performance. Wide swings in CPI are most readily
correlated with variations in the cache hit-rates in the
unified 8K cache. However in the case of Eqntott, which

334
has better than average cache hit-rates, instruction
selection is the key factor in determining CPI.
-
Gcc Intel486 Data Hltrato

The execution trace of the entire program is divided 100


go
into sets representing lOOK instructions each. Each set is 80
called a trace sample. Similar trace samples are grouped 70
together and represented by one member of the group. 60

The CPI for the Intel486 DX CPU is plotted for 2


LL
50

representative trace samples from a hundred groups with


the most members. The simulation assumed the
processor had access to a zero wait state, infinite size
second level cache. Each sample is a set of lOOK or
200K instructions. The x-axis is the trace sample #. The
purpose of the following figures is to show the variance Fig. 6. Profile of Data Hit-rates over the samples
of CPI for the complete program. Also shown is the shown in fig. 5
variation of the Data hit-rate in the unified cache for the G P
Inte1486. The simulator allows us to separately compute From Fig. 7., Xlisp is seen to be a steady benchmark
the code and data hit-rates. We have also shown whose basic character does not change over the duration
examples of how certain anomalies in the CPI can be of execution. The high Cache hit rate (98%) helps keep
traced back to their causes. It is not our intention to the average CPI to 1.96 for the Intel486 CPU. The data
deliver a blow by blow analysis of the entire profile. hit-rate is shown in Fig. 8.

GCC:
Fig. 5,shows the CPI during the execution of one

II
XLlSP Intel486 CPI
sample in Gcc increasing dramatically to 3.7. Fig. 6.
shows the Data Hit-rate in the unified 8K cache during 2.51
the same set of samples. It can be seen from the circled - 2
area of Fig. 5., that the CPI rises during the execution of
sample 40, which at the same time experiences a dramatic
drop in data hit-rates, as Seen in the circled area in Fig. 6.
Examination of the detailed simulation logs for that
sample reveal the following facts: (i) the Write Hit-rate
during that sample is around 37%, (ii) writes outnumber
reads by a ratio of 2 to 1, and (iii) 40% of all memory I

accesses are writes. Since the Intel486 DX CPU executes Fig. 7. CPI profile of Xlisp
in Write-through mode, the presence of several back to
back writes will overflow the write buffers and cause the
processor to stall. The average CPI for Gcc is 2.27.
100 -
90 .~

GCC ln1.1486 CPI 80 .-


70 .~
4
tSeeexplanaUon * 60.-
3.5 c
50-~
E 40-~
2.5 30 - -
-
B 2 20 - -
l o --

Xlisp has the best data locality of the 4 benchmarks.


Fig. 5. Profile of CPI varying over several Thus even though its data bandwidth requirement (see
samples Fig. 3.) are quite high, we get good cache performance.

335
compare), causes the CPI to rise. In comparison, the
Espresso/Bca SPARC has no mul/div instructions and coincidentally
has a greater path length. This shows the
The circled area in Fig. 9. shows an increased CPI inappropriateness of using CPI as a comparison metric
(2.45) due to experiencing a low cache hit-rate - data hit- across architectures.
rate is 73%. The cache hit-rate for reads in that sample is
even lower, only 67.5%, which explains the higher CPI.
The average CPI is 1.91, with a low of 1.4 and a high of Eqntott Intel486 CPI
2.5. This is the case where the average CPI is quite 2.5
misleading as it varies from sample to sample.
2 --- n

BCA/Espresso Intel466 CPI

2.5

'I
Eqntott Intel486 Data Hit-rate

100
Fig. 9. CPI Profile for Espresso / Bca 90
80

-
BCNEspresso Intel486 Data Hkrate

100
90
80
70

Fig. 12. Profile of Data Hit-rates for Eqntott.


20
10
The average CPI is 1.95 and is quite steady for the run
time of the job. Eqntott has low memory traffic as seen
1 from Fig. 3. The low data access demands helps free up
more of the cache bandwidth for prefetches thereby
Fig. 10. Data Hit-rate profile for Espresso/Bca improving the code hit-rate. The significant degradation
in CPI is caused by the inappropriate instruction
selection.
Eqntott
3. CONCLUSIONS
The peak shown circled in Fig. 11. shows a significant
rise in instantaneous CPI during that phase of the Performance evaluation of the i386 architecture is
benchmark. Further examination of the simulation logs more complex than for some architectures. The overall
for that sample do not show any degraded cache performance of a given benchmark depends on a number
performance, as can be seen in Fig. 12. However from of performance metrics, like the average instruction
the instruction mix encountered in the execution of that length, the number of memory references per instruction.
sample, the following fact emerges: 40% of the execution The analysis shown here is being used to improve the
rime is spent in instructions whose CPI is over 4.0. This performance of future Intel processors and their
incidence of complex instructions such as ret (return from compilers. The number of memory references per
procedure), muZ(integer multiply), div (integer divide), instruction can be lowered by using optimized compilers
leave (high level procedure exit) and seas (string byte

336
and thus improving the overall performance. The CPI for SPEant lor Inbl&DX 8nd o h n
the four integer SPEC benchmarks vary from a low of 1.4
to a high of 3.7 for the Intel486 DX microprocessor. This
indicates the danger of comparing average CPI, which
"T H
may have fluctuated over a wide range during the run
time of the job. Also even within the i386 architecture,
CPI does not correlate well to performance across various
applications. The CPI variations are being studied to
improve the performance of the overall job.

4. REFERENCES

Crawford, J. and Gelsinger, P., "Programming the Fig. A l . Comparison of SPECint for several
80386, Sybex, San Francisco, CA, 1987. systems
Cmelik, Robert E, Kong, Shing I., et.al, "An
Analysis of MIPS and SPARC Instruction Set The results for the IBM RS6000, SparcStation 2 are
Utilization on the SPEC Benchmarks", Proceedings from the SPEC newsletter dated September 1991[4]. The
of ASPLOS-IV, Santa Clara, Califomia, 1991, pp. MIPS RC3360 result is from the Winter 1991 publication
290-302. of the SPEC newsletter. The Intel486 DX microprocessor
Stephens, Chriss., Cogswell, Bryce., et.al, result is from the 50- Inte1486DX Microprocessor
"Instruction Level Profiling and Evaluation of the performance brief published by Intel Corporation[6].
IBM RS/6000", ' gs of the 18th Annual

Intemational Symposium on Computer


Architecture, May 1991, pp. 180-189.
Pixie - an object file based basic block profiler for
MIPS workstations.
SPEC Newsletter, September 1991.
SPEC Newsletter, Winter 1991.
50Mhz Inte1486DX Microprocessor Performance
Brief, Intel Corporation, Order No.: 241120-001
SPEC Newsletter, September 1991.
Henessy J.L., and Patterson D. A., "Computer
Architecture A Quantitative Approach", Morgan
Kaufinann Publishers, Inc., San Mateo, CA., 1990,
pp. 123.

APPENDIX

Al. Comparisonof SPECint metric for several


systems

The measured SPECintTM metric results for systems


using these processors are presented in Fig. Al. below.
The measured results are for a system which uses an
Intel486 DX CPU running at S O W , with a 256K second
level Cache. It must be noted that the different CPUs are
running at different clock frequencies and belong to
different generations in design. The RS6000 is the latest
architecture and is superscalar, whereas the others are not.

337

You might also like