You are on page 1of 56

Tesis del Master en

Ingeniera de Computadores
Curso 20062007

Out-of-Order Retirement of Instructions in


Superscalar and Multithreaded Processors

Autor:
Rafael Ubal Tena
Directores:
Julio Sahuquillo Borras
Pedro Lopez Rodrguez

Contents
Contents

Abstract

iii

Introduction
1.1 Out-of-Order Retirement in Monothreaded Processors . . . . . . .
1.2 Out-of-Order Retirement with Support for Multiple Threads . . .

The Validation Buffer Microarchitecture


2.1 Register Reclamation . . . . . . . . .
2.2 Recovery Mechanism . . . . . . . . .
2.3 Working Example . . . . . . . . . . .
2.4 Uniprocessor Memory Model . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

1
1
3

.
.
.
.

5
6
8
10
11

Multithreaded Validation Buffer (VB-MT) Microarchitecture


3.1 Multithreading support . . . . . . . . . . . . . . . . . . . . .
3.2 Resource sharing . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Adapting SMT Techniques to VB-MT . . . . . . . . . . . . .
3.3.1 Instruction Count (ICOUNT) . . . . . . . . . . . . . .
3.3.2 Predictive Data Gating (PDG) . . . . . . . . . . . . .
3.3.3 Dynamically Controlled Resource Allocation (DCRA)

.
.
.
.
.
.

.
.
.
.
.
.

12
13
13
16
16
16
17

The Simulation Framework Multi2Sim


4.1 Basic simulator description . . . . . . . . . . . . . . . . .
4.1.1 Program Loading . . . . . . . . . . . . . . . . . .
4.1.2 Simulation Model . . . . . . . . . . . . . . . . .
4.2 Support for Multithreaded and Multicore Architectures . .
4.2.1 Functional simulation: parallel workloads support .
4.2.2 Detailed simulation: Multithreading support . . . .
4.2.3 Detailed simulation: Multicore support . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

18
18
18
19
20
21
22
23

.
.
.
.
.
.
.

.
.
.
.
.
.
.

Exterimental Results
5.1 Evaluation of the VB Microarchitecture . . . . . . . . . . . . . .
5.1.1 Exploring the Potential of the VB Microarchitecture . . .
5.1.2 Exploring the Behavior in a Modern Microprocessor . . .
5.2 Impact on Performance of Supporting Precise Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Evaluation of the VB-MT Microarchitecture . . . . . . . . . . . .
5.3.1 Multithreading Granularity . . . . . . . . . . . . . . . . .
5.3.2 Multithreading Scalability: Performance vs. Complexity .
5.3.3 Fetch Policies . . . . . . . . . . . . . . . . . . . . . . . .
5.3.4 Resources occupancy . . . . . . . . . . . . . . . . . . . .

32
33
34
37
38
40

Related Work

42

Conclusions
7.1 Contributions and Future Work . . . . . . . . . . . . . . . . . . .
7.2 Publications Related with this Work . . . . . . . . . . . . . . . .

44
45
45

ii

25
25
26
28

Abstract
Current superscalar processors commit instructions in program order by using
a reorder buffer (ROB). The ROB provides support for speculation, precise exceptions, and register reclamation. However, committing instructions in program
order may lead to significant performance degradation if a long latency operation
blocks the ROB head.
Several proposals have been published to deal with this problem. Most of them
retire instructions speculatively. However, as speculation may fail, checkpoints
are required in order to rollback the processor to a precise state, which requires
both extra hardware to manage checkpoints and the enlargement of other major
processor structures, which in turn might impact the processor cycle.
This work focuses on out-of-order commit in a nonspeculative way, thus avoiding checkpoints. To this end, we replace the ROB with a structure called Validation Buffer (VB). This structure keeps dispatched instructions until they are nonspeculative or mispeculated, which allows an early retirement. By doing so, the
performance bottleneck is largely alleviated. An aggressive register reclamation
mechanism targeted to this microarchitecture is also devised. As experimental results show, the VB structure is much more efficient than a typical ROB since, with
only 32 entries, it achieves a performance close to an in-order commit microprocessor using a 256-entry ROB.
The present work also makes an exhaustive analysis of out-of-order retirement
of instructions on multithreaded processors. Superscalar processors exploit instruction level parallelism by issuing multiple instructions per cycle. However,
issue width is usually wasted because of instruction dependencies. On the other
hand, multithreaded processors reduce this waste by providing support to the concurrent execution of instructions from multiple threads, thus exploiting both instruction and thread level parallelism. Additionally, out-of-order commit (OOC)
processors overlap the execution of the long latency instructions with potentially
lots of subsequent ones, so this microarchitectures also help to reduce the issue
waste.
We analyze the impact on performance of unifying both multithreading and
OOC techniques, by combining the three main paradigms of multithreading
iii

fine grain (FGMT), coarse grain (CGMT), and simultaneous (SMT) with the
Validation Buffer microarchitecture, which retires instructions out of order. From
the experimental results, we conclude that: (i) an OOC-SMT processor achieves
the same performance as a conventional SMT with half the amount of hardware
threads; (ii) an OOC-FGMT processor outperforms a conventional SMT processor, which requires a more complex issue logic; and (iii) the use of OOC allows
optimized fetch policies (DCRA) for SMT processors to improve their job, almost
completely removing the issue width waste.

iv

Chapter 1
Introduction
1.1 Out-of-Order Retirement in Monothreaded
Processors
Current high-performance microprocessors execute instructions out-of-order to
exploit instruction level parallelism (ILP). To support speculative execution, provide precise exceptions, and register reclamation, a reorder buffer (ROB) structure
is used [1]. After being decoded, instructions are inserted in program order in the
ROB, where they are kept while being executed and until retired in the commit
stage. The key to support speculation and precise exceptions is that instructions
leave the ROB also in program order, that is, when they are the oldest ones in
the pipeline. Consequently, if a branch is mispredicted or an instruction raises
an exception there is a guarantee that, when the offending instruction reaches the
commit stage, all the previous instructions have already been retired and none of
the subsequent ones have done it. Therefore, to recover from that situation, all the
processor has to do is to abort the latter ones.
This behavior is conservative. For instance, when the ROB head is blocked by
a long latency instruction (e.g., a load that misses in the L2), subsequent instructions cannot release their ROB entries. This happens even if these instructions are
independent from the long latency one and they have been completed. In such a
case, since the ROB has a finite size, as long as instruction decoding continues,
the ROB may become full, thus stalling the processor for a valuable number of cycles. Register reclamation is also handled in a conservative way because physical
registers are mapped for longer than their useful lifetime. In summary, both the
advantages and the shortcomings of the ROB come from the fact that instructions
are committed in program order.
A naive solution to address this problem is to enlarge the ROB size to accommodate more instructions in flight. However, as ROB-based microarchitectures
1

serialize the release of some critical resources at the commit stage (e.g., physical
registers or store queue entries), these resources should be also enlarged. This
resizing increases the cost in terms of area and power, and it might also impact the
processor cycle [2].
To overcome this drawback, some solutions that commit instructions out of
order have been published. These proposals can be classified into two approaches
depending on whether instructions are speculatively retired or not. Some proposals falling into the first approach, like [3], allow the retirement of the instruction
obstructing the ROB head by providing a speculative value. Others, like [4] or [5],
replace the normal ROB with alternative structures to speculatively retire instructions out of order. As speculation may fail, these proposals need to provide a
mechanism to recover the processor to the correct state. To this end, the architectural state of the machine is checkpointed. Again, this implies the enlargement
of some major microprocessor structures, for instance, the register file [5] or the
load/store queue [4], because completed instructions cannot free some critical resources until their associated checkpoint is released.
Regarding the nonspeculative approach, Bell and Lipasti [6] propose to scan
a few entries of the ROB, as many as the commit width, and those instructions
satisfying certain conditions are allowed to be retired. None of these conditions
imposes an instruction to be the oldest one in the pipeline to be retired. Hence,
instructions can be retired out of program order. However, in this scenario, the
ROB head may become fragmented after the commit stage, and thus the ROB must
be collapsed for the next cycle. Collapsing a large structure is costly in time and
could adversely impact the microprocessor cycle. As a consequence, this proposal
is unsuitable for large ROB sizes, which is the current trend. Moreover, the small
number of instructions scanned at the commit stage significantly constrains the
potential that this proposal could achieve.
In this work we propose the Validation Buffer (VB) microarchitecture, which
is also based on the nonspeculative approach. This microarchitecture uses a FIFOlike table structure analogous to the ROB. The aim of this structure is to provide
support for speculative execution, exceptions, and register reclamation. While in
the VB, instructions are speculatively executed. Once all the previous branches
and supported exceptions are resolved, the execution mode of the instructions
changes either to nonspeculative or mispeculated. At that point, instructions are
allowed to leave the VB. Therefore, instructions leave the VB in program order
but, unlike for the ROB, they do not remain in the VB until retirement. Instead,
they remain in the VB only until the execution mode of such instruction is resolved, either nonspeculative or mispeculated. Consequently, instructions leave
the VB at different stages of their execution: completed, issued, or just decoded
and not issued. For instance, instructions following a long latency memory reference instruction could leave the VB as soon as its memory address is success2

fully calculated and no page fault has been risen. This work discusses how the
VB microarchitecture works, focusing on how it deals with register reclamation,
speculative execution, and exceptions.
The first main contribution (Chapter 2) of this work is the proposal of an aggressive out-of-order retirement microarchitecture without checkpointing. This
microarchitecture decouples instruction tracking for execution purposes and for
resource reclamation purposes. The proposal outperforms the existing proposal
that does not perform checkpoints, since it achieves more than twice its performance using smaller VB/ROB sizes. On the other hand, register reclamation cannot be handled as done in current microprocessors [7, 8, 9] because no ROB is
used. Therefore, we devise an aggressive register reclamation method targeted
to this architecture. Experimental results show that the VB microarchitecture increases the ILP while requiring less complexity in some major critical resources
like the register file and the load/store queue.

1.2 Out-of-Order Retirement with Support for


Multiple Threads
Superscalar processors effectively exploit instruction level parallelism of a single
thread, by issuing multiple instructions in an out-of-order fashion in the same cycle. Nevertheless, issue ports are usually wasted because of instruction dependencies (i.e., the available parallelism), thus, adversely impacting the performance.
Two kinds of waste are distinguishable [10]: vertical waste, when no instruction is
issued, and horizontal waste, when some instruction is issued without completely
filling the issue width.
Resource utilization can be improved by providing support to the execution
of multiple threads, that is, by exploiting both instruction and thread level parallelism. There are three main multithreading models implemented in current
processors: fine grain (FGMT), coarse grain (CGMT), and simultaneous multitreading (SMT). All of them reduce vertical waste but only SMT reduces horizontal waste [10] by issuing instructions from multiple threads in the same cycle,
achieving best performance gains. Nevertheless, this is done at expenses of adding
complexity to the issue logic, which is a critical point in current microprocessors.
Multithreaded architectures represent an important segment in the industry. For
instance, the Alpha 21464, the Intel Pentium 4 [9], the IBM Power 5 [11], the
Sun Niagara [12], and the Intel Montecito [13] are commercial microprocessors
included in this group.
On the other hand, and as introduced above, out-of-order commit (OOC) processors differ from in-order commit (IOC) architectures in the sense that they do

not force an instruction to be the oldest one in the pipeline in order to be retired.
In this way, long latency operations (e.g., a L2 miss) do not block the ROB when
they reach the ROB head; instead, long memory latencies are overlapped with
the retirement of subsequent instructions which do not depend on the memory
operation. Thus, these architectures mainly attack the vertical waste, although
horizontal waste is indirectly also improved. In addition, we will show that OOC
processors can make better use of resources and are more performance-cost effective than IOC processors. Quite recently, out-of-order retirement has been investigated on superscalar processors and chip multiprocessors (CMPs), but, to the best
of our knowledge, no research has focused on multithreaded processors.
As second main contribution of this work (Chapter 3), we analyze the impact of retiring instructions in an out-of-order fashion in the three main models of
multithreading: FGMT, CFMT, and SMT. To this end, we selected our own proposal the Validation Buffer Microarchitecture as the OOC base architecture,
and extended it to support the execution of multiple threads. Experimental results
provide three main conclusions:
First, a VB-based SMT processor requires in most cases half the amount
of hardware threads than an ROB-based SMT processor to achieve similar
performance. In other words, performance can be maintained in VB-based
SMT when reducing the number of hardware threads, thus saving all hardware resources to track their status.
Second, a VB-based FGMT processor outperforms an ROB-based SMT
processor. In this case, performance can be sustained while simplifying the
issue logic, which can be translated in shorter issue delays or lower power
consumption of instruction schedulers.
Third, existing fetch policies for SMT processors provide complementary
advantages to the out-of-order retirement benefits. A high-performance
SMT design could implement both techniques if area, power consumption
and hardware constraints allow it.
The rest of this work is structured as follows. Chapter 2 gives a detailed view
of the Validation Buffer microarchitecture. Chapter 3 deals with out-of-order retirement in multithreaded processors, and explains a set of existing instruction
fetch policies that will be used for evaluation purposes. Chapter 4.1 describes the
simulation framework used to model all proposed techniques. Chapter 5 performs
an exhaustive evaluation of both VB and VB-MT architectures, and Chapters 6
and 7 provide citations to related works and some concluding remarks, respectively.

Chapter 2
The Validation Buffer
Microarchitecture
The commit stage is typically the latest one of the microarchitecture pipeline. At
this stage, a completed instruction updates the architectural machine state, frees
the used resources and exits the ROB. The mechanism proposed in this work allows instructions to be retired early, as soon as it is known that they are nonspeculative. Notice that these instructions may not be completed. Once they are completed, they will update the machine state and free the used resources. Therefore,
instructions will exit the pipeline in an out-of-order fashion.
The necessary conditions to allow an instruction to be committed out-or-order
are [6]: i) the instruction is completed; ii) WAR hazards are solved (i.e., a write
to a particular register cannot be permitted to commit before all prior reads of
that architected register have completed); iii) previous branches are successfully
predicted; iv) none of the previous instructions is going to raise an exception,
and v) the instruction is not involved in memory replay traps. The first condition
is straightforwardly met by any proposal at the writeback stage. The last three
conditions are handled by the Validation Buffer (VB) structure, which replaces
the ROB and contains the instructions whose conditions are not known yet. The
second condition is fulfilled by the devised register reclamation method (see Section 2.1).
The VB deals with the speculation related conditions (iii, iv and v) by decomposing code into fragments or epochs. The epoch boundaries are defined by
instructions that may initiate an speculative execution, referred to as epoch initiators (e.g., branches or potentially exception raiser instructions). Only those
instructions whose previous epoch initiators have completed and confirmed their
prediction are allowed to modify the machine state. We refer to these instructions
as validated instructions.
Instructions reserve an entry in the VB when they are dispatched, that is, they
5

enter in program order in the VB. Epoch initiator instructions are marked as such
in the VB. When an epoch initiator detects a mispeculation, all the following instructions are cancelled. When an instruction reaches the VB head, if it is an
epoch initiator and it has not completed execution yet, it waits. When it completes, it leaves the VB and updates machine state, if any. Non epoch-initiator
instructions that reach the VB head can leave it regardless of their execution state.
That is, they can be either dispatched, issued or completed. However, only those
not cancelled instructions (i.e., validated) will update the machine state. On the
other hand, cancelled instructions are drained to free the resources they occupy
(see Section 2.2).
Notice that when an instruction leaves the VB, if it is already completed, it is
not consuming execution resources in the pipeline. Thus, it is analogous to a normal retirement when using the ROB. Otherwise, unlike the ROB, the instruction
is retired from the VB but it remains in the pipeline until it is completed.
The proposed microarchitecture can support a wide range of epochs initiators.
At least, epoch initiators according to the three speculative related conditions are
supported. Therefore, branches and memory reference instructions (i.e., the address calculation part) act as epoch initiators. In other words, branch speculation,
memory replay traps (see Section 2.4) and exceptions related with address calculation (e.g., page faults, invalid addresses) are supported by design.
It is possible to include more instructions in the set of epoch initiators. For
instance, in order to support precise floating-point exceptions, floating-point instructions should be included in this set. As instructions are able to leave the VB
only when their epoch initiators validate their epoch, a high percentage of epoch
initiators might reduce the performance benefits of the VB. We can use user definable flags to enable or disable support for precise exceptions. If the corresponding
flag is enabled, the instruction that may generate a given type of exception will
force a new epoch when it is decoded. In fact, a program could dynamically enable or disable these flags during its execution. For instance, it can be enabled
when the compiler suspects that an arithmetic exception may be raised.

2.1 Register Reclamation


Typically, modern microprocessors free a physical register when the instruction
that renames the corresponding logical register commits [14]. Then, the physical
register index is placed in the list that contains the free physical registers available
for new producers.
Waiting until the commit stage to free a physical register is easy to implement but conforms to a conservative approach; a sufficient condition is that all
consumers have read the corresponding value. Therefore, this method does not
6

Figure 2.1: VB microarchitecture block diagram.


efficiently use registers, as they can be mapped for longer than their useful lifetime. In addition, this method requires keeping track of the oldest instruction in
the pipeline. As this instruction may have already left the VB, this method is
unsuitable for our proposal.
Due to these reasons, we devised a register reclamation strategy based on the
counter method [15, 14] targeted for the VB microarchitecture. The hardware
components used in this scheme, as shown in Figure 2.1 are:
Frontend Register Alias Table (RATf ront ). This table maintains the current
mapping for each logical register and is accessed in the rename stage. The table is indexed by the source logical register to obtain the corresponding physical
register identifier. Additionally, each time a new physical register is mapped to a
destination logical register, the RATf ront is updated.
Retirement Register Alias Table (RATret ). The RATret table is updated as
long as instructions exit the VB. This table contains a precise state of the register
map, as only those validated instructions leaving the VB are allowed to update it.
Register Status Table (RST ). The RST is indexed by a physical register
number and contains three fields, labelled as pending readers, valid remapping
and completed, respectively. Pending readers field contains the number of decoded instructions that consume the corresponding physical register, but have not
read it yet. This value is incremented as consumers enter the decode stage, and
decremented when they are issued to the execution units. The second and third
fields are composed each by a single bit. The valid remapping bit is set when the
associated logical register has been definitively remapped to a new physical register, that is, when the instruction that remapped the logical register has left the VB
as validated. Finally, the completed bit indicates that the instruction producing its
value has completed execution, writing the result to the physical register.
7

Table 2.1: Actions depending on pipeline events


Event
An instruction I enters the rename
stage and has physical register (p.r.) p
as source operand.
I enters the rename stage and reclaims
a p.r. to map an output logical register l.
I is issued and reads p.r. p.
I finishes execution, writing the result
over p.r. p.
I exits the VB as validated. l is the logical destination of I, p.r. p is the current
mapping of l, and p.r. p was the previous mapping of l.

Actions
RST [p].pending readers + +

Find a free p.r. p and set RST [p] =


{0, 0, 0}, RATf ront [l] = p.
RST [p].pending readers
RST [p].completed = 1
RST [p ].valid remapping
RATret [l] = p.

1,

With this representation, a free physical register p can be easily identified


when the corresponding entry in the RST contains the triplet {0,1,1}. A 0 in
pending readers guarantees that no instruction already in the pipeline will read
the contents of p. Next, a 1 in valid remapping implies that no new instruction
will enter the pipeline (i.e., the rename stage) and read the contents of p, because
p has been unmapped by a valid instruction. Finally, a 1 in the completed field denotes that no instruction in the pipeline is going to overwrite p. These conditions
ensure that a specific physical register can be safely reallocated for a subsequent
renaming. On the other hand, a triplet {0,0,1} denotes a busy register, with no
pending readers, not unmapped by a valid instruction, and appearing as the result
of a completed (and valid) operation.
Table 2.1 shows different situations which illustrate the dynamic operation of
the proposed register reclamation strategy in a non-speculative mode, as well as
the updating mechanism of RST , RATf ront and RATret tables.

2.2 Recovery Mechanism


The recovery mechanism always involves restoring both the RATf ront and the
RST tables.
Register Alias Table Recovery. Current microprocessors employ different
methods to restore the renaming information when a mispeculation or exception occurs. The method presented in this work uses the two renaming tables,
RATf ront and RATret , explained above, similarly to the Pentium 4 [9].
8

RATret contains a delayed copy of a validated RATf ront . That is, it matches
the RATf ront table at the time the exiting (as valid) instruction was renamed. So,
a simple method to implement the recovery mechanism (restoring the mapping to
a precise state) is to wait until the offending instruction reaches the VB head, and
then copying RATret into RATf ront . Alternative implementations can be found
in [4].
Register Status Table Recovery. The recovery mechanism must also undo
the modifications performed by the cancelled instructions in any of the three fields
of the RST .
Concerning the valid remapping field, we describe two possible techniques
to restore its values. The first technique squashes from the VB those entries corresponding to instructions younger than the offending instruction when this one
reaches the VB head. At that point, the RATret contains the physical registers
identifiers that we use to restore the correct mapping. The remaining physical
registers must be freed. To this end, all valid remapping entries are initially set
to 1 (necessary condition to be freed). Then, the RATret is scanned looking for
physical registers whose valid remapping entry must be reset.
The second technique relies on the following observation. Only the physical
registers that were allocated (i.e., mapped to a logical register) by instructions
younger than the offending one must be freed. Therefore, instead of squashing
the VB contents when the offending instruction reaches the VB head, as instructions in the VB are cancelled, they are drained. These instructions must set to
1 the valid remapping entry of their current mapping. Notice that in this case,
the valid remapping flag is used to free the registers allocated by the current
mapping, instead of the previous mapping like in normal operation. While the
cancelled instructions are being drained, new instructions can enter the renaming
stage, provided that the RATf ront has been already recovered. Therefore, the VB
draining can be overlapped with subsequent new processor operations.
Regarding to the pending readers field, it cannot be just reset as there can
already be valid pending readers in the issue queue. Thus, each pending readers
entry must be decremented as many as the number of cancelled pending readers
for the corresponding physical register. To this end, the issue logic must allow to
detect those instructions younger than the offending instruction, that is, the cancelled pending readers. This can be implemented by using a bitmap mask in the issue queue to identify which instructions are younger than a given branch [16]. The
cancelled instructions must be drained from the issue queue to correctly handle
(i.e. decrement) their pending readers entries. Notice that this logic can be also
used to handle the completed field, by enabling a cancelled instruction to set the
entry of its destination physical register. Alternatively, it is also possible to simply let the cancelled instructions execute to correctly handle the pending readers
and completed fields.
9

Figure 2.2: Instruction status and epochs.

2.3 Working Example


To illustrate different scenarios that could require triggering the recovery mechanism as well as register reclamation handling, we use the example shown in Figure 2.2. This example shows a validation buffer with 12 instructions belonging to
3 different epochs. Instructions can be in one of the following states: dispatched
but not issued, issued but not yet completed, and completed. Unlike normal ROBs
no control information about these states is stored in the VB. Instead, the only information required is whether the epoch is validated or cancelled.
In the example, assume that the three epochs have just been resolved, epochs
0 and 1 as validated and epoch 2 as cancelled. Thus, only instructions belonging
to epochs 0 and 1 should be allowed to update the machine state.
Firstly, instructions belonging to epoch 0 leave the VB. As this epoch has been
validated, each instruction will update the RATret and set the valid remapping
bit of the physical register previously mapped to its destination logical register.
Since these instructions are completed, the VB is the last machine resource they
consume.
Then, instructions of epoch 1 leave the VB, two completed, one issued
and the other one dispatched but not yet issued. The RATret table and the
valid remapping bit are handled as above, no matter the instruction status. However, the non completed instructions will remain in the pipeline until they are
complete.
Finally, instructions belonging to epoch 2 leave the VB as cancelled. These
instructions must not update the RATret , but they must set the valid remapping
bit of the physical register currently mapped to its destination logical register. In
addition, the processor must obtain a correct RATf ront , re-execute the epoch initiator (if needed), and fetch the correct instructions. To this end, the machine waits
until the epoch initiator instruction that has triggered the cancellation reaches the
VB head. Then, the processor recovers to a correct state by copying the RATret to
the RATf ront and resumes execution from the correct path. The RST state is
recovered as explained in Section 2.2.
10

2.4 Uniprocessor Memory Model


To correctly follow the uniprocessor memory model it must be ensured that load
instructions get the data produced by the newest previous store matching its memory address. A key component to improve performance in such a model, is the
load/store queue (LSQ).
In the VB microarchitecture, as done in some current microprocessors, memory reference instructions are internally split by the hardware into two instructions
when they are decoded and dispatched: the memory address calculation, which is
considered as an epoch initiator, and the memory operation itself. The former reserves a VB entry when it is dispatched while the latter reserves an entry in the
LSQ. A memory reference instruction frees its corresponding queue entry provided that it has been validated and completed. Additionally, a store instruction
must be the oldest memory instruction in the pipeline.
Load bypassing is the main technique applied to the LSQ to improve the processor performance. This technique permits loads to early execute by advancing
previous stores in their access to the cache. Load bypassing can be speculatively
performed by allowing loads to bypass previous stores in the LSQ even if any
store address is unresolved yet.
As speculation may fail, processors must provide some mechanism to detect
and recover from load mispeculation. For instance, loads issued speculatively can
be placed in a special buffer called the finished load buffer [17]. The entry of
this buffer is released when the load commits. On the other hand, when a store
commits, a search is performed in this buffer looking for aliasing loads (note that
all loads in the buffer are younger than the store). If aliasing is detected, when
the mispeculated load commits, both the load and subsequent instructions must
be re-executed.
A finished load buffer can be quite straightforwardly implemented in the VB,
with no additional complexity. In this case, a load instruction will release its entry
in the finished load buffer when it leaves the VB. When a store leaves the VB,
the address of all previous memory instructions and its own address have been resolved. Thus, it can already search the mentioned buffer looking for aliasing loads
speculatively issued. As in a ROB-based implementation, all loads in the buffer
are younger than the store. On a hit, the recovery mechanism should be triggered
as soon as the mispeculated load exits the VB. Notice that this implementation
allows mispeculation to be early detected.
Finally, store-load forwarding and load replay traps are also supported by the
VB by using the same hardware available in current microprocessors.

11

Chapter 3
Multithreaded Validation Buffer
(VB-MT) Microarchitecture
Superscalar processors effectively exploit instruction level parallelism of a single
thread. To this end, multiple instructions can be issued in an out-of-order fashion
in the same cycle. Nevertheless, issue ports are usually wasted because of instruction dependencies (i.e., the available parallelism), thus, adversely impacting the
performance. Two kinds of waste are distinguishable [10]: vertical waste, when
no instruction is issued, and horizontal waste, when some instruction is issued
without completely filling the issue width.
Resource utilization can be improved by providing support to the execution
of multiple threads, that is, by exploiting both instruction and thread level parallelism. There are three main multithreading models implemented in current
processors: fine grain (FGMT), coarse grain (CGMT), and simultaneous multitreading (SMT). All of them reduce vertical waste but only SMT reduces horizontal waste [10] by issuing instructions from multiple threads in the same cycle,
achieving best performance gains. Nevertheless, this is done at expenses of adding
complexity to the issue logic, which is a critical point in current microprocessors.
Multithreaded architectures represent an important segment in the industry. For
instance, the Alpha 21464, the Intel Pentium 4 [9], the IBM Power 5 [11], the
Sun Niagara [12], and the Intel Montecito [13] are commercial microprocessors
included in this group.
On the other hand, out-of-order commit prevents long latency operations (e.g.,
a L2 cache miss) of blocking the ROB when they reach its head; instead, long
memory latencies are overlapped with the retirement of subsequent instructions
which do not depend on the memory operation. Thus, these architectures mainly
attack the vertical waste, although horizontal waste is indirectly also improved.
In this chapter, we deeply analyze the impact of retiring instructions in an outof-order fashion in the three main models of multithreading: FGMT, CGMT, and
12

SMT, using the Validation Buffer microarchitecture as base out-of-order commit


approach.

3.1 Multithreading support


As in any multithreading model, the Multithreaded Validation Buffer (VB-MT)
microarchitecture gives the illusion of having various logical processors, that is,
various simultaneously active software contexts, one per hardware thread. Although most hardware structures can be shared or private among threads [6], the
logical state, defined by a virtual memory image and a logical register file, must
be independently maintained per thread.
Concerning the virtual memory image, it makes sense for all threads to have
a common physical address space. The MMU (Memory Management Unit), and
the Translation Lookaside Buffers (TLBs) in particular, must be adapted to be
indexed by pairs {thread, virtual address}, returning disjoint physical addresses
to each thread.
In the case of the register file, the associated physical structures can be either
shared or private per thread, but the subsets of physical registers bound to specific
threads must be disjoint from each other. This implies a renaming mechanism
that allocates a new physical register from a pair {thread, logical register}, which
is straightforwardly translated into a private RAT (Register Aliasing Table) per
thread.
Involving the register renaming strategy, its implementation on a multithreaded, out-of-order commit architecture is analogous to the single-threaded,
out-of-order commit one. The main difference is the replication of the two renaming tables (RATf ront and RATret ) for each thread. The former is looked up and
updated when an instruction is at the rename stage, while the latter is modified
when an instruction leaves the VB, providing a delayed copy of RATf ront . Depending on the thread to which a specific instruction belongs, it should look up or
update its associated RAT table.

3.2 Resource sharing


A multithreaded design offers the view of having multiple logical processors in
a single chip. A straightforward way of implementing such a system consists of
replicating all hardware structures per thread, while maintaining a common functional resource pool. In the opposite approach, hardware structures can be shared
among threads, using policies that dynamically allocate resource slots. Between
these limits, there is a gradient of solutions to be explored in order to find the
13

Figure 3.1: VB-MT architecture diagram with all storage resources shared among
threads

optimal sharing strategy of resources.


Processor resources can be classified as storage resources (ROB, IQ, LSQ...)
and bandwidth resources (fetch stage, issue logic...). As pointed out in [18], the
sharing of storage resources can result in a performance degradation if no appropriate resource allocation policy is used. The reason is that over-allocations
to stalled threads can cause starvation to active ones, and wrong allocation decisions can affect several future cycles. In contrast, unwise allocation decisions of
bandwidth resources can be immediately compensated in the next cycle.
On the other hand, the high demand variability of bandwidth resources (in
contrast to storage resources) causes a shared design to be the most performanceefficient solution. The issue stage is the main bandwidth resource affected by
this fact, which has caused SMT (Simultaneous Multithreading) to stand out from
other multithreading designs. Related experimental results are shown in Chapter
5.
In this section, we evaluate the effect of sharing the main storage resources,
so we can compare it with results of previous works, and check out if they can
be extrapolated to VB-MT. Figure 3.1 shows a general block diagram of the VBMT architecture, where all resources with a variable sharing strategy draw gray.
Additionally, Figure 3.2 shows the impact of varying the sharing strategy of some
of them. In all cases, a SMT processor with a round-robin fetch policy is implemented, being all resources private except the one labelled in the X-axis. IPCs
of ROB/VB architectures with all resources set as private are tagged as ROB/VB
Baselines.
Shared resources are often more costly than private, as the allocation/deallocation of resource entries may get more complex. Moreover, shared
resources need to increase read/write ports in order to enable parallel access to
various threads, which may increase both their area and latency. The advantage of
shared resources, as one would expect, lies in the fact that one single thread can
compete for all its entries.
14

VB Baseline

IPC

1.9
ROB Baseline

1.85
1.8
1.75

eg
s.
Sh
ar
ed

Sh
ar
ed

ROB

Ph
.R

IQ

Q
IF
Sh
ar
ed

Sh
ar
ed

ac
he

1.7

VB

Figure 3.2: Impact of resource sharing for ROB/VB architectures.

Nevertheless, Figure 3.2 reveals that the conclusions of [18] continue to be


valid in an VB-MT architecture: there is a performance loss when sharing any
storage resource without an optimized allocation policy. The reason is that slow
or stalled threads can abuse of shared resources, reserving their entries for a long
time, and preventing other threads from using them fruitfully. The only component that does not show performance degradation when being shared is the
ROB/VB (not shown in Figure 3.2, see below).
The negative effects that stalled threads have over shared components can be
solved by applying some resource allocation policy that detects such situations
(e.g., DCRA [19], evaluated later). Consequently, we will use a baseline processor with all resources configured as private among threads, except in those experiments that implement DCRA, where all resources will be shared among threads
using this policy.
The ROB/VB is found between those resources that can be shared in an MT
design. In this case, we can cite two possible mechanisms to handle the ROB
entries that are assigned to threads. One approach consists of assigning disjoint
ROB portions to threads [20], whose size can vary depending on the threads demand; each portion is treated as an independent ROB. As an opposite alternative,
instructions can be inserted in a FIFO manner into the shared ROB, as if all of
them belonged to the same thread, and hence retired in the same order.
Both approaches can take the advantages of a shared storage resource (also
applying some resource allocation policy if necessary), but each one suffers of
certain drawbacks. In the former case, ROB portions cannot grow arbitrarily to
fill the whole ROB size; a portion size is constrained by the position of contiguous
15

portions, and ROB portions can only be shifted when the associated head and tail
pointers are properly aligned. In the latter case, instructions from different threads
are intermingled across the ROB, so non completed instructions from a thread may
prevent completed instructions from another from exiting the ROB. Moreover, the
recovery process should cancel instructions selectively, forcing interleaved gaps
to remain in the ROB until they are retired. A deeper study of the effects of the
ROB/VB sharing strategies is planned as for future work, so we assume in this
paper private ROBs/VBs for all experiments.

3.3 Adapting SMT Techniques to VB-MT


In this section, we discuss the adaptation of previously proposed techniques on
SMT processors to the VB architecture. These techniques try to exploit the SMT
potential by maintaining functional units utilization as high as possible. The techniques studied in this section are Instruction Count (ICOUNT), Predictive Data
Gating (PDG) and Dinamically Controlled Resource Allocation (DCRA). Their
impact on the VB architecture, compared with the ROB, will be evaluated in
Chapter 5.

3.3.1 Instruction Count (ICOUNT)


ICOUNT is a fetch policy proposed by Tullsen et al [21], which assigns fetch priority to those threads with less instructions in the decode, rename and issue stages.
The designation ICOUNT.nt .ni means that at most nt threads can be handled at
the fetch stage in the same cycle, taking at most ni instructions from each one.
In the experiments shown in Chapter 5 we used an ICOUNT.2.8 configuration,
which provides the best results for the ROB-SMT architecture.

3.3.2 Predictive Data Gating (PDG)


PDG was proposed by El-Moursy et al [22], and can be classified as a fetch policy, too. This technique is aimed at avoiding a long permanence of non-ready
instructions in the issue queue (IQ) due to dependences on long-latency instructions, long data dependence chains or contention for functional units or cache. To
early detect long-latency events, a load miss predictor is placed in the processor
front-end, and a per thread specific counter tracks the estimated number of pending cache misses. The values of these counters are used to stall threads fetch when
a threshold is exceeded.
For our experiments, a load miss predictor of 4K entries with 2-bit saturating
counters is assumed, as suggested in [22]. The predictor is indexed by the PC
16

of the load instruction, and the prediction is given by the most significant bit of
the saturating counter. Whenever a load misses the data cache, the corresponding
counter is reset, while it is incremented when the associated load hits the cache.

3.3.3 Dynamically Controlled Resource Allocation (DCRA)


DCRA was proposed by Cazorla et al [19] and can be classified as a resource allocation policy. As ICOUNT and PDG, it tries to increase functional units utilization by preventing stalled threads from occupying resources that could be used by
other threads with instructions ready to execute. DCRA does not only control the
fetch bandwidth granted to each thread, but also the allocation of shared processor resources. However, nomenclature has been relaxed in previous works [22],
considering this approach as an instruction fetch policy as well.
DCRA classifies threads according to two criteria. A thread can be slow or fast
if it has or not pending L1 cache misses. At the same time, a thread can be active
or inactive for a given resource R depending on whether it has recently demanded
the resource. A thread is considered active for resource R during a predefined
number Y of cycles after the cycle it demanded an entry of R (e.g. Y = 256).
The authors propose a mathematical formula to limit the number of entries of a
shared resource that a thread can allocate depending on its previous classification.
The shared resources can be the instruction queue, load-store queue, physical register file, etc., and the additional hardware consists of resource occupancy counters
and a lookup table to implement the computation of resource entries limit using
the aforementioned formula.

17

Chapter 4
The Simulation Framework
Multi2Sim
This section describes Multi2Sim [23], the simulation framework that has been
used to model the architectural designs proposed in this work. Multi2Sim integrates a model of processor cores, memory hierarchy and interconnection network in a tool that enables their evaluation. The simulator has been extended to
model the VB microarchitecture, both for monothreaded and multithreaded environments, including all instruction fetch policies cited in Chapter 3.

4.1 Basic simulator description


Multi2Simhas been developed integrating some significant characteristics of popular simulators, such as separate functional and timing simulation, SMT and
multiprocessor support and cache coherence. Multi2Sim is an application-only
tool intended to simulate final MIPS32 executable files. With a MIPS32 crosscompiler (or a MIPS32 machine) one can compile his own program sources, and
test them under Multi2Sim. This section deals with the process of starting and
running an application in a cross-platform environment, and describes briefly the
three implemented simulation techniques (functional, detailed and event-driven
simulation).

4.1.1 Program Loading


Program loading is the process in which an executable file is mapped into different
virtual memory regions of a new software context, and its register file and stack are
initialized to start execution. In a real machine, the operating system is in charge of

18

these actions, but an application-only tool should manage program loading during
its initialization.
Executable File Loading. The executable files output by gcc follow the ELF
(Executable and Linkable Format) specification. An ELF file is made up of a
header and a set of sections. Some Linux distributions include the library libbfd,
which provides types and functions to list the sections of an ELF file and track
their main attributes (starting address, size, flags and content). When the flags of
an ELF section indicate that it is loadable, its contents are copied into memory
after the corresponding starting address.
Program Stack. The next step of the program loading process is to initialize
the process stack. The aim of the program stack is to store function local variables
and parameters. During the program execution, the stack pointer ($sp register) is
managed by the own program code. However, when the program starts, it expects
some data in it, namely the program arguments and environment variables, which
must be placed by the program loader.
Register File. The last step is the register file initialization. This includes the
$sp register, which has been progressively updated during the stack initialization,
and the PC and NPC registers. The initial value of the PC register is specified in
the ELF header of the executable file as the program entry point. The NPC register
is not explicitly defined in the MIPS32 architecture, but it is used internally by the
simulator to handle the branch delay slot.

4.1.2 Simulation Model


Multi2Sim uses three different simulation models, embodied in different modules:
a functional simulation engine, a detailed simulator and an event-driven module
the latter two perform the timing simulation. To describe them, the term context
will be used hereafter to denote a software entity, defined by the status of a virtual
memory image and a logical register file. In contrast, the term thread will refer
to a processor hardware entity comprising a physical register file, a set of physical memory pages, a set of entries in the pipeline queues, etc. The three main
simulation techniques are described next.
Functional Simulation, also called simulator kernel. It is built as an autonomous library and provides an interface to the rest of the simulator. This engine does not know of hardware threads, and owns functions to create/destroy
software contexts, perform program loading, enumerate existing contexts, consult
their status, execute machine instructions and handle speculative execution. The
supported machine instructions follow the MIPS32 specification [24] [25]. This
choice was basically motivated by a fixed instruction size and formats, which enable a simple instruction decoding.
An important feature of the simulation kernel, inherited from SimpleScalar
19

[26], is the checkpointing capability of the implemented memory module and


register file, thinking of an external module that needs to implement speculative
execution. In this sense, when a wrong execution path starts, both the register file
and memory status are saved, reloading them on the misprediction detection.
Detailed Simulation. The Multi2Sim detailed simulator uses the functional
engine to perform a timing-first [27] simulation: in each cycle, a sequence of
calls to the kernel updates the state of existing contexts. The detailed simulator
analyzes the nature of the recently executed machine instructions and accounts the
operation latencies incurred by hardware structures.
The main simulated hardware consists of pipeline structures (stage resources,
instruction queue, load-store queue, reorder buffer...), branch predictor (modelling
a combined bimodal-gshare predictor), cache memories (with variable size, associativity and replacement policy), memory management unit, and segmented
functional units of configurable latency.
Event-Driven Simulation. In a scheme where functional and detailed simulation are independent, the implementation of the machine instructions behaviour
can be centralized in a single file (functional simulation), increasing the simulator
modularity. In this sense, function calls that activate hardware components (detailed simulation) have an interface that returns the latency required to complete
their access.
Nevertheless, this latency is not a deterministic value in some situations, so
it cannot be calculated when the function call is performed. Instead, it must be
simulated cycle by cycle. This is the case of interconnects and caches, where an
access can result in a message transfer, whose delay cannot be computed a priori,
justifying the need of an independent event-driven simulation engine.

4.2 Support for Multithreaded and Multicore Architectures


This section describes the basic simulator features that provide support for multithreaded and multicore processor modelling. They can be classified in two main
groups: those that affect the functional simulation engine (enabling the execution
of parallel workloads) and those which involve the detailed simulation module
(enabling pipelines with various hardware threads on the one hand, and systems
with several cores on the other).

20

4.2.1 Functional simulation: parallel workloads support


The functional engine has been extended to support parallel workloads execution.
In this context, parallel workloads can be seen as tasks that dynamically create
child processes at runtime, carrying out communication and synchronization operations. The supported parallel programming model is the one specified by the
widely used POSIX Threads library (pthread) shared memory model [28].
In a multithreaded environment, some studies suggest using a set of sequential
workloads [10]. The reason is that multiple resources are shared among hardware
threads, and processor throughput can be evaluated more accurately when no contention appears due to communication between contexts. In contrast, multicore
processor pipelines are fully replicated, and an important contention point is the
interconnection network. The execution of multiple sequential workloads exhibits
only some interconnect activity in form of L2-L1 cache transfers, but no coherence actions can occur between processes having disjoint memory maps. Thus, in
order to evaluate multicore processors, it makes sense to support and run parallel
workloads with shared memory locations, whose distributed access can stress the
interconnection network.
Actual parallel workloads require special hardware support (machine instructions), as well as low level software support (system calls) that enable threads
spawning, synchronization and termination. Each of these issues are described
below, jointly with a brief description of the POSIX threads management:
Instruction set support. When the processor hardware supports concurrent
threads execution, the parallel programming requirement that directly affects its
architecture is the existence of critical sections, which cannot be executed simultaneously by more than one thread. CMPs or multithreaded processors must stall
the activity of a hardware thread when it tries to enter a critical section occupied
by other thread.
The MIPS32 approach implements the mutual exclusion mechanism by means
of two machine instructions (LL and SC), defining the concept of RMW (readmodify-write) sequence [25]. An RMW sequence is a set of instructions, embraced by a pair LL-SC that run atomically on a multiprocessor system. The cited
machine instructions do not enforce an RMW sequence, but the output value of
SC informs of the RMW success or failure.
Operating system support. Tracing the execution of a parallel workload,
the operating system support required by pthread is formed of system calls i)
to spawn/destroy a thread (clone, exit group), ii) to wait for child threads
(waitpid), iii) to communicate and synchronize threads with system pipes
(pipe, read, write, poll) and iv) to wake up suspended threads using system
signals (sigaction, sigprocmask, sigsuspend, kill).
POSIX Threads parallelism management. Applications programmed with
21

Figure 4.1: Examples of pipeline organizations


pthread can be simulated without changes using Multi2Sim. This library intro-

duces user code which handles parallelism by means of the described subset of
machine instructions and system calls. However, the fact of having thread management code mingled with application code must be taken into account, as it
constitutes a certain overhead which could affect final results. Further details on
this consideration can be found in [23].

4.2.2 Detailed simulation: Multithreading support


Multi2Sim supports a set of parameters that specify how stages are organized in a
multithreaded design. Stages can be shared among threads or private per thread [6]
(except the execute stage, which is shared by definition of multithread). Moreover,
when a stage is shared, there must be an algorithm which schedules a thread every
cycle on the stage. The modelled pipeline is divided into five stages, described
below.
The fetch stage takes instructions from the L1 instruction cache and places
them into an IFQ (instruction fetch queue). The decode/rename stage takes instructions from an IFQ, decodes them, renames their registers and assigns them a
slot in the ROB (reorder buffer) and IQ (instruction queue). Then, the issue stage
consumes instructions from the IQ and sends them to the corresponding functional
unit. During the execution stage, the functional units operate and write their results back into the register file. Finally, the commit stage retires instructions from
the ROB in program order. This architecture is analogous to the one modelled by
the SimpleScalar tool set [26], but uses a ROB, an IQ (instruction queue) and a
physical register file, instead of the RUU (register update unit).
Figure 4.1 illustrates two possible pipeline organizations. In a) all stages are
shared among threads, while in b) all stages (except execute) are replicated as
many times as supported hardware threads. Multi2Sim allows to evaluate different stage sharing strategies, as well as different algorithms that schedule stage
22

Table 4.1: Combination of parameters for different multithread configurations


FGMT
CGMT
SMT
fetch kind

timeslice

switchoneventtimeslice/
multiple

fetch priority-

equal/icount

decode kind

shared/

shared/

issue kind

shared/

timeslice/ timeslice

timeslice/

replicated

replicated

timeslice

shared/

replicated

timeslice
retire kind

timeslice

timeslice

timeslice/
replicated

Figure 4.2: Evaluated cache distribution designs


resources in each cycle. Depending on the stages sharing and thread selection
policies, a multithread processor can be classified as fine-grain (FGMT), coarsegrain (CGMT) or simultaneous multithread (SMT).
A FGMT processor switches threads on a fixed schedule, typically on every
processor cycle. In contrast, a CGMT processor is characterized by a thread
switch induced by a long latency operation or a thread quantum expiration. Finally, an SMT processor enhances the previous ones with a more aggressive instruction issue policy, which is able to issue instructions from different threads
in a single cycle. The simulator parameters that specify the sharing strategy of
pipeline stages among threads, and thus the kind of multithreading, are summarized in Table 4.1. Again, [23] gives a detailed description of all possible values
these parameters may take.

4.2.3 Detailed simulation: Multicore support


A multicore simulation environment is basically achieved by replicating the data
structures that represent a single processor core. The zone of shared resources in

23

a multicore processor starts with the memory hierarchy. When caches are shared
among cores, some contention can exist when they are accessed simultaneously.
In contrast, when they are private per core, a coherence protocol (e.g. MOESI
[29]) is implemented to guarantee memory consistency. Multi2Sim implements
in its current version a split-transaction bus as interconnection network, extensible
to any other topology of on-chip networks.
The number of interconnects and their location vary depending on the sharing
strategy of data and instruction caches. Figure 4.2 shows three possible schemes
of sharing L1 and L2 caches (t = private per thread, c = private per core, s =
shared), and the resulting interconnects for a dual-core dual-thread processor.

24

Chapter 5
Exterimental Results
In this chapter, a very detailed evaluation of the Validation Buffer microarchitecture is presented, splitting it up into two main sections. In the first section, the
VB architecture is evaluated in monothreaded environments, comparing it with
a baseline processor and other existing out-of-order retirement proposals. In the
second section, an exhaustive study is performed over the VB-MT architecture,
using a different baseline processor and investigating the evolution of different
performance metrics on various multithreaded scenarios.

5.1 Evaluation of the VB Microarchitecture


This section presents the simulation environment and the benchmarks used to evaluate the performance of the VB microarchitecture. For comparison purposes, two
ROB-based proposals without checkpointing have been modelled, one retiring instructions in program order (hereafter the IOC processor) and the out-of-order
commit technique proposed in [6] (from now on, the Scan processor). As the VB
microarchitecture cancels instructions as soon as they leave the VB, to perform
a fair comparison, the recovery mechanism for the ROB-based architectures is
triggered at the WB stage.
The analyzed architectures have been modelled on top of a modified version
of the SimpleScalar toolset [26] with separated ROB, instruction queues, and register file structures. The pipeline has also been enlarged with separated decode, rename, and dispatch stages. Both load speculation and store-load replay have been
modelled in all the evaluated approaches. Table 5.1 summarizes the architectural
parameters used through the experiments. Performance have been analyzed using
a moderate value of memory latency because processor frequency is not growing
at the same rate as in the past. However, as it can be deduced from the results, if
longer memory latencies were considered, performance gains provided by the VB
25

Table 5.1: Machine parameters.


Microprocessor core
Out of order
Hybrid gShare/bimodal:
Gshare has 16-bit global history plus 64K 2-bit
counters.
Bimodal has 2K 2-bit counters, and the
choice predictor has 1K 2-bit counters
Branch predictor penalty
10 cycles
Fetch, issue, commit bandwidth
4 instructions/cycle
# of Integer ALUs, multiplier/dividers
4/1
# of FP ALUs, FP multiplier/dividers
2/1
Memory hierarchy
Memory ports available (to CPU)
2
L1 data cache
32KB, 4 way, 64 byte-line
L1 data cache hit latency
3 cycles
L2 data cache
512KB, 8 ways, 64 byte-line
L2 data cache hit latency
18 cycles
Memory access latency
200 cycles
Issue policy
Branch predictor type

would be higher, as the ROB would be blocked for longer.


Experiments were run using the SPEC2000 benchmark suite [30]. Both integer benchmarks (SpecInt) and floating-point (SpecFP) have been evaluated using
the ref input sets and statistics were gathered using single simulation points [31].
The experimental study pursues two main goals: to evaluate the potential benefits
on performance of the proposal, and to explore its complexity requirements in a
modern processor.

5.1.1 Exploring the Potential of the VB Microarchitecture


To explore how the VB size impacts on performance the remaining major processor structures (i.e., instruction queue, register file, and load/store queue) have
been assumed to be unbounded. Figure 5.1 shows the average IPC (i.e., harmonic
mean) for the SpecInt and SpecFP benchmarks when varying the ROB/VB size.
As observed, performance improvements provided by the VB microarchitecture
are much higher in floating-point benchmarks than in the integer ones. This is
because the ROB size is not the main performance bottleneck in integer applications. One of the reasons is because the high percentage of mispredicted branches.
Concerning floating-point benchmarks, the highest IPC difference appears with
the smallest VB/ROB size (i.e., 32 entries). These differences get smaller as the
VB/ROB size increases. Nevertheless, it is required a large 1024-entry ROB for
26

a) SpecInt Benchmarks

b) SpecFP Benchmarks

Figure 5.1: IPC for Spec2000 benchmarks and unlimited resources.

the IOC and Scan processors to match the IPC achieved by the VB proposal.
Figure 5.2 presents the IPC achieved by each benchmark for a 32-entry
ROB/VB. Results present minor differences across the integer benchmarks, thus,
hereafter, performance analysis will focus on floating-point workloads. Loads
and floating-point instructions are the main sources of IPC differences, as these
instructions could potentially block the ROB for long. To provide insight into
this fact, Table 5.2 shows the percentage of these instructions, the L1 miss rate,
and the percentage of time that the retirement of instructions from the ROB/VB
is blocked. Results demonstrate that the VB microarchitecture effectively reduces
the blocking time, so improving performance. This can be appreciated by observing that those applications showing high blocked time differences also show
high IPC differences (see Figure 5.2). Of course, those applications with a low
percentage of both floating-point instructions and cache miss rate would slightly
benefit or not at all from our proposal.
For example, one of the highest differences and speedups is obtained by the
swim workload, which has a high percentage (i.e., 43%) of floating-point instructions as well as a relatively high miss rate (9%). In this case, a 32-entry ROB is
blocked by 84% of time, while a 32-entry VB is blocked only half of the time. On
the other hand, applications having a small percentage of floating-point instructions in the executed interval and small miss rates (i.e., mesa, equake, and fma3d)
27

Figure 5.2: IPC for SpecInt and SpecFP benchmarks assuming a 32-entry
ROB/VB.

are more moderately affected.

5.1.2 Exploring the Behavior in a Modern Microprocessor


This section explores the behavior of the VB microarchitecture while dimensioning the major microprocessor structures closely resembling the ones implemented
in the Intel Pentium 4: a 32-entry instruction queue, a 64-entry load/store queue,
128 physical registers, and a 128-entry ROB (IOC and Scan models). The VB
size has been ranged from 8 to 128 entries.
Figure 5.3 shows the IPC. Results show that the VB microarchitecture is much
more efficient since it achieves with only 16 entries, on average, higher IPC than
the other architectures. On the other hand, the use of VBs larger than 32 or 64
entries provides minor benefits on performance.
To explore the complexity requirements of the four major microprocessor
structures in VB microarchitecture ROB or VB (Figure 5.4), instruction queue
(Figure 5.5), register file (Figure 5.6), and load/store queue (Figure 5.7), we
measured their occupancy in number of entries. In general, their occupancy is
lower than in the compared architectures regardless the VB/ROB size. The only
exception is the instruction queue structure. This result was expected, as the decode stage is blocked for longer in the other architectures; therefore, more instructions are dispatched to the instruction queues in the VB microarchitecture.
In other words, the validation buffer structure has less pressure than the ROB,
because part of this pressure moves to the instruction queue.
28

Table 5.2: VB improved performance reasons.


Workload
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
178.galgel
179.art
183.equake
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi

Instructions (%)
f.point
load
30
23
43
27
59
32
52
30
13
27
27
40
20
27
35
41
35
27
66
13
35
30
64
19
22
24

L1 miss
rate (%)
1
9
3
5
1
6
34
11
5
10
1
0
1

Blocked time (%)


ROB 32 VB 32
75
35
93
41
84
47
90
49
59
47
88
44
95
60
96
89
86
74
91
29
73
60
72
39
47
44

Figure 5.3: IPC for SpecFP benchmarks in a modern microprocessor.

29

Figure 5.4: ROB and VB occupancy.

Figure 5.5: Instruction queue occupancy.

30

Figure 5.6: Register file occupancy.

Figure 5.7: Load/Store queue occupancy.

31

Regarding the VB occupancy, differences are really high, as the VB occupancy is, on average, lower than one third the occupancy of the ROB. Notice that
the highest IPC benefits appear in those applications whose VB requirements are
smaller than the instruction queue ones (e.g., swim or mgrid, see Figures 5.4 and
5.5). Therefore, in these cases, the VB microarchitecture effectively alleviates the
retirement of instructions from the pipeline, allowing more instructions to be decoded and increasing ILP. On the contrary, when the VB requirements are larger
than the ones of the instruction queue like happens when using a ROB (e.g., mesa
or equake) the benefits are smaller since the ROB is not acting as the main performance bottleneck. Results also show the effectiveness of the proposed register
reclamation mechanism (see Figure 5.6). Finally, the LSQ occupancy is lower
in the VB microarchitecture (see Figure 5.7). This is because in ROB-based machines a LSQ entry cannot be released until all previous instructions have been
committed. In contrast, in the VB microarchitecture a LSQ entry only needs to
wait until all the previous instructions have been validated, which is a weaker
condition.
As the proposed architecture implements both out-of-order retirement and an
aggressive register reclamation method, one might think that performance benefits may come from both sides. To isolate which part comes from the VB and
which one from the register reclamation method, we ran simulations assuming an
unbounded register file. Figure 5.8 shows the results. As observed, the register
mechanism itself slightly affects the overall performance, thus one can conclude
that almost all the benefits come from the fact that instructions are out-of-order retired. As opposite, IOC and Scan, improve their performance with an unbounded
amount of physical registers, but even in these cases, the performance of a 16entry VB is still better.

5.2 Impact on Performance of Supporting Precise


Floating-Point Exceptions
Any floating-point exception can be supported by including the corresponding
instructions in the set of epoch initiators. If all floating-point instructions were
included, since some of these instructions may take tens of cycles to complete,
the performance might be significantly impacted. To deal with this fact, such
instructions could solve their epoch early in the pipeline. In the following experiment we assume that all the floating-point instructions are epoch initiators and
their epoch is resolved when they are completed, which is over-conservative. For
example, some floating-point exceptions can be easily detected by comparing the
exponents (e.g., overflow) or checking one of the operands (e.g., division by zero).

32

Figure 5.8: IPC in a modern microprocessor assuming an unbounded amount of


physical registers.
Figure 5.9 shows the results. As expected, supporting all floating-point exceptions
hurts the performance achieved by the VB microarchitecture. However, even in
this case, a 32-entry VB achieves performance close to the IOC processor with a
128-entry ROB. Moreover, a 64-entry VB achieves performance close to the Scan
and IOC models while using a ROB twice as large.
From these results, we can conclude that, although supporting precise floatingpoint exceptions is encouraged by the IEEE-754 floating-point standard, the cost
of enabling all the exceptions defined by the standard strongly impacts performance. As a matter of fact, most architectures allow disabling floating-point exceptions by software. In some architectures (e.g., Alpha) these exceptions are
imprecise by default [7], and the help of the compiler is required to detect which
instruction raised the exception [32].

5.3 Evaluation of the VB-MT Microarchitecture


The results shown in this section can be divided into two main groups. On one
hand, we explore the benefits of the VB microarchitecture implementing different
multithreading paradigms (Sections 5.3.1 and 5.3.2). On the other hand, we evaluate previous techniques that enhance performance on multithreaded processors
(e.g. PDG or DCRA), and how they work in our proposal (Sections 5.3.3 and
5.3.4).
For performance evaluation of multithreaded architectures, we assumed the
baseline processor configuration shown in Table 5.3. The employed simulation
tool in this case is Multi2Sim [33], a publicly available simulator of multithreaded

33

Figure 5.9: IPC when supporting floating-point exceptions.


processors, whose model of the VB architecture has been compared and validated
with results shown in Section 5.1. Multi2Sim supports, among others, modelling
of multicore-multithreaded processors, sharing strategies of processor resources,
instruction fetch policies and thread priority assignment.
For those experiments where a specific design has been evaluated with groups
of benchmarks, we follow the criteria discussed in [20] to build these mixes using the SPEC2000 suite, as shown in Table 5.4. Benchmarks are classified as
ILP (high instruction level parallelism) and MEM (memory-intensive), and three
groups are created to evaluate possible combinations of two and four threads. In
addition, two more groups (referred to as INT and FP) are analyzed, which include
only integer or floating-point benchmarks, respectively. These two groups are included due to the contrasting behaviour of integer and floating-point applications
on the VB architecture.

5.3.1 Multithreading Granularity


As first experiment, we have evaluated the behaviour of the VB microarchitecture
implementing multithreading in its different forms (CGMT, FGMT, SMT). We
explored different sharing models of processor stages and various thread selection
policies at each stage. The FGMT design is modelled with a timeslice issue policy,
while the SMT issue stage takes instructions from different threads in the same
cycle (shared issue). Both FGMT and SMT are modelled with a timeslice, or
round-robin, fetch stage (more complex fetch policies on VB-MT are evaluated in
Section 5.3.3). Finally, the CGMT design establishes a thread quantum of 1000
cycles, and the thread switch penalty is equal to the number of cycles needed to
drain the processor pipeline at a given time [6].
34

Table 5.3: Baseline processor parameters


Parameter
Machine width
Storage resources size
Functional units and latency
(total/issue)
Phys Registers
L1 I-cache
L1 D-cache
L2 Unified cache
BTB
Branch Predictor
Memory Latency
D-TLB, I-TLB

Configuration
8-wide fetch, 8-wide issue, 8-wide commit
32 entry private IQs, 24 entry private LSQs, 32 entry private
ROBs
8 Int Add (2/1), 2 Int Mult (1/1), 2 Int Div (20/19), 4 Ld/St (2/1),
8 FP Add (4/2), 2 FP Mult (8/1), 2 FP Div (40/20)
128 entry private files
32KB, 2-way, 64 byte line, per-thread private, 1 cycle hit time
32KB, 2-way, 64 byte line, per-thread private, 2 cycles hit time
1MB, 8-way, 64 byte line, shared among threads, 10 cycles hit
time
1024 entry, 2-way
McFarling, private per thread, 4K entry gShare, 4K entry bimodal
200 cycles
16K, 4-way, shared among threads

Table 5.4: Benchmark combinations


Classification
ILP
ILP+MEM
MEM
INT
FP

Mix Name
Mix 0
Mix 1
Mix 2
Mix 3
Mix 4
Mix 5
Mix 6
Mix 7

Benchmarks
wupwise, eon
apsi, eon, fma3d, gcc
art, gzip
art, gzip, wupwise, twolf
applu, ammp
applu, ammp, art, mcf
gcc, gzip
wupwise, mgrid

35

IPC

3
2
1

ROB-CGMT
ROB-FGMT

ROB-SMT
VB-CGMT

ea
n

7
ix

.M
H

Benchmarks Mix

6
m

ix

ix
m

ix

3
m

ix
m

ix
m

1
ix
m

ix

VB-FGMT
VB-SMT

Figure 5.10: Performance for different multithread designs in the ROB/VB architectures for different benchmark mixes.

The results corresponding to this experiment are shown in Figure 5.10. The
last group of bars represents the average values for each design. Comparing
the behaviour of the VB-MT architecture in its different variants with the ROBSMT processor, we can observe that VB-CGMT is about 5% slower than ROBSMT, while VB-FGMT and VB-SMT outperform ROB-SMT by about 16.4% and
19.7%, respectively.
Mixes 2 and 6 show a flat behaviour both when substituting the ROB by the VB
and when improving the multithreading paradigm. This fact corroborates results
shown in Section 5.1.1, where it was shown that specific benchmarks do not obtain
benefits neither from the VB nor from enlarging the ROB. This situation is caused
by a lack of instruction level parallelism aggravated by a high L1 miss rate as well
as by a high branch misprediction rate. Thread level parallelism is also affected
by this fact, preventing SMT to outperform CGMT or FGMT.
An interesting observation is the average performance improvement of VBFGMT, which is a simple design of multithreading, over ROB-SMT, which introduces more complex hardware in the issue stage to schedule instructions from
different threads in the same cycle. The reason is that the benefits obtained by
filling empty issue slots with instructions from various threads in ROB-SMT is
compensated in VB-FGMT with the extra performance gained from the efficient
management of the VB structure, which prevents the pipeline from stalling so
often.

36

Benchmark gcc

Benchmark mgrid

2.5

Throughput (IPC)

Throughput (IPC)

2
1.5
1
0.5
0
1

3
4
5
6
Number of Threads

4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1

3
4
5
6
Number of Threads

Benchmark art

4.5
4
3.5
3
2.5
2
1.5
1
0.5
0

Throughput (IPC)

Throughput (IPC)

Benchmark wupwise

3
4
5
6
Number of Threads

5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1

3
4
5
6
Number of Threads

Figure 5.11: Scalability of different multithread designs for ROB/VB architectures with a single replicated benchmark.

5.3.2 Multithreading Scalability: Performance vs. Complexity


In this section, we investigate the impact on performance of the number of hardware threads (n), when ranging from 1 to 8. This factor limits to n the number of
active software contexts whose state can be saved simultaneously in a multithread
processor. In each experiment, a specific benchmark is launched with as many
instances as architected hardware threads.
Figure 5.11 shows the results of this experiment for four different benchmarks,
on the ROB and VB architectures implementing FGMT, CGMT and SMT. The
top left graph contains the results of benchmark gcc, which shows a characteristic
behaviour of integer applications. The out-of-order retirement of instructions obtains scarce benefits in performance for integer workloads, as already observed in
[34][35], so the rest of this section will focus on results for floating-point bench37

marks.
The other three graphs correspond to the floating-point benchmarks mgrid,
wupwise and art, and show extremely contrasting results. Looking at the top right
graph (mgrid), one can observe important effects when n is increased. On one
hand, CGMT provides neither gain nor loss of performance up to 3 threads. In
this case, the benefits of multithreading come from the fact of avoiding software
context switches, which are not necessary in a system with n logical processors
executing n software contexts. However, a further increase of n has negative
effects on the global IPC, shown more clearly in the case of VB-CGMT.
The FGMT and SMT curves belonging to the ROB architecture (in all graphs)
show well known effects of multithreading. While a fine grain design reaches
better performance with ascending values for n up to a maximum of 4, an SMT
design is capable of exploiting the thread level parallelism in a more scalable
manner.
The FGMT and SMT curves belonging to the VB architecture show a similar
evolution when n 4, where SMT immediately outperforms FGMT. Nevertheless, when n > 4, they differ from the ROB curves in the sense that the SMT
scalability is not noticed anymore. The reason is that VB reduces for floatingpoint benchmarks the probability of a processor pipeline to get stalled due to lack
of space in the ROB, which supposes a strong bottleneck alleviation, so we get an
early and sharp increase of instruction sources when n is increased. Since functional units utilization with 4 threads is already high in VB-SMT, an indiscriminate choice of instructions to enter the pipeline only results in a worse instruction
scheduling, and no performance improvement is achieved. In this case, sophisticated resource allocation policies such as DCRA would be necessary to maintain
SMT scalability.
Finally, it is important to compare the VB-FGMT and the ROB-SMT curves
(mgrid and wupwise). As results of Section 5.3.1 already showed, VB-FGMT
provides, on average, better IPC for the evaluated benchmarks mixes, which try
to be representative of current multithreaded processors [36] ranging from 2 to
4 threads. The simpler VB-FGMT implementation can still be used reaching
a higher performance up to approximately 4 threads. With a higher number of
threads, the ROB-SMT scalability is imposed, and a VB-SMT design is needed to
keep the advantages of the VB architecture.

5.3.3 Fetch Policies


This section studies the influence on the VB microarchitecture of several instruction fetch policies (RR, ICOUNT, PDG and DCRA), which were explained in
Chapter 3.
The modelled processor follows the parameters of Table 5.3, except the one
38

IPC

e
Av
er
ag

ix
7
m

ix
6
m

ix
5
m

ix
4
m

ix
3
m

ix
2
m

ix
1
m

ix
0

Benchmarks mix
ROB-RR
ROB-ICOUNT

ROB-PDG
ROB-DCRA

VB-RR
VB-ICOUNT

VB-PDG
VB-DCRA

Figure 5.12: Evaluation of fetch policies for the ROB and VB architectures

implementing DCRA, which shares the instruction fetch queue, instruction queue
and load-store queue among hardware threads. The reason is that DCRA does not
only assign different and variable fetch slots to threads, but also obtains benefits by
dynamically assigning different number of entries of shared resources to threads.
On one hand, Figure 5.12 shows the pronounced advantage of a sophisticated
instruction fetch policy in SMT processors. As the average values suggest, any
fetch policy other than RR provides better benefits than the replacement of a ROB
by a VB. On the other hand, Figure 5.12 illustrates that fetch policies advantages
also apply to the VB architecture. If we compare the advanced fetch policies (the
three right bars of the Average group) against the naive VB-RR policy, we obtain
on average 28.4%, 32.9% and 40.2% benefits for VB-ICOUNT, VB-PDG and
VB-DCRA, respectively. Comparing these three policies versus the ROB-DCRA
policy (the best ROB-based policy), the performance speedup reaches 12%, 15.6%
and 21.9%, respectively.
Although we need some improved fetch policy in the VB microarchitecture in
order to enhance a ROB-DCRA architecture, there is no need to implement the
most effective one (VB-DCRA), which might require more complex hardware.
Instead, the instruction counters added by ICOUNT are sufficient to make the VBbased approach rise over ROB-DCRA. However, we can also see that the ability of
retiring instructions out of order, combined with optimized fetch policies, allows
the greatest performance improvement, as both techniques contribute with their
orthogonal potential.

39

Percentage of Cycles

1
0.8
0.6
0.4
0.2

A
C
-D

VB

-P

T
N
O

-R
VB

VB

BO

-IC

VB

A
R

G
PD
B-

O
R

B-

IC

B-

Fetch Policy
8/7 Issue Slots
6/5 Issue Slots

4/3 Issue Slots


2/1 Issue Slots

0 Issue Slots

Figure 5.13: Filled issue slots for different SMT architectures and fetch policies.

5.3.4 Resources occupancy


In this section, we measure the average occupancy of both bandwidth and storage
resources for previously studied fetch policies over ROB-SMT and VB-SMT. As
it is well known, SMT processors pursue to reduce the waste of issue slots, supplying a higher stress over functional units, and thus, a higher throughput of the
execution stage. In this section, we measure the impact of retiring instructions out
of order on this performance metric.
Figure 5.13 represents the issue bandwidth utilization as a percentage of execution cycles in which a specific number of issue slots has been filled. Results
clearly show how the VB architecture reduces the horizontal waste, since plotted
regions corresponding to less than 7/8 issue slots are significantly smaller for VBSMT. Vertical waste is also diminished for VB-SMT, which can be observed in the
solid black regions, slightly smaller for the VB bars. These results also corroborate the relationship between filled issue slots and MT processor performance,
when comparing Figures 5.12 and 5.13. The highest occupancy is achieved with
VB-DCRA, which fills 7 or 8 issue slots in almost 50% of the execution time.
It is important to compare the bars corresponding to the ROB-DCRA and VBICOUNT designs, both in Figures 5.12 and 5.13. Figure 5.12 points ROB-DCRA
as the most performance-effective ROB design, while Figure 5.13 shows that it
exploits most efficiently the issue bandwidth. On the other hand, it is shown that a
simple ICOUNT policy is enough to make VB-ICOUNT outperform ROB-DCRA
(a naive RR fetch policy is not enough). This fact designates VB-ICOUNT as a
40

b) Load-Store Queue Occupancy

ROB-DCRA
VB-ICOUNT
Probability

Probability

a) Instruction Queue Occupancy


1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

Fraction of Instruction Queue Entries

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

ROB-DCRA
VB-ICOUNT

0.2

0.4

0.6

0.8

Fraction of Load-Store Queue Entries

Figure 5.14: Storage resources occupancy for ROB-DCRA and VB-ICOUNT.

cost-effective solution, reaching higher performance than the most complex fetch
policy for ROB-SMT, but implementing a simple fetch policy in VB-SMT.
Additionally to the issue slots, we have investigated the occupancy of storage resources, focusing on the most efficient ROB design (ROB-DCRA) and the
design with the simplest efficient fetch policy on VB (VB-ICOUNT). Figure 5.14
shows the occupancy of the instruction queue (IQ) and the load-store queue (LSQ)
for these designs. The curves in the graphs are to be interpreted as the probability
(Y-axis) for a resource of having an occupation equal or greater than a specific
fraction of its entries (X-axis).
As one can observe, VB-ICOUNT causes a lower occupancy both in the IQ
and the LSQ. In the case of the IQ (Figure 5.14a), one can appreciate that only
the 50% of the IQ entries are being used in ROB-DCRA, meaning that the IQ
is over-dimensioned. However, the unused fraction of the IQ grows up to almost 70% in the case of VB-ICOUNT. Something similar happens with the LSQ
(Figure 5.14b), which could be implemented a 20% smaller, without pactically affecting performance. As a consequence, VB-ICOUNT does not only outperform
ROB-DCRA with a simpler fetch policy, but also permits a decrement of the main
storage resources size, maintaining performance gains.

41

Chapter 6
Related Work
Long latency operations constrain the output rate of the ROB, and thus, microprocessor performance. Recent microprocessor mechanisms have been proposed
dealing with this problem [3, 5, 4]. In essence, these proposals permit to retire
instructions in a speculative mode when a long latency operation blocks the ROB
head. These solutions introduce specific hardware to checkpoint the architectural
state at specific times and guarantee correct execution. When a misprediction occurs, the processor rolls back to the checkpoint, discarding all subsequent computations. Some of these proposals have been extended to be used in multiprocessor
systems [37, 38].
In [3], Kirman et al propose the checkpointed early load retirement mechanism
which has certain similarities with the previous one. To unclog the ROB when a
long-latency load instruction blocks the ROB head, a predicted value is provided
for those dependent instructions to allow them to continue. When the value of
the load is fetched from memory, it is compared against the predicted one. On a
misprediction, the processor must roll back to the checkpoint.
In [5], Cristal et al propose to replace the ROB structure with a mechanism
to perform checkpoints at specific instructions. This mechanism uses a CAM
structure for register mapping purposes, which is also in charge of the freeing
physical registers. Stores must wait in the commit stage to modify the machine
state until the closest previous checkpoint has committed. In addition, instructions
taking a long time to issue (e.g., those dependent from a load) are moved from the
instruction queue to a secondary buffer, thus freeing resources that can be used
by other instructions. These instructions must be re-inserted into the instruction
queue when the instruction they are dependent on has completed (e.g., the load
data has already been fetched). This problem has also been tackled by Akkary et
al in [4].
In [39] Martinez et al propose an in-order retirement mechanism which identifies irreversible instructions to early freing resources. Unlike the VB microar42

chitecture, this proposal retires instructions in-order. This proposal, as well as the
works discussed above, need checkpointing to roll back the processor to a correct
state.
In [6] a checkpoint free approach is presented. However, this proposal still
use a ROB, and scans the n oldest entries of the ROB to select instructions to be
retired. This fact constrains this proposal making it unsuitable for large ROB sizes.
In addition, resources are handled as a typical processor using a ROB, without any
focus on improving resource usage.
Finally, the performance degradation caused by ROB blocking could also be
alleviated by enlarging the major microprocessor structures or efficiently managing them [40, 41, 42].

43

Chapter 7
Conclusions
As a first contribution of this work, we have proposed the VB microarchitecture,
which aims at retiring instructions out of order while still providing support for
speculation and precise exceptions handling. Unlike most previous proposals, the
out-of-order commit mechanism proposed in this work does not require hardware
to perform checkpointing because out-of-order instruction retirement is correct by
design.
Performance has been compared against two ROB-based proposals, one retiring instructions in order and the other one out-of-order. Results are encouraging
since with only a 32-entry validation buffer and assuming the remaining major
processor structures unbounded, our proposal achieves performance similar to the
other evaluated architectures but with a 256-entry ROB. Moreover, when sizing
the major microprocessor structures close to the ones implemented in a modern
processor, an 8-entry VB microarchitecture outperforms the compared architectures with a 128-entry ROB.
Concerning major processor resource requirements, results show that, besides
achieving better performance, the resource usage does not increase in the VB microarchitecture. For instance, the register file and the load/store queue usages are
reduced. This feature makes our proposal an interesting alternative for poweraware implementations. It is also shown that the validation buffer has a lower
occupancy than the instruction queue. Therefore, in the VB microarchitecture the
hardware dealing with instruction retirement is not, in general, the main microprocessor structure constraining the performance anymore.
The second main contribution of this work is the combination of the out-oforder retirement VB microarchitecture with different models of multithreading.
This has lead to the observation that both techniques contribute orthogonally to increase processor performance. We also explored the behaviour of different thread
selection policies at the fetch stage (i.e., fetch policies) on the resulting multithreaded VB architecture.
44

Our simulations of multithreaded environments are performed using a set of


benchmark mixes, typically used for evaluation purposes on multithreading, and
their results provide three main conclusions: (i) a fine-grain multithreaded VBbased processor outperforms, on average, a simultaneous multithreaded ROBbased processor; (ii) a simultaneous multithreaded VB-based processor reaches
the maximum performance with about half the number of hardware threads than
a simultaneous multithreaded ROB-based processor; (iii) benefits of fetch policies (such as DCRA) are orthogonal to the ones provided by the VB. These contributions justify the viability and cost-effectiveness of an out-of-order commit,
multithreaded processor microarchitecture.

7.1 Contributions and Future Work


This work is the first part of a PhD thesis, whose main topic is the research on
out-of-order commit architectures. A new microarchitecture is under development, which does not use any FIFO queue at all to maintain the program order of
instructions in the pipeline (such as the Reorder Buffer or the Validation Buffer),
preventing the processor from stalling due to a lack of space in this structure.
The evaluation of all proposals is planned to be performed both on uniprocessor and multiprocessor environments (such as multithread and multicore processors). In this sense, issues related with parallel architectures will be studied on the
new proposals, such as synchronization or memory consistency. The simulation
framework used to evaluate the novel architectures is Multi2Sim [33], which will
be extended and documented, making the new releases publicaly available.

7.2 Publications Related with this Work


The following papers related with this work were submitted and accepted for publication in different conferences. They are enumerated next:
R. Ubal, S. Petit, J. Sahuquillo, P. Lopez and J. Duato, A First Approach
to Non-Speculative Out-of-Order Instructions Retirement, XVIII Jornadas
de Paralelismo, Zaragoza, 2007
R. Ubal, J. Sahuquillo, S. Petit, P. Lopez and J. Duato, The Validation
Buffer Microarchitecture for Multithreaded Processors, ACACES Summer
School, LAquila (Italy), July 2007.
R. Ubal, J. Sahuquillo, S. Petit, P. Lopez and J. Duato, VB-MT: Design Issues and Performance of the Validation Buffer Microarchitecture for Multi45

threaded Processors, The 16th International Conference on Parallel Architectures and Compilation Techniques, Brasov (Romania), September 2007.
R. Ubal, J. Sahuquillo, S. Petit and P. Lopez, A Simulation Framework
to Evaluate Multicore-Multithreaded Processors, 19th International Symposium on Computer Architecture and High Performance Computing, Gramado (Brasil), October 2007.
R. Ubal, J. Sahuquillo, S. Petit, P. Lopez and J. Duato, The Impact of
Out-of-Order Commit in Coarse-Grain, Fine-Grain and Simultaneous Multithreaded Architectures, to appear in the 22nd IEEE International Parallel
and Distributed Processing Symposium, Miami (Florida, USA), April 2008.
An additional paper has been submitted, and is currently under review process:
S. Petit, J. Sahuquillo, P. Lopez, R. Ubal and J. Duato, A ComplexityEffective Out-of-Order Retirement Microarchitecture, IEEE Transactions
on Computers.

46

Bibliography
[1] J.E. Smith and A.R. Pleszkun. Implementation of precise interrupts in
pipelined processors. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pages 3644, June 1985.
[2] S. Palacharla, N.P. Jouppi, and J.E. Smith. Complexity-effective superscalar
processor. In Proceedings of the 24th Annual International Symposium on
Computer Architecture, June 1997.
[3] N. Kirman, M. Kirman, M. Chaudhuri, and J. Martnez. Checkpointed early
load retirement. In Proceedings of the International Symposium on High
Performance Architecture, February 2005.
[4] H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th International Symposium on Microarchitecture, December
2003.
[5] A. Cristal, D. Ortega, J. Llosa, and M. Valero. Out-of-order commit processors. In Proceedings of the International Symposium on High Performance
Architecture, February 2004.
[6] G.B. Bell and M.H. Lipasti. Deconstructing Commit. In Proceedings of
the The International Symposium on Performance Analysis of Systems and
Software, March 2004.
[7] R. E. Kessler. The alpha 21264 microprocessor. IEEE Micro, 19(2):2436,
March 1999.
[8] J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system
microarchitecture, technical white paper. IBM Server Group, October 2001.
[9] G. Hinton, D. Sager, and M. Upton et al. The microarchitecture of the Pentium 4 processor. Intel Technology Journal. Q1, 2001.

47

[10] D.M. Tullsen and S.J. Eggers and H.M. Levy. Simultaneous Multithreading:
Maximizing On-Chip Parallelism. 22nd Annual International Symposium
on Computer Architecture, June 1995.
[11] R. Kalla and B. Sinharoy and J.M. Tendler. IBM Power5 Chip: a Dual-Core
Multithreaded Processor. IEEE Micro, March-April 2004.
[12] P. Kongetira and K. Aingaran and K. Olukotun. Niagara: a 32-way Multithreaded Sparc Processor. IEEE Micro, March-April 2005.
[13] C. McNairy and R. Bhatia. Montecito: a Dual-Core, Dual-Thread Itanium
Processor. IEEE Micro, March-April 2005.
[14] J.E. Smith and G. Sohi. The microarchitecture of superscalar processors.
Proc. of the IEE, 83(2), December 1995.
[15] M. Moudgill, K. Pingali, and S. Vassiliadis. Register renaming and dynamic
speculation: an alternative approach. In Proceedings of the 26th International Symposium on Microarchitecture, pages 202213, December 1993.
[16] K.C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro,
pages 2840, April 1996.
[17] J.P. Shen and M.H. Lipasti. Modern Processor Design. McGraw-Hill, 2005.
[18] S. E. Raasch and S. K. Reinhardt. The Impact of Resource Partitioning on
SMT Processors. 12th International Conference on Parallel Architectures
and Compilation Techniques, 2003.
[19] F. J. Cazorla and A. Ramirez and M. Valero and E. Fernandez. Dynamically Controlled Resource Allocation in SMT Processors. In Proceedings of
the 37th annual IEEE/ACM International Symposium on Microarchitecture,
pages 171182, 2004.
[20] J. Sharkey and D. Balkan and D. Ponomarev. Adaptive Reorder Buffers for
SMT Processors. In Proceedings of the 15th International Conference on
Parallel Architectures and Compilation Techniques, pages 244253, 2006.
[21] D.M. Tullsen and S.J. Eggers and J.S. Emer and H.M. Levy and J.L. Lo and
R.L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. 23rd Annual International
Symposium on Computer Architecture, May 1996.

48

[22] A. El-Moursy and D.H. Albonesi. Front-End Policies for Improved Issue
Efficiency in SMT Processors. Proceedings of the 9th International Conference on High Performance Computer Architecture, Feb 2003.
[23] www.gap.upv.es/raurte/tools/multi2sim.html.
R. Ubal Homepage Tools Multi2Sim.
[24] MIPS Technologies, Inc. MIPS32TM Architecture For Programmers, volume
I: Introduction to the MIPS32TM Architecture. 2001.
[25] MIPS Technologies, Inc. MIPS32TM Architecture For Programmers, volume
II: The MIPS32TM Instruction Set. 2001.
[26] D.C. Burger and T.M. Austin. The simplescalar tool set, version 2.0. Computer Architecture News, 25(3), 1997.
[27] M. R. Marty, B. Beckmann, L. Yen, A. R. Alameldeen, M. Xu, and
K. Moore. GEMS: Multifacets General Execution-driven Multiprocessor
Simulator. International Symposium on Computer Architecture, 2006.
R
Threads. Addison Wesley
[28] D. R. Butenhof. Programming with POSIX
Professional, 1997.

[29] P. Sweazey and A.J. Smith. A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus. 13th Intl Symp. Computer
Architecture, pages 414423, June 1986.
[30] Standard
performance
http://www.spec.org/cpu2000/.

evaluation

corporation.

[31] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS-X), October 2002.
[32] Free
Software
Foundation,
[Online].
Available:http://www.gnu.org/software/gcc/onlinedocs/.
GCC online documentation, 2006.
[33] R. Ubal, J. Sahuquillo, S. Petit, and P. Lopez. Multi2Sim: A Simulation
Framework to Evaluate Multicore-Multithreaded Processors. 19th International Symposium on Computer Architecture and High Performance Computing, October 2007.

49

[34] M. Pericas and A. Cristal and R. Gonzalez and D.A. Jimenez and M. Valero.
A Decoupled KILO-Instruction Processor. 11th International Conference
on High Performance Computer Architecture, February 2006.
[35] G. Bell and M. Lipasti. Deconstructing commit. International Symposium
on Performance Analysis of Systems and Software, March 2004.
[36] S. Choi and D. Yeung. Learning-Based SMT Processor Resource Distribution via Hill-Climbing. 33rd International Symposium on Computer Architecture, June 2006.
[37] M. Kirman, N. Kirman, and J.F. Martnez. Cherry-mp: Correctly integrating
checkpointed early resource recycling in chip multiprocessors. In Proceedings of the International Symposium on Microarchitecture, November 2005.
[38] E. Vallejo, M. Galluzzi, A. Cristal, F. Vallejo, R. Beivide, P. Stenstrom, J. E.
Smith, and M. Valero. Implementing kilo-instruction multiprocessors. In
IEEE Conference on Pervasive Services, Invited lecture, July 2005.
[39] J.F. Martnez, J. Renau, MC. Huang, M. Prvulovic, and J. Torrellas. Cherry:
checkpointed early resource recycling in out-of-order processors. In Proceedings of the 35th International Symposium on Microarchitecture, November 2002.
[40] S. E. Raasch, N. L. Binkert, and S. K. Reinhardt. A scalable instruction
queue design using dependence chains. In Proceedings of the 29th Annual
International Symposium on Computer Architecture, May 2002.
[41] R. Balasubramonian, S. Dwarkadas, and D.H. Albonesi. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings
of the 34th Int. Symp. on Microarchitecture, December 2001.
[42] I. Park, C.L. Ooi, and T.N. Vijaykumar. Reducing design complexity of the
load/store queue. In Proceedings of the 36th International Symposium on
Microarchitecture, December 2003.

50

You might also like