You are on page 1of 9

CS 433 Midterm Exam Mar 7, 2012 Professor Sarita Adve Time: 2 Hours

Please clearly print your name and NetID and circle the appropriate category in the space provided below. Name NetID Undergraduate Graduate Category Instructions

1. You may only use class handouts from this semesters offering, the course text ( Computer
Architecture: A Quantitative Approach 5th Edition by Hennessy and Patterson), your own homework submissions for this course, papers indicated as reference material in class, and notes written or typed by yourself. You may also use homework solutions and sample midterms provided on the course website. No other materials are allowed, including other books, notes prepared by others, or materials from previous offerings of this course (except as noted here) or from other universities.

2. You can use electronic equipment for e-book, but only if the networking is switched off.
Additionally, no application other than the e-book should be open.

3. Calculators are allowed. You may not use any other electronic devices. 4. Please do not turn in loose scrap paper. Limit your answers to the space provided if possible. If
this is not possible, please write on the back of the same sheet. You may use the back of each sheet for scratch work.

5. In all cases, show your work. No credit will be given for numeric answers if there is no
indication of how the answer was derived. Partial credit will be given even if your final solution is incorrect, provided you show the intermediate steps in reaching the final solution.

6. If you believe a problem is incorrectly or incompletely specified, make a reasonable assumption


and solve the problem. The assumption should not result in a trivial solution. In all cases, clearly state any assumptions that you make in your answers.

7. This exam has 6 problems and 12 pages (including this one and the last page for scratch work).
All students should solve problems 1 through 5. Only graduate students should solve problem 6. Please budget your time appropriately. Good luck! Problem 1 2 3 4 5 6 Total Maximum Points 3 15 18 12 6 6 (only for graduates) 54 for undergraduates 60 for graduates Received Points

Problem 1 [3 points] Assume that we make an enhancement to a computer that reduces the execution time of part of a programs execution by a factor of 10. Assume the enhancement is used 50% of the total execution time, measured as a percentage of the execution time when the enhancement is in use. Part (A) [1.5 points] What is the overall speedup we have obtained from the enhancement? Part (B) [1.5 points] What percentage of the original execution time has been converted to enhanced mode?

Solution: a. old execution time = old new execution time = new old = 0.5 new + 0.5 10 new = 5.5 new speedup = old/new = 5.5 b. Let x be the fraction of original execution time converted to enhanced mode. In the original code, the unenhanced part is equal in time to the enhanced part sped up by 10, therefore: (1 x) = x /10 10 10x = x 10 = 11x 10/11 = x = 0.91 Grading: 1.5 points for each of the two equations. No deduction for calculation mistakes. Problem 2 [15 Points] This problem concerns Tomasulos algorithm (with reservation stations) with the reorder buffer as discussed in detail in the lecture notes, with the following changes/additions/clarifications. Assume only the following functional units: one integer unit that uses 1 cycle for EX, one non-pipelined floating point adder that uses 3 cycles in EX, and one nonpipelined floating point divide unit that uses 10 cycles in EX. The integer unit is used for addition, address calculations for loads and stores, and to resolve branch conditions. The processor can issue and commit at most one instruction per cycle. The CDB supports only one broadcast per cycle. Assume you have an unlimited number of reservation stations and reorder buffer entries. An instruction waiting for data on the CDB can move to EX in the cycle after the CDB broadcast.

An instruction waiting to write to the CDB holds its execution unit until it gets the CDB; i.e., it prevents other instructions needing the same functional unit from beginning execution. Assume that integer instructions also follow Tomasulos algorithm (analogous to the floating point instructions) so they can be issued out of order and the result from the integer functional unit is also broadcast on the CDB and forwarded to dependent instructions through the CDB.

Assume both branches below are not taken in the iteration considered.

Complete the blank entries in the following table using the above specifications. For each instruction, fill in the cycle numbers in each pipeline stage (CM stands for commit) and indicate where its source operands are read from (use RF for register file, ROB for reorder buffer, and CDB for common data bus). The entries for the first instruction and for the issue stage are filled in for you. Entries that are solid are not meaningful for the corresponding instruction and entries with dots should be ignored.

Instruction
Iteration 1 Loop: L.D F0,0(R1) ADD.D F2,F0,F2 L.D F4,0(R3) BEZ F4,Skipped DIV.D F4,F2,F4 Skipped: SUBI R1,R1,#4 SUBI R3,R3,#4 Some FP instr Some FP instr S.D F4,4(R3) Many FP instr BNEZ R3,Loop

IS 1 2 3 4 5 6 7 8 9 10 .. 23

Operand1

Operand1 Source

Operand2

Operand2 Source

EX 2

WB 3 7 5 18 8 9

CM 4 8 9 10 19 20 21 .. .. .. .. ..

R1 F0 R3 F4 F2 R1 R3 .. .. F4 .. R1

RF CDB RF CDB CDB RF RF .. .. CDB .. RF F4 CDB F2 RF

4-6 4 6 8-17 7 8 .. .. .. .. ..

.. .. R3 ..

.. .. ROB ..

Grading: 0.5 point for each entry. Add 0.5 points if all entries are correct. Cascading errors will not be penalized additionally as long as the relevant dependencies are still observed.

Problem 3: Loop Unrolling (18 Points) In this problem, we will use a single-issue, in-order MIPS pipeline similar to those studied in class, but with the following specification. There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective address calculation for loads/stores), subtraction, and logic operations.. There is 1 FP/integer multiplier, taking 8 cycles to perform multiplication. It is pipelined. There is 1 FP adder, taking 3 cycles to perform FP additions and subtractions. It is pipelined. There is 1 FP/integer divider, taking 24 cycles. It is NOT pipelined. There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM stage for stores. Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after the effective address calculation. There are as many registers, both FP and integer, as you need. Branches are resolved in ID and there is one branch delay slot. If multiple instructions finish their EX stages in the same cycle, then we will assume they can all proceed to the MEM stage together. Similarly, if multiple instructions finish their MEM stages in the same cycle, then we will assume they can all proceed to the WB stage together. In other words, for the purpose of this problem, you are to ignore structural hazards on the MEM and WB stages. This problem explores the ability of the compiler to schedule code as efficiently as possible for such a pipeline. L.D F4, 0 (R1) MUL.D F8, F4, F0 L.D F6, 0 (R2) ADD.D F10, F6, F2 ADD.D F12, F8, F10 S.D F12, 0 (R3) DADDUI R1, R1, #8 DADDUI R2, R2, #8 DADDUI R3, R3, #8 DSUB R5, R4, R1 BNEZ R5, Loop

Loop:

Part A. [6 points] Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row take a cycle. If an instruction cant be issued on a given cycle (because the current instruction has a dependency that will not be resolved in time), write STALL instead, and move on to the next cycle (row) to see if it can be issued then. Explain all stalls, but dont reorder instructions. How many cycles elapse before the second iteration begins? Show your work (i.e., write instructions in order as they would be executed and write STALL in between instructions for each cycle where there would be a stall). Remember there is 1 branch delay slot. Solution: Loop:

L.D F4, 0(R1) stall RAW F4

MUL.D F8, F4, F0 L.D F6, 0(R2) stall RAW F6 ADD.D F10, F6, F2 stall RAW F8, F10 stall RAW F8, F10 stall RAW F8 stall RAW F8 ADD.D F12, F8, F10 stall RAW F12 S.D F12, 0(R3) DADDUI R1, R1, #8 DADDUI R2, R2, #8 DADDUI R3, R3, #8 DSUB R5, R4, R1 stall RAW R5 since branch resolved in ID BNEZ R5, Loop NOP stall for branch delay 20 cycles elapse before the next iteration begins.

Grading: 1 point for each sequence of stalls. point partial credit for indicating that there is a stall between a pair of instructions, but with an incorrect number of cycles. Negative point for each unnecessary sequence of stalls.
Part B. [6 points] Now reschedule the loop to compute the same results as quickly as possible. You can change immediate values and memory offsets andreorder instructions, but dont change anything else. Show any stalls that remain. How many cycles elapse before the second iteration begins? Show your work. Solution: Loop: L.D F4, 0(R1) L.D F6, 0(R2) MUL.D F8, F4, F0 ADD.D F10, F6, F2 DADDUI R1, R1, #8 DADDUI R2, R2, #8 DADDUI R3, R3, #8 DSUB R5, R4, R1 stall RAW F8 stall RAW F8 ADD.D F12, F8, F10 BNEZ R5, Loop S.D F12, -8(R3) 13 cycles elapse before the second iteration begins Grading: Full points for any correct sequence with minimum number of stalls. Partial credit only if the sequence does the same computation and reduces some stalls. Deduct point for each error (e.g., incorrect index), and deduct point for each stall in excess of 2.

Part C. [6 points] Now unroll the loop the minimum number of times needed to eliminate all stalls (with rescheduling). Show the unrolled and rescheduled loop. You can, and should, remove redundant instructions. How many original iterations of the loop are in an iteration of your new unrolled loop? How many cycles elapse before the next iteration of the unrolled loop begins? Dont worry about start-up or clean-up code outside the unrolled loop. Show your work. Solution: Note that in the solutions below, the registers used could be different and there is some flexibility in scheduling the instructions. Loop: L.D F4, 0(R1) L.D F6, 0(R2) MUL.D F8, F4, F0 L.D F14, 8(R1) L.D F16, 8(R2) MUL.D F18, F14, F0 ADD.D F10, F6, F2 ADD.D F20, F16, F2 DADDUI R1, R1, #16 DADDUI R2, R2, #16 DADDUI R3, R3, #16 ADD.D F12, F8, F10 DSUB R5, R4, R1 ADD.D F22, F18, F20 S.D F12, -16(R3) BNEZ R5, Loop S.D F22, -8(R3)

There are two original iterations in an iteration of the new loop. 17 cycles elapse before the next iteration of the new loop begins. Grading: 1 point for the correct iteration count. Deduct point for every error or stall cycle. Give

partial credit (3 points) If they use three iterations instead of two and the solution is correct with three iterations. Problem 4 [12 points]

Consider a loop that is entered several times in a program. Each time it is entered, the loop performs 8 iterations. Each iteration executes four branches with the following outcomes (branch 1 occurs before branch 2 which occurs before branch 3 which occurs before branch 4 in each iteration): Iteration 5 N T

Branch 1 Branch 2

1 N T

2 T T

3 T T

4 N T

6 T N

7 T N

8 N N

Branch 3 Branch 4

T T

N T

N T

T T

T T

T T

T T

N N

When Branch 4 is not taken at iteration 8, the program leaves the loop. Assume any branches between executions of the loop do not affect local histories or prediction entries of any of the above branches, and the global branch prediction history is all Not Taken every time the loop begins. Assume the predictor tables have infinite storage and the loop occurs enough times that the initial state of the predictors at the beginning of the program does not matter. For each of the three branches below, describe the predictor with the best misprediction rate, explain why that predictor works well for this branch, and give the state of that predictor at the end of the 1,000th invocation of the loop. When giving the state for a history based predictor, indicate which history a given prediction corresponds to. Consider only local and global correlating predictors, saturating counters, and static predictions. History may not be longer than 2 branches, and counters may not be larger than 2 bits. For full credit, you should give the simplest predictor that achieves the same misprediction rate, where counter based predictors are considered simpler than history based and global history is simpler than local history. (A) Branch 1: Solution: A local (2,1) predictor will predict every branch correctly. This works well because the decision of branch 1 is highly correlated with its previous results. The predictor state is T/T/N/N, assuming the first entry is for history NN, the second is for history NT, third for history TN, and fourth for history TT.

(B) Branch 2: Solution: A 1-bit counter makes 2 incorrect predictions per iteration. The branch has long runs of the same decision, and a 1-bit counter costs a minimal number of mispredicts when switching from taken to not taken. The final predictor state at the end of an iteration is 0 (not taken). (C) Branch 3: Solution: A global (2,1) predictor will predict every branch correctly. This works well because the decision is highly correlated with the decision of earlier branches in the same iteration. The predictor state is N/T/T/N (with the same correspondence to history bits as above). Grading: For each case, 2 points for the correct predictor and justification and 2 points for the correct state. If correct state is given for a wrong predictor, 2 points will be given.

Problem 5 [6 Points]. Suppose we have a deeply pipelined processor, for which we implement a branchtarget buffer for conditional branches only. Assume that the misprediction penalty is always four cycles and buffer miss penalty is always three cycles. Assume a 90% hit rate, 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a

fixed two-cycle branch penalty? Assume a base clock cycle per instruction (CPI) without branch stalls is one cycle. Solution: For this problem we are given the base CPI without branch stalls. From this we can compute the number of stalls given by no BTB and with the BTB: CPI1 ( with no BTB) and CPI2 (with BTB) and the resulting speedup given as follows where CPI represents CPI base. Speedup = CPI1/ CPI2 = CPI + Stalls1 /CPI + Stalls2 Stalls1 = 15% * 2 = 0.30 Stalls2 is calculated from following table.

BTB Result Miss Hit Hit

BTB Prediction

correct Incorrect

Frequency per Instruction 15% * 10% = 1.5% 15% * 90% * 90% = 12.1% 15% * 90% * 10% = 1.3%

Penalty (cycles) 3 0 4

Stalls2 = (1.5% * 3) + (12.1% * 0) + (1.3% * 4) = 0.097 Speedup = (1.0+0.30)/(1.0+0.097) = 1.2 Grading scheme: 1 point to correct penalty for each of three cases BTB Miss, BTB Hit (correct prediction) and BTB hit (incorrect prediction). Give 1 point for the stalls using no branch prediction and 2 points for correct use of speedup formula.

Problem 6: GRADUATE PROBLEM (all graduate students should solve this problem) [6 points] You are a member of a team designing an out-of-order processor with dynamic scheduling and speculative execution. Your initial design was just reviewed by the circuit implementation team, and it turns out that you have some spare transistor budget! (A rare occurrence in practice.) Your processor currently has a small 2-bit saturating counter-based branch predictor which performs moderately well. It has 8 Integer Functional Units and 4 Floating Point Units (FPUs), 256KB of on-chip caches, 4 reservation stations for the Integer Units, and 2 reservation stations for FPUs. The Reorder Buffer has 8 entries. The processor has a 25 stage pipeline. The applications you care for have a small code size and work on small data sets in the range of 64 KB. These applications spend most of their time in loops whose iterations are independent of each other, but typically have only a limited amount of ILP within a single iteration (within the current processor implementation). You can use the extra transistors in (possibly several of) the following ways:

1. Improve the branch predictor accuracy. 3. Add more reservation stations to your Tomasulos Algorithm-based Dynamic Scheduler. 4. Add more FPUs and Integer Units. 5. Add more Reorder Buffer entries. Some of these may be desirable additions, while others may not be too beneficial given the current configuration. There is a meeting coming up to discuss the proposed additions. Which of the above four additions should you support and which ones should you oppose (you can support/oppose multiple of these)? You need to justify your choices to receive credit.

Solution: 1. Improve the branch predictor accuracy: This is a desirable addition. The problem states that the branch predictor performs only moderately well and the processor has a long pipeline making branch mispredictions expensive. Thus, improving branch prediction accuracy is quite likely to increase performance. 2. Add more reservation stations to your Tomasulos Algorithm-based Dynamic Scheduler: This is a desirable addition. More reservation stations would mean a larger window within which the processor can search for ready instructions to execute, thus it can discover more parallelism and keep execution units busy. This would lead to better performance, especially since our application needs to discover parallelism across loop iterations. 3. Add more FPUs and Integer Units: This doesnt seem to be a good addition. The current machine already has enough FUs and we should try to improve other aspects of the processor. Adding more FUs wont help if the processsor is unable to discover enough parallelism in the instruction stream to keep them busy. 4. Add more Reorder Buffer entries: This is a desirable addition. The current configuration has very few ROB entries. A large ROB helps to mask out the effects of long latency instructions and help search for parallelism within a larger window (this goes together with (3)). Grading: 1.5 points for correctly analyzing each part. 6 points total. Give partial credit (0.5 points) if student gives valid reason for why improvement can be avoided.

You might also like