You are on page 1of 15

CS433g: Computer System Organization Fall 2005

Practice Set 5
Memory Hierarchy

Please refer to the newsgroup message for instructions on obtaining EXTRA


CREDIT for this homework.
Problem 1
Consider a system with 4-way set associative cache of 256 KB. The cache line size is 8 words
(32 bits per word). The smallest addressable unit is a byte, and memory addresses are 64 bits
long.
a. Show the division of the bits in a memory address and how they are used to access the cache.
Solution:
We are given that the block size is 8 words (32 bytes). Therefore, the number of bytes required to
specify the block offset is log232 = 5 bits. The number of sets is 256 KB / (32 * 4) = 2048 sets.
Therefore, the index field would require 11 bits. The remaining 64 11 5 = 48 bits are used for
the tag field.
b. Draw a diagram showing the organization of the cache and, using your answer from part (a),
indicate how physical addresses are related to cache locations.
Solution:
The diagram would look similar to Figures 5.4 and/or 5.5 from H&P. We know that any physical
address with the same index bits will map to the same set in the cache. The tag is used to
distinguish between these physical locations.
c. What memory addresses can map to set 289 of the cache?
Solution:
Memory locations with index bits 00100100001 will map to set 289.
d. What percentage of the cache memory is used for tag bits?
Solution:
For each cache line (block), we have 1 tag entry. The size of the cache line is 32 * 8 = 256 bits.
Therefore, the percentage of cache memory used for tag bits is 48 / (48 + 256) = 15.8%.

Problem 2
You are building a computer system around a processor with in-order execution that runs at 1
GHz and has a CPI of 1, excluding memory accesses. The only instructions that read or write data
from/to memory are loads (20% of all instructions) and stores (5% of all instructions).

The memory system for this computer has a split L1 cache. Both the I-cache and the D-cache are
direct mapped and hold 32 KB each. The I-cache has a 2% miss rate and 64 byte blocks, and the
D-cache is a write-through, no-write-allocate cache with a 5% miss rate and 64 byte blocks. The
hit time for both the I-cache and the D-cache is 1 ns. The L1 cache has a write buffer. 95% of
writes to L1 find a free entry in the write buffer immediately. The other 5% of the writes have to
wait until an entry frees up in the write buffer (assume that such writes arrive just as the write
buffer initiates a request to L2 to free up its entry and the entry is not freed up until the L2 is done
with the request). The processor is stalled on a write until a free write buffer entry is available.
The L2 cache is a unified write-back, write-allocate cache with a total size of 512 KB and a block
size of 64-bytes. The hit time of the L2 cache is 15ns. Note that this is also the time taken to write
a word to the L2 cache. The local hit rate of the L2 cache is 80%. Also, 50% of all L2 cache
blocks replaced are dirty. The 64-bit wide main memory has an access latency of 20ns (including
the time for the request to reach from the L2 cache to the main memory), after which any number
of bus words may be transferred at the rate of one bus word (64-bit) per bus cycle on the 64-bit
wide 100 MHz main memory bus. Assume inclusion between the L1 and L2 caches and assume
there is no write-back buffer at the L2 cache. Assume a write-back takes the same amount of time
as an L2 read miss of the same size.
While calculating any time values (such as hit time, miss penalty, AMAT), please use ns
(nanoseconds) as the unit of time. For miss rates below, give the local miss rate for that cache.
By miss penaltyL2, we mean the time from the miss request issued by the L2 cache up to the time
the data comes back to the L2 cache from main memory.

Part A
Computing the AMAT (average memory access time) for instruction accesses.
i. Give the values of the following terms for instruction accesses.
hit timeL1, miss rateL1, hit timeL2, miss rateL2
hit timeL1 = 1 processor cycle = 1 ns
miss rateL1= 0.02
hit timeL2 = 15 ns
miss rateL2 = 1 0.8 = 0.2
ii. Give the formula for calculating miss penaltyL2, and compute the value of miss
penaltyL2.
miss penaltyL2 = memory access latency + time to transfer one L2 cache block
Transfer rate of memory bus = 64 bits / bus cycle = 64 bits / 10 ns = 8 bytes / 10 ns = 0.8
bytes / ns
Time to transfer one L2 cache block = 64 bytes / 0.8 bytes = 80 ns.
So, miss penaltyL2 = 20 + 80 = 100 ns
However, 50% of all replaced blocks are dirty and so they need to be written back to main
memory. This takes another 100 ns.
Therefore, miss penaltyL2 = 100 + 0.5 x 100 = 150 ns

iii. Give the formula for calculating the AMAT for this system using the five terms whose
values you computed above and any other values you need.
AMAT = hit timeL1 + miss rateL1 x (hit timeL2 + miss rateL2 x miss penaltyL2)
iv. Plug in the values into the AMAT formula above, and compute a numerical value for
AMAT for instruction accesses.
AMAT = 1 + 0.02 x (15 + 0.2 x 150) = 1.9 ns
Part B
Computing the AMAT for data reads.
i.

Give the value of miss rateL1 for data reads.


miss rateL1 = 0.05

ii. Calculate the value of the AMAT for data reads using the above value, and other values
you need.
AMAT = hit timeL1 + miss rateL1 x (hit timeL2 + miss rateL2 x miss penaltyL2)
AMAT = 1 + 0.05 x (15 + 0.2 x 150)
= 3.25 ns
Part C
Computing the AMAT for data writes.
i. Give the value of miss penaltyL2 for data writes.
miss penaltyL2 = miss penaltyL2 for the data read case
So, miss penaltyL2 = 150 ns
(Assuming that after the block is read into the L2 cache from the main memory, no further
time is spent writing to it. In other words the time to write to it is included in the 150 ns
value. This value of 150 ns is used in the solutions for all subsequent parts, but using a
value of 151 ns is also perfectly acceptable.)
Note: Here a value of 151 ns is also equally acceptable.
(Assuming that one additional cycle (1ns) is spent writing to the block once it has arrived in
the L2 cache.)
ii. Give the value of write timeL2Buff for a write buffer entry being written to the L2 cache.
As, the L2 cache hit rate is 80%, only 20% of the write buffer writes will miss in the L2
cache and will thus incur the miss penaltyL2.
So, write timeL2Buff = hit timeL2 + 0.2 x miss penaltyL2
1 x 15 + 0.2 x 150
= 45 ns
iii. Calculate the value of the AMAT for data writes using the above two values, and any
other values that you need. Only include the time that the processor will be stalled. Hint:
There are two cases to be considered here depending upon whether the write buffer is full
or not.
There are two cases to consider here. In 95% of the cases the write buffer will have empty
space, so the processor will only need to wait 1 cycle. In the remaining 5% of the cases,
the write buffer will be full, and the processor will have to wait for the additional time
taken for a buffer entry to be written to the L2 cache, which is write timeL2Buff.

AMAT = hit timeL1 + 0.05 x write timeL2Buff


= 1 + 0.05 x (45)
= 3.25 ns

Part D
Compute the overall CPI, including memory accesses (instructions plus data). Assume that
there is no overlap between the latencies of instruction and data accesses.
The CPI excluding memory accesses = 1
We are given that 20% of the instructions are data reads (loads), and 5% are data writes
(stores).
Also, note that 100% of the instructions require an instruction fetch.
Since, on this system one clock cycle is 1 ns, we can use the AMAT values directly.
So, CPI including memory accesses
= 1 + (AMAT for instructions 1) + (0.2 x AMAT for data reads - 1) + (0.05 x AMAT
for data writes - 1)
= 1 + (1.9 1) + 0.2 x (3.25 1) + 0.05 x (3.25 1)
= 2.46
Note: We are subtracting 1 cycle (1ns) from all of the AMAT times (instruction, data
read and data write), because in the pipeline 1 cycle of memory access is already
accounted for in the CPI of 1.

Problem 3
Way prediction allows an associative cache to provide the hit time of a direct-mapped cache. The
MIPS R10000 processor uses way prediction to achieve a different goal: reduce the cost of the
chip package. The R10000 hardware includes an on-chip L1 cache, on-chip L2 tag comparison
circuitry, and an on-chip L2 way prediction table. L2 tag information is brought on chip to detect
an L2 hit or miss. The way prediction table contains 8K 1-bit entries, each corresponding to two
L2 cache blocks. L2 cache storage is built external to the processor package, is 2-way associative,
and may have one of several block sizes.
a. How can way prediction reduce the number of pins needed on the R10000 package to
read L2 tags and data, and what is the impact on performance compared to a package
with a full complement of pins to interface to the L2 cache?
Solution:
When way prediction is not used, the chip would need to access L2 tags for both associative
ways. Ideally, this would be done in parallel; thus, the R10000 and L2 chips would need
enough pins to bring both tags onto the processor for comparison. With way prediction, we
need only bring the tag for the way that was predicted; in the less likely case where the
predicted way is incorrect, we could load the other tag with minimal penalty.

b. What is the performance drawback of just using the same smaller number of pins but not
including way prediction?

Solution:
To use the smaller number of pins without way prediction, we would check the tags for the
two ways one after the other. Now, when we have a hit, on average half the time we will get
the correct tag first, and half the time we will get the correct tag second. With the way
prediction, we were getting the correct tag a high fraction of the time, so the average L2
access time will be higher.

c. Assume that the R10000 uses most-recently used way prediction. What are reasonable
design choices for the cache state update(s) to make when the desired data is in the
predicted way, the desired data is in the non-predicted way, and the desired data is not in
the L2 cache? Please fill in your answers in the following table.
Solution:
Cache Access Case
Desired data is in the
predicted way
Desired data is in the
non-predicted way
Desired data is not in
the L2 cache

Way prediction entry


No change
Flip the way prediction
bit
Set way prediction bit to
point to new location of
the data

Cache State Change


Tag and valid bits
No change

Cache data
No change

No change

No change

Set tag and valid


bits

Bring data from


memory

d. For a 1024 KB L2 cache with 64-byte blocks and 8-way set associativity, how many way
prediction table entries are needed?
Solution:
The number of blocks in the L2 cache = 1024KB / 64B = 16K
The number of sets in the L2 cache = 16K / 8 = 2K
Thus, the number of way prediction table entries needed = 2K
e. For an 8 MB L2 cache with 128-byte blocks and 2-way set associativity, how many way
prediction table entries are needed?
Solution:
The number of blocks in the L2 cache = 8MB / 128B = 64K
The number of sets in the L2 cache = 64K / 2 = 32K
Thus, the number of way prediction table entries needed = 32K

f.

What is the difference in the way that the R10000 with only 8K way prediction table
entries will support the cache in part d) versus the cache in part e)? Hint: Think about the
similarity between a way prediction table and a branch prediction table.
Solution:
Since the R10000 way prediction table has 8K entries, it can easily support the cache in part
d). However, this table is too small to accommodate all of the 16K entries required by the

cache in part e). One idea is to make each prediction entry in part e), correspond to two
different sets. However, this introduces the possibility of interference, just like we have seen
previously with branch history tables.

Problem 4
Consider the following piece of code:
register int i,j;
/* i, j are in the processor registers
*/
register float sum1, sum2, a[64][64], b[64][64];
for ( i = 0; i < 64; i++ )
{
for ( j = 0; j < 64; j++ ){
sum1 += a[i][j];
}

/* 1 */
/* 2 */
/* 3 */

for ( j = 0; j < 32; j++ ){


sum2 += b[i][2*j];
}

/* 4 */
/* 5 */

}
Assume the following:
There is a perfect instruction cache; i.e., do not worry about the time for any instruction
accesses.
Both int and float are of size 4 bytes.
Assume that only the accesses to the array locations a[i][j] and b[i][2*j] generate loads to the
data cache. The rest of the variables are all allocated in registers.
Assume a fully associative, LRU data cache with 32 lines, where each line has 16 bytes.
Initially, the data cache is empty.
The arrays a and b are stored in row major form.
To keep things simple, we will assume that statements in the above code are executed
sequentially. The time to execute lines (1), (2), and (4) is 4 cycles for each invocation. Lines
(3) and (5) take 10 cycles to execute and an additional 40 cycles to wait for the data if there is
a data cache miss.
There is a data prefetch instruction with the format prefetch(array[index]). This prefetches the
entire block containing the word array[index] into the data cache. It takes 1 cycle for the
processor to execute this instruction and send it to the data cache. The processor can then go
ahead and execute subsequent instructions. If the prefetched data is not in the cache, it takes
40 cycles for the data to get loaded into the cache.
Assume that the arrays a and b both start at cache line boundaries.

a. How many cycles does the above code fragment take to execute if we do NOT use
pefetching? Also calculate the average number of cycles per outer-loop iteration.
Solution:
Number of cycles taken by line 1 = 64 x 4 = 256
Number of cycles taken by line 2 = 64 x 64 x 4 = 16384

Number of cycles taken by line 3 = 64 x 64 x (10 + 40/4) = 81920


Note that for line 3 every fourth cache access will be a miss, and thats where the 40/4 comes
from.
Number of cycles taken by line 4 = 64 x 32 x 4 = 8192
Number of cycles taken by line 5 = 64 x 32 x (10 + 40/2) = 61440
Note that for line 3 every second cache access will be a miss, and thats where the 40/2 comes
from.
Total number of cycles taken by the entire code fragment
= 256 + 16384 + 81920 + 8192 + 61440
= 168192 cycles
The average # of cycles per outer loop iteration = 168192 / 64 = 2628
b. Consider inserting prefetch instructions for the two inner loops for the arrays a and b
respectively. Explain why we may need to unroll the loops to insert prefetches? What is the
minimum number of times you would need to unroll for each of the two loops for this
purpose?
Solution:
Since the cache line is 16 bytes long and the size of the float is 4 bytes, a cache block has 4 floats.
Thus, one prefetch instruction will bring in four elements of the array. So we only need to do a
prefetch every 4 iterations of first inner loop operating on array a. To achieve this unroll the
loop 4 times. For the second inner loop operating on array b, we need to do a prefetch every 2
iterations, so unroll this loop 2 times.
c. Unroll the inner loops for the number of times identified in part b, and insert the minimum
number of software prefetches to minimize execution time. The technique to insert prefetches
is analogous to software pipelining. You do not need to worry about startup and cleanup code
and do not introduce any new loops.
Solution:
for (i=0; i<64; i++) {
for(j=0; j<64; j+=4) {
prefetch(a[i][j+4]);
sum1 += a[i][j];
sum1 += a[i][j+1];
sum1 += a[i][j+2];
sum1 += a[i][j+3];
}

/* (1) */
/* (2) */
/* (3-0) */
/* (3-1) */
/* (3-2) */
/* (3-3) */
/* (3-4) */

for(j=0; j<32; j+=2) {


/* (4) */
prefetch(b[i][2*(j+4)]); /* (5-0) */
sum2 += b[i][2*j];
/* (5-1) */
sum2 += b[i][2*(j+1)]; /* (5-2) */
}
}
d. How many cycles does each outer loop iteration of the code in part (c) take to execute?
Calculate the average speedup over the code without prefetching. Assume prefetches are not
present in the startup code. Extra time needed by prefetches executing beyond the end of the
loop execution time should not be counted.
Solution:

Number of cycles taken by line 1 = 4 = 4


Number of cycles taken by line 2 = 16 x 4 = 64
Number of cycles taken by line 3-0 = 16 x 1 = 16
Number of cycles taken by line 3-1 = 15 x 10 + 1 x (10 + 40) = 200
Note that for line 3-1 the first iteration of the inner loop will always cause a miss because the
prefetch is getting the data required by the (j+1)st iteration, i.e the 2nd iteration.
Number of cycles taken by line 3-2 = 16 x 10 = 160
Number of cycles taken by line 3-3 = 16 x 10 = 160
Number of cycles taken by line 3-4 = 16 x 10 = 160
Number of cycles taken by line 4 = 16 x 4 = 64
Number of cycles taken by line 5-0 = 16 x 1 = 16
Number of cycles taken by line 5-1 = 14 x 10 + 2 x (10 + 40) = 240
Note that for line 5-1 the first two iterations of the inner loop will always cause a miss because
the prefetch is getting the data required by the (j+2)nd iteration, i.e the 3rd iteration.
Number of cycles taken by line 5-2 = 16 x (10) = 160
Total number of cycles taken by the entire code fragment
= 4 + 64 + 16 + 200 + 160 + 160 + 160 + 64 + 16 + 240 + 160
= 1244 cycles
The speedup over the code without prefetching = 168192 / (1244 x 64) = 2.11
e. In part (c) above it is possible that loop unrolling reduces performance by increasing code
size. Is there another technique that can be used to achieve the same objective as loop
unrolling in this example, but with fewer instructions? Explain this technique and illustrate its
use for the code in part (c).
Solution:
We could use an if statement to eliminate loop unrolling.
for (i=0; i<64; i++) {
/* (1) */
for(j=0; j<64; j++) {
/* (2) */
if(j%4 == 0)
/* (3-0) */
prefetch(a[i][j+4]);
/* (3-1) */
sum1 += a[i][j];
/* (3-2) */
}
for(j=0; j<32; j++) {
/* (4) */
if(j%2 == 0)
/* (5-0) */
prefetch(b[i][2*(j+2)]); /* (5-1) */
sum2 += b[i][2*j];
/* (5-2) */
}
}

Problem 5
Consider a system with the following processor components and policies:
A direct-mapped L1 data cache of size 4KB and block size of 16 bytes, indexed
and tagged using physical addresses, and using a write-allocate, write-back policy
A fully-associative data TLB with 4 entries and an LRU replacement policy

Physical addresses of 32 bits, and virtual addresses of 40 bits


Byte addressable memory
Page size of 1MB

Part A
Which bits of the virtual address are used to obtain a virtual to physical translation from
the TLB? Explain exactly how these bits are used to make the translation, assuming there
is a TLB hit.
Solution
The virtual address is 40 bits long. Because the virtual page size is 1MB = 2^20 bytes,
and memory is byte addressable, the virtual page offset is 20 bits. Thus, the first 4020=20 bits are used for address translation at the TLB. Since the TLB is fully associative,
all of these bits are used for the tag; i.e., there are no index bits.
When a virtual address is presented for translation, the hardware first checks to see if the
20 bit tag is present in the TLB by comparing it to all other entries simultaneously. If a
valid match is found (i.e., a TLB hit) and no protection violation occurs, the page frame
number is read directly from the TLB.
Part B
Which bits of the virtual or physical address are used as the tag, index, and block offset
bits for accessing the L1 data cache? Explicitly specify which of these bits can be used
directly from the virtual address without any translation.
Solution
Since the cache is physically indexed and physically tagged, all of the bits from accessing
the cache must come from the physical address. However, since the lowest 20 bits of the
virtual address form the page offset and are therefore not translated, these 20 bits can be
used directly from the virtual address. The remaining 12 bits (of the total of 32 bits in the
physical address) must be used after translation.
Since the block size is 16 bytes = 2^4 bytes, and memory is byte addressable, the lowest
4 bits are used as block offset.
Since the cache is direct mapped, the number of sets is 4KB/16 bytes = 2^8. Therefore, 8
bits are needed for the index.
The remaining 32-8-4 = 20 bits are needed for the tag.
20 bits
Tag

8 bits
Index

4 bits
Offset

As mentioned above, the index and offset bits can be used before translation while the tag
bits must await the translation for the 12 uppermost bits.

Part C
The following lists part of the page table entries corresponding to a few virtual addresses
(using hexadecimal notation). Protection bits of 01 imply read-only access and 11 implies
read/write access. Dirty bit of 0 implies the page is not dirty. Assume the valid bits of all
the following entries are set to 1.

1
2
3
4
5
6

Virtual
number
FFFFF
FFFFE
FFFFD
FFFFC
FFFFB
FFFFA

page Physical page number Protection bits


CFC
CAC
CFC
CBA
CAA
CCA

11
11
11
11
11
01

Dirty bits
0
0
0
0
0
0

The following table lists a stream of eight data loads and stores to virtual addresses by the
processor (all addresses are in hexadecimal). Complete the rest of the entries in the table
corresponding to these loads and stores using the above information and your solutions to
parts A and B. For the data TLB hit, data cache hit, and protection violation columns,
specify yes or no. Assume initially the data TLB and data cache are both empty.

Processor load/store to Corresponding


virtual address
physical
address

1
2
3
4
5
6
7
8

Part of the Data


TLB
physical
address used hit?
to index the
data cache

Data
cache
hit?

Protection
violation?

Dirty
bit

Part of the Data


TLB
physical
address used hit?
to index the
data cache

Data
cache
hit?

Protection
violation?

Dirty
bit

Store FFFFF ABAC1


Store FFFFC ECAB1
Load FFFFF BAAE3
Load FFFFB CEBC3
Store FFFFE AAFA1
Store FFFFC AABC9
Load FFFFD BAAE2
Store FFFFA ABAC4

Solution
Processor load/store to Corresponding
virtual address
physical
address

1
2
3
4
5
6
7
8

Store FFFFF ABAC1


Store FFFFC ECAB1
Load FFFFF BAAE3
Load FFFFB CEBC3
Store FFFFE AAFA1
Store FFFFC AABC9
Load FFFFD BAAE2
Store FFFFA ABAC4

CFCABAC1
CBAECAB1
CFCBAAE3
CAACEBC3
CACAAFA1
CBAAABC9
CFCBAAE2
CCAABAC4

AC
AB
AE
BC
FA
BC
AE
AC

No
No
Yes
No
No
Yes
No
No

No
No
No
No
No
No
Yes
No

No
No
No
No
No
No
No
Yes

Problem 6
Consider a 4-way set-associative L1 data cache with a total of 64 KB of data. Assume
the cache is write-back and with block size of 16 bytes. Further, the cache is virtuallyindexed and physically-tagged meaning that the index field of the address comes from the
virtual address generated by the CPU, and the tag field comes from the physical address
that the virtual address is translated into. The data TLB in the system has 128 entries and
is 2-way set associative. The physical and virtual addresses are 32 bits and 50 bits,
respectively. Both the cache and the memory are byte addressable. The physical
memory page size is 4 kilobytes.
Part A
Show the bits from the physical and virtual addresses that are used as the block offset,
index, and tag bits for the cache. Similarly, show the bits from the two addresses that are
used as the page offset, index, and tag bits for the TLB.
Solution
For the TLB, we only use the virtual address. Each page has 4 KB or 212 bytes, and we
need 12 bits to be able to uniquely address each of these bytes. So the page offset is the
least significant (right most) 12 bits of the address. The TLB has 128 entries divided in
64 or 26 sets, and we need 6 bits to be able to uniquely address each of these 64 sets. So
the next 6 bits represent the index. As stated the virtual address is 50 bits long, so the
remaining 32 most significant bits make up the tag field of the address. This is illustrated
below.

Tag (32 bits)


Index (6 bits)
Page Offset (12 bits)
49------------------18 17------------12 11-----------------0
For the data cache, a cache block is 16 bytes or 24 bytes, so we need 4 bits to be able to
uniquely address each of these 16 bytes. So the block offset is the least significant (right
most) 4 bits of the address. The number of blocks in the cache is 64 KB / 16 or 212
blocks. Since the cache is 4-way set associative, there are 210 sets in the cache, so we
need 10 bits to be able to uniquely identify each of these blocks. So the next 10 bits
represent the index. As stated, the physical address is 32 bits long, so the remaining 18
most significant bits make up the tag field of the address. This is illustrated below.

1
1
0
0
1
1
0
0

Tag Since the cache is physically tagged, the tag bits come from the physical address,
using bits 14 to 31.
Physical
Virtual
Virtual
Tag (18 bits)
Index (10 bits)
Block Offset (4 bits)
31------------------14 13--------------4 3-------------0

Part B
Can the cache described for this problem have synonyms? If no, explain in detail why
not. If yes, first explain in detail why and then explain how and why you could eliminate
this problem by changing the cache configuration, but not changing the total data storage
in the cache? Would you expect this cache to perform worse or better than the original
design? Please provide a justification for your answer.
Solution
Yes, the original cache can have synonyms because 2 bits of the virtual page-frame
number are used in the index. These bits could result in the synonym problem since
different values for these bits may map to the same physical address. To eliminate this
problem, the cache needs to use only virtual bits that correspond directly to physical bits
(the page offset) for the index. This can be done by reducing the number of index bits
and increasing the number of tag bits. Since we need to reduce the number of index bits
by two we make the cache 4-way 22 = 16-way set associative. A cache lookup will
then use bits 12-31 as tag, bits 4-11 as index, and bits 0-3 as block offset. Since bits 0-11
are the page offset, we effectively use only physical address bits for indexing. Since the
cache is physically tagged, the cache lookup uses only physical address bits and there can
be no synonyms.
Cache look up with new configuration
Tag (20 bits)
Index (8 bits)
Block Offset (4 bits)
31------------------------12 11----------4 3---------------0
Translation of Page-Frame Address
Page Offset

For whether the new cache will perform better, reasonable answers will be accepted. The
preferred response is that the new cache has the advantage that the cache can be indexed
completely in parallel with the TLB access since the page offset is already aligned to the
boundary of the cache's tag and index fields. Therefore, the page offset can be used in its
un-translated form to index the cache. The disadvantage of this approach is that a 16-way
associative cache is difficult to build and will require many tag comparisons to operate in
parallel. Therefore, whether the new cache is faster depends on whether the advantage
outweighs the disadvantage.

Problem 7
A graduate student comes to you with the following graph. The student is performing
experiments by varying the amount of data accessed by a certain benchmark. The only
thing the student tells you of the experiments is that their system uses virtual memory, a
data TLB, only one level of data cache, and the data TLB maps a much smaller amount
of data than can be contained in the data cache. You may assume that there are no
conflict misses in the caches and TLB. Further assume that instructions always fit in the
instruction TLB and an L1 instruction cache.

Execution Time

6
5
4
3
2
1
a

b
Data Size (KB)

Part A
Give an explanation for the shape of the curve in each of the regions numbered 1 through
7.
Solution
1: Execution time slowly increases (performance decreases) due to increasing data size
but remains at a roughly similar level.
2: At this point, the TLB overflows and execution time sharply increases to handle the
increased TLB misses.
3: Execution time again slowly increases due to increasing data size and plateaus at a
higher level than before due to overhead from TLB misses.
4: At this point, the data cache overflows, causing a high frequency of cache misses and
execution time again sharply increases.
5: Execution time again slowly increases due to increasing data size and plateaus at a
high level due to overhead from retrieving data directly from main memory due to cache
misses.
6: Execution time again sharply increases due to physical memory filling up and
thrashing occurring between disk and physical memory.

7: Execution time is very high due to overhead from TLB misses, cache misses and
virtual memory thrashing. It is slowly increasing due to increasing data size.

Part B
From the graph, can you make a reasonable guess at any of the following system
properties? If so, what are they? If not, why not? Explain your answers. (Note: your
answers can be in terms of a, b, and c).
(i)
Number of TLB entries
(ii)
Page size
(iii) Physical memory size
(iv)
Virtual memory size
(v)
Cache size
Solution
There is no reasonable guess for page size and virtual memory size. There is also no
reasonable guess for the number of TLB entries since it depends on the page size.
It is acceptable if you guess that the cache size is b KB and the physical memory size is c
KB, since these are the points at which the execution time shows significant
degradations. However, these quantities are actually only upper bounds, since the actual
size of these structures depends on the temporal and spatial reuse in the access stream.
(The actual size depends on a property known as the working set of the application.)
Problem 8
Consider a memory hierarchy with the following parameters. Main memory is interleaved
on a word basis with four banks and a new bank access can be started every cycle. It
takes 8 processor clock cycles to send an address from the cache to main memory; 50
cycles for memory to access a block; and an additional 25 cycles to send a word of data
back from memory to the cache. The memory bus width is 1 word. There is a single level
of data cache with a miss rate of 2% and a block size of 4 words. Assume 25% of all
instructions are data loads and stores. Assume a perfect instruction cache; i.e., there are
no instruction cache misses. If all data loads and stores hit in the cache, the CPI for the
processor is 1.5.
Part A
Suppose the above memory hierarchy is used with a simple in-order processor and the
cache blocks on all loads and stores until they complete. Compute the miss penalty and
resulting CPI for such a system.
Solution
Miss penalty = 8 + 50 + 25*4 = 158 cycles.
CPI = 1.5 + (0.25* 0.02 * 158) = 2.29

Part B
Suppose we now replace the processor with an out-of-order processor and the cache with
a non-blocking cache that can have multiple load and store misses outstanding. Such a
configuration can overlap some part of the miss penalty, resulting in a lower effective
penalty as seen by the processor. Assume that this configuration effectively reduces the
miss penalty (as seen by the processor) by 20%. What is the CPI of this new system and
what is the speedup over the system in Part A?
Solution
Effective miss penalty = 0.80 * 158 = 126 cycles.
CPI = 1.5 + (0.25 * .02 * 126) = 2.13
Speedup over the system in part A is 2.29/2.13 = 1.08.
Part C
Start with the system in Part A for this part. Suppose now we double the bus width and
the width of each memory bank. That is, it now takes 50 cycles for memory to access the
block as before, and the additional 20 cycles now send a double word of data back from
memory to the cache. What is the miss penalty now? What is the CPI? Is this system
faster or slower than that in Part B?
Solution
Miss penalty = 8 + 50 + 25*2 = 108 cycles.
CPI = 1.5 + (0.25 * .02 * 108) = 2.04.
This system is slightly faster than that in part B.

You might also like