Professional Documents
Culture Documents
15-213
Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
class14.ppt
Cache Memories
Cache memories are small, fast SRAM-based memories managed automatically in hardware.
CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure:
CPU chip register file L1 cache cache bus ALU system bus memory bus I/O bridge main memory
15-213, F02
L2 cache
2
bus interface
The tiny, very fast CPU register file has room for four 4-byte words.
The small fast L1 cache has room for two 4-word blocks.
The transfer unit between the cache and main memory is a 4-word block (16 bytes). block 10
abcd
...
block 21
pqrs
...
block 30
3
The big slow main memory has room for many 4-word blocks.
wxyz
...
15-213, F02
valid
set 0: valid
tag
tag
0
0
1
1
B1
E lines per set B1
valid
tag tag
1 1
B1 B1
set 1:
valid
set S-1:
15-213, F02
Addressing Caches
Address A: t bits
v set 0: v v set 1: v tag tag tag tag 0 0 0 0 v set S-1: tag 0 0 1 B1 1 1 1 1 B1 B1 B1 B1
m-1
s bits
b bits
0
The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>.
tag
B1
The word contents begin at offset <block offset> bytes from the beginning of the block.
15-213, F02
Direct-Mapped Cache
Simplest kind of cache Characterized by exactly one line per set.
set 0: set 1:
valid
tag
valid
tag
set S-1:
valid
tag
cache block
15-213, F02
valid valid
tag tag
cache block
cache block
t bits
m-1
tag
tag
cache block
15-213, F02
Line matching: Find a valid line in the selected set with a matching tag Word selection: Then extract the word
=1? (1) The valid bit must be set
0 1 2 3 4 5 6 7
0110
w0
w1 w2
w3
(2) The tag bits in the cache =? line must match the tag bits in the address t bits 0110 tag s bits b bits i 100 set index block offset0
(3) If (1) and (2), then cache hit, and block offset selects starting byte.
m-1
15-213, F02
v
1 1
(1)
(3)
1 1
v
1 1
v
1 1
(4)
9
M[12-13]
(5)
1 1
Adjacent memory lines would map to same cache entry Poor use of spatial locality
Consecutive memory lines map to different cache lines Can hold C-byte region of address space in cache at one time
10
15-213, F02
set 0:
valid
valid valid
tag
tag tag
cache block
cache block cache block E=2 lines per set
set 1:
valid
tag
cache block
set S-1:
valid valid
tag tag
11
15-213, F02
Selected set
set 1:
valid
valid
tag
tag
cache block
cache block
valid t bits
m-1
tag tag
cache block
tag
valid
cache block
12
15-213, F02
must compare the tag in each valid line in the selected set.
=1? (1) The valid bit must be set.
0 1 2 3 4 5 6 7
1001 0110 w0 w1 w2 w3
(2) The tag bits in one of the cache lines must match the tag bits in the address
=?
(3) If (1) and (2), then cache hit, and block offset selects starting byte. s bits b bits i 100 set index block offset0
15-213, F02
m-1
13
Multi-Level Caches
Options: separate data and instruction caches, or a unified cache
Processor
Regs L1 d-cache L1 i-cache
Unified L2 Cache
Memory
disk
200 B 3 ns
8-64 KB 3 ns
30 GB 8 ms $0.05/MB
14
15-213, F02
Regs.
15
15-213, F02
Hit Time
Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) Typical numbers:
1 clock cycle for L1 3-8 clock cycles for L2
Miss Penalty
16
int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; }
Memory mountain
Measured read throughput as a function of spatial and temporal locality. Compact way to characterize memory system performance.
18
15-213, F02
/* Working set size (in bytes) */ /* Stride (in array elements) */ /* Clock frequency */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz)); printf("\n"); } exit(0); }
20 15-213, F02
1000
L1
800
Pentium III Xeon 550 MHz 16 KB on-chip L1 d-cache 16 KB on-chip L1 i-cache 512 KB off-chip unified L2 cache
600
400
xe
L2
200
s1
s3
s5
s7
s9
512k
s13
stride (words)
128k
s11
32k
8k
s15
8m
2m
2k
mem
21
15-213, F02
800
600
400
200
8m
4m
2m
8k
4k
2k
64k
32k
512k
256k
1024k
128k
16k
1k
22
15-213, F02
23
15-213, F02
blocking)
Block size
Exploit spatial locality
Description:
/* ijk */ Variable sum for (i=0; i<n; i++) { held in register for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } }
Line size = 32B (big enough for 4 64-bit words) Matrix dimension (N) is very large
Approximate 1/N as 0.0
Analysis Method:
25
15-213, F02
accesses successive elements if block size (B) > 4 bytes, exploit spatial locality
compulsory miss rate = 4 bytes / B
26
(i,j)
(i,*) A B C
Row-wise
Columnwise
Fixed
B 1.0
C 0.0
15-213, F02
(i,*)
A B C
Row-wise Columnwise
Fixed
B 1.0
C 0.0
15-213, F02
Fixed
Row-wise Row-wise
B 0.25
C 0.25
15-213, F02
Inner loop:
(i,k) A B (k,*) (i,*) C
Fixed
Row-wise Row-wise
B 0.25
C 0.25
15-213, F02
Column wise
Fixed
Columnwise
B 0.0
C 1.0
15-213, F02
Columnwise
Fixed
Columnwise
B 0.0
C 1.0
15-213, F02
misses/iter = 1.25
for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } }
misses/iter = 0.5
for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } }
misses/iter = 2.0
for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } }
33
15-213, F02
50
40
Cycles/iteration
30
20
10
34
25
50
75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
15-213, F02
block (in this context) does not mean cache block. Instead, it mean a sub-block within the matrix. Example: N = 8; sub-block size = 4
A11 A12 A21 A22 X B11 B12 = B21 B22 C21 C22 C11 C12
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars. C11 = A11B11 + A12B21 C21 = A21B11 + A22B21
35
36
15-213, F02
Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C Loop over i steps through n row slivers of A & C, using same B
for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; Innermost } kk jj jj Loop Pair
i kk i
37
Update successive row sliver accessed elements of sliver bsize times block reused n times in succession 15-213, F02
60
50
Cycles/iteration
40
30
20
10
kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25)
75 10 0 12 5 15 0 17 5 20 0 22 5 25 0 27 5 30 0 32 5 35 0 37 5 40 0
25
50
38
15-213, F02
Concluding Observations
Programmer can optimize for cache performance
39
15-213, F02