Architecture

1.
Cycles for startup overhead for each of the structural units Load/store unit startup overhead : 15 Multiplier unit startup overhead : 8 cycles Adder unit startup overhead : 5 cycles MVL = 64
a) Arithmetic intensity: Total No. of floating point operations = 6 (No. of Floats read = 4 , No. of floats written = 2) Arithmetic intensity = 6/6 = 1 b) VMIPS assembly code using strip mining: Strip mined code: low = 0; n = 300; MVL = 64; VL = (n % MVL); /*find odd-size piece using modulo op % */ for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/ for (i = low; i < (low+VL); i=i+1){ /*runs for length VL*/ c_re[i] = a_re[i] * b_re[i] a_im[i] * b_im[i]; c_im[i] = a_re[i] * b_im[i] + a_im[i] * b_re[i]; } low = low + VL; /*start of next vector*/ VL = MVL; /*reset the length to maximum vector length*/ }
VMIPS code: MVL = 64 LD LD LOOP: LV LV MULVV.D LV LV MULVV.D SUBVV.D SV MULVV.D MULVV.D ADDVV.D SV BNE ADD VLR, 44 R1, 0 V1, a_r + R1 V3, b_r + R1 V5, V1, V3 V2, a_i + R1 V4, b_i + R1 V6, V2, V4 V7, V5, V6 V7, c_r + R1 V5, V1, V4 V6, V2, V3 V7, V5, V6 V7, c_i + R1 R1, 0, ELSE R1, R1, #176 # Store C_imaginary # Branches to else if not the first iteration # Increment address by 44 * 4 for the first iteration # After first iteration set vector length to 64 # Increment address by 64 * 4 # Branch to Loop if the address offset is less than 300 * 4 # Store C_real # a_re[i] * b_im[i] # a_im[i] * b_re[i] # load A_real # load B_real # a_re[i] * b_re[i] # load A_im # load B_im # a_im[i] * b_im[i] # Limit to first 44 elements
LD JUMP ELSE: ADD BLT
VLR, 64 LOOP R1, R1, #256 R1, 1200, LOOP
c) No. of chimes required: Instructions 1 LV V1, a_r + R1 2 LV V3, b_r + R1 | MULVV.D 3 LV V2, a_i + R1 4 LV V4, b_i + R1 | MULVV.D 5 SV V6, c_r + R1 | MULVV.D 6 MULVV.D V6, V2, V4 V5, V1, V4 V7, c_i + R1 | SUB.D V7, V5, V6 V5, V1, V3 Clock Cycles 15 15 + 1 = 16 15 15 + 1 + 1 = 17 15 15
V6, V2, V3 | ADDV V7, V5, V6 | ST
Total required chimes = 6 d) Total number of clock cycles required = Total number of clock cycles required for each chime 15 + 16 + 15 + 17 + 15 + 15 = 93 cycles per complex result value e) 3 Memory banks:
Instructions 1 2 3 4
LV V1, a_r + R1 | LV V3, b_r + R1 | MULVV.D V5, V1, V3 | LV V2, a_i + R1
Clock Cycles 15 + 1 = 16 15 + 1 +1 + 1= 18 15 15
LV V4, b_i + R1 | MULVV.D V6, V2, V4 | SUB.D V7, V5, V6 | SV V6, c_r + R1 MULVV.D V5, V1, V4 MULVV.D V6, V2, V3 | ADDV V7, V5, V6 | ST V7, c_i + R1
Total required chimes = 4 Total number of clock cycles required = Total number of clock cycles required for each chime =16 + 18 + 15 + 15 = 64 cycles per complex value.
2. a) Recurrence doubling: The function given below returns the sum of the elements Total number of additions = 32 + 16 + 8 + 4 + 2 + 1 = 63 int rec_add( int dot[], int low, int high){ if (low>= high) return dot[low]; else{ int mid = low + (high -low)/2; return rec_add(dot, low, mid) + rec_add(a, mid+1, high); } } b) VMIPS code to reduce a 64 length vector into 8 partial sums. For this, we recursively split the original 64 length into vectors of half the length . First the vectors will be split into vectors of length 32 each, calculate their sum and store it back in V1, in the next step we split this 32 length into vectors of length 16 each, calculate their sum and store it back in V1. We continue this way until we hit the require vector length of 8 LD VLR 32 ADDV.D LD ADDV.D LD ADDV.D V1 VLR V1 VLR V1 V1(0), V1(32) 16 V1(0), V1(16) 8 V1(0), V1(8)
Now vector V1 contains the 8 partial sums in it's first 8 locations, I.e V1(0) V1(7) c) unsigned int tid = threadIdx.x; for (unsigned int s = (blockDim.x/2); s >= 1; s = s/2){ if( tid < s){ sdata[tid} += sdata[tid+s]; } _syncthreads(); } Here since consecutive threads access elements that are consecutive, bank conflicts are avoided. After each iteration, the number of threads reduces by half, but the active threads are all consecutive and in the same wrap.
3. a) No. of SIMP processors = 10 Each processor 8 Lanes 32 Instructions every 4 cycles. In every 4 Cycles, the system can execute 32 * 10 = 320 instructions Every cycle 320 /4 = 80 instructions. Because of memory latencies issue rate is considered to be 0.85. Therefore, every cycle we can execute 80 * 0.85 = 68 instructions. Due to branches, we have only 80% of threads active. Which implies, for every cycle, we execute 68 * 0.8 = 54.4 instructions Among these only 70% are floating point operations, which gives us = 0.7 * 54.4 = 38.08 One clock cycle = 38.08 instructions Therefore, with 1.5 GHz we get : 38.08 * 1.5 GHz = 57.12 GFLOPS
b) 1. Similar to above we get the throughput as follows: The only variable that has changed is the number of instructions per cycles, given by 16 32 Instructions every 4 cycles (per processor). 160 instructions overall in the system per clock cycle All else remaining constant we get a speedup of = 2x 2. By increasing the number of SIMD processors, we are essentially increasing the number of instructions evaluated per cycle. Increase in number of instructions = 15/10 1.5 times the original number of instructions. Hence speed-up is 1.5x 3. Increasing the instruction issue rate to 0.95: This gives us an increase in instructions, and hence a speedup of 0.95/0.85 = 1.117

Architecture

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Architecture

Uploaded by

Copyright:

Available Formats

1.

LD JUMP ELSE: ADD BLT

VLR, 64 LOOP R1, R1, #256 R1, 1200, LOOP

V6, V2, V3 | ADDV V7, V5, V6 | ST

You might also like