Please include the TIME or TA NAME of the DISCUSSION section that you attend as
well as your NAME and STUDENT ID. Homeworks and labs will be handed back
in discussion sections.
1) Do problems 6.30 and 6.31 from P&H
2) Assume we have a processor with the following pipeline latencies:
| Instruction producing result: | Instruction using result | Latency (clock cycles) |
| Floating point ALU op. | Floating point ALU op. | 3 |
| Floating point ALU op. | Store double | 2 |
| Load double | Floating point ALU op. | 1 |
| Load double | Store double | 0 |
Now
assume we have the following code:
daxpy:
ld
$f2, 0($r1)
multd
$f4, $f2, $f0
ld
$f6, 0($r2)
addd $f6, $f4,$f6
sd 0($r2),
$f6
addiu $r1, $r1, 8
addiu
$r2, $r2, 8
addiu
$r3, $r3, -1
beq $r3, $0, daxpy
(a) Assume one branch delay slot. How many stalls per iteration do we have? How many cycles per iteration?
(b) Rewrite the code by rearranging the instructions in this loop to minimize stalls. How many stalls do we have per iteration now? How many cycles per iteration?
(c) Now unroll the loop as many times as necessary to schedule it without stalls. How many times must you unroll the loop? How many cycles per iteration?
(d) Now software pipeline the loop. Omit startup and ending code. Suppose the latency of our instructions goes up now. What is the maximum latency we can have between two floating point ALU operations without stalling the software pipelined version of the code?
3) Suppose we are running the same daxpy loop as in problem 2, but this time we are running it on a processor using Tomasulo’s algorithm. Let us assume the following execution times:
Functional
Unit type: Cycles: #
of FUs #
of reservation stations
Integer
1 1
5
FP
adder
4 1
3
FP
multiplier 15 1 2
Complete the following table for the first three iterations of the loop. The first two instructions have been completed for you.
Instruction: Issue: Execute: Memory
written: CDB:
Ld
$f0,0($r1) 1 2 3
4
addd
$f4,$f0,$f2 1 5
8