Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Computer architecture, a quantitative approach (solution for 5th edition), Essays (high school) of Advanced Computer Architecture

Solution to Computer Architecture, 5th edition book by hennessy

Typology: Essays (high school)

2014/2015
On special offer
60 Points
Discount

Limited-time offer


Uploaded on 09/27/2015

preetam1030
preetam1030 🇮🇳

4.4

(72)

1 document

1 / 91

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 1 Solutions 2
Chapter 2 Solutions 6
Chapter 3 Solutions 13
Chapter 4 Solutions 33
Chapter 5 Solutions 44
Chapter 6 Solutions 50
Appendix A Solutions 63
Appendix B Solutions 83
Appendix C Solutions 92
Copyright © 2012 Elsevier, Inc. All rights reserved.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
Discount

On special offer

Partial preview of the text

Download Computer architecture, a quantitative approach (solution for 5th edition) and more Essays (high school) Advanced Computer Architecture in PDF only on Docsity!

  • Chapter 1 Solutions
  • Chapter 2 Solutions
  • Chapter 3 Solutions
  • Chapter 4 Solutions
  • Chapter 5 Solutions
  • Chapter 6 Solutions
  • Appendix A Solutions
  • Appendix B Solutions
  • Appendix C Solutions

Solutions to Case Studies

and Exercises 1

Chapter 1 Solutions ■ 3

Case Study 2: Power Consumption in Computer Systems

1.4 a. .80 x = 66 + 2 × 2.3 + 7.9; x = 99

b. .6 × 4 W + .4 × 7.9 = 5. c. Solve the following four equations: seek7200 = .75 × seek seek7200 + idle7200 = 100 seek5400 + idle5400 = 100 seek7200 × 7.9 + idle7200 × 4 = seek5400 × 7 + idle5400 × 2. idle7200 = 29.8%

1.5 a.

b.

c. 200 W × 11 = 2200 W 2200/(76.2) = 28 racks Only 1 cooling door is required.

1.6 a. The IBM x346 could take less space, which would save money in real estate. The racks might be better laid out. It could also be much cheaper. In addition, if we were running applications that did not match the characteristics of these benchmarks, the IBM x346 might be faster. Finally, there are no reliability numbers shown. Although we do not know that the IBM x346 is better in any of these areas, we do not know it is worse, either.

1.7 a. (1 – 8) + .8/2 = .2 + .4 =.

b.

c. ; x = 50%

d.

Exercises

1.8 a. (1.35) 10 = approximately 20

b. 3200 × (1.4) 12 = approximately 181, c. 3200 × (1.01) 12 = approximately 3605 d. Power density, which is the power consumed over the increasingly small area, has created too much heat for heat sinks to dissipate. This has limited the activity of the transistors on the chip. Instead of increasing the clock rate, manufacturers are placing multiple cores on the chip.

14 KW

( 66 W + 2.3 W +7.9 W)

14 KW

( 66 W + 2.3 W + 2 ×7.9 W)

Power new Power old

( V ×0.60)

2 ×( F ×0.60) V^2 × F

3 = = =0.

( 1 – x ) + x ⁄ 2

Power new Power old

( V ×0.75)^2 ×( F ×0.60)

V^2 × F

2 = = × 0.6 =0.

4 ■ Solutions to Case Studies and Exercises

e. Anything in the 15–25% range would be a reasonable conclusion based on the decline in the rate over history. As the sudden stop in clock rate shows, though, even the declines do not always follow predictions. 1.9 a. 50% b. Energy = ½ load × V 2. Changing the frequency does not affect energy–only power. So the new energy is ½ load × ( ½ V) 2 , reducing it to about ¼ the old energy. 1.10 a. 60% b. 0.4 + 0.6 × 0.2 = 0.58, which reduces the energy to 58% of the original energy. c. newPower/oldPower = ½ Capacitance × (Voltage × .8)^2 × (Frequency × .6)/ ½ Capacitance × Voltage × Frequency = 0.8^2 × 0.6 = 0.256 of the original power. d. 0.4 + 0 .3 × 2 = 0.46, which reduce the energy to 46% of the original energy. 1.11 a. 10 9 /100 = 10^7 b. 10 7 /10 7 + 24 = 1 c. [need solution] 1.12 a. 35/10000 × 3333 = 11.67 days b. There are several correct answers. One would be that, with the current sys- tem, one computer fails approximately every 5 minutes. 5 minutes is unlikely to be enough time to isolate the computer, swap it out, and get the computer back on line again. 10 minutes, however, is much more likely. In any case, it would greatly extend the amount of time before 1/3 of the computers have failed at once. Because the cost of downtime is so huge, being able to extend this is very valuable. c. $90,000 = (x + x + x + 2x)/ $360,000 = 5x $72,000 = x 4th quarter = $144,000/hr 1.13 a. Itanium, because it has a lower overall execution time. b. Opteron: 0.6 × 0.92 + 0.2 × 1.03 + 0.2 × 0.65 = 0. c. 1/0.888 = 1. 1.14 a. See Figure S.1. b. 2 = 1/((1 – x) + x /10) 5/9 = x = 0.56 or 56% c. 0.056/0.5 = 0.11 or 11% d. Maximum speedup = 1/(1/10) = 10 5 = 1/((1 – x) + x /10) 8/9 = x = 0.89 or 89%

6 ■ Solutions to Case Studies and Exercises

Case Study 1: Optimizing Cache Performance via Advanced Techniques

2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each column access will result in fetching a new line for the non-ideal matrix, we need a minimum of 8x8 (64 elements) for each matrix. Hence, the minimum cache size is 128 × 8B = 1KB. b. The blocked version only has to fetch each input and output element once. The unblocked version will have one cache miss for every 64B/8B = 8 row elements. Each column requires 64Bx256 of storage, or 16KB. Thus, column elements will be replaced in the cache before they can be used again. Hence the unblocked version will have 9 misses (1 row and 8 columns) for every 2 in the blocked version. c. for (i = 0; i < 256; i=i+B) { for (j = 0; j < 256; j=j+B) { for(m=0; m<B; m++) { for(n=0; n<B; n++) { output[j+n][i+m] = input[i+m][j+n]; } } } } d. 2-way set associative. In a direct-mapped cache the blocks could be allocated so that they map to overlapping regions in the cache. e. You should be able to determine the level-1 cache size by varying the block size. The ratio of the blocked and unblocked program speeds for arrays that do not fit in the cache in comparison to blocks that do is a function of the cache block size, whether the machine has out-of-order issue, and the band- width provided by the level-2 cache. You may have discrepancies if your machine has a write-through level-1 cache and the write buffer becomes a limiter of performance. 2.2 Since the unblocked version is too large to fit in the cache, processing eight 8B ele- ments requires fetching one 64B row cache block and 8 column cache blocks. Since each iteration requires 2 cycles without misses, prefetches can be initiated every 2 cycles, and the number of prefetches per iteration is more than one, the memory system will be completely saturated with prefetches. Because the latency of a prefetch is 16 cycles, and one will start every 2 cycles, 16/2 = 8 will be out- standing at a time. 2.3 Open hands-on exercise, no fixed solution.

Chapter 2 Solutions

Chapter 2 Solutions ■ 7

Case Study 2: Putting it all Together: Highly Parallel Memory Systems

2.4 a. The second-level cache is 1MB and has a 128B block size.

b. The miss penalty of the second-level cache is approximately 105ns. c. The second-level cache is 8-way set associative. d. The main memory is 512MB. e. Walking through pages with a 16B stride takes 946ns per reference. With 250 such references per page, this works out to approximately 240ms per page.

2.5 a. Hint: This is visible in the graph above as a slight increase in L2 miss service time for large data sets, and is 4KB for the graph above. b. Hint: Take independent strides by the page size and look for increases in latency not attributable to cache sizes. This may be hard to discern if the amount of memory mapped by the TLB is almost the same as the size as a cache level. c. Hint: This is visible in the graph above as a slight increase in L2 miss service time for large data sets, and is 15ns in the graph above. d. Hint: Take independent strides that are multiples of the page size to see if the TLB if fully-associative or set-associative. This may be hard to discern if the amount of memory mapped by the TLB is almost the same as the size as a cache level.

2.6 a. Hint: Look at the speed of programs that easily fit in the top-level cache as a function of the number of threads. b. Hint: Compare the performance of independent references as a function of their placement in memory.

2.7 Open hands-on exercise, no fixed solution.

Exercises

2.8 a. The access time of the direct-mapped cache is 0.86ns, while the 2-way and 4-way are 1.12ns and 1.37ns respectively. This makes the relative access times 1.12/.86 = 1.30 or 30% more for the 2-way and 1.37/0.86 = 1.59 or 59% more for the 4-way. b. The access time of the 16KB cache is 1.27ns, while the 32KB and 64KB are 1.35ns and 1.37ns respectively. This makes the relative access times 1.35/ 1.27 = 1.06 or 6% larger for the 32KB and 1.37/1.27 = 1.078 or 8% larger for the 64KB. c. Avg. access time = hit% × hit time + miss% × miss penalty, miss% = misses per instruction/references per instruction = 2.2% (DM), 1.2% (2-way), 0.33% (4-way), .09% (8-way). Direct mapped access time = .86ns @ .5ns cycle time = 2 cycles 2-way set associative = 1.12ns @ .5ns cycle time = 3 cycles

Chapter 2 Solutions ■ 9

2.12 a. 16B, to match the level 2 data cache write path.

b. Assume merging write buffer entries are 16B wide. Since each store can write 8B, a merging write buffer entry would fill up in 2 cycles. The level- cache will take 4 cycles to write each entry. A non-merging write buffer would take 4 cycles to write the 8B result of each store. This means the merging write buffer would be 2 times faster. c. With blocking caches, the presence of misses effectively freezes progress made by the machine, so whether there are misses or not doesn’t change the required number of write buffer entries. With non-blocking caches, writes can be processed from the write buffer during misses, which may mean fewer entries are needed.

2.13 a. A 2GB DRAM with parity or ECC effectively has 9 bit bytes, and would require 18 1Gb DRAMs. To create 72 output bits, each one would have to output 72/18 = 4 bits. b. A burst length of 4 reads out 32B. c. The DDR-667 DIMM bandwidth is 667 × 8 = 5336 MB/s. The DDR-533 DIMM bandwidth is 533 × 8 = 4264 MB/s.

2.14 a. This is similar to the scenario given in the figure, but tRCD and CL are both 5. In addition, we are fetching two times the data in the figure. Thus it requires 5 + 5 + 4 × 2 = 18 cycles of a 333MHz clock, or 18 × (1/333MHz) = 54.0ns. b. The read to an open bank requires 5 + 4 = 9 cycles of a 333MHz clock, or 27.0ns. In the case of a bank activate, this is 14 cycles, or 42.0ns. Including 20ns for miss processing on chip, this makes the two 42 + 20 = 61ns and 27.0 + 20 = 47ns. Including time on chip, the bank activate takes 61/47 = 1. or 30% longer.

2.15 The costs of the two systems are $2 × 130 + $800 = $1060 with the DDR2- DIMM and 2 × $100 + $800 = $1000 with the DDR2-533 DIMM. The latency to service a level-2 miss is 14 × (1/333MHz) = 42ns 80% of the time and 9 × (1/ MHz) = 27ns 20% of the time with the DDR2-667 DIMM. It is 12 × (1/266MHz) = 45ns (80% of the time) and 8 × (1/266MHz) = 30ns (20% of the time) with the DDR-533 DIMM. The CPI added by the level- misses in the case of DDR2-667 is 0.00333 × 42 × .8 + 0.00333 × 27 × .2 = 0. giving a total of 1.5 + 0.130 = 1.63. Meanwhile the CPI added by the level- misses for DDR-533 is 0.00333 × 45 × .8 + 0.00333 × 30 × .2 = 0.140 giving a total of 1.5 + 0.140 = 1.64. Thus the drop is only 1.64/1.63 = 1.006, or 0.6%, while the cost is $1060/$1000 = 1.06 or 6.0% greater. The cost/performance of the DDR2-667 system is 1.63 × 1060 = 1728 while the cost/performance of the DDR2-533 system is 1.64 × 1000 = 1640, so the DDR2-533 system is a better value.

2.16 The cores will be executing 8cores × 3GHz/2.0CPI = 12 billion instructions per second. This will generate 12 × 0.00667 = 80 million level-2 misses per second. With the burst length of 8, this would be 80 × 32B = 2560MB/sec. If the memory

10 ■ Solutions to Case Studies and Exercises

bandwidth is sometimes 2X this, it would be 5120MB/sec. From Figure 2.14, this is just barely within the bandwidth provided by DDR2-667 DIMMs, so just one memory channel would suffice. 2.17 a. The system built from 1Gb DRAMs will have twice as many banks as the system built from 2Gb DRAMs. Thus the 1Gb-based system should provide higher performance since it can have more banks simultaneously open. b. The power required to drive the output lines is the same in both cases, but the system built with the x4 DRAMs would require activating banks on 18 DRAMs, versus only 9 DRAMs for the x8 parts. The page size activated on each x4 and x8 part are the same, and take roughly the same activation energy. Thus since there are fewer DRAMs being activated in the x8 design option, it would have lower power. 2.18 a. With policy 1, Precharge delay Trp = 5 × (1/333 MHz) = 15ns Activation delay Trcd = 5 × (1/333 MHz) = 15ns Column select delay Tcas = 4 × (1/333 MHz) = 12ns Access time when there is a row buffer hit

Access time when there is a miss

With policy 2, Access time = Trcd + Tcas + Tddr If A is the total number of accesses, the tip-off point will occur when the net access time with policy 1 is equal to the total access time with policy 2. i.e.,

= (Trcd + Tcas + Tddr)A

r = 100 × (15)/(15 + 15) = 50% If r is less than 50%, then we have to proactively close a page to get the best performance, else we can keep the page open. b. The key benefit of closing a page is to hide the precharge delay Trp from the critical path. If the accesses are back to back, then this is not possible. This new constrain will not impact policy 1.

Th r Tcas(^ +Tddr) 100 =--------------------------------------

Tm (^100 – r)^ (^ Trp^ +^ Trcd^ +^ Tcas^ +Tddr) 100 =---------------------------------------------------------------------------------------------

r 100 --------- Tcas( +Tddr)A 100 – r 100 +----------------- Trp( + Trcd + Tcas +Tddr)A

r 100 ×Trp Trp +Trcd ⇒ =----------------------------

12 ■ Solutions to Case Studies and Exercises

d. The null call and null I/O call have the largest slowdown. These have no real work to outweigh the virtualization overhead of changing protection levels, so they have the largest slowdowns. 2.22 The virtual machine running on top of another virtual machine would have to emu- late privilege levels as if it was running on a host without VT-x technology. 2.23 a. As of the date of the Computer paper, AMD-V adds more support for virtual- izing virtual memory, so it could provide higher performance for memory- intensive applications with large memory footprints. b. Both provide support for interrupt virtualization, but AMD’s IOMMU also adds capabilities that allow secure virtual machine guest operating system access to selected devices. 2.24 Open hands-on exercise, no fixed solution. 2.25 a. These results are from experiments on a 3.3GHz Intel® Xeon® Processor X5680 with Nehalem architecture (westmere at 32nm). The number of misses per 1K instructions of L1 Dcache increases significantly by more than 300X when input data size goes from 8KB to 64 KB, and keeps relatively constant around 300/1K instructions for all the larger data sets. Similar behavior with different flattening points on L2 and L3 caches are observed. b. The IPC decreases by 60%, 20%, and 66% when input data size goes from 8KB to 128 KB, from 128KB to 4MB, and from 4MB to 32MB, respectively. This shows the importance of all caches. Among all three levels, L1 and L caches are more important. This is because the L2 cache in the Intel® Xeon® Processor X5680 is relatively small and slow, with capacity being 256KB and latency being around 11 cycles. c. For a recent Intel i7 processor (3.3GHz Intel® Xeon® Processor X5680), when the data set size is increased from 8KB to 128KB, the number of L Dcache misses per 1K instructions increases by around 300, and the number of L2 cache misses per 1K instructions remains negligible. With a 11 cycle miss penalty, this means that without prefetching or latency tolerance from out-of-order issue we would expect there to be an extra 3300 cycles per 1K instructions due to L1 misses, which means an increase of 3.3 cycles per instruction on average. The measured CPI with the 8KB input data size is 1.37. Without any latency tolerance mechanisms we would expect the CPI of the 128KB case to be 1.37 + 3.3 = 4.67. However, the measured CPI of the 128KB case is 3.44. This means that memory latency hiding techniques such as OOO execution, prefetching, and non-blocking caches improve the perfor- mance by more than 26%.

Chapter 3 Solutions ■ 13

Case Study 1: Exploring the Impact of Microarchitectural Techniques 2

3.1 The baseline performance (in cycles, per loop iteration) of the code sequence in Figure 3.48, if no new instruction’s execution could be initiated until the previ- ous instruction’s execution had completed, is 40. See Figure S.2. Each instruc- tion requires one clock cycle of execution (a clock cycle in which that instruction, and only that instruction, is occupying the execution units; since every instruction must execute, the loop will take at least that many clock cycles). To that base number, we add the extra latency cycles. Don’t forget the branch shadow cycle.

3.2 How many cycles would the loop body in the code sequence in Figure 3. require if the pipeline detected true data dependencies and only stalled on those, rather than blindly stalling everything just because one functional unit is busy? The answer is 25, as shown in Figure S.3. Remember, the point of the extra latency cycles is to allow an instruction to complete whatever actions it needs, in order to produce its correct output. Until that output is ready, no dependent instructions can be executed. So the first LD must stall the next instruction for three clock cycles. The MULTD produces a result for its successor, and therefore must stall 4 more clocks, and so on.

Figure S.2 Baseline performance (in cycles, per loop iteration) of the code sequence in Figure 3.48.

Chapter 3 Solutions

Loop: LD F2,0(Rx) 1 + 4 DIVD F8,F2,F0 1 + 12 MULTD F2,F6,F2 1 + 5 LD F4,0(Ry) 1 + 4 ADDD F4,F0,F4 1 + 1 ADDD F10,F8,F2 1 + 1 ADDI Rx,Rx,#8 1 ADDI Ry,Ry,#8 1 SD F4,0(Ry) 1 + 1 SUB R20,R4,Rx 1 BNZ R20,Loop 1 + 1


cycles per loop iter 40

Chapter 3 Solutions ■ 15

that LD could conceivably have been executed concurrently with the DIVD and the MULTD. Since this problem posited a two-execution-pipe machine, the LD executes in the cycle following the DIVD/MULTD. The loop overhead instructions at the loop’s bottom also exhibit some potential for concurrency because they do not depend on any long-latency instructions.

3.4 Possible answers:

  1. If an interrupt occurs between N and N + 1, then N + 1 must not have been allowed to write its results to any permanent architectural state. Alternatively, it might be permissible to delay the interrupt until N + 1 completes.
  2. If N and N + 1 happen to target the same register or architectural state (say, memory), then allowing N to overwrite what N + 1 wrote would be wrong.
  3. N might be a long floating-point op that eventually traps. N + 1 cannot be allowed to change arch state in case N is to be retried.

Execution pipe 0 Execution pipe 1 Loop: LD F2,0(Rx) ; ; ; ; ; DIVD F8,F2,F0 ; MULTD F2,F6,F LD F4,0(Ry) ; ; ; ; ; ADD F4,F0,F4 ; ; ; ; ; ; ; ADDD F10,F8,F2 ; ADDI Rx,Rx,# ADDI Ry,Ry,#8 ; SD F4,0(Ry) SUB R20,R4,Rx ; BNZ R20,Loop ;

cycles per loop iter 22

Figure S.4 Number of cycles required per loop.

16 ■ Solutions to Case Studies and Exercises

Long-latency ops are at highest risk of being passed by a subsequent op. The DIVD instr will complete long after the LD F4,0(Ry), for example. 3.5 Figure S.5 demonstrates one possible way to reorder the instructions to improve the performance of the code in Figure 3.48. The number of cycles that this reordered code takes is 20.

3.6 a. Fraction of all cycles, counting both pipes, wasted in the reordered code shown in Figure S.5: 11 ops out of 2x20 opportunities. 1 – 11/40 = 1 – 0. = 0. b. Results of hand-unrolling two iterations of the loop from code shown in Figure S.6: c. Speedup =

Speedup = 20 / (22/2) Speedup = 1.

Execution pipe 0 Execution pipe 1

Loop: LD F2,0(Rx) ; LD F4,0(Ry)

; ; ; ; DIVD F8,F2,F0 ; ADDD F4,F0,F MULTD F2,F6,F2 ; ; SD F4,0(Ry) ; #ops: 11 ; #nops: (20 × 2) – 11 = 29 ; ADDI Rx,Rx,# ; ADDI Ry,Ry,# ; ; ; ; ; ; SUB R20,R4,Rx ADDD F10,F8,F2 ; BNZ R20,Loop ;

cycles per loop iter 20

Figure S.5 Number of cycles taken by reordered code.

exec time w/o enhancement exec time with enhancement

18 ■ Solutions to Case Studies and Exercises

3.8 See Figure S.8. The rename table has arbitrary values at clock cycle N – 1. Look at the next two instructions (I0 and I1): I0 targets the F1 register, and I1 will write the F register. This means that in clock cycle N, the rename table will have had its entries 1 and 4 overwritten with the next available Temp register designators. I0 gets renamed first, so it gets the first T reg (9). I1 then gets renamed to T10. In clock cycle N, instructions I2 and I3 come along; I2 will overwrite F6, and I3 will write F0. This means the rename table’s entry 6 gets 11 (the next available T reg), and rename table entry 0 is written to the T reg after that (12). In principle, you don’t have to allocate T regs sequentially, but it’s much easier in hardware if you do.

3.9 See Figure S.9.

Figure S.8 Cycle-by-cycle state of the rename table for every instruction of the code in Figure 3.51.

ADD R1, R1, R1; 5 + 5 −> 10 ADD R1, R1, R1; 10 + 10 −> 20 ADD R1, R1, R1; 20 + 20 −> 40

Figure S.9 Value of R1 when the sequence has been executed.

Renamed in cycle N

Renamed in cycle N + 1

N –1 N N + 0 1 2 3 4 5 6 7 8 9 0 9 2 3 10 5 6 7 8 9

12 1 2 3 4 5 11 7 8 9

Clock cycle

I0: I1: I2: I3:

SUBD ADDD MULTD DIVD

F1,F2,F F4,F1,F F6,F4,F F0,F2,F

Next avail T reg

Rename table

62 63

0 1 2 3 4 5 6 7 8 9

62 63

12 11 10 9

62 63

0 1 2 3 4 5 6 7 8 9

62 63

62 63

0 1 2 3 4 5 6 7 8 9

62 63

14 13 12 11 16 15 14 13

Chapter 3 Solutions ■ 19

3.10 An example of an event that, in the presence of self-draining pipelines, could dis- rupt the pipelining and yield wrong results is shown in Figure S.10.

3.11 See Figure S.11. The convention is that an instruction does not enter the execution phase until all of its operands are ready. So the first instruction, LW R3, 0 (R0), marches through its first three stages (F, D, E) but that M stage that comes next requires the usual cycle plus two more for latency. Until the data from a LD is avail- able at the execution unit, any subsequent instructions (especially that ADDI R1, R1, #1, which depends on the 2nd LW) cannot enter the E stage, and must therefore stall at the D stage.

Figure S.10 Example of an event that yields wrong results. What could go wrong with this? If an interrupt is taken between clock cycles 1 and 4, then the results of the LW at cycle 2 will end up in R1, instead of the LW at cycle 1. Bank stalls and ECC stalls will cause the same effect—pipes will drain, and the last writer wins, a classic WAW hazard. All other “intermediate” results are lost.

Figure S.11 Phases of each instruction per clock cycle for one iteration of the loop.

alu

ADDI R10, R4, # ADDI R10, R4, #

ADDI R11, R3, # ADDI R2, R2, #

1 2 3 4 5 6 7 Clock cycle

alu

ADDI R20, R0, #

SUB R4, R3, R

ld/st LW R4, 0(R0) LW R4, 0(R0) LW R5, 8(R1) LW R5, 8(R1)

SW R9, 8(R8)

BNZ R4, Loop

SW R9, 8(R8)

SW R7, 0(R6) SW R7, 0(R6)

ld/st br

Loop length Loop: LW R3,0(R0) LW R1,0(R3) ADDI R1,R1,# SUB R4,R3,R SW R1,0(R3) BNZ R4, Loop LW R3,0(R0) (2.11a) 4 cycles lost to branch overhead

(2.11b) 2 cycles lost with static predictor (2.11c) No cycles lost with correct dynamic prediction

1 F

2 D F

3 E D F

4 M

5

6

7 W E D F

8

M

9

10

11

W E D F

12

M E D F

13

W M E D

14

W M E

15

16

17

W M F

18

W D

19

...