Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Computer architecture solutions[1] new, Exercises of Advanced Computer Architecture

Solution book for Computer architecture 4.0

Typology: Exercises

2015/2016

Uploaded on 11/04/2016

Amit.Mistry
Amit.Mistry 🇺🇸

4.7

(3)

1 document

1 / 163

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Introduction 2
Chapter 1 Solutions 2
Chapter 2 Solutions 20
Chapter 3 Solutions 37
Chapter 4 Solutions 59
Chapter 5 Solutions 87
Chapter 6 Solutions 107
Chapter 7 Solutions 121
Chapter 8 Solutions 130
Appendix A Solutions 148
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Computer architecture solutions[1] new and more Exercises Advanced Computer Architecture in PDF only on Docsity!

  • Introduction
  • Chapter 1 Solutions
  • Chapter 2 Solutions
  • Chapter 3 Solutions
  • Chapter 4 Solutions
  • Chapter 5 Solutions
  • Chapter 6 Solutions
  • Chapter 7 Solutions
  • Chapter 8 Solutions
  • Appendix A Solutions

B

Solutions to All Exercises for

Instructors

Captain Kirk

You ought to sell an instruction and maintenance manual with

this thing.

Cyrano Jones

If I did, what would happen to man’s search for knowledge?

Star Trek

“The Trouble with Tribbles” (Dec. 29, 1967)

Chapter 1 Solutions  3

c. Assume that the relative disk volume scales linearly with disk capacity.

where years is the number of years forward from 1990. Then,

d. Actual component cost of the.

Cost of components other than the hard disk is.

e. Cost of hard disk is.

Assume disk density did improve 60% per year from 1990 through 1996 and

at 100% per year since 1997. Then by 2001 an improvement of only 30% per

year would have lead to a higher hard disk cost of

Adding this to the cost of the other components and scaling comp0onent cost

up to list price gives. At this

higher price desktop digital video editing would be much less widely accessi-

ble.

1.2 Let PV stand for percent vectorization divided by 100.

a. Plot for.

b. From the equation in (a), if then the percent vectorization

is or 56%.

c. or 11%.

d. From the equation in (a), if then or

e. The increased percent vectorization needed to match a hardware speedup of

applied to the original 70% vectorization is

Projected value =1990 value ×100 MB-------------------30 GB ×( 1 – 0.3)years

Mass 2002 = 1000 g × (^) 100 MB-------------------30 GB ×( 1 – 0.3)^12 =4152 g

Height 2002 =Volume 2002 ⁄Drive bay area =1000 g ×100 MB-------------------30 GB ×( 1 – 0.3)^12 =29.7 cm

$1000 PC = $1000 × 46.6%=$

$466 × 91%=$

$466 × 9% =$

$42 (^1 +60%)

( 1 +30%)^11

× -----------------------------------------------------------=$

PC cost = ($424 +$1258) ⁄ 46.6%=$

Net speedup PV^ ⁄^10

1 – PV PV

= ------------------------------- 0 ≤ PV ≤ 1

Net speedup = 2

Time in vector mode PV^ ⁄^10

1 – PV+PV------ 10 -

Net speedup = 10 ⁄ 2 = 5 PV = 8 ⁄ 9

10 × 2 = 20

1 – PV+PV------ 10 -

4  Solutions to All Exercises for Instructors

Solving shows that the vectorization must increase to 74%, not a large

increase. Improving the compiler to increase vectorization another 4% may

be easier and cheaper than improving the hardware by a factor of 2.

1.3 This question further explores the effects of Amdahl’s Law, but the data given in

the question is in a form that cannot be directly applied to the general speedup

formula.

a. Because the information given does not allow direct application of Amdahl’s

Law we start from the definition of speedup:

The unenhanced time is the sum of the time that does not benefit from the 10

times faster speedup plus the time that does benefit, but before its reduction

by the factor of 10. Thus,

Substituting into the equation for Speedup yields

b. Using Amdahl’s Law, the given value of 10 for the enhancement factor, and

the value for Speedupoverall from part (a), we have

Solving shows that the enhancement can be applied 91% of the original time.

1.4 a.

Thus,

Also,

n 8 16 32 64 128 256 512 1024 Speedup 2.7 4.0 6.4 10.7 18.2 32.0 56.9 102.

Speedup (^) overall Time (^) unenhanced =----------------------------------Time (^) enhanced-

Time (^) unenhanced = 50% Time (^) enhanced + 10 × 50% Timeenhanced=5.5 Time (^) enhanced

Speedup (^) overall

5.5Time (^) enhanced =-------------------------------------Time (^) enhanced-^ =5.

Fraction (^) enhanced 1 – Fraction (^) enhanced

Fraction (^) enhanced +------------------------------------- 10

Speedup =Number of floating-point instructions DFT------------------------------------------------------------------------------------------------------Number of floating-point instructions FFT-

n^2 = n ------------------- log 2 n -

n lim → ∞Speedup^ n lim →∞^ n

2 = (^) n ------------------- log 2 n - =∞

6  Solutions to All Exercises for Instructors

Relatively, there are 21 times more instructions executed by the embedded

processor.

b.

The MIPS rating of the embedded processor will be a factor of 10/6 = 1.

times higher than the rating of the RISC version.

c. The RISC processor performs the non-FP instructions plus 195,578 FP

instructions. The embedded processor performs the same number of non-FP

instructions as the RISC processor, but performs some larger number of

instructions than 195,578 to compute the FP results using non-FP instructions

only. The number of non-FP instructions is

Thus,

Finally,

1.8 Care in using consistent units and in expressing dies/wafer and good dies/wafer

as integer values are important for this exercise.

a. The number of good dies must be an integer and is less than or equal to the

number of dies per wafer, which must also be an integer. The result presented

here assumes that the integer dies per wafer is modified by wafer and die

yield to obtain the integer number of good dies.

Microprocessor Dies/wafer Good dies/wafer Alpha 21264C 231 128 Powe3-II 157 71 Itanium 79 20 MIPS R14000 122 46 UltraSPARC III 118 44

MIPS RISC

CC RISC

CPI RISC × 10 6

---------------------------------- CC

10 × 10 6

MIPS (^) emb

CC (^) emb CPI (^) emb × 10 6

-------------------------------- CC

6 × 10 6

Number of non-FP instructions = IC (^) RISC – 195578 =0.108 CC – 195578

Number instructions for FP (^) emb =IC (^) emb – Number non-FP instructions =2.27 CC – ( 0.108 CC – 195578 ) =2.16 CC + 195578

Average number instr. for FP in software (^) emb

Number instr. for FP (^) emb =---------------------------------------------------------Number FP instr. - 2.16 CC + 195578 =-------------------------------------------- 195578

Chapter 1 Solutions  7

b. The cost per good die is

c. The cost per good, tested, and packaged part is

d. The largest processor die is the Itanium at 300 mm^2. Defect density has a sub-

stantial effect on cost, pointing out the value of carefully managing the wafer

manufacturing process to maximize the number of defect-free die. The table

below restates die cost assuming the baseline defect density from parts (a)–

(c) and then for the lower and higher densities for this part.

e. For the Alpha 21264C, tested, packaged die costs for an assumed defect den-

sity of 0.8 per cm^2 and variation in parameter α from α = 4 to α = 6 are

$77.53 and $78.59, respectively.

1.9 a. Various answers are possible. Assume a wafer cost of $5000 and α = 4 in all

cases. For a defect density of 0.6 /cm^2 and die area ranging from 0.5 to 4 cm^2 ,

then die cost ranges from $4.93 to 118.56. Fitting a polynomial curve to the

(die area, die cost) pairs shows that a quadratic model has an acceptable norm

of the residuals value of 0.669. Fitting to a third degree polynomial yields a

very small cubic term coefficient and a better norm of the residuals of 0.017,

but the quadratic fit is good and the polynomial simpler, so that would be the

preferred choice.

Microprocessor $/ good die Alpha 21264C $36. Powe3-II $56. Itanium $245. MIPS R14000 $80. UltraSPARC III $118.

Microprocessor $ / good, tested, packaged die Alpha 21264C $64. Powe3-II $78. Itanium $268. MIPS R14000 $108. UltraSPARC III $152.

Itanium $ / good, tested, packaged die defect density = 0.5 $268. defect density = 0.3 $171. defect density = 1.0 $635.

Chapter 1 Solutions  9

Now

because the quotient of a nonnegative real number at a positive real number is

nonnegative. Thus, AM ≥ GM.

Now assume that AM = GM. Then,

Algebraic manipulation yields , which for positive integers implies

a = b. So AM = GM when a = b.

1.12 For positive integers r and s

and

Now

because the quotient of a nonnegative real number at a positive real number is

nonnegative. Thus, AM ≥ HM.

Now assume that AM = HM. Then,

Algebraic manipulation yields ( r–s )^2 = 0, which for positive integers implies

r = s. So AM = HM when r = s.

1.13 a. Let the data value sets be

A = {10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 1}

and

B = {1, 1, 1, 1, 1, 1, 1, 1, 1, 10^7 }

Arithmetic mean (A) = 9 × 106

Median (A) = 10 × 10 6

Arithmetic mean (B) = 1 × 106

Median (B) = 1

AM – GM a -----------^ + 2 b - – ab a -------------------------------^ –^2 2 ab + b - (^ a^ – b )

2 = = =--------------------------- 2 ≥ 0

a + b ----------- 2 - = ab

a – b = 0

Arithmetic mean = AM = r ----------^ + 2 s -

Harmonic mean HM 12 r^ ---^

  • s ---

AM – HM a -----------^ + 2 b - 12 r^ ---^

  • -- s -
  • ------------ (^ r^ – s )

2 = = 2 ------------------ ( r + s )-^ ≥ 0

r + s ---------- 2 -

r^ ---^

  • -- s -

10  Solutions to All Exercises for Instructors

Set A mean and median are within 10% in value, but set B mean and median

are far apart. A large outlying value seriously distorts the arithmetic mean,

while a small outlying value has a lesser effect.

b. Harmonic mean (A) = 10.

Harmonic mean (B) = 1.

In this case the set B harmonic mean is very close to the median, but set A

harmonic mean is much smaller than the set A median. The harmonic mean is

more affected by a small outlying value than a large one.

c. Which is closest depends on the nature of the outlying data point. Neither

mean produces a statistic that is representative of the data values under all cir-

cumstances.

d. Let the new data sets be

C = {1, 1, 1, 1, 1, 1, 1, 1, 1, 2}

and

D = {10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 5 × 106 }

Then

Arithmetic mean (C) = 9.5 × 106

Harmonic mean (C) = 9.1 × 106

Median (C) = 10 × 106

and

Arithmetic mean (D) = 1.

Harmonic mean (D) = 1.

Median (D) = 1

In both cases, the means and medians are close. Summarizing a set of data

values that has less disparity among the values by stating a statistic, such as

mean or median, is intrinsically more meaningful.

1.14 a. For a set of n programs each taking Time i on one machine, the equal-time

weightings on that machine are

Applying this formula to the Reference Time data for the 14 benchmarks

yields the weights shown in Figure S.3.

w (^) i^1 Time i Time^1 j

j = 1

n × ∑

12  Solutions to All Exercises for Instructors

1.15 a. The first condition is that the measured time be accurate, precise, and exclu-

sively for the program of interest. Execution time is measured, typically,

using a clock that ignores what the computer is running. This might be a

clock on the wall or a free-running timer chip in the computer with an output

that can be read using a system call. If the computer can work on computa-

tional tasks other than the program of interest during the measurement inter-

val, then it is important to remove this other time from the run duration of the

program of interest. If we cannot account for this other time, then the perfor-

mance result derived from the measurement will be inaccurate, and may be of

little meaning.

If the program completes execution in an interval that is short compared to

the resolution of the timer, then the run time may be over- or under-stated

enough due to rounding to affect our understanding. This is a problem of

insufficient measurement precision, also known as having too few significant

digits in a measurement. When a more precise timer (for example, microsec-

Weighted time for each benchmark (seconds)

Benchmark

Reference computer Compaq IBM Intel 168.wupwise 134.4 29.3 43.8 34. 171.swim 134.4 12.5 59.2 33. 172.mgrid 134.4 25.6 47.3 54. 173.applu 134.4 34.8 43.2 55. 177.mesa 134.4 26.8 49.2 25. 178.galgel 134.4 30.2 35.4 45. 179.art 134.4 10.9 14.5 35. 183.equake 134.4 61.1 25.4 57. 187.facerec 134.4 19.8 62.5 45. 188.ammp 134.4 33.2 49.4 47. 189.lucas 134.4 21.0 51.5 43. 191.fma3d 134.4 28.5 44.1 47. 200.sixtrack 134.4 49.2 65.5 79. 301.apsi 134.4 30.2 46.0 38. Weighted arithmetic mean (seconds) 1881 413 637 643 SPECfp_base (geometric mean as percent) 100 500 313 304 Figure S.4 Weighted runtimes. The table entries for each benchmark show the time in seconds for that benchmark on a given computer. The summation of benchmark times gives the weighted arithmetic mean execution time of the benchmark suite. Note that with equal weighting of the benchmarks the three computers studied are ranked Com- paq, IBM, Intel from fastest (lowest time) to slowest, which is the same ranking seen in the SPECfp_base_2000 numbers, where the highest corresponds to fastest.

Chapter 1 Solutions  13

onds instead of milliseconds) is not available, the traditional solution is to

change the benchmark program input to yield a longer run time so that the

available timer resolution is then sufficiently precise. The goal is for the run

time to become long enough to require the desired number of significant dig-

its to express so that rounding will have an insignificant effect.

The condition that has to do with the program itself is what if the program

does not terminate. or does not terminate within the patience of the measurer?

How long then is the execution time? How should run time be defined?

b. Throughput is a consistent and reliable measure of performance if a consis-

tent, meaningful unit of work can be defined. Consider a web server that

sends a single, fixed page in response to requests. Each request then triggers a

computational task, transfer identical web page description language to each

new requesting computer, that is essentially identical each time. Throughput

in terms of pages served per unit time would then be inversely proportional to

the time to perform what is essentially a fixed benchmark task, serve this

page. This is the same concept involved in measuring the time to run a fixed

SPEC benchmark with its given code and given input data set. So throughput

of fixed tasks is directly comparable to running fixed benchmarks.

When the task performed changes each time, for example very different

pages served for each new request, then the use of throughput becomes more

difficult. If an aggregate of tasks with consistent character exists, then

throughput measured over a time interval that encompasses the collection of

tasks may be sufficiently consistent and reliable. It may be difficult to identify

such a task collection or to restrict the processing performed to just that col-

lection.

c. With overlapped work, single transaction time will understate the amount of

work, measured in units of number of transactions completed, that the com-

puter can perform per unit time. Throughput will not understate performance

in this way.

1.16 a. Amdahl’s Law can be generalized to handle multiple enhancements. If only

one enhancement can be used at a time during program execution, then

where FE i is the fraction of time that enhancement i can be used and SE i is

the speedup of enhancement i. For a single enhancement the equation reduces

to the familiar form of Amdahl’s Law.

With three enhancements we have

Speedup 1 FE (^) i i

  • (^) ∑ FE (^) i = + ∑ i SE -------- (^) i
  • 1

Speedup 1 – (FE 1 + FE 2 +FE 3 )

FE 1

SE-------- 1 -^

FE 2

SE-------- 2 -^

FE 3

= + + +SE-------- 3 -

  • 1

Chapter 1 Solutions  15

Thus, if only one enhancement can be implemented, enhancement 3 offers

much greater speedup.

Thus, if only a pair of enhancements can be implemented, enhancements 1

and 3 offer the greatest speedup.

Selecting the fastest enhancement(s) may not yield the highest speedup. As

Amdahl’s Law states, an enhancement contributes to speedup only for the

fraction of time that it can be used.

1.17 a.

b.

c.

d.

e. The time for the processor alone is W = 4 sec. The time for the processor/co-

processor configuration is B = 1.1 sec. While its MIPS rating is lower, the

faster execution time belongs to the processor/co-processor combination.

Your colleague’s evaluation is correct.

Speedup (^12 1) – (0.15+ 0.15 ) 0. --------- 30 -

  • 1 =1.

Speedup (^13 1) – (0.15+ 0.7 ) 0. 30

  • 1 =4.

Speedup (^23 1) – (0.15+ 0.7 ) 0. --------- 20 -

  • 1 =4.

MIPS (^) proc = 120 × 10 6 = I ---------------^ + WYF -

MIPS (^) proc/co = 80 × 10 6 = I -----------^ + BF -

I = 120 × 106 WFY =( 120 × 10 6 ) ( 4 )–( 8 × 10 6 ) ( 50 ) = 80 × 10 6 instructions

B^80

× 6 + 8 × 10 6

80 × 10 6

= --------------------------------------------- =1.1 sec

MFLOPS (^) proc/co= B ----------------------------------------------------------------------------- – Time for integer instructions^ F - F = B ------------------------------------------- – I ⁄MIPS (^) proc/co-

8 × 10 6 1.1 – 80 × 10 6 ⁄ 80 × 10 6

=80 MFLOPS

16  Solutions to All Exercises for Instructors

1.18 a.

Because one of the two measured values (time) is reported with only three sig-

nificant digits, the answer should be stated to three significant digits precision.

b. There are four 171.swim operations that are not explicitly given normalized

values: load, store, copy, and convert. Let’s think through what normalized

values to use for these instructions.

First, convert comprises only 0.006% of the FP operations. Thus, convert

would have to correspond to about 1000 normalized FP operations to have

any effect on MFLOPS reported with three significant digits. It seems

unlikely that convert would be this much more time-consuming than expo-

nentiation or a trig function. Any less and there is no effect. So let’s apply an

important principle—keep models simple—and model convert as one nor-

malized FP operation.

Next, copy replicates a value, making it available at a second location. This

same behavior can be produced by adding zero to a value and saving the

result in a new location. So, reasonably, copy should have the same normal-

ized FP count as add.

Finally, load and store interact with computer memory. They can be quick to

the extent that the memory responds quickly to an access request, unlike

divide, square root, exponentiation, and sin, which are computed using a

series of approximation steps to reach an answer. Because load and store are

very common, Amdahl’s Law suggests making them fast. So assume a nor-

malized FP value of 1 for load and store. Note that any increase would signif-

icantly affect the result.

With the above normalized FP operations model, we have

1.19 No solution provided.

1.20 No solution provided.

1.21 a. No solution provided.

b. The steps of the word-processing workload and their nature are as follows.

1. Load the word-processing program and the document file. [Disk and

memory system.]

MFLOPS (^) native Number of floating-point operations Execution time in seconds × 10 6

287 × 10 6

MFLOPS (^) normalized Normalized number of floating-point operations Execution time in seconds × 10 6

18  Solutions to All Exercises for Instructors

1.24 Figure 1.10 shows the addition of three costs to that of the circuitry components,

which determines system list price. System power and volume increases that are

the unavoidable consequence of a CPU component power consumption increase

can be identified in an analogous way. First, consider the effect of CPU power.

An additional watt of CPU power consumption requires an additional watt of

power supply capacity. Because a power supply is not 100% efficient, the power

input to the supply circuitry must increase by more than 1 watt and the waste

energy of conversion, appearing as heat in the components of the supply, will

increase. This input power increase of greater than 1 watt is modeled much as the

direct costs increase shown in Figure 1.10.

At some level of power delivery to the system, the power supply components will

become hotter than their rated maximum operating temperatures if only convec-

tion cooling is available. While several active cooling technologies are available,

the least expensive is forced air. This requires addition of a power supply fan and

the power to run it, with the typical small, rotating fan using about 1 watt of

power. This additional power requirement can be modeled analogously to that

gross margin of Figure 1.10.

Finally, with increasing CPU power consumption the chip will eventually

become too hot without a substantial heat sink and, perhaps, a dedicated fan to

assure high airflow for the CPU. (The fan may be designed to run only when the

CPU temperature is particularly high, as is the case for many laptop computers).

This final “power tax” of a CPU fan for the more power-hungry CPU is modeled

as the average discount component of list price (see Figure 1.10).

System volume is affected by the need to house all system components and, for

air cooling, to provide adequate paths for airflow. The volume model for CPU

Figure S.5 Graph showing relative performance of three processors (normalized to Pentium). The speedup and speedup per watt are quite different.

Chapter 1 Solutions  19

power consumption increases follows the system cost (Figure 1.10) and system

power consumption models. The starting point is the basic system electronics

volume (motherboard) and the basic power supply.

To provide an additional watt to the CPU requires a higher capacity power sup-

ply, and all other things being equal, higher capacity supplies will use physically

larger components, increasing volume. If power supply capacity increases suffi-

ciently, then a cooling fan must be added to the supply along with (possibly)

internal airflow paths to provide cooling. This increases power supply volume.

Finally, a hot CPU chip may require a dedicated large heat sink and fan, further

increasing system volume. A CPU heat sink can easily occupy 100 or more times

the volume of the packaged CPU chip alone.

The bottom line is that each additional watt of CPU power consumption has a

definite impact on system size due to larger and more numerous components and

on system noise due to new and/or increased flows for forced air cooling. Gener-

ally these effects are viewed negatively in the marketplace. Volume is an impor-

tant characteristic, as evidenced by the rapid replacement of CRT monitors with

LCD displays in business environments where personnel workstations (cubicles)

can be reduced in area saving office rent. Reduced computer noise is generally

viewed favorably by users.

1.25 If the collection of performance values are viewed as a set of orthogonal vectors

in hyperspace, then those vectors define a right hyperprism with one vertex at the

origin. The geometric mean of those values is the length of the side of a hyper-

cube having the same volume as the hyperprism.

The advantage of using total execution time is ease of computing the summary

metric. The advantage of using weighted arithmetic mean of the execution times

with weights relative to a comparison platform is that this takes into account the

relative importance of the individual benchmarks in making up the aggregate

workload. The advantage of using the geometric mean of the speed ratios is that

any machine from a collection is equally valid as the normative platform and the

result appears to allow for simple scaling to predict performance of new pro-

grams on the collection of machines.

The disadvantage of using total execution time is that the actual workload may

not be well modeled. The disadvantage of using weighted arithmetic mean is that

the results are affected by machine design details and program input size. The

disadvantage of using geometric mean is that is does not track execution time,

our gold standard for performance measurement.

1.26 Whether performance changes and what might be concluded will depend on the

specific hardware and software platform chosen for the experiment.

What can be concluded about SPEC2000 is the following. Suppose that there are

SPEC2000 results for a platform of interest to you. Your computational needs

likely depend on programs, your own and/or those from a vendor, that are not

compiled as aggressively as the benchmarks. Thus, the performance you enjoy

from this computer is likely to be less than that reported in the SPEC2000 results.

Further, the performance ranking of a variety of computers on your workload