Download Computer architecture solutions[1] new and more Exercises Advanced Computer Architecture in PDF only on Docsity!
- Introduction
- Chapter 1 Solutions
- Chapter 2 Solutions
- Chapter 3 Solutions
- Chapter 4 Solutions
- Chapter 5 Solutions
- Chapter 6 Solutions
- Chapter 7 Solutions
- Chapter 8 Solutions
- Appendix A Solutions
B
Solutions to All Exercises for
Instructors
Captain Kirk
You ought to sell an instruction and maintenance manual with
this thing.
Cyrano Jones
If I did, what would happen to man’s search for knowledge?
Star Trek
“The Trouble with Tribbles” (Dec. 29, 1967)
Chapter 1 Solutions 3
c. Assume that the relative disk volume scales linearly with disk capacity.
where years is the number of years forward from 1990. Then,
d. Actual component cost of the.
Cost of components other than the hard disk is.
e. Cost of hard disk is.
Assume disk density did improve 60% per year from 1990 through 1996 and
at 100% per year since 1997. Then by 2001 an improvement of only 30% per
year would have lead to a higher hard disk cost of
Adding this to the cost of the other components and scaling comp0onent cost
up to list price gives. At this
higher price desktop digital video editing would be much less widely accessi-
ble.
1.2 Let PV stand for percent vectorization divided by 100.
a. Plot for.
b. From the equation in (a), if then the percent vectorization
is or 56%.
c. or 11%.
d. From the equation in (a), if then or
e. The increased percent vectorization needed to match a hardware speedup of
applied to the original 70% vectorization is
Projected value =1990 value ×100 MB-------------------30 GB ×( 1 – 0.3)years
Mass 2002 = 1000 g × (^) 100 MB-------------------30 GB ×( 1 – 0.3)^12 =4152 g
Height 2002 =Volume 2002 ⁄Drive bay area =1000 g ×100 MB-------------------30 GB ×( 1 – 0.3)^12 =29.7 cm
$1000 PC = $1000 × 46.6%=$
$466 × 91%=$
$466 × 9% =$
$42 (^1 +60%)
( 1 +30%)^11
× -----------------------------------------------------------=$
PC cost = ($424 +$1258) ⁄ 46.6%=$
Net speedup PV^ ⁄^10
1 – PV PV
= ------------------------------- 0 ≤ PV ≤ 1
Net speedup = 2
Time in vector mode PV^ ⁄^10
1 – PV+PV------ 10 -
Net speedup = 10 ⁄ 2 = 5 PV = 8 ⁄ 9
10 × 2 = 20
1 – PV+PV------ 10 -
4 Solutions to All Exercises for Instructors
Solving shows that the vectorization must increase to 74%, not a large
increase. Improving the compiler to increase vectorization another 4% may
be easier and cheaper than improving the hardware by a factor of 2.
1.3 This question further explores the effects of Amdahl’s Law, but the data given in
the question is in a form that cannot be directly applied to the general speedup
formula.
a. Because the information given does not allow direct application of Amdahl’s
Law we start from the definition of speedup:
The unenhanced time is the sum of the time that does not benefit from the 10
times faster speedup plus the time that does benefit, but before its reduction
by the factor of 10. Thus,
Substituting into the equation for Speedup yields
b. Using Amdahl’s Law, the given value of 10 for the enhancement factor, and
the value for Speedupoverall from part (a), we have
Solving shows that the enhancement can be applied 91% of the original time.
1.4 a.
Thus,
Also,
n 8 16 32 64 128 256 512 1024 Speedup 2.7 4.0 6.4 10.7 18.2 32.0 56.9 102.
Speedup (^) overall Time (^) unenhanced =----------------------------------Time (^) enhanced-
Time (^) unenhanced = 50% Time (^) enhanced + 10 × 50% Timeenhanced=5.5 Time (^) enhanced
Speedup (^) overall
5.5Time (^) enhanced =-------------------------------------Time (^) enhanced-^ =5.
Fraction (^) enhanced 1 – Fraction (^) enhanced
Fraction (^) enhanced +------------------------------------- 10
Speedup =Number of floating-point instructions DFT------------------------------------------------------------------------------------------------------Number of floating-point instructions FFT-
n^2 = n ------------------- log 2 n -
n lim → ∞Speedup^ n lim →∞^ n
2 = (^) n ------------------- log 2 n - =∞
6 Solutions to All Exercises for Instructors
Relatively, there are 21 times more instructions executed by the embedded
processor.
b.
The MIPS rating of the embedded processor will be a factor of 10/6 = 1.
times higher than the rating of the RISC version.
c. The RISC processor performs the non-FP instructions plus 195,578 FP
instructions. The embedded processor performs the same number of non-FP
instructions as the RISC processor, but performs some larger number of
instructions than 195,578 to compute the FP results using non-FP instructions
only. The number of non-FP instructions is
Thus,
Finally,
1.8 Care in using consistent units and in expressing dies/wafer and good dies/wafer
as integer values are important for this exercise.
a. The number of good dies must be an integer and is less than or equal to the
number of dies per wafer, which must also be an integer. The result presented
here assumes that the integer dies per wafer is modified by wafer and die
yield to obtain the integer number of good dies.
Microprocessor Dies/wafer Good dies/wafer Alpha 21264C 231 128 Powe3-II 157 71 Itanium 79 20 MIPS R14000 122 46 UltraSPARC III 118 44
MIPS RISC
CC RISC
CPI RISC × 10 6
---------------------------------- CC
10 × 10 6
MIPS (^) emb
CC (^) emb CPI (^) emb × 10 6
-------------------------------- CC
6 × 10 6
Number of non-FP instructions = IC (^) RISC – 195578 =0.108 CC – 195578
Number instructions for FP (^) emb =IC (^) emb – Number non-FP instructions =2.27 CC – ( 0.108 CC – 195578 ) =2.16 CC + 195578
Average number instr. for FP in software (^) emb
Number instr. for FP (^) emb =---------------------------------------------------------Number FP instr. - 2.16 CC + 195578 =-------------------------------------------- 195578
Chapter 1 Solutions 7
b. The cost per good die is
c. The cost per good, tested, and packaged part is
d. The largest processor die is the Itanium at 300 mm^2. Defect density has a sub-
stantial effect on cost, pointing out the value of carefully managing the wafer
manufacturing process to maximize the number of defect-free die. The table
below restates die cost assuming the baseline defect density from parts (a)–
(c) and then for the lower and higher densities for this part.
e. For the Alpha 21264C, tested, packaged die costs for an assumed defect den-
sity of 0.8 per cm^2 and variation in parameter α from α = 4 to α = 6 are
$77.53 and $78.59, respectively.
1.9 a. Various answers are possible. Assume a wafer cost of $5000 and α = 4 in all
cases. For a defect density of 0.6 /cm^2 and die area ranging from 0.5 to 4 cm^2 ,
then die cost ranges from $4.93 to 118.56. Fitting a polynomial curve to the
(die area, die cost) pairs shows that a quadratic model has an acceptable norm
of the residuals value of 0.669. Fitting to a third degree polynomial yields a
very small cubic term coefficient and a better norm of the residuals of 0.017,
but the quadratic fit is good and the polynomial simpler, so that would be the
preferred choice.
Microprocessor $/ good die Alpha 21264C $36. Powe3-II $56. Itanium $245. MIPS R14000 $80. UltraSPARC III $118.
Microprocessor $ / good, tested, packaged die Alpha 21264C $64. Powe3-II $78. Itanium $268. MIPS R14000 $108. UltraSPARC III $152.
Itanium $ / good, tested, packaged die defect density = 0.5 $268. defect density = 0.3 $171. defect density = 1.0 $635.
Chapter 1 Solutions 9
Now
because the quotient of a nonnegative real number at a positive real number is
nonnegative. Thus, AM ≥ GM.
Now assume that AM = GM. Then,
Algebraic manipulation yields , which for positive integers implies
a = b. So AM = GM when a = b.
1.12 For positive integers r and s
and
Now
because the quotient of a nonnegative real number at a positive real number is
nonnegative. Thus, AM ≥ HM.
Now assume that AM = HM. Then,
Algebraic manipulation yields ( r–s )^2 = 0, which for positive integers implies
r = s. So AM = HM when r = s.
1.13 a. Let the data value sets be
A = {10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 1}
and
B = {1, 1, 1, 1, 1, 1, 1, 1, 1, 10^7 }
Arithmetic mean (A) = 9 × 106
Median (A) = 10 × 10 6
Arithmetic mean (B) = 1 × 106
Median (B) = 1
AM – GM a -----------^ + 2 b - – ab a -------------------------------^ –^2 2 ab + b - (^ a^ – b )
2 = = =--------------------------- 2 ≥ 0
a + b ----------- 2 - = ab
a – b = 0
Arithmetic mean = AM = r ----------^ + 2 s -
Harmonic mean HM 12 r^ ---^
AM – HM a -----------^ + 2 b - 12 r^ ---^
2 = = 2 ------------------ ( r + s )-^ ≥ 0
r + s ---------- 2 -
r^ ---^
10 Solutions to All Exercises for Instructors
Set A mean and median are within 10% in value, but set B mean and median
are far apart. A large outlying value seriously distorts the arithmetic mean,
while a small outlying value has a lesser effect.
b. Harmonic mean (A) = 10.
Harmonic mean (B) = 1.
In this case the set B harmonic mean is very close to the median, but set A
harmonic mean is much smaller than the set A median. The harmonic mean is
more affected by a small outlying value than a large one.
c. Which is closest depends on the nature of the outlying data point. Neither
mean produces a statistic that is representative of the data values under all cir-
cumstances.
d. Let the new data sets be
C = {1, 1, 1, 1, 1, 1, 1, 1, 1, 2}
and
D = {10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 10^7 , 5 × 106 }
Then
Arithmetic mean (C) = 9.5 × 106
Harmonic mean (C) = 9.1 × 106
Median (C) = 10 × 106
and
Arithmetic mean (D) = 1.
Harmonic mean (D) = 1.
Median (D) = 1
In both cases, the means and medians are close. Summarizing a set of data
values that has less disparity among the values by stating a statistic, such as
mean or median, is intrinsically more meaningful.
1.14 a. For a set of n programs each taking Time i on one machine, the equal-time
weightings on that machine are
Applying this formula to the Reference Time data for the 14 benchmarks
yields the weights shown in Figure S.3.
w (^) i^1 Time i Time^1 j
j = 1
n × ∑
12 Solutions to All Exercises for Instructors
1.15 a. The first condition is that the measured time be accurate, precise, and exclu-
sively for the program of interest. Execution time is measured, typically,
using a clock that ignores what the computer is running. This might be a
clock on the wall or a free-running timer chip in the computer with an output
that can be read using a system call. If the computer can work on computa-
tional tasks other than the program of interest during the measurement inter-
val, then it is important to remove this other time from the run duration of the
program of interest. If we cannot account for this other time, then the perfor-
mance result derived from the measurement will be inaccurate, and may be of
little meaning.
If the program completes execution in an interval that is short compared to
the resolution of the timer, then the run time may be over- or under-stated
enough due to rounding to affect our understanding. This is a problem of
insufficient measurement precision, also known as having too few significant
digits in a measurement. When a more precise timer (for example, microsec-
Weighted time for each benchmark (seconds)
Benchmark
Reference computer Compaq IBM Intel 168.wupwise 134.4 29.3 43.8 34. 171.swim 134.4 12.5 59.2 33. 172.mgrid 134.4 25.6 47.3 54. 173.applu 134.4 34.8 43.2 55. 177.mesa 134.4 26.8 49.2 25. 178.galgel 134.4 30.2 35.4 45. 179.art 134.4 10.9 14.5 35. 183.equake 134.4 61.1 25.4 57. 187.facerec 134.4 19.8 62.5 45. 188.ammp 134.4 33.2 49.4 47. 189.lucas 134.4 21.0 51.5 43. 191.fma3d 134.4 28.5 44.1 47. 200.sixtrack 134.4 49.2 65.5 79. 301.apsi 134.4 30.2 46.0 38. Weighted arithmetic mean (seconds) 1881 413 637 643 SPECfp_base (geometric mean as percent) 100 500 313 304 Figure S.4 Weighted runtimes. The table entries for each benchmark show the time in seconds for that benchmark on a given computer. The summation of benchmark times gives the weighted arithmetic mean execution time of the benchmark suite. Note that with equal weighting of the benchmarks the three computers studied are ranked Com- paq, IBM, Intel from fastest (lowest time) to slowest, which is the same ranking seen in the SPECfp_base_2000 numbers, where the highest corresponds to fastest.
Chapter 1 Solutions 13
onds instead of milliseconds) is not available, the traditional solution is to
change the benchmark program input to yield a longer run time so that the
available timer resolution is then sufficiently precise. The goal is for the run
time to become long enough to require the desired number of significant dig-
its to express so that rounding will have an insignificant effect.
The condition that has to do with the program itself is what if the program
does not terminate. or does not terminate within the patience of the measurer?
How long then is the execution time? How should run time be defined?
b. Throughput is a consistent and reliable measure of performance if a consis-
tent, meaningful unit of work can be defined. Consider a web server that
sends a single, fixed page in response to requests. Each request then triggers a
computational task, transfer identical web page description language to each
new requesting computer, that is essentially identical each time. Throughput
in terms of pages served per unit time would then be inversely proportional to
the time to perform what is essentially a fixed benchmark task, serve this
page. This is the same concept involved in measuring the time to run a fixed
SPEC benchmark with its given code and given input data set. So throughput
of fixed tasks is directly comparable to running fixed benchmarks.
When the task performed changes each time, for example very different
pages served for each new request, then the use of throughput becomes more
difficult. If an aggregate of tasks with consistent character exists, then
throughput measured over a time interval that encompasses the collection of
tasks may be sufficiently consistent and reliable. It may be difficult to identify
such a task collection or to restrict the processing performed to just that col-
lection.
c. With overlapped work, single transaction time will understate the amount of
work, measured in units of number of transactions completed, that the com-
puter can perform per unit time. Throughput will not understate performance
in this way.
1.16 a. Amdahl’s Law can be generalized to handle multiple enhancements. If only
one enhancement can be used at a time during program execution, then
where FE i is the fraction of time that enhancement i can be used and SE i is
the speedup of enhancement i. For a single enhancement the equation reduces
to the familiar form of Amdahl’s Law.
With three enhancements we have
Speedup 1 FE (^) i i
- (^) ∑ FE (^) i = + ∑ i SE -------- (^) i
- 1
Speedup 1 – (FE 1 + FE 2 +FE 3 )
FE 1
SE-------- 1 -^
FE 2
SE-------- 2 -^
FE 3
= + + +SE-------- 3 -
Chapter 1 Solutions 15
Thus, if only one enhancement can be implemented, enhancement 3 offers
much greater speedup.
Thus, if only a pair of enhancements can be implemented, enhancements 1
and 3 offer the greatest speedup.
Selecting the fastest enhancement(s) may not yield the highest speedup. As
Amdahl’s Law states, an enhancement contributes to speedup only for the
fraction of time that it can be used.
1.17 a.
b.
c.
d.
e. The time for the processor alone is W = 4 sec. The time for the processor/co-
processor configuration is B = 1.1 sec. While its MIPS rating is lower, the
faster execution time belongs to the processor/co-processor combination.
Your colleague’s evaluation is correct.
Speedup (^12 1) – (0.15+ 0.15 ) 0. --------- 30 -
Speedup (^13 1) – (0.15+ 0.7 ) 0. 30
Speedup (^23 1) – (0.15+ 0.7 ) 0. --------- 20 -
MIPS (^) proc = 120 × 10 6 = I ---------------^ + WYF -
MIPS (^) proc/co = 80 × 10 6 = I -----------^ + BF -
I = 120 × 106 W – FY =( 120 × 10 6 ) ( 4 )–( 8 × 10 6 ) ( 50 ) = 80 × 10 6 instructions
B^80
× 6 + 8 × 10 6
80 × 10 6
= --------------------------------------------- =1.1 sec
MFLOPS (^) proc/co= B ----------------------------------------------------------------------------- – Time for integer instructions^ F - F = B ------------------------------------------- – I ⁄MIPS (^) proc/co-
8 × 10 6 1.1 – 80 × 10 6 ⁄ 80 × 10 6
=80 MFLOPS
16 Solutions to All Exercises for Instructors
1.18 a.
Because one of the two measured values (time) is reported with only three sig-
nificant digits, the answer should be stated to three significant digits precision.
b. There are four 171.swim operations that are not explicitly given normalized
values: load, store, copy, and convert. Let’s think through what normalized
values to use for these instructions.
First, convert comprises only 0.006% of the FP operations. Thus, convert
would have to correspond to about 1000 normalized FP operations to have
any effect on MFLOPS reported with three significant digits. It seems
unlikely that convert would be this much more time-consuming than expo-
nentiation or a trig function. Any less and there is no effect. So let’s apply an
important principle—keep models simple—and model convert as one nor-
malized FP operation.
Next, copy replicates a value, making it available at a second location. This
same behavior can be produced by adding zero to a value and saving the
result in a new location. So, reasonably, copy should have the same normal-
ized FP count as add.
Finally, load and store interact with computer memory. They can be quick to
the extent that the memory responds quickly to an access request, unlike
divide, square root, exponentiation, and sin, which are computed using a
series of approximation steps to reach an answer. Because load and store are
very common, Amdahl’s Law suggests making them fast. So assume a nor-
malized FP value of 1 for load and store. Note that any increase would signif-
icantly affect the result.
With the above normalized FP operations model, we have
1.19 No solution provided.
1.20 No solution provided.
1.21 a. No solution provided.
b. The steps of the word-processing workload and their nature are as follows.
1. Load the word-processing program and the document file. [Disk and
memory system.]
MFLOPS (^) native Number of floating-point operations Execution time in seconds × 10 6
287 × 10 6
MFLOPS (^) normalized Normalized number of floating-point operations Execution time in seconds × 10 6
18 Solutions to All Exercises for Instructors
1.24 Figure 1.10 shows the addition of three costs to that of the circuitry components,
which determines system list price. System power and volume increases that are
the unavoidable consequence of a CPU component power consumption increase
can be identified in an analogous way. First, consider the effect of CPU power.
An additional watt of CPU power consumption requires an additional watt of
power supply capacity. Because a power supply is not 100% efficient, the power
input to the supply circuitry must increase by more than 1 watt and the waste
energy of conversion, appearing as heat in the components of the supply, will
increase. This input power increase of greater than 1 watt is modeled much as the
direct costs increase shown in Figure 1.10.
At some level of power delivery to the system, the power supply components will
become hotter than their rated maximum operating temperatures if only convec-
tion cooling is available. While several active cooling technologies are available,
the least expensive is forced air. This requires addition of a power supply fan and
the power to run it, with the typical small, rotating fan using about 1 watt of
power. This additional power requirement can be modeled analogously to that
gross margin of Figure 1.10.
Finally, with increasing CPU power consumption the chip will eventually
become too hot without a substantial heat sink and, perhaps, a dedicated fan to
assure high airflow for the CPU. (The fan may be designed to run only when the
CPU temperature is particularly high, as is the case for many laptop computers).
This final “power tax” of a CPU fan for the more power-hungry CPU is modeled
as the average discount component of list price (see Figure 1.10).
System volume is affected by the need to house all system components and, for
air cooling, to provide adequate paths for airflow. The volume model for CPU
Figure S.5 Graph showing relative performance of three processors (normalized to Pentium). The speedup and speedup per watt are quite different.
Chapter 1 Solutions 19
power consumption increases follows the system cost (Figure 1.10) and system
power consumption models. The starting point is the basic system electronics
volume (motherboard) and the basic power supply.
To provide an additional watt to the CPU requires a higher capacity power sup-
ply, and all other things being equal, higher capacity supplies will use physically
larger components, increasing volume. If power supply capacity increases suffi-
ciently, then a cooling fan must be added to the supply along with (possibly)
internal airflow paths to provide cooling. This increases power supply volume.
Finally, a hot CPU chip may require a dedicated large heat sink and fan, further
increasing system volume. A CPU heat sink can easily occupy 100 or more times
the volume of the packaged CPU chip alone.
The bottom line is that each additional watt of CPU power consumption has a
definite impact on system size due to larger and more numerous components and
on system noise due to new and/or increased flows for forced air cooling. Gener-
ally these effects are viewed negatively in the marketplace. Volume is an impor-
tant characteristic, as evidenced by the rapid replacement of CRT monitors with
LCD displays in business environments where personnel workstations (cubicles)
can be reduced in area saving office rent. Reduced computer noise is generally
viewed favorably by users.
1.25 If the collection of performance values are viewed as a set of orthogonal vectors
in hyperspace, then those vectors define a right hyperprism with one vertex at the
origin. The geometric mean of those values is the length of the side of a hyper-
cube having the same volume as the hyperprism.
The advantage of using total execution time is ease of computing the summary
metric. The advantage of using weighted arithmetic mean of the execution times
with weights relative to a comparison platform is that this takes into account the
relative importance of the individual benchmarks in making up the aggregate
workload. The advantage of using the geometric mean of the speed ratios is that
any machine from a collection is equally valid as the normative platform and the
result appears to allow for simple scaling to predict performance of new pro-
grams on the collection of machines.
The disadvantage of using total execution time is that the actual workload may
not be well modeled. The disadvantage of using weighted arithmetic mean is that
the results are affected by machine design details and program input size. The
disadvantage of using geometric mean is that is does not track execution time,
our gold standard for performance measurement.
1.26 Whether performance changes and what might be concluded will depend on the
specific hardware and software platform chosen for the experiment.
What can be concluded about SPEC2000 is the following. Suppose that there are
SPEC2000 results for a platform of interest to you. Your computational needs
likely depend on programs, your own and/or those from a vendor, that are not
compiled as aggressively as the benchmarks. Thus, the performance you enjoy
from this computer is likely to be less than that reported in the SPEC2000 results.
Further, the performance ranking of a variety of computers on your workload