Solution for CS 2506 Spring 2020 MIPS01 ------------------------------------------------------------------------------- 1. For a certain processor, with a 2GHz clock rate, assume the following average CPI for each category of instruction: Category average CPI ---------------------- arithmetic 1 load/store 12 branch 5 Executed on a single processor, a certain program requires performing 2.56E9 arithmetic instructions, 1.28E9 load/store instructions, and 2.56E8 branch instructions. The program described above is modified for parallel execution on multiple cores; each core has the same clock rate as the original processor. The changes to the program alter the number of instructions that must be performed, per core, as follows. For arithmetic and load/store instructions, the number of instructions executed per core will be divided by 0.7 * p, where p is the number of cores used. However, the number of branch instructions per core will not change, relative to the original program. a) [14 points] Fill in the following table (round each answer to the nearest hundredth): cores #arith #load/store #branch #cycles exec time speedup per core per core per core per core ---------------------------------------------------------------------- 1 2.56E9 1.28E9 2.56E8 19.2E9 9.60s 1.00 ---------------------------------------------------------------------- 4 0.80E9 0.40E9 2.56E8 6.88E9 3.44s 2.79 ---------------------------------------------------------------------- 8 0.40E9 0.20E9 2.56E8 4.08E9 2.04s 4.71 ---------------------------------------------------------------------- Calculations for 1 core: #cycles = 2.56E9 * 1 + 1.28E9 * 12 + .256E9 * 5 = 19.2E9 exec time = 19.2E9 cycles / 2.0E9 cycles/second = 9.6 seconds Calculations for 4 cores: #arith = 2.56E9 / 0.8*4 = 0.8E9 per core #load/store = 1.28E9 / 0.8*4 = 0.4E9 per core #branch = .256E9 per core #cycles = 0.8E9 * 1 + 0.4E9 * 12 + 0.256E9 * 5 = 6.88E9 per core exec time = 6.88E9 cycles / 2.0E9 cycles/second = 3.44 seconds speedup = 9.6 / 3.44 = 2.79 Calculations for 8 cores: #arith = 2.56E9 / 0.8*8 = 0.4E9 #load/store = 1.28E9 / 0.8*4 = 0.2E9 #branch = .256E9 #cycles = 0.4E9 * 1 + 0.2E9 * 12 + 0.256E9 * 5 = 4.08E9 per core exec time = 4.08E9 cycles / 2.0E9 cycles/second = 2.04 seconds speedup = 9.6 / 2.04 = 4.71 b) [10 points] What must the CPI for load/store instructions be reduced to in order for the performance, running the program on a single core, to equal the performance of the program running on 4 of the original cores? Justify your conclusion precisely. Answer: We need: 3.44s = (2.56E9 * 1 + 1.28E9 * X + .256E9 * 5) / 2.0E9 This yields X = 3.04E9 / 1.28 = 2.375 cycles. 2. A company currently produces three different processors, all executing the same set of instructions: - P1 has a 2.8 GHz clock rate and an advertised average CPI of 1.2 - P2 has a 3.2 GHz clock rate and an advertised average CPI of 1.6 - P3 has a 3.8 GHz clock rate and an advertised average CPI of 1.8 a) [12 points] Using IPS (instructions per second) as your only criterion, and accepting the information given above, which processor offers the best performance? Justify your conclusion precisely. Answer: Calculate the IPS for each processor: IPS_P1 = (2.8E9 cycles/second)/(1.2 cycles/instruction) = 2.333E9 IPS_P2 = (3.2E9 cycles/second)/(1.6 cycles/instruction) = 2.000E9 IPS_P3 = (3.8E9 cycles/second)/(1.8 cycles/instruction) = 2.111E9 So, by this measure, P1 offers the best performance. b) [12 points] It takes 42 seconds (of CPU time) to execute a certain benchmark on P1. How many machine instructions are executed when that benchmark is run on P1? Justify your conclusion precisely. Answer: # instructions = (2.333E9 instructions/second) * (42 seconds) = 98E9 instructions (IPS for P1 is actually 2 1/3) c) [12 points] The company would like to reduce the execution time of that benchmark on P1 by 25%, but the redesign (of P1) they've come up with would entail increasing the average CPI by 10%. What clock rate must they apply in order to achieve their goal? Justify your conclusion precisely. Answer: We need an execution time of 0.75 * 42 = 31.5 seconds. Now, execution time equals #instructions / (#instructions/second). With the change, the CPI will be 1.1 * 1.2 = 1.32 cycles/instruction. Therefore, the IPS is now X / 1.32, where X is the new clock rate. So, we need 31.5 = (98E9 instructions) / (X / 1.32), which yields clock rate = 98E9 * 1.32 / 31.5 = 4.11E9 3. Consider a computer running a program that requires 300 seconds, with 60 seconds spent executing floating point instructions, 120 seconds executed load/store instructions, 90 seconds spent executing branch instructions, and the rest spent executing other instructions. a) [10 points] Suppose the execution time of floating point instructions is reduced by 40%. Use Amdahl's Law to determine what speedup would be achieved for that program. Round your answer to the nearest one-thousandth. Answer: TimeAfter = TimeUnaffected + TimeAffected/Speedup = 240 + 60 / (5/3) = 240 + 36 = 276s Speedup = TimeBefore / TimeAfter = 300 / 276 = 1.087 b) [10 points] Can the total time can be reduced by 25% by reducing only the time for branch instructions? If so, explain how. If not, explain why. Use Amdahl's Law to justify your answer. Answer: From Amdahl's Law, letting X be the improvement factor for branches: 225 = 210 + 90 / X which yields X = 90 / 15 = 6. So, yes, we can achieve a 25% improvement if we can make branch instructions 6 times faster. c) [10 points] Suppose we have technologies to improve the performance of these three types of instructions (floating point instructions, memory instructions, and branch instructions) by a factor of 2. Suppose, however, we have to choose only one type to improve for some reasons (e.g., labor, space/power constraints, etc.). Which one should we choose? Answer: Make the common case fast... we get the greatest reduction in runtime by reducing the time cost of the operation(s) that take the most time. In this case, that would be load/store instructions. If in doubt, here are the options: floating point: time = 240 + 60 / 2 = 270s load/store: time = 180 + 120 / 2 = 240s branches: time = 210 + 90 / 2 = 255s other: time = 270 + 30 / 2 = 285s d) [10 points] Suppose that you have performed the optimization chosen in part c). Now, suppose you can do the same thing once more (i.e., you pick any of the three types and improve its performance by a factor of 2), would you choose the same type of instruction to optimize? Use Amdahl's Law to justify your answer. Answer: After the improvement above the times spent on the various instructions would be: floating point: 60 load/store: 60 branches: 90 other: 30 Again, we want to speed up the operation(s) on which the most time is spent. So, now we want to speed up branch instructions. floating point: time = 180 + 60 / 2 = 210s load/store: time = 180 + 60 / 2 = 210s branches: time = 150 + 90 / 2 = 195s other: time = 210 + 30 / 2 = 225s