Solution for CS 2506 Spring 2020 MIPS01
-------------------------------------------------------------------------------

1. For a certain processor, with a 2GHz clock rate, assume the following average
   CPI for each category of instruction:

   Category   average CPI
   ----------------------
   arithmetic           1
   load/store          12
   branch               5

   Executed on a single processor, a certain program requires performing 2.56E9
   arithmetic instructions, 1.28E9 load/store instructions, and 2.56E8 branch
   instructions.

   The program described above is modified for parallel execution on multiple
   cores; each core has the same clock rate as the original processor.  The
   changes to the program alter the number of instructions that must be
   performed, per core, as follows.  For arithmetic and load/store instructions,
   the number of instructions executed per core will be divided by 0.7 * p,
   where p is the number of cores used.  However, the number of branch
   instructions per core will not change, relative to the original program.

   a) [14 points] Fill in the following table (round each answer to the nearest
      hundredth):

       cores   #arith   #load/store   #branch   #cycles   exec time   speedup
               per core    per core    per core  per core
       ----------------------------------------------------------------------
           1   2.56E9        1.28E9    2.56E8    19.2E9       9.60s      1.00
       ----------------------------------------------------------------------
           4   0.80E9        0.40E9    2.56E8    6.88E9       3.44s      2.79
       ----------------------------------------------------------------------
           8   0.40E9        0.20E9    2.56E8    4.08E9       2.04s      4.71
       ----------------------------------------------------------------------

      Calculations for 1 core:
        #cycles   = 2.56E9 * 1 + 1.28E9 * 12 + .256E9 * 5 = 19.2E9
        exec time = 19.2E9 cycles / 2.0E9 cycles/second
                  =  9.6 seconds

      Calculations for 4 cores:
        #arith      = 2.56E9 / 0.8*4 = 0.8E9 per core
        #load/store = 1.28E9 / 0.8*4 = 0.4E9 per core
        #branch     = .256E9 per core
        #cycles     = 0.8E9 * 1 + 0.4E9 * 12 + 0.256E9 * 5 = 6.88E9 per core
        exec time   = 6.88E9 cycles / 2.0E9 cycles/second = 3.44 seconds
        speedup     = 9.6 / 3.44 = 2.79

      Calculations for 8 cores:
        #arith      = 2.56E9 / 0.8*8 = 0.4E9
        #load/store = 1.28E9 / 0.8*4 = 0.2E9
        #branch     = .256E9
        #cycles     = 0.4E9 * 1 + 0.2E9 * 12 + 0.256E9 * 5 = 4.08E9 per core
        exec time   = 4.08E9 cycles / 2.0E9 cycles/second = 2.04 seconds
        speedup     = 9.6 / 2.04 = 4.71

   b) [10 points] What must the CPI for load/store instructions be reduced to in
      order for the performance, running the program on a single core, to equal
      the performance of the program running on 4 of the original cores?
      Justify your conclusion precisely.

      Answer:
      We need:

         3.44s = (2.56E9 * 1 + 1.28E9 * X + .256E9 * 5) / 2.0E9

      This yields X = 3.04E9 / 1.28 = 2.375 cycles.


2. A company currently produces three different processors, all executing the
   same set of instructions: 

     - P1 has a 2.8 GHz clock rate and an advertised average CPI of 1.2 
     - P2 has a 3.2 GHz clock rate and an advertised average CPI of 1.6 
     - P3 has a 3.8 GHz clock rate and an advertised average CPI of 1.8

   a) [12 points] Using IPS (instructions per second) as your only criterion,
      and accepting the information given above, which processor offers the best
      performance? Justify your conclusion precisely. 

      Answer:
      Calculate the IPS for each processor:

        IPS_P1 = (2.8E9 cycles/second)/(1.2 cycles/instruction) = 2.333E9
        IPS_P2 = (3.2E9 cycles/second)/(1.6 cycles/instruction) = 2.000E9
        IPS_P3 = (3.8E9 cycles/second)/(1.8 cycles/instruction) = 2.111E9

      So, by this measure, P1 offers the best performance.


   b) [12 points] It takes 42 seconds (of CPU time) to execute a certain
      benchmark on P1. How many machine instructions are executed when that
      benchmark is run on P1? Justify your conclusion precisely. 

      Answer:
        # instructions = (2.333E9 instructions/second) * (42 seconds)
                       = 98E9 instructions

      (IPS for P1 is actually 2 1/3)


   c) [12 points] The company would like to reduce the execution time of that
      benchmark on P1 by 25%, but the redesign (of P1) they've come up with
      would entail increasing the average CPI by 10%. What clock rate must they
      apply in order to achieve their goal? Justify your conclusion precisely.

      Answer:
      We need an execution time of 0.75 * 42 = 31.5 seconds.

      Now, execution time equals #instructions / (#instructions/second).

      With the change, the CPI will be 1.1 * 1.2 = 1.32 cycles/instruction.

      Therefore, the IPS is now X / 1.32, where X is the new clock rate.

      So, we need 31.5 = (98E9 instructions) / (X / 1.32), which yields

         clock rate = 98E9 * 1.32 / 31.5 = 4.11E9

3. Consider a computer running a program that requires 300 seconds, with 60
   seconds spent executing floating point instructions, 120 seconds executed
   load/store instructions, 90 seconds spent executing branch instructions, and
   the rest spent executing other instructions.

  a) [10 points] Suppose the execution time of floating point instructions is
     reduced by 40%. Use Amdahl's Law to determine what speedup would be
     achieved for that program. Round your answer to the nearest one-thousandth.

     Answer:
     TimeAfter = TimeUnaffected + TimeAffected/Speedup
               = 240 + 60 / (5/3)
               = 240 + 36
               = 276s

     Speedup = TimeBefore / TimeAfter = 300 / 276 = 1.087


  b) [10 points] Can the total time can be reduced by 25% by reducing only the
     time for branch instructions? If so, explain how.  If not, explain why. Use
     Amdahl's Law to justify your answer.

     Answer:
     From Amdahl's Law, letting X be the improvement factor for branches:

         225 = 210 + 90 / X

     which yields X = 90 / 15 = 6.  So, yes, we can achieve a 25% improvement
     if we can make branch instructions 6 times faster.


  c) [10 points] Suppose we have technologies to improve the performance of
     these three types of instructions (floating point instructions, memory
     instructions, and branch instructions) by a factor of 2. Suppose, however,
     we have to choose only one type to improve for some reasons (e.g., labor,
     space/power constraints, etc.). Which one should we choose? 

     Answer:
     Make the common case fast... we get the greatest reduction in runtime by
     reducing the time cost of the operation(s) that take the most time.  In
     this case, that would be load/store instructions.

     If in doubt, here are the options:

        floating point:  time = 240 +  60 / 2 = 270s
        load/store:      time = 180 + 120 / 2 = 240s
        branches:        time = 210 +  90 / 2 = 255s
        other:           time = 270 +  30 / 2 = 285s


  d) [10 points] Suppose that you have performed the optimization chosen in part
     c).  Now, suppose you can do the same thing once more (i.e., you pick any
     of the three types and improve its performance by a factor of 2), would you
     choose the same type of instruction to optimize?  Use Amdahl's Law to
     justify your answer.

     Answer:
     After the improvement above the times spent on the various instructions
     would be:

        floating point:  60
        load/store:      60
        branches:        90
        other:           30

     Again, we want to speed up the operation(s) on which the most time is spent.
     So, now we want to speed up branch instructions.

        floating point:  time = 180 + 60 / 2  = 210s
        load/store:      time = 180 + 60 / 2  = 210s
        branches:        time = 150 + 90 / 2  = 195s
        other:           time = 210 + 30 / 2  = 225s