Solution for CS 2506 Spring 2020 MIPS04
-------------------------------------------------------------------------------

1.  Suppose that, in addition to registers in the CPU, a system includes a
    unified L1 cache, and DRAM, and that the storage levels have the average
    access times shown below:

      L1 cache   1ns
      DRAM      50ns

    The system has a 2GHz clock rate.  Also, the system executes machine
    instructions using a 5-stage pipeline, and the average CPI is 6.5, assuming
    that there are no memory accesses.  (This would take into account the
    effect of stalls for forwarding and branch decisions, and include the fetch
    if it's from the L1 cache.)

    For the following parts of this question, round time values to the nearest
    nanosecond (ns), and round CPI values to the nearest tenth (e.g., 7.2).
    Show your computations.

    Aside:
    It's important to note that if the clock rate is 2GHz, then each clock cycle
    is 0.5ns, so there are two cycles per ns.  Therefore, the access times above
    can also be expressed as:

     L1 cache   1 ns     2 cycles
     DRAM      50 ns   100 cycles


    a) [4 points] For this part, assume that the hit rate for the L1 cache is
       (an unlikely) 100%.  What is the average memory access time AMAT (in ns)?

       Answer:
       Under the given assumption, each memory access ends with an L1 hit, so 
       the average memory access time (indeed, the time for every memory access)
       is 1.0 ns.

    b) [4 points] Again assuming an L1 cache hit rate of 100%, what would be
       the average CPI if 20% of the executed instructions require a memory
       access?

       Answer:
       CPI = 6.5 + 0.20 * 2 = 6.90 cycles
       
    c) [4 points] Now, suppose the L1 cache hit rate is 97%, and that no memory
       access requires going to the SSD.  What is the average memory access time
       AMAT now (in ns)?

       Answer:
       AMAT = 1           for L1 access
            + 0.03 * 50   for DRAM access
            = 2.5 ns

    d) [8 points] Under the same assumptions as part c), and assuming that 20%
       of the executed instructions require a memory access, what would be the
       average CPI?

       Answer:
       CPI = 6.5 
           + 0.03 * 100              for L1 fetch miss and DRAM fetch
           + 0.2 * 2                 for L1 data hit
           + 0.2 * 0.03 * 100        for L1 data miss and DRAM data hit
           = 6.5 + 3.0 + 0.4 + 0.6
           = 10.50 cycles

    e) [8 points] Repeat part d), but now assume that 40% of the executed
       instructions require a memory access.

       Answer:
       CPI = 6.5 
           + 0.03 * 100              for L1 fetch miss and DRAM fetch
           + 0.4 * 2                 for L1 data hit
           + 0.4 * 0.03 * 100        for L1 data miss and DRAM data hit
           = 6.5 + 3.0 + 0.8 + 1.2
           = 11.50 cycles


2.  Suppose that, in addition to registers in the CPU, a system includes a
    unified L1 cache, a unified L2 cache, and DRAM, and that the storage levels
    have the average access times shown below:

      L1 cache    1 ns
      L2 cache   10 ns
      DRAM       50 ns

    The system has a 2GHz clock rate.  Also, the system executes machine
    instructions using a 5-stage pipeline, and the average CPI is 6.5, assuming
    that there are no memory accesses.  (This would take into account the
    effect of stalls for forwarding and branch decisions, and include the fetch
    if it's from the L1 cache.)

    For the following parts of this question, round time values to the nearest
    nanosecond (ns), and round CPI values to the nearest tenth (e.g., 7.2).
    Show your computations.

    Aside:
    It's important to note that if the clock rate is 2GHz, then each clock cycle
    is 0.5ns, so there are two cycles per ns.  Therefore, the access times above
    can also be expressed as:

     L1 cache   1 ns     2 cycles
     L2 cache  10 ns    20 cycles
     DRAM      50 ns   100 cycles

    a) [8 points] Now, suppose the L1 cache hit rate is 97%, and the L2 cache
       hit rate is 99%.  What is the average memory access time AMAT now
       (in ns)?

       Answer:
       AMAT = 1                    for L1 hit
            + 0.03 * 10            for L1 miss and L2 hit
            + 0.03 * 0.01 * 50     for L1 miss, L2 miss, and DRAM access
            = 1.315 ns

       So, 1.3 ns
       

    b) [8 points] Under the same assumptions as part a), and assuming that 40%
       of the executed instructions require a memory access (besides being
       fetched), what would be the average CPI?

       Answer:
       CPI = 6.5                        for base CPI
           + 0.03 * 20                  for L1 fetch miss and L2 fetch hit
           + 0.03 * 0.01 * 100          for L1 and L2 fetch misses, and DRAM fetch hit
           + 0.4 * 2                    for L1 data hit
           + 0.4 * 0.03 * 20            for L1 data miss and L2 data hit
           + 0.4 * 0.03 * 0.01 * 100    for L1 and L2 data miss, and DRAM data hit
           = 6.5 + 0.6 + 0.03 + 0.8 + 0.24 + 0.012
           = 8.182

       So, 8.18 cycles


3.  A system uses 24-bit DRAM addresses and an 8-way set associative cache with
    128 sets, with a block size of 32 bytes.  The cache sets are, of course,
    numbered 0 to 127.

    a) [4 points]  The capacity of a cache is the amount of user data the
       cache can store (so this excludes valid bits, tags. etc.)  What is the
       capacity of this cache?  Give your answer in bytes and in KB (kilobytes).

       Answer:
       There are 8 * 128 cache lines, each storing 32 bytes.  So, the capacity
       is 32768 bytes, which equals 32KB.


    b) [4 points]  How many tag comparator circuits would this cache need?

       Answer:
       A multiway cache needs as many comparators as there are cache lines in a
       set:  8.

    For parts c) through g), assume we are loading the block of data from DRAM
    that corresponds to the address shown below:

      0  1  0  1  1  0  0  0  1  1  0  1  1  1  0  0  1  0  0  1  1  0  1  1

    c) [4 points]  Which cache set will the block of data be stored in?  Give
       your answer in binary, and in base-10.

       Answer:
       The cache blocks contain 32 bytes, so the low 5 bits of the address are
       used to specify the offset of the relevant byte within a block.  There
       are 128 sets, so we need 7 bits to specify the relevant set number.  So,
       for the address above, the breakdown looks like this:

        0  1  0  1  1  0  0  0  1  1  0  1  1  1  0  0  1  0  0  1  1  0  1  1
        ^                                ^  ^                 ^  ^           ^
        +--------- tag ------------------+  +--- set # -------+  +-- offset -+

       So, the set number is:   binary   110 0100
                                base-10       100 

    d) [4 points]  What value will be stored in the tag field of the line
       containing this block?  Give your answer in binary, and in hex.

       Answer:
       Using the analysis shown in part c), the tag is:  binary  0101 1000 1101
                                                         hex                58D

    For each of the following parts, suppose that the block corresponding to
    the address above has been loaded into the cache, and that the cache does
    not contain any other data blocks.  For parts e) through g), determine
    whether an access to each of the addresses below would result in a cache
    hit or a cache miss.  Justify your answers.

    e) [2 points]  

        0  1  0  1  1  0  0  0  1  1  0  1  1  1  0  0  1  0  0  0  0  0  0  0
        ^                                ^  ^                 ^  ^           ^
        +--------- tag ------------------+  +--- set # -------+  +-- offset -+

       This address maps to the same set (100) as the first data block, and has
       the same tag field, so this is a cache hit.

    f) [2 points]  

        0  1  0  1  1  0  0  0  1  1  0  1  1  1  1  0  1  1  0  1  1  0  1  1
        ^                                ^  ^                 ^  ^           ^
        +--------- tag ------------------+  +--- set # -------+  +-- offset -+

       This address maps to the a different set (118) than the first data block,
       so this is a cache miss.

    g) [2 points]  

        0  1  0  0  1  1  1  0  1  1  0  1  1  1  0  0  1  0  0  1  1  0  1  1
        ^                                ^  ^                 ^  ^           ^
        +--------- tag ------------------+  +--- set # -------+  +-- offset -+

       This address maps to the same set (100) as the first data block, but has
       a different tag field, so this is a cache miss.

4.  Suppose the C function shown below is used to perform searches on an array
    storing 2^20 64-bit integer values:

      int32_t countMatches(const int64_t list, int32_t nElements, 
                                               int64_t target) {

         int32_t nMatches = 0;

         for (int32_t idx = 0; idx < nElements; idx++) {

            if ( target == list[idx] ) {
                nMatches++;
            }
         }
         return nMatches;
      }

    Aside:
    Let's note something up front... this performs a linear traversal of the
    array, and each array element is examined exactly once.  So, there's no
    temporal locality to be exploited.

    a) [4 points]  Suppose this code is executed on hardware that includes a
       single-level direct-mapped cache, where each cache line uses a block
       size of 8 bytes, and the total cache capacity is 64KB.  Assume that
       each local automatic variable in the function is stored in a register.
       What will be the cache hit rate for accesses to the elements in the
       array?  Justify your answer.

       Answer:
       Each array element occupies 8 bytes, which exactly matches the block
       size in the cache.  So, each access will involve a cache miss, and the
       cache hit rate is exactly 0%.

    b) [4 points] Now suppose a single-level direct-mapped cache is used, but
       the block size is 128 bytes and the total cache capacity is still 64KB.
       Will the cache hit rate for accesses to elements in the array change?
       If so, what will the hit rate be.  Explain why.  If not, explain why the
       hit rate will not change.

       Answer:
       Now, each cache miss causes a block of 128 bytes to be loaded from DRAM
       into a cache line.  Each of those blocks contains 16 8-byte values.  The
       algorithm performs a stride-1 linear traversal, so after a block is 
       loaded to the cache, as a result of a miss, the algorithm will proceed
       to examine the remaining 15 values in that block.

       In other words, each cache miss will be followed by 15 cache hits; so
       the cache hit rate is 15/16 or 93.75%.

       (The moral is that a larger block size has allowed the cache to take
       advantage of the spatial locality the algorithm follows.)


5. [10 points] Suppose the C function shown below is used to perform searches
   on a sorted array storing 2^20 64-bit integer values:

      int32_t findMatches(const int64_t list, int32_t nElements, 
                                              int64_t target) {

         int32_t loIdx = 0;
         int32_t hiIdx = nElements;

         while ( loIdx < hiIdx ) {

            int32_t midIdx = (loIdx + hiIdx) / 2;

            if ( target < list[midIdx] ) {
               hiIdx = midIdx - 1;
            }
            else if ( target > list[midIdx] ) {
               loIdx = midIdx + 1;
            }
            else {
               return midIdx;
            }
         }
         return -1;
      }

    Suppose that the system uses a single-level direct-mapped cache with a
    total capacity of 64KB.  Will the size of the cache blocks have any
    effect on the cache hit rate for array elements?  Explain why or why
    not.  (Be careful, there is a subtle issue here.)

    Answer:
    This function uses a binary search algorithm; therefore, there is no
    hope for temporal locality.  But, there is also (almost) no hope for
    spatial locality either.

    The algorithm examines a sequence of values from the array, but those
    values are (mostly) widely separated in memory.  For example, let's
    suppose that list[0] is at address 0, just for argument.  Then the
    first element examined will be (approximately) at address 4,194,296, 
    and the second element examined will be (approximately) at address
    2,097,148 or at address 6,291,444.  Those will NOT fall within the
    size range of any reasonable (or unreasonable) cache design...

    ... so it doesn't seem the cache block size makes a difference...

    ... but... as the algorithm runs, it does converge on smaller and
    smaller sections of the array.  Eventually, the distance between the
    current element and the next element will be small enough to fit in
    a single cache block.  How soon that happens does depend, in part, on
    the size of the cache block.  So, there's a small advantage to be had
    if the cache block size is large.  A very small advantage.

    
6.  Consider two caches. Each cache has 16 cache lines in size, each cache uses
    32-byte cache blocks, and each cache starts out with all lines marked
    invalid. The only difference is one cache is two-way set associative and
    the other is direct-mapped. Assume DRAM uses 16-bit addresses.

    Aside:
    For the direct-mapped cache, a 16-bit address will break down this way;
    T indicates a tag bit, S indicates a set bit, B indicate a block offset
    bit:

       TTTT TTTS SSSB BBBB

    For the two-way associative cache, a 16-bit address will break down this way:
    T indicates a tag bit, S indicates a set bit, B indicate a block offset
    bit:

       TTTT TTTT SSSB BBBB

   
    a) [8 points] Find a shortest possible reference stream of addresses where
       the two-way associative cache would get at least one hit and the direct-
       mapped cache would get no hits. Provide addresses of the reference
       stream in binary and hex.

       Answer:
       We need an access sequence that achieves the following behavior:

                     direct-mapped          two-way associative
          Access1    miss                   miss
          Access2    miss, evict            miss
          Access3    miss due to eviction   hit

       So, we need three accesses.  

       The second access must map to the same set as the first, in the
       direct-mapped cache, causing an eviction.  So, the S bits must map to
       the same set for Access1 and Access2, and the tag bits must not be
       the same.

       The third access must be to the same block as the first access.  So
       the S bits and T bits must match for Access1 and Access3.

       There are many such patterns; here is one:
   
                    2-way                                      direct
         1: 0x0000  00000000 000 00000   miss (set 0, line 0)  0000000 0000 00000   miss (set 0)
         2: 0x0200  00000010 000 00000   miss (set 0, line 1)  0000001 0000 00000   miss (set 0)
         3: 0x0000  00000000 000 00001   hit  (set 0, line 0)  0000000 0000 00001   miss (set 0)

   
    b) [8 points] Find a shortest possible reference stream of addresses where
       the direct-mapped cache would get a hit and the two-way associative
       cache would get no hits. Provide addresses of the reference stream in
       binary in hex.

       Answer:
       We should find a case where a sequence of addresses are mapped to a single set 
       in a 2-way cache so that there will be a cache miss; while they are mapped to 
       different sets in a direct-mapped cache, resulting in a cache hit.

       There are many such patterns; here is one:
      
                   2-way                                         direct
        1: 0x0000  00000000 000 00001   miss (set 0, line 0)     0000000 0000 00001   miss (set 0)
        2: 0x08    00000001 000 00010   miss (set 0, line 1)     0000000 1000 00010   miss (set 8)
        3: 0x18    00000011 000 00011   miss (set 0, line 0)     0000001 1000 00011   miss (set 8)
        4: 0x00    00000000 000 00100   miss (set 0, line 1)     0000000 0000 00100   hit  (set 0)