Solution for CS 2506 Spring 2020 MIPS04 ------------------------------------------------------------------------------- 1. Suppose that, in addition to registers in the CPU, a system includes a unified L1 cache, and DRAM, and that the storage levels have the average access times shown below: L1 cache 1ns DRAM 50ns The system has a 2GHz clock rate. Also, the system executes machine instructions using a 5-stage pipeline, and the average CPI is 6.5, assuming that there are no memory accesses. (This would take into account the effect of stalls for forwarding and branch decisions, and include the fetch if it's from the L1 cache.) For the following parts of this question, round time values to the nearest nanosecond (ns), and round CPI values to the nearest tenth (e.g., 7.2). Show your computations. Aside: It's important to note that if the clock rate is 2GHz, then each clock cycle is 0.5ns, so there are two cycles per ns. Therefore, the access times above can also be expressed as: L1 cache 1 ns 2 cycles DRAM 50 ns 100 cycles a) [4 points] For this part, assume that the hit rate for the L1 cache is (an unlikely) 100%. What is the average memory access time AMAT (in ns)? Answer: Under the given assumption, each memory access ends with an L1 hit, so the average memory access time (indeed, the time for every memory access) is 1.0 ns. b) [4 points] Again assuming an L1 cache hit rate of 100%, what would be the average CPI if 20% of the executed instructions require a memory access? Answer: CPI = 6.5 + 0.20 * 2 = 6.90 cycles c) [4 points] Now, suppose the L1 cache hit rate is 97%, and that no memory access requires going to the SSD. What is the average memory access time AMAT now (in ns)? Answer: AMAT = 1 for L1 access + 0.03 * 50 for DRAM access = 2.5 ns d) [8 points] Under the same assumptions as part c), and assuming that 20% of the executed instructions require a memory access, what would be the average CPI? Answer: CPI = 6.5 + 0.03 * 100 for L1 fetch miss and DRAM fetch + 0.2 * 2 for L1 data hit + 0.2 * 0.03 * 100 for L1 data miss and DRAM data hit = 6.5 + 3.0 + 0.4 + 0.6 = 10.50 cycles e) [8 points] Repeat part d), but now assume that 40% of the executed instructions require a memory access. Answer: CPI = 6.5 + 0.03 * 100 for L1 fetch miss and DRAM fetch + 0.4 * 2 for L1 data hit + 0.4 * 0.03 * 100 for L1 data miss and DRAM data hit = 6.5 + 3.0 + 0.8 + 1.2 = 11.50 cycles 2. Suppose that, in addition to registers in the CPU, a system includes a unified L1 cache, a unified L2 cache, and DRAM, and that the storage levels have the average access times shown below: L1 cache 1 ns L2 cache 10 ns DRAM 50 ns The system has a 2GHz clock rate. Also, the system executes machine instructions using a 5-stage pipeline, and the average CPI is 6.5, assuming that there are no memory accesses. (This would take into account the effect of stalls for forwarding and branch decisions, and include the fetch if it's from the L1 cache.) For the following parts of this question, round time values to the nearest nanosecond (ns), and round CPI values to the nearest tenth (e.g., 7.2). Show your computations. Aside: It's important to note that if the clock rate is 2GHz, then each clock cycle is 0.5ns, so there are two cycles per ns. Therefore, the access times above can also be expressed as: L1 cache 1 ns 2 cycles L2 cache 10 ns 20 cycles DRAM 50 ns 100 cycles a) [8 points] Now, suppose the L1 cache hit rate is 97%, and the L2 cache hit rate is 99%. What is the average memory access time AMAT now (in ns)? Answer: AMAT = 1 for L1 hit + 0.03 * 10 for L1 miss and L2 hit + 0.03 * 0.01 * 50 for L1 miss, L2 miss, and DRAM access = 1.315 ns So, 1.3 ns b) [8 points] Under the same assumptions as part a), and assuming that 40% of the executed instructions require a memory access (besides being fetched), what would be the average CPI? Answer: CPI = 6.5 for base CPI + 0.03 * 20 for L1 fetch miss and L2 fetch hit + 0.03 * 0.01 * 100 for L1 and L2 fetch misses, and DRAM fetch hit + 0.4 * 2 for L1 data hit + 0.4 * 0.03 * 20 for L1 data miss and L2 data hit + 0.4 * 0.03 * 0.01 * 100 for L1 and L2 data miss, and DRAM data hit = 6.5 + 0.6 + 0.03 + 0.8 + 0.24 + 0.012 = 8.182 So, 8.18 cycles 3. A system uses 24-bit DRAM addresses and an 8-way set associative cache with 128 sets, with a block size of 32 bytes. The cache sets are, of course, numbered 0 to 127. a) [4 points] The capacity of a cache is the amount of user data the cache can store (so this excludes valid bits, tags. etc.) What is the capacity of this cache? Give your answer in bytes and in KB (kilobytes). Answer: There are 8 * 128 cache lines, each storing 32 bytes. So, the capacity is 32768 bytes, which equals 32KB. b) [4 points] How many tag comparator circuits would this cache need? Answer: A multiway cache needs as many comparators as there are cache lines in a set: 8. For parts c) through g), assume we are loading the block of data from DRAM that corresponds to the address shown below: 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1 1 c) [4 points] Which cache set will the block of data be stored in? Give your answer in binary, and in base-10. Answer: The cache blocks contain 32 bytes, so the low 5 bits of the address are used to specify the offset of the relevant byte within a block. There are 128 sets, so we need 7 bits to specify the relevant set number. So, for the address above, the breakdown looks like this: 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1 1 ^ ^ ^ ^ ^ ^ +--------- tag ------------------+ +--- set # -------+ +-- offset -+ So, the set number is: binary 110 0100 base-10 100 d) [4 points] What value will be stored in the tag field of the line containing this block? Give your answer in binary, and in hex. Answer: Using the analysis shown in part c), the tag is: binary 0101 1000 1101 hex 58D For each of the following parts, suppose that the block corresponding to the address above has been loaded into the cache, and that the cache does not contain any other data blocks. For parts e) through g), determine whether an access to each of the addresses below would result in a cache hit or a cache miss. Justify your answers. e) [2 points] 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 ^ ^ ^ ^ ^ ^ +--------- tag ------------------+ +--- set # -------+ +-- offset -+ This address maps to the same set (100) as the first data block, and has the same tag field, so this is a cache hit. f) [2 points] 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 ^ ^ ^ ^ ^ ^ +--------- tag ------------------+ +--- set # -------+ +-- offset -+ This address maps to the a different set (118) than the first data block, so this is a cache miss. g) [2 points] 0 1 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1 1 ^ ^ ^ ^ ^ ^ +--------- tag ------------------+ +--- set # -------+ +-- offset -+ This address maps to the same set (100) as the first data block, but has a different tag field, so this is a cache miss. 4. Suppose the C function shown below is used to perform searches on an array storing 2^20 64-bit integer values: int32_t countMatches(const int64_t list, int32_t nElements, int64_t target) { int32_t nMatches = 0; for (int32_t idx = 0; idx < nElements; idx++) { if ( target == list[idx] ) { nMatches++; } } return nMatches; } Aside: Let's note something up front... this performs a linear traversal of the array, and each array element is examined exactly once. So, there's no temporal locality to be exploited. a) [4 points] Suppose this code is executed on hardware that includes a single-level direct-mapped cache, where each cache line uses a block size of 8 bytes, and the total cache capacity is 64KB. Assume that each local automatic variable in the function is stored in a register. What will be the cache hit rate for accesses to the elements in the array? Justify your answer. Answer: Each array element occupies 8 bytes, which exactly matches the block size in the cache. So, each access will involve a cache miss, and the cache hit rate is exactly 0%. b) [4 points] Now suppose a single-level direct-mapped cache is used, but the block size is 128 bytes and the total cache capacity is still 64KB. Will the cache hit rate for accesses to elements in the array change? If so, what will the hit rate be. Explain why. If not, explain why the hit rate will not change. Answer: Now, each cache miss causes a block of 128 bytes to be loaded from DRAM into a cache line. Each of those blocks contains 16 8-byte values. The algorithm performs a stride-1 linear traversal, so after a block is loaded to the cache, as a result of a miss, the algorithm will proceed to examine the remaining 15 values in that block. In other words, each cache miss will be followed by 15 cache hits; so the cache hit rate is 15/16 or 93.75%. (The moral is that a larger block size has allowed the cache to take advantage of the spatial locality the algorithm follows.) 5. [10 points] Suppose the C function shown below is used to perform searches on a sorted array storing 2^20 64-bit integer values: int32_t findMatches(const int64_t list, int32_t nElements, int64_t target) { int32_t loIdx = 0; int32_t hiIdx = nElements; while ( loIdx < hiIdx ) { int32_t midIdx = (loIdx + hiIdx) / 2; if ( target < list[midIdx] ) { hiIdx = midIdx - 1; } else if ( target > list[midIdx] ) { loIdx = midIdx + 1; } else { return midIdx; } } return -1; } Suppose that the system uses a single-level direct-mapped cache with a total capacity of 64KB. Will the size of the cache blocks have any effect on the cache hit rate for array elements? Explain why or why not. (Be careful, there is a subtle issue here.) Answer: This function uses a binary search algorithm; therefore, there is no hope for temporal locality. But, there is also (almost) no hope for spatial locality either. The algorithm examines a sequence of values from the array, but those values are (mostly) widely separated in memory. For example, let's suppose that list[0] is at address 0, just for argument. Then the first element examined will be (approximately) at address 4,194,296, and the second element examined will be (approximately) at address 2,097,148 or at address 6,291,444. Those will NOT fall within the size range of any reasonable (or unreasonable) cache design... ... so it doesn't seem the cache block size makes a difference... ... but... as the algorithm runs, it does converge on smaller and smaller sections of the array. Eventually, the distance between the current element and the next element will be small enough to fit in a single cache block. How soon that happens does depend, in part, on the size of the cache block. So, there's a small advantage to be had if the cache block size is large. A very small advantage. 6. Consider two caches. Each cache has 16 cache lines in size, each cache uses 32-byte cache blocks, and each cache starts out with all lines marked invalid. The only difference is one cache is two-way set associative and the other is direct-mapped. Assume DRAM uses 16-bit addresses. Aside: For the direct-mapped cache, a 16-bit address will break down this way; T indicates a tag bit, S indicates a set bit, B indicate a block offset bit: TTTT TTTS SSSB BBBB For the two-way associative cache, a 16-bit address will break down this way: T indicates a tag bit, S indicates a set bit, B indicate a block offset bit: TTTT TTTT SSSB BBBB a) [8 points] Find a shortest possible reference stream of addresses where the two-way associative cache would get at least one hit and the direct- mapped cache would get no hits. Provide addresses of the reference stream in binary and hex. Answer: We need an access sequence that achieves the following behavior: direct-mapped two-way associative Access1 miss miss Access2 miss, evict miss Access3 miss due to eviction hit So, we need three accesses. The second access must map to the same set as the first, in the direct-mapped cache, causing an eviction. So, the S bits must map to the same set for Access1 and Access2, and the tag bits must not be the same. The third access must be to the same block as the first access. So the S bits and T bits must match for Access1 and Access3. There are many such patterns; here is one: 2-way direct 1: 0x0000 00000000 000 00000 miss (set 0, line 0) 0000000 0000 00000 miss (set 0) 2: 0x0200 00000010 000 00000 miss (set 0, line 1) 0000001 0000 00000 miss (set 0) 3: 0x0000 00000000 000 00001 hit (set 0, line 0) 0000000 0000 00001 miss (set 0) b) [8 points] Find a shortest possible reference stream of addresses where the direct-mapped cache would get a hit and the two-way associative cache would get no hits. Provide addresses of the reference stream in binary in hex. Answer: We should find a case where a sequence of addresses are mapped to a single set in a 2-way cache so that there will be a cache miss; while they are mapped to different sets in a direct-mapped cache, resulting in a cache hit. There are many such patterns; here is one: 2-way direct 1: 0x0000 00000000 000 00001 miss (set 0, line 0) 0000000 0000 00001 miss (set 0) 2: 0x08 00000001 000 00010 miss (set 0, line 1) 0000000 1000 00010 miss (set 8) 3: 0x18 00000011 000 00011 miss (set 0, line 0) 0000001 1000 00011 miss (set 8) 4: 0x00 00000000 000 00100 miss (set 0, line 1) 0000000 0000 00100 hit (set 0)