Write a version of your matrix-vector multiplication code using a checkerboard (blokced) decomposition. Compare the performance, speedup and efficiency of this version against the version of the code you implemented in the first part of the assignment. Use only two problem sizes, one equal to 80% of the memory available in one uniprocessor machine (512 MB), which you will use to measure performance on 1, 2, 3 and 4 uniprocessor machines, and one equal to 80% of the memory available in one dual-processor machine (1024 MB), which you will use to measure performance on 1, 2, 3, and 4 multiprocessor machines using 2,4, 6, and 8 processors respectively. Don't reuse the results from the first assignment, since you need to ignore I/O time in this assignment.
Consider again the problem of matrix-vector multiplication when the matrix is sparse. A sparse matrix has a high percentage of zero elements. Write a version of your matrix-vector multiplication which balances better the workload with a sparse matrix. Furthermore, store the matrix so that the space occupied by the matrix is exactly as much as needed by the non-zero elements and re-write the algorithm to accommodate the new matrix representation. You are free to devise your own load balancing scheme, static or dynamic, as well as your own matrix representation. Creativity is rewarded in this part of the assignment. Measure execution times, speedup and efficiency for the same machine configurations you used in the first part of the assignment. Use a sparse matrix with 30% non-zero elements, scattered randomly in the matrix. Use a matrix that would occupy 80% of the available memory if stored in dense form with the zeros. You need to prove that your sparse representation needs only about 24% of RAM needed by the dense representation.
Write a shared-memory implementation of the checkerboard version of the sparse matrix-vector multiplication using POSIX threads. We will teach POSIX threads in class and, if needed, in a few lab sessions. Measure execution time, speedup and efficiency on a single dual-processor machine using 1 and 2 processors. Compare execution time, speedup and efficiency against the MPI version of the code, executed on the same dual-processor machine using 1 and 2 processors. Discuss extensively your results and, if you see differences between the MPI and POSIX threads version of the code, explain the differences.