Chapter 1 (continued)
I. Introduction, motivation, terminology
- What's the argument for parallelism, as opposed to faster and faster
single CPUs, for example)? By
Moore's Law we believe processors will keep getting faster. Isn't that
enough? For many problems it is, but ...
- Perhaps I don't want to wait for the faster CPU. It will only
be available next year. I want speedup now.
- Perhaps I want more than the factor of 2 or 4 speedup that I
can hope to get from faster CPUs in the next year or two. With
parallelism, I might be able to get speedups of 10 or 100 or even
more.
- Perhaps the clock speed isn't really the bottleneck. Perhaps
the bottleneck is really the memory bus. Maybe if I had multiple
memories and multiple memory buses I could speed up my application.
- Perhaps my problem is very naturally expressed (and solved) in
parallel. For example, physics simulations involve a system (nature)
that is inherently parallel. You could argue that forcing
that computation to be done serially is actually quite awkward.
- The "speed-of-light" argument. There are real
physical limits to how fast a single microprocessor can
be. We seem to be within sight of those limits
for superscalar pipelined microprocessors (although people
have been saying this for a decade). Multithreaded/multicore
processors introduced to overcome some of the limitations.
- The "commodity microprocessor" argument. Big
problems can be solved more cheaply by throwing lots of
inexpensive processors at them, rather than a small number
of really expensive processors. And one of the main reasons
the processors are cheap is because they have a huge market
base (unlike "special" high-performance supercomputer
processors). In fact, the fastest supercomputer in
existence (IBM BG/L) uses "low-key" microprocessors
from the embedded computing domain. It opts for reasonable
performance per CPU at low power. It throws 65536+ CPUs
to achieve peak performance of 100+ Tflops.
- And finally, since commodity microprocessors are getting faster and faster,
we can have both parallelism and faster processors.
- Note too that there are often plenty of good reasons not
to use parallelism. For example,
it may be just too hard---technically or in
human time---to exploit parallelism for a given problem.
Some problems are just inherently sequential.
- Enter parallel computer architecture: What are the dominant models?
- Commodity clusters and parallel machines. On the side of clusters
we have scalability, cost to maintain. On the side of custom SMPs we
have "simple" programming model and portability, reliability, applicability.
However, the cost per processor in clusters scales much better than
SMPs (network in SMPs and hardware support for shared memory
become bottlenecks and a major cost factors). Clusters can easily scale
to 100+ processors at reasonable cost. SMPs can rarely scale that much
and their cost beyond 8 processors explodes.
- What components affect cost? Processors? Networking? On-board logic
to support parallelism?
II. A first pass through everything
- Computer architecture
- The main question is the relationship between the CPUs and the
memory/memories. At the highest level there are two broad categories: shared memory and distributed memory.
- Two ways to view parallel architectures. CPU-network-centric (a collection
of CPUs with some communication medium organized in a specific topology
that occasionally enables optimization of communication). Memory-centric
(an extension of a conventional memory hierarchy with replicated
components). Parallel programming models have different views.
For example, data-parallel models have a memory-centric view of parallelism
and an implied owner-computes rule for CPUs. Computation is organized
around data distributed in memories. MPI has a more CPU-network-centric
view of parallelism, with naming and virtual topologies of processors.
Memories in MPI are "represented" by processors.
- Typical characteristics of a shared memory architecture.
- modest number of processors, e.g., 8, 16, 32
- one shared bus, one large memory, one address-space
- major issues: bus contention, cache-coherency
- note: NUMA shared memory machines are more scalable than
UMA; but with NUMA some memory is still "farther away" than other
memory.
- Typical characteristics of a distributed memory architecture.
- scalable interconnection network
- some memory physically local, some remote
- message-passing programming model.
- major issues: network latency and bandwidth, message-passing
overhead, data decomposition.
- Layered architecture also very common. Best of both worlds.
Example: Cluster of SMPs,
a distributed memory architecture at layer 1 (multiple SMPs
each with its own memory, connected via a switching network),
a shared memory architecture at layer 2 (each SMP provides standard
cache-coherent shared memory).
- Decoupling of abstractions from hardware:
- You can implement a virtual shared-memory architecture
on top of a distributed memory architecture (e.g. to
simplify the development of parallel applications or
port easily shared-memory applications).
- You can implement a distributed memory communication layer
on top of a shared memory architecture (e.g. to
port many codes written previously for distributed memory,
to take advantage of faster communication through
shared memory in SMP nodes,
- You can combine shared-memory and distributed-memory
abstractions in a parallel program, almost naturally.
CS 4234,
Dimitris Nikolopoulos,
latest update: