Power Up Your Code - A Guide to Parallel Architectures & Programming!

August 15, 2023 — #Parallel Computing #CS Basics

Hey, code commanders and architecture aficionados! We've journeyed through the 'why' and 'how' of parallel algorithms, but now let's get down to the nuts and bolts: the machines that execute these algorithms and the methods we use to instruct them. Welcome to the electrifying world of Parallel Architectures and Programming!

The relentless quest for computational power has led us beyond the limits of single processors into an era where parallelism is king. Understanding the diverse architectures that enable concurrent execution and mastering the programming paradigms to exploit them is crucial for tackling today's grand computational challenges – from scientific discovery to AI and Big Data.

So, buckle up as we explore the hardware that thinks in parallel and the programming models that speak its language!

Part 1: Introduction – The Parallel Imperative

Why do we even need to bother with parallelism? Let's set the stage.

The Need for Parallelism:
- Performance Limitations of Single Processors: While individual processors have become incredibly fast, there are physical limits (heat, power consumption, speed of light) to how much faster they can get.
- Solving Grand Challenge Problems: Many scientific and engineering problems (e.g., climate modeling, drug discovery, cosmological simulations) require computational power far beyond what a single CPU can offer.
- Big Data Processing: Analyzing massive datasets demands distributed and parallel processing capabilities.
- Real-time Requirements: Applications like AI, real-time image processing, and financial trading need rapid computations.
Forms of Parallelism (Flynn's Taxonomy): A classic classification:
1. SISD (Single Instruction, Single Data): Traditional sequential computer (e.g., a uniprocessor PC). One instruction stream operates on one data stream.
2. SIMD (Single Instruction, Multiple Data): A single instruction is executed simultaneously on multiple data streams by different processing elements (PEs). Think of array processors or GPUs.
3. MISD (Multiple Instruction, Single Data): Multiple instructions operate on a single data stream. This is a rare architecture, sometimes seen in fault-tolerant systems or specialized systolic arrays.
4. MIMD (Multiple Instruction, Multiple Data): Multiple processors execute different instruction streams on different data streams simultaneously. This is the most general and common form of parallel computer today (e.g., multi-core systems, clusters).
Moore's Law and Multi-cores: For decades, Moore's Law predicted the doubling of transistors on a chip roughly every two years. While this led to faster single-core performance, physical limitations (power wall, heat) shifted the focus. Instead of making single cores drastically faster, manufacturers started putting multiple cores on a single chip (multi-core processors). This made parallel programming a necessity even for mainstream software.
Fundamentals of Parallel Computers:
- Consist of multiple Processing Elements (PEs).
- An interconnection network allows PEs to communicate and exchange data.
- Memory systems can be shared (all PEs access a common memory) or distributed (each PE has its own private memory).
Communication Architecture: Defines how PEs exchange information.
- Shared Memory Architecture: PEs communicate implicitly by reading and writing to shared memory locations. Requires mechanisms for synchronization and cache coherence.
- Message Passing Architecture: PEs have private memory and communicate explicitly by sending and receiving messages over the interconnection network.
- Data Parallel Architecture: PEs perform the same operation on different parts of a dataset simultaneously (characteristic of SIMD).
- Dataflow Architecture: Operations execute as soon as their operands are available. Computation is driven by data availability. (More of a conceptual model, influenced some parallel language designs).
- Systolic Architecture: Data flows rhythmically through a network of PEs, each performing a simple operation. Well-suited for specific, regular computations like matrix multiplication or signal processing.
Performance Issues: Latency (delay in communication), bandwidth (data transfer rate), synchronization overhead, load imbalance, scalability – all affect how well a parallel program performs.

Part 2: Large Cache Design – Keeping Data Close in Parallel Systems

In shared-memory parallel systems, caches are crucial for performance, but they introduce the cache coherence problem.

Shared vs. Private Caches:
- Private Caches: Each processor has its own local cache. Fast access for local data, but leads to the cache coherence problem if multiple processors cache the same shared data.
- Shared Caches: A single cache is shared by multiple processors (e.g., an L3 cache in a multi-core chip). Simplifies coherence for that cache level but can become a bottleneck. Often, systems use a hierarchy of private (L1, L2) and shared (L3) caches.
Centralized vs. Distributed Shared Caches:
- Centralized: A single shared cache unit.
- Distributed: The logically shared cache is physically distributed among processors or memory controllers, often seen in NUMA systems.
Snooping-based Cache Coherence Protocol:
- Used in bus-based multiprocessor systems where all processors can "snoop" (monitor) the bus.
- When a processor modifies a cached block, it broadcasts this information on the bus. Other caches snooping the bus can then invalidate or update their copies of that block.
- Protocols: MSI, MESI, MOESI (each letter represents a cache line state: Modified, Exclusive, Shared, Invalid, Owned).
- Scalability: Limited by bus bandwidth, typically suitable for a small to moderate number of processors.
Directory-based Cache Coherence Protocol:
- Used in larger, non-bus-based systems (e.g., NUMA systems with many processors).
- A directory maintains information about which processors are sharing or have an exclusive copy of each memory block.
- When a processor requests a block or attempts to write to it, the directory is consulted to find other copies and send invalidation or update messages only to the necessary caches.
- Scalability: More scalable than snooping as it avoids broadcast traffic.
Uniform Cache Access (UCA) vs. Non-Uniform Cache Access (NUMA):
- UCA: Implies uniform access times to the shared cache from all processors.
- NUMA: Access times to shared memory (and potentially caches) depend on the processor's proximity to the memory/cache bank.
  - D-NUCA (Dynamic NUMA): Cache architectures that dynamically manage data placement and migration in a NUMA system to optimize access latencies, often by treating distributed L2/L3 caches as a single shared address space.
  - S-NUCA (Static NUMA or Simple NUMA): Cache access times are non-uniform but fixed based on physical proximity.
Cache Inclusion/Exclusion Policies:
- Inclusion: Data present in an upper-level cache (e.g., L1) must also be present in a lower-level cache (e.g., L2). Simplifies coherence.
- Exclusion: Data is in L1 or L2, but not both (at least for private caches). Can increase effective cache capacity.
- Non-Inclusion: No strict inclusion/exclusion relationship.
Difference between Transaction and Transactional Memory:
- Transaction (Database sense): An ACID (Atomicity, Consistency, Isolation, Durability) unit of work against a database.
- Transactional Memory (TM): A concurrency control mechanism for shared memory programming that allows programmers to define blocks of code (transactions) that execute atomically and in isolation, similar to database transactions. The system (hardware, software, or hybrid) manages conflicts.
  - STM (Software Transactional Memory): Implemented entirely in software.
  - HTM (Hardware Transactional Memory): Implemented with hardware support, often offering better performance.

Part 3: Graphics Processing Units (GPUs) – Parallelism Unleashed!

Originally designed for graphics, GPUs have evolved into massively parallel processors capable of tackling general-purpose computing tasks.

GPUs as Parallel Computers: Modern GPUs are highly parallel, many-core processors with thousands of simple cores designed for high throughput on data-parallel computations. They operate in a SIMT (Single Instruction, Multiple Threads) fashion, which is a variation of SIMD.
Architecture of a Modern GPU (e.g., NVIDIA, AMD):
- A collection of Streaming Multiprocessors (SMs) or Compute Units (CUs).
- Each SM/CU contains numerous simpler cores (CUDA cores, Stream Processors).
- Hierarchical memory:
  - Fast on-chip shared memory/scratchpad per SM/CU.
  - L1/L2 caches.
  - Large, high-bandwidth global device memory (GDDR RAM).
- Hardware schedulers manage thousands of concurrent threads.
Evolution of Graphics Pipelines: From fixed-function pipelines (for specific graphics tasks like vertex processing, pixel shading) to programmable pipelines where developers could write custom shader programs, paving the way for GPGPU.
GPGPUs (General-Purpose computing on GPUs): Using GPUs for non-graphics applications that are data-parallel and computationally intensive (e.g., scientific simulations, machine learning, financial modeling).
Scalable GPUs: Modern GPUs are incredibly scalable within a single chip (more SMs, more cores). Multiple GPUs can also be used in a system for even greater parallelism.
Architectural Characteristics of Future Systems: Trends include tighter CPU-GPU integration, larger and faster memory, more sophisticated on-chip networks, and specialized hardware for AI/ML (e.g., tensor cores).
Implication of Technology and Architecture for Users (Programmers): Requires thinking in a highly parallel way, managing data movement between CPU (host) and GPU (device) memory, understanding the GPU memory hierarchy, and writing code (kernels) that can be executed by thousands of threads.
Example: Vector Addition on GPU: A simple task like adding two vectors $C = A + B$ can be massively parallelized. Each element $C_i = A_i + B_i$ can be computed by a separate GPU thread.
Applications of GPU Computing:
- Deep learning (training and inference).
- Scientific simulations (physics, chemistry, biology).
- Image and video processing.
- Cryptography and cryptanalysis.
- Financial modeling and risk analysis.
- Medical imaging.

Part 4: Introduction to Parallel Programming – Strategies, Patterns & Performance

Writing parallel programs effectively requires new strategies and ways of thinking.

Strategies: Decomposition (breaking problem into tasks), Assignment (assigning tasks to processors), Orchestration (coordinating tasks, communication, synchronization), Mapping (binding tasks to physical PEs).
Mechanism: How parallelism is expressed and managed (e.g., threads, processes, message passing, directives).
Performance Theory: Understanding Amdahl's Law (limits of speedup due to sequential portions), Gustafson's Law (scaled speedup), and metrics like speedup, efficiency, and scalability.
Parallel Programming Patterns: Reusable solutions to common parallel programming problems. (This list is quite extensive, I'll provide brief descriptions).
- Nesting Pattern: Applying parallel patterns hierarchically.
- Parallel Control Pattern: Managing the flow of parallel execution.
  - Fork-Join: A master thread forks multiple worker threads to perform tasks in parallel, then waits for them to join (complete) before proceeding.
  - Pipeline: A series of stages where data flows through, with each stage processed in parallel or by a dedicated processor.
- Parallel Data Management: How data is structured and accessed in parallel.
  - Map (Scaled Vector, Mandelbrot examples): Applying a function independently to all elements of a collection. Scaled Vector ( $y_i = ax_i + y_i$ ) and Mandelbrot set calculation (each pixel computed independently) are classic examples.
  - Collective Operations:
    - Reduce: Combining elements of a collection into a single result using an associative operator (e.g., sum, max).
    - Fusing Map and Reduce: Performing a map operation and then immediately reducing the results, often for efficiency (e.g., map-reduce paradigm).
    - Scan (Prefix Sum): Computing prefix sums/products/etc., on a collection.
    - Fusing Map and Scan: Applying a map then a scan.
  - Data Reorganization (Recognition):
    - Gather: Collecting distributed data items into a single location.
    - Scatter: Distributing data items from a single source to multiple destinations.
    - Pack: Compressing data by removing unused elements.
  - Stencil and Recurrence:
    - Stencil: Each element in a grid is updated based on the values of its neighbors. Common in image processing and PDE solvers.
    - Recurrence: Computations where each term depends on previous terms (e.g., Fibonacci sequence). Parallelizing these can be tricky but is possible for certain types.

Part 5: Parallel Programming Languages & Models – Speaking to the Processors ️

Several models and libraries help programmers write parallel code.

Distributed Memory Programming with MPI (Message Passing Interface):
- MPI: A standardized API for writing message-passing programs. The dominant model for high-performance distributed memory systems (clusters, supercomputers).
- Key Concepts: Processes (each with its own memory), communicators (groups of processes), point-to-point communication (MPI_Send, MPI_Recv), collective communication.
- Example: Trapezoidal Rule in MPI: To approximate an integral, the interval can be divided among MPI processes. Each process computes the sum of trapezoids for its sub-interval locally. Then, a collective operation like MPI_Reduce can sum these partial results to get the global integral.
- I/O Handling: Parallel I/O libraries (like MPI-IO) are needed for efficient reading/writing of large datasets by multiple processes.
- MPI Derived Data Types: Allow users to define custom, non-contiguous data layouts (e.g., columns of a matrix) so they can be sent/received in a single MPI call, improving efficiency and simplifying code.
- Collective Communication: Operations involving all processes in a communicator (e.g., MPI_Bcast, MPI_Reduce, MPI_Scatter, MPI_Gather, MPI_Alltoall). MPI provides optimized implementations.
Shared Memory Programming with Pthreads (POSIX Threads):
- Pthreads: A POSIX standard API for creating and managing threads within a single process that shares memory. Low-level, gives fine-grained control.
- Key Concepts: Thread creation (pthread_create), termination (pthread_join), mutexes (for mutual exclusion), condition variables.
- Conditional Variables: Used with mutexes to manage complex synchronization scenarios where threads need to wait for certain conditions to become true before proceeding.
- Read-Write Locks: Allow multiple threads to read shared data concurrently but require exclusive access for writing. More efficient than a simple mutex if reads are much more frequent than writes.
- Cache Handling: Programmers need to be aware of cache effects (false sharing, true sharing) and try to optimize data layout and access patterns for better cache performance.
Shared Memory Programming with OpenMP (Open Multi-Processing):
- OpenMP: An API that supports multi-platform shared-memory parallel programming in C, C++, and Fortran. It uses compiler directives, library routines, and environment variables. Higher-level than Pthreads.
- Key Concepts: Primarily directive-based. The programmer annotates sequential code with pragmas/directives to indicate parallel regions, loop parallelization, tasking, etc.
- Parallel for Directives (#pragma omp parallel for): Easily parallelizes loops by distributing iterations among a team of threads.
- Scheduling Loops: Specifies how loop iterations are assigned to threads (e.g., static, dynamic, guided, runtime).
- Thread Safety: Ensuring that shared data is accessed correctly by multiple threads, often requiring synchronization constructs like critical, atomic, or locks.
CUDA (Compute Unified Device Architecture):
- CUDA: A parallel computing platform and programming model created by NVIDIA for general-purpose processing on their GPUs (GPGPU).
- Parallel Programming in CUDA C/C++: An extension of C/C++ that allows developers to write kernels (functions) that are executed by many GPU threads in parallel.
- Thread Management: CUDA organizes threads into a hierarchy:
  - Grid: Contains multiple blocks.
  - Block: Contains multiple threads. Threads within a block can cooperate using shared memory and synchronize using __syncthreads().
  - Thread: The basic unit of execution.
- Memory Model:
  - Global Memory: Large, off-chip device memory accessible by all threads.
  - Shared Memory: Fast, on-chip memory shared by threads within a block.
  - Constant Memory: Read-only memory cached on-chip, good for data broadcast to all threads.
  - Texture Memory: Optimized for 2D spatial locality.
  - Local memory per thread (often registers or off-chip if registers spill).
- Constant Memory and Events: Constant memory is useful for read-only data shared by many threads. Events are used for synchronization and timing operations between host (CPU) and device (GPU) or within the device.
- Graphics Interoperability: CUDA can interoperate with graphics APIs like OpenGL and DirectX, allowing GPU-computed data to be directly used for rendering.
- Atomics: Atomic operations (e.g., atomicAdd, atomicCAS) ensure that read-modify-write operations on shared or global memory are performed indivisibly by threads.
- Streams: Sequences of CUDA operations (kernel launches, memory copies) that can execute concurrently on the GPU, allowing for overlapping computation and data transfer.

Conclusion: Building Bridges to Computational Speed!

Understanding parallel architectures and the art of parallel programming is no longer a niche skill – it's fundamental to unlocking the next wave of computational breakthroughs. We've seen a fascinating spectrum, from:

The foundational classifications of parallel computers (SISD to MIMD) and the implications of Moore's Law shifting to multi-cores.
The intricacies of cache design and coherence in shared-memory systems, including transactional memory concepts.
The rise of GPUs as massively parallel computing engines for GPGPU tasks.
The core strategies and patterns that guide parallel program design.
Powerful programming models and languages like MPI for distributed memory, Pthreads and OpenMP for shared memory, and CUDA for harnessing GPU power.

Each architectural choice and programming paradigm comes with its own set of trade-offs regarding performance, scalability, programmability, and cost. The journey into parallel computing is about choosing the right tools and techniques to make multiple processors "sing in harmony," tackling problems that were once unimaginably complex.