Algorithm Architects - Designing & Analyzing Parallel Algorithms!

August 14, 2023 — #Parallel Computing #CS Basics

Hey, algorithm designers and parallel processing enthusiasts! We know that the demand for computational power is ever-increasing, pushing the limits of single-processor speeds. The answer? Parallelism! But simply throwing more processors at a problem isn't enough. We need to fundamentally rethink how we design algorithms to make them thrive in a parallel environment.

This is where the Design and Analysis of Parallel Algorithms comes in. It's the art and science of creating procedures that can be executed concurrently on multiple processing units, and understanding their performance characteristics. We'll explore different models of parallel computers, fundamental techniques for parallelizing tasks, and look at how common computational problems are tackled in a parallel fashion.

Ready to think in parallel? Let's begin!

Part 1: Understanding Parallel Architectures – The Playground for Algorithms ️

Before designing parallel algorithms, we need to understand the types of parallel computers they'll run on.

Introduction to Parallel Computers: These systems employ multiple processors to execute parts of a program simultaneously. The architecture significantly influences algorithm design.
SIMD (Single Instruction, Multiple Data): A class of parallel computers where a single control unit dispatches the same instruction to multiple processing elements (PEs), each of which operates on a different data item. Think of a drill sergeant yelling one command, and the whole platoon executes it on their own piece of equipment.
- EREW (Exclusive Read, Exclusive Write) SIMD: Processors can read from or write to a shared memory location, but not simultaneously to the same location. This is the most restrictive PRAM (Parallel Random Access Machine) model.
- CREW (Concurrent Read, Exclusive Write) SIMD: Multiple processors can read the same memory location concurrently, but writes to a location must be exclusive. A more powerful PRAM model than EREW.
- SM-SIMD (Shared Memory SIMD): Refers to SIMD machines where PEs communicate through a shared memory. The PRAM models (EREW, CREW, CRCW) are theoretical SM-SIMD models.
Tree and Mesh Interconnection Computers: Specific network topologies for connecting processors in SIMD or MIMD machines.
- Tree: Processors are connected in a tree structure. Good for algorithms involving divide-and-conquer or broadcast/reduction operations.
- Mesh: Processors are arranged in a grid (2D or 3D). Each processor is connected to its nearest neighbors. Suited for problems with local data dependencies (e.g., image processing, solving PDEs).
Classifying MIMD (Multiple Instruction, Multiple Data) Algorithms: In MIMD systems, each processor can execute different instructions on different data independently.
- Shared Memory MIMD (Multiprocessors): Processors communicate via shared memory (e.g., multi-core CPUs). Algorithms often involve threads, locks, and semaphores.
- Distributed Memory MIMD (Multicomputers): Processors have their own private memory and communicate by explicit message passing over a network (e.g., clusters, supercomputers using MPI).
Hypercube SIMD Model: A specific SIMD architecture where N = $2^k$ processors are interconnected like the vertices of a k-dimensional hypercube. Each processor is connected to $k$ neighbors. Offers good connectivity and low diameter.

Part 2: Parallel Sorting & Selection – Bringing Order to Chaos, Faster! ️

Sorting and selection are fundamental operations. Let's see how they go parallel.

Sequential Algorithm (Context): Algorithms like QuickSort, MergeSort, HeapSort typically run in $O(n \log n)$ time on a single processor. Selection (finding the k-th smallest) can be $O(n)$ (e.g., Median-of-Medians).
Algorithm for Parallel Selection: Finding the k-th smallest element can be parallelized. One approach involves using a parallel version of QuickSelect: processors collectively choose a pivot, partition data, and then recursively solve in parallel on the relevant sub-problem.
Sorting on a Linear Array: A simple interconnection where processors are arranged in a line.
- Odd-Even Transposition Sort: A comparison-exchange sort. In alternating phases, odd-indexed PEs compare-exchange with their right neighbor, then even-indexed PEs compare-exchange with their right neighbor. Takes $O(n)$ phases for $n$ elements on $n$ PEs.
Broadcasting a Datum: Sending a single piece of data from one processor to all other processors. On a linear array, this can take $O(n)$ time. On a mesh or hypercube, it's faster (e.g., $O(\sqrt{n})$ on an $n$ -processor mesh, $O(\log n)$ on an $n$ -processor hypercube).
Computing All Sums (Prefix Sums / Scan): Given $x_0, ..., x_{n-1}$ , compute $s_k = x_0 + ... + x_k$ for all $k$ . Can be done efficiently in parallel (e.g., $O(\log n)$ time on $n/\log n$ EREW PRAM processors).
Sorting on a Mesh: For an $N \times N$ mesh of $P=N^2$ processors, sorting $P$ elements can be done using algorithms like ShearSort or variants of odd-even merge, often achieving $O(N)$ or $O(\sqrt{P})$ time.
Sorting on EREW SIMD Computer:
- Enumeration Sort: For each element, count how many other elements are smaller than it. This count determines its rank (final position). On an EREW PRAM with $n^2$ processors, it can sort $n$ elements in $O(\log n)$ time (for counting) or with $n$ processors, it's slower but still parallelizable.
- Parallel Quick Sort: Can be adapted. One common EREW approach involves a master processor performing pivots and distributing sub-problems, or more distributed pivot selection.
- Hyperquicksort: A QuickSort variant for hypercube architectures. The hypercube is recursively divided, data is partitioned and exchanged along dimensions, and local sorts are performed. Expected time $O((\text{data per node}) \log (\text{data per node}) + (\text{total data}/P) \log P + (\text{dimensions}) \log P)$ .
Sorting on Other Networks: Algorithms are tailored to the specific network topology (e.g., sorting on a tree, shuffle-exchange network). Bitonic sort is another famous parallel sorting network algorithm.

Part 3: Matrix Mastery in Parallel – Crunching Numbers on Grids and Cubes

Matrix operations are central to many scientific and engineering applications.

Mesh Transpose: Transposing an $N \times N$ matrix distributed on an $N \times N$ mesh of processors. Element $A[i,j]$ stored at $P_{i,j}$ needs to move to $P_{j,i}$ . This involves routing data along rows and then columns (or vice versa). Can be done in $O(N)$ time.
Shuffle Transpose: Matrix transposition on architectures that support shuffle interconnections.
EREW Transpose: Transposing a matrix on an EREW PRAM model. Can be done efficiently by having each element write itself to its transposed position, handling write conflicts if a CREW model isn't assumed. With $N^2$ processors for an $N \times N$ matrix, it can be done in $O(1)$ after initial index calculation if concurrent reads for indices are allowed, or $O(\log N)$ using prefix sums for address calculation on a true EREW.
Mesh Multiplication ( $C = A \times B$ ): For $N \times N$ matrices on an $N \times N$ mesh.
- Cannon's Algorithm: A classic algorithm. It involves initial alignment (skewing) of matrices A and B, then $N$ steps of circular shifting rows of A, circular shifting columns of B, and performing local multiply-accumulate operations. Takes $O(N)$ time.
Cube Multiplication: Matrix multiplication on a hypercube architecture. Data is distributed, and elements are broadcast/rotated along dimensions to allow processors to compute their share of the product matrix.
Matrix by Vector Multiplication ( $y = Ax$ ):
- If matrix $A$ is distributed row-wise, each processor $P_i$ computes a part of the output vector $y_i = \sum_j A_{ij} x_j$ . This requires broadcasting the vector $x$ or parts of it.
- If $A$ is distributed column-wise, each $P_j$ computes $A_{:,j} x_j$ and then these partial vectors are summed up (a reduction).
Tree Multiplication: Performing matrix-vector or matrix-matrix multiplication on a tree-connected architecture. Data flows up and down the tree for broadcasts and reductions.

Part 4: Numerical Problem Solving in Parallel – From Equations to Eigenvalues

Many complex numerical problems benefit greatly from parallel computation.

Linear Equations (Solving $Ax=b$ ):
- SIMD Algorithm: For dense systems, parallel versions of Gaussian elimination or iterative methods (like Jacobi or Gauss-Seidel relaxation) can be implemented on SIMD architectures (e.g., mesh, hypercube). Operations like row/column broadcasts and reductions are key. For sparse systems, different parallel approaches focusing on graph partitioning of the matrix are used.
Roots of Nonlinear Equations ( $f(x)=0$ ):
- MIMD Algorithm: Methods like bisection, Newton-Raphson, or secant method can be parallelized. For instance, if searching for multiple roots, different processors can explore different initial guesses or intervals concurrently on an MIMD machine. For a single root, parallelism might involve parallelizing function evaluations if they are expensive.
Partial Differential Equations (PDEs):
- Often solved using finite difference, finite element, or finite volume methods, which discretize the domain into a grid.
- Parallelism is natural: each processor handles a subgrid and exchanges boundary data with neighbors. This maps well to mesh-connected architectures.
Computing Eigenvalues:
- Important in many scientific and engineering problems. Algorithms like the QR algorithm, Jacobi method, or divide-and-conquer methods can be parallelized. Parallelism often involves parallelizing matrix operations (multiplications, factorizations) within these algorithms.
Monte Carlo Methods: Use repeated random sampling to obtain numerical results.
- Inherently parallel: each sample (or batch of samples) can be computed independently by different processors.
- Parallel Random Number Generators (PRNGs): Crucial for Monte Carlo. Need PRNGs that can generate multiple independent streams of random numbers across processors, ensuring statistical quality and reproducibility.
- Random Number Distributions: Generating random numbers from various distributions (uniform, normal, exponential, etc.) in parallel for simulations.

Part 5: Navigating Networks – Parallel Graph Algorithms ️️

Graphs model relationships in countless domains. Parallelizing graph algorithms is essential for large-scale network analysis.

Definitions: A quick recap – graphs consist of vertices (nodes) and edges (connections). We deal with directed/undirected, weighted/unweighted graphs.
Graph Coloring: Assigning colors to vertices such that no two adjacent vertices share the same color, using a minimum number of colors. Parallel algorithms often involve speculative coloring and resolving conflicts, or distributing vertex processing.
Computing the Connectivity Matrix (Transitive Closure): Finding if there's a path between all pairs of vertices. Parallel versions of algorithms like Floyd-Warshall or using matrix multiplication ( $A^*$ where $A$ is adjacency matrix) can be used. On a SIMD machine with $n^2$ processors for $n$ vertices, this can often be done efficiently.
Finding Connected Components: Partitioning vertices into sets where all vertices within a set are reachable from each other. Parallel approaches might involve pointer jumping on a tree representation or iterative merging of components discovered by different processors.
Traversal (BFS, DFS):
- Parallel BFS: Can be done layer by layer. All nodes at the current frontier explore their neighbors in parallel.
- Parallel DFS: More challenging due to its sequential nature. Strategies include parallel exploration from multiple start nodes or dynamic work stealing.
Minimal Alpha-Beta Tree (Game Tree Search): Alpha-beta pruning is a search algorithm to decrease the number of nodes evaluated in a game tree. Parallelizing it involves distributing parts of the tree search to different processors, but care must be taken as pruning decisions in one branch can affect others.
Minimum Cost Spanning Tree (MCST):
- Parallel versions of Prim's or Kruskal's algorithm.
- Borůvka's algorithm is inherently parallel: each vertex (or component) selects its cheapest outgoing edge, and these edges are added to form new components; repeat until one component remains.
Addition Tree / Multiplication Tree (Parallel Prefix / Reduction): These aren't distinct graph problems but rather computational patterns often visualized as trees, used for operations like computing all sums (prefix sum, as discussed earlier) or reducing values (e.g., finding the sum or max of $n$ numbers in $O(\log n)$ time on $n/\log n$ processors using a tree-like communication pattern).

Conclusion: Thinking in Parallel, Computing in Concert!

The design and analysis of parallel algorithms open up a new dimension in problem-solving, allowing us to tackle computations that would be impossible for sequential machines. We've journeyed through:

The diverse architectures of parallel computers (SIMD, MIMD, specific interconnections) that form the stage for these algorithms.
How fundamental tasks like sorting, selection, and matrix operations are re-imagined for parallel execution on various platforms.
The application of parallel techniques to solve complex numerical problems and conduct Monte Carlo simulations.
Strategies for parallelizing essential graph algorithms to analyze large-scale networks.

The core challenge always lies in effectively decomposing the problem, mapping tasks to processors, minimizing communication and synchronization overheads, and achieving good load balance. While complex, mastering parallel algorithm design is crucial for pushing the frontiers of science, engineering, and data analysis in our increasingly data-rich world!