01 - Multiprocessing
Section 1.1: HPC landscape overview: where different techniques fit
Key Concept
High-Performance Computing (HPC) includes various strategies to accelerate computation by leveraging hardware parallelism. Choosing the right approach depends on the nature of the problem, the scale of data, and available infrastructure.
Topics
- Multicore CPUs: Ideal for general-purpose, moderately parallel tasks using shared memory.
- GPUs: Well-suited for highly parallel workloads like matrix operations or simulations.
- Clusters (Distributed Computing): Used for large-scale problems that require many machines, often with message passing.
Specialized Hardware (FPGAs/ASICs): Offer peak performance for specific applications but require more development effort.
In-session Exercise (5–10 min): For each of the above techniques, name one problem or domain where it shines (e.g., GPUs for image processing, MPI clusters for weather simulation).
Common Pitfall: Overestimating GPU benefit without accounting for data transfer and memory constraints.
- Best Practice: Benchmark different approaches early to identify the most scalable solution for your task.
Section 1.2: Introduction to the multiprocessing
module
Key Concept
The multiprocessing
module allows you to run code in parallel using multiple processor cores, enabling faster execution of computationally intensive tasks. It achieves this by creating and managing separate processes, each with its own memory space.
Topics
- Process Creation: Creating new processes to execute tasks concurrently.
- Process Communication: Mechanisms for processes to exchange data (e.g., queues, pipes).
- Process Pools: Simplified way to manage a group of worker processes.
- Global Interpreter Lock (GIL): Understanding its limitations on true parallelism in Python.
Exercise
- Briefly consider a scenario where you could benefit from parallel processing. What tasks would be suitable?
Common Pitfalls
- Data Sharing: Remember that processes have separate memory spaces; data needs to be explicitly shared.
- Overhead: Creating and managing processes has overhead; it's not always faster for small tasks.
Best Practices
- Minimize Data Transfer: Reduce the amount of data passed between processes.
- Use Process Pools: Leverage process pools for efficient task distribution.
Section 1.3: Using Process
, Queue
, Pipe
, and Pool
Key Concept
These modules provide tools for concurrent execution, allowing you to distribute workloads across multiple cores or machines. They enable parallel processing and communication between processes.
Topics
Process
: Create and manage independent processes for parallel tasks.Queue
: Facilitate inter-process communication by providing a buffer for data exchange.Pipe
: Establish a unidirectional communication channel between two processes.Pool
: Manage a collection of worker processes, simplifying task distribution and result collection.In-session Exercise: Consider a scenario where you need to perform a computationally intensive task on multiple datasets. How would you leverage
Process
andQueue
to parallelize the processing? (5 min)Common Pitfalls: Incorrectly managing shared resources (e.g., race conditions) when using
Process
.- Best Practices: Use
Pool
for managing a large number of worker processes to avoid overhead of individual process creation.
Section 1.4: Introduction to concurrent.futures
for cleaner parallel execution
Key Concept
concurrent.futures
provides a high-level interface for asynchronously executing callables, simplifying parallel and concurrent programming in Python. It abstracts away much of the complexity of managing threads or processes.
Topics
- Executor: The core object that manages the execution of tasks. Choose from
ThreadPoolExecutor
(for I/O-bound tasks) orProcessPoolExecutor
(for CPU-bound tasks). - submit(): A method to submit a callable to the executor, returning a
Future
object. - Future: Represents the result of an asynchronous computation. Use
Future.result()
to retrieve the result (blocking until it's available). - Asynchronous Execution: Tasks run concurrently, allowing for faster overall execution time.
Exercise
- (5 min): Consider a simple function that performs a computationally intensive calculation. How could you use
concurrent.futures
to run this calculation in parallel with another task (e.g., reading data from a file)? (No code required, just conceptualize the flow).
Pitfalls
- Resource Contention: Be mindful of shared resources (e.g., files, databases) when using multiple processes or threads.
- Overhead: Parallelization isn't always faster; the overhead of managing concurrency can sometimes outweigh the benefits, especially for very short tasks.
Best Practices
- Choose the Right Executor: Select
ThreadPoolExecutor
for I/O-bound tasks andProcessPoolExecutor
for CPU-bound tasks. - Handle Exceptions: Wrap your callable in a
try...except
block to gracefully handle exceptions that might occur during execution.
Section 1.5: Examples of CPU-bound tasks with performance profiling
Key Concept
CPU-bound tasks are limited by the processing power of the central processor. Performance profiling helps identify these tasks and pinpoint bottlenecks.
Topics
- Image/Video Processing: Operations like filtering, encoding, and decoding require significant computational power.
- Scientific Simulations: Numerical calculations in areas like fluid dynamics, molecular dynamics, and climate modeling are inherently CPU-intensive.
- Data Analysis & Machine Learning: Training models, performing complex statistical analyses, and data transformations often demand substantial CPU resources.
Cryptography: Encryption/decryption algorithms, especially those using complex mathematical operations, are CPU-bound.
In-session Exercise: Identify potential CPU-bound tasks within a given code snippet (provided on a slide).
- Common Pitfall: Assuming increased clock speed always translates to improved performance.
- Best Practice: Use profiling tools to measure actual execution time before optimizing.
Section 1.6: Comparison with sequential execution
Key Concept
Sequential execution processes instructions one after another, which can be straightforward but is often inefficient for complex tasks. Parallel execution offers a potential speedup by dividing work across multiple processing units.
Topics
- Execution Order: Instructions are processed in a linear sequence.
- Resource Utilization: A single processor or core is typically utilized.
- Scalability: Limited by the speed of the single processor.
Complexity: Simple to understand and implement for basic tasks.
In-session Exercise: Consider a simple calculation involving many independent operations. How could parallel execution potentially improve the time required? (5 min)
Common Pitfalls: Assuming all tasks are equally amenable to parallelization.
- Best Practices: Identify independent tasks before attempting parallelization.
Section 1.7: Synchronization and locking considerations
Key Concept
Ensuring that multiple processes or threads access shared resources in a controlled manner is crucial to prevent data corruption and unexpected behavior. Synchronization mechanisms and locking are fundamental tools for achieving this.
Topics
- Race Conditions: Occur when the outcome of a program depends on the unpredictable order of execution of multiple threads/processes.
- Mutual Exclusion: Ensuring that only one thread/process can access a critical section of code at a time.
- Locking Mechanisms: Tools like mutexes, semaphores, and read-write locks provide different levels of access control.
Deadlock: A situation where two or more processes are blocked indefinitely, waiting for each other to release resources.
In-session Exercise: Consider a scenario where two threads increment a shared counter. What potential problem arises if no synchronization is used? (5 min)
- Common Pitfalls: Forgetting to release locks after use.
- Best Practices: Use the simplest synchronization mechanism that meets the requirements.
Section 1.8: Profiling CPU-bound vs I/O-bound tasks (setup for Day 2)
Key Concept
Understanding whether a task is primarily limited by CPU processing or data input/output is crucial for optimizing performance. This distinction dictates the most effective optimization strategies.
Topics
- CPU-bound: Task spends most of its time performing calculations.
- I/O-bound: Task spends most of its time waiting for data from disk, network, or other sources.
- Profiling Tools: Tools like
perf
,cProfile
, and system monitoring utilities help identify bottlenecks. Metrics: Key metrics include CPU utilization, disk I/O operations per second (IOPS), and network bandwidth.
In-session exercise: Consider a scenario: a program reads a large file, performs calculations on the data, and writes the results to another file. Which type of bottleneck is most likely to occur? (5 min)
Common Pitfalls: Assuming the most obvious bottleneck is always the primary one.
- Best Practices: Start with profiling to verify your assumptions before applying optimizations.
Exercise: Parallel prime number calculation
Objective: Implement a simple parallel prime number calculation using Python's multiprocessing
module.
Instructions:
- You are given a script primes.py
that takes an integer n
as input and calculates all prime numbers up to n
.
- Modify the script to use the multiprocessing
module to divide the range of numbers (2 to n
) into chunks and have each process calculate the prime numbers within its chunk.
- Combine the results from all processes to produce the final list of prime numbers.
Expected Learning Outcome: You will understand how to use multiprocessing
to parallelize a simple task and combine the results from multiple processes.
Exercise: Signal filtering comparison (sequential vs parallel)
Objective: To understand the performance difference between sequential and parallel signal filtering using Python's multiprocessing
module.
Instructions:
- You are given a script signal_filter.py
that simulates a simple signal filtering process. This script takes a list of numbers (representing a signal) and applies a simple moving average filter.
- Modify the script to perform the filtering both sequentially and using multiprocessing. The signal should be divided into chunks for parallel processing.
- Compare the execution time of the sequential and parallel versions for a signal of length 10,000. You can use the time
module for timing.
Expected Learning Outcome: You should be able to compare the performance of sequential and parallel processing for a simple task and understand the basic principles of parallelization.