03 - MPI with `mpi4py`
Section 3.1: Introduction to MPI and mpi4py
Key Concept
MPI (Message Passing Interface) is a standard for parallel computing that enables processes to communicate and synchronize with each other. mpi4py
is a Python binding to the MPI standard, allowing Python code to leverage parallel processing capabilities.
Topics
- Distributed Memory: Understand that each process has its own memory space.
- Communication: Learn about sending and receiving data between processes.
- Collective Operations: Familiarize yourself with common operations like
broadcast
,scatter
, andgather
. Process Management: Understand how processes are launched and managed within an MPI application.
In-session Exercise: Consider how you would divide a large dataset among multiple processes. What communication strategy would you use?
- Common Pitfall: Incorrectly handling data distribution can lead to data races or incomplete results.
- Best Practice: Use
MPI.Comm_rank()
to identify the process ID for tailored computations.
Section 3.2: Basic MPI Workflow
Key Concept
This section outlines the fundamental steps involved in writing a simple parallel program using MPI.
Topics
- Initialization: Explain the need to initialize the MPI environment.
- Communication Setup: Describe how to define communication patterns between processes.
- Task Decomposition: Discuss dividing the overall problem into smaller tasks for parallel execution.
- Finalization: Explain the importance of properly terminating the MPI environment.
Section 3.3: Data Structures in Parallel
Key Concept
Choosing appropriate data structures is crucial for efficient parallel programming.
Topics
- Global vs. Local Data: Differentiate between data accessible to all processes and data specific to a single process.
- Data Partitioning: Explore techniques for dividing data among processes (e.g., domain decomposition).
- Synchronization Primitives: Introduce concepts like barriers and locks for coordinating parallel operations.
- Data Alignment: Briefly mention the importance of data alignment for performance.
Section 3.4: Common MPI Patterns
Key Concept
Several common patterns arise in parallel programming, offering reusable solutions for frequently encountered problems.
Topics
- Data Parallelism: Explain how to apply the same operation to different parts of the data.
- Task Parallelism: Describe how to execute different tasks concurrently.
- Pipeline Parallelism: Introduce the concept of breaking down a complex task into stages executed in a pipeline.
- Reduce Operation: Highlight the use of
MPI.Reduce
for aggregating data across processes.
Section 3.2: mpiexec
and distributed execution
Key Concept
mpiexec
is the command-line tool used to launch parallel processes across multiple nodes in a cluster. It allows you to distribute your workload and leverage the combined computational power of your system.
Topics
- Process Launch:
mpiexec
starts multiple instances of a program. - Resource Specification: You specify the number of processes and the host(s) to run them on.
- Environment Variables:
mpiexec
sets environment variables crucial for inter-process communication (e.g.,MPI_COMM_WORLD_SIZE
,MPI_PROC_ID
). Basic Command Structure:
mpiexec -n <processes> <program_arguments>
In-session Exercise: (5 min) What are the key environment variables set by
mpiexec
and what is their purpose?- Common Pitfalls: Forgetting to specify the number of processes (
-n
) will likely lead to errors. - Best Practices: Always verify the number of processes and host configuration before launching a job.
Section 3.3: MPI Communication Fundamentals
Key Concept
MPI communication enables processes to exchange data and coordinate their actions during parallel execution. It's the core mechanism for distributed computation.
Topics
- Message Passing: Processes send and receive data via messages.
- Communication Patterns: Common patterns include point-to-point (sending to a specific process) and collective operations (e.g., broadcast, gather).
- Data Distribution: Data can be distributed across processes using techniques like domain decomposition.
MPI Functions: MPI provides a set of functions for communication (e.g.,
MPI_Send
,MPI_Recv
).In-session Exercise: (5 min) Describe the difference between point-to-point and collective communication.
- Common Pitfalls: Incorrectly specifying message sizes or buffer sizes can lead to communication errors.
- Best Practices: Use appropriate data types for communication to minimize data transfer overhead.
Section 3.4: Common MPI Operations**
Key Concept
MPI provides a set of fundamental operations for exchanging data and coordinating processes, forming the building blocks of parallel algorithms.
Topics
- Send/Receive: Fundamental for data exchange between processes.
- Broadcast: Copies data from one process to all other processes.
- Gather/Scatter: Collects data from multiple processes into one, or distributes data from one process to multiple processes.
Barrier: Synchronizes processes to ensure they reach a certain point in the computation before proceeding.
In-session Exercise: (5 min) Explain when you would use a
Barrier
operation.- Common Pitfalls: Incorrectly using
Gather
orScatter
can lead to data inconsistencies. - Best Practices: Understand the data layout and process IDs when using collective operations.
Section 3.3: Process communication: send
, recv
, broadcast
, scatter
, gather
Key Concept
This section covers fundamental mechanisms for inter-process communication (IPC), enabling processes to exchange data and coordinate their actions. These primitives provide building blocks for distributed computation and parallel processing.
Topics
send
andrecv
: Basic communication; one process sends data to another, and the receiver retrieves it.broadcast
: Sends data from one process to multiple recipients.scatter
: Distributes data from one process to multiple recipients, potentially with each recipient receiving a portion.gather
: Collects data from multiple processes into one process.
Exercise
- (5 min) Consider a scenario where you need to calculate the average of a large dataset distributed across multiple processes. Which communication primitives would be most suitable for gathering the partial results?
Pitfalls
- Deadlock: Incorrectly ordering
send
andrecv
operations can lead to processes blocking indefinitely. - Data Races: Concurrent access to shared data without proper synchronization can result in unpredictable behavior.
Best Practices
- Synchronization: Use appropriate synchronization mechanisms (e.g., mutexes, semaphores) to protect shared data.
- Error Handling: Implement robust error handling to manage communication failures.
Section 3.3: Process communication: send
, recv
, broadcast
, scatter
, gather
Key Concept
This section introduces core inter-process communication (IPC) primitives, allowing processes to exchange data and coordinate. These are essential for parallel and distributed applications.
Topics
send
andrecv
: Fundamental mechanism for point-to-point communication. One process sends data, another receives.broadcast
: Sends data from a single process to all other participating processes.scatter
: Distributes data from a single process to multiple processes, potentially splitting the data.gather
: Collects data from multiple processes into a single process.
Exercise
- (5 min) Imagine you have a master process that needs to distribute a large array of data to several worker processes. Which communication primitive would be most appropriate?
Pitfalls
- Race Conditions: Unprotected concurrent access to shared resources can lead to incorrect results.
- Buffer Overflow: Sending more data than the receiver's buffer can cause crashes or data corruption.
Best Practices
- Explicit Synchronization: Clearly define synchronization points using appropriate primitives.
- Data Validation: Validate data before sending and receiving to prevent errors.
Section 3.4: Modern alternatives: joblib
and Ray
for distributed computing
Key Concept
joblib
and Ray
provide simpler and more scalable solutions for parallelizing Python code compared to traditional approaches like multiprocessing. Ray
offers a more comprehensive framework for distributed applications.
Topics
joblib
: Simple parallelization for CPU-bound tasks, particularly useful for large datasets.Ray
: A distributed execution framework for building scalable applications, including parallel computing, machine learning, and reinforcement learning.- Scalability: Both tools offer different levels of scalability, with
Ray
designed for larger, more complex deployments. Ease of Use: Both aim to simplify parallelization, reducing boilerplate code compared to
multiprocessing
.In-session Exercise: Consider how you might parallelize a data preprocessing step (e.g., feature scaling) using either
joblib
orRay
. What are the key considerations for choosing between them? (5-10 min)Common Pitfalls: Over-parallelization can lead to overhead exceeding the benefits of parallelism.
- Best Practices: Profile your code to identify bottlenecks before parallelizing.
Section 3.5: Comparison: MPI vs modern distributed frameworks
Key Concept
Modern distributed frameworks offer higher-level abstractions and often better performance than traditional MPI, but require a shift in thinking about application design.
Topics
- Abstraction Level: MPI operates at the message level; modern frameworks offer object-oriented or data-centric abstractions.
- Scalability: Modern frameworks often scale more efficiently to large clusters and heterogeneous environments.
- Programming Model: MPI requires explicit communication management; modern frameworks often handle this implicitly.
Ease of Use: Modern frameworks generally have simpler APIs and better integration with common data formats.
In-session Exercise: Briefly brainstorm how you would refactor a simple MPI program to use a data-centric approach.
- Common Pitfalls: Over-optimizing communication details in a modern framework can negate its benefits.
- Best Practices: Prioritize data locality and minimize data movement across nodes.
Section 3.5: Comparison: MPI vs modern distributed frameworks
Key Concept
Modern distributed frameworks provide higher-level abstractions and often superior performance compared to traditional MPI, but necessitate a change in application design.
Topics
- Abstraction Level: MPI focuses on explicit message passing; modern frameworks offer data-centric or object-oriented abstractions.
- Scalability: Modern frameworks are designed for efficient scaling across large, heterogeneous clusters.
- Programming Model: MPI requires manual communication management; modern frameworks often handle this implicitly.
Ease of Use: Modern frameworks typically feature simpler APIs and better data format integration.
In-session Exercise: Consider a simple MPI program. How might you restructure it to leverage a data-centric approach?
- Common Pitfalls: Excessive focus on low-level communication details can diminish the advantages of the framework.
- Best Practices: Maximize data locality to reduce inter-node data transfer.
Exercise: Distributed FFT computation
Objective: Implement a basic distributed FFT computation using MPI to divide a large array into smaller chunks and compute the FFT of each chunk.
Instructions:
- You are given a Python script fft_mpi.py
that uses MPI to distribute the computation of the Fast Fourier Transform (FFT) of a large array. The script divides the input array into chunks and assigns each chunk to a different MPI process.
- Modify the fft_mpi.py
script to print the index of the chunk being processed and the size of the chunk for each process.
- Run the modified script with 4 MPI processes.
- Observe the output and verify that the array is indeed divided into chunks and each process is working on a portion of the data.
Expected Learning Outcome: You will understand how to divide a computational task into smaller parts and distribute it across multiple processes using MPI, and how to track the progress of each process.
Exercise: Fallback: joblib
parallel processing if MPI setup fails
Objective: To understand how to use joblib
as a fallback for parallel processing when MPI is unavailable.
Instructions:
- You are given a Python script calculate_sum.py
that calculates the sum of a large list of numbers. The script uses MPI for parallelization.
- First, attempt to run calculate_sum.py
using MPI.
- Then, modify the script to include a fallback mechanism using joblib
parallel processing. This fallback should be activated if MPI fails to initialize.
- Run the modified script and observe the output.
Expected Learning Outcome: You will understand how to implement a fallback strategy for parallel processing, allowing your code to run on systems without MPI.
Exercise: Monte Carlo simulation across processes
Objective: Simulate the probability of a random point falling within a circle using MPI.
Instructions:
- You are given a Python script monte_carlo.py
that performs a Monte Carlo simulation. This script currently runs the simulation on a single process.
- Modify the script to distribute the simulation workload across multiple processes using MPI.
- Run the modified script with mpiexec -n <number_of_processes> python monte_carlo.py
.
- Compare the results (number of points inside the circle) with the single-process simulation.
Expected Learning Outcome: You will understand how to distribute a computationally intensive task across multiple processes using MPI and how to compare the results to a single-process execution.