05 - Parallel Libraries
Section 5.1: Quick overview of parallel libraries:
Key Concept
Parallel libraries provide tools to distribute computational tasks across multiple processor cores or machines, significantly reducing execution time for computationally intensive applications. They enable faster results by leveraging the power of parallel processing.
Topics
- Task Parallelism: Dividing a large problem into smaller, independent tasks that can be executed concurrently.
- Data Parallelism: Applying the same operation to different parts of a dataset simultaneously.
- Shared Memory vs. Distributed Memory: Understanding the different architectures for parallel execution.
- Common Parallel Library Choices: Briefly mentioning libraries like OpenMP, MPI, and threading libraries (e.g., pthreads, threading in Python).
In-Session Exercise (5-10 min)
- Briefly brainstorm scenarios where parallel processing would be beneficial in their field. (No detailed steps)
Common Pitfalls
- Data Races: Occur when multiple threads access and modify the same data concurrently without proper synchronization.
- Overhead: Parallelization introduces overhead (communication, synchronization) that can sometimes outweigh the benefits for small problems.
Best Practices
- Profile Before Parallelizing: Identify performance bottlenecks before attempting parallelization.
* Minimize Communication: Reduce data transfer between processors to improve efficiency.
Section 5.2: Numba
with @jit(parallel=True)
Key Concept
Numba
can significantly speed up Python code by compiling it to machine code. The @jit(parallel=True)
decorator enables parallel execution of the compiled code, leveraging multiple CPU cores for faster processing.
Topics
- Parallelization: Distribute workload across multiple CPU cores.
@jit(parallel=True)
: Decorator to enable parallel execution.- Data Dependencies: Parallelization is most effective with independent operations.
Performance Gains: Expect speedups, but not always linear with core count.
In-session Exercise: Identify a simple loop in a provided Python function that could benefit from parallelization.
- Common Pitfalls: Data races and incorrect parallelization due to dependencies.
- Best Practices: Profile code before parallelizing to identify bottlenecks and ensure gains.
Section 5.3: Dask
for large structures and automatic parallelism
Key Concept
Dask
is a flexible parallel computing library that excels at handling datasets that are too large to fit into memory by breaking them into smaller chunks and executing operations on those chunks in parallel. It simplifies parallelization without requiring extensive code changes.
Topics
- Lazy evaluation: Operations are not executed immediately but are scheduled for execution when their results are needed.
- Data structures:
Dask
provides specialized data structures likeDask Array
,Dask DataFrame
, andDask Bag
for parallel computation. - Parallel execution:
Dask
automatically distributes computations across multiple cores or machines. Task scheduling:
Dask
manages the scheduling and execution of tasks in a distributed manner.In-session exercise: Consider how you might parallelize a simple array operation (e.g., element-wise squaring) using
Dask
if you had a very large array.- Common Pitfall: Overhead from task scheduling can sometimes outweigh the benefits of parallelism for very small datasets.
- Best Practice: Identify the most computationally intensive parts of your workflow to focus your parallelization efforts.
Exercise: Dask
for large structures and automatic parallelism
Objective: Use Dask
to calculate the sum of a large array of numbers in parallel.
Instructions:
- Create a Dask
array from a large list of numbers (e.g., 1 million numbers).
- Calculate the sum of the Dask
array using dask.array.sum()
.
- Compare the execution time of the Dask
approach with a standard Python loop.
Expected Learning Outcome: You will understand how to use Dask
to parallelize computations on large datasets and observe the performance benefits.
Section 5.4: Numba
and NumPy Arrays
Key Concept
Numba
provides optimized support for NumPy arrays, enabling efficient numerical computations.
Topics
- NumPy Integration: Seamlessly works with NumPy arrays.
- Array Operations: Optimized for common array operations (e.g., addition, multiplication).
- Data Types:
Numba
infers data types, but explicit type annotations can improve performance. - Vectorization: Leverage vectorized operations for significant speedups.
Exercise: Numba
with @jit(parallel=True)
Objective: To understand how to parallelize a simple numerical computation using Numba
.
Instructions:
- You are given a Python script parallel_example.py
that calculates the sum of squares of a list of numbers. The script currently uses a standard Python loop.
- Modify the script to use Numba
with the @jit(parallel=True)
decorator to parallelize the calculation.
- Run the script and compare the execution time with and without parallelization.
Expected Learning Outcome: You should be able to apply Numba
's @jit(parallel=True)
decorator to parallelize a simple function and observe the performance improvement.
- In-session Exercise: Convert a Python list to a NumPy array and perform a simple operation using Numba
.
- Common Pitfalls: Incorrect data type inference leading to unexpected behavior.
- Best Practices: Use explicit type annotations for performance and clarity.
Section 5.4: JAX
for scientific computing and automatic differentiation
Key Concept
JAX
is a high-performance numerical computation library developed by Google, designed for automatic differentiation and XLA compilation. It excels at accelerating scientific workloads.
Topics
- Composable Function Transformations:
JAX
allows you to easily transform functions (e.g., differentiation, vectorization, parallelization) by composing operations. - Automatic Differentiation (AD):
JAX
automatically computes gradients of functions, crucial for optimization and machine learning. - XLA Compilation:
JAX
leverages XLA (Accelerated Linear Algebra) for optimized execution on CPUs, GPUs, and TPUs. - Vectorization & Parallelization:
JAX
provides tools for efficiently applying operations to arrays and distributing computations across multiple devices.
Exercise: JAX
for scientific computing and automatic differentiation
Objective: To get familiar with the basics of JAX
and its automatic differentiation capabilities.
Instructions:
- Install JAX
and NumPy
using pip
: pip install jax jaxlib numpy
.
- Write a simple Python function that calculates the square of a number.
- Use jax.grad
to compute the gradient of the function with respect to its input. Print the gradient.
Expected Learning Outcome: You will understand how to use jax.grad
to automatically compute the derivative of a simple function.
Pitfalls
- Immutability:
JAX
arrays are immutable. Avoid modifying arrays in place. - Explicit Control:
JAX
requires explicit specification of operations; it doesn't automatically infer them.
Best Practices
- Functional Programming: Embrace a functional programming style for cleaner and more predictable code.
jax.jit
for Performance: Usejax.jit
to compile functions for significant speedups, especially for repeated computations.
Section 5.5: PyTorch
and GPU usage (torch.cuda
)
Key Concept
Leveraging GPUs with PyTorch significantly accelerates training and inference by utilizing parallel processing capabilities. The torch.cuda
module provides the tools to manage and utilize GPU resources.
Topics
- GPU Availability Check: Determine if a GPU is available and accessible.
- Data and Model Placement: Move tensors (data) and models to the GPU using
.to('cuda')
. torch.cuda.device
: Access the current GPU device.- Data Transfer: Understand the process of moving data between CPU and GPU memory.
Exercise: PyTorch
and GPU usage (torch.cuda
)
Objective: To understand how to utilize a GPU with PyTorch for faster computations.
Instructions:
- You are given a simple PyTorch script that performs matrix multiplication.
- Modify the script to move the tensors and computations to the GPU using torch.cuda.device
and tensor.to(device)
.
- Run the modified script and observe the execution time. (You can use timeit
or a similar tool).
Expected Learning Outcome: You will be able to move tensors and computations to the GPU using torch.cuda
and observe the performance difference.
Pitfalls
- Data Type Mismatch: Ensure data types are compatible with GPU computations (e.g., avoid using
float64
iffloat32
is sufficient). - Memory Overflow: Be mindful of GPU memory limitations and avoid excessively large models or batches.
Best Practices
- Batch Size Optimization: Experiment with batch sizes to maximize GPU utilization without exceeding memory limits.
- Data Preprocessing on GPU: Consider performing data preprocessing steps on the GPU to reduce data transfer overhead.
Section 5.6: Performance profiling tools: time
, timeit
, line_profiler
, py-spy
Key Concept
These tools help identify performance bottlenecks in Python code by measuring execution time and pinpointing slow sections. They offer different levels of granularity, from overall execution time to line-by-line analysis.
Topics
time
: Measures overall execution time of a code snippet. Simple and quick for basic timing.timeit
: Measures execution time of small code snippets multiple times to provide more reliable results. Ideal for comparing different implementations.line_profiler
: Profiles execution time of individual lines within a function. Provides detailed insights into performance hotspots.py-spy
: Samples the Python process's call stack without modifying the code. Useful for profiling long-running processes and production environments.In-session exercise: Compare the execution time of a simple loop using
time
andtimeit
.- Common Pitfalls: Ensure the code being profiled is representative of the real-world workload. External factors (e.g., I/O) can skew results.
- Best Practices: Profile representative workloads and focus on the most time-consuming parts of the code.
Section 5.7: Numerical precision considerations in parallel computing
Key Concept
Parallel computing can exacerbate numerical precision issues due to the accumulation of rounding errors and the potential for different processors to operate with slightly different precision.
Topics
- Rounding Error Accumulation: Individual calculations introduce small errors that can compound across many operations.
- Floating-Point Representation: Understand the limitations of floating-point numbers (e.g., IEEE 754) and their inherent precision.
- Parallelism and Precision: Different processors may use slightly different precision levels, leading to variations in results.
Order of Operations: The order in which calculations are performed can influence the magnitude of rounding errors.
In-Session Exercise: Consider a simple calculation involving many additions. How might the order of operations affect the final result? (5 min)
- Common Pitfalls: Assuming that parallelization automatically improves accuracy.
- Best Practices: Use higher precision data types (e.g.,
double
instead offloat
) when accuracy is critical.
Section 5.8: Choosing the right tool for different problem types
Key Concept
Selecting the appropriate software or computational method is crucial for efficient problem-solving. The choice depends on the problem's nature, desired accuracy, and available resources.
Topics
- Differential Equations: Consider solvers like SciPy's
odeint
or dedicated libraries like Scikit-learn'sdifferential_equation
. - Linear Algebra: NumPy provides a comprehensive suite of linear algebra functions (e.g., matrix operations, eigenvalue decomposition).
- Numerical Integration: Explore libraries like SciPy's
integrate
module for various integration techniques (e.g., trapezoidal rule, Simpson's rule). - Optimization: Utilize libraries like SciPy's
optimize
module for finding minima/maxima of functions, constrained optimization, etc.
Exercise
- Briefly describe a problem you've encountered where you could have used a different tool. What tool would have been more appropriate and why? (5 min)
Pitfalls
- Over-engineering: Don't choose a complex tool if a simpler one suffices.
- Ignoring data characteristics: The choice of tool should align with the data's properties (e.g., size, distribution).
Best Practices
- Start simple: Begin with basic tools and gradually increase complexity as needed.
- Understand the tool's limitations: Be aware of the assumptions and potential inaccuracies of each tool.
Section 5.9: Comparative benchmark: CPU vs GPU vs distributed
Key Concept
Different computing architectures (CPU, GPU, and distributed systems) offer distinct advantages and disadvantages for scientific workloads. Choosing the right architecture depends on the specific workload characteristics.
Topics
- CPU: General-purpose, excels at sequential tasks and complex control flow.
- GPU: Massively parallel, optimized for data-parallel operations (e.g., matrix calculations).
- Distributed: Leverages multiple machines for scalability and fault tolerance.
Workload Suitability: Consider data volume, computational intensity, and parallelism potential.
In-session Exercise: Briefly brainstorm (2 min) which workload types would benefit most from each architecture (CPU, GPU, Distributed).
- Common Pitfalls: Assuming GPU is always faster; neglecting data transfer overhead in distributed systems.
- Best Practices: Profile your code to identify bottlenecks before choosing an architecture.
Exercise: Complete signal processing pipeline using multiple approaches
Objective: Implement a simple signal processing pipeline to demonstrate parallel processing techniques.
Instructions:
- You are given a Python script signal_processing.py
that reads a sample audio file (e.g., a short WAV file). The script performs the following steps:
1. Reads the audio data.
2. Applies a simple filtering operation (e.g., a moving average).
3. Calculates the signal's energy.
- Modify the script to implement the same pipeline using two different parallel processing approaches:
1. Multiprocessing: Use the multiprocessing
library to parallelize the filtering and energy calculation steps.
2. Threading: Use the threading
library to parallelize the filtering and energy calculation steps.
- Compare the execution time of the original script and the parallelized versions.
Expected Learning Outcome: You should understand the basic concepts of multiprocessing and threading in Python and how they can be used to improve the performance of a simple computational pipeline. You should also be able to measure the execution time of different approaches.
Exercise: Performance comparison: multiprocessing vs Numba vs GPU
Objective: To compare the performance of a simple numerical computation using multiprocessing, Numba, and (if available) a GPU.
Instructions:
- You are given a Python script computation.py
that calculates the sum of squares of numbers from 1 to 1 million. This script is currently implemented using a standard Python loop.
- Modify computation.py
to implement the same calculation using:
- Multiprocessing (using multiprocessing.Pool
)
- Numba (using the @njit
decorator)
- (If you have access to a GPU, and the necessary libraries installed, add a GPU implementation using a library like CuPy or TensorFlow/PyTorch).
- Run the computation.py
script with each of the three implementations and measure the execution time using the timeit
module.
- Compare the execution times and discuss the relative performance of each approach.
Expected Learning Outcome: You will understand the basic differences in how multiprocessing, Numba, and GPUs can be used to accelerate numerical computations, and how to measure performance differences.
Exercise: Real-world optimization problem
Objective: Calculate the sum of squares of numbers in a list using both a standard Python loop and the multiprocessing
library.
Instructions:
- You are given a Python list of numbers: numbers = list(range(1, 101))
.
- Implement a function sum_of_squares_sequential(numbers)
that calculates the sum of squares using a standard Python loop.
- Implement a function sum_of_squares_parallel(numbers)
that calculates the sum of squares using the multiprocessing
library to parallelize the calculation.
- Compare the execution time of both functions using the timeit
module.
Expected Learning Outcome: You will understand how to use the multiprocessing
library to parallelize a simple task and observe the potential performance benefits.