05 - Parallel Libraries

July 2025

Section 5.1: Quick overview of parallel libraries:

Key Concept

Parallel libraries provide tools to distribute computational tasks across multiple processor cores or machines, significantly reducing execution time for computationally intensive applications. They enable faster results by leveraging the power of parallel processing.

Topics

Task Parallelism: Dividing a large problem into smaller, independent tasks that can be executed concurrently.
Data Parallelism: Applying the same operation to different parts of a dataset simultaneously.
Shared Memory vs. Distributed Memory: Understanding the different architectures for parallel execution.
Common Parallel Library Choices: Briefly mentioning libraries like OpenMP, MPI, and threading libraries (e.g., pthreads, threading in Python).

In-Session Exercise (5-10 min)

Briefly brainstorm scenarios where parallel processing would be beneficial in their field. (No detailed steps)

Common Pitfalls

Data Races: Occur when multiple threads access and modify the same data concurrently without proper synchronization.
Overhead: Parallelization introduces overhead (communication, synchronization) that can sometimes outweigh the benefits for small problems.

Best Practices

Profile Before Parallelizing: Identify performance bottlenecks before attempting parallelization.

* Minimize Communication: Reduce data transfer between processors to improve efficiency.

Section 5.2: `Numba` with `@jit(parallel=True)`

Key Concept

Numba can significantly speed up Python code by compiling it to machine code. The @jit(parallel=True) decorator enables parallel execution of the compiled code, leveraging multiple CPU cores for faster processing.

Topics

Parallelization: Distribute workload across multiple CPU cores.
@jit(parallel=True): Decorator to enable parallel execution.
Data Dependencies: Parallelization is most effective with independent operations.
Performance Gains: Expect speedups, but not always linear with core count.
In-session Exercise: Identify a simple loop in a provided Python function that could benefit from parallelization.
Common Pitfalls: Data races and incorrect parallelization due to dependencies.
Best Practices: Profile code before parallelizing to identify bottlenecks and ensure gains.

Section 5.3: `Dask` for large structures and automatic parallelism

Key Concept

Dask is a flexible parallel computing library that excels at handling datasets that are too large to fit into memory by breaking them into smaller chunks and executing operations on those chunks in parallel. It simplifies parallelization without requiring extensive code changes.

Topics

Lazy evaluation: Operations are not executed immediately but are scheduled for execution when their results are needed.
Data structures: Dask provides specialized data structures like Dask Array, Dask DataFrame, and Dask Bag for parallel computation.
Parallel execution: Dask automatically distributes computations across multiple cores or machines.
Task scheduling: Dask manages the scheduling and execution of tasks in a distributed manner.
In-session exercise: Consider how you might parallelize a simple array operation (e.g., element-wise squaring) using Dask if you had a very large array.
Common Pitfall: Overhead from task scheduling can sometimes outweigh the benefits of parallelism for very small datasets.
Best Practice: Identify the most computationally intensive parts of your workflow to focus your parallelization efforts.

Exercise: `Dask` for large structures and automatic parallelism

Objective: Use Dask to calculate the sum of a large array of numbers in parallel.

Instructions: - Create a Dask array from a large list of numbers (e.g., 1 million numbers). - Calculate the sum of the Dask array using dask.array.sum(). - Compare the execution time of the Dask approach with a standard Python loop.

Expected Learning Outcome: You will understand how to use Dask to parallelize computations on large datasets and observe the performance benefits.

Section 5.4: `Numba` and NumPy Arrays

Key Concept

Numba provides optimized support for NumPy arrays, enabling efficient numerical computations.

Topics

NumPy Integration: Seamlessly works with NumPy arrays.
Array Operations: Optimized for common array operations (e.g., addition, multiplication).
Data Types: Numba infers data types, but explicit type annotations can improve performance.
Vectorization: Leverage vectorized operations for significant speedups.

Exercise: `Numba` with `@jit(parallel=True)`

Objective: To understand how to parallelize a simple numerical computation using Numba.

Instructions: - You are given a Python script parallel_example.py that calculates the sum of squares of a list of numbers. The script currently uses a standard Python loop. - Modify the script to use Numba with the @jit(parallel=True) decorator to parallelize the calculation. - Run the script and compare the execution time with and without parallelization.

Expected Learning Outcome: You should be able to apply Numba's @jit(parallel=True) decorator to parallelize a simple function and observe the performance improvement. - In-session Exercise: Convert a Python list to a NumPy array and perform a simple operation using Numba. - Common Pitfalls: Incorrect data type inference leading to unexpected behavior. - Best Practices: Use explicit type annotations for performance and clarity.

Section 5.4: `JAX` for scientific computing and automatic differentiation

Key Concept

JAX is a high-performance numerical computation library developed by Google, designed for automatic differentiation and XLA compilation. It excels at accelerating scientific workloads.

Topics

Composable Function Transformations: JAX allows you to easily transform functions (e.g., differentiation, vectorization, parallelization) by composing operations.
Automatic Differentiation (AD): JAX automatically computes gradients of functions, crucial for optimization and machine learning.
XLA Compilation: JAX leverages XLA (Accelerated Linear Algebra) for optimized execution on CPUs, GPUs, and TPUs.
Vectorization & Parallelization: JAX provides tools for efficiently applying operations to arrays and distributing computations across multiple devices.

Exercise: `JAX` for scientific computing and automatic differentiation

Objective: To get familiar with the basics of JAX and its automatic differentiation capabilities.

Instructions: - Install JAX and NumPy using pip: pip install jax jaxlib numpy. - Write a simple Python function that calculates the square of a number. - Use jax.grad to compute the gradient of the function with respect to its input. Print the gradient.

Expected Learning Outcome: You will understand how to use jax.grad to automatically compute the derivative of a simple function.

Pitfalls

Immutability: JAX arrays are immutable. Avoid modifying arrays in place.
Explicit Control: JAX requires explicit specification of operations; it doesn't automatically infer them.

Best Practices

Functional Programming: Embrace a functional programming style for cleaner and more predictable code.
jax.jit for Performance: Use jax.jit to compile functions for significant speedups, especially for repeated computations.

Section 5.5: `PyTorch` and GPU usage (`torch.cuda`)

Key Concept

Leveraging GPUs with PyTorch significantly accelerates training and inference by utilizing parallel processing capabilities. The torch.cuda module provides the tools to manage and utilize GPU resources.

Topics

GPU Availability Check: Determine if a GPU is available and accessible.
Data and Model Placement: Move tensors (data) and models to the GPU using .to('cuda').
torch.cuda.device: Access the current GPU device.
Data Transfer: Understand the process of moving data between CPU and GPU memory.

Exercise: `PyTorch` and GPU usage (`torch.cuda`)

Objective: To understand how to utilize a GPU with PyTorch for faster computations.

Instructions: - You are given a simple PyTorch script that performs matrix multiplication. - Modify the script to move the tensors and computations to the GPU using torch.cuda.device and tensor.to(device). - Run the modified script and observe the execution time. (You can use timeit or a similar tool).

Expected Learning Outcome: You will be able to move tensors and computations to the GPU using torch.cuda and observe the performance difference.

Pitfalls

Data Type Mismatch: Ensure data types are compatible with GPU computations (e.g., avoid using float64 if float32 is sufficient).
Memory Overflow: Be mindful of GPU memory limitations and avoid excessively large models or batches.

Best Practices

Batch Size Optimization: Experiment with batch sizes to maximize GPU utilization without exceeding memory limits.
Data Preprocessing on GPU: Consider performing data preprocessing steps on the GPU to reduce data transfer overhead.

Section 5.6: Performance profiling tools: `time`, `timeit`, `line_profiler`, `py-spy`

Key Concept

These tools help identify performance bottlenecks in Python code by measuring execution time and pinpointing slow sections. They offer different levels of granularity, from overall execution time to line-by-line analysis.

Topics

time: Measures overall execution time of a code snippet. Simple and quick for basic timing.
timeit: Measures execution time of small code snippets multiple times to provide more reliable results. Ideal for comparing different implementations.
line_profiler: Profiles execution time of individual lines within a function. Provides detailed insights into performance hotspots.
py-spy: Samples the Python process's call stack without modifying the code. Useful for profiling long-running processes and production environments.
In-session exercise: Compare the execution time of a simple loop using time and timeit.
Common Pitfalls: Ensure the code being profiled is representative of the real-world workload. External factors (e.g., I/O) can skew results.
Best Practices: Profile representative workloads and focus on the most time-consuming parts of the code.

Section 5.7: Numerical precision considerations in parallel computing

Key Concept

Parallel computing can exacerbate numerical precision issues due to the accumulation of rounding errors and the potential for different processors to operate with slightly different precision.

Topics

Rounding Error Accumulation: Individual calculations introduce small errors that can compound across many operations.
Floating-Point Representation: Understand the limitations of floating-point numbers (e.g., IEEE 754) and their inherent precision.
Parallelism and Precision: Different processors may use slightly different precision levels, leading to variations in results.
Order of Operations: The order in which calculations are performed can influence the magnitude of rounding errors.
In-Session Exercise: Consider a simple calculation involving many additions. How might the order of operations affect the final result? (5 min)
Common Pitfalls: Assuming that parallelization automatically improves accuracy.
Best Practices: Use higher precision data types (e.g., double instead of float) when accuracy is critical.

Section 5.8: Choosing the right tool for different problem types

Key Concept

Selecting the appropriate software or computational method is crucial for efficient problem-solving. The choice depends on the problem's nature, desired accuracy, and available resources.

Topics

Differential Equations: Consider solvers like SciPy's odeint or dedicated libraries like Scikit-learn's differential_equation.
Linear Algebra: NumPy provides a comprehensive suite of linear algebra functions (e.g., matrix operations, eigenvalue decomposition).
Numerical Integration: Explore libraries like SciPy's integrate module for various integration techniques (e.g., trapezoidal rule, Simpson's rule).
Optimization: Utilize libraries like SciPy's optimize module for finding minima/maxima of functions, constrained optimization, etc.

Exercise

Briefly describe a problem you've encountered where you could have used a different tool. What tool would have been more appropriate and why? (5 min)

Pitfalls

Over-engineering: Don't choose a complex tool if a simpler one suffices.
Ignoring data characteristics: The choice of tool should align with the data's properties (e.g., size, distribution).

Best Practices

Start simple: Begin with basic tools and gradually increase complexity as needed.
Understand the tool's limitations: Be aware of the assumptions and potential inaccuracies of each tool.

Section 5.9: Comparative benchmark: CPU vs GPU vs distributed

Key Concept

Different computing architectures (CPU, GPU, and distributed systems) offer distinct advantages and disadvantages for scientific workloads. Choosing the right architecture depends on the specific workload characteristics.

Topics

CPU: General-purpose, excels at sequential tasks and complex control flow.
GPU: Massively parallel, optimized for data-parallel operations (e.g., matrix calculations).
Distributed: Leverages multiple machines for scalability and fault tolerance.
Workload Suitability: Consider data volume, computational intensity, and parallelism potential.
In-session Exercise: Briefly brainstorm (2 min) which workload types would benefit most from each architecture (CPU, GPU, Distributed).
Common Pitfalls: Assuming GPU is always faster; neglecting data transfer overhead in distributed systems.
Best Practices: Profile your code to identify bottlenecks before choosing an architecture.

Exercise: Complete signal processing pipeline using multiple approaches

Objective: Implement a simple signal processing pipeline to demonstrate parallel processing techniques.

Instructions: - You are given a Python script signal_processing.py that reads a sample audio file (e.g., a short WAV file). The script performs the following steps: 1. Reads the audio data. 2. Applies a simple filtering operation (e.g., a moving average). 3. Calculates the signal's energy. - Modify the script to implement the same pipeline using two different parallel processing approaches: 1. Multiprocessing: Use the multiprocessing library to parallelize the filtering and energy calculation steps. 2. Threading: Use the threading library to parallelize the filtering and energy calculation steps. - Compare the execution time of the original script and the parallelized versions.

Expected Learning Outcome: You should understand the basic concepts of multiprocessing and threading in Python and how they can be used to improve the performance of a simple computational pipeline. You should also be able to measure the execution time of different approaches.

Exercise: Performance comparison: multiprocessing vs Numba vs GPU

Objective: To compare the performance of a simple numerical computation using multiprocessing, Numba, and (if available) a GPU.

Instructions: - You are given a Python script computation.py that calculates the sum of squares of numbers from 1 to 1 million. This script is currently implemented using a standard Python loop. - Modify computation.py to implement the same calculation using: - Multiprocessing (using multiprocessing.Pool) - Numba (using the @njit decorator) - (If you have access to a GPU, and the necessary libraries installed, add a GPU implementation using a library like CuPy or TensorFlow/PyTorch). - Run the computation.py script with each of the three implementations and measure the execution time using the timeit module. - Compare the execution times and discuss the relative performance of each approach.

Expected Learning Outcome: You will understand the basic differences in how multiprocessing, Numba, and GPUs can be used to accelerate numerical computations, and how to measure performance differences.

Exercise: Real-world optimization problem

Objective: Calculate the sum of squares of numbers in a list using both a standard Python loop and the multiprocessing library.

Instructions: - You are given a Python list of numbers: numbers = list(range(1, 101)). - Implement a function sum_of_squares_sequential(numbers) that calculates the sum of squares using a standard Python loop. - Implement a function sum_of_squares_parallel(numbers) that calculates the sum of squares using the multiprocessing library to parallelize the calculation. - Compare the execution time of both functions using the timeit module.

Expected Learning Outcome: You will understand how to use the multiprocessing library to parallelize a simple task and observe the potential performance benefits.

No Pages Found

05 - Parallel Libraries

Section 5.1: Quick overview of parallel libraries:

Key Concept

Topics

In-Session Exercise (5-10 min)

Common Pitfalls

Best Practices

* Minimize Communication: Reduce data transfer between processors to improve efficiency.

Section 5.2: Numba with @jit(parallel=True)

Key Concept

Topics

Section 5.3: Dask for large structures and automatic parallelism

Key Concept

Topics

Exercise: Dask for large structures and automatic parallelism

Section 5.4: Numba and NumPy Arrays

Key Concept

Topics

Exercise: Numba with @jit(parallel=True)

Section 5.4: JAX for scientific computing and automatic differentiation

Key Concept

Topics

Exercise: JAX for scientific computing and automatic differentiation

Pitfalls

Best Practices

Section 5.5: PyTorch and GPU usage (torch.cuda)

Key Concept

Topics

Exercise: PyTorch and GPU usage (torch.cuda)

Pitfalls

Best Practices

Section 5.6: Performance profiling tools: time, timeit, line_profiler, py-spy

Key Concept

Topics

Section 5.7: Numerical precision considerations in parallel computing

Key Concept

Topics

Section 5.8: Choosing the right tool for different problem types

Key Concept

Topics

Exercise

Pitfalls

Best Practices

Section 5.9: Comparative benchmark: CPU vs GPU vs distributed

Key Concept

Topics

Exercise: Complete signal processing pipeline using multiple approaches

Exercise: Performance comparison: multiprocessing vs Numba vs GPU

Exercise: Real-world optimization problem

Section 5.2: `Numba` with `@jit(parallel=True)`

Section 5.3: `Dask` for large structures and automatic parallelism

Exercise: `Dask` for large structures and automatic parallelism

Section 5.4: `Numba` and NumPy Arrays

Exercise: `Numba` with `@jit(parallel=True)`

Section 5.4: `JAX` for scientific computing and automatic differentiation

Exercise: `JAX` for scientific computing and automatic differentiation

Section 5.5: `PyTorch` and GPU usage (`torch.cuda`)

Exercise: `PyTorch` and GPU usage (`torch.cuda`)

Section 5.6: Performance profiling tools: `time`, `timeit`, `line_profiler`, `py-spy`