04 - GPU with PyCUDA/Numba

July 2025

Section 4.1: Introduction to CUDA: concept of kernels and grids

Key Concept

CUDA enables parallel computation on NVIDIA GPUs by executing code blocks called kernels, organized into grids. Kernels are the functions that perform the actual calculations, and grids define how these kernels are executed across the GPU's architecture.

Topics

  • Kernels: Functions executed on the GPU.
  • Grids: Logical groupings of kernel invocations.
  • Blocks: Subdivisions of a grid, containing threads.
  • Threads: Individual execution units within a kernel.

  • In-session Exercise: Imagine you need to perform the same operation on a large array of data. How would you conceptually divide the work among multiple GPUs? (Think about the overall structure.)

  • Common Pitfalls: Incorrectly specifying grid dimensions can lead to underutilization of GPU resources.

  • Best Practices: Aim for a sufficient number of threads per block to hide latency, but avoid excessive thread counts that can overwhelm the GPU.

Section 4.2: Memory Hierarchy in CUDA

Key Concept

CUDA GPUs have a complex memory hierarchy, with different types of memory offering varying speeds and scopes. Understanding this hierarchy is crucial for optimizing performance.

Topics

  • Global Memory: Largest memory space, accessible by all threads.
  • Shared Memory: Fast, on-chip memory shared by threads within a block.
  • Registers: Fastest memory, private to each thread.
  • Constant Memory: Read-only memory for data that doesn't change during execution.

  • In-session Exercise: Consider a scenario where you need to frequently access the same data within a block of threads. Which memory type would be most efficient?

  • Common Pitfalls: Frequent access to global memory can be a performance bottleneck.

  • Best Practices: Utilize shared memory to reduce global memory accesses.

Section 4.3: CUDA Runtime API Overview

Key Concept

The CUDA Runtime API provides functions for managing CUDA devices, allocating memory, and launching kernels. It's the interface between your host application and the GPU.

Topics

  • Device Selection: Choosing which GPU to use.
  • Memory Allocation: Allocating memory on the device.
  • Kernel Launch: Initiating the execution of a kernel on the GPU.
  • Synchronization: Ensuring proper order of operations between host and device.

  • In-session Exercise: What are the key steps involved in running a CUDA program? (Think about the sequence of actions.)

  • Common Pitfalls: Memory leaks can occur if memory allocated on the device is not properly freed.

  • Best Practices: Always explicitly free allocated memory when it's no longer needed.

Section 4.2: PyCUDA vs Numba CUDA: similarities and differences

Key Concept

Both PyCUDA and Numba CUDA aim to enable GPU acceleration of Python code, but they differ significantly in their approach to achieving this. PyCUDA provides a lower-level, explicit CUDA programming interface, while Numba CUDA offers a higher-level, just-in-time (JIT) compilation approach.

Topics

  • Programming Model: PyCUDA requires explicit CUDA code writing; Numba CUDA uses decorators to compile Python functions.
  • Level of Abstraction: PyCUDA is low-level, giving fine-grained control; Numba CUDA is high-level, simplifying GPU programming.
  • Performance: PyCUDA can achieve higher performance with careful optimization; Numba CUDA often provides good performance with minimal code changes.
  • Ease of Use: Numba CUDA is generally easier to use for simple GPU acceleration tasks.

  • In-session Exercise: Consider a simple matrix multiplication. Which library might you initially choose and why?

  • Common Pitfalls: Expecting Numba CUDA to automatically optimize all Python code for the GPU.
  • Best Practices: Profile your code with both libraries to determine the best fit for your specific application.

Section 4.2: PyCUDA vs Numba CUDA: similarities and differences

Key Concept

PyCUDA and Numba CUDA both facilitate GPU acceleration of Python code, but they employ distinct methodologies: PyCUDA offers a direct CUDA programming interface, while Numba CUDA leverages JIT compilation for GPU execution.

Topics

  • Approach: PyCUDA is a wrapper around the CUDA API; Numba CUDA utilizes a JIT compiler to translate Python code into CUDA.
  • Control: PyCUDA provides more direct control over GPU resources and memory management.
  • Abstraction: Numba CUDA offers a higher level of abstraction, simplifying the process of GPU acceleration.
  • Performance Trade-offs: PyCUDA can potentially achieve higher performance with manual optimization, while Numba CUDA prioritizes ease of use and good performance out-of-the-box.

  • In-session Exercise: Imagine you have a computationally intensive loop. Which library might be a good starting point, and what factors would influence your decision?

  • Common Pitfalls: Overlooking the need for data transfers between CPU and GPU.
  • Best Practices: Start with Numba CUDA for simpler tasks and consider PyCUDA for more complex, performance-critical applications.

Section 4.3: First kernel with numba.cuda

Key Concept

This section demonstrates how to write and execute a simple CUDA kernel using numba.cuda to accelerate computations on a GPU. It introduces the basic structure of a CUDA kernel and how to integrate it into a Numba-accelerated function.

Topics

  • Defining a CUDA kernel using the @cuda.jit decorator.
  • Passing data to and from the GPU using cuda.device_array.
  • Basic CUDA operations (e.g., addition, multiplication).
  • Launching the kernel with a specified grid and block size.

  • In-session exercise: Modify the kernel to perform element-wise multiplication of two arrays.

  • Common Pitfall: Incorrectly specifying the grid and block size can lead to poor performance or errors.
  • Best Practice: Ensure data transfers between the CPU and GPU are minimized to maximize GPU utilization.

Section 4.4: CuPy for NumPy-like GPU operations

Key Concept

CuPy provides a NumPy-compatible interface for GPU computing, allowing you to leverage the power of NVIDIA GPUs to accelerate numerical computations. It mirrors NumPy's API, making it relatively easy to transition existing code to run on GPUs.

Topics

  • GPU Acceleration: Execute NumPy-like operations on NVIDIA GPUs.
  • Data Transfer: Efficiently move data between CPU and GPU memory.
  • API Compatibility: Utilizes a NumPy-like API for familiar syntax.
  • Broadcasting: Supports NumPy's broadcasting rules for GPU operations.

  • In-session Exercise: Convert a simple NumPy array operation (e.g., element-wise multiplication) to its CuPy equivalent.

  • Common Pitfalls: Data transfer bottlenecks can limit performance; ensure data is on the GPU before computation.
  • Best Practices: Minimize data transfers between CPU and GPU for optimal performance.

Section 4.5: TensorFlow for Deep Learning

Key Concept

TensorFlow is a powerful open-source library for numerical computation and large-scale machine learning, particularly well-suited for building and training deep neural networks.

Topics

  • Tensors: Fundamental data structure representing multi-dimensional arrays.
  • Computational Graph: Defines the flow of operations for model training and inference.
  • Keras API: High-level API for simplified model building and training.
  • GPU Acceleration: Leverages GPUs for accelerated training and inference.

  • In-session Exercise: Define a simple TensorFlow tensor and perform basic operations (e.g., addition, multiplication).

  • Common Pitfalls: Vanishing/exploding gradients during training; requires careful initialization and regularization.
  • Best Practices: Use the Keras API for rapid prototyping and model development.

Section 4.6: PyTorch for Deep Learning

Key Concept

PyTorch is another popular open-source machine learning framework, known for its dynamic computational graph and Pythonic interface, offering flexibility and ease of debugging.

Topics

  • Tensors: Similar to TensorFlow, tensors are the core data structure.
  • Dynamic Computation Graph: Graph is defined on-the-fly, allowing for flexible model architectures.
  • Autograd: Automatic differentiation engine for gradient calculation.
  • GPU Acceleration: Seamlessly utilizes GPUs for accelerated training and inference.

  • In-session Exercise: Create a simple PyTorch tensor and perform basic operations.

  • Common Pitfalls: Understanding the dynamic graph can be initially challenging; requires careful management of operations.
  • Best Practices: Leverage the torch.nn module for building neural network layers.

Section 4.5: ROCm/HIP alternatives for AMD GPUs

Key Concept

While ROCm/HIP is AMD's primary platform, alternative approaches exist for leveraging AMD GPU acceleration, offering flexibility and potentially broader compatibility.

Topics

  • SYCL: A modern, C++-based programming model for heterogeneous computing, providing portability across different hardware vendors.
  • OpenCL: A widely adopted standard for parallel programming, offering broad hardware support but potentially lower performance on AMD GPUs compared to ROCm/HIP.
  • CUDA (with bridge): Using CUDA code with a bridge to translate calls to AMD GPU APIs.
  • DirectCompute: Microsoft's API for GPU computing, primarily used in Windows environments.

  • In-session Exercise: Briefly research and list the pros and cons of SYCL vs. OpenCL for a specific workload (e.g., image processing).

  • Common Pitfalls: Expect potential performance differences and compatibility issues when using non-ROCm/HIP options.
  • Best Practices: Prioritize ROCm/HIP when possible for optimal performance and support.

Section 4.5: ROCm/HIP alternatives for AMD GPUs

Key Concept

ROCm/HIP is AMD's primary platform, but alternative approaches exist for leveraging AMD GPU acceleration, offering flexibility and potentially broader compatibility.

Topics

  • SYCL: A modern, C++-based programming model for heterogeneous computing, promoting portability across different hardware.
  • OpenCL: A standard for parallel programming with broad hardware support, but potentially lower performance on AMD GPUs compared to ROCm/HIP.
  • CUDA (with bridge): Leveraging existing CUDA code by using a translation bridge to AMD GPU APIs.
  • DirectCompute: Microsoft's API for GPU computing, primarily used in Windows environments.

  • In-session Exercise: Compare and contrast SYCL and OpenCL for a specific workload (e.g., image processing) – focus on portability and performance.

  • Common Pitfalls: Performance may be suboptimal compared to ROCm/HIP, and compatibility can be a concern.
  • Best Practices: Evaluate the workload's requirements and prioritize ROCm/HIP when possible.

Section 4.6: Memory management: host-device transfers, memory coalescing

Key Concept

Efficiently moving data between host (CPU) and device (GPU/accelerator) memory is crucial for performance. Memory coalescing minimizes transfer overhead by grouping multiple small transfers into a single larger one.

Topics

  • Host-device transfer modes: Synchronous (blocking) vs. Asynchronous (non-blocking).
  • Memory coalescing: Combining multiple small memory accesses into larger, contiguous transfers.
  • Pinned/Page-locked memory: Ensuring data remains in device memory to avoid unnecessary copies.
  • Transfer APIs: Understanding the available functions (e.g., cudaMemcpy, cudaMemcpyAsync).

Exercise

  • Consider a scenario where you need to transfer 1000 1-byte elements from host to device. What is the potential performance impact of transferring each element individually versus coalescing them into larger blocks?

Common Pitfalls

  • Data alignment: Misaligned data can significantly reduce transfer speeds.
  • Overhead of asynchronous transfers: Asynchronous transfers introduce complexity and potential for race conditions if not managed carefully.

Best Practices

  • Transfer in large blocks: Aim for transfer sizes that match the device's memory bandwidth.
  • Use pinned memory: When possible, use pinned memory to avoid unnecessary data copies.

Section 4.7: Benchmark: operations on CPU vs GPU

Key Concept

GPUs excel at parallel processing, making them significantly faster than CPUs for certain types of computations, particularly those involving large datasets and repetitive operations. CPUs are more versatile for general-purpose tasks.

Topics

  • Parallelism: GPUs have massively parallel architectures with thousands of cores.
  • Data Transfer: Moving data between CPU and GPU is a performance bottleneck.
  • Computational Intensity: GPU performance advantage is most pronounced with highly parallelizable workloads.
  • Task Suitability: CPUs are generally better for sequential tasks and complex control flow.

  • In-session Exercise: Consider a simple matrix multiplication. How would you structure the calculation to best leverage a GPU? (Think about data partitioning).

Common Pitfalls: * Overhead: Don't assume GPU acceleration automatically improves performance; data transfer overhead can negate gains. * Data Size: Small datasets may not justify the overhead of GPU acceleration.

Best Practices: * Profiling: Always profile your code to identify performance bottlenecks before attempting GPU acceleration. * Data Locality: Minimize data transfers between CPU and GPU to maximize performance.


Section 4.7: Benchmark: operations on CPU vs GPU

Key Concept

GPUs are highly optimized for parallel computations, offering substantial speedups over CPUs for specific workloads, especially those involving large datasets and repetitive calculations. CPUs remain the workhorses for general-purpose tasks.

Topics

  • Parallel Processing: GPUs utilize a massively parallel architecture, enabling simultaneous execution of many operations.
  • Memory Hierarchy: GPUs have a different memory hierarchy (e.g., shared memory) optimized for data reuse.
  • Computational Graph: Many GPU operations are expressed as computational graphs for efficient execution.
  • Workload Characteristics: GPU acceleration is most effective for workloads with high degrees of parallelism.

  • In-session Exercise: Imagine you have a large image dataset. How could you use a GPU to perform image filtering (e.g., blurring)? (Consider the operations involved).

Common Pitfalls: * Synchronization: Incorrect synchronization between CPU and GPU threads can lead to data corruption or incorrect results. * Memory Bandwidth: Insufficient memory bandwidth can limit GPU performance.

Best Practices: * CUDA/OpenCL: Leverage established GPU programming frameworks like CUDA or OpenCL. * Data Alignment: Ensure data is properly aligned in memory for optimal GPU performance.


Exercise: 2D convolution on GPU

Objective: Implement a simple 2D convolution operation on a GPU using either CUDA or ROCm.

Instructions: - You are given a Python script that defines a small image and a kernel function for 2D convolution. - Modify the script to execute the convolution kernel on a GPU using either CUDA or ROCm (choose one based on your hardware availability). - Ensure you correctly transfer the input image data to the GPU and retrieve the output.

Expected Learning Outcome: You will understand the basic steps involved in offloading computation to a GPU and performing a simple image processing task.


Exercise: Matrix multiplication with memory optimization

Objective: Implement a basic matrix multiplication function using either CUDA or ROCm, focusing on optimizing memory access patterns.

Instructions: - You are given a Python script that defines two matrices, A and B, as NumPy arrays. - Modify the script to implement matrix multiplication using either CUDA or ROCm, aiming to improve performance by minimizing data transfers between the CPU and GPU. - Compare the execution time of the original NumPy matrix multiplication with your GPU-accelerated version for matrices of size 100x100.

Expected Learning Outcome: You will understand the importance of memory optimization in GPU programming and how to leverage CUDA or ROCm to achieve faster matrix multiplication.

Exercise: Matrix multiplication with memory optimization

Objective: Implement a basic matrix multiplication function using either CUDA or ROCm, focusing on optimizing memory access patterns.

Instructions: - You are given a Python script that defines two matrices, A and B, as NumPy arrays. - Modify the script to implement matrix multiplication using either CUDA or ROCm, aiming to improve performance by minimizing data transfers between the CPU and GPU. - Compare the execution time of the original NumPy matrix multiplication with your GPU-accelerated version for matrices of size 100x100.

Expected Learning Outcome: You will understand the importance of memory optimization in GPU programming and how to leverage CUDA or ROCm to achieve faster matrix multiplication.


Exercise: Signal processing kernels

Objective: Implement a simple 1D convolution kernel using either CUDA or ROCm.

Instructions: - You are given a Python script that defines a 1D convolution kernel function. This function takes an input signal and a kernel as arguments and returns the convolved signal. - Modify the script to utilize either CUDA or ROCm to accelerate the convolution operation. You'll need to adapt the kernel function to work with the GPU. - Run the modified script and observe the performance difference between the CPU-based and GPU-based convolution.

Expected Learning Outcome: You will understand how to adapt a simple signal processing kernel for GPU execution using either CUDA or ROCm, and observe the potential performance benefits of GPU acceleration.


No Pages Found