Graphics Processing Unit (GPU)
Overview
Time: 30 min
Learn what a GPU is and how it differs from a CPU.
Learn the basic architecture of a GPU.
Learn how one vendor differs from another.
GPU Overview
A Graphics Processing Unit (GPU) is a specialized processor designed to accelerate computations. Unlike a Central Processing Unit (CPU), which is optimized for general-purpose tasks, a GPU is optimized for parallel processing, making it particularly effective for tasks that can be performed cuncurrently.
Important
Concurrency is the ability to perform multiple tasks simultaneously, while parallelism is the simultaneous execution of multiple tasks. All cincuurrent tasks are not parallel, but all parallel tasks are concurrent.
A GPU is not a replacement for a CPU, but rather a complementary processor that can significantly speed up certain types of computations, especially those involving large datasets or complex mathematical operations.
CPU vs GPU
A CPU is designed for general-purpose computing and is optimized for tasks that require high single-threaded performance. It has a few cores (typically 4 to 16) that can handle multiple threads, but it is not optimized for parallel processing. In contrast, a GPU has thousands of smaller cores that can handle many threads simultaneously, making it ideal for tasks that can be parallelized.
GPU Architecture
Explanation
Gadi only has NVIDIA GPUs. So when we say GPUs we mean NVIDIA GPUs. Nevertheless, most concepts discussed here are the same across different vendors.
GPUs are composed of multiple Streaming Multiprocessors (SMs), an on-chip L2 cache, and high-bandwidth DRAM.
GPU Architecture
Modern NVIDIA GPUs use a hierarchical memory and compute architecture designed for high throughput and parallelism. This document explains the roles of:
Streaming Multiprocessors (SMs)
CUDA cores
L1 and L2 caches
These components work together to accelerate parallel tasks such as scientific computing, deep learning, and graphics rendering.
Streaming Multiprocessors (SMs)
Feature |
Description |
|---|---|
Definition |
A Streaming Multiprocessor is the basic unit of computation in an NVIDIA GPU. |
Contains |
Multiple CUDA cores, L1 cache, warp schedulers, registers, and shared memory |
Responsibility |
Executes multiple warps (groups of 32 threads) in parallel |
Independent Execution |
Each SM can execute multiple thread blocks concurrently (if resources allow) |
CUDA Cores
Feature |
Description |
|---|---|
Definition |
The smallest execution units inside an SM |
Function |
Execute arithmetic and logic instructions for individual threads |
Number per SM |
Varies by architecture (e.g., 64 in Volta, up to 128 or more in newer GPUs) |
Highly Parallel |
Thousands of CUDA cores on a GPU enable massive parallel execution |
While CUDA cores are often thought of as “identical tiny processors,” in practice, not all CUDA cores are the same in capability or function, especially within an SM (Streaming Multiprocessor).
CUDA Cores
CUDA cores are primarily designed to handle:
Integer operations
32-bit floating-point (FP32) arithmetic
However, other types of operations like:
64-bit floating point (FP64)
Tensor operations (for AI)
Special function math (e.g., sin, sqrt)
For example:
Multiply two FP32 arrays → goes to CUDA cores
Apply sin() → goes to SFU (Special Function Unit)
Multiply matrices → goes to Tensor cores
L1 Cache (Level 1)
Feature |
Description |
|---|---|
Scope |
Private to each Streaming Multiprocessor (SM) |
Latency |
Very low |
Size |
192 - 256 KB depending on the GPU architecture |
Usage |
Stores frequently accessed data by threads in a block; shared with shared memory |
Functionality |
Reduces traffic to global memory and accelerates spatially local data access |
Note: In recent architectures like Ampere and Hopper, L1 cache is unified with shared memory, allowing configurable allocation between them.
L2 Cache (Level 2)
Feature |
Description |
|---|---|
Scope |
Shared across all SMs |
Latency |
Higher than L1, but much lower than global memory (DRAM) |
Size |
40 - 50 MB depending on the GPU architecture |
Usage |
Caches global memory accesses for all SMs |
Coherency |
Coherent across all SMs — changes made in one SM are visible to others |
How They Work Together
Threads are scheduled and executed on CUDA cores inside an SM.
Each SM has fast L1 cache and access to configurable shared memory, which helps speed up local computation.
Data not found in L1 is fetched from the L2 cache, which acts as a buffer before accessing global memory (DRAM).
This hierarchy helps reduce latency, increase memory throughput, and enable high-performance parallel execution.
Key Points
A GPU is a specialized processor designed for parallel processing.
It has thousands of smaller cores optimized for concurrent tasks.
The architecture includes Streaming Multiprocessors (SMs), CUDA cores, and multiple levels of cache.
GPUs are not a replacement for CPUs but complement them for specific tasks.