The CuTe DSL (Domain-Specific Language) is a Python framework for writing high-performance CUDA kernels using CuTe’s layout algebra and tensor abstractions. It provides a Pythonic interface to CUTLASS’s CuTe library, enabling kernel development with automatic compilation to optimized PTX/SASS.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NVIDIA/cutlass/llms.txt
Use this file to discover all available pages before exploring further.
Overview
CuTe DSL allows you to write CUDA kernels in Python that are compiled to efficient GPU code, with full access to:- Layout Algebra - Express complex memory access patterns
- Tensor Abstractions - High-level tensor operations
- Hardware Features - Tensor Cores, TMA, async copy, barriers
- JIT Compilation - Automatic optimization and compilation
CuTe DSL kernels achieve performance comparable to hand-written CUDA C++ while offering Python’s development productivity.
Installation
- CUDA 12.0+ (CUDA 13 for latest features)
- Python 3.8+
- GPU: Ampere (SM80+), Hopper (SM90), or Blackwell
Core Concepts
Decorators
@cute.kernel
Defines a CUDA kernel that runs on the GPU:
- Type annotations specify tensor types
- Access thread/block indices with
cute.arch - Use CuTe operations inside the kernel
@cute.jit
Marks a host function for JIT compilation:
Layouts
Layouts define the mapping between logical coordinates and memory addresses.Creating Layouts
Composed Layouts
Create hierarchical layouts for tiling:Thread-Value (TV) Layouts
Map threads to data elements:Tensors
Tensors combine data pointers with layouts:Tensor Operations
Copy Operations
Copy Atoms
Define how data is copied:Tiled Copy
Distribute copy operations across threads:Fragments
Register memory storage:MMA (Matrix Multiply-Accumulate)
Utilize Tensor Cores for matrix multiplication:Shared Memory
Allocate and manage shared memory:Synchronization
Control Flow
Compile-Time Ranges
Dynamic Control Flow
Complete Example: Elementwise Operation
Advanced Features
Multi-Stage Pipelines
Overlap memory transfers with computation:TMA (Tensor Memory Accelerator)
Hopper+ feature for efficient bulk transfers:PyTorch Integration
Seamless integration with PyTorch:JAX Integration
Debugging
Print Statements
Compile with Debug Info
Use NCU (NVIDIA Compute Profiler)
Performance Tips
Memory Coalescing
Memory Coalescing
Ensure contiguous threads access contiguous memory:
Bank Conflicts
Bank Conflicts
Add padding to shared memory layouts:
Vectorization
Vectorization
Use 128-bit loads/stores when possible:
Occupancy
Occupancy
Balance threads, registers, and shared memory:
API Summary
Decorators
@cute.kernel- Define GPU kernel@cute.jit- Mark function for JIT compilation
Layout Functions
cute.make_layout(shape, stride)- Create layoutcute.make_ordered_layout(shape, order)- Create ordered layoutcute.make_layout_tv(thr_layout, val_layout)- Create TV layout
Tensor Functions
cute.make_tensor(ptr, layout)- Create tensorcute.make_fragment_like(tensor)- Create fragmentcute.make_rmem_tensor(shape, dtype)- Allocate registerscute.make_identity_tensor(shape)- Create coordinate tensorcute.zipped_divide(tensor, tiler)- Tile tensorcute.local_tile(tensor, tiler, coord, proj)- Local tilecute.size(tensor, mode)- Get size
Copy Functions
cute.make_copy_atom(op, dtype, ...)- Create copy atomcute.make_tiled_copy_tv(atom, thr, val)- Create tiled copycute.copy(atom, src, dst, pred)- Perform copycute.autovec_copy(src, dst)- Auto-vectorized copy
MMA Functions
cute.nvgpu.MmaUniversalOp(dtype)- Create MMA opcute.make_tiled_mma(op, layout, ...)- Create tiled MMAcute.gemm(mma, c, a, b, c)- Matrix multiply
Arch Functions
cute.arch.thread_idx()- Thread indexcute.arch.block_idx()- Block indexcute.arch.block_dim()- Block dimensionscute.arch.grid_dim()- Grid dimensionscute.arch.syncthreads()- Block barriercute.arch.cp_async_commit_group()- Commit asynccute.arch.cp_async_wait_group(n)- Wait async
Compilation
cute.compile(fn, *args, **opts)- Compile functioncute.compile[options](fn, *args)- Compile with options
Next Steps
Examples
Explore complete kernel examples
Quickstart
Quick introduction to CuTe DSL
PyTorch Integration
Build PyTorch extensions