Documentation Index Fetch the complete documentation index at: https://mintlify.com/NVIDIA/cutlass/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
CuTe (CUDA Templates) is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. It provides a composable, structured approach to writing high-performance CUDA kernels.
CuTe is the foundation for CUTLASS 3.x kernels and provides a more modern, composable alternative to the CUTLASS 2.x hierarchy.
Core Concepts
CuTe is built around three fundamental abstractions:
Layout Describes the mapping from logical coordinates to linear indices
Tensor Combines data (pointer) with a Layout to provide structured access
Algorithms Operations on Tensors (copy, fill, GEMM, etc.)
Layouts
Layouts define the relationship between multi-dimensional coordinates and linear memory addresses.
Layout Definition
include/cute/layout.hpp:98
template < class Shape , class Stride = LayoutLeft ::Apply< Shape >>
struct Layout {
CUTE_HOST_DEVICE constexpr
Layout ( Shape const& shape = {}, Stride const& stride = {});
template < int ... I >
CUTE_HOST_DEVICE constexpr
decltype ( auto ) shape () const ;
template < int ... I >
CUTE_HOST_DEVICE constexpr
decltype ( auto ) stride () const ;
};
Creating Layouts
Basic Layouts
Compile-Time Layouts
using namespace cute ;
// 1D layout: 8 elements
auto layout_1d = make_layout ( 8 );
// 2D layout: 4×8 column-major
auto layout_2d = make_layout (
make_shape ( 4 , 8 ), // Shape: (4, 8)
make_stride ( 1 , 4 ) // Stride: (1, 4) - column-major
);
// 2D layout: 4×8 row-major
auto layout_2d_row = make_layout (
make_shape ( 4 , 8 ),
make_stride ( 8 , 1 ) // Stride: (8, 1) - row-major
);
// Hierarchical layout: (4,8):(1,4) partitioned as ((2,2),(4,2))
auto layout_hier = make_layout (
make_shape ( make_shape ( 2 , 2 ), make_shape ( 4 , 2 )),
make_stride ( make_stride ( 1 , 2 ), make_stride ( 4 , 16 ))
);
Layout Operations
make_layout(shape, stride)
Creates a Layout from a shape and stride
Returns the total number of elements in the layout
Returns the number of dimensions (modes) in the layout
Returns the hierarchical depth of the layout
Simplifies the layout by merging adjacent modes with compatible strides
Tensors
Tensors combine a pointer with a Layout to provide structured multi-dimensional views of data.
Tensor Definition
include/cute/tensor_impl.hpp
template < class Engine , class Layout >
struct Tensor {
using iterator = typename Engine :: iterator ;
using value_type = typename Engine :: value_type ;
using element_type = typename Engine :: element_type ;
using reference = typename Engine :: reference ;
};
Creating Tensors
Global Memory Tensors
Shared Memory Tensors
Register Tensors
using namespace cute ;
// Create tensor from global memory pointer
float * gmem_ptr = /* ... */ ;
auto gmem_tensor = make_tensor (
make_gmem_ptr (gmem_ptr),
make_layout ( make_shape (M, N), make_stride ( 1 , M)) // Column-major
);
// Indexing: gmem_tensor(i, j) accesses element at row i, column j
float value = gmem_tensor ( 2 , 3 );
Tensor Operations
make_tensor(pointer, layout)
Creates a Tensor from a pointer and layout
local_tile(tensor, tile_shape, coord)
Extracts a tile from a tensor at the given coordinate
local_partition(tensor, thread_layout, thread_id)
Partitions a tensor across threads according to a thread layout
Creates a new tensor with the same shape and layout (useful for accumulators)
Practical Example: GEMM Kernel
From examples/cute/tutorial/sgemm_1.cu:
Tensor Setup
Shared Memory Buffers
Thread Partitioning
Main Loop
template < class ProblemShape , class CtaTiler ,
class TA , class AStride ,
class TB , class BStride ,
class TC , class CStride ,
class Alpha , class Beta >
__global__ void gemm_device (
ProblemShape shape_MNK, CtaTiler cta_tiler,
TA const * A, AStride dA,
TB const * B, BStride dB,
TC * C, CStride dC,
Alpha alpha, Beta beta) {
using namespace cute ;
// Create full tensor views
Tensor mA = make_tensor ( make_gmem_ptr (A),
select < 0 , 2 >(shape_MNK), dA); // (M,K)
Tensor mB = make_tensor ( make_gmem_ptr (B),
select < 1 , 2 >(shape_MNK), dB); // (N,K)
Tensor mC = make_tensor ( make_gmem_ptr (C),
select < 0 , 1 >(shape_MNK), dC); // (M,N)
// Get this CTA's tile
auto cta_coord = make_coord ( blockIdx . x , blockIdx . y , _);
Tensor gA = local_tile (mA, cta_tiler, cta_coord, Step < _1, X,_1 > {});
Tensor gB = local_tile (mB, cta_tiler, cta_coord, Step < X,_1,_1 > {});
Tensor gC = local_tile (mC, cta_tiler, cta_coord, Step < _1,_1, X > {});
}
Algorithms
CuTe provides high-level algorithms for common operations.
Copy Operations
include/cute/algorithm/copy.hpp
// Simple copy
copy (src_tensor, dst_tensor);
// Async copy (Ampere+)
copy_async (src_tensor, dst_tensor);
// Copy with predication
copy_if (predicate, src_tensor, dst_tensor);
// Cooperative copy across thread group
cooperative_copy < NumThreads >(tid, src_tensor, dst_tensor);
Fill and Clear
// Fill with value
fill (tensor, 3.14 f );
// Clear to zero
clear (tensor);
GEMM Algorithm
include/cute/algorithm/gemm.hpp
// Register-level GEMM: D = A * B + C
gemm (A_tensor, B_tensor, C_tensor);
// Cooperative GEMM across thread group
cooperative_gemm < NumThreads >(tid, A_tensor, B_tensor, C_tensor);
AXPBY (Linear Combination)
// Y = alpha * X + beta * Y
axpby (alpha, X_tensor, beta, Y_tensor);
Special Layouts and Patterns
Swizzling
Swizzled layouts reduce shared memory bank conflicts:
using namespace cute ;
// Create swizzled layout for shared memory
auto smem_layout = composition (
Swizzle < 3 , 0 , 3 > {}, // XOR swizzle pattern
make_layout ( make_shape ( 128 , 32 ), make_stride ( 1 , 128 ))
);
Blocked Layouts
// Create blocked/tiled layout
auto blocked = blocked_product (
make_layout ( make_shape ( 4 , 8 )), // Outer block shape
make_layout ( make_shape ( 2 , 2 )) // Inner tile shape
);
// Results in ((4,2),(8,2)) layout
Type System
Compile-Time Integers
using namespace cute ;
// Compile-time integer
Int < 4 > static_four{};
// Arithmetic at compile time
auto result = Int < 4 > {} * Int < 8 > {}; // Int<32>
// Underscore for dynamic dimensions
auto shape = make_shape (Int < 4 > {}, _, Int < 8 > {});
Tuples
using namespace cute ;
// Create tuple
auto t = make_tuple ( 1 , 2 , 3 );
// Access elements
auto first = get < 0 >(t); // 1
auto second = get < 1 >(t); // 2
// Hierarchical tuples
auto nested = make_tuple (
make_tuple ( 1 , 2 ),
make_tuple ( 3 , 4 )
);
Atom Types
Atoms describe hardware-specific instruction patterns.
Copy Atoms
#include <cute/atom/copy_atom.hpp>
// LDG.128 - 128-bit global memory load
using GmemLoadAtom = Copy_Atom < SM80_CP_ASYNC_CACHEALWAYS < uint128_t >, float >;
// LDSM - Shared memory load for matrix
using SmemLoadAtom = Copy_Atom < SM75_U32x4_LDSM_N , half_t >;
MMA Atoms
#include <cute/atom/mma_atom.hpp>
// Tensor Core MMA atom (Ampere)
using MmaAtom = MMA_Atom < SM80_16x8x16_F16F16F16F16_TN >;
// SIMT FMA atom
using SimtMma = MMA_Atom < SM70_8x8x4_F32F16F16F32_TN >;
Debugging and Visualization
Print Utilities
using namespace cute ;
// Print layout
auto layout = make_layout ( make_shape ( 4 , 8 ), make_stride ( 1 , 4 ));
print (layout);
// Output: (4,8):(1,4)
// Print tensor
Tensor tensor = make_tensor (ptr, layout);
print_tensor (tensor);
// Print in LaTeX format for documentation
print_latex (layout);
Compile-Time Assertions
CUTE_STATIC_ASSERT_V ( size (layout) == Int < 32 > {});
CUTE_STATIC_ASSERT_V ( rank (layout) == Int < 2 > {});
CUTE_STATIC_ASSERT_V ( is_static < decltype (layout)>::value);
Key Functions Reference
Shape and Stride Utilities
Creates a Shape tuple from arguments
Creates a Stride tuple from arguments
Creates a coordinate tuple for indexing
Returns the size of mode I (or total size if I omitted)
Returns the shape of mode I (or full shape if I omitted)
Returns the stride of mode I (or full stride if I omitted)
Composition and Manipulation
composition(layout_a, layout_b)
Composes two layouts: layout_b ∘ layout_a
complement(layout, shape)
Returns the complementary layout within a given shape
logical_divide(layout, tile)
Divides a layout into tiles
zipped_divide(layout, tile)
Divides and interleaves (for thread partitioning)
Best Practices
Use Static Shapes When Possible Static shapes (Int<N>) enable compile-time optimizations and better code generation.
Leverage Layout Composition Build complex layouts by composing simpler ones - this is more maintainable and often more efficient.
Partition Before Loop Partition tensors across threads outside loops to avoid recomputation.
Use Typed Tensors Let CuTe’s type system catch shape mismatches at compile time.
Advanced Topics
TMA (Tensor Memory Accelerator)
Hopper architecture’s hardware-accelerated tensor loads:
#include <cute/atom/copy_atom.hpp>
// TMA descriptor-based copy
using TmaLoadAtom = Copy_Atom < SM90_TMA_LOAD , float >;
// Use in kernel with TMA descriptor
Warp-Specialized Kernels
Different warps perform different roles:
int warp_id = threadIdx . x / 32 ;
int lane_id = threadIdx . x % 32 ;
if (warp_id == 0 ) {
// Producer warp: load data
} else {
// Consumer warp: compute
}
See Also
CuTe Tutorials Examples in examples/cute/tutorial/ demonstrate progressive complexity
GEMM API High-level GEMM API built on CuTe (CUTLASS 3.x)
Architecture Guide Learn about architecture-specific features
Performance Guide Optimize CuTe-based kernels