Skip to main content

Tensor Core Programming

Tensor Cores are specialized hardware units in modern NVIDIA GPUs that dramatically accelerate matrix multiply-accumulate (MMA) operations. This page explains how CUTLASS leverages Tensor Cores for peak performance.

What are Tensor Cores?

Tensor Cores are dedicated matrix processing units that can perform multiple multiply-accumulate operations in a single instruction. They are the key to achieving peak throughput for deep learning and HPC workloads.
Key Characteristics:
  • Perform matrix operations on small tiles (e.g., 16×8×16)
  • Operate at warp granularity (32 threads collaborate)
  • Deliver 8-16× higher throughput than CUDA cores for matrix math
  • Support various data types: FP64, FP32, TF32, FP16, BF16, FP8, INT8, INT4

Architecture Evolution

Volta (SM70) - First Generation

  • Shape: 16×16×4 (M×N×K)
  • Data Types: FP16 input, FP16/FP32 accumulation
  • Instructions: wmma API

Turing (SM75) - Second Generation

  • Shapes: 16×8×8, 8×8×4
  • Data Types: FP16, INT8, INT4, INT1
  • Instructions: Enhanced wmma and mma.sync

Ampere (SM80) - Third Generation

  • New Shapes: 16×8×8, 16×8×16
  • New Types: BF16, TF32 (19-bit format)
  • Instructions: mma.sync.aligned
  • Features: Async copy, structured sparsity (2:4)

Hopper (SM90) - Fourth Generation

  • New Shapes: 64×64×16, 64×128×16, 64×192×16
  • New Types: FP8 (E4M3, E5M2)
  • Instructions: wgmma (warpgroup MMA)
  • Features: TMA (Tensor Memory Accelerator), Thread Block Clusters
  • Throughput: Up to 2000 TFLOPS (FP8)

Blackwell (SM100) - Fifth Generation

  • Enhanced Shapes: Larger warpgroup operations
  • New Types: FP4, MXFP formats
  • Features: Enhanced TMA, distributed shared memory
  • Throughput: Up to 4000 TFLOPS (FP4)
ArchitectureFP16 TFLOPSFP32 TFLOPSFP8 TFLOPS
Volta V100125--
Turing T465--
Ampere A100312156 (TF32)-
Hopper H100989495 (TF32)1979
Blackwell B20025001250 (TF32)5000

MMA Instruction Format

Tensor Core operations are exposed through MMA (Matrix Multiply-Accumulate) instructions. Here’s an example from Ampere (SM80):
template <>
struct Mma<
  gemm::GemmShape<16, 8, 8>,  // Output: 16×8, K=8
  32,                          // Threads per instruction
  bfloat16_t,                  // A element type
  layout::RowMajor,            // A layout
  bfloat16_t,                  // B element type  
  layout::ColumnMajor,         // B layout
  float,                       // C/D element type
  layout::RowMajor,            // C/D layout
  OpMultiplyAdd>               // Operation
{
  using FragmentA = Array<bfloat16_t, 4>;  // A fragment per thread
  using FragmentB = Array<bfloat16_t, 2>;  // B fragment per thread
  using FragmentC = Array<float, 4>;       // C/D fragment per thread
  
  CUTLASS_HOST_DEVICE
  void operator()(
    FragmentC &d,           // Output
    FragmentA const &a,     // Input A
    FragmentB const &b,     // Input B  
    FragmentC const &c      // Input C (accumulator)
  ) const {
    asm(
      "mma.sync.aligned.m16n8k8.row.col.f32.bf16.bf16.f32 "
      "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
      : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
      : "r"(A[0]), "r"(A[1]), "r"(B[0]),
        "f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
    );
  }
};
Reference: include/cutlass/arch/mma_sm80.h:76
The instruction name encodes:
  • m16n8k8: Matrix dimensions (16×8 output, K=8)
  • row.col: A is row-major, B is column-major
  • f32.bf16.bf16.f32: Output type, A type, B type, accumulator type

Fragment Layout

Tensor Core operands are distributed across threads in a warp. Understanding fragment layouts is crucial:

Thread-to-Fragment Mapping (16×8×8 example)

Matrix A (16×8):
- Shape per thread: 4 elements (FragmentA)
- Distribution: Each thread holds elements from multiple rows
- Thread 0: [a0, a8, a16, a24]
- Thread 1: [a1, a9, a17, a25]  
- ...

Matrix B (8×8):
- Shape per thread: 2 elements (FragmentB)
- Distribution: Each thread holds elements from multiple columns
- Thread 0: [b0, b8]
- Thread 1: [b1, b9]
- ...

Matrix C/D (16×8):
- Shape per thread: 4 elements (FragmentC)  
- Distribution: Output elements mapped to threads
CUTLASS handles fragment distribution automatically when you use the provided templates. You rarely need to compute layouts manually!

Warpgroup MMA (Hopper SM90+)

Hopper introduced warpgroup-scoped MMA instructions that enable larger, more efficient operations:
// Warpgroup MMA operates on 4 warps (128 threads) simultaneously
// Shape: 64×64×16 per instruction

template <class... Args>
CUTE_HOST_DEVICE
void wgmma_m64n64k16(
  Args const&... args
) {
  asm volatile(
    "wgmma.mma_async.sync.aligned.m64n64k16.f32.f16.f16 "
    "{%0, %1, ..., %63}, %64, %65, p, 1, 1, 0;\n"
    : ... // 64 output registers per thread
    : ... // A descriptor, B descriptor
  );
}

Key Differences

FeatureWarp MMA (SM80)Warpgroup MMA (SM90)
Threads32 (1 warp)128 (4 warps)
Max Shape16×8×1664×192×16
MemoryShared memoryTMA descriptors
SchedulingSoftwareHardware pipelining

Using Tensor Cores in CUTLASS

CUTLASS provides high-level abstractions for Tensor Core programming:
#include "cutlass/gemm/device/gemm.h"

using Gemm = cutlass::gemm::device::Gemm<
  cutlass::half_t,              // Element A
  cutlass::layout::RowMajor,    // Layout A
  cutlass::half_t,              // Element B
  cutlass::layout::ColumnMajor, // Layout B
  cutlass::half_t,              // Element C
  cutlass::layout::RowMajor,    // Layout C
  float,                        // Element Accumulator
  cutlass::arch::OpClassTensorOp, // Use Tensor Cores
  cutlass::arch::Sm80             // Target architecture
>;

Method 2: CuTe MMA Atoms

CuTe provides composable MMA atoms for fine-grained control:
#include "cute/atom/mma_atom.hpp"

// Define MMA atom for SM80 16x8x8 BF16
using MMA_Atom = MMA_Atom<
  SM80_16x8x8_F32BF16BF16F32_TN  // Operation
>;

// Use in tiled MMA
using TiledMMA = TiledMMA<
  MMA_Atom,
  Layout<Shape<_2, _4, _1>>,  // Repeat: 2×4 atoms
  Layout<Shape<_1, _2, _1>>   // Value layout
>;

// Perform MMA
auto tCrA = partition_fragment_A(mma, tCsA);
auto tCrB = partition_fragment_B(mma, tCsB);  
auto tCrC = partition_fragment_C(mma, tCsC);

gemm(mma, tCrA, tCrB, tCrC);  // Executes Tensor Core instructions

Data Type Support

Different Tensor Core generations support different types:

FP16 (Half Precision)

// Most widely supported
using ElementA = cutlass::half_t;
using ElementB = cutlass::half_t;  
using ElementC = cutlass::half_t;
using ElementAccumulator = float;  // Higher precision accumulation

BF16 (Brain Float16)

// SM80+, better dynamic range than FP16
using ElementA = cutlass::bfloat16_t;
using ElementB = cutlass::bfloat16_t;
using ElementAccumulator = float;

TF32 (TensorFloat32)

// SM80+, FP32 input with 19-bit precision
using ElementA = cutlass::tfloat32_t;
using ElementB = cutlass::tfloat32_t;
using ElementAccumulator = float;

FP8 (8-bit Float)

// SM89+, highest throughput
using ElementA = cutlass::float_e4m3_t;  // E4M3 format
using ElementB = cutlass::float_e5m2_t;  // E5M2 format  
using ElementAccumulator = float;

Block-Scaled Types (SM100+)

// FP4, MXFP4, MXFP6, MXFP8 with per-block scaling
using ElementA = cutlass::nvfp4_t;
using ElementB = cutlass::nvfp4_t;
using ElementAccumulator = float;
// Requires separate scale tensors
Higher Throughput (Lower Precision):
  • FP4: 4× FP16 throughput
  • FP8: 2× FP16 throughput
  • INT8: 2× FP16 throughput
Higher Precision (Lower Throughput):
  • FP64: 1/16× FP16 throughput
  • FP32: 1/2× FP16 throughput (via TF32)
  • FP16/BF16: Baseline throughput
Choose based on your accuracy requirements!

Structured Sparsity (SM80+)

Ampere introduced 2:4 structured sparsity support:
// For every 4 elements, exactly 2 are non-zero
// Provides 2× effective throughput

using Gemm = cutlass::gemm::device::GemmSparse<
  cutlass::half_t,
  cutlass::layout::RowMajor,
  cutlass::half_t,  
  cutlass::layout::ColumnMajor,
  cutlass::half_t,
  cutlass::layout::RowMajor,
  float,
  cutlass::arch::OpClassTensorOp,
  cutlass::arch::Sm80,
  // Additional sparse configuration...
>;
Matrix A is compressed, and a metadata tensor indicates which elements are non-zero.

Performance Optimization

1. Tile Size Selection

Choose tile sizes that are multiples of Tensor Core operation sizes:
// Good: Multiples of Tensor Core shape (16×8)
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 32>;
using WarpShape = cutlass::gemm::GemmShape<64, 64, 32>;

// Bad: Not aligned to Tensor Core boundaries
using ThreadblockShape = cutlass::gemm::GemmShape<100, 100, 20>;  // Inefficient!

2. Maximizing Occupancy

// Use enough warps per threadblock for high occupancy
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 32>;
using WarpShape = cutlass::gemm::GemmShape<32, 64, 32>;
// Warps per threadblock: (128/32) × (128/64) = 4×2 = 8 warps

3. Double Buffering

// Overlap data loading with computation
const int kStages = 2;  // Double buffering
// Or more stages for deeply pipelined kernels
const int kStages = 4;  // SM80+

4. Async Copy (SM80+)

// Use cp.async to overlap global→shared memory copies
using GlobalToSharedCopyA = 
  Copy_Atom<SM80_CP_ASYNC_CACHEGLOBAL<cute::uint128_t>, ElementA>;
Performance Checklist:
  • ✓ Tile sizes are multiples of Tensor Core shapes
  • ✓ High occupancy (8+ warps per SM)
  • ✓ Multi-stage pipeline (2-4 stages)
  • ✓ Vectorized memory accesses (128-bit when possible)
  • ✓ Async copy for SM80+ targets
  • ✓ TMA for SM90+ targets

Common Pitfalls

1. Misaligned Data

// Bad: Pointer not aligned to 16 bytes
float16_t* A = allocate(M * K);

// Good: Ensure alignment
alignedPtr<float16_t, 16> A = aligned_allocate(M * K);

2. Wrong Layout

// MMA instruction expects specific layouts
// Check that your tensor layouts match the MMA atom requirements

// For SM80_16x8x8_F32FP16FP16F32_TN:
// - A: Row-major (T = transposed in MMA terminology)  
// - B: Column-major (N = non-transposed)

3. Incorrect Fragment Distribution

// Let CUTLASS handle fragment distribution
// Don't try to manually shuffle data between threads!

// Good: Use CUTLASS abstractions
auto tCrA = partition_fragment_A(mma, tCsA);

// Bad: Manual shuffling (error-prone)
// for (int t = 0; t < 32; ++t) { /* complex logic */ }

Debugging Tensor Core Kernels

// Print tensor shapes and layouts
#include "cute/util/print.hpp"

if (thread0()) {
  print("MMA shape: ");
  print(typename TiledMMA::Shape_MNK{});
  print("\n");
  
  print("Thread layout: ");
  print(typename TiledMMA::ThrLayout{});  
  print("\n");
}

Real-World Example

Complete Tensor Core GEMM snippet:
// SM80 FP16 Tensor Core GEMM
using MmaOp = SM80_16x8x16_F32F16F16F32_TN;
using TiledMma = TiledMMA<
  MMA_Atom<MmaOp>,
  Layout<Shape<_2,_2,_1>>  // 2×2 warp arrangement
>;

// Partition inputs
auto tCrA = thr_mma.partition_fragment_A(gA(_, _, k_tile));
auto tCrB = thr_mma.partition_fragment_B(gB(_, _, k_tile));  
auto tCrC = partition_fragment_C(thr_mma, make_shape(M, N));

clear(tCrC);

// Main loop
for (int k = 0; k < K; k += kTileK) {
  // Load to registers
  copy(tCgA(_, _, k), tCrA);
  copy(tCgB(_, _, k), tCrB);
  
  // Tensor Core MMA
  gemm(tiled_mma, tCrA, tCrB, tCrC);  
}

// Store output
copy(tCrC, tCgC);

Next Steps

Memory Layouts

Optimize data layouts for Tensor Cores

CuTe Library

Use CuTe abstractions for MMA operations

Examples

Explore Tensor Core code examples

GEMM Operations

Build complete GEMM kernels