Batched GEMM Example

This example demonstrates how to use CUTLASS to compute batched GEMM operations in two different ways:

Strided batched GEMM: Matrices separated by a fixed stride in memory
Array GEMM: Arbitrary pointers to each matrix in the batch

Overview

Batched GEMM operations compute multiple independent matrix multiplications:

C[i] = alpha * (A[i] x B[i]) + beta * C[i]  for i = 0 to batch_count-1

This is common in many applications including neural network training, computer graphics, and scientific computing.

Key Concepts

Strided batched GEMM: Efficient when matrices are laid out with uniform spacing
Array GEMM: Flexible approach for arbitrary memory layouts
Batch stride: Distance in memory between consecutive matrices
Performance optimization: Amortize kernel launch overhead across multiple operations

Memory Layout

Consider a batch of 2 matrices with dimensions M=6, N=3, K=2:

Matrix C Layout (M=6, N=3, batch=2)

-----------------------------------------------------------
| (0,0,0) | (0,0,1) | (0,0,2) | (1,0,0) | (1,0,1) | (1,0,2) |
-----------------------------------------------------------
| (0,1,0) | (0,1,1) | (0,1,2) | (1,1,0) | (1,1,1) | (1,1,2) |
-----------------------------------------------------------
|    ...  |   ...   |   ...   |   ...   |   ...   |   ...   |
-----------------------------------------------------------
            batch 0          |           batch 1

Where (batch_idx, row_idx, column_idx) denotes each element. The batch stride is: batch_stride_C = ldc * N

Implementation

Building and Running

Build the example

cd /path/to/cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS='75;80;86'
make 05_batched_gemm

Run the example

./examples/05_batched_gemm/05_batched_gemm

Expected output:

Running strided batched gemm
Passed.
Running array gemm
Passed.

Source Code Location

The complete source code for this example is available at:

examples/05_batched_gemm/batched_gemm.cu

What This Example Demonstrates

Two batching modes: Both strided and array-based batched GEMM
Flexible memory layouts: How to handle both regular and irregular memory patterns
Pointer management: Setting up device pointer arrays for array GEMM
Correctness verification: Reference implementation for validating results

Performance Considerations

Strided batched GEMM is typically faster when matrices are uniformly spaced because:
- Simpler addressing logic
- Better memory access patterns
- Less pointer indirection
Array GEMM provides flexibility when:
- Matrices are scattered in memory
- Each batch item comes from different allocations
- You need arbitrary ordering of operations

Key Takeaways

Use GemmBatched for strided batched operations with uniform spacing
Use GemmArray for arbitrary pointer arrays with irregular layouts
Batch operations amortize kernel launch overhead across multiple GEMMs
Both approaches share the same underlying optimizations for individual matrix multiplications

Next Steps

Learn about Basic GEMM for single matrix multiplication
Explore Fused Operations to combine GEMM with activation functions
Check out Convolution for batched convolution operations

C++ Examples

Python Examples

CuTe DSL Examples

Batched GEMM

Batched GEMM Example

Overview

Key Concepts

Memory Layout

Matrix C Layout (M=6, N=3, batch=2)

Implementation

Building and Running

Build the example

Run the example

Source Code Location

What This Example Demonstrates

Performance Considerations

Key Takeaways

Next Steps

​Batched GEMM Example

​Overview

​Key Concepts

​Memory Layout

​Matrix C Layout (M=6, N=3, batch=2)

​Implementation

​Building and Running

​Build the example

​Run the example

​Source Code Location

​What This Example Demonstrates

​Performance Considerations

​Key Takeaways

​Next Steps

Batched GEMM Example

Overview

Key Concepts

Memory Layout

Matrix C Layout (M=6, N=3, batch=2)

Implementation

Building and Running

Build the example

Run the example

Source Code Location

What This Example Demonstrates

Performance Considerations

Key Takeaways

Next Steps