XTC - Overview

Achieving peak performance on AI operators is hard

Matrix multiplication, convolution, activations...
Balance computation and data movement
Keep hardware units continuously utilized
Minimize stalls and idle time

The Automation vs. Manual Tuning Dilemma

Approach	Pros	Cons
Compiler heuristics	High productivity	Often fail to reach peak performance
Hand-tuned kernels	Highest performance	Poor portability, high dev effort

Goal: Expose optimization decisions through controllable and portable interfaces

Scheduling Languages: The Promise

Allow experts to script optimization transformations

Tiling, fusion, vectorization, parallelization...
Reduce reliance on opaque compiler heuristics
Can be driven by humans or autotuners

Examples: TVM/TE, Halide, MLIR Transform dialect

The Problem: Fragmentation

Each scheduling language is locked to its ecosystem

TVM Tensor Expressions → TVM only
Halide → Halide framework only
MLIR Transform dialect → MLIR ecosystem only

What's Missing?

There is currently no unified, user-facing API flexible enough to decouple scheduling specification from code generation.

TVM & MLIR hard to compare on equal footing
Difficult share scheduling strategies across backends
No common measurement infrastructure

XTC - Proposal

A research platform that decouples:

Scheduling -- Common API across compilers
Code generation -- Multiple backends (TVM, MLIR, ...)
Measurement -- Cross-platform hardware counters

bg right:50% 70%

XTC Architecture

Entry points (blue): - High-level scheduling language - Unified scheduling API - Design space exploration

Backends: - TVM, MLIR, extensible...

Measurement (green): - x86, ARM (experimental: Apple Silicon, NVIDIA GPUs)

A Taste of XTC

sch.dims = ['I','J','K']
sch.split(root="mm0", dim="J", segments={"J[0]":0,"J[1]":256})
sch.strip_mine(root="J[0]", dim="K", tiles={"K1": 4})
sch.strip_mine(root="J[0]", dim="J", tiles={"J1": 16})
sch.unroll(root="J[0]", unrolls={"J1": 16, "K1": 4})
sch.vectorize(root="J[0]", axes=["J1"])

Same schedule → TVM or MLIR backend

Scheduling Primitives

Primitive	Purpose
`strip_mine`	Partition iteration domain into blocks
`interchange`	Reorder loops for locality/vectorization
`split`	Divide loop into contiguous regions
`unroll`	Expose instruction-level parallelism
`vectorize`	Map to SIMD resources
`parallelize`	Distribute across threads/cores
`pack_at/buffer_at`	Improve spatial locality
`fuse_producer/consumer_at`	Combine producer/consumer operations
`distribute`	Define and distribute memories among processors

Higher-Level: Declarative Scheduling

sch.descript({
    'I': [],
    'J[0:256]': {
        'K': [],
        'K#4': ['unroll'],
        'J#16': ['vectorize']
    },
    'J[256:258]': {
        'K': []
    }
})

Describe the target loop nest rather than the transformation sequence

Why XTC for Research?

Fair comparison of scheduling strategies across backends
Reproducible measurements with HW counter access
Identify backend limitations (e.g., MLIR vectorization issues)
Evaluate performance models against real hardware
Rapid prototyping of new scheduling languages

Key Results

Matches hand-tuned C with vector intrinsics
High correlation between TVM and MLIR backends
Identified mlir-opt vectorization limitation on generic convolutions
Integration into an existing AI framework (15-30× speedup over unoptimized generic C++)

References

Paper: https://arxiv.org/abs/2512.16512

Sources: https://github.com/xtc-tools/xtc

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search