AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference

Xuanzheng Wang, Shuo Miao, Zihan Zhu, Peng Qu, Youhui Zhang

June, 2025

Abstract

As Large Language Models (LLMs) scale exponentially, existing pruning techniques face three deployment bottlenecks: (1) hardwarelimited unstructured sparsity support, (2) kernel-level mismatch with LLM sparsity patterns, and (3) layer-wise sparsity heterogeneity.We present AlphaSparseTensor, an automated SpMM optimization framework that co-designs algorithmic discovery and hardware execution. Building on AlphaTensor’s paradigm, our solution introduces dynamic programmingbased block minimization and sparsity-aware workflow generation through:(1) adaptive zero-block detection and (2) hierarchical tiling for variable sparsity distributions. The system further optimizes GPU execution via memory-computation pipelining and data layout transformations. Evaluations show consistent improvements across multiple benchmarks. 1.91x speedup over cuSPARSE on Sparse Transformers, and 4.05x average acceleration versus cuBLAS for 70% pruned LLaMA models. End-to-end inference tests on LLaMA (7B/13B/65B) show system-level improvements of 8.4×, 2.1×, 1.3×, and 1.2× respectively compared to cuBLAS, cuSPARSE, PyTorch, and Sputnik. We open-source the discovered algorithms at: https://github.com/DavidMiao1127/AlphaSparseTensor.

Publication

In International European Conference on Parallel and Distributed Computing 2025

AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference

Abstract

Xuanzheng Wang

Peng Qu

Youhui Zhang

Professor