AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference

Abstract

As Large Language Models (LLMs) scale exponentially, existing pruning techniques face three deployment bottlenecks: (1) hardwarelimited unstructured sparsity support, (2) kernel-level mismatch with LLM sparsity patterns, and (3) layer-wise sparsity heterogeneity.We present AlphaSparseTensor, an automated SpMM optimization framework that co-designs algorithmic discovery and hardware execution. Building on AlphaTensor’s paradigm, our solution introduces dynamic programmingbased block minimization and sparsity-aware workflow generation through:(1) adaptive zero-block detection and (2) hierarchical tiling for variable sparsity distributions. The system further optimizes GPU execution via memory-computation pipelining and data layout transformations. Evaluations show consistent improvements across multiple benchmarks. 1.91x speedup over cuSPARSE on Sparse Transformers, and 4.05x average acceleration versus cuBLAS for 70% pruned LLaMA models. End-to-end inference tests on LLaMA (7B/13B/65B) show system-level improvements of 8.4×, 2.1×, 1.3×, and 1.2× respectively compared to cuBLAS, cuSPARSE, PyTorch, and Sputnik. We open-source the discovered algorithms at: https://github.com/DavidMiao1127/AlphaSparseTensor.

Publication
In International European Conference on Parallel and Distributed Computing 2025