DTrans: A Dataflow-transformation FPGA Accelerator with Nonlinear-operators fusion aiming for the Generative Model

Abstract

FPGA-based accelerators have emerged as an effective solution for GPT inference, given their inherent flexibility and capacity for domain-specific customization. However, the effective utilization of GPT has been hindered by two main obstacles: the unequal ratios between compute and memory access during the prefilling and generation stages, and the growing hardware resource requirements for nonlinear operations caused by longer text lengths and larger embedding dimensions. To address these challenges, we introduce DTrans, an FPGA accelerator specifically designed for GPT, which leverages nonlinear-operator fusion and a customized pipeline to efficiently handle long-input and long-generation tasks. We introduce a sequence-length-decoupled nonlinear operator design that enables in-place execution while preserving computational accuracy. Additionally, we employ a customized pipeline design that utilizes a two-level alternating input pipeline mapping for long-input tasks to alleviate the overhead associated with residual computation and onchip buffers. Furthermore, for long-token generation tasks, our pipeline overlaps computational delays in operations such as Softmax and layer normalization with matrix operations. We also propose a novel two-stage dataflow transformation strategy that adopts different reuse strategies for the prefilling and generating stages with different computational/memory access characteristics. Our comparative analyses reveal that DTrans outperforms the GPU(V100) in terms of throughput and energy effciency, achieving improvements of 11.99× and 11.7×, respectively. When compared with state-of-the-art GPT inference accelerators, DTrans demonstrates more than 5.64× and 5.22× enhancements in these metrics.

Publication
34th International Conference on Field-Programmable Logic and Applications