Multipurpose Deep-Learning Accelerator for Arbitrary Quantization With Reduction of Storage, Logic, and Latency Waste
- Authors
- Moon, Seunghyun; Mun, Han-Gyeol; Son, Hyunwoo; Sim, Jae-Yoon
- Issue Date
- Jan-2024
- Publisher
- Institute of Electrical and Electronics Engineers
- Keywords
- Arbitrary quantization (AQ); bit-serial processing; Computer architecture; Decoding; deep neural network (DNN) accelerator; Hardware; lookup table (LUT); Moon; precision scalability; Quantization (signal); run-length compression (RLC); Table lookup; Task analysis
- Citation
- IEEE Journal of Solid-State Circuits, v.59, no.1, pp 1 - 14
- Pages
- 14
- Indexed
- SCIE
SCOPUS
- Journal Title
- IEEE Journal of Solid-State Circuits
- Volume
- 59
- Number
- 1
- Start Page
- 1
- End Page
- 14
- URI
- https://scholarworks.gnu.ac.kr/handle/sw.gnu/68356
- DOI
- 10.1109/JSSC.2023.3312615
- ISSN
- 0018-9200
1558-173X
- Abstract
- Various pruning and quantization heuristics have been proposed to compress recent deep-learning models. However, the rapid development of new optimization techniques makes it difficult for domain-specific accelerators to efficiently process various models showing irregularly stored parameters or nonlinear quantization. This article presents a scalable-precision deep-learning accelerator that supports multiply-and-accumulate operations (MACs) with two arbitrarily quantized data sequences. The proposed accelerator includes three main features. To minimize logic overhead when processing arbitrarily quantized 8-bit precision data, a lookup table (LUT)-based runtime reconfiguration is proposed. The use of bit-serial execution without unnecessary computations enables the multiplication of data with non-equal precision while minimizing logic and latency waste. Furthermore, two distinct data formats, raw and run-length compressed, are supported by a zero-eliminator (ZE) and runtime-density detector (RDD) that are compatible with both formats, delivering enhanced storage and performance. For a precision range of 1–8 bit and fixed sparsity of 30%, the accelerator implemented in 28 nm low-power (LP) CMOS shows a peak performance of 0.87–5.55 TOPS and a power efficiency of 15.1–95.9 TOPS/W. The accelerator supports processing with arbitrary quantization (AQ) while achieving state-of-the-art (SOTA) power efficiency. IEEE
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - 공과대학 > 전자공학과 > Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.