A 127.8TOPS/W Arbitrarily Quantized 1-to-8b Scalable-Precision Accelerator for General-Purpose Deep Learning with Reduction of Storage, Logic and Latency Waste

Moon, S.; Mun, H.-G.; Son, H.; Sim, J.-Y.

Detailed Information

Cited 0 time in webofscience

Cited 14 time in scopus

Metadata Downloads

A 127.8TOPS/W Arbitrarily Quantized 1-to-8b Scalable-Precision Accelerator for General-Purpose Deep Learning with Reduction of Storage, Logic and Latency Waste

Full metadata record

DC Field	Value	Language
dc.contributor.author	Moon, S.	-
dc.contributor.author	Mun, H.-G.	-
dc.contributor.author	Son, H.	-
dc.contributor.author	Sim, J.-Y.	-
dc.date.accessioned	2023-04-25T04:40:15Z	-
dc.date.available	2023-04-25T04:40:15Z	-
dc.date.issued	2023-02	-
dc.identifier.issn	0193-6530	-
dc.identifier.uri	https://scholarworks.gnu.ac.kr/handle/sw.gnu/59274	-
dc.description.abstract	Research on deep learning accelerators has focused on inference tasks to improve performance by means of maximally utilizing sparsity and quantization. Unlike CNN-only networks, however, recent state-of-the-art (SOTA) models consist of multiple blocks of various layers with different layer-by-layer characteristics in sparsity and required precision. This trend presents challenges in building a general accelerator architecture to maximize the benefits from sparsity and quantization, while supporting efficient processing for various models ranging from traditional CNNs to the new models to come in the future. First, there are multiple considerations that include the bottleneck in data bandwidth, as well as the trade-off between sparsity and required precision. The required precision is likely to increase as the sparsity increases. This underpins the need for flexibility in setting the quantization with a layer-by-layer configuration. In addition, storing data in a unified format can also prohibit the maximum utilization of hardware resources. Since recent models have large variations in sparsity [11], a major portion of data movement might be taken by sending zeros, causing a severe waste in data bandwidth. We propose a sparsity-aware accelerator that adaptively changes the data format by detecting the sparsity of the given task. Data is stored in raw format when the sparse rate is low and in compressed format (run-length coding, RLC) when the sparse rate is high. Second, there is a correlation between the effective precision and the quantization policy. Arbitrary quantization has demonstrated a higher level of quality of result (QoR) compared to linear quantization (denoted as INT). There have been two representative approaches in nonlinear quantization: 1) arbitrary basis (AB) where quantized values are given by linear combinations of n independent bases, and 2) arbitrary quantization (AQ) which has arbitrary 2 quantized values. Though these quantization schemes achieve good accuracy, there has been no hardware implementation for efficient processing of AQ. The conventional INT multiplication increases the complexity by 4x as both input precisions double. On the other hand, if AQ with a scalable precision of up to 8b is implemented using a look-up-table (LUT) approach, it would explode hardware complexity. To resolve this problem, we propose a hierarchical decoding architecture for AQ with a scalable precision up to 8b. Finally, the required precisions for inputs and weights are not the same [4], [10]. Good QoR is realized by assigning more bits to inputs and fewer bits to weights. Previous accelerators handle inputs and weights with a fixed and equal precision leading to the waste of computational energy. This work employs a dynamic-precision bit-serial multiplication for the weights to minimize waste of energy. Putting them all together, we propose a 1-to-8b scalable-precision general-purpose deep learning accelerator to support multiply-and-accumulate (MAC) operations with input and weight vectors quantized by AQ and AB, respectively. The accelerator includes three main features: 1) a zero elimination scheme that works with two data formats, raw and RLC, to save storage cost and to improve effective bandwidth, 2) extended-precision AQ computing hardware without exploding logic complexity, and 3) bit-serial AB processing without unnecessary computations. © 2023 IEEE.	-
dc.format.extent	3	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	Institute of Electrical and Electronics Engineers Inc.	-
dc.title	A 127.8TOPS/W Arbitrarily Quantized 1-to-8b Scalable-Precision Accelerator for General-Purpose Deep Learning with Reduction of Storage, Logic and Latency Waste	-
dc.type	Article	-
dc.identifier.doi	10.1109/ISSCC42615.2023.10067615	-
dc.identifier.scopusid	2-s2.0-85151722875	-
dc.identifier.bibliographicCitation	Digest of Technical Papers - IEEE International Solid-State Circuits Conference, v.2023-February, pp 330 - 332	-
dc.citation.title	Digest of Technical Papers - IEEE International Solid-State Circuits Conference	-
dc.citation.volume	2023-February	-
dc.citation.startPage	330	-
dc.citation.endPage	332	-
dc.type.docType	Conference Paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-

Files in This Item: There are no files associated with this item.

Appears in Collections: 공과대학 > 전자공학과 > Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Son, Hyun Woo photo

Son, Hyun Woo: IT공과대학 (전자공학부)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

Gyeongsang National University Central Library, 501, Jinju-daero, Jinju-si, Gyeongsangnam-do, 52828, Republic of Korea+82-55-772-0534

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE