Dynamic Multi-Scale Quantization: A Quantization Technique for Efficient Large Language Model Compression

Ayush Bodade; Bhavesh Chaudhari; Akshat Biniwale; Sudhir Dhage

Authors

Ayush Bodade Student, Department of Computer Science and Engineering, Bharatiya Vidya Bhavan’s Sardar Patel Institute of Technology, Mumbai, India
Bhavesh Chaudhari Student, Department of Computer Science and Engineering, Bharatiya Vidya Bhavan’s Sardar Patel Institute of Technology, Mumbai, India
Akshat Biniwale Student, Department of Computer Science and Engineering, Bharatiya Vidya Bhavan’s Sardar Patel Institute of Technology, Mumbai, India
Sudhir Dhage Dean (Administration & Quality Assurance), Department of Computer Science and Engineering, Bharatiya Vidya Bhavan’s Sardar Patel Institute of Technology, Mumbai, India

Abstract

Quantization techniques are crucial for reducing the computational and memory requirements of large machine learning models, particularly large language models (LLMs). Existing quantization methods often tradeoff between accuracy and efficiency, with limitations in adaptability to diverse workloads and hardware environments. This paper introduces Dynamic Multi-Scale Quantization (DMSQ), a framework combining adaptive precision scaling, per-layer calibration, and workload-aware optimization to address these challenges. We present the mathematical foundations of DMSQ, detail its implementation, and demonstrate its effectiveness through experimental evaluation. DMSQ achieves significant compression while maintaining high accuracy, making it suitable for deployment on resource-constrained devices.

Downloads

Download data is not yet available.

Dynamic Multi-Scale Quantization: A Quantization Technique for Efficient Large Language Model Compression

Authors

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Sidebar-1

For Authors

Indexing/Abstracting