KV Cache Quantization
Overview
Key-Value (KV) cache quantization is an important aspect of efficient large language model (LLM) inference. The importance of KV cache quantization lies in its potential to reduce memory consumption and improve runtime performance, especially for larger sequence lengths and batch sizes. We use the same quantization method as IQ Weight quantization.
Config and Usage
The KV cache quantization feature is controlled by the kv_cache_mode function in the Engine Runtime Config:
kv_cache_mode(cache_mode: AsCacheMode): Sets the cache mode for the key-value cache. The AsCacheMode enum provides three options: AsCacheDefault, AsCacheQuantI8, and AsCacheQuantU4.
AsCacheDefault: will keep the same data type as model infernece, usually it means a BF16/FP16 stored KV-Cache.
AsCacheQuantI8: will quantize kv-cache into int8 type, this will reduce kv-cache memory footprint in half (compared to bf16).
AsCacheQuantU4: will quantize kv-cache into uint4 type, this will reduce kv-cache memory footprint in 1/4 (compared to bf16).
Example
You can modify one line to enable this feature in Quick Start Guide for Python API :
# insert this code in runtime cfg builder part.
runtime_cfg_builder.kv_cache_mode(AsCacheMode.AsCacheQuantI8) # for int8
# runtime_cfg_builder.kv_cache_mode(AsCacheMode.AsCacheQuantU4) # for uint4