===================== Prefix Caching ===================== What is Prefix Caching ********************** Prefix caching stores kv-caches in GPU or CPU memory for extended periods to reduce redundant calculations. When a new prompt shares the same prefix as a previous one, it can directly use the cached kv-caches, avoiding unnecessary computation and improving performance. Enable Prefix Caching ********************* Runtime Configuration --------------------- - ``prefill_cache(enable=True)``: Enables or disables the prefix cache, default value is True. - ``prefix_cache_ttl(ttl: int)``: Prefix cache time to live, default value is 300s. Environment Variable -------------------- - ``CPU_CACHE_RATIO`` - Description: DashInfer will set CPU_CACHE_RATIO * 100% of the current remaining CPU memory for kv-cache storage, and when CPU_CACHE_RATIO=0, no CPU memory is used to store kv cache. - Data type: float - Default value: ``0.0`` - Range: float value between [0.0, 1.0] Performance *********** Run `benchmark_throughput.py` in `examples/benchmark` by following command: .. code-block:: shell model=qwen/Qwen2-7B-Instruct && \ python3 benchmark_throughput.py --model_path=${model} --modelscope \ --engine_max_batch=1 --engine_max_length=4003 --device_ids=0 \ --test_qps=250 --test_random_input --test_sample_size=20 --test_max_output=3 \ --engine_enable_prefix_cache --prefix_cache_rate_list 0.99,0.9,0.6,0.3 On Nvidia-A100 GPU we get following result: .. csv-table:: Batch_size,Request_num,In_tokens,Out_tokens,Avg_context_time(s),Avg_generate_time(s),Prefix_Cache(hit rate) 1,20,4000,3,0.030,0.040,96.0% 1,20,4000,3,0.044,0.040,89.6% 1,20,4000,3,0.121,0.040,57.6% 1,20,4000,3,0.185,0.040,28.8% 1,20,4000,3,0.254,0.040,0.0%