===================== Engine Runtime Config ===================== The runtime configuration allows you to set various options for the model inference, such as the maximum batch size, maximum sequence length, and cache modes. You can use the ``AsModelRuntimeConfigBuilder`` class to create and configure the runtime settings. 1. Use model loader's helper funtion to create a runtime config; it will fill all necessary fileds, and you can modify based on this builder. 2. Directly use builder to create; this will require you to fill all necesary fileld like ``model_name`` and paths. 3. You can use a prefilled python dict, and use builder's ``from_dict`` to update or create a builder. Model Configuration ------------------- - ``model_name(model_name: str)``: Sets the name of the model. - ``model_dir(model_dir, file_name_prefix)``: Sets the model file path and weights file path based on the provided directory and file name prefix. - ``model_file_path(model_file_name, weight_file_path)``: This will set the model's graph and model's weight in sepreated way, not recommended. Compute Unit ------------ - ``compute_unit(target_device: TargetDevice, device_id_array=None, compute_thread_in_device: int = 0)``: Setup the runtime compute unit. The `target_device` parameter can be set to `CUDA`, `CPU`, or `CPU_NUMA`. - For CUDA, you can specify the GPU device IDs in `device_id_array`. - For CPU, `compute_thread_in_device` specifies the number of compute threads to use during inference (0 for auto-detection). - For CPU_NUMA, `device_id_array` specifies the NUMA node IDs, and `compute_thread_in_device` specifies the compute threads inside each NUMA node. Compute Unit Examples ^^^^^^^^^^^^^^^^^^^^^ Some examples as follows: CUDA .... 1. CUDA: Single Card : .. code-blocK:: python runtime_builder(safe_model_name, TargetDevice.CUDA, [0], max_batch=64) 2. CUDA: 2 Cards: .. code-blocK:: python runtime_builder(safe_model_name, TargetDevice.CUDA, [0, 1], max_batch=64) 3. CUDA: 2 Cards with specifiy IDs (2nd card and 4th card): .. code-blocK:: python runtime_builder(safe_model_name, TargetDevice.CUDA, [1, 3], max_batch=64) CPU ... 1. CPU with Single NUMA Automatically choose compute thread number. .. code-blocK:: python runtime_builder(safe_model_name, TargetDevice.CPU, [0], max_batch=64) Manually set compute thread; usually number should be equal or less than phyiscal core number. .. code-blocK:: python runtime_builder(safe_model_name, TargetDevice.CPU, [0], max_batch=64).compute_unit(TargetDevice.CPU, compute_thread_in_device=32) Sequence Length and Batch Size ------------------------------ - ``max_length(length: int)``: Sets the maximum sequence length for the engine. - ``max_batch(batch: int)``: Sets the maximum batch size for the engine. - ``max_prefill_length(length: int)``: Sets the maximum prefill length that will be processed in one context inference; if input length is greater than this length, it will be process in multiple context inference steps. Prefix Caching Configuration -------------------------- See :doc:`Prefix Caching <../llm/prefix_caching>`. KV Cache Quantization Configuration ----------------------------------- ``kv_cache_mode(cache_mode: AsCacheMode)``: Sets the cache mode for the key-value cache. The `AsCacheMode` enum provides three options: `AsCacheDefault`, `AsCacheQuantI8`, and `AsCacheQuantU4`. - `AsCacheDefault`: will keep the same data type as model infernece, usually it means a BF16/FP16 stored KV-Cache. - `AsCacheQuantI8`: will quantize kv-cache into int8 type, this will reduce kv-cache memory footprint in half (compared to bf16). - `AsCacheQuantU4`: will quantize kv-cache into uint4 type, this will reduce kv-cache memory footprint in 1/4 (compared to bf16). This config does not depend on weight quantizaion, and it can be switched on/off independently. Utility Functions ----------------- - ``from_dict(rfield)``: Sets the runtime configuration from a dictionary. - ``build()``: Builds and returns the `AsModelConfig` object. Usage Example ------------- Here's an example of how to configure and use the runtime settings: .. code-block:: python runtime_cfg_builder = model_loader.create_reference_runtime_config_builder(safe_model_name, TargetDevice.CUDA, device_list, max_batch=1) # Change the maximum sequence length runtime_cfg_builder.max_length(set_engine_max_length) runtime_cfg_builder.prefill_cache(set_prefill_cache) # Enable int8 or int4 key-value cache quantization if cache_quant_mode != "16": if cache_quant_mode == "8": runtime_cfg_builder.kv_cache_mode(AsCacheMode.AsCacheQuantI8) elif cache_quant_mode == "4": runtime_cfg_builder.kv_cache_mode(AsCacheMode.AsCacheQuantU4) runtime_cfg = runtime_cfg_builder.build() # Install the model into the engine engine.install_model(runtime_cfg) In this example, we first create a ``AsModelRuntimeConfigBuilder`` instance using the ``create_reference_runtime_config_builder`` method from the ``model_loader``. We then set the desired maximum sequence length, enable or disable the prefix cache, and configure the key-value cache quantization mode (int8 or int4) if needed.