Environment Variable Usage
--------------------------


This section describes the definition of the DashInfer environment variables and their function.

Memory Mangament
================

.. list-table:: Environment Var: Memory
   :widths: 10 15 5 5 25
   :header-rows: 1

   * - EnvVar Name
     - Describe
     - Type
     - Default
     - Options

   * - ``BFC_ALLOCATOR``
     - Use BFC Allocator or raw cudaMalloc API

       for management of CUDA Device memory.

     - bool
     - ``ON``
     - ON - Enable BFC Allocator

       OFF - Disable BFC Allocator

   * - ``BFC_MEM_RATIO``
     - The max ratio of device memory that will be managemented by BFC Allocator.
     - float
     - ``0.9``
     - float value between (0.0,1.0]

   * - ``BFC_LEFTOVER_MB``
     - The amount of GPU memory that cannot be used by the BFC allocator,

       typically the sum of GPU memory occupied by PyTorch, CUDA driver,

       default context, etc.
     - int
     - ``350``
     - Formula for the actual GPU memory allocated

       by the BFC Allocator on each GPU:

       ``(Total Physical Memory - BFC_LEFTOVER_MB) * BFC_MEM_RATIO``

   * - ``CPU_CACHE_RATIO``
     - DashInfer will set CPU_CACHE_RATIO * 100% of the current remaining CPU memory 

       for kv cache storage, and when CPU_CACHE_RATIO=0, no CPU memory is used to 

       store kv cache.
     - float
     - ``0.0``
     - float value between [0.0, 1.0]

Logging
=======


.. list-table:: Environment Var: Logging
   :widths: 10 15 5 5 25
   :header-rows: 1

   * - EnvVar Name
     - Describe
     - Type
     - Default
     - Options

   * - ``ALLSPARK_TIME_LOG``
     - Whether logging the generation and context step detailed time in different phase.
     - int
     - ``0``
     - ``0`` - not print;  ``1`` - print log.

   * - ``ALLSPARK_DUMP_OUTPUT_TOKEN``
     - Whether print output token in log.
     - int
     - ``0``
     - ``0`` - not print;  ``1`` - print log.

   * - ``HIE_LOG_SATAUS_INTERVAL``
     - The threshold control for printing statistical log when text generation.
     - int
     - ``5``
     - In second, should be greater than 0.


Engine Behavior
===============

.. list-table:: Environment Var: Engine Behavior
   :widths: 10 15 5 5 25
   :header-rows: 1

   * - EnvVar Name
     - Describe
     - Type
     - Default
     - Options

   * - ``ALLSPARK_USE_TORCH_SAMPLE``
     - Use the same sampler as vllm and PyTorch.

       The generation speed may decrease by 5%-10%.
     - int
     - ``1``
     - ``0`` - use torch sampler

       ``1`` - use DashInfer native sampler,

       which provides the same distribution, 
       
       but not exactly the same value.

   * - ``AS_FLASH_THRESH``
     - Threshold for enable Flash Attention do context attention calculation.

       Flash Attention will be used if context length is greater than this threshold.
     - int
     - ``1024``
     - int value between (0, int64_max).

   * - ``ALLSPARK_DISABLE_WARMUP``
     - Disable warm up step when model is start up.
     - int
     - ``0``
     - ``1``: disable warm up

       ``0``: not disable warm up