Offline Inference with Python API

We have presented a quick start example of LLM inference with Python API in Quick Start Guide for Python API.

Concepts

In this example, there is some explanation about how the inference work and the main components. In this documentation, we will have a deep dive into APIs and how to enable different features.

Main Components

There is three key components, Engine, Model, and Request:

The Model will be installed into Engine and the engine can be controlled by engine API such as Start and Stop, After engine start, it can receive input requests (includes input ids and generate config).
Request: After request starts, API will provide a handle and a queue, and the request can be controlled by the engine API and handle like Start / Stop / Release.
Output: The output id and status of this request can be get by queue object returned by start_request_{text|ids} API,
After this request’s status changes to GenerationFnished or GenerateInterrupted, or stopped by stop_request, the resources will be released by engine, except for the queue.
After release request, all resources related to this request will be freed.

Main Steps

First of all, the inference pipeline has the following main steps:

Model Loading: For Hugging Face models, use this loader: dashinfer.allspark.model_loader.HuggingFaceModel. This loader will parse the parameters of the Hugging Face model and create corresponding parameters for conversion.
Model Serialization: This process converts the model into DashInfer format, offering two modes. The one is a transparent in-memory conversion that does not generate intermediate files; however, the drawback is that it uses twice the memory. The other converts the model to a local file, which can be loaded later using the DashInfer loader (WIP). The Weight Quantization function happens during this step.
Engine Installation and Model Startup: Once the model is loaded, it is installed in the corresponding engine, which then starts the installed model. It will allocate related resources like GPU/CPU memory. It will also warm up the engine. In this step, the engine will generate some fake requests to ensure that a request with maximum length can be run successfully. After warm-up, there are no significant device memory allocations during subsequent request processing, which enhances serving stability.
Request Initiation and Output Reception: This step primarily focuses on asynchronously initiating requests. After a request is created, the engine processes it in a continuous batching manner. The corresponding output from the request is obtained through the request’s output queue, which also allows asynchronous monitoring of the request’s current status.

Note

For NVIDIA GPUs, the device memory will be managed by a BFC Allocator. It will allocate the configured total device memory on Engine Installation and Model Startup step. The allocation ratio can be controlled by the BFC_MEM_RATIO environment variable. For example, BFC_MEM_RATIO=0.8 means allocating 80% of all GPU memory on this device (total memory, not free memory). More memory usually means a larger KV-cache pool, which is better for throughput. The default value is 0.9 in this release.

Explaination with Examples

Simple Python API Code Snippet

generated_ids = []

engine = allspark.Engine()

model_loader = allspark.HuggingFaceModel(model_local_path, safe_model_name)

(model_loader.load_model()
    .read_model_config()
    .serialize_to_memory(engine)
    .free_model())

# create engine runtime config.
runtime_cfg_builder = model_loader.create_reference_runtime_config_builder(safe_model_name, TargetDevice.CUDA, device_list, max_batch=8)
runtime_cfg = runtime_cfg_builder.build()

# install and start.
engine.install_model(runtime_cfg)
engine.start_model(safe_model_name)

# generation request.
gen_cfg = model_loader.create_reference_generation_config_builder(runtime_cfg)

status, handle, queue = engine.start_request_text(safe_model_name, model_loader, "who are you?", gen_cfg)


# sync request will block request until request is finished.
engine.sync_request(safe_model_name, handle)

# sync output
generated_elem = queue.Get()
generated_ids += generated_elem.ids_from_generate

# free request and model
engine.release_request(safe_model_name, handle)
engine.stop_model(safe_model_name)

Engine Creation

Engine object can be created with allspark.Engine(). It will create an Engine instance for later invocation.

from dashinfer import allspark
engine = allspark.Engine()

I20240906 13:28:04.592298 130842 as_engine.cpp:281] AllSpark Init with Version: 3.1.0/(GitSha1:93a8bb12)

After successful initialization, the engine will print its version information. During this step, the engine’s required resources will be allocated.

Model Loading and Serialization

Before loading model, you need to download the model from model hub (like huggingface or modelscope), or specify your local model path, which is needed in this step.

You need to choose a model loader for each type of model. There are two types of supported models:

huggingface format model, use dashinfer.allspark.model_loader.HuggingFaceModel
dashinfer format model, use dashinfer.allspark.model_loader.DashInferModel, which is the converted DashInfer format model files.

Since DashInfer requires customized weight format (Dashinfer Model), the HuggingFaceModel will require serialize step for the format convert. There are two ways to do such conversion: file or memory. If the serialization target is file, it will save the new file into a local directory; if the serialization target is memory, it will serialize the model to a temp file under /tmp, and the file will be deleted after process finishes or .free_model() function is called.

Note

some operation system environment will not mount /tmp as memory file system (tmpfs in linux), which may cause “No Space in File System” error.

In model loading step, you will get many information like model context length, default generation config, tokenizer etc.

Engine Runtime Config

The import data structure in this part is a RuntimeConfig (aka AsModelConfig). It include which device (CUDA or CPU), how many devices (device_list), and the maximum batch size supported, the maximum token length (input + output) will support, and the model’s information such as file path and /or the identifier.

AsModelRuntimeConfigBuilder is the helper class to create RuntimeConfig

Runtime Config Builder

The AsModelRuntimeConfigBuilder class is used to configure the runtime settings for a model in the DashInfer engine. It provides a user-friendly Python API for setting various runtime parameters, such as the model path, computation device, maximum batch size, maximum sequence length, and caching modes.

Runtime Configuration

The runtime configuration includes the following settings:

Model Path: The path to the model file and weights file. User only specify this model localtion when using serialized model, HuggingFace Loader will fill this value automiticly.
Compute Unit: The target device for computation, which can be CUDA, CPU, or CPU_NUMA.
Thread Number: The number of threads to use for computation on CPU or CPU_NUMA devices.
Maximum Batch Size: The maximum batch size for inference.
Maximum Sequence Length: The maximum sequence length for input data.
Cache Mode: The KV cache mode, defualt mode is 16-bit floating point (bfloat16/float16), and can be configured as int8 or uint4 mode.
Prefill Cache: The prompt prefix cache, which can reduce duplicated prefill computation time, and is ON by default.

Usage

Here’s an example of how to use the AsModelRuntimeConfigBuilder:

from dashinfer.allspark import *
from dashinfer.allspark.engine import *

# Create a new builder instance
builder = AsModelRuntimeConfigBuilder()

# Set the model name
builder.model_name("my_model")

# Set the model directory and file name prefix
# User can specify this folder to start running the serialized model.
builder.model_dir("/path/to/model", "model_prefix")

# Set the compute unit to use CUDA device 0
builder.compute_unit(TargetDevice.CUDA, [0])

# Set the maximum batch size and sequence length
builder.max_batch(32)
builder.max_length(2048)

# Set the cache mode to quantize key-value pairs
builder.kv_cache_mode(AsCacheMode.AsCacheQuantU4)

# Build the runtime configuration
runtime_config = builder.build()

# Use the runtime configuration for inference or other operations

The AsModelRuntimeConfigBuilder class provides a fluent interface, allowing you to chain method calls together. It also includes several convenient methods for setting the model path and compute unit from different input formats.

For more detailed information on the available methods and their usage, please refer to the docstrings within the class definition.

Engine Control

After setup of runtime_config, user can call engine.install_model() function to install or register model into engine with a model_name. User can control the model in this engine by this model_name, and the model_name should be unique in this engine.

Model’s running state in engine includes following states:

Initial: the state after model installed.
Running: the state after calling engine.start_model
Stop: the state after calling engine.stop_model; the model can not receive request, and model executing thread will stop.
Released: the state after calling engine.release_model; all resources will be released.

Most time we only deal with the engine in ‘Running’ state. Engine can deal with user’s request in this state.

Text Request and Generation Request Config

This section mainly describe how to start a text LLM request, and how the generation config should be configured.

1. Generate Config Setup.

GenerationConfig

GenerationConfig is used to set various control parameters for text generation. It can be built and configured using the ASGenerationConfigBuilder class. Here are some of the main configuration options:

Sampling Settings

do_sample (bool): Whether to enable sampling in generation. Currently, sampling must be enabled.
temperature (float): Temperature for sampling, controlling the randomness in generation.
top_k (int): Top-K sampling parameter, limiting the selection of the next token.
top_p (float): Top-P sampling parameter for nucleus sampling.

Output Control

max_length (int): Maximum total length of generated text, including both prefill and generation parts.
min_length (int): Minimum length of the generated text. Set to 0 will disable this constraint.
early_stopping (bool): If True, generation stops when the EOS token is encountered.
stop_words_ids (List[List[int]]): A list of word IDs that signal the generation should stop.
eos_token_id (int): ID of the EOS (end of sequence) token, to be specified based on your model.
no_repeat_ngram_size (int): Size of n-grams that should not repeat in the generated text.

Generation Quality Control

repetition_penalty (float): Penalty applied to repeated words.
length_penalty (float): Penalty based on the length of the generated sequence.
presence_penalty (float): Penalty for the presence of certain words in the output.
suppress_repetition_in_generation (bool): If True, uses presence_penalty to suppress word repetition.

Other Settings

seed (int64_t): Seed for random number generation to ensure reproducibility.
logprobs (bool): If True, returns log probabilities of generated tokens. Not supported by some models.
top_logprobs (int): Specifies the number of tokens with log probabilities to return if logprobs is True.
lora_name (str): Name of the LoRA adaptation, if applicable.
mm_info (MultiMediaInfo): Multimedia information, specific to certain use cases.
response_format (dict): Dict of arguments for guided decoding.

Using `ASGenerationConfigBuilder`

You can use the ASGenerationConfigBuilder class to build and configure the GenerationConfig. For example:

builder = ASGenerationConfigBuilder()
config = (builder.do_sample()
        .max_length(512)
        .temperature(0.7)
        .top_k(50)
        .build())

The ASGenerationConfigBuilder provides a fluent interface, allowing you to chain method calls to set the desired configuration. It also supports initialization from a Hugging Face GenerationConfig instance.

2. Send Request

Here is the documentation for sending requests in English:

2. Send Request

The DashInfer engine provides two main functions for initiating a text generation request: start_request_ids and start_request_text. These functions allow you to provide input data in the form of token IDs or text strings, respectively, along with the desired generation configuration.

`start_request_ids`

def start_request_ids(self,
                      model_name: str,
                      model: LLM,
                      input_ids: Tensor,
                      generate_config_builder: ASGenerationConfigBuilder):
    """
    Start a generation request with a model and tensor inputs along with a structured generation configuration.

    Args:
        model_name (str): The name of the model installed for text generation tasks.
        model (LLM): The language model instance.
        input_ids (Tensor): Tensor containing the input token IDs for generation.
        generate_config_builder: An instance of the ASGenerationConfigBuilder class.
    Returns:
        tuple: A tuple consisting of:
            - AsStatus: The status of the request as returned by the engine.
            - object: A request handle to track and manage this specific request.
            - ResultQueue: A queue from which to retrieve the results and status updates of the generation process.
    """

This function accepts input token IDs as a PyTorch Tensor or a Python list, along with the configured ASGenerationConfigBuilder instance. It returns a tuple containing the request status, a handle for the request, and a result queue for retrieving the generated output and monitoring the request status.

`start_request_text`

def start_request_text(self,
                       model_name: str,
                       model: LLM,
                       input_str_or_array,
                       generate_config_builder: ASGenerationConfigBuilder):
    """
    Start Request by model and with text input.

    Args:
        model_name (str): The installed model name.
        model (LLM): The language model instance.
        input_str_or_array (str, List[str]): The input text or an array of input texts.
        generate_config_builder: An instance of the ASGenerationConfigBuilder class.

    Returns:
        tuple: A tuple consisting of:
            - AsStatus: The status of the request as returned by the engine.
            - object: A request handle to track and manage this specific request.
            - ResultQueue: A queue from which to retrieve the results and status updates of the generation process.
    """

This function accepts the input text or an array of input texts, along with the configured ASGenerationConfigBuilder instance. It tokenizes the input using the model’s tokenizer and then initiates the generation request with the input token IDs and the specified generation configuration. The function returns a tuple containing the request status, a handle for the request, and a result queue for retrieving the generated output and monitoring the request status.

Both functions return a request handle and a result queue, which can be used to monitor the request status and retrieve the generated output. The ASGenerationConfigBuilder class is used to configure the generation parameters, such as the maximum length, sampling settings, and output control options.

3. Stop and Release Request

These functions are used to manage and control the lifecycle of generation requests. stop_request allows you to stop a running request, release_request releases the resources associated with a request, and sync_request waits for an asynchronous request to complete before returning.

`stop_request`

def stop_request(self, model_name: str, request_handle) -> AsStatus:
    """
    Stops a request.

    Args:
        model_name (str): Model name.
        request_handle: Handle for the request.

    Returns:
        AsStatus: Status of the operation.
    """

The stop_request function stops a previously initiated request. It takes the following arguments:

model_name (str): The name of the model associated with the request.
request_handle: The handle for the request to be stopped.

It returns an AsStatus object indicating the status of the operation.

`release_request`

def release_request(self, model_name: str, request_handle) -> AsStatus:
    """
    Releases a request's resources.

    Args:
        model_name (str): Model name.
        request_handle: Handle for the request.

    Returns:
        AsStatus: Status of the operation.
    """

The release_request function releases the resources associated with a request. It takes the following arguments:

model_name (str): The name of the model associated with the request.
request_handle: The handle for the request whose resources need to be released.

It returns an AsStatus object indicating the status of the operation.

4. Sync Request

`sync_request`

def sync_request(self, model_name: str, request_handle) -> AsStatus:
    """
    Waits for the completion of an asynchronous request.

    Args:
        model_name (str): Model name.
        request_handle: Handle for the request.

    Returns:
        AsStatus: Status of the operation.
    """

The sync_request function waits for the completion of an asynchronous request, this API is optional, the model inference is start asynchronous in engine when start_request is called. This API mainly for simulate the sync request for user’s use case.

It takes the following arguments:

model_name (str): The name of the model associated with the request.
request_handle: The handle for the asynchronous request.

It returns an AsStatus object indicating the status of the operation.

Output and Status

ResultQueue

The ResultQueue class is designed to generate status and retrieve results from the DashInfer engine.

class ResultQueue

The ResultQueue class provides methods to retrieve the generation status, the current generated length, and request statistics. Additionally, it offers three methods for fetching generated tokens from the queue:

Get() blocks until new tokens are generated.
GetWithTimeout(timeout_ms) blocks until new tokens are generated or the specified timeout is reached.
GetNoWait() returns immediately with the generated tokens or None if the queue is empty.

These methods return the generated tokens as Python objects, or None if the queue is empty or the timeout is reached.

The GenerateRequestStatus enum represents the current status of the generation process. The possible values are:

GenerateRequestStatus.Init: Init status when queue is create.

GenerateRequestStatus.ContextFinished: Status when context (prefill) has been compleled.

GenerateRequestStatus.Generating: Status when request generation is in progress.

GenerateRequestStatus.GenerateInterrupted: Status when engine has no resource to finish this request’s generation, usually meaning no device memory available.

GenerateRequestStatus.GenerateFinished: Status when generation is finished, normally meaning EOS token generated, or generated length exceeds engine_max_length.

GenerateStatus()

Get the generation status. This API will not block.

This method returns the current status of the generation process as an instance of the GenerateRequestStatus enum. The possible values represent different states of the generation.

Returns:: The current generation status.
Return type:: GenerateRequestStatus

GeneratedLength()

Get the current generated length, which is the accumulated number of generated tokens.

Returns:: The current generated length.
Return type:: int

RequestStatInfo()

Get the key-value dictionary of all statistics for this request.

Returns:: A dictionary containing the request statistics.
Return type:: dict

Get()

Fetches new token(s) from the queue. This method will block until new tokens are generated.

Returns:: The generated tokens, or None if the queue is empty.
Return type:: GeneratedElements

GetWithTimeout(timeout_ms)

Fetches new token(s) from the queue. This method will block until new tokens are generated or the specified timeout (in milliseconds) is reached.

Parameters:: timeout_ms (int) – The timeout value in milliseconds.
Returns:: The generated tokens, or None if the queue is empty or the timeout is reached.
Return type:: GeneratedElements or None

GetNoWait()

Fetches new token(s) from the queue without blocking. This method returns None if no new tokens are available.

Returns:: The generated tokens, or None if the queue is empty.
Return type:: GeneratedElements or None

Here’s the documentation for the GeneratedElements class. This class provides access to the generated tokens, their log probabilities, and other related information produced during the text generation process.

class GeneratedElements

Generated Token class, contains token(s) and related information. It may contain multiple tokens generated since the last call to Get methods.

ids_from_generate

Token(s) from this generation.

Type:: list

log_probs_list

A probability list for each token, including the top_logprobs tokens and their probabilities when generated.

Dimension: [num_token][top_logprobs], where each token has a pair [token_id, prob].

Type:: list

token_logprobs_list

Stores the probability value for each selected token.

Type:: list

tensors_from_model_inference

Tensor outputs from model inference.

Type:: list

prefix_cache_len

Cached prefix token length.

Type:: int

prefix_len_gpu

GPU cached prefix token length.

Type:: int

prefix_len_cpu

CPU cached prefix token length.

Type:: int

Here is an example of how to use the Get() and GenerateStatus():

generated_ids = []
status = queue.GenerateStatus()

## in following 3 status, it means tokens are generating
while (status == GenerateRequestStatus.Init
        or status == GenerateRequestStatus.Generating
        or status == GenerateRequestStatus.ContextFinished):
    elements = queue.Get()
    if elements is not None:
        print(f"new token: {elements.ids_from_generate}")
        generated_ids += elements.ids_from_generate
    status = queue.GenerateStatus()
    if status == GenerateRequestStatus.GenerateFinished:
        break
        # This means generated is finished.
    if status == GenerateRequestStatus.GenerateInterrupted:
        break
        # This means the GPU has no available resources; the request has been halted by the engine.
        # The client should collect the tokens generated so far and initiate a new request later.

Offline Inference with Python API

Concepts

Main Components

Main Steps

Explaination with Examples

Simple Python API Code Snippet

Engine Creation

Model Loading and Serialization

Engine Runtime Config

Runtime Config Builder

Runtime Configuration

Usage

Engine Control

Text Request and Generation Request Config

1. Generate Config Setup.

GenerationConfig

Sampling Settings

Output Control

Generation Quality Control

Other Settings

Using ASGenerationConfigBuilder

2. Send Request

2. Send Request

start_request_ids

start_request_text

3. Stop and Release Request

stop_request

release_request

4. Sync Request

sync_request

Output and Status

ResultQueue

Using `ASGenerationConfigBuilder`

`start_request_ids`

`start_request_text`

`stop_request`

`release_request`

`sync_request`