FAQ

How to enable serialization of models in memory?

During the model loading stage, use the serialize_to_memory interface for loading.

(model_loader.load_model()
 .read_model_config()
 .serialize_to_memory(engine, enable_quant=init_quant, weight_only_quant=weight_only_quant)
 .free_model())

Note that after installing the model, you can call the model_loader.free_memory_serialize_file() interface to release memory, which will also be released when the process exits.

How to initiate a request with converted IDs (non-text)?

The example shows how to initiate a text request, but in many scenarios, users need to initiate requests directly with converted IDs. In such cases, you can use the following interface to initiate the request:

# Just an example of how to tokenize your text
tokenizer = model_loader.init_tokenizer().get_tokenizer()
encode_ids = tokenizer.encode(input_str)
# Send an ID request
status, handle, queue = engine.start_request_ids(safe_model_name, model_loader, encode_ids, gen_cfg)

How to set up multi-GPU?

When generating the runtime config, set multi-GPU with runtime_cfg_builder = model_loader.create_reference_runtime_config_builder(safe_model_name, TargetDevice.CUDA, device_list, max_batch=8).

How to switch the precision of model execution, e.g., float16 and bfloat16?

In HuggingFaceModel, use user_set_data_type to set the model type, and the engine will convert the model’s weights to the corresponding data type.

For example:

allspark.HuggingFaceModel(model_model_path, safe_model_name, in_memory_serialize=in_memory,
                          user_set_data_type="bfloat16")

How to configure RuntimeConfig using a dictionary?

AsModelRuntimeConfigBuilder can be imported using the from_dict() function. The format can be referenced from the DIConfig YAML format, and you can also use the corresponding dictionary format for configuration.

Here’s an example:

input_dict = {
    'model_name': 'test_model',
    'compute_unit': {
        'device_type': 'cuda',
        'device_ids': [0, 1],
        'compute_thread_in_device': 2
    },
    'engine_max_length': 100,
    'engine_max_batch': 32
}
# Create a Builder instance and call from_dict
builder = AsModelRuntimeConfigBuilder()
builder.from_dict(input_dict)

For the complete configuration, please refer to the configuration file in DIConfig.

How to asynchronously retrieve output (without calling sync_request)?

See the code in Handling Output section of Quick Start Guide for Python API. Asynchronous output retrieval is the recommended way to achieve high throughput.