FAQ === How to enable serialization of models in memory? ------------------------------------------------ During the model loading stage, use the ``serialize_to_memory`` interface for loading. .. code-block:: python (model_loader.load_model() .read_model_config() .serialize_to_memory(engine, enable_quant=init_quant, weight_only_quant=weight_only_quant) .free_model()) Note that after installing the model, you can call the ``model_loader.free_memory_serialize_file()`` interface to release memory, which will also be released when the process exits. How to initiate a request with converted IDs (non-text)? -------------------------------------------------------- The example shows how to initiate a text request, but in many scenarios, users need to initiate requests directly with converted IDs. In such cases, you can use the following interface to initiate the request: .. code-block:: python # Just an example of how to tokenize your text tokenizer = model_loader.init_tokenizer().get_tokenizer() encode_ids = tokenizer.encode(input_str) # Send an ID request status, handle, queue = engine.start_request_ids(safe_model_name, model_loader, encode_ids, gen_cfg) How to set up multi-GPU? ------------------------ When generating the runtime config, set multi-GPU with ``runtime_cfg_builder = model_loader.create_reference_runtime_config_builder(safe_model_name, TargetDevice.CUDA, device_list, max_batch=8)``. How to switch the precision of model execution, e.g., float16 and bfloat16? ----------------------------------------------------------------------------------- In ``HuggingFaceModel``, use ``user_set_data_type`` to set the model type, and the engine will convert the model's weights to the corresponding data type. For example: .. code-block:: python allspark.HuggingFaceModel(model_model_path, safe_model_name, in_memory_serialize=in_memory, user_set_data_type="bfloat16") How to configure RuntimeConfig using a dictionary? -------------------------------------------------- ``AsModelRuntimeConfigBuilder`` can be imported using the ``from_dict()`` function. The format can be referenced from the DIConfig YAML format, and you can also use the corresponding dictionary format for configuration. Here's an example: .. code-block:: python input_dict = { 'model_name': 'test_model', 'compute_unit': { 'device_type': 'cuda', 'device_ids': [0, 1], 'compute_thread_in_device': 2 }, 'engine_max_length': 100, 'engine_max_batch': 32 } # Create a Builder instance and call from_dict builder = AsModelRuntimeConfigBuilder() builder.from_dict(input_dict) For the complete configuration, please refer to the configuration file in DIConfig. How to asynchronously retrieve output (without calling `sync_request`)? ----------------------------------------------------------------------- See the code in `Handling Output` section of :doc:`get_started/quick_start_api_py_en`. Asynchronous output retrieval is the recommended way to achieve high throughput.