=============== LoRA Support =============== Before read this document, please first read `LoRA Adapters `_ for a basic concept and process. Overview -------------- The AllSpark Engine can work with multiple LoRAs based on the models listed in the `python/pyhie/allspark/model/` directory. When you want to perform inference with LoRA for the first time, four steps should be completed. Prepare LoRA Adapters ------------------------ Before conversion, ensure that the LoRA adapter files are in PyTorch format. If they are in SafeTensors format, you can easily convert them using the `safetensors` module: .. code-block:: python from safetensors.torch import safe_open safetensor_model_path = 'lora-1/adapter_model.safetensors' with safe_open(safetensor_model_path, framework="pt", device='cpu') as f: for k in f.keys(): tensors[k] = f.get_tensor(k) torch.save(tensors, os.path.join(raw_lora_dir, 'lora-1/adapter_model.bin')) After conversion, the directory structure should look like this: .. code-block:: bash /dir/to/my/loras/ |__ lora-1/ |__ adapter_config.json |__ adapter_model.safetensors (raw) |__ adapter_model.bin (converted to pth format) |__ lora-2/ |__ adapter_config.json |__ adapter_model.safetensors (raw) |__ adapter_model.bin (converted to pth format) Enable LoRA -------------- To enable LoRA support, a base model should be converted into AllSpark format using a JSON lora_cfg argument with the following fields: 1. `input_base_dir`: A relative or absolute directory path to the parent directory of the LoRA adapters. 2. `lora_names`: A list of LoRA names, each being a directory name. 3. `lora_cfg`: An optional boolean flag indicating whether to convert only the adapters or also the base model. .. code-block:: python output_dir = '/path/to/output/dir/' # output directory name engine = allspark.Engine() model_loader = allspark.HuggingFaceModel(...) model_loader.load_model().read_model_config().serialize_to_path( engine, output_dir, lora_cfg={ 'input_base_dir': '/dir/to/my/loras/', # input parent directory of all the LoRA adatper directories 'lora_names': ['lora-1', 'lora-2'], # which LoRA adapters will be converted. The lora name is also the directory name of the LoRA adapter. 'lora_only': False # False means you will convert both the base model and the LoRA adapters. No base model but only LoRA converted if set True. } ) After calling serialize_to_path(), four files are generated in the output_dir: qwen7b.asgraph: Base model containing LoRA support qwen7b.asparam: Weights data of base model lora-{1,2}.aslroa: Converted LoRA adapters Setup LoRA Limits ------------------- The limits of LoRA should be set appropriately before inference. You can change the default limits using the following instructions: .. code-block:: python runtime_cfg_builder = model_loader.create_reference_runtime_config_builder(...) runtime_cfg_builder.lora_max_num(20).lora_max_rank(64) Now, both the maximum number and maximum rank of all LoRA adapters are set. Infer With LoRA ----------------- Finally, you can pass the `lora_name` argument into `GenerationConfig` and use it to perform generation tasks. .. code-block:: python gen_cfg = model_loader.create_reference_generation_config_builder(runtime_cfg) gen_cfg.update({"lora_name": 'lora-2'}) status, handle, queue = engine.start_request_text(converted_model_name, model_loader, input_str, gen_cfg) Example -------------- The full example of how to use LoRA is as follows: .. code-block:: python import os import torch import modelscope from modelscope.utils.constant import DEFAULT_MODEL_REVISION from dashinfer import allspark from dashinfer.allspark.engine import TargetDevice from dashinfer.allspark.prompt_utils import PromptTemplate from dashinfer.allspark._allspark import AsStatus, GenerateRequestStatus, AsCacheMode from safetensors import safe_open def check_transformers_version(): import transformers required_version = "4.37.0" current_version = transformers.__version__ if current_version < required_version: raise Exception( f"Transformers version {current_version} is lower than required version {required_version}. Please upgrade transformers to version {required_version}." ) exit() def convert_safetensor_to_pytorch(raw_lora_dir): model_path = os.path.join(raw_lora_dir, 'adapter_model.safetensors') tensors = {} with safe_open(model_path, framework="pt", device='cpu') as f: for k in f.keys(): tensors[k] = f.get_tensor(k) torch.save(tensors, os.path.join(raw_lora_dir, 'adapter_model.bin')) if __name__ == '__main__': check_transformers_version() # if use in memory serialize, change this flag to True in_memory = False init_quant= False weight_only_quant = True device_list=[0,1] fetch_output_mode = "async" # or "sync" modelscope_name ="qwen/Qwen2-7B-Instruct" ms_version = DEFAULT_MODEL_REVISION model_local_path="" output_model_dir = "../../model_output" model_local_path = modelscope.snapshot_download(modelscope_name, ms_version) safe_model_name = str(modelscope_name).replace("/", "_") model_loader = allspark.HuggingFaceModel(model_local_path, safe_model_name, user_set_data_type="bfloat16", in_memory_serialize=in_memory, trust_remote_code=True) engine = allspark.Engine() # lora-1 and lora-2 are adapter directories, which include adapter_config.json and adapter_model.bin (pth format) lora_base_dir = '/dir/to/my/loras/' # If the format is .safetensors, you should run the following conversion: # start lora format conversion: convert_safetensor_to_pytorch(os.path.join(lora_base_dir, 'lora-1')) convert_safetensor_to_pytorch(os.path.join(lora_base_dir, 'lora-2')) if in_memory: (model_loader.load_model() .read_model_config() .serialize_to_memory(engine, enable_quant=init_quant, weight_only_quant=weight_only_quant, lora_cfg={'input_base_dir': lora_base_dir, 'lora_names': ['lora-1', 'lora-2'], 'lora_only': False} ) .export_model_diconfig(os.path.join(output_model_dir, "diconfig.yaml")) .free_model()) else: (model_loader.load_model() .read_model_config() .serialize_to_path(engine, output_model_dir, enable_quant=init_quant, weight_only_quant=weight_only_quant, lora_cfg={'input_base_dir': lora_base_dir, 'lora_names': ['lora-1', 'lora-2'], 'lora_only': False}, skip_if_exists=True ) .free_model()) runtime_cfg_builder = model_loader.create_reference_runtime_config_builder(safe_model_name, TargetDevice.CUDA, device_list, max_batch=8) # like change to engine max length to a smaller value runtime_cfg_builder.max_length(256).lora_max_num(25).lora_max_rank(64) # like enable int8 kv-cache or int4 kv cache rather than fp16 kv-cache # runtime_cfg_builder.kv_cache_mode(AsCacheMode.AsCacheQuantI8) # or u4 # runtime_cfg_builder.kv_cache_mode(AsCacheMode.AsCacheQuantU4) runtime_cfg = runtime_cfg_builder.build() # install model to engine engine.install_model(runtime_cfg) if in_memory: model_loader.free_memory_serialize_file() # start the engine engine.start_model(safe_model_name) # load loras ret = engine.load_lora(safe_model_name, 'lora-1') assert(ret == AsStatus.ALLSPARK_SUCCESS) ret = engine.load_lora(safe_model_name, 'lora-2') assert(ret == AsStatus.ALLSPARK_SUCCESS) # start model inference with lora input_list = ["你是谁?", "How to protect our planet and build a green future?"] for i in range(len(input_list)): input_str = input_list[i] input_str = PromptTemplate.apply_chatml_template(input_str) # generate a reference generate config. gen_cfg = model_loader.create_reference_generation_config_builder(runtime_cfg) # change generate config base on this generation config, like change top_k = 1 gen_cfg.update({"top_k": 1}) gen_cfg.update({"repetition_penalty": 1.1}) gen_cfg.update({"lora_name": 'lora-%d'%(i % 2 + 1)}) #gen_cfg.update({"eos_token_id", 151645}) status, handle, queue = engine.start_request_text(safe_model_name, model_loader, input_str, gen_cfg) generated_ids = [] if fetch_output_mode == "sync": # sync will wait request finish, like a sync interface, but you can async polling the queue. # without this call, the model result will async running, result can be fetched by queue # until queue status become generate finished. engine.sync_request(safe_model_name, handle) # after sync, you can fetch all the generated id by this api, this api is a block api # will return when there new token, or generate is finished. generated_elem = queue.Get() # after get, engine will free resource(s) and token(s), so you can only get new token by this api. generated_ids += generated_elem.ids_from_generate else: status = queue.GenerateStatus() ## in following 3 status, it means tokens are generating while (status == GenerateRequestStatus.Init or status == GenerateRequestStatus.Generating or status == GenerateRequestStatus.ContextFinished): print(f"2 request: status: {queue.GenerateStatus()}") elements = queue.Get() if elements is not None: print(f"new token: {elements.ids_from_generate}") generated_ids += elements.ids_from_generate status = queue.GenerateStatus() if status == GenerateRequestStatus.GenerateFinished: break # This means generated is finished. if status == GenerateRequestStatus.GenerateInterrupted: break # This means the GPU has no available resources; the request has been halted by the engine. # The client should collect the tokens generated so far and initiate a new request later. # de-tokenize id to text output_text = model_loader.init_tokenizer().get_tokenizer().decode(generated_ids) print("---" * 20) print( f"test case: {modelscope_name} input:\n{input_str} \n output:\n{output_text}\n") print(f"input token:\n {model_loader.init_tokenizer().get_tokenizer().encode(input_str)}") print(f"output token:\n {generated_ids}") engine.release_request(safe_model_name, handle) engine.stop_model(safe_model_name) engine.release_model(safe_model_name)