====================================== Welcome to DashInfer documentation ====================================== DashInfer is a highly optimized LLM inference engine with the following core features: - **Lightweight Architecture**: DashInfer requires minimal third-party dependencies and uses static linking for almost all dependency libraries. By providing C++ and Python interfaces, DashInfer can be easily integrated into your existing system. - **High Precision**: DashInfer has been rigorously tested to ensure accuracy, and is able to provide inference whose accuracy is consistent with PyTorch and other GPU engines (e.g., vLLM). - **High Performance**: DashInfer employs optimized kernels to provide high-performance LLM serving, as well as lots of standard LLM inference techniques, including: - **Continuous Batching**: DashInfer allows for the immediate insertion of new requests and supports streaming outputs. - **Paged Attention**: Using our self-developed paged attention technique (which we call *SpanAttention*), we can achieve efficient acceleration of attention operator, combined with int8 and uint4 KV cache quantization, based on highly efficient GEMM and GEMV implementations. - **Prefix Cache**: DashInfer supports highly efficient Prefix Cache for prompts, which accelerates standard LLMs and MultiModal LMs (MMLMs) like Qwen-VL, using both GPU and CPU. - **Quantization Support**: Using DashInfer's *InstantQuant* (IQ), weight-only quantization acceleration can be achieved without fine-tuning, improving deployment efficiency. Accuracy evaluation shows that IQ has almost no impact on model accuracy, for detail, see :doc:`quant/weight_activate_quant`. - **Asynchronous Interface**: Request-based asynchronous interfaces offer individual control over generation parameters and request status of each request. - Supported Models: - **Mainstream Open-Source LLMs**: DashInfer supports mainstream open-source LLMs including Qwen, LLaMA, ChatGLM, etc., and supports loading models in the Huggingface format. - **MultiModal LMs**: DashInfer supports MultiModal Language Models (MMLMs) including Qwen-VL, Qwen-AL, and Qwen2-VL. - **OpenAI API Server**: DashInfer can easily serve with fastChat to achieve OpenAI-compatible API server. - **Multi-Programming-Language API**: Both C++ and Python interfaces are provided. It is possible to extend C++ interface to Java, Rust and other programming languages, via standard cross-language interfaces. ============= Release Note ============= .. include:: release_note.rst ================== Table of Contents ================== .. _get_started: .. toctree:: :maxdepth: 1 :caption: Getting Started get_started/install_en.md get_started/quick_start_api_py_en.md get_started/quick_start_api_server_en.md get_started/env_var_options_en .. _supported_models: .. toctree:: :maxdepth: 1 :caption: Models supported_models_en .. _llm_deployment: .. toctree:: :maxdepth: 1 :caption: LLM Deployment llm/llm_offline_inference_en llm/runtime_config llm/guided_decoding llm/prefix_caching llm/lora_support .. _vlm_deployment: .. toctree:: :maxdepth: 1 :caption: MultiModal LM (MMLM) Deployment vlm/vlm_offline_inference_en .. _developer_guide: .. toctree:: :maxdepth: 2 :caption: Developer Guide devel/source_code_build_en.rst .. _quant_support: .. toctree:: :maxdepth: 2 :caption: Quantization quant/weight_activate_quant quant/kv_cache_quant .. _sub_proj: .. toctree:: :maxdepth: 2 :caption: Subprojects sub_proj/intro sub_proj/hiednn.md sub_proj/spanattn.md .. The following sections are not ready yet .. Benchmark .. =========== .. profile/profile_latency_throughput .. profile/profile_op .. eval/evaluation_llm .. Advanced Guide .. ============== .. adv/content_length_extention .. adv/json_mode_output .. API Reference .. ============= .. api/py_api_ref .. _faq: .. toctree:: :maxdepth: 1 :caption: FAQ faq_en