| Roadmap | Support Matrix | Docs | Recipes | Examples | Prebuilt Containers | Design Proposals | Blogs
High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
Large language models exceed single-GPU capacity. Tensor parallelism spreads layers across GPUs but creates coordination challenges. Dynamo closes this orchestration gap.
Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provides:
- Disaggregated Prefill & Decode – Maximizes GPU throughput with latency/throughput trade-offs
- Dynamic GPU Scheduling – Optimizes performance based on fluctuating demand
- LLM-Aware Request Routing – Eliminates unnecessary KV cache re-computation
- Accelerated Data Transfer – Reduces inference response time using NIXL
- KV Cache Offloading – Leverages multiple memory hierarchies for higher throughput
Built in Rust for performance and Python for extensibility, Dynamo is fully open-source with an OSS-first development approach.
| Feature | vLLM | SGLang | TensorRT-LLM |
|---|---|---|---|
| Disaggregated Serving | ✅ | ✅ | ✅ |
| KV-Aware Routing | ✅ | ✅ | ✅ |
| SLA-Based Planner | ✅ | ✅ | ✅ |
| KVBM | ✅ | 🚧 | ✅ |
| Multimodal | ✅ | ✅ | ✅ |
| Tool Calling | ✅ | ✅ | ✅ |
Full Feature Matrix → — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
- [12/05] Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200
- [12/02] Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo
- [12/01] InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference
- [11/20] Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT
- [11/20] WEKA partners with NVIDIA on KV cache storage for Dynamo
- [11/13] Dynamo Office Hours Playlist
- [10/16] How Baseten achieved 2x faster inference with NVIDIA Dynamo
| Path | Use Case | Time | Requirements |
|---|---|---|---|
| Local Quick Start | Test on a single machine | ~5 min | 1 GPU, Ubuntu 24.04 |
| Kubernetes Deployment | Production multi-node clusters | ~30 min | K8s cluster with GPUs |
Want to help shape the future of distributed LLM inference? We welcome contributors at all levels—from doc fixes to new features.
- Contributing Guide – How to get started
- Report a Bug – Found an issue?
- Feature Request – Have an idea?
The following examples require a few system level packages. Recommended to use Ubuntu 24.04 with a x86_64 CPU. See docs/reference/support-matrix.md
The Dynamo team recommends the uv Python package manager, although any way works. Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
Backend engines require Python development headers for JIT compilation. Install them with:
sudo apt install python3-devWe publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.
uv venv venv
source venv/bin/activate
uv pip install pip
# Choose one
uv pip install "ai-dynamo[sglang]" #replace with [vllm], [trtllm], etc.
Before trying out Dynamo, you can verify your system configuration and dependencies:
python3 deploy/sanity_check.pyThis is a quick check for system resources, development tools, LLM frameworks, and Dynamo components.
Dynamo provides a simple way to spin up a local set of inference components including:
- OpenAI Compatible Frontend – High performance OpenAI compatible http api server written in Rust.
- Basic and Kv Aware Router – Route and load balance traffic to a set of workers.
- Workers – Set of pre-configured LLM serving engines.
# Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
# For local dev: --store-kv file avoids etcd (workers and frontend must share a disk)
python3 -m dynamo.frontend --http-port 8000 --store-kv file
# Start the SGLang engine. You can run several of these for the same or different models.
# The frontend will discover them automatically.
python3 -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --store-kv fileNote: vLLM workers publish KV cache events by default, which requires NATS. For dependency-free local development with vLLM, add
--kv-events-config '{"enable_kv_cache_events": false}'. This keeps local prefix caching enabled while disabling event publishing. See Service Discovery and Messaging for details.
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"stream":false,
"max_tokens": 300
}' | jqRerun with curl -N and change stream in the request to true to get the responses as soon as the engine issues them.
- Scale up: Deploy on Kubernetes with Recipes
- Add features: Enable KV-aware routing, disaggregated serving
- Benchmark: Use AIPerf to measure performance
- Try other engines: vLLM, SGLang, TensorRT-LLM
For production deployments on Kubernetes clusters with multiple GPUs.
- Kubernetes cluster with GPU nodes
- Dynamo Platform installed
- HuggingFace token for model downloads
Pre-built deployment configurations for common models and topologies:
| Model | Framework | Mode | GPUs | Recipe |
|---|---|---|---|---|
| Llama-3.1-70B | vLLM | Aggregated | 4x H100 | View |
| DeepSeek-R1 | SGLang | Disaggregated | 8x H200 | View |
| Qwen3-32B | TensorRT-LLM | Disaggregated | 8x GPU | View |
See recipes/README.md for the full list and deployment instructions.
Dynamo is inference engine agnostic. Install the wheel for your chosen engine and run with python3 -m dynamo.<engine> --help.
| Engine | Install | Docs | Best For |
|---|---|---|---|
| vLLM | uv pip install ai-dynamo[vllm] |
Guide | Broadest feature coverage |
| SGLang | uv pip install ai-dynamo[sglang] |
Guide | High-throughput serving |
| TensorRT-LLM | pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo[trtllm] |
Guide | Maximum performance |
Note: TensorRT-LLM requires
pip(notuv) due to URL-based dependencies. See the TRT-LLM guide for container setup and prerequisites.
Use CUDA_VISIBLE_DEVICES to specify which GPUs to use. Engine-specific options (context length, multi-GPU, etc.) are documented in each backend guide.
Dynamo uses TCP for inter-component communication. External services are optional for most deployments:
| Deployment | etcd | NATS | Notes |
|---|---|---|---|
| Kubernetes | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
| Local Development | ❌ Not required | ❌ Not required | Pass --store-kv file; vLLM also needs --kv-events-config '{"enable_kv_cache_events": false}' |
| KV-Aware Routing | — | ✅ Required | Prefix caching enabled by default requires NATS |
For local development without external dependencies, pass --store-kv file (avoids etcd) to both the frontend and workers. vLLM users should also pass --kv-events-config '{"enable_kv_cache_events": false}' to disable KV event publishing (avoids NATS) while keeping local prefix caching enabled; SGLang and TRT-LLM don't require this flag.
For distributed non-Kubernetes deployments or KV-aware routing:
To quickly setup both: docker compose -f deploy/docker-compose.yml up -d
Dynamo provides comprehensive benchmarking tools:
- Benchmarking Guide – Compare deployment topologies using AIPerf
- SLA-Driven Deployments – Optimize deployments to meet SLA requirements
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at /openapi.json. To generate without running the server:
cargo run -p dynamo-llm --bin generate-frontend-openapiThis writes to docs/frontends/openapi.json.
For contributors who want to build Dynamo from source rather than installing from PyPI.
Ubuntu:
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
macOS:
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install cmake protobuf
## Check that Metal is accessible
xcrun -sdk macosx metal
If Metal is accessible, you should see an error like metal: error: no input files, which confirms it is installed correctly.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
Follow the instructions in uv installation guide to install uv if you don't have uv installed. Once uv is installed, create a virtual environment and activate it.
- Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh- Create a virtual environment
uv venv dynamo
source dynamo/bin/activateuv pip install pip maturin
Maturin is the Rust<->Python bindings build tool.
cd lib/bindings/python
maturin develop --uv
The GPU Memory Service is a Python package with a C++ extension. It requires only Python development headers and a C++ compiler (g++).
cd $PROJECT_ROOT
uv pip install -e lib/gpu_memory_servicecd $PROJECT_ROOT
uv pip install -e .
You should now be able to run python3 -m dynamo.frontend.
For local development, pass --store-kv file to avoid external dependencies (see Service Discovery and Messaging section).
Set the environment variable DYN_LOG to adjust the logging level; for example, export DYN_LOG=debug. It has the same syntax as RUST_LOG.
If you use vscode or cursor, we have a .devcontainer folder built on Microsofts Extension. For instructions see the ReadMe for more details.


