vLLM reference knowledge
A machine-readable knowledge base your AI SRE agent can query as a tool — the operational facts about vLLM that don't live reliably in a model's training data, modeled from first-party sources and the practitioner long tail.
Reference sources
37
Concepts modeled
59
Connections
126
Reference sources
37The primary material behind the knowledge — official docs, releases, pull requests, issues, and research. Every item links to its real source.
Official documentation15
First-party vLLM docs pages, modeled and linked to source.
- Core vLLM Production Stack integration docs
- latest optimization docs
- Production Stack KEDA autoscaling docs
- Production Stack KV cache aware routing docs
- Production Stack overview
- Production Stack prefix aware routing docs
- Production Stack router command configuration docs
- v0.4.2 performance docs
- vLLM benchmark API docs
- vLLM engine arguments docs
- vLLM optimization and tuning docs
- vLLM production metrics docs
- vLLM Production Stack benchmarking page
- vLLM serve benchmark docs
- vLLM throughput benchmark docs
Pull requests7
Merged changes that altered behavior, defaults, or compatibility.
- [Bugfix] Fix condition to clear persistent topk so that it can be captured regardless
- [Bugfix] Fix persistent_topk cooperative deadlock at TopK=1024
- [Model Runner V2] Fix rejection sampling acceptance rate gap vs MRV1
- Deprecate Transformers v4
- Temporary disable persistent topk
- Temporary disable persistent topk for Hopper
- Transformers v5 baseline support
Issues & roadmap6
Reported defects and roadmap signals from the upstream tracker.
Release notes4
Version releases with the changes and fixes that ship in them.
Operational guides2
Deployment and tuning guides for running vLLM in production.
Research papers1
Primary research behind vLLM's memory and throughput claims.
Benchmark reports1
Published performance runs with their methodology captured.
Coverage probes1
Searches run to map what the corpus does and does not yet cover.
- Existing Schema coverage probe for vLLM KV cache OOM
Conceptual knowledge
59The operational understanding modeled on top of those sources — failure modes, metrics, parameters, benchmarks, architecture, and mitigations, cross-linked into a graph.
Failure modes & risks13
Known defects, coverage gaps, and operational hazards to watch for.
- docs guidance versus dynamic default logic
- gpu_memory_utilization ineffective on sliced GPU stacks
- long-context prefill OOM below advertised max model length
- max_num_batched_tokens must exceed max_model_len when chunked prefill is disabled
- Missing first-party KV cache capacity calculator
- Missing routing policy tradeoff matrix
- Missing workload-class to recommended-config matrix
- Model Runner V2 rejection-sampling acceptance-rate gap versus MRV1
- MTP=1 hang on DeepSeek V4 when persistent_topk path is active
- Production Stack benchmark platform not yet published
- ROCm DSV4-Flash dense KV cache pool materialization
- warmup prefill kernel memory regression
- WSL2 CUDA overhead allocator mismatch
Architecture & components13
Engine subsystems, stack components, and how serving traffic flows.
- KEDA autoscaling
- KEDA autoscaling on vLLM waiting requests
- KV cache manager
- persistent_topk path in DSA sparse-attention indexer
- prefix aware routing
- Production Stack Helm chart
- Production Stack router
- Production Stack router CLI
- ROCm AITER MLA sparse attention path
- route by KV cache hit rate
- route by shared prompt prefix
- upstream vLLM engine
- warmup prefill kernels path
Parameters & defaults10
Tunable settings, their defaults, safe ranges, and default drift.
- >8192 throughput guidance
- 2048 smaller-value ITL tuning example
- 512 chunked-prefill default in v0.4.2 docs
- chunked prefill decode-priority scheduling
- enable_chunked_prefill
- gpu_memory_utilization
- kv_cache_dtype
- max_num_batched_tokens
- max_num_batched_tokens default history
- max_num_seqs
Benchmarks & workloads10
Benchmark methods, claims, and the workload classes they apply to.
- decode-heavy benchmark workload
- high-concurrency traffic spike
- long-context prefill
- offline inference throughput benchmark
- online serving throughput benchmark
- prefill-heavy benchmark workload
- ShareGPT benchmark workload
- single-batch latency benchmark
- vLLM 0.6.0 performance-update experiment context
- vLLM 0.6.0 throughput and TPOT improvement claim
Metrics & signals7
The numbers to watch and what healthy versus unhealthy looks like.
- KV block lifecycle metrics
- KV cache usage percentage
- output token throughput
- request throughput
- time per output token
- time to first token
- vllm:num_requests_waiting
Hardware & compatibility5
GPU profiles, dependency constraints, and supported-model limits.
- A100 and H100 benchmark hardware
- DeepSeek V4
- Hugging Face Transformers dependency constraint
- Kubernetes cluster with GPU support
- Prometheus observability stack
Mitigations & remedies1
Actions that relieve a known failure mode once you hit it.
- enforce eager execution