vLLM reference knowledge

A machine-readable knowledge base your AI SRE agent can query as a tool — the operational facts about vLLM that don't live reliably in a model's training data, modeled from first-party sources and the practitioner long tail.

Reference sources

Concepts modeled

Connections

126

Reference sources

The primary material behind the knowledge — official docs, releases, pull requests, issues, and research. Every item links to its real source.

Official documentation15

First-party vLLM docs pages, modeled and linked to source.

Pull requests7

Merged changes that altered behavior, defaults, or compatibility.

Issues & roadmap6

Reported defects and roadmap signals from the upstream tracker.

Release notes4

Version releases with the changes and fixes that ship in them.

Operational guides2

Deployment and tuning guides for running vLLM in production.

Research papers1

Primary research behind vLLM's memory and throughput claims.

Efficient Memory Management for Large Language Model Serving with PagedAttention

Benchmark reports1

Published performance runs with their methodology captured.

vLLM v0.6.0 performance update

Coverage probes1

Searches run to map what the corpus does and does not yet cover.

Existing Schema coverage probe for vLLM KV cache OOM

Conceptual knowledge

The operational understanding modeled on top of those sources — failure modes, metrics, parameters, benchmarks, architecture, and mitigations, cross-linked into a graph.

Failure modes & risks13

Known defects, coverage gaps, and operational hazards to watch for.

docs guidance versus dynamic default logic
gpu_memory_utilization ineffective on sliced GPU stacks
long-context prefill OOM below advertised max model length
max_num_batched_tokens must exceed max_model_len when chunked prefill is disabled
Missing first-party KV cache capacity calculator
Missing routing policy tradeoff matrix
Missing workload-class to recommended-config matrix
Model Runner V2 rejection-sampling acceptance-rate gap versus MRV1
MTP=1 hang on DeepSeek V4 when persistent_topk path is active
Production Stack benchmark platform not yet published
ROCm DSV4-Flash dense KV cache pool materialization
warmup prefill kernel memory regression
WSL2 CUDA overhead allocator mismatch

Architecture & components13

Engine subsystems, stack components, and how serving traffic flows.

KEDA autoscaling
KEDA autoscaling on vLLM waiting requests
KV cache manager
persistent_topk path in DSA sparse-attention indexer
prefix aware routing
Production Stack Helm chart
Production Stack router
Production Stack router CLI
ROCm AITER MLA sparse attention path
route by KV cache hit rate
route by shared prompt prefix
upstream vLLM engine
warmup prefill kernels path

Parameters & defaults10

Tunable settings, their defaults, safe ranges, and default drift.

>8192 throughput guidance
2048 smaller-value ITL tuning example
512 chunked-prefill default in v0.4.2 docs
chunked prefill decode-priority scheduling
enable_chunked_prefill
gpu_memory_utilization
kv_cache_dtype
max_num_batched_tokens
max_num_batched_tokens default history
max_num_seqs

Benchmarks & workloads10

Benchmark methods, claims, and the workload classes they apply to.

decode-heavy benchmark workload
high-concurrency traffic spike
long-context prefill
offline inference throughput benchmark
online serving throughput benchmark
prefill-heavy benchmark workload
ShareGPT benchmark workload
single-batch latency benchmark
vLLM 0.6.0 performance-update experiment context
vLLM 0.6.0 throughput and TPOT improvement claim

Metrics & signals7

The numbers to watch and what healthy versus unhealthy looks like.

KV block lifecycle metrics
KV cache usage percentage
output token throughput
request throughput
time per output token
time to first token
vllm:num_requests_waiting

Hardware & compatibility5

GPU profiles, dependency constraints, and supported-model limits.

A100 and H100 benchmark hardware
DeepSeek V4
Hugging Face Transformers dependency constraint
Kubernetes cluster with GPU support
Prometheus observability stack

Mitigations & remedies1

Actions that relieve a known failure mode once you hit it.

enforce eager execution