Model Recipes#
Preconfigured Recipes#
Recipe selector#
Note
These configs are for aggregated (in-flight batched) serving where prefill and decode run on the same GPU. For disaggregated serving setups, see the model-specific deployment guides below.
Note
Traffic Patterns: The ISL (Input Sequence Length) and OSL (Output Sequence Length) values in each configuration represent the maximum supported values for that config. Requests exceeding these limits may result in errors.
To handle requests with input sequences longer than the configured ISL, add the following to your config file:
enable_chunked_prefill: true
This enables chunked prefill, which processes long input sequences in chunks rather than requiring them to fit within a single prefill operation. Note that enabling chunked prefill does not guarantee optimal performance—these configs are tuned for the specified ISL/OSL.
Model-Specific Deployment Guides#
The deployment guides below provide more detailed instructions for serving specific models with TensorRT LLM.
- Deployment Guide for Nemotron v3 Super on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware
- Deployment Guide for Qwen3 on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell
- Deployment Guide for GLM-5 on TensorRT LLM - Blackwell Hardware