Model Recipes#

Preconfigured Recipes#

Recipe selector#

Note

These configs are for aggregated (in-flight batched) serving where prefill and decode run on the same GPU. For disaggregated serving setups, see the model-specific deployment guides below.

Note

Traffic Patterns: The ISL (Input Sequence Length) and OSL (Output Sequence Length) values in each configuration represent the maximum supported values for that config. Requests exceeding these limits may result in errors.

To handle requests with input sequences longer than the configured ISL, add the following to your config file:

enable_chunked_prefill: true

This enables chunked prefill, which processes long input sequences in chunks rather than requiring them to fit within a single prefill operation. Note that enabling chunked prefill does not guarantee optimal performance—these configs are tuned for the specified ISL/OSL.

Model-Specific Deployment Guides#

The deployment guides below provide more detailed instructions for serving specific models with TensorRT LLM.