Getting a model to work in a notebook is one thing. Getting it to serve real traffic reliably, cheaply, and fast is a different problem entirely. The gap between the two is where most of the engineering actually lives.
This post is a survey of the tools available to close that gap. I’ll cover the full stack, from compilation and optimization to request handling, and try to be clear about where each tool fits and when you should reach for it. The scope is intentionally broad: I’ll cover general PyTorch models (classifiers, encoders, rerankers) and then go deeper on autoregressive LLMs, which have a distinct set of requirements and a rich dedicated ecosystem.
The mental model I find most useful is a three-layer stack:
[Compilation & Optimization] ← how the model graph is transformed before serving
↓
[Execution Engine] ← what runs the forward pass at request time
↓
[Serving Layer] ← what handles HTTP/gRPC, batching, scheduling, routing
Most tools live in one layer. Some collapse all three into a single opinionated stack. Knowing which layer a tool operates on is the fastest way to understand what it does and what it doesn’t do.
Part 1: General-Purpose Model Serving
This section covers tools that work for any PyTorch model: text classifiers, embedding models, cross-encoders, vision models, and so on. If you’re serving a BERT-style encoder or a fine-tuned classification head, this is where to start.
Compilation & Optimization
Before you think about how to serve a model, it’s worth thinking about what you’re serving. The raw PyTorch model you trained is rarely the most efficient artifact you can deploy.
TorchScript
TorchScript is PyTorch’s built-in mechanism for serializing and compiling models into a representation that can run independently of Python. There are two modes: torch.jit.trace, which records a forward pass and captures the graph statically, and torch.jit.script, which actually compiles the Python code into TorchScript IR.
The main reason to use TorchScript is portability: you can load and run a .pt model in C++ with no Python dependency, which matters a lot for latency-sensitive or edge deployments.
A few gotchas worth knowing:
torch.jit.traceassumes fixed input shapes. If your model has dynamic control flow (e.g., adaptive padding based on batch length), it will silently produce wrong results rather than erroring.- Loading weights from a compiled model is not straightforward. The serialization format changed significantly in PyTorch 2.0. Check the migration notes before relying on checkpoint compatibility.
- For transformers specifically, make sure your input shapes are consistent across training and inference. Trying to do batch-level truncation or padding after tracing is a common source of
TracerWarningissues.
When to use it: C++ deployment, edge inference, or when you need a self-contained model artifact with no Python runtime dependency.
ONNX and ONNX Runtime
Open Neural Network Exchange (ONNX) is a format that lets you export a model trained in PyTorch and run it with a different execution backend. ONNX Runtime (ORT) is Microsoft’s high-performance execution engine for ONNX graphs, and it’s genuinely fast, often faster than native PyTorch for inference on encoder-style models.
The workflow is: export from PyTorch with torch.onnx.export, optionally run ONNX Runtime’s transformer-specific optimizer, then serve with ORT.
ORT has a dedicated transformers optimizer that fuses attention patterns, eliminates redundant ops, and enables mixed-precision execution. For BERT-style models on CPU, the speedups are significant.
When to use it: Cross-platform deployments, CPU inference where performance matters, or when you want to target hardware backends that ONNX Runtime supports (CUDA, DirectML, OpenVINO, CoreML).
TensorRT
TensorRT is NVIDIA’s inference optimizer and runtime. It takes a model (usually from ONNX) and compiles it into a highly optimized engine for a specific GPU target, performing layer fusion, precision calibration (FP16/INT8), and memory layout optimization.
The resulting .trt engine is fast, often meaningfully faster than ORT on GPU for the right workloads. The tradeoff is that the engine is tied to a specific GPU architecture, a specific batch size range, and a specific input shape. Changing any of those means recompiling.
# Convert an ONNX model to a TRT engine
CUDA_VISIBLE_DEVICES=0 trtexec --onnx=model.onnx --saveEngine=model.trt --fp16
The easiest way to get a TensorRT environment is via NVIDIA’s Docker image:
docker run --gpus all -it --rm -v local_dir:container_dir \
nvcr.io/nvidia/tensorrt:24.08-py3
# Install Python bindings inside the container
/opt/tensorrt/python/python_setup.sh
When to use it: Maximum-throughput inference on NVIDIA GPUs for encoder-style models or CNNs, where you control the hardware and can afford the compilation overhead.
Serving Frameworks
Once you have an optimized model artifact, you need something to handle the request lifecycle: receiving HTTP/gRPC calls, batching inputs, routing to the model, and returning results.
TorchServe
TorchServe is AWS and PyTorch’s official model serving framework. It’s designed specifically for PyTorch and handles a lot of operational concerns out of the box: model versioning and rollback, multi-model serving, logging, metrics, and health checks.
The core abstraction is a handler, a Python class with pre-processing, inference, and post-processing hooks that you implement. TorchServe manages the lifecycle around it.
It has solid support for large model serving via integration with Hugging Face accelerate, which matters if you’re trying to serve something that doesn’t fit on a single GPU.
When to use it: Production deployments of PyTorch models where you need operational maturity (versioning, rollback, metrics) and don’t want to build those things yourself.
Ray Serve
Ray Serve is part of the broader Ray distributed computing ecosystem. Its main differentiator is that it’s built for scaling: you can deploy multiple models, compose them into pipelines, and scale individual components independently across a cluster.
It integrates naturally with the rest of Ray (for data loading, training, hyperparameter tuning), which makes it a good choice if you’re already using Ray in your ML infrastructure.
When to use it: Multi-model pipelines, ensemble deployments, or when you need to scale serving horizontally across a cluster and are already in the Ray ecosystem.
FastAPI / Flask
For prototyping or simple deployments, a plain Python web framework is often the right call. FastAPI is the modern choice: async-native, automatic OpenAPI docs, fast enough for most workloads.
FastAPI shines when your serving logic is custom. For example, a reranker that takes variable-length candidate lists, or an embedding model that needs specific batching logic you want full control over.
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class InferenceRequest(BaseModel):
text: str
@app.post("/predict")
async def predict(request: InferenceRequest):
# your model inference here
return {"result": model(request.text)}
When to use it: Prototypes, internal tools, or deployments where you need full control over the request handling logic and operational complexity / volume is low.
Part 2: LLM-Specific Serving
Autoregressive LLMs have fundamentally different serving requirements from encoder models. Generation is sequential and memory-bound, not compute-bound. The bottleneck is usually KV cache memory, not FLOPS. Batching across requests requires continuous scheduling, not just padding to a fixed shape. This is a sufficiently different problem that a dedicated ecosystem has emerged around it.
The tools below are designed specifically for this regime. In most cases, they collapse all three layers of the stack (optimization, execution, and serving) into a single system.
vLLM
vLLM is the current default for open-source LLM serving. It was built at UC Berkeley and introduced PagedAttention, which manages the KV cache using virtual memory paging analogous to OS memory management. This eliminates KV cache fragmentation and makes it possible to serve many concurrent requests efficiently.
Beyond PagedAttention, vLLM implements continuous batching (requests are added and removed from the running batch dynamically, rather than waiting for a full batch to complete), chunked prefill (prefill and decode phases are interleaved to reduce time-to-first-token variance), and speculative decoding. It supports tensor parallelism for multi-GPU serving and has first-class Hugging Face model support.
The development pace is fast. Major models are usually supported within days of release, and the community is large and active.
pip install vllm
# Serve a model
vllm serve Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 2
vLLM exposes an OpenAI-compatible API by default, which means dropping it in as a backend behind existing OpenAI client code requires minimal changes.
When to use it: This should be your starting point for any open-source LLM deployment. The only reasons to look elsewhere are if you need maximum throughput on NVIDIA hardware (TensorRT-LLM) or if you’re on Hugging Face infrastructure (TGI).
Text Generation Inference (TGI)
TGI is Hugging Face’s production LLM serving stack. It predates vLLM and introduced continuous batching independently. The execution is backed by custom Rust and CUDA kernels and it supports Flash Attention, Paged Attention, and speculative decoding.
TGI’s main advantage is its integration with the Hugging Face ecosystem. If your model is on the Hub and your infrastructure is Hugging Face Inference Endpoints, TGI is the natural choice. It also has particularly strong support for quantized models (GPTQ, AWQ, bitsandbytes).
Functionally, vLLM and TGI are now close peers. The right choice between them is usually determined by ecosystem fit rather than raw performance.
As of December 2025, TGI is in maintenance mode. Hugging Face credits it with pioneering the approach of building optimized inference engines on top of transformers model architectures, an approach now adopted by vLLM, SGLang, llama.cpp, and MLX. Going forward, Hugging Face actively contributes to and recommends these downstream engines instead.
TGI remains a valid choice for existing deployments, but new projects should default to vLLM or SGLang.
When to use it: Hugging Face infrastructure, or when quantization support and Hub integration are primary concerns.
SGLang
SGLang came out of research at Stanford and UC Berkeley and has grown into one of the two dominant open-source inference engines alongside vLLM. Its core technical contribution is RadixAttention: a KV cache management scheme that automatically discovers shared prefixes across requests by organizing the cache as a radix tree, without any manual configuration. This makes it particularly well-suited for workloads with heavy prefix reuse : multi-turn chat, agent loops hammering the same system prompt, and RAG pipelines with repeated instruction templates.
Beyond RadixAttention, SGLang ships a zero-overhead batch scheduler that overlaps CPU scheduling with GPU execution, keeping the GPU continuously engaged, and a compressed finite-state machine backend for constrained/structured generation. It supports the same attention backends as vLLM (FlashAttention, FlashInfer, FlashMLA) and runs across NVIDIA, AMD MI300, Intel, and Google TPU.
SGLang is now the other engine Hugging Face explicitly recommends alongside vLLM following TGI’s maintenance-mode announcement. The practical choice between SGLang and vLLM usually comes down to workload shape: if a significant share of your traffic shares a common prefix, SGLang’s architecture works strongly in its favor.
TensorRT-LLM
TensorRT-LLM is NVIDIA’s dedicated framework for LLM inference on their hardware. It’s distinct from vanilla TensorRT: it’s a full Python library that provides optimized implementations of popular LLM architectures, compiles them into TensorRT engines, and manages multi-GPU execution.
The throughput ceiling is higher than vLLM on the same NVIDIA hardware. NVIDIA ships hand-tuned kernels for every attention variant, and the compilation step allows optimizations that JIT-compiled systems can’t do. On H100 and B200 hardware at high batch sizes, TensorRT-LLM is typically the fastest option.
The tradeoff is operational complexity. The compilation step takes time, is tied to specific GPU architectures, and new model support pace lags behind the vLLM or SGLang. Debugging is harder.
TensorRT-LLM can sit underneath Triton Inference Server for production deployments, which is the canonical NVIDIA-native serving stack.
When to use it: Maximum throughput on NVIDIA GPUs, large-scale production deployments where you’re optimizing for tokens/second/dollar, or when you’re already in NVIDIA’s ecosystem (DGX, NGC).
Triton Inference Server
Triton is NVIDIA’s general-purpose model inference server. It’s often misunderstood as a peer to vLLM or TGI, but it’s better understood as a serving layer that sits above execution backends. It can serve TensorRT engines, ONNX Runtime models, PyTorch models, and TensorFlow models behind a unified gRPC/HTTP interface.
In the LLM context, the canonical setup is TensorRT-LLM as the execution backend with Triton as the serving frontend. Triton handles concurrent request management, batching configuration, and metrics; TensorRT-LLM handles the actual forward pass.
Triton is also the right choice when you need to serve heterogeneous model pipelines (a TensorRT encoder alongside a C/python preprocessing, for instance) behind a single endpoint (see Ensemble pipelines).
When to use it: NVIDIA-native production stacks (especially with TensorRT-LLM), or heterogeneous pipeline deployments where a unified serving interface matters.
AWS Neuron
AWS Neuron is the SDK for running models on AWS Trainium and Inferentia chips, Amazon’s custom ML accelerators. If you’re serving on AWS infrastructure and want to avoid GPU costs, Neuron is the path.
The workflow follows the same compilation pattern as TensorRT: export from PyTorch to an optimized Neuron binary targeting the specific chip. BERT-style models on Inferentia2 can be very cost-effective for high-throughput encoder workloads.
LLM support has improved significantly but still lags behind what’s available on GPU. For encoder-only models on AWS, it’s worth benchmarking.
How to Choose
The decision mostly comes down to two questions: what kind of model, and how much operational complexity can you absorb.
For general PyTorch models (classifiers, encoders, rerankers):
- Start with FastAPI for anything internal or prototypal.
- Move to TorchServe or Ray Serve when you need model versioning, multi-model serving, or horizontal scaling.
- Apply ONNX Runtime or TensorRT as an optimization layer underneath whichever server you pick, if latency is a hard requirement.
For autoregressive LLMs:
- Start with vLLM. It’s the safest default: good throughput, active development, broad model support, OpenAI-compatible API.
- Consider SGLang as it is as simple as vLLM with sometimes better performance for some specific type of workloads.
- Use TensorRT-LLM + Triton if you’re optimizing for maximum throughput on NVIDIA hardware and can absorb the operational overhead.
- Use AWS Neuron if you’re cost-optimizing on Inferentia and the model support covers your architecture.
One heuristic that holds up well: don’t optimize prematurely. vLLM on a single H100 will comfortably serve most workloads. Reach for TensorRT-LLM when you have evidence that vLLM is the bottleneck, not before.