Serving model in production is a challenge that require a set of optimization to be performed on the model and guarantee low latency.

For compiling and serving PyTorch models in a production environment, you have several tools and libraries at your disposal that can help streamline the process and enhance the performance of your models. Here’s a breakdown of some popular options

flowchart LR
  A[Train] --> B(Convert)
  B --> C{Serve}
  C --> D[Use]
  C --> E[Monitor]

Servers

TorchServe

Description: Developed by AWS and PyTorch, TorchServe is specifically designed for serving PyTorch models. It supports features like model versioning, rollback, logging, metrics, and multi-model serving.
Key Features:
- Model snapshotting for safe deployment.
- Native support for TorchScript models.
- Built-in pre-processing, inference, and post-processing hooks. ### Interesting Ressources about Torchserve
Serving large models with Torchserve — PyTorch/Serve master documentation
serve/examples/Huggingface_Transformers at master · pytorch/serve (github.com)

Flask Or FastAPI

Description: For simpler use cases or prototypes, Python-based web frameworks like Flask or FastAPI can be used to create APIs that serve PyTorch models.
Key Features:
- FastAPI offers asynchronous request handling and automatic interactive API documentation.
- Flask is widely used, simple to deploy, and integrates easily with other Python libraries.

Ray Serve

Description: Part of the Ray framework, Ray Serve is designed for scalable and easy-to-use model serving.
Key Features:
- Supports deploying multiple models and can scale across a cluster.
- Integrates seamlessly with other Ray components for distributed computing tasks.

AWS Neuron

BERT TorchServe Tutorial — AWS Neuron Documentation (readthedocs-hosted.com)

Triton Inference Server

These tools offer different benefits depending on your specific requirements, such as the level of traffic, the computational resources available, and the complexity of your deployment scenario. For instance, if you are working primarily in a cloud environment with heavy workloads, TorchServe or ONNX Runtime might be more suitable. For edge deployments or when needing low-latency responses, TensorRT could be better option. Meanwhile, FastAPI and Flask are great for quickly getting a model server running or for less demanding applications.

Runtimes

TorchScript

Description: TorchScript is a way to create serializable and optimizable models from PyTorch code. It makes your model architecture portable and executable outside the Python environment.
Key Features:
- Can be used with both JIT and AOT compilation methods.
- Allows integration with C++ for high-performance applications without Python overhead.

Interesting Ressources about Torchscript

Export to TorchScript (huggingface.co)
tensorflow - Torch JIT Trace = TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect - Stack Overflow
Make sure that inputs always share the same dims during training and inference, that involves not trying to do adapatative padding and truncation at the batch level.
Loading weights form a compiled model is not straightforward PyTorch 2.0 | PyTorch Make compiled models serializable · Issue #101107 · pytorch/pytorch (github.com)

ONNX And ONNX Runtime

Description: Open Neural Network Exchange (ONNX) is a format that allows models trained in one framework (like PyTorch) to be exported and served using another.
Key Features:
- Cross-platform and cross-framework compatibility.
- ONNX Runtime provides high-performance computation of ONNX models across different platforms and hardware. ### Interesting Ressources about ONNX
onnxruntime/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb at main · microsoft/onnxruntime (github.com)
Transformers optimizer | onnxruntime

TensorRT

Description: Although TensorRT is typically associated with NVIDIA and is used primarily for optimizing models in TensorFlow or other frameworks, it can also be used with PyTorch via ONNX.
Key Features:
- Optimizes neural network models to produce faster inference times, particularly on NVIDIA GPUs.
- Reduces latency and memory usage during inference. ### Interesting Ressources about TensorRT
TensorRT/samples/python/onnx_packnet/README.md at main · NVIDIA/TensorRT (github.com)
TensorRT | NVIDIA NGC
Quick Start Guide - NVIDIA Deep Learning TensorRT Documentation

How to Install TensorRT

docker run --gpus all -it --rm -v local_dir:container_dir
nvcr.io/nvidia/tensorrt:24.08-py3

If you want to choose devices : ‘“device=3,4,5,6,7”’

Eventually you can install the python dependencies :

/opt/tensorrt/python/python_setup.sh

Convert from ONNX to TRT Engine

To convert your computation graph into a TRT engine, the most straightforward way is to have a model saved in the ONNX format.

CUDA_VISIBLE_DEVICES=7 trtexec --onnx=query_classifier.onnx --saveEngine=query_classifier_engine.TRT

vLLM

vLLM is a fast and memory-efficient library for running large language models (LLMs) on CPUs and GPUs. It provides a Python interface for building and serving LLM servers with low latency and high throughput. It is a great alternative to TensorRT with a less steep learning curve. vLLM benefits from a very active development and a large community support. The documentation is also very good and easy to follow. The latest models are usually supported and the library is updated regularly.

How to deploy Transformers to Production