How to deploy Transformers to Production

Optimization techniques for deploying PyTorch models in a production setting and achieve low latency.

How to deploy Transformers to Production

Serving model in production is a challenge that require a set of optimization to be performed on the model and guarantee low latency.

For compiling and serving PyTorch models in a production environment, you have several tools and libraries at your disposal that can help streamline the process and enhance the performance of your models. Here’s a breakdown of some popular options:

Train → Convert → Serve → Use / Monitor

Servers

TorchServe

  • Description: Developed by AWS and PyTorch, TorchServe is specifically designed for serving PyTorch models. It supports features like model versioning, rollback, logging, metrics, and multi-model serving.
  • Key Features:
    • Model snapshotting for safe deployment.
    • Native support for TorchScript models.
    • Built-in pre-processing, inference, and post-processing hooks.

Interesting Resources about Torchserve


Flask Or FastAPI

  • Description: For simpler use cases or prototypes, Python-based web frameworks like Flask or FastAPI can be used to create APIs that serve PyTorch models.
  • Key Features:
    • FastAPI offers asynchronous request handling and automatic interactive API documentation.
    • Flask is widely used, simple to deploy, and integrates easily with other Python libraries.

Ray Serve

  • Description: Part of the Ray framework, Ray Serve is designed for scalable and easy-to-use model serving.
  • Key Features:
    • Supports deploying multiple models and can scale across a cluster.
    • Integrates seamlessly with other Ray components for distributed computing tasks.

AWS Neuron


Triton Inference Server

These tools offer different benefits depending on your specific requirements, such as the level of traffic, the computational resources available, and the complexity of your deployment scenario. For instance, if you are working primarily in a cloud environment with heavy workloads, TorchServe or ONNX Runtime might be more suitable. For edge deployments or when needing low-latency responses, TensorRT could be better option. Meanwhile, FastAPI and Flask are great for quickly getting a model server running or for less demanding applications.


Runtimes

TorchScript

  • Description: TorchScript is a way to create serializable and optimizable models from PyTorch code. It makes your model architecture portable and executable outside the Python environment.
  • Key Features:
    • Can be used with both JIT and AOT compilation methods.
    • Allows integration with C++ for high-performance applications without Python overhead.

Interesting Resources about Torchscript


ONNX And ONNX Runtime

  • Description: Open Neural Network Exchange (ONNX) is a format that allows models trained in one framework (like PyTorch) to be exported and served using another.
  • Key Features:
    • Cross-platform and cross-framework compatibility.
    • ONNX Runtime provides high-performance computation of ONNX models across different platforms and hardware.

Interesting Resources about ONNX


TensorRT

  • Description: Although TensorRT is typically associated with NVIDIA and is used primarily for optimizing models in TensorFlow or other frameworks, it can also be used with PyTorch via ONNX.
  • Key Features:
    • Optimizes neural network models to produce faster inference times, particularly on NVIDIA GPUs.
    • Reduces latency and memory usage during inference.

Interesting Resources about TensorRT

How to Install TensorRT

docker run --gpus all -it --rm -v local_dir:container_dir \
nvcr.io/nvidia/tensorrt:24.08-py3

If you want to choose devices: '"device=3,4,5,6,7"'

Eventually you can install the python dependencies:

/opt/tensorrt/python/python_setup.sh

Convert from ONNX to TRT Engine

To convert your computation graph into a TRT engine, the most straightforward way is to have a model saved in the ONNX format.

CUDA_VISIBLE_DEVICES=7 trtexec --onnx=query_classifier.onnx --saveEngine=query_classifier_engine.TRT

vLLM

vLLM is a fast and memory-efficient library for running large language models (LLMs) on CPUs and GPUs. It provides a Python interface for building and serving LLM servers with low latency and high throughput. It is a great alternative to TensorRT with a less steep learning curve. vLLM benefits from a very active development and a large community support. The documentation is also very good and easy to follow. The latest models are usually supported and the library is updated regularly.

Interesting Resources about vLLM