How to deploy Transformers to Production

Optimization techniques for deploying PyTorch models in a production setting and achieve low latency.

Large Language Model
NLP
Published

September 21, 2024

Serving model in production is a challenge that require a set of optimization to be performed on the model and guarantee low latency.

For compiling and serving PyTorch models in a production environment, you have several tools and libraries at your disposal that can help streamline the process and enhance the performance of your models. Here’s a breakdown of some popular options

flowchart LR
  A[Train] --> B(Convert)
  B --> C{Serve}
  C --> D[Use]
  C --> E[Monitor]

Servers

TorchServe

Flask Or FastAPI

  • Description: For simpler use cases or prototypes, Python-based web frameworks like Flask or FastAPI can be used to create APIs that serve PyTorch models.
  • Key Features:
    • FastAPI offers asynchronous request handling and automatic interactive API documentation.
    • Flask is widely used, simple to deploy, and integrates easily with other Python libraries.

Ray Serve

  • Description: Part of the Ray framework, Ray Serve is designed for scalable and easy-to-use model serving.
  • Key Features:
    • Supports deploying multiple models and can scale across a cluster.
    • Integrates seamlessly with other Ray components for distributed computing tasks.

AWS Neuron

Triton Inference Server

These tools offer different benefits depending on your specific requirements, such as the level of traffic, the computational resources available, and the complexity of your deployment scenario. For instance, if you are working primarily in a cloud environment with heavy workloads, TorchServe or ONNX Runtime might be more suitable. For edge deployments or when needing low-latency responses, TensorRT could be better option. Meanwhile, FastAPI and Flask are great for quickly getting a model server running or for less demanding applications.

Runtimes

TorchScript

  • Description: TorchScript is a way to create serializable and optimizable models from PyTorch code. It makes your model architecture portable and executable outside the Python environment.
  • Key Features:
    • Can be used with both JIT and AOT compilation methods.
    • Allows integration with C++ for high-performance applications without Python overhead.

Interesting Ressources about Torchscript

ONNX And ONNX Runtime

TensorRT

How to Install TensorRT

docker run --gpus all -it --rm -v local_dir:container_dir
nvcr.io/nvidia/tensorrt:24.08-py3

If you want to choose devices : ‘“device=3,4,5,6,7”’

Eventually you can install the python dependencies :

/opt/tensorrt/python/python_setup.sh

Convert from ONNX to TRT Engine

To convert your computation graph into a TRT engine, the most straightforward way is to have a model saved in the ONNX format.

CUDA_VISIBLE_DEVICES=7 trtexec --onnx=query_classifier.onnx --saveEngine=query_classifier_engine.TRT

vLLM

vLLM is a fast and memory-efficient library for running large language models (LLMs) on CPUs and GPUs. It provides a Python interface for building and serving LLM servers with low latency and high throughput. It is a great alternative to TensorRT with a less steep learning curve. vLLM benefits from a very active development and a large community support. The documentation is also very good and easy to follow. The latest models are usually supported and the library is updated regularly.

Interesting Ressources about vLLM