From Research to Production I: Efficient Model Deployment with Triton Inference Server

Kerem Yildirir
Make It New
Published in
8 min readOct 26, 2023

--

In this article, we will demonstrate a simple scenario where we have a classification model, and we want to deploy it to production with NVIDIA Triton Inference Server. This is the first installment of a series where we start with a local prototype, and gradually build up to a solution for image classification from a simple script to a FastAPI application that runs on the cloud.

With the recent developments in the field of artificial intelligence, a lot of new use cases are emerging. Developing the model is a big challenge, but many other challenges present themselves when we want to turn this model into a product. The workflow of a machine learning application typically consists of 3 stages:

  • Exploration and data processing, which focuses on retrieving and processing high-quality data, and is essential for a successful model.
  • Modeling, where the model is developed, trained, and validated.
  • Deployment, when the model has passed the necessary checks and benchmarks and is ready to be served for production.
The machine learning workflow (Source: Become a Machine Learning Engineer Nanodegree (Udacity))

There are many tools for serving models, such as TorchServe and Tensorflow Serve engines from PyTorch and Tensorflow. BentoML offers a complete suite for training and deploying ML applications.

NVIDIA Triton Inference Server, or Triton, is an open-source software that addresses the deployment stage. It does NOT offer any functionality regarding the first two stages, but it stands out with its comprehensive documentation, flexibility, and hardware optimization options.

Triton offers support for most machine learning (ML) frameworks as well as custom C++ and Python backends. This reduces the need for multiple inference servers for different frameworks, allowing you to simplify your machine learning infrastructure. It seamlessly works with NVIDIA GPUs and allows simultaneous model execution on multiple GPUs, increasing hardware utilization, and is used for model serving at Azure, AWS, GCP, and many other solutions.

Triton Inference Server workflow

By default, Triton does not offer any autoscaling, although it can be achieved by configuring Kubernetes or using NVIDIA’s native tool, Triton Management Service(TMS).

TMS enables the automated deployment of multiple Triton Inference Server instances in Kubernetes. It provides resource-efficient model orchestration on GPUs and CPUs, simplifying the scaling and management of AI models in production environments.

Getting Started

The outline of this tutorial is like the following:

  • We will first pull the pre-trained ResNet50 model, and save it as a TensorFlow SavedModel object. This is to simulate training your own model and saving the model weights.
  • After saving the model weights, we will create a directory named “model repository” for the Triton Server, put the weights there, and run the server using docker.
  • When the server is up and running, we will write a simple client script to act as our deployed application, then communicate with the server and send inference requests.

For simplicity, we will not use any GPUs and go through the workflow on a local machine without any GPU for this first part and use cloud GPUs for the following parts of the series. If you wish to run this on GPU, then simply add --gpus all flag to the docker command below.

Creating a model

With the following script, we create the model artifacts and save them locally. This can also be hosted in any cloud storage, such as Azure Blob Storage or Amazon S3 buckets.

import tensorflow as tf
from tensorflow.keras.applications import ResNet50

# Step 1: Load the pretrained model
model = ResNet50(weights='imagenet')

# Step 2: Save the model in SavedModel format
saved_model_path = 'saved_model_directory'
tf.saved_model.save(model, saved_model_path)

Deploying the model with Triton

In order to deploy a model with Triton, we need to create a model_repository and put the saved weights in it with the desired directory structure. It looks like the following for the general case:

<model-repository-path>/
<model-name>/
[config.pbtxt]
[<output-labels-file> ...]
<version>/
<model-definition-directory>
<version>/
<model-definition-directory>
...
<model-name>/
[config.pbtxt]
[<output-labels-file> ...]
<version>/
<model-definition-directory>
<version>/
<model-definition-directory>
...
...

and like this for our case:

model_repository
└── resnet50
├── 1
│ └── model.savedmodel
│ ├── assets
│ ├── fingerprint.pb
│ ├── saved_model.pb
│ └── variables
│ ├── variables.data-00000-of-00001
│ └── variables.index
└── config.pbtxt

Each model needs to be given a name, a model config named config.pbtxt and at least one version folder (as an integer), and the model definition file corresponding to the model weights file. The model definition file should be named after the type of model, model.savedmodel in this case.

In order to achieve this, we need to create the structure and copy the model files under saved_model_directory to the model_repository/resnet50/1/model.savedmodel/ and create a model_repository/resnet50/config.pbtxt with the following content:

name: "resnet50" 
platform: "tensorflow_savedmodel"

name and platform are required parameters for Triton to identify and address the model and no further arguments are needed to get started as Triton auto-completes the model information from the model definition, and uses default options for the other configurations.

The complete model configuration can be queried from Triton Inference Server and looks like the following:

{
"name": "resnet50",
"platform": "tensorflow_savedmodel",
"backend": "tensorflow",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 4,
"input": [
{
"name": "input_1",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
224,
224,
3
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "predictions",
"data_type": "TYPE_FP32",
"dims": [
1000
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"dynamic_batching": {
"preferred_batch_size": [
4
],
"max_queue_delay_microseconds": 0,
"preserve_ordering": false,
"priority_levels": 0,
"default_priority_level": 0,
"priority_queue_policy": {}
},
"instance_group": [
{
"name": "resnet50",
"kind": "KIND_CPU",
"count": 2,
"gpus": [],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.savedmodel",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}

After defining the model repository, we need to start the Triton Inference Server with the given model repository as an argument. The easiest way of doing it is with Docker. Simply run the command with the desired version:

docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/path/to/model_repository:/models nvcr.io/nvidia/tritonserver:23.08-py3 tritonserver --model-repository=/models

If everything works as expected, you should see the logs in the following format:

I0926 11:37:40.511677 1 server.cc:674] 
+----------+---------+--------+
| Model | Version | Status |
+----------+---------+--------+
| resnet50 | 1 | READY |
+----------+---------+--------+
I0926 11:37:40.511787 1 metrics.cc:703] Collecting CPU metrics
I0926 11:37:40.511906 1 tritonserver.cc:2435]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.37.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /model_repository/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0926 11:37:40.514442 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0926 11:37:40.514614 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0926 11:37:40.561568 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Triggering the model from the client

Now that the server is running, the applications will communicate with this server using a client. Triton accepts gRPC and HTTP protocols.

The client will use the gRPC protocol to communicate with the server and send an inference request. After receiving the response, we use the output name we received from the model metadata to access the output tensor returned from the server.

An example client.py can be seen below. The script performs the following actions:

  • Acquire an image from a URL
  • Create a gRPC client for communicating with the server and receive the model metadata information using the model name we want to access
  • Create the required input and output objects for the request to be processed.
  • Preprocess the image, feed it into the created input object, and then trigger inference.
  • After inference, use the output_namesreceived from the metadata to retrieve the model output from the response under that name.
#!/usr/bin/env python3
import ast
import urllib
from io import BytesIO
import numpy as np
import requests
import tensorflow as tf
import tritonclient.grpc as grpc_client
from PIL import Image
# Get imagenet class names as dictionary
r = urllib.request.urlopen(
"https://gist.githubusercontent.com/yrevar/942d3a0ac09ec9e5eb3a/raw/238f720ff059c1f82f368259d1ca4ffa5dd8f9f5/imagenet1000_clsidx_to_labels.txt"
).read()
class2name = ast.literal_eval(r.decode())

def get_input_output_names_shapes(
client: grpc_client.InferenceServerClient, model_name: str
):
model_metadata = client.get_model_metadata(model_name)
input_names = [input_tensor.name for input_tensor in model_metadata.inputs]
output_names = [output_tensor.name for output_tensor in model_metadata.outputs]
model_config = client.get_model_config(model_name)
input_shapes = [input_tensor.dims for input_tensor in model_config.config.input]
return input_names, output_names, input_shapes

def main():
# Url to retrieve an image from the internet
url = "https://i1.wp.com/robinbarefield.com/wp-content/uploads/2015/03/DSC_1763.jpg"
response = requests.get(url)
img = Image.open(BytesIO(response.content)) # read the image with PIL
model_name = "resnet50" # Set the model name
with grpc_client.InferenceServerClient(
"localhost:8001"
) as client: # connect to the server
# Get the names and shapes of the input and output tensors
input_names, output_names, input_shapes = get_input_output_names_shapes(
client, model_name
)
# Resize the image to expected input shape of the model
width, height = input_shapes[0][0], input_shapes[0][1]
img = img.resize((width, height))
# Convert the downloaded image to RGB
input_data = np.array(img.convert("RGB")).astype(np.float32)
# Preprocess the image for the ResNet50 model
input_data = tf.keras.applications.resnet50.preprocess_input(input_data)
# Add the dimension for the batch size
input_data = np.expand_dims(input_data, 0)
# Create and fill the actual input tensor to the model
input_tensor = grpc_client.InferInput(input_names[0], input_data.shape, "FP32")
input_tensor.set_data_from_numpy(input_data)
# Run inference
inputs = [input_tensor]
outputs = [
grpc_client.InferRequestedOutput(output_name)
for output_name in output_names
]
result = client.infer(model_name=model_name, inputs=inputs, outputs=outputs)
# Print out predicted class name
print(class2name[np.argmax(result.as_numpy(output_names[0]))])

if __name__ == "__main__":
main()

When we run the above script with the following image,

we get the following output:

❯ python client.py 
brown bear, bruin, Ursus arctos

Next steps

We now have deployed a ResNet50 model inside a Triton Inference Server and performed inference using our Python client! The source code of the above examples can be found here. The next chapters in this tutorial series will cover the more advanced options of the Triton Inference Server. We will use the Triton Model Analyzer to optimize the model configuration for our requirements, carry the preprocessing computation to the Triton, and develop a mini web application that runs on the cloud with the client. Stay tuned!

Originally published at keremyildirir.com

--

--