Stack Abuse: Loading a Pretrained Tensorflow Model into TensorFlow Serving :

Stack Abuse: Loading a Pretrained Tensorflow Model into TensorFlow Serving
blow post content copied from  Planet Python
click here to view original post


You are part of a project that will use deep learning to try to identify what is in images - such as cars, ducks, mountains, sky, trees, etc.

In this project, two things are important - the first one, is that the deep learning model trains quickly, with efficiency (because the model will be deployed to a device that doesn't have much computational power):

Your team has decided to use EfficientNets, specifically, the V2 family, as they're robust, train fast, and have strong accuracy.

And the second one is that the model needs to be accessible through a link, so predictions can be made in the web:

Regarding the second point, you want to be able to make the model accessible to other people, in a way they could send their data through a REST API request and get the model predictions as a response. To do that, you need to deploy or serve a model, which can be done in a myriad of ways - though we'll be taking a look at with Tensorflow serving (TF Serving).

So far so good, the first two project necessities are covered by EfficientNetV2 and TF Serving! In this guide, we'll be starting with a pre-trained model general image classifier and deploying it to TensorFlow Serving, using Docker.

TensorFlow Serving

TensorFlow Serving is, well, a serving system for machine learning models. It's specifically designed for production environments, and helps bridge the gap between data scientists and production-oriented software engineers.

You can install it from source, which allows for customization (for specific use cases) or prioritize its integration to any operating system by utilizing Docker containers. Since there is no need for customization, and we're preparing the model for production, Docker containers are a great choice for us.

Docker will create a layer between what is being deployed or served and the operating system, so it is more general, easier to scale and accommodate if there are any project changes in the future, or an expansion to other operating systems.

This is the general context of the project. Now, let's start installing the necessary model libraries, setting up the container and learning how to serve the model!

Importing Tensorflow

If you haven't already, let's install TensorFlow. On a Conda-based environment, you can run conda install:

$ conda install tensorflow

Otherwise, pip makes it simple:

$ pip install tensorflow

Note: You can also run the installation from a Jupyter Notebook by placing an exclamation mark before the command, such as: !conda install tensorflow.

Along with TensorFlow, let's import NumPy:

import tensorflow as tf
import numpy as np

Preprocessing an Image with TensorFlow and Keras

We'll be serving an existing, pre-trained model for general image classification, trained on ImageNet, which will allow us to focus on the serving process.

Let's take an example image to classify, such as this image of swans in a lake from Pexels (royalty free). We will use this image to understand if the model will recognize the swans, a lake, or if it will get close to that and recognize animals and nature.

Once downloaded, let's define a path to the image to make it simple to load:

img_path = 'tf_serving/pexels-artūras-kokorevas-10547480.jpg'

When feeding images into a model - we want to make sure that we follow the expected preprocessing steps. These typically include resizing and rescaling to the expected input (lest the weights can't be used), but sometimes also include normalization. All Keras models come with a preprocess_input() function that preprocesses the input for that trained model.

Note: EfficientNetV2's preprocess_input() function just performs pass, since no preprocessing is required. However, the models do expect the inputs to be in a range of [0..255], encoded as floats. The model itself includes a Rescaling layer that'll scale them down to [-1, 1]. If you already have a [-1, 1] input, set the include_preprocessing flag to False when loading the EfficientNet models.

The EfficientNetV2 family comes in several flavors - B0, B1, B2, B3, S, M and L. B0..B3 are for comparison with the V1 of the family, which spanned B0..B7, and the models were made by adjusting the width and depth coefficients, making the models wider and deeper. S, M and L come from the V2 paper, which have a different configuration of input and output filters across the building blocks.

You can think of them as trading accuracy for speed, where B0 is the the lightest of them, while L is the largest.

Advice: If you wish to learn more about state of the art neural network architectures, join the "Convolutional Neural Networks - Beyond Basic Architectures" course!

Depending on your training and inference hardware, you can find a sweet spot of accuracy and speed.

The Pexels swans image originally has a resolution 5078 by 3627 pixels, we can easily change both dimensions to 224. Typically, resizing is done during training, so efficiency in the reading and resizing operations is required. For creating optimized pipelines - is usually combined with tf.image operations:

size = (224, 224)
# Read file as bytes
img =
# Decode into PNGs
img = tf.image.decode_png(img, channels=3)
# Batch the input (required for vectorized operations)
img = tf.expand_dims(img, 0)
# Perform vectorized operation to resize the images
img = tf.image.resize(img, size=size)

While it may seem verbose for reading a file - getting used to this syntax will play a significant role in your data and training pipelines.

Let's take a look at the image:

import matplotlib.pyplot as plt
# Squeeze the image from a batch to a single image for Matplotlib
# and cast to UINT8
plt.imshow(tf.cast(tf.squeeze(img), dtype=tf.uint8))

Creating the Model with Tensorflow

Let's instantiate EfficientNetV2B0:

model = tf.keras.applications.EfficientNetV2B0()

The parameters default to the "ImageNet setup" - i.e. 'imagenet' weights are loaded in, there are 1000 output classes, and the input image size is 224 (historically most common input size). You may, of course, specify these arguments yourself, or change them to adapt to a different input pipeline:

model = tf.keras.applications.EfficientNetV2B0(weights='imagenet', 
                                               input_shape=(224, 224, 3))

If you want to take a look at all the layers that the model has, you can see a list of them when executing the model's summary() method:


The network has ~7.14M trainable parameters:

Total params: 7,200,312
Trainable params: 7,139,704
Non-trainable params: 60,608

Since we won't retrain the network in this guide, and the image is ready, we can go ahead and make predictions!

Making Predictions

To make predictions, we can use the predict() method and store the results in a preds variable:

preds = model.predict(x)

Alternatively, you can simply pass the image to the model instead:

preds = model(x)

The preds here are a tensor, of (batch_size, class_probabilities). Since we have a single image, the output tensor is of shape (1, 1000), where there are 1000 probability scores for each class in ImageNet:

# <tf.Tensor: shape=(1, 1000), dtype=float32, numpy=
# array([[5.84674963e-05, 7.39379029e-05, 7.74277505e-05, 1.28119747e-04,
#        1.26851926e-04, 6.71151938e-05, 3.55448428e-05, 2.84188209e-05,
#        6.39903592e-05, 2.44139865e-05, 4.29903994e-05, 5.27093216e-05,
#        ...

You can get the highest probability class through the argmax() function, which returns the index of the highest value in this tensor:

tf.argmax(preds, axis=1)
# <tf.Tensor: shape=(1,), dtype=int64, numpy=array([295], dtype=int64)>

We're performing argmax() on axis=1, since we're performing it on the second axis ('column') of the tensor (we're performing argmax() across the 1000 class probabilities). So, what's the class under the index of 295? Typically, you'll have a list or dictionary of indices to classnames, either loaded in memory or in a file.

Since ImageNet has many classes, and is a common dataset/class-set to work with, TensorFlow exposes a decode_predictions() method alongside every model. By passing the predictions into it, it'll parse through the label map and return the top-5 labels associated with the top-5 most probable predictions, and their human-readable labels:

preds = preds.numpy()
('n09332890', 'lakeside', 0.2955897), 
('n09421951', 'sandbar', 0.24374594), 
('n01855672', 'goose', 0.10379495), 
('n02894605', 'breakwater', 0.031712674), 
('n09428293', 'seashore', 0.031055905)]]

In the output above, it can be seen that the network believes to be a lakeside in the image with the most probability, around 30% of chance, followed by a sandbar, a goose, a breakwater and a seashore. Not outstanding, but a good enough first try. Here, we need to take into consideration that the swans image is not an easy one to classify, it has tonalities that are close to each other and not very clear definitions of where the landscape ends and the frozen lake begins. Especially, in smaller resolutions, this is harder to identify.

Saving the Model

The creation and prediction simulates the iterative development cycle of a model. Let's save the "current version" of the model for deployment.

To organize that information, let's create a folder with the name of the neural net - for instance, effv2b0:

$ mkdir effv2b0

Now, with the folder to keep track of the versions created, we need to find a way to differentiate between each version file, to name each saved model in an unique way. A common approach to naming each file uniquely is to use the time the model was saved in seconds (or the full calendar date and seconds). This number can be obtained by the time() method in Python's time library.

In the same way we have done before, we can import time library, then obtain the current time in seconds:

import time

current_time = int(time.time()) # int() truncates the float output, removing its decimal places

We have generated a name for the file, let's define a path to save it inside the effv2b0 folder using Python's f-string to concatenate the folder with the number:

path = f"effv2b0/{current_time}"

Finally, we can save the model using the save() method and passing the path as argument:

The final folder structure with saved model files should look like this:

# how the folder should look like
├── effv2b0
│ ├── 1673311761
│ │ ├── assets 
│ │ ├── saved_model.pb 
│ │ └── variables 

Notice that the save() method outputs an assets folder, a saved_model.pb file, and a variables folder. The assets folder contains files used by the TensorFlow graph, the .pb (protobuf) file stores the model architecture and training configuration, and the variables folder, the model weights. This is everything TensorFlow needs to run a trained model.

We have already understood the main steps of preparing an image, creating a neural network model, predicting and saving a model. We can now see how this model will be served.

Serving the Model with Tensorflow Serving and Docker

With a model version selected - we can set up a Docker image to house our model and TF Serving, and deploy it.

Advice: If you'd like to learn more about Docker, read our "Docker: A High Level Introduction".

Installing Docker

The first step to the process is installing Docker, in Docker's website you can download the latest version according to your operating system.

After the download and the installation, we can test it to see if it is running. You can do this on a command line, just typing in the instruction, or inside a Jupyter Notebook, in the same way we have shown previously, by inserting an exclamation mark ! before the command.

To start Docker, type in and execute:

$ open --background -a Docker

After a few seconds, you should see the Docker application window opening:

Once Docker has started, you can then test it with:

$ docker run hello-world

This results in:

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:

For more examples and ideas, visit:

This output means we are ready to go. Docker is up and running!

Pulling a TF Serving Image

The next step is to have TF Serving inside Docker - which is the same as pulling a Tensorflow Serving image from Docker, in other words, to download and load TF Serving.

To pull the TF Serving image, execute:

$ docker pull tensorflow/serving:latest-gpu

Note: If you're using Mac's M1 chip, to pull the image, use:

$ docker pull emacski/tensorflow-serving:latest-linux_arm64

After pulling the image, if you take a look a the Docker desktop app, in the Images tab, there should be a new tensorflow/serving image:

Notice that there is also a hello-world image from the initial Docker test.

Serving the Model

Up to now, we have a Docker container with TF Server loaded inside of it, we can finally run it. To run the container, we will craft the following instruction:

$ docker run --rm -p <port_number>:<port_number> \
        --name <container_name> \
        -v "<local_path_to_net_folder>:<internal_tfserving_path_/models/+net_folder>" \
        -e MODEL_NAME=<same_as_net_folder> \

In the above instruction, there are 5 Docker flags, --rm, -p, --name, -v, -e. This is what each one means:

  • --rm: same as remove, it tells Docker to clean up the container after it exits;
  • -p: short for port, it tells Docker in which port the container runs;
  • --name: specifies what is the name of the container;
  • -v: short for volume, when used with colon marks : makes the first path, or host path available to exchange information with the second path, or the path inside the container. In our example, this means we are transferring or copying what is in our folder to TF Serving's /models/ folder and enabling changes in it;
  • -e: same as env or environment variables, in our example it defines a MODEL_NAME variable that will exist inside the container.

Also, in the above command, the text inside < > is to be substituted by the ports by which the model will be available, the name of the container, the local path to the network folder, followed by the corresponding path in TF Serving to the network folder, which is inside a /models/ folder, the model name, and the name of the Docker image. Bellow is an example:

$ docker run --rm -p 8501:8501 \
        --name tfserving_effv2 \
        -v "/Users/csamp/Documents/stack_ab/effv2b0:/models/effv2b0" \
        -e MODEL_NAME=effv2b0 \

Note: if you are using Mac's M1 chip, the only difference in the command is in the last line, which will have the name of the emacski/tensorflow-serving:latest-linux_arm64 image:

$ docker run --rm -p 8501:8501 \
        --name tfserving_effv2 \
        -v "/Users/csamp/Documents/stack_ab/effv2b0:/models/effv2b0" \
        -e MODEL_NAME=effv2b0 \

If you also end up using another image for your system, you only need to change the last line.

After executing command, you will see a long output ending in "Entering the event loop ...":

2023-01-17 10:01:33.219123: I external/tf_serving/tensorflow_serving/model_servers/] Building single TensorFlow model file config:  model_name: effv2b0 model_base_path: /models/effv2b0
2023-01-17 10:01:33.220437: I external/tf_serving/tensorflow_serving/model_servers/] Adding/updating models.
2023-01-17 10:01:33.220455: I external/tf_serving/tensorflow_serving/model_servers/]  (Re-)adding model: effv2b0
2023-01-17 10:01:33.330517: I external/tf_serving/tensorflow_serving/core/] Successfully reserved resources to load servable {name: effv2b0 version: 1670550215}
2023-01-17 10:01:33.330545: I external/tf_serving/tensorflow_serving/core/] Approving load for servable version {name: effv2b0 version: 1670550215}
2023-01-17 10:01:33.330554: I external/tf_serving/tensorflow_serving/core/] Loading servable version {name: effv2b0 version: 1670550215}
2023-01-17 10:01:33.331164: I external/org_tensorflow/tensorflow/cc/saved_model/] Reading SavedModel from: /models/effv2b0/1670550215
2023-01-17 10:01:33.465487: I external/org_tensorflow/tensorflow/cc/saved_model/] Reading meta graph with tags { serve }
2023-01-17 10:01:33.465524: I external/org_tensorflow/tensorflow/cc/saved_model/] Reading SavedModel debug info (if present) from: /models/effv2b0/1670550215
2023-01-17 10:01:33.468611: I external/org_tensorflow/tensorflow/core/common_runtime/] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2023-01-17 10:01:33.763910: I external/org_tensorflow/tensorflow/cc/saved_model/] Restoring SavedModel bundle.
2023-01-17 10:01:33.781220: W external/org_tensorflow/tensorflow/core/platform/profile_utils/] Failed to get CPU frequency: -1
2023-01-17 10:01:34.390394: I external/org_tensorflow/tensorflow/cc/saved_model/] Running initialization op on SavedModel bundle at path: /models/effv2b0/1670550215
2023-01-17 10:01:34.516968: I external/org_tensorflow/tensorflow/cc/saved_model/] SavedModel load for tags { serve }; Status: success: OK. Took 1185801 microseconds.
2023-01-17 10:01:34.536880: I external/tf_serving/tensorflow_serving/servables/tensorflow/] No warmup data file found at /models/effv2b0/1670550215/assets.extra/tf_serving_warmup_requests
2023-01-17 10:01:34.539248: I external/tf_serving/tensorflow_serving/core/] Successfully loaded servable version {name: effv2b0 version: 1670550215}
2023-01-17 10:01:34.540738: I external/tf_serving/tensorflow_serving/model_servers/] Finished adding/updating models
2023-01-17 10:01:34.540785: I external/tf_serving/tensorflow_serving/model_servers/] Using InsecureServerCredentials
2023-01-17 10:01:34.540794: I external/tf_serving/tensorflow_serving/model_servers/] Profiler service is enabled
2023-01-17 10:01:34.542004: I external/tf_serving/tensorflow_serving/model_servers/] Running gRPC ModelServer at ...
[warn] getaddrinfo: address family for nodename not supported
2023-01-17 10:01:34.543973: I external/tf_serving/tensorflow_serving/model_servers/] Exporting HTTP/REST API at:localhost:8501 ...
[ : 245] NET_LOG: Entering the event loop ...

This means that the TF model is being served!

You can also look in the Docker desktop, in the Containers tab, you will see a line with the container name we have specified in the instruction's --name tag, in this case, tfserving_effv2, followed by the image link, the status as running and the ports:

Note: if you want to run everything inside a Jupyter Notebook, in this step, you can interrupt the kernel after executing the serving command and reading the "Entering the event loop ..." message. This will only stop the cell, but Docker will continue running and you can proceed to execute your next cell.

Sending Requests and Getting a Response from the Model

Our model is already accessible through TF Serving in the 8501 port. To be able to access it through the web, we need to send data, or make a request to the served model, and then receive data as a response, getting our predictions, and this is usually done over HTTP. This is how the web works and communicates. To be able to use requests and responses, we will import Python's requests library.

Typically, when sending messages over HTTP, we send JSON-formatted messages, as they're both lightweight and very human-readable, and conform to the most widely used language on the web - JavaScript. Since we'll also be sending a JSON payload, we'll import Python's json library:

import json
import requests

After importing the libraries, we need to define the location we want to access - same address of where our model is being served - called an endpoint:

endpoint = 'http://localhost:8501/v1/models/effv2b0:predict'

We're serving the model on our local machine, hence the localhost, though the same steps are taken for a remote virtual machine as well. The v1 version is automatically created and tracked by TF Server, and we are accessing the predict method of the effv2b0 model.

Let's set the header's content-type for the HTTP request:

header = {"content-type": "application/json"} 

The last thing we need to do is to send the data for the model to predict, which would be our preprocessed swan image, that we will rearrange into a json format with the json.dumps() method. The resulting JSON:

batch_json = json.dumps({"instances": x.tolist()}) # tolist() transforms x array into a list of arrays so json can understand it

Tensorflow will be expecting a json with the instances key, so it is mandatory to name the field instances.

So far, we have an endpoint, a header and a JSON string with one image. It is time to tie it all together in a web request. To do this, we will use the method that receives an url, data, headers and returns a response:

json_res =, 

After receiving this json, we can access its content by loading it with json.loads() and accessing its text with json_res.text. The returned response is in a dictionary format:

server_preds = json.loads(json_res.text)

We can then pass this server predictions dictionary to the same decode_predictions() method we have used previously. There are only two adaptations to be made - the first is to access the predictions key inside the dict, and then to transform the predictions list into an array:

print('Predicted:', decode_predictions(np.array(server_preds['predictions'])))

This results in:

Predicted: [[
('n09332890', 'lakeside', 0.295589358), 
('n09421951', 'sandbar', 0.243745327), 
('n01855672', 'goose', 0.10379523), 
('n02894605', 'breakwater', 0.0317126848), 
('n09428293', 'seashore', 0.0310558397)]]

Here, we have the same predictions we made in our machine now being served and accessed through the web. Mission accomplished!

The final code to access the served model is the following:

import json
import requests

endpoint = 'http://localhost:8501/v1/models/effv2b0:predict'
header = {"content-type": "application/json"} 
batch_json = json.dumps({"instances": x.tolist()})

json_res =, data=batch_json, headers=header)
server_preds = json.loads(json_res.text)
print('Predicted:', decode_predictions(np.array(server_preds['predictions'])))


In this guide, we have learned what a TensorFlow pre-trained model is, how to use it and in which context to use it. We have also learned about serving this model with Docker, and why using Docker would be a good idea according to our objectives.

Besides following all the steps to image transformation, model creation, prediction, mode saving, model serving and web requesting, we have also seen how little effort is involved in using another model in this structure.

March 03, 2023 at 07:04PM
Click here for more details...

The original post is available in Planet Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.