Performance Optimization - Ximilar: Visual AI for Business

The Best Tools for Machine Learning Model Serving

Michal Lukáč — Wed, 25 Oct 2023 09:26:42 +0000

As the prevalence of AI in various industries increases, so does the need to optimize the machine learning model serving. As a machine learning engineer, I’ve seen that training models is just one part of the ML journey. Equally important as the other challenges is the careful selection of deployment strategies and serving systems.

In this article, we’ll delve into the importance of selecting the right tools for machine learning model serving, and talk about their pros and cons. We’ll explore various deployment options, serving systems like TensorFlow Serving, TorchServe, Triton, Ray Serve, and MLflow, and also the deployment of specific models such as large language models (LLMs). I’ll also provide some thoughts and recommendations for navigating this ever-evolving landscape.

Machine Learning Models Serving Then and Now

When I first began my journey in the world of machine learning, the landscape was constantly shifting. The frameworks being actively developed and used at the time included Caffee, Theano, TensorFlow (Google) and PyTorch (Meta), all vying for their place in the world of AI. As time has passed, the competition has become more and more lopsided, with TensorFlow and PyTorch leading the way. While TensorFlow has remained the more popular choice for production-ready models, PyTorch has been steadily gaining in popularity, particularly within research circles, for its faster, more intuitive prototyping capabilities.

While there are hundreds of libraries available to train and optimize models, the most popular frameworks such as TensorFlow, PyTorch and Scikit-Learn are all based on Python programming language. Python is often chosen due to its simplicity and the vast amount of libraries for data manipulation. However, it is not the fastest language and can present problems with parallel processing, threads and GIL. Additionally, specialized libraries such as spaCy and PyG are available for specific tasks, such as Natural Language Processing (NLP) and Graph Analysis, respectively. The focus was and still partially is on the optimization of models and architectures. On the other hand, there are more and more problems in machine learning models serving in production because of the large-scale adoption of AI.

Nowadays, even more complex models like large language models (LLM, GPT/LAMMA/BARD) and multi-modal models are in fashion which creates a bigger pressure on optimal model deployment, infrastructure environment and storage capacity. Making machine learning model serving and deployment effective and cheap is a big problem. Even companies like Microsoft or NVIDIA are actively working on solutions that will cut the costs of it. So let’s look into some of the best options that we as developers currently have.

The Machine Learning and DevOps Challenges

Being a Machine Learning Engineer, I can say that training a model is just a small part of the whole lifecycle. Data preparation, deployment process and running the model smoothly for numerous customers is a daily challenge and a major part of the job.

Deployment Strategies

In addition to having to allocate GPU/CPU resources and manage inference speed, the company deploying ML models must also consider the deployment strategy for the trained model. You could be deploying the ML model as an API, running it in a container, or using a serverless platform. Each of these options comes with its own set of benefits and drawbacks, so carefully considering the best approach is essential. When we have a trained model, there are several options on how to use it:

Deploy it as an API endpoint, sending data in the request and getting results immediately in response. This approach is suitable for faster models that are able to process the data in just a few seconds.
Deploy it as an API endpoint, but return just a promise or asynchronous response from the model. This is great for computational-intensive models that can take minutes or hours of processing. For example, generative models and upscaling models are slow and require this approach.
Use a system that is able to serve it for you.
Use the model locally on your data.
Deploy models on Smartphones or IoT devices with feed from local sensors.

Other Challenges

The complexity of machine learning projects grows with variables such as:

The number of models – It is common practice to use multiple models. For example, at this moment, there are tens of thousands of different ML models on the Ximilar platform.
Model versions – You can train each of your models on different training data (part of the dataset) and mark it as a different version. Model versioning is great if you want to A/B test your ML model, tune your model performance, and for continuous model training.
Format of models – You can potentially train and save your ML models in various formats. For instance, .h5 which is a Keras/TensorFlow format or .pt (PyTorch) or .onnx for ONNX Runtime. Usually, each framework supports only specific formats.
The number of frameworks – Served ML models could be trained with different frameworks and their versions.
The number of the nodes (servers) – Models can be hosted on one or multiple servers and the serving system should be able to intelligently load balance the requests on servers so that none of them is throttled.
Models storage/registry – You need to store the ML models in some database or storage, such as AWS S3 or local storage
Speed/performance – The loading time of models from the storage can be critical and can cause a slow inference per sample.
Easy to use – Calling model via Rest API or gRPC requests, single-or-batch inference.
Hardware specification – ML models can be deployed on Edge devices or PCs with various architectures.
GPUs vs CPUs and libraries – Some models must be used only on CPUs and some require a GPU card.

Our Approach to the Machine Learning Model Serving

Several systems were developed to tackle these problems. Serving and deploying machine learning models has come a long way since we founded Ximilar in 2016. Back then, no system was capable of effectively serving hundreds of neural networks for inference.

So, we decided to build our own system for machine learning model serving, and today it forms the backbone of our machine-learning platform. As the use of AI becomes more widespread in companies, newer systems such as TensorFlow Serving emerge quickly to meet the increasing demand.

Which Framework Is The Best?

The Battle of Machine Learning Frameworks

Nowadays, each big tech company has their own solution for machine learning model serving and training. To name a few, PyTorch (TorchServe) and AITemplate by META (Facebook), TensorFlow (TFServing) by Google, ONNX runtime by Microsoft, Triton by NVIDIA, Multi-Model-Server by Amazon and many others like BentoML or Ray.

There are also tens of formats that you can save your ML model in, just TensorFlow alone is able to save into .h5, .pb, saved_model or .tflite formats, each of them serving a different purpose. For example, TensorFlow Lite is great for smartphones. It also loads very fast, so the availability of the model is great. However, it supports only limited operations and more modern architectures cannot be converted with it.

Machine learning model serving: each big tech company has their own solution for training and serving machine learning models.

You can also try to convert models from PyTorch or TensorFlow to TensorRT and OpenVino formats. The conversion usually works with basic and most-used architectures. The TensorRT is great if you are deploying ML models on Jetson Nano or Xavier. You can achieve a boost in performance on Intel servers via OpenVino conversion or the Neural Magic library.

The ONNX Format

One notable thing is the ONNX format. The ONNX is not a library for training your machine learning models, ONNX is an open format for storing machine learning models. After the model training, for example, in TensorFlow, you can convert it to ONNX format. You are able to run this converted model via ONNX runtime on almost any platform, programming language, CPU architecture and with preferred hardware acceleration. Sometimes serving a model requires a specific version of libraries, which is why you can solve a lot of problems via ONNX.

Exploration is Key

There are a lot of options for ML model training, saving, conversion and deployment. Every library has its pros and cons, some of them are easy to use for training and development. Others, on the other hand, are specialized for specific platforms or for specific fields (computer vision, recommender systems or NLP).

I would recommend you invest some time in exploring all the frameworks and systems, before deciding which framework you would like to lock in. The competition is rough in this field and every company tries to be as innovative as possible to keep up with the others. Even a Chinese company Baidu developed their own solution called PaddlePaddle. At the end of the article, I will give some recommendations on which frameworks and serving systems you should use and when.

The Best Machine Learning Serving Tools

OK, let’s say that you trained your own model or downloaded one that has already been trained. Now you would like to deploy a machine-learning model in production. Here are a few options that you can try.

If you don’t know how to train a machine learning model, you can start with this tutorial by PyTorch.

Deploy ML Models With API

If you have one or a few models, you can build your own system for ML model serving. With Python and libraries such as Flask or Django, there is a straightforward way to develop a simple REST API. When the web service starts, it loads the model in the background and then every incoming request will call the model on the incoming data.

It could get problematic if you want to effectively work with GPU cards, and handle parallel requests. I would recommend packing the system to Docker and then running it in Kubernetes.

With Kubernetes, Docker and smart load-balancing as HAProxy such a system can potentially scale to bigger volumes. Java or Go languages are also good languages to deploy ML models.

Here is a simple tutorial with a sci-kit-learn model as REST API with Flask.

Now let’s have a look at the open-source serving systems that you can use out of the box, usually with a small piece of code or no code at all.

TensorFlow Serving

GitHub | Docs

TensorFlow Serving is a modern serving system for TensorFlow ML models. It’s a part of TensorFlow Extended developed by Google. The recommended way of using the system is via Docker.

Simply run the Docker pull TensorFlow/serving (optionally TensorFlow/serving:latest-gpu if you need GPU support) command. Just run the image via Docker:

docker run -p 8501:8501 
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model 
  -e MODEL_NAME=my_model -t tensorflow/serving

Now that the system is serving your model, you can query with gRPC or REST calls. For more information, read the documentation. TensorFlow Serving works best with the SavedModel format. The model should define its signature_def_map which will define the inputs and outputs of the model. If you would like to dive into the system then my recommendation is videos by the team itself.

In my opinion, TensorFlow serving is great with simple models and just a few versions. The documentation, however, could be simpler. With advanced architectures, you will need to define the custom operations, which is a big disadvantage if you have a lot of models with more modern operations.

TorchServe

GitHub | Docs

TorchServe is a more modern system than TensorFlow Serving. The documentation is clean and supports basically everything that TF Serving does, however, this one is for PyTorch models. Before serving a PyTorch model via TorchServe, you need to convert them to .mar packages. Basically, the .mar package tells the model name, version, architecture and actual weights of the model. Installation and running are also possible via Docker, and it is very similar to TensorFlow Serving.

I personally like the management of the models, you are able to simply register new models by sending API requests, list models and query statistics. I find the TorchServe very simple to use. Both REST API and gRPC are available. If you are working with pure PyTorch models then the TorchServe is recommended way.

Triton

GitHub | Docs

Both of the serving systems mentioned above are tightly bound to the frameworks of the models they are able to serve. That is probably why Triton has a big advantage over them since it can serve both TensorFlow and PyTorch models. It is also able to serve OpenVINO, ONNX and TensorRT formats! That means it supports all the major formats in the machine learning field. Even though NVIDIA developed it, it doesn’t require a GPU card and can run also on CPUs.

To run Triton, simply pull it from the docker repository via the Docker pull nvcr.io/nvidia/tritonserver command. The triton servers are able to load models from a specific directory called model_repository. Each model is defined with configuration, in this configuration, there is a platform setting that defines a model format. For example, “tensorflow_graphdef” or “onnxruntime_onnx“. In this way, Triton knows how to run specific models.

The documentation is not super-easy to read (mostly GitHub README files) because it is in very active development. Otherwise, working with the models is similar to other serving systems, meaning calling models via gRPC or REST.

Ray Serve

GitHub | Docs

Ray is a general-purpose system for scaling machine learning workloads. It primarily focuses on model serving and providing the primitives for you to build your own ML platform on top.

Ray Serve offers a more Pythonic way of creating your own serving system. It is framework-agnostic and anything that can be run via Python can be run also with Ray. Basically, it looks as simple as Flask. You define the simple Python class for your model and decorate it with a route prefix handler. Then you just call the REST API request.

import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

# 1: Define a Ray Serve deployment.
@serve.deployment(route_prefix="/")
class MyModelDeployment:
    def __init__(self, msg: str):
        # Initialize model state: could be very large neural net weights.
        self._msg = msg

    def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}

# 2: Deploy the model.
serve.run(MyModelDeployment.bind(msg="Hello world!"))

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/").json())

If you want to have more control over the system, Ray is a great option. There is a Ray Clusters library which is able to deploy the system on your own Kubernetes Cluster, AWS or GCP with the ability to configure the autoscaling option.

MLflow

MLflow is an open-source platform for the whole ML lifecycle. From training to evaluation, deployment, tracking, model monitoring and central model registry.

MLflow offers a robust API and several language bindings for the whole management of the machine learning model’s lifecycle. There is also a UI for tracking your trained models. MLflow is really a mature package with a whole bundle of components that your team can use.

Other Useful Tools for Machine Learning Model Serving

Multi-Model-Server is a similar system to the previous ones. Developed by the Amazon AWS team, the system is able to run models trained with MXNet or converted via ONNX.
BentoML is a project very similar to MLflow. There are many different tools that data scientists can use for training and deployment processes. The UI looks a bit more modern. BentoML is also able to automatically generate Docker images for your models.
KServe is a simple system for managing and scaling models on your Kubernetes. It solves the deployment, and autoscaling and provides standardized inference protocol across ML frameworks.

Cloud Options of AWS, GCP and Azure

Of course, every big tech player provides cloud platforms to host and serve your machine learning models. Let’s have a quick look at a few examples.

Microsoft is a big supporter of ONNX, so with Azure Machine Learning services, you are able to deploy your models to the cloud via Python or Azure CLI. The process requires an entry script in Python with two methods: init for initialization of your model and run for inference. You can find the entire workflow in Azure development documentation.

The Google Cloud Platform (GCP) has good support for TensorFlow as it is their native framework. However, Docker deployment is available, so other frameworks can be used too. There are multiple ways to achieve the deployment. The classic way will be using the AI Platform prediction tool or Google Cloud Run. There is also a serverless HTTP endpoint/function, which serves your model stored in the Google Cloud Storage bucket. You define your function in Python with the prediction method and loading of the model.

Amazon Web Services (AWS) also contains multiple options for the ML deployment process and serving. The specialized system for machine learning is Amazon Sagemaker.

All the big platforms allow you to create your own virtual server instances. Create your Kubernetes clusters and use any of the systems/frameworks mentioned earlier. Nevertheless, you need to be very careful because it could get really pricey. There are also smaller players on the market such as Banana, Seldon and Comet ML for training, serving & deployment. I personally don’t have experience with them but they are becoming more popular.

Large Language (LLMs) and Multi-Modal Models in Production

With the introduction of GPT by OpenAI a new class of AI models was introduced – the large language models (LLMs). These models are extremely big, trained on massive datasets and deployed on an infrastructure that requires a whole datacenter to run. “Smaller” – usually open source version – models are released but they also require a lot of computational resources and modern servers to run smoothly.

Recently, several serving systems for these models were developed:

OpenLLM by BentoML is a nice system that supports almost all open-source models like Llama2. You can just pick one of the models and run the following commands to start with the serving and query the results:

openllm start opt
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'Explain to me the difference between "further" and "farther"'

vLLM project is a Python library that can help you with the deployment of LLM as an API Server. What is great is that it supports OpenAI-Compatible Server, so you can switch from OpenAI paid service easily to open source variant without modifying the code on the client. This project is being developed at UC Berkeley and it is integrating new techniques for fast inferencing of LLMs.
SkyPilot – is a great option if you want to run the LLMs on cloud providers such as AWS, Google Cloud or Azure. Because running these models is costly, SkyPilot is able to pick the cheapest provider automatically and launch it as an endpoint.

Ximilar AI Platform

Free Login | Docs

Last but not least, you can use our codeless machine-learning platform. Instead of writing a lot of code, training and deploying an ML model by yourself, you can try it in the Ximilar App. Training image classification and object detection can be done both in the browser App or via API. There is every tool that you would need in the ML model development stage, such as training data/image management, labelling tools, evaluation of your models on testing and training datasets, performance metrics, explanation of models on specific images, and so on.

Ximilar’s computer vision platform enables you to develop AI-powered systems for image recognition, visual quality control, and more without knowledge of coding or machine learning. You can combine them as you wish and upgrade any of them anytime.

Once your model is trained, it is deployed as a REST API endpoint. It can be connected to a workflow of more machine learning models working together with conditions like if-else statements. The major benefit is you just connect your system to the API and query the results. All the training and serving problems are solved by us. In the end, you will save a lot of costs because you don’t need to own or rent your infrastructure, serving systems or specialized software engineering team on machine learning.

We built a Ximilar Platform so that businesses from e-commerce, healthcare, manufacturing, real estate and other areas could simply develop their own AI models without coding and with a reasonable budget. For example, on the following screen, you can see our task management for the trading cards collector community.

We and our customers use our platform for the training of machine learning models. Together with our own system for machine learning model serving is it an all-in-one solution for ML model deployment.

The great thing is that everything is manageable via REST API requests with JSON responses. Here is a simple curl command to query all models in production:

curl --request GET 
  --url https://api.ximilar.com/recognition/v2/task/ 
  --header 'Content-Type: application/json' 
  --header 'authorization: Token APITOKEN'

Deployment of ML Models is Science

There are a lot of systems that try to make deployment and serving easy. The topic of deployment & serving is broad, with many choices for hardware infrastructure, DevOps, programming languages, system development, costs, storage, and scaling. So it is not easy to pick one. If you would like to dig deeper, I would suggest the following content for further reading:

For the performance test of serving systems, I recommend a post from Biano that includes testing scripts.
A nice overview of all the deployment systems is also in a video lecture on the Full Stack Deep Learning course.

My Final Tips & Recommendations

Pick a good framework to start with

Doing machine learning for more than 10 years, my advice is to start by picking a good framework for model development. In my opinion, the best choice right now is PyTorch. Using it is easy and it supports a lot of state-of-the-art architectures.

I used to be a fan of TensorFlow for a long time, but over time, its developers were not able to integrate modern approaches. Also, the backward compatibility is often disrupted and the quality of code is getting worse which leads to more and more bugs in the framework.

Save your models in different formats

Second, save your models in different formats. I would also recommend using ONNX and OpenVino here. You never know when you will need it. This happened to me a few times. We needed to upgrade the server and systems (our production environment), but the new versions of libraries stopped supporting the specific format of the model, so we had to switch to a different one.

Pick a serving system suitable to your needs

If you are a small company, then Ray Serve is a good option. Bigger companies, on the other hand, have complex requirements for development and robust infrastructure. In this case, I would recommend picking more complex systems like MLFlow. If you would like to serve the models on the cloud, then look at a multi-model server. The choice is really based on the use case. If you don’t want to bother with all of this then try our Ximilar Platform, which is a solution model optimization, model validation, data storage and model deployment as API.

I will keep this article updated and if there is some new perspective serving system I will be more than happy to mention it here. After all, machine learning is about constant progress, and that is one of the things I like about it the most.

The post The Best Tools for Machine Learning Model Serving appeared first on Ximilar: Visual AI for Business.

Flows – The Game Changer for Next-Generation AI Systems

Zuzana Raidová — Wed, 01 Sep 2021 15:25:28 +0000

We have spent thousands of man-hours on this challenging subject. Gallons of coffee later, we introduced a service that might change how you work with data in Machine Learning & AI. We named this solution Flows. It enables simple and intuitive chaining and combining of machine learning models. This simple idea speeds up the workflow of setting up complex computer vision systems and brings unseen scalability to machine learning solutions.

We are here to offer a lot more than just training models, as common AI companies do. Our purpose is not to develop AGI (artificial general intelligence), which is going to take over the world, but easy-to-use AI solutions, that can revolutionize many areas of both business and daily life. So, let’s dive into the possibilities of flows in this 2021 update of one of our most-viewed articles.

Flows: Visual AI Setup Cannot Get Much Easier

In general, at our platform, you can break your machine learning problem down into smaller, separate parts (recognition, detection, and other machine learning models called tasks) and then easily chain & combine these tasks with Flows to achieve the full complexity and hierarchical classification of a visual AI solution.

A typical simple use case is conditional image processing. For instance, the first recognition task filters out non-valid images, then the next one decides a category of the image and, according to the result, other tasks recognize specific features for a given category.

Simple use of machine learning models combination in a flow

Flows allow your team to review and change datasets of all complexity levels fast and without any trouble. It doesn’t matter whether your model uses three simple categories (e.g. cats, dogs, and guinea pigs) or works with an enormous and complex hierarchy with exceptions, special conditions, and interdependencies.

It also enables you to review the whole dataset structure, analyze, and, if necessary, change its logic due to modularity. With a few clicks, you can add new labels or models (tasks), change their chaining, change the names of the output fields, etc. Neat? More than that!

Think of Flows as Zapier or IFTTT in AI. With flows, you simply connect machine learning models, and review the structure anytime you need.

Define a Flow With a Few Clicks

Let’s assume we are building a real estate website, and we want to automatically recognize different features that we can see in the photos. Different kinds of apartments and houses have various recognizable features. Here is how we can define this workflow using recognition flows (we trained each model with a custom image recognition service):

An example of real estate classifier made of machine learning models combined in a flow

The image recognition models are chained in a “main” flow called the branch selector. The branch selector saves the result in the same way as a recognition task node and also chooses an action based on the result of this task. First, we let the top category task recognize the type of estate (Apartment vs. Outdoor house). If it is an apartment, we can see that two subsequent tasks are “Apartment features” and “Room type”.

A flow can also call other flows, so-called nested flows, and delegate part of the work to them. If the image is an outdoor house, we continue processing by another nested flow called “Outdoor house”. In this flow, we can see another branching according to the task that recognizes “House type”. Different tasks are called for individual categories (Bungalow, Cottage, etc.):

An example use of nested flows – the main flow calls other nested flows to process images based on their category

Flow Elements We Used

So far, we have used three elements:

A recognition task, that simply calls a given task and saves the result into an output field with a specified name. No other logic is involved.
A branch selector, on the other hand, saves the result in the same way as a recognition task node, but then it chooses an action based on the result of this task.
Nested flow, another flow of tasks, that the “main” flow (branch selector) called.

Implicitly, there is also a List element present in some branches. We do not need to create it, because as soon as we add two or more elements to a single branch, a list generates in the background. All nodes in a list are normally executed in parallel, but you can also set sequential execution. In this case, the reordering button will appear.

Branch Selector – Advanced Settings

The branch selector is a powerful element. It’s worthwhile to explore what it can do. Let’s go through the most important options. In a single branch, by default, only actions (tag or category) with the highest relevance will be performed, provided the relevance (the probability outputted by the model) is above 50 %. But we can change this in advanced settings. We can specify the threshold value and also enable parallel execution of multiple branches!

The advanced settings of a branch selector, enabling to skip a task of a flow

You can specify the format of the results. Flat JSON means that results from all branches will be saved on the same level as any previous outcomes. And if there are two same output names in multiple branches, they can be overwritten. The parallel execution guarantees neither order nor results. You can prevent this from happening by selecting nested JSON, which will save the results from each branch under a separate key, based on the branch name (that is the tag/category name).

If some data (output_field) are present in the incoming request, we can skip calling the branch selector processing. You can define this in If Output Field Exists. This way we can save credits and also improve the precision of the system. I will show you how useful this behaviour can be in the next paragraphs. To learn about the advanced options of training, check this article.

An Example: Fashion Detection With Tags

We have just created a flow to tag simple and basic pictures. That is cool. But can we really use it in real-life applications? Probably not. The reason is, in most pictures, there is usually more than one clothing item. So how are we going to automate the tagging of more complex pictures? The answer is simple: we can integrate object detection into flows and combine it with recognition & tagging models!

Example of Fashion Tagging combined with Object Detection in Ximilar App

The flow structure then exactly mirrors the rich product taxonomy. Each image goes through a taxonomy tree in order to get proper tags. This is our “top classifier” – a flow that can tell one of our seven top categories of a fashion product image, which will determine how the image will be further classified. For instance, if it is a “Clothing” product, the image continues to “Clothing tagging” flow.

A “top classifier” – a flow that can tell one of our seven top categories of a fashion product image.

Similar to categorization or tagging, there are two basic nodes for object detection: the Detection Task for simple execution of a given task and Object Selector, which enables the processing of the detected objects.

Object Selector will call the object detection task. The detected objects will be extracted out of the image and passed further to any of the available nodes. Yes, any of them! Even another Object Selector, if, for example, you need to first detect people and then detect clothes on each person separately.

Object Selector – Advanced Settings

Object Selector behavior can be customized in similar ways as a Branch Selector. In addition to the Probability Threshold, there is also an Area Threshold. By default, all objects are processed. By setting this threshold, the objects that do not take at least a given percentage of an image are simply ignored. This can be changed to a single object by probability or area in Select. As I mentioned, we extract the object before further processing. We can extend it a bit to include some context using Expand Bounding Box by…

Setting a threshold for a space that an object should occupy in order to be detected

A Typical Flows Application: Fashion Tagging

We have been playing with the fashion subject since the inception of Ximilar. It is the most challenging and also the most promising one. We have created all kinds of tools and helpers for the fashion industry, namely Fashion Tagging, specialized Fashion Search, or Annotate. We are proud to have a very precise automatic fashion tagging service with a rich fashion taxonomy.

And, of course, Fashion Tagging is internally powered by Flows. It is a huge project with several dozens of features to recognize, about a hundred recognition tasks, and hundreds of labels all chained into several interconnected flows. For example, this is what our AI says about a simple dress now – and you can try it on your picture in the public demo.

Example of fashion attributes assigned to a dress by Ximilar Fashion Tagging flow

Try the Fashion Tagging demo

Include Pre-trained Services In Your Flow

The last group of nodes at your disposal are Ximilar services. We are working hard and on an ever-growing number of ready-to-use services which can be called through our API and integrated into your project. It is natural for our users to combine more AI services, and flows make it easier than ever. At this moment, you can call these ready-to-use recognition services:

Fashion Tagging (demo, details)
Home Decor Tagging (demo, details)
Dominant Colors (demo, details)

But more will come in the future, for example, Remove Background.

Increasing Possibilities of Flows

As our app and list of services grow, so do the flows. There are two features we are currently looking forward to. We are already building custom similarity models for our customers. As soon as they are ready, they will be available for combining in flows. And there is one more item very high on our list, which is predicting numeric values. Regression, in machine learning terms. Stay tuned for more exciting news!

Create Your Flow – It’s Free

Before Flows, setting up the AI Vision process was a tedious task for a skilled developer. Now everyone can set up, manage and alter steps on their own. In a comprehensive, visual way. Being able to optimize the process quickly, getting a faster response, losing less time and expenses, and delivering higher quality to customers.

And what’s the best part? Flows are available to the users of Ximilar’s free plan, so you can try them right away. Register or sign up to the Ximilar App and enter Flows service at the Dashboard. If you want to learn the basics first, check out our video tutorials. Then you can connect tasks and labels defined in your own Image Recognition.

Training of machine learning models is free with Ximilar, you are only paying for API calls for recognition. Read more about API calls or API credit packs. We strongly believe you will love Flows as much as we enjoyed bringing them to life. And if you feel like there is a feature missing, or if you prefer a custom-made solution, feel free to contact us!

The post Flows – The Game Changer for Next-Generation AI Systems appeared first on Ximilar: Visual AI for Business.

OpenVINO: Start Optimizing Your TensorFlow 2 Models for Intel CPUs with Docker

Libor Vanek — Tue, 02 Feb 2021 06:59:01 +0000

In the previous article, we mentioned how OpenVINO improved the performance of our machine learning models on our Intel Xeon CPUs. Now, we would like to help the machine learning practitioners who want to start using this toolkit as fast as possible and test it on their own models.

You can find extensive documentation on the official homepage, there is the GitHub page, some courses on Coursera and many other resources. But how to start as fast as possible without a need to study all of the materials? One of the possible ways can be found in the following paragraphs.

No Need to Install, Use Docker

OpenVINO has a couple of dependencies which need to be present on your computer. Additionally, to install some of them, you need to have root/admin rights. This might not be desirable. Using Docker represents much cleaner way. Especially when there is an image prepared for you on Docker Hub.

If you are not familiar with Docker, it might seem like a complicated piece of software, but we highly recommend you to try it and learn the basics. It is not hard and worth the effort. Docker is an important part of today’s SW development world. You will find installation instructions here.

Running the Docker Image

Containers are stateless. That means that next time you start your container, all the changes made will be gone. (Yes, this is a feature.) If you want to persist some files, just prepare a directory on your filesystem and we will bind it as a shared volume to a running container. (To /home/openvino directory).

We will run our container in interactive mode (-it) so that you can easily execute commands inside it and it will be terminated after you close the command prompt (--rm). With the following command, the image will be downloaded (pulled), if it was not already done, and it will be run as a container.

docker run -it --rm -v __YOUR_DIRECTORY__:/home/openvino openvino/ubuntu18_dev:latest

To be able to use all the tools, OpenVINO environment needs to be initialized. For some reason, this is not done automatically. (At least not for a normal user, if you start docker as a root, -u 0, setup script is run.)

source /opt/intel/openvino/bin/setupvars.sh

Conformation is then printed out.

[setupvars.sh] OpenVINO environment initialized

TensorFlow 2 Dependencies

TensorFlow 2 is not present inside the container by default. We could very easily create our own image based on the original one with TensorFlow 2 installed. This is the best way in production. With this being said, we will show you another way using the original container and installing the missing packages to an virtual environment into the shared directory (volume). This way, we can create as many such environments as we want. Or easily modify this environment. In addition, we will be still able to try the older TensorFlow 1 models. We prefer this approach during initial development.

Following code needs to be executed only once, after you first start your container.

mkdir ~/env
python3 -m venv ~/env/tensorflow2 --system-site-packages
source ~/env/tensorflow2/bin/activate
pip3 install --upgrade pip
pip3 install -r /opt/intel/openvino/deployment_tools/model_optimizer/requirements_tf2.txt

When you close your container and open it again, this is the only part you will need to repeat.

source ~/env/tensorflow2/bin/activate

Converting TensorFlow SavedModel

Let’s say you have a trained model in SavedFormat. For the sake of this tutorial, we can take a pretrained MobileNet. Execute python3 and run following couple of lines to download and save the model.

import tensorflow as tf
model = tf.keras.applications.MobileNetV2(input_shape=(224,224,3))
model.save("/home/openvino/models/tf2")

Conversion is a matter of one command. However, there are few important parameters we need to use and which will be described below. Complete list can be found in the documentation. Exit python interpreter and run following command in a bash.

/opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --saved_model_dir ~/models/tf2 --output_dir ~/models/converted --batch 1 --reverse_input_channels --mean_values [127.5,127.5,127.5] --scale_values [127.5,127.5,127.5]

--batch 1 sets batch size to 1. OpenVINO cannot work with undefined input dimensions. I out case, the input shape for MobileNetV2 of is (-1, 224, 224, 3).
--reverse_input_channels tells inference engine to convert image from it’s BGR format to RGB used by MobileNetV2.
--mean_values [127.5,127.5,127.5] --scale_values [127.5,127.5,127.5] finally performs the necessary preprocessing so that image values are between -1 and 1. In TensorFlow, we would call preprocess_input.

Be careful, if you use pretrained models, different preprocessing and channel order can be used. If you try to use a neural network with input preprocessed in wrong way, you will of course get the wrong result.

You don’t need to include the preprocessing inside the converted model. The other option is to preprocess every image inside your own code before passing it to converted model. However, we use some OpenVINO inference tool which expects correct input.

At this point, we also need to mentioned that you might get slightly different values from SavedModel in TensorFlow and converted OpenVINO model. But from my experience on classification models, the difference is quite similar as when you use different ways to downscale your image to a proper input size.

Run Inference on Converted Model

First, we will get a testing picture which belongs to one of the 1000 ImageNet classes. We chose zebra, class index 340. (For TensorFlow, Google keeps the class indexes here.)

Photo by Frida Bredesen on Unsplash

Let’s download it to our home directory We saved the small version of the image on our server so you can get it from there.

 curl -o ~/zebra.jpg -s https://images.ximilar.com/tutorials/openvino/zebra.jpg

There is a script you can use for testing the prediction with no need to write any code.

python3 /opt/intel/openvino/deployment_tools/inference_engine/samples/python/classification_sample_async/classification_sample_async.py -i ~/zebra.jpg -m ~/models/converted/saved_model.xml -d CPU

We get some info lines and top 10 results at the end. Since the numbers are pretty clear, we will show only the first three.

classid probability
------- -----------
  340    0.9556126
  396    0.0032325
  721    0.0008250

Cool, our model is really sure about what is on the picture!

Using OpenVINO Inference in Python

That was easy, right? But you probably need to run the inference inside your own Python code. You can take a look inside the script. It is pretty straightforward, but for the sake of completeness, we will copy some of the code here. We will also add a code for running an inference on the original model so that you can compare it easily. If you please, run python3 and we can start.

We need one basic import from OpenVINO inference engine. Also, OpenCV and NumPy are needed for opening and preprocessing the image. If you prefer, TensorFlow could be used here as well of course. But since it is not needed for running the inference at all, we will not use it.

import cv2
import numpy as np
from openvino.inference_engine import IECore

As for the preprocessing, part of it is already present inside the converted model (scaling, changing mean, inverting input channels width and height), but that is not all. We need to make sure the image has a proper size (224 pixels both sides) and the dimensions are correct – batch, channel, width, height.

img = cv2.imread("/home/openvino/zebra.jpg")
img = cv2.resize(img, (224, 224))
img = np.expand_dims(img, 0)
img_openvino = img.transpose((0,3, 1, 2))

Now, we can try a simple OpenVINO prediction. We will use one synchronous request.

ie = IECore()
net = ie.read_network(model="/home/openvino/models/converted/saved_model.xml", weights="/home/openvino/models/converted/saved_model.bin")
input_name = next(iter(net.input_info))
output_name = next(iter(net.outputs))
net.batch_size = 1
# number of request can be specified by parameter num_requests, default 1
exec_net = ie.load_network(network=net, device_name="CPU")
# we have one request only, see num_requests above
request = exec_net.requests[0]
# infer() waits for the result
# for asynchronous processing check async_infer() and wait()
request.infer({input_name: img_openvino})
# read the result
prediction_openvino_blob = request.output_blobs[output_name]
prediction_openvino = prediction_openvino_blob.buffer

Our result, prediction_openvino, is an ordinary NumPy array with the shape (batch dimension, number of classes) = (1, 1000). To print the top 3 values as before, we can use following couple of lines.

ids = np.argsort(prediction_openvino)[0][::-1][:3]
probabilities = np.sort(prediction_openvino)[0][::-1][:3]
for id, prob in zip(ids, probabilities):
    print(f"{id}\t{prob}")

We get exactly the same results as before. Our code works!

Comparing Results with Original TensorFlow Model

Now, let’s do the same with TensorFlow model. Do not forget to preprocess the image first. Prepared function preprocess_input can be used for that.

import tensorflow as tf
img_tf =  tf.keras.applications.mobilenet_v2.preprocess_input(img)
model = tf.keras.models.load_model("/home/openvino/models/tf2")
prediction_tf = model.predict(img_tf)

The results are almost the same, the difference is so small that we can ignore it. The top result from this prediction has probability 0.95860416, compared to 0.9556126 we had before. The order of the other predictions might be slightly different, because the values are so tiny.

By the way, there is a build-in function decode_predictions, which will not only give you the top results, but also class names and codes instead of just ids. Top 3 TensorFlow predictions.

from tensorflow.keras.applications.mobilenet_v2 import decode_predictions
top3 = decode_predictions(prediction_tf, top=3)

Here is the result:

[[('n02391049', 'zebra', 0.95860416), ('n02643566', 'lionfish', 0.0032956717), ('n01968897', 'chambered_nautilus', 0.0008273276)]]

Benchmarking

We should mention that there is also a tool for benchmarking OpenVINO models called Benchmark Python Tool. It offers synchronous (latency-oriented) and asynchronous (throughput-oriented) measuring modes. Unfortunately, it does not work for other models (like TensorFlow) and cannot be used for direct comparison.

How OpenVINO Helped Us

Enough of code, at the end of the article we will add some numbers. In Ximilar, we use often recognition models with resolution of 224 or 512 pixels. In a batch of 1 or 10. We used TensorFlow Lite format as often as possible, because it is very fast to load. (See a comparison here.) Because of a fast loading, it is not necessary to have the model in the cache all the time. To make running TF Lite models faster, we enhance the performance with XNNPACK.

Below you can see a chart with results for MobileNet V2. For the batches, we show prediction time in seconds per single image. Tests were done on our production workers.

Summary

In this article, we briefly introduced some of the basic functionality of OpenVINO. Of course, there is much more to try. We hope that this article has motivated you to try it yourself and maybe continue to explore all the possibilities through more advanced resources.

The post OpenVINO: Start Optimizing Your TensorFlow 2 Models for Intel CPUs with Docker appeared first on Ximilar: Visual AI for Business.

Behind The Scene: Ximilar and Intel AI Builders Program

Libor Vanek — Thu, 17 Dec 2020 12:12:11 +0000

When Ximilar was founded, many crucial decisions needed to be made. One of them was where to run our software. After careful consideration, our first servers were bought and subsequently, all our services have run on our own hardware.

This brought of course a great number of challenges, but also many opportunities. We would like to share with you some of the latest accomplishments. Mainly, we are proud to announce that we have become a part of Intel AI Builders Program. Having direct access to technical resources and support from Intel helps us in using our hardware in the best way possible.

Running the entire MLaaS (Machine learning as a service) platform on our cloud infrastructure as efficiently as possible and delivering high-quality results is essential for us.

Making The Most of It

Knowing our servers enables us to optimize used models to be very efficient. It is not hard to get incredible amounts of computational power today, but we still believe that it is worth using them to their fullest potential. We love saving energy, and resources, and at the same time providing our services for reasonable prices.

Machine learning computations are performed both by GPUs (graphics processing units) and CPUs (central processing units). During training, the CPU prepares the images, and the GPU optimizes the artificial neural network. For prediction, both of them can be used. But generally, we use GPU only when the model is huge, or we need the results very fast. In other cases, the CPU is sufficient. Recently, we focused on optimizing our prediction on our Intel Xeon CPUs to make them run faster.

We use the TensorFlow library for your machine-learning models. It was developed by Google and released as open-source in 2015. Google also came up with their own hardware for machine learning – Tensor Processing Units, TPUs in short. Nevertheless, they are still interested in other hardware platforms as well, and they provide XNNPACK, an optimized library of floating-point neural network inference operators.

OpenVINO Toolkit

However, we achieved the best results by using the OpenVINO Toolkit from Intel. It consists of two basic parts. A model optimizer, which is run after a model is trained, and an inference engine for predictions. We are able to speed up some of our tasks by 5 times. The predictions run especially efficiently on larger images and larger batches.

So far, we are using OpenVINO for our image recognition service and various recognition tasks in different fashion services. For example, fashion tagging or visual search runs many models on a single image and every improvement in efficiency is even more noticeable. But we are not forgetting the rest of our portfolio, and we are working hard to extend it to all the different services we provide.

Last but not least, we would like to thank Intel for their support, and we are looking forward to our cooperation in the next years!

Do you want to know more? Read a blog on the Intel AI website.

The post Behind The Scene: Ximilar and Intel AI Builders Program appeared first on Ximilar: Visual AI for Business.