Engineering - Ximilar: Visual AI for Business

The Best Tools for Machine Learning Model Serving

Michal Lukáč — Wed, 25 Oct 2023 09:26:42 +0000

As the prevalence of AI in various industries increases, so does the need to optimize the machine learning model serving. As a machine learning engineer, I’ve seen that training models is just one part of the ML journey. Equally important as the other challenges is the careful selection of deployment strategies and serving systems.

In this article, we’ll delve into the importance of selecting the right tools for machine learning model serving, and talk about their pros and cons. We’ll explore various deployment options, serving systems like TensorFlow Serving, TorchServe, Triton, Ray Serve, and MLflow, and also the deployment of specific models such as large language models (LLMs). I’ll also provide some thoughts and recommendations for navigating this ever-evolving landscape.

Machine Learning Models Serving Then and Now

When I first began my journey in the world of machine learning, the landscape was constantly shifting. The frameworks being actively developed and used at the time included Caffee, Theano, TensorFlow (Google) and PyTorch (Meta), all vying for their place in the world of AI. As time has passed, the competition has become more and more lopsided, with TensorFlow and PyTorch leading the way. While TensorFlow has remained the more popular choice for production-ready models, PyTorch has been steadily gaining in popularity, particularly within research circles, for its faster, more intuitive prototyping capabilities.

While there are hundreds of libraries available to train and optimize models, the most popular frameworks such as TensorFlow, PyTorch and Scikit-Learn are all based on Python programming language. Python is often chosen due to its simplicity and the vast amount of libraries for data manipulation. However, it is not the fastest language and can present problems with parallel processing, threads and GIL. Additionally, specialized libraries such as spaCy and PyG are available for specific tasks, such as Natural Language Processing (NLP) and Graph Analysis, respectively. The focus was and still partially is on the optimization of models and architectures. On the other hand, there are more and more problems in machine learning models serving in production because of the large-scale adoption of AI.

Nowadays, even more complex models like large language models (LLM, GPT/LAMMA/BARD) and multi-modal models are in fashion which creates a bigger pressure on optimal model deployment, infrastructure environment and storage capacity. Making machine learning model serving and deployment effective and cheap is a big problem. Even companies like Microsoft or NVIDIA are actively working on solutions that will cut the costs of it. So let’s look into some of the best options that we as developers currently have.

The Machine Learning and DevOps Challenges

Being a Machine Learning Engineer, I can say that training a model is just a small part of the whole lifecycle. Data preparation, deployment process and running the model smoothly for numerous customers is a daily challenge and a major part of the job.

Deployment Strategies

In addition to having to allocate GPU/CPU resources and manage inference speed, the company deploying ML models must also consider the deployment strategy for the trained model. You could be deploying the ML model as an API, running it in a container, or using a serverless platform. Each of these options comes with its own set of benefits and drawbacks, so carefully considering the best approach is essential. When we have a trained model, there are several options on how to use it:

Deploy it as an API endpoint, sending data in the request and getting results immediately in response. This approach is suitable for faster models that are able to process the data in just a few seconds.
Deploy it as an API endpoint, but return just a promise or asynchronous response from the model. This is great for computational-intensive models that can take minutes or hours of processing. For example, generative models and upscaling models are slow and require this approach.
Use a system that is able to serve it for you.
Use the model locally on your data.
Deploy models on Smartphones or IoT devices with feed from local sensors.

Other Challenges

The complexity of machine learning projects grows with variables such as:

The number of models – It is common practice to use multiple models. For example, at this moment, there are tens of thousands of different ML models on the Ximilar platform.
Model versions – You can train each of your models on different training data (part of the dataset) and mark it as a different version. Model versioning is great if you want to A/B test your ML model, tune your model performance, and for continuous model training.
Format of models – You can potentially train and save your ML models in various formats. For instance, .h5 which is a Keras/TensorFlow format or .pt (PyTorch) or .onnx for ONNX Runtime. Usually, each framework supports only specific formats.
The number of frameworks – Served ML models could be trained with different frameworks and their versions.
The number of the nodes (servers) – Models can be hosted on one or multiple servers and the serving system should be able to intelligently load balance the requests on servers so that none of them is throttled.
Models storage/registry – You need to store the ML models in some database or storage, such as AWS S3 or local storage
Speed/performance – The loading time of models from the storage can be critical and can cause a slow inference per sample.
Easy to use – Calling model via Rest API or gRPC requests, single-or-batch inference.
Hardware specification – ML models can be deployed on Edge devices or PCs with various architectures.
GPUs vs CPUs and libraries – Some models must be used only on CPUs and some require a GPU card.

Our Approach to the Machine Learning Model Serving

Several systems were developed to tackle these problems. Serving and deploying machine learning models has come a long way since we founded Ximilar in 2016. Back then, no system was capable of effectively serving hundreds of neural networks for inference.

So, we decided to build our own system for machine learning model serving, and today it forms the backbone of our machine-learning platform. As the use of AI becomes more widespread in companies, newer systems such as TensorFlow Serving emerge quickly to meet the increasing demand.

Which Framework Is The Best?

The Battle of Machine Learning Frameworks

Nowadays, each big tech company has their own solution for machine learning model serving and training. To name a few, PyTorch (TorchServe) and AITemplate by META (Facebook), TensorFlow (TFServing) by Google, ONNX runtime by Microsoft, Triton by NVIDIA, Multi-Model-Server by Amazon and many others like BentoML or Ray.

There are also tens of formats that you can save your ML model in, just TensorFlow alone is able to save into .h5, .pb, saved_model or .tflite formats, each of them serving a different purpose. For example, TensorFlow Lite is great for smartphones. It also loads very fast, so the availability of the model is great. However, it supports only limited operations and more modern architectures cannot be converted with it.

Machine learning model serving: each big tech company has their own solution for training and serving machine learning models.

You can also try to convert models from PyTorch or TensorFlow to TensorRT and OpenVino formats. The conversion usually works with basic and most-used architectures. The TensorRT is great if you are deploying ML models on Jetson Nano or Xavier. You can achieve a boost in performance on Intel servers via OpenVino conversion or the Neural Magic library.

The ONNX Format

One notable thing is the ONNX format. The ONNX is not a library for training your machine learning models, ONNX is an open format for storing machine learning models. After the model training, for example, in TensorFlow, you can convert it to ONNX format. You are able to run this converted model via ONNX runtime on almost any platform, programming language, CPU architecture and with preferred hardware acceleration. Sometimes serving a model requires a specific version of libraries, which is why you can solve a lot of problems via ONNX.

Exploration is Key

There are a lot of options for ML model training, saving, conversion and deployment. Every library has its pros and cons, some of them are easy to use for training and development. Others, on the other hand, are specialized for specific platforms or for specific fields (computer vision, recommender systems or NLP).

I would recommend you invest some time in exploring all the frameworks and systems, before deciding which framework you would like to lock in. The competition is rough in this field and every company tries to be as innovative as possible to keep up with the others. Even a Chinese company Baidu developed their own solution called PaddlePaddle. At the end of the article, I will give some recommendations on which frameworks and serving systems you should use and when.

The Best Machine Learning Serving Tools

OK, let’s say that you trained your own model or downloaded one that has already been trained. Now you would like to deploy a machine-learning model in production. Here are a few options that you can try.

If you don’t know how to train a machine learning model, you can start with this tutorial by PyTorch.

Deploy ML Models With API

If you have one or a few models, you can build your own system for ML model serving. With Python and libraries such as Flask or Django, there is a straightforward way to develop a simple REST API. When the web service starts, it loads the model in the background and then every incoming request will call the model on the incoming data.

It could get problematic if you want to effectively work with GPU cards, and handle parallel requests. I would recommend packing the system to Docker and then running it in Kubernetes.

With Kubernetes, Docker and smart load-balancing as HAProxy such a system can potentially scale to bigger volumes. Java or Go languages are also good languages to deploy ML models.

Here is a simple tutorial with a sci-kit-learn model as REST API with Flask.

Now let’s have a look at the open-source serving systems that you can use out of the box, usually with a small piece of code or no code at all.

TensorFlow Serving

GitHub | Docs

TensorFlow Serving is a modern serving system for TensorFlow ML models. It’s a part of TensorFlow Extended developed by Google. The recommended way of using the system is via Docker.

Simply run the Docker pull TensorFlow/serving (optionally TensorFlow/serving:latest-gpu if you need GPU support) command. Just run the image via Docker:

docker run -p 8501:8501 
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model 
  -e MODEL_NAME=my_model -t tensorflow/serving

Now that the system is serving your model, you can query with gRPC or REST calls. For more information, read the documentation. TensorFlow Serving works best with the SavedModel format. The model should define its signature_def_map which will define the inputs and outputs of the model. If you would like to dive into the system then my recommendation is videos by the team itself.

In my opinion, TensorFlow serving is great with simple models and just a few versions. The documentation, however, could be simpler. With advanced architectures, you will need to define the custom operations, which is a big disadvantage if you have a lot of models with more modern operations.

TorchServe

GitHub | Docs

TorchServe is a more modern system than TensorFlow Serving. The documentation is clean and supports basically everything that TF Serving does, however, this one is for PyTorch models. Before serving a PyTorch model via TorchServe, you need to convert them to .mar packages. Basically, the .mar package tells the model name, version, architecture and actual weights of the model. Installation and running are also possible via Docker, and it is very similar to TensorFlow Serving.

I personally like the management of the models, you are able to simply register new models by sending API requests, list models and query statistics. I find the TorchServe very simple to use. Both REST API and gRPC are available. If you are working with pure PyTorch models then the TorchServe is recommended way.

Triton

GitHub | Docs

Both of the serving systems mentioned above are tightly bound to the frameworks of the models they are able to serve. That is probably why Triton has a big advantage over them since it can serve both TensorFlow and PyTorch models. It is also able to serve OpenVINO, ONNX and TensorRT formats! That means it supports all the major formats in the machine learning field. Even though NVIDIA developed it, it doesn’t require a GPU card and can run also on CPUs.

To run Triton, simply pull it from the docker repository via the Docker pull nvcr.io/nvidia/tritonserver command. The triton servers are able to load models from a specific directory called model_repository. Each model is defined with configuration, in this configuration, there is a platform setting that defines a model format. For example, “tensorflow_graphdef” or “onnxruntime_onnx“. In this way, Triton knows how to run specific models.

The documentation is not super-easy to read (mostly GitHub README files) because it is in very active development. Otherwise, working with the models is similar to other serving systems, meaning calling models via gRPC or REST.

Ray Serve

GitHub | Docs

Ray is a general-purpose system for scaling machine learning workloads. It primarily focuses on model serving and providing the primitives for you to build your own ML platform on top.

Ray Serve offers a more Pythonic way of creating your own serving system. It is framework-agnostic and anything that can be run via Python can be run also with Ray. Basically, it looks as simple as Flask. You define the simple Python class for your model and decorate it with a route prefix handler. Then you just call the REST API request.

import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

# 1: Define a Ray Serve deployment.
@serve.deployment(route_prefix="/")
class MyModelDeployment:
    def __init__(self, msg: str):
        # Initialize model state: could be very large neural net weights.
        self._msg = msg

    def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}

# 2: Deploy the model.
serve.run(MyModelDeployment.bind(msg="Hello world!"))

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/").json())

If you want to have more control over the system, Ray is a great option. There is a Ray Clusters library which is able to deploy the system on your own Kubernetes Cluster, AWS or GCP with the ability to configure the autoscaling option.

MLflow

MLflow is an open-source platform for the whole ML lifecycle. From training to evaluation, deployment, tracking, model monitoring and central model registry.

MLflow offers a robust API and several language bindings for the whole management of the machine learning model’s lifecycle. There is also a UI for tracking your trained models. MLflow is really a mature package with a whole bundle of components that your team can use.

Other Useful Tools for Machine Learning Model Serving

Multi-Model-Server is a similar system to the previous ones. Developed by the Amazon AWS team, the system is able to run models trained with MXNet or converted via ONNX.
BentoML is a project very similar to MLflow. There are many different tools that data scientists can use for training and deployment processes. The UI looks a bit more modern. BentoML is also able to automatically generate Docker images for your models.
KServe is a simple system for managing and scaling models on your Kubernetes. It solves the deployment, and autoscaling and provides standardized inference protocol across ML frameworks.

Cloud Options of AWS, GCP and Azure

Of course, every big tech player provides cloud platforms to host and serve your machine learning models. Let’s have a quick look at a few examples.

Microsoft is a big supporter of ONNX, so with Azure Machine Learning services, you are able to deploy your models to the cloud via Python or Azure CLI. The process requires an entry script in Python with two methods: init for initialization of your model and run for inference. You can find the entire workflow in Azure development documentation.

The Google Cloud Platform (GCP) has good support for TensorFlow as it is their native framework. However, Docker deployment is available, so other frameworks can be used too. There are multiple ways to achieve the deployment. The classic way will be using the AI Platform prediction tool or Google Cloud Run. There is also a serverless HTTP endpoint/function, which serves your model stored in the Google Cloud Storage bucket. You define your function in Python with the prediction method and loading of the model.

Amazon Web Services (AWS) also contains multiple options for the ML deployment process and serving. The specialized system for machine learning is Amazon Sagemaker.

All the big platforms allow you to create your own virtual server instances. Create your Kubernetes clusters and use any of the systems/frameworks mentioned earlier. Nevertheless, you need to be very careful because it could get really pricey. There are also smaller players on the market such as Banana, Seldon and Comet ML for training, serving & deployment. I personally don’t have experience with them but they are becoming more popular.

Large Language (LLMs) and Multi-Modal Models in Production

With the introduction of GPT by OpenAI a new class of AI models was introduced – the large language models (LLMs). These models are extremely big, trained on massive datasets and deployed on an infrastructure that requires a whole datacenter to run. “Smaller” – usually open source version – models are released but they also require a lot of computational resources and modern servers to run smoothly.

Recently, several serving systems for these models were developed:

OpenLLM by BentoML is a nice system that supports almost all open-source models like Llama2. You can just pick one of the models and run the following commands to start with the serving and query the results:

openllm start opt
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'Explain to me the difference between "further" and "farther"'

vLLM project is a Python library that can help you with the deployment of LLM as an API Server. What is great is that it supports OpenAI-Compatible Server, so you can switch from OpenAI paid service easily to open source variant without modifying the code on the client. This project is being developed at UC Berkeley and it is integrating new techniques for fast inferencing of LLMs.
SkyPilot – is a great option if you want to run the LLMs on cloud providers such as AWS, Google Cloud or Azure. Because running these models is costly, SkyPilot is able to pick the cheapest provider automatically and launch it as an endpoint.

Ximilar AI Platform

Free Login | Docs

Last but not least, you can use our codeless machine-learning platform. Instead of writing a lot of code, training and deploying an ML model by yourself, you can try it in the Ximilar App. Training image classification and object detection can be done both in the browser App or via API. There is every tool that you would need in the ML model development stage, such as training data/image management, labelling tools, evaluation of your models on testing and training datasets, performance metrics, explanation of models on specific images, and so on.

Ximilar’s computer vision platform enables you to develop AI-powered systems for image recognition, visual quality control, and more without knowledge of coding or machine learning. You can combine them as you wish and upgrade any of them anytime.

Once your model is trained, it is deployed as a REST API endpoint. It can be connected to a workflow of more machine learning models working together with conditions like if-else statements. The major benefit is you just connect your system to the API and query the results. All the training and serving problems are solved by us. In the end, you will save a lot of costs because you don’t need to own or rent your infrastructure, serving systems or specialized software engineering team on machine learning.

We built a Ximilar Platform so that businesses from e-commerce, healthcare, manufacturing, real estate and other areas could simply develop their own AI models without coding and with a reasonable budget. For example, on the following screen, you can see our task management for the trading cards collector community.

We and our customers use our platform for the training of machine learning models. Together with our own system for machine learning model serving is it an all-in-one solution for ML model deployment.

The great thing is that everything is manageable via REST API requests with JSON responses. Here is a simple curl command to query all models in production:

curl --request GET 
  --url https://api.ximilar.com/recognition/v2/task/ 
  --header 'Content-Type: application/json' 
  --header 'authorization: Token APITOKEN'

Deployment of ML Models is Science

There are a lot of systems that try to make deployment and serving easy. The topic of deployment & serving is broad, with many choices for hardware infrastructure, DevOps, programming languages, system development, costs, storage, and scaling. So it is not easy to pick one. If you would like to dig deeper, I would suggest the following content for further reading:

For the performance test of serving systems, I recommend a post from Biano that includes testing scripts.
A nice overview of all the deployment systems is also in a video lecture on the Full Stack Deep Learning course.

My Final Tips & Recommendations

Pick a good framework to start with

Doing machine learning for more than 10 years, my advice is to start by picking a good framework for model development. In my opinion, the best choice right now is PyTorch. Using it is easy and it supports a lot of state-of-the-art architectures.

I used to be a fan of TensorFlow for a long time, but over time, its developers were not able to integrate modern approaches. Also, the backward compatibility is often disrupted and the quality of code is getting worse which leads to more and more bugs in the framework.

Save your models in different formats

Second, save your models in different formats. I would also recommend using ONNX and OpenVino here. You never know when you will need it. This happened to me a few times. We needed to upgrade the server and systems (our production environment), but the new versions of libraries stopped supporting the specific format of the model, so we had to switch to a different one.

Pick a serving system suitable to your needs

If you are a small company, then Ray Serve is a good option. Bigger companies, on the other hand, have complex requirements for development and robust infrastructure. In this case, I would recommend picking more complex systems like MLFlow. If you would like to serve the models on the cloud, then look at a multi-model server. The choice is really based on the use case. If you don’t want to bother with all of this then try our Ximilar Platform, which is a solution model optimization, model validation, data storage and model deployment as API.

I will keep this article updated and if there is some new perspective serving system I will be more than happy to mention it here. After all, machine learning is about constant progress, and that is one of the things I like about it the most.

The post The Best Tools for Machine Learning Model Serving appeared first on Ximilar: Visual AI for Business.

How to Build a Good Visual Search Engine?

Michal Lukáč — Mon, 09 Jan 2023 14:08:28 +0000

Visual search is one of the most-demanded computer vision solutions. Our team in Ximilar have been actively developing the best general multimedia visual search engine for retailers, startups, as well as bigger companies, who need to process a lot of images, video content, or 3D models.

However, a universal visual search solution is not the only thing that customers around the world will require in the future. Especially smaller companies and startups now more often look for custom or customizable visual search solutions for their sites & apps, built in a short time and for a reasonable price. What does creating a visual search engine actually look like? And can a visual search engine be built by anyone?

This article should provide a bit deeper insight into the technology behind visual search engines. I will describe the basic components of a visual search engine, analyze approaches to machine learning models and their training datasets, and share some ideas, training tips, and techniques that we use when creating visual search solutions. Those who do not wish to build a visual search from scratch can skip right to Building a Visual Search Engine on a Machine Learning Platform.

What Exactly Does a Visual Search Engine Mean?

The technology of visual search in general analyses the overall visual appearance of the image or a selected object in an image (typically a product), observing numerous features such as colours and their transitions, edges, patterns, or details. It is powered by AI trained specifically to understand the concept of similarity the way you perceive it.

In a narrow sense, the visual search usually refers to a process, in which a user uploads a photo, which is used as an image search query by a visual search engine. This engine in turn provides the user with either identical or similar items. You can find this technology under terms such as reverse image search, search by image, or simply photo & image search.

However, reverse image search is not the only use of visual search. The technology has numerous applications. It can search for near-duplicates, match duplicates, or recommend more or less similar images. All of these visual search tools can be used together in an all-in-one visual search engine, which helps internet users find, compare, match, and discover visual content.

And if you combine these visual search tools with other computer vision solutions, such as object detection, image recognition, or tagging services, you get a quite complex automated image-processing system. It will be able to identify images and objects in them and apply both keywords & image search queries to provide as relevant search results as possible.

Different computer vision systems can be combined on Ximilar platform via Flows. If you would like to know more, here’s an article about how Flows work.

Typical Visual Search Engines:
Google Lens & Pinterest Lens

Big visual search industry players such as Shutterstock, eBay, Pinterest (Pinterest Lens) or Google Images (Google Lens & Google Images) already implemented visual search engines, as well as other advanced, yet hidden algorithms to satisfy the increasing needs of online shoppers and searchers. It is predicted, that a majority of big companies will implement some form of soft AI in their everyday processes in the next few years.

The Algorithm for Training
Visual Similarity

The Components of a Visual Search Tool

Multimedia search engines are very powerful systems consisting of multiple parts. The first key component is storage (database). It wouldn’t be exactly economical to store the full sample (e.g., .jpg image or .mp4 video) in a database. That is why we do not store any visual data for visual search. Instead, we store just a representation of the image, called a visual hash.

The visual hash (also visual descriptor or embedding) is basically a vector, representing the data extracted from your image by the visual search. Each visual hash should be a unique combination of numbers to represent a single sample (image). These vectors also have some mathematical properties, meaning you can compare them, e.g., with cosine, hamming, or Euclidean distance.

So the basic principle of visual search is: the more similar the images are, the more similar will their vector representations be. Visual search engines such as Google Lens are able to compare incredible volumes of images (i.e., their visual hashes) to find the best match in a hundred milliseconds via smart indexing.

How to Create a Visual Hash?

The visual hashes can be extracted from images by standard algorithms such as PHASH. However, the era of big data gives us a much stronger model for vector representation – a neural network. A simple overview of the image search system built with a neural network can look like this:

Extracting visual vectors with the neural network and searching with them in a similarity collection.

This neural network was trained on images from a website selling cosmetics. Here, it extracted the embeddings (vectors), and they were stored in a database. Then, when a customer uploads an image to the visual search engine on the website, the neural network will extract the embedding vector from this image as well, and use it to find the most similar samples.

Of course, you could also store other metadata in the database, and do advanced filtering or add keyword search to the visual search.

Types of Neural Networks

There are several basic architectures of neural networks that are widely used for vector representations. You can encode almost anything with a neural network. The most common for images is a convolutional neural network (CNN).

There are also special architectures to encode words and text. Lately, so-called transformer neural networks are starting to be more popular for computer vision as well as for natural language processing (NLP). Transformers use a lot of new techniques developed in the last few years, such as an attention mechanism. The attention mechanism, as the name suggests, is able to focus only on the “interesting” parts of the image & ignore the unnecessary details.

Training the Similarity Model

There are multiple methods to train models (neural networks) for image search. First, we should know that training of machine learning models is based on your data and loss function (also called objective or optimization function).

Optimization Functions

The loss function usually computes the error between the output of the model and the ground truth (labels) of the data. This feature is used for adjusting the weights of a model. The model can be interpreted as a function and its weights as parameters of this function. Therefore, if the value of the loss function is big, you should adjust the weights of the model.

How it Works

The model is trained iteratively, taking subsamples of the dataset (batches of images) and going over the entire dataset multiple times. We call one such pass of the dataset an epoch. During one batch analysis, the model needs to compute the loss function value and adjust weights according to it. The algorithm for adjusting the weights of the model is called backpropagation. Training is usually finished when the loss function is not improving (minimizing) anymore.

We can divide the methods (based on loss function) depending on the data we have. Imagine that we have a dataset of images, and we know the class (category) of each image. Our optimization function (loss function) can use these classes to compute the error and modify the model.

The advantage of this approach is its simple implementation. It’s practically only a few lines in any modern framework like TensorFlow or PyTorch. However, it has also a big disadvantage: the class-level optimization functions don’t scale well with the number of classes. We could potentially have thousands of classes (e.g., there are thousands of fashion products and each product represents a class). The computation of such a function with thousands of classes/arguments can be slow. There could also be a problem with fitting everything on the GPU card.

Loss Function: A Few Tips

If you work with a lot of labels, I would recommend using a pair-based loss function instead of a class-based one. The pair-based function usually takes two or more samples from the same class (i.e., the same group or category). A model based on a pair-based loss function doesn’t need to output prediction for so many unique classes. Instead, it can process just a subsample of classes (groups) in each step. It doesn’t know exactly whether the image belongs to class 1 or 9999. But it knows that the two images are from the same class.

Images can be labelled manually or by a custom image recognition model. Read more about image recognition systems.

The Distance Between Vectors

The picture below shows the data in the so-called vector space before and after model optimization (training). In the vector space, each image (sample) is represented by its embedding (vector). Our vectors have two dimensions, x and y, so we can visualize them. The objective of model optimization is to learn the vector representation of images. The loss function is forcing the model to predict similar vectors for samples within the same class (group).

By similar vectors, I mean that the Euclidean distance between the two vectors is small. The larger the distance, the more different these images are. After the optimization, the model assigns a new vector to each sample. Ideally, the model should maximize the distance between images with different classes and minimize the distance between images of the same class.

Optimization for visual search should maximize the distance of items between different categories and minimize the distance within the category.

Sometimes we don’t know anything about our data in advance, meaning we do not have any metadata. In such cases, we need to use unsupervised or self-supervised learning, about which I will talk later in this article. Big tech companies do a lot of work with unsupervised learning. Special models are being developed for searching in databases. In research papers, this field is often called deep metric learning.

Supervised & Unsupervised Machine Learning Methods

1) Supervised Learning

As I mentioned, if we know the classes of images, the easiest way to train a neural network for vectors is to optimize it for the classification problem. This is a classic image recognition problem. The loss function is usually cross-entropy loss. In this way, the model is learning to predict predefined classes from input images. For example, to say whether the image contains a dog, a cat or a bird. We can get the vectors by removing the last classification layer of the model and getting the vectors from some intermediate layer of the network.

When it comes to the pair-based loss function, one of the oldest techniques for metric learning is the Siamese network (contrastive learning). The name contains “Siamese” because there are two identical models of the same weights. In the Siamese network, we need to have pairs of images, which we label based on whether they are or aren’t equal (i.e., from the same class or not). Pairs in the batch that are equal are labelled with 1 and unequal pairs with 0.

In the following image, we can see different batch construction methods that depend on our model: Siamese (contrastive) network, Triplet, or N-pair, which I will explain below.

Each deep learning architecture requires different batch construction methods. For example, Siamese and N-pair require tuples. However, in N-pair, the tuples must be unique.

Triplet Neural Network and Online/Offline Mining

In the Triplet method, we construct triplets of items, two of which (anchor and positive) belong to the same category and the third one (negative) to a different category. This can be harder than you might think because picking the “right” samples in the batch is critical. If you pick items that are too easy or too difficult, the network will converge (adjust weights) very slowly or not at all. The triplet loss function contains an important constant called margin. Margin defines what should be the minimum distance between positive and negative samples.

Picking the right samples in deep metric learning is called mining. We can find optimal triplets via either offline or online mining. The difference is, that during offline mining, you are finding the triplets at the beginning of each epoch.

Online & Offline Mining

The disadvantage of offline mining is that computing embeddings for each sample is not very computationally efficient. During the epoch, the model can change rapidly, so embeddings are becoming obsolete. That’s why online mining of triplets is more popular. In online mining, each batch of triplets is created before fitting the model. For more information about mining and batch strategies for triplet training, I would recommend this post.

We can visualize the Triplet model training in the following way. The model is copied three times, but it has the same shared weights. Each model takes one image from the triplet (anchor, positive, negative) and outputs the embedding vector. Then, the triplet loss is computed and weights are adjusted with backpropagation. After the training is done, the model weights are frozen and the output of the embeddings is used in the similarity engine. Because the three models have shared weights (the same), we take only one model that is used for predicting embedding vectors on images.

Triplet network that takes a batch of anchor, positive and negative images.

N-pair Models

The more modern approach is the N-pair model. The advantage of this model is that you don’t mine negative samples, as it is with a triplet network. The batch consists of just positive samples. The negative samples are mitigated through the matrix construction, where all non-diagonal items are negative samples.

You still need to do online mining. For example, you can select a batch with a maximum value of the loss function, or pick pairs that are distant in metric space.

The N-pair model requires a unique pair of items. In the triplet and Siamese model, your batch can contain multiple triplets/pairs from the same class (group).

In our experience, the N-pair model is much easier to fit, and the results are also better than with the triplet or Siamese model. You still need to do a lot of experiments and know how to tune other hyperparameters such as learning rate, batch size, or model architecture. However, you don’t need to work with the margin value in the loss function, as it is in triplet or Siamese. The small drawback is that during batch creation, we need to have always only two items per class/product.

Proxy-Based Methods

In the proxy-based methods (Proxy-Anchor, Proxy-NCA, Soft Triple) the model is trying to learn class representatives (proxies) from samples. Imagine that instead of having 10,000 classes of fashion products, we will have just 20 class representatives. The first representative will be used for shoes, the second for dresses, the third for shirts, the fourth for pants and so on.

A big advantage is that we don’t need to work with so many classes and the problems coming with it. The idea is to learn class representatives and instead of slow mining “the right samples” we can use the learned representatives in computing the loss function. This leads to much faster training & convergence of the model. This approach, as always, has some cons and questions like how many representatives should we use, and so on.

MultiSimilarity Loss

Finally, it is worth mentioning MultiSimilarity Loss, introduced in this paper. MultiSimilarity Loss is suitable in cases when you have more than two items per class (images per product). The authors of the paper are using 5 samples per class in a batch. MultiSimilarity can bring closer items within the same class and push the negative samples far away by effectively weighting informative pairs. It works with three types of similarities:

Self-Similarity (the distance between the negative sample and anchor)
Positive-Similarity (the relationship between positive pairs)
Negative-Similarity (the relationship between negative pairs)

Finally, it is also worth noting, that in fact, you don’t need to use only one loss function, but you can combine multiple loss functions. For example, you can use the Triplet Loss function with CrossEntropy and MultiSimilarity or N-pair together with Angular Loss. This should often lead to better results than the standalone loss function.

2) Unsupervised Learning

AutoEncoder

Unsupervised learning is helpful when we have a completely unlabelled dataset, meaning we don’t know the classes of our images. These methods are very interesting because the annotation of data can be very expensive and time-consuming. The most simplistic unsupervised learning can simply use some form of AutoEncoder.

AutoEncoder is a neural network consisting of two parts: an encoder, which encodes the image to the smaller representation (embedding vector), and a decoder, which is trying to reconstruct the original image from the embedding vector.

After the whole model is trained, and the decoder is able to reconstruct the images from smaller vectors, the decoder part is discarded and only the encoder part is used in similarity search engines.

Simple AutoEncoder neural network for learning embeddings via reconstruction of the image.

There are many other solutions for unsupervised learning. For example, we can train AutoEncoder architecture to colourize images. In this technique, the input image has no colour and the decoding part of the network tries to output a colourful image.

Image Inpainting

Another technique is Image Inpainting, where we remove part of the image and the model will learn to inpaint them back. Interesting way to propose a model that is solving jigsaw puzzles or correct ordering of frames of a video.

Then there are more advanced unsupervised models like SimCLR, MoCo, PIRL, SimSiam or GAN architectures. All these models try to internally represent images so their outputs (vectors) can be used in visual search systems. The explanation of these models is beyond this article.

Tips for Training Deep Metric Models

Here are some useful tips for training deep metric learning models:

Batch size plays an important role in deep metric learning. Some methods such as N-pair should have bigger batch sizes. Bigger batch sizes generally lead to better results, however, they also require more memory on the GPU card.
If your dataset has a bigger variation and a lot of classes, use a bigger batch size for Multi-similarity loss.
The most important part of metric learning is your data. It’s a pity that most research, as well as articles, focus only on models and methods. If you have a large collection with a lot of products, it is important to have a lot of samples per product. If you have fewer classes, try to use some unsupervised method or cross-entropy loss and do heavy augmentations. In the next section, we will look at data in more depth.
Try to start with a pre-trained model and tune the learning rate.
When using Siamese or Triplet training, try to play with the margin term, all the modern frameworks will allow you to change it (make it harder) during the training.
Don’t forget to normalize the output of the embedding if the loss function requires it. Because we are comparing vectors, they should be normalized in a way that the norm of the vectors is always 1. This way, we are able to compute Euclidean or cosine distances.
Use advanced methods such as MultiSimilarity with big batch size. If you use Siamese, Triplet, or N-pair, mining of negatives or positives is essential. Start with easier samples at the beginning and increase the challenging samples every epoch.

Neural Text Search on Images with CLIP

Up to right now, we were talking purely about images and searching images with image queries. However, a common use case is to search the collection of images with text input, like we are doing with Google or Bing search. This is also called Text-to-Image problem, because we need to transform text representation to the same representation as images (same vector space). Luckily, researchers from OpenAI develop a simple yet powerful architecture called CLIP (Contrastive Language Image Pre-training). The concept is simple, instead of training on pair of images (SIAMESE, NPAIR) we are training two models (one for image and one for text) on pairs of images and texts.

The architecture of CLIP model by OpenAI. Image Source Github

You can train a CLIP model on a dataset and then use it on your images (or videos) collection. You are able to find similar images/products or try to search your database with a text query. If you would like to use a CLIP-like model on your data, we can help you with the development and integration of the search system. Just contact us at care@ximilar.com, and we can create a search system for your data.

Try search demo

The Training Data
for Visual Search Engines

99 % of the deep learning models have a very expensive requirement: data. Data should not contain any errors such as wrong labels, and we should have a lot of them. However, obtaining enough samples can be a problematic and time-consuming process. That is why techniques such as transfer learning or image augmentation are widely used to enrich the datasets.

How Does Image Augmentation Help With Training Datasets?

Image augmentation is a technique allowing you to multiply training images and therefore expand your dataset. When preparing your dataset, proper image augmentation is crucial. Each specific category of data requires unique augmentation settings for the visual search engine to work properly. Let’s say you want to build a fashion visual search engine based strictly on patterns and not the colours of items. Then you should probably employ heavy colour distortion and channel-swapping augmentation (randomly swapping red, green, or blue channels of an image).

On the other hand, when building an image search engine for a shop with coins, you can rotate the images and flip them to left-right and upside-down. But what to do if the classic augmentations are not enough? We have a few more options.

Removing or Replacing Background

Most of the models that are used for image search require pairs of different images of the same object. Typically, when training product image search, we use an official product photo from a retail site and another picture from a smartphone, such as a real-life photo or a screenshot. This way, we get a pair-based model that understands the similarity of a product in pictures with different backgrounds, lights, or colours.

The difference between a product photo and a real-life image made with a smartphone, both of which are important to use when training computer vision models.

All such photos of the same product belong to an entity which we call a Similarity Group. This way, we can build an interactive tool for your website or app, which enables users to upload a real-life picture (sample) and find the product they are interested in.

Background Removal Solution

Sometimes, obtaining multiple images of the same group can be impossible. We found a way to tackle this issue by developing a background removal model that can distinguish the dominant foreground object from its background and detect its pixel-accurate position.

Once we know the exact location of the object, we can generate new photos of products with different backgrounds, making the training of the model more effective with just a few images.

The background removal can also be used to narrow the area of augmentation only to the dominant item, ignoring the background of the image. There are a lot of ways to get the original product in different styles, including changing saturation, exposure, highlights and shadows, or changing the colours entirely.

Generating more variants can make your model very robust.

Building such an augmentation pipeline with background/foreground augmentation can take hundreds of hours and a lot of GPU resources. That is why we deployed our Background Removal solution as a ready-to-use image tool.

You can use the Background Removal as a stand-alone service for your image collections, or as a tool for training data augmentation. It is available in public demo, App, and via API.

GAN-Based Methods for Generating New Training Data

One of the modern approaches is to use a Generative Adversarial Network (GAN). GANs are incredibly powerful in generating whole new images from some specific domain. You can simply create a model for generating new kinds of insects or making birds with different textures.

Creating new insect images automatically to train an image recognition system? How cool is that? There are endless possibilities with GAN models for basically any image type. [Source]

The greatest advantage of GAN is you will easily get a lot of new variants, which will make your model very robust. GANs are starting to be widely used in more tasks such as simulations, and I think the gathering of data will cost much less in the near future because of them. In Ximilar, we used GAN to create a GAN Image Upscaler, which adds new relevant pixels to images to increase their resolution and quality.

When creating a visual search system on our platform, our team picks the most suitable neural network architecture, loss functions, and image augmentation settings through the analysis of your visual data and goals. All of these are critical for the optimization of a model and the final accuracy of the system. Some architectures are more suitable for specific problems like OCR systems, fashion recommenders or quality control. The same goes with image augmentation, choosing the wrong settings can destroy the optimization. We have experience with selecting the best tools to solve specific problems.

Annotation System for Building Image Search Datasets

As we can see, a good dataset definitely is one of the key elements for training deep learning models. Obtaining such a collection can be quite expensive and time-consuming. With some of our customers, we build a system that continually gathers the images needed in the training datasets (for instance, through a smartphone app). This feature continually & automatically improves the precision of the deployed search engines.

How does it work? When the new images are uploaded to Ximilar Platform (through Custom Similarity service) either via App or API, our annotators can check them and use them to enhance the training dataset in Annotate, our interface dedicated to image annotation & management of datasets for computer vision systems.

Annotate effectively works with the similarity groups by grouping all images of the same item. The annotator can add the image to a group with the relevant Stock Keeping Unit (SKU), label it as either a product picture or a real-life photo, add some tags, or mark objects in the picture. They can also mark images that should be used for the evaluation and not used in the training process. In this way, you can have two separate datasets, one for training and one for evaluation.

We are quite proud of all the capabilities of Annotate, such as quality control, team cooperation, or API connection. There are not many web-based data annotation apps where you can effectively build datasets for visual search, object detection, as well as image recognition, and which are connected to a whole visual AI platform based on computer vision.

Image annotation tool for building visual search and image similarity models.

How to Improve Visual Search Engine Results?

We already assessed that the optimization algorithm and the training dataset are key elements in training your similarity model. And that having multiple images per product then significantly increases the quality of the trained similarity model. The model (CNN or other modern architecture) for similarity is used for embedding (vector) extraction, which determines the quality of image search.

Over the years that we’ve been training visual search engines for various customers around the world, we were also able to identify several potential weak spots. Their fixing really helped with the performance of searches as well as the relevance of the search results. Let’s take a look at what can improve your visual search engine:

Include Tags

Adding relevant keywords for every image can improve the search results dramatically. We recommend using some basic words that are not synonymous with each other. The wrong keywords for one item are for instance “sky, skyline, cloud, cloudy, building, skyscraper, tall building, a city”, while the good alternative keywords would be “sky, cloud, skyscraper, city”.

Our engine can internally use these tags and improve the search results. You can let an image recognition system label the images instead of adding the keywords manually.

Include Filtering Categories

You can store the main categories of images in their metadata. For instance, in real estate, you can distinguish photos that were taken inside or outside. Based on this, the searchers can filter the search results and improve the quality of the searches. This can also be easily done by an image recognition task.

Include Dominant Colours

Colour analysis is very important, especially when working for a fashion or home decor shop. We built a tool conveniently called Dominant Colors, with several extraction options. The system can extract the main colours of a product while ignoring its background. Searchers can use the colours for advanced filtering.

Use Object Detection & Segmentation

Object detection can help you focus the view of both the search engine and its user on the product, by merely cutting the detected object from the image. You can also apply background removal to search & showcase the products the way you want. For training object detection and other custom image recognition models, you can use our App & Annotate.

Use Optical Character Recognition (OCR)

In some domains, you can have products with text. For instance, wine bottles or skincare products with the name of the item and other text labels that can be read by artificial intelligence, stored as metadata and used for keyword search on your site.

Our visual search engine allows us to combine several features for multimedia search with advanced filtering.

Improve Image Resolution

If the uploaded images from the mobile phones have low resolution, you can use the image upscaler to increase the resolution of the image, screenshot, or video. This way, you will get as much as possible even from user-generated content with potentially lower quality.

Combine Multiple Approaches

Fusion – Combining multiple features like model embeddings, tags, dominant colours, and text increases your chances to build a solid visual search engine. Our system is able to use these different modalities and return the best items accordingly. For example, extracting dominant colours is really helpful in Fashion Search, our service combining object detection, fashion tagging & visual search.

Search Engine and Vector Databases

Once you trained your model (neural network), you can extract and store the embeddings for your multimedia items somewhere. There are a lot of image search engine implementations that are able to work with vectors (embedding representation) that you can use. For example, Annoy from Spotify or FAISS from Facebook developers.

These solutions are open-source (i.e. you don’t have to deal with usage rights) and you can use them for simple solutions. However, they also have a few disadvantages:

After the initial build of the search engine database, you cannot perform any update, insert or delete operations. Once you store the data, you can only perform search queries.
You are unable to use a combination of multiple features, such as tags, colours, or metadata.
There’s no support for advanced filtering for more precise results.
You need to have an IT background and coding skills to implement and use them. And in the end, the system must be deployed on some server, which brings additional challenges.
It is difficult to extend them for advanced use cases, you will need to learn a complex codebase of the project and adjust it accordingly.

Building a Visual Search Engine on a Machine Learning Platform

The creation of a great visual search engine is not an easy task. The mentioned challenges and disadvantages of building complex visual search engines with high performance are the reasons why a lot of companies hesitate to dedicate their time and funds to building them from scratch. That is where AI platforms like Ximilar come into play.

Custom Similarity Service

Ximilar provides a computer vision platform, where a fast similarity engine is available as a service. Anyone can connect via API and fill their custom collection with data and query at the same time. This streamlines the tedious workflow a lot, enabling people to have custom visual search engines fast and, more importantly, without coding. Our image search engines can handle other data types like videos, music, or 3D models. If you want more privacy for your data, the system can also be deployed on your hardware infrastructure.

In all industries, it is important to know what we need from our model and optimize it towards the defined goal. We developed our visual search services with this in mind. You can simply define your data and problem and what should be the primary goal for this similarity. This is done via similarity groups, where you put the items that should be matched together.

Examples of Visual Search Solutions for Business

One of the typical industries that use visual search extensively is fashion. Here, you can look at similarities in multiple ways. For instance, one can simply want to find footwear with a colour, pattern, texture, or shape similar to the product in a screenshot. We built several visual search engines for fashion e-shops and especially price comparators, which combined search by photo and recommendations of alternative similar products.

Based on a long experience with visual search solutions, we deployed several ready-to-use services for visual search: Visual Product Search, a complex visual search service for e-commerce including technologies such as search by photo, similar product recommendations, or image matching, and Fashion Search created specifically for the fashion segment.

Another nice use case is also the story of how we built a Pokémon Trading Card search engine. It is no surprise that computer vision has been recently widely applied in the world of collectibles. Trading card games, sports cards or stamps and visual AI are a perfect match. Based on our customers’ demand, we also created several AI solutions specifically for collectibles.

The Workflow of Building
a Visual Search Engine

If you are looking to build a custom search engine for your users, we can develop a solution for you, using our service Custom Image Similarity. This is the typical workflow of our team when working on a customized search service:

Setup, Research & Plan – Initial calls, the definition of the project, NDA, and agreement on expected delivery time.
Data – If you don’t provide any data, we will gather it for you. Gathering and curating datasets is the most important part of developing machine learning models. Having a well-balanced dataset without any bias to any class leads to great performance in production.
First prototype – Our machine learning team will start working on the model and collection. You will be able to see the first results within a month. You can test it and evaluate it by yourself via our clickable front end.
Development – Once you are satisfied with the results, we will gather more data and do more experiments with the models. This is an iterative way of improving the model.
Evaluation & Deployment – If the system performs well and meets the criteria set up in the first calls (mostly some evaluation on the test dataset and speed performance), we work on the deployment. We will show you how to connect and work with the API for visual similarity (insert, delete, search endpoints).

If you are interested in knowing more about how the cooperation with Ximilar works in general, read our How it works and contact us anytime.

We are also able to do a lot of additional steps, such as:

Managing and gathering more training data continually after the deployment to gradually increase the performance of visual similarity (the usage rights for user-generated content are up to you; keep in mind that we don’t store any physical images).
Building a customized model or multiple models that can be integrated into the search engine.
Creating & maintaining your visual search collection, with automatic synchronization to always keep up to date with your current stock.
Scaling the service to hundreds of requests per second.

Visual Search is Not Only
For the Big Companies

I presented the basic techniques and architectures for training visual similarity models, but of course, there are much more advanced models and the research of this field continues with mile steps.

Search engines are practically everywhere. It all started with AltaVista in 1995 and Google in 1998. Now it’s more common to get information directly from Siri or Alexa. Searching for things with visual information is just another step, and we are glad that we can give our clients tools to maximise their potential. Ximilar has a lot of technical experience with advanced search technology for multimedia data, and we work hard to make it accessible to everyone, including small and medium companies.

If you are considering implementing visual search into your system:

Schedule a call with us and we will discuss your goals. We will set up a process for getting the training data that are necessary to train your machine learning model for search engines.
In the following weeks, our machine learning team will train a custom model and a testable search collection for you.
After meeting all the requirements from the POC, we will deploy the system to production, and you can connect to it via Rest API.

How do custom projects work?

The post How to Build a Good Visual Search Engine? appeared first on Ximilar: Visual AI for Business.

How to Convert a Video Into a Streaming Format?

Michal Lukáč — Tue, 23 Aug 2022 12:03:00 +0000

In the last few months, we have been actively developing a lot of new AI solutions for videos. Automated video processing is a growing field of AI, with many interesting applications. It however brought quite a few new challenges (huge amount of data, processing time, precision, and so on) that didn’t need to be taken into consideration when building classic image-processing systems, including converting the standard video into streaming format. This article might prove useful to those who encountered similar challenges.

The Automated Video Processing

According to this research by Deloitte, it is typical for younger generations to build a dynamic portfolio of media and entertainment options. Consumers across generations have been spending more time watching online TV (Nielsen) and browsing the internet using social media and video-on-demand services on a daily basis.

There is no doubt that automated video processing is going to become as normal as image processing by AI, revolutionizing not only platforms such as YouTube, TikTok, Instagram, or Twitch – but the way we work with and perceive video content.

One of the projects that we are co-developing called Get Moments required a lot of working with FFmpeg, Python, OpenCV and Machine Learning (mostly TensorFlow). One of the challenges we encountered was converting a standard movie format into a streaming format, so there are quite a few tips I can now share with you.

What is the Streaming Format, and What is it Good For?

The need for a streaming format came with the rise and popularity of YouTube. Different users around the world have different internet connection speeds, and they can watch different parts of a video with different quality. That is possible because the video is delivered to them in a streaming format, without the need to load it fully.

Converting a video into a streaming format means you create multiple copies of this video with different qualities, all of which are chunked into short segments.

Instead of downloading and playing a video file in classic MP4 container format with H.264 (video codec), only the parts of the video that are currently watched, are loaded and streamed in the quality corresponding with the user’s internet connection quality. That is possible because when converting a standard video file into a streaming format, you create multiple copies of this video with different qualities, all of which are chunked into short segments.

HLS or DASH Streaming Format – Comparison

The full power of streaming video formats comes with CDNs (content delivery networks) that are able to deliver content over the internet very fast. There are several video streaming formats, but the currently most used are HLS and DASH. Both protocols run over HTTP, use TCP, are supported via HTML5 video player, and both chunk videos into segments with intervals of 2–10 seconds.

HLS (HTTP Live Streaming) is a live-streaming protocol with adaptive bitrate. Because it was developed by Apple, there is support for all Apple devices. HLS is using H.264 for video compression, with AAC and MP3 for an audio stream.

DASH (MPEG-DASH) is more open and standardized. It is widely used, for example on YouTube in HTM5 player. Unlike HLS, the DASH video can be encoded with different codes (codec-agnostic) for both video and audio streams.

I personally prefer the HTTP Live Streaming format for several reasons. The m3u8 index/header file looks much nicer, there is better support on Apple devices, and the conversion to HLS is much easier than to DASH. Nevertheless, not every video player is supporting the HLS or DASH format, so be careful what you have on your website or mobile app.

How much space do the HLS and DASH formats take up?

Let’s convert a sample video file with:

Length 60 seconds
Resolution 1080p
Size 25 MB
Encoded with H.264 codec
No audio track

I converted this video to both HLS and DASH formats in 360p, 720p and 1080p resolutions. You can select your own resolution via encoding with FFmpeg.

When I converted the video to DASH with only two resolutions (360p and 1080p), the size was 32 MB. And when I added the third resolution (720p), I got to a similar size as with HLS. In both cases, the total size of the three files with different qualities together was around 55 MB, so a bit over double the size of the original file. Of course, the size can also change depending on the used codecs.

What is the data structure of HLS and DASH?

The folder with HLS format contains video encoded to 360p, 720p and 1080p. You can see the .ts files representing the chunks of 10-second intervals. Because we have a 60-second video, it contains 6 chunks – 6 .ts files.

In the case of DASH format, each video chunk has 5 seconds, so the video with DASH folder contains 12 chunks with a .m4s suffix.

DASH video with chunks

HLS video with chunks

DASH and HLS streaming structure generated via FFmpeg.

You can also see index.m3u8, which is our index file. It is linked to the video player on the website where we are streaming. It is a simple text file containing information on which resolution and bandwidth these videos have. The content looks like this:

#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=375000,RESOLUTION=640x360
360_video.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2000000,RESOLUTION=1280x720
720_video.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=3500000,RESOLUTION=1920x1080
1080_video.m3u8

The file 360_video.m3u8 defines the length of the chunk .ts files, and it looks like this:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:11
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:10.135122,
360_video0.ts
#EXTINF:10.760756,
360_video1.ts
#EXTINF:10.135122,
360_video2.ts
#EXTINF:9.634622,
360_video3.ts
#EXTINF:9.884878,
360_video4.ts
#EXTINF:9.468422,
360_video5.ts
#EXT-X-ENDLIST

The video converted to DASH format also has a manifest/index file with XML Structure.

How to convert .mp4, .mkv or .mov videos to HLS?

For converting the video to HLS streaming video format with three qualities (1920p, 720p and 360p) you can call the FFmpeg directly:

mkdir hls
ffmpeg -i minute.mp4 -profile:v baseline -level 3.0 -s 640x360  -start_number 0 -hls_time 10 -hls_list_size 0 -f hls hls/360_video.m3u8
ffmpeg -i minute.mp4 -profile:v baseline -level 3.0 -s 1280x720  -start_number 0 -hls_time 10 -hls_list_size 0 -f hls hls/720_video.m3u8
ffmpeg -i minute.mp4 -profile:v baseline -level 3.0 -s 1920x1080  -start_number 0 -hls_time 10 -hls_list_size 0 -f hls hls/1080_video.m3u8

You can select the preferred resolutions via -s arg. For example, you can additionally create a video in 480p resolution if needed. With -hls_time, you can specify the length of chunks. After the conversion is done, we can manually or programmatically create an index.m3u8 file, which is used as a link in your web player.

You can also call the conversion of MP4 to HLS via Python with a subprocess module:

def call_ffmpeg(cmd):
    with subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as process:
        process.communicate()
    return True

Read the FFMPEG documentation for more information about the optimal setting for your HLS video. There are a lot of parameters to tune, for example, you can set max rate, bufsize, average bit rate and much more. Here are a few links to start:

How to convert videos to DASH?

Similar to HLS, we can convert the video to DASH in two resolutions (360p and 1080p) with this command:

ffmpeg -re -i minute.mp4 -map 0 -map 0 -c:a libfdk_aac -c:v libx264 \
-b:v:0 300k -b:v:1 3000k -s:v:0 640x360 -s:v:1 1920x1080 -profile:v:1 baseline \
-profile:v:0 main -bf 1 -keyint_min 120 -g 120 -sc_threshold 0 \
-b_strategy 0 -ar:a:1 22050 -use_timeline 1 -use_template 1 \
-adaptation_sets "id=0,streams=v id=1,streams=a" \
-f dash dash/out.mpd

The video conversion uses the H.264 codec via the -c:v libx264 argument. The resolution is set via -s:v argument. Whenever you are playing a DASH video, your entry point is this .mpd file, which will be generated during the conversion.

How can I stream videos on my website?

You can for example upload your converted videos to any storage like Amazon S3, Wasabi or DigitalOcean, and put the CDN (Cloudflare, CDN77 or Bunny) in front of your storage. For a web player, you could use for example the Bradmax player.

Did you find this guide useful? Check our other guides on custom visual search, image recognition & object detection systems, or various applications of visual AI.

Is there an API for conversion to streaming formats?

We’ve been working with video conversion and processing by artificial intelligence for a while and gained a lot of experience with FFmpeg on videos. If you would like to try our API for video conversion into streaming formats, cutting videos, concatenating, and video trimming, contact us at tech@ximilar.com. In case you have any other questions or ideas, contact us through the contact form. We’re here to help!

The post How to Convert a Video Into a Streaming Format? appeared first on Ximilar: Visual AI for Business.

Explainable AI: What is My Image Recognition Model Looking At?

Zuzana Raidová — Tue, 07 Dec 2021 14:16:20 +0000

There are many challenges in machine learning, and developing a good model is one of them. Even though neural networks are very powerful, they have a great weakness. Their complexity makes it hard to understand how they reach their decisions. This might be a problem when you want to move from development to production, and it might eventually cause your whole project to fail. But how can you measure the success of a machine learning model? The answer is not easy. In our opinion, the model must excel in a production environment and should work reliably in both common and uncommon situations.

However, even when the results in production are good, there are areas, where we can’t simply accept black box decisions without being sure, how the AI made them. These areas are typically medicine and biotech or any other field where there is no place for errors. We need to make sure that both output and the way our model reached its decision make sense – we need explainable AI. For these reasons, we introduced a new feature to our Image Recognition service called Explain.

Training Image Recognition

Image Recognition is a Visual AI service enabling you to train custom models to recognize images or objects in them. In Ximilar App, you can use Categorization & Tagging and Object Detection, which can be combined with Flows. For example, the first task will detect all the human models in the image and the categorization & tagging tasks will categorize and tag their clothes and accessories.

Image recognition is a very powerful technology, bringing automation to many industries. It requires well-trained models, and, in the case of object detection, precise data annotation. If you are not familiar with using image recognition on our platform, please try to set up your own classifier first.

These resources should be helpful in the beginning:

Check the Custom Image Recognition page
Read The Basic Rules for Image Recognition Models Training
Read Best Practices in Image Recognition Training
Watch our YouTube tutorial on how to set up the image recognition task
Read how to combine and chain your models with Flows

From model-centric to data-centric with explainable AI

Explaining which areas are important for the leaf disease recognition model when predicting a label called “canker”.

When you want a model which performs great in a production setting and has high accuracy, you need to focus on your training data first. Consistency of labelling, cleaning datasets from unnecessary samples/labels, and adding feature-rich samples that are missing is much more important than the newest architecture of the neural network. Andrew Ng, an entrepreneur and professor at Stanford, is also promoting this approach to building machine learning models.

The Explain feature in our App tells you:

which parts of images (features and pixels) are important for predicting specific labels
for which images the model will probably predict the wrong results
which samples should be added to your training dataset to improve performance

Simple Example: T-shirt or Not?

Let’s look at this simple example of how explainable AI can be useful. Let’s say we have a task containing two categories – t-shirts and shoes. For a start, we have 20 images in each category. It is definitely not enough for production, but it is enough if you want to experiment and learn.

This neural network has two labels: shoes and t-shirt.

After playing with the advanced options and short training, the result seems really promising:

Using Explain on a Training Image

But did the model actually learn what we wanted? To check, what the neural network find important when categorizing our images, we will apply two different methods with the tool Explain:

Grad-CAM (first published in 2016) – this method is very fast, but the results are not very precise
Blur Integrated Gradients (published in 2020) smoothed with SmoothGrad – this method provides much more details, but at the cost of computational time

Grad-Cam result of Explain feature. As you can see, the model is looking mostly at the head/face.

Blur-Integrated Gradients results, the most important features are head/face, similar to what grad-cam is telling us.

In this case, both methods clearly demonstrate the problem of our model. The focus is not on the t-shirt itself, but on the head of the person wearing it. In the end, it was easier for the learning algorithm and the neural network to distinguish between the two categories using this feature instead of focusing on the t-shirt. If we look at the training data for label t-shirt, we can see that all pictures include a person with a visible face.

Data for T-shirt label for the image recognition task. This small dataset contains only photos with visible faces, which can be a problem.

Explainability After Adding New Data

The solution might be adding more varied training data and introducing images without a person. Generally, it’s a good approach to start with a small dataset and over time increase it to a bigger one. Adding visually broad images helps model with overfitting on wrong features. So we added more photos to the label and trained the model again. Let’s see what the results look like with our new version of the model:

After retraining the model on new data, we can see the improvement for what features the neural network looking for.

The Grad-CAM result on the left is not very convincing in this case. The image on the right shows the result of Blur Integrated Gradients. Here you can see, how the focus moved from the head to the t-shirt. It seems like the head still plays some part, but there is much less focus on it.

Both methods for explainable AI have their drawbacks, and sometimes we have to try more pictures to get a better understanding of model behaviour. We also need to mention one important point. Due to the way the algorithm works, it tends to prefer edges, which is clearly visible in the examples.

Summary

The Explainability and Interpretability of Neural Networks is a big research topic, and we are looking forward to adopting and integrating more techniques into our SaaS AI solution. AI Explainability that we showed you is only one tool amongst many towards data-centric AI.

If you have any troubles, do not hesitate to contact us. The machine learning specialists of Ximilar have vast experience with different kinds of problems, and are always happy to help you with yours.

The post Explainable AI: What is My Image Recognition Model Looking At? appeared first on Ximilar: Visual AI for Business.

How to deploy object detection on Nvidia Jetson Nano

Michal Lukáč — Mon, 18 Oct 2021 12:13:16 +0000

At the beginning of summer, we received a request for a custom project for a camera system in a factory located in Africa. The project was about detecting, counting, and visual quality control of the items on the conveyor belts in a factory with the help of visual AI. So we developed a complex system with neural networks on a small computer called Jetson Nano. If you are curious about how we did it, this article is for you. And if you need help with building similar solutions for your factory, our team and tools are here for you.

What is NVIDIA Jetson Nano?

There were two reasons why using our API was not an option. First, the factory has unstable internet connectivity. Also, the entire solution needs to run in real time. So we chose to experiment with embedded hardware that can be deployed in such an environment, and we are very glad that we found Nvidia Jetson Nano.

[Source]

Jetson Nano is an amazing small computer (embedded or edge device) built for AI. It allows you to do machine learning in a very efficient way with low-power consumption (about 5 watts). It can be a part of IoT (Internet of Things) systems, running on Ubuntu & Linux, and is suitable for simple robotics or computer vision projects in factories. However, if you know that you will need to detect, recognize and track tens of different labels, choose the higher version of Jetson embedded hardware, such as Xavier. It is a much faster device than Nano and can solve more complex problems.

What is Jetson Nano good for?

Jetson is great if:

You need a real-time analysis
Your problem can be solved with one or two simple models
You need a budget solution & be cost-effective when running the system
You want to connect it to a static camera – for example, monitoring an assembly line
The system cannot be connected to the internet – for example, because your factory is in a remote place or for security reasons

The biggest challenges in Africa & South Africa remain connectivity and accessibility. AI systems that can run in house and offline can have great potential in such environments.
Deloitte: Industry 4.0 – Is Africa ready for digital transformation?

Object Detection with Jetson Nano

If you need real-time object detection processing, use the Yolo-V4-Tiny model proposed in this repository AlexeyAB/darknet. And other more powerful architectures are available as well. Here is a table of what FPS you can expect when using Yolo-V4-Tiny on Jetson:

Architecture	mAP @ 0.5	FPS
yolov4-tiny-288	0.344	36.6
yolov4-tiny-416	0.387	25.5
yolov4-288	0.591	7.93

Source: Github

After the model’s training is completed, the next step is the conversion of the weights to the TensorRT runtime. TensorRT runtimes make a substantial difference in speed performance on Jetson Nano. So train the model with AlexeyAB/darknet and then convert it with tensorrt_demos repository. The conversion has multiple steps because you first convert darknet Yolo weights to ONNX and then convert to TensorRT.

There is always a trade-off between accuracy and speed. If you do not require a fast model, we also have a good experience with Centernet. Centernet can achieve a really nice mAP with precise boxes. If you run models with TensorFlow or PyTorch backends, then the speed is slower than Yolo models in our experience. Luckily, we can train both architectures and export them in a suitable format for Nvidia Jetson Nano.

Image Recognition on Jetson Nano

For any image categorization problem, I would recommend using simple architecture as MobileNetV2. You can select for example the depth multiplier for mobilenet of 0.35 and image resolution 128×128 pixels. In this way, you can achieve great performance both in speed and precision.

We recommend using TFLITE backend when deploying the recognition model on Jetson Nano. So train the model with the TensorFlow framework and then convert it to TFLITE. You can train recognition models with our platform without any coding for free. Just visit Ximilar App, where you can develop powerful image recognition models and download them for offline usage on Jetson Nano.

A simple Object Detection camera system with the counting of products can be deployed offline in your factory with Jetson Nano.

Recommended camera and utilities

Jetson Nano is simple but powerful hardware. However, it is not as powerful as your laptop or desktop computer. That’s why analyzing 4k images on Jetson will be very slow. I would recommend using max 1080p camera resolution. We used a camera by Raspberry PI, which works very well on Jetson and installation is easy!

I should mention that with Jetson Nano, you can come across some temperature issues. Jetson is normally shipped with a passive cooling system. However, if this small piece of hardware should be in the factory, and run stable for 24 hours, we recommend using an active cooling system like this one. Don’t forget to run the next command so your fan on Jetson starts working:

sudo jetson_clocks --fan

Installation steps & tips for development

When working with Jetson Nano, I recommend following guidelines by Nvidia, for example here is how to install the latest TensorFlow version. There is a great tool called jtop, which visualizes hardware stats as GPU frequency, temperature, memory size, and much more:

jtop tool can help you monitor statistics on Nvidia Jetson Nano.

Remember, the Jetson has shared memory with GPU. You can easily run out of 4 GB when running the model and some programs alongside. If you want to save more than 0.5 GB of memory on Jetson, then run the Ubuntu on LXDE desktop environment/interface. The LXDE is more lightweight than the default Ubuntu environment. To increase memory, you can also create a swap file. But be aware that if your project requires a lot of memory, it can eventually destroy your microSD card. More great tips and hacks can be found on JetsonHacks page.

For improvement of the speed of Jetson, you can also try these two commands, which will set the maximum power input and frequency:

sudo nvpmodel -m0
sudo jetson_clocks

When using the latest image for Jetson, be sure that you are working with the right OpenCV versions of the library. For example, some older tracking algorithms like MOSSE or KCF from OpenCV require a specific version. For some tracking solutions, I recommend looking on PyImageSearch website.

Developing on Jetson Nano

The experience of programming challenging projects, exploring new gadgets, and helping our customers is something that deeply satisfies us. We are looking forward to trying other hardware for machine learning such as Coral from Google, Raspberry Pi, or Intel Movidius for Industry 4.0 projects.

Most of the time, we are developing a machine learning API for large e-commerce sites. We are really glad that our platform can also help us build machine learning models on devices running in distant parts of the world with no internet connectivity. I think that there are many more opportunities for similar projects in the future.

The post How to deploy object detection on Nvidia Jetson Nano appeared first on Ximilar: Visual AI for Business.

Flows – The Game Changer for Next-Generation AI Systems

Zuzana Raidová — Wed, 01 Sep 2021 15:25:28 +0000

We have spent thousands of man-hours on this challenging subject. Gallons of coffee later, we introduced a service that might change how you work with data in Machine Learning & AI. We named this solution Flows. It enables simple and intuitive chaining and combining of machine learning models. This simple idea speeds up the workflow of setting up complex computer vision systems and brings unseen scalability to machine learning solutions.

We are here to offer a lot more than just training models, as common AI companies do. Our purpose is not to develop AGI (artificial general intelligence), which is going to take over the world, but easy-to-use AI solutions, that can revolutionize many areas of both business and daily life. So, let’s dive into the possibilities of flows in this 2021 update of one of our most-viewed articles.

Flows: Visual AI Setup Cannot Get Much Easier

In general, at our platform, you can break your machine learning problem down into smaller, separate parts (recognition, detection, and other machine learning models called tasks) and then easily chain & combine these tasks with Flows to achieve the full complexity and hierarchical classification of a visual AI solution.

A typical simple use case is conditional image processing. For instance, the first recognition task filters out non-valid images, then the next one decides a category of the image and, according to the result, other tasks recognize specific features for a given category.

Simple use of machine learning models combination in a flow

Flows allow your team to review and change datasets of all complexity levels fast and without any trouble. It doesn’t matter whether your model uses three simple categories (e.g. cats, dogs, and guinea pigs) or works with an enormous and complex hierarchy with exceptions, special conditions, and interdependencies.

It also enables you to review the whole dataset structure, analyze, and, if necessary, change its logic due to modularity. With a few clicks, you can add new labels or models (tasks), change their chaining, change the names of the output fields, etc. Neat? More than that!

Think of Flows as Zapier or IFTTT in AI. With flows, you simply connect machine learning models, and review the structure anytime you need.

Define a Flow With a Few Clicks

Let’s assume we are building a real estate website, and we want to automatically recognize different features that we can see in the photos. Different kinds of apartments and houses have various recognizable features. Here is how we can define this workflow using recognition flows (we trained each model with a custom image recognition service):

An example of real estate classifier made of machine learning models combined in a flow

The image recognition models are chained in a “main” flow called the branch selector. The branch selector saves the result in the same way as a recognition task node and also chooses an action based on the result of this task. First, we let the top category task recognize the type of estate (Apartment vs. Outdoor house). If it is an apartment, we can see that two subsequent tasks are “Apartment features” and “Room type”.

A flow can also call other flows, so-called nested flows, and delegate part of the work to them. If the image is an outdoor house, we continue processing by another nested flow called “Outdoor house”. In this flow, we can see another branching according to the task that recognizes “House type”. Different tasks are called for individual categories (Bungalow, Cottage, etc.):

An example use of nested flows – the main flow calls other nested flows to process images based on their category

Flow Elements We Used

So far, we have used three elements:

A recognition task, that simply calls a given task and saves the result into an output field with a specified name. No other logic is involved.
A branch selector, on the other hand, saves the result in the same way as a recognition task node, but then it chooses an action based on the result of this task.
Nested flow, another flow of tasks, that the “main” flow (branch selector) called.

Implicitly, there is also a List element present in some branches. We do not need to create it, because as soon as we add two or more elements to a single branch, a list generates in the background. All nodes in a list are normally executed in parallel, but you can also set sequential execution. In this case, the reordering button will appear.

Branch Selector – Advanced Settings

The branch selector is a powerful element. It’s worthwhile to explore what it can do. Let’s go through the most important options. In a single branch, by default, only actions (tag or category) with the highest relevance will be performed, provided the relevance (the probability outputted by the model) is above 50 %. But we can change this in advanced settings. We can specify the threshold value and also enable parallel execution of multiple branches!

The advanced settings of a branch selector, enabling to skip a task of a flow

You can specify the format of the results. Flat JSON means that results from all branches will be saved on the same level as any previous outcomes. And if there are two same output names in multiple branches, they can be overwritten. The parallel execution guarantees neither order nor results. You can prevent this from happening by selecting nested JSON, which will save the results from each branch under a separate key, based on the branch name (that is the tag/category name).

If some data (output_field) are present in the incoming request, we can skip calling the branch selector processing. You can define this in If Output Field Exists. This way we can save credits and also improve the precision of the system. I will show you how useful this behaviour can be in the next paragraphs. To learn about the advanced options of training, check this article.

An Example: Fashion Detection With Tags

We have just created a flow to tag simple and basic pictures. That is cool. But can we really use it in real-life applications? Probably not. The reason is, in most pictures, there is usually more than one clothing item. So how are we going to automate the tagging of more complex pictures? The answer is simple: we can integrate object detection into flows and combine it with recognition & tagging models!

Example of Fashion Tagging combined with Object Detection in Ximilar App

The flow structure then exactly mirrors the rich product taxonomy. Each image goes through a taxonomy tree in order to get proper tags. This is our “top classifier” – a flow that can tell one of our seven top categories of a fashion product image, which will determine how the image will be further classified. For instance, if it is a “Clothing” product, the image continues to “Clothing tagging” flow.

A “top classifier” – a flow that can tell one of our seven top categories of a fashion product image.

Similar to categorization or tagging, there are two basic nodes for object detection: the Detection Task for simple execution of a given task and Object Selector, which enables the processing of the detected objects.

Object Selector will call the object detection task. The detected objects will be extracted out of the image and passed further to any of the available nodes. Yes, any of them! Even another Object Selector, if, for example, you need to first detect people and then detect clothes on each person separately.

Object Selector – Advanced Settings

Object Selector behavior can be customized in similar ways as a Branch Selector. In addition to the Probability Threshold, there is also an Area Threshold. By default, all objects are processed. By setting this threshold, the objects that do not take at least a given percentage of an image are simply ignored. This can be changed to a single object by probability or area in Select. As I mentioned, we extract the object before further processing. We can extend it a bit to include some context using Expand Bounding Box by…

Setting a threshold for a space that an object should occupy in order to be detected

A Typical Flows Application: Fashion Tagging

We have been playing with the fashion subject since the inception of Ximilar. It is the most challenging and also the most promising one. We have created all kinds of tools and helpers for the fashion industry, namely Fashion Tagging, specialized Fashion Search, or Annotate. We are proud to have a very precise automatic fashion tagging service with a rich fashion taxonomy.

And, of course, Fashion Tagging is internally powered by Flows. It is a huge project with several dozens of features to recognize, about a hundred recognition tasks, and hundreds of labels all chained into several interconnected flows. For example, this is what our AI says about a simple dress now – and you can try it on your picture in the public demo.

Example of fashion attributes assigned to a dress by Ximilar Fashion Tagging flow

Try the Fashion Tagging demo

Include Pre-trained Services In Your Flow

The last group of nodes at your disposal are Ximilar services. We are working hard and on an ever-growing number of ready-to-use services which can be called through our API and integrated into your project. It is natural for our users to combine more AI services, and flows make it easier than ever. At this moment, you can call these ready-to-use recognition services:

Fashion Tagging (demo, details)
Home Decor Tagging (demo, details)
Dominant Colors (demo, details)

But more will come in the future, for example, Remove Background.

Increasing Possibilities of Flows

As our app and list of services grow, so do the flows. There are two features we are currently looking forward to. We are already building custom similarity models for our customers. As soon as they are ready, they will be available for combining in flows. And there is one more item very high on our list, which is predicting numeric values. Regression, in machine learning terms. Stay tuned for more exciting news!

Create Your Flow – It’s Free

Before Flows, setting up the AI Vision process was a tedious task for a skilled developer. Now everyone can set up, manage and alter steps on their own. In a comprehensive, visual way. Being able to optimize the process quickly, getting a faster response, losing less time and expenses, and delivering higher quality to customers.

And what’s the best part? Flows are available to the users of Ximilar’s free plan, so you can try them right away. Register or sign up to the Ximilar App and enter Flows service at the Dashboard. If you want to learn the basics first, check out our video tutorials. Then you can connect tasks and labels defined in your own Image Recognition.

Training of machine learning models is free with Ximilar, you are only paying for API calls for recognition. Read more about API calls or API credit packs. We strongly believe you will love Flows as much as we enjoyed bringing them to life. And if you feel like there is a feature missing, or if you prefer a custom-made solution, feel free to contact us!

The post Flows – The Game Changer for Next-Generation AI Systems appeared first on Ximilar: Visual AI for Business.

Image Similarity as a Service For Your Web

Michal Lukáč — Tue, 27 Jul 2021 16:43:13 +0000

With the service Image Similarity added to the Ximilar App, you can build your own visual similarity engine powered by artificial intelligence in just a few clicks, with several lines of code. Similarity search enables companies to improve the user experience significantly and increase revenue with smarter management of their visual data.

The technology behind image similarity is robust, reliable & fast. Built on state-of-the-art (SOTA) AI models and vector databases, you can search millions of images/products in milliseconds. It is used by big e-commerce players as well as small startups for showing visual alternatives or finding products with pictures. Some of our customers have hundreds of millions of images in their collections and do more than 100 million requests per month. Let’s dive into building a superfast similarity search service for your web.

What is Image Similarity?

Image Similarity, or image similarity search, is a visual AI service comparing, grouping, and recommending visually similar images. For example, a typical use case is a product recommendation of similar items in e-shops. It can also be used for reverse image search, where the query is an external image and the results are images from the collection. This approach gives way more accurate results than searching by tags, labels and other attributes.

Ximilar is using state-of-the-art deep learning models for all visual search services. We build our own indexing & searching technology that can run both as a service or on your hardware if needed. The collections can be focused either on product photos, fashion, image matching, or generic photos (stock images).

Go to the public demo

Features of the Image Similarity Service

Here are several features of the Image Similarity service that we think are crucial:

Simple access through the Ximilar App (creating a collection on click) and connection to REST API
The scalable search service can handle collections with hundreds of millions of similar items (images, videos, etc.) and hundreds of requests per second with both CRUD operations and searching
The ultra-fast and reliable engine that is mostly deployed in large e-commerce platforms – the query for finding the most visually similar product is low latency (in milliseconds)
The service is customizable – the platform enables you to train your own model for visual similarity search
Advanced filtering that supports JSON meta-data – if you need to restrict the result to a specific field
Grouping based on similarity – our search technology can group photos of the same product as one item
Security and privacy of your data – only meta-data and the visual representation of the images are stored, therefore your images are not stored anywhere
The service is affordable and cost-effective both for startups and enterprises, offering free plan for tests as well as discounts with your growth over time
We can deploy it on your hardware, independently of our infrastructure, and also offline – custom similarity model and deployment appropriate to your needs
Our search engine and machine learning models improve constantly – maintaining much higher quality than any other open-source project & we are able to build custom search engines with trained models

Applications Using Visual Similarity

According to this research by Deloitte, merchandising with artificial intelligence is more and more relevant, and recommendation engines play a vital part in it. Here are a few use cases for visual similarity engines:

E-shops that use product similarity to help customers to browse and find related products (e.g. in fashion & luxury items, home decor & furniture, art, wall art, prints & posters, collectible trading cards, comics, trademarks, etc.)
Stock photo databases suggesting similar content – getting visual alternatives of photos, designs, product images, and videos
Finding the exact products – apps like Vivino for finding wine or any kind of product are easy to develop for us
Visual similarity duplicate finder (also image matching or deduplication), to know which images are already in your database, or which product photos you can merge together
Reverse image search – finding a product or an image with a picture online
Finding similar real estate based for example on interior design, furniture, garden, etc.
Comparing two images for similarity – for example patterns or designs

Showing similar wall art with a jungle pattern. [Source]

Recommending products to your customers has several advantages. Firstly, it creates a better user experience and helps your customers find the right products faster. Secondly, it instantly makes the purchase rate on your web higher. This means a win on both sides – satisfied customers and higher revenue for you. Read more about customer experience and product recommendations in our blog post on fashion search.

Step by step: Building Real-Estate Image Search

Creating the Collection

So let’s take a look at how to easily build your own similarity search engine with the Ximilar platform. The first step is to log in to the Ximilar App. If you don’t have an account, then sign up – it’s free and takes just a minute. After that, on the Dashboard, click on the Visual Search tile and then the Image Similarity service. Then go to the Collections in the left menu and click on Create New Collection. It will show a pop-up with different collection types from which you need to select one.

The collection is a space where you upload your images. With this collection, you are performing queries for search. You can choose from Generic Photo Collection, Product Photo Collection, Dominant Colors Similarity, and Image Matching. Clicking on one of the cards will create a collection for your account.

Pick one collection type suitable for your data to create your similarity application.

Each of these collection types is suitable for different types of images:

Use Generic Photos if you work with stock photos
Pick a Product Photos collection if you are an e-commerce company
Select Image Matching to find duplicates in your images
For the fashion sector, we recommend using a specialized service called Fashion Search
Custom Similarity is suitable if you are working with another type of data (e.g. videos or 3D models). To do this, please schedule a call with us, and we will develop your own model tuned for your data. For instance, we built a photo search system for the Magic the Gathering Trading Cards for one of our customers.

For this example of real estate, I will use a Generic Photo Collection. The advantage of Generic Photo Collection is that it also supports searching images via text input/query. We usually develop custom similarity models for real estate, when the customers need specific and more accurate results. However, for this simple use case, the generic real estate model will be enough.

Schedule a call

Format of Image Similarity Dataset

First, we need to prepare a text file with JSON records. Each record represents an image that we want to store/insert into our collection. The key field is "_url" with the image URL. The advantage of the _url is that you can directly see and inspect the results via app.ximilar.com.

You can also optionally send records with base64 data, this is great if your data are stored locally on your computer. Don’t worry, we are not storing the whole images (data or base64) in the collection database, just URLs with all other metadata present in the records.

The JSON records look like this:
{"_id": "1_1", "_url": "_URL_IMAGE_PATH_", "estate_id": "1", "category": "indoor", "subcategory": "kitchen", "tags": []} {"_id": "1_2", "_url": "_URL_IMAGE_PATH_", "estate_id": "1", "category": "indoor", "subcategory": "kitchen", "tags": []} ...

If you don’t have image URLs, you can use either "_file" or "_base64" fields for the image data (locally stored "_file" data are automatically converted by the Python client to base64). The image similarity engine is indexing every record of the collection by extracting a representation from the image by a neural network model. However, we are not storing the images in our engine. So, only records that contain "_url" will be visualized in the Ximilar App.

You must store unique identifiers of each image in the "_id" field to identify your images in the collection. The value of this field must be a string. The API endpoint for searching is returning this _id values, that is how you get the results for visual search. You can also store additional fields for every JSON record, and then you can use these fields for filtering, grouping, and tuning the similarity function (see below).

Filling the Collection With Your Data

The next step requires a few lines of code. We are going to insert the prepared images into our collection using our python-client library. You can install the library using pip or directly from GitLab. The usage of the client is very straightforward and basically, you can just use the script tools/collections/insert_json_records.py:

python insert_json_records.py --type generic --auth_token __YOUR_TOKEN__ --collection_id __COLLECTION_ID__ --path /path/to/the/file.json

You will find the collection ID and the Authorization token on the “collection page” in the Ximilar App. This script will run for a few minutes, depending on the size of your image dataset.

Result: Finding Visually Similar Pictures

That was pretty easy, right? Now, if you go to the collections page, you will see something like this:

You can see your image similarity collection in the Ximilar App

All images from the JSON file were indexed, and now you can inspect the collection in the Ximilar App. Select the Similarity Search in the left menu of the Image Similarity service and test how the similarity works. You can specify the query image either by upload, by URL, or your IDs, or by choosing one of the randomly selected images from the collection.

Even though we have indexed just several hundred images, you can see that the similarity engine works pretty well. The first image is the query image and the next images are the k-nearest to the query image:

Showing most visually similar real estate to the first image.

Rest API Connection for Image Search

The next step might be to integrate the service into your application via API. You can either directly use the REST API for searching visually similar images or, if you are using Python, we recommend our Python SDK client like this:

# pip install ximilar-client
from ximilar.client import SimilarityPhotosClient
client = SimilarityPhotosClient("_API_TOKEN_", "_COLLECTION_ID_")
# search k nearest items
client.search({"_id": "1"}, k = 3)
# search by external image
client.search({"_url": "_URL_PATH_"})

Advanced Features for Photo Similarity

The search for visually similar images can be combined with filtering on metadata. This metadata can be stored in the JSON, as in our example with the "category" and "subcategory" fields. In the API, the filtering is specified using a MongoDB-like syntax – see the documentation.

For example, let’s say that we want to search for images similar to the image with ID=1_1 that are indoor photos made in a kitchen. We assume that this meta-information is stored in the “category” and “subcategory” fields of every JSON record. The query will look like this:

client.search({"_id": "1_1"}, filter={"category": "Indoor", "subcategory": "Kitchen"})

If we know that we will often filter on some fields, we can specify them in the “Fields to index” option of the collection to make the query processing more efficient.

You can specify which field from JS records will define your SKU identifier.

Often, your data contains several photos of one “object” – a product or, in our example, real estate. Our service can group the search results not by individual photos but by product IDs. You can set this in the advanced options of the collection by specifying the name of the real estate in the Product ID field, and the magic will happen.

Enhancing Image Similarity Engine with Tags

The image similarity is based purely on the visual content of the image. However, you can use your tags (labels, keywords) to enhance the similarity search. In the example, we assume that the data already contains categories, subcategories, and tags. In order to enhance the visual similarity search with tags, you can fill the “tags” field for every record with your tags, and also use method /v2/visualTagsKNN. After that, your search results will be based on a combination of visual similarity and keywords.

If you don’t have categories and tags, you can create your own photo tagger through our Image Recognition service, and enrich your image data automatically before indexing. The possibilities of image recognition models and their combinations are endless, resulting in highly customizable solutions. Read our guide on how to build your own Image Recognition API.

With the Ximilar Image Recognition service, you can create custom tagging models for your images.

You can build several models:

One classifier for categorizing indoor/outdoor/floor plan photos
One classifier for getting room type (Bedroom, Kitchen, Living room, etc.)
One tagger for outdoor tags like (Pool, Garden, Garage, House view, etc.)

To Sum Up

The real estate photo similarity search is only one use case of visual similarity from many (fashion, e-commerce, art, stock photos, healthcare…). We hope that you will enjoy working with this service, and we are looking forward to seeing your projects based on it. Thanks to our developers Libor and Ludovit, you can use this service through the frontend app.

Visual Similarity service by Ximilar is unique in terms of search quality, speed performance, and all the possibilities of the API. Our engineers are constantly upgrading the quality of the search, so you don’t have to. We are able to build custom solutions suitable for your data. With multiple collections, you can even A/B test the performance on your websites. This can run in our cloud as SaaS or in your warehouse! If you have more questions about pricing, and technical details, or you would like to run the similarity search engine on your own machines, then contact us.

Try our public demos

The post Image Similarity as a Service For Your Web appeared first on Ximilar: Visual AI for Business.

How to Build Your Own Image Recognition API?

Víťa Válka — Fri, 16 Jul 2021 10:38:27 +0000

Image recognition systems are still young, but they become more available every day. Usually, custom image recognition APIs are used for better filtering and recommendations of products in e-shops, sorting stock photos, classification of errors, or pathological findings. Ximilar, same as Apple Vision SDK or Google Tensorflow, make the training of custom recognition models easy and affordable. However, not many people and companies have been using this technology to its full potential so far.

For example, recently, I had a conversation with a client who said that Google Vision didn’t work for him, and it returned non-relevant tags. The problem was not the API but the approach to it. He employed a few students to do the labelling job and create an image classifier. However, the results were not good at all. After showing him our approach, sharing some tips and simple rules, he got better classification results almost immediately. This post should serve as a comprehensive guide for those, who build their own image classifiers and want to get the most out of it.

How to Begin

Image recognition is based on the techniques of machine learning and computer vision. It is able to categorize and tag images with tags describing the attributes recognized in them. You can read everything about the service and its possibilities here.

To train your own Image Recognition models and create a system accessible through API, you will first need to upload a set of training images and create your image recognition tasks (models). Then you will use the training set to train the models to categorize the images.

If you need your images to be tagged, you should upload or create a set of tags and train tagging tasks. As the last step, you can combine these tasks into a Flow, and modify or replace any of them anytime due to its modular structure. You can then gradually improve your accuracy based on testing, evaluation metrics and feedback from your customers. Let’s have a look at the basic rules you should follow to reach the best results.

The Basic Rules for Image Recognition Models Training

Each image recognition task contains at least two labels (classes, categories) – e.g., cats and dogs. A classic image recognition model (task) assigns one label to each image – so the image is either a cat or dog. In general, the more classes you have, the more data you will need to teach the neural network to predict labels.

Binary classification for cats and dogs. Source: Kelly Lacy (Pexels), Pixabay

The training images should represent the real data that will be analyzed in a production setting. For example, if you aim to build a medical diagnostic tool helping radiologists identify the slightest changes in the lung tissue, you need to assemble a database of x-ray images with proven pathological findings. For the first training of your task, we recommend sticking to these simple rules:

Start with binary classification (two labels) – use 50–100 images/label
Use about 20 labels for basic and 100 labels for more complex solutions
For well-defined labels use 200+ images/label
For hard to recognize labels add 100+ images/label
Pattern recognition – for structures, x-ray images, etc. use 50–100 images/label

Always keep in mind, that training one task with hundreds of labels on small datasets almost never works. You need at least 20 labels and 100+ images per label to start with to achieve solid results. Start with the recommended counts, and then add more if needed.

You can create your image recognition model via app.ximilar.com without coding.

The Difference Between Testing & Production

The users of Ximilar App can train tasks with a minimum of 20 images per label. Our platform automatically divides your input data into two datasets – training & test set, usually in a ratio of 80:20. The training set is used to optimize the parameters of the classifier. During the training, the training images are augmented in several ways to extend the set.

The test data (about 20 %) are then used to validate and measure accuracy by simulating how the model will perform in production. You can see the accuracy results on the Task dashboard in Ximilar App. You can also create an independent test dataset and evaluate it. This is a great way to get accurate results on a dataset that was not seen by the model in the training before you actually deploy it.

Remember, the lower limit of 20 images per label usually leads to weak results and low accuracy. While it might be enough for your testing, it won’t be enough for production. This is also called overfitting. Most of the time the accuracy in Ximilar is pretty high, easily over 80 % for small datasets. However, it is common in machine learning to use more images for more stable and reliable results in production. Some tasks need hundreds or thousands of images per label for the good performance of your production model. Read more about the advanced options for training.

The Best Practices in Image Recognition Training

Start With Fewer Categories

I usually advise first-time users to start with up to 10 categories. For example, when building an app for people to recognize shoes, you would start with 10 shoe types (running, trekking, sneakers, indoor sport, boots, mules, loafers …). It is easier to train a model with 10 labels, each with 100 training images of a shoe type, than with 30 types. You can let users upload new shoe images. This way, you can get an amazing training dataset of real images in one month and then gradually update your model.

Use Multiple Recognition Tasks With Fewer Categories

The simpler classifiers can be incredibly helpful. Actually, we can end up with more than 30 types of shoes in one model. However, as we said, it is harder to train such a model. Instead, we can create a system with better performance if we create one model for classifying footwear into main types – Sport, Casual, Elegant, etc. And then for each of the main types, we create another classifier. So for Sport, there will be a model that classifies sports shoes to Running shoes, Sneakers, Indoor shoes, Trekking shoes, Soccer shoes, etc.

Use Binary Classifiers for Important Classes

Imagine you are building a tagging model for real estate websites, and you have a small training dataset. You can first separate your images into estate types. For example, start with a binary classifier that separates images to groups “Apartment” and “Outdoor house”. Then you can train more models specifically for room types (kitchen, bedroom, living room, …), apartment features, room quality, etc. These models will be used only if the image is labelled as “Apartment”.

Ximilar Flows allow you to connect multiple custom image recognition models to API.

You can connect all these tasks via the Flows system with a few clicks. This way, you can chain multiple image recognition models in one API endpoint and build a powerful visual AI. Typical use cases for Flows are in the e-commerce and healthcare fields. Systems for fashion product tagging can also contain thousands of labels. It’s hard to train just one model with thousands of labels that will have good accuracy. But, if you divide your data into multiple models, you will achieve better results in a shorter time! For labelling work, you can use our image Annotation system if needed.

Choose Your Training Images Wisely

Machine learning performs better if the distribution of training and evaluated pictures is even. It means that your training pictures should be very visually similar to the pictures your model will analyze in a production setting. So if your model will be used in CCTV setting, then your training data must come from CCTV cameras. Otherwise, you are likely to build a model that has great performance on training data, but it completely fails when used in production.

The same applies to Real Estate and other fields. If the system analyzes images of real estate that were not made only by professional photographers, then you need to include photos from smartphones, with bad lighting, blurry images, etc.

Typical home decor and real estate images used for image recognition. Your model should be able to recognize both professional and non-professional images. Source: Pexels.

Improving the Accuracy of the System

When clicking on the training button on the task page, the new model is created and put in the training queue. If you upload more data or change labels, you can train a new model. You can have multiple versions of them and deploy to the API only specific version that works best for you. Down on the task page, you can find a table with all your trained models (only the last 5 are stored). For each trained model, we store several metrics that are useful when deciding which model to pick for production.

Multiple versions models of your task in Ximilar Platform. Click on activate and this version will be deployed as API.

Inspect the Results and Errors

Click on the zoom icon in the list of trained models to inspect the results. You can see the basic metrics: Accuracy, Recall, and Precision. Precision tells you what is the probability that the model is right if it predicts a specific label. Recall tells you how likely is the prediction correct. If we have high recall but lower precision for the label “Apartment” from our real estate example, then the model is probably predicting on every image that it is “Apartment” (even on the images that should be “Outdoor house”). The solution is probably simple – just add more pictures that represent “Outdoor house”.

The Confusion matrix shows you which labels are easily confused by the trained model. These labels probably contain similar images, and it is therefore hard for the model to distinguish between them. Another useful component is Failed Images (misclassified) that show you the model’s mistake on your data. With Failed images, you can also see labelling mistakes in your data and fix them immediately. All of these features will help you build a more reliable model with good performance.

Inspecting the results of your trained models can show you potential problems in your data.

Reliability of the Image Recognition Results

Every client is looking for reliability and robustness. Stay simple if you aim to reach high accuracy. Build models with just a few labels if you can. For more complex tagging systems use Flows. Building an image classifier with a limited number of training images needs an iterative approach. Here are a few tips on how to achieve high accuracy and reliable results:

Break your large task into simple decisions (yes or no) or basic categories (red, blue and green)
Make fewer categories & connect them logically
Use general models for general categories
Make sure your training data represent the real data your model will analyze in production
Each label should have a similar amount of images, so the data will be balanced
Merge very close classes (visually similar), then create another task only for them, and connect it via Flows
Use both human and UI feedback to improve the quality of your dataset – inspect evaluation metrics like Accuracy, Precision, Recall, Confusion Matrix, and Failed Images
Always collect new images to extend your dataset

Summary for Training Image Recognition Models

Building an image classifier requires a proper task definition and continuous improvements of your training dataset. If the size of the dataset is challenging, start simple and gradually iterate towards your goal. To make the basic setup easier, we created a few step-by-step video tutorials. Learn how to deploy your models for offline use here, check the other guides, or our API documentation. You can also see for yourself how our pre-trained models perform in the public demo.

We believe that with the Ximilar platform, you are able to create highly complex, customizable, and scalable solutions tailored to the needs of your business – check the use cases for quality control, visual search engines or fashion. The basic features in our app are free, so anyone can try it. Training of image recognition models is also free with Ximilar platform. You are simply paying only for calling the model for prediction. We are always here to discuss your custom projects and all the challenges in person or on a call. If you have any questions, feel free to contact us.

Try our public demos

The post How to Build Your Own Image Recognition API? appeared first on Ximilar: Visual AI for Business.

OpenVINO: Start Optimizing Your TensorFlow 2 Models for Intel CPUs with Docker

Libor Vanek — Tue, 02 Feb 2021 06:59:01 +0000

In the previous article, we mentioned how OpenVINO improved the performance of our machine learning models on our Intel Xeon CPUs. Now, we would like to help the machine learning practitioners who want to start using this toolkit as fast as possible and test it on their own models.

You can find extensive documentation on the official homepage, there is the GitHub page, some courses on Coursera and many other resources. But how to start as fast as possible without a need to study all of the materials? One of the possible ways can be found in the following paragraphs.

No Need to Install, Use Docker

OpenVINO has a couple of dependencies which need to be present on your computer. Additionally, to install some of them, you need to have root/admin rights. This might not be desirable. Using Docker represents much cleaner way. Especially when there is an image prepared for you on Docker Hub.

If you are not familiar with Docker, it might seem like a complicated piece of software, but we highly recommend you to try it and learn the basics. It is not hard and worth the effort. Docker is an important part of today’s SW development world. You will find installation instructions here.

Running the Docker Image

Containers are stateless. That means that next time you start your container, all the changes made will be gone. (Yes, this is a feature.) If you want to persist some files, just prepare a directory on your filesystem and we will bind it as a shared volume to a running container. (To /home/openvino directory).

We will run our container in interactive mode (-it) so that you can easily execute commands inside it and it will be terminated after you close the command prompt (--rm). With the following command, the image will be downloaded (pulled), if it was not already done, and it will be run as a container.

docker run -it --rm -v __YOUR_DIRECTORY__:/home/openvino openvino/ubuntu18_dev:latest

To be able to use all the tools, OpenVINO environment needs to be initialized. For some reason, this is not done automatically. (At least not for a normal user, if you start docker as a root, -u 0, setup script is run.)

source /opt/intel/openvino/bin/setupvars.sh

Conformation is then printed out.

[setupvars.sh] OpenVINO environment initialized

TensorFlow 2 Dependencies

TensorFlow 2 is not present inside the container by default. We could very easily create our own image based on the original one with TensorFlow 2 installed. This is the best way in production. With this being said, we will show you another way using the original container and installing the missing packages to an virtual environment into the shared directory (volume). This way, we can create as many such environments as we want. Or easily modify this environment. In addition, we will be still able to try the older TensorFlow 1 models. We prefer this approach during initial development.

Following code needs to be executed only once, after you first start your container.

mkdir ~/env
python3 -m venv ~/env/tensorflow2 --system-site-packages
source ~/env/tensorflow2/bin/activate
pip3 install --upgrade pip
pip3 install -r /opt/intel/openvino/deployment_tools/model_optimizer/requirements_tf2.txt

When you close your container and open it again, this is the only part you will need to repeat.

source ~/env/tensorflow2/bin/activate

Converting TensorFlow SavedModel

Let’s say you have a trained model in SavedFormat. For the sake of this tutorial, we can take a pretrained MobileNet. Execute python3 and run following couple of lines to download and save the model.

import tensorflow as tf
model = tf.keras.applications.MobileNetV2(input_shape=(224,224,3))
model.save("/home/openvino/models/tf2")

Conversion is a matter of one command. However, there are few important parameters we need to use and which will be described below. Complete list can be found in the documentation. Exit python interpreter and run following command in a bash.

/opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --saved_model_dir ~/models/tf2 --output_dir ~/models/converted --batch 1 --reverse_input_channels --mean_values [127.5,127.5,127.5] --scale_values [127.5,127.5,127.5]

--batch 1 sets batch size to 1. OpenVINO cannot work with undefined input dimensions. I out case, the input shape for MobileNetV2 of is (-1, 224, 224, 3).
--reverse_input_channels tells inference engine to convert image from it’s BGR format to RGB used by MobileNetV2.
--mean_values [127.5,127.5,127.5] --scale_values [127.5,127.5,127.5] finally performs the necessary preprocessing so that image values are between -1 and 1. In TensorFlow, we would call preprocess_input.

Be careful, if you use pretrained models, different preprocessing and channel order can be used. If you try to use a neural network with input preprocessed in wrong way, you will of course get the wrong result.

You don’t need to include the preprocessing inside the converted model. The other option is to preprocess every image inside your own code before passing it to converted model. However, we use some OpenVINO inference tool which expects correct input.

At this point, we also need to mentioned that you might get slightly different values from SavedModel in TensorFlow and converted OpenVINO model. But from my experience on classification models, the difference is quite similar as when you use different ways to downscale your image to a proper input size.

Run Inference on Converted Model

First, we will get a testing picture which belongs to one of the 1000 ImageNet classes. We chose zebra, class index 340. (For TensorFlow, Google keeps the class indexes here.)

Photo by Frida Bredesen on Unsplash

Let’s download it to our home directory We saved the small version of the image on our server so you can get it from there.

 curl -o ~/zebra.jpg -s https://images.ximilar.com/tutorials/openvino/zebra.jpg

There is a script you can use for testing the prediction with no need to write any code.

python3 /opt/intel/openvino/deployment_tools/inference_engine/samples/python/classification_sample_async/classification_sample_async.py -i ~/zebra.jpg -m ~/models/converted/saved_model.xml -d CPU

We get some info lines and top 10 results at the end. Since the numbers are pretty clear, we will show only the first three.

classid probability
------- -----------
  340    0.9556126
  396    0.0032325
  721    0.0008250

Cool, our model is really sure about what is on the picture!

Using OpenVINO Inference in Python

That was easy, right? But you probably need to run the inference inside your own Python code. You can take a look inside the script. It is pretty straightforward, but for the sake of completeness, we will copy some of the code here. We will also add a code for running an inference on the original model so that you can compare it easily. If you please, run python3 and we can start.

We need one basic import from OpenVINO inference engine. Also, OpenCV and NumPy are needed for opening and preprocessing the image. If you prefer, TensorFlow could be used here as well of course. But since it is not needed for running the inference at all, we will not use it.

import cv2
import numpy as np
from openvino.inference_engine import IECore

As for the preprocessing, part of it is already present inside the converted model (scaling, changing mean, inverting input channels width and height), but that is not all. We need to make sure the image has a proper size (224 pixels both sides) and the dimensions are correct – batch, channel, width, height.

img = cv2.imread("/home/openvino/zebra.jpg")
img = cv2.resize(img, (224, 224))
img = np.expand_dims(img, 0)
img_openvino = img.transpose((0,3, 1, 2))

Now, we can try a simple OpenVINO prediction. We will use one synchronous request.

ie = IECore()
net = ie.read_network(model="/home/openvino/models/converted/saved_model.xml", weights="/home/openvino/models/converted/saved_model.bin")
input_name = next(iter(net.input_info))
output_name = next(iter(net.outputs))
net.batch_size = 1
# number of request can be specified by parameter num_requests, default 1
exec_net = ie.load_network(network=net, device_name="CPU")
# we have one request only, see num_requests above
request = exec_net.requests[0]
# infer() waits for the result
# for asynchronous processing check async_infer() and wait()
request.infer({input_name: img_openvino})
# read the result
prediction_openvino_blob = request.output_blobs[output_name]
prediction_openvino = prediction_openvino_blob.buffer

Our result, prediction_openvino, is an ordinary NumPy array with the shape (batch dimension, number of classes) = (1, 1000). To print the top 3 values as before, we can use following couple of lines.

ids = np.argsort(prediction_openvino)[0][::-1][:3]
probabilities = np.sort(prediction_openvino)[0][::-1][:3]
for id, prob in zip(ids, probabilities):
    print(f"{id}\t{prob}")

We get exactly the same results as before. Our code works!

Comparing Results with Original TensorFlow Model

Now, let’s do the same with TensorFlow model. Do not forget to preprocess the image first. Prepared function preprocess_input can be used for that.

import tensorflow as tf
img_tf =  tf.keras.applications.mobilenet_v2.preprocess_input(img)
model = tf.keras.models.load_model("/home/openvino/models/tf2")
prediction_tf = model.predict(img_tf)

The results are almost the same, the difference is so small that we can ignore it. The top result from this prediction has probability 0.95860416, compared to 0.9556126 we had before. The order of the other predictions might be slightly different, because the values are so tiny.

By the way, there is a build-in function decode_predictions, which will not only give you the top results, but also class names and codes instead of just ids. Top 3 TensorFlow predictions.

from tensorflow.keras.applications.mobilenet_v2 import decode_predictions
top3 = decode_predictions(prediction_tf, top=3)

Here is the result:

[[('n02391049', 'zebra', 0.95860416), ('n02643566', 'lionfish', 0.0032956717), ('n01968897', 'chambered_nautilus', 0.0008273276)]]

Benchmarking

We should mention that there is also a tool for benchmarking OpenVINO models called Benchmark Python Tool. It offers synchronous (latency-oriented) and asynchronous (throughput-oriented) measuring modes. Unfortunately, it does not work for other models (like TensorFlow) and cannot be used for direct comparison.

How OpenVINO Helped Us

Enough of code, at the end of the article we will add some numbers. In Ximilar, we use often recognition models with resolution of 224 or 512 pixels. In a batch of 1 or 10. We used TensorFlow Lite format as often as possible, because it is very fast to load. (See a comparison here.) Because of a fast loading, it is not necessary to have the model in the cache all the time. To make running TF Lite models faster, we enhance the performance with XNNPACK.

Below you can see a chart with results for MobileNet V2. For the batches, we show prediction time in seconds per single image. Tests were done on our production workers.

Summary

In this article, we briefly introduced some of the basic functionality of OpenVINO. Of course, there is much more to try. We hope that this article has motivated you to try it yourself and maybe continue to explore all the possibilities through more advanced resources.

The post OpenVINO: Start Optimizing Your TensorFlow 2 Models for Intel CPUs with Docker appeared first on Ximilar: Visual AI for Business.

The Best Resources on Artificial Intelligence and Machine Learning

Michal Lukáč — Wed, 08 Jul 2020 06:37:24 +0000

Over the years, we machine learning engineers at Ximilar have gathered a lot of interesting ML/AI material from which we draw. I have chosen the best ones from podcasts to online courses that I recommend to listen to, read, or check out. Some of them are basic and introductory, others more advanced. However, all of them are high-quality ones made by the best people in the field and they are worth checking. If you are interested in the current progress of AI or you are just curious about what will be in the future then you are on the right page. AI will change all possible fields, whether it is physics, law, healthcare, cryptocurrencies, or retail and one should be prepared for what is to come…

Podcasts

If there is one medium that has become popular in recent years, it must be podcasts. Everyone is doing it right now – there are podcasts about sex, politics, tech, healthcare, brains, bicycles… and AI is not missing. But one of them stands out. It is a podcast by Lex Fridman. This MIT alumni is doing an incredible job by interviewing the top people from the field, famous people included (like Garry Kasparov or Elon Musk). Some episodes are more about science, physics, the mind, startups, and the future of humanity. The ideas presented in the podcast are just mind-blowing. The talks are deep but clever and it will take you some time to get through them.

The Turing test is a recursive test. The Turing test is a test on us. It is a test of whether people are intelligent enough to understand themselves.

Lex Fridman and Garry Kasparov [Youtube]
Lex Fridman and Sam Altman (CEO of OpenAI) [Youtube]
Lex Fridman and Eliezer Yudkowsky [Youtube]
Lex Fridman and Max Tegmark [Youtube]
Lex Fridman and Ilya Sutskever [Youtube]
and many more

Another great podcast is Brain Inspired by Paul Middlebrooks with interesting guests. It shows and discusses topics from Neuroscience and AI and how these fields are connected.

Books

Life 3.0 by Max Tegmark – How will AI change healthcare, jobs, justice, or war? Max Tegmark is a professor at MIT who has written this provocative and engaging book about the future. He tries to answer a lot of questions like What is intelligence? Can a machine have a consciousness? Can we control AI? … This is a great introduction even for non-technical people.

Human Compatible by Stuart Russel is an important book that asks questions on how to coexist with intelligent AI in future.

AI Superpowers: China, Silicon Valley, and the New World Order by Kai-Fu Lee – A book about the incredible progress in AI in China.

Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron – Do you know how to code and would you like to start with some experiments? This book is not only about one of the most popular programming frameworks (TensorFlow) but also about modern techniques in machine learning and neural networks. You will code your first image recognition model and learn how to pre-process and analyze text.

Deep Learning for Coders with Fastai and PyTorch by Jeremy Howard and Sylvain Gugger – Another great book for coders. Code examples are in the PyTorch framework. Jeremy Howard is a famous researcher and developer in the AI community. His Fastai project helps millions of people to get into deep learning.

There are many more interesting books oriented for software developers like Deep Learning with Python by François Chollet. Looking for more hardcore books with math equations? Then try Deep Learning by MIT Press. If you are interested in classic approaches, then many university students will remember preparing for exams with Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig or Bishop’s Pattern Recognition and Machine Learning. (These two are a bit advanced and many topics are for a master’s or even PhD level.)

Magazines

MIT Technology Review is a great magazine with the latest news and trends in technology and future innovations. The magazine covers also other interesting topics as biotechnology, blockchain, space, climate change, and more. There is a print or digital access option for you.

People to Follow

There are a lot of famous Scientists & Engineers & Entrepreneurs to follow. For example, often mentioned Jeremy Howard (fast.ai), Andrej Karpathy (Tesla AI), Yann LeCun (Facebook AI), Rachel Thomas (fast.ai, data ethics), Francois Chollet (Google), Fei-Fei Li (Stanford), Anima Anandkumar (Nvidia AI), Demis Hassabis (DeepMind), Geoffrey Hinton (Google), Eliezer Yudkowsky (AI Alignment), Ilya Sutskever (OpenAI) and more…

Lectures & Online Courses

So you’ve read some books and articles and now you want to start digging a little deeper? Or do you want to become a Machine Learning Specialist? Then start with some online courses. Of course, you will need to learn a little bit about math before and get some basic programming skills. Online courses are a great option if you can’t study at university or you want to get knowledge at your own pace. Here are some of the courses that can serve you as the starting point:

Elements of AI – was created by the University of Helsinki for a broader audience, it’s not very technical and can be a great introduction for beginners
Machine Learning course from Andrew Ng – this one is a classic and most popular one for a number of reasons, it’s great introductory material.
To learn more math, we can recommend Mathematics for Machine Learning.
Deep Learning specialization is more about modern approaches to neural networks.
There are a lot of great specializations on Udacity by top companies and engineers from various fields like Healthcare or Automotive.
CS231n, CS224N and CS224W are great Stanford courses for computer vision, natural language processing (NLP) and graphs, including video lectures, slides, and materials. It’s FREE!
6.034 and 6.S191 – lectures for AI and Deep Learning by MIT on YouTube.
Practical Deep Learning for Coders by fast.ai – Jeremy Howard is doing a great job here by explaining concepts, and ideas and showing the code in Jupiter notebooks.
PyImageSearch – offers great introductory tutorials in the computer vision field.
Full Stack Deep Learning – great course for the whole cycle of developing machine learning systems

Research Blogs

You know how to code and you even know how to build your CNN? Or are you just simply interested in what is the future of the field and how companies are using AI? Check out some of the latest trends and SOTA approaches from the top research groups in the world. There are several giants like Apple, Facebook and Google pushing the AI boundaries:

Facebook AI Research – most of the research from the Facebook team is done in Recommender systems, NLP, and Computer Vision.
Google AI Blog – Google is probably the most dominant player in AI, check out, for example, their weather prediction system.
Microsoft Research – Microsoft has one of the oldest research groups of all companies. It is investing heavily in AI, Computer Vision and Augmented Reality (AR)
Google Deepmind blog – using AI to solve difficult problems from healthcare solutions to playing StarCraft 2.
Open AI Blog – how to solve Rubik’s cube by robotic hand or would you like to generate music in one click?
Baidu Research – research blog by one of the largest internet companies in China.
Malong – research by a company focused on AI for the retail industry (Malong provides in-store product recognition & loss prevention AI to Walmart and other major retailers)
NVIDIA Blog and AI research – the biggest GPU creator is doing research in many fields (from accelerating research speed in healthcare to improving the gaming experience).
Distill – beautiful and interactive visualizations and explanations of the topics from deep learning. People behind this project are from Open AI, Tesla, Google…

There are also a lot of AI research labs located at top universities such as MIT, Stanford, or Berkeley.

Great Articles in the AI Field

We are always looking for high-quality content which is why some of the following articles can be a bit longer. AI is a complex field which is disrupting the way we live and do business:

The New Business of AI article by Andreessen Horowitz.
The AI Revolution: The road to superintelligence article by Tim Urban.
The Global AI Index – which country is the most innovative and which country is investing the most resources? Right now, the USA is still dominating but China is catching up rapidly.
AI and Efficiency – algorithmic progress has yielded more gains than classical hardware efficiency.
Reflecting on a year of making machine learning actually useful – iterating over datasets is much more important than the latest model architectures.
15 Tech Experts Share Potential Impacts Of AI On Society
State of AI: State of AI reports by year
50 most promising AI companies in America – promising AI startups in the USA
Google “We Have No Moat, And Neither Does OpenAI” – why opensource outcompete Google and Microsoft in future

Newsletters

Data Science Weekly and Deep Learning Weekly – as the names suggest this is every week news from data science and machine learning.
The Algorithm – a newsletter released by MIT.
The Batch – a newsletter by deeplearning.ai.
Alignment – a newsletter by Rohin Shah, it is also released in Chinese.
SeminAnalysis – a newsletter on substack about Cloud, AI and Hardware.

Trends & Problems

AI Alignment & Safety problem – Have future super-intelligent AGI systems the same values as humanity? This is one of the toughest and most important challenges. With the AI race started by OpenAI/Microsoft and Google via Large language models (LLM) & multi-modal models, we have less time to solve AI safety.
Ethics, Transparency & Safety & Regulations – Should countries ban the usage of face recognition technology? [source][source] Is it ethical to scrape the data from the internet to build your face search startup? [source] What is an unethical use of AI? [source] What about autonomous weapons for defensive purposes? Are social media polarizing people with their clever algorithms optimized for more clicks/likes/…? [source]
Jobs replacement – Will AI replace all manufacturing and basic jobs? Or will the knowledge workers be first? Will the research in AI create even more job opportunities? What is going to happen in countries that are heavily dependent on manual work labour? [source] Will companies that are using robots/clever algorithms pay AI Tax one day? With Large Language Models (LLM) models many content and copywriters are losing their jobs. The same is happening with graphics designers with generative AI like Midjourney. Github Copilot and similar tools will one day probably replace programmers. Being a programmer myself I’m not sure if I’m happy about it and in ten years maybe I will need to switch to another job profession.
Interpretability & Explainability & Racial bias – Why did the deep learning model predict X and not Y? What has the neural network actually learned? How can we fool the model with adversarial attacks to make it predict wrong? Can models discriminate because of your race? it is a big issue not only in Face recognition, Insurance, and Healthcare. [source]
Generative models – GANs, and generative models like stable diffusion, are an incredible technology that brings a lot of challenges. Have you heard about Deep Fakes videos? One day, a large percentage of internet content will be created by generative AI. The Deep Fakes will be unrecognizable from human content. This could create new problems in politics, business, security or our personal lives. Will there be some proof by human protocol then?
Big and Small models, IoT and Environment impact – Bigger models can lead to incredible results in NLP [source][source]. However, only a few top companies as Microsoft, Google or Amazon have the resources to train them. On the other hand, there is also more research to make models lighter and faster with binarization or pruning techniques. Small models do not require computers with GPUs. Not every part of the world is connected to fast internet and AI on edge devices are becoming more popular.

Biggest Breakthroughs in AI

A lot of things happened during the last few years, here are some research articles that pushed the boundaries of AI by a large margin ordered by date (since 2010):

ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton – developed the first big CNN architecture trained on GPUs that was able to improve image classification problems on Imagenet (2012)
Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov and Google team – a simple architecture for encoding words as vectors/embeddings (2013)
Playing Atari with Deep Reinforcement Learning by Volodymyr Mnih et al. (2013)
Generative Adversarial Networks – a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models by Ian Goodfellow (2014)
Attention Is All You Need by Ashish Vaswani – improving attention mechanism and proposing a Transformer architecture (2017)
Language Models are Few-Shot Learners – paper about GPT 3 version by OpenAI team (2020)

Here is the hall of fame in complex Artificial Intelligence projects:

AlphaGo by Google/DeepMind for beating the best human players in the GO game
AlphaFold by Google/DeepMind for solving the protein structure prediction (2020)
LLM / GPT-3 and ChatGPT by OpenAI for advanced language model that can do a LOT of things with texts and language
DALL-E 2 and stable diffusion models by OpenAI and Midjourney for advances in image generation (2022)

That is all for now. There are other great resource lists like the one from DeepMind, from which I got inspired. The list is divided by the level of the target audience – introductory, intermediate, and advanced. We will try to keep this post updated and if we find a gem, it will definitely appear here. There is much more material from which you can learn, but now it’s up to you to start your own machine-learning journey. We will try to keep this article updated with the latest news in Artificial Intelligence.

The post The Best Resources on Artificial Intelligence and Machine Learning appeared first on Ximilar: Visual AI for Business.

Engineering - Ximilar: Visual AI for Business

The Best Tools for Machine Learning Model Serving

Machine Learning Models Serving Then and Now

The Machine Learning and DevOps Challenges

Deployment Strategies

Other Challenges

Our Approach to the Machine Learning Model Serving

Which Framework Is The Best?

The Battle of Machine Learning Frameworks

The ONNX Format

Exploration is Key

The Best Machine Learning Serving Tools

Deploy ML Models With API

TensorFlow Serving

TorchServe

Triton

Ray Serve

MLflow

Other Useful Tools for Machine Learning Model Serving

Cloud Options of AWS, GCP and Azure

Large Language (LLMs) and Multi-Modal Models in Production

Ximilar AI Platform

Deployment of ML Models is Science

My Final Tips & Recommendations

Pick a good framework to start with

Save your models in different formats

Pick a serving system suitable to your needs

How to Build a Good Visual Search Engine?

What Exactly Does a Visual Search Engine Mean?

Typical Visual Search Engines:Google Lens & Pinterest Lens

The Algorithm for TrainingVisual Similarity

The Components of a Visual Search Tool

How to Create a Visual Hash?

Types of Neural Networks

Training the Similarity Model

Optimization Functions

How it Works

Loss Function: A Few Tips

The Distance Between Vectors

Supervised & Unsupervised Machine Learning Methods

1) Supervised Learning

Triplet Neural Network and Online/Offline Mining

Online & Offline Mining

N-pair Models

Proxy-Based Methods

MultiSimilarity Loss

2) Unsupervised Learning

AutoEncoder

Image Inpainting

Tips for Training Deep Metric Models

Neural Text Search on Images with CLIP

The Training Datafor Visual Search Engines

How Does Image Augmentation Help With Training Datasets?

Removing or Replacing Background

Background Removal Solution

GAN-Based Methods for Generating New Training Data

Annotation System for Building Image Search Datasets

How to Improve Visual Search Engine Results?

Include Tags

Include Filtering Categories

Include Dominant Colours

Use Object Detection & Segmentation

Use Optical Character Recognition (OCR)

Improve Image Resolution

Combine Multiple Approaches

Search Engine and Vector Databases

Building a Visual Search Engine on a Machine Learning Platform

Custom Similarity Service

Examples of Visual Search Solutions for Business

The Workflow of Buildinga Visual Search Engine

Visual Search is Not OnlyFor the Big Companies

How to Convert a Video Into a Streaming Format?

The Automated Video Processing

What is the Streaming Format, and What is it Good For?

HLS or DASH Streaming Format – Comparison

How much space do the HLS and DASH formats take up?

What is the data structure of HLS and DASH?

How to convert .mp4, .mkv or .mov videos to HLS?

How to convert videos to DASH?

How can I stream videos on my website?

Typical Visual Search Engines:
Google Lens & Pinterest Lens

The Algorithm for Training
Visual Similarity

The Training Data
for Visual Search Engines

The Workflow of Building
a Visual Search Engine

Visual Search is Not Only
For the Big Companies