Machine Learning Model - Ximilar: Visual AI for Business https://www3.ximilar.com/blog/tag/machine-learning-model/ VISUAL AI FOR BUSINESS Wed, 25 Sep 2024 14:58:31 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 https://www.ximilar.com/wp-content/uploads/2024/08/cropped-favicon-ximilar-32x32.png Machine Learning Model - Ximilar: Visual AI for Business https://www3.ximilar.com/blog/tag/machine-learning-model/ 32 32 The Best Tools for Machine Learning Model Serving https://www.ximilar.com/blog/the-best-tools-for-machine-learning-model-serving/ Wed, 25 Oct 2023 09:26:42 +0000 https://www.ximilar.com/?p=14372 An overview and analysis of serving systems and deployment methods for Machine Learning and AI models.

The post The Best Tools for Machine Learning Model Serving appeared first on Ximilar: Visual AI for Business.

]]>
As the prevalence of AI in various industries increases, so does the need to optimize the machine learning model serving. As a machine learning engineer, I’ve seen that training models is just one part of the ML journey. Equally important as the other challenges is the careful selection of deployment strategies and serving systems.

In this article, we’ll delve into the importance of selecting the right tools for machine learning model serving, and talk about their pros and cons. We’ll explore various deployment options, serving systems like TensorFlow Serving, TorchServe, Triton, Ray Serve, and MLflow, and also the deployment of specific models such as large language models (LLMs). I’ll also provide some thoughts and recommendations for navigating this ever-evolving landscape.

Machine Learning Models Serving Then and Now

When I first began my journey in the world of machine learning, the landscape was constantly shifting. The frameworks being actively developed and used at the time included Caffee, Theano, TensorFlow (Google) and PyTorch (Meta), all vying for their place in the world of AI. As time has passed, the competition has become more and more lopsided, with TensorFlow and PyTorch leading the way. While TensorFlow has remained the more popular choice for production-ready models, PyTorch has been steadily gaining in popularity, particularly within research circles, for its faster, more intuitive prototyping capabilities.

While there are hundreds of libraries available to train and optimize models, the most popular frameworks such as TensorFlow, PyTorch and Scikit-Learn are all based on Python programming language. Python is often chosen due to its simplicity and the vast amount of libraries for data manipulation. However, it is not the fastest language and can present problems with parallel processing, threads and GIL. Additionally, specialized libraries such as spaCy and PyG are available for specific tasks, such as Natural Language Processing (NLP) and Graph Analysis, respectively. The focus was and still partially is on the optimization of models and architectures. On the other hand, there are more and more problems in machine learning models serving in production because of the large-scale adoption of AI.

Nowadays, even more complex models like large language models (LLM, GPT/LAMMA/BARD) and multi-modal models are in fashion which creates a bigger pressure on optimal model deployment, infrastructure environment and storage capacity. Making machine learning model serving and deployment effective and cheap is a big problem. Even companies like Microsoft or NVIDIA are actively working on solutions that will cut the costs of it. So let’s look into some of the best options that we as developers currently have.

The Machine Learning and DevOps Challenges

Being a Machine Learning Engineer, I can say that training a model is just a small part of the whole lifecycle. Data preparation, deployment process and running the model smoothly for numerous customers is a daily challenge and a major part of the job.

Deployment Strategies

In addition to having to allocate GPU/CPU resources and manage inference speed, the company deploying ML models must also consider the deployment strategy for the trained model. You could be deploying the ML model as an API, running it in a container, or using a serverless platform. Each of these options comes with its own set of benefits and drawbacks, so carefully considering the best approach is essential. When we have a trained model, there are several options on how to use it:

  • Deploy it as an API endpoint, sending data in the request and getting results immediately in response. This approach is suitable for faster models that are able to process the data in just a few seconds.
  • Deploy it as an API endpoint, but return just a promise or asynchronous response from the model. This is great for computational-intensive models that can take minutes or hours of processing. For example, generative models and upscaling models are slow and require this approach.
  • Use a system that is able to serve it for you.
  • Use the model locally on your data.
  • Deploy models on Smartphones or IoT devices with feed from local sensors.

Other Challenges

The complexity of machine learning projects grows with variables such as:

  • The number of models – It is common practice to use multiple models. For example, at this moment, there are tens of thousands of different ML models on the Ximilar platform.
  • Model versions – You can train each of your models on different training data (part of the dataset) and mark it as a different version. Model versioning is great if you want to A/B test your ML model, tune your model performance, and for continuous model training.
  • Format of models – You can potentially train and save your ML models in various formats. For instance, .h5 which is a Keras/TensorFlow format or .pt (PyTorch) or .onnx for ONNX Runtime. Usually, each framework supports only specific formats.
  • The number of frameworks – Served ML models could be trained with different frameworks and their versions.
  • The number of the nodes (servers) – Models can be hosted on one or multiple servers and the serving system should be able to intelligently load balance the requests on servers so that none of them is throttled.
  • Models storage/registry – You need to store the ML models in some database or storage, such as AWS S3 or local storage
  • Speed/performance – The loading time of models from the storage can be critical and can cause a slow inference per sample.
  • Easy to use – Calling model via Rest API or gRPC requests, single-or-batch inference.
  • Hardware specification – ML models can be deployed on Edge devices or PCs with various architectures.
  • GPUs vs CPUs and libraries – Some models must be used only on CPUs and some require a GPU card.

Our Approach to the Machine Learning Model Serving

Several systems were developed to tackle these problems. Serving and deploying machine learning models has come a long way since we founded Ximilar in 2016. Back then, no system was capable of effectively serving hundreds of neural networks for inference.

So, we decided to build our own system for machine learning model serving, and today it forms the backbone of our machine-learning platform. As the use of AI becomes more widespread in companies, newer systems such as TensorFlow Serving emerge quickly to meet the increasing demand.

Which Framework Is The Best?

The Battle of Machine Learning Frameworks

Nowadays, each big tech company has their own solution for machine learning model serving and training. To name a few, PyTorch (TorchServe) and AITemplate by META (Facebook), TensorFlow (TFServing) by Google, ONNX runtime by Microsoft, Triton by NVIDIA, Multi-Model-Server by Amazon and many others like BentoML or Ray.

There are also tens of formats that you can save your ML model in, just TensorFlow alone is able to save into .h5, .pb, saved_model or .tflite formats, each of them serving a different purpose. For example, TensorFlow Lite is great for smartphones. It also loads very fast, so the availability of the model is great. However, it supports only limited operations and more modern architectures cannot be converted with it.

Machine learning model serving: each big tech company has their own solution for training and serving machine learning models.
Machine learning model serving: each big tech company has their own solution for training and serving machine learning models.

You can also try to convert models from PyTorch or TensorFlow to TensorRT and OpenVino formats. The conversion usually works with basic and most-used architectures. The TensorRT is great if you are deploying ML models on Jetson Nano or Xavier. You can achieve a boost in performance on Intel servers via OpenVino conversion or the Neural Magic library.

The ONNX Format

One notable thing is the ONNX format. The ONNX is not a library for training your machine learning models, ONNX is an open format for storing machine learning models. After the model training, for example, in TensorFlow, you can convert it to ONNX format. You are able to run this converted model via ONNX runtime on almost any platform, programming language, CPU architecture and with preferred hardware acceleration. Sometimes serving a model requires a specific version of libraries, which is why you can solve a lot of problems via ONNX.

Exploration is Key

There are a lot of options for ML model training, saving, conversion and deployment. Every library has its pros and cons, some of them are easy to use for training and development. Others, on the other hand, are specialized for specific platforms or for specific fields (computer vision, recommender systems or NLP).

I would recommend you invest some time in exploring all the frameworks and systems, before deciding which framework you would like to lock in. The competition is rough in this field and every company tries to be as innovative as possible to keep up with the others. Even a Chinese company Baidu developed their own solution called PaddlePaddle. At the end of the article, I will give some recommendations on which frameworks and serving systems you should use and when.

The Best Machine Learning Serving Tools

OK, let’s say that you trained your own model or downloaded one that has already been trained. Now you would like to deploy a machine-learning model in production. Here are a few options that you can try.

If you don’t know how to train a machine learning model, you can start with this tutorial by PyTorch.

Deploy ML Models With API

If you have one or a few models, you can build your own system for ML model serving. With Python and libraries such as Flask or Django, there is a straightforward way to develop a simple REST API. When the web service starts, it loads the model in the background and then every incoming request will call the model on the incoming data.

It could get problematic if you want to effectively work with GPU cards, and handle parallel requests. I would recommend packing the system to Docker and then running it in Kubernetes.

With Kubernetes, Docker and smart load-balancing as HAProxy such a system can potentially scale to bigger volumes. Java or Go languages are also good languages to deploy ML models.

Here is a simple tutorial with a sci-kit-learn model as REST API with Flask.

Now let’s have a look at the open-source serving systems that you can use out of the box, usually with a small piece of code or no code at all.

TensorFlow Serving

GitHub | Docs

TensorFlow Serving is a modern serving system for TensorFlow ML models. It’s a part of TensorFlow Extended developed by Google. The recommended way of using the system is via Docker.

Simply run the Docker pull TensorFlow/serving (optionally TensorFlow/serving:latest-gpu if you need GPU support) command. Just run the image via Docker:

docker run -p 8501:8501 
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model 
  -e MODEL_NAME=my_model -t tensorflow/serving

Now that the system is serving your model, you can query with gRPC or REST calls. For more information, read the documentation. TensorFlow Serving works best with the SavedModel format. The model should define its signature_def_map which will define the inputs and outputs of the model. If you would like to dive into the system then my recommendation is videos by the team itself.

In my opinion, TensorFlow serving is great with simple models and just a few versions. The documentation, however, could be simpler. With advanced architectures, you will need to define the custom operations, which is a big disadvantage if you have a lot of models with more modern operations.

TorchServe

GitHub | Docs

TorchServe is a more modern system than TensorFlow Serving. The documentation is clean and supports basically everything that TF Serving does, however, this one is for PyTorch models. Before serving a PyTorch model via TorchServe, you need to convert them to .mar packages. Basically, the .mar package tells the model name, version, architecture and actual weights of the model. Installation and running are also possible via Docker, and it is very similar to TensorFlow Serving.

I personally like the management of the models, you are able to simply register new models by sending API requests, list models and query statistics. I find the TorchServe very simple to use. Both REST API and gRPC are available. If you are working with pure PyTorch models then the TorchServe is recommended way.

Triton

GitHub | Docs

Both of the serving systems mentioned above are tightly bound to the frameworks of the models they are able to serve. That is probably why Triton has a big advantage over them since it can serve both TensorFlow and PyTorch models. It is also able to serve OpenVINO, ONNX and TensorRT formats! That means it supports all the major formats in the machine learning field. Even though NVIDIA developed it, it doesn’t require a GPU card and can run also on CPUs.

To run Triton, simply pull it from the docker repository via the Docker pull nvcr.io/nvidia/tritonserver command. The triton servers are able to load models from a specific directory called model_repository. Each model is defined with configuration, in this configuration, there is a platform setting that defines a model format. For example, “tensorflow_graphdef” or “onnxruntime_onnx“. In this way, Triton knows how to run specific models.

The documentation is not super-easy to read (mostly GitHub README files) because it is in very active development. Otherwise, working with the models is similar to other serving systems, meaning calling models via gRPC or REST.

Ray Serve

GitHub | Docs

Ray is a general-purpose system for scaling machine learning workloads. It primarily focuses on model serving and providing the primitives for you to build your own ML platform on top.

Ray Serve offers a more Pythonic way of creating your own serving system. It is framework-agnostic and anything that can be run via Python can be run also with Ray. Basically, it looks as simple as Flask. You define the simple Python class for your model and decorate it with a route prefix handler. Then you just call the REST API request.

import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

# 1: Define a Ray Serve deployment.
@serve.deployment(route_prefix="/")
class MyModelDeployment:
    def __init__(self, msg: str):
        # Initialize model state: could be very large neural net weights.
        self._msg = msg

    def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}

# 2: Deploy the model.
serve.run(MyModelDeployment.bind(msg="Hello world!"))

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/").json())

If you want to have more control over the system, Ray is a great option. There is a Ray Clusters library which is able to deploy the system on your own Kubernetes Cluster, AWS or GCP with the ability to configure the autoscaling option.

MLflow

MLflow is an open-source platform for the whole ML lifecycle. From training to evaluation, deployment, tracking, model monitoring and central model registry.

MLflow offers a robust API and several language bindings for the whole management of the machine learning model’s lifecycle. There is also a UI for tracking your trained models. MLflow is really a mature package with a whole bundle of components that your team can use.

Other Useful Tools for Machine Learning Model Serving

  • Multi-Model-Server is a similar system to the previous ones. Developed by the Amazon AWS team, the system is able to run models trained with MXNet or converted via ONNX.
  • BentoML is a project very similar to MLflow. There are many different tools that data scientists can use for training and deployment processes. The UI looks a bit more modern. BentoML is also able to automatically generate Docker images for your models.
  • KServe is a simple system for managing and scaling models on your Kubernetes. It solves the deployment, and autoscaling and provides standardized inference protocol across ML frameworks.

Cloud Options of AWS, GCP and Azure

Of course, every big tech player provides cloud platforms to host and serve your machine learning models. Let’s have a quick look at a few examples.

Microsoft is a big supporter of ONNX, so with Azure Machine Learning services, you are able to deploy your models to the cloud via Python or Azure CLI. The process requires an entry script in Python with two methods: init for initialization of your model and run for inference. You can find the entire workflow in Azure development documentation.

The Google Cloud Platform (GCP) has good support for TensorFlow as it is their native framework. However, Docker deployment is available, so other frameworks can be used too. There are multiple ways to achieve the deployment. The classic way will be using the AI Platform prediction tool or Google Cloud Run. There is also a serverless HTTP endpoint/function, which serves your model stored in the Google Cloud Storage bucket. You define your function in Python with the prediction method and loading of the model.

Amazon Web Services (AWS) also contains multiple options for the ML deployment process and serving. The specialized system for machine learning is Amazon Sagemaker.

All the big platforms allow you to create your own virtual server instances. Create your Kubernetes clusters and use any of the systems/frameworks mentioned earlier. Nevertheless, you need to be very careful because it could get really pricey. There are also smaller players on the market such as Banana, Seldon and Comet ML for training, serving & deployment. I personally don’t have experience with them but they are becoming more popular.

Large Language (LLMs) and Multi-Modal Models in Production

With the introduction of GPT by OpenAI a new class of AI models was introduced – the large language models (LLMs). These models are extremely big, trained on massive datasets and deployed on an infrastructure that requires a whole datacenter to run. “Smaller” – usually open source version – models are released but they also require a lot of computational resources and modern servers to run smoothly.

Recently, several serving systems for these models were developed:

  • OpenLLM by BentoML is a nice system that supports almost all open-source models like Llama2. You can just pick one of the models and run the following commands to start with the serving and query the results:

openllm start opt
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'Explain to me the difference between "further" and "farther"'
  • vLLM project is a Python library that can help you with the deployment of LLM as an API Server. What is great is that it supports OpenAI-Compatible Server, so you can switch from OpenAI paid service easily to open source variant without modifying the code on the client. This project is being developed at UC Berkeley and it is integrating new techniques for fast inferencing of LLMs.

  • SkyPilot – is a great option if you want to run the LLMs on cloud providers such as AWS, Google Cloud or Azure. Because running these models is costly, SkyPilot is able to pick the cheapest provider automatically and launch it as an endpoint.

Ximilar AI Platform

Free Login | Docs

Last but not least, you can use our codeless machine-learning platform. Instead of writing a lot of code, training and deploying an ML model by yourself, you can try it in the Ximilar App. Training image classification and object detection can be done both in the browser App or via API. There is every tool that you would need in the ML model development stage, such as training data/image management, labelling tools, evaluation of your models on testing and training datasets, performance metrics, explanation of models on specific images, and so on.

Ximilar’s computer vision platform enables you to develop AI-powered systems for image recognition, visual quality control, and more without knowledge of coding or machine learning. You can combine them as you wish and upgrade any of them anytime.

Once your model is trained, it is deployed as a REST API endpoint. It can be connected to a workflow of more machine learning models working together with conditions like if-else statements. The major benefit is you just connect your system to the API and query the results. All the training and serving problems are solved by us. In the end, you will save a lot of costs because you don’t need to own or rent your infrastructure, serving systems or specialized software engineering team on machine learning.

We built a Ximilar Platform so that businesses from e-commerce, healthcare, manufacturing, real estate and other areas could simply develop their own AI models without coding and with a reasonable budget. For example, on the following screen, you can see our task management for the trading cards collector community.

We and our customers use our platform for the training of machine learning models. Together with our own system for machine learning model serving is it an all-in-one solution for ML model deployment.
We and our customers use our platform for the training of machine learning models. Together with our own system for machine learning model serving is it an all-in-one solution for ML model deployment.

The great thing is that everything is manageable via REST API requests with JSON responses. Here is a simple curl command to query all models in production:

curl --request GET 
  --url https://api.ximilar.com/recognition/v2/task/ 
  --header 'Content-Type: application/json' 
  --header 'authorization: Token APITOKEN'

Deployment of ML Models is Science

There are a lot of systems that try to make deployment and serving easy. The topic of deployment & serving is broad, with many choices for hardware infrastructure, DevOps, programming languages, system development, costs, storage, and scaling. So it is not easy to pick one. If you would like to dig deeper, I would suggest the following content for further reading:

My Final Tips & Recommendations

Pick a good framework to start with

Doing machine learning for more than 10 years, my advice is to start by picking a good framework for model development. In my opinion, the best choice right now is PyTorch. Using it is easy and it supports a lot of state-of-the-art architectures.

I used to be a fan of TensorFlow for a long time, but over time, its developers were not able to integrate modern approaches. Also, the backward compatibility is often disrupted and the quality of code is getting worse which leads to more and more bugs in the framework.

Save your models in different formats

Second, save your models in different formats. I would also recommend using ONNX and OpenVino here. You never know when you will need it. This happened to me a few times. We needed to upgrade the server and systems (our production environment), but the new versions of libraries stopped supporting the specific format of the model, so we had to switch to a different one.

Pick a serving system suitable to your needs

If you are a small company, then Ray Serve is a good option. Bigger companies, on the other hand, have complex requirements for development and robust infrastructure. In this case, I would recommend picking more complex systems like MLFlow. If you would like to serve the models on the cloud, then look at a multi-model server. The choice is really based on the use case. If you don’t want to bother with all of this then try our Ximilar Platform, which is a solution model optimization, model validation, data storage and model deployment as API.

I will keep this article updated and if there is some new perspective serving system I will be more than happy to mention it here. After all, machine learning is about constant progress, and that is one of the things I like about it the most.

The post The Best Tools for Machine Learning Model Serving appeared first on Ximilar: Visual AI for Business.

]]>
When OCR Meets ChatGPT AI in One API https://www.ximilar.com/blog/when-ocr-meets-chatgpt-ai-in-one-api/ Wed, 14 Jun 2023 09:38:27 +0000 https://www.ximilar.com/?p=13781 Introducing the fusion of optical character recognition (OCR) and conversational AI (ChatGPT) as an online REST API service.

The post When OCR Meets ChatGPT AI in One API appeared first on Ximilar: Visual AI for Business.

]]>
Imagine a world where machines not only have the ability to read text but also comprehend its meaning, just as effortlessly as we humans do. Over the past two years, we have witnessed extraordinary advancements in these areas, driven by two remarkable technologies: optical character recognition (OCR) and ChatGPT (generative pre-trained transformer). The combined potential of these technologies is enormous and offers assistance in numerous fields.

That is why we in Ximilar have recently developed an OCR system, integrated it with ChatGPT and made it available via API. It is one of the first publicly available services combining OCR software and the GPT model, supporting several alphabets and languages. In this article, I will provide an overview of what OCR and ChatGPT are, how they work, and – more importantly – how anyone can benefit from their combination.

What is Optical Character Recognition (OCR)?

OCR (Optical Character Recognition) is a technology that can quickly scan documents or images and extract text data from them. OCR engines are powered by artificial intelligence & machine learning. They use object detection, pattern recognition and feature extraction.

An OCR software can actually read not only printed but also handwritten text in an image or a document and provide you with extracted text information in a file format of your choosing.

How Optical Character Recognition Works?

When an OCR engine is provided with an image, it first detects the position of the text. Then, it uses AI model for reading individual characters to find out what the text in the scanned document says (text recognition).

This way, OCR tools can provide accurate information from virtually any kind of image file or document type. To name a few examples: PDF files containing camera images, scanned documents (e.g., legal documents), old printed documents such as historical newspapers, or even license plates.

A few examples of OCR: transcribing books to electronic form, reading invoices, passports, IDs and landmarks.
A few examples of OCR: transcribing books to electronic form, reading invoices, passports, IDs, and landmarks.

Most OCR tools are optimized for specific languages and alphabets. We can tune these tools in many ways. For example, to automate the reading of invoices, receipts, or contracts. They can also specialize in handwritten or printed paper documents.

The basic outputs from OCR tools are usually the extracted texts and their locations in the image. The data extracted with these tools can then serve various purposes, depending on your needs. From uploading the extracted text to simple Word documents to turning the recognized text to speech format for visually impaired users.

OCR programs can also do a layout analysis for transforming text into a table. Or they can integrate natural language processing (NLP) for further text analysis and extraction of named entities (NER). For example, identifying numbers, famous people or locations in the text, like ‘Albert Einstein’ or ‘Eiffel Tower’.

Technologies Related to OCR

You can also meet the term optical word recognition (OWR). This technology is not as widely used as the optical character recognition software. It involves the recognition and extraction of individual words or groups of words from an image.

There is also optical mark recognition (OMR). This technology can detect and interpret marks made on paper or other media. It can work together with OCR technology, for instance, to process and grade tests or surveys.

And last but not least, there is intelligent character recognition (ICR). It is a specific OCR optimised for the extraction of handwritten text from an image. All these advanced methods share some underlying principles.

What are GPT and ChatGPT?

Generative pre-trained transformer (GPT), is an AI text model that is able to generate textual outputs based on input (prompt). GPT models are large language models (LLMs) powered by deep learning and relying on neural networks. They are incredibly powerful tools and can do content creation (e.g., writing paragraphs of blog posts), proofreading and error fixing, explaining concepts & ideas, and much more.

The Impact of ChatGPT

ChatGPT introduced by OpenAI and Microsoft is an extension of the GPT model, which is further optimized for conversations. It has had a great impact on how we search, work with and process data.

GPT models are trained on huge amounts of textual data. So they have better knowledge than an average human being about many topics. In my case, ChatGPT has definitely better English writing & grammar skills than me. Here’s an example of ChatGPT explaining quantum computing:

ChatGPT model explaining quantum computing. [source: OpenAI]
ChatGPT model explaining quantum computing. [source: OpenAI]

It is no overstatement to say that the introduction of ChatGPT revolutionized data processing, analysis, search, and retrieval.

How Can OCR & GPT Be Combined For Smart Text Extraction

The combination of OCR with GPT models enables us to use this technology to its full potential. GPT can understand, analyze and edit textual inputs. That is why it is ideal for post-processing of the raw text data extracted from images with OCR technology. You can give the text to the GPT and ask simple questions such as “What are the items on the invoice and what is the invoice price?” and get an answer with the exact structure you need.

This was a very hard problem just a year ago, and a lot of companies were trying to build intelligent document-reading systems, investing millions of dollars in them. The large language models are really game changers and major time savers. It is great that they can be combined with other tools such as OCR and integrated into visual AI systems.

It can help us with many things, including extraction of essential information from images and putting them into text documents or JSON. And in the future, it can revolutionize search engines, and streamline automated text translation or entire workflows of document processing and archiving.

Examples of OCR Software & ChatGPT Working Together

So, now that we can combine computer vision and advanced natural language processing, let’s take a look at how we can use this technology to our advantage.

Reading, Processing and Mining Invoices From PDFs

One of the typical examples of OCR software is reading the data from invoices, receipts, or contracts from image-only PDFs (or other documents). Imagine a part of invoices and receipts your accounting department accepts are physical printed documents. You could scan the document, and instead of opening it in Adobe Acrobat and doing manual data entry (which is still a standard procedure in many accounting departments today), you would let the automated OCR system handle the rest.

Scanned documents can be automatically sent to the API from both computers and mobile phones. The visual AI needs only a few hundred milliseconds to process an image. Then you will get textual data with the desired structure in JSON or another format. You can easily integrate such technology into accounting systems and internal infrastructures to streamline invoice processing, payments or SKU numbers monitoring.

Receipt analysis via Ximilar OCR and OpenAI ChatGPT.
Receipt analysis via Ximilar OCR and OpenAI ChatGPT.

Trading Card Identifying & Reading Powered by AI

In recent years, the collector community for trading cards has grown significantly. This has been accompanied by the emergence of specialized collector websites, comparison platforms, and community forums. And with the increasing number of both cards and their collectors, there has been a parallel demand for automating the recognition and cataloguing collectibles from images.

Ximilar has been developing AI-powered solutions for some of the biggest collector websites on the market. And adding an OCR system was an ideal solution for data extraction from both cards and their graded slabs.

Automatic Recognition of Collectibles

Ximilar built an AI system for the detection, recognition and grading of collectibles. Check it out!

We developed an OCR system that extracts all text characters from both the card and its slab in the image. Then GPT processes these texts and provides structured information. For instance, the name of the player, the card, its grade and name of grading company, or labels from PSA.

Extracting text from the trading card via OCR and then using GPT prompt to get relevant information.
Extracting text from the trading card via OCR and then using GPT prompt to get relevant information.

Needless to say, we are pretty big fans of collectible cards ourselves. So we’ve been enjoying working on AI not only for sports cards but also for trading card games. We recently developed several solutions tuned specifically for the most popular trading card games such as Pokémon, Magic the Gathering or YuGiOh! and have been adding new features and games constantly. Do you like the idea of trading card recognition automation? See how it works in our public demo.

How Can I Use the OCR & GPT API On My Images or PDFs?

Our OCR software is publicly available via an online REST API. This is how you can use it:

  1. Log into Ximilar App

    • Get your free API TOKEN to connect to API – Once you sign up to Ximilar App, you will get a free API token, which allows your authentication. The API documentation is here to help you with the basic setup. You can connect it with any programming language and any platform like iOS or Android. We provide a simple Python SDK for calling the API.

    • You can also try the service directly in the App under Computer Vision Platform.

  2. For simple text extraction from your image, call the endpoint read.

    https://api.ximilar.com/ocr/v2/read
  3. For text extraction from an image and its post-processing with GPT, use the endpoint read_gpt. To get the results in a deserved structure, you will need to specify the prompt query along with your input images in the API request, and the system will return the results immediately.

    https://api.ximilar.com/ocr/v2/read_gpt
  4. The output is JSON with an ‘_ocr’ field. This dictionary contains texts that represent a list of polygons that encapsulate detected words and sentences in images. The full_text field contains all strings concatenated together. The API is returning also the language name (“lang_name”) and language code (“lang”; ISO 639-1). Here is an example:

    {
    "_url": "__URL_PATH_TO_IMAGE__
    "_ocr": {
    "texts": [
    {
    "polygon": [[53.0,76.0],[116.0,76.0],[116.0,94.0],[53.0,94.0]],
    "text": "MICKEY MANTLE",
    "prob": 0.9978849291801453
    },
    ...
    ],
    "full_text": "MICKEY MANTLE 1st Base Yankees",
    "lang_name": "english",
    "lang_code": "en
    }
    }

    Our OCR engine supports several alphabets (Latin, Chinese, Korean, Japanese and Cyrillic) and languages (English, German, Chinese, …).

Integrate the Combination of OCR and ChatGPT In Your System

All our solutions, including the combination of OCR & GPT, are available via API. Therefore, they can be easily integrated into your system, website, app, or infrastructure.

Here are some examples of up-to-date solutions that can easily be built on our platform and automate your workflows:

  • Detection, recognition & text extraction system – You can let the users of your website or app upload images of collectibles and get relevant information about them immediately. Once they take an image of the item, our system detects its position (and can mark it with a bounding box). Then, it recognizes their features (e.g., name of the card, collectible coin or comic book), extracts texts with OCR and you will get text data for your website (e.g., in a table format).

  • Card grade reading system – If your users upload images of graded cards or other collectibles, our system can detect everything including the grades and labels on the slabs in a matter of milliseconds.

  • Comic book recognition & search engine – You can extract all texts from each image of a comic book and automatically match it to your database for cataloguing.

  • Giving your collection or database of collectibles order – Imagine you have a website featuring a rich collection of collectible items, getting images from various sources and comparing their prices. The metadata can be quite inconsistent amongst source websites, or be absent in the case of user-generated content. AI can recognize, match, find and extract information from images based purely on computer vision and independent of any kind of metadata.

Let’s Build Your Solution

If you would like to learn more about how you can automate the workflows in your company, I recommend browsing our page All Solutions, where we briefly explained each solution. You can also check out pages such as Visual AI for Collectibles, or contact us right away to discuss your unique use case. If you’d like to learn more about how we work on customer projects step by step, go to How it Works.

Ximilar’s computer vision platform enables you to develop AI-powered systems for image recognition, visual quality control, and more without knowledge of coding or machine learning. You can combine them as you wish and upgrade any of them anytime.

Don’t forget to visit the free public demo to see how the basic services work. Your custom solution can be assembled from many individual services. This modular structure enables us to upgrade or change any piece anytime, while you save your money and time.

The post When OCR Meets ChatGPT AI in One API appeared first on Ximilar: Visual AI for Business.

]]>
Predict Values From Images With Image Regression https://www.ximilar.com/blog/predict-values-from-images-with-image-regression/ Wed, 22 Mar 2023 15:03:45 +0000 https://www.ximilar.com/?p=12666 With image regression, you can assess the quality of samples, grade collectible items or rate & rank real estate photos.

The post Predict Values From Images With Image Regression appeared first on Ximilar: Visual AI for Business.

]]>
We are excited to introduce the latest addition to Ximilar’s Computer Vision Platform. Our platform is a great tool for building image classification systems, and now it also includes image regression models. They enable you to extract values from images with accuracy and efficiency and save your labor costs.

Let’s take a look at what image regression is and how it works, including examples of the most common applications. More importantly, I will tell you how you can train your own regression system on a no-code computer vision platform. As more and more customers seek to extract information from pictures, this new feature is sure to provide Ximilar’s customers with the tools they need to stay ahead of the curve in today’s highly competitive AI-driven market.

What is the Difference Between Image Categorization and Regression?

Image recognition models are ideal for the recognition of images or objects in them, their categorization and tagging (labelling). Let’s say you want to recognize different types of car tyres or their patterns. In this case, categorization and tagging models would be suitable for assigning discrete features to images. However, if you want to predict any continuous value from a certain range, such as the level of tyre wear, image regression is the preferred approach.

Image regression is an advanced machine-learning technique that can predict continuous values within a specific range. Whenever you need to rate or evaluate a collection of images, an image regression system can be incredibly useful.

For instance, you can define a range of values, such as 0 to 5, where 0 is the worst and 5 is the best, and train an image regression task to predict the appropriate rating for given products. Such predictive systems are ideal for assigning values to several specific features within images. In this case, the system would provide you with highly accurate insights into the wear and tear of a particular tyre.

Predicting the level of tires worn out from the image is a use case for an image regression task, while a categorization task can recognize the pattern of the tire.
Predicting the level of tires worn out from the image is a use case for an image regression task, while a categorization task can recognize the pattern of the tyre.

How to Train Image Regression With a Computer Vision Platform?

Simply log in to Ximilar App and go to Categorization & Tagging. Upload your training pictures and under Tasks, click on Create a new task and create a Regression task.

Creating an image regression task in Ximilar App.

You can train regression tasks and test them via the same front end or with API. You can develop an AI prediction task for your photos with just a few clicks, without any coding or any knowledge of machine learning.

This way, you can create an automatic grading system able to analyze an image and provide a numerical output in the defined range.

Use the Same Training Data For All Your Image Classification Tasks

Both image recognition and image regression methods fall under the image classification techniques. That is why the whole process of working with regression is very similar to categorization & tagging models.

Working with image regression model on Ximilar computer vision platform.

Both technologies can work with the same datasets (training images), and inputs of various image sizes and types. In both cases, you can simply upload your data set to the platform, and after creating a task, label the pictures with appropriate continuous values, and then click on the Train button.

Apart from a machine learning platform, we offer a number of AI solutions that are field-tested and ready to use. Check out our public demos to see them in action.

If you would like to build your first image classification system on a no-code machine learning platform, I recommend checking out the article How to Build Your Own Image Recognition API. We defined the basic terms in the article How to Train Custom Image Classifier in 5 Minutes. We also made a basic video tutorial:

Tutorial: train your own image recognition model with Ximilar platform.

Neural Network: The Technology Behind Predicting Range Values on Images

The most simple technique for predicting float values is linear regression. This can be further extended to polynomial regression. These two statistical techniques are working great on tabular input data. However, when it comes to predicting numbers from images, a more advanced approach is required. That’s where neural networks come in. Mathematically said, neural network “f” can be trained to predict value “y” on picture “x”, or “y = f(x)”.

Neural networks can be thought of as approximations of functions that we aim to identify through the optimization on training data. The most commonly used NNs for image-based predictions are Convolutional Neural Networks (CNNs), visual transformers (VisT), or a combination of both. These powerful tools analyze pictures pixel by pixel, and learn relevant features and patterns that are essential for solving the problem at hand.

CNNs are particularly effective in picture analysis tasks. They are able to detect features at different spatial scales and orientations. Meanwhile, VisTs have been gaining popularity due to their ability to learn visual features without being constrained by spatial invariance. When used together, these techniques can provide a comprehensive approach to image-based predictions. We can use them to extract the most relevant information from images.

What Are the Most Common Applications of Value Regression From Images?

Estimating Age From Photos

Probably the most widely known use case of image regression by the public is age prediction. You can come across them on social media platforms and mobile apps, such as Facebook, Instagram, Snapchat, or Face App. They apply deep learning algorithms to predict a user’s age based on their facial features and other details.

While image recognition provides information on the object or person in the image, the regression system tells us a specific value – in this case, the person's age.
While image recognition provides information on the object or person in the image, the regression system tells us a specific value – in this case, the person’s age.

Needless to say, these plugins are not always correct and can sometimes produce biased results. Despite this limitation, various image regression models are gaining popularity on various social sites and in apps.

Ximilar already provides a face-detection solution. Models such as age prediction can be easily trained and deployed on our platform and integrated into your system.

Value Prediction and Rating of Real Estate Photos

Pictures play an essential part on real estate sites. When people are looking for a new home or investment, they are navigating through the feed mainly by visual features. With image regression, you are able to predict the state, quality, price, and overall rating of real estate from photos. This can help with both searching and evaluating real estate.

Predicting rating, and price (regression) for household images with image regression.
Predicting rating, and price (regression) for household images with image regression.

Custom recognition models are also great for the recognition & categorization of the features present in real estate photos. For example, you can determine whether a room is furnished, what type of room it is, and categorize the windows and floors based on their design.

Additionally, a regression can determine the quality or state of floors or walls, as well as rank the overall visual aesthetics of households. You can store all of this information in your database. Your users can then use such data to search for real estate that meets specific criteria.

Image classification systems such as image recognition and value regression are ideal for real estate ranking. Your visitors can search the database with the extracted data.
Image classification systems such as image recognition and value regression are ideal for real estate ranking. Your visitors can search the database with the extracted data.

Determining the Degree of Wear and Tear With AI

Visual AI is increasingly being used to estimate the condition of products in photos. While recognition systems can detect individual tears and surface defects, regression systems can estimate the overall degree of wear and tear of things.

A good example of an industry that has seen significant adoption of such technology is the insurance industry. For example, startups-like Lemonade Inc, or Root use AI when paying the insurance.

With custom image recognition and regression methods, it is now possible to automate the process of insurance claims. For instance, a visual AI system can indicate the seriousness of damage to cars after accidents or assess the wear and tear of various parts such as suspension, tires, or gearboxes. The same goes with other types of insurance, including households, appliances, or even collectible & antique items.

Our platform is commonly utilized to develop recognition and detection systems for visual quality control & defect detection. Read more in the article Visual AI Takes Quality Control to a New Level.

Automatic Grading of Antique & Collectible Items Such as Sports Cards

Apart from car insurance and damage inspection, recognition and regression are great for all types of grading and sorting systems, for instance on price comparators and marketplaces of collectible and antique items. Deep learning is ideal for the automatic visual grading of collector items such as comic books and trading cards.

By leveraging visual AI technology, companies can streamline their processes, reduce manual labor significantly, cut costs, and enhance the accuracy and reliability of their assessments, leading to greater customer satisfaction.

Automatic Recognition of Collectibles

Ximilar built an AI system for the detection, recognition and grading of collectibles. Check it out!

Food Quality Estimation With AI

Biotech, Med Tech, and Industry 4.0 also have a lot of applications for regression models. For example, they can estimate the approximate level of fruit & vegetable ripeness or freshness from a simple camera image.

The grading of vegetables by an image regression model.
The grading of vegetables by an image regression model.

For instance, this Japanese farmer is using deep learning for cucumber quality checks. Looking for quality control or estimation of size and other parameters of olives, fruits, or meat? You can easily create a system tailored to these use cases without coding on the Ximilar platform.

Build Custom Evaluation & Grading Systems With Ximilar

Ximilar provides a no-code visual AI platform accessible via App & API. You can log in and train your own visual AI without the need to know how to code or have expertise in deep learning techniques. It will take you just a few minutes to build a powerful AI model. Don’t hesitate to test it for free and let us know what you think!

Our developers and annotators are also able to build custom recognition and regression systems from scratch. We can help you with the training of the custom task and then with the deployment in production. Both custom and ready-to-use solutions can be used via API or even deployed offline.

The post Predict Values From Images With Image Regression appeared first on Ximilar: Visual AI for Business.

]]>
How to Build a Good Visual Search Engine? https://www.ximilar.com/blog/how-to-build-a-good-visual-search-engine/ Mon, 09 Jan 2023 14:08:28 +0000 https://www.ximilar.com/?p=12001 Let's take a closer look at the technology behind visual search and the key components of visual search engines.

The post How to Build a Good Visual Search Engine? appeared first on Ximilar: Visual AI for Business.

]]>
Visual search is one of the most-demanded computer vision solutions. Our team in Ximilar have been actively developing the best general multimedia visual search engine for retailers, startups, as well as bigger companies, who need to process a lot of images, video content, or 3D models.

However, a universal visual search solution is not the only thing that customers around the world will require in the future. Especially smaller companies and startups now more often look for custom or customizable visual search solutions for their sites & apps, built in a short time and for a reasonable price. What does creating a visual search engine actually look like? And can a visual search engine be built by anyone?

This article should provide a bit deeper insight into the technology behind visual search engines. I will describe the basic components of a visual search engine, analyze approaches to machine learning models and their training datasets, and share some ideas, training tips, and techniques that we use when creating visual search solutions. Those who do not wish to build a visual search from scratch can skip right to Building a Visual Search Engine on a Machine Learning Platform.

What Exactly Does a Visual Search Engine Mean?

The technology of visual search in general analyses the overall visual appearance of the image or a selected object in an image (typically a product), observing numerous features such as colours and their transitions, edges, patterns, or details. It is powered by AI trained specifically to understand the concept of similarity the way you perceive it.

In a narrow sense, the visual search usually refers to a process, in which a user uploads a photo, which is used as an image search query by a visual search engine. This engine in turn provides the user with either identical or similar items. You can find this technology under terms such as reverse image search, search by image, or simply photo & image search.

However, reverse image search is not the only use of visual search. The technology has numerous applications. It can search for near-duplicates, match duplicates, or recommend more or less similar images. All of these visual search tools can be used together in an all-in-one visual search engine, which helps internet users find, compare, match, and discover visual content.

And if you combine these visual search tools with other computer vision solutions, such as object detection, image recognition, or tagging services, you get a quite complex automated image-processing system. It will be able to identify images and objects in them and apply both keywords & image search queries to provide as relevant search results as possible.

Different computer vision systems can be combined on Ximilar platform via Flows. If you would like to know more, here’s an article about how Flows work.

Typical Visual Search Engines:
Google Lens & Pinterest Lens

Big visual search industry players such as Shutterstock, eBay, Pinterest (Pinterest Lens) or Google Images (Google Lens & Google Images) already implemented visual search engines, as well as other advanced, yet hidden algorithms to satisfy the increasing needs of online shoppers and searchers. It is predicted, that a majority of big companies will implement some form of soft AI in their everyday processes in the next few years.

The Algorithm for Training
Visual Similarity

The Components of a Visual Search Tool

Multimedia search engines are very powerful systems consisting of multiple parts. The first key component is storage (database). It wouldn’t be exactly economical to store the full sample (e.g., .jpg image or .mp4 video) in a database. That is why we do not store any visual data for visual search. Instead, we store just a representation of the image, called a visual hash.

The visual hash (also visual descriptor or embedding) is basically a vector, representing the data extracted from your image by the visual search. Each visual hash should be a unique combination of numbers to represent a single sample (image). These vectors also have some mathematical properties, meaning you can compare them, e.g., with cosine, hamming, or Euclidean distance.

So the basic principle of visual search is: the more similar the images are, the more similar will their vector representations be. Visual search engines such as Google Lens are able to compare incredible volumes of images (i.e., their visual hashes) to find the best match in a hundred milliseconds via smart indexing.

How to Create a Visual Hash?

The visual hashes can be extracted from images by standard algorithms such as PHASH. However, the era of big data gives us a much stronger model for vector representation – a neural network. A simple overview of the image search system built with a neural network can look like this:

Extracting visual vectors with the neural network and searching with them in a similarity collection.
Extracting visual vectors with the neural network and searching with them in a similarity collection.

This neural network was trained on images from a website selling cosmetics. Here, it extracted the embeddings (vectors), and they were stored in a database. Then, when a customer uploads an image to the visual search engine on the website, the neural network will extract the embedding vector from this image as well, and use it to find the most similar samples.

Of course, you could also store other metadata in the database, and do advanced filtering or add keyword search to the visual search.

Types of Neural Networks

There are several basic architectures of neural networks that are widely used for vector representations. You can encode almost anything with a neural network. The most common for images is a convolutional neural network (CNN).

There are also special architectures to encode words and text. Lately, so-called transformer neural networks are starting to be more popular for computer vision as well as for natural language processing (NLP). Transformers use a lot of new techniques developed in the last few years, such as an attention mechanism. The attention mechanism, as the name suggests, is able to focus only on the “interesting” parts of the image & ignore the unnecessary details.

Training the Similarity Model

There are multiple methods to train models (neural networks) for image search. First, we should know that training of machine learning models is based on your data and loss function (also called objective or optimization function).

Optimization Functions

The loss function usually computes the error between the output of the model and the ground truth (labels) of the data. This feature is used for adjusting the weights of a model. The model can be interpreted as a function and its weights as parameters of this function. Therefore, if the value of the loss function is big, you should adjust the weights of the model.

How it Works

The model is trained iteratively, taking subsamples of the dataset (batches of images) and going over the entire dataset multiple times. We call one such pass of the dataset an epoch. During one batch analysis, the model needs to compute the loss function value and adjust weights according to it. The algorithm for adjusting the weights of the model is called backpropagation. Training is usually finished when the loss function is not improving (minimizing) anymore.

We can divide the methods (based on loss function) depending on the data we have. Imagine that we have a dataset of images, and we know the class (category) of each image. Our optimization function (loss function) can use these classes to compute the error and modify the model.

The advantage of this approach is its simple implementation. It’s practically only a few lines in any modern framework like TensorFlow or PyTorch. However, it has also a big disadvantage: the class-level optimization functions don’t scale well with the number of classes. We could potentially have thousands of classes (e.g., there are thousands of fashion products and each product represents a class). The computation of such a function with thousands of classes/arguments can be slow. There could also be a problem with fitting everything on the GPU card.

Loss Function: A Few Tips

If you work with a lot of labels, I would recommend using a pair-based loss function instead of a class-based one. The pair-based function usually takes two or more samples from the same class (i.e., the same group or category). A model based on a pair-based loss function doesn’t need to output prediction for so many unique classes. Instead, it can process just a subsample of classes (groups) in each step. It doesn’t know exactly whether the image belongs to class 1 or 9999. But it knows that the two images are from the same class.

Images can be labelled manually or by a custom image recognition model. Read more about image recognition systems.

The Distance Between Vectors

The picture below shows the data in the so-called vector space before and after model optimization (training). In the vector space, each image (sample) is represented by its embedding (vector). Our vectors have two dimensions, x and y, so we can visualize them. The objective of model optimization is to learn the vector representation of images. The loss function is forcing the model to predict similar vectors for samples within the same class (group).

By similar vectors, I mean that the Euclidean distance between the two vectors is small. The larger the distance, the more different these images are. After the optimization, the model assigns a new vector to each sample. Ideally, the model should maximize the distance between images with different classes and minimize the distance between images of the same class.

How visual search engines work: Optimization for visual search should maximize the distance of items between different categories and minimize the distance within category.
Optimization for visual search should maximize the distance of items between different categories and minimize the distance within the category.

Sometimes we don’t know anything about our data in advance, meaning we do not have any metadata. In such cases, we need to use unsupervised or self-supervised learning, about which I will talk later in this article. Big tech companies do a lot of work with unsupervised learning. Special models are being developed for searching in databases. In research papers, this field is often called deep metric learning.

Supervised & Unsupervised Machine Learning Methods

1) Supervised Learning

As I mentioned, if we know the classes of images, the easiest way to train a neural network for vectors is to optimize it for the classification problem. This is a classic image recognition problem. The loss function is usually cross-entropy loss. In this way, the model is learning to predict predefined classes from input images. For example, to say whether the image contains a dog, a cat or a bird. We can get the vectors by removing the last classification layer of the model and getting the vectors from some intermediate layer of the network.

When it comes to the pair-based loss function, one of the oldest techniques for metric learning is the Siamese network (contrastive learning). The name contains “Siamese” because there are two identical models of the same weights. In the Siamese network, we need to have pairs of images, which we label based on whether they are or aren’t equal (i.e., from the same class or not). Pairs in the batch that are equal are labelled with 1 and unequal pairs with 0.

In the following image, we can see different batch construction methods that depend on our model: Siamese (contrastive) network, Triplet, or N-pair, which I will explain below.

How visual search engine works: Each deep learning architecture requires different batch construction methods. For example siames and npair requires tuples. However in Npair, the tuples must be unique.
Each deep learning architecture requires different batch construction methods. For example, Siamese and N-pair require tuples. However, in N-pair, the tuples must be unique.

Triplet Neural Network and Online/Offline Mining

In the Triplet method, we construct triplets of items, two of which (anchor and positive) belong to the same category and the third one (negative) to a different category. This can be harder than you might think because picking the “right” samples in the batch is critical. If you pick items that are too easy or too difficult, the network will converge (adjust weights) very slowly or not at all. The triplet loss function contains an important constant called margin. Margin defines what should be the minimum distance between positive and negative samples.

Picking the right samples in deep metric learning is called mining. We can find optimal triplets via either offline or online mining. The difference is, that during offline mining, you are finding the triplets at the beginning of each epoch.

Online & Offline Mining

The disadvantage of offline mining is that computing embeddings for each sample is not very computationally efficient. During the epoch, the model can change rapidly, so embeddings are becoming obsolete. That’s why online mining of triplets is more popular. In online mining, each batch of triplets is created before fitting the model. For more information about mining and batch strategies for triplet training, I would recommend this post.

We can visualize the Triplet model training in the following way. The model is copied three times, but it has the same shared weights. Each model takes one image from the triplet (anchor, positive, negative) and outputs the embedding vector. Then, the triplet loss is computed and weights are adjusted with backpropagation. After the training is done, the model weights are frozen and the output of the embeddings is used in the similarity engine. Because the three models have shared weights (the same), we take only one model that is used for predicting embedding vectors on images.

How visual search engines work: Triplet network that takes a batch of anchor, positive and negative images.
Triplet network that takes a batch of anchor, positive and negative images.

N-pair Models

The more modern approach is the N-pair model. The advantage of this model is that you don’t mine negative samples, as it is with a triplet network. The batch consists of just positive samples. The negative samples are mitigated through the matrix construction, where all non-diagonal items are negative samples.

You still need to do online mining. For example, you can select a batch with a maximum value of the loss function, or pick pairs that are distant in metric space.

How visual search engine works: N-pair model requires a unique pair of items. In triplet and Siamese model, your batch can contain multiple triplets/pairs from the same class (group).
The N-pair model requires a unique pair of items. In the triplet and Siamese model, your batch can contain multiple triplets/pairs from the same class (group).

In our experience, the N-pair model is much easier to fit, and the results are also better than with the triplet or Siamese model. You still need to do a lot of experiments and know how to tune other hyperparameters such as learning rate, batch size, or model architecture. However, you don’t need to work with the margin value in the loss function, as it is in triplet or Siamese. The small drawback is that during batch creation, we need to have always only two items per class/product.

Proxy-Based Methods

In the proxy-based methods (Proxy-Anchor, Proxy-NCA, Soft Triple) the model is trying to learn class representatives (proxies) from samples. Imagine that instead of having 10,000 classes of fashion products, we will have just 20 class representatives. The first representative will be used for shoes, the second for dresses, the third for shirts, the fourth for pants and so on.

A big advantage is that we don’t need to work with so many classes and the problems coming with it. The idea is to learn class representatives and instead of slow mining “the right samples” we can use the learned representatives in computing the loss function. This leads to much faster training & convergence of the model. This approach, as always, has some cons and questions like how many representatives should we use, and so on.

MultiSimilarity Loss

Finally, it is worth mentioning MultiSimilarity Loss, introduced in this paper. MultiSimilarity Loss is suitable in cases when you have more than two items per class (images per product). The authors of the paper are using 5 samples per class in a batch. MultiSimilarity can bring closer items within the same class and push the negative samples far away by effectively weighting informative pairs. It works with three types of similarities:

  • Self-Similarity (the distance between the negative sample and anchor)
  • Positive-Similarity (the relationship between positive pairs)
  • Negative-Similarity (the relationship between negative pairs)

Finally, it is also worth noting, that in fact, you don’t need to use only one loss function, but you can combine multiple loss functions. For example, you can use the Triplet Loss function with CrossEntropy and MultiSimilarity or N-pair together with Angular Loss. This should often lead to better results than the standalone loss function.

2) Unsupervised Learning

AutoEncoder

Unsupervised learning is helpful when we have a completely unlabelled dataset, meaning we don’t know the classes of our images. These methods are very interesting because the annotation of data can be very expensive and time-consuming. The most simplistic unsupervised learning can simply use some form of AutoEncoder.

AutoEncoder is a neural network consisting of two parts: an encoder, which encodes the image to the smaller representation (embedding vector), and a decoder, which is trying to reconstruct the original image from the embedding vector.

After the whole model is trained, and the decoder is able to reconstruct the images from smaller vectors, the decoder part is discarded and only the encoder part is used in similarity search engines.

How visual search engine works: Simple AutoEncoder neural network for learning embeddings via reconstruction of image.
Simple AutoEncoder neural network for learning embeddings via reconstruction of the image.

There are many other solutions for unsupervised learning. For example, we can train AutoEncoder architecture to colourize images. In this technique, the input image has no colour and the decoding part of the network tries to output a colourful image.

Image Inpainting

Another technique is Image Inpainting, where we remove part of the image and the model will learn to inpaint them back. Interesting way to propose a model that is solving jigsaw puzzles or correct ordering of frames of a video.

Then there are more advanced unsupervised models like SimCLR, MoCo, PIRL, SimSiam or GAN architectures. All these models try to internally represent images so their outputs (vectors) can be used in visual search systems. The explanation of these models is beyond this article.

Tips for Training Deep Metric Models

Here are some useful tips for training deep metric learning models:

  • Batch size plays an important role in deep metric learning. Some methods such as N-pair should have bigger batch sizes. Bigger batch sizes generally lead to better results, however, they also require more memory on the GPU card.
  • If your dataset has a bigger variation and a lot of classes, use a bigger batch size for Multi-similarity loss.
  • The most important part of metric learning is your data. It’s a pity that most research, as well as articles, focus only on models and methods. If you have a large collection with a lot of products, it is important to have a lot of samples per product. If you have fewer classes, try to use some unsupervised method or cross-entropy loss and do heavy augmentations. In the next section, we will look at data in more depth.
  • Try to start with a pre-trained model and tune the learning rate.
  • When using Siamese or Triplet training, try to play with the margin term, all the modern frameworks will allow you to change it (make it harder) during the training.
  • Don’t forget to normalize the output of the embedding if the loss function requires it. Because we are comparing vectors, they should be normalized in a way that the norm of the vectors is always 1. This way, we are able to compute Euclidean or cosine distances.
  • Use advanced methods such as MultiSimilarity with big batch size. If you use Siamese, Triplet, or N-pair, mining of negatives or positives is essential. Start with easier samples at the beginning and increase the challenging samples every epoch.

Neural Text Search on Images with CLIP

Up to right now, we were talking purely about images and searching images with image queries. However, a common use case is to search the collection of images with text input, like we are doing with Google or Bing search. This is also called Text-to-Image problem, because we need to transform text representation to the same representation as images (same vector space). Luckily, researchers from OpenAI develop a simple yet powerful architecture called CLIP (Contrastive Language Image Pre-training). The concept is simple, instead of training on pair of images (SIAMESE, NPAIR) we are training two models (one for image and one for text) on pairs of images and texts.

The architecture of CLIP model by OpenAI. Image Source Github

You can train a CLIP model on a dataset and then use it on your images (or videos) collection. You are able to find similar images/products or try to search your database with a text query. If you would like to use a CLIP-like model on your data, we can help you with the development and integration of the search system. Just contact us at care@ximilar.com, and we can create a search system for your data.

The Training Data
for Visual Search Engines

99 % of the deep learning models have a very expensive requirement: data. Data should not contain any errors such as wrong labels, and we should have a lot of them. However, obtaining enough samples can be a problematic and time-consuming process. That is why techniques such as transfer learning or image augmentation are widely used to enrich the datasets.

How Does Image Augmentation Help With Training Datasets?

Image augmentation is a technique allowing you to multiply training images and therefore expand your dataset. When preparing your dataset, proper image augmentation is crucial. Each specific category of data requires unique augmentation settings for the visual search engine to work properly. Let’s say you want to build a fashion visual search engine based strictly on patterns and not the colours of items. Then you should probably employ heavy colour distortion and channel-swapping augmentation (randomly swapping red, green, or blue channels of an image).

On the other hand, when building an image search engine for a shop with coins, you can rotate the images and flip them to left-right and upside-down. But what to do if the classic augmentations are not enough? We have a few more options.

Removing or Replacing Background

Most of the models that are used for image search require pairs of different images of the same object. Typically, when training product image search, we use an official product photo from a retail site and another picture from a smartphone, such as a real-life photo or a screenshot. This way, we get a pair-based model that understands the similarity of a product in pictures with different backgrounds, lights, or colours.

How visual search engine works: The difference between a product photo and a real-life image made with a smartphone, both of which are important to use when training computer vision models.
The difference between a product photo and a real-life image made with a smartphone, both of which are important to use when training computer vision models.

All such photos of the same product belong to an entity which we call a Similarity Group. This way, we can build an interactive tool for your website or app, which enables users to upload a real-life picture (sample) and find the product they are interested in.

Background Removal Solution

Sometimes, obtaining multiple images of the same group can be impossible. We found a way to tackle this issue by developing a background removal model that can distinguish the dominant foreground object from its background and detect its pixel-accurate position.

Once we know the exact location of the object, we can generate new photos of products with different backgrounds, making the training of the model more effective with just a few images.

The background removal can also be used to narrow the area of augmentation only to the dominant item, ignoring the background of the image. There are a lot of ways to get the original product in different styles, including changing saturation, exposure, highlights and shadows, or changing the colours entirely.

How visual search engines work: Generating more variants can make your model very robust.
Generating more variants can make your model very robust.

Building such an augmentation pipeline with background/foreground augmentation can take hundreds of hours and a lot of GPU resources. That is why we deployed our Background Removal solution as a ready-to-use image tool.

You can use the Background Removal as a stand-alone service for your image collections, or as a tool for training data augmentation. It is available in public demo, App, and via API.

GAN-Based Methods for Generating New Training Data

One of the modern approaches is to use a Generative Adversarial Network (GAN). GANs are incredibly powerful in generating whole new images from some specific domain. You can simply create a model for generating new kinds of insects or making birds with different textures.

How visual search engines work: Creating new insect images automatically to train an image recognition system? How cool is that? There are endless possibilities with GAN models for basicaly any image type. [Source]
Creating new insect images automatically to train an image recognition system? How cool is that? There are endless possibilities with GAN models for basically any image type. [Source]

The greatest advantage of GAN is you will easily get a lot of new variants, which will make your model very robust. GANs are starting to be widely used in more tasks such as simulations, and I think the gathering of data will cost much less in the near future because of them. In Ximilar, we used GAN to create a GAN Image Upscaler, which adds new relevant pixels to images to increase their resolution and quality.

When creating a visual search system on our platform, our team picks the most suitable neural network architecture, loss functions, and image augmentation settings through the analysis of your visual data and goals. All of these are critical for the optimization of a model and the final accuracy of the system. Some architectures are more suitable for specific problems like OCR systems, fashion recommenders or quality control. The same goes with image augmentation, choosing the wrong settings can destroy the optimization. We have experience with selecting the best tools to solve specific problems.

Annotation System for Building Image Search Datasets

As we can see, a good dataset definitely is one of the key elements for training deep learning models. Obtaining such a collection can be quite expensive and time-consuming. With some of our customers, we build a system that continually gathers the images needed in the training datasets (for instance, through a smartphone app). This feature continually & automatically improves the precision of the deployed search engines.

How does it work? When the new images are uploaded to Ximilar Platform (through Custom Similarity service) either via App or API, our annotators can check them and use them to enhance the training dataset in Annotate, our interface dedicated to image annotation & management of datasets for computer vision systems.

Annotate effectively works with the similarity groups by grouping all images of the same item. The annotator can add the image to a group with the relevant Stock Keeping Unit (SKU), label it as either a product picture or a real-life photo, add some tags, or mark objects in the picture. They can also mark images that should be used for the evaluation and not used in the training process. In this way, you can have two separate datasets, one for training and one for evaluation.

We are quite proud of all the capabilities of Annotate, such as quality control, team cooperation, or API connection. There are not many web-based data annotation apps where you can effectively build datasets for visual search, object detection, as well as image recognition, and which are connected to a whole visual AI platform based on computer vision.

A sneak peek into Annotate – image annotation tool for building visual search and image similarity models.
Image annotation tool for building visual search and image similarity models.

How to Improve Visual Search Engine Results?

We already assessed that the optimization algorithm and the training dataset are key elements in training your similarity model. And that having multiple images per product then significantly increases the quality of the trained similarity model. The model (CNN or other modern architecture) for similarity is used for embedding (vector) extraction, which determines the quality of image search.

Over the years that we’ve been training visual search engines for various customers around the world, we were also able to identify several potential weak spots. Their fixing really helped with the performance of searches as well as the relevance of the search results. Let’s take a look at what can improve your visual search engine:

Include Tags

Adding relevant keywords for every image can improve the search results dramatically. We recommend using some basic words that are not synonymous with each other. The wrong keywords for one item are for instance “sky, skyline, cloud, cloudy, building, skyscraper, tall building, a city”, while the good alternative keywords would be “sky, cloud, skyscraper, city”.

Our engine can internally use these tags and improve the search results. You can let an image recognition system label the images instead of adding the keywords manually.

Include Filtering Categories

You can store the main categories of images in their metadata. For instance, in real estate, you can distinguish photos that were taken inside or outside. Based on this, the searchers can filter the search results and improve the quality of the searches. This can also be easily done by an image recognition task.

Include Dominant Colours

Colour analysis is very important, especially when working for a fashion or home decor shop. We built a tool conveniently called Dominant Colors, with several extraction options. The system can extract the main colours of a product while ignoring its background. Searchers can use the colours for advanced filtering.

Use Object Detection & Segmentation

Object detection can help you focus the view of both the search engine and its user on the product, by merely cutting the detected object from the image. You can also apply background removal to search & showcase the products the way you want. For training object detection and other custom image recognition models, you can use our AppAnnotate.

Use Optical Character Recognition (OCR)

In some domains, you can have products with text. For instance, wine bottles or skincare products with the name of the item and other text labels that can be read by artificial intelligence, stored as metadata and used for keyword search on your site.

How visual search engines work: Our visual search engine allows us to combine several features for multimedia search with advanced filtering.
Our visual search engine allows us to combine several features for multimedia search with advanced filtering.

Improve Image Resolution

If the uploaded images from the mobile phones have low resolution, you can use the image upscaler to increase the resolution of the image, screenshot, or video. This way, you will get as much as possible even from user-generated content with potentially lower quality.

Combine Multiple Approaches

FusionCombining multiple features like model embeddings, tags, dominant colours, and text increases your chances to build a solid visual search engine. Our system is able to use these different modalities and return the best items accordingly. For example, extracting dominant colours is really helpful in Fashion Search, our service combining object detection, fashion taggingvisual search.

Search Engine and Vector Databases

Once you trained your model (neural network), you can extract and store the embeddings for your multimedia items somewhere. There are a lot of image search engine implementations that are able to work with vectors (embedding representation) that you can use. For example, Annoy from Spotify or FAISS from Facebook developers.

These solutions are open-source (i.e. you don’t have to deal with usage rights) and you can use them for simple solutions. However, they also have a few disadvantages:

  • After the initial build of the search engine database, you cannot perform any update, insert or delete operations. Once you store the data, you can only perform search queries.
  • You are unable to use a combination of multiple features, such as tags, colours, or metadata.
  • There’s no support for advanced filtering for more precise results.
  • You need to have an IT background and coding skills to implement and use them. And in the end, the system must be deployed on some server, which brings additional challenges.
  • It is difficult to extend them for advanced use cases, you will need to learn a complex codebase of the project and adjust it accordingly.

Building a Visual Search Engine on a Machine Learning Platform

The creation of a great visual search engine is not an easy task. The mentioned challenges and disadvantages of building complex visual search engines with high performance are the reasons why a lot of companies hesitate to dedicate their time and funds to building them from scratch. That is where AI platforms like Ximilar come into play.

Custom Similarity Service

Ximilar provides a computer vision platform, where a fast similarity engine is available as a service. Anyone can connect via API and fill their custom collection with data and query at the same time. This streamlines the tedious workflow a lot, enabling people to have custom visual search engines fast and, more importantly, without coding. Our image search engines can handle other data types like videos, music, or 3D models. If you want more privacy for your data, the system can also be deployed on your hardware infrastructure.

In all industries, it is important to know what we need from our model and optimize it towards the defined goal. We developed our visual search services with this in mind. You can simply define your data and problem and what should be the primary goal for this similarity. This is done via similarity groups, where you put the items that should be matched together.

Examples of Visual Search Solutions for Business

One of the typical industries that use visual search extensively is fashion. Here, you can look at similarities in multiple ways. For instance, one can simply want to find footwear with a colour, pattern, texture, or shape similar to the product in a screenshot. We built several visual search engines for fashion e-shops and especially price comparators, which combined search by photo and recommendations of alternative similar products.

Based on a long experience with visual search solutions, we deployed several ready-to-use services for visual search: Visual Product Search, a complex visual search service for e-commerce including technologies such as search by photo, similar product recommendations, or image matching, and Fashion Search created specifically for the fashion segment.

Another nice use case is also the story of how we built a Pokémon Trading Card search engine. It is no surprise that computer vision has been recently widely applied in the world of collectibles. Trading card games, sports cards or stamps and visual AI are a perfect match. Based on our customers’ demand, we also created several AI solutions specifically for collectibles.

The Workflow of Building
a Visual Search Engine

If you are looking to build a custom search engine for your users, we can develop a solution for you, using our service Custom Image Similarity. This is the typical workflow of our team when working on a customized search service:

  1. SetupResearch & Plan – Initial calls, the definition of the project, NDA, and agreement on expected delivery time.

  2. Data – If you don’t provide any data, we will gather it for you. Gathering and curating datasets is the most important part of developing machine learning models. Having a well-balanced dataset without any bias to any class leads to great performance in production.

  3. First prototype – Our machine learning team will start working on the model and collection. You will be able to see the first results within a month. You can test it and evaluate it by yourself via our clickable front end.

  4. Development – Once you are satisfied with the results, we will gather more data and do more experiments with the models. This is an iterative way of improving the model.

  5. Evaluation & Deployment – If the system performs well and meets the criteria set up in the first calls (mostly some evaluation on the test dataset and speed performance), we work on the deployment. We will show you how to connect and work with the API for visual similarity (insert, delete, search endpoints).

If you are interested in knowing more about how the cooperation with Ximilar works in general, read our How it works and contact us anytime.

We are also able to do a lot of additional steps, such as:

  • Managing and gathering more training data continually after the deployment to gradually increase the performance of visual similarity (the usage rights for user-generated content are up to you; keep in mind that we don’t store any physical images).
  • Building a customized model or multiple models that can be integrated into the search engine.
  • Creating & maintaining your visual search collection, with automatic synchronization to always keep up to date with your current stock.
  • Scaling the service to hundreds of requests per second.

Visual Search is Not Only
For the Big Companies

I presented the basic techniques and architectures for training visual similarity models, but of course, there are much more advanced models and the research of this field continues with mile steps.

Search engines are practically everywhere. It all started with AltaVista in 1995 and Google in 1998. Now it’s more common to get information directly from Siri or Alexa. Searching for things with visual information is just another step, and we are glad that we can give our clients tools to maximise their potential. Ximilar has a lot of technical experience with advanced search technology for multimedia data, and we work hard to make it accessible to everyone, including small and medium companies.

If you are considering implementing visual search into your system:

  1. Schedule a call with us and we will discuss your goals. We will set up a process for getting the training data that are necessary to train your machine learning model for search engines.

  2. In the following weeks, our machine learning team will train a custom model and a testable search collection for you.

  3. After meeting all the requirements from the POC, we will deploy the system to production, and you can connect to it via Rest API.

The post How to Build a Good Visual Search Engine? appeared first on Ximilar: Visual AI for Business.

]]>
Flows – The Game Changer for Next-Generation AI Systems https://www.ximilar.com/blog/flows-the-game-changer-for-next-generation-ai-systems/ Wed, 01 Sep 2021 15:25:28 +0000 https://www.ximilar.com/?p=5213 Flows is a service for combining machine learning models for image recognition, object detection and other AI services into API.

The post Flows – The Game Changer for Next-Generation AI Systems appeared first on Ximilar: Visual AI for Business.

]]>
We have spent thousands of man-hours on this challenging subject. Gallons of coffee later, we introduced a service that might change how you work with data in Machine Learning & AI. We named this solution Flows. It enables simple and intuitive chaining and combining of machine learning models. This simple idea speeds up the workflow of setting up complex computer vision systems and brings unseen scalability to machine learning solutions.

We are here to offer a lot more than just training models, as common AI companies do. Our purpose is not to develop AGI (artificial general intelligence), which is going to take over the world, but easy-to-use AI solutions, that can revolutionize many areas of both business and daily life. So, let’s dive into the possibilities of flows in this 2021 update of one of our most-viewed articles.

Flows: Visual AI Setup Cannot Get Much Easier

In general, at our platform, you can break your machine learning problem down into smaller, separate parts (recognition, detection, and other machine learning models called tasks) and then easily chain & combine these tasks with Flows to achieve the full complexity and hierarchical classification of a visual AI solution.

A typical simple use case is conditional image processing. For instance, the first recognition task filters out non-valid images, then the next one decides a category of the image and, according to the result, other tasks recognize specific features for a given category.

Hierarchical classification with Ximilar Flows service is easy. Flows can help you to build powerful computer vision system.
Simple use of machine learning models combination in a flow

Flows allow your team to review and change datasets of all complexity levels fast and without any trouble. It doesn’t matter whether your model uses three simple categories (e.g. cats, dogs, and guinea pigs) or works with an enormous and complex hierarchy with exceptions, special conditions, and interdependencies.

It also enables you to review the whole dataset structure, analyze, and, if necessary, change its logic due to modularity. With a few clicks, you can add new labels or models (tasks), change their chaining, change the names of the output fields, etc. Neat? More than that!

Think of Flows as Zapier or IFTTT in AI. With flows, you simply connect machine learning models, and review the structure anytime you need.

Define a Flow With a Few Clicks

Let’s assume we are building a real estate website, and we want to automatically recognize different features that we can see in the photos. Different kinds of apartments and houses have various recognizable features. Here is how we can define this workflow using recognition flows (we trained each model with a custom image recognition service):

An example of real estate classifier made of machine learning models combined with flows.
An example of real estate classifier made of machine learning models combined in a flow

The image recognition models are chained in a “main” flow called the branch selector. The branch selector saves the result in the same way as a recognition task node and also chooses an action based on the result of this task. First, we let the top category task recognize the type of estate (Apartment vs. Outdoor house). If it is an apartment, we can see that two subsequent tasks are “Apartment features” and “Room type”.

A flow can also call other flows, so-called nested flows, and delegate part of the work to them. If the image is an outdoor house, we continue processing by another nested flow called “Outdoor house”. In this flow, we can see another branching according to the task that recognizes “House type”. Different tasks are called for individual categories (Bungalow, Cottage, etc.):

An example use of nested flows. The main flow calls another nested flows to process images based on their category.
An example use of nested flows – the main flow calls other nested flows to process images based on their category

Flow Elements We Used

So far, we have used three elements:

  • A recognition task, that simply calls a given task and saves the result into an output field with a specified name. No other logic is involved.
  • A branch selector, on the other hand, saves the result in the same way as a recognition task node, but then it chooses an action based on the result of this task. 
  • Nested flow, another flow of tasks, that the “main” flow (branch selector) called.

Implicitly, there is also a List element present in some branches. We do not need to create it, because as soon as we add two or more elements to a single branch, a list generates in the background. All nodes in a list are normally executed in parallel, but you can also set sequential execution. In this case, the reordering button will appear.

Branch Selector – Advanced Settings

The branch selector is a powerful element. It’s worthwhile to explore what it can do. Let’s go through the most important options. In a single branch, by default, only actions (tag or category) with the highest relevance will be performed, provided the relevance (the probability outputted by the model) is above 50 %. But we can change this in advanced settings. We can specify the threshold value and also enable parallel execution of multiple branches!

The advanced settings of a branch selector, enabling to skip a task of a flow.
The advanced settings of a branch selector, enabling to skip a task of a flow

You can specify the format of the results. Flat JSON means that results from all branches will be saved on the same level as any previous outcomes. And if there are two same output names in multiple branches, they can be overwritten. The parallel execution guarantees neither order nor results. You can prevent this from happening by selecting nested JSON, which will save the results from each branch under a separate key, based on the branch name (that is the tag/category name).

If some data (output_field) are present in the incoming request, we can skip calling the branch selector processing. You can define this in If Output Field Exists. This way we can save credits and also improve the precision of the system. I will show you how useful this behaviour can be in the next paragraphs. To learn about the advanced options of training, check this article.

An Example: Fashion Detection With Tags

We have just created a flow to tag simple and basic pictures. That is cool. But can we really use it in real-life applications? Probably not. The reason is, in most pictures, there is usually more than one clothing item. So how are we going to automate the tagging of more complex pictures? The answer is simple: we can integrate object detection into flows and combine it with recognition & tagging models!

Example of Fashion Tagging combined with Object Detection in Ximilar App
Example of Fashion Tagging combined with Object Detection in Ximilar App

The flow structure then exactly mirrors the rich product taxonomy. Each image goes through a taxonomy tree in order to get proper tags. This is our “top classifier” – a flow that can tell one of our seven top categories of a fashion product image, which will determine how the image will be further classified. For instance, if it is a “Clothing” product, the image continues to “Clothing tagging” flow.

A “top classifier” – a flow that can tell one of our seven top categories of a fashion product image.

Similar to categorization or tagging, there are two basic nodes for object detection: the Detection Task for simple execution of a given task and Object Selector, which enables the processing of the detected objects.

Object Selector will call the object detection task. The detected objects will be extracted out of the image and passed further to any of the available nodes. Yes, any of them! Even another Object Selector, if, for example, you need to first detect people and then detect clothes on each person separately.

Object Selector – Advanced Settings

Object Selector behavior can be customized in similar ways as a Branch Selector. In addition to the Probability Threshold, there is also an Area Threshold. By default, all objects are processed. By setting this threshold, the objects that do not take at least a given percentage of an image are simply ignored. This can be changed to a single object by probability or area in Select. As I mentioned, we extract the object before further processing. We can extend it a bit to include some context using Expand Bounding Box by…

Advanced setting for object selector in a flow enabling to add percentage threshold an object should occupy in order to be detected.
Setting a threshold for a space that an object should occupy in order to be detected

A Typical Flows Application: Fashion Tagging

We have been playing with the fashion subject since the inception of Ximilar. It is the most challenging and also the most promising one. We have created all kinds of tools and helpers for the fashion industry, namely Fashion Tagging, specialized Fashion Search, or Annotate. We are proud to have a very precise automatic fashion tagging service with a rich fashion taxonomy.

And, of course, Fashion Tagging is internally powered by Flows. It is a huge project with several dozens of features to recognize, about a hundred recognition tasks, and hundreds of labels all chained into several interconnected flows. For example, this is what our AI says about a simple dress now – and you can try it on your picture in the public demo.

Example of fashion attributes assigned to a dress by Ximilar Fashion Tagging flow.
Example of fashion attributes assigned to a dress by Ximilar Fashion Tagging flow

Include Pre-trained Services In Your Flow

The last group of nodes at your disposal are Ximilar services. We are working hard and on an ever-growing number of ready-to-use services which can be called through our API and integrated into your project. It is natural for our users to combine more AI services, and flows make it easier than ever. At this moment, you can call these ready-to-use recognition services:

But more will come in the future, for example, Remove Background.

Increasing Possibilities of Flows

As our app and list of services grow, so do the flows. There are two features we are currently looking forward to. We are already building custom similarity models for our customers. As soon as they are ready, they will be available for combining in flows. And there is one more item very high on our list, which is predicting numeric values. Regression, in machine learning terms. Stay tuned for more exciting news!

Create Your Flow – It’s Free

Before Flows, setting up the AI Vision process was a tedious task for a skilled developer. Now everyone can set up, manage and alter steps on their own. In a comprehensive, visual way. Being able to optimize the process quickly, getting a faster response, losing less time and expenses, and delivering higher quality to customers.

And what’s the best part? Flows are available to the users of Ximilar’s free plan, so you can try them right away. Register or sign up to the Ximilar App and enter Flows service at the Dashboard. If you want to learn the basics first, check out our video tutorials. Then you can connect tasks and labels defined in your own Image Recognition.

Training of machine learning models is free with Ximilar, you are only paying for API calls for recognition. Read more about API calls or API credit packs. We strongly believe you will love Flows as much as we enjoyed bringing them to life. And if you feel like there is a feature missing, or if you prefer a custom-made solution, feel free to contact us!

The post Flows – The Game Changer for Next-Generation AI Systems appeared first on Ximilar: Visual AI for Business.

]]>
Image Annotation Tool for Teams https://www.ximilar.com/blog/image-annotation-tool-for-teams/ Thu, 06 May 2021 11:55:57 +0000 https://www.ximilar.com/?p=4115 Annotate is an advanced image annotation tool supporting complex taxonomies and teamwork on computer vision projects.

The post Image Annotation Tool for Teams appeared first on Ximilar: Visual AI for Business.

]]>
Through the years, we worked with many annotation tools. The problem is most of the desktop annotating apps are offline and intended for single-person use, not for team cooperation. The web-based apps, on the other hand, mostly focus on data management with photo annotation, and not on the whole ecosystem with API and inference systems. In this article, I review, what should a good image annotation tool do, and explain the basic features of our own tool – Annotate.

Every big machine learning project requires the active cooperation of multiple team members – engineers, researchers, annotators, product managers, or owners. For example, supervised deep learning for object detection, as well as segmentation, outperforms unsupervised solutions. However, it requires a lot of data with correct annotations. Annotation of images is one of the most time-consuming parts of every deep learning project. Therefore, picking the right annotator tool is critical. When your team is growing and your projects require higher complexity over time, you may encounter new challenges, such as:

  • Adding labels to the taxonomy would require re-checking a lot of your work
  • Increasing the performance of your models would require more data
  • You will need to monitor the progress of your projects

Building solid annotation software for computer vision is not an easy task. And yes, it requires a lot of failures and taking many wrong turns before finding the best solution. So let’s look at what should be the basic features of an advanced data annotation tool.

What Should an Advanced Image Annotation Tool Do?

Many customers are using our cloud platform Ximilar App in very specific areas, such as FashionHealthcare, Security, or Industry 4.0. The environment of a proper AI helper or tool should be complex enough to cover requirements like:

  • Features for team collaboration – you need to assign tasks, and then check the quality and consistency of data
  • Great user experience for dataset curation – everything should be as simple as possible, but no simpler
  • Fast production of high-quality datasets for your machine-learning models
  • Work with complex taxonomies & many models chained with Flows
  • Fast development and prototyping of new features
  • Connection to Rest API with Python SDK & querying annotated data

With these needs in mind, we created our own image annotation tool. We use it in our internal projects and provide it to our customers as well. Our technologies for machine learning accelerate the entire pipeline of building good datasets. Whether you are a freelancer tagging pictures or a team managing product collections in e-commerce, Annotate can help.

Our Visual AI tools enable you to work with your own custom taxonomy of objects, such as fashion apparel or things captured by the camera. You can read the basics on the categories & tags and machine learning model training, watch the tutorials, or check our demo and see for yourself how it works.

The Annotate

Annotate is an advanced image annotation tool, which enables you to annotate images precisely and fast. It works as an end-to-end platform for visual data management. You can query the same images, change labels, create objects, draw bounding boxes and even polygons here.

It is a web-based online annotation tool, that works fully on the cloud. Since it is connected to the same back-end & database as Ximilar App, all changes you do in Annotate, manifest in your workspace in App, and vice versa. You can create labels, tasks & models, or upload images through the App, and use them in Annotate.

Ximilar Application and Annotate are connected to the same backend (api.ximilar.com) and the same database.

Annotate extends the functionalities of the Ximilar App. The App is great for training, creating entities, uploading data, and batch management of images (bulk actions for labelling and filtering). Annotate, on the other hand, was created for the detail-oriented management of images. The default single-zoomed image view brings advantages, such as:

  • Identifying separate objects, drawing polygons and adding metadata to a single image
  • Suggestions based on AI image recognition help you choose from very complex taxonomies
  • The annotators focus on one image at a time to minimize the risk of mistakes

Interested in getting to know Annotate better? Let’s have a look at its basic functions.

Deep Focus on a Single Image

If you enter the Images (left menu), you can open any image in the single image view. To the right of the image, you can see all the items located in it. This is where most of the labelling is done. There is also a toolbar for drawing objects and polygons, labelling images, and inspecting metadata.

In addition, you can zoom in/out and drag the image. This is especially helpful when working with smaller objects or big-resolution images. For example, teams annotating medical microscope samples or satellite pictures can benefit from this robust tool.

View on image annotation tool. This is main view with tools and labels present.
The main view of the image in our Fashion Tagging workspace

Create Multiple Workspaces

Some of you already know this from other SaaS platforms. The idea is to divide your data into several independent storages. Imagine your company is working on multiple projects at the same time and each of them requires you to label your data with an image annotation tool. Your company account can have many workspaces, each for one project.

Here is our active workspace for Fashion Tagging

Within the workspaces, you don’t mix your images, labels, and tasks. For example, one workspace contains only images for fruit recognition projects (apples, oranges, and bananas) and another contains data on animals (cats and dogs).

Your team members can get access to different workspaces. Also, everyone can switch between the workspaces in the App as well as in Annotate (top right, next to the user icon). Did you know, that the workspaces are also accessible via API? Check out our documentation and learn how to connect to API.

Train Precise AI Models with Verification

Building good computer vision models requires a lot of data, high-quality annotations, and a team of people who understand the process of building such a dataset. In short, to create high-quality models, you need to understand your data and have a perfectly annotated dataset. In the words of the Director of AI at Tesla, Andrej Karpathy:

Labeling is a job for highly trained professionals. Andrej Karpathy (Head of AI at Tesla)

Annotate helps you build high-quality AI training datasets by verification. Every image can be verified by different users in the workspace. You can increase the precision by training your models only on verified images.

Verifications list for image.
A list of users who verified the image with the exact dates

Verifying your data is a necessary requirement for the creation of good deep-learning models. To verify the image, simply click the button verify or verify and next (if you are working on a job). You will be able to see who verified any particular image and when.

Create and Track Image Annotating Jobs

When you need to process the newly uploaded images, you can assign them to a Job and a team of people can process them one by one in a job queue. You can also set up exactly how many times each image should be seen by the people processing this queue.

Moreover, you can specify, which photo recognition model or flow of models should be displayed when doing the job. For example, here is the view of the jobs that we are using in one of our tagging services.

List of jobs for image annotation.
Two jobs are waiting to be completed by annotators,
you can start working by hitting the play button on the right

When working on a job, every time an annotator hits the Verify & Next button, it will redirect them to a new image within a job. You can track the progress of each job in the Jobs. Once the image annotation job is complete, the progress bar turns green, and you can proceed to the next steps: retraining the models, uploading new images, or creating another job.

Draw Objects and Polygons

Sometimes, recognizing the most probable category or tags for an image is not enough. That is why Annotate provides a possibility to identify the location of specific things by drawing objects and polygons. The great thing is that you are not paying any credits for drawing objects or labelling. This makes Annotate one of the most cost-effective online apps for image annotation.

Drawing tool for image annotation. Creating bounding box for object detection model.
Simply click and drag the rectangle with the rectangle tool on canvas to create the detection object.

What exactly do you pay for, when annotating data? The only API credits are counted for data uploads, with volume-based discounts. This makes Annotate an affordable, yet powerful tool for data annotation. If you want to know more, read our newest Article on API Credit Packs, check our Pricing Plans or Documentation.

Annotate With Complex Taxonomies Elegantly

The greatest advantage of Annotate is working with very complex taxonomies and attribute hierarchies. That is why it is usually used by companies in E-commerce, Fashion, Real Estate, Healthcare, and other areas with rich databases. For example, our Fashion tagging service contains more than 600 labels that belong to more than 100 custom image recognition models. The taxonomy tree for some of the biotech projects can be even broader.

Navigating through the taxonomy of labels is very elegant in Annotate – via Flows. Once your Flow is defined (our team can help you with it), you simply add labels to the images. The branches expand automatically when you add labels. In other words, you always see only essential labels for your images.

Adding labels from complex taxonomy to fashion image.
Simply navigate through your taxonomy tree, expanding branches when clicking on specific labels.

For example, in this image is a fashion object “Clothing”, to which we need to assign more labels. Adding the Clothing/Dresses label will expand the tags that are in the Length Dresses and Style Dresses tasks. If you select the label Elegant from Style Dresses, only features & attributes you need will be suggested for annotation.

Automate Repetitive Tasks With AI

Annotate was initially designed to speed up the work when building computer vision solutions. When annotating data, manual drawing & clicking is a time-consuming process. That is why we created the AI helper tools to automate the entire annotating process in just a few clicks. Here are a few things that you can do to speed up the entire annotation pipeline:

  • Use the API to upload your previously annotated data to train or re-train your machine learning models and use them to annotate or label more data via API
  • Create bounding boxes and polygons for object detection & instance object segmentation with one click
  • Create jobs, share the data, and distribute the tasks to your team members
Automatically predict objects on one click speeds up annotating data.
Predicting bounding boxes with one click automates the entire process of annotation.

Image Annotation Tool for Advanced Visual AI Training

As the main focus of Ximilar is AI for sorting, comparing, and searching multimedia, we integrate the annotation of images into the building of AI search models. This is something that we miss in all other data annotation applications. For the building of such models, you need to group multiple items (images or objects, typically product pictures) into the Similarity Groups. Annotate helps us create datasets for building strong image similarity search models.

Grouping same or similar images with Image Annotation Tool.
Grouping the same or similar images with the Image Annotation Tool. You can tell which item is a smartphone photo or which photos should be located on an e-commerce platform.

Annotate is Always Growing

Annotate was originally developed as our internal image annotation software, and we have already delivered a lot of successful solutions to our clients with it. It is a unique product that any team can benefit from and improve the computer vision models unbelievably fast

We plan to introduce more data formats like videos, satellite imagery (sentinel maps), 3D models, and more in the future to level up the Visual AI in fields such as visual quality control or AI-assisted healthcare. We are also constantly working on adding new features and improving the overall experience of Ximilar services.

Annotate is available for all users with Business & Professional pricing plans. Would you like to discuss your custom solution or ask anything? Let’s talk! Or read how the cooperation with us works first.

The post Image Annotation Tool for Teams appeared first on Ximilar: Visual AI for Business.

]]>
How to Deploy Models to Mobile & IoT For Offline Use https://www.ximilar.com/blog/how-to-deploy-models-to-mobile-iot-for-offline-use/ Wed, 27 May 2020 09:09:49 +0000 https://www.ximilar.com/?p=1597 Tutorial for deploying Image Recognition models trained with TensorFlow to your smartphone and edge devices.

The post How to Deploy Models to Mobile & IoT For Offline Use appeared first on Ximilar: Visual AI for Business.

]]>
Did you know that the number of IoT devices is crossing 38 billion in 2020? That is a big number. Roughly half of those are connected to the Internet. That is quite a large load for internet infrastructure, even for 5G networks. And still, some countries in the world don’t yet adopt 4G. So internet connectivity can be slow in many cases.

Earlier, in a separate blog post, we mentioned that one day you will be able to download your trained models offline. The time is now. We worked for several months with the newest TensorFlow 2+ (KUDOS to the TF team!), rewriting our internal system from scratch, so your trained models can finally be deployed offline.

Tadaaa — that makes Ximilar one of the first machine learning platform that allows its users to train a custom image recognition model with just a few clicks and download it for offline usage!

The feature is active only in custom pricing plans. If you would like to download and use your models offline, please let us know at sales@ximilar.com, where we are ready to discuss potential options with you.

Let’s get started!

Let’s have a look at how to use your trained model directly on your server, mobile phone, IoT device, or any other edge device. The downloaded model can be run on iOS devices, Android phones, Coral, NVIDIA Jetson, Raspberry Pi, and many others. This makes sense, especially in case your device is offline – if it’s connected to the internet, you can query our API to get results from your latest trained model.

Why offline usage?

Privacy, network connectivity, security, and high latency are common concerns that all customers have. Online use can also become a bottleneck when adopting machine learning on a very large scale or in factories for visual quality control. Here are some scenarios to consider offline models:

  • You don’t want your data to leave your private network.
  • Your device cannot be connected to the Internet or the connectivity is slow.
  • You don’t need to request our API from your mobile for every image you make.
  • You don’t want to be dependent on our infrastructure (but, BTW, we have almost 100% uptime).
  • You need to do numerous queries (tens of millions) per day and want to run your models on your GPU cards.

Right now, both recognition models and detection models are ready for offline usage!

Before continuing with this article, you should already know how to create your Recognition models.

Download

After creating and training your Task, go to the Task page. Once you have permission to download the model, scroll down to the list of trained models and you should see the download icons. Choose the version of the model you are satisfied with and use the icon to download a ZIP file.

This ZIP archive contains several files. The actual model file is located in the tflite folder, and it is in TFLITE format which can be easily deployed on any edge device. Another essential file is labels.txt, which contains the names of your task labels. Order of the names is important as it corresponds with the order of model outputs, so don’t mix them up. The default input size of the model has a 224×224 resolution. There is another folder with saved_model which is used when deploying on server/pc with GPU.

Deploy on Android

This android code/project contains an example application by the TensorFlow team which shows how to deploy the model on an Android device. We forked it and adjusted it to work with our models. Be aware that the model is already normalizing the input image by itself. So you should not normalize the RGB image from the camera in any way.
Here you can download a simple Animal/Cat/Dog tagging model to test. First, copy the model file together with labels.txt to the assets folder of the Android project. Connect your mobile via USB cable to your computer, build the project in Android Studio and run it. Be aware that you should have developer mode with USB debugging enabled on your Android device (you can enable it in Settings). The application should appear on your Android device. Select the MobileNet-Float model, and you are ready for the magic to happen!

That’s it!

Remember this is just a sample code on how to load the model and use it with your mobile camera. You can adjust/use code in a way you need.

Deploy on iOS

With iOS, you have two options. You can use either Objective-C or Swift language. See an example application for iOS. It is implemented in the Swift language. If you are a developer then I recommend being inspired by this file on GitHub. It loads and calls the model. The official quick start guide for iOS from the TensorFlow team is on tensorflow.org.

Workstation/PC/Server

If you want to deploy the recognition model on your server, then start with Ximilar-com/models repository. The folder scripts/recognition contains everything for the successful deployment of the model on your computer. You need to install TensorFlow with version 2.2+. If your workstation has an NVIDIA GPU, you can run the model on it. The GPU needs to have at least 4 GB of memory and CUDA Compute Capability 3.7+. Inferencing on GPU will increase the speed of prediction several times. You can play with the batch size of your samples, which we recommend when using GPU.

Deploying to Raspberry Pi is through the Python language library. See the classification Raspberry Pi project or guide for tflite.

Edge and Embedded Devices

There is also the option to deploy on Coral, NVIDIA Jetson, or directly to a Web browserPersonally, we have a great experience on small projects with Jetson Nano. The MobilenetV2 architecture converted to TensorFlow LITE models works great. If you need to do object detection, tracking and counting then we recommend using YOLO architectures converted to TensorRT. YOLO can run on Jetson Nano in real-time settings and is fantastic for factories, assembly lines and conveyor belts with a small number of product types. You can easily buy and set up a camera on Jetson. Luckily, we are able to develop such models for you and your projects.

Update 2021/2022: We developed an object and image recognition system for Nvidia Jetson Nano for conveyor belts and factories. Read more at our blog post how to create visual AI system for Jetson.

Summary

Now you have another reason to use the Ximilar platform. Of course, by using offline models, you cannot use the Ximilar Flows which is able to connect your tasks to form a complex computer vision system. Otherwise, you can do with your model whatever you want.

To learn more about TFLITE format, see the tflite guide by the TensorFlow team. Big thanks to them! 

If you would like to download your model for offline usage, then contact us at sales@ximilar.com and our sales team will discuss a suitable pricing model for you.

The post How to Deploy Models to Mobile & IoT For Offline Use appeared first on Ximilar: Visual AI for Business.

]]>