Custom Visual Search - Ximilar: Visual AI for Business https://www3.ximilar.com/blog/tag/custom-visual-search/ VISUAL AI FOR BUSINESS Tue, 17 Sep 2024 17:18:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 https://www.ximilar.com/wp-content/uploads/2024/08/cropped-favicon-ximilar-32x32.png Custom Visual Search - Ximilar: Visual AI for Business https://www3.ximilar.com/blog/tag/custom-visual-search/ 32 32 How to Build a Good Visual Search Engine? https://www.ximilar.com/blog/how-to-build-a-good-visual-search-engine/ Mon, 09 Jan 2023 14:08:28 +0000 https://www.ximilar.com/?p=12001 Let's take a closer look at the technology behind visual search and the key components of visual search engines.

The post How to Build a Good Visual Search Engine? appeared first on Ximilar: Visual AI for Business.

]]>
Visual search is one of the most-demanded computer vision solutions. Our team in Ximilar have been actively developing the best general multimedia visual search engine for retailers, startups, as well as bigger companies, who need to process a lot of images, video content, or 3D models.

However, a universal visual search solution is not the only thing that customers around the world will require in the future. Especially smaller companies and startups now more often look for custom or customizable visual search solutions for their sites & apps, built in a short time and for a reasonable price. What does creating a visual search engine actually look like? And can a visual search engine be built by anyone?

This article should provide a bit deeper insight into the technology behind visual search engines. I will describe the basic components of a visual search engine, analyze approaches to machine learning models and their training datasets, and share some ideas, training tips, and techniques that we use when creating visual search solutions. Those who do not wish to build a visual search from scratch can skip right to Building a Visual Search Engine on a Machine Learning Platform.

What Exactly Does a Visual Search Engine Mean?

The technology of visual search in general analyses the overall visual appearance of the image or a selected object in an image (typically a product), observing numerous features such as colours and their transitions, edges, patterns, or details. It is powered by AI trained specifically to understand the concept of similarity the way you perceive it.

In a narrow sense, the visual search usually refers to a process, in which a user uploads a photo, which is used as an image search query by a visual search engine. This engine in turn provides the user with either identical or similar items. You can find this technology under terms such as reverse image search, search by image, or simply photo & image search.

However, reverse image search is not the only use of visual search. The technology has numerous applications. It can search for near-duplicates, match duplicates, or recommend more or less similar images. All of these visual search tools can be used together in an all-in-one visual search engine, which helps internet users find, compare, match, and discover visual content.

And if you combine these visual search tools with other computer vision solutions, such as object detection, image recognition, or tagging services, you get a quite complex automated image-processing system. It will be able to identify images and objects in them and apply both keywords & image search queries to provide as relevant search results as possible.

Different computer vision systems can be combined on Ximilar platform via Flows. If you would like to know more, here’s an article about how Flows work.

Typical Visual Search Engines:
Google Lens & Pinterest Lens

Big visual search industry players such as Shutterstock, eBay, Pinterest (Pinterest Lens) or Google Images (Google Lens & Google Images) already implemented visual search engines, as well as other advanced, yet hidden algorithms to satisfy the increasing needs of online shoppers and searchers. It is predicted, that a majority of big companies will implement some form of soft AI in their everyday processes in the next few years.

The Algorithm for Training
Visual Similarity

The Components of a Visual Search Tool

Multimedia search engines are very powerful systems consisting of multiple parts. The first key component is storage (database). It wouldn’t be exactly economical to store the full sample (e.g., .jpg image or .mp4 video) in a database. That is why we do not store any visual data for visual search. Instead, we store just a representation of the image, called a visual hash.

The visual hash (also visual descriptor or embedding) is basically a vector, representing the data extracted from your image by the visual search. Each visual hash should be a unique combination of numbers to represent a single sample (image). These vectors also have some mathematical properties, meaning you can compare them, e.g., with cosine, hamming, or Euclidean distance.

So the basic principle of visual search is: the more similar the images are, the more similar will their vector representations be. Visual search engines such as Google Lens are able to compare incredible volumes of images (i.e., their visual hashes) to find the best match in a hundred milliseconds via smart indexing.

How to Create a Visual Hash?

The visual hashes can be extracted from images by standard algorithms such as PHASH. However, the era of big data gives us a much stronger model for vector representation – a neural network. A simple overview of the image search system built with a neural network can look like this:

Extracting visual vectors with the neural network and searching with them in a similarity collection.
Extracting visual vectors with the neural network and searching with them in a similarity collection.

This neural network was trained on images from a website selling cosmetics. Here, it extracted the embeddings (vectors), and they were stored in a database. Then, when a customer uploads an image to the visual search engine on the website, the neural network will extract the embedding vector from this image as well, and use it to find the most similar samples.

Of course, you could also store other metadata in the database, and do advanced filtering or add keyword search to the visual search.

Types of Neural Networks

There are several basic architectures of neural networks that are widely used for vector representations. You can encode almost anything with a neural network. The most common for images is a convolutional neural network (CNN).

There are also special architectures to encode words and text. Lately, so-called transformer neural networks are starting to be more popular for computer vision as well as for natural language processing (NLP). Transformers use a lot of new techniques developed in the last few years, such as an attention mechanism. The attention mechanism, as the name suggests, is able to focus only on the “interesting” parts of the image & ignore the unnecessary details.

Training the Similarity Model

There are multiple methods to train models (neural networks) for image search. First, we should know that training of machine learning models is based on your data and loss function (also called objective or optimization function).

Optimization Functions

The loss function usually computes the error between the output of the model and the ground truth (labels) of the data. This feature is used for adjusting the weights of a model. The model can be interpreted as a function and its weights as parameters of this function. Therefore, if the value of the loss function is big, you should adjust the weights of the model.

How it Works

The model is trained iteratively, taking subsamples of the dataset (batches of images) and going over the entire dataset multiple times. We call one such pass of the dataset an epoch. During one batch analysis, the model needs to compute the loss function value and adjust weights according to it. The algorithm for adjusting the weights of the model is called backpropagation. Training is usually finished when the loss function is not improving (minimizing) anymore.

We can divide the methods (based on loss function) depending on the data we have. Imagine that we have a dataset of images, and we know the class (category) of each image. Our optimization function (loss function) can use these classes to compute the error and modify the model.

The advantage of this approach is its simple implementation. It’s practically only a few lines in any modern framework like TensorFlow or PyTorch. However, it has also a big disadvantage: the class-level optimization functions don’t scale well with the number of classes. We could potentially have thousands of classes (e.g., there are thousands of fashion products and each product represents a class). The computation of such a function with thousands of classes/arguments can be slow. There could also be a problem with fitting everything on the GPU card.

Loss Function: A Few Tips

If you work with a lot of labels, I would recommend using a pair-based loss function instead of a class-based one. The pair-based function usually takes two or more samples from the same class (i.e., the same group or category). A model based on a pair-based loss function doesn’t need to output prediction for so many unique classes. Instead, it can process just a subsample of classes (groups) in each step. It doesn’t know exactly whether the image belongs to class 1 or 9999. But it knows that the two images are from the same class.

Images can be labelled manually or by a custom image recognition model. Read more about image recognition systems.

The Distance Between Vectors

The picture below shows the data in the so-called vector space before and after model optimization (training). In the vector space, each image (sample) is represented by its embedding (vector). Our vectors have two dimensions, x and y, so we can visualize them. The objective of model optimization is to learn the vector representation of images. The loss function is forcing the model to predict similar vectors for samples within the same class (group).

By similar vectors, I mean that the Euclidean distance between the two vectors is small. The larger the distance, the more different these images are. After the optimization, the model assigns a new vector to each sample. Ideally, the model should maximize the distance between images with different classes and minimize the distance between images of the same class.

How visual search engines work: Optimization for visual search should maximize the distance of items between different categories and minimize the distance within category.
Optimization for visual search should maximize the distance of items between different categories and minimize the distance within the category.

Sometimes we don’t know anything about our data in advance, meaning we do not have any metadata. In such cases, we need to use unsupervised or self-supervised learning, about which I will talk later in this article. Big tech companies do a lot of work with unsupervised learning. Special models are being developed for searching in databases. In research papers, this field is often called deep metric learning.

Supervised & Unsupervised Machine Learning Methods

1) Supervised Learning

As I mentioned, if we know the classes of images, the easiest way to train a neural network for vectors is to optimize it for the classification problem. This is a classic image recognition problem. The loss function is usually cross-entropy loss. In this way, the model is learning to predict predefined classes from input images. For example, to say whether the image contains a dog, a cat or a bird. We can get the vectors by removing the last classification layer of the model and getting the vectors from some intermediate layer of the network.

When it comes to the pair-based loss function, one of the oldest techniques for metric learning is the Siamese network (contrastive learning). The name contains “Siamese” because there are two identical models of the same weights. In the Siamese network, we need to have pairs of images, which we label based on whether they are or aren’t equal (i.e., from the same class or not). Pairs in the batch that are equal are labelled with 1 and unequal pairs with 0.

In the following image, we can see different batch construction methods that depend on our model: Siamese (contrastive) network, Triplet, or N-pair, which I will explain below.

How visual search engine works: Each deep learning architecture requires different batch construction methods. For example siames and npair requires tuples. However in Npair, the tuples must be unique.
Each deep learning architecture requires different batch construction methods. For example, Siamese and N-pair require tuples. However, in N-pair, the tuples must be unique.

Triplet Neural Network and Online/Offline Mining

In the Triplet method, we construct triplets of items, two of which (anchor and positive) belong to the same category and the third one (negative) to a different category. This can be harder than you might think because picking the “right” samples in the batch is critical. If you pick items that are too easy or too difficult, the network will converge (adjust weights) very slowly or not at all. The triplet loss function contains an important constant called margin. Margin defines what should be the minimum distance between positive and negative samples.

Picking the right samples in deep metric learning is called mining. We can find optimal triplets via either offline or online mining. The difference is, that during offline mining, you are finding the triplets at the beginning of each epoch.

Online & Offline Mining

The disadvantage of offline mining is that computing embeddings for each sample is not very computationally efficient. During the epoch, the model can change rapidly, so embeddings are becoming obsolete. That’s why online mining of triplets is more popular. In online mining, each batch of triplets is created before fitting the model. For more information about mining and batch strategies for triplet training, I would recommend this post.

We can visualize the Triplet model training in the following way. The model is copied three times, but it has the same shared weights. Each model takes one image from the triplet (anchor, positive, negative) and outputs the embedding vector. Then, the triplet loss is computed and weights are adjusted with backpropagation. After the training is done, the model weights are frozen and the output of the embeddings is used in the similarity engine. Because the three models have shared weights (the same), we take only one model that is used for predicting embedding vectors on images.

How visual search engines work: Triplet network that takes a batch of anchor, positive and negative images.
Triplet network that takes a batch of anchor, positive and negative images.

N-pair Models

The more modern approach is the N-pair model. The advantage of this model is that you don’t mine negative samples, as it is with a triplet network. The batch consists of just positive samples. The negative samples are mitigated through the matrix construction, where all non-diagonal items are negative samples.

You still need to do online mining. For example, you can select a batch with a maximum value of the loss function, or pick pairs that are distant in metric space.

How visual search engine works: N-pair model requires a unique pair of items. In triplet and Siamese model, your batch can contain multiple triplets/pairs from the same class (group).
The N-pair model requires a unique pair of items. In the triplet and Siamese model, your batch can contain multiple triplets/pairs from the same class (group).

In our experience, the N-pair model is much easier to fit, and the results are also better than with the triplet or Siamese model. You still need to do a lot of experiments and know how to tune other hyperparameters such as learning rate, batch size, or model architecture. However, you don’t need to work with the margin value in the loss function, as it is in triplet or Siamese. The small drawback is that during batch creation, we need to have always only two items per class/product.

Proxy-Based Methods

In the proxy-based methods (Proxy-Anchor, Proxy-NCA, Soft Triple) the model is trying to learn class representatives (proxies) from samples. Imagine that instead of having 10,000 classes of fashion products, we will have just 20 class representatives. The first representative will be used for shoes, the second for dresses, the third for shirts, the fourth for pants and so on.

A big advantage is that we don’t need to work with so many classes and the problems coming with it. The idea is to learn class representatives and instead of slow mining “the right samples” we can use the learned representatives in computing the loss function. This leads to much faster training & convergence of the model. This approach, as always, has some cons and questions like how many representatives should we use, and so on.

MultiSimilarity Loss

Finally, it is worth mentioning MultiSimilarity Loss, introduced in this paper. MultiSimilarity Loss is suitable in cases when you have more than two items per class (images per product). The authors of the paper are using 5 samples per class in a batch. MultiSimilarity can bring closer items within the same class and push the negative samples far away by effectively weighting informative pairs. It works with three types of similarities:

  • Self-Similarity (the distance between the negative sample and anchor)
  • Positive-Similarity (the relationship between positive pairs)
  • Negative-Similarity (the relationship between negative pairs)

Finally, it is also worth noting, that in fact, you don’t need to use only one loss function, but you can combine multiple loss functions. For example, you can use the Triplet Loss function with CrossEntropy and MultiSimilarity or N-pair together with Angular Loss. This should often lead to better results than the standalone loss function.

2) Unsupervised Learning

AutoEncoder

Unsupervised learning is helpful when we have a completely unlabelled dataset, meaning we don’t know the classes of our images. These methods are very interesting because the annotation of data can be very expensive and time-consuming. The most simplistic unsupervised learning can simply use some form of AutoEncoder.

AutoEncoder is a neural network consisting of two parts: an encoder, which encodes the image to the smaller representation (embedding vector), and a decoder, which is trying to reconstruct the original image from the embedding vector.

After the whole model is trained, and the decoder is able to reconstruct the images from smaller vectors, the decoder part is discarded and only the encoder part is used in similarity search engines.

How visual search engine works: Simple AutoEncoder neural network for learning embeddings via reconstruction of image.
Simple AutoEncoder neural network for learning embeddings via reconstruction of the image.

There are many other solutions for unsupervised learning. For example, we can train AutoEncoder architecture to colourize images. In this technique, the input image has no colour and the decoding part of the network tries to output a colourful image.

Image Inpainting

Another technique is Image Inpainting, where we remove part of the image and the model will learn to inpaint them back. Interesting way to propose a model that is solving jigsaw puzzles or correct ordering of frames of a video.

Then there are more advanced unsupervised models like SimCLR, MoCo, PIRL, SimSiam or GAN architectures. All these models try to internally represent images so their outputs (vectors) can be used in visual search systems. The explanation of these models is beyond this article.

Tips for Training Deep Metric Models

Here are some useful tips for training deep metric learning models:

  • Batch size plays an important role in deep metric learning. Some methods such as N-pair should have bigger batch sizes. Bigger batch sizes generally lead to better results, however, they also require more memory on the GPU card.
  • If your dataset has a bigger variation and a lot of classes, use a bigger batch size for Multi-similarity loss.
  • The most important part of metric learning is your data. It’s a pity that most research, as well as articles, focus only on models and methods. If you have a large collection with a lot of products, it is important to have a lot of samples per product. If you have fewer classes, try to use some unsupervised method or cross-entropy loss and do heavy augmentations. In the next section, we will look at data in more depth.
  • Try to start with a pre-trained model and tune the learning rate.
  • When using Siamese or Triplet training, try to play with the margin term, all the modern frameworks will allow you to change it (make it harder) during the training.
  • Don’t forget to normalize the output of the embedding if the loss function requires it. Because we are comparing vectors, they should be normalized in a way that the norm of the vectors is always 1. This way, we are able to compute Euclidean or cosine distances.
  • Use advanced methods such as MultiSimilarity with big batch size. If you use Siamese, Triplet, or N-pair, mining of negatives or positives is essential. Start with easier samples at the beginning and increase the challenging samples every epoch.

Neural Text Search on Images with CLIP

Up to right now, we were talking purely about images and searching images with image queries. However, a common use case is to search the collection of images with text input, like we are doing with Google or Bing search. This is also called Text-to-Image problem, because we need to transform text representation to the same representation as images (same vector space). Luckily, researchers from OpenAI develop a simple yet powerful architecture called CLIP (Contrastive Language Image Pre-training). The concept is simple, instead of training on pair of images (SIAMESE, NPAIR) we are training two models (one for image and one for text) on pairs of images and texts.

The architecture of CLIP model by OpenAI. Image Source Github

You can train a CLIP model on a dataset and then use it on your images (or videos) collection. You are able to find similar images/products or try to search your database with a text query. If you would like to use a CLIP-like model on your data, we can help you with the development and integration of the search system. Just contact us at care@ximilar.com, and we can create a search system for your data.

The Training Data
for Visual Search Engines

99 % of the deep learning models have a very expensive requirement: data. Data should not contain any errors such as wrong labels, and we should have a lot of them. However, obtaining enough samples can be a problematic and time-consuming process. That is why techniques such as transfer learning or image augmentation are widely used to enrich the datasets.

How Does Image Augmentation Help With Training Datasets?

Image augmentation is a technique allowing you to multiply training images and therefore expand your dataset. When preparing your dataset, proper image augmentation is crucial. Each specific category of data requires unique augmentation settings for the visual search engine to work properly. Let’s say you want to build a fashion visual search engine based strictly on patterns and not the colours of items. Then you should probably employ heavy colour distortion and channel-swapping augmentation (randomly swapping red, green, or blue channels of an image).

On the other hand, when building an image search engine for a shop with coins, you can rotate the images and flip them to left-right and upside-down. But what to do if the classic augmentations are not enough? We have a few more options.

Removing or Replacing Background

Most of the models that are used for image search require pairs of different images of the same object. Typically, when training product image search, we use an official product photo from a retail site and another picture from a smartphone, such as a real-life photo or a screenshot. This way, we get a pair-based model that understands the similarity of a product in pictures with different backgrounds, lights, or colours.

How visual search engine works: The difference between a product photo and a real-life image made with a smartphone, both of which are important to use when training computer vision models.
The difference between a product photo and a real-life image made with a smartphone, both of which are important to use when training computer vision models.

All such photos of the same product belong to an entity which we call a Similarity Group. This way, we can build an interactive tool for your website or app, which enables users to upload a real-life picture (sample) and find the product they are interested in.

Background Removal Solution

Sometimes, obtaining multiple images of the same group can be impossible. We found a way to tackle this issue by developing a background removal model that can distinguish the dominant foreground object from its background and detect its pixel-accurate position.

Once we know the exact location of the object, we can generate new photos of products with different backgrounds, making the training of the model more effective with just a few images.

The background removal can also be used to narrow the area of augmentation only to the dominant item, ignoring the background of the image. There are a lot of ways to get the original product in different styles, including changing saturation, exposure, highlights and shadows, or changing the colours entirely.

How visual search engines work: Generating more variants can make your model very robust.
Generating more variants can make your model very robust.

Building such an augmentation pipeline with background/foreground augmentation can take hundreds of hours and a lot of GPU resources. That is why we deployed our Background Removal solution as a ready-to-use image tool.

You can use the Background Removal as a stand-alone service for your image collections, or as a tool for training data augmentation. It is available in public demo, App, and via API.

GAN-Based Methods for Generating New Training Data

One of the modern approaches is to use a Generative Adversarial Network (GAN). GANs are incredibly powerful in generating whole new images from some specific domain. You can simply create a model for generating new kinds of insects or making birds with different textures.

How visual search engines work: Creating new insect images automatically to train an image recognition system? How cool is that? There are endless possibilities with GAN models for basicaly any image type. [Source]
Creating new insect images automatically to train an image recognition system? How cool is that? There are endless possibilities with GAN models for basically any image type. [Source]

The greatest advantage of GAN is you will easily get a lot of new variants, which will make your model very robust. GANs are starting to be widely used in more tasks such as simulations, and I think the gathering of data will cost much less in the near future because of them. In Ximilar, we used GAN to create a GAN Image Upscaler, which adds new relevant pixels to images to increase their resolution and quality.

When creating a visual search system on our platform, our team picks the most suitable neural network architecture, loss functions, and image augmentation settings through the analysis of your visual data and goals. All of these are critical for the optimization of a model and the final accuracy of the system. Some architectures are more suitable for specific problems like OCR systems, fashion recommenders or quality control. The same goes with image augmentation, choosing the wrong settings can destroy the optimization. We have experience with selecting the best tools to solve specific problems.

Annotation System for Building Image Search Datasets

As we can see, a good dataset definitely is one of the key elements for training deep learning models. Obtaining such a collection can be quite expensive and time-consuming. With some of our customers, we build a system that continually gathers the images needed in the training datasets (for instance, through a smartphone app). This feature continually & automatically improves the precision of the deployed search engines.

How does it work? When the new images are uploaded to Ximilar Platform (through Custom Similarity service) either via App or API, our annotators can check them and use them to enhance the training dataset in Annotate, our interface dedicated to image annotation & management of datasets for computer vision systems.

Annotate effectively works with the similarity groups by grouping all images of the same item. The annotator can add the image to a group with the relevant Stock Keeping Unit (SKU), label it as either a product picture or a real-life photo, add some tags, or mark objects in the picture. They can also mark images that should be used for the evaluation and not used in the training process. In this way, you can have two separate datasets, one for training and one for evaluation.

We are quite proud of all the capabilities of Annotate, such as quality control, team cooperation, or API connection. There are not many web-based data annotation apps where you can effectively build datasets for visual search, object detection, as well as image recognition, and which are connected to a whole visual AI platform based on computer vision.

A sneak peek into Annotate – image annotation tool for building visual search and image similarity models.
Image annotation tool for building visual search and image similarity models.

How to Improve Visual Search Engine Results?

We already assessed that the optimization algorithm and the training dataset are key elements in training your similarity model. And that having multiple images per product then significantly increases the quality of the trained similarity model. The model (CNN or other modern architecture) for similarity is used for embedding (vector) extraction, which determines the quality of image search.

Over the years that we’ve been training visual search engines for various customers around the world, we were also able to identify several potential weak spots. Their fixing really helped with the performance of searches as well as the relevance of the search results. Let’s take a look at what can improve your visual search engine:

Include Tags

Adding relevant keywords for every image can improve the search results dramatically. We recommend using some basic words that are not synonymous with each other. The wrong keywords for one item are for instance “sky, skyline, cloud, cloudy, building, skyscraper, tall building, a city”, while the good alternative keywords would be “sky, cloud, skyscraper, city”.

Our engine can internally use these tags and improve the search results. You can let an image recognition system label the images instead of adding the keywords manually.

Include Filtering Categories

You can store the main categories of images in their metadata. For instance, in real estate, you can distinguish photos that were taken inside or outside. Based on this, the searchers can filter the search results and improve the quality of the searches. This can also be easily done by an image recognition task.

Include Dominant Colours

Colour analysis is very important, especially when working for a fashion or home decor shop. We built a tool conveniently called Dominant Colors, with several extraction options. The system can extract the main colours of a product while ignoring its background. Searchers can use the colours for advanced filtering.

Use Object Detection & Segmentation

Object detection can help you focus the view of both the search engine and its user on the product, by merely cutting the detected object from the image. You can also apply background removal to search & showcase the products the way you want. For training object detection and other custom image recognition models, you can use our AppAnnotate.

Use Optical Character Recognition (OCR)

In some domains, you can have products with text. For instance, wine bottles or skincare products with the name of the item and other text labels that can be read by artificial intelligence, stored as metadata and used for keyword search on your site.

How visual search engines work: Our visual search engine allows us to combine several features for multimedia search with advanced filtering.
Our visual search engine allows us to combine several features for multimedia search with advanced filtering.

Improve Image Resolution

If the uploaded images from the mobile phones have low resolution, you can use the image upscaler to increase the resolution of the image, screenshot, or video. This way, you will get as much as possible even from user-generated content with potentially lower quality.

Combine Multiple Approaches

FusionCombining multiple features like model embeddings, tags, dominant colours, and text increases your chances to build a solid visual search engine. Our system is able to use these different modalities and return the best items accordingly. For example, extracting dominant colours is really helpful in Fashion Search, our service combining object detection, fashion taggingvisual search.

Search Engine and Vector Databases

Once you trained your model (neural network), you can extract and store the embeddings for your multimedia items somewhere. There are a lot of image search engine implementations that are able to work with vectors (embedding representation) that you can use. For example, Annoy from Spotify or FAISS from Facebook developers.

These solutions are open-source (i.e. you don’t have to deal with usage rights) and you can use them for simple solutions. However, they also have a few disadvantages:

  • After the initial build of the search engine database, you cannot perform any update, insert or delete operations. Once you store the data, you can only perform search queries.
  • You are unable to use a combination of multiple features, such as tags, colours, or metadata.
  • There’s no support for advanced filtering for more precise results.
  • You need to have an IT background and coding skills to implement and use them. And in the end, the system must be deployed on some server, which brings additional challenges.
  • It is difficult to extend them for advanced use cases, you will need to learn a complex codebase of the project and adjust it accordingly.

Building a Visual Search Engine on a Machine Learning Platform

The creation of a great visual search engine is not an easy task. The mentioned challenges and disadvantages of building complex visual search engines with high performance are the reasons why a lot of companies hesitate to dedicate their time and funds to building them from scratch. That is where AI platforms like Ximilar come into play.

Custom Similarity Service

Ximilar provides a computer vision platform, where a fast similarity engine is available as a service. Anyone can connect via API and fill their custom collection with data and query at the same time. This streamlines the tedious workflow a lot, enabling people to have custom visual search engines fast and, more importantly, without coding. Our image search engines can handle other data types like videos, music, or 3D models. If you want more privacy for your data, the system can also be deployed on your hardware infrastructure.

In all industries, it is important to know what we need from our model and optimize it towards the defined goal. We developed our visual search services with this in mind. You can simply define your data and problem and what should be the primary goal for this similarity. This is done via similarity groups, where you put the items that should be matched together.

Examples of Visual Search Solutions for Business

One of the typical industries that use visual search extensively is fashion. Here, you can look at similarities in multiple ways. For instance, one can simply want to find footwear with a colour, pattern, texture, or shape similar to the product in a screenshot. We built several visual search engines for fashion e-shops and especially price comparators, which combined search by photo and recommendations of alternative similar products.

Based on a long experience with visual search solutions, we deployed several ready-to-use services for visual search: Visual Product Search, a complex visual search service for e-commerce including technologies such as search by photo, similar product recommendations, or image matching, and Fashion Search created specifically for the fashion segment.

Another nice use case is also the story of how we built a Pokémon Trading Card search engine. It is no surprise that computer vision has been recently widely applied in the world of collectibles. Trading card games, sports cards or stamps and visual AI are a perfect match. Based on our customers’ demand, we also created several AI solutions specifically for collectibles.

The Workflow of Building
a Visual Search Engine

If you are looking to build a custom search engine for your users, we can develop a solution for you, using our service Custom Image Similarity. This is the typical workflow of our team when working on a customized search service:

  1. SetupResearch & Plan – Initial calls, the definition of the project, NDA, and agreement on expected delivery time.

  2. Data – If you don’t provide any data, we will gather it for you. Gathering and curating datasets is the most important part of developing machine learning models. Having a well-balanced dataset without any bias to any class leads to great performance in production.

  3. First prototype – Our machine learning team will start working on the model and collection. You will be able to see the first results within a month. You can test it and evaluate it by yourself via our clickable front end.

  4. Development – Once you are satisfied with the results, we will gather more data and do more experiments with the models. This is an iterative way of improving the model.

  5. Evaluation & Deployment – If the system performs well and meets the criteria set up in the first calls (mostly some evaluation on the test dataset and speed performance), we work on the deployment. We will show you how to connect and work with the API for visual similarity (insert, delete, search endpoints).

If you are interested in knowing more about how the cooperation with Ximilar works in general, read our How it works and contact us anytime.

We are also able to do a lot of additional steps, such as:

  • Managing and gathering more training data continually after the deployment to gradually increase the performance of visual similarity (the usage rights for user-generated content are up to you; keep in mind that we don’t store any physical images).
  • Building a customized model or multiple models that can be integrated into the search engine.
  • Creating & maintaining your visual search collection, with automatic synchronization to always keep up to date with your current stock.
  • Scaling the service to hundreds of requests per second.

Visual Search is Not Only
For the Big Companies

I presented the basic techniques and architectures for training visual similarity models, but of course, there are much more advanced models and the research of this field continues with mile steps.

Search engines are practically everywhere. It all started with AltaVista in 1995 and Google in 1998. Now it’s more common to get information directly from Siri or Alexa. Searching for things with visual information is just another step, and we are glad that we can give our clients tools to maximise their potential. Ximilar has a lot of technical experience with advanced search technology for multimedia data, and we work hard to make it accessible to everyone, including small and medium companies.

If you are considering implementing visual search into your system:

  1. Schedule a call with us and we will discuss your goals. We will set up a process for getting the training data that are necessary to train your machine learning model for search engines.

  2. In the following weeks, our machine learning team will train a custom model and a testable search collection for you.

  3. After meeting all the requirements from the POC, we will deploy the system to production, and you can connect to it via Rest API.

The post How to Build a Good Visual Search Engine? appeared first on Ximilar: Visual AI for Business.

]]>
Pokémon TCG Search Engine: Use AI to Catch Them All https://www.ximilar.com/blog/pokemon-card-image-search-engine/ Tue, 11 Oct 2022 12:20:00 +0000 https://www.ximilar.com/?p=4551 With a new custom image similarity service, we are able to build an image search engine for collectible cards trading.

The post Pokémon TCG Search Engine: Use AI to Catch Them All appeared first on Ximilar: Visual AI for Business.

]]>
Have you played any trading card games? As an elementary school student, I remember spending hundreds of hours playing Lord of the Rings TCG with my friend. Back then, LOTR was in the cinemas, and the game was simply fantastic, with beautiful pictures from movies. I still remember my deck, played with a combination of Ents/Gondor and Nazguls.

Other people in our office spent their youth playing Magic The Gathering (with beautiful artworks), or collecting sports cards with their favorite athletes. In my country, basketball cards and ice hockey cards were really popular. Cards are still loved, played, collected, and traded by geeks, collectors, and sports fans across the world! Their market is growing, and so is the need for automation of image processing on websites and apps for collectors. Right now, cards can be seen even as a great investment.

Where can you use visual AI for cards?

Trading card games (トレーディングカード) can consist of tens of thousands of cards. In principle, building a basic image classifier based solely on image recognition leads to low precision and is simply not enough for more complicated problems.

However, we are able to build a complex similarity system that can recognize, categorize, and find similar cards by a picture. Once trained properly, it can deal with enormous databases of images it never encountered before. With this system, you can find all the information, such as the year of release, card title, exact value, card set, or whether it already is in someone’s collection, with just a smartphone image of the card.

Tip: Check out our Computer Vision Platform to learn about how basic image recognition systems work. If you are not sure how to develop your card search system, just contact us and we will help you.

Collectibles are a big business and some cards are really expensive nowadays. Who knows, maybe you have the card of Charizard or Kobe Bryant hidden in your old box in the attic. We can develop a system for you that can automatically analyze the bulk of trading cards sent from your customers or integrate it into your mobile/smartphone app.

Automatic Recognition of Collectibles

Ximilar built an AI system for the detection, recognition and grading of collectibles. Check it out!

What can visual search do for the trading cards market?

In the last year, we have been building a universal system able to train visual models with numerous applications in image search engines. We already offer visual search services for photo search. But, they are optimized mostly for general and fashion images. This system can be tuned to trading cards, coins, furniture & home decor, arts, and real estate, … there are infinite use cases.

In the last decades, we have all witnessed the growth of the TCG community. However, technologies based on artificial intelligence have not yet been used in this market. Plus, even though the first system for scanning trading cards was released by ebay.com, it was not made available for small shops as an API. And since trading card games and visual AI are a perfect match, we are going to change itwith a card image search.

Tip: Check out Visual Product Search to learn more about visual search applications.

Which TCG cards could visual AI help with?

An image search engine is a great approach when the number of classes for the image classification is high (above 1,000+). With TCGs, each card represents a unique class. A convolutional neural network (CNN) trained as a classifier can have poor results when working with a larger number of classes.

Pokémon TCG contains more than 10,000 cards (classes), Magic the Gathering (MTG) over 50.000, and the same goes for basketball or any other sports cards. So basically, we can build a visual search system for both:

  • Trading card games (Magic the Gathering, Lord of the Rings, Pokémon, Yu-Gi-Oh!, One Piece, Warhammer, and so on)
  • Collectible sports cards (like Ice Hockey, Football, Soccer, Baseball, Basketball, UFC, and more)
Pokémon, Magic The Gathering, LOTR, Ice Hockey and Basketball cards.
Pokémon, Magic The Gathering, LOTR, Ice Hockey, and Basketball cards.
Yes, we are big fans of all these things 🙂

A visual search/recognition technology is starting to be used on E-bay when listing trading and sports cards for sale. However, this is only available in the e-bay app on smartphones. The app has a built-in scanning tool for cards and can find the average price with additional info.

Our service for card image search can be integrated into your website or application. And you can simply connect via API through a smartphone, computer, or sorting machine to find exact cards by photo, saving a lot of time and improving the user experience!

We’ve been recently training an AI (neural network) model for Pokémon trading cards, Yugioh! and Magic The Gathering. Why these? Pokémon is the most played TCG in the world, the game has simple rules and an enormous fan base. Very popular are also MTG and Yugioh! Some cards are really expensive, but more importantly, they are traded heavily!

With this model, we built a reverse search for finding the exact Pokémon card, MTG and Yugioh! cards, which achieved 94%+ accuracy (i.e. exact image match). And we are still talking about a prototype in beta version that can be improved to almost 100 %. This search system can return you the edition of the card, language, name of the card, year of release and much more.

If you would like to try the system on these three trading card games, then the endpoint for card identification (/v2/tcg_id) from the Collectibles Recognition service is the right choice for you. If you need to tune it on your image collections or have any other games or cards (sports) then just contact us and we can build a similar service for you.

Automatic grading and inspection of cards with AI

A lot of companies are grading sports & trading cards manually. Our visual AI can be trained to detect corner types, scratches, surface wear, light staining, creases, focus, and borders. The Image recognition models are able to identify marks, wrong cut, lopsided centering, print defects and other special attributes.

For example, PSA is a company that has developed its own grading standards for automatic card grading (MINT). With our platform and team, you can automatize the entire workflow of grading with just one photo. We provide several solutions for computing card grades and card condition.

PSA graded baseball card. Our machine learning model can analyze picture of these cards.
PSA graded baseball card. Automatic grading is possible with machine learning.

With the new custom similarity service, we are able to create a custom solution for trading card image search in a matter of weeks. The process for developing it is quite simple:

  1. We will schedule a call and talk about your goals. We will agree on how we will obtain the training data that are necessary to train your custom machine-learning model for the search engine.
  2. Our machine-learning specialists will assemble a testable image search collection and train a custom machine-learning model for you in a matter of weeks.
  3. After meeting all the requirements of PoC, we will deploy the system to production, and you can connect to it via Rest API.

Image Recognition of Collectibles

Machine learning models bring endless possibilities not only to pop culture geeks and collectors, but to all fields and industries. From personalized recommendations in custom fashion search engines to automatic detection of slight differences in surface materials, the visual AI gets better and smarter every day, making routine tasks a matter of milliseconds. That is one of the reasons why it is an unlimited resource of curiosity, challenges, and joy for us, including being geeks – professionally :).

Ximilar is currently releasing on a ready-to-use computer vision service able to recognize collectibles such as TCG cards, coins, banknotes or post stamps, detect their features and categorize them. Let us know if you’d like to implement it on your website!

If you are interested in a customized AI solution for collector items write us an email and we will get back to you as soon as possible. If you would like to identify cards with our Collectibles Recognition service just sign up via app.ximilar.com.

The post Pokémon TCG Search Engine: Use AI to Catch Them All appeared first on Ximilar: Visual AI for Business.

]]>
Image Similarity as a Service For Your Web https://www.ximilar.com/blog/superfast-image-similarity-for-your-website/ Tue, 27 Jul 2021 16:43:13 +0000 https://www.ximilar.com/?p=1044 A step-by-step guide for using image similarity as a service. Find similar items with accurate & fast API for Image Search.

The post Image Similarity as a Service For Your Web appeared first on Ximilar: Visual AI for Business.

]]>
With the service Image Similarity added to the Ximilar App, you can build your own visual similarity engine powered by artificial intelligence in just a few clicks, with several lines of code. Similarity search enables companies to improve the user experience significantly and increase revenue with smarter management of their visual data.

The technology behind image similarity is robust, reliable & fast. Built on state-of-the-art (SOTA) AI models and vector databases, you can search millions of images/products in milliseconds. It is used by big e-commerce players as well as small startups for showing visual alternatives or finding products with pictures. Some of our customers have hundreds of millions of images in their collections and do more than 100 million requests per month. Let’s dive into building a superfast similarity search service for your web.


What is Image Similarity?

Image Similarity, or image similarity search, is a visual AI service comparing, grouping, and recommending visually similar images. For example, a typical use case is a product recommendation of similar items in e-shops. It can also be used for reverse image search, where the query is an external image and the results are images from the collection. This approach gives way more accurate results than searching by tags, labels and other attributes.

Ximilar is using state-of-the-art deep learning models for all visual search services. We build our own indexing & searching technology that can run both as a service or on your hardware if needed. The collections can be focused either on product photos, fashion, image matching, or generic photos (stock images).

Features of the Image Similarity Service

Here are several features of the Image Similarity service that we think are crucial:

  • Simple access through the Ximilar App (creating a collection on click) and connection to REST API
  • The scalable search service can handle collections with hundreds of millions of similar items (images, videos, etc.) and hundreds of requests per second with both CRUD operations and searching
  • The ultra-fast and reliable engine that is mostly deployed in large e-commerce platforms – the query for finding the most visually similar product is low latency (in milliseconds)
  • The service is customizable – the platform enables you to train your own model for visual similarity search
  • Advanced filtering that supports JSON meta-data – if you need to restrict the result to a specific field
  • Grouping based on similarity – our search technology can group photos of the same product as one item
  • Security and privacy of your data – only meta-data and the visual representation of the images are stored, therefore your images are not stored anywhere
  • The service is affordable and cost-effective both for startups and enterprises, offering free plan for tests as well as discounts with your growth over time
  • We can deploy it on your hardware, independently of our infrastructure, and also offline – custom similarity model and deployment appropriate to your needs
  • Our search engine and machine learning models improve constantly – maintaining much higher quality than any other open-source project & we are able to build custom search engines with trained models

Applications Using Visual Similarity

According to this research by Deloitte, merchandising with artificial intelligence is more and more relevant, and recommendation engines play a vital part in it. Here are a few use cases for visual similarity engines:

  • E-shops that use product similarity to help customers to browse and find related products (e.g. in fashion & luxury items, home decor & furniture, art, wall art, prints & posters, collectible trading cards, comics, trademarks, etc.)
  • Stock photo databases suggesting similar content – getting visual alternatives of photos, designs, product images, and videos
  • Finding the exact products – apps like Vivino for finding wine or any kind of product are easy to develop for us
  • Visual similarity duplicate finder (also image matching or deduplication), to know which images are already in your database, or which product photos you can merge together
  • Reverse image search – finding a product or an image with a picture online
  • Finding similar real estate based for example on interior design, furniture, garden, etc.
  • Comparing two images for similarity – for example patterns or designs
Example of visually related wall arts.
Showing similar wall art with a jungle pattern. [Source]

Recommending products to your customers has several advantages. Firstly, it creates a better user experience and helps your customers find the right products faster. Secondly, it instantly makes the purchase rate on your web higher. This means a win on both sides – satisfied customers and higher revenue for you. Read more about customer experience and product recommendations in our blog post on fashion search.

Creating the Collection

So let’s take a look at how to easily build your own similarity search engine with the Ximilar platform. The first step is to log in to the Ximilar App. If you don’t have an account, then sign up – it’s free and takes just a minute. After that, on the Dashboard, click on the Visual Search tile and then the Image Similarity service. Then go to the Collections in the left menu and click on Create New Collection. It will show a pop-up with different collection types from which you need to select one.

The collection is a space where you upload your images. With this collection, you are performing queries for search. You can choose from Generic Photo Collection, Product Photo Collection, Dominant Colors Similarity, and Image Matching. Clicking on one of the cards will create a collection for your account.

Choose right type of collection. Generic photos, Product photos, Custom Similarity and Image Matching.
Pick one collection type suitable for your data to create your similarity application.

Each of these collection types is suitable for different types of images:

  • Use Generic Photos if you work with stock photos
  • Pick a Product Photos collection if you are an e-commerce company
  • Select Image Matching to find duplicates in your images
  • For the fashion sector, we recommend using a specialized service called Fashion Search
  • Custom Similarity is suitable if you are working with another type of data (e.g. videos or 3D models). To do this, please schedule a call with us, and we will develop your own model tuned for your data. For instance, we built a photo search system for the Magic the Gathering Trading Cards for one of our customers.

For this example of real estate, I will use a Generic Photo Collection. The advantage of Generic Photo Collection is that it also supports searching images via text input/query. We usually develop custom similarity models for real estate, when the customers need specific and more accurate results. However, for this simple use case, the generic real estate model will be enough.

Format of Image Similarity Dataset

Example of real estate image that is inserted into the similarity search collection.

First, we need to prepare a text file with JSON records. Each record represents an image that we want to store/insert into our collection. The key field is "_url" with the image URL. The advantage of the _url is that you can directly see and inspect the results via app.ximilar.com.

You can also optionally send records with base64 data, this is great if your data are stored locally on your computer. Don’t worry, we are not storing the whole images (data or base64) in the collection database, just URLs with all other metadata present in the records.

The JSON records look like this:

{"_id": "1_1", "_url": "_URL_IMAGE_PATH_", "estate_id": "1", "category": "indoor", "subcategory": "kitchen", "tags": []}
{"_id": "1_2", "_url": "_URL_IMAGE_PATH_", "estate_id": "1", "category": "indoor", "subcategory": "kitchen", "tags": []}
...

If you don’t have image URLs, you can use either "_file" or "_base64" fields for the image data (locally stored "_file" data are automatically converted by the Python client to base64). The image similarity engine is indexing every record of the collection by extracting a representation from the image by a neural network model. However, we are not storing the images in our engine. So, only records that contain "_url" will be visualized in the Ximilar App.

You must store unique identifiers of each image in the "_id" field to identify your images in the collection. The value of this field must be a string. The API endpoint for searching is returning this _id values, that is how you get the results for visual search. You can also store additional fields for every JSON record, and then you can use these fields for filtering, grouping, and tuning the similarity function (see below).

Filling the Collection With Your Data

The next step requires a few lines of code. We are going to insert the prepared images into our collection using our python-client library. You can install the library using pip or directly from GitLab. The usage of the client is very straightforward and basically, you can just use the script tools/collections/insert_json_records.py:

python insert_json_records.py --type generic --auth_token __YOUR_TOKEN__ --collection_id __COLLECTION_ID__ --path /path/to/the/file.json

You will find the collection ID and the Authorization token on the “collection page” in the Ximilar App. This script will run for a few minutes, depending on the size of your image dataset.

Result: Finding Visually Similar Pictures

That was pretty easy, right? Now, if you go to the collections page, you will see something like this:

You can see your image similarity collection in app.ximilar.com.
You can see your image similarity collection in the Ximilar App

All images from the JSON file were indexed, and now you can inspect the collection in the Ximilar App. Select the Similarity Search in the left menu of the Image Similarity service and test how the similarity works. You can specify the query image either by upload, by URL, or your IDs, or by choosing one of the randomly selected images from the collection.

Even though we have indexed just several hundred images, you can see that the similarity engine works pretty well. The first image is the query image and the next images are the k-nearest to the query image:

Showing most visually similar real estates to the first image.
Showing most visually similar real estate to the first image.

The next step might be to integrate the service into your application via API. You can either directly use the REST API for searching visually similar images or, if you are using Python, we recommend our Python SDK client like this:

# pip install ximilar-client
from ximilar.client import SimilarityPhotosClient
client = SimilarityPhotosClient("_API_TOKEN_", "_COLLECTION_ID_")
# search k nearest items
client.search({"_id": "1"}, k = 3)
# search by external image
client.search({"_url": "_URL_PATH_"})

Advanced Features for Photo Similarity

The search for visually similar images can be combined with filtering on metadata. This metadata can be stored in the JSON, as in our example with the "category" and "subcategory" fields. In the API, the filtering is specified using a MongoDB-like syntax – see the documentation.

For example, let’s say that we want to search for images similar to the image with ID=1_1 that are indoor photos made in a kitchen. We assume that this meta-information is stored in the “category” and “subcategory” fields of every JSON record. The query will look like this:

client.search({"_id": "1_1"}, filter={"category": "Indoor", "subcategory": "Kitchen"})

If we know that we will often filter on some fields, we can specify them in the “Fields to index” option of the collection to make the query processing more efficient.

You can specify which field from JS records define your SKUs identifier.
You can specify which field from JS records will define your SKU identifier.

Often, your data contains several photos of one “object” – a product or, in our example, real estate. Our service can group the search results not by individual photos but by product IDs. You can set this in the advanced options of the collection by specifying the name of the real estate in the Product ID field, and the magic will happen.

Enhancing Image Similarity Engine with Tags

The image similarity is based purely on the visual content of the image. However, you can use your tags (labels, keywords) to enhance the similarity search. In the example, we assume that the data already contains categories, subcategories, and tags. In order to enhance the visual similarity search with tags, you can fill the “tags” field for every record with your tags, and also use method /v2/visualTagsKNN. After that, your search results will be based on a combination of visual similarity and keywords.

If you don’t have categories and tags, you can create your own photo tagger through our Image Recognition service, and enrich your image data automatically before indexing. The possibilities of image recognition models and their combinations are endless, resulting in highly customizable solutions. Read our guide on how to build your own Image Recognition API.

With Ximilar Image Recognition service you are able to create a custom tagging models for your images.
With the Ximilar Image Recognition service, you can create custom tagging models for your images.

You can build several models:

  • One classifier for categorizing indoor/outdoor/floor plan photos
  • One classifier for getting room type (Bedroom, Kitchen, Living room, etc.)
  • One tagger for outdoor tags like (Pool, Garden, Garage, House view, etc.)

To Sum Up

The real estate photo similarity search is only one use case of visual similarity from many (fashion, e-commerce, art, stock photos, healthcare…). We hope that you will enjoy working with this service, and we are looking forward to seeing your projects based on it. Thanks to our developers Libor and Ludovit, you can use this service through the frontend app.

Visual Similarity service by Ximilar is unique in terms of search quality, speed performance, and all the possibilities of the API. Our engineers are constantly upgrading the quality of the search, so you don’t have to. We are able to build custom solutions suitable for your data. With multiple collections, you can even A/B test the performance on your websites. This can run in our cloud as SaaS or in your warehouse! If you have more questions about pricing, and technical details,  or you would like to run the similarity search engine on your own machines, then contact us.

The post Image Similarity as a Service For Your Web appeared first on Ximilar: Visual AI for Business.

]]>
Image Annotation Tool for Teams https://www.ximilar.com/blog/image-annotation-tool-for-teams/ Thu, 06 May 2021 11:55:57 +0000 https://www.ximilar.com/?p=4115 Annotate is an advanced image annotation tool supporting complex taxonomies and teamwork on computer vision projects.

The post Image Annotation Tool for Teams appeared first on Ximilar: Visual AI for Business.

]]>
Through the years, we worked with many annotation tools. The problem is most of the desktop annotating apps are offline and intended for single-person use, not for team cooperation. The web-based apps, on the other hand, mostly focus on data management with photo annotation, and not on the whole ecosystem with API and inference systems. In this article, I review, what should a good image annotation tool do, and explain the basic features of our own tool – Annotate.

Every big machine learning project requires the active cooperation of multiple team members – engineers, researchers, annotators, product managers, or owners. For example, supervised deep learning for object detection, as well as segmentation, outperforms unsupervised solutions. However, it requires a lot of data with correct annotations. Annotation of images is one of the most time-consuming parts of every deep learning project. Therefore, picking the right annotator tool is critical. When your team is growing and your projects require higher complexity over time, you may encounter new challenges, such as:

  • Adding labels to the taxonomy would require re-checking a lot of your work
  • Increasing the performance of your models would require more data
  • You will need to monitor the progress of your projects

Building solid annotation software for computer vision is not an easy task. And yes, it requires a lot of failures and taking many wrong turns before finding the best solution. So let’s look at what should be the basic features of an advanced data annotation tool.

What Should an Advanced Image Annotation Tool Do?

Many customers are using our cloud platform Ximilar App in very specific areas, such as FashionHealthcare, Security, or Industry 4.0. The environment of a proper AI helper or tool should be complex enough to cover requirements like:

  • Features for team collaboration – you need to assign tasks, and then check the quality and consistency of data
  • Great user experience for dataset curation – everything should be as simple as possible, but no simpler
  • Fast production of high-quality datasets for your machine-learning models
  • Work with complex taxonomies & many models chained with Flows
  • Fast development and prototyping of new features
  • Connection to Rest API with Python SDK & querying annotated data

With these needs in mind, we created our own image annotation tool. We use it in our internal projects and provide it to our customers as well. Our technologies for machine learning accelerate the entire pipeline of building good datasets. Whether you are a freelancer tagging pictures or a team managing product collections in e-commerce, Annotate can help.

Our Visual AI tools enable you to work with your own custom taxonomy of objects, such as fashion apparel or things captured by the camera. You can read the basics on the categories & tags and machine learning model training, watch the tutorials, or check our demo and see for yourself how it works.

The Annotate

Annotate is an advanced image annotation tool, which enables you to annotate images precisely and fast. It works as an end-to-end platform for visual data management. You can query the same images, change labels, create objects, draw bounding boxes and even polygons here.

It is a web-based online annotation tool, that works fully on the cloud. Since it is connected to the same back-end & database as Ximilar App, all changes you do in Annotate, manifest in your workspace in App, and vice versa. You can create labels, tasks & models, or upload images through the App, and use them in Annotate.

Ximilar Application and Annotate are connected to the same backend (api.ximilar.com) and the same database.

Annotate extends the functionalities of the Ximilar App. The App is great for training, creating entities, uploading data, and batch management of images (bulk actions for labelling and filtering). Annotate, on the other hand, was created for the detail-oriented management of images. The default single-zoomed image view brings advantages, such as:

  • Identifying separate objects, drawing polygons and adding metadata to a single image
  • Suggestions based on AI image recognition help you choose from very complex taxonomies
  • The annotators focus on one image at a time to minimize the risk of mistakes

Interested in getting to know Annotate better? Let’s have a look at its basic functions.

Deep Focus on a Single Image

If you enter the Images (left menu), you can open any image in the single image view. To the right of the image, you can see all the items located in it. This is where most of the labelling is done. There is also a toolbar for drawing objects and polygons, labelling images, and inspecting metadata.

In addition, you can zoom in/out and drag the image. This is especially helpful when working with smaller objects or big-resolution images. For example, teams annotating medical microscope samples or satellite pictures can benefit from this robust tool.

View on image annotation tool. This is main view with tools and labels present.
The main view of the image in our Fashion Tagging workspace

Create Multiple Workspaces

Some of you already know this from other SaaS platforms. The idea is to divide your data into several independent storages. Imagine your company is working on multiple projects at the same time and each of them requires you to label your data with an image annotation tool. Your company account can have many workspaces, each for one project.

Here is our active workspace for Fashion Tagging

Within the workspaces, you don’t mix your images, labels, and tasks. For example, one workspace contains only images for fruit recognition projects (apples, oranges, and bananas) and another contains data on animals (cats and dogs).

Your team members can get access to different workspaces. Also, everyone can switch between the workspaces in the App as well as in Annotate (top right, next to the user icon). Did you know, that the workspaces are also accessible via API? Check out our documentation and learn how to connect to API.

Train Precise AI Models with Verification

Building good computer vision models requires a lot of data, high-quality annotations, and a team of people who understand the process of building such a dataset. In short, to create high-quality models, you need to understand your data and have a perfectly annotated dataset. In the words of the Director of AI at Tesla, Andrej Karpathy:

Labeling is a job for highly trained professionals. Andrej Karpathy (Head of AI at Tesla)

Annotate helps you build high-quality AI training datasets by verification. Every image can be verified by different users in the workspace. You can increase the precision by training your models only on verified images.

Verifications list for image.
A list of users who verified the image with the exact dates

Verifying your data is a necessary requirement for the creation of good deep-learning models. To verify the image, simply click the button verify or verify and next (if you are working on a job). You will be able to see who verified any particular image and when.

Create and Track Image Annotating Jobs

When you need to process the newly uploaded images, you can assign them to a Job and a team of people can process them one by one in a job queue. You can also set up exactly how many times each image should be seen by the people processing this queue.

Moreover, you can specify, which photo recognition model or flow of models should be displayed when doing the job. For example, here is the view of the jobs that we are using in one of our tagging services.

List of jobs for image annotation.
Two jobs are waiting to be completed by annotators,
you can start working by hitting the play button on the right

When working on a job, every time an annotator hits the Verify & Next button, it will redirect them to a new image within a job. You can track the progress of each job in the Jobs. Once the image annotation job is complete, the progress bar turns green, and you can proceed to the next steps: retraining the models, uploading new images, or creating another job.

Draw Objects and Polygons

Sometimes, recognizing the most probable category or tags for an image is not enough. That is why Annotate provides a possibility to identify the location of specific things by drawing objects and polygons. The great thing is that you are not paying any credits for drawing objects or labelling. This makes Annotate one of the most cost-effective online apps for image annotation.

Drawing tool for image annotation. Creating bounding box for object detection model.
Simply click and drag the rectangle with the rectangle tool on canvas to create the detection object.

What exactly do you pay for, when annotating data? The only API credits are counted for data uploads, with volume-based discounts. This makes Annotate an affordable, yet powerful tool for data annotation. If you want to know more, read our newest Article on API Credit Packs, check our Pricing Plans or Documentation.

Annotate With Complex Taxonomies Elegantly

The greatest advantage of Annotate is working with very complex taxonomies and attribute hierarchies. That is why it is usually used by companies in E-commerce, Fashion, Real Estate, Healthcare, and other areas with rich databases. For example, our Fashion tagging service contains more than 600 labels that belong to more than 100 custom image recognition models. The taxonomy tree for some of the biotech projects can be even broader.

Navigating through the taxonomy of labels is very elegant in Annotate – via Flows. Once your Flow is defined (our team can help you with it), you simply add labels to the images. The branches expand automatically when you add labels. In other words, you always see only essential labels for your images.

Adding labels from complex taxonomy to fashion image.
Simply navigate through your taxonomy tree, expanding branches when clicking on specific labels.

For example, in this image is a fashion object “Clothing”, to which we need to assign more labels. Adding the Clothing/Dresses label will expand the tags that are in the Length Dresses and Style Dresses tasks. If you select the label Elegant from Style Dresses, only features & attributes you need will be suggested for annotation.

Automate Repetitive Tasks With AI

Annotate was initially designed to speed up the work when building computer vision solutions. When annotating data, manual drawing & clicking is a time-consuming process. That is why we created the AI helper tools to automate the entire annotating process in just a few clicks. Here are a few things that you can do to speed up the entire annotation pipeline:

  • Use the API to upload your previously annotated data to train or re-train your machine learning models and use them to annotate or label more data via API
  • Create bounding boxes and polygons for object detection & instance object segmentation with one click
  • Create jobs, share the data, and distribute the tasks to your team members
Automatically predict objects on one click speeds up annotating data.
Predicting bounding boxes with one click automates the entire process of annotation.

Image Annotation Tool for Advanced Visual AI Training

As the main focus of Ximilar is AI for sorting, comparing, and searching multimedia, we integrate the annotation of images into the building of AI search models. This is something that we miss in all other data annotation applications. For the building of such models, you need to group multiple items (images or objects, typically product pictures) into the Similarity Groups. Annotate helps us create datasets for building strong image similarity search models.

Grouping same or similar images with Image Annotation Tool.
Grouping the same or similar images with the Image Annotation Tool. You can tell which item is a smartphone photo or which photos should be located on an e-commerce platform.

Annotate is Always Growing

Annotate was originally developed as our internal image annotation software, and we have already delivered a lot of successful solutions to our clients with it. It is a unique product that any team can benefit from and improve the computer vision models unbelievably fast

We plan to introduce more data formats like videos, satellite imagery (sentinel maps), 3D models, and more in the future to level up the Visual AI in fields such as visual quality control or AI-assisted healthcare. We are also constantly working on adding new features and improving the overall experience of Ximilar services.

Annotate is available for all users with Business & Professional pricing plans. Would you like to discuss your custom solution or ask anything? Let’s talk! Or read how the cooperation with us works first.

The post Image Annotation Tool for Teams appeared first on Ximilar: Visual AI for Business.

]]>