Neural Networks - Ximilar: Visual AI for Business https://www3.ximilar.com/blog/tag/neural-networks/ VISUAL AI FOR BUSINESS Wed, 25 Sep 2024 14:58:31 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 https://www.ximilar.com/wp-content/uploads/2024/08/cropped-favicon-ximilar-32x32.png Neural Networks - Ximilar: Visual AI for Business https://www3.ximilar.com/blog/tag/neural-networks/ 32 32 Predict Values From Images With Image Regression https://www.ximilar.com/blog/predict-values-from-images-with-image-regression/ Wed, 22 Mar 2023 15:03:45 +0000 https://www.ximilar.com/?p=12666 With image regression, you can assess the quality of samples, grade collectible items or rate & rank real estate photos.

The post Predict Values From Images With Image Regression appeared first on Ximilar: Visual AI for Business.

]]>
We are excited to introduce the latest addition to Ximilar’s Computer Vision Platform. Our platform is a great tool for building image classification systems, and now it also includes image regression models. They enable you to extract values from images with accuracy and efficiency and save your labor costs.

Let’s take a look at what image regression is and how it works, including examples of the most common applications. More importantly, I will tell you how you can train your own regression system on a no-code computer vision platform. As more and more customers seek to extract information from pictures, this new feature is sure to provide Ximilar’s customers with the tools they need to stay ahead of the curve in today’s highly competitive AI-driven market.

What is the Difference Between Image Categorization and Regression?

Image recognition models are ideal for the recognition of images or objects in them, their categorization and tagging (labelling). Let’s say you want to recognize different types of car tyres or their patterns. In this case, categorization and tagging models would be suitable for assigning discrete features to images. However, if you want to predict any continuous value from a certain range, such as the level of tyre wear, image regression is the preferred approach.

Image regression is an advanced machine-learning technique that can predict continuous values within a specific range. Whenever you need to rate or evaluate a collection of images, an image regression system can be incredibly useful.

For instance, you can define a range of values, such as 0 to 5, where 0 is the worst and 5 is the best, and train an image regression task to predict the appropriate rating for given products. Such predictive systems are ideal for assigning values to several specific features within images. In this case, the system would provide you with highly accurate insights into the wear and tear of a particular tyre.

Predicting the level of tires worn out from the image is a use case for an image regression task, while a categorization task can recognize the pattern of the tire.
Predicting the level of tires worn out from the image is a use case for an image regression task, while a categorization task can recognize the pattern of the tyre.

How to Train Image Regression With a Computer Vision Platform?

Simply log in to Ximilar App and go to Categorization & Tagging. Upload your training pictures and under Tasks, click on Create a new task and create a Regression task.

Creating an image regression task in Ximilar App.

You can train regression tasks and test them via the same front end or with API. You can develop an AI prediction task for your photos with just a few clicks, without any coding or any knowledge of machine learning.

This way, you can create an automatic grading system able to analyze an image and provide a numerical output in the defined range.

Use the Same Training Data For All Your Image Classification Tasks

Both image recognition and image regression methods fall under the image classification techniques. That is why the whole process of working with regression is very similar to categorization & tagging models.

Working with image regression model on Ximilar computer vision platform.

Both technologies can work with the same datasets (training images), and inputs of various image sizes and types. In both cases, you can simply upload your data set to the platform, and after creating a task, label the pictures with appropriate continuous values, and then click on the Train button.

Apart from a machine learning platform, we offer a number of AI solutions that are field-tested and ready to use. Check out our public demos to see them in action.

If you would like to build your first image classification system on a no-code machine learning platform, I recommend checking out the article How to Build Your Own Image Recognition API. We defined the basic terms in the article How to Train Custom Image Classifier in 5 Minutes. We also made a basic video tutorial:

Tutorial: train your own image recognition model with Ximilar platform.

Neural Network: The Technology Behind Predicting Range Values on Images

The most simple technique for predicting float values is linear regression. This can be further extended to polynomial regression. These two statistical techniques are working great on tabular input data. However, when it comes to predicting numbers from images, a more advanced approach is required. That’s where neural networks come in. Mathematically said, neural network “f” can be trained to predict value “y” on picture “x”, or “y = f(x)”.

Neural networks can be thought of as approximations of functions that we aim to identify through the optimization on training data. The most commonly used NNs for image-based predictions are Convolutional Neural Networks (CNNs), visual transformers (VisT), or a combination of both. These powerful tools analyze pictures pixel by pixel, and learn relevant features and patterns that are essential for solving the problem at hand.

CNNs are particularly effective in picture analysis tasks. They are able to detect features at different spatial scales and orientations. Meanwhile, VisTs have been gaining popularity due to their ability to learn visual features without being constrained by spatial invariance. When used together, these techniques can provide a comprehensive approach to image-based predictions. We can use them to extract the most relevant information from images.

What Are the Most Common Applications of Value Regression From Images?

Estimating Age From Photos

Probably the most widely known use case of image regression by the public is age prediction. You can come across them on social media platforms and mobile apps, such as Facebook, Instagram, Snapchat, or Face App. They apply deep learning algorithms to predict a user’s age based on their facial features and other details.

While image recognition provides information on the object or person in the image, the regression system tells us a specific value – in this case, the person's age.
While image recognition provides information on the object or person in the image, the regression system tells us a specific value – in this case, the person’s age.

Needless to say, these plugins are not always correct and can sometimes produce biased results. Despite this limitation, various image regression models are gaining popularity on various social sites and in apps.

Ximilar already provides a face-detection solution. Models such as age prediction can be easily trained and deployed on our platform and integrated into your system.

Value Prediction and Rating of Real Estate Photos

Pictures play an essential part on real estate sites. When people are looking for a new home or investment, they are navigating through the feed mainly by visual features. With image regression, you are able to predict the state, quality, price, and overall rating of real estate from photos. This can help with both searching and evaluating real estate.

Predicting rating, and price (regression) for household images with image regression.
Predicting rating, and price (regression) for household images with image regression.

Custom recognition models are also great for the recognition & categorization of the features present in real estate photos. For example, you can determine whether a room is furnished, what type of room it is, and categorize the windows and floors based on their design.

Additionally, a regression can determine the quality or state of floors or walls, as well as rank the overall visual aesthetics of households. You can store all of this information in your database. Your users can then use such data to search for real estate that meets specific criteria.

Image classification systems such as image recognition and value regression are ideal for real estate ranking. Your visitors can search the database with the extracted data.
Image classification systems such as image recognition and value regression are ideal for real estate ranking. Your visitors can search the database with the extracted data.

Determining the Degree of Wear and Tear With AI

Visual AI is increasingly being used to estimate the condition of products in photos. While recognition systems can detect individual tears and surface defects, regression systems can estimate the overall degree of wear and tear of things.

A good example of an industry that has seen significant adoption of such technology is the insurance industry. For example, startups-like Lemonade Inc, or Root use AI when paying the insurance.

With custom image recognition and regression methods, it is now possible to automate the process of insurance claims. For instance, a visual AI system can indicate the seriousness of damage to cars after accidents or assess the wear and tear of various parts such as suspension, tires, or gearboxes. The same goes with other types of insurance, including households, appliances, or even collectible & antique items.

Our platform is commonly utilized to develop recognition and detection systems for visual quality control & defect detection. Read more in the article Visual AI Takes Quality Control to a New Level.

Automatic Grading of Antique & Collectible Items Such as Sports Cards

Apart from car insurance and damage inspection, recognition and regression are great for all types of grading and sorting systems, for instance on price comparators and marketplaces of collectible and antique items. Deep learning is ideal for the automatic visual grading of collector items such as comic books and trading cards.

By leveraging visual AI technology, companies can streamline their processes, reduce manual labor significantly, cut costs, and enhance the accuracy and reliability of their assessments, leading to greater customer satisfaction.

Automatic Recognition of Collectibles

Ximilar built an AI system for the detection, recognition and grading of collectibles. Check it out!

Food Quality Estimation With AI

Biotech, Med Tech, and Industry 4.0 also have a lot of applications for regression models. For example, they can estimate the approximate level of fruit & vegetable ripeness or freshness from a simple camera image.

The grading of vegetables by an image regression model.
The grading of vegetables by an image regression model.

For instance, this Japanese farmer is using deep learning for cucumber quality checks. Looking for quality control or estimation of size and other parameters of olives, fruits, or meat? You can easily create a system tailored to these use cases without coding on the Ximilar platform.

Build Custom Evaluation & Grading Systems With Ximilar

Ximilar provides a no-code visual AI platform accessible via App & API. You can log in and train your own visual AI without the need to know how to code or have expertise in deep learning techniques. It will take you just a few minutes to build a powerful AI model. Don’t hesitate to test it for free and let us know what you think!

Our developers and annotators are also able to build custom recognition and regression systems from scratch. We can help you with the training of the custom task and then with the deployment in production. Both custom and ready-to-use solutions can be used via API or even deployed offline.

The post Predict Values From Images With Image Regression appeared first on Ximilar: Visual AI for Business.

]]>
How to Build a Good Visual Search Engine? https://www.ximilar.com/blog/how-to-build-a-good-visual-search-engine/ Mon, 09 Jan 2023 14:08:28 +0000 https://www.ximilar.com/?p=12001 Let's take a closer look at the technology behind visual search and the key components of visual search engines.

The post How to Build a Good Visual Search Engine? appeared first on Ximilar: Visual AI for Business.

]]>
Visual search is one of the most-demanded computer vision solutions. Our team in Ximilar have been actively developing the best general multimedia visual search engine for retailers, startups, as well as bigger companies, who need to process a lot of images, video content, or 3D models.

However, a universal visual search solution is not the only thing that customers around the world will require in the future. Especially smaller companies and startups now more often look for custom or customizable visual search solutions for their sites & apps, built in a short time and for a reasonable price. What does creating a visual search engine actually look like? And can a visual search engine be built by anyone?

This article should provide a bit deeper insight into the technology behind visual search engines. I will describe the basic components of a visual search engine, analyze approaches to machine learning models and their training datasets, and share some ideas, training tips, and techniques that we use when creating visual search solutions. Those who do not wish to build a visual search from scratch can skip right to Building a Visual Search Engine on a Machine Learning Platform.

What Exactly Does a Visual Search Engine Mean?

The technology of visual search in general analyses the overall visual appearance of the image or a selected object in an image (typically a product), observing numerous features such as colours and their transitions, edges, patterns, or details. It is powered by AI trained specifically to understand the concept of similarity the way you perceive it.

In a narrow sense, the visual search usually refers to a process, in which a user uploads a photo, which is used as an image search query by a visual search engine. This engine in turn provides the user with either identical or similar items. You can find this technology under terms such as reverse image search, search by image, or simply photo & image search.

However, reverse image search is not the only use of visual search. The technology has numerous applications. It can search for near-duplicates, match duplicates, or recommend more or less similar images. All of these visual search tools can be used together in an all-in-one visual search engine, which helps internet users find, compare, match, and discover visual content.

And if you combine these visual search tools with other computer vision solutions, such as object detection, image recognition, or tagging services, you get a quite complex automated image-processing system. It will be able to identify images and objects in them and apply both keywords & image search queries to provide as relevant search results as possible.

Different computer vision systems can be combined on Ximilar platform via Flows. If you would like to know more, here’s an article about how Flows work.

Typical Visual Search Engines:
Google Lens & Pinterest Lens

Big visual search industry players such as Shutterstock, eBay, Pinterest (Pinterest Lens) or Google Images (Google Lens & Google Images) already implemented visual search engines, as well as other advanced, yet hidden algorithms to satisfy the increasing needs of online shoppers and searchers. It is predicted, that a majority of big companies will implement some form of soft AI in their everyday processes in the next few years.

The Algorithm for Training
Visual Similarity

The Components of a Visual Search Tool

Multimedia search engines are very powerful systems consisting of multiple parts. The first key component is storage (database). It wouldn’t be exactly economical to store the full sample (e.g., .jpg image or .mp4 video) in a database. That is why we do not store any visual data for visual search. Instead, we store just a representation of the image, called a visual hash.

The visual hash (also visual descriptor or embedding) is basically a vector, representing the data extracted from your image by the visual search. Each visual hash should be a unique combination of numbers to represent a single sample (image). These vectors also have some mathematical properties, meaning you can compare them, e.g., with cosine, hamming, or Euclidean distance.

So the basic principle of visual search is: the more similar the images are, the more similar will their vector representations be. Visual search engines such as Google Lens are able to compare incredible volumes of images (i.e., their visual hashes) to find the best match in a hundred milliseconds via smart indexing.

How to Create a Visual Hash?

The visual hashes can be extracted from images by standard algorithms such as PHASH. However, the era of big data gives us a much stronger model for vector representation – a neural network. A simple overview of the image search system built with a neural network can look like this:

Extracting visual vectors with the neural network and searching with them in a similarity collection.
Extracting visual vectors with the neural network and searching with them in a similarity collection.

This neural network was trained on images from a website selling cosmetics. Here, it extracted the embeddings (vectors), and they were stored in a database. Then, when a customer uploads an image to the visual search engine on the website, the neural network will extract the embedding vector from this image as well, and use it to find the most similar samples.

Of course, you could also store other metadata in the database, and do advanced filtering or add keyword search to the visual search.

Types of Neural Networks

There are several basic architectures of neural networks that are widely used for vector representations. You can encode almost anything with a neural network. The most common for images is a convolutional neural network (CNN).

There are also special architectures to encode words and text. Lately, so-called transformer neural networks are starting to be more popular for computer vision as well as for natural language processing (NLP). Transformers use a lot of new techniques developed in the last few years, such as an attention mechanism. The attention mechanism, as the name suggests, is able to focus only on the “interesting” parts of the image & ignore the unnecessary details.

Training the Similarity Model

There are multiple methods to train models (neural networks) for image search. First, we should know that training of machine learning models is based on your data and loss function (also called objective or optimization function).

Optimization Functions

The loss function usually computes the error between the output of the model and the ground truth (labels) of the data. This feature is used for adjusting the weights of a model. The model can be interpreted as a function and its weights as parameters of this function. Therefore, if the value of the loss function is big, you should adjust the weights of the model.

How it Works

The model is trained iteratively, taking subsamples of the dataset (batches of images) and going over the entire dataset multiple times. We call one such pass of the dataset an epoch. During one batch analysis, the model needs to compute the loss function value and adjust weights according to it. The algorithm for adjusting the weights of the model is called backpropagation. Training is usually finished when the loss function is not improving (minimizing) anymore.

We can divide the methods (based on loss function) depending on the data we have. Imagine that we have a dataset of images, and we know the class (category) of each image. Our optimization function (loss function) can use these classes to compute the error and modify the model.

The advantage of this approach is its simple implementation. It’s practically only a few lines in any modern framework like TensorFlow or PyTorch. However, it has also a big disadvantage: the class-level optimization functions don’t scale well with the number of classes. We could potentially have thousands of classes (e.g., there are thousands of fashion products and each product represents a class). The computation of such a function with thousands of classes/arguments can be slow. There could also be a problem with fitting everything on the GPU card.

Loss Function: A Few Tips

If you work with a lot of labels, I would recommend using a pair-based loss function instead of a class-based one. The pair-based function usually takes two or more samples from the same class (i.e., the same group or category). A model based on a pair-based loss function doesn’t need to output prediction for so many unique classes. Instead, it can process just a subsample of classes (groups) in each step. It doesn’t know exactly whether the image belongs to class 1 or 9999. But it knows that the two images are from the same class.

Images can be labelled manually or by a custom image recognition model. Read more about image recognition systems.

The Distance Between Vectors

The picture below shows the data in the so-called vector space before and after model optimization (training). In the vector space, each image (sample) is represented by its embedding (vector). Our vectors have two dimensions, x and y, so we can visualize them. The objective of model optimization is to learn the vector representation of images. The loss function is forcing the model to predict similar vectors for samples within the same class (group).

By similar vectors, I mean that the Euclidean distance between the two vectors is small. The larger the distance, the more different these images are. After the optimization, the model assigns a new vector to each sample. Ideally, the model should maximize the distance between images with different classes and minimize the distance between images of the same class.

How visual search engines work: Optimization for visual search should maximize the distance of items between different categories and minimize the distance within category.
Optimization for visual search should maximize the distance of items between different categories and minimize the distance within the category.

Sometimes we don’t know anything about our data in advance, meaning we do not have any metadata. In such cases, we need to use unsupervised or self-supervised learning, about which I will talk later in this article. Big tech companies do a lot of work with unsupervised learning. Special models are being developed for searching in databases. In research papers, this field is often called deep metric learning.

Supervised & Unsupervised Machine Learning Methods

1) Supervised Learning

As I mentioned, if we know the classes of images, the easiest way to train a neural network for vectors is to optimize it for the classification problem. This is a classic image recognition problem. The loss function is usually cross-entropy loss. In this way, the model is learning to predict predefined classes from input images. For example, to say whether the image contains a dog, a cat or a bird. We can get the vectors by removing the last classification layer of the model and getting the vectors from some intermediate layer of the network.

When it comes to the pair-based loss function, one of the oldest techniques for metric learning is the Siamese network (contrastive learning). The name contains “Siamese” because there are two identical models of the same weights. In the Siamese network, we need to have pairs of images, which we label based on whether they are or aren’t equal (i.e., from the same class or not). Pairs in the batch that are equal are labelled with 1 and unequal pairs with 0.

In the following image, we can see different batch construction methods that depend on our model: Siamese (contrastive) network, Triplet, or N-pair, which I will explain below.

How visual search engine works: Each deep learning architecture requires different batch construction methods. For example siames and npair requires tuples. However in Npair, the tuples must be unique.
Each deep learning architecture requires different batch construction methods. For example, Siamese and N-pair require tuples. However, in N-pair, the tuples must be unique.

Triplet Neural Network and Online/Offline Mining

In the Triplet method, we construct triplets of items, two of which (anchor and positive) belong to the same category and the third one (negative) to a different category. This can be harder than you might think because picking the “right” samples in the batch is critical. If you pick items that are too easy or too difficult, the network will converge (adjust weights) very slowly or not at all. The triplet loss function contains an important constant called margin. Margin defines what should be the minimum distance between positive and negative samples.

Picking the right samples in deep metric learning is called mining. We can find optimal triplets via either offline or online mining. The difference is, that during offline mining, you are finding the triplets at the beginning of each epoch.

Online & Offline Mining

The disadvantage of offline mining is that computing embeddings for each sample is not very computationally efficient. During the epoch, the model can change rapidly, so embeddings are becoming obsolete. That’s why online mining of triplets is more popular. In online mining, each batch of triplets is created before fitting the model. For more information about mining and batch strategies for triplet training, I would recommend this post.

We can visualize the Triplet model training in the following way. The model is copied three times, but it has the same shared weights. Each model takes one image from the triplet (anchor, positive, negative) and outputs the embedding vector. Then, the triplet loss is computed and weights are adjusted with backpropagation. After the training is done, the model weights are frozen and the output of the embeddings is used in the similarity engine. Because the three models have shared weights (the same), we take only one model that is used for predicting embedding vectors on images.

How visual search engines work: Triplet network that takes a batch of anchor, positive and negative images.
Triplet network that takes a batch of anchor, positive and negative images.

N-pair Models

The more modern approach is the N-pair model. The advantage of this model is that you don’t mine negative samples, as it is with a triplet network. The batch consists of just positive samples. The negative samples are mitigated through the matrix construction, where all non-diagonal items are negative samples.

You still need to do online mining. For example, you can select a batch with a maximum value of the loss function, or pick pairs that are distant in metric space.

How visual search engine works: N-pair model requires a unique pair of items. In triplet and Siamese model, your batch can contain multiple triplets/pairs from the same class (group).
The N-pair model requires a unique pair of items. In the triplet and Siamese model, your batch can contain multiple triplets/pairs from the same class (group).

In our experience, the N-pair model is much easier to fit, and the results are also better than with the triplet or Siamese model. You still need to do a lot of experiments and know how to tune other hyperparameters such as learning rate, batch size, or model architecture. However, you don’t need to work with the margin value in the loss function, as it is in triplet or Siamese. The small drawback is that during batch creation, we need to have always only two items per class/product.

Proxy-Based Methods

In the proxy-based methods (Proxy-Anchor, Proxy-NCA, Soft Triple) the model is trying to learn class representatives (proxies) from samples. Imagine that instead of having 10,000 classes of fashion products, we will have just 20 class representatives. The first representative will be used for shoes, the second for dresses, the third for shirts, the fourth for pants and so on.

A big advantage is that we don’t need to work with so many classes and the problems coming with it. The idea is to learn class representatives and instead of slow mining “the right samples” we can use the learned representatives in computing the loss function. This leads to much faster training & convergence of the model. This approach, as always, has some cons and questions like how many representatives should we use, and so on.

MultiSimilarity Loss

Finally, it is worth mentioning MultiSimilarity Loss, introduced in this paper. MultiSimilarity Loss is suitable in cases when you have more than two items per class (images per product). The authors of the paper are using 5 samples per class in a batch. MultiSimilarity can bring closer items within the same class and push the negative samples far away by effectively weighting informative pairs. It works with three types of similarities:

  • Self-Similarity (the distance between the negative sample and anchor)
  • Positive-Similarity (the relationship between positive pairs)
  • Negative-Similarity (the relationship between negative pairs)

Finally, it is also worth noting, that in fact, you don’t need to use only one loss function, but you can combine multiple loss functions. For example, you can use the Triplet Loss function with CrossEntropy and MultiSimilarity or N-pair together with Angular Loss. This should often lead to better results than the standalone loss function.

2) Unsupervised Learning

AutoEncoder

Unsupervised learning is helpful when we have a completely unlabelled dataset, meaning we don’t know the classes of our images. These methods are very interesting because the annotation of data can be very expensive and time-consuming. The most simplistic unsupervised learning can simply use some form of AutoEncoder.

AutoEncoder is a neural network consisting of two parts: an encoder, which encodes the image to the smaller representation (embedding vector), and a decoder, which is trying to reconstruct the original image from the embedding vector.

After the whole model is trained, and the decoder is able to reconstruct the images from smaller vectors, the decoder part is discarded and only the encoder part is used in similarity search engines.

How visual search engine works: Simple AutoEncoder neural network for learning embeddings via reconstruction of image.
Simple AutoEncoder neural network for learning embeddings via reconstruction of the image.

There are many other solutions for unsupervised learning. For example, we can train AutoEncoder architecture to colourize images. In this technique, the input image has no colour and the decoding part of the network tries to output a colourful image.

Image Inpainting

Another technique is Image Inpainting, where we remove part of the image and the model will learn to inpaint them back. Interesting way to propose a model that is solving jigsaw puzzles or correct ordering of frames of a video.

Then there are more advanced unsupervised models like SimCLR, MoCo, PIRL, SimSiam or GAN architectures. All these models try to internally represent images so their outputs (vectors) can be used in visual search systems. The explanation of these models is beyond this article.

Tips for Training Deep Metric Models

Here are some useful tips for training deep metric learning models:

  • Batch size plays an important role in deep metric learning. Some methods such as N-pair should have bigger batch sizes. Bigger batch sizes generally lead to better results, however, they also require more memory on the GPU card.
  • If your dataset has a bigger variation and a lot of classes, use a bigger batch size for Multi-similarity loss.
  • The most important part of metric learning is your data. It’s a pity that most research, as well as articles, focus only on models and methods. If you have a large collection with a lot of products, it is important to have a lot of samples per product. If you have fewer classes, try to use some unsupervised method or cross-entropy loss and do heavy augmentations. In the next section, we will look at data in more depth.
  • Try to start with a pre-trained model and tune the learning rate.
  • When using Siamese or Triplet training, try to play with the margin term, all the modern frameworks will allow you to change it (make it harder) during the training.
  • Don’t forget to normalize the output of the embedding if the loss function requires it. Because we are comparing vectors, they should be normalized in a way that the norm of the vectors is always 1. This way, we are able to compute Euclidean or cosine distances.
  • Use advanced methods such as MultiSimilarity with big batch size. If you use Siamese, Triplet, or N-pair, mining of negatives or positives is essential. Start with easier samples at the beginning and increase the challenging samples every epoch.

Neural Text Search on Images with CLIP

Up to right now, we were talking purely about images and searching images with image queries. However, a common use case is to search the collection of images with text input, like we are doing with Google or Bing search. This is also called Text-to-Image problem, because we need to transform text representation to the same representation as images (same vector space). Luckily, researchers from OpenAI develop a simple yet powerful architecture called CLIP (Contrastive Language Image Pre-training). The concept is simple, instead of training on pair of images (SIAMESE, NPAIR) we are training two models (one for image and one for text) on pairs of images and texts.

The architecture of CLIP model by OpenAI. Image Source Github

You can train a CLIP model on a dataset and then use it on your images (or videos) collection. You are able to find similar images/products or try to search your database with a text query. If you would like to use a CLIP-like model on your data, we can help you with the development and integration of the search system. Just contact us at care@ximilar.com, and we can create a search system for your data.

The Training Data
for Visual Search Engines

99 % of the deep learning models have a very expensive requirement: data. Data should not contain any errors such as wrong labels, and we should have a lot of them. However, obtaining enough samples can be a problematic and time-consuming process. That is why techniques such as transfer learning or image augmentation are widely used to enrich the datasets.

How Does Image Augmentation Help With Training Datasets?

Image augmentation is a technique allowing you to multiply training images and therefore expand your dataset. When preparing your dataset, proper image augmentation is crucial. Each specific category of data requires unique augmentation settings for the visual search engine to work properly. Let’s say you want to build a fashion visual search engine based strictly on patterns and not the colours of items. Then you should probably employ heavy colour distortion and channel-swapping augmentation (randomly swapping red, green, or blue channels of an image).

On the other hand, when building an image search engine for a shop with coins, you can rotate the images and flip them to left-right and upside-down. But what to do if the classic augmentations are not enough? We have a few more options.

Removing or Replacing Background

Most of the models that are used for image search require pairs of different images of the same object. Typically, when training product image search, we use an official product photo from a retail site and another picture from a smartphone, such as a real-life photo or a screenshot. This way, we get a pair-based model that understands the similarity of a product in pictures with different backgrounds, lights, or colours.

How visual search engine works: The difference between a product photo and a real-life image made with a smartphone, both of which are important to use when training computer vision models.
The difference between a product photo and a real-life image made with a smartphone, both of which are important to use when training computer vision models.

All such photos of the same product belong to an entity which we call a Similarity Group. This way, we can build an interactive tool for your website or app, which enables users to upload a real-life picture (sample) and find the product they are interested in.

Background Removal Solution

Sometimes, obtaining multiple images of the same group can be impossible. We found a way to tackle this issue by developing a background removal model that can distinguish the dominant foreground object from its background and detect its pixel-accurate position.

Once we know the exact location of the object, we can generate new photos of products with different backgrounds, making the training of the model more effective with just a few images.

The background removal can also be used to narrow the area of augmentation only to the dominant item, ignoring the background of the image. There are a lot of ways to get the original product in different styles, including changing saturation, exposure, highlights and shadows, or changing the colours entirely.

How visual search engines work: Generating more variants can make your model very robust.
Generating more variants can make your model very robust.

Building such an augmentation pipeline with background/foreground augmentation can take hundreds of hours and a lot of GPU resources. That is why we deployed our Background Removal solution as a ready-to-use image tool.

You can use the Background Removal as a stand-alone service for your image collections, or as a tool for training data augmentation. It is available in public demo, App, and via API.

GAN-Based Methods for Generating New Training Data

One of the modern approaches is to use a Generative Adversarial Network (GAN). GANs are incredibly powerful in generating whole new images from some specific domain. You can simply create a model for generating new kinds of insects or making birds with different textures.

How visual search engines work: Creating new insect images automatically to train an image recognition system? How cool is that? There are endless possibilities with GAN models for basicaly any image type. [Source]
Creating new insect images automatically to train an image recognition system? How cool is that? There are endless possibilities with GAN models for basically any image type. [Source]

The greatest advantage of GAN is you will easily get a lot of new variants, which will make your model very robust. GANs are starting to be widely used in more tasks such as simulations, and I think the gathering of data will cost much less in the near future because of them. In Ximilar, we used GAN to create a GAN Image Upscaler, which adds new relevant pixels to images to increase their resolution and quality.

When creating a visual search system on our platform, our team picks the most suitable neural network architecture, loss functions, and image augmentation settings through the analysis of your visual data and goals. All of these are critical for the optimization of a model and the final accuracy of the system. Some architectures are more suitable for specific problems like OCR systems, fashion recommenders or quality control. The same goes with image augmentation, choosing the wrong settings can destroy the optimization. We have experience with selecting the best tools to solve specific problems.

Annotation System for Building Image Search Datasets

As we can see, a good dataset definitely is one of the key elements for training deep learning models. Obtaining such a collection can be quite expensive and time-consuming. With some of our customers, we build a system that continually gathers the images needed in the training datasets (for instance, through a smartphone app). This feature continually & automatically improves the precision of the deployed search engines.

How does it work? When the new images are uploaded to Ximilar Platform (through Custom Similarity service) either via App or API, our annotators can check them and use them to enhance the training dataset in Annotate, our interface dedicated to image annotation & management of datasets for computer vision systems.

Annotate effectively works with the similarity groups by grouping all images of the same item. The annotator can add the image to a group with the relevant Stock Keeping Unit (SKU), label it as either a product picture or a real-life photo, add some tags, or mark objects in the picture. They can also mark images that should be used for the evaluation and not used in the training process. In this way, you can have two separate datasets, one for training and one for evaluation.

We are quite proud of all the capabilities of Annotate, such as quality control, team cooperation, or API connection. There are not many web-based data annotation apps where you can effectively build datasets for visual search, object detection, as well as image recognition, and which are connected to a whole visual AI platform based on computer vision.

A sneak peek into Annotate – image annotation tool for building visual search and image similarity models.
Image annotation tool for building visual search and image similarity models.

How to Improve Visual Search Engine Results?

We already assessed that the optimization algorithm and the training dataset are key elements in training your similarity model. And that having multiple images per product then significantly increases the quality of the trained similarity model. The model (CNN or other modern architecture) for similarity is used for embedding (vector) extraction, which determines the quality of image search.

Over the years that we’ve been training visual search engines for various customers around the world, we were also able to identify several potential weak spots. Their fixing really helped with the performance of searches as well as the relevance of the search results. Let’s take a look at what can improve your visual search engine:

Include Tags

Adding relevant keywords for every image can improve the search results dramatically. We recommend using some basic words that are not synonymous with each other. The wrong keywords for one item are for instance “sky, skyline, cloud, cloudy, building, skyscraper, tall building, a city”, while the good alternative keywords would be “sky, cloud, skyscraper, city”.

Our engine can internally use these tags and improve the search results. You can let an image recognition system label the images instead of adding the keywords manually.

Include Filtering Categories

You can store the main categories of images in their metadata. For instance, in real estate, you can distinguish photos that were taken inside or outside. Based on this, the searchers can filter the search results and improve the quality of the searches. This can also be easily done by an image recognition task.

Include Dominant Colours

Colour analysis is very important, especially when working for a fashion or home decor shop. We built a tool conveniently called Dominant Colors, with several extraction options. The system can extract the main colours of a product while ignoring its background. Searchers can use the colours for advanced filtering.

Use Object Detection & Segmentation

Object detection can help you focus the view of both the search engine and its user on the product, by merely cutting the detected object from the image. You can also apply background removal to search & showcase the products the way you want. For training object detection and other custom image recognition models, you can use our AppAnnotate.

Use Optical Character Recognition (OCR)

In some domains, you can have products with text. For instance, wine bottles or skincare products with the name of the item and other text labels that can be read by artificial intelligence, stored as metadata and used for keyword search on your site.

How visual search engines work: Our visual search engine allows us to combine several features for multimedia search with advanced filtering.
Our visual search engine allows us to combine several features for multimedia search with advanced filtering.

Improve Image Resolution

If the uploaded images from the mobile phones have low resolution, you can use the image upscaler to increase the resolution of the image, screenshot, or video. This way, you will get as much as possible even from user-generated content with potentially lower quality.

Combine Multiple Approaches

FusionCombining multiple features like model embeddings, tags, dominant colours, and text increases your chances to build a solid visual search engine. Our system is able to use these different modalities and return the best items accordingly. For example, extracting dominant colours is really helpful in Fashion Search, our service combining object detection, fashion taggingvisual search.

Search Engine and Vector Databases

Once you trained your model (neural network), you can extract and store the embeddings for your multimedia items somewhere. There are a lot of image search engine implementations that are able to work with vectors (embedding representation) that you can use. For example, Annoy from Spotify or FAISS from Facebook developers.

These solutions are open-source (i.e. you don’t have to deal with usage rights) and you can use them for simple solutions. However, they also have a few disadvantages:

  • After the initial build of the search engine database, you cannot perform any update, insert or delete operations. Once you store the data, you can only perform search queries.
  • You are unable to use a combination of multiple features, such as tags, colours, or metadata.
  • There’s no support for advanced filtering for more precise results.
  • You need to have an IT background and coding skills to implement and use them. And in the end, the system must be deployed on some server, which brings additional challenges.
  • It is difficult to extend them for advanced use cases, you will need to learn a complex codebase of the project and adjust it accordingly.

Building a Visual Search Engine on a Machine Learning Platform

The creation of a great visual search engine is not an easy task. The mentioned challenges and disadvantages of building complex visual search engines with high performance are the reasons why a lot of companies hesitate to dedicate their time and funds to building them from scratch. That is where AI platforms like Ximilar come into play.

Custom Similarity Service

Ximilar provides a computer vision platform, where a fast similarity engine is available as a service. Anyone can connect via API and fill their custom collection with data and query at the same time. This streamlines the tedious workflow a lot, enabling people to have custom visual search engines fast and, more importantly, without coding. Our image search engines can handle other data types like videos, music, or 3D models. If you want more privacy for your data, the system can also be deployed on your hardware infrastructure.

In all industries, it is important to know what we need from our model and optimize it towards the defined goal. We developed our visual search services with this in mind. You can simply define your data and problem and what should be the primary goal for this similarity. This is done via similarity groups, where you put the items that should be matched together.

Examples of Visual Search Solutions for Business

One of the typical industries that use visual search extensively is fashion. Here, you can look at similarities in multiple ways. For instance, one can simply want to find footwear with a colour, pattern, texture, or shape similar to the product in a screenshot. We built several visual search engines for fashion e-shops and especially price comparators, which combined search by photo and recommendations of alternative similar products.

Based on a long experience with visual search solutions, we deployed several ready-to-use services for visual search: Visual Product Search, a complex visual search service for e-commerce including technologies such as search by photo, similar product recommendations, or image matching, and Fashion Search created specifically for the fashion segment.

Another nice use case is also the story of how we built a Pokémon Trading Card search engine. It is no surprise that computer vision has been recently widely applied in the world of collectibles. Trading card games, sports cards or stamps and visual AI are a perfect match. Based on our customers’ demand, we also created several AI solutions specifically for collectibles.

The Workflow of Building
a Visual Search Engine

If you are looking to build a custom search engine for your users, we can develop a solution for you, using our service Custom Image Similarity. This is the typical workflow of our team when working on a customized search service:

  1. SetupResearch & Plan – Initial calls, the definition of the project, NDA, and agreement on expected delivery time.

  2. Data – If you don’t provide any data, we will gather it for you. Gathering and curating datasets is the most important part of developing machine learning models. Having a well-balanced dataset without any bias to any class leads to great performance in production.

  3. First prototype – Our machine learning team will start working on the model and collection. You will be able to see the first results within a month. You can test it and evaluate it by yourself via our clickable front end.

  4. Development – Once you are satisfied with the results, we will gather more data and do more experiments with the models. This is an iterative way of improving the model.

  5. Evaluation & Deployment – If the system performs well and meets the criteria set up in the first calls (mostly some evaluation on the test dataset and speed performance), we work on the deployment. We will show you how to connect and work with the API for visual similarity (insert, delete, search endpoints).

If you are interested in knowing more about how the cooperation with Ximilar works in general, read our How it works and contact us anytime.

We are also able to do a lot of additional steps, such as:

  • Managing and gathering more training data continually after the deployment to gradually increase the performance of visual similarity (the usage rights for user-generated content are up to you; keep in mind that we don’t store any physical images).
  • Building a customized model or multiple models that can be integrated into the search engine.
  • Creating & maintaining your visual search collection, with automatic synchronization to always keep up to date with your current stock.
  • Scaling the service to hundreds of requests per second.

Visual Search is Not Only
For the Big Companies

I presented the basic techniques and architectures for training visual similarity models, but of course, there are much more advanced models and the research of this field continues with mile steps.

Search engines are practically everywhere. It all started with AltaVista in 1995 and Google in 1998. Now it’s more common to get information directly from Siri or Alexa. Searching for things with visual information is just another step, and we are glad that we can give our clients tools to maximise their potential. Ximilar has a lot of technical experience with advanced search technology for multimedia data, and we work hard to make it accessible to everyone, including small and medium companies.

If you are considering implementing visual search into your system:

  1. Schedule a call with us and we will discuss your goals. We will set up a process for getting the training data that are necessary to train your machine learning model for search engines.

  2. In the following weeks, our machine learning team will train a custom model and a testable search collection for you.

  3. After meeting all the requirements from the POC, we will deploy the system to production, and you can connect to it via Rest API.

The post How to Build a Good Visual Search Engine? appeared first on Ximilar: Visual AI for Business.

]]>
Flows – The Game Changer for Next-Generation AI Systems https://www.ximilar.com/blog/flows-the-game-changer-for-next-generation-ai-systems/ Wed, 01 Sep 2021 15:25:28 +0000 https://www.ximilar.com/?p=5213 Flows is a service for combining machine learning models for image recognition, object detection and other AI services into API.

The post Flows – The Game Changer for Next-Generation AI Systems appeared first on Ximilar: Visual AI for Business.

]]>
We have spent thousands of man-hours on this challenging subject. Gallons of coffee later, we introduced a service that might change how you work with data in Machine Learning & AI. We named this solution Flows. It enables simple and intuitive chaining and combining of machine learning models. This simple idea speeds up the workflow of setting up complex computer vision systems and brings unseen scalability to machine learning solutions.

We are here to offer a lot more than just training models, as common AI companies do. Our purpose is not to develop AGI (artificial general intelligence), which is going to take over the world, but easy-to-use AI solutions, that can revolutionize many areas of both business and daily life. So, let’s dive into the possibilities of flows in this 2021 update of one of our most-viewed articles.

Flows: Visual AI Setup Cannot Get Much Easier

In general, at our platform, you can break your machine learning problem down into smaller, separate parts (recognition, detection, and other machine learning models called tasks) and then easily chain & combine these tasks with Flows to achieve the full complexity and hierarchical classification of a visual AI solution.

A typical simple use case is conditional image processing. For instance, the first recognition task filters out non-valid images, then the next one decides a category of the image and, according to the result, other tasks recognize specific features for a given category.

Hierarchical classification with Ximilar Flows service is easy. Flows can help you to build powerful computer vision system.
Simple use of machine learning models combination in a flow

Flows allow your team to review and change datasets of all complexity levels fast and without any trouble. It doesn’t matter whether your model uses three simple categories (e.g. cats, dogs, and guinea pigs) or works with an enormous and complex hierarchy with exceptions, special conditions, and interdependencies.

It also enables you to review the whole dataset structure, analyze, and, if necessary, change its logic due to modularity. With a few clicks, you can add new labels or models (tasks), change their chaining, change the names of the output fields, etc. Neat? More than that!

Think of Flows as Zapier or IFTTT in AI. With flows, you simply connect machine learning models, and review the structure anytime you need.

Define a Flow With a Few Clicks

Let’s assume we are building a real estate website, and we want to automatically recognize different features that we can see in the photos. Different kinds of apartments and houses have various recognizable features. Here is how we can define this workflow using recognition flows (we trained each model with a custom image recognition service):

An example of real estate classifier made of machine learning models combined with flows.
An example of real estate classifier made of machine learning models combined in a flow

The image recognition models are chained in a “main” flow called the branch selector. The branch selector saves the result in the same way as a recognition task node and also chooses an action based on the result of this task. First, we let the top category task recognize the type of estate (Apartment vs. Outdoor house). If it is an apartment, we can see that two subsequent tasks are “Apartment features” and “Room type”.

A flow can also call other flows, so-called nested flows, and delegate part of the work to them. If the image is an outdoor house, we continue processing by another nested flow called “Outdoor house”. In this flow, we can see another branching according to the task that recognizes “House type”. Different tasks are called for individual categories (Bungalow, Cottage, etc.):

An example use of nested flows. The main flow calls another nested flows to process images based on their category.
An example use of nested flows – the main flow calls other nested flows to process images based on their category

Flow Elements We Used

So far, we have used three elements:

  • A recognition task, that simply calls a given task and saves the result into an output field with a specified name. No other logic is involved.
  • A branch selector, on the other hand, saves the result in the same way as a recognition task node, but then it chooses an action based on the result of this task. 
  • Nested flow, another flow of tasks, that the “main” flow (branch selector) called.

Implicitly, there is also a List element present in some branches. We do not need to create it, because as soon as we add two or more elements to a single branch, a list generates in the background. All nodes in a list are normally executed in parallel, but you can also set sequential execution. In this case, the reordering button will appear.

Branch Selector – Advanced Settings

The branch selector is a powerful element. It’s worthwhile to explore what it can do. Let’s go through the most important options. In a single branch, by default, only actions (tag or category) with the highest relevance will be performed, provided the relevance (the probability outputted by the model) is above 50 %. But we can change this in advanced settings. We can specify the threshold value and also enable parallel execution of multiple branches!

The advanced settings of a branch selector, enabling to skip a task of a flow.
The advanced settings of a branch selector, enabling to skip a task of a flow

You can specify the format of the results. Flat JSON means that results from all branches will be saved on the same level as any previous outcomes. And if there are two same output names in multiple branches, they can be overwritten. The parallel execution guarantees neither order nor results. You can prevent this from happening by selecting nested JSON, which will save the results from each branch under a separate key, based on the branch name (that is the tag/category name).

If some data (output_field) are present in the incoming request, we can skip calling the branch selector processing. You can define this in If Output Field Exists. This way we can save credits and also improve the precision of the system. I will show you how useful this behaviour can be in the next paragraphs. To learn about the advanced options of training, check this article.

An Example: Fashion Detection With Tags

We have just created a flow to tag simple and basic pictures. That is cool. But can we really use it in real-life applications? Probably not. The reason is, in most pictures, there is usually more than one clothing item. So how are we going to automate the tagging of more complex pictures? The answer is simple: we can integrate object detection into flows and combine it with recognition & tagging models!

Example of Fashion Tagging combined with Object Detection in Ximilar App
Example of Fashion Tagging combined with Object Detection in Ximilar App

The flow structure then exactly mirrors the rich product taxonomy. Each image goes through a taxonomy tree in order to get proper tags. This is our “top classifier” – a flow that can tell one of our seven top categories of a fashion product image, which will determine how the image will be further classified. For instance, if it is a “Clothing” product, the image continues to “Clothing tagging” flow.

A “top classifier” – a flow that can tell one of our seven top categories of a fashion product image.

Similar to categorization or tagging, there are two basic nodes for object detection: the Detection Task for simple execution of a given task and Object Selector, which enables the processing of the detected objects.

Object Selector will call the object detection task. The detected objects will be extracted out of the image and passed further to any of the available nodes. Yes, any of them! Even another Object Selector, if, for example, you need to first detect people and then detect clothes on each person separately.

Object Selector – Advanced Settings

Object Selector behavior can be customized in similar ways as a Branch Selector. In addition to the Probability Threshold, there is also an Area Threshold. By default, all objects are processed. By setting this threshold, the objects that do not take at least a given percentage of an image are simply ignored. This can be changed to a single object by probability or area in Select. As I mentioned, we extract the object before further processing. We can extend it a bit to include some context using Expand Bounding Box by…

Advanced setting for object selector in a flow enabling to add percentage threshold an object should occupy in order to be detected.
Setting a threshold for a space that an object should occupy in order to be detected

A Typical Flows Application: Fashion Tagging

We have been playing with the fashion subject since the inception of Ximilar. It is the most challenging and also the most promising one. We have created all kinds of tools and helpers for the fashion industry, namely Fashion Tagging, specialized Fashion Search, or Annotate. We are proud to have a very precise automatic fashion tagging service with a rich fashion taxonomy.

And, of course, Fashion Tagging is internally powered by Flows. It is a huge project with several dozens of features to recognize, about a hundred recognition tasks, and hundreds of labels all chained into several interconnected flows. For example, this is what our AI says about a simple dress now – and you can try it on your picture in the public demo.

Example of fashion attributes assigned to a dress by Ximilar Fashion Tagging flow.
Example of fashion attributes assigned to a dress by Ximilar Fashion Tagging flow

Include Pre-trained Services In Your Flow

The last group of nodes at your disposal are Ximilar services. We are working hard and on an ever-growing number of ready-to-use services which can be called through our API and integrated into your project. It is natural for our users to combine more AI services, and flows make it easier than ever. At this moment, you can call these ready-to-use recognition services:

But more will come in the future, for example, Remove Background.

Increasing Possibilities of Flows

As our app and list of services grow, so do the flows. There are two features we are currently looking forward to. We are already building custom similarity models for our customers. As soon as they are ready, they will be available for combining in flows. And there is one more item very high on our list, which is predicting numeric values. Regression, in machine learning terms. Stay tuned for more exciting news!

Create Your Flow – It’s Free

Before Flows, setting up the AI Vision process was a tedious task for a skilled developer. Now everyone can set up, manage and alter steps on their own. In a comprehensive, visual way. Being able to optimize the process quickly, getting a faster response, losing less time and expenses, and delivering higher quality to customers.

And what’s the best part? Flows are available to the users of Ximilar’s free plan, so you can try them right away. Register or sign up to the Ximilar App and enter Flows service at the Dashboard. If you want to learn the basics first, check out our video tutorials. Then you can connect tasks and labels defined in your own Image Recognition.

Training of machine learning models is free with Ximilar, you are only paying for API calls for recognition. Read more about API calls or API credit packs. We strongly believe you will love Flows as much as we enjoyed bringing them to life. And if you feel like there is a feature missing, or if you prefer a custom-made solution, feel free to contact us!

The post Flows – The Game Changer for Next-Generation AI Systems appeared first on Ximilar: Visual AI for Business.

]]>
Visual AI Takes Quality Control to a New Level https://www.ximilar.com/blog/visual-ai-takes-quality-control-to-a-new-level/ Wed, 24 Feb 2021 16:08:27 +0000 https://www.ximilar.com/?p=2424 Comprehensive guide for automated visual industrial quality control with AI and Machine Learning. From image recognition to anomaly detection.

The post Visual AI Takes Quality Control to a New Level appeared first on Ximilar: Visual AI for Business.

]]>
Have you heard about The Big Hack? The Big Hack story was about a tiny probe (small chip) inserted on computer motherboards by Chinese manufacturing companies. Attackers then could infiltrate any server workstation containing these motherboards, many of which were installed in large US-based companies and government agencies. The thing is, the probes were so small, and the motherboards so complex, that they were almost impossible to spot by the human eye. You can take this post as a guide to help you navigate the latest trends of AI in the industry with a primary focus on AI-based visual inspection systems.

AI Adoption by Companies Worldwide

Let’s start with some interesting stats and news. The expansion of AI and Machine Learning is becoming common across numerous industries. According to this report by Stanford University, AI adoption is increasing globally. More than 50 % of respondents said their companies were using AI, and the adoption growth was greatest in the Asia-Pacific region. Some people refer to the automation of factory processes, including digitalization and the use of AI, as the Fourth Industrial Revolution (and so-called Industry 4.0).

Photo by AI Index 2019 Report
AI adoption by industry and function [Source]

The data show that the Automotive industry is the largest adopter of AI in manufacturing, using heavily machine learning, computer vision, and robotics.
Other industries, such as Pharma or Infrastructure, are using computer vision in their production lines as well. Financial services, on the other hand, are using AI mostly in operations, marketing & sales (with a focus on Natural Language Processing – NLP).

AI technologies per industry [Source]

The MIT Technology Review cited the statement of a leading artificial intelligence expert Andrew Ng, who has been helping tech giants like Google implement AI solutions, that factories are AI’s next frontier. For example, while it would be difficult to inspect parts of electronic devices with our eyes, a cheap camera of the latest Android or iPhone can provide high-resolution images that can be connected to any industrial system.

Adopting AI brings major advantages, but also potential risks that need to be mitigated. It is no surprise that companies are mainly concerned about the cybersecurity of such systems. Imagine you could lose a billion dollars if your factory stopped working (like Honda in this case). Other obstacles are potential errors in machine learning models. There are techniques on how to discover such errors, such as the explainability of AI systems. As for now, the explainability of AI is a concern of only 19 % of companies so there is space to improve. Getting insight from the algorithms can improve the processes and quality of the products. Other than security, there are also political & ethical questions (e.g., job replacement or privacy) that companies are worried about.

This survey by McKinsey & Company brings interesting insights into Germany’s industrial sector. It demonstrates the potential of AI for German companies in eight use cases, one of which is automated quality testing. The expected benefit is a 50% productivity increase due to AI-based automation. Needless to say, Germany is a bit ahead with the AI implementation strategy – there are already several plans made by German institutions to create standardised AI systems that will have better interoperability, certain security standards, quality criteria, and test procedures.

Highly developed economies like Germany, with a high GDP per capita and challenges such as a quickly ageing population, will increasingly need to rely on automation based on AI to achieve GDP targets.

McKinsey & Company

Another study by PwC predicts that the total expected economic impact of AI in the period until 2030 will be about $15.7 trillion. The greatest economic gains from AI are expected in China (26% higher GDP in 2030) and North America.

What is Visual Quality Control?

The human visual system is naturally very selective in what it perceives, focusing on one thing at a time and not actually seeing the whole image (direct vs. peripheral view). The cameras, on the other hand, see all the details, and with the highest resolution possible. Therefore, stories like The Big Hack show us the importance of visual control not only to ensure quality but also safety. That is why several companies and universities decided to develop optical inspection systems engaging machine learning methods able to detect the tiniest difference from the reference board.

Motherboards by Super Micro [Source: Scott Gelber]

In general, visual quality control is a method or process to inspect equipment or structures to discover defects, damages, missing parts, or other irregularities in production or manufacturing. It is an important method of confirming the quality and safety of manufactured products. Optical inspection systems are mostly used for visual quality control in factories and assembly lines, where the control would be hard or ineffective with human workers.

What Are the Main Benefits of Automatic Visual Inspection?

Here are some of the essential aspects and reasons, why automatic visual inspection brings a major advantage to businesses:

  • The human eye is imprecise – Even though our visual system is a magnificent thing, it needs a lot of “optimization” to be effective, making it prone to optical illusions. The focused view can miss many details, our visible spectrum is limited (380–750 nm), and therefore unable to capture NIR wavelength (source). Cameras and computer systems, on the other hand, can be calibrated to different conditions. Cameras are more suitable for highly precise analyses.
  • Manual checking – Manual checking of the items one by one is a time-consuming process. Smart automation allows processing and checking more items and faster. It also reduces the number of defective items that are released to customers.
  • The complexity – Some assembly lines can produce thousands of various products of different shapes, colours, and materials. For humans, it can be very difficult to keep track of all possible variations.
  • Quality – Providing better and higher quality products by reducing defective items and getting insights into the critical parts of the assembly line.
  • Risk of damage – Machine vision can reduce the risk of item damage and contamination by a person.
  • Workplace safety – Making the work environment safer by inspecting it for potentially dangerous actions (e.g. detection of protection wearables as safety helmets in construction sites), inspection in radioactive or biohazard environments, detection of fire, covid face masks, and many more.
  • Saving costs – Labour work can be pretty expensive in the Western world.
    For example, the average Quality control inspector salary in the US is about 40k USD. Companies consider numerous options when saving costs, such as moving the factories to other countries, streamlining the operations, or replacing the workers with robots. And as I said before, this goes hand in hand with some political & ethical questions. I think the most reasonable solution in the long term is the cooperation of workers with robotic systems. This will make the process more robust, reliable, and effective.
  • Costs of AI systems – Sooner or later, modern technology and automation will be common in all companies (Startups as well as enterprise companies). The adoption of automatic solutions based on AI will make the transition more affordable.

Where is Visual Quality Control Used?

Let’s take a look at some of the fields where the AI visual control helps:

  • Cosmetics – Inspection of beauty products for defects and contaminations, colour & shape checks, controlling glass or plastic tubes for cleanliness and rejecting scratched pieces.
  • Pharma & Medical – Visual inspection for pharmaceuticals: rejecting defective and unfilled capsules or tablets or the filling level of bottles, checking the integrity of items; or surface imperfections of medical devices. High-resolution recognition of materials.
  • Food Industry and Agriculture – Food and beverage inspection for freshness. Label print/barcode/QR code control of presence or position.

A great example of industrial IoT is this story about a Japanese cucumber farmer who developed a monitoring system for quality check with deep learning and TensorFlow.

  • Automotive – Examination of forged metallic parts, plastic parts, cracks, stains or scratches in the paint coating, and other surface and material imperfections. Monitoring quality of automotive parts (tires, car seats, panels, gears) over time. Engine monitoring and predictive autonomous maintenance.
  • Aerospace – Checking for the presence and quality of critical components and material, spotting the defective parts, discarding them, and therefore making the products more reliable.
  • Transportation – Rail surface defects control (example), aircraft maintenance check, or baggage screening in airports – all of them require some kind of visual inspection.
  • Retail/Consumer Goods & Fashion – Checking assembly line items made of plastics, polymers, wood, and textile, and packaging. Visual quality control can be deployed for the manufacturing process of the goods. Sorting imprecise products.
  • Energy, Mining & Heavy Industries – Detecting cracks and damage in wind blades or solar panels, visual control in nuclear power plants, and many more.

It’s interesting to see that more and more companies choose collaborative platforms such as Kaggle to solve specific problems. In 2019, the contest by Russian company Severstal on Kaggle led to tens of solutions for the steel defect detection problem.

Steel defects [Source: Kaggle]

Image of flat steel defects from Severstal competition. [Source: Kaggle]
  • Other, e.g. safety checks – if people are present in specific zones of the factory if they have helmets, or stopping the robotic arm if a worker is located nearby.

The Technology Behind AI Quality Control

There are several different approaches and technologies that can be used for visual inspection on production lines. The most common nowadays are using some kind of neural network model.

Neural Networks – Deep Learning

Neural Networks (NN) are computational models that accept the input data and output relevant information. To make the neural network useful (finding the weights for the connection between the neurons and layers), we need to feed the network with some initial training data.

The advantage of using neural networks is their power to internally represent training data which leads to the best performance compared to other machine learning models in computer vision. However, it brings challenges, such as computational demands, overfitting, and others.

[Un|Semi|Self] Supervised Learning

If a machine-learning algorithm (NN) requires ground truth labels, i.e. annotations, then we are talking about supervised learning. If not, then it is an unsupervised method or something in between – semi or self-supervised method. However, building an annotated dataset is much more expensive than simply obtaining data with no labels. The good news is that the latest research in Neural Networks tackles problems with unsupervised learning.

On the left is the original item without any defects, on the right, a bit damaged one. If we know the labels (OK/DEFECT), we can train a supervised machine-learning algorithm. [Source: Kaggle]

Here is the list of common services and techniques for visual inspection:

  • Image Recognition – Simple neural network that can be trained for categorization or error detection on products from images. The most common architectures are based on convolution (CNN).
  • Object Detection – Model able to predict the exact position (bounding box) of specific parts. Suitable for defect localization and counting.
  • Segmentation – More complex than object detection, image segmentation can tell you a pixel-based prediction.
  • Image Regression – Regress/get a single decimal value from the image. For example, getting the level of wear out of the item.
  • Anomaly Detection – Shows which image contains an anomaly and why. Mostly done by GAN or GRAD-CAM.
  • OCR – Optical Character Recognition is used for getting and reading text from images.
  • Image matching – Matching the picture of the product to the reference image and displaying the difference.
  • Other – There are also other solutions that do not require data at all, most of the time using some simple, yet powerful computer vision technique.

If you would like to dive a bit deeper into the process of building a model, you can check my posts on Medium, such as How to detect defects on images.

Typical Types and Sources of Data for Visual Inspection

Common Data Sources

Thermal imaging example [Source: Quality Magazine]

RGB images – The most common data type and the easiest to get. A simple 1080p camera that you can connect to Raspberry Pi costs about 25$.

Thermography – Thermal quality control via infrared cameras, mostly used to detect flaws not visible by simple RGB cameras under the surface, gas imaging, fire prevention, and electronics behaviour under different conditions. If you want to know more, I recommend reading the articles in Quality Magazine.

3D scanning, Lasers, X-ray, and CT scans – Creating 3D models from special depth scanners gives you a better insight into material composition, surface, shape, and depth.

Microscopy – Due to the rapid development and miniaturization of technologies, sometimes we need a more detailed and precise view. Microscopes can be used in an industrial setting to ensure the best quality and safety of products. Microscopy is used for visual inspection in many fields, including material sciences and industry (stress fractures), nanotechnology (nanomaterial structure), or biology & medicine. There are many microscopy methods to choose from, such as stereomicroscopy, electron microscopy, opto-digital or purely digital microscopes, and others.

Common Inspection Errors

  • scratches
  • patches
  • knots, shakes, checks, and splits in the wood
  • crazing
  • pitted surface
  • missing parts
  • label/print damage
  • corrosion
  • coating nonuniformity
Surface crazing and cracking on brake discs [source], crazing in polymer-grafted nanoparticle film [source], and wood shakes [source].

Examples of Datasets for Visual Inspection

  • Severstal Kaggle Dataset – A competition for the detection of defects on flat sheet steel.
  • MVTec AD – 5000 high-resolution annotated images of 15 items (divided into defective and defect-free categories).
  • Casting Dataset – Casting is a manufacturing process in which a liquid material is usually poured into a form/mould. About 7 thousand images of submersible pump defects.
  • Kolektor Surface-Defect Dataset – Dataset of microscopic fractions or cracks in electrical accumulators.
  • PCB Dataset – Annotated images of printed circuit boards.

AI Quality Control Use Cases

We talked about a wide range of applications for visual control with AI and machine learning. Here are three of our use cases for industrial image recognition we worked on in 2020. All these cases required an automatic optical inspection (AOI) and partial customization when building the model, working with different types of data and deployment (cloud/on-premise instance/smartphone). We are glad to hear that during the COVID-19 pandemic, our technologies help customers keep their factories open.

Our typical workflow for a customized solution is the following:

  1. Setup, Research & Plan: If we don’t know how to solve the problem from the initial call, our Machine Learning team does the research and finds the optimal solution for you.
  2. Gathering Data: We sit with your team and discuss what kind of data samples we need. If you can’t acquire and annotate data yourself, our team of annotators will work on obtaining a training dataset.
  3. First prototype: Within 2–4 weeks we prepare the first prototype or proof of concept. The proof of concept is a lightweight solution for your problem. You can test it and evaluate it by yourself.
  4. Development: Once you are satisfied with the prototype results, our team can focus on the development of the full solution. We work mostly in an iterative way improving the model and obtaining more data if needed.
  5. Evaluation & Deployment: If the system performs well and meets the criteria set up in the first calls (mostly some evaluation on the test dataset and speed performance), we work on the deployment. It can be used in our cloud, on-premise, or embedded hardware in the factory. It’s up to you. We can even provide a source code so your team can edit it in the future.

Use case: Image recognition & OCR for wood products

One of our customers contacted us with a request to build a system for categorization and quality control of wooden products. With Ximilar Platform we were able to easily develop and deploy a camera system over the assembly line that sorted the products into the bins. The system can identify the defective print on the products with optical character recognition technology (OCR), and the surface control of wood texture is enabled by a separate model.

Printed text on wood [Source: Ximilar]

The technology is connected to a simple smartphone/tablet camera in the factory and can handle tens of products per second. This way, our customer was able to reduce rework and manual inspections which led to saving thousands of USD per year. This system was built with the Ximilar Flows service.

Use case: Spectrogram analysis from car engines

Another project we successfully deployed was the detection of malfunctioning engines. We did it by transforming the sound input from the car into an image spectrogram. After that, we train a deep neural network that recognises problematic car engines and can tell you the specific problem of the engine.

The good news is that this system can also detect anomalies in an unsupervised way (no need for data labelling) with the GAN technology.

Spectrogram from Engine [Source: Ximilar]

Use case: Wind Turbin Blade damages from drone footage

[Source: Pexels]

According to Bloomberg, there is no simple way to recycle a wind turbine, and it is therefore crucial to prolong the lifespan of wind power plants. They can be hit by lightning, influenced by extreme weather, and other natural forces.

That’s why we developed for our customers a system checking the rotor blade integrity and damages working with drone video footage. The videos are uploaded to the system, and inspection is done with an object detection model identifying potential problems. There are thousands of videos analyzed in one batch, so we built a workstation (with NVidia RTX GPU cards) able to handle such a load.

Ximilar Advantages in Visual AI Quality Control

  • An end-to-end and easy-to-use platform for Computer Vision and Machine Learning, with enterprise-ready features.
  • Processing hundreds of images per second on an average computer.
  • Train your model in the cloud and use it offline in your factory without an internet connection. Thanks to TensorFlow, you can use the model on any computer, edge device, GPU card, or embedded hardware (Raspberry Pi or NVIDIA Jetson connected to a camera). We also provide optimized CPU models on Intel devices through OpenVINO technology.
  • Easily gather more data and teach models on new defects within a day.
  • Evaluation of the independent dataset, and model versioning.
  • A customized yet affordable solution providing the best outcome with pixel-accurate recognition.
  • Advanced image management and annotation platform suitable for creating intelligent vision systems.
  • Image augmentation settings that can be tuned for your problem.
  • Fast machine learning models that can be connected to your industrial camera or smartphone for industrial image processing robust to lighting conditions, object motion, or vibrations.
  • Great team of experts, available to communicate and help.

To sum up, it is clear that artificial intelligence and machine learning are becoming common in the majority of industries working with automation, digital data, and quality or safety control. Machine learning definitely has a lot to offer to the factories with both manual and robotic assembly lines, or even fully automated production, but also to various specialized fields, such as material sciences, pharmaceutical, and medical industry.

Are you interested in creating your own visual control system?

The post Visual AI Takes Quality Control to a New Level appeared first on Ximilar: Visual AI for Business.

]]>
Evaluation on an Independent Dataset for Image Recognition https://www.ximilar.com/blog/evaluation-on-an-independent-dataset-for-image-recognition/ Fri, 14 Aug 2020 10:34:13 +0000 https://www.ximilar.com/?p=1871 Monitor the performance of your custom models on test dataset with Ximilar platform.

The post Evaluation on an Independent Dataset for Image Recognition appeared first on Ximilar: Visual AI for Business.

]]>
Today, I would like to present the latest feature which many of you were missing in our custom image recognition service. We don’t see this feature in any other public platform that allows you to train your own classification models.

Many machine-learning specialists and data scientists are used to evaluate models on an test dataset, which is selected by them manually and is not seen by the training process. Why? So they see the results on a dataset which is not biased to the training data.

We think that this is a critical step for the reliability and transparency of our solution.

“The more we come to rely on technology, the more important it becomes that it’s robust and trustworthy, doing what we want it to do”.

Max Tegmark, scientist and author of international bestseller Life 3.0

Training, Validation and Testing Datasets at the Ximilar platform

Your Categorization or Tagging Task in the image recognition service contains labels. Every label must contain at least 20 images before Ximilar allows you to train the task.

When you click on the Train button, the new model/training is added to a queue. Our training process takes the next non-trained model (neural network) from the queue and starts the optimization.

The process divides your data/images (which are not labelled with a test flag) into the training dataset (80 %) and validation (20 %) dataset randomly.

In the first phase, the model optimization is done on the training dataset and evaluated it on the validation dataset. The result of this evaluation is stored, and its Accuracy number is what you see when you open your task or detail of the model.

In the last training phase, the process takes all your data (training and validation dataset) and optimizes the model one more time. Then we compute the failed images and store them. You can see them at the bottom of the model page.

Newly, you can mark certain images by the test flag. Before your optimized model is stored in the database (after the training phases), the model is evaluated on this test dataset (images with test flag). This is a very useful feature if you are looking for better insights into your model. In short:

  • Creating a test dataset for your Task is optional.
  • The test dataset is stable, it contains images only that you mark with the “test” flag. On the contrary, the validation dataset mentioned above is picked randomly. That’s why the results (accuracy, precision, recall) of the test dataset are better for monitoring between different model versions of your Task.
  • Your task is never optimized on the test dataset.

How to set up your test dataset?

If you want to add this test flag to some of your images, then select them and use the MODIFY button. Select the checkbox “Set as test images” and that’s it.

Naturally, these test images must have the correct labels (categories or tags). If the labels of your task contain also these test images, every newly trained model will contain the independent test evaluation. You can display these results by selecting the “Test” check button in the upper-right corner of the model detail.

To sum it up

So, if you have some series of test images, then you will be able to see two results in the detail of your model:

1. Validation Set selected randomly from your training images (this is shown as default)

2. Test Set (your selected images)

We recommend adding at least 30 images to the test set per Label/Tag and increasing the numbers (hundreds or thousands) over time.

We hope you will love this feature whether you are working on your healthcare, retail, or industry project. If you want to know more about insights from the model page, then read our older blog post. We are looking forward to bringing more explainability features (Explainable AI) in the future, so stay tuned.

The post Evaluation on an Independent Dataset for Image Recognition appeared first on Ximilar: Visual AI for Business.

]]>
The Best Resources on Artificial Intelligence and Machine Learning https://www.ximilar.com/blog/the-best-resources-on-artificial-intelligence-and-machine-learning/ Wed, 08 Jul 2020 06:37:24 +0000 https://www.ximilar.com/?p=1740 List of great books, podcasts, magazines, lectures, blogs & papers from the field of AI, Data Science, and Machine Learning.

The post The Best Resources on Artificial Intelligence and Machine Learning appeared first on Ximilar: Visual AI for Business.

]]>
Over the years, we machine learning engineers at Ximilar have gathered a lot of interesting ML/AI material from which we draw. I have chosen the best ones from podcasts to online courses that I recommend to listen to, read, or check out. Some of them are basic and introductory, others more advanced.  However, all of them are high-quality ones made by the best people in the field and they are worth checking. If you are interested in the current progress of AI or you are just curious about what will be in the future then you are on the right page. AI will change all possible fields, whether it is physics, law, healthcare, cryptocurrencies, or retail and one should be prepared for what is to come…

Podcasts

If there is one medium that has become popular in recent years, it must be podcasts. Everyone is doing it right now – there are podcasts about sex, politics, tech, healthcare, brains, bicycles… and AI is not missing. But one of them stands out. It is a podcast by Lex Fridman. This MIT alumni is doing an incredible job by interviewing the top people from the field, famous people included (like Garry Kasparov or Elon Musk). Some episodes are more about science, physics, the mind, startups, and the future of humanity. The ideas presented in the podcast are just mind-blowing. The talks are deep but clever and it will take you some time to get through them.

The Turing test is a recursive test. The Turing test is a test on us. It is a test of whether people are intelligent enough to understand themselves.

  • Lex Fridman and Garry Kasparov [Youtube]
  • Lex Fridman and Sam Altman (CEO of OpenAI) [Youtube]
  • Lex Fridman and Eliezer Yudkowsky [Youtube]
  • Lex Fridman and Max Tegmark [Youtube]
  • Lex Fridman and Ilya Sutskever [Youtube]
  • and many more

Another great podcast is Brain Inspired by Paul Middlebrooks with interesting guests. It shows and discusses topics from Neuroscience and AI and how these fields are connected.

Books

Life 3.0 by Max Tegmark – How will AI change healthcare, jobs, justice, or war? Max Tegmark is a professor at MIT who has written this provocative and engaging book about the future. He tries to answer a lot of questions like What is intelligence? Can a machine have a consciousness? Can we control AI? … This is a great introduction even for non-technical people.

Human Compatible by Stuart Russel is an important book that asks questions on how to coexist with intelligent AI in future.

AI Superpowers: China, Silicon Valley, and the New World Order by Kai-Fu Lee – A book about the incredible progress in AI in China.

Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron – Do you know how to code and would you like to start with some experiments? This book is not only about one of the most popular programming frameworks (TensorFlow) but also about modern techniques in machine learning and neural networks. You will code your first image recognition model and learn how to pre-process and analyze text.

Deep Learning for Coders with Fastai and PyTorch by Jeremy Howard and Sylvain Gugger – Another great book for coders. Code examples are in the PyTorch framework. Jeremy Howard is a famous researcher and developer in the AI community. His Fastai project helps millions of people to get into deep learning.

There are many more interesting books oriented for software developers like Deep Learning with Python by François Chollet. Looking for more hardcore books with math equations? Then try Deep Learning by MIT Press. If you are interested in classic approaches, then many university students will remember preparing for exams with Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig or Bishop’s Pattern Recognition and Machine Learning. (These two are a bit advanced and many topics are for a master’s or even PhD level.)

Magazines

MIT Technology Review is a great magazine with the latest news and trends in technology and future innovations. The magazine covers also other interesting topics as biotechnology, blockchain, space, climate change, and more. There is a print or digital access option for you.

Popular Videos & Channels

People to Follow

There are a lot of famous Scientists & Engineers & Entrepreneurs to follow. For example, often mentioned Jeremy Howard (fast.ai), Andrej Karpathy (Tesla AI), Yann LeCun (Facebook AI), Rachel Thomas (fast.ai, data ethics), Francois Chollet (Google), Fei-Fei Li (Stanford), Anima Anandkumar (Nvidia AI), Demis Hassabis (DeepMind), Geoffrey Hinton (Google), Eliezer Yudkowsky (AI Alignment), Ilya Sutskever (OpenAI) and more…

Lectures & Online Courses

So you’ve read some books and articles and now you want to start digging a little deeper? Or do you want to become a Machine Learning Specialist? Then start with some online courses. Of course, you will need to learn a little bit about math before and get some basic programming skills. Online courses are a great option if you can’t study at university or you want to get knowledge at your own pace. Here are some of the courses that can serve you as the starting point:

  • Elements of AI –  was created by the University of Helsinki for a broader audience, it’s not very technical and can be a great introduction for beginners
  • Machine Learning course from Andrew Ng – this one is a classic and most popular one for a number of reasons, it’s great introductory material.
  • To learn more math, we can recommend Mathematics for Machine Learning.
  • Deep Learning specialization is more about modern approaches to neural networks.
  • There are a lot of great specializations on Udacity by top companies and engineers from various fields like Healthcare or Automotive.
  • CS231nCS224N and CS224W are great Stanford courses for computer vision, natural language processing (NLP) and graphs, including video lectures, slides, and materials. It’s FREE!
  • 6.034 and 6.S191 – lectures for AI and Deep Learning by MIT on YouTube.
  • Practical Deep Learning for Coders by fast.ai – Jeremy Howard is doing a great job here by explaining concepts, and ideas and showing the code in Jupiter notebooks.
  • PyImageSearch – offers great introductory tutorials in the computer vision field.
  • Full Stack Deep Learning – great course for the whole cycle of developing machine learning systems

Research Blogs

You know how to code and you even know how to build your CNN? Or are you just simply interested in what is the future of the field and how companies are using AI? Check out some of the latest trends and SOTA approaches from the top research groups in the world. There are several giants like Apple, Facebook and Google pushing the AI boundaries:

  • Facebook AI Research – most of the research from the Facebook team is done in Recommender systems, NLP, and Computer Vision.
  • Google AI Blog – Google is probably the most dominant player in AI, check out, for example, their weather prediction system.
  • Microsoft Research – Microsoft has one of the oldest research groups of all companies. It is investing heavily in AI, Computer Vision and Augmented Reality (AR)
  • Google Deepmind blog – using AI to solve difficult problems from healthcare solutions to playing StarCraft 2.
  • Open AI Blog – how to solve Rubik’s cube by robotic hand or would you like to generate music in one click?
  • Baidu Research – research blog by one of the largest internet companies in China.
  • Malong – research by a company focused on AI for the retail industry (Malong provides in-store product recognition & loss prevention AI to Walmart and other major retailers)
  • NVIDIA Blog and AI research – the biggest GPU creator is doing research in many fields  (from accelerating research speed in healthcare to improving the gaming experience).
  • Distill – beautiful and interactive visualizations and explanations of the topics from deep learning. People behind this project are from Open AI, Tesla, Google…

There are also a lot of AI research labs located at top universities such as MIT, Stanford, or Berkeley.

Great Articles in the AI Field

We are always looking for high-quality content which is why some of the following articles can be a bit longer. AI is a complex field which is disrupting the way we live and do business:

Newsletters

Trends & Problems

  • AI Alignment & Safety problem – Have future super-intelligent AGI systems the same values as humanity? This is one of the toughest and most important challenges. With the AI race started by OpenAI/Microsoft and Google via Large language models (LLM) & multi-modal models, we have less time to solve AI safety.
  • Ethics, Transparency & Safety & Regulations – Should countries ban the usage of face recognition technology? [source][source] Is it ethical to scrape the data from the internet to build your face search startup? [source] What is an unethical use of AI? [source] What about autonomous weapons for defensive purposes? Are social media polarizing people with their clever algorithms optimized for more clicks/likes/…? [source]
  • Jobs replacement – Will AI replace all manufacturing and basic jobs? Or will the knowledge workers be first? Will the research in AI create even more job opportunities? What is going to happen in countries that are heavily dependent on manual work labour? [source] Will companies that are using robots/clever algorithms pay AI Tax one day? With Large Language Models (LLM) models many content and copywriters are losing their jobs. The same is happening with graphics designers with generative AI like Midjourney. Github Copilot and similar tools will one day probably replace programmers. Being a programmer myself I’m not sure if I’m happy about it and in ten years maybe I will need to switch to another job profession.
  • Interpretability & Explainability & Racial bias – Why did the deep learning model predict X and not Y? What has the neural network actually learned? How can we fool the model with adversarial attacks to make it predict wrong? Can models discriminate because of your race? it is a big issue not only in Face recognition, Insurance, and Healthcare. [source]
  • Generative models – GANs, and generative models like stable diffusion, are an incredible technology that brings a lot of challenges. Have you heard about Deep Fakes videos? One day, a large percentage of internet content will be created by generative AI. The Deep Fakes will be unrecognizable from human content. This could create new problems in politics, business, security or our personal lives. Will there be some proof by human protocol then?
  • Big and Small models, IoT and Environment impact – Bigger models can lead to incredible results in NLP [source][source]. However, only a few top companies as Microsoft, Google or Amazon have the resources to train them. On the other hand, there is also more research to make models lighter and faster with binarization or pruning techniques. Small models do not require computers with GPUs. Not every part of the world is connected to fast internet and AI on edge devices are becoming more popular.

Biggest Breakthroughs in AI

A lot of things happened during the last few years, here are some research articles that pushed the boundaries of AI by a large margin ordered by date (since 2010):

Here is the hall of fame in complex Artificial Intelligence projects:

  • AlphaGo by Google/DeepMind for beating the best human players in the GO game
  • AlphaFold by Google/DeepMind for solving the protein structure prediction (2020)
  • LLM / GPT-3 and ChatGPT by OpenAI for advanced language model that can do a LOT of things with texts and language
  • DALL-E 2 and stable diffusion models by OpenAI and Midjourney for advances in image generation (2022)

That is all for now. There are other great resource lists like the one from DeepMind, from which I got inspired. The list is divided by the level of the target audience – introductory, intermediate, and advanced. We will try to keep this post updated and if we find a gem, it will definitely appear here. There is much more material from which you can learn, but now it’s up to you to start your own machine-learning journey. We will try to keep this article updated with the latest news in Artificial Intelligence.

The post The Best Resources on Artificial Intelligence and Machine Learning appeared first on Ximilar: Visual AI for Business.

]]>
How Does Machine Learning Work? Like a Brain! https://www.ximilar.com/blog/how-does-machine-learning-work-like-a-brain/ Sun, 23 Jul 2017 07:00:23 +0000 https://www.ximilar.com/?p=764 Machine learning model is an algorithm used to build visual AI solutions. This article explains the basics using analogy to human brain.

The post How Does Machine Learning Work? Like a Brain! appeared first on Ximilar: Visual AI for Business.

]]>
Human Analogy to Describe Machine Learning in Image Classification

I could point to dozens of articles about machine learning and convolutional neural networks. Every article describes different details. Sometimes too many details are mentioned, so I decided to write my own article using the parallel of machine learning and the human brain. I will not touch any mathematics or deep learning details. The goal is to stay simple and help people experimenting with Ximilar to meet their goals.

Introduction

Machine learning provides computers with the ability to learn without being explicitly programmed.

For images: We want something that can look at a set of images and remember the patterns. When we expose a new image to our smart “model” it will “guess” what is on the image. That’s how people learn!

I mentioned two important words:

  • Model — is what we call machine learning algorithms. It is not coded anymore (if green, then grass). It is a structure that can learn and generalize (small, rounded, green is apple).
  • Guess — we are not in a binary world. Now, we moved into the probability domain. We receive a likelihood of an image to be an apple.

Deeper But Still Simple

A model is like a child’s brain. You show it an apple to a kid and say, “This is an apple”. Repeating it 20 times, a connection in its brain is established and it can now recognize apples. What is important at the beginning, it can not differentiate small details. The small ball in your hand is an apple because it follows the same pattern (small, rounded, green).

The set of images shown to the kid is called the training dataset.

The brain is a model, and it can recognize only categories from image datasets. It is made of layers and connections. This makes it similar to our brain structure. Different parts of the network are learning different abstract patterns.

Supervised learning means we have to say “This is an apple” and add visual information to it. We are adding a label to each image.

deep learning network

Simple deep learning network

Evaluation  – Model Accuracy

In human terms, this is like exam time. At school, we learn a lot of information and general concepts. To understand how much we actually know, the teacher prepares a set of questions we have not seen in study books. Then we evaluate our brain and we know 9 of 10 questions are answered right.

Teachers’ questions are what we call the testing dataset. It is usually parted from the training dataset before training (20% of provided pictures in our case).

Accuracy is the number of images we answer right (in percent). What is important: we do not care how sure he is about his answer. We only care about the final answers.

Limits of Computers

Why don’t we have computers with human-level skills yet? Because the brain is the most powerful computer. It has amazing processing power, huge memory and some magical sauce we don’t even understand.

Our computer models are limited in memory and computational power. We are fine with storage memory but short with superfast memory accessible by processors. Power is limited by heat, technology, price etc.

Bigger models can hold more information but take longer to train. This makes the AI development in 2017 focus on:

  • making the models smaller,
  • less computationally intensive,
  • able to learn more information.

Connection to Custom Image Recognition

This technology is what drives our custom image classification API. People can build an image recognizer without deep knowledge in a few minutes. Sometimes clients ask me if we can recognize 10,000 categories, having one training image of each. Imagine the kid’s brain learning this. It is nearly impossible. The idea is, that the more categories you want your child to know, the more images it has to see. It takes ages for our brains to develop and understand the world. Same as the child starts with basic objects, and starts with basic categories.

What the child is confident about is good/bad. Teaching models to differentiate good from bad is very accurate and does not need many images.

Summary

I tried to simplify machine learning to a visual task only and compare it with something we all know. In Ximilar we often think of the human brain while experimenting with new models and processing pipelines. I will be happy to hear some feedback from you.

This article was originally published by David Rajnoch.

The post How Does Machine Learning Work? Like a Brain! appeared first on Ximilar: Visual AI for Business.

]]>
Custom vs. General Vision AI Services https://www.ximilar.com/blog/custom-vs-general-vision-ai-services/ Thu, 15 Jun 2017 06:00:49 +0000 https://www.ximilar.com/?p=749 Learn the differences between general and custom image recognition platforms and discover which is best for your specific visual needs.

The post Custom vs. General Vision AI Services appeared first on Ximilar: Visual AI for Business.

]]>
Understand the Difference in Image Recognition Platforms

AI is on fire, and so are services delivering different forms of artificial intelligence. In this post, I would like to focus on the Visual segment and compare two different approachesgeneral vision and custom vision.

What is the difference, and what works better for different visual tasks?

General Vision Platforms

Here are some examples of general vision platforms:

  • Google Cloud Vision API
  • Amazon Rekognition
  • Microsoft Computer Vision
  • Clarifai
  • IBM Visual Recognition Watson

These platforms are built for understanding everyday objects (dogs, aeroplanes, faces, tables…). They mostly provide photo tagging. This means understanding as many objects and abstracts in the image as possible.

The main goal of general platforms is a human-level understanding of images. They are reaching for more than object understanding. The idea is to understand the abstract interaction between objects, moods, and contexts. In the video, they are trying to understand the action, its impact and time continuity. These are a very complex task and need a lot of labelled data to learn.

General vision providers gain millions of images from different sources. Users upload images into Google Photos, OneDrive and Google index every image on the web for Google images.

How do users then benefit from these never-ending data sources?

Machine learning requires a lot of data for training. In this case, users don’t have to provide any data. General models learned thousands of everyday objects, face emotions, landmarks, car types… We can start using this treasure right away with no pain of gathering relevant training data.

This is an amazing benefit, and general models learn more and more categories. We can set up many cool apps by implementing general models because our apps often look at everyday objects.

Another benefit of general solutions is other functionalities they provide out of the box. Generate a thumbnail, read a text, or find a celebrity name. All with no training data and for a reasonable price.

How to choose?

The most important factor we are trying to maximise is accuracy on our specific task. Each provider has a different number of training images, and different deep learning architecture and provides different tasks. These are company secret data we will never reach.

Generally, in AI, we want to find all the providers who offer the functionality we need and test them out to find the best-performing solution. For complex tasks, it is common to mix a few providers with the best results.

Who is this for?

General vision suits the best to the application that needs to recognise everyday objects. Robots reading human faces, e-shop image captioning for better SEO performance and helping blind people to understand new environments. These are the great examples of general vision. When reaching for vision solutions, the first question should be: Is this something I could find online? If yes, then it is worth trying general models.

Services like Google Vision provide the power of millions of images to everyone.

But what happens when we come out of every day’s space? What if we have scientific data available only to a few universities? Here comes the Custom vision.

MGB V8 Cabriolet

Custom Vision Solutions

One example is Ximilar solutions. These custom vision solutions are continually evolving to meet the dynamic needs of the rapidly growing sectors such as e-commerce, manufacturing, healthcare, and more. Most of these solutions are ready for deployment via API with just a click, requiring no knowledge of coding or machine learning techniques. They are highly customizable and can be easily combined in a modular fashion to suit various applications. I call this a win.

And some more:

  • Microsoft custom vision AI
  • Clarifai custom image recognition
  • Imaga computer vision

In custom computer vision, users create their own rules to sort images.

Rather than asking the type of the flower, you may want to know if there is a sun shining on the flower. Sometimes you want to be alert when your security camera spots humans, but your neighbour on a mower tractor is all right.

You could make sure all the product thumbnails you display show the unboxed products on a white background. Someone wants to make sure that the product on the end of the line is not damaged. This is something that off-shelf solutions are not built for.

About a year ago, there was only one solution. Hire an AI team to deliver an expensive on-premiss solution. Custom vision solution opens up whole new possibilities in visual AI. Compared to general platforms, there is an infinite number of tasks we can solve by defining custom objects. We can also detect different states of one object or environment. Custom vision is a little machine learning lab where everyone can test his idea. As a result, we can automate boring human tasks and save some time.

Custom vision goal is not general image understanding but a 100% accurate understanding of the specific task. This is very close to the market, but it comes with one disadvantage. User have to gather their own training data. This can be a pain and time-consuming, but it is a competitive advantage too.

At this moment, all the custom services offer image classification tasks. This means sorting the images into classes while looking at the whole image. The task can be as simple as deciding “ok” or “broken” or it can consist of many classes (e.g. several terrain appearances).

Custom vision can also come in handy when we need high accuracy on a smaller set of categories. We don’t always need to recognise thousands of categories. We would like to find 10 that are interesting for us. Custom vision can often supply better accuracy for general tasks.

The key for the user is the number of images they need for training. This is very hard to estimate in general but can be as little as 20 images per class. Read more about custom datasets in this post. The smaller the visual difference is, the higher the number of images we need.

How to choose?

We are looking for a solution that is easy to use, provides a simple user interface and the best accuracy for our task. We should test all the available solutions before we start using one. Before we start testing our idea I would also recommend saving some time discussing the project with the support team.

Who is this for?

Custom vision suits the application that needs to recognise very specific images or object states. It also fits images that are not available online or are not mass-produced by web users. It can solve many scientific, industrial, medical, and laboratory tasks. General models are often made for the needs of the online business. Custom vision can help in a variety of industries. Agriculture, production lines, security, and many others.

Custom vision opens new possibilities in visual automation. All made simple for users with no technical background.

Summary

Machine learning is a technology that saves people a lot of time. Vision is one of the human abilities that is now possible to automate. There is no universal approach for vision tasks, so we have to decide what type of task we are facing.

General vision is here to organize and structure images that are available on the web. It is very simple to use and needs no training data.

Custom vision makes sense in images that are very specific to the task or not available online. It needs some effort to gather training data, but it provides very accurate results for vision tasks.

The post Custom vs. General Vision AI Services appeared first on Ximilar: Visual AI for Business.

]]>