Tutorials - Ximilar: Visual AI for Business

We Introduce Plan Overview & Advanced Plan Setup

Zuzana Raidová — Tue, 24 Sep 2024 13:58:05 +0000

We’re excited to introduce new updates to Ximilar App! As a machine learning platform for training and deploying computer vision models, it also lets you manage subscriptions, monitor API credit usage, and purchase credit packs.

These updates aim to improve your experience and streamline plan setup and credit consumption optimization. Here’s a quick rundown of what’s new.

Plan Setup: Simplified Subscription Management

We’ve revamped the subscription page with new features and better functionality. The Plan Setup page now allows you to choose between Free, Business, or Professional plans, customize your monthly credit supply using a slider, and access our new API Credit Consumption Calculator—a handy tool to help you make informed decisions.

Plan setup in Ximilar App.

The entire checkout process has been streamlined as well, allowing you to adjust your payment method directly before completing your purchase.

Go to Plan setup

Explore Pricing plans

Manage Your Payment Methods and Currencies

You can change the default currency for plan setup and payments in the Settings. To update your payment method, simply access the Stripe Portal from your Plan Overview under “More Actions.” If you prefer a different payment method or have any additional questions, feel free to reach out to us!

Credit Calculator: Estimate & Optimise Your Credit Consumption

One of the most exciting additions to the app is the new Credit Calculator, now available directly within the platform. While this tool was previously featured on our Pricing page, it’s now integrated into the app as well, allowing you to not only estimate your credit needs but also preset your subscription plan directly from the calculator.

Once you’ve adjusted your credits based on projected usage, you can proceed straight to checkout, making the entire process of optimizing and purchasing credits smoother and more efficient.

Credit consumption calculator in Ximilar App.

Calculator in App

Calculator at Pricing page

Plan Overview: A Complete View of Your Plans and Credits

The page Plan Overview gives you a comprehensive view of your active subscription, any past plans, and your pre-paid credit packs. Previously, credit information was limited to your dashboard, but now you have detailed insight into your credit usage and plan history.

Plan overview in Ximilar App.

In the Plan Overview, you can view all your current active subscription plans. If you upgrade or downgrade, multiple plans may temporarily appear, as credits from your previous plan remain available until the end of the billing period.

Go to Plan overview

Reports: Detailed Insights into Credit Usage

Our new Reports page enables you to gain deeper insights into your API credit usage. It provides two types of reports: credit consumption by AI solution (e.g., Card Grading) and by individual operation within a solution (e.g., “grade one card” within the Card Grading solution).

Reports in Ximilar App give you detailed insight into your API credit consumption.

See Reports

Credit Packs: Flexibility to Buy Extra Credits Anytime

API Credit packs act as a safety net for unexpected system loads. Now available on their dedicated page, you can purchase additional API credit packs as needed. You can also compare pricing against higher subscription plans and choose the most cost-effective option. Both your active and used credit packs will be displayed on the Plan Overview page.

API Credit packs page in Ximilar App.

Go to Credit packs

Invoices: All Your Purchases in One Place

This updated page neatly lists all your invoices, including both subscription payments and one-time credit pack purchases, ensuring that all your financial information is in one place.

Invoices in Ximilar App.

Go to Invoices

Greater Control & Flexibility For the Users

These updates are designed to provide you with greater control, transparency, and flexibility as you build and deploy visual AI solutions. All of these features are now accessible in your sidebar. Check them out, and feel free to reach out with any questions!

The post We Introduce Plan Overview & Advanced Plan Setup appeared first on Ximilar: Visual AI for Business.

How to Identify Sports Cards With AI

Michal Lukáč — Mon, 12 Feb 2024 11:47:38 +0000

We have huge news for the collectors and collectibles marketplaces. Today, we are releasing an AI-powered system able to identify sports cards. It was a massive amount of work for our team, and we believe that our sports card identification API can benefit a lot of local shops, small and large businesses, as well as individual developers who aim to build card recognition apps.

Sports Cards Collecting on The Rise

Collecting sports cards, including hockey cards, has been a popular hobby for many people. Especially during my childhood, I collected hockey cards, as a big fan of the sport. Today, card collecting has evolved into an investment, and many new collectors enter the community solely to buy and sell cards on various marketplaces.

Some traditional baseball rookie cards can have significant value, for example, the estimated price of a vintage Mickey Mantle PSA 10 1952 Topps rookie baseball card is $15 million – $30 million.

Our Existing Solutions for Card Collector Sites & Apps

Last year, we already released several services focused on trading cards:

First, we released a Trading Card Game Identifier API. It can identify trading card games (TCGs), such as Pokémon, Magic The Gathering: MTG and Yu-Gi-Oh!, and more. We believe that this system is amongst the fastest, most precise and accurate in the world.
Second, we built a Card Grading and fast Card Conditioning API for both sports and trading card games. This service can instantly evaluate each corner, edges, and surface, and check the centring in a card scan, screenshot or photo in a matter of seconds. Each of these features is graded independently, resulting in an overall grade. The outputs can be both values or conditions-based (eBay or TCGPlayer naming). You can test it here.
We have also been building custom visual search engines for private collections of trading cards and other collectibles. With this feature, people can visit marketplaces or use their apps to upload card images, and effortlessly search for identical or similar items in their database with a click. Visual search is a standard AI-powered function in major price comparators. If a particular game is not on our list, or if you wish to search within your own collection, list, or portfolio of other collectibles (e.g., coins, stamps, or comic books), we can also create it for you – let us know.

We have been gradually establishing a track record of successful projects in the collectibles field. From the feedback of our customers, we hear that our services are much more precise than the competition. So a couple of months ago, we started building a sports card scanning system as well. It allows users to send the scan to the API, and get back precise identification of the card.

Our API is open to all developers, just sign up to Ximilar App, and you can start building your own great product on top of it!

Test it Now in Live Demo

This solution is already available for testing in our public demo. Try it for free now!

The Main Features of Sports Cards

There are several factors determining the value of the card:

Rarity & Scarcity: Cards with limited production runs or those featuring star players are often worth more.
Condition: Like any collectible item, the condition of a sports card is crucial. Cards in mint or near-mint condition are generally worth more than those with wear and tear.
Grade & Grading services: Graded cards (from PSA or Beckett) typically have higher prices in the market.
The fame of the player: Names of legends like Michael Jordan or Shohei Ohtani instantly add value to the trading cards in your collection.
Autographs, memorabilia, and other features, that add to the card’s rarity.

Each card manufacturer must have legal rights and licensing agreements with the sports league, teams, or athletes. Right now, there are several main producers:

Panini – This Italian company is the largest player in the market in terms of licensing agreements and number of releases.
Topps – Topps is an American company with a long history. They are now releasing cards from Baseball, Basketball or MMA.
Upper Deck – Upper Deck is a company with an exclusive license for hockey cards from the NHL.
Futera – Futera focuses mostly on soccer cards.

Example of Upper Deck, Futera, Panini Prizm and Topps Chrome cards.

Dozens of other card manufacturers were acquired by these few players. They add their brands or names as special sets in their releases. For example, the Fleer company was acquired by Upper Deck in 2005 and Donruss was bought by Panini.

Identifying Sports Cards With Artificial Intelligence

When it comes to sports cards, it’s crucial to recognize that the identification challenge is more complex than that of Pokémon or Magic The Gathering cards. While these games present challenges such as identical trading card artworks in multiple sets or different language variants, sports cards pose distinct difficulties in recognition and identification, such as:

Amount of data/cards – The companies add a lot of new cards into their portfolio each year. As of the latest date, the total figure exceeds tens of millions of cards.
Parallels, variations, and colours – The card can have multiple variants with different colours, borders, various foil effects, patterns, or even materials. More can be read in a great article by getcardbase.com. Look at the following example of the NBA’s LeBron James card, and some of its variants.

LeBron James 2021 Donruss Optic #41 card in several variations of different parallels and colors.

Special cards: Short Print (SP) and Super Short Print (SSP) cards are intentionally produced in smaller quantities than the rest of the particular set. The most common special cards are Rookie cards (RC) that feature a player in their rookie season and that is why they hold sentimental and historical value.
Serial numbered cards: A type of trading cards that have a unique serial number printed directly on the card itself.
Authentic signature/autograph: These are usually official signature cards, signed by players. To examine the authenticity of the signature, and thus ensure the card’s value, reputable trading card companies may employ card authentication processes.
Memorabilia: In the context of trading cards, memorabilia cards are special cards that feature a piece of an athlete’s equipment, such as a patch from a uniform, shoe, or bat. Sports memorabilia are typically more valuable because of their rarity. These cards are also called relic cards.

As you can see, it’s not easy to identify the card and its price and to keep track of all its different variants.

Example: Panini Prizm Football Cards

Take for example the 2022 Panini Prizm Football Cards and the parallel cards. Gold Prizms (10 cards) are worth much more than the Orange Prizms (with 250 cards) because of their scarcity. Upon the release of a card set, the accompanying checklist, presented as a population table, is typically made available. This provides detailed information about the count for each variation.

2022 Panini Prizm Football Cards examples. (Source: beckett.com)

Next, for Panini Prizm, there are more than 20 parallel foil patterns like Speckle, Hyper, Diamond, Fast Break/Disco/No Huddle, Flash, Mozaic, Mojo, Pulsar, Shimmer, etc. with all possible combinations of colours such as green, blue, pink, purple, gold, and so on.

These combinations matter because some of them are more rare than others. There are also different names for the foil cards between companies. Topps has chrome Speckle patterns which are almost identical to the Panini Prizm Sparkle pattern.

Lastly, no database contains each picture for every card in the world. This makes visual search extremely hard for cards that have no picture on the internet.

If you feel lost in all the variations and parallels cards, you are not alone.

Luckily, we developed (and are actively improving) an AI service that is trying to tackle the mentioned problems with sports cards identification. This service is available on click as an open REST API, so anyone can connect to develop and integrate their system with ours. The results are in seconds and it’s one of the fastest services available in the market.

How to Identify Sports Cards Via API?

In general, you can use and connect to the REST API with any programming language like Python or Javascript. Our developer’s documentation will serve you as a guide with many helpful instructions and tips.

To access our API, sign in Ximilar App to get your unique API authentication token. You will find the administration of your services under Collectibles Recognition. Here is an example REST Request via curl:

$ curl https://api.ximilar.com/collectibles/v2/sport_id -H "Content-Type: application/json" -H "Authorization: Token __API_TOKEN__" -d '{
    "records": [
        { "_url": "__PATH_TO_IMAGE_URL__"}
    ], "slab_id": false
}'

The example response when you identify sports cards with Ximilar API.

The API response will be as follows:

When the system succesfuly indetifies the card, it will return you full identification. You will get a list of features such as the name of the player/person, the name of the set, card number, company, team and features like foil, autograph, colour and more. It is also able to generate URL links for eBay searches so you can check the card values or purchase them directly.
If we are not sure about the identification (or we don’t have a specific card in our system) the system will return empty search results. In such case, feel free to ask for support.

How AI Sports Cards Identification Works?

Our identification system uses advanced machine learning models with smart algorithms for post-processing. The system is a complex flow of models that incorporates visual search. We trained the system on a large amount of data, curated by our own annotation team.

First, we identify the location of the card in your photo. Second, we do multiple AI analyses of the card to identify whether it has autograph and more. The third step is to find the card in our collection with visual search (reverse image search). Lastly, we use AI to rerank the results to make them as precise as possible.

What Sports Cards Can Ximilar Identify?

Our sports cards database contains a few million cards. Of course, this is just a small subset of all collectible cards that were produced. Right now we focus on 6 main domains: Baseball cards, Football cards, Basketball cards, Hockey cards, Soccer and MMA, and the list expands based on demand. We continually add more data and improve the system.

We try to track and include new releases every month. If you see that we are missing some cards and you have the collection, let us know. We can agree on adding them to training data and giving you a discount on API requests. Since we want to build the most accurate system for card identification in the world, we are always looking for ways to gather more cards and improve the software’s accuracy.

Who Will Benefit From AI-Powered Sports Cards Identifier?

Access to our REST API can improve your position in the market especially if:

You own e-commerce sites/marketplaces that buy & sell cards – If you have your own shop, site or market for people who collect cards, this solution can boost your traffic and sales.
You are planning to design and publish your own collector app and need an all-in-one API for the recognition and grading of cards.
You want to manage, organize and add data to your own card collection.

Is My Data Safe?

Yes. First of all, we don’t save the analysed images. We don’t even have so much storage capacity to store each analysed image, photo, scan and screen you add to your collection. Once our system processes an image, it removes it from the memory. Also, GDPR applies to all photos that enter our system. Read more in our FAQs.

How Fast is the System, Can I Connect it to a Scanner?

The system can identify one card scan in one second. You can connect it to any card scanner available in the market. The scanning outputs the cards into the folders, to which you can apply a script for card identification.

Sports Cards Recognition Apps You Can Build With Our API

Here are a few ideas for apps that you can build with our Sport Card Identifier and REST API:

Automatic card scanning system – create a simple script that will be connected to our API and your scanners like Fujitsu fi-8170. The system will be able to document your cards with incredible speed. Several of our customers are already organizing their collections of TCGs (like Magic The Gathering or Pokémon) and adding new cards on the go.
Price checking app or portfolio analysis – create your phone app alternative to Ludex or CollX. Start documenting the cards by taking pictures and grading your trading card collection. Our system can provide card IDs, pre-grade cards, and search them in an online marketplace. Easily connect with other collectors, purchase & sell the cards. Test our system’s ability to provide URLs to marketplaces here.
Analysing eBay submission – would you like to know what your card’s worth and how many are currently available in the market? For how much was the card sold in the past? Track the price of the card over time? Or what is the card population? With our technology, you can build a system that can analyse it.

AI for Trading Cards and Collectors

So this is our latest narrow AI service for the collector community. It is quite easy to integrate it into any system. You can use it for automatic documentation of your collection or simply to list your cards on online markets.

For more information, contact us via chat or contact page, and we can schedule a call with you and talk about the technical and business details. If you want to go straight and implement it, take look at our developer’s API documentation and don’t hesitate to ask for guidance anytime.

Right now we are also working on Comics identification (Comic book, magazines and manga). If you would like to hear more then just contact us via email or chat.

Try our public demos

The post How to Identify Sports Cards With AI appeared first on Ximilar: Visual AI for Business.

Build Your Own Trading Card Game Identifier With Our API

Michal Lukáč — Thu, 27 Jul 2023 15:56:16 +0000

In one of my previous blog posts, I wrote about how we built a visual search engine for trading cards such as Pokémon TCG, Magic The Gathering or sports cards. This customized visual search engine is very precise, but it’s suitable mainly for collector shops & websites that already have their own collections of photos or cards that can be matched to the pictures uploaded by players.

However, as the world of trading card games expands, collectors increasingly require a versatile trading card game identifier. This tool should swiftly recognize various collectible cards, irrespective of your private collection or database. We accepted this challenge and built a Trading Card Game Identifier (Card ID). In this article, I will describe how it works, and how you can use it for your own App or website. We will also take a brief look at other additions to our Collectibles Recognition: OCR & grading system for both TCGs and sports.

Trading Card Games Identifier

What is the Card Identifier?

Card Identifier is an AI-powered tool by Ximilar able to recognize trading game cards in any image format and provide you with their attributes, such as the name, exact set, series, codes, number, or year of release. It also provides attributes such as information on whether the card is holo (foil-treated) or what alphabet or language it uses.

This solution is an extension of our core service AI Recognition of Collectibles (which does the basic image recognition of all collectible items) and expands its functionalities by detailed identification of specific trading games.

Card ID works independently of keywords and metadata. As a matter of fact, you can use it to generate keywords. You can save the output in JSON or use it for searching and filtering items on your webpage.

The attributes of cards, such as their name, date of release and set, are also typically used to find the trading card’s average price on marketplaces such as TCGPlayer or eBay. That is why for some cards, we can provide links to these sites right away.

There are several use-cases that you can use our card identifier, here are few of them:

you can connect your card scanner like Fujitsu (fi-8170) and create an system for documenting & digitalising your card collectors inventory, save thousands of hours with AI analysis
you can build a smartphone app that is able to identify a card from photo and get average price on ebay or tcgplayer
you can create your own marketplace website for card reselling & listing. Our technology will help with card identification of incoming submissions.

Because our solutions are powered by computer vision, you can upload photos of as many cards as you want, with or without sleeves, under different lighting and conditions.

Which Games Can the Trading Card Game Identifier Recognize?

Pokémon TCG

The Pokémon Trading Card Game is one of the most popular trading games. Fans of all generations and nationalities have been playing Pokémon TCG ever since its release in 1998. Our identifier recognizes Pokémon cards in both English & Japanese and provides their attributes.

Pokémon TCG (source: dicebreaker.com, rights: The Pokémon Company International, Inc.)

Magic The Gathering

MGT is a highly popular game. As of 2023, over 100 MTG sets have been released, with their numbers continually rising, making it increasingly challenging to keep pace with all the sets and new cards. Our identifier provides all basic information about the Magic The Gathering card in an uploaded photo, and we keep adding new attributes.

Magic The Gathering TCG, The Lord of the Rings set with amazing artwork. (source: wargamer.com, rights: Hasbro)

Yu-Gi-Oh!

Yu-Gi-Oh! is an iconic trading card game based on an anime series. Since its 1999 release, Yu-Gi-Oh! has garnered a dedicated community of players and collectors. Recognized as a top-selling TCG in 2009 by Guinness World Records, with over 22 billion cards sold worldwide, the demand for an AI model to assist with card identification is understandable.

Yu-Gi-Oh! Trading Card Game is a perfect adept for AI recognition with its 22 billion sold cards. (source: konami.com, rights: Konami)

From MetaZoo to Lorcana

TCGs such as MetaZoo TCG, Flesh and Blood TCG, One Piece Card Game, or Lorcana TCG are all smaller or more recent games, but they are starting to be more and more popular both in English-speaking and Asian countries.

Lorcana Trading Card Game. (source: mousetcg.com, rights: Disney & Ravensburger)

Independent of card type, this endpoint will also provide information such as:

Side – front or back of the card.
Alphabet – such as Latin, Japanese, Korean, Chinese, and more.
Holo/Foil – whether the card has a holo effect (aluminium foil).
Autograph – this particular feature is common rather for baseball and other sports cards.

All this information is necessary to value trading cards properly. For instance, a Japanese card can have a different value than an English one, and a holo card can have a higher value than a regular one.

How Identifying Trading Card Games via API Works?

Connect to API

Once you register in Ximilar App, you will automatically get your own unique API token. You will need at least business pricing plan. Then you can access and use our solutions both via App & API:

In the App, Card ID is a part of the Collectibles Recognition service. So if you upload your images there, the trading cards in them will be automatically recognized and identified.

The REST API endpoint is simple to use and easy to integrate into your mobile app, website or card-sorting machines. If you’re new to deploying solutions via API, the API documentation is here to help you with the basic setup. You can also find a lot of helpful information in our Help Center.
For a lot of cards we are able to provide links to TCG Player or Cardmarket so you will know the price of analysed cards immediatelly.

To access the Card Identifier by Ximilar, use the endpoint /v2/tcg_id:

https://api.ximilar.com/collectibles/v2/tcg_id

We are always here to answer your questions through the contact form or live chat and can also do the setup for you.

Implement Trading Card Game Identifier in Your App

Imagine you are building an app or a site catering to Yu-Gi-Oh! fans and collectors. When a visitor uploads a picture of a new card, our AI Recognition of Collectibles instantly detects the card’s position and confirms it as a trading card. Thanks to its object detection & image recognition capabilities, users can upload pictures containing multiple cards.

Recognition of Yu-Gi-Oh! playing card with Ximilar Trading Card Game Identifier.

Subsequently, the Card ID provides the card’s attributes Name, Full Name, Set, Set Code, Card Number, Rarity and Year. This happens independently of your portfolio (collection) or database.

The identification of the record is fast (usually takes a second to process) and the results are provided in JSON. This way, the user can be provided with structured data on their trading card in a matter of seconds.

The identification works for almost all popular TCGs. And the good news is that our AI for card recognition is so powerful that we can extend it to other games. Let us know if you are missing any games.

New Solutions For Sports Cards

Sports Card Text Analysis With OCR & GPT

Because there are millions of sports cards, and it’s very hard to gather data for them, we have recently released another solution for text extraction from sports cards. The system is accessible via following endpoints:

https://api.ximilar.com/collectibles/v2/card_ocr_id
https://api.ximilar.com/collectibles/v2/sport_id

For the first endpoint. This technology is able to read all the texts in the photo with a card via Optical Character Recognition (OCR) and then provide information on the athlete via Large Language model (LLM) – GPT. This model is still in the works, however, it can help you with the automatization and labelling of the cards. If you have your own collection of sport’s cards then we can build you a precise, fast and affordable AI system for sports card identification.

The second endpoint actually uses a limited sports cards database for identification. You can try to play with both of them and choose the solution that works for you. If you have your own database of sports cards we can build a similar system just on your data.

You can read more about this solution in the article When OCR Meets ChatGPT AI in One API.

Read Graded Slab Labels With AI

Sports card grading is gaining popularity not only in the USA but also in Europe and Asia, as collectors recognize the value of their cards. Having rare foiled cards evaluated by esteemed companies like PSA or Beckett may be a good investment.

Online trading has become a prevalent trend, with eBay leading the pack as the go-to marketplace for collectibles. However, searching for the best deal among thousands of results for a specific query, like a “Michael Jordan Graded Card” can be incredibly time-consuming and challenging.

Reading the Graded Slab Label and getting the certificate number with the grade from the picture.

Our endpoint slab_id reads the graded slabs and helps to automate the identification of promising cards:

https://api.ximilar.com/collectibles/v2/slab_id

It will read the slab and return attributes such as grade, name, grade company and certification number. You can use it to automatically find and filter items with certain grades or conditions (8/9/10, near mint, gem mint, and so on).

Pre-Grading of Sports Cards With AI

We also provide an alternative to the slab reader in case the uploaded card doesn’t have a grade yet. It is an AI-powered grader for websites that evaluate & sell sports cards. The system can grade whole cards as well as individual parts like corners, edges, or centering. It is accessible via endpoint grade (precise) or condition (lightweight and fast):

https://api.ximilar.com/card-grader/v2/grade
https://api.ximilar.com/card-grader/v2/condition

AI grading for sports cards by Ximilar.

Because identifying grades from a single picture cannot fully replace a professional grader, this endpoint serves mainly as a pre-grading solution. As I write this article, it is currently in beta testing. Nonetheless, it has already proven effective in specific scenarios, particularly with high-resolution pictures of sports cards without sleeves or slabs. This feature was highly requested by many of our customers. So we made it accessible to both Business and Professional plan users.

Solving this challenge is no simple task, and it is a long-term project for us. We are working hard both on gathering training data and improving the model architecture. It serves also as a research project, as we encounter a lot of new and not quite standard things and problems. I will write more about this service, technology, and development in a future blog post. So stay tuned!

Automation in Collectibles Industry Makes Sense

Here are a few reasons why I think the trading card industry is growing rapidly, and will use AI-powered automation more in the future:

According to Business Research Insights, the global collectible card game market was valued at 13 billion USD in 2021. And it is projected to grow by 16% annually.
In 2022 alone, 10 million sports cards were graded, indicating significant demand for grading services.
The TCG potential has caught the attention of major corporations including Disney, evident from their release of the Lorcana TCG in 2023.
Companies like Nintendo earned nearly 43 million USD from card games playing last year.
Magic The Gathering remains Hasbro’s most profitable business venture.

Get a Solution Tailored to Your Business

All the services mentioned in this article are easy to combine with each other and with the rest of our solutions. One of the most popular solutions in the field of collectibles is a visual search and similar item recommendation. If you are aiming to have your own visual search engine, I suggest reading Pokémon TCG Search Engine: Use AI to Catch Them All and then contacting us.

The collector community’s feedback and thoughts serve as our primary motivation to develop tailor-made solutions for this amazing field. Contact us anytime and we can discuss your goals.

The post Build Your Own Trading Card Game Identifier With Our API appeared first on Ximilar: Visual AI for Business.

When OCR Meets ChatGPT AI in One API

Michal Lukáč — Wed, 14 Jun 2023 09:38:27 +0000

Imagine a world where machines not only have the ability to read text but also comprehend its meaning, just as effortlessly as we humans do. Over the past two years, we have witnessed extraordinary advancements in these areas, driven by two remarkable technologies: optical character recognition (OCR) and ChatGPT (generative pre-trained transformer). The combined potential of these technologies is enormous and offers assistance in numerous fields.

That is why we in Ximilar have recently developed an OCR system, integrated it with ChatGPT and made it available via API. It is one of the first publicly available services combining OCR software and the GPT model, supporting several alphabets and languages. In this article, I will provide an overview of what OCR and ChatGPT are, how they work, and – more importantly – how anyone can benefit from their combination.

What is Optical Character Recognition (OCR)?

OCR (Optical Character Recognition) is a technology that can quickly scan documents or images and extract text data from them. OCR engines are powered by artificial intelligence & machine learning. They use object detection, pattern recognition and feature extraction.

An OCR software can actually read not only printed but also handwritten text in an image or a document and provide you with extracted text information in a file format of your choosing.

How Optical Character Recognition Works?

When an OCR engine is provided with an image, it first detects the position of the text. Then, it uses AI model for reading individual characters to find out what the text in the scanned document says (text recognition).

This way, OCR tools can provide accurate information from virtually any kind of image file or document type. To name a few examples: PDF files containing camera images, scanned documents (e.g., legal documents), old printed documents such as historical newspapers, or even license plates.

A few examples of OCR: transcribing books to electronic form, reading invoices, passports, IDs, and landmarks.

Most OCR tools are optimized for specific languages and alphabets. We can tune these tools in many ways. For example, to automate the reading of invoices, receipts, or contracts. They can also specialize in handwritten or printed paper documents.

The basic outputs from OCR tools are usually the extracted texts and their locations in the image. The data extracted with these tools can then serve various purposes, depending on your needs. From uploading the extracted text to simple Word documents to turning the recognized text to speech format for visually impaired users.

OCR programs can also do a layout analysis for transforming text into a table. Or they can integrate natural language processing (NLP) for further text analysis and extraction of named entities (NER). For example, identifying numbers, famous people or locations in the text, like ‘Albert Einstein’ or ‘Eiffel Tower’.

Technologies Related to OCR

You can also meet the term optical word recognition (OWR). This technology is not as widely used as the optical character recognition software. It involves the recognition and extraction of individual words or groups of words from an image.

There is also optical mark recognition (OMR). This technology can detect and interpret marks made on paper or other media. It can work together with OCR technology, for instance, to process and grade tests or surveys.

And last but not least, there is intelligent character recognition (ICR). It is a specific OCR optimised for the extraction of handwritten text from an image. All these advanced methods share some underlying principles.

What are GPT and ChatGPT?

Generative pre-trained transformer (GPT), is an AI text model that is able to generate textual outputs based on input (prompt). GPT models are large language models (LLMs) powered by deep learning and relying on neural networks. They are incredibly powerful tools and can do content creation (e.g., writing paragraphs of blog posts), proofreading and error fixing, explaining concepts & ideas, and much more.

The Impact of ChatGPT

ChatGPT introduced by OpenAI and Microsoft is an extension of the GPT model, which is further optimized for conversations. It has had a great impact on how we search, work with and process data.

GPT models are trained on huge amounts of textual data. So they have better knowledge than an average human being about many topics. In my case, ChatGPT has definitely better English writing & grammar skills than me. Here’s an example of ChatGPT explaining quantum computing:

ChatGPT model explaining quantum computing. [source: OpenAI]

It is no overstatement to say that the introduction of ChatGPT revolutionized data processing, analysis, search, and retrieval.

How Can OCR & GPT Be Combined For Smart Text Extraction

The combination of OCR with GPT models enables us to use this technology to its full potential. GPT can understand, analyze and edit textual inputs. That is why it is ideal for post-processing of the raw text data extracted from images with OCR technology. You can give the text to the GPT and ask simple questions such as “What are the items on the invoice and what is the invoice price?” and get an answer with the exact structure you need.

This was a very hard problem just a year ago, and a lot of companies were trying to build intelligent document-reading systems, investing millions of dollars in them. The large language models are really game changers and major time savers. It is great that they can be combined with other tools such as OCR and integrated into visual AI systems.

It can help us with many things, including extraction of essential information from images and putting them into text documents or JSON. And in the future, it can revolutionize search engines, and streamline automated text translation or entire workflows of document processing and archiving.

Examples of OCR Software & ChatGPT Working Together

So, now that we can combine computer vision and advanced natural language processing, let’s take a look at how we can use this technology to our advantage.

Reading, Processing and Mining Invoices From PDFs

One of the typical examples of OCR software is reading the data from invoices, receipts, or contracts from image-only PDFs (or other documents). Imagine a part of invoices and receipts your accounting department accepts are physical printed documents. You could scan the document, and instead of opening it in Adobe Acrobat and doing manual data entry (which is still a standard procedure in many accounting departments today), you would let the automated OCR system handle the rest.

Scanned documents can be automatically sent to the API from both computers and mobile phones. The visual AI needs only a few hundred milliseconds to process an image. Then you will get textual data with the desired structure in JSON or another format. You can easily integrate such technology into accounting systems and internal infrastructures to streamline invoice processing, payments or SKU numbers monitoring.

Receipt analysis via Ximilar OCR and OpenAI ChatGPT.

Trading Card Identifying & Reading Powered by AI

In recent years, the collector community for trading cards has grown significantly. This has been accompanied by the emergence of specialized collector websites, comparison platforms, and community forums. And with the increasing number of both cards and their collectors, there has been a parallel demand for automating the recognition and cataloguing collectibles from images.

Ximilar has been developing AI-powered solutions for some of the biggest collector websites on the market. And adding an OCR system was an ideal solution for data extraction from both cards and their graded slabs.

Automatic Recognition of Collectibles

Ximilar built an AI system for the detection, recognition and grading of collectibles. Check it out!

We developed an OCR system that extracts all text characters from both the card and its slab in the image. Then GPT processes these texts and provides structured information. For instance, the name of the player, the card, its grade and name of grading company, or labels from PSA.

Extracting text from the trading card via OCR and then using GPT prompt to get relevant information.

Needless to say, we are pretty big fans of collectible cards ourselves. So we’ve been enjoying working on AI not only for sports cards but also for trading card games. We recently developed several solutions tuned specifically for the most popular trading card games such as Pokémon, Magic the Gathering or YuGiOh! and have been adding new features and games constantly. Do you like the idea of trading card recognition automation? See how it works in our public demo.

Try demo

How Can I Use the OCR & GPT API On My Images or PDFs?

Our OCR software is publicly available via an online REST API. This is how you can use it:

Log into Ximilar App
- Get your free API TOKEN to connect to API – Once you sign up to Ximilar App, you will get a free API token, which allows your authentication. The API documentation is here to help you with the basic setup. You can connect it with any programming language and any platform like iOS or Android. We provide a simple Python SDK for calling the API.
- You can also try the service directly in the App under Computer Vision Platform.
For simple text extraction from your image, call the endpoint read.
```
https://api.ximilar.com/ocr/v2/read
```
For text extraction from an image and its post-processing with GPT, use the endpoint read_gpt. To get the results in a deserved structure, you will need to specify the prompt query along with your input images in the API request, and the system will return the results immediately.
```
https://api.ximilar.com/ocr/v2/read_gpt
```
The output is JSON with an ‘_ocr’ field. This dictionary contains texts that represent a list of polygons that encapsulate detected words and sentences in images. The full_text field contains all strings concatenated together. The API is returning also the language name (“lang_name”) and language code (“lang”; ISO 639-1). Here is an example:
```
{
  "_url": "__URL_PATH_TO_IMAGE__
  "_ocr": {
     "texts": [
       {
          "polygon": [[53.0,76.0],[116.0,76.0],[116.0,94.0],[53.0,94.0]],
          "text": "MICKEY MANTLE",
          "prob": 0.9978849291801453
       },
       ...
     ],
     "full_text": "MICKEY MANTLE 1st Base Yankees",
     "lang_name": "english",
     "lang_code": "en
  }
}
```
Our OCR engine supports several alphabets (Latin, Chinese, Korean, Japanese and Cyrillic) and languages (English, German, Chinese, …).

Integrate the Combination of OCR and ChatGPT In Your System

All our solutions, including the combination of OCR & GPT, are available via API. Therefore, they can be easily integrated into your system, website, app, or infrastructure.

Here are some examples of up-to-date solutions that can easily be built on our platform and automate your workflows:

Detection, recognition & text extraction system – You can let the users of your website or app upload images of collectibles and get relevant information about them immediately. Once they take an image of the item, our system detects its position (and can mark it with a bounding box). Then, it recognizes their features (e.g., name of the card, collectible coin or comic book), extracts texts with OCR and you will get text data for your website (e.g., in a table format).
Card grade reading system – If your users upload images of graded cards or other collectibles, our system can detect everything including the grades and labels on the slabs in a matter of milliseconds.
Comic book recognition & search engine – You can extract all texts from each image of a comic book and automatically match it to your database for cataloguing.
Giving your collection or database of collectibles order – Imagine you have a website featuring a rich collection of collectible items, getting images from various sources and comparing their prices. The metadata can be quite inconsistent amongst source websites, or be absent in the case of user-generated content. AI can recognize, match, find and extract information from images based purely on computer vision and independent of any kind of metadata.

Let’s Build Your Solution

If you would like to learn more about how you can automate the workflows in your company, I recommend browsing our page All Solutions, where we briefly explained each solution. You can also check out pages such as Visual AI for Collectibles, or contact us right away to discuss your unique use case. If you’d like to learn more about how we work on customer projects step by step, go to How it Works.

Ximilar’s computer vision platform enables you to develop AI-powered systems for image recognition, visual quality control, and more without knowledge of coding or machine learning. You can combine them as you wish and upgrade any of them anytime.

Don’t forget to visit the free public demo to see how the basic services work. Your custom solution can be assembled from many individual services. This modular structure enables us to upgrade or change any piece anytime, while you save your money and time.

How do custom projects work?

The post When OCR Meets ChatGPT AI in One API appeared first on Ximilar: Visual AI for Business.

Predict Values From Images With Image Regression

Zuzana Raidová — Wed, 22 Mar 2023 15:03:45 +0000

We are excited to introduce the latest addition to Ximilar’s Computer Vision Platform. Our platform is a great tool for building image classification systems, and now it also includes image regression models. They enable you to extract values from images with accuracy and efficiency and save your labor costs.

Let’s take a look at what image regression is and how it works, including examples of the most common applications. More importantly, I will tell you how you can train your own regression system on a no-code computer vision platform. As more and more customers seek to extract information from pictures, this new feature is sure to provide Ximilar’s customers with the tools they need to stay ahead of the curve in today’s highly competitive AI-driven market.

What is the Difference Between Image Categorization and Regression?

Image recognition models are ideal for the recognition of images or objects in them, their categorization and tagging (labelling). Let’s say you want to recognize different types of car tyres or their patterns. In this case, categorization and tagging models would be suitable for assigning discrete features to images. However, if you want to predict any continuous value from a certain range, such as the level of tyre wear, image regression is the preferred approach.

Image regression is an advanced machine-learning technique that can predict continuous values within a specific range. Whenever you need to rate or evaluate a collection of images, an image regression system can be incredibly useful.

For instance, you can define a range of values, such as 0 to 5, where 0 is the worst and 5 is the best, and train an image regression task to predict the appropriate rating for given products. Such predictive systems are ideal for assigning values to several specific features within images. In this case, the system would provide you with highly accurate insights into the wear and tear of a particular tyre.

Predicting the level of tires worn out from the image is a use case for an image regression task, while a categorization task can recognize the pattern of the tyre.

How to Train Image Regression With a Computer Vision Platform?

Simply log in to Ximilar App and go to Categorization & Tagging. Upload your training pictures and under Tasks, click on Create a new task and create a Regression task.

Creating an image regression task in Ximilar App.

You can train regression tasks and test them via the same front end or with API. You can develop an AI prediction task for your photos with just a few clicks, without any coding or any knowledge of machine learning.

This way, you can create an automatic grading system able to analyze an image and provide a numerical output in the defined range.

Use the Same Training Data For All Your Image Classification Tasks

Both image recognition and image regression methods fall under the image classification techniques. That is why the whole process of working with regression is very similar to categorization & tagging models.

Working with image regression model on Ximilar computer vision platform.

Both technologies can work with the same datasets (training images), and inputs of various image sizes and types. In both cases, you can simply upload your data set to the platform, and after creating a task, label the pictures with appropriate continuous values, and then click on the Train button.

Apart from a machine learning platform, we offer a number of AI solutions that are field-tested and ready to use. Check out our public demos to see them in action.

If you would like to build your first image classification system on a no-code machine learning platform, I recommend checking out the article How to Build Your Own Image Recognition API. We defined the basic terms in the article How to Train Custom Image Classifier in 5 Minutes. We also made a basic video tutorial:

Tutorial: train your own image recognition model with Ximilar platform.

Neural Network: The Technology Behind Predicting Range Values on Images

The most simple technique for predicting float values is linear regression. This can be further extended to polynomial regression. These two statistical techniques are working great on tabular input data. However, when it comes to predicting numbers from images, a more advanced approach is required. That’s where neural networks come in. Mathematically said, neural network “f” can be trained to predict value “y” on picture “x”, or “y = f(x)”.

Neural networks can be thought of as approximations of functions that we aim to identify through the optimization on training data. The most commonly used NNs for image-based predictions are Convolutional Neural Networks (CNNs), visual transformers (VisT), or a combination of both. These powerful tools analyze pictures pixel by pixel, and learn relevant features and patterns that are essential for solving the problem at hand.

CNNs are particularly effective in picture analysis tasks. They are able to detect features at different spatial scales and orientations. Meanwhile, VisTs have been gaining popularity due to their ability to learn visual features without being constrained by spatial invariance. When used together, these techniques can provide a comprehensive approach to image-based predictions. We can use them to extract the most relevant information from images.

What Are the Most Common Applications of Value Regression From Images?

Estimating Age From Photos

Probably the most widely known use case of image regression by the public is age prediction. You can come across them on social media platforms and mobile apps, such as Facebook, Instagram, Snapchat, or Face App. They apply deep learning algorithms to predict a user’s age based on their facial features and other details.

While image recognition provides information on the object or person in the image, the regression system tells us a specific value – in this case, the person’s age.

Needless to say, these plugins are not always correct and can sometimes produce biased results. Despite this limitation, various image regression models are gaining popularity on various social sites and in apps.

Ximilar already provides a face-detection solution. Models such as age prediction can be easily trained and deployed on our platform and integrated into your system.

Value Prediction and Rating of Real Estate Photos

Pictures play an essential part on real estate sites. When people are looking for a new home or investment, they are navigating through the feed mainly by visual features. With image regression, you are able to predict the state, quality, price, and overall rating of real estate from photos. This can help with both searching and evaluating real estate.

Predicting rating, and price (regression) for household images with image regression.

Custom recognition models are also great for the recognition & categorization of the features present in real estate photos. For example, you can determine whether a room is furnished, what type of room it is, and categorize the windows and floors based on their design.

Additionally, a regression can determine the quality or state of floors or walls, as well as rank the overall visual aesthetics of households. You can store all of this information in your database. Your users can then use such data to search for real estate that meets specific criteria.

Image classification systems such as image recognition and value regression are ideal for real estate ranking. Your visitors can search the database with the extracted data.

Determining the Degree of Wear and Tear With AI

Visual AI is increasingly being used to estimate the condition of products in photos. While recognition systems can detect individual tears and surface defects, regression systems can estimate the overall degree of wear and tear of things.

A good example of an industry that has seen significant adoption of such technology is the insurance industry. For example, startups-like Lemonade Inc, or Root use AI when paying the insurance.

With custom image recognition and regression methods, it is now possible to automate the process of insurance claims. For instance, a visual AI system can indicate the seriousness of damage to cars after accidents or assess the wear and tear of various parts such as suspension, tires, or gearboxes. The same goes with other types of insurance, including households, appliances, or even collectible & antique items.

Our platform is commonly utilized to develop recognition and detection systems for visual quality control & defect detection. Read more in the article Visual AI Takes Quality Control to a New Level.

Automatic Grading of Antique & Collectible Items Such as Sports Cards

Apart from car insurance and damage inspection, recognition and regression are great for all types of grading and sorting systems, for instance on price comparators and marketplaces of collectible and antique items. Deep learning is ideal for the automatic visual grading of collector items such as comic books and trading cards.

By leveraging visual AI technology, companies can streamline their processes, reduce manual labor significantly, cut costs, and enhance the accuracy and reliability of their assessments, leading to greater customer satisfaction.

Automatic Recognition of Collectibles

Ximilar built an AI system for the detection, recognition and grading of collectibles. Check it out!

Food Quality Estimation With AI

Biotech, Med Tech, and Industry 4.0 also have a lot of applications for regression models. For example, they can estimate the approximate level of fruit & vegetable ripeness or freshness from a simple camera image.

The grading of vegetables by an image regression model.

For instance, this Japanese farmer is using deep learning for cucumber quality checks. Looking for quality control or estimation of size and other parameters of olives, fruits, or meat? You can easily create a system tailored to these use cases without coding on the Ximilar platform.

Build Custom Evaluation & Grading Systems With Ximilar

Ximilar provides a no-code visual AI platform accessible via App & API. You can log in and train your own visual AI without the need to know how to code or have expertise in deep learning techniques. It will take you just a few minutes to build a powerful AI model. Don’t hesitate to test it for free and let us know what you think!

Our developers and annotators are also able to build custom recognition and regression systems from scratch. We can help you with the training of the custom task and then with the deployment in production. Both custom and ready-to-use solutions can be used via API or even deployed offline.

How do custom projects work?

The post Predict Values From Images With Image Regression appeared first on Ximilar: Visual AI for Business.

How to Build a Good Visual Search Engine?

Michal Lukáč — Mon, 09 Jan 2023 14:08:28 +0000

Visual search is one of the most-demanded computer vision solutions. Our team in Ximilar have been actively developing the best general multimedia visual search engine for retailers, startups, as well as bigger companies, who need to process a lot of images, video content, or 3D models.

However, a universal visual search solution is not the only thing that customers around the world will require in the future. Especially smaller companies and startups now more often look for custom or customizable visual search solutions for their sites & apps, built in a short time and for a reasonable price. What does creating a visual search engine actually look like? And can a visual search engine be built by anyone?

This article should provide a bit deeper insight into the technology behind visual search engines. I will describe the basic components of a visual search engine, analyze approaches to machine learning models and their training datasets, and share some ideas, training tips, and techniques that we use when creating visual search solutions. Those who do not wish to build a visual search from scratch can skip right to Building a Visual Search Engine on a Machine Learning Platform.

What Exactly Does a Visual Search Engine Mean?

The technology of visual search in general analyses the overall visual appearance of the image or a selected object in an image (typically a product), observing numerous features such as colours and their transitions, edges, patterns, or details. It is powered by AI trained specifically to understand the concept of similarity the way you perceive it.

In a narrow sense, the visual search usually refers to a process, in which a user uploads a photo, which is used as an image search query by a visual search engine. This engine in turn provides the user with either identical or similar items. You can find this technology under terms such as reverse image search, search by image, or simply photo & image search.

However, reverse image search is not the only use of visual search. The technology has numerous applications. It can search for near-duplicates, match duplicates, or recommend more or less similar images. All of these visual search tools can be used together in an all-in-one visual search engine, which helps internet users find, compare, match, and discover visual content.

And if you combine these visual search tools with other computer vision solutions, such as object detection, image recognition, or tagging services, you get a quite complex automated image-processing system. It will be able to identify images and objects in them and apply both keywords & image search queries to provide as relevant search results as possible.

Different computer vision systems can be combined on Ximilar platform via Flows. If you would like to know more, here’s an article about how Flows work.

Typical Visual Search Engines:
Google Lens & Pinterest Lens

Big visual search industry players such as Shutterstock, eBay, Pinterest (Pinterest Lens) or Google Images (Google Lens & Google Images) already implemented visual search engines, as well as other advanced, yet hidden algorithms to satisfy the increasing needs of online shoppers and searchers. It is predicted, that a majority of big companies will implement some form of soft AI in their everyday processes in the next few years.

The Algorithm for Training
Visual Similarity

The Components of a Visual Search Tool

Multimedia search engines are very powerful systems consisting of multiple parts. The first key component is storage (database). It wouldn’t be exactly economical to store the full sample (e.g., .jpg image or .mp4 video) in a database. That is why we do not store any visual data for visual search. Instead, we store just a representation of the image, called a visual hash.

The visual hash (also visual descriptor or embedding) is basically a vector, representing the data extracted from your image by the visual search. Each visual hash should be a unique combination of numbers to represent a single sample (image). These vectors also have some mathematical properties, meaning you can compare them, e.g., with cosine, hamming, or Euclidean distance.

So the basic principle of visual search is: the more similar the images are, the more similar will their vector representations be. Visual search engines such as Google Lens are able to compare incredible volumes of images (i.e., their visual hashes) to find the best match in a hundred milliseconds via smart indexing.

How to Create a Visual Hash?

The visual hashes can be extracted from images by standard algorithms such as PHASH. However, the era of big data gives us a much stronger model for vector representation – a neural network. A simple overview of the image search system built with a neural network can look like this:

Extracting visual vectors with the neural network and searching with them in a similarity collection.

This neural network was trained on images from a website selling cosmetics. Here, it extracted the embeddings (vectors), and they were stored in a database. Then, when a customer uploads an image to the visual search engine on the website, the neural network will extract the embedding vector from this image as well, and use it to find the most similar samples.

Of course, you could also store other metadata in the database, and do advanced filtering or add keyword search to the visual search.

Types of Neural Networks

There are several basic architectures of neural networks that are widely used for vector representations. You can encode almost anything with a neural network. The most common for images is a convolutional neural network (CNN).

There are also special architectures to encode words and text. Lately, so-called transformer neural networks are starting to be more popular for computer vision as well as for natural language processing (NLP). Transformers use a lot of new techniques developed in the last few years, such as an attention mechanism. The attention mechanism, as the name suggests, is able to focus only on the “interesting” parts of the image & ignore the unnecessary details.

Training the Similarity Model

There are multiple methods to train models (neural networks) for image search. First, we should know that training of machine learning models is based on your data and loss function (also called objective or optimization function).

Optimization Functions

The loss function usually computes the error between the output of the model and the ground truth (labels) of the data. This feature is used for adjusting the weights of a model. The model can be interpreted as a function and its weights as parameters of this function. Therefore, if the value of the loss function is big, you should adjust the weights of the model.

How it Works

The model is trained iteratively, taking subsamples of the dataset (batches of images) and going over the entire dataset multiple times. We call one such pass of the dataset an epoch. During one batch analysis, the model needs to compute the loss function value and adjust weights according to it. The algorithm for adjusting the weights of the model is called backpropagation. Training is usually finished when the loss function is not improving (minimizing) anymore.

We can divide the methods (based on loss function) depending on the data we have. Imagine that we have a dataset of images, and we know the class (category) of each image. Our optimization function (loss function) can use these classes to compute the error and modify the model.

The advantage of this approach is its simple implementation. It’s practically only a few lines in any modern framework like TensorFlow or PyTorch. However, it has also a big disadvantage: the class-level optimization functions don’t scale well with the number of classes. We could potentially have thousands of classes (e.g., there are thousands of fashion products and each product represents a class). The computation of such a function with thousands of classes/arguments can be slow. There could also be a problem with fitting everything on the GPU card.

Loss Function: A Few Tips

If you work with a lot of labels, I would recommend using a pair-based loss function instead of a class-based one. The pair-based function usually takes two or more samples from the same class (i.e., the same group or category). A model based on a pair-based loss function doesn’t need to output prediction for so many unique classes. Instead, it can process just a subsample of classes (groups) in each step. It doesn’t know exactly whether the image belongs to class 1 or 9999. But it knows that the two images are from the same class.

Images can be labelled manually or by a custom image recognition model. Read more about image recognition systems.

The Distance Between Vectors

The picture below shows the data in the so-called vector space before and after model optimization (training). In the vector space, each image (sample) is represented by its embedding (vector). Our vectors have two dimensions, x and y, so we can visualize them. The objective of model optimization is to learn the vector representation of images. The loss function is forcing the model to predict similar vectors for samples within the same class (group).

By similar vectors, I mean that the Euclidean distance between the two vectors is small. The larger the distance, the more different these images are. After the optimization, the model assigns a new vector to each sample. Ideally, the model should maximize the distance between images with different classes and minimize the distance between images of the same class.

Optimization for visual search should maximize the distance of items between different categories and minimize the distance within the category.

Sometimes we don’t know anything about our data in advance, meaning we do not have any metadata. In such cases, we need to use unsupervised or self-supervised learning, about which I will talk later in this article. Big tech companies do a lot of work with unsupervised learning. Special models are being developed for searching in databases. In research papers, this field is often called deep metric learning.

Supervised & Unsupervised Machine Learning Methods

1) Supervised Learning

As I mentioned, if we know the classes of images, the easiest way to train a neural network for vectors is to optimize it for the classification problem. This is a classic image recognition problem. The loss function is usually cross-entropy loss. In this way, the model is learning to predict predefined classes from input images. For example, to say whether the image contains a dog, a cat or a bird. We can get the vectors by removing the last classification layer of the model and getting the vectors from some intermediate layer of the network.

When it comes to the pair-based loss function, one of the oldest techniques for metric learning is the Siamese network (contrastive learning). The name contains “Siamese” because there are two identical models of the same weights. In the Siamese network, we need to have pairs of images, which we label based on whether they are or aren’t equal (i.e., from the same class or not). Pairs in the batch that are equal are labelled with 1 and unequal pairs with 0.

In the following image, we can see different batch construction methods that depend on our model: Siamese (contrastive) network, Triplet, or N-pair, which I will explain below.

Each deep learning architecture requires different batch construction methods. For example, Siamese and N-pair require tuples. However, in N-pair, the tuples must be unique.

Triplet Neural Network and Online/Offline Mining

In the Triplet method, we construct triplets of items, two of which (anchor and positive) belong to the same category and the third one (negative) to a different category. This can be harder than you might think because picking the “right” samples in the batch is critical. If you pick items that are too easy or too difficult, the network will converge (adjust weights) very slowly or not at all. The triplet loss function contains an important constant called margin. Margin defines what should be the minimum distance between positive and negative samples.

Picking the right samples in deep metric learning is called mining. We can find optimal triplets via either offline or online mining. The difference is, that during offline mining, you are finding the triplets at the beginning of each epoch.

Online & Offline Mining

The disadvantage of offline mining is that computing embeddings for each sample is not very computationally efficient. During the epoch, the model can change rapidly, so embeddings are becoming obsolete. That’s why online mining of triplets is more popular. In online mining, each batch of triplets is created before fitting the model. For more information about mining and batch strategies for triplet training, I would recommend this post.

We can visualize the Triplet model training in the following way. The model is copied three times, but it has the same shared weights. Each model takes one image from the triplet (anchor, positive, negative) and outputs the embedding vector. Then, the triplet loss is computed and weights are adjusted with backpropagation. After the training is done, the model weights are frozen and the output of the embeddings is used in the similarity engine. Because the three models have shared weights (the same), we take only one model that is used for predicting embedding vectors on images.

Triplet network that takes a batch of anchor, positive and negative images.

N-pair Models

The more modern approach is the N-pair model. The advantage of this model is that you don’t mine negative samples, as it is with a triplet network. The batch consists of just positive samples. The negative samples are mitigated through the matrix construction, where all non-diagonal items are negative samples.

You still need to do online mining. For example, you can select a batch with a maximum value of the loss function, or pick pairs that are distant in metric space.

The N-pair model requires a unique pair of items. In the triplet and Siamese model, your batch can contain multiple triplets/pairs from the same class (group).

In our experience, the N-pair model is much easier to fit, and the results are also better than with the triplet or Siamese model. You still need to do a lot of experiments and know how to tune other hyperparameters such as learning rate, batch size, or model architecture. However, you don’t need to work with the margin value in the loss function, as it is in triplet or Siamese. The small drawback is that during batch creation, we need to have always only two items per class/product.

Proxy-Based Methods

In the proxy-based methods (Proxy-Anchor, Proxy-NCA, Soft Triple) the model is trying to learn class representatives (proxies) from samples. Imagine that instead of having 10,000 classes of fashion products, we will have just 20 class representatives. The first representative will be used for shoes, the second for dresses, the third for shirts, the fourth for pants and so on.

A big advantage is that we don’t need to work with so many classes and the problems coming with it. The idea is to learn class representatives and instead of slow mining “the right samples” we can use the learned representatives in computing the loss function. This leads to much faster training & convergence of the model. This approach, as always, has some cons and questions like how many representatives should we use, and so on.

MultiSimilarity Loss

Finally, it is worth mentioning MultiSimilarity Loss, introduced in this paper. MultiSimilarity Loss is suitable in cases when you have more than two items per class (images per product). The authors of the paper are using 5 samples per class in a batch. MultiSimilarity can bring closer items within the same class and push the negative samples far away by effectively weighting informative pairs. It works with three types of similarities:

Self-Similarity (the distance between the negative sample and anchor)
Positive-Similarity (the relationship between positive pairs)
Negative-Similarity (the relationship between negative pairs)

Finally, it is also worth noting, that in fact, you don’t need to use only one loss function, but you can combine multiple loss functions. For example, you can use the Triplet Loss function with CrossEntropy and MultiSimilarity or N-pair together with Angular Loss. This should often lead to better results than the standalone loss function.

2) Unsupervised Learning

AutoEncoder

Unsupervised learning is helpful when we have a completely unlabelled dataset, meaning we don’t know the classes of our images. These methods are very interesting because the annotation of data can be very expensive and time-consuming. The most simplistic unsupervised learning can simply use some form of AutoEncoder.

AutoEncoder is a neural network consisting of two parts: an encoder, which encodes the image to the smaller representation (embedding vector), and a decoder, which is trying to reconstruct the original image from the embedding vector.

After the whole model is trained, and the decoder is able to reconstruct the images from smaller vectors, the decoder part is discarded and only the encoder part is used in similarity search engines.

Simple AutoEncoder neural network for learning embeddings via reconstruction of the image.

There are many other solutions for unsupervised learning. For example, we can train AutoEncoder architecture to colourize images. In this technique, the input image has no colour and the decoding part of the network tries to output a colourful image.

Image Inpainting

Another technique is Image Inpainting, where we remove part of the image and the model will learn to inpaint them back. Interesting way to propose a model that is solving jigsaw puzzles or correct ordering of frames of a video.

Then there are more advanced unsupervised models like SimCLR, MoCo, PIRL, SimSiam or GAN architectures. All these models try to internally represent images so their outputs (vectors) can be used in visual search systems. The explanation of these models is beyond this article.

Tips for Training Deep Metric Models

Here are some useful tips for training deep metric learning models:

Batch size plays an important role in deep metric learning. Some methods such as N-pair should have bigger batch sizes. Bigger batch sizes generally lead to better results, however, they also require more memory on the GPU card.
If your dataset has a bigger variation and a lot of classes, use a bigger batch size for Multi-similarity loss.
The most important part of metric learning is your data. It’s a pity that most research, as well as articles, focus only on models and methods. If you have a large collection with a lot of products, it is important to have a lot of samples per product. If you have fewer classes, try to use some unsupervised method or cross-entropy loss and do heavy augmentations. In the next section, we will look at data in more depth.
Try to start with a pre-trained model and tune the learning rate.
When using Siamese or Triplet training, try to play with the margin term, all the modern frameworks will allow you to change it (make it harder) during the training.
Don’t forget to normalize the output of the embedding if the loss function requires it. Because we are comparing vectors, they should be normalized in a way that the norm of the vectors is always 1. This way, we are able to compute Euclidean or cosine distances.
Use advanced methods such as MultiSimilarity with big batch size. If you use Siamese, Triplet, or N-pair, mining of negatives or positives is essential. Start with easier samples at the beginning and increase the challenging samples every epoch.

Neural Text Search on Images with CLIP

Up to right now, we were talking purely about images and searching images with image queries. However, a common use case is to search the collection of images with text input, like we are doing with Google or Bing search. This is also called Text-to-Image problem, because we need to transform text representation to the same representation as images (same vector space). Luckily, researchers from OpenAI develop a simple yet powerful architecture called CLIP (Contrastive Language Image Pre-training). The concept is simple, instead of training on pair of images (SIAMESE, NPAIR) we are training two models (one for image and one for text) on pairs of images and texts.

The architecture of CLIP model by OpenAI. Image Source Github

You can train a CLIP model on a dataset and then use it on your images (or videos) collection. You are able to find similar images/products or try to search your database with a text query. If you would like to use a CLIP-like model on your data, we can help you with the development and integration of the search system. Just contact us at care@ximilar.com, and we can create a search system for your data.

Try search demo

The Training Data
for Visual Search Engines

99 % of the deep learning models have a very expensive requirement: data. Data should not contain any errors such as wrong labels, and we should have a lot of them. However, obtaining enough samples can be a problematic and time-consuming process. That is why techniques such as transfer learning or image augmentation are widely used to enrich the datasets.

How Does Image Augmentation Help With Training Datasets?

Image augmentation is a technique allowing you to multiply training images and therefore expand your dataset. When preparing your dataset, proper image augmentation is crucial. Each specific category of data requires unique augmentation settings for the visual search engine to work properly. Let’s say you want to build a fashion visual search engine based strictly on patterns and not the colours of items. Then you should probably employ heavy colour distortion and channel-swapping augmentation (randomly swapping red, green, or blue channels of an image).

On the other hand, when building an image search engine for a shop with coins, you can rotate the images and flip them to left-right and upside-down. But what to do if the classic augmentations are not enough? We have a few more options.

Removing or Replacing Background

Most of the models that are used for image search require pairs of different images of the same object. Typically, when training product image search, we use an official product photo from a retail site and another picture from a smartphone, such as a real-life photo or a screenshot. This way, we get a pair-based model that understands the similarity of a product in pictures with different backgrounds, lights, or colours.

The difference between a product photo and a real-life image made with a smartphone, both of which are important to use when training computer vision models.

All such photos of the same product belong to an entity which we call a Similarity Group. This way, we can build an interactive tool for your website or app, which enables users to upload a real-life picture (sample) and find the product they are interested in.

Background Removal Solution

Sometimes, obtaining multiple images of the same group can be impossible. We found a way to tackle this issue by developing a background removal model that can distinguish the dominant foreground object from its background and detect its pixel-accurate position.

Once we know the exact location of the object, we can generate new photos of products with different backgrounds, making the training of the model more effective with just a few images.

The background removal can also be used to narrow the area of augmentation only to the dominant item, ignoring the background of the image. There are a lot of ways to get the original product in different styles, including changing saturation, exposure, highlights and shadows, or changing the colours entirely.

Generating more variants can make your model very robust.

Building such an augmentation pipeline with background/foreground augmentation can take hundreds of hours and a lot of GPU resources. That is why we deployed our Background Removal solution as a ready-to-use image tool.

You can use the Background Removal as a stand-alone service for your image collections, or as a tool for training data augmentation. It is available in public demo, App, and via API.

GAN-Based Methods for Generating New Training Data

One of the modern approaches is to use a Generative Adversarial Network (GAN). GANs are incredibly powerful in generating whole new images from some specific domain. You can simply create a model for generating new kinds of insects or making birds with different textures.

Creating new insect images automatically to train an image recognition system? How cool is that? There are endless possibilities with GAN models for basically any image type. [Source]

The greatest advantage of GAN is you will easily get a lot of new variants, which will make your model very robust. GANs are starting to be widely used in more tasks such as simulations, and I think the gathering of data will cost much less in the near future because of them. In Ximilar, we used GAN to create a GAN Image Upscaler, which adds new relevant pixels to images to increase their resolution and quality.

When creating a visual search system on our platform, our team picks the most suitable neural network architecture, loss functions, and image augmentation settings through the analysis of your visual data and goals. All of these are critical for the optimization of a model and the final accuracy of the system. Some architectures are more suitable for specific problems like OCR systems, fashion recommenders or quality control. The same goes with image augmentation, choosing the wrong settings can destroy the optimization. We have experience with selecting the best tools to solve specific problems.

Annotation System for Building Image Search Datasets

As we can see, a good dataset definitely is one of the key elements for training deep learning models. Obtaining such a collection can be quite expensive and time-consuming. With some of our customers, we build a system that continually gathers the images needed in the training datasets (for instance, through a smartphone app). This feature continually & automatically improves the precision of the deployed search engines.

How does it work? When the new images are uploaded to Ximilar Platform (through Custom Similarity service) either via App or API, our annotators can check them and use them to enhance the training dataset in Annotate, our interface dedicated to image annotation & management of datasets for computer vision systems.

Annotate effectively works with the similarity groups by grouping all images of the same item. The annotator can add the image to a group with the relevant Stock Keeping Unit (SKU), label it as either a product picture or a real-life photo, add some tags, or mark objects in the picture. They can also mark images that should be used for the evaluation and not used in the training process. In this way, you can have two separate datasets, one for training and one for evaluation.

We are quite proud of all the capabilities of Annotate, such as quality control, team cooperation, or API connection. There are not many web-based data annotation apps where you can effectively build datasets for visual search, object detection, as well as image recognition, and which are connected to a whole visual AI platform based on computer vision.

Image annotation tool for building visual search and image similarity models.

How to Improve Visual Search Engine Results?

We already assessed that the optimization algorithm and the training dataset are key elements in training your similarity model. And that having multiple images per product then significantly increases the quality of the trained similarity model. The model (CNN or other modern architecture) for similarity is used for embedding (vector) extraction, which determines the quality of image search.

Over the years that we’ve been training visual search engines for various customers around the world, we were also able to identify several potential weak spots. Their fixing really helped with the performance of searches as well as the relevance of the search results. Let’s take a look at what can improve your visual search engine:

Include Tags

Adding relevant keywords for every image can improve the search results dramatically. We recommend using some basic words that are not synonymous with each other. The wrong keywords for one item are for instance “sky, skyline, cloud, cloudy, building, skyscraper, tall building, a city”, while the good alternative keywords would be “sky, cloud, skyscraper, city”.

Our engine can internally use these tags and improve the search results. You can let an image recognition system label the images instead of adding the keywords manually.

Include Filtering Categories

You can store the main categories of images in their metadata. For instance, in real estate, you can distinguish photos that were taken inside or outside. Based on this, the searchers can filter the search results and improve the quality of the searches. This can also be easily done by an image recognition task.

Include Dominant Colours

Colour analysis is very important, especially when working for a fashion or home decor shop. We built a tool conveniently called Dominant Colors, with several extraction options. The system can extract the main colours of a product while ignoring its background. Searchers can use the colours for advanced filtering.

Use Object Detection & Segmentation

Object detection can help you focus the view of both the search engine and its user on the product, by merely cutting the detected object from the image. You can also apply background removal to search & showcase the products the way you want. For training object detection and other custom image recognition models, you can use our App & Annotate.

Use Optical Character Recognition (OCR)

In some domains, you can have products with text. For instance, wine bottles or skincare products with the name of the item and other text labels that can be read by artificial intelligence, stored as metadata and used for keyword search on your site.

Our visual search engine allows us to combine several features for multimedia search with advanced filtering.

Improve Image Resolution

If the uploaded images from the mobile phones have low resolution, you can use the image upscaler to increase the resolution of the image, screenshot, or video. This way, you will get as much as possible even from user-generated content with potentially lower quality.

Combine Multiple Approaches

Fusion – Combining multiple features like model embeddings, tags, dominant colours, and text increases your chances to build a solid visual search engine. Our system is able to use these different modalities and return the best items accordingly. For example, extracting dominant colours is really helpful in Fashion Search, our service combining object detection, fashion tagging & visual search.

Search Engine and Vector Databases

Once you trained your model (neural network), you can extract and store the embeddings for your multimedia items somewhere. There are a lot of image search engine implementations that are able to work with vectors (embedding representation) that you can use. For example, Annoy from Spotify or FAISS from Facebook developers.

These solutions are open-source (i.e. you don’t have to deal with usage rights) and you can use them for simple solutions. However, they also have a few disadvantages:

After the initial build of the search engine database, you cannot perform any update, insert or delete operations. Once you store the data, you can only perform search queries.
You are unable to use a combination of multiple features, such as tags, colours, or metadata.
There’s no support for advanced filtering for more precise results.
You need to have an IT background and coding skills to implement and use them. And in the end, the system must be deployed on some server, which brings additional challenges.
It is difficult to extend them for advanced use cases, you will need to learn a complex codebase of the project and adjust it accordingly.

Building a Visual Search Engine on a Machine Learning Platform

The creation of a great visual search engine is not an easy task. The mentioned challenges and disadvantages of building complex visual search engines with high performance are the reasons why a lot of companies hesitate to dedicate their time and funds to building them from scratch. That is where AI platforms like Ximilar come into play.

Custom Similarity Service

Ximilar provides a computer vision platform, where a fast similarity engine is available as a service. Anyone can connect via API and fill their custom collection with data and query at the same time. This streamlines the tedious workflow a lot, enabling people to have custom visual search engines fast and, more importantly, without coding. Our image search engines can handle other data types like videos, music, or 3D models. If you want more privacy for your data, the system can also be deployed on your hardware infrastructure.

In all industries, it is important to know what we need from our model and optimize it towards the defined goal. We developed our visual search services with this in mind. You can simply define your data and problem and what should be the primary goal for this similarity. This is done via similarity groups, where you put the items that should be matched together.

Examples of Visual Search Solutions for Business

One of the typical industries that use visual search extensively is fashion. Here, you can look at similarities in multiple ways. For instance, one can simply want to find footwear with a colour, pattern, texture, or shape similar to the product in a screenshot. We built several visual search engines for fashion e-shops and especially price comparators, which combined search by photo and recommendations of alternative similar products.

Based on a long experience with visual search solutions, we deployed several ready-to-use services for visual search: Visual Product Search, a complex visual search service for e-commerce including technologies such as search by photo, similar product recommendations, or image matching, and Fashion Search created specifically for the fashion segment.

Another nice use case is also the story of how we built a Pokémon Trading Card search engine. It is no surprise that computer vision has been recently widely applied in the world of collectibles. Trading card games, sports cards or stamps and visual AI are a perfect match. Based on our customers’ demand, we also created several AI solutions specifically for collectibles.

The Workflow of Building
a Visual Search Engine

If you are looking to build a custom search engine for your users, we can develop a solution for you, using our service Custom Image Similarity. This is the typical workflow of our team when working on a customized search service:

Setup, Research & Plan – Initial calls, the definition of the project, NDA, and agreement on expected delivery time.
Data – If you don’t provide any data, we will gather it for you. Gathering and curating datasets is the most important part of developing machine learning models. Having a well-balanced dataset without any bias to any class leads to great performance in production.
First prototype – Our machine learning team will start working on the model and collection. You will be able to see the first results within a month. You can test it and evaluate it by yourself via our clickable front end.
Development – Once you are satisfied with the results, we will gather more data and do more experiments with the models. This is an iterative way of improving the model.
Evaluation & Deployment – If the system performs well and meets the criteria set up in the first calls (mostly some evaluation on the test dataset and speed performance), we work on the deployment. We will show you how to connect and work with the API for visual similarity (insert, delete, search endpoints).

If you are interested in knowing more about how the cooperation with Ximilar works in general, read our How it works and contact us anytime.

We are also able to do a lot of additional steps, such as:

Managing and gathering more training data continually after the deployment to gradually increase the performance of visual similarity (the usage rights for user-generated content are up to you; keep in mind that we don’t store any physical images).
Building a customized model or multiple models that can be integrated into the search engine.
Creating & maintaining your visual search collection, with automatic synchronization to always keep up to date with your current stock.
Scaling the service to hundreds of requests per second.

Visual Search is Not Only
For the Big Companies

I presented the basic techniques and architectures for training visual similarity models, but of course, there are much more advanced models and the research of this field continues with mile steps.

Search engines are practically everywhere. It all started with AltaVista in 1995 and Google in 1998. Now it’s more common to get information directly from Siri or Alexa. Searching for things with visual information is just another step, and we are glad that we can give our clients tools to maximise their potential. Ximilar has a lot of technical experience with advanced search technology for multimedia data, and we work hard to make it accessible to everyone, including small and medium companies.

If you are considering implementing visual search into your system:

Schedule a call with us and we will discuss your goals. We will set up a process for getting the training data that are necessary to train your machine learning model for search engines.
In the following weeks, our machine learning team will train a custom model and a testable search collection for you.
After meeting all the requirements from the POC, we will deploy the system to production, and you can connect to it via Rest API.

How do custom projects work?

The post How to Build a Good Visual Search Engine? appeared first on Ximilar: Visual AI for Business.

How to Convert a Video Into a Streaming Format?

Michal Lukáč — Tue, 23 Aug 2022 12:03:00 +0000

In the last few months, we have been actively developing a lot of new AI solutions for videos. Automated video processing is a growing field of AI, with many interesting applications. It however brought quite a few new challenges (huge amount of data, processing time, precision, and so on) that didn’t need to be taken into consideration when building classic image-processing systems, including converting the standard video into streaming format. This article might prove useful to those who encountered similar challenges.

The Automated Video Processing

According to this research by Deloitte, it is typical for younger generations to build a dynamic portfolio of media and entertainment options. Consumers across generations have been spending more time watching online TV (Nielsen) and browsing the internet using social media and video-on-demand services on a daily basis.

There is no doubt that automated video processing is going to become as normal as image processing by AI, revolutionizing not only platforms such as YouTube, TikTok, Instagram, or Twitch – but the way we work with and perceive video content.

One of the projects that we are co-developing called Get Moments required a lot of working with FFmpeg, Python, OpenCV and Machine Learning (mostly TensorFlow). One of the challenges we encountered was converting a standard movie format into a streaming format, so there are quite a few tips I can now share with you.

What is the Streaming Format, and What is it Good For?

The need for a streaming format came with the rise and popularity of YouTube. Different users around the world have different internet connection speeds, and they can watch different parts of a video with different quality. That is possible because the video is delivered to them in a streaming format, without the need to load it fully.

Converting a video into a streaming format means you create multiple copies of this video with different qualities, all of which are chunked into short segments.

Instead of downloading and playing a video file in classic MP4 container format with H.264 (video codec), only the parts of the video that are currently watched, are loaded and streamed in the quality corresponding with the user’s internet connection quality. That is possible because when converting a standard video file into a streaming format, you create multiple copies of this video with different qualities, all of which are chunked into short segments.

HLS or DASH Streaming Format – Comparison

The full power of streaming video formats comes with CDNs (content delivery networks) that are able to deliver content over the internet very fast. There are several video streaming formats, but the currently most used are HLS and DASH. Both protocols run over HTTP, use TCP, are supported via HTML5 video player, and both chunk videos into segments with intervals of 2–10 seconds.

HLS (HTTP Live Streaming) is a live-streaming protocol with adaptive bitrate. Because it was developed by Apple, there is support for all Apple devices. HLS is using H.264 for video compression, with AAC and MP3 for an audio stream.

DASH (MPEG-DASH) is more open and standardized. It is widely used, for example on YouTube in HTM5 player. Unlike HLS, the DASH video can be encoded with different codes (codec-agnostic) for both video and audio streams.

I personally prefer the HTTP Live Streaming format for several reasons. The m3u8 index/header file looks much nicer, there is better support on Apple devices, and the conversion to HLS is much easier than to DASH. Nevertheless, not every video player is supporting the HLS or DASH format, so be careful what you have on your website or mobile app.

How much space do the HLS and DASH formats take up?

Let’s convert a sample video file with:

Length 60 seconds
Resolution 1080p
Size 25 MB
Encoded with H.264 codec
No audio track

I converted this video to both HLS and DASH formats in 360p, 720p and 1080p resolutions. You can select your own resolution via encoding with FFmpeg.

When I converted the video to DASH with only two resolutions (360p and 1080p), the size was 32 MB. And when I added the third resolution (720p), I got to a similar size as with HLS. In both cases, the total size of the three files with different qualities together was around 55 MB, so a bit over double the size of the original file. Of course, the size can also change depending on the used codecs.

What is the data structure of HLS and DASH?

The folder with HLS format contains video encoded to 360p, 720p and 1080p. You can see the .ts files representing the chunks of 10-second intervals. Because we have a 60-second video, it contains 6 chunks – 6 .ts files.

In the case of DASH format, each video chunk has 5 seconds, so the video with DASH folder contains 12 chunks with a .m4s suffix.

DASH video with chunks

HLS video with chunks

DASH and HLS streaming structure generated via FFmpeg.

You can also see index.m3u8, which is our index file. It is linked to the video player on the website where we are streaming. It is a simple text file containing information on which resolution and bandwidth these videos have. The content looks like this:

#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=375000,RESOLUTION=640x360
360_video.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2000000,RESOLUTION=1280x720
720_video.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=3500000,RESOLUTION=1920x1080
1080_video.m3u8

The file 360_video.m3u8 defines the length of the chunk .ts files, and it looks like this:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:11
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:10.135122,
360_video0.ts
#EXTINF:10.760756,
360_video1.ts
#EXTINF:10.135122,
360_video2.ts
#EXTINF:9.634622,
360_video3.ts
#EXTINF:9.884878,
360_video4.ts
#EXTINF:9.468422,
360_video5.ts
#EXT-X-ENDLIST

The video converted to DASH format also has a manifest/index file with XML Structure.

How to convert .mp4, .mkv or .mov videos to HLS?

For converting the video to HLS streaming video format with three qualities (1920p, 720p and 360p) you can call the FFmpeg directly:

mkdir hls
ffmpeg -i minute.mp4 -profile:v baseline -level 3.0 -s 640x360  -start_number 0 -hls_time 10 -hls_list_size 0 -f hls hls/360_video.m3u8
ffmpeg -i minute.mp4 -profile:v baseline -level 3.0 -s 1280x720  -start_number 0 -hls_time 10 -hls_list_size 0 -f hls hls/720_video.m3u8
ffmpeg -i minute.mp4 -profile:v baseline -level 3.0 -s 1920x1080  -start_number 0 -hls_time 10 -hls_list_size 0 -f hls hls/1080_video.m3u8

You can select the preferred resolutions via -s arg. For example, you can additionally create a video in 480p resolution if needed. With -hls_time, you can specify the length of chunks. After the conversion is done, we can manually or programmatically create an index.m3u8 file, which is used as a link in your web player.

You can also call the conversion of MP4 to HLS via Python with a subprocess module:

def call_ffmpeg(cmd):
    with subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as process:
        process.communicate()
    return True

Read the FFMPEG documentation for more information about the optimal setting for your HLS video. There are a lot of parameters to tune, for example, you can set max rate, bufsize, average bit rate and much more. Here are a few links to start:

How to convert videos to DASH?

Similar to HLS, we can convert the video to DASH in two resolutions (360p and 1080p) with this command:

ffmpeg -re -i minute.mp4 -map 0 -map 0 -c:a libfdk_aac -c:v libx264 \
-b:v:0 300k -b:v:1 3000k -s:v:0 640x360 -s:v:1 1920x1080 -profile:v:1 baseline \
-profile:v:0 main -bf 1 -keyint_min 120 -g 120 -sc_threshold 0 \
-b_strategy 0 -ar:a:1 22050 -use_timeline 1 -use_template 1 \
-adaptation_sets "id=0,streams=v id=1,streams=a" \
-f dash dash/out.mpd

The video conversion uses the H.264 codec via the -c:v libx264 argument. The resolution is set via -s:v argument. Whenever you are playing a DASH video, your entry point is this .mpd file, which will be generated during the conversion.

How can I stream videos on my website?

You can for example upload your converted videos to any storage like Amazon S3, Wasabi or DigitalOcean, and put the CDN (Cloudflare, CDN77 or Bunny) in front of your storage. For a web player, you could use for example the Bradmax player.

Did you find this guide useful? Check our other guides on custom visual search, image recognition & object detection systems, or various applications of visual AI.

Is there an API for conversion to streaming formats?

We’ve been working with video conversion and processing by artificial intelligence for a while and gained a lot of experience with FFmpeg on videos. If you would like to try our API for video conversion into streaming formats, cutting videos, concatenating, and video trimming, contact us at tech@ximilar.com. In case you have any other questions or ideas, contact us through the contact form. We’re here to help!

The post How to Convert a Video Into a Streaming Format? appeared first on Ximilar: Visual AI for Business.

Ximilar Introduces a Brand New App

Zuzana Raidová — Mon, 06 Dec 2021 11:06:53 +0000

An update is never late, nor is it early. It arrives precisely when we mean it to. After tuning up the back end for four years, the time has come to level up the front end of our App as well. We tested multiple ways, got valuable feedback from our users, and now we’re happy to introduce a new interface. It is more user-friendly, there are richer options, and the orientation in the growing number of our services is easier.

All Important Things at Hand

Ximilar provides a platform for visual AI, where anyone can create, train and deploy custom-made visual AI solutions based on the techniques of machine learning and computer vision. The platform is accessible via API and a web-based App, where users from all around the world work with both ready-to-use and custom solutions. They implement them into their own apps, quality control or monitoring systems in factories, healthcare tools and so on.

We created the new interface to adapt to the ever-increasing number of services we provide. It now makes better use of both the dashboard and sidebar, showcases useful articles and guides, and provides more support. So, let’s take a look at the major new features!

Service Categories & News

We grouped our services based on how they work with data and the degree of possible customization. After you log into the application, you will see the cards of four service groups with short descriptions on the dashboard. Below them, you can see the newest articles from our Blog, where we publish a lot of useful tips on how to create and implement custom visual AI solutions.

The service groups are following:

Ready-to-use Image Recognition includes all the services, that you can use straight away without the need for additional training, custom tags and labels. In principle, these services analyze your data (i.e., your image collection) and provide you with information based on image recognition, object detection, analysis of colors & styles etc. Here you will find Fashion Tagging, Home Decor Tagging, Photo Tagging and Dominant Colors.
Custom Image Recognition allows you to train custom Categorization & Tagging and Object Detection models. Flows, that enable you to combine the models, are also under this category. To prepare the training data for object detection seamlessly and fast, you can use our own tool Annotate.
Visual Search encompasses all services able to identify, analyze and compare visually similar content. Image Similarity can find, compare and recommend visually similar images or products. You can also use Image Matching to identify duplicates or near-duplicates in your collection, or create a fully custom visual search. Fashion Search is a complex service based on visual search and fashion tagging for apparel image collections.
Image Tools are online tools based on computer vision and machine learning that will when provided with an image, modify it. You can then either use the result or implement these image tools in your Flows. Here you will find Remove Background and Image Upscaler.

Do you want to learn more about AI and machine learning? Check the list of The Best Resources on Artificial Intelligence and Machine Learning.

Discover Services

Within the service groups, you can now browse all our services, including the ones that are not in your pricing scheme. Every service dashboard features a service overview and links to documentation, useful guides, case studies & video tutorials.

Do you want to know what you pay for when using our App? Check our article on API credit packs or the documentation.

Guides & Help at Hand

The sidebar underwent some major changes. It now displays all service groups and services. At the bottom, you will find the Guides & Help section with all necessary links to the beginner App Overview tutorial, Guides, Documentation & Contacts in case you need help.

How to make the most of a computer vision solution? Our guides are packed with useful tips & tricks, as well as first-hand experience of our machine learning specialists.

Customize the Sidebar With Favorites

Since each use case is highly specific, our users usually use a small group of services or only one service at a time. That is why you can now pin your most-used services as Favorites.

When you first log into the new front end, all of your previously used services will be marked as favourites. You can then choose which of them will stay on top.

What’s next?

This front-end update is just a first step out of many we’ve been working on. We focus on adding some major features to the platform, such as explainability, as well as custom image regression models. The Ximilar platform provides one of the most advanced Visual AI tools with API on the market, and you can test them for free. Nevertheless, the key to the improvement of our services and App are your opinions and user experience. Let us know what you think!

The post Ximilar Introduces a Brand New App appeared first on Ximilar: Visual AI for Business.

How to Build Your Own Image Recognition API?

Víťa Válka — Fri, 16 Jul 2021 10:38:27 +0000

Image recognition systems are still young, but they become more available every day. Usually, custom image recognition APIs are used for better filtering and recommendations of products in e-shops, sorting stock photos, classification of errors, or pathological findings. Ximilar, same as Apple Vision SDK or Google Tensorflow, make the training of custom recognition models easy and affordable. However, not many people and companies have been using this technology to its full potential so far.

For example, recently, I had a conversation with a client who said that Google Vision didn’t work for him, and it returned non-relevant tags. The problem was not the API but the approach to it. He employed a few students to do the labelling job and create an image classifier. However, the results were not good at all. After showing him our approach, sharing some tips and simple rules, he got better classification results almost immediately. This post should serve as a comprehensive guide for those, who build their own image classifiers and want to get the most out of it.

How to Begin

Image recognition is based on the techniques of machine learning and computer vision. It is able to categorize and tag images with tags describing the attributes recognized in them. You can read everything about the service and its possibilities here.

To train your own Image Recognition models and create a system accessible through API, you will first need to upload a set of training images and create your image recognition tasks (models). Then you will use the training set to train the models to categorize the images.

If you need your images to be tagged, you should upload or create a set of tags and train tagging tasks. As the last step, you can combine these tasks into a Flow, and modify or replace any of them anytime due to its modular structure. You can then gradually improve your accuracy based on testing, evaluation metrics and feedback from your customers. Let’s have a look at the basic rules you should follow to reach the best results.

The Basic Rules for Image Recognition Models Training

Each image recognition task contains at least two labels (classes, categories) – e.g., cats and dogs. A classic image recognition model (task) assigns one label to each image – so the image is either a cat or dog. In general, the more classes you have, the more data you will need to teach the neural network to predict labels.

Binary classification for cats and dogs. Source: Kelly Lacy (Pexels), Pixabay

The training images should represent the real data that will be analyzed in a production setting. For example, if you aim to build a medical diagnostic tool helping radiologists identify the slightest changes in the lung tissue, you need to assemble a database of x-ray images with proven pathological findings. For the first training of your task, we recommend sticking to these simple rules:

Start with binary classification (two labels) – use 50–100 images/label
Use about 20 labels for basic and 100 labels for more complex solutions
For well-defined labels use 200+ images/label
For hard to recognize labels add 100+ images/label
Pattern recognition – for structures, x-ray images, etc. use 50–100 images/label

Always keep in mind, that training one task with hundreds of labels on small datasets almost never works. You need at least 20 labels and 100+ images per label to start with to achieve solid results. Start with the recommended counts, and then add more if needed.

You can create your image recognition model via app.ximilar.com without coding.

The Difference Between Testing & Production

The users of Ximilar App can train tasks with a minimum of 20 images per label. Our platform automatically divides your input data into two datasets – training & test set, usually in a ratio of 80:20. The training set is used to optimize the parameters of the classifier. During the training, the training images are augmented in several ways to extend the set.

The test data (about 20 %) are then used to validate and measure accuracy by simulating how the model will perform in production. You can see the accuracy results on the Task dashboard in Ximilar App. You can also create an independent test dataset and evaluate it. This is a great way to get accurate results on a dataset that was not seen by the model in the training before you actually deploy it.

Remember, the lower limit of 20 images per label usually leads to weak results and low accuracy. While it might be enough for your testing, it won’t be enough for production. This is also called overfitting. Most of the time the accuracy in Ximilar is pretty high, easily over 80 % for small datasets. However, it is common in machine learning to use more images for more stable and reliable results in production. Some tasks need hundreds or thousands of images per label for the good performance of your production model. Read more about the advanced options for training.

The Best Practices in Image Recognition Training

Start With Fewer Categories

I usually advise first-time users to start with up to 10 categories. For example, when building an app for people to recognize shoes, you would start with 10 shoe types (running, trekking, sneakers, indoor sport, boots, mules, loafers …). It is easier to train a model with 10 labels, each with 100 training images of a shoe type, than with 30 types. You can let users upload new shoe images. This way, you can get an amazing training dataset of real images in one month and then gradually update your model.

Use Multiple Recognition Tasks With Fewer Categories

The simpler classifiers can be incredibly helpful. Actually, we can end up with more than 30 types of shoes in one model. However, as we said, it is harder to train such a model. Instead, we can create a system with better performance if we create one model for classifying footwear into main types – Sport, Casual, Elegant, etc. And then for each of the main types, we create another classifier. So for Sport, there will be a model that classifies sports shoes to Running shoes, Sneakers, Indoor shoes, Trekking shoes, Soccer shoes, etc.

Use Binary Classifiers for Important Classes

Imagine you are building a tagging model for real estate websites, and you have a small training dataset. You can first separate your images into estate types. For example, start with a binary classifier that separates images to groups “Apartment” and “Outdoor house”. Then you can train more models specifically for room types (kitchen, bedroom, living room, …), apartment features, room quality, etc. These models will be used only if the image is labelled as “Apartment”.

Ximilar Flows allow you to connect multiple custom image recognition models to API.

You can connect all these tasks via the Flows system with a few clicks. This way, you can chain multiple image recognition models in one API endpoint and build a powerful visual AI. Typical use cases for Flows are in the e-commerce and healthcare fields. Systems for fashion product tagging can also contain thousands of labels. It’s hard to train just one model with thousands of labels that will have good accuracy. But, if you divide your data into multiple models, you will achieve better results in a shorter time! For labelling work, you can use our image Annotation system if needed.

Choose Your Training Images Wisely

Machine learning performs better if the distribution of training and evaluated pictures is even. It means that your training pictures should be very visually similar to the pictures your model will analyze in a production setting. So if your model will be used in CCTV setting, then your training data must come from CCTV cameras. Otherwise, you are likely to build a model that has great performance on training data, but it completely fails when used in production.

The same applies to Real Estate and other fields. If the system analyzes images of real estate that were not made only by professional photographers, then you need to include photos from smartphones, with bad lighting, blurry images, etc.

Typical home decor and real estate images used for image recognition. Your model should be able to recognize both professional and non-professional images. Source: Pexels.

Improving the Accuracy of the System

When clicking on the training button on the task page, the new model is created and put in the training queue. If you upload more data or change labels, you can train a new model. You can have multiple versions of them and deploy to the API only specific version that works best for you. Down on the task page, you can find a table with all your trained models (only the last 5 are stored). For each trained model, we store several metrics that are useful when deciding which model to pick for production.

Multiple versions models of your task in Ximilar Platform. Click on activate and this version will be deployed as API.

Inspect the Results and Errors

Click on the zoom icon in the list of trained models to inspect the results. You can see the basic metrics: Accuracy, Recall, and Precision. Precision tells you what is the probability that the model is right if it predicts a specific label. Recall tells you how likely is the prediction correct. If we have high recall but lower precision for the label “Apartment” from our real estate example, then the model is probably predicting on every image that it is “Apartment” (even on the images that should be “Outdoor house”). The solution is probably simple – just add more pictures that represent “Outdoor house”.

The Confusion matrix shows you which labels are easily confused by the trained model. These labels probably contain similar images, and it is therefore hard for the model to distinguish between them. Another useful component is Failed Images (misclassified) that show you the model’s mistake on your data. With Failed images, you can also see labelling mistakes in your data and fix them immediately. All of these features will help you build a more reliable model with good performance.

Inspecting the results of your trained models can show you potential problems in your data.

Reliability of the Image Recognition Results

Every client is looking for reliability and robustness. Stay simple if you aim to reach high accuracy. Build models with just a few labels if you can. For more complex tagging systems use Flows. Building an image classifier with a limited number of training images needs an iterative approach. Here are a few tips on how to achieve high accuracy and reliable results:

Break your large task into simple decisions (yes or no) or basic categories (red, blue and green)
Make fewer categories & connect them logically
Use general models for general categories
Make sure your training data represent the real data your model will analyze in production
Each label should have a similar amount of images, so the data will be balanced
Merge very close classes (visually similar), then create another task only for them, and connect it via Flows
Use both human and UI feedback to improve the quality of your dataset – inspect evaluation metrics like Accuracy, Precision, Recall, Confusion Matrix, and Failed Images
Always collect new images to extend your dataset

Summary for Training Image Recognition Models

Building an image classifier requires a proper task definition and continuous improvements of your training dataset. If the size of the dataset is challenging, start simple and gradually iterate towards your goal. To make the basic setup easier, we created a few step-by-step video tutorials. Learn how to deploy your models for offline use here, check the other guides, or our API documentation. You can also see for yourself how our pre-trained models perform in the public demo.

We believe that with the Ximilar platform, you are able to create highly complex, customizable, and scalable solutions tailored to the needs of your business – check the use cases for quality control, visual search engines or fashion. The basic features in our app are free, so anyone can try it. Training of image recognition models is also free with Ximilar platform. You are simply paying only for calling the model for prediction. We are always here to discuss your custom projects and all the challenges in person or on a call. If you have any questions, feel free to contact us.

Try our public demos

The post How to Build Your Own Image Recognition API? appeared first on Ximilar: Visual AI for Business.

Train Your Own Machine Learning Models With Video Tutorials

Michal Lukáč — Tue, 17 Nov 2020 15:18:22 +0000

Ximilar platform was built for both experts and newbies in machine learning. Every month, our dev & AI team introduce new features and innovations. The services are becoming more complex, and it is a challenging task to maintain a great user experience, but one we love. For example, Ximilar Flows allows you to build a very complex visual perception system from smaller building blocks, but it can be tricky for newcomers.

We have spent part of this summer working on video tutorials which could explain individual services and features in a more approachable way. So whether you want to build an Object Detection system, or want to search your e-commerce photo collection by visual similarity, then simply watch our video tutorials. They can help you build faster and more accurate AI models and show you some hidden features which you might not have seen.

Videos are also accessible from the Ximilar App platform through the video icons on the overview page of individual services. Right now, the instruction videos are available for:

Image Recognition – two parts, initial setup of the recognition task, training the model, inspecting the results, and advanced image filtering
Flows – build a complex computer vision system with a combination of multiple image recognition models
Object Detection – creating an object detection model, using bounding boxes, inspecting the results
Similarity Search – create your first collection, fill it with data and test the results of the image recommendation

This work was supported by JIC (South Moravia Innovation Center).

Go to our YouTube channel

The post Train Your Own Machine Learning Models With Video Tutorials appeared first on Ximilar: Visual AI for Business.

Tutorials - Ximilar: Visual AI for Business

We Introduce Plan Overview & Advanced Plan Setup

Plan Setup: Simplified Subscription Management

Manage Your Payment Methods and Currencies

Credit Calculator: Estimate & Optimise Your Credit Consumption

Plan Overview: A Complete View of Your Plans and Credits

Reports: Detailed Insights into Credit Usage

Credit Packs: Flexibility to Buy Extra Credits Anytime

Invoices: All Your Purchases in One Place

Greater Control & Flexibility For the Users

How to Identify Sports Cards With AI

Sports Cards Collecting on The Rise

Our Existing Solutions for Card Collector Sites & Apps

The Main Features of Sports Cards

Identifying Sports Cards With Artificial Intelligence

Example: Panini Prizm Football Cards

How to Identify Sports Cards Via API?

How AI Sports Cards Identification Works?

What Sports Cards Can Ximilar Identify?

Who Will Benefit From AI-Powered Sports Cards Identifier?

Is My Data Safe?

How Fast is the System, Can I Connect it to a Scanner?

Sports Cards Recognition Apps You Can Build With Our API

AI for Trading Cards and Collectors

Build Your Own Trading Card Game Identifier With Our API

Trading Card Games Identifier

What is the Card Identifier?

Which Games Can the Trading Card Game Identifier Recognize?

Pokémon TCG

Magic The Gathering

Yu-Gi-Oh!

From MetaZoo to Lorcana

How Identifying Trading Card Games via API Works?

Connect to API

Implement Trading Card Game Identifier in Your App

New Solutions For Sports Cards

Sports Card Text Analysis With OCR & GPT

Read Graded Slab Labels With AI

Pre-Grading of Sports Cards With AI

Automation in Collectibles Industry Makes Sense

Get a Solution Tailored to Your Business

When OCR Meets ChatGPT AI in One API

What is Optical Character Recognition (OCR)?

How Optical Character Recognition Works?

Technologies Related to OCR

What are GPT and ChatGPT?

The Impact of ChatGPT

How Can OCR & GPT Be Combined For Smart Text Extraction

Examples of OCR Software & ChatGPT Working Together

Reading, Processing and Mining Invoices From PDFs

Trading Card Identifying & Reading Powered by AI

How Can I Use the OCR & GPT API On My Images or PDFs?

Integrate the Combination of OCR and ChatGPT In Your System

Let’s Build Your Solution

Predict Values From Images With Image Regression

What is the Difference Between Image Categorization and Regression?

How to Train Image Regression With a Computer Vision Platform?

Use the Same Training Data For All Your Image Classification Tasks

Neural Network: The Technology Behind Predicting Range Values on Images

What Are the Most Common Applications of Value Regression From Images?

Estimating Age From Photos

Value Prediction and Rating of Real Estate Photos

Determining the Degree of Wear and Tear With AI

Automatic Grading of Antique & Collectible Items Such as Sports Cards

Food Quality Estimation With AI

Build Custom Evaluation & Grading Systems With Ximilar

How to Build a Good Visual Search Engine?

What Exactly Does a Visual Search Engine Mean?

Typical Visual Search Engines:Google Lens & Pinterest Lens

The Algorithm for TrainingVisual Similarity

The Components of a Visual Search Tool

How to Create a Visual Hash?

Types of Neural Networks

Training the Similarity Model

Optimization Functions

How it Works

Loss Function: A Few Tips

The Distance Between Vectors

Supervised & Unsupervised Machine Learning Methods

1) Supervised Learning

Typical Visual Search Engines:
Google Lens & Pinterest Lens

The Algorithm for Training
Visual Similarity

The Training Data
for Visual Search Engines

The Workflow of Building
a Visual Search Engine

Visual Search is Not Only
For the Big Companies