Image recognition systems can lead to significant improvements when monitoring manufacturing processes or identifying product quality issues.
Images created on the production line are usually highly standardized by capturing them under constant conditions. It is also very common that a lot of annotated images (by human experts) are available as well.
When enough annotated and highly standardized images are available, the image recognition system is much easier to set up. Under these circumstances a camera module can be considered as “just” another sensor, with a very wide (information intensive) data stream. This data stream is in itself highly complex and has to be condensed down by a ML or AI model, before it can be used to trigger any actions.
Image Recognition with Traditional Machine Learning
When the images are highly standardized, it can be sufficient to use a traditional ML model to condense the data stream down to easier-to-interpret signals.
Standardization can mean the following:
- The number of observed items is kept constant (for example at one).
- The position and orientation of the items are fixed.
- The cameras, camera types and configuration are unified.
- The lighting is kept constant by using artificial lights.
The relationship between the individual pixels and the outcome (the signal to act upon) is in any scenario highly complex.
The model (ML and AI) uses the pixels, interprets them as signals and learns (during model training) the relation of these signals with a target outcome (that is know for the training examples). A “traditional” artificial neural network for example – slightly simplified - takes all signals at its inputs, aggregates them and transforms them in a non-linear fashion to generate an output signal corresponding with the observed targets.
In simple scenarios, this can be rather easy to understand: For example when baking a cake, it is possible to estimate its readiness based on the sum of all brown color hues. Not a simple but also not a highly-complex conclusion.
Complex scenarios
The traditional artificial neural networks and other ML model types can only get you so far. With more complex scenarios they will become unreliable or require enormous amounts of training data.
What is meant by “complex scenarios”? This can of course be many different variations, but let’s look at an illustrative example:
Simple: an item in a fixed position and with precise orientation
Complex: an item lays, just where it fell (for example on a conveyor belt)
The first case could be handled still by an ML model, but the second one only under the condition that we expand the training data and train with drastically more examples. We basically multiply the amount by positions and degrees. That can quickly get infeasible. (The potential solution to modify and “straighten out” the images, is possible, but in itself highly complex as well).
In this (the second) case AI models are the way to go. They are much closer to the functionality of the human ability to recognize and process images. For humans the interpretation of images with varying orientation is usually trivial. A cat from the left or a cat from the right are – apart from aspects of superstition – equally interpreted as cats by us.
Modern AI-based approaches
Such AI models for image recognition, process the image data through a number of different and specifically designed layers. Hence these networks are called convolutional neural networks (CNN). The individual layers are not mostly similar in structure (like in traditional ML models), but instead created for specific tasks (like edge detection, or aggregation of the detected entities). As the network is made from very many layers, it must be trained with specialized (Deep Learning) training algorithms.
This design makes it possible to process images much closer to an actual understanding, similar to how the human brain processes images.
Off-the-Shelf models
The training of such models (deep complex neural networks) even with modern optimized algorithms is highly resource intensive (time, processing, energy) and requires large amounts of training examples.
Fortunately, it is possible to take a “shortcut”. In the community you can find pre-trained, generalized models that can be adapted to your own use cases. This is a very smart and efficient approach, and interestingly very close to the principles of learning known for millennia: learning something by adding it to existing knowledge is much easier than starting from scratch. As a new-born we take years learning to see and understand our surroundings, but as a grown-up a new “impression” potentially takes us just a moment to process and internalize.
This is how we can use off-the-shelf models: we get a generalized model built to process visual impressions (based on previous training with many examples), with all the general concepts of processing images already in place. We just need to add our domain or problem specific images.
This usually requires cutting of the outer layers of the network, replacing them by layers specific to our scenario and training only these layers afterwards, while leaving the inner pre-trained part of the network intact. This approach makes it possible to get to good results with only limited individual examples and processing power.
Did we get your attention? Are there image recognition scenarios where your organization could benefit from? Please get in touch!