One-Shot Learning from Prototype Stock Keeping Unit Images

Kowalczyk, Aleksandra; Sarwas, Grzegorz

doi:10.3390/info15090526

Open AccessArticle

One-Shot Learning from Prototype Stock Keeping Unit Images

by

Aleksandra Kowalczyk

^1,†

and

Grzegorz Sarwas

^1,2,*,†

¹

Faculty of Electrical Engineering, Warsaw University of Technology, Pl. Politechniki 1, 00-661 Warsaw, Poland

²

Omniaz Sp. z o.o., ul. Narutowicza 40/1, 90-135 Łódź, Poland

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2024, 15(9), 526; https://doi.org/10.3390/info15090526

Submission received: 10 July 2024 / Revised: 20 August 2024 / Accepted: 21 August 2024 / Published: 28 August 2024

(This article belongs to the Special Issue Information Processing in Multimedia Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This paper highlights the importance of one-shot learning from prototype Stock Keeping Unit (SKU) images for efficient product recognition in retail and inventory management. Traditional methods require large supervised datasets to train deep neural networks, which can be costly and impractical. One-shot learning techniques mitigate this issue by enabling classification from a single prototype image per product class, thus reducing data annotation efforts. We introduce the Variational Prototyping Encoder (VPE), a novel deep neural network for one-shot classification. Utilizing a support set of prototype SKU images, VPE learns to classify query images by capturing image similarity and prototypical concepts. Unlike metric learning-based approaches, VPE pre-learns image translation from real-world object images to prototype images as a meta-task, facilitating efficient one-shot classification with minimal supervision. Our research demonstrates that VPE effectively reduces the need for extensive datasets by utilizing a single image per class while accurately classifying query images into their respective categories, thus providing a practical solution for product classification tasks.

Keywords:

one-shot learning; autoencoders; prototyping

1. Introduction

The analysis of products on store shelves has been a research focus for decades [1,2,3,4,5,6,7,8].

Recognizing individual products at the SKU level can aid in analyzing product shelf share and planogram compliance while considering limited promotions or seasonal facings. It can also be used to personalize promotions or recommendations for individual customers. Additionally, such systems, which recognize products on store shelves based on the appearance of their packaging, can be designed to assist individuals with disabilities. For instance, they can offer specific conveniences for visually impaired people.

The research presented in this article focuses on recognizing product SKUs based on their packing facing, not barcodes. An SKU is a unique code used internally by retailers and e-commerce sellers to identify each product and its variation, often including details like style, size, color, collection, or packaging-facing version. In contrast, without detailed product information, barcodes are used externally across the retail supply chain to identify the manufacturer and product number. Changing a product’s packaging alters the SKU but not the barcode, which is crucial for tracking promotional campaign effectiveness, e.g., a line of drinks released for a sports event with athletes’ images on the cans. One barcode can have multiple SKUs, and one SKU can apply to products with different barcodes if the packaging is the same but the manufacturing location differs.

A significant challenge in this area is the diversity of products, the vast number of classes, frequent packaging changes, and seasonal rotations, all of which demand flexible and scalable solutions. Typically, automated SKU recognition involves two stages: detecting the products on the shelf and then recognizing them, while state-of-the-art detectors effectively detect products [9], the recognition problem remains challenging [10]. Using a neural network for classification requires preparing the network to recognize every possible class. Each class needs some representation in the form of images, and its quantity depends on many factors. However, when the number of classes is large, the size of the dataset significantly increases, often leading to huge datasets. This problem pertains to the food industry due to the continuous rotation of products. Consequently, classification typically relies on a two-step approach. The first step, based on a convolutional neural network, aims to generate a multidimensional feature vector that should be unique for each class. In the second step, the generated vector is compared with objects from the reference feature vector. However, creating a pattern set that accounts for different views of each product, varying lighting conditions, noise, or color temperature is nearly impossible. Thus, SKU recognition should be approached as a one-shot or few-shot learning problem, where the goal is to infer the class of a detected product based on one or a few prototype images. Given that every product launched on the market initially has a digital design of its label/facing or an e-commerce model used in online store listings, a recognition process based on such a single prototype would be groundbreaking.

To address these challenges, this paper explores the potential of one-shot learning, which relies on a single prototype image per class and is particularly suited for environments where data scarcity is the norm. We introduce and evaluate Variational Prototyping Encoder (VPE) architecture for classifying store-shelf products. The VPE architecture effectively handles domain discrepancies and data imbalances by utilizing pairs of prototype and real images [11]. This approach facilitates the learning of latent feature space, where Variational Autoencoder (VAE) ensures that features of actual products are tightly clustered around the prototype features.

The main contributions of this paper can be summarized as follows:

A novel modification of the VPE algorithm was made, involving the incorporation of prototypes as a signal at the encoder input;
The modified VPE was adapted for product recognition on retail shelves;
The impact of data diversity and quality was analyzed, focusing on key aspects such as augmentation techniques, background uniformity, and optimal prototype selection;
A comprehensive optimization of parameters and techniques for the VPE was conducted. This included methods for stopping network training, distance metrics in the latent space, network architecture, and various implemented loss functions.

2. Related Works

One-shot learning stands out as a pivotal technique where a model is designed to acquire knowledge from a single example, contrasting sharply with traditional deep learning approaches that rely on extensive datasets. Pioneering efforts in this field, such as the work of Li et al., utilize a Bayesian strategy to harness latent and generic prior information, demonstrating that such learned priors can adapt effectively to various small-data problems, thereby alleviating issues of data imbalance and showing promising generalizability [12]. Furthering these concepts, Lake et al. explored the generative processes using hierarchical Bayesian models, which proved capable of extending to new tasks with minimal data input [13].

Recent strategies in one-shot learning have focused on embedding learning and meta-learning. Works by researchers [14,15] have advanced the field of metric learning by transforming task-related information into a metric space where classification occurs through the comparison of similarity scores. In contrast, approaches by [16,17] aim to imbue models with the ability to adapt to new tasks, aligning with meta-learning methodologies.

Chen et al. [18] have extended prototype learning to one-shot image segmentation by incorporating multi-class label information during episodic training to generate more nuanced feature representations for each category. Prototypical Networks [19] introduced an approach where classification in few-shot scenarios is facilitated by computing distances to class-centered prototypes, representing a simpler yet effective bias beneficial in limited-data conditions.

When addressing the challenges of retail shelf product recognition, Wang’s proposal of an enhanced Siamese neural network in one-shot learning is particularly noteworthy [20]. This approach introduces a spatial channel dual attention mechanism aimed at refining the network architecture, significantly enhancing the network’s ability to focus on and interpret subtle product details.

On the generative modeling front, VAE, introduced by Kingma and Welling, is a generative model comprising encoder and de-coder networks [21]. VAE encodes input data into a latent space and decodes it back to the original domain, facilitating tasks like image reconstruction and generation. VPE, a derivative of VAE, presented by Kim et al., specializes in the one-shot classification of graphic symbols, enabling categorization with a single prototype image per class [11].

Recent research explores extensions like Variational Multi-Prototype Encoder (VaMPE) [22] or Semi-Supervised Variational Prototyping Encoder (SS-VPE) [23]. VaMPE utilizes multiple prototypes per class to enhance model performance without the need for additional sub-labeling. SS-VPE employs generative unsupervised learning to optimize prototypes in latent space, applies a Student’s-t mixture model for robust outlier management, and advances the VAE for enhanced few-shot semi-supervised learning performance. It is also worth mentioning the introduction of VPE++, which inherently reduces hubness and incorporates contrastive and multi-task losses to increase the discriminative ability of few-shot learning models [24].

The evolving landscape of one-shot learning, prototype methods, and VAE-based approaches underscores the continuous efforts to address challenges in learning from limited data and improve the efficiency and effectiveness of machine learning models. These advancements hold promise for applications across various domains, including image recognition.

This paper focuses on employing one-shot learning techniques utilizing prototype SKU images. One-shot learning, which trains a model to recognize patterns or objects with a single example, makes it particularly suited for scenarios with limited data. Here, prototypes, representative examples of product categories, are utilized alongside unique SKU identifiers to develop a model capable of discerning various products from single instances.

3. Method

This section describes the VPE proposed in [11] adopted to SKU recognition case.

3.1. Variational Prototyping Encoder

It is assumed that the paired dataset

X = {(x^{(i)}, t^{(i)})}_{i = 1}^{N}

consists of real image samples x and their corresponding label prototypes t. Each class has only one prototype t that acts as the label. The data generation process is similar to the VAE [21], but the generated target value is not the data x but rather t, which is the latent code

z^{(i)}

generated from the prior distribution

p_{θ} (z)

, from which the prototype

t^{(i)}

is generated from the conditional distribution

p_{θ} (z | x)

. This process is hidden, so the parameter

θ

and the latent variables

z^{(i)}

are unknown. Therefore, they are approximated using variational Bayesian inference. The parameters are approximated by maximizing the marginal likelihood. Each log-marginal likelihood of an individual prototype

p_{0} (t^{(i)})

can be lower-bounded by

\begin{matrix} log p_{θ} (t) & = log \int_{z} p_{0} (t, z) = log \int_{z} \frac{p (t, z)}{q_{ϕ} (z | x)} q_{ϕ} (z | x) \\ = log (E_{q_{ϕ} (z | x)} [\frac{p (t, z)}{q_{ϕ} (z | x)}]) \\ \geq E_{q_{ϕ} (z | x)} [log p_{θ} (t, z) - log q_{ϕ} (z | x)] (by Jensen ’ s inequality) \\ = E_{q_{ϕ} (z | x)} [log p_{θ} (t | z)] - D_{K L} [q_{ϕ} (z | x) ∥ p_{θ} (z)], \end{matrix}

(1)

where

D_{K L} [\cdot]

is the Kullback–Leibler divergence, and

q_{ϕ} (z | x)

is introduced to approximate the intractable true posterior. The distributions

q_{ϕ} (z | x)

and

p_{θ} (t | z)

are defined as the probabilistic encoder and decoder. By maximizing the variational lower bound in Equation (1), one can determine the model parameters

ϕ

and

θ

of the encoder and decoder.

The described method serves to translate real image inputs into corresponding prototypical images that remain invariant despite real-world perturbations such as background clutter, geometric variations, and photometric alterations. In essence, VPE exhibits parallels with the denoising autoencoder, functioning as a normalization mechanism for real-world perturbations. Consequently, VPE has the potential to generate latent embeddings z, that are either invariant or robust in the face of such perturbations.

3.2. Training and Testing Phases

In the VPE, two primary phases can be delineated: the training and testing stages. During the training phase Figure 1, the encoder transforms input images from the real domain into a latent distribution denoted as

q (z | x)

. Furthermore, in this research, prototypes are included alongside real training images as inputs to the encoder, significantly enhancing the results achieved. Consequently, the prototype becomes a potent signal. Subsequently, the decoder reconstructs the encoded distribution into a prototype corresponding to the input image. In the testing phase Figure 2, the trained encoder serves as a feature extractor. Both test images and prototypes from the database undergo encoding into the latent space. Subsequently, nearest neighbor classification is performed to categorize the test images. Additionally, during the training phase, the model’s performance is evaluated using a validation set, allowing for the assessment of its effectiveness throughout the training process.

It is essential to note that the testing is conducted on previously unseen images from classes already encountered, as well as entirely new classes for the encoder.

3.3. Network Architecture

An encoder was built with three convolution layers, each followed by a fully connected layer for mean and variance predictions. A stride size of 2 was used for each convolution layer, downsizing the feature map by a factor of 2. In the encoder, a

7 \times 7

convolutional kernel was used first, followed by a

4 \times 4

kernel, and then another

4 \times 4

kernel. Batch normalization and leaky ReLU were applied after every convolution layer. The final layer consisted of a fully connected layer converting a feature map into a predefined latent variable size.

The decoder’s layers were arranged inversely to the encoder’s, with a fully connected layer followed by three convolution layers. Before each convolution, upsampling by a factor of 2 was performed to recover the feature size to the original input dimensions. All convolution kernels in the decoder were set to

3 \times 3

. Batch normalization and leaky ReLU were applied after the first and second convolution operations, and a sigmoid activation was used after the final convolution operation. The detailed VPE network architecture is shown in Appendix A.

3.4. Loss Functions

Various loss functions for VPE were implemented and tested. The following were included:

Sum of two components: Binary Cross Entropy (BCE) and Kullback–Leibler Divergence (KLD) [11]:

$BCE (x, \hat{x}) = - \sum_{i = 1}^{K} [x_{i} log ({\hat{x}}_{i}) + (1 - x_{i}) log (1 - {\hat{x}}_{i})],$

(2)

where x are the original image, $\hat{x}$ are the reconstructed image, and K is equal to the sum of elements in the tensor consisting of the number of photos in the given batch, the number of RGB channels and the size of the photo.

$KLD (μ, σ) = - \frac{1}{2} \sum_{i = 1}^{L} (1 + log (σ_{i}^{2}) - μ_{i}^{2} - σ_{i}^{2}),$

(3)

where $μ$ is the mean vector of the latent variables in VPE, $σ$ is the standard deviation vector, and L is equivalent to the sum of elements in a tensor whose first dimension also refers to the number of photos in a given batch, and the second is the declared size of the latent space.
The total loss function is the sum of these two components:

$Loss = BCE + KLD .$

(4)
Root mean square error (RMSE) measures the changes in pixel values of the input band of the original image x and the reconstructed image $\hat{x}$ . N and M are the width and height numbers, respectively.
This error is determined using the following formula [25]:

$RMSE (x, \hat{x}) = \sqrt{\frac{1}{M \times N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {(x_{i j} - {\hat{x}}_{i, j})}^{2}} .$

(5)

The desired value of this error is zero.
Relative average spectral error (RASE) is computed based on the RMSE value using the following Equation [25]:

$RASE (x, \hat{x}) = \frac{100}{μ (x)} \sqrt{\frac{1}{B} \sum_{k = 1}^{B} {RMSE}^{2} (x_{k}, {\hat{x}}_{k})},$

(6)

where $μ (x)$ is the mean value across all B spectral bands of the original image x. $R M S E (x_{k}, {\hat{x}}_{k})$ is the root mean square error of the k-th band between the original $x_{k}$ and the reconstructed ${\hat{x}}_{k}$ images. The desired value of this parameter is zero.
Relative dimensionless global error in synthesis (ERGAS) is a global quality factor. This error is affected by variations in the average pixel value of the image and the dynamically changing range. It can be expressed as [25]:

$ERGAS (x, \hat{x}) = 100 \cdot \frac{h}{l} \sqrt{\frac{1}{B} \sum_{k = 1}^{B} {(\frac{RMSE (x_{k}, {\hat{x}}_{k})}{μ (x_{k})})}^{2}},$

(7)

where $\frac{h}{l}$ is the ratio of the number of the original image’s pixels to the reconstructed image’s pixels. B is the total number of spectral bands. $R M S E (x_{k}, {\hat{x}}_{k})$ is the root mean square error of the k-th band between the original $x_{k}$ and the reconstructed $\hat{x}$ images, while $μ (x_{k})$ is the mean value of the k-th band of the original image x. The optimal value for this error is close to zero.
Pearson’s correlation coefficient (PCC) shows the spectral correlation between two images. The value of this coefficient for the reconstructed image $\hat{x}$ and the original image x is calculated as follows [25]:

$\sum PCC (x, \hat{x}) = \frac{\sum_{i = 1}^{M} \sum_{j = 1}^{N} (x_{i j} - μ (x)) (({\hat{x}}_{i j} - μ (\hat{x}))}{\sqrt{\sum_{i = 1}^{M} \sum_{j = 1}^{N} {(x_{i j} - μ (x))}^{2} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {({\hat{x}}_{i j} - μ (\hat{x}))}^{2}}},$

(8)

where $μ (x)$ and $μ (\hat{x})$ mean the average values of the x and $\hat{x}$ images, while M and N denote the shape of the images. The desired value of this coefficient is one.

4. Experiments

4.1. Dataset Overview

The research consisted of two phases. In the first phase, a dataset was created by extracting products from store shelf photos, specifically designed to include non-alcoholic beverages, both canned and bottled variants. The images were taken within stores belonging to one of the most numerous retail chains in Poland, and canned products are prevalent due to the specific nature of the store. To enhance the robustness and variability of the dataset, photos were captured under various conditions—different angles, lighting, and distances—ensuring a comprehensive representation of real-world scenarios. Each product was then extracted and categorized, creating a structured dataset tailored to the evaluation of our one-shot learning model. The dataset is partitioned into three subsets: a training set, a validation set, and a test set. The training set consists of samples from 38 distinct classes, while the validation set encompasses 11 classes. The test set consists of 15 classes, further divided into two categories: ‘seen’ and ‘unseen’ classes. The ‘seen’ subset comprises 6 classes, representing classes the model has been exposed to during training. However, the model has not seen the specific photos in this subset. Conversely, the ‘unseen’ subset contains 9 classes that are entirely novel to the model and not encountered during the training phase.

In the second phase, a dataset representing products from store shelves was created using frames extracted from video recordings. This introduced the additional challenge of recognizing products from lower-quality images. The dataset includes all products available in popular franchise stores, categorized by SKU, which accounts for factors such as product size. The research focused on beverages (1070 classes), dairy products (270 classes), and snacks (156 classes), which constitute the majority of the store’s inventory. The dataset was also divided into training, validation, and test sets in a 70:15:15 ratio.

This dataset structure facilitates rigorous evaluation of the model’s performance across various scenarios, including its ability to generalize to unseen classes, thus providing insights into its robustness and efficacy in real-world applications.

4.2. Implementation Overview

During the course of the research, various optimizers and parameter values were rigorously tested to determine the most effective settings for training the networks. Ultimately, the ADAM optimizer was selected for its robust performance, with a learning rate finely tuned to

10^{- 4}

, beta values set at (0.9, 0.999), an epsilon value of

10^{- 8}

, and a mini-batch size of 154. The effects of different image resolutions were also investigated, leading to an adaptation in the input dimension of the initial fully connected layer, which adjusts dynamically based on the input size. As a result of the tests that were conducted, it was demonstrated that higher values for the convolution filter size and latent variable size yield better results. However, balancing these choices with the need to reduce memory and resource consumption was necessary. Therefore, it was decided to set the number of filters in the following convolutional layers at [100, 150, 250] and the latent variable size at 600.

4.3. Metrics

Recall, defined as the ratio of correctly assigned instances to the correct class ( $T P$ ) to the sum of correctly assigned instances to the correct class ( $T P$ ) and incorrectly assigned instances to a class other than the correct one ( $F N$ ), represents the model’s ability to correctly identify instances of a given class, as follows [26]:

$Recall = \frac{T P}{T P + F N} .$

(9)
Precision, being the ratio of correctly assigned instances to the correct class ( $T P$ ) to the sum of correctly assigned instances to the correct class ( $T P$ ) and incorrectly assigned instances to the correct class ( $F P$ ), measures the model’s ability to correctly classify instances as positive, as follows [26]:

$Precision = \frac{T P}{T P + F P} .$

(10)
F1-score is a function used to test accuracy, which takes into account both precision and recall. It is considered the weighted average of precision and recall. The F1 value ranges from 0 to 1. It favors algorithms with high recall, as follows [26]:

$F 1 - score = \frac{2 \times (Precision \times Recall)}{Precision + Recall} = \frac{2 \times T P}{2 \times T P + F P + F N} .$

(11)

4.4. Results

Based on the conducted experiments, it was decided to modify the algorithm by adding prototypes to each training set, as a significant increase in recall for unseen classes, from 0.686 to 0.922, was observed. Precision increased from 0.741 to 0.903, and the F1-score from 0.712 to 0.912.

A comparison of recall metrics obtained through different evaluation methods and distance measures was conducted and presented in Table 1. Two distance measures, Euclidean and Cosine, are evaluated. For the Euclidean distance measure, when the recall is calculated after a specified number of epochs, the results show that for all instances, the recall is 0.888, while for the training set, it is 0.894, and for the test set, it is 0.883. In terms of top-nn recall, for second nearest neighbors (2-nn), it achieves 0.972, and for third nearest neighbors (3-nn), it is 0.986. However, when recall is triggered by validation, the overall recall decreases to 0.769, with 0.939 for the training set and a significant drop to 0.623 for the test set. The top-nn recall also declines to 0.825 for 2-nn and 0.839 for 3-nn. Conversely, for the Cosine distance measure, recall values are consistently higher. When evaluated after a specified number of epochs, the overall recall is 0.916, with 0.909 for the training set and 0.922 for the test set. The top-nn recall is notably high, reaching 0.986 for 2-nn and 0.993 for 3-nn. Similarly, when validation triggers recall, the overall recall remains relatively high at 0.888, with 0.955 for the training set and 0.831 for the test set. The top-nn recall maintains its high values at 0.986 or 0.993 for both 2-nn and 3-nn.

For the VPE algorithm applied to image sizes of

48 \times 48

pixels, see Table 2.Mostly, the model achieves high recall for seen classes, ranging from 0.939 to 0.955 and for unseen classes from 0.896 to 0.948. However, its performance decreases to 0.576 for seen classes, when rotation as augmentation technique is applied. Adding spatial transformer improves recall for seen and unseen classes Table 2. Adding spatial transformers means the spatial transformer modules are positioned before the 1st and 3rd convolution layers to enhance spatial invariance.

For the same algorithm applied to image sizes of

64 \times 64

pixels, see Table 2. The model maintains relatively high accuracy for seen classes, ranging from 0.909 to 0.970. However, performance on unseen classes varies, with augmentation techniques generally improving accuracy, except when rotation is applied. Augmentations include adjusting the longer side of the image to a specified size while preserving the aspect ratio and adding zero padding around the image. This approach ensures a uniformly centered image with the desired dimensions, maintaining the central placement of its center.

In this study, various loss functions were tested to evaluate their performance in the context of VPE problems Table 3. The selected loss functions are as follows: sum of two components: BCE and KLD, RASE, RMSE, ERGAS, and PCC. Ultimately, no significant differences were observed among the various loss functions, but each one demonstrated the ability to achieve high performance. For this reason, it was decided to remain with the sum of two components: BCE and KLD. KLD regularizes the latent space by promoting the distribution of z to align with the prior distribution. Mapping similar data inputs to nearby locations in the latent space prevents collapse. Any type of reconstruction loss can be used. BCE loss was employed with real-valued targets in the range [0, 1].

A comparison of results for various prototypes within a single class was conducted. Images depicting the product rotated at different angles were tested to define and ultimately select the prototype most suited to real-world conditions Figure 3.

All prototypes’ backgrounds were standardized to black to match the backgrounds present in all images depicting the extracted product. This was accomplished with the help of The Segment Anything Model, a cutting-edge image-segmentation model that allows for promptable segmentation, delivering unmatched versatility for tasks involving image analysis [27].

In our study, the model demonstrates satisfactory performance, as indicated by the metrics of recall, precision, and F1-score (Table 4), and while the model exhibits better performance for seen classes, the differences are marginal.

Figure 4 illustrates clearly defined clusters of objects of each class from the test set Figure 5, indicating that the points corresponding to a specific class are close to each other.

In the second phase of this study, a dataset was used that was derived from frames extracted from video recordings. The performance of the algorithm was compared across products from different categories, demonstrating that certain categories, such as snacks, are easier to recognize than others, such as dairy products (Table 5).

With a larger dataset, especially for frames from recordings with poorer lighting conditions, the t-SNE visualization of features for dairy products clearly shows that distinct clusters are primarily formed by classes the model has already encountered during the training phase (Figure 6).

Reproducing images on a prototype is an auxiliary task during network training. It is crucial to understand how image translation behaves on unknown data. To illustrate this, visualizations of prototype reconstructions generated using a decoder from the latent feature space into which both real prototypes and test data were transformed by the trained encoder were created. Sample reconstructions of prototypes from the test class were compared with real prototypes and selected real images Figure 7.

5. Conclusions

This study has successfully implemented a VPE tailored to the problem of recognizing retail shelf items from a limited dataset based on product graphics prototypes, achieving satisfactory accuracy. The strategic addition of prototypes to each training set notably improved the recognition rate of unseen classes, indicating a substantial improvement in the algorithm’s ability to identify new classes without prior exposure.

These results suggest that the Cosine distance measure consistently outperforms the Euclidean measure across both evaluation methods, yielding higher recall values across all tested scenarios. Further comparisons showed that appropriately chosen and applied image sizes and augmentation techniques positively affect the algorithm’s performance. Random rotation and horizontal flipping were the only transformations that did not yield the anticipated outcomes for the analyzed dataset. This could be attributed to the products’ specific nature, such as beverages and dairy items. These items are typically placed on shelves in stores in a specific position and are rarely tilted or turned, as store staff ensure their proper arrangement. Consequently, the augmentation technique involving random rotation of images might have introduced unrealistic representations of these products.

Adding spatial transformers enhances the algorithm’s performance because it increases the network’s robustness and the network’s resistance to geometric deformations, allowing for better recognition of objects regardless of their orientation or position. As a result, the network can more accurately identify important features of beverages, such as labels and bottle shapes, while ignoring less relevant background elements.

No significant differences were observed among the tested loss functions, but all proved effective in optimizing the model’s performance, confirming their usefulness for complex problems involving VAEs. Uniform testing conditions for prototypes, such as consistent backgrounds and the selection of suitable prototypes, contributed to creating a cohesive assessment environment that yielded satisfactory effects.

The results revealed variations in recognition indicators across various product categories. Dairy products proved to be the most challenging compared to drinks and snacks. This can be attributed to the relatively smaller size of dairy products and significant identifying features on the front of the packaging and on the lids. Moreover, the labels of dairy products often utilize muted, similar colors.

The model can effortlessly distinguish visually similar products of the same brands with similar packaging, differing only in aspects such as flavor, as demonstrated for the first and second pairs in Figure 8. However, the model is unable to differentiate between dairy products of the same type in different sizes based only on the prototype. It is worth mentioning that humans would also struggle to make this distinction based on images alone.

The model demonstrates high efficiency in reconstructing prototypes for classes seen during the training process. It performs almost flawlessly even when the images are blurred, have low resolution, are shadowed, or are only partially visible. Although generating prototypes for classes unseen during training is not as precise, it still reflects the key features of these classes from the input images. The model accurately handles high-level features such as the dominant color or shape of the packaging, and while detailed elements may not be precisely replicated, the locations of colors and shapes are approximately consistent with the actual products. The model exhibits particularly good abilities in reconstructing prototypes for classes, which was not directly seen during the training phase but learned through other variants of the product. The model can detect subtle differences and accurately reproduce features characteristic of a new flavor variant, not just those it already knows. As a result, even new and previously unknown variants can be represented with satisfactory accuracy. VPE implicitly assimilates knowledge on how to neutralize the real image against disturbances occurring in the real world and, to some extent, captures high-level prototype concepts for classes unseen during the training phase.

In conclusion, it is worth mentioning the limitations encountered during the research. The lack of availability of product collections from store shelves divided into product SKUs with their prototype meant that a large part of the design work consisted of obtaining and developing such datasets. This is a very time-consuming and laborious process, which undoubtedly limits the possibility of efficiently and extensively testing the solution in various scenarios.

Several promising avenues for further research and development can be identified to enhance the current approach. One such direction is the application of diffusion models to the studied problem. Diffusion models, also known as diffusion probabilistic models or score-based generative models, represent a class of latent variable generative models that have recently gained significant attention in the machine learning community [28]. These models have been shown to outperform traditional methods, such as VAEs, in generating more accurate and robust latent spaces. By integrating diffusion models, a more precise representation of the underlying data structure may be achieved, leading to improved performance in the studied task. This could unlock new, innovative solutions and provide a deeper understanding of the complexities involved.

Additionally, the exploration of reranking methods presents another potential enhancement. Specifically, refining SKU recognition through a reranking technique that optimizes results based on the top-5 nearest neighbors could be highly effective. This reranking process could leverage extracted global features to reassess and reorder initial predictions, or it could generate new features by applying local feature detectors and descriptors such as SIFT [29] or SURF [30]. By doing so, the accuracy of SKU recognition could be significantly improved, especially in challenging scenarios.

A further challenge in SKU recognition arises when dealing with similar product facings for SKUs that differ only in size. Two products with identical packaging but different volumes can be difficult to distinguish. To address this issue, one could explore the analysis of the weight-to-height ratio of detected products as a distinguishing feature. Alternatively, a model could be trained to estimate the size of a product based on the gap space between shelves, providing additional context for accurate SKU identification. These approaches could mitigate the ambiguity in recognizing products with similar appearances, ultimately leading to more reliable SKU classification.

Author Contributions

Conceptualization, G.S.; methodology, A.K.; software, A.K.; validation, G.S.; formal analysis, G.S.; investigation, A.K.; resources, G.S.; data curation, G.S.; writing—original draft preparation, A.K.; writing—review and editing, G.S.; visualization, A.K.; supervision, G.S.; project administration, G.S.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was co-founded by the National Center for Research and Development under Subtask 1.1.1 of the Smart Growth Operational Program 2014–2020, co-financed from public funds of the Regional Development Fund No. 2014/2020 under grant no. POIR.01.01.01-00-2326/20-00.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to (commercial use of the data). In the future, it is planned to make the collection publicly available with a request to cite this paper if it is used for research purposes.

Acknowledgments

A big thanks to the Omniaz mapping team and all those who ensured the quality of the data provided for the experiments.

Conflicts of Interest

Author Grzegorz Sarwas is currently employed at the Warsaw University of Technology and the company Omniaz Sp. z o.o. The remaining author (Aleksandra Kowalczyk) declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Network Architecture

Figure A1. The detailed VPE network architecture with division into encoder and decoder blocks.

References

Merler, M.; Galleguillos, C.; Belongie, S. Recognizing Groceries in situ Using in vitro Training Data. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar] [CrossRef]
George, M.; Mircic, D.; Sörös, G.; Floerkemeier, C.; Mattern, F. Fine-Grained Product Class Recognition for Assisted Shopping. In Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, Chile, 7–13 December 2015; pp. 546–554. [Google Scholar] [CrossRef]
Melek, C.G.; Sonmez, E.B.; Albayrak, S. A survey of product recognition in shelf images. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 5–8 October 2017; pp. 145–150. [Google Scholar] [CrossRef]
Tonioni, A.; Serra, E.; Di Stefano, L. A deep learning pipeline for product recognition on store shelves. In Proceedings of the 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), Sophia Antipolis, France, 12–14 December 2018; pp. 25–31. [Google Scholar] [CrossRef]
Geng, W.; Han, F.; Lin, J.; Zhu, L.; Bai, J.; Wang, S.; He, L.; Xiao, Q.; Lai, Z. Fine-Grained Grocery Product Recognition by One-Shot Learning. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; MM’18. pp. 1706–1714. [Google Scholar] [CrossRef]
Leo, M.; Carcagnì, P.; Distante, C. A Systematic Investigation on end-to-end Deep Recognition of Grocery Products in the Wild. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7234–7241. [Google Scholar] [CrossRef]
Chen, S.; Liu, D.; Pu, Y.; Zhong, Y. Advances in deep learning-based image recognition of product packaging. Image Vis. Comput. 2022, 128, 104571. [Google Scholar] [CrossRef]
Selvam, P.; Faheem, M.; Dakshinamurthi, V.; Nevgi, A.; Bhuvaneswari, R.; Deepak, K.; Abraham Sundar, J. Batch Normalization Free Rigorous Feature Flow Neural Network for Grocery Product Recognition. IEEE Access 2024, 12, 68364–68381. [Google Scholar] [CrossRef]
Goldman, E.; Herzig, R.; Eisenschtat, A.; Goldberger, J.; Hassner, T. Precise Detection in Densely Packed Scenes. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5222–5231. [Google Scholar] [CrossRef]
Melek, C.G.; Battini Sönmez, E.; Varlı, S. Datasets and methods of product recognition on grocery shelf images using computer vision and machine learning approaches: An exhaustive literature review. Eng. Appl. Artif. Intell. 2024, 133, 108452. [Google Scholar] [CrossRef]
Kim, J.; Oh, T.H.; Lee, S.; Pan, F.; Kweon, I.S. Variational Prototyping-Encoder: One-Shot Learning With Prototypical Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9454–9462. [Google Scholar] [CrossRef]
Fe-Fei, L.; Fergus; Perona. A Bayesian approach to unsupervised one-shot learning of object categories. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; Volume 2, pp. 1134–1141. [Google Scholar] [CrossRef]
Lake, B.M.; Salakhutdinov, R.; Tenenbaum, J.B. Human-level concept learning through probabilistic program induction. Science 2015, 350, 1332–1338. [Google Scholar] [CrossRef] [PubMed]
Vinyals, O.; Blundell, C.; Lillicrap, T.; kavukcuoglu, k.; Wierstra, D. Matching Networks for One Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Nice, France, 2016; Volume 29. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to Compare: Relation Network for Few-Shot Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar] [CrossRef]
Zhenguo, L.; Fengwei, Z.; Fei, C.; Hang, L. Meta-SGD: Learning to Learn Quickly for Few Shot Learning. arXiv 2017, arXiv:1707.09835. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning–Volume 70. JMLR.org, Sydney, Australia, 6–11 August 2017; Volume 10, pp. 1126–1135. [Google Scholar]
Chen, T.; Xie, G.S.; Yao, Y.; Wang, Q.; Shen, F.; Tang, Z.; Zhang, J. Semantically Meaningful Class Prototype Learning for One-Shot Image Segmentation. IEEE Trans. Multimed. 2022, 24, 968–980. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-shot Learning. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Nice, France, 2017; Volume 30. [Google Scholar]
Wang, C.; Huang, C.; Zhu, X.; Zhao, L. One-Shot Retail Product Identification Based on Improved Siamese Neural Networks. Circuits, Syst. Signal Process. 2022, 41, 1–15. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014; Conference Track Proceedings. Available online: https://arxiv.org/abs/1312.6114v11 (accessed on 1 July 2024). Conference Track Proceedings.
Kang, J.S.; Ahn, S.C. Variational Multi-Prototype Encoder for Object Recognition Using Multiple Prototype Images. IEEE Access 2022, 10, 19586–19598. [Google Scholar] [CrossRef]
Liu, Y.; Shi, D. SS-VPE: Semi-Supervised Variational Prototyping Encoder With Student’s-t Mixture Model. IEEE Trans. Instrum. Meas. 2023, 72, 1–9. [Google Scholar] [CrossRef]
Xiao, C.; Madapana, N.; Wachs, J. One-Shot Image Recognition Using Prototypical Encoders with Reduced Hubness. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2251–2260. [Google Scholar] [CrossRef]
Panchal, S. Implementation and Comparative Quantitative Assessment of Different Multispectral Image Pansharpening Approaches. Signal Image Process. Int. J. 2015, 6, 35. [Google Scholar] [CrossRef]
Bansal, A.; Singhrova, A. Performance Analysis of Supervised Machine Learning Algorithms for Diabetes and Breast Cancer Dataset. In Proceedings of the 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), Coimbatore, India, 25–27 March 2021; pp. 137–143. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Hu, R.; Hu, W.; Li, J. Saliency Driven Nonlinear Diffusion Filtering for Object Recognition. In Proceedings of the 2013 2nd IAPR Asian Conference on Pattern Recognition, Naha, Japan, 5–8 November 2013; pp. 381–385. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]

Figure 1. Illustration of the training phase of the VPE.

Figure 2. Illustration of the testing phase of the VPE.

Figure 3. Examples of different prototypes for one product obtained by rotating the can, highlighting different features of the product.

Figure 4. T-SNE visualization of features for the beverages product test dataset. Black crosses mark the prototypes of each class from the test set. Additionally, red dots mark the prototypes belonging to the classes defined as seen, meaning those from which the model was familiarized in the training process.

Figure 5. Test class prototypes from phase one of the beverages.

Figure 6. T-SNE visualization of features for the dairy product test dataset. Clusters were marked for selected classes whose prototypes were seen in the training phase.

Figure 7. The obtained prototype reconstructions for the test set of dairy products compared with real prototypes and real photos divided into classes whose prototypes were seen in the learning process and classes not seen in the model training phase.

Figure 8. Examples of challenges in recognizing similar products. The first pair represents the challenge of recognizing different flavor variants of a given product, and the second and third pair illustrates the challenge of recognizing different sizes of the same product.

Table 1. Comparison of recall metrics under different evaluation methods and distance measures after a certain number of epochs are reached or validation accuracy is achieved.

Distance	Method	Recall			Top-nn
Distance	Method	All	Train	Test	2-nn	3-nn
Euclidean	Reach defined number of epochs	0.888	0.894	0.883	0.972	0.986
Euclidean	Trigger after validation accuracy is achieved	0.769	0.939	0.623	0.825	0.839
Cosine	Reach defined number of epochs	0.916	0.909	0.922	0.986	0.993
Cosine	Trigger after validation accuracy is achieved	0.888	0.955	0.831	0.986	0.986

Table 2. One-shot classification recall for different image sizes and algorithm versions, which includes different combinations of spatial transformer, augmentation, and separately treated rotations.

Image Size	Algorithm’s Version	One-Shot Classification Recall (%)
Image Size	Algorithm’s Version	Classes Seen	Classes Unseen
$48 \times 48$	VPE	0.939	0.713
	VPE + aug	0.939	0.896
	VPE + aug + rotate	0.576	0.818
	VPE + stn	0.939	0.948
	VPE + aug + stn	0.955	0.896
$64 \times 64$	VPE	0.924	0.740
	VPE + aug	0.970	0.909
	VPE + aug + rotate	0.712	0.909
	VPE + stn	0.939	0.935
	VPE + aug + stn	0.909	0.922

Table 3. One-shot classification recall for different loss functions.

Loss Function	One-Shot Classification Recall (%)
Loss Function	Classes Seen	Classes Unseen
BCE + KLD	0.970	0.949
RMSE	0.970	0.949
ERGAS	0.939	0.970
CC	0.955	0.949
RASE	0.924	0.929

Table 4. Summary of classification metrics such as recall, precision, and F1-score for seen and unseen classes. The model was familiarized with the classes referred to as seen during the training process, although images in this subset were never seen by the model. In turn, the unseen classes are those that are completely new to the model and were not used in the training phase.

Classes	Recall	Precision	F1-Score
	Seen
Class 1, Black Energy, ultra mango, can, orange	1.000	1.000	1.000
Class 2, Coca-cola, bottle	1.000	1.000	1.000
Class 8, Easy boost, zero sugar, cherry, can, pink	1.000	0.917	0.957
Class 9, Easy boost, blueberry and lime, can, purple	1.000	1.000	1.000
Class 10, Level up Classic Energy Drink, can, blue	1.000	1.000	1.000
Class 11, Dzik, tropic, can, green	1.000	1.000	1.000
	Unseen
Class 0, Black Energy, zero sugar, paradise, can, light-blue	1.000	1.000	1.000
Class 3, Tiger Pure, passion fruit-lemon, can, light-yellow	0.909	0.909	0.909
Class 4, Tiger Hyper Splash, exotic, can, pink	0.909	0.909	0.909
Class 5, Black Energy, ultra mojito, can, green	1.000	1.000	1.000
Class 6, Red Bull Purple Edition, sugarfree, açai, can, purple	0.909	1.000	0.952
Class 7, Lipton Ice Tea, lemon, bottle	1.000	1.000	1.000
Class 12, Oshee Isotonic Drink, multifruit, narrow bottle, blue	1.000	1.000	1.000
Class 13, Oshee Vitamin Water, lemon-orange, bottle, blue	1.000	1.000	1.000

Table 5. One-shot classification recall for various categories of food products for the second phase of research, which are beverages, dairy, and snacks.

Category	One-Shot Classification Recall (%)
Category	Classes Seen	Classes Unseen
Beverages	0.939	0.725
Dairy	0.924	0.613
Snacks	0.954	0.754

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kowalczyk, A.; Sarwas, G. One-Shot Learning from Prototype Stock Keeping Unit Images. Information 2024, 15, 526. https://doi.org/10.3390/info15090526

AMA Style

Kowalczyk A, Sarwas G. One-Shot Learning from Prototype Stock Keeping Unit Images. Information. 2024; 15(9):526. https://doi.org/10.3390/info15090526

Chicago/Turabian Style

Kowalczyk, Aleksandra, and Grzegorz Sarwas. 2024. "One-Shot Learning from Prototype Stock Keeping Unit Images" Information 15, no. 9: 526. https://doi.org/10.3390/info15090526

APA Style

Kowalczyk, A., & Sarwas, G. (2024). One-Shot Learning from Prototype Stock Keeping Unit Images. Information, 15(9), 526. https://doi.org/10.3390/info15090526

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

One-Shot Learning from Prototype Stock Keeping Unit Images

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Variational Prototyping Encoder

3.2. Training and Testing Phases

3.3. Network Architecture

3.4. Loss Functions

4. Experiments

4.1. Dataset Overview

4.2. Implementation Overview

4.3. Metrics

4.4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Network Architecture

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI