In this section, to demonstrate the feasibility of our model on small data due to real-world limitations, we construct a dataset with 4 classes to classify, each of which equivalently contains 18 products with only 8 images per product. The way that products can be placed in a compact space is limited, resulting in the fact that even if the amount of data is increased, the features of the data are not increased. The collection will be briefly introduced in the following subsection. To demonstrate that our model is not affected by changes in products, we then use a k-means clustering-based method to split the collected dataset appropriately into a training dataset and a validation dataset with no product overlap.
3.1. Image Collection
Our data acquisition platform is based on a webcam connected to a computer, which is fixed at a distance of about 20 centimeters in front of the product shelf. All properties of the camera are fixed and images are captured directly as gray-scale images rather than color images. Notably, in order not to affect the overall image due to the change of the product, the auto-focus and auto-exposure functions are disabled, and instead, a constant focus and a constant exposure are manually tuned at which the end of the spiral rack is clearly visible. Because in actual operation, for example, in the automatic exposure mode, the brightness of the product will affect the brightness of the overall picture, causing the spiral to sometimes be clear and sometimes buried in shadows. Only the front of each product is used in consideration from the consumers’ perspectives. The products are translated, slightly rotated, and flipped left and right in the confined space of the spirals to make the eight images as distinguishable as possible.
Eventually, there are a total of 144 original images in each class with the size of , covering 18 different products (including a situation where no product is placed). There is no specific selection of certain products, but rather as many products as we have on hand. Despite the fact that only snacks in rectangular packaging are collected, the collection will not solely be beneficial for models optimized for snacks because it is conceivable that grocery products such as tissues and face masks are no different in their rectangular packaging.
3.2. Data Processing and Proposed Splitting Method
The motivation for splitting data in an appropriate way is that biased data may cause the features learned by the model to be biased, that is, unable to cope with general situations. Imagine a situation where a model trained only on data from person A’s handwritten digits would perform poorly in predicting the recognition of person B’s handwritten digits, or more broadly anyone’s handwritten digits, especially when person A’s handwriting has strong personal characteristics. For such special training situations as the aforementioned example, it is reasonable to believe that no matter how much effort is put into the model, it is difficult to have the expected performance, that is, applicability in a broad sense. Therefore, it is essential to split the dataset appropriately.
Originally captured images of left and right spiral racks are center-cropped into the size of
to ensure that the spiral edge remains approximately
of the image width or height from the image edge, and the images of the left spiral racks are flipped to the right to double the amount of data. It is a common method to split the whole dataset randomly into a training dataset and a validation dataset according to certain proportions. The random split enables the training dataset and the validation dataset to have feature similarity when there are sufficient data. However, this may not be applicable to small data where an extreme situation that the randomly split data for training cannot represent the whole dataset may occur. Therefore, in the case of small data, it is necessary to ensure that the training dataset is sufficient so that it can represent the whole dataset, and that the validation dataset is sufficient to validate the trained model on rich features. We assume that the changes in the features of the processed image mainly come from the changes in the image intensity and pattern complexity of the product itself, which can be roughly considered as the mean
and standard deviation
of the image, respectively. The
and
of the
nth product are calculated from the average of all images belonging to the
nth product, as denoted in Equations (
1) and (
2). The two features are then normalized to zero mean and unit variance, respectively, for convenience in k-means clustering.
where
W and
H denote the width and height of the processed image, respectively, and
denotes the pixel intensity.
N is the number of images for the
nth product.
Let
denote the feature vector of the
nth product. Given a set of features
, we aim to partition the 18 (number of product types) features into
k clusters
so as to minimize the within-cluster sum of squares (WCSS) of each point to its cluster centroid
, where Equation (
3) defines the objective. One can refer to Lloyd’s algorithm [
27] to obtain a certain local optimum for this problem in a simple and computation-friendly way.
where
is the mean vector of cluster
.
The appropriate
k can be either selected by the elbow method which is by graphing the relationship between the number of clusters
k and the WCSS, and picking the elbow of the curve as the optimal number of clusters to use [
28], or by the silhouette method where the peak of the curve of
k versus the average silhouette value indicates the optimal number of clusters [
29]. For a point
, the silhouette value is defined as Equation (
4).
where
is the number of points in cluster
,
measures the similarity of
to its own cluster
by the average distance of
from the rest of the points in the cluster, as denoted in Equation (
5), and
measures the dissimilarity of
from points in the nearest cluster as denoted in Equation (
6).
We then split the clustered data into two parts from each cluster, the training dataset
and the validation dataset
, based on silhouette values within each cluster. Feature
, as mentioned above, is used to refer to the
nth product, and selecting this feature is the selection of the overall dataset of the product. The silhouette value is a measure of the similarity of a point to its own cluster compared to other clusters [
30], where a high value indicates a good match to its own cluster and a poor match to neighboring clusters. Let
be the sorted set of clusters, where each cluster is sorted by the descent order of silhouette values within itself. Here,
is the local sorted index of
and
maps the relationship of the local sorted index
and its global product index
n, denoted as
. For the training dataset
, we take a pair of points with the maximum and the minimum silhouette at the same time and take the next pair after stepping an interval towards the average silhouette. The rest is treated as the validation dataset. In this way, it ensures that the training dataset and the validation dataset share a close average silhouette. Our splitting method for the training dataset
and validation dataset
in cluster
is explicitly described in Equations (
7) and (
8), respectively. In our method, it is always that
.
3.3. Data Splitting Method for Comparison
In this section, a data splitting method contrary to our proposal is described and used as a comparison in
Section 5. The stand of our proposal is that the data contains the features of the entire dataset as much as possible and is uniformly scattered, so its opposite is that the data are biased towards a single feature.
Therefore, the selection for the training dataset
gives priority to the data of a single cluster
, and when the number of data in
is not sufficient (as in our proposal, the number is 10), the data closest to cluster
are selected. If the number in cluster
is greater than or equal to 10, the data with the highest 10 similarities in cluster
are selected. The similarity within a cluster can be measured by Equation (
5), and Equation (
9) can be used to indicate the similarity of point
to cluster
, where a smaller value means more similar. Here, we only discuss the situation that
, since this is our actual situation.
Let
denote the sorted set of
in ascending order of their similarity to cluster
, the bijection relationship of the local sorted index
in
and its global product index
n is indicated by
. The training data biased towards cluster
and its validation data can described as Equations (
10) and (
11).