1. Introduction
Inventory is what is stored in the warehouse. More specifically, it includes raw materials, semi-finished products, finished products, spare parts, etc. It is the basis of business activities, but more inventory does not mean more profit. Excessive inventory will occupy a lot of storage space and increase the storage costs. In addition, it will also affect the capital turnover of enterprises, affect the normal arrangement of production plans, and bring greater sales pressure. On the contrary, insufficient inventory will lead to the unmet demand of consumers and the loss of market share of enterprises. One of the objectives of inventory management is to maintain the inventory at a reasonable quantity, such as submitting orders for replenishment in time when the inventory is insufficient. Therefore, it is very important to count the amount of inventory.
The current inventory-counting methods can be divided into three categories including the manual method, the internet of things (IoT) method, and the vision-based method.
Traditionally, industries hire workers to perform the counting job. This is a simple way to realize inventory counting. It does not need any technical conditions and does not have high requirements for workers’ skills or abilities. However, the manual counting method also has many disadvantages. In a large-scale warehouse, the manual counting method is a heavy work; it takes a long time with low efficiency and brings high labor costs. Moreover, Iman et al. [
1] indicate that the counting inaccuracy is a significant problem for industries, especially for the retail and manufacturing industries, and it is mainly caused by human error. In order to avoid the shortcomings of manual methods, some new technologies have been used in inventory counting.
With the development of wireless communication technologies and radio frequency identification (RFID), IoT was used in inventory-counting tasks. Tejesh et al. [
2] built a warehouse-management system using IoT which recorded inventory quantities and their positions in the warehouse. They first attached a RFID tag with a unique identification number to the product. Then, they used a RFID reader to emit a short-range radio signal to initialize the tag. Products in different stockrooms can be distinguished by different RFID readers used in different stockrooms. After the initialization of the tag, it can be scanned by the RFID reader, and the product information in the tag will be transmitted to the central server with a transmitting node and a receiving node. Finally, all the product information will be shown on a web page through a web server. However, workers are still needed to use the RFID reader. Zhang et al. [
3] use a robot to replace the manual labor of operating a RFID scanner. The robot is equipped with an ultrasound sensor, laser scanner, and RGB camera to build a map of the environment. Then, a navigation path is generated to guide the robot. By installing RFID readers on the robot, it can automatically scan inventory and count numbers. However, the RFID method also has some defects, the counting and localization are totally reliant on the RFID tags. In some occasions, the goods or products are small, cheap, and intensive. Using the IoT method, all of them must be attached to RFID tags. It will cost a lot, and some goods or products are not suitable for being adhered to by the tags.
The vision-based method first captures the images of the inventory and then uses the image-processing algorithm to locate and classify the objects to achieve inventory counting. According to different image-processing algorithms used, it can be further divided into a traditional method and a deep learning method.
The Nishchal K. Verma group did a lot of research in this field. In their recent research [
4], they propose an inventory-counting method using KAZE features. The algorithm have two input images, one is the object to be detected, and another is the scene image. They extracted KAZE features from those two images and then found KAZE features in the scene image, which correspond to that in object image. Next, by applying the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm on these features, the bounding box of the object is obtained, and the counting task is finally completed. It performs better than the Fuzzy Histogram, SURF (Speeded Up Robust Features)-based strategy using SVM (Support Vector Machine) and the SURF-based strategy using the DBSCAN clustering method. It can also resist illumination, scaling, rotation, and viewpoint changes. However, the test image contains only four objects at most, the detection performance on dense, tiny objects remains to be tested. The algorithm can only detect one kind of object at once. This is inefficient when dealing with multi-class object detection. In [
5], they combine traditional methods with neural networks for inventory detection and counting. The input image is first blurred with a Gaussian filter to reduce noise and then processed by a Sobel edge detector to extract edges. After that, a connected component analysis (CCA) is used to find connected regions and their centroids in the image. Then, each centroid will generate a fixed-size bounding box. Adjacent bounding boxes will be merged to obtain final bounding boxes. Finally, they are sent into a single layer convolutional neural network (CNN) and a softmax layer to obtain their classes. This performs better than Fuzzy Histogram, Cross Correlation, and SURF SVM. However, this method is not end to end and is too complex. More importantly, the results are largely dependent on the Sobel and CCA algorithms, but their performance in complex scenes and small targets needs to be tested.
In the meantime, the deep learning method shows great ability in object detection tasks. Lots of neural network models such as Faster-RCNN (Faster region-based convolutional network) [
6], YOLO (You Only Look Once) [
7], SSD (Single Shot MultiBox Detector) [
8], Retina-Net [
9], and YOLOv3 [
10] all achieved outstanding results in object-detection tasks and have been widely used in production and life. Wang et al. [
11] proposed a method based on SSD called AP-SSD to detect traffic scenes including pedestrians, motor vehicles, and non-motor vehicles, which reached 91.83% mAP on KITTI dataset and 86.36% mAP on Web Dataset. Shi et al. [
12] applied YOLOv3 to detecting students’ in class. They used Bayesian optimization and deep separable convolution to improve the YOLOv3 algorithm. It reached 92.2% mAP on Microsoft COCO dataset, which is 1.2% higher than the original YOLOv3. Lawal [
13] used YOLOv3 to detect tomatoes and improved it with a label what-you-see approach, densely architecture incorporation, spatial pyramid pooling, and Mish function. It can mostly reach 99.5% AP with a 52 ms detection time.
In our task, there are four classes of cup, and they are all small objects. To achieve real-time cup inventory counting, we tried different methods and found the YOLOv3 deep learning algorithm performed well on our task. To further improve its performance, we made some improvements. We first set a detection area to ignore the cups in the irrelevant area in the input image. Then, we eliminated the two smaller feature maps in YOLOv3 to reduce the network parameters and accelerate the inference process and found it also improves mAP by 1%. Finally, we performed the k-means clustering algorithm on our own dataset to replace YOLOv3’s initial anchors with our new ones, which improved mAP from 96.65% to 96.82%. The proposed method reduced the parameters of the network and achieved high accuracy on real-time small-object inventory counting in complex scenes. It has the potential to be used in other inventory-counting tasks, and the acceleration method can also be used in other similar object-detection scenarios.
4. Discussion
We made some improvements to the YOLOv3 network and applied them to inventory cup detection and counting. The detection mAP reaches 96.82% with 54.88 FPS. This proves that our method can count inventory with high accuracy in real time so that workers can be freed from this work.
The first improvement we made is setting a detection area. Sometimes, we only care about goods or products in specific areas. With this method, the rest of the goods or products can be ignored. At first, we trained another YOLOv3 network to find the shelf and obtain the coordinates of the detection area. However, as mentioned before, the shelf and the camera are both fixed in our task, so the coordinates of the detection area will not change under normal circumstance. So, we set the coordinates as user-defined parameters instead of detecting the shelf, to save time and computing resources. Even in some special cases, when the camera or shelf is moved, the detection area can be adjusted conveniently at any time due to actual needs. User-defined parameters have another advantage that, when the shelf is changed or we are only interested in one of layers of the shelf, the coordinates can be changed easily without retraining a network to obtain the coordinates. However, it is not capable of use in highly dynamic scenes, where training a network to find the detection area is a considerable task. This method can be used in other object-detection tasks as well.
The second improvement we made is eliminating the feature maps and the structures behind them that make no contribution to the final detection results. Through this method, the detection FPS increased from 48.15 to 54.88, while the mAP increased from 95.65% to 96.65%. This is because the contribution of each feature map is different when detecting targets with different scales. In our task, objects are all small cups, so the two smaller feature maps are eliminated. This method can be used in other tiny object-detection tasks and probably in tasks where the objects are approximately the same size to accelerate the detection speed.
The last improvement we made is to refine the anchor size. After performing a clustering algorithm on our own dataset and resetting the anchor size, the mAP increased from 96.65% to 96.82%, which is 0.76% higher than 96.06% of that in YOLOv4. Since the default anchor size of YOLOv3 is obtained on the COCO dataset, this method applies to all tasks which have their own dataset.
The performance of YOLOv4 is better than YOLOv3. However, the model of YOLOv4 is more complex and has more parameters. Considering the limited computing resources of industrial computers and that YOLOv3 is more mature at that time, we chose YOLOv3 as the base of our research and achieved satisfied results. After applying our method on the industrial computer, we found that it may also be capable for YOLOv4. So, in future work, we can apply our improvements on YOLOv4 and examine its performance.
We deployed our algorithm on the industrial computer to process the images collected by the camera in real time and show the results on the web page. It achieves real-time inventory counting with only a 1.61% average error rate. Furthermore, the data can be stored and analyzed to obtain more in-depth inventory-change information; then, corresponding decisions can be made to esnure the normal operation of firms. This builds the foundation of automatic inventory management.
There is only one shelf in our task, and the counting work can be easily completed by a worker. However, compared to manual counting, our method can discover a shortage of inventory in a more timely fashion. Moreover, it has the potential to be used in large-scale warehouses, where it could greatly improve inventory-counting efficiency and reduce labor costs.