1. Introduction
Caretakers are needed as the elderly population rises, which may put society in an unsustainable situation. This condition increases the need for automated assistance and social robots will be especially useful for carrying out daily tasks [
1]. Everyday tasks such as manipulating clothing are a challenge for robots. Key components of clothing manipulation include detection and classification.
Classifying objects by looking at them is a trivial task for humans but challenging for computers. To perform the task on a par with humans, the computers must also be robust to carry out the recognition under varying light conditions, different sizes and colors of objects, occlusions, etc. Despite the difficulty in achieving classification ability, a huge amount of literature has developed on various methods to classify objects. As a result, it is becoming a basic ability of any intelligent agent and has wide-reaching applications in robotics. The huge amount of literature explains it with various methods attempting to classify objects with greater accuracy.
Early attempts at solving the classification problem involved meticulously defining and extracting certain features from image datasets, so those characteristics represented most of the data with high confidence. These features were defined so that they aimed to capture interesting information in images such as edges, circles, lines, or a combination of these, which are ideally invariant to translation, scale, and varying light intensities. HOG [
2], SURF [
3], SIFT [
4], and FAST [
5] are a few of them. Once these features were extracted, classifiers such as Support Vector Machine [
6], Naive Bayes [
7], Decision Trees [
8], K-Nearest Neighbors [
9] or Linear Discriminant Analysis [
10] were employed to decide the membership of an unseen image. These methods, however, were time-consuming and defining features, and capturing a wide range of information, was difficult. Consequently, recent works have given way to learn these features using Convolutional Neural Networks (CNNs) instead of hand-crafting them.
CNN, an extension class of Artificial Neural Networks (ANNs [
11]), is a class of supervised learning methods whereby huge amounts of data are fed into them. Moreover, these networks learn several convolution filters, which are capable of detecting interesting features in the dataset by minimizing a certain loss function. This approach has been extremely successful in solving object classification problems. This has opened up new avenues for such methods being used in a range of applications of which clothing classification is one of them.
There have been various attempts at sorting fashion images into different categories using CNNs. These attempts differ from one another based on the network’s structure, such as the number and size of convolutional layers, adding residual blocks [
12] or adding Long Short-Term Memory (LSTM) units [
13], etc. Based on the improved models over the year using CNNs, we investigated a new network’s structure which uses multi-convolutional layers system for fashion image classification. From the state of art, it can be pointed out that using CNNs boosts the accuracy of clothing image classification; therefore, we decided to study in depth a specific case of CNNs, the multiple convolutional layers (MCNNs), since these models have fewer parameters than a traditional CNN (e.g., params of Alexnet: 58, 302, 196, params of Resnet18: 11, 175, 370), which make the networks more efficient [
14].
The main contributions of this work are the following:
We proposed new MCNN models to increase the classification performance on the Fashion-MNIST dataset. Moreover, searching hyper-optimization and data augmentation techniques are applied to improve the generalization of the models.
We created the Fashion-Product dataset and a customized dataset of ours for confirming the network´s performance.
We compared our models’ performance with state-of-the-art and the literature of different model structures trained on the Fashion-MNIST dataset. In addition, the performance on the Fashion-Product dataset and a customized dataset are compared by state-of-the-art and MCNN15.
The paper is composed as follows:
Section 2 presents the related works.
Section 3 describes the methodology related to fashion image classification;
Section 4 introduces the description of the experimental set-up, and
Section 5 contains the results.
Section 6 discusses the experiments. Finally,
Section 7 concludes the paper.
2. Related Works
Many datasets are used for clothing image classification, such as the Fashion-MNIST, the DeepFashion-C, the AG dataset, and the IndoFashion dataset. Concerning the DeepFahion-C, this dataset was used in [
15]. In this work, the authors suggested a framework for retrieving fashion products based on images that draws inspiration from biology and resembles the two-stream visual processing system the human brain is thought to have. The attentional heterogeneous bilinear network (AHBN) is hypothesized to include a fully convolutional branch and a deep CNN branch; both are used to extract landmark localization data and fine-grained appearance attribute information, respectively. Following the application of a compact bilinear pooling layer to simulate the interaction of the two streams, a combined channel-wise attention mechanism is then used to focus on significant channels in the derived heterogeneous features. The DeepFashion-C was also used in [
16]. Their fashion model integrates two attention pipelines, landmark-driven and spatial-channel attention, to improve apparel classification. Through the use of these attention pipelines, their model was able to represent the multiscale contextual information of landmarks, which enhances the accuracy of classification by determining the locations of the most crucial features in an input image. In [
17], the authors introduced a semi-supervised multi-task learning strategy to achieve attribute prediction and apparel category classification. They adopted a teacher–student (T-S) pair model that uses weighted loss minimization while exchanging knowledge between teacher and student to intensify semi-supervised fashion clothing analysis. With the simultaneous learning of labeled and unlabeled samples, they aimed in this study to increase the feature representation and prevent further training for unlabeled examples. The authors evaluated the proposed approach on the large-scale DeepFashion-C dataset and the combined unlabeled dataset obtained from six publicly available datasets. The AG dataset was used instead in [
18], where a real-world study was proposed, which was aimed at automatically recognizing and classifying logos, stripes, colors, and other features of clothing, solely from final rendering images of their products. The IndoFashion was introduced in [
19], a dataset of over 106,000 images with 15 different categories for the fine-grained classification of Indian ethnic clothes.
The apparel classification task with the Fashion-MNIST dataset has been developed in several works of literature, and the accuracy obtained is, in some cases, more than 90%. CNNs are widely used for clothes classification with this dataset, and two examples of that were presented in [
20,
21]. In the first work, the authors implemented a CNN by retraining the final layer of a GoogLeNet to classify apparel. In the second one, the Fashion-MNIST dataset was also tested on two networks created by the authors (CNN-Softmax and CNN-SVM). In [
22], the authors proposed three different CNN architectures and used batch normalization and residual skip connections to ease and accelerate the learning process. Other CNN architectures were applied in [
23]. The authors presented four different CNN models for training the Fashion-MNIST dataset, comparing their method with state of the art. Additionally, in [
24], the authors proposed to apply Hierarchical Convolutional Neural Networks (H-CNN) to apparel classification. This study has contributed to being the first trial to apply the hierarchical classification of apparel using CNN. The authors implemented an H-CNN using VGGNet on the Fashion-MNIST dataset, obtaining a decreasing loss and an improved accuracy compared to the literature.
In [
25], other networks such as ResNet, a Wide Residual Network (WRN), and a PyramidNet were used on the Fashion-MNIST dataset for image classification tasks. The authors improved the performance level by increasing the network width and number of channels. In addition, data enhancement and learning strategies had a greater impact on the model performance. In [
26], the authors used a VGG-11 network to classify the same dataset, and they modified the network by introducing a multi-nonlinearity layer. This layer increased the learning of more complex patterns at a relatively small cost. Simultaneously, the batch normalization layer was added after each pooling layer, making it easier to train effective models by standardizing input data to make the distribution of each feature similar. In [
27], the authors used a batch normalization strategy with a novel shallow convolutional neural network (SCNNB) to improve image classification accuracy. Moreover, in [
28], the authors stated that with normalization, some tuning, and reduction of overfitting, they obtained an accuracy of 91.78% with a VGG-like architecture. They also tested a CNN with the Fashion-MNIST dataset and attained almost the same test accuracy of 90.77%. The authors pointed out that VGG produces better results but at the cost of taking a long time to train and being more computationally intensive.
As for concerning apparel classification using Support Vector Machine (SVM) and Fashion-MNIST dataset, in [
29], the authors proposed fashion articles classification system using HOG feature space and multiclass SVM classifier, obtaining an accuracy of 86.53%. Additionally, in [
30], the authors compared the performance of different models (SVM, K-Nearest Neighbors, Random Forest, Decision Tree and a CNN) on the Fashion-MNIST and CIFAR-10 datasets. Their work also examined different feature extraction techniques to improve the model’s performance. From the results, it was shown that the approach of using an autoencoder was better than the Principal Component Analysis (PCA) for boosting the performance of the model; especially, using an autoencoder with SVM surpassed the performance of a CNN model.
A different network called LSTM was used in [
31] to boost the model’s accuracy in the clothes classification task using the Fashion-MNIST dataset. In [
32], an LSTM architecture was also used for image classification on the Fashion-MNIST Dataset. They used cross-validation that detected and prevented overfitting, and fine-tuning helped to improve the model’s structure, increase the score and reduce training time consumption. Moreover, Heuristic Pattern Reduction methods reduced the training time. In some cases, they could also increase the score, while network pruning was one of the most significant challenges in the experiment.
Hyperparameter optimization and regularization techniques, such as image augmentation and dropout, are used to improve the accuracy of networks in classification tasks using the Fashion-MNIST dataset. In [
33], this technique was used with four-layer ConvNets, which could attain an accuracy of 93.99%. From the literature, lots of methods for training the Fashion-MNIST dataset were used and demonstrated great performances during testing; despite this, there is still room for improvement concerning the accuracy of this kind of dataset.
6. Discussion
This section discusses the image classification of the Fashion-MNIST, the Fashion-Product dataset, and our dataset.
Concerning the first dataset, we obtained a higher accuracy (94.09%) compared to the state of the art. This could be related to the addition of a certain number of convolutional layers, as seen in our model (MCNN15). From the results section, it can be pointed out that adding new layers improves the model’s accuracy until the addition of 15 layers. After increasing this number of layers, the accuracy decreases (MCNN18), which may be caused by the quality of feature extraction from the MCNN18 model. It means that the Fashion-MNIST, a small image size dataset, could lose feature information over a certain number of convolutional layers and pooling operations [
47] (our case is MCNN18). Moreover, the MCNN18 and bigger models need more training time without improving their performance.
In
Table 4, the numbers in red represent the accuracy obtained using Pytorch. We realized that the model’s accuracy increased greatly when the authors trained their models using Tensorflow2. Therefore, we replicated the models and trained them using PyTorch again. As a result, accuracy is reduced to 90.64% compared to 98.8% on the CNN LeNet-5 model [
23] and 90.25% compared to 99.1% on the CNNs model [
23].
For what concerns
Figure 3b, a regularization term is applied to keep the loss data stable and increase the generalization performance. However, when it is not applied, the accuracy increases by one or two percent with overfitting. In the confusion matrix, most fashions are satisfactorily classified. In particular, sandals, trousers, bags and sneakers are the classes that obtain a higher success rate, since they have specific shapes. However, when shirts are classified, our architecture confused them as T-shirts/Tops, which is why our model’s performance decreases.
From
Table 5 and
Table 6, it can be seen that our model could classify the unseen fashions. However,
Table 5 shows that our model could not classify the shirt group and had a low success rate in the pullover, sneaker and bag classes. Furthermore, in
Table 6, the model’s output could not classify coat, sandal, and sneaker classes despite the model being trained with data augmentation. It means that our model could not find similar examples of test images.
Concerning the network complexity in terms of the number of parameters, we investigated the trade-off between accuracy and network complexity in
Table 7. As more convolutional layers are added, the parameter is dramatically increased. A deeper network model has more parameters, and it would perform better based on the deep learning theory. However, more parameters do not guarantee better performance. For example, Alexnet requires the most parameters (because of fully connected layers) in
Table 7, but the performance is not the highest. In addition, although MCNN15 has fewer layers than MCNN18, MCNN15’s accuracy is higher than MCNN18. It would have a problem that a vanishing/exploding gradient would have occurred when the network model went too deep (Resnet solved this problem with the skip connection). In the future, we plan to investigate deeper network models with our structure to confirm this theory. Apart from this, few parameters (similar to MCNN15) with different network structures could show sufficient performance (e.g., Mobilenet).
One interesting point is that we expected our model could classify the most scores in each category, but as described in
Table 5 and
Table 6, each model recognized some categories better than our model. In addition, VIT shows worse performance in the Fashion-MNIST dataset but good performance in the customized dataset. For this reason, we could investigate different network models based on the results of the state-of-the-art models for future work.
It is a big challenge of supervisory learning that if we do not provide lots of good examples (similar examples) for creating a model during training, the model might fail to classify, even though it is in the same category. Our model still has the limitation of supervised learning, and GAN [
48], or self-supervised learning [
49] could be considered for overcoming the state-of-the-art problem.
7. Conclusions
Classifying household object images represents an issue in the object-based manipulation field. In this paper, we proposed a new model (MCNN15) with the highest accuracy (94.09%) compared to literature concerning image classification using the Fashion-MNIST dataset. We also evaluated our model with two other datasets (the Fashion-Product dataset and a household object datasets), even if we achieved an accuracy of 60% and 40%, respectively. Our proposed model does not dramatically increase the performance, but MCNN, which is not a new network structure, is still a promising network model that might generalize and improve the performance of the Fashion-MNIST dataset. Moreover, different hyperparameter optimization methods (different number of fully connected layers, batch size, stride, with or without dropout, etc.) could improve the model’s performance.
In the future, we would like to improve the accuracy of our model (MCNN15) with the Fashion-Product and household object datasets, changing the typology of layers and some parameters inside the layers. Additionally, we would like to implement a new model for fashion image classification combining some existing techniques [
48,
49] to boost the accuracy of the architecture.