Number Recognition Through Color Distortion Using Convolutional Neural Networks

Henshaw, Christopher; Dennis, Jacob; Nadzam, Jonathan; Michaels, Alan J.

doi:10.3390/computers14020034

Open AccessArticle

Number Recognition Through Color Distortion Using Convolutional Neural Networks

¹

Virginia Tech National Security Institute, Blacksburg, VA 24060, USA

²

Bradley Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA 24061, USA

^*

Authors to whom correspondence should be addressed.

Computers 2025, 14(2), 34; https://doi.org/10.3390/computers14020034

Submission received: 19 December 2024 / Revised: 13 January 2025 / Accepted: 15 January 2025 / Published: 22 January 2025

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning applied to image-based number recognition has made significant strides in recent years. Recent use of Large Language Models (LLMs) in natural language search and generation of text have improved performance for general images, yet performance limitations still exist for data subsets related to color blindness. In this paper, we replicated the training of six distinct neural networks (MNIST, LeNet5, VGG16, AlexNet, and two AlexNet modifications) using deep learning techniques with the MNIST dataset and the Ishihara-Like MNIST dataset. While many prior works have dealt with MNIST, the Ishihara adaption addresses red-green combinations of color blindness, allowing for further research in color distortion. Through this research, we applied pre-processing to accentuate the effects of red-green and monochrome colorblindness and hyper-parameterized the existing architectures, ultimately achieving better overall performance than currently published in known works.

Keywords:

machine learning; optical character recognition (OCR); color blindness; Ishihara

1. Introduction

While there has been extensive research of character recognition in the field of Machine Learning (ML) in the past 30 years, most of the research has been centered around creating and analyzing less than ideal datasets representing handwritten numbers and letters [1]. This application has many uses, such as increasing performance with vision-based systems, transcribing and understanding old texts, and using the combination of steganography and cryptography to hide information within images [2]. A decent portion of early optical character recognition (OCR) research was centered around using the Modified National Institute of Standards and Technology (MNIST) dataset in conjunction with convolutional neural networks (CNNs) [3,4]. Examples of this work include expanding the MNIST dataset in 2017 to letters [5] and creating a dataset around standard clothing items (Fashion MNIST) [6].

In this paper, we present the idea, procedure, and results of ML-based evaluation of red-green color blindness distortions similar to [7] that is intended to create and train a neural network model that can succeed through the variations in human writing to detect the visual character in a nonideal color distorted environment. Essentially, instead of simply trying to detect characters with heavy distortion due to their writing style, this research seeks to augment prior research by evaluating a distorted letter in an environment that would cause the information to become more diluted. To do this, the Ishihara-Like MNIST [8] dataset was used. This dataset comprises the characters from the MNIST dataset, but they are placed inside an Ishihara circle. Ishihara circles [9], more commonly known as color blindness circles, are used to detect which category of color deficiency a person may possess. In this dataset, as opposed to MNIST, the characters are no longer cohesive in nature as the entire circle is comprised of varying sized circles in different colors. In comparison, a standard Ishihara circle has a near perfect character in the center. Examples of a standard Ishihara circle and a MNIST Ishihara circle for the numbers 6 and 8 are shown in Figure 1 and Figure 2.

1.1. Prevalence

Color blindness, or rather color vision deficiency, affects nearly 8% of men and 0.5% of women, for a total of roughly 4% of the population [11]. This deficiency is caused by the absence of one or more of the three types of cone cells (a type of photoreceptor cell) in the retina of the eyes. These cells are responsible for our color vision as well as our color sensitivity. Human eyes are comprised of approximately 6 million cones, wherein 60% are red sensing, 30% are green sensing, and 10% are blue sensing [12]. This deficiency can be caused due to genetic disorders (most common), injury to the eyes, or cancer and tumors that affect the optical nerve [13]. Additionally, color blindness can be caused by medications, the deterioration of the eyes from aging, and diseases such as Alzheimer’s or Parkinson’s [14]. There is no known cure for color blindness, but mitigation techniques exist in the form of special glasses and contact lenses or visual aids [13].

While there are seven official diagnoses of color deficiency, the most common is red-green [11]. Red-green color deficiency encapsulates four of the seven diagnoses: deuteranomaly, protanomaly, protanopia, and deuteranopia. Deuteranomaly is the most common and causes shades of green to appear more red, while protanomaly causes shades of red to appear more green. Protanopia is the absence of red cones, while deuteranopia is the absence of green cones. The next two deficiencies are blue-yellow: tritanomaly and tritanopia. Tritanomaly makes it difficult to distinguish between blue and green and also between yellow and red. This is due to malfunctioning blue cones. Tritanopia, on the other hand, makes the patient unable to distinguish between blue and green, purple and red, and yellow and pink. Due to this, all colors appear less bright. This deficiency is caused by the lack of blue cones. The last type of color deficiency is known as monochromacy, monocromacia, or Aachromatopsia. This is the lack of color cones entirely and causes all color to appear in grayscale [15]. Figure 3 attempts to highlight the distinction between the various types of color deficiency. While this image shows the comparison of the scale of colors, it fails to show exactly how the world appears for those with a given deficiency. Figure 4 shows the seven different deficiencies when considering a colorful image of fruit.

1.2. Motivation

This may seem like an arbitrary topic to model research after, but most models in ML for color correction are based on the physiological models on how people with color blindness perceive the world [20]. Most research in this field tends to focus on correcting the images for those with the deficiency [21,22,23,24]. Therefore, we think it is important to explore training a neural network model with images that have color distortion with the intent not of modifying the image, but instead modifying the architecture of the model to bypass or see through the distortion. In doing so, we may work towards a better understanding of how the brain (or CNN) learns when presented with distorted data inputs.

The real world application of this research is to create a model that processes an image with heavy distortion in near real time such that the machine can read the image but a human cannot. This is achieved by training on data that has heavy artificial color distortion in the particular targeted color scheme. A significant amount of research has been conducted to use color theory and color segmentation to help CV algorithms in detection of characters. In [25,26], this research was used to correct the distortion that dirt and fading has on traffic signs. Additionally, more research has been performed with color transformation to increase the detection of the traffic sign as seen in [27]. The goal of this paper is to use the theory behind research to use the color gradients in these images such that they become indistinguishable to humans but clear to a machine. Additional work could also be done in the realm of ML if models understand the information presented as proposed in [28] because fragmented color filters could be used to hide more information.

1.3. Background

The basis for this research started with replicating [29], in which Solonko attempted to modify traditional Ishihara circles to make them look more like MNIST. The goal was to train a model on MNIST and then test it with his custom Ishihara circles, evaluating the character in the center of the circle. However, to achieve high validation results, the images underwent heavy image modification. This included median blurring, k-means clustering, erosion, thresholding, and morphology [29]. All of these pre-processing techniques were used to isolate the character inside the circle, essentially eliminating the distortion from the background. By the end, only a white skeletonized version of the image on a black background was left. An example of his process on the images can be shown in Figure 5. It should be noted that before feeding the image into pre-processing, some processing had already been performed, as the character in the foreground was separated from its background in terms of color. To increase the reach in this research, we sought to limit the modifications to the image.

In 1998, the MNIST dataset was presented by LeCun et al. [3]. It comprised 60,000 training images and 10,000 testing images of handwritten digits from 0 to 9. These digits were handwritten from 500 different writers (divided into two sets) and then shuffled together. The first set was from high school students and the second set was from Census Bureau employees. These handwritten digits were scanned into digital form, normalized to

20 \times 20

pixels, converted to grayscale, and then padded to increase their size to

28 \times 28

[3,4,30]. Today, the original dataset is used mainly as a baseline for training OCR models and CV models, similar to an ML-based “hello world” program [4]. Outside of research and construction of ML models, the original MNIST dataset is used in various business sectors such as banks for reading checks, postal services when reading addresses and zip codes, and documentation management for sorting hand written documents [31]. As previously stated, the MNIST dataset has since been expanded to many other areas. The focus here is to encapsulate more domains into an MNIST-like form so that research can be performed on those areas as well. Expansions within the realm of language include making datasets similar to the original comprised of English letters instead of digits (EMIST) [5], Kuzushiji (cursive Japanese) [32], and even ancient Sumerian characters [33]. Outside of language, the Fashion-MNIST, for example, seeks to help train neural networks in recognizing various clothing pieces such as shirts, blouses, dresses, and shoes. This dataset is intended to be the modern replacement for MNIST [34]. Using these more detailed datasets could help with the problem of overfitting in Deep Neural Networks (DNNs) as shown in [35] and help with recognition of everyday objects as seen in [6]. Figure 6, Figure 7 and Figure 8, show the original MNIST with the indicated expansions. The images were created using the datasets from Keras.

However, as pertaining to this research, the Ishihara-Like MNIST dataset was not created for the purposes of character recognition. It was created and used for the exploration of explainable AI, the notion that humans should be able to trust the evaluation of a computer-based system for its validity. For this, an assessment framework was made for a human-centric evaluation. To do this, Ishihara-Like MNIST circles were created and tested on color blind individuals, wherein they would need an explanation to determine if their interpretation of the images were correct. Therefore, they would have to rely on the validity of a machine in that assessment [36]. With the current limitations of ML, this was a perfect use for MNIST, where a non-biased question could be asked that most individuals could not answer correctly. This provided for a uniform distribution of samples and would allow for a control by using those that were not color blind.

To create these Ishihara circles, the following process was used. First, the original MNIST images were loaded and the character was separated from its background. To do this, the image was binarized and a monochrome reduction was applied. Once the digit was extracted, the inner and outer outlines of the character were placed on a blank background. While not explicitly stated in the documentation, the images were resized at some point as the end result was a

128 \times 128

image. Using a Monte Carlo simulation, a circle was then generated with varying circles inside it, and the extracted MNIST frame was placed in the circle. Edge detection was used to correct the circles inside the digit and the background to ensure all circles were fully formed. Then, coloring was applied to the background and character according to the plate [36]. Figure 9 depicts this process.

In 1917, Dr Shinobu Ishihara from the University of Tokyo created and introduced the color blindness test [9]. This test consisted of a series of “plates” or images, usually 14, 24, or 38 at a time. The plates contained closely packed circles that varied in size and color to hide a number. The patient who was being tested for color deficiency was given these plates and asked to identify them correctly. The score of correct plates out of the total number identified the severity of the deficiency. To distinguish the different types of color blindness, the numerated plates held different meanings [37]. The Ishihara-Like MNIST dataset comprises 8 of those plates (numerical 2–9) and one additional plate containing random colors. Each folder of this dataset contains 10,000 training images and 2000 testing images. While it is not explicitly shared what the breakdown of each plate contains, it is stated that the generation of these plates “reasonable [sic] reproduces the themes of the original Ishihara plates” [36]. By this statement, it is assumed the same nomenclature and color scheming was followed. The only discrepancy stated is that plate 2 was the normal plate instead of plate 1. In Figure 10, each image shows what the image should depict and what red-green individuals see in the form of (actual answer, color deficient answer). It should be noted that only red-green color blindness is covered in these images as the other deficiencies are represented in plates greater than 9. Additionally, when researching this topic, there are some deviations in the listing of plates, wherein plates are out of order or changed. Therefore, not every test had the images in the exact same order. Finally, plates were created that only color blind individuals could see. Figure 11 shows an example of one of these plates.

While many other fields have used the Ishihara circles in their research, no other research or articles are known using this particular dataset without expert modification. Other published work as seen in [40] use the standard Ishihara circles in the training and evaluation of models. However, it should be noted that the images used in [40] were also heavily modified to achieve a high validation accuracy and the characters inside the circle were not handwritten variations. Similar types of research are also seen in [41], where a model is trained on character images that were taken at obscure angles or with difficult font styles, or in [42], where a model is trained on images taken of old documents or texts where time has degraded the images. The goal of these two articles is to extrapolate the character from the image despite the color distortion.

For this research, we used LeNet [43], VGG16 [44], Alexnet [45], and two modification of the AlexNet architecture in evaluating the Ishihara circles. Since the Ishihara-Like MNIST dataset was created using the original MNIST dataset, the standard model used to train MNIST was also used. This sets a baseline to see if the original model used to train the MNIST characters could be used to evaluate the circles. The reasoning behind the other previously mentioned models is because they were all significant improvements of the original MNIST model and showed greater accuracy with OCR training [46,47]. While more advanced models such as YOLO [48] could have been used, the goal was to optimize a small architecture that would improve the amount of time required to train the model. In testing, various permutations of the data were used. This included training and testing the models on the color Ishihara-Like MNIST circles, training and testing on the Grayscale Ishihara-Like MNIST circles, and cross-testing the two sets.

An example of a grayscale Ishihara-Like MNIST circle and its color counterpart is shown in Figure 12. The color image in this figure was generated using matplotlib, wherein the colors are not exactly as they should appear due to color mapping. The character inside the circle is a “2”.

1.4. Outline

Section 2 lays out the experimental design for this research. It begins in Section 2.1 by stating the tools that were used to perform this research. Then in Section 2.2, the process of how each of the datasets are loaded, processed, and ingested into the neural network models is provided. In Section 2.3, the models used to evaluate the datasets, why they were selected, and the modifications made are shown. In Section 2.4, the metrics by which these models were judged for their effectiveness with the datasets are given. Then in Section 2.5, the complete list of test cases performed are described. Finally, Section 2.5 ends with our initial assumptions on how each test case would perform. Section 3 quantifies the results of our research. This includes comprehensive tables showing the output of each test with each tested metric. Included with these results is our analysis on how each test performed. Additionally, confusion matrices are shown for the models that were trained and tested with the entire MNIST-Like Ishihara dataset. Section 4 summarizes the results of our entire research and concludes with our findings.

2. Experimental Design

In the following section, the methodology used for testing this research is described. This includes the tools that were used for the creation of the Python script, the process by which it was tested, the model selection, and how the evaluation was performed. Furthermore, the section lists the hardware that was used to perform the aforementioned tests.

2.1. Tools Used

This research used Python as the programming language, Keras and Tensorflow for the ML aspects, and OpenCV for the image processing. These selections were mainly due to the compatibility with Solonko’s prior work. By using common tools, this allowed for easy modification to the architecture for further testing.

2.2. Testing Process

To perform adequate testing, a program was constructed in Python with two parent classes: the Data Loader and the Model Wrapper. These two wrappers served as the starting point for selecting between the two different datasets and the four distinct models. The Data Loader was broken into two pieces. One piece dealt with loading and processing the Ishihara-Like MNIST set, and the other loaded and processed the MNIST dataset. While the two parts operated in the same manner, they were separated due to the differences in form between the two datasets. The Model Wrapper then uses the specified images and labels and perform the training, testing, and evaluation.

MNIST was pulled directly from Keras through an API call in the form of an array. While not necessary in testing, the ability to resize the MNIST dataset so that it matched the size of the Ishihara set was implemented. Additionally, a function was implemented using OpenCV to apply a color filter to the MNIST dataset to determine if adding color had any effect on the training of the model. The possible color masks applied to the MNIST dataset were as follows: viridis, magma, plasma, inferno, cividis, mako, rocket, and turbo. An example of a modified MNIST image along with its original is displayed in Figure 13.

The Ishihara-Like MNIST set was downloaded from Kaggle [49] and was provided as a folder containing 9 sub-folders (or plates), each containing a Training and Testing set in the form of Printer Command Language (.pcl) files. Each of these sub-folders contained 10 k training images and 2 k testing images. Due to the Ishihara set being stored in files, these files had to be loaded, processed to tensors, and added to an array. To contain the amount of memory required to run the program, the option to load a specific amount of images or plates was implemented. Additionally, to test the images in grayscale, a color modification was performed using OpenCV. It should be noted that the original color scheme of these images was BGR, and not the standard RGB. Other than the conversion of the images from color to grayscale, no other preprocessing techniques were applied.

Once the images were loaded and processed, the model was then built and training began. Using a parent structure for model selection allowed each of the separate models to share in using the built-in functions from Keras without the re-implementation of code. When training the model, the amount of epochs to train on, the training accuracy threshold, validation accuracy threshold, and the validation split could be set. By default, for each of the datasets, a standard 80%/20% split was used for the separation of the training and validation sets.

For each of the runs, the amount of epochs was defaulted to 50; however, the training could stop if the training and validation accuracies were met. As a general notion, anywhere between 50 and 200 epochs are used for medium sized datasets, wherein medium is defined as any set between 10 GB and 1 TB [50]. For our research purposes, this value was based on the amount of epochs needed to stabilize training grayscale Ishihara in preliminary training. For training/testing of MNIST, 99% was used for both training and validation accuracy. Likewise, for color Ishihara and grayscale Ishihara, 99% was used for training and testing. Figure 14 shows the dataflow and program operation.

2.3. Model Training and Selection

As previously stated, each of the following models were used due to their significance in OCR. Specifically, each model was picked due to its significance in training on the MNIST dataset. Given that the Ishihara-Like MNIST dataset was created from the original MNIST dataset, a standard model architecture to train MNIST was used as the baseline for this research. This model, as shown in [51], is a simple architecture that comprises of two Conv2d layers and two Dense layers. In particular, the “ReLu” activation function was used for its ability to speed up gradient computation and its ability to introduce non-linearity to the dataset [52]. While this model does not have a formal name, we will refer to it as the MNIST model for the remainder of this paper. For this model and the sequential models listed, the “Adam” optimizer was used due to it being a leading adaptive stochastic gradient descent optimizer. For loss, the “Sparse Categorical Cross Entropy” was used due to how well it works for predicting models with multiple classes.

For the second model, LeNet5 was used. LeNet is a convolutional neural network that was introduced by Yann LeCun and his colleagues at Bell Labs in 1998. It “is considered the classic model that laid the foundation for deep learning” [43,53]. It was proposed to be used for hand written images. Since the architecture is small, it is also easy and fast to run. Given that this research runs the MNIST dataset and a derivative of MNIST, this model was consistent with showing progress with a more efficient architecture. The difference in this model and models used prior to its creation, outside of the number of parameters, was the change from the “Sigmoid” to the “TanH” activation function. This change allowed for higher gradient values when training neural networks [54].

While the first two models are relatively small in size, the third model dwarfs them both and is much more heavily involved. VGG16 is a CNN that was introduced and developed by K. Simonyan and A. Zisserman from the University of Oxford in 2014. It gained notoriety because it achieved an accuracy of 92.7% on ImageNet, which was not matched at the time. Additionally, this model started a change in newer architectures by showing that a model could learn with a reduced size of the convolutional kernels to (3, 3) as opposed to the (11, 11) used at the time [44,55]. Like the previous models, VGG16 is used in image classification, image recognition, and object detection tasks. The potential downside to using this model is its size. VGG16 is composed of 13 convolutional layers, 5 max-pooling layers, and 3 fully connected layers. It should be noted that this is the model that was used to train the Ishihara-Like MNIST. From their documentation, it appears that [36] used the standard model without any modifications except for the addition of a Batch Normalization after each Conv2d layer [36].

AlexNet was the fourth choice for a standard OCR model. AlexNet was introduced by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton at the university of Toronto in 2012. It was developed as a faster method for image recognition and classification tasks than previous models. The purpose of this model was to rectify the previous issues of Deep Learning by solving the gradient descent issue with Dropout, setting the activation function from Tanh to ReLu, and allowing overlapping pooling of the layers [45,56].

On top of the four distinct models, we decided to branch off and modify the architecture of AlexNet to be more efficient in terms of its size. In initial research and training, we found that AlexNet seemed to perform the best with the fewest amount of epochs and time required to run on the Ishihara dataset. This allowed us to leverage the use of a much smaller model than VGG16. In the fifth model, we reduced the size of AlexNet’s filters by 4, reduced the kernel size from (11, 11) to (5, 5), and reduced the dropout from 0.5 to 0.1. By reducing the number of filters, the model size also decreases by this factor. With a smaller kernel size, the model would hopefully be able to generalize features better. Finally, by decreasing the dropout, less neurons are dropped out during training. In the remainder of this paper, this model will be known as Custom 1. In the sixth model, we reduced the filters by 8, but kept all of the other parameters the same. This model will be referred to as Custom 2. The goal of these two models was to run faster than either VGG16 or the original AlexNet, while minimizing the performance loss of using a smaller model. Table 1 shows a summary of each of the models chosen for this research. To allow for replication of our process, the compiler and architectures of the two custom models in Python are shown in Figure 15 and Figure 16. For the other models (MNIST, LeNet5, and VGG16), the standard architectures with the addition of a Batch Normalization layer after each Conv2D layer were used.

2.4. Metrics of Success

To compare each of the models in their evaluation of the datasets, a quantitative basis was made. For this basis, we used multiple metrics to determine which models performed the best in each test case. For each of the models and test cases ran, the following metrics were recorded for evaluation:

Performance: the overall accuracy percentage the model achieved when predicting new images. This is the value that matters the most. The goal is to have the model evaluate with a high accuracy on images it has never seen before. This will be the metric we compare to previous research.
Precision: the percentage of correctly predicted positives out of all instances by the models.
Recall (TPR): the percentage of actual positives that are correctly identified by the models.
Training Time (in seconds): The amount of time it took for the model to train. Likewise with the number of epochs, the goal is to run the model as quickly as possible.
Evaluation Time (in seconds): the amount of time it took for the model to evaluate on a new image or batch of images. In the real world, this is the value that matters the most when incorporating the model in a OCR sensor.

It should be noted that [36] achieved a 99% performance accuracy using VGG16 on the color Ishihara-Like MNIST. It was not stated what percentage of the dataset was used to train this model or how long it took to train. However, in our research, we will be training with the color and grayscale versions of the dataset. When training and testing these models, an NVIDIA A100 80GB PCIe GPU was used.

Before testing, our assumptions were that the MNIST dataset would perform well on all of the models. Given that each of these models were trained on MNIST previously, we would expect nothing less than 99% or 98% evaluation accuracy. With the Ishihara-Like MNIST dataset, we expect that the accuracy could be significantly lower with the smaller models (MNIST and LeNet) due to the complexity of the dataset but on par with MNIST with the larger models (VGG16 and AlexNet). However, the grayscale version of the Ishihara is expected to perform significantly worse than the colored version due to the reduction of training information. For the cross testing of the datasets, it is presumed that the models will perform on par with random guessing as the models are trained and tested with two different datasets. However, our hope is that it will be slightly better than chance due to the incorporation of MNIST in the Ishihara-Like MNIST dataset.

2.5. Test Cases

With each of the models above, we sought to try a few variations with testing. While the end goal of this research was to find and perform better with the Ishihara-Like MNIST dataset than previous research, we thought it would be enlightening to compare the variations and analyze the output. By expanding the testing into two categories (color and grayscale), and by cross-testing the two datasets, we hoped to understand what features were being learned with these datasets. Additionally, if the datasets performed the same with grayscale as they did with color, this would show that the expansion of information from one channel to three had little or no effect on the ability of the neural networks to learn and extract features from these datasets. The test cases performed on each of the above models is shown in Figure 17.

In the first two test cases, the goal is to see how MNIST trains on the architecture which will provide initial assumptions on how well the Ishihara-Like MNIST will perform. However, MNIST is a much simpler dataset to work with and has cohesive characters. While the colored version of MNIST is not needed, we want to observe if there are any changes in the training output of MNIST when expanded from one channel to three. The next two test cases (Ishihara) are the crux of this research, wherein we will try to exceed performance values of previous research. For test cases 5–8, we seek to determine if the model has learned any feature extractions or feature spaces that will allow it to cross-test on a completely different dataset. Even though the Ishihara-Like MNIST dataset was created using the MNIST, the dataset is different enough that it is unlikely for the models to perform extremely well.

3. Results

In the following section, the results from each of the tests are listed. In each table, the performance accuracy, the precision of the model, the recall of the model, the time it took for the model to train (in seconds), and the time it took for evaluation (in seconds) are listed as (accuracy, epochs, training time, evaluation time). To note, the training time is the total amount of time it took to train the model and the evaluation time is the amount of time it took to correctly predict the amount of testing images. Therefore, to calculate the time required to determine one image, divide the listed evaluation time by the total number of testing images. Additionally, only the Ishihara tests have associated confusion matrices since they are most pertinent to this research.

To set a baseline for the entire evaluation, the original MNIST dataset was trained and tested against each of the models. As shown in Table 2, it performed reasonably well as expected. We expected higher performance on the first two models (MNIST and LeNet5), but this decline could have been due to the resizing of the original dataset from (28, 28, 1) to (112, 112, 1). The image resizing was performed so that the MNIST dataset was the same size as the Ishihara-Like MNIST dataset. As shown in Table 3, injecting color in the MNIST images seemed to have little effect on the performance of the models. To reiterate, the performance was as expected, but slightly under the target value of 99% in half of the models.

In each of the models, the performance accuracy was within 1% or 2% of the others. While these results are what was expected, the LeNet5 model performed the worst on both versions of the dataset. We believe this is due to the “TanH” activation function being used instead of the modern “Relu” function. The anomaly from these two tests is the performance of Custom 1 and Custom 2 as compared to the MNIST and LeNet5 models. Custom 1 and 2 were able to perform better than either of the other two models with less parameters. In the case of Custom 2, the architecture is roughly eight times smaller than MNIST but was able to outperform it. We believe this is due to the addition of more convolutional layers in conjunction with a larger kernel size.

Table 4 and Table 5 show the results for training and evaluating on the Ishihara-Like MNIST dataset with each of the models. Table 4 uses the grayscale version of the images and Table 5 uses the color version of the images. These tests were broken into several parts. First, each plate was tested individually on each of the models wherein each plate contained 10 k training images and 2 k testing images. It should be noted that we believe this is not an adequate amount of images to use for the training phase of a model, however this was the maximum amount of images per plate from [8]. For the last test, all of the plates were combined into one array such that the array contained 90 k training images and 18 k testing images. In the case where the models resulted in a 10% performance accuracy, this means that the model performed on par with chance as there were ten classes in this dataset.

In the early phases of testing with this dataset, the first three models (MNIST, LeNet5, and VGG16) performed very poorly with this dataset in both grayscale and in color. It was only when testing with all 90k images together that better results were obtained. This prompted further investigation given that VGG16 was used in the original training of this dataset. After analyzing the difference between these three models and AlexNet, we concluded that this was due to a lack of data normalization. Therefore, this was the reason a Batch Normalization layer was inserted after each Con2d layer in each model. This theory was further confirmed after reviewing [36], finding that they inserted this layer in their training with VGG16. After the insertion of the layer, performance drastically increased.

As shown in Table 4, there is quite a variation in results between all of the models with the grayscale version of the images. We believe the decrease in performance is due to the small amount of training images and the reduction of information with one channel, as opposed to three channels with color images. While VGG16 was the largest model with the most parameters, it was not able to consistently perform better than the rest of the models with the individual plates. AlexNet was able to consistently perform on average around 70–80% whereas VGG16’s performance dips quite heavily on plates four through eight. When examining the reduction of size to AlexNet, it appears that this had a sizable effect on the evaluation of the grayscale images as the performance accuracy was consistently lower on average with Custom 1 and Custom 2. Even with this reduction in performance, the results were still consistent and stable across all of the tests as was the case with the original AlexNet model.

When all of the plates were combined, all of the models were able to perform significantly better. In the case of VGG16, this resulted in a substantially higher performance than every other model. We presume this is due to the increased size of the model that allowed for the extraction of features that would not be picked up in the smaller models. While we argue that a larger model is not the best pick for every dataset, it did perform quite well with the lack of information in these images. Even with a reduced architecture, AlexNet and the two custom architectures were able to achieve roughly 90% accuracy when using all of the grayscale images. On the other end of the scale of performance, LeNet5 performed very poorly in almost every test in comparison to the other models, similarly as it did with Tests 1 and 2. Given that the LeNet model is very similar to the MNIST model, we attribute this failure to the “TanH” activation function. The results from the MNIST model were in between that of LeNet5 and AlexNet, therefore resulting in no meaningful conclusions. The one anomaly to this test was the results from VGG16 on the rand plate. While the other models were able to adequately train on this plate on par with the other plates, VGG16 was not able to be trained. It is unclear why this occurred. For this test, it should be noted that these images are very hard for a human to read. Therefore, it is promising that a ML model is able to correctly evaluate these images. Figure 18 shows the performance accuracy for each plate for each of the models with the grayscale images. As shown in this image, AlexNet has an overall higher evaluation than the other models while also maintaining a relatively straight line resulting in more stable training. The “rand” or random plate was removed from this graphic due to its low results.

The confusion matrices from the training of the grayscale images are shown in Figure 19. In each picture, the classes are listed from left to right and top to bottom. When reading these pictures, the rows represent the ground truth labels while the column represent the predicted labels. For example, in the first confusion matrix with the MNIST model, the model predicted the image as the value “0” incorrectly four times when the correct label was actually “9” (this value is in the lower left-hand corner of the plot). In an ideal case, the matrices would show a solid diagonal line (representing 100% performance accuracy).

Examining the matrices from the grayscale testing shows results that are consistent with errors in the evaluation of MNIST or standard OCR characters when the model’s performance is sub-par. As shown, a few of the classes were misidentified as another class due to their similarity. LeNet5 is able to show the results the most. In this case, it incorrectly predicted a “9” as a “4” 5.61%, a “5” as an “8” 6.11%, and a “7” as a “9” 9.94% of the time. Also it should be noted that it predicted the value “1” as an “8” 13.39% of the time. This is an abnormal mis-classification.The next highest incorrect value is a “3” being predicted an an “8” which is consistent with incorrect predictions in the numerical system with handwritten numbers. Given that the Ishihara-Like MNIST dataset was created using the original MNIST and MNIST was created using handwriting from many different people, it is not uncommon for there to be many variations of each class potentially causing errors in training.

As shown in Table 5, the models performed significantly better on the color versions of the dataset. Given the expansion of data from one channel into three channels with color, we anticipated that the training would fare significantly better. Even with the individual plates, the models were able to extract enough features to distinguish the ten distinct classes with only 10 k training images. In the case of using all of the images together, each of the models performed in the upper decile except for LeNet5. Again, we attribute the poor performance of LeNet5 to its activation function. When analyzing the training and evaluation times, AlexNet was able to perform within 5% of VGG16 in almost all cases but was able to train in half the required time and evaluate slightly faster. A significant difference in this test is the combination of all of the plates into one set only performed slightly better the the plates individually. Figure 20 shows the performance accuracy for each plate on each of the models and Figure 21 shows the confusion matrices for this test. When examining the matrices of the color version of the Ishihara images, the results were on par with what was shown in the table above. In each model except for LeNet5, the matrices resulted in at least a 93% accuracy for each of the ten classes as expected.

In both grayscale and in color, the “rand” or random plate had an abnormal effect on the ability of the models to train. This plate had a drastically lower performance on each of the models than the rest of the plates. On each of the other plates with all of the models, the results were in the upper decile with color but this plate caused as low as 10% evaluation in the case of VGG16. While the difference in evaluation with the grayscale images is not noticeable outside of VGG16, it becomes quite noticeable with the color images. To deduce the reasoning behind this, we took a close look at the images. Figure 22 shows two examples of images from this plate. After reviewing the images, it became abundantly clear why the models had poor performance. The image on the left is a 7 and the image on the right is an 8. However, without the labels, we would not have been able to decipher the contents of these two circles. Realizing this, it warranted a test case wherein each of the models are trained on all of the color plates except the Random plate. Table 6 shows the results of this training. As shown, without the inclusion of the Random colors plate in the combination of all the other plates, the models were able to produce higher performance accuracy. In this run, we were able to match the results with VGG16 with prior research. Therefore, we conclude that the inclusion of this plate had a overall negative effect on the training of these models. The confusion matrices for this test have been provided in Figure 23.

To further examine the reasoning behind decline in performance evaluation with this plate, the confusion matrices from both the grayscale and color trainings of this plate have been provided in Figure 24 and Figure 25. In these matrices, only 200 images from each of the 10 distinct classes were evaluated upon. This was due to the limited number of images in the rand plate. As shown with the Random plate, each of the models struggled to identify the correct class for each of the ten classes in all of the matrices. This was likely the cause of the low performance in the test with all plates combined for both grayscale and color. It should be noted that it is unclear why VGG16 was unable to train at all on this plate. When looking at the training process, the training accuracy and validation accuracy did not rise above 10%.

In Table 7, Table 8, Table 9 and Table 10, the results from cross training the two sets is shown. In these test cases, each of the models are trained on the first dataset and then evaluated on the second dataset. Our initial hopes were that the training of MNIST and evaluation on Ishihara-Like MNIST would have performed slightly better but the results are not surprising given the difference in the datasets. As shown, all of the models performed within the realm of chance for their output. In some cases, the models performed worse than chance. To note, the MNIST and LeNet5 models were able to perform better than chance when trained on the grayscale version of the Ishihara dataset and evaluated with MNIST. It is unclear why these models were able to perform 20–25% better than their counterparts. However with only a 20% increase in a set of ten classes, we believe this does not warrant further research. In the case where the models were trained on MNIST and evaluated on the Ishihara-Like MNIST, we believe the performance would have been significantly better if the images underwent some form of preprocessing. Such is the case with Solonko’s work wherein the images underwent heavy modifications before they were evaluated by the neural network resulting in high performance. However, there seems to be no correlation between these two models when training with a neural network even though the latter was created using the first.

4. Future Works

While only a small portion of the population are affected by color blindness, this research could be used to understand how ML models interpret images with color distortion. Additionally, the random plate and grayscale version of these images show that a ML model can learn to extract features from images that are not human readable. We believe there is still much work that could be completed with this dataset.

4.1. Improvements

Keras and Tensorflow are very good tools for ML, however if this project were to be expanded, PyTorch would be used. This is due to its ability to create more complex models. This would allow for a more systematic approach for adding more models to test on. Additionally, in terms of performance and stability, using Pytorch would allow more diagnostic tools to be implemented into the program to help determine why a particular model was suffering in terms of accuracy and what improvements could be made. Following this idea, more models like ResNet and Inception could be added to the list of models above to analyze their performance in association with the other models.

As stated earlier, the Ishihara-Like MNIST dataset only reasonably recreated the red-green plates from the color blindness test. This only included using 8 of the 39 plates available. In future applications, we would like to follow the process of creating these circles for the other forms of red-green color blindness and create plates for blue-yellow color blindness. Given that each folder of this dataset only contained 10 k training images and 2 k testing images, more images would allow for an easier training process. Additionally, this generation of new images would include plates that are only able to be seen by individuals with these deficiencies. Finally, this type of research could be used to determine how easily readable a picture is for someone who is color blind or used to build a program that could create an Ishihara circle from any image (not just MNIST).

4.2. Extensions

While our paper focused on the training of models with color distortion, future applications of this research could use federated and split learning based-methods as seen in [57]. This would allow for the analysis of images that have possible corruption wherein the privacy or security of the dataset is a concern.

5. Conclusions

In this paper, we presented our research and findings with correctly identifying numerical characters in images with color distortion. While much work has been performed on this topic previously, we sought to use, expand, and improve upon the overall performance of correctly identifying the Ishihara-Like MNIST dataset. While prior work has only analyzed performance on the color (original) versions of the images, we sought to train models on the grayscale version as well. In our research, we did not perform any preprocessing to the images other than conversion of color to grayscale. To do this analysis, we performed our tests with a standard model architecture used to train MNIST, LeNet5, VGG16, AlexNet, and two small custom modifications of AlexNet. In each model, we trained the neural network models with each plate of images individually followed by a combination of all of the images in one set.

In our findings, we concluded that all of the aforementioned models had lower than expected performance accuracy when trained on the grayscale version of the Ishihara-Like MNIST images (averaging around 60%). However, VGG16 was still able to perform the overall better overall than the rest of the presented models. We believe this low performance with the grayscale images is due to the lack of information found in a monotone image as compared to the 3 channels in a color image. Therefore, a larger model is able to extract more features from the images. Even though VGG16’s performance was good overall, the results were not consistent with every plate whereas AlexNet provided stable results. To examine this in the future, multiple runs of each model on each of the individual grayscale plates would be required.

With the color version of the images, each of the models except LeNet5 performed very well, achieving results on average above 90%. This occurred with the individual plates and the test case where all of the plates were combined. During the evaluation of the colored Ishihara images, we discovered that the rand plate (which is a randomly colored Ishihara circle) had significantly lower performance than the other colored plates. After analyzing the images, it was found that these images are very similar in their appearance to the grayscale versions, thus resulting in a lower performance. In the case of the grayscale images, the images are hard for humans to distinguish. With the rand plate, we found them to be illegible.

We concluded that this dataset could be trained faster and perform nearly as well using an architecture that is at least 100 times smaller than previously researched. While not explicitly stated how VGG16 was trained, what percentage of the dataset was used, and what steps were taken to pre-process the data in former research, we believe that our results conclude with an overall increase in performance due to the reduction in size of the model and an increase in evaluation time. Using a smaller model on a dataset of this size may result in a small performance loss, but would allow the user to run the model significantly faster. As shown in the performance graphs above, the reduction of size in AlexNet caused a slight drop in performance in each test case but caused the models to become more stable with their results. The curve of performance accuracy was more smooth with Custom 1 and Custom 2 than with the original AlexNet. With more hyper-tuning of the parameters of the model, we believe that the performance of models like AlexNet and smaller could be increased to match that of VGG16 while also maintaining the benefits of using a smaller model. Additionally, we believe that by understanding how the distortion of color affects images such as these, ML models could be improved to extract features more easily from everyday images.

Author Contributions

Conceptualization, C.H. and J.N.; methodology, C.H. and J.N.; software, C.H., J.N. and J.D.; validation, J.D. and A.J.M.; investigation, C.H.; writing—original draft preparation, C.H.; writing—review and editing, A.J.M.; supervision, A.J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code repository for this research is available at: https://github.com/vtnsi/ishihara-mnist.git, accessed on 12 December 2024).

Acknowledgments

Thank you to Lisha Henshaw for her support in proofreading the draft manuscript. Thank you to Chelsy Ables for creating Figure 14 and the cover figure to this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Memon, J.; Sami, M.; Khan, R.A.; Uddin, M. Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR). IEEE Access 2020, 8, 142642–142668. [Google Scholar] [CrossRef]
Tseng, Y.C.; Pan, H.K. Secure and invisible data hiding in 2-color images. In Proceedings of the Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213), Anchorage, AK, USA, 22–26 April 2001; Volume 2, pp. 887–896. [Google Scholar] [CrossRef]
LeCun, Y.; Cortes, C.; Burges, C. MNIST Handwritten Digit Database. ATT Labs [Online]. Available online: http://yann.lecun.com/exdb/mnist (accessed on 12 December 2024).
Baldominos, A.; Saez, Y.; Isasi, P. A Survey of Handwritten Character Recognition with MNIST and EMNIST. Appl. Sci. 2019, 9, 3169. [Google Scholar] [CrossRef]
Cohen, G.; Afshar, S.; Tapson, J.; van Schaik, A. EMNIST: Extending MNIST to handwritten letters. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2921–2926. [Google Scholar] [CrossRef]
Nocentini, O.; Kim, J.; Bashir, M.Z.; Cavallo, F. Image Classification Using Multiple Convolutional Neural Networks on the Fashion-MNIST Dataset. Sensors 2022, 22, 9544. [Google Scholar] [CrossRef] [PubMed]
Gerónimo, D.; Serrat, J.; López, A.M.; Baldrich, R. Traffic Sign Recognition for Computer Vision Project-Based Learning. IEEE Trans. Educ. 2013, 56, 364–371. [Google Scholar] [CrossRef]
Shaker, A.; Saralajew, S.; Gashteovski, K.; Faust, I.; Xu, Z.; Kotnis, B.; Ben-Rim, W.; Lawrence, C. Ishihara Like MNIST. 2022. Available online: https://www.kaggle.com/datasets/ammarshaker/ishihara-mnist (accessed on 12 December 2024).
Ishihara, S. Tests for colour-blindness, 1951.
Picryl. Available online: https://picryl.com/media/eight-ishihara-charts-for-testing-colour-blindness-europe-wellcome-l0059155-cf3385 (accessed on 12 December 2024).
We Are Colorblind. 2019. Available online: https://wearecolorblind.com/articles/a-quick-introduction-to-color-blindness/ (accessed on 12 December 2024).
American Academy of Ophthalmology. 2018. Available online: https://www.aao.org/eye-health/anatomy/cones#:~:text= There%20are%20three%20types%20of,%2Dsensing%20cones%20(10%20percent) (accessed on 12 December 2024).
National Eye Institute. Color Blindness. Available online: https://www.nei.nih.gov/learn-about-eye-health/eye-conditions-and-diseases/color-blindness (accessed on 12 December 2024).
Mayo Clinic. Color Blindness. Available online: https://www.mayoclinic.org/diseases-conditions/poor-color-vision/symptoms-causes/syc-20354988 (accessed on 12 December 2024).
National Eye Institute. Types of Color Vision Deficiency. Available online: https://www.nei.nih.gov/learn-about-eye-health/eye-conditions-and-diseases/color-blindness/types-color-vision-deficiency (accessed on 12 December 2024).
GavinAdmin. Available online: https://doctorofeye.com/colour-blindness/ (accessed on 12 December 2024).
MedlinePlus. Available online: https://medlineplus.gov/genetics/condition/achromatopsia/#frequency (accessed on 12 December 2024).
PickPik. Available online: https://www.pickpik.com/fruit-mixed-color-food-assorted-variety-62464 (accessed on 12 December 2024).
Pilestone Inc. Color Blind Vision Simulator. Available online: https://pilestone.com/pages/color-blindness-simulator-1 (accessed on 12 December 2024).
Petrovic, G.; Fujita, H. Deep Correct: Deep Learning Color Correction for Color Blindness; IOS Press: Amsterdam, The Netherlands, 2017. [Google Scholar] [CrossRef]
Lin, H.Y.; Chen, L.Q.; Wang, M.L. Improving Discrimination in Color Vision Deficiency by Image Re-Coloring. Sensors 2019, 19, 2250. [Google Scholar] [CrossRef] [PubMed]
Jefferson, L.; Harvey, R. Accommodating color blind computer users. In Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility, Portland, OR, USA, 23–25 October 2006; pp. 40–47. [Google Scholar] [CrossRef]
Jefferson, L.; Harvey, R. An interface to support color blind computer users. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 28 April–3 May 2007; pp. 1535–1538. [Google Scholar] [CrossRef]
Tsekouras, G.E.; Rigos, A.; Chatzistamatis, S.; Tsimikas, J.; Kotis, K.; Caridakis, G.; Anagnostopoulos, C.N. A Novel Approach to Image Recoloring for Color Vision Deficiency. Sensors 2021, 21, 2740. [Google Scholar] [CrossRef] [PubMed]
de la Escalera, A.; Moreno, L.; Salichs, M.; Armingol, J. Road traffic sign detection and classification. IEEE Trans. Ind. Electron. 1997, 44, 848–859. [Google Scholar] [CrossRef]
Bahlmann, C.; Zhu, Y.; Ramesh, V.; Pellkofer, M.; Koehler, T. A system for traffic sign detection, tracking, and recognition using color, shape, and motion information. In Proceedings of the IEEE Proceedings. Intelligent Vehicles Symposium, Las Vegas, NV, USA, 6–8 June 2005; pp. 255–260. [Google Scholar] [CrossRef]
Creusen, I.; Hazelhoff, L.; de With, P. Color transformation for improved traffic sign detection. In Proceedings of the 2012 19th IEEE International Conference on Image Processing, Orlando, FL, USA, 30 September–3 October 2012; pp. 461–464. [Google Scholar] [CrossRef]
Xie, Z.; Lyu, R. Whether pattern memory can be truly realized in deep neural network? Research Square 2024. [Google Scholar] [CrossRef]
Solonko, M. Reading Color Blindness Charts: Deep Learning and Computer Vision. Available online: https://towardsdatascience.com/reading-color-blindness-charts-deep-learning-and-computer-vision-a8c824dd71cd (accessed on 12 December 2024).
Bottou, L.; Cortes, C.; Denker, J.; Drucker, H.; Guyon, I.; Jackel, L.; LeCun, Y.; Muller, U.; Sackinger, E.; Simard, P.; et al. Comparison of classifier methods: A case study in handwritten digit recognition. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3—Conference C: Signal Processing (Cat. No.94CH3440-5), Jerusalem, Israel, 9–13 October 1994; Volume 2, pp. 77–82. [Google Scholar] [CrossRef]
GeeksforGeeks. MNIST Dataset: Practical Applications Using Keras and PyTorch. 2024. Available online: https://www.geeksforgeeks.org/mnist-dataset/ (accessed on 12 December 2024).
Clanuwat, T.; Bober-Irizar, M.; Kitamoto, A.; Lamb, A.; Yamamoto, K.; Ha, D. Deep Learning for Classical Japanese Literature. arXiv 2018, arXiv:1812.01718. [Google Scholar] [CrossRef]
Al-Noori, A.H.; Talib, M.; Harbi S., J. The Classification of Ancient Sumerian Characters using Convolutional Neural Network. In Proceedings of the 1st International Conference on Computing and Emerging Sciences, Lahore, Pakistan, 26–27 May 2023; SciTePress: Setúbal, Portugal, 2020. [Google Scholar] [CrossRef]
Zalando Research; Crawford Company. Fashion Mnist. 2017. Available online: https://www.kaggle.com/datasets/zalando-research/fashionmnist (accessed on 12 December 2024).
Xhaferra, E.; Cina, E.; Toti, L. Classification of Standard FASHION MNIST Dataset Using Deep Learning Based CNN Algorithms. In Proceedings of the 2022 International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 20–22 October 2022; pp. 494–498. [Google Scholar] [CrossRef]
Rim, W.B.; Shaker, A.; Xu, Z.; Gashteovski, K.; Kotnis, B.; Lawrence, C.; Quittek, J.; Saralajew, S. A Human-Centric Assessment of the Usefulness of Attribution Methods in Computer Vision. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Vilnius, Lithuania, 9–13 September 2024. [Google Scholar] [CrossRef]
Potjewyd, G. The Color Code. 2022. Available online: https://theophthalmologist.com/business-profession/the-color-code (accessed on 12 December 2024).
Welcome Collection. Available online: https://wellcomecollection.org/search/works (accessed on 12 December 2024).
Ishihara, S. Ishihara Instructions. Available online: https://web.stanford.edu/group/vista/wikiupload/0/0a/Ishihara.14.Plate.Instructions.pdf (accessed on 12 December 2024).
Dhawale, K.; Vohra, A.S.; Jain, P.; Kumar, T. A Framework to Identify Color Blindness Charts Using Image Processing and CNN. In Communication, Networks and Computing: Second International Conference (CNC 2020), Gwalior, India, 29–31 December 2020; Revised Selected Papers 2; Springer: Singapore, 2021; pp. 100–109. [Google Scholar] [CrossRef]
Ye, Q.; Doermann, D. Text Detection and Recognition in Imagery: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1480–1500. [Google Scholar] [CrossRef]
Imran, F.; Hossain, D.M.A.; Mamun, M.A. Identification and Recognition of Printed Distorted Characters Using Proposed DCR Method. In Proceedings of the 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020; pp. 1478–1481. [Google Scholar] [CrossRef]
Paravisionlab.co.in. LeNet-5: A Simple Yet Powerful CNN for Image Classification. Available online: https://paravisionlab.co.in/lenet-5-architecture/ (accessed on 12 December 2024).
Boesch, G. Very Deep Convolutional Networks (VGG) Essential Guide. Available online: https://viso.ai/deep-learning/vgg-very-deep-convolutional-networks/ (accessed on 12 December 2024).
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Wang, Y.; Li, F.; Sun, H.; Li, W.; Zhong, C.; Wu, X.; Wang, H.; Wang, P. Improvement of MNIST Image Recognition Based on CNN. IOP Conf. Ser. Earth Environ. Sci. 2020, 428, 012097. [Google Scholar] [CrossRef]
Cheng, S.; Shang, G.; Zhang, L. Handwritten digit recognition based on improved VGG16 network. In Proceedings of the Tenth International Conference on Graphics and Image Processing (ICGIP 2018), Chengdu, China, 12–14 December 2018; Volume 11069, pp. 954–962. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Kaggle. Available online: https://www.kaggle.com/ (accessed on 12 December 2024).
GeeksforGeeks. How to Choose Batch Size and Number of Epochs When Fitting a Model? 2024. Available online: https://www.geeksforgeeks.org/how-to-choose-batch-size-and-number-of-epochs-when-fitting-a-model/ (accessed on 12 December 2024).
Kaggle, A.J. Available online: https://www.kaggle.com/code/amyjang/tensorflow-mnist-cnn-tutorial/ (accessed on 12 December 2024).
Thakur, A. ReLU vs. Sigmoid Function in Deep Neural Networks. 2020. Available online: https://wandb.ai/ayush-thakur/dl-question-bank/reports/ReLU-vs-Sigmoid-Function-in-Deep-Neural-Networks–VmlldzoyMDk0MzI#:~:text=The%20model%20trained%20with%20ReLU,better%20when%20trained%20with%20ReLU (accessed on 12 December 2024).
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Kumar, S. Comparison of Sigmoid, Tanh and Relu Activation Functions. 2023. Available online: https://www.aitude.com/comparison-of-sigmoid-tanh-and-relu-activation-functions/ (accessed on 12 December 2024).
Melanie. Unveiling the Secrets of the VGG Model: A Deep Dive with Daniel. Available online: https://datascientest.com/en/unveiling-the-secrets-of-the-vgg-model-a-deep-dive-with-daniel#:~:text=A%20little%20history,Recognition%20Challenge)%20competition%20in%202014 (accessed on 12 December 2024).
Wei, J. AlexNet: The Architecture That Challenged CNNs. Available online: https://towardsdatascience.com/alexnet-the-architecture-that-challenged-cnns-e406d5297951 (accessed on 12 December 2024).
Taheri, R.; Arabikhan, F.; Gegov, A.; Akbari, N. Robust Aggregation Function in Federated Learning. In International Conference on Information and Knowledge Systems; Springer: Cham, Switzerland, 2023; pp. 168–175. [Google Scholar]

Figure 1. Standard Ishihara Circle [10].

Figure 2. MNIST Ishihara Circle [8].

Figure 3. Color Blindness Spectrum [11,15,16,17].

Figure 4. A normal color image followed by the image modified as to emulate the seven different types of color blindness [18,19].

Figure 5. The modifications the image undertook when pre-processing with Solonko’s work [29].

Figure 6. Original MNIST.

Figure 7. Fashion MNIST.

Figure 8. EMNIST.

Figure 9. The process to convert an MNIST character to an Ishihara-Like MNIST circle [36].

Figure 10. Sample Ishihara plates from the Ishihara test. The first number inside the parenthesizes is the correct number, followed by what an individual with red-green color blindness would see [38,39].

Figure 11. A plate that only red-green color-blind individuals can decipher. The value inside the circle is 73 [38].

Figure 12. Grayscale (left) and Color MNIST Ishihara Circle (right).

Figure 13. Original (left) and Colorized (right) version of an MNIST digit using the “inferno” color mask.

Figure 14. Program Description.

Figure 15. Custom 1 model in Python.

Figure 16. Custom 2 model in Python.

Figure 17. Test cases performed on the datasets.

Figure 18. The performance accuracy for each of the models with the grayscale Ishihara plates. The rand plate was not used in this graphic.

Figure 19. Confusion matrices for the Grayscale Ishihara-Like MNIST dataset showing the correct evaluation of the 10 distinct classes. Each class contained 1800 testing images in these matrices.

Figure 20. The performance accuracy for each of the models with each of the colored Ishihara plates. The rand plate was not used in this graphic.

Figure 21. Confusion matrices for the Color Ishihara-Like MNIST dataset showing the correct evaluation of the 10 distinct classes. Each class contained 1800 testing images in these matrices.

Figure 22. Sample images from the “random” Ishihara plate. The image on the left is 7 and the image on the right is 8.

Figure 23. Confusion matrices for the Color Ishihara-Like MNIST dataset without the Random plate showing the correct evaluation of the 10 distinct classes. Each class contained 1800 testing images in these matrices.

Figure 24. Confusion matrices for the Grayscale Ishihara-Like MNIST Random plate showing the correct evaluation of the 10 distinct classes. Each class contained 200 testing images in these matrices.

Figure 25. Confusion matrices for the Color Ishihara-Like MNIST Random plate showing the correct evaluation of the 10 distinct classes. Each class contained 200 testing images in these matrices.

Table 1. A comparison of each of the models by number of parameters and model size.

Model Name	Number of Trainable Parameters	Number of Layers	Conv2d Layers	Dense Layers
MNIST	2,416,330	8	3	2
Lenet	1,214,006	8	2	3
VGG16	50,415,434	22	13	3
AlexNet	23,357,514	19	5	3
Custom 1	1,469,466	19	5	3
Custom 2	371,154	19	5	3

Table 2. For the above test, the original MNIST dataset was used to train and evaluate each model.

Test 1—Original MNIST (Accuracy, Precision, Recall Training Time, Evaluation Time)—60 k Training Images, 10 k Testing Images
Model	Results
MNIST	(98.37%, 98.38%, 98.36%, 835 s, 6.33 s)
LeNet5	(98.21%, 98.21%, 98.20%, 820 s, 7.45 s)
VGG16	(99.14%, 99.13%, 99.15%, 493 s, 8.28 s)
AlexNet	(99.22%, 99.23%, 99.22%, 221 s, 6.39 s)
Custom 1	(99.12%, 99.11%, 99.12%, 199 s, 6.69 s)
Custom 2	(99.21%, 99.21%, 99.20%, 333 s, 6.12 s)

Table 3. In this test, the colored version of the MNIST dataset was used to train and evaluate each model.

Test 2—Color MNIST (Accuracy, Precision, Recall, Training Time, Evaluation Time)—60 k Training Images, 10 k Testing Images
Model	Results
MNIST	(98.68%, 98.68%, 98.66%, 957 s, 6.12 s)
LeNet5	(97.90%, 97.91%, 97.87%, 938 s, 8.23 s)
VGG16	(99.10%, 99.10%, 99.09%, 566 s, 8.30 s)
AlexNet	(99.24%, 99.26%, 99.23%, 276 s, 7.51 s)
Custom 1	(98.98%, 98.99%, 98.96%, 189 s, 10.00 s)
Custom 2	(99.16%, 99.17%, 99.16%, 247 s, 8.66 s)

Table 4. Training/Testing results of the grayscale Ishihara-Like MNIST images on each of the models with each of the plates followed by a run with all of the plates combined.

Test 3: Grayscale Ishihara (Accuracy, Precision, Recall, Training Time, Evaluation Time)—10 k Training Images, 2 k Testing Images Per Plate
Plate	MNIST	LeNet5	VGG16
2	(43.00%, 78.18%, 43.00%, 113 s, 0.74 s)	(33.85%, 53.03%, 33.85%, 152 s, 0.84 s)	(84.85%, 88.25%, 84.85%, 391 s, 1.44 s)
3	(65.35%, 78.47%, 65.35%, 149 s, 0.77 s)	(42.00%, 64.12%, 42.00%, 153 s, 1.07 s)	(93.65%, 94.15%, 93.65%, 388 s, 1.07 s)
4	(50.90%, 72.67%, 50.90%, 152 s, 0.92 s)	(40.75%, 49.52%, 40.75%, 147 s, 0.90 s)	(75.90%, 83.55%, 75.90%, 383 s, 0.85 s)
5	(41.50%, 73.07%, 41.50%, 148 s, 0.88 s)	(25.95%, 57.72%, 25.95%, 151 s, 0.91 s)	(78.35%, 85.23%, 78.35%, 383 s, 1.03 s)
6	(45.20%, 73.90%, 45.20%, 152 s, 0.90 s)	(31.40%, 61.03%, 31.40%, 129 s, 0.87 s)	(43.95%, 66.19%, 43.95%, 383 s, 1.11 s)
7	(42.50%, 62.24%, 42.50%, 151 s, 0.72 s)	(32.45%, 61.74%, 32.45%, 151 s, 0.76 s)	(75.50%, 82.82%, 75.50%, 362 s, 0.95 s)
8	(82.65%, 82.88%, 82.65%, 152 s, 0.80 s)	(50.55%, 64.38%, 50.55%, 151 s, 0.59 s)	(68.30%, 81.07%, 68.30%, 359 s, 0.90 s)
9	(82.40%, 82.92%, 82.40%, 151 s, 0.79 s)	(41.20%, 62.24%, 41.20%, 149 s, 0.73 s)	(91.60%, 91.93%, 91.60%, 384 s, 0.89 s)
rand	(80.30%, 80.54%, 80.30%, 152 s, 0.72 s)	(60.45%, 60.83%, 60.45%, 150 s, 0.71 s)	(10.00%, 1.00%, 10.00%, 382 s, 0.95 s)
all	91.16%, 91.35%, 91.16%, 1238 s, 6.08 s)	(78.43%, 78.55%, 78.43%, 1214 s, 5.84 s)	(98.53%, 98.53%, 98.53%, 732 s, 7.49 s)
Plate	AlexNet	Custom 1	Custom 2
2	(57.70%, 77.69%, 57.70%, 177 s, 0.91 s)	(81.00%, 85.56%, 81.00%, 166 s, 0.87 s)	(55.35%, 76.27%, 55.35%, 173 s, 0.91 s)
3	(85.75%, 87.47%, 85.75%, 177 s, 0.80 s)	(76.85%, 81.76%, 76.85%, 167 s, 0.93 s)	(69.60%, 75.39%, 69.60%, 165 s, 0.82 s)
4	(87.05%, 89.04%, 87.05%, 174 s, 0.94 s)	(69.70%, 80.16%, 69.70%, 172 s, 0.85 s)	(61.10%, 76.24%, 61.10%, 167 s, 0.91 s)
5	(82.45%, 88.81%, 82.45%, 179 s, 0.89 s)	(72.50%, 81.78%, 72.50%, 172 s, 0.89 s)	(69.20%, 76.53%, 69.20%, 170 s, 0.88 s)
6	(83.55%, 87.31%, 83.55%, 177 s, 0.86 s)	(67.10%, 79.72%, 67.10%, 175 s, 0.90 s)	(74.60%, 77.64%, 74.60%, 172 s, 0.95 s)
7	(78.75%, 83.65%, 78.75%, 175 s, 0.70 s)	(68.50%, 81.56%, 68.50%, 173 s, 0.74 s)	(67.80%, 74.95%, 67.80%, 172 s, 0.75 s)
8	(85.00%, 86.97%, 85.00%, 174 s, 0.75 s)	(68.85%, 77.39%, 68.85%, 172 s, 0.77 s)	(66.65%, 74.87%, 66.65%, 170 s, 0.75 s)
9	(67.85%, 81.03%, 67.85%, 177 s, 0.70 s)	(76.65%, 81.07%, 76.65%, 173 s, 0.69 s)	(75.35%, 76.99%, 75.35%, 169 s, 0.68 s)
rand	(78.25%, 81.50%, 78.25%, 175 s, 0.73 s)	(72.50%, 75.84%, 72.50%, 169 s, 0.73 s)	(68.30%, 74.57%, 68.30%, 169 s, 0.79 s)
all	(89.84%, 90.9%, 89.84%, 1413 s, 5.81 s)	(90.29%, 90.83%, 90.29%, 1391 s, 6.10 s)	(88.06%, 88.67%, 88.06%, 1395 s, 5.94 s)

Table 5. Training/Testing results of the color Ishihara-Like MNIST images on each of the models with each of the plates followed by a run with all of the plates combined.

Test 4: Color Ishihara (Accuracy, Precision, Recall, Training Time, Evaluation Time)—10 k Training Images, 2 k Testing Images Per Plate
Plate	MNIST	LeNet5	VGG16
2	(94.25%, 94.28%, 94.25%, 162 s, 1.44 s)	(92.45%, 92.44%, 92.45%, 172 s, 1.42 s)	(98.25%, 98.27%, 98.25%, 413 s, 1.60 s)
3	(94.30%, 94.31%, 94.30%, 172 s, 1.45 s)	(93.35%, 93.38%, 93.35%, 166 s, 1.44 s)	(97.75%, 97.83%, 97.75%, 406 s, 1.35 s)
4	(95.90%, 95.91%, 95.90%, 171 s, 1.42 s)	(93.45%, 93.44%, 93.45%, 166 s, 1.38 s)	(97.50%, 97.58%, 97.50%, 401 s, 1.60 s)
5	(94.60%, 94.63%, 94.60%, 174 s, 1.43 s)	(92.25%, 92.32%, 92.25%, 171 s, 1.44 s)	(97.95%, 97.99%, 97.95%, 403 s, 1.77 s)
6	(95.40%, 95.42%, 95.40%, 171 s, 1.41 s)	(92.80%, 92.86%, 92.80%, 171 s, 1.47 s)	(97.45%, 97.50%, 97.45%, 392 s, 2.79 s)
7	(96.45%, 96.45%, 96.45%, 171 s, 1.55 s)	(93.00%, 93.01%, 93.00%, 169 s, 0.98 s)	(99.00%, 99.00%, 99.00%, 402 s, 0.92 s)
8	(94.75%, 94.76%, 94.75%, 171 s, 1.00 s)	(92.15%, 92.19%, 92.15%, 171 s, 1.03 s)	(99.10%, 99.11%, 99.10%, 398 s, 1.17 s)
9	(95.15%, 95.16%, 95.15%, 170 s, 0.94 s)	(92.80%, 92.83%, 92.80%, 168 s, 0.89 s)	(98.55%, 98.56%, 98.55%, 401 s, 1.06 s)
rand	(78.95%, 79.08%, 78.95%, 171 s, 1.19 s)	(44.20%, 43.83%, 44.20%, 168 s, 1.21 s)	(10.00%, 1.00%, 10.00%, 404 s, 0.92 s)
all	(92.30%, 92.40%, 92.30%, 1328 s, 10.74 s)	(82.26%, 83.68%, 82.26%, 1319 s, 8.85 s)	(98.31%, 98.32%, 98.31%, 572 s, 11.73 s)
Plate	AlexNet	Custom 1	Custom 2
2	(94.10%, 95.10%, 94.10%, 194 s, 1.41 s)	(95.75%, 95.95%, 95.75%, 191 s, 1.44 s)	(95.95%, 96.01%, 95.95%, 190 s, 1.36 s)
3	(96.30%, 96.43%, 96.30%, 196 s, 1.50 s)	(96.55%, 96.62%, 96.55%, 182 s, 1.51 s)	(94.55%, 94.84%, 94.55%, 190 s, 1.43 s)
4	(89.95%, 91.19%, 89.95%, 190 s, 1.42 s)	(96.85%, 96.92%, 96.85%, 190 s, 1.46 s)	(96.40%, 96.45%, 96.40%, 188 s, 1.41 s)
5	(95.25%, 95.43%, 95.25%, 195 s, 1.44 s)	(97.25%, 97.27%, 97.25%, 186 s, 1.46 s)	(96.20%, 96.26%, 96.20%, 190 s, 1.45 s)
6	(96.85%, 96.89%, 96.85%, 196 s, 1.45 s)	(96.65%, 96.84%, 96.65%, 190 s, 1.48 s)	(97.10%, 97.13%, 97.10%, 182 s, 1.47 s)
7	(94.40%, 95.04%, 94.40%, 193 s, 0.86 s)	(97.40%, 97.43%, 97.40%, 191 s, 0.94 s)	(96.55%, 96.69%, 96.55%, 188 s, 0.89 s)
8	(93.70%, 94.49%, 93.70%, 195 s, 0.97 s)	(95.75%, 96.05%, 95.75%, 192 s, 1.07 s)	(96.20%, 96.33%, 96.20%, 192 s, 0.91 s)
9	(95.70%, 95.92%, 95.70%, 199 s, 0.85 s)	(92.50%, 93.81%, 92.50%, 191 s, 0.94 s)	(96.95%, 97.00%, 96.95%, 192 s, 1.46 s)
rand	(83.70%, 85.24%, 83.70%, 196 s, 1.02 s)	(76.50%, 78.74%, 76.50%, 194 s, 0.91 s)	(76.20%, 77.45%, 76.20%, 194 s, 0.94 s)
all	(96.77%, 96.78%, 96.77%, 1512 s, 9.07 s)	(94.38%, 94.54%, 94.38%, 1484 s, 10.36 s)	(90.81%, 90.92%, 90.81%, 1472 s, 9.48 s)

Table 6. This test trained each of the models with all of the plates combined except for the Random plate.

Ishihara Color—Test Case Without the Random Plate (Accuracy, Precision, Recall, Training Time, Evaluation Time)
Model	Results
MNIST	(97.19%, 97.20%, 97.19%, 1395 s, 9.87 s)
LeNet5	(96.11%, 96.11%, 96.11%, 1382 s, 10.48 s)
VGG16	(98.88%, 98.89%, 98.88%, 640 s, 11.54 s)
AlexNet	(98.55%, 98.56%, 98.55%, 493 s, 10.00 s)
Custom 1	(98.45%, 98.46%, 98.45%, 852 s, 11.49 s)
Custom 2	(98.09%, 98.11%, 98.09%, 1537 s, 9.28 s)

Table 7. Training each of the models with the original version of the MNIST dataset and then testing on the grayscale version of the Ishihara-Like MNIST dataset.

Test 5—Original MNIST with Gray Ishihara (Accuracy, Precision, Recall, Training Time, Evaluation Time)–60 k Training Images, 10 k Testing Images
Model	Results
MNIST	(10.00%, 1.00%, 10.00%, 805 s, 4.77 s)
LeNet	(10.01%, 2.67%, 10.01%, 821 s, 5.55 s)
VGG16	(12.48%, 12.51%, 12.48%, 536 s, 7.83 s)
AlexNet	(10.02%, 4.06%, 10.02%, 182 s, 5.35 s)
Custom 1	(9.95%, 4.56%, 9.95%, 158 s, 5.95 s)
Custom 2	(10.23%, 4.13%, 10.23%, 267 s, 6.26 s)

Table 8. Training each of the models with a colored version of the MNIST dataset and then testing on the original version of the Ishihara-Like MNIST dataset.

Test 7—Color MNIST with Color Ishihara (Accuracy, Precision, Recall, Training Time, Evaluation Time)–60 k Training Images, 10 k Testing Images
Model	Results
MNIST	(9.93%, 5.64%, 9.93%, 950 s, 8.65 s)
LeNet	(10.49%, 2.22%, 10.49%, 934 s, 9.62 s)
VGG16	(10.00%, 19.12%, 10.00%, 523 s, 9.49 s)
AlexNet	(10.06%, 5.05%, 10.06%, 209 s, 8.27 s)
Custom 1	(10.02%, 2.62%, 10.02%, 249 s, 10.36 s)
Custom 2	(10.06%, 7.17%, 10.06%, 246 s, 10.57 s)

Table 9. Training each of the models with the grayscale version of the Ishihara-Like MNIST dataset and then testing on the original version of the MNIST dataset.

Test 6—Grayscale Ishihara with Original MNIST (Accuracy, Precision, Recall, Training Time, Evaluation Time)–60 k Training Images, 10 k Testing Images
Model	Results
MNIST	(32.13%, 31.14%, 32.94%, 785 s, 5.84 s)
LeNet	(35.81%, 51.21%, 35.46%, 819 s, 5.70 s)
VGG16	(9.84%, 11.98%, 10.02%, 1104 s, 6.56 s)
AlexNet	(10.28%, 1.03%, 10.00%, 944 s, 5.34 s)
Custom 1	(9.68%, 1.59%, 9.94%, 959 s, 6.08 s)
Custom 2	(10.32%, 1.03%, 10.00%, 912 s, 6.63 s)

Table 10. Training each of the models with the original version of the Ishihara-Like MNIST dataset and then testing on the colored version of the MNIST dataset.

Test 8—Color Ishihara with Color MNIST (Accuracy, Precision, Recall, Training Time, Evaluation Time)—60 k Training Images, 10 k Testing Images
Model	Results
MNI sT	(11.20%, 7.21%, 10.95%, 938 s, 7.04 s)
LeNet	(25.55%, 55.60%, 24.57%, 901 s, 7.15 s)
VGG16	(11.48%, 13.30%, 10.13%, 1126 s, 7.80 s)
AlexNet	(9.74%, 0.97%, 10.00%, 1058 s, 8.19 s)
Custom 1	(9.05%, 1.48%, 9.22%, 1029 s, 7.77 s)
Custom 2	(12.29%, 6.49%, 12.05%, 1026 s, 8.60 s)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Henshaw, C.; Dennis, J.; Nadzam, J.; Michaels, A.J. Number Recognition Through Color Distortion Using Convolutional Neural Networks. Computers 2025, 14, 34. https://doi.org/10.3390/computers14020034

AMA Style

Henshaw C, Dennis J, Nadzam J, Michaels AJ. Number Recognition Through Color Distortion Using Convolutional Neural Networks. Computers. 2025; 14(2):34. https://doi.org/10.3390/computers14020034

Chicago/Turabian Style

Henshaw, Christopher, Jacob Dennis, Jonathan Nadzam, and Alan J. Michaels. 2025. "Number Recognition Through Color Distortion Using Convolutional Neural Networks" Computers 14, no. 2: 34. https://doi.org/10.3390/computers14020034

APA Style

Henshaw, C., Dennis, J., Nadzam, J., & Michaels, A. J. (2025). Number Recognition Through Color Distortion Using Convolutional Neural Networks. Computers, 14(2), 34. https://doi.org/10.3390/computers14020034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Number Recognition Through Color Distortion Using Convolutional Neural Networks

Abstract

1. Introduction

1.1. Prevalence

1.2. Motivation

1.3. Background

1.4. Outline

2. Experimental Design

2.1. Tools Used

2.2. Testing Process

2.3. Model Training and Selection

2.4. Metrics of Success

2.5. Test Cases

3. Results

4. Future Works

4.1. Improvements

4.2. Extensions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI