2. Related Work
To select the target of interest according to a user’s intention, approaches to locating the target of his/her gaze can be categorized into gaze-based and visual saliency-based methods. The former category includes eye blinks, dwell time, on- and off-screen buttons, keystrokes, eyebrow raises, and speech. Blinking can be used to select letters from the alphabet that can solve issues in eye typing where gaze direction is utilized to point to these letters [
9]. Blinking for commands and normal blinking are usually difficult to discriminate [
10]; thus, eyes that are closed for a lengthy period may be required for such discrimination, which can affect performance and the user’s convenience. The dwell time-based method seems to be a better and more natural approach compared to the blinking-based method [
11]. However, for this approach, the gaze tracking system requires knowledge of the position and duration of the user’s gaze on the selected object. Past research [
12,
13,
14] has considered methods that use the dwell time of the position of a user’s gaze. Further, an approach proposed by Huckauf et al., instead of using dwell time and blink-based selection, accomplishes target selection by using antisaccades [
15]. Ware et al. proposed selecting the object of interest by fixation and subsequent saccade towards on/off buttons [
16].
In past research [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23], one modality has been used to point to or to select the object of interest. However, these methods typically suffer from problems whereby the object is selected every time the user looks at it, intentionally or unintentionally. This problem was first pointed out as the “Midas touch problem” [
17]. It needs to be avoided, and the object of interest to a user should be discriminated from unintentionally gazes at points. The selection of the object of interest by eye blink encounters difficulty in discriminating intentional from unintentional blinks. Likewise, dwell time-based selection encounters the same type of problem, i.e., if the dwell time is very short, it faces the same problem described above: the Midas touch problem; on the other hand, if dwell time is too long, it can degrade performance as well as become tire the user [
18]. A possible solution is to use graphical on/off screen buttons, but this too can be problematic, as it can interfere with the user’s intention with regard to the relevant object and deviate from the object of interest. Some past work [
19,
20] has proposed the use of manual inputs, such as keystrokes, combined with gaze control for pointing at and selecting objects of interest. Surakka et al. introduced the idea of a frowning face to selecting the object of interest [
21]. The method based on eye blinks or eyebrow raises was proposed by Grauman et al., where these were used to point at and select the object and convey the relevant command [
22]. Tuisku et al. used gazing and smiling to this end, where gazing was used to point to the object and smiling as a selection tool [
23]. Although these techniques performed well for their intended purposes, they have limited accuracy and speed of selection according to the user’s intention.
To solve the problems of single modality-based methods, multiple modality-based methods need to be explored. The authors of [
24] proposed the idea of a multiple modality-based method based on pupil accommodation and dwell time. However, because they did not consider information from the monitor image (textural or high-frequency information in the area of the monitor image at which a user gazes), there is room for further improvement in detecting the user’s gaze for target selection. Furthermore, the accurate measurement of pupil accommodation and dwell time is dependent on the correct detection of pupil size and the centers of the pupil and the corneal glint. Therefore, incorrect detection can affect system performance.
Researchers have also shown considerable interest in visual attention models that can be implemented to classify of intentional and unintentional gazes [
25]. Saliency-based models have been used to measure the possibility of a location to attract the observer’s attention. A saliency map can be obtained from a visual image for gaze points. Attaining visual information is a low-cost preprocessing step through which visual systems (biological or artificial) select the most noticeable information from a given scene [
26]. Saliency prediction with higher accuracy has a number of applications in salient object detection, automatic vehicle guidance, scene understanding, robot navigation, specifically fast regions of interest (ROI) selection in complex visual scenes, and so on [
27]. A large number of saliency-based methods for detecting intentional gazes have been proposed [
28,
29,
30,
31,
32,
33,
34,
35,
36]. Most models of saliency are based on bottom-up image cues as they are biologically inspired. Many theoretical models are based on the feature integration theory proposed by Treisman and Gelade [
37]. They analyzed the visual features that are combined to direct human attention over conjunction search tasks. A feed forward model to combine these features with the concept of saliency map was proposed by Koch and Ullman [
38], which was implemented and verified by Itti et al. [
39]. Different saliency models are based on low-, middle-, and high-level features. All models have different assumptions and methodologies focusing on different aspects of human visual behavior. These models can mainly be categorized into bottom-up, top-down, and learning-based approaches [
27].
In the bottom-up approach [
39,
40,
41,
42,
43,
44,
45], saliency models use biologically reasonable low-level features depend on computational principles proposed by Itti et al. [
39]. On the basis of image features, i.e., color, intensity, and orientation, they derived bottom-up visual saliency based on center-surround differences across multi-scale image features. Bruce and Tsotsos [
40] proposed attention for an information maximization (AIM) saliency model, but it is weak in detecting local dissimilarities (i.e., local vs. global saliency). A graph-based visual saliency (GBVS) model was proposed by Harel et al. based on the assumption that image patches that are different from surrounding patches are salient [
41]. Like Itti et al.’s method, GBVS also fails to detect global visual saliency. Liu et al. [
42] identified ROI by conditional random fields (CRF) using three types of features: multi-scale contrast, center-surround histogram, and color spatial distribution. Kong et al. proposed the idea of a multi-scale integration strategy to combine various low-level saliency features [
43]. Fang et al. proposed compressed domain saliency detection using color and motion characteristics [
44]. Jiang et al assumed that the area of interest of visual attention has a shape prior, and detected salient regions using contour energy computation [
45]. Low-level feature-based methods perform well in general but do not consider semantic information, such as faces, humans, animals, objects, and text.
To solve this problem, the top-down approach has been researched [
46,
47,
48,
49]. Several studies have adopted this approach and attained performance improvement by adding high-level features. Cerf et al. [
46] added a high-level factor, face detection, to Itti’s model and showed that it enhances performance. Judd et al. [
47] proposed a saliency detection model by learning the best weights for all combined features using support vector machines (SVMs). Chang et al. [
48] subsequently proposed the idea of an object-based saliency model by relying on the claim that observers tend to look at the centers of objects. Hence, top-down models have highlighted the importance of high-level and semantic features, e.g., faces, animals, the image-centric bias, and the object-centric bias. A major problem with this approach is that it is training dependent, and often fails to detect salient features if its models are not trained. To consider the limitations of the bottom-up and top-down approaches, the learning-based method has been researched [
50,
51,
52,
53,
54,
55,
56,
57]. An AdaBoost-based model for feature selection was proposed by Borji [
52]. It models complex input data by combining a series of base classifiers. Reingal et al. observed that there were different statistics for fixated patches compared to random patches [
53]. Borji et al. [
56] later combined two simple methods (multiplication and sum) using normalization schemes (identity, exponential, and logarithmic) to combine saliency models. The evolutionary optimization algorithm has been used by some researchers to find optimal sets of combination weights [
57].
All saliency-based methods can provide only information concerning regions where people tend to gaze with higher probability, instead of accurate intentional gaze position. Therefore, to overcome the limitations of past work on gaze-based methods as well as visual saliency-based methods, we propose a fuzzy system-based target selection method for near-infrared (NIR) camera-based gaze trackers by fusing the gaze-based method with the bottom-up visual saliency-based method. Our method combines multi-modal inputs, i.e., pupil accommodation measured by template matching, short dwell time, and Gabor filtering-based texture information of visual saliency using a fuzzy system. In the following four ways, our research is novel compared to past research. The first, second, and third points represent major novelties whereas the fourth is a minor one:
- -
First, a new and improved method based on the Chan–Vese algorithm is proposed for detecting the pupil center with a boundary as well as the glint center.
- -
Second, we use three features, i.e., change in pupil size using template matching (to measure pupil accommodation), change in gaze position during short dwell time, and Gabor filtering-based texture information of monitor image at gaze target. A fuzzy system is then used with these three features as inputs, and the decision concerning the user’s target selection is taken through defuzzification.
- -
Third, an optimal input membership function for fuzzy system can be obtained based on the maximum entropy criterion.
- -
Fourth, through comparative experiments using an on-screen keyboard on a previous dwell time-based method, the performance and usability of our method were verified in a real gaze-tracking environment.
In
Table 1, we summarize the comparison of the proposed with existing methods.
The remainder of this paper is organized as follows: in
Section 3, our proposed system and methodology are introduced. The experimental setup is explained and the results are presented in
Section 4.
Section 5 contains our conclusions and discussion of some ideas for future work.
4. Experimental Results
The performance of the proposed method for detecting user gaze for target selection was measured through experiments performed with 15 participants, where each participant attempted two trials for each of three target objects, i.e., a teddy bear, a bird, and a butterfly, displayed at nine positions on a 19-inch monitor, as shown in
Figure 14. That is, three experiments were performed with these objects (teddy bear, bird, and butterfly) with each at nine positions on the screen. We collected 270 data items (15 participants × 2 trials × 9 gaze positions) for gazing for target selection, i.e., true positive (TP) data, and the same number of non-gazing data, i.e., true negative (TN) data, for each of the three experiments. Most participants were graduate students, and some of them were faculty or staff members of our university’s department. They were randomly selected by considering the variation in eye characteristics with age, gender, and nationality. All participants voluntarily participated in our experiments. Of the 15, five wore glasses and four people wore contact lenses. The remained 6 people did not wear glasses and contact lens. The ages of the participants ranged from 20 s to 40 s (mean age 29.3 years). Nine participants were male, and the other 6 ones were female. We made sure that participants of different nationalities were involved in our experiments: one Mongolian, one Tanzanian, two Pakistanis, four Vietnamese, and seven Koreans. Before the experiments, we provided the sufficient explanations of our experiments to all participants, and obtained informed (written) consents from all the participants.
In the first experiment, we compared the accuracy of our method, in detecting the boundary between the pupil and the glint as well as the center of each, with a previous method [
24]. As shown in
Figure 15b,c, the boundary and center of the pupil detected by our method were closer to the ground truth than those calculated by the previous method. Moreover, as shown in
Figure 15d,e, the detected boundary and center of the glint according to our method were closer to the ground truth than the previous method. In this experiment, the boundary and center of the ground truth were manually chosen. Further, for all images, we measured detection errors based on Euclidean distance between the center of the ground truth and the center as detected by our method and the previous method [
24]. As shown in
Table 4, our method yielded a higher accuracy (lower error) in detecting the centers of the pupil and the glint than the previous method.
In the second experiment, the accuracy of the detection of the user’s gaze for target selection on TP and TN data, according to various defuzzification methods, was compared in terms of equal error rate (EER). Two types of errors, types I and II, were considered. The error of incorrectly classifying TP data as TN data was defined as a type I error, and that of incorrectly classifying TN data as TP data was defined as a type II error. As explained in the previous section, our system determines that the user’s gaze is employed (TP) if the output score of the fuzzy system is higher than the threshold. If not, our system determines that user’s gaze is not employed (TN). Therefore, type I and II errors occur depending on the threshold. If the threshold is increased, type 1 error increases and type II error decreases. With a smaller threshold, type I error decreases whereas type II error increases. When type I and II errors are most similar to the appropriate threshold, the EER is calculated by averaging the two errors.
Experiment 1 (Bear)
In the first experiment, a teddy bear was used as target object, as shown in
Figure 14b. The classification results of TP and TN data according to the five defuzzification methods that used the MIN and MAX rules are listed in
Table 5 and
Table 6, respectively. As indicated in these tables, the smallest EER (approximately 0.19%) of classification was obtained by the COG with the MIN rule.
Figure 16 and
Figure 17 show the receiver operating characteristic (ROC) curves for the classification results of TP and TN data according to the various defuzzification methods using MIN or MAX rules, respectively. The ROC curve represents the change in type I error (%) according to the increase in 100—type II error (%).
In case that type I and II errors were small, the accuracy of that particular method was regarded as high. Therefore, if the ROC curve was closer to (0, 100) (“type I error” of 0% and “100—type II error” of 100%) on the graph, its accuracy was regarded as higher. As shown in these figures, the accuracy of classification by the COG with the MIN rule was higher than obtained by other defuzzification methods.
Experiment 2 (Bird)
In the second experiment, a bird was used as target object as shown in
Figure 14b. The classification results of TP and TN data according to the five defuzzification methods that used the MIN and MAX rules are listed in
Table 7 and
Table 8, respectively. As indicated in these tables, the smallest EER (approximately 0%) of classification was obtained by the COG with the MIN rule.
Figure 18 and
Figure 19 show the ROC curves for the classification results of TP and TN data according to the various defuzzification methods using MIN or MAX rules. As is shown, the accuracy of classification by the COG with the MIN rule was higher than for other defuzzification methods.
Experiment 3 (Butterfly)
In the third experiment, a butterfly was used as target object as shown in
Figure 14b. The classification results of TP and TN data according to the five defuzzification methods that used the MIN and MAX rules are listed in
Table 9 and
Table 10, respectively. As indicated in these tables, the smallest EER (approximately 0.19%) of classification was obtained by the COG with the MIN rule.
Figure 20 and
Figure 21 show the ROC curves for the classification results of TP and TN data according to the various defuzzification methods using the MIN or MAX rules. As is shown, the accuracy of classification by the COG with the MIN rule was higher than by other defuzzification methods.
The above results were verified by comparing the proposed method using three features (change of pupil size w.r.t. time by template matching, change in gaze position within short dwell time, and the texture information of monitor image at gaze target) with the previous method [
24] using two features (change of pupil size w.r.t. time by peakedness, and change in gaze position within short dwell time). In addition, we compared the accuracy by proposed method with that by using three features (change of pupil size w.r.t. time by peakedness, change in gaze position within short dwell time, and the texture information of monitor image at gaze target). For convenience, we call the last method “Method A”. The only difference between the proposed method and “Method A” is that change in pupil size w.r.t. time is measured by template matching in proposed method but by peakedness in “Method A”.
In all cases, the ROC curves of the highest accuracy among the various defuzzification methods with the MIN or MAX rules are shown. The accuracy of by proposed method was always higher than that of “Method A” and the previous method [
24] in all cases of the three experiments, i.e., bear, bird, and butterfly, as shown in
Figure 22.
In the next experiment, we compared the accuracy of proposed method, “Method A”, and the previous method [
24] when noise was included in the input data. All features of the proposed and the previous method was affected by the accurate detection of pupil size and gaze position, which were in turn affected by the performance of the gaze tracking system. Thus, we included Gaussian random noises in the detected pupil size and gaze position.
As shown in
Figure 23, noise had a stronger effect on the previous method [
24] and “Method A” than the proposed method in all three experiments. A notable decrease in accuracy with increase in EER was observed in the latter two methods when noise caused incorrect detection of pupil size and gaze.
In the next experiments, we compared the usability of our system with the conventional dwell time-based selection method [
13,
14]. We requested the 15 participants to rate the level of convenience and interest in performing the target selection task of the proposed method and a dwell time-based method by using a questionnaire (5: very convenient, 4: convenient, 3: normal, 2: inconvenient, 1: very inconvenient, in case of a convenience questionnaire) (5: very interesting, 4: interesting, 3: normal, 2: uninteresting, 1: very uninteresting, in case of an interest questionnaire). In order to render our results unaffected by participant learning and physiological state, such as fatigue, we provided a rest time of 10 minutes to each participant between experiments. Based on [
13,
14], dwell time for target selection was set at 500 ms. That is, when the change in the user’s gaze position for our feature 2 was lower than the threshold, and this state was maintained for longer than 500 ms, target selection was activated.
The average scores are shown in
Figure 24, which shows that our method scored higher than the conventional dwell time-based method in terms of both convenience and interest.
We also performed a
t-test [
80] to prove that user’s convenience on the proposed method was statistically higher than that of the conventional dwell time-based method. The
t-test was performed by taking two independent samples of data: user’s convenience on our system (
µ = 3.8,
σ = 0.5) and with the conventional dwell time-based method (
µ = 2.8,
σ = 0.7). The calculated
p-value was approximately 3.4 × 10
−4, which was smaller than the 99% (0.01) significance level. Hence, the null hypothesis for the
t-test, i.e., no difference between the two independent samples, was rejected. Therefore, we can conclude that there was a significant difference, up to 99%, in user convenience between our proposed method and the dwell time-based method.
A t-test analysis based on user interest was also performed. This also yielded similar results, i.e., user interest on our proposed system (µ = 4.2, σ = 0.5) was higher than with the dwell time-based system (µ = 2.6, σ = 0.9). The calculated p-value was 6.74 × 10−6, i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded a significant difference in user interest between our system and the dwell time-based system. The average score for convenience and interest was higher for the proposed method because it is more natural than a conventional dwell time-based method.
In order to analyze the effect of the difference in size between two groups, we performed Cohen’s
d analysis [
81]. It classifies the difference as small if it is within the range 0.2–0.3, medium if it is 0.5, and large if it is greater than or equal to 0.8. We calculated the value of Cohen’s
d for convenience and interest. For user convenience, it was 1.49, which was in the large-effect category; hence, it had a major effect on the difference between the two groups. For user interest, the calculated Cohen’s
d was approximately 2.14, which was also in the large-effect category. Hence, we concluded that user convenience and user interest showed a large effect in the difference between our proposed method and dwell time-based method.
To confirm the practicality of our method, we performed additional experiments using an on-screen keyboard based on our method, where each person typed a word through our system on an on-screen keyboard displayed on a monitor, as shown in
Figure 25. All 15 subjects from before participated in the experiments, and a monitor with a 19-inch screen and resolution of 1680 × 1050 pixels was used. Twenty sample words were selected based on frequency of use as shown in
Table 11 [
82]. As shown in
Table 11, the left-upper words “the” and “and” were more frequently used than the right-lower words “all” and “would”. If the user’s gaze detected by our method was associated with a specific button, and our method had determined the user’s gaze, the corresponding character was selected and displayed, as shown in
Figure 25.
We performed a
t-test to prove that our method is statistically better than the conventional dwell time-based method [
13,
14], as shown in
Figure 26 and
Figure 27. In order to immunize our results against participant learning and physiological state, such as fatigue, we gave a 10-min rest to each participant between experiments. Based on [
13,
14], dwell time for target selection was set at 500 ms. In all cases, our gaze detection method was used for fair comparison, and selection was performed using our method or the dwell time-based method. We conducted our statistical analysis by using four performance criteria, i.e., accuracy, execution time, interest, and convenience.
As shown in
Figure 26a, the
t-test was performed using two samples independent of each other: the user’s typing accuracy using our system (
µ = 89.7,
σ = 7.4) and the conventional dwell time-based method (
µ = 67.5,
σ = 6.5). The calculated
p-value was approximately 1.72 × 10
−9, smaller than for a 99% (0.01) significance level. Hence, the null hypothesis for the
t-test, i.e., no difference between two independent samples, was rejected. Therefore, we concluded that there was a significant difference of up to 99% in accuracy between the proposed method and the dwell time-based method.
Similarly, a
t-test analysis based on average execution time for typing one character was performed, as shown in
Figure 26b, this test yielded similar results, i.e., the average execution time for typing one character with our system (
µ = 0.67,
σ = 0.013) and the dwell time-based system (
µ = 2.6,
σ = 0.17). The calculated
p-value was 2.36 × 10
−16, i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded that there was a significant difference in execution time for typing one character on a virtual keyboard between our system and the dwell time-based system. Although dwell time for target selection was set at 500 ms in the dwell-time-based method, it was often case that the participant did not wait for this long and move his or her gaze position, which increased the average execution time for typing one character, as shown in
Figure 26b.
A
t-test analysis based on execution time for typing one word was also conducted as shown in
Figure 26c. We compared our proposed system (
µ = 2.9,
σ = 0.42) with the dwell time-based system (
µ = 8.7,
σ = 0.87). The calculated
p-value was 5.7 × 10
−16, i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded that there was a significant difference between the execution time for typing one word on a virtual keyboard between our system and the dwell time-based system.
As shown in
Figure 27, the
t-test analysis for user interest compared our proposed system (
µ = 4.0,
σ = 0.4) with the dwell time-based system (
µ = 2.4,
σ = 0.9). The calculated
p-value was 6.1 × 10
−7, i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded that there was a significant difference in user interest between our system and the dwell time-based system.
Similarly, we performed a
t-test with respect to user convenience. We noted that with our system (
µ = 3.9,
σ = 0.5) and the conventional dwell time-based method (
µ = 2.9,
σ = 0.8), the calculated
p-value was approximately 3.4 × 10
−4, which was smaller than the 99% (0.01) significance level. Hence, the null hypothesis for the
t-test, i.e., no difference between two independent samples was rejected. Therefore, we concluded a significant difference up to 99% in user convenience with our proposed method and the dwell time-based method. As shown in
Figure 27, the average score for convenience and interest was higher for the proposed method because it is more natural than a conventional dwell time-based method.
Similarly, we calculated the value of Cohen’s
d for convenience, interest, average execution time for typing, and accuracy. For user convenience and interest, they were 1.49 and 2.34, respectively, and lay in a large category. Hence, this had a significant effect on the difference between the two groups. For average execution times for one character, one word, and accuracy, we calculated the value of Cohen’s
d as approximately 15.92, 8.51, and 3.19, respectively, which also lay in the large-effect category. Hence, from the
p value and Cohen’s
d, we concluded that user convenience, interest, average execution time, and accuracy were significantly different for the proposed and the dwell time-based methods [
13,
14].
In our experiment, the screen resolution is 1680 × 1050 pixels on a 19-inch monitor. The z-distance between the monitor and user’s eye ranges 60 to 70 cm. Considering the accuracy (about ±1°) of gaze detection in our system, the minimum distance between two objects on the monitor screen should be about 2.44 cm (70 cm × tan(2°)) which corresponds to approximately 82 pixels.