Cognitive Diagnosis Method via Q-Matrix-Embedded Neural Networks

Tao, Jinhong; Zhao, Wei; Liu, Fengjuan; Guo, Xiaoqing; Cheng, Nuo; Guo, Qian; Xu, Xiaoqing; Duan, Hong

doi:10.3390/app142210380

Open AccessArticle

Cognitive Diagnosis Method via Q-Matrix-Embedded Neural Networks

by

Jinhong Tao

¹,

Wei Zhao

^1,*,

Fengjuan Liu

^2,*,

Xiaoqing Guo

^1,3,

Nuo Cheng

¹,

Qian Guo

¹,

Xiaoqing Xu

¹ and

Hong Duan

⁴

¹

School of Information Science and Technology, Northeast Normal University, 2555 Jingyue, Changchun 130117, China

²

School of Educational Science, Shaanxi University of Technology, 1 Dongyi, Hanzhong 723001, China

³

Shenzhen Experimental School Guangming Department, Shenzhen Experimental Educaton Group, 768 Niushan Road, Shenzhen 518107, China

⁴

Teachers College, Shihezi University/Bingtuan Education Institute, 221 North 4th Road, Shihezi 832003, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10380; https://doi.org/10.3390/app142210380

Submission received: 27 September 2024 / Revised: 1 November 2024 / Accepted: 5 November 2024 / Published: 12 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Cognitive diagnosis is one of the essential components in intelligent education and aims to diagnose student’s skill or knowledge mastery based on their responses. Recently, with the development of artificial intelligence, some researchers have applied neural network methods to cognitive diagnosis. Although they achieved some success, they seemed to lack a certain basis for designing network structures and could not obtain a unified method for designing network structures. We propose a neural network method for cognitive diagnosis based on Q-matrix constraints, introducing the Q-matrix from traditional cognitive diagnosis to enhance the reliability and interpretability of the network structure. Specifically, our method is highly consistent with generalized deterministic inputs, the noisy “and” gate model (GDINA), and the network structure reflects the direct contribution of skills to answering questions correctly, as well as the indirect contribution of interactions between skills to answering questions correctly. Finally, extensive experiments on both simulated and real datasets demonstrated that our method achieved high accuracy and reliability, with a particularly notable performance on low-quality datasets. As the number of questions and skills increased, our approach exhibited greater robustness compared to the classical methods, highlighting its potential for broad applicability in cognitive diagnosis tasks.

Keywords:

cognitive diagnosis assessment; diagnostic classification models; Q-matrix; machine learning; psychometrics; cognitive tests; educational measurement

1. Introduction

Cognitive diagnostic assessment (CDA) aims to diagnose students’ latent traits based on their response data, and it is widely applied in psychological assessment and personalized learning diagnosis [1]. Cognitive diagnostic assessments enable a fine-grained understanding of students’ mastery of specific knowledge concepts and can also be applied to diagnose issues such as examinee mental health [2,3], as illustrated in Figure 1. This figure presents an example of a cognitive diagnostic system, which typically comprises three components: a module for item design and student response logging, a cognitive diagnostic module, and a diagnostic report visualization module. As shown, the cognitive diagnostic model forms the core of the system, with the accuracy of diagnosis directly impacting the system’s usability and reliability [4]. Over the years, this area has attracted substantial research efforts, leading to the development of diverse cognitive diagnostic models. The existing classical cognitive diagnosis methods can be divided into parametric and non-parametric methods. The most representative non-parametric method is the non-parametric classification model based on vector similarity calculation (e.g., NPC and GNPC [5,6]). However, the most studied methods are the parametric methods based on statistical theory, such as the saturated models (e.g., LCDM and GDINA [7]) and the reduced models (e.g., DINA, RRUM, ACDM, and DINMix [8,9]).

In recent years, with the rapid development of artificial intelligence technology, some researchers have tried to use artificial neural networks (ANNs) for cognitive diagnosis [10]. Cui et al. synthesized ideal responses to train a multilayer perceptron (MLP) and evaluated its performance in the DINA framework [11]. Wen et al. combined an artificial neural network (ANN) and a hidden Markov model (HMM) to monitor students’ cognitive skill development [12]. Their method can effectively track students’ cognitive skill development, but the accuracy is affected by item quality, item quantity, and the skills examined. F. Wang et al. proposed a neural cognitive diagnosis framework called NeuralCD for student cognitive diagnosis in intelligent education systems [13]. The framework uses neural networks to simulate the nonlinear interaction between students and exercises, and it has a certain interpretability. D. Chen et al. studied the impact of the complexity of the skills or knowledge concepts being examined on the accuracy of artificial neural networks in cognitive diagnosis [14]. Their results showed that the higher the complexity of the knowledge–skill structure, the lower the classification accuracy. K. Xue et al. integrated ANN with DINA and DINO models to achieve a semi-supervised neural network cognitive diagnosis model [15]. Their results showed that when the discrimination of items decreased, the diagnostic accuracy of their method decreased, but their model had good robustness to noise.

In summary, numerous researchers have applied neural networks to cognitive diagnosis in recent years. However, it can be observed that almost none of these studies addressed how to determine a network structure for cognitive diagnosis, such as the depth of the network and the number of neurons in each layer, for different testing scenarios. To address this issue, we propose a Q-matrix-constrained neural network design method for cognitive diagnosis. Specifically, our method is highly consistent with the GDINA model [7], and the network structure reflects the direct contribution of skill or knowledge concepts to the correct answering of items, as well as the enabling or inhibiting effect of interactions between skills on the correct answering of items. With our method, users only need to provide the Q-matrix and interaction matrix to automatically complete the network construction, greatly reducing the complexity of building neural network models. This feature facilitates use by educators and researchers in non-computational fields, such as education and psychology, thereby enhancing the method’s usability and broad applicability. The structure of this paper is as follows: first, we introduce the relevant concepts and theories for the Q-matrix and GDINA model; next, we explain the computational logic and algorithmic construction of the proposed cognitive diagnostic model; subsequently, we evaluate the method’s performance using simulated and empirical data; finally, we discuss the results and offer future research directions.

2. Materials and Methods

2.1. Related Concepts and Symbol Definition

2.1.1. Q-Matrix

In cognitive diagnostic models (CDMs), a Q-matrix corresponding to test items is frequently utilized to infer a student’s mastery of skills [16]. This procedure is similar to how a teacher observes students’ responses and then analyzes their skill mastery depending on the concepts or skills tested in the test items. Therefore, the purpose of this work was to develop a cognitive diagnosis approach for neural networks based on Q-matrix restrictions. As shown in Equation (1), the Q-matrix explains how the K skills or attributes to be tested are used to create J test question items [17].

Q = {(q_{j k})}_{j = 1, 2, \dots, J, k = 1, 2, \dots, K}

(1)

where

q_{j k} \in {0, 1}

and denotes whether the kth skill will be needed for the item j. K denotes the number of skills, and J denotes the number of items. The values

q_{j k} = 1

and

q_{j k} = 0

denote the two scenarios in which skill k is examined and not examined in item j, respectively. Each row of the Q-matrix is called a q-vector, e.g.,

q_{j} = (q_{j 1}, q_{j 2}, \dots, q_{j K})

, which represents the details of the skills examined in item j. In an item, when the number of examined skills is more than two, there may exist interactive relationships between the skills, which are reflected by mutual promotion or inhibition among the skills.

2.1.2. Interaction Q-Matrix

According to the polynomial expansion theorem [18], there are

2^{K}

types of interactive relationships for an item that examines K skills. Subtracting the zero-order and first-order interactive relationships yields the remaining

k^{'}

interactive relationships, where the quantity of

K^{'}

is calculated using the formula shown in Equation (2).

K^{'} = \sum_{k = 2}^{K} (\begin{matrix} K \\ k \end{matrix}) = 2^{K} - \sum_{k = 0}^{1} (\begin{matrix} K \\ k \end{matrix}) = 2^{K} - (1 + K)

(2)

where

(\begin{matrix} K \\ k \end{matrix})

is denoted as a combinatorial number that picks k skills at random from K skills. In a given Q-matrix comprising J question items and K skills, the upper limit for the number of possible interactions between skills is

K^{'}

, while during an instructional assessment, the actual number of interactions is significantly smaller, represented as

K^{*}

, where

K^{*} < < K^{'}

. For example, in the case of five skills, the maximum number of potential interactions between these skills is 26 (e.g.,

2^{5} - (1 + 5) = 26

), while fewer than ten may actually exist. Hence, a skill interaction matrix is established to depict the skill interactions in the real assessment and is defined as shown in Equation (3).

Q^{'} = {(q_{k^{*} k}^{'})}_{k^{*} = 1, 2, \dots, K^{*}, k = 1, 2, \dots, K}

(3)

where

q_{k^{*} k}^{'} \in {0, 1}

, and each row

q_{k *}^{'}

in the skill interaction matrix

Q^{'}

represents an actually existing skill interaction. For example,

q^{'} = (1, 0, 1, 0)

indicates that there is an interdependence or influence between skill 1 and skill 3.

K^{*}

denotes the number of skill interactions actually present in the instructional assessment. In light of this, the interactions among skills associated with each item in a given Q-matrix

Q

can be characterized by an interactive Q-matrix

Q^{*}

, which is defined as shown in Equation (4).

Q^{*} = {(q_{j k^{*}}^{*})}_{j = 1, 2, \dots, J, k^{*} = 1, 2, \dots, K *}

(4)

where each column in the interaction matrix-Q

Q^{*}

corresponds to a specific skill interaction type, which aligns with a row item in the skill interaction matrix

Q^{'}

. Each row

q_{j}^{*}

denotes a tangible skill interaction with item j.

q_{j k^{*}}^{*} \in {0, 1}

,

q_{j k^{*}}^{*} = 1

indicates the presence of the kth skill interaction type in item j; otherwise, it signifies its absence. The computation of

q_{j k^{*}}^{*}

follows the formula in Equation (5).

q_{j k^{*}}^{*} = \prod_{k = 1}^{K} q_{j k}^{q_{k^{*} k}^{'}}, j = 1, 2, \dots, J, k^{*} = 1, 2, \dots, K^{*}

(5)

It is important to note that both the Q-matrix

Q

and the skill interaction matrix

Q^{'}

are predefined by the experts who construct the questions.

2.1.3. Skill Mastery Patterns and Observed Response Patterns

For a given Q-matrix, such as

Q

in Equation (1), there are

2^{K}

candidate skill mastery patterns, usually named the attribute mastery patterns (AMPs; [19]) and referred to as skill mastery patterns (SMPs) in this paper. This is defined as shown in Equation (6).

A = {(α_{ℓ k})}_{ℓ = 1, 2, \dots, L, k = 1, 2, \dots, K}

(6)

where

α_{ℓ k} \in {0, 1}

and

L = 2^{K}

, and the values

α_{ℓ k} = 1

indicate that skill K has been mastered by the student, whereas

α_{ℓ k} = 0

implies it has not been mastered. Each row

α_{ℓ}

of the

A

matrix represents a skill mastery pattern, e.g.,

α_{ℓ} = (α_{ℓ 1}, α_{ℓ 2}, \dots, α_{L K})

, which reflects the mastery status of skills in an attribute skill mastery pattern ℓ. The student’s response to all the test items in a specific test is commonly referred to as the observed response pattern (ORP; [19]), e.g.,

x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i J}), i = 1, 2, \dots, N

, where

x_{i}

denotes the answers of the student i for items 1 to J.

x_{i j} \in {0, 1}

,

x_{i j} = 1

indicates that student i answered item j correctly, and vice versa. N and J represent the number of examinees and items, respectively. The observed response pattern of all students can be expressed in the form of a matrix

X

, as shown in Equation (7).

X = {(x)}_{i = 1, 2, \dots, N, j = 1, 2, \dots, J}

(7)

The task of cognitive diagnosis is to assess students’ mastery of each skill based on their response patterns, that is, to predict skill mastery patterns (SMPs) from observed response patterns (ORPs).

2.2. Q-Matrix Constraint-Based Neural Network

Deep feedforward neural networks, also known as multilayer perceptions (MLPs; [20]), were first proposed by Rosenblatt in 1957. However, it was not until 1986 that the backpropagation (BP) algorithm was introduced by Rumelhart [21], which successfully solved the weight adjustment problem for multilayer neural networks and greatly advanced the research on neural networks. In the introduction, we discussed that some researchers have attempted to use artificial neural networks for cognitive diagnosis. In this article, we designed a deep neural network model for cognitive diagnosis, named the Dual Q-net cognitive diagnosis model, with a Q-matrix

Q

and interactive Q-matrix

Q^{*}

as constraints. The network architecture is shown in Figure 2.

Before introducing Dual Q-Net, let us first review the GDINA model [7], whose core computational method is shown in Equation (8).

P (α_{l j}^{*}) = δ_{j 0} + \sum_{k = 1}^{K_{j}^{*}} δ_{j k} α_{l k} + \sum_{k^{'} = k + 1}^{K_{j}^{*}} \sum_{k = 1}^{k_{j}^{*} - 1} δ_{j k k^{'}} α_{l k} α_{l k^{'}} + \dots + δ_{j 12 \dots K_{j}^{*}} \prod_{k = 1}^{K_{j}^{*}} α_{l k}

(8)

where

K^{*} j

denotes the number of attributes actually assessed by item j,

δ_{j 0}

serves as the intercept for item j,

δ_{j k}

represents the main effect due to

α_{k}

, and

δ_{j 12, \dots, K_{j}}

indicates the interaction effect due to

α_{1}, \dots, α_{K_{j}^{*}}

. According to the GDINA calculation, the probability of correctly answering item j for a given skill mastery pattern

α_{ℓ j}^{*}

consists of three parts. They are the first part in the Equation (8) (i.e.,

δ_{j 0}

), which can be interpreted as the guessing rate or the intercept; the second part (i.e.,

\sum_{k = 1}^{K_{j}^{*}} δ_{j k} α_{l k}

) that represents the direct contribution of skills to the correct answer; and the residual terms that represent the contribution of skill interactions to the correct answer. Inspired by the GDINA model, the core computation process of Dual Q-Net is designed to consist of three parts, where the orange line represents the main effect, and the green line represents the interaction effect. The combined contribution of the orange and green lines to the mastery status of skills is represented by the central purple line. In addition, we introduced a Q-matrix inside the network structure to constrain the number of neurons in each layer and the connections between layers. In Figure 2, the orange line consists of an input layer and a hidden layer. The dimension of the input layer is J, and the dimension of the hidden layer is K. Each neuron in the input layer represents an item, while each neuron in the hidden layer represents a skill. A connection between the input layer and the hidden layer only exists when item j examined skill k, which is implemented by the Q-matrix, e.g., Equation (1). It is essential to emphasize that this is not a strict rule. If full connectivity is desired, this can be achieved by setting the value of each element of the Q-matrix to one, where the calculation of

M^{1}

is shown in Equation (9).

M = R e l u [X (Q ⊙ W^{m}) + b^{m}]

(9)

where

M \in R^{N \times K}

denotes the output of the hidden layer in the orange line,

X \in R^{N \times J}

denotes the response data of the students,

Q \in R^{J \times K}

is the constraint matrix for this computational flow and derived from Equation (1), and

W^{m 1} \in R^{J \times K}

and

b^{m 1} \in R^{K}

denote the weight and bias, respectively. ⊙ is the Hadamard product, which means the element-wise multiplication of two vectors or matrices. As the orange line represents the main effect, the

R e l u (\cdot)

function is applied as the activation function.

The green line calculates the contribution of the interaction between skills to the correct answers, which consists of an input layer and a hidden layer. Similarly to the orange line, the input layer has dimensions J, and the hidden layer has dimensions

K^{*}

. The green computational line is constrained by the embedded interactive Q-matrix

Q^{*}

(e.g., Equation (4)), which is computed as shown in Equation (10).

I = t a n h [X (Q^{*} ⊙ W^{i}) + b^{i}]

(10)

where

I \in R^{J \times K^{*}}

denotes the output of the hidden layer in the green line, while

W^{i}

and

b^{i}

denote the weights and biases for this line, respectively. The

t a n h (\cdot)

function was chosen as the activation function of this line because the interaction effects between skills can facilitate or inhibit correct responses.

Finally, in the purple calculation flow, a fully connected layer is employed to consolidate the outcomes of the orange line, which signifies the primary effect, and the green line, which signifies the interactive effect between skills, to determine the student’s mastery of each knowledge and skill. Finally, in the purple calculation flow, a fully connected layer is employed to consolidate the outcomes of the orange line, which signifies the primary effect, and the green line, which signifies the interactive effect between skills, to determine the student’s mastery of each piece of knowledge and skill. The calculation process is shown in Equation (11).

\hat{A} = σ [(M \oplus I) W^{c} + b^{c}]

(11)

where the symbol ⊕ is used to denote the serial concatenation of two vectors or matrices.

w^{c}

and

b^{c}

denote the weights and biases in this layer.

σ (\cdot)

denotes the activation function.

\hat{A}

denotes the predicted probability of a student’s skill mastery by Dual Q-Net. A threshold can usually be used to determine whether a student has mastered a skill. For example,

α_{i k} > 0.5

indicates that student i has mastered skill k.

The forward computation process of Dual Q-Net network is thus completed. The loss function of the model is the mean squared error function, as shown in Equation (12). The model parameters are updated by the backward propagation algorithm.

L (θ) = \frac{1}{2 N} {(\hat{A} - A)}^{T} (\hat{A} - A)

(12)

where

θ

denotes the parameters of the model, including the weight and bias of each layer in the network.

A

is the real skill mastering mode of the students, and N is the number of students participating in the test.

To provide a clear understanding of the implementation details and training process of the Dual Q-Net algorithm, we present its pseudocode in Algorithm 1. This algorithm details the construction of the Dual Q-Net model, including the calculation of the interactive Q-matrix from the original Q-matrix, and describes both the forward computation and training processes. Notably, it specifies how the network parameters are constrained by the Q-matrix and the interactive Q-matrix. Additionally, Equation (13) presents the calculation of the network parameter count of the Dual Q-Net model.

N_{params} = (K + K^{*}) (K + 2) + \sum_{j = 1}^{J} \sum_{k = 1}^{K} q_{j k} + \sum_{j = 1}^{J} \sum_{k^{*} = 1}^{K^{*}} q_{j k^{*}}^{*}

(13)

where

N_{params}

denotes the total parameter count of the Dual Q-Net model, with

q_{j k}

,

q_{j k^{*}}^{*}

, J, K, and

K^{*}

retaining the same definitions as provided in Equations (1)–(3).

Algorithm 1: Algorithm description for our proposed Dual Q-Net

3. Experiments and Results

In this work, simulated and real datasets were used to evaluate the performance of our proposed cognitive diagnostic method. Attribute-wise agreement rate (AAR) and pattern-wise agreement rate (PAR) metrics were used as evaluation criteria [5]. This section is organized as follows: First, we present the model evaluation metrics. Next, we describe the experimental datasets, which included simulated and real data. Finally, we analyze the experimental results from multiple groups to verify the effectiveness of our proposed models.

3.1. Agreement Evaluation Metrics

In this section, we introduce two evaluation metrics: AAR and PAR. Suppose that the model predicts the skill mastery pattern

{\hat{α}}_{i} = {{\hat{α}}_{i 1}, {\hat{α}}_{i 2}, \dots, {\hat{α}}_{i K}}

for student i and the true skill mastery pattern for student i is

α_{i} = {α_{i 1}, α_{i 2}, \dots, α_{i K}}

, then AAR and PAR are defined as shown in Equations (14) and (16), respectively.

A A R = \frac{1}{N \times K} \sum_{i = 1}^{N} \sum_{k = 1}^{K} ρ ({\hat{α}}_{i k}, α_{i k})

(14)

where AAR represents the agreement rate between the model-predicted student mastery of each skill and the actual student mastery of each skill; N denotes the number of students; and K denotes the number of skills assessed. The function

ρ (\cdot)

is the Kronecker delta, which is defined as shown in Equation (15).

ρ (a, b) = \{\begin{matrix} 1, & if a = b \\ 0, & if a \neq b \end{matrix}

(15)

The definition of PAR is shown in Equation (16).

P A R = \frac{1}{N} \sum_{i = 1}^{N} \prod_{k = 1}^{K} ρ ({\hat{α}}_{i k}, α_{i k})

(16)

where PAR represents the overall consistency between the model-predicted student skill mastery patterns and the actual student skill mastery patterns. In other words, the PAR value is 1 only when the model-predicted mastery status of each skill perfectly aligns with the actual mastery status observed in students.

3.2. Simulation Studies

3.2.1. Simulation Datasets

To evaluate the performance of our proposed cognitive diagnostic approach, we generated artificial simulation data for model testing by manipulating five factors, following previous studies [6,22,23]. These factors included the number of candidates, the number of skills examined, the number of test items, the quality of elements, and the model used for simulation data generation. In each dataset, the number of candidates N was set to 100, 200, 300, and 500, while the number of skills was set to three and five. The quality of an item was measured using its guessing parameter

P (0)

and slip parameter

1 - P (1)

. When

P (0), 1 - P (1) \in U (0, 0.15)

, the item is deemed to be of high quality, and when

P (0), 1 - P (1) \in U (0.15, 0.3)

, it is believed to be of low quality. It should be noted that

U

denotes a uniform distribution. We designed two Q-matrices for generating simulation data. The generating rules of the Q-matrix were as follows: (a) ensure that there were items that required at least one attribute in the Q-matrix. (b) the remaining items were selected from all

2^{K} - 1

items randomly, to satisfy the predetermined test length, where, as shown in Equation (17),

Q_{1}

consists of 10 question items and 3 skills.

Q_{1}^{⊤} = (\begin{matrix} 1 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 1 \\ 0 & 1 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 1 \\ 0 & 0 & 1 & 1 & 1 & 0 & 1 & 0 & 1 & 1 \end{matrix})

(17)

Likewise, as shown in Equation (18),

Q_{2}

consists of 31 question items and 5 skills.

Q_{2}^{⊤} = (\begin{matrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{matrix})

(18)

The models employed for generating the simulation data included DINA and GDINA. Specifically, DINA was utilized to generate simulated data based on the Q-matrix

Q_{1}

, while GDINA was used to generate simulated data based on the Q-matrix

Q_{2}

.

In order to make the simulated data closer to the real situation, the multivariate normal threshold model was used to simulate the correlation of attributes. To make the simulated data closer to real-world conditions, we referenced the research of Chiu et al. [6] and Wang et al. [23] , using a multivariate normal threshold model to simulate the correlations among attributes. Specifically, assuming that the student’s potential continuous score

θ_{i} = {[θ_{i 1}, θ_{i 20}, \dots, θ_{i K}]}^{⊤}

for K skills followed the multivariate constant distribution, i.e.,

θ_{i} \sim M V N (0_{K}, \sum)

, where

θ_{K} = {[0, 0, \dots, 0]}^{⊤}

is a zero vector with

K \times 1

dimensions, and ∑ is as shown in Equation (19).

\sum = [\begin{matrix} 1 & 0.5 & \dots & 0.5 \\ 0.5 & 1 & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋱ & 0.5 \\ 0.5 & \dots & 0.5 & 1 \end{matrix}]

(19)

where a medium correlation between attributes is denoted, and the elements on the non-diagonal of the matrix ∑ are set to 0.5.

α_{i} = [α_{i 1}, α_{i 2}, \dots, α_{i K}]

is the skill acquisition pattern of the student being simulated, where

α_{i k}

is calculated as shown in Equation (20).

α_{i k} = \{\begin{matrix} 1, i f θ_{i k} \geq Φ^{- 1} (\frac{k}{K + 1}) \\ 0, o t h e r w i s e \end{matrix}

(20)

where

Φ^{- 1} (\cdot)

is the inverse function of the cumulative distribution function of the standard normal distribution.

3.2.2. Results

This section addresses two crucial pivotal aspects of validating model performance. Firstly, we compare the performance of the Dual Q-Net proposed by us with the cognitive diagnostic methods based on MLP by [11] and ANN by [24] during training on simulated data. The experimental results indicated that Dual Q-Net showed relatively superior performance across the various scenarios. Subsequently, we conducted a comprehensive evaluation of our proposed method by comparing the performance of neural-network-based cognitive diagnostic models with classical cognitive diagnostic methods on simulated datasets of different scales.

Result I: training process results of different neural network models

Before starting the analysis of the experimental results, there is some common information that needs to be illustrated. MLP contains one input layer with dimensions

N \times J

, and one output layer with the dimensions

J \times K

, indicating that the total number of parameters in the MLP network is

K (J + 1)

. Among the models, ANN consists of an input layer with dimensions

N \times J

, a hidden layer with dimensions

K \times K

, and an output layer with dimensions

N \times K

, resulting in a total parameter count of

K (J + K + 2)

for the network. The meanings denoted by the symbols N, J, and K here are the same as those denoted in the foregoing. The structure of Dual Q-Net is illustrated in Figure 2.

In Figure 3 and Figure 4, the training processes of the three methods, MLP, ANN, and Dual Q-Net, are depicted, where the green curve represents Dual Q-Net, the yellow line represents ANN, and the blue curve represents MLP. Each figure consists of six subplots arranged in two rows and three columns. Among them, the first, second, and third columns represent the convergence processes of the loss, AAR, and PAR during model training, respectively. The first row and second row depict the convergence process of model training on simulated datasets generated by the same Q-matrix and parametric cognitive diagnostic model with low (i.e.,

P (0), 1 - P (1) \in U (0, 0.15)

) and high (i.e.,

P (0), 1 - P (1) \in U (0.15, 0.30)

) levels of guessing and slip.

Figure 3 shows the trend in loss, AAR, and PAR values with the number of training iterations for the three models, MLP, ANN, and Dual Q-Net, on the simulated data, which had a sample size of 500 and were generated by the DINA model with the Q-matrix

Q_{1}

. Notably, in the Dual Q-Net network, the skill interaction matrix

Q^{'}

was designed as a

4 \times 3

randomized binary matrix with non-repeating rows. This design signified the presence of four mutual influence interactions among the three skills examined in the Q-matrix

Q_{1}

. In particular, the sample size of the test set was 20% of the training set. The hyperparameter learning rates, batch size, and epochs for the trained models were 0.01, 64, and 100 for the high-quality dataset. However, the hyperparameters were adjusted to 0.005, 128, and 200 for the low-quality dataset.

In Figure 3, we analyze how well the models performed overall across the test datasets. The overall performance of Dual Q-Net surpassed that of MLP and ANN in all cases. Firstly, Dual Q-Net achieved a lower loss than MLP and ANN on both high- and low-quality datasets, as shown in the sub-plot in Figure 3. Moreover, Dual Q-Net’s loss on the high-quality datasets converged faster and had smaller values than that of MLP and ANN. Furthermore, the same trend was observed with the low-quality datasets, where the loss performance of Dual Q-Net was more prominent than on the high-quality datasets. Secondly, in terms of model accuracy, Dual Q-Net was ahead of MLP and ANN for both AAR and PAR, as shown in sub-plots b, e, c, and f in Figure 3. In addition, it can also be observed that this advantage was particularly prominent on the low-quality datasets.

Figure 4 serves the same purpose as Figure 3. The difference is that the simulated dataset was generated from the GDINA model with Q-matrix

Q_{2}

. Notably, Q-matrix

Q_{2}

assessed five skills, and there were 26 skill interaction patterns according to Equation (3). In the experiment in Figure 4, we only set 10 random skill interaction patterns; that is, the interaction matrix

Q^{'}

was a binary matrix with a dimension of

10 \times 5

and no duplicate rows. The hyperparameter learning rates, batch size, and epochs for the trained models were 0.01, 128, and 100 for the high-quality dataset. However, the hyperparameters were adjusted to 0.005, 64, and 200 for the low-quality dataset.

The findings presented in Figure 4 consistently align with those depicted in Figure 3. Dual Q-Net exhibited a pronounced advantage over MLP and ANN across loss, AAR, and PAR, particularly in the case of the low-quality data. The collective findings from Figure 3 and Figure 4 demonstrate that our method maintained a reliable and robust performance when the number of questions and skills increased. This contrasts with the classical methods (e.g., DINA and GDINA; [25]), indicating that our approach can, to some extent, address this limitation inherent in classical methods.

Result II: Result comparison of different methods on simulated datasets

The objective of the previous analysis was to discuss the changing trends in the accuracy of the MLP, ANN, and Dual Q-Net models during the training process. The main purpose of this section is to compare the performance of our proposed method with the classical methods on the various datasets, to better evaluate the effectiveness of our method. The section is structured as follows: First, we present the related experimental parameter settings and datasets, followed by the parameterized optimal baseline model for each dataset. Before the analysis of some typical experimental results, the results of each model on different datasets are reviewed. The simulated experimental dataset shown in Table 1 was created so that a clearer experimental analysis could be performed. With reference to existing studies and after several experiments [26], it was found that DINA was better at generating simulated data when the number of examined attributes was small. And when more attributes were examined, GDINA was better. For example, the first dataset HSD1 is the simulation data generated by the DINA model. Where the Q-matrix contains 10 items and 3 attributes (i.e.,

Q_{1} \in R^{10 \times 3}

), while the guesses and sleeps are at a low level (i.e.,

g, s \in U [0, 0.15]

).

To evaluate the model performance, we produced simulated datasets for each condition and sampled four sub-datasets with sample sizes N of 100, 200, 300, and 500. To make the evaluation more comprehensive and unbiased, we selected the most representative classical methods as reference benchmarks. These included parametric cognitive diagnostic models such as DINA and GDINA, and nonparametric models like NPC and GNPC. The scores of each method for AAR and PAR are presented in Table 2 and Table 3. In the case of the three neural-network-based methods, MLP, ANN, and Dual Q-Net, the hyperparameters were consistent with those outlined in the previous section. The model parameters for all three were initialized by pre-training with a mixture of the four simulated datasets, followed by fine-tuning tailored to datasets of different sample sizes. The performance of these fine-tuned and traditional models was tested on the test dataset, with the average results from dozens of repetitions detailed in Table 2 and Table 3. In the following, we detail the performance of each model using the AAR metric across the various simulated datasets.

On the whole, firstly, our method, Dual Q-Net, consistently outperformed all other models under the various conditions. Secondly, across all datasets, the neural-network-based methods achieved higher scores for AAR compared with the classical parametric (e.g., DINA and GDINA) and non-parametric (e.g., NPC and GNPC) methods. It is noteworthy that this superiority was particularly pronounced for datasets with lower quality and more attributes. For instance, on all sub-datasets of LSD2, the scores of the classical methods for AAR were consistently below 0.9, with some even falling below 0.8, whereas the neural-network-based methods consistently surpassed the 0.9 threshold. In terms of the impact of the number of attributes on AAR performance, when the number of attributes was small (e.g., the HSD1 and LSD1 datasets), the differences between the classical models were minimal, with the DINA model having a slight edge over the other three classical models. Among the nonparametric methods, NPC slightly outperformed GNPC. In the realm of the neural-network-based methods, Dual Q-Net and the ANN were similar and ahead of the MLP. As the number of attributes increased (e.g., the HSD2 and LSD2 datasets), it became evident that GDINA and GNPC performed relatively well among the classical methods. In contrast, DINA and NPC did not perform as strongly, both scoring below 0.8 for AAR. Among the neural-network-based methods, our method, Dual Q-Net, further extended its lead over ANN and MLP. To investigate the impact of the quality (i.e., the guess and slip rates) of the datasets on the performance of the models for AAR, we compared the AAR scores of the models on high-quality datasets (e.g., HSD1 and HSD2) and low-quality datasets (e.g., LSD1 and LSD2). All methods were affected by the guessing and slip parameters, but the neural-network-based methods were more robust than the classical methods. When evaluating the effect of the sample size on the performance of the models, we observed that the classical models were sensitive to the sample size, and the overall trend was that their performance improved as the sample size increased. GDINA and GNPC followed this trend closely, but DINA and NPC seemed to follow this trend only for the HSD1 and LSD1 datasets. This was because the HSD2 and LSD2 datasets were generated based on the GDINA model, which caused DINA and NPC to deviate from this trend on these two simulated datasets. On the other hand, the neural-network-based methods did not consistently adhere to this trend. This was because, in our work, the neural network models were trained using pre-training to initialize the model parameters, and then fine-tuning the model parameters based on sub-datasets with different sample sizes. Therefore, the robustness of the neural-network-based methods was more influenced by the sample imbalance of the test set than the sample size. Hence, the neural-network-based methods in this paper did not completely follow this rule.

Previously, we analyzed in detail how each model performed for AAR on each simulated dataset. Next, we transfer our focus to their performance for PAR. Overall, our method Dual Q-Net outperformed the other models in PAR score on every simulated dataset, which is consistent with its performance for AAR. The difference was that the neural-network-based methods had a much larger advantage over the classical methods in PAR than for AAR under the same conditions. In particular, as the number of attributes examined increased and the data quality decreased (i.e., the guessing and slip parameters increased), the advantage of the neural-network-based methods becomes more prominent. For the classical models, on datasets with fewer attributes (e.g., the HSD1 and LSD1 datasets), the DINA model had a slight advantage over the other three, which was not significantly different. However, for datasets with more attributes (e.g., the HSD2 and LSD2 datasets) GDINA and GNPC were significantly ahead of DINA and NPC. For the neural-network-based methods, the overall performance was more satisfactory. Their PAR scores on all datasets were overall greater than 0.7. However, their performance was somewhat affected by the quality of the data, while it was not sensitive to the sample size.

3.3. Real Data Illustration

3.3.1. Real Datasets

In order to better validate the effectiveness of the cognitive diagnosis method proposed in this paper, in addition to applying artificially simulated data, we also conducted experiments on real data and tested its performance. For the real dataset, we chose the widely-used fractional subtraction dataset (FRAC [27]) and the elementary probability test theory assessment (EPTT [28]).

The original fractional subtraction dataset contains 536 student responses. Its Q-matrix in this study, which consists of 20 items and 8 skills, was the same as the one used in the original in Table 8 in the study of [29]. The skill labels (

a_{1}

) convert a whole number to a fraction; (

a_{2}

) separate a whole number from a fraction; (

a_{3}

) simplify before subtracting; (

a_{4}

) find a common denominator; (

a_{5}

) borrow from the whole number part; (

a_{6}

) perform column borrowing to subtract the second numerator from the first; (

a_{7}

) subtract numerators; and (

a_{8}

) reduce answers to the simplest form. The elementary probability test theory assessment dataset consists of 12 probability items and included 504 students. This dataset examines four skills, as follows: (cp) the probability of the complement of an event; (id) two independent events; (pb) the probability of an event; and (un) the union of two disjoint events. The Q-matrix employed in this study aligned with that utilized by Chen et al., as presented in Table 7 of their article [14]. For the real datasets mentioned above, it should be noted that the response data from the students and the Q-matrix can be obtained from the R edmdata package 1.2.0 [30]. Since there is no ground truth for these real datasets from EPTT and FRAC, we selected the predictions of the best-fit parametric model as a benchmark by incorporating the model fit parameters AIC and BIC [31]. On the EPTT dataset, the fitting results indicated lower AIC and BIC scores for the GDINA model, while on the FRAC dataset, the DINA model demonstrated lower scores. Therefore, the benchmark for the EPTT dataset was based on the results of the GDINA model, and for the FRAC dataset, the benchmark was based on the results of the DINA model.

The network architecture of our method, Dual Q-Net, relies on a Q-matrix

Q

and a skill interactive Q-matrix

Q^{'}

. In the experiments with real datasets, we constructed corresponding skill interactive Q-matrices for the EPTT and FRAC datasets based on suggestions from mathematical experts, as shown in Equations (21) and (22).

Q_{e}^{'} ⊤ = (\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 1 \end{matrix})

(21)

Q_{e}^{'} ⊤

denotes the skill interaction matrix of the dataset EPTT, which contains five skill interaction patterns. For example, the first column,

q_{e 1}^{'} = (1, 0, 1, 0)

, means there are mutual influence relationships between skill 1 (i.e., cp) and skill 3 (i.e., pb).

Q_{s}^{'} ⊤ = (\begin{matrix} 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \end{matrix})

(22)

The same as

Q_{e}^{'} ⊤

,

Q_{s}^{'} ⊤

denotes the skill interaction matrix of the dataset FRAC, which contains 17 skill interaction patterns.For example, the first column,

q_{s 1}^{'} = (1, 1, 0, 0, 0, 0, 0, 0)

indicates that there are mutual influence relationships between skill 1 (i.e.,

a_{1}

) and skill 2 (i.e.,

a_{2}

). In this case, the neural network parameter initialization was completed by generating simulated data from the benchmark model. The hyperparameter learning rates, batch size, and epochs for the trained models were 0.004, 64, and 100 on the EPTT datasets. However, the hyperparameters were adjusted to 0.001, 64, and 100 for the FRAC datasets. To mimic the simulation study design and emphasize the benefit of using our method, subsets were created from each real dataset using a stratified sampling strategy with sample sizes of N equaling 100, 200, and 300.

3.3.2. Results

Table 4 displays the scores of each method for the AAR and PAR performance metrics for each test set, as derived from the real datasets and their corresponding model parameters. It is vital to emphasize that errors in classification are inevitable in practice [6]. Therefore, in this case, for the PAR metric, we not only considered the scenario where all attributes were accurately estimated (i.e., PAR(K)) but also calculated instances where an attribute error occurred (i.e., PAR(

K - 1

); [23]), to provide a more general assessment of the methods.

The results in Table 4 show that the performance patterns of each method on the real datasets were generally consistent with those on the simulated datasets. In a detailed examination, neural-network-based methods outperformed the traditional approaches overall. In particular, our method, Dual Q-Net, emerged as the top performer across all datasets, both in terms of AAR and PAR metrics.

In terms of AAR, the parametric models outperformed the non-parametric models among the traditional models. Specifically, on the EPTT datasets, the GDINA model performed the best, followed by the DINA model, while the NPC and GNPC models were the worst and had no significant difference. On the FRAC datasets, the DINA model led the other three traditional models, followed by GDINA, NPC, and GNPC. On both the EPTT and FRAC datasets, the classical methods exhibited an upward trend in AAR scores as the sample size increased. In contrast, the neural-network-based methods demonstrated performance patterns similar to those on the simulated datasets, and their performance remained robust across varying sample sizes. The number of attributes affected the performance of all methods. This is shown by the fact that the model’s performance decreased with an increase in the number of attributes. Among them, the neural-network-based methods were less affected by the number of attributes and showed better robustness. It should be emphasized that only the DINA model maintained a higher AAR score on the FRAC datasets than on the EPTT datasets. This was because the benchmark of the FRAC datasets was based on the prediction results of the DINA model on the whole FRAC dataset.

In terms of PAR, the trends in the scores of the individual models were generally consistent with those for AAR. For PAR(K), the models overall scored higher on the EPTT dataset. For PAR(

K - 1

), the classical models maintained higher scores on the EPTT dataset. However, the neural-network-based methods did not strictly follow this pattern. On the same dataset, models usually scored lower for PAR (K) than for AAR. The models’ scores for PAR(

K - 1

) did not follow this pattern strictly, which suggests that classification errors are inevitable in real-world practice and that relaxing assessment conditions appropriately may be a more pragmatic method of evaluation. In addition, the scores of the classical methods for PAR(

K - 1

) did not strictly follow the increase in scores with increasing sample size.

4. Discussion

The accuracy of models in classifying student skill mastery is crucial in CDAs and directly determines the reliability of the model. In the past, researchers have proposed many parametric and nonparametric CDMs, drawing from the fields of statistics and vector distance. Despite these efforts, a universally robust method adaptable to various conditions remains elusive. In recent years, the rapid advancements in artificial intelligence (AI), particularly in fields like natural language processing (NLP), have highlighted the potential of ANNs [32]. As a powerful machine learning method, ANN application in CDAs has not received substantial attention, although some researchers have begun exploring this avenue. In response to this gap, in this work, we proposed Dual Q-Net, a novel neural-network-based cognitive diagnostic method. This method, inspired by the GDINA model, leverages Q-matrix and interactive Q-matrix constraints and is rooted in deep cognitive diagnosis principles. The previous sections detailed the methodology and evaluated its performance extensively. In the following sections, we will delve into various aspects, including the design principles and interpretability of Dual Q-Net.

Firstly, we will talk about the architecture and interpretability of the Dual Q-Net model. The design of our method was inspired by the GDINA model. The contribution of GDINA to the correct identification of skill patterns consists of three parts: guesses and slips, direct effects of skills, and interaction effects between skills [7]. Similarly, our proposed method, Dual Q-Net also contains three parts, namely the bias parameters representing guess and slip in each layer of the network, the orange computation flow in Figure 2 that represents the main effects, and the green computation flow that represents the interaction effects among skills. In addition, we also referred to the DINA model to design a very simple and easy-to-understand interactive Q-matrix [33], and applied it together with the Q-matrix to our network structure to achieve constraints on the network structure. This is the most prominent difference between our method and the methods of [11,24]. Their network structure was fixed, thus making it difficult to adapt their method to different data. In contrast, our design based on this approach allows the network structure of Dual Q-Net to be dynamically generated, and the network structure can be dynamically adapted for different datasets based on different Q-matrices and interactive Q-matrices. Moreover, our method employs an interactive Q-matrix to constrain the weights of the hidden layer network, ensuring that our method avoids overfitting. The findings of numerous previous studies and the results of traditional methods in this study have consistently indicated that as the attribute structure becomes more complex, the classification accuracy decreases. Moreover, higher sliding and guessing values are associated with a lower classification accuracy, while an increase in the number of training samples correlates with a higher classification accuracy [25]. These design features, which ensure that our method is more robust to the number of attributes and still performs well on data with high guesses and slips, mean that our method does not strictly adhere to the above laws on both simulated and real datasets.

Secondly, in terms of model training, Cui et al. trained their networks using ideal response patterns and ideal mastery patterns [11], and Xue et al. trained their network parameters based on a semi-supervised method of co-training [15]. For a given Q-matrix, the ideal data are very limited and do not cover the real responses of the students [34]. As a result, neural network models trained on the basis of ideal data suffer from poor generalization and performance compared to traditional CDMs on a test set. Although the semi-supervised method can utilize non-expert-labeled student response data to a certain extent, this method suffers from sensitivity to the initial labels and label noise, which limits further improvements in the model performance. In this work, we trained the neural network differently from their model, adopting a distinct method for network training, and optimizing network parameters through a combined strategy of pre-training and fine-tuning [35]. This enabled our method to maintain robust performance, even in datasets with small sample sizes (

N < 100

). Evidence with both simulated and real datasets supports this assertion.

Finally, while our method demonstrated commendable performance across multiple datasets, it is not without limitations. Firstly, our experimental data predominantly relied on non-compensatory modeling assumptions. Hence, the structure and reliability of our networks when employing compensatory modeling assumptions warrant further exploration. Secondly, in this work, we focused only on dichotomous attributes and responses. The feasibility of seamlessly adapting our network design philosophy to polytomous attributes and responses remains to be validated in future research. Lastly, although our method performed well in experiments, it has not yet been applied in practical educational assessments. Therefore, the generalizability and usability of our method in practical scenarios still need to be tested.

5. Conclusions

CDAs accurately evaluate the skill mastery of students based on their response data, which is crucial for personalized learning and teaching. In this paper, we proposed Dual Q-Net, a novel neural network cognitive diagnosis method based on Q-matrix constraints and inspired by the GDINA model and the deep principles of cognitive diagnosis. We then comprehensively evaluated our method through numerous experiments on both simulated and real data. The results demonstrated that our method offers significant improvements over traditional methods in that it can not only effectively respond to the interactions between skills, but also has better robustness to the number of skills, the guessing and slip parameters of the items, and the sample size. Our findings further indicate the substantial potential of neural networks in cognitive diagnosis applications. Moving forward, we aim to integrate our method into sequence models to tackle longitudinal cognitive diagnostic challenges [36]. Subsequently, these methods will be applied to teaching practice and continuously refined.

Author Contributions

Conceptualization, J.T., W.Z. and X.G.; methodology, J.T. and N.C.; software, J.T.; validation, J.T., X.G. and N.C.; formal analysis, J.T. and W.Z.; investigation, J.T. and N.C.; resources, W.Z. and F.L.; data curation, J.T., X.G. and N.C.; writing—original draft preparation, J.T., X.G. and N.C.; writing—review and editing, J.T., X.G., N.C. and W.Z.; visualization, J.T., Q.G. and X.X.; supervision, W.Z. and F.L.; project administration, W.Z. and H.D.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Social Science Foundation of China grant number BCA200083.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets from fraction subtraction and the elementary probability test theory assessment were analyzed in this study. These data can be found at https://cran.r-project.org/src/contrib/Archive/edmdata/, accessed on 25 July 2021. The manual datasets and code are available at https://github.com/jhong-tao/Dual-Q-Net, accessed on 1 June 2024.

Acknowledgments

We would like to thank all the team members of flexCDMs http://www.psychometrics-studio.cn/ (accessed on 8 July 2017) for the great efforts they have made in constructing the cognitive diagnosis analysis service prior to this study. Special thanks to Dongbo Tu from Jiangxi Normal University, who gave us guidelines and suggestions on the experimental design by email during the study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Templin, J.L.; Henson, R.A. Measurement of psychological disorders using cognitive diagnosis models. Psychol. Methods 2006, 11, 287–305. [Google Scholar] [CrossRef] [PubMed]
Maas, L.; Brinkhuis, M.J.S.; Kester, L.; Wijngaards-de Meij, L. Cognitive Diagnostic Assessment in University Statistics Education: Valid and Reliable Skill Measurement for Actionable Feedback Using Learning Dashboards. Appl. Sci. 2022, 12, 4809. [Google Scholar] [CrossRef]
Song, L.; He, M.; Shang, X.; Yang, C.; Liu, J.; Yu, M.; Lu, Y. A deep cross-modal neural cognitive diagnosis framework for modeling student performance. Expert Syst. Appl. 2023, 230, 120675. [Google Scholar] [CrossRef]
Jiang, B.; Li, X.; Yang, S.; Kong, Y.; Cheng, W.; Hao, C.; Lin, Q. Data-Driven Personalized Learning Path Planning Based on Cognitive Diagnostic Assessments in MOOCs. Appl. Sci. 2022, 12, 3982. [Google Scholar] [CrossRef]
Chiu, C.Y.; Douglas, J. A Nonparametric Approach to Cognitive Diagnosis by Proximity to Ideal Response Patterns. J. Classif. 2013, 30, 225–250. [Google Scholar] [CrossRef]
Chiu, C.Y.; Sun, Y.; Bian, Y. Cognitive Diagnosis for Small Educational Programs: The General Nonparametric Classification Method. Psychometrika 2017, 83, 355–375. [Google Scholar] [CrossRef]
de la Torre, J. The Generalized DINA Model Framework. Psychometrika 2011, 76, 179–199. [Google Scholar] [CrossRef]
Ma, W.; de la Torre, J. GDINA: An R Package for Cognitive Diagnosis Modeling. J. Stat. Softw. 2020, 93, 1–26. [Google Scholar] [CrossRef]
Zhan, P. Deterministic Input, Noisy Mixed Modeling for Identifying Coexisting Condensation Rules in Cognitive Diagnostic Assessments. J. Intell. 2023, 11, 55. [Google Scholar] [CrossRef]
Liu, Q. Towards a New Generation of Cognitive Diagnosis. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21), Montreal, QC, Canada, 19–27 August 2021. [Google Scholar] [CrossRef]
Cui, Y.; Gierl, M.; Guo, Q. Statistical classification for cognitive diagnostic assessment: An artificial neural network approach. Educ. Psychol. 2015, 36, 1065–1082. [Google Scholar] [CrossRef]
Wen, H.; Liu, Y.; Zhao, N. Longitudinal Cognitive Diagnostic Assessment Based on the HMM/ANN Model. Front. Psychol. 2020, 11, 2145. [Google Scholar] [CrossRef] [PubMed]
Wang, F.; Liu, Q.; Chen, E.; Huang, Z.; Chen, Y.; Yin, Y.; Huang, Z.; Wang, S. Neural Cognitive Diagnosis for Intelligent Education Systems. Proc. AAAI Conf. Artif. Intell. 2020, 34, 6153–6161. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Y.; Culpepper, S.A.; Chen, Y. Inferring the Number of Attributes for the Exploratory DINA Model. Psychometrika 2021, 86, 30–64. [Google Scholar] [CrossRef] [PubMed]
Xue, K.; Bradshaw, L.P. A Semi-supervised Learning-Based Diagnostic Classification Method Using Artificial Neural Networks. Front. Psychol. 2021, 11, 618336. [Google Scholar] [CrossRef] [PubMed]
Ding, S.L.; Zhu, Y.F.; Lin, H.J.; Cai, Y. Modification of Tatsuoka’s Q Matrix Theory: Modif. Tatsuoka’s Q Matrix Theory. Acta Psychol. Sin. 2009, 41, 175–181. [Google Scholar] [CrossRef]
Culpepper, S.A. Estimating the Cognitive Diagnosis Q Matrix with Expert Knowledge: Application to the Fraction-Subtraction Dataset. Psychometrika 2018, 84, 333–357. [Google Scholar] [CrossRef]
Biggs, N. The roots of combinatorics. Hist. Math. 1979, 6, 109–136. [Google Scholar] [CrossRef]
Xin, T.; Zhang, J. Local Equating of Cognitively Diagnostic Modeled Observed Scores. Appl. Psychol. Meas. 2014, 39, 44–61. [Google Scholar] [CrossRef]
Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning internal representation by error propagation. Parallel Distrib. Process. 1986, 1, 318–362. [Google Scholar]
Ma, W.; Guo, W. Cognitive diagnosis models for multiple strategies. Br. J. Math. Stat. Psychol. 2019, 72, 370–392. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Ma, W.; Cai, Y.; Tu, D. A general nonparametric classification method for multiple strategies in cognitive diagnostic assessment. Behav. Res. Methods 2023, 56, 723–735. [Google Scholar] [CrossRef]
Chen, D.; Yan, C. Classification of Attribute Mastery Patterns Using Deep Learning. Open J. Model. Simul. 2021, 09, 198–210. [Google Scholar] [CrossRef]
Sen, S.; Cohen, A.S. Sample Size Requirements for Applying Diagnostic Classification Models. Front. Psychol. 2021, 11, 621251. [Google Scholar] [CrossRef] [PubMed]
Kreitchmann, R.S.; de la Torre, J.; Sorrel, M.A.; Nájera, P.; Abad, F.J. Improving reliability estimation in cognitive diagnosis modeling. Behav. Res. Methods 2022, 55, 3446–3460. [Google Scholar] [CrossRef] [PubMed]
Tatsuoka, C. Data Analytic Methods for Latent Partially Ordered Classification Models. J. R. Stat. Soc. Ser. C Appl. Stat. 2002, 51, 337–350. [Google Scholar] [CrossRef]
Heller, J.; Wickelmaier, F. Minimum Discrepancy Estimation in Probabilistic Knowledge Structures. Electron. Notes Discret. Math. 2013, 42, 49–56. [Google Scholar] [CrossRef]
de la Torre, J.; Douglas, J.A. Higher-order latent trait models for cognitive diagnosis. Psychometrika 2004, 69, 333–353. [Google Scholar] [CrossRef]
Balamuta, J.J.; Culpepper, S.A.; Douglas, J.A. edmdata: Data Sets for Psychometric Modeling. R Package Version 1.2.0. 2021. Available online: https://mirrors.pku.edu.cn/CRAN/web/packages/edmdata/index.html (accessed on 25 July 2021).
Philipp, M.; Strobl, C.; de la Torre, J.; Zeileis, A. On the Estimation of Standard Errors in Cognitive Diagnosis Models. J. Educ. Behav. Stat. 2017, 43, 88–115. [Google Scholar] [CrossRef]
Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
de la Torre, J. DINA Model and Parameter Estimation: A Didactic. J. Educ. Behav. Stat. 2009, 34, 115–130. [Google Scholar] [CrossRef]
Xiong, J.; Luo, F.; Ding, S.; Duan, H. A Cognitive Diagnosis Method Based on Mahalanobis Distance. In Quantitative Psychology; Springer International Publishing: Chem, Switzerland, 2018; pp. 319–333. [Google Scholar] [CrossRef]
Tajbakhsh, N.; Shin, J.Y.; Gurudu, S.R.; Hurst, R.T.; Kendall, C.B.; Gotway, M.B.; Liang, J. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med Imaging 2016, 35, 1299–1312. [Google Scholar] [CrossRef] [PubMed]
Zhan, P. Refined Learning Tracking with a Longitudinal Probabilistic Diagnostic Model. Educ. Meas. Issues Pract. 2020, 40, 44–58. [Google Scholar] [CrossRef]

Figure 1. Example of a cognitive diagnostic process, used to infer and predict students’ level of mastery of knowledge concepts through cognitive diagnostic assessments. (a) Collection of student response data: The Q-matrix is transformed into test items, with student responses recorded. Circles in different colors represent distinct knowledge concepts or skills, and items denote test questions composed of these concepts. Solid connecting lines indicate the constituent relationship between items and the respective knowledge concepts or skills. (b) Visualization report of cognitive diagnostic results. Additionally, CDM refers to a specifically designated cognitive diagnosis model.

Figure 2. The neural network architecture of the Dual Q-Net cognitive diagnostic model. Note, the orange line (i.e., the orange computational flow with a vertical striped background) indicates that the main effect is constrained by the Q matrix. The green line (i.e., the green computational flow with a horizontal striped background) indicates that the secondary effect between skills is constrained by the interactive Q matrix. The purple line (i.e., the purple computational flow with a diagonal striped background) represents the combination of the main effect with the secondary effect. The blue neuron (i.e., the blue computational flow with a plain, unstriped background) represents the loss function used to calculate the loss value. The cyan

X

-blocks and the green

A

-blocks represent the student response data and skill mastery patterns, respectively. Both are considered external data, with the former serving as input data for the model and the latter as supervision data.

Figure 2. The neural network architecture of the Dual Q-Net cognitive diagnostic model. Note, the orange line (i.e., the orange computational flow with a vertical striped background) indicates that the main effect is constrained by the Q matrix. The green line (i.e., the green computational flow with a horizontal striped background) indicates that the secondary effect between skills is constrained by the interactive Q matrix. The purple line (i.e., the purple computational flow with a diagonal striped background) represents the combination of the main effect with the secondary effect. The blue neuron (i.e., the blue computational flow with a plain, unstriped background) represents the loss function used to calculate the loss value. The cyan

X

-blocks and the green

A

-blocks represent the student response data and skill mastery patterns, respectively. Both are considered external data, with the former serving as input data for the model and the latter as supervision data.

Figure 3. Training Dynamics: loss, AAR, and PAR values in neural networks on simulated data generated by the DINA model with Q-matrix

Q_{1}

. Subplots (a–c) show the results of the model training process for the high-quality dataset (i.e.,

P (0), 1 - P (1) \in U (0, 0.15)

), and subfigures (d–f) show the results of the model training process for the low-quality dataset (i.e.,

P (0), 1 - P (1) \in U (0.15, 0.30)

).

Figure 3. Training Dynamics: loss, AAR, and PAR values in neural networks on simulated data generated by the DINA model with Q-matrix

Q_{1}

. Subplots (a–c) show the results of the model training process for the high-quality dataset (i.e.,

P (0), 1 - P (1) \in U (0, 0.15)

), and subfigures (d–f) show the results of the model training process for the low-quality dataset (i.e.,

P (0), 1 - P (1) \in U (0.15, 0.30)

).

Figure 4. Training Dynamics: Loss, AAR, and PAR values of neural networks on simulated data generated by the GDINA model with Q-matrix

Q_{2}

. Subplots (a–c) show the results of the model training process on the high-quality dataset (i.e.,

P (0), 1 - P (1) \in U (0, 0.15)

), and subfigures (d–f) show the results of the model training process on the low-quality dataset (i.e.,

P (0), 1 - P (1) \in U (0.15, 0.30)

).

Figure 4. Training Dynamics: Loss, AAR, and PAR values of neural networks on simulated data generated by the GDINA model with Q-matrix

Q_{2}

. Subplots (a–c) show the results of the model training process on the high-quality dataset (i.e.,

P (0), 1 - P (1) \in U (0, 0.15)

), and subfigures (d–f) show the results of the model training process on the low-quality dataset (i.e.,

P (0), 1 - P (1) \in U (0.15, 0.30)

).

Table 1. Details of the simulated dataset.

Data Type	Data Name	g,s Levels	Q-Matrixes	Sim Models
sim data	HSD1	low	$Q_{1} \in R^{10 \times 3}$	DINA
	LSD1	high	$Q_{1} \in R^{10 \times 3}$	DINA
	HSD2	low	$Q_{2} \in R^{31 \times 5}$	GDINA
	LSD2	high	$Q_{2} \in R^{31 \times 5}$	GDINA

The value of the guessing and slipping parameters

g, s \in U (0, 0.15)

is called “low”, indicating that the simulated data were defined as high-quality data.

g, s \in U (0.15, 0.30)

is the opposite. The Q-matrices

Q_{1}

and

Q_{2}

were derived from Equations (17) and (18).

Table 2. AAR classification agreement rate between the different methods on the simulated dataset.

Data	N	DINA	GDINA	NPC	GNPC	MLP	ANN	Dual Q-Net
HSD1	100	0.990	0.986	0.987	0.987	0.992	0.994	0.995
	200	0.993	0.990	0.989	0.989	0.993	0.995	0.995
	300	0.990	0.989	0.988	0.988	0.991	0.994	0.994
	500	0.993	0.992	0.987	0.988	0.991	0.994	0.995
LSD1	100	0.848	0.813	0.858	0.846	0.891	0.896	0.903
	200	0.870	0.832	0.859	0.850	0.880	0.884	0.897
	300	0.874	0.845	0.858	0.852	0.877	0.880	0.889
	500	0.893	0.859	0.860	0.855	0.878	0.881	0.890
HSD2	100	0.784	0.982	0.752	0.958	0.998	0.997	0.997
	200	0.765	0.987	0.756	0.971	0.995	0.997	0.997
	300	0.728	0.989	0.752	0.979	0.993	0.995	0.996
	500	0.751	0.991	0.755	0.984	0.994	0.995	0.997
LSD2	100	0.734	0.821	0.757	0.836	0.948	0.950	0.970
	200	0.707	0.840	0.750	0.847	0.942	0.951	0.967
	300	0.719	0.848	0.753	0.845	0.922	0.938	0.958
	500	0.697	0.868	0.752	0.847	0.912	0.926	0.944

Note, N denotes the sample size of the simulated dataset.

Table 3. PAR classification agreement rate between different methods on the simulated dataset.

Data	N	DINA	GDINA	NPC	GNPC	MLP	ANN	Dual Q-Net
HSD1	100	0.975	0.965	0.971	0.971	0.977	0.983	0.984
	200	0.978	0.974	0.974	0.974	0.981	0.985	0.986
	300	0.974	0.972	0.972	0.972	0.972	0.981	0.983
	500	0.977	0.976	0.974	0.976	0.976	0.982	0.984
LSD1	100	0.620	0.545	0.669	0.627	0.709	0.720	0.741
	200	0.680	0.597	0.679	0.640	0.690	0.700	0.729
	300	0.682	0.615	0.674	0.643	0.681	0.689	0.707
	500	0.724	0.644	0.678	0.648	0.680	0.685	0.710
HSD2	100	0.271	0.914	0.240	0.808	0.990	0.987	0.985
	200	0.224	0.935	0.241	0.870	0.611	0.983	0.985
	300	0.193	0.946	0.236	0.899	0.965	0.974	0.982
	500	0.206	0.958	0.239	0.958	0.968	0.976	0.982
LSD2	100	0.212	0.362	0.234	0.401	0.760	0.766	0.856
	200	0.146	0.421	0.216	0.434	0.736	0.775	0.847
	300	0.164	0.456	0.224	0.425	0.664	0.716	0.802
	500	0.138	0.499	0.222	0.423	0.624	0.677	0.746

Note: N denotes the sample size of the simulated dataset.

Table 4. Performance statistics of different methods on various real datasets.

Data	N	Metrics	DINA	GDINA	NPC	GNPC	MLP	ANN	Dual Q-Net
EPTT	100	AAR	0.953	0.960	0.940	0.938	0.977	0.978	0.987
		PAR( $K_{e}$ )	0.840	0.862	0.809	0.773	0.914	0.915	0.953
		PAR( $K_{e} - 1$ )	0.972	0.981	0.957	0.981	0.996	0.995	0.996
	200	AAR	0.961	0.972	0.944	0.946	0.968	0.974	0.983
		PAR( $K_{e}$ )	0.861	0.903	0.818	0.800	0.881	0.903	0.937
		PAR( $K_{e} - 1$ )	0.985	0.986	0.961	0.987	0.991	0.993	0.994
	300	AAR	0.962	0.983	0.940	0.942	0.969	0.973	0.985
		PAR( $K_{e}$ )	0.865	0.937	0.808	0.781	0.887	0.899	0.943
		PAR( $K_{e} - 1$ )	0.982	0.994	0.956	0.986	0.991	0.992	0.997
FRAC	100	AAR	0.935	0.882	0.850	0.857	0.956	0.955	0.979
		PAR( $K_{f}$ )	0.606	0.416	0.388	0.351	0.699	0.687	0.837
		PAR( $K_{f} - 1$ )	0.895	0.725	0.686	0.663	0.951	0.958	0.993
	200	AAR	0.956	0.886	0.850	0.829	0.962	0.963	0.973
		PAR( $K_{f}$ )	0.733	0.430	0.397	0.300	0.739	0.741	0.806
		PAR( $K_{f} - 1$ )	0.932	0.753	0.688	0.601	0.965	0.967	0.977
	300	AAR	0.969	0.873	0.845	0.831	0.969	0.970	0.976
		PAR( $K_{f}$ )	0.787	0.363	0.392	0.317	0.779	0.786	0.826
		PAR( $K_{f} - 1$ )	0.967	0.700	0.684	0.594	0.978	0.980	0.981

Note: N denotes the sample size of the subdataset. The symbols

K_{e}

and

K_{f}

denote the number of skills examined in the datasets EPTT and FRAC, respectively, where

K_{e} = 4

and

K_{f} = 8

. PAR(k) represents the agreement rate for two vectors of length K that have k identical elements in them.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, J.; Zhao, W.; Liu, F.; Guo, X.; Cheng, N.; Guo, Q.; Xu, X.; Duan, H. Cognitive Diagnosis Method via Q-Matrix-Embedded Neural Networks. Appl. Sci. 2024, 14, 10380. https://doi.org/10.3390/app142210380

AMA Style

Tao J, Zhao W, Liu F, Guo X, Cheng N, Guo Q, Xu X, Duan H. Cognitive Diagnosis Method via Q-Matrix-Embedded Neural Networks. Applied Sciences. 2024; 14(22):10380. https://doi.org/10.3390/app142210380

Chicago/Turabian Style

Tao, Jinhong, Wei Zhao, Fengjuan Liu, Xiaoqing Guo, Nuo Cheng, Qian Guo, Xiaoqing Xu, and Hong Duan. 2024. "Cognitive Diagnosis Method via Q-Matrix-Embedded Neural Networks" Applied Sciences 14, no. 22: 10380. https://doi.org/10.3390/app142210380

APA Style

Tao, J., Zhao, W., Liu, F., Guo, X., Cheng, N., Guo, Q., Xu, X., & Duan, H. (2024). Cognitive Diagnosis Method via Q-Matrix-Embedded Neural Networks. Applied Sciences, 14(22), 10380. https://doi.org/10.3390/app142210380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cognitive Diagnosis Method via Q-Matrix-Embedded Neural Networks

Abstract

1. Introduction

2. Materials and Methods

2.1. Related Concepts and Symbol Definition

2.1.1. Q-Matrix

2.1.2. Interaction Q-Matrix

2.1.3. Skill Mastery Patterns and Observed Response Patterns

2.2. Q-Matrix Constraint-Based Neural Network

3. Experiments and Results

3.1. Agreement Evaluation Metrics

3.2. Simulation Studies

3.2.1. Simulation Datasets

3.2.2. Results

3.3. Real Data Illustration

3.3.1. Real Datasets

3.3.2. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI