Deep Knowledge Tracing Integrating Temporal Causal Inference and PINN

Lu, Faming; Li, Yingran; Bao, Yunxia

doi:10.3390/app15031504

Open AccessArticle

Deep Knowledge Tracing Integrating Temporal Causal Inference and PINN

by

Faming Lu

¹,

Yingran Li

¹ and

Yunxia Bao

^2,*

¹

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China

²

College of Mathematics and System Science, Shandong University of Science and Technology, Qingdao 266590, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1504; https://doi.org/10.3390/app15031504

Submission received: 20 December 2024 / Revised: 26 January 2025 / Accepted: 30 January 2025 / Published: 1 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

Knowledge tracing predicts students’ future performance based on their historical performance data, which is significant for students’ learning resource recommendation, learning path prediction, and other aspects. Students’ knowledge mastery, learning ability, and question difficulty all influence the performance metrics of knowledge tracing. This paper proposes a deep knowledge tracing model that integrates temporal causal inference and the PINN (Physics-Informed Neural Network) model. The model first uses the temporal causality model to explores the causal relationships between students’ knowledge points, which is then combined with the deep learning-based knowledge tracing model for prediction. Next, it treats the logical model as a ’physical model’, adds a loss term, considers the confounding factors caused by students’ answer preferences, and adjusts students’ learning ability through backdoors to obtain more accurate predictions. In the public education datasets ASSISTment2012 and ASSISTchall, the predictive performance of the TLPKT-PINN model is superior to some classical models and LPKT. From the experimental results, we can conclude that considering the degree of mastery of causal knowledge points and adjusting the loss term can improve the accuracy of predicting student grades.

Keywords:

smart education; knowledge tracking; temporal causality; causal inference; learning process

1. Introduction

With the increasing emergence of online learning tutorials, students can study courses through the internet. AI-powered educational technologies (AI-EdTech) are increasingly being used to automate and scaffold learning activities [1]. Smart learning systems offer new ways of acquiring knowledge and have been expanding their popularity and influence over recent decades [2]. More and more traditional classrooms are integrating with online education, leading to a continuous accumulation of data from online learning activities. These data can be used in scenarios such as student knowledge state identification, learning performance prediction, exercise recommendation, etc. Students can autonomously solve problems based on their interests or teachers’ requirements, and check their mastery levels to engage in targeted learning. Teachers can also monitor students’ mastery levels to optimize their teaching methods. Knowledge tracing is a key research area in educational data mining, and has attracted a lot of attention from academics [3]. Knowledge tracking is the task of tracking students’ knowledge status based on their learning activities, and it is mainly divided into Bayesian-based knowledge tracing and deep learning-based knowledge tracing.

The classic Bayesian Knowledge Tracing (BKT) model [4] estimates students’ mastery of knowledge points through a probability model, using a Hidden Markov Model (HMM) [5] to treat students’ knowledge states as hidden variables. BKT can based on the state transition matrix to predict the next knowledge state. Its advantages include that it can dynamically update a student’s knowledge status to deal with uncertainty, and provide personalized learning recommendations. However, BKT also has some drawbacks, such as the sensitivity to model parameters, which can lead to the need for large amounts of historical data to model accurately. On large-scale datasets, its computational complexity is high. And the model defaults to the idea that students will not forget what they have done, which is not true. With the development of deep learning, knowledge tracing models have begun to integrate with deep learning techniques. The Deep Knowledge Tracing (DKT) model [6] utilizes recurrent neural networks (RNNs) [7] to capture students’ learning processes, predicting student’s mastery of knowledge points based on their sequence of responses. The advantages of DKT include its ability to handle complex time series data and capture long-range dependencies, thereby providing personalized learning recommendations. However, some deep learning-based knowledge tracing models may lack interpretability. And it models one skill at a time, ignoring the relationships between different skills.

In recent years, the Learning Process-consistent Knowledge Tracing (LPKT) model [8] has been used to track knowledge by simulating the evolving paths of knowledge mastery during the student learning process. This model can capture and utilize the dynamic changes in students’ learning processes, providing more precise personalized learning support. However, this model only considers the relationship between knowledge states and exercises, without accounting for the causal relationships between knowledge points. According to real-life teaching, we know that there is a causal relationship between knowledge points, such as the causal relationship between addition and multiplication, where addition is the cause of multiplication and multiplication is the result of addition. When students are learning a certain knowledge point, their mastery of causal knowledge points will definitely affect their mastery of that knowledge point. So, when predicting students’ grades, we need to consider the causal relationship between knowledge points. However, the LPKT model does not consider this relationship.

In our previous work, we utilized a causal inference algorithm to mine causal relationships between the levels of knowledge mastery, and established rules to identify the strength of causal effects, but we did not consider the time factor. The temporal causal model considers time factors, captures the dynamic characteristics of causal relationships over time, describes long-term effects and feedback effects, adapts to complex temporal systems, and can thus obtain more accurate causal relationships between knowledge points. Additionally, regarding the prediction results, they may be contrary to the principle that “if a student performs well, then his corresponding learning ability is high; if a student has a high mastery level of knowledge points, he may like to do relatively difficult exercises”. Based on the LPKT model, we propose a deep learning knowledge tracing framework called the TLPKT_PINN model based on the temporal causality model and optimize the loss function by incorporating physical loss. We conducted experiments on two real datasets, and the results indicate that our proposed framework is better at accurately tracking the evolution of knowledge states. The innovations in our work are as follows:

(1): We use a temporal causal model to explore the relationships between the knowledge points, which can be combined with the general mastery levels of students to derive the final knowledge state influenced by causally related knowledge points.
(2): By implementing backdoor adjustment, we can obtain students’ learning abilities and problem difficulty levels, effectively removing the confounding factors related to students’ preferences. It can improve the prediction of students’ learning abilities and problem difficulties.
(3): We add the physical loss term. We also increase the accuracy of predicting student performance by increasing the penalty for violating common-sense patterns in the student learning process.

2. Related Work

2.1. Knowledge Tracing

Knowledge tracing is a well-known problem in AI for education, consisting of monitoring how the knowledge state of students changes during the learning process and accurately predicting their performance in future exercises [9].

Knowledge tracing can be traced back to the late 20th century, with Corbett and Anderson (1995) [4] proposing BKT. BKT uses a HMM to treat students’ knowledge states as hidden variables and predicts the next state based on a state transition matrix. The BKT model primarily relies on four parameters: the probability that a student has mastered the knowledge point before answering questions related to it; the probability that the student will transition from not mastering to mastering the knowledge point on their next attempt; the probability that the student has mastered the knowledge point but answers related questions correctly (the guessing probability); and the probability that the student has mastered the knowledge point but answers incorrectly (the slipping probability). Based on these four parameters, the model can further characterize the student’s response to the next question using conditional probabilities. Subsequently, BKT has a lot of extensions. Käser et al. [10] proposed the Dynamic Bayesian Knowledge Tracing (DBKT) model, which utilizes Dynamic Bayesian Networks (DBNs) to capture the dynamic changes in students’ knowledge states during the learning process, then leading to more accurate student modeling and personalized education. DE BAKER et al. [11] introduced the Three-Learning-State BKT (TLS-BKT) model, which divides the learning process into three states through an evaluation function. This model improves the binary node states in the BKT model, transforming the original “not mastered/mastered” states into “not mastered/learning/mastered” states, enhancing the model’s flexibility and robustness.

In recent years, an increasing number of researchers have introduced the deep learning into knowledge tracing to enhance its expressiveness and performance. A typical model based on deep learning is DKT proposed by Piech et al. RNNs based on long short-term memory (LSTM), These networks can learn sequences of observations, thus causing the model to be well suited for time series applications [12]. DKT was the first to apply deep learning to knowledge tracing, allowing it to capture more complex student knowledge states without the need for explicit manual encoding of knowledge domains. It uses an RNN as a hidden unit to generate knowledge state vectors that represent students’ knowledge states, and it outputs predicted student responses through a sigmoid linear layer. The LSTM mechanism, with its inherent “gate” design, can effectively extract the features and correlations of multiple time sequences [13]. Due to the interpretability issues of inputs and outputs, deep learning knowledge tracing has seen numerous improved models. For example, the Dynamic Key–Value Memory Network (DKVMN) [14] incorporates an attention mechanism, allowing it to utilize the relationships between foundational concepts and directly output students’ mastery levels for each concept. Graph-based Knowledge Tracing (GKT) [15] is a knowledge tracing method based on graph neural networks, aimed at capturing the complex relationships between knowledge by constructing knowledge graphs and leveraging the powerful representation learning capabilities of graph neural networks to model students’ learning processes. The Context-aware Attentive Knowledge Tracing (AKT) [16] model constructs context-aware representations of questions and answers, using a monotonic attention mechanism to summarize students’ past performances over time scales and employing the Rasch model to capture individual differences among questions covering the same concept. At present, deep learning technology has achieved state-of-the-art results in the processing of Euclidean data [17]. Lyu et al. [18] proposed a DKT-STDRL model that utilizes Convolutional Neural Networks to extract spatial features from students’ learning sequences and LSTM to process temporal features.

2.2. Temporal Causal Inference

In real life, the occurrence of most events is not at regular intervals but is discrete. A point process is a powerful modeling tool for event sequences, consisting of a time series [19] of binary events occurring in continuous time. Point processes have been successfully applied in various fields such as social networks, finance, equipment maintenance, and electronic health records. The characteristic feature of point process models is their intensity function. The occurrence times of various types of events

ε = 1, ...., E

are unevenly distributed. A multivariate point process containing

E

type of events can be represented using a counting process

{N_{e}}_{{e = 1}}^{E}

, where

N_{e} = {N_{e} (t) | t \in [0, T]}

. The intensity function shown in Equation (1) can be defined for type

u

as the expected instantaneous rate of occurrence of type

e

events given the history. Each intensity function captures the instantaneous occurrence rate of a class of events conditioned on historical events.

λ_{e} (t) = \frac{E (d N_{e} (t) | H_{t})}{d t}

(1)

where

H_{t} = {t_{i}, e_{i} | t_{i} < t, e_{i} \in ε}

representing all types of events that occurred before time

t

.

Granger causality [20] emphasizes the temporal sequence of events and was originally used to study the dependency structure of multivariate time series. It has also been extended to multi-type event sequences [21]. In short, to test the causality between A and B, let

Ω_{n}

be the set of all information up to time (

t = 1, 2, \dots, n

) (including information other than A and B), and let

B_{n}

be all the information about B up to time n. Here, both

B_{n}

and

Ω_{n}

are multivariate random variables, and

B_{n} \in Ω_{n}

. Thus,

Ω_{n} - B_{n}

represents all information up to time n excluding B. Assume that the following: (1) the present and the past can influence the future, but the future cannot influence the past; (2)

Ω_{n}

does not contain any redundant information, meaning that if a variable

Z_{n}

is functionally related to one or more other variables, it should be removed from

Ω_{n}

. If

P (A_{n + 1} | Ω_{n}) \neq P (A_{n + 1} | Ω_{n} - B_{n})

, then variable B is considered a cause of variable A, indicating that

B_{n}

contains unique information that influences the occurrence of

A_{n + 1}

. Intuitively, for event sequence data, if including historical events of one type improves the prediction of future events of another type, we say that one type of event has Granger causality with respect to the other type.

The Hawkes process [22] assumes that past events can independently and additively influence the occurrence of future events, specifically through a collection of pairwise kernel functions. The Hawkes process is a mathematical model used to model self-exciting processes; it is a counting process that describes a series of events occurring over time. In this process, the occurrence of each event increases the likelihood of the next event happening, while the effect of this stimulation gradually diminishes over time.

The core concept of the Hawkes process is the conditional intensity function, it represents the expected frequency of an event occurring near a certain point in time, given all past events. Its mathematical expression is as follows:

λ^{*} (t) = μ + \sum_{t_{i} < t} g (t - t_{i})

(2)

where

λ^{*} (t)

is the conditional intensity function,

μ

is the background intensity,

g (t - t_{i})

is the triggering kernel, and

t_{i}

is the time of the i-th event. The triggering kernel can take various forms, with one common type being the exponential decay function, which indicates that the stimulating effect of past events on the current time point decreases exponentially as the time interval increases.

An event sequence records the occurrence of a specific type of event. Point processes can be used to describe event sequences. A set of point processes can be input, where each point process represents an event sequence, and the output is a causal graph established by the different processes. In the causal graph, each node represents a point process, and each directed edge captures the directed interaction from one point process to another.

2.3. PINN Model

In the field of computational science, surrogate models currently play an increasingly important role, especially in situations that require handling complex physical processes and large data analysis. Neural networks are becoming increasingly popular tools, being widely applied to simulate the complex systems. However, traditional neural network models often have shortcomings in strictly adhering to physical laws. Recent research [23] has combined high-fidelity and low-fidelity data, using data fusion techniques to reduce noise impact and improve solution accuracy. This method effectively integrates data of varying quality when dealing with complex systems in the real world, thereby enhancing prediction performance.

M. Raissi [24] combined the Gaussian process model with the linear differential equation. This method can automatically infer the parameters and structure of differential equations from data, addressing uncertainties in the process, thereby improving predictive performance. H. Owhadi [25] employed the Gaussian process [26] regression to design function representations for given linear operators, accurately inferring solutions and providing uncertainty estimates for several prototype problems in mathematical physics. The Physics-Informed Neural Networks (PINN) model [27] is a type of machine learning model that combines deep learning with physical knowledge. Unlike traditional data-driven neural networks, PINN models use physical laws to guide the model during training, thereby enhancing its generalization capability. This is particularly evident in situations where data are scarce or noisy. The PINN model is typically composed of a deep neural network. Its characteristic is the inclusion of physics-informed terms in the loss function, which represent the physical laws being followed. In traditional machine learning methods, the learning process is primarily data-driven, and models heavily rely on large amounts of high-quality data. However, in practical applications, there often exist challenges such as data scarcity or noisy data. In such cases, relying solely on data-driven models makes it difficult to achieve accurate and reliable predictions. Taking a one-dimensional damped harmonic oscillator as an example, the working principle of the PINN model is as follows: First, define the physical problem and the corresponding physical laws, such as the motion equation of the harmonic oscillator. The inputs are time t and position x, while the outputs are velocity v and acceleration a of the neural network. The loss function of the model consists of data error terms and physics-informed error terms. The data error terms measure the difference between predicted values and actual observations, while the physics-informed error terms ensure that the predictions comply with the relevant physical laws. By optimizing the loss function, the neural network is trained to ultimately produce predictions that adhere to the physical laws.

3. TLPKT_PINN Model

3.1. Overall Framework of the Model

In the learning process of students, their learning activities consist of a series of questions they pose and the corresponding answers. For a group of students

S = {s_{1}, ...., s_{i}}

, a set of exercises

E = {e_{1}, ...., e_{i}}

, and a set of knowledge concepts

K = {k_{1}, ...., k_{i}}

, student

i

spends time

a t

at time point

t

answering the exercise e derived from the knowledge concept k, and receives a response

a \in {0, 1}

, where

a = 1

indicates that the student answered the question correctly, and

a = 0

indicates that the student answered the question incorrectly. Therefore, for a student, we can define their learning process as follows:

x = {(e_{1}, a t_{1}, a_{1}, t_{1}), i t_{1}, (e_{2}, a t_{2}, a_{2}, t_{2}), i t_{2}, \dots, (e_{n}, a t_{n}, a_{n}, t_{n}), i t_{n}}

where

(e_{n}, a t_{n}, a_{n}, t_{n})

represents a fundamental learning unit in the student’s learning process. Here,

e_{t}

is the exercise,

a t_{t}

is the time spent by the student in answering the exercise

e_{t}

,

a_{t}

indicates the binary correctness label (1 for correct, 0 for incorrect), and

i t_{t}

represents the interval time between learning units.

Since each exercise is associated with the specific knowledge points, we use the Q-matrix with 0 and 1 to represent the relationship between exercises and knowledge points. Specifically, if exercise

e_{j}

requires knowledge point

k_{m}

, then

Q_{j m} = 1

; otherwise,

Q_{j m} = 0

.

Our deep learning knowledge tracking model is shown in Figure 1, which includes three modules: learning module, forgetting module, and prediction module. The specific implementation will be explained in the following chapters.

The innovations of this model are as follows:

Temporal causal knowledge point relationship mining: Inputting student ID, problem-solving time, and the correct or incorrect situation of knowledge points, the intensity function between knowledge points is obtained. Then, the intensity function is combined with Granger causality to finally obtain the temporal causal relationships between knowledge points. By using temporal causality to obtain the causal relationship between knowledge points, the causal relationship between knowledge points is combined with the student’s mastery level of knowledge points output by the knowledge tracking model to obtain the final student’s mastery level of knowledge points, which is finally used for predicting student grades;
Calculating students’ learning ability and the difficulty of exercises: Students with different learning abilities have varying preferences for answering questions. Students with stronger learning abilities tend to prefer more relatively difficult questions, which may result in a lower accuracy rate. However, we should not underestimate their learning potential because of this. Similarly, students with lower learning abilities may also have a low accuracy rate on simple exercises, and we cannot therefore overestimate the difficulty of the question. To more accurately assess students’ learning abilities and the difficulty of exercises, we constructed a prior causal model for learning ability and exercise difficulty. Initially, we assumed that all students had the same level of learning ability. When identifying students’ abilities, we considered exercise difficulty as the confounding factor, and when identifying exercise difficulty, we treated students’ abilities as the confounding factor. We employed backdoor adjustment to eliminate the influence of these confounding factors.
Using the physical loss function to adjust the loss term and optimize the prediction: the logistic growth model can describe the dynamic process of students’ knowledge mastery. In order to enable the neural network to learn the rules that conform to the differential equations of the logistic model, the logistic model is used as a physical model to construct a physical loss function for constraints. The academic performance of students is related to their abilities and the difficulty of their problems. If the predicted student score is high, it indicates that the student’s learning ability is high and the selected question is relatively difficult. On the contrary, if the predicted student score is low, it indicates that the student’s learning ability is low and the selected question is relatively easy. We use student abilities obtained through backdoor adjustments and interventions, and if we violate the above common sense, the punishment will be increased.

3.2. Causal Relationship Mining of Knowledge Points

For the relationship between knowledge points in temporal causal mining, its strength function [28] is modeled by two recurrent neural networks: one is a recurrent neural network (RNN) that captures the relationships between events over time, and the other is an RNN based on a time series for updating intensity functions. On this basis, an attention mechanism is introduced. For each student’s interactive information

{(z_{i}, t_{i})}_{i}^{e}

, that is to say, at time

t_{i}

, obtain the knowledge point

z_{i}

. Separate the situations of correctly and incorrectly obtained knowledge points, recode the knowledge point, and derive the causal relationship between correctness and errors of knowledge points.

The intensity function can be represented as

λ (t) = ϕ_{θ} (t; h_{t_{i}})

, where

h_{t_{i}}

is the feedback loop capturing the influence of previous events, defined as

h_{t_{i}} = h_{v} (t_{i}; h_{t_{i}})

. The weights

ϕ

and

v

represent the network parameters. The influence of historically learned knowledge naturally decays over time, so a temporal component is added to the original RNN to ensure that the final states gradually diminish.

Endogenous variables can be expressed as follows:

h_{i}^{e} = ϕ_{e}^{h} (W_{e}^{h} (z_{i}, t_{i}) + B_{e}^{h} h_{i - 1}^{e} γ (t_{i} - t_{i - 1}) + b_{e}^{h})

(3)

Here,

ϕ

represents the activation function,

γ (t)

is the decay function,

(z_{i}, t_{i})

denotes the embedding vector of the knowledge point z and the time features,

W_{e}^{h} (z_{i}, t_{i})

indicates the influence from the current event, and

B_{e}^{h} h_{i - 1}^{e} γ (t_{i} - t_{i - 1})

represents the influence from historical events.

The temporal RNN event sequence encoder first transforms the event sequence into hidden states. Then, utilizing the attention mechanism of time process, the intensity function of the event type z appears on events that have a significant impact, guided by specific parameter vectors

u_{z}

. The attention distribution is represented as shown in Equations (4) and (5):

e_{z_{i} z} = \tanh (h_{i}^{e} u_{z})

(4)

a_{z_{i} z} = \frac{\exp (e_{z_{i} z})}{\sum_{i} \exp (e_{z_{i} z})}

(5)

Define the infectious matrix in a manner similar to a conventional point process to reflect the event dependency between events:

A_{z_{i} z} = < a_{z_{i} z} >

. Thus, we obtain the event dependency representation

s_{z}

for event type z:

s_{z} (t) = \sum_{i} a_{z_{i} z} h_{i}^{e} γ (t - t_{i})

(6)

λ_{z} (t) = f (w_{e} s_{z} (t))

(7)

To correctly integrate the dense feature vectors sampled at different timestamps, we employed a synchronous RNN, passing the hidden states to subsequent layers to compute the exogenous intensity. Using the synchronous RNN allows us to obtain the hidden state

h_{t}^{x}

of the time series x:

h_{t}^{x} = ϕ_{t}^{x} (W_{t}^{x} x_{t} + B_{t}^{x} h_{t - 1}^{x} + b_{x}^{h})

(8)

Finally, the endogenous and exogenous intensities are jointly modeled, with the intensity function based on the collaborative layer under endogenous and exogenous representations. Therefore, the intensity function can be expressed as follows:

λ_{z} (t) = f (w_{e} s_{z} (t) + w_{z}^{x} h_{t}^{x})

(9)

where the first term

w_{e} s_{z} (t)

represents the endogenous component of event dynamics, and the second term

w_{z}^{x} h_{t}^{x}

represents the exogenous intensity.

To integrate the intensity function with the Granger causality test, we first calculate the intensity function for each knowledge point occurrence sequence to quantify the changes in their occurrence rates. Next, by introducing the intensity function as an independent variable within the Granger causality test framework, we evaluate the predictive capability of these intensity functions for other knowledge points. This means that if the appearance of one knowledge point can increase the probability of another knowledge point appearing, it indicates that there is a causal relationship between the two knowledge points, and thus, we can deduce a causal relationship matrix between knowledge points.

3.3. LPKT Model Integrating Temporal Causality

3.3.1. Learning Embedding and Knowledge Point Embedding

Use an embedding matrix

E \in ℝ^{J \times d_{e}}

to represent the exercise set, where J is the number of exercise and

d_{e}

is the dimension. Each exercise

e_{t}

in the learning unit

x_{t}

is represented as a vector

e_{t} \in ℝ^{d_{e}}

. To obtain the learning embedding

l_{t} \in ℝ^{d_{k}}

of the basic learning unit, the exercise

e_{t}

, student response time

a t_{t}

, and answer status

a_{t}

are concatenated together, and a multi-layer perceptron (MLP) is used to deeply fuse the exercise embedding, response time embedding, and answer embedding, as shown in Formula (10):

l_{t} = W_{1}^{T} [e_{t} \oplus a t_{t} \oplus a_{t}] + b_{1}

(10)

where

\oplus

represents the concatenation operation,

W_{1} \in ℝ^{(d_{e} + d_{k} + d_{a}) \times d_{k}}

is the weight matrix, and

b_{1} \in ℝ^{d_{k}}

is the bias term.

The purpose of knowledge embedding is to store and update students’ knowledge states during the learning process. In the LPKT model, knowledge embedding is initialized as an embedding matrix

h \in ℝ^{M \times d_{k}}

, where M is the number of knowledge concepts, and each row of matrix

h

represents the mastery level of the corresponding knowledge concept. In every learning interaction among students, the LPKT model updates the learning gains for each knowledge concept into the knowledge embedding, while also incorporating the forgetting effect.

3.3.2. The Learning Module

Learning gain represents the differences in student performance at two time points, that is to say, taking into account the differences in performance during two consecutive learning interactions. The previous learning embedding

l_{t - 1}

and current learning embedding

l_{t}

of the student are connected as fundamental input elements to model the learning gain. There are two main factors influence a student’s learning gain: the interval between learning sessions and the knowledge state of previously related causal knowledge points. On one hand, the interval between two learning units is a critical factor in the learning process; generally, if the interval is shorter, it indicates that the student’s learning process is compact and continuous. On the other hand, when a student engages with a particular knowledge point, the mastery state of previously related causal knowledge points will also affect their mastery of that knowledge point. Therefore, the above two factors are modeled to capture the evolution of learning gain.

We concatenate the interval time to the basic input elements between two consecutive learning embedding. For the student’s prior knowledge status, in order to focus on the knowledge state related to the knowledge concepts of the current exercise, we first combine the current knowledge concept vector

q_{e_{t}}

associated with

h_{t - 1}

to obtain the knowledge state

{\tilde{h^{'}}}_{t - 1}

:

{\tilde{h^{'}}}_{t - 1} = q_{e_{t}} \cdot h_{t - 1}

(11)

The level of mastery that a student has over a specific knowledge point is not only related to their current state but also to the states of knowledge points that have a causal relationship with that knowledge point. The mastery level of the knowledge points that have a causal relationship with knowledge point X in the knowledge state

{\tilde{h^{'}}}_{t - 1}

of student i is denoted as Mastery(Y). The mastery level for each knowledge point is defined as the sum of the product of the weights of the edges (the causal strength between knowledge points) in the causal matrix for knowledge point (X) and the mastery levels of the neighboring points (the knowledge points that are causally related). The weighted sum is then divided by the total weight to normalize the result. The specific formula is shown as follows:

M a s t e r y (X) = \frac{\sum_{Y \in ch (X)}^{| ch (X) |} W_{X} (Y) \times M a s t e r y (Y) + \sum_{Y \in ch (X)}^{| ch (X) |} W_{Y} (X) \times M a s t e r y (Y)}{| A dj (X) |}

(12)

Finally, the mastery of each student, denoted as Mastery(X), is integrated to obtain the final knowledge state vector

{\tilde{h}}_{t - 1}

.

The modeling of students’ learning gains

\lg_{t}

is as follows:

\lg_{t} = \tanh (W_{2}^{T} [l_{t - 1} \oplus i t_{t} \oplus l_{t} \oplus {\tilde{h}}_{t - 1}] + b_{2})

(13)

where

W_{2} \in ℝ^{(4 d_{k}) \times d_{k}}

is the weight matrix and

b_{2} \in ℝ^{d_{k}}

is the bias term.

After obtaining the student’s knowledge state and learning gains, the learning gate

Γ_{t}^{l}

is used to control the student’s ability to absorb knowledge, defined as follows:

Γ_{t}^{l} = σ (W_{3}^{T} [l_{t - 1} \oplus i t_{t} \oplus l_{t} \oplus {\tilde{h}}_{t - 1}] + b_{3})

(14)

where

W_{3} \in ℝ^{(4 d_{k}) \times d_{k}}

is the weight matrix,

b_{3} \in ℝ^{d_{k}}

is the bias term, and

σ

is the nonlinear sigmoid activation function.

Then, multiply

Γ_{t}^{l}

by

\lg_{t}

to obtain the actual learning benefit of the student’s t-th learning interaction. Similarly, in order to focus on the learning gain of relevant knowledge concepts in exercise

e_{t}

, we multiply

L G_{t}

by

q_{e_{t}}

to obtain the relevant learning gain

{\tilde{L G}}_{t - 1}

:

{LG}_{t} = Γ_{t}^{l} (\lg_{t} + 1) / 2

(15)

{\tilde{L G}}_{t - 1} {= q}_{e_{t}} \cdot L G_{t}

(16)

where the learning gain of students

{LG}_{t}

will always be positive, with a range mapped from (−1, 1) to (0, 1).

3.3.3. Forgetting Module

Learning plays an enhancing role in students’ knowledge status, but over time, according to the forgetting curve theory, the number of knowledge points remembered by students will exponentially decrease. In order to model the complex forgetting effect, we used the forgetting gate

Γ_{t}^{f}

to simulate the forgetting process of students. The gate is modeled based on three factors: (1) the student’s previous knowledge state

h_{t - 1}

, (2) learning gain

L G_{t - 1}

, and (3) interval time

i t_{t}

. Therefore, the definition of the forgetting gate is as follows:

Γ_{t}^{f} = σ (W_{4}^{T} [h_{t - 1} \oplus L G_{t} \oplus i t_{t}] + b_{4})

(17)

where

W_{4} \in ℝ^{(3 d_{k}) \times d_{k}}

is the weight matrix,

b_{4} \in ℝ^{d_{k}}

is the bias term, and

σ

is the nonlinear sigmoid activation function.

Then, the update of students’ knowledge status after completing the t-th learning interaction is shown in Formula (18):

h_{t} = {\tilde{L G}}_{t} + Γ_{t}^{f} \cdot h_{t - 1}

(18)

3.3.4. Prediction Module and Objective Function

When given an exercise, students will attempt to apply the knowledge they have already mastered to the corresponding knowledge concepts to solve it. Therefore, the student’s relevant knowledge state should be used to infer their performance in the next exercise. The prediction formula is as follows:

y_{t + 1} = σ (W_{5}^{T} [e_{t - 1} \oplus {\tilde{h}}_{t}] + b_{5})

(19)

where

W_{5} \in ℝ^{(2 d_{k}) \times d_{k}}

is the weight matrix, and

b_{5} \in ℝ^{d_{k}}

is the bias term.

The cross-entropy loss between the predicted value

y

and the actual answer a is chosen as the objective function:

L (θ) = - \sum_{t = 1}^{T} (a_{t} \log y_{t} + (1 - a_{t}) \log (1 - y_{t})) + λ_{θ} | | θ | |^{2}

(20)

where θ represents all parameters and

λ_{θ}

is the regularization hyperparameter. The Adam optimizer is used to minimize the objective function in small batches.

3.4. Loss Function Optimization

3.4.1. Student’s Ability and Exercise Difficulty Based on Backdoor Adjustment

Students with different learning abilities have distinct preferences in answering questions. For example, students with stronger learning abilities tend to tackle more challenging exercises, but this may result in a relatively lower accuracy rate in their answers. Conversely, students with weaker learning abilities tend to choose easier problems, leading to a higher accuracy rate. Therefore, we should not overestimate or underestimate a student’s learning ability. Based on relevant knowledge of causal inference, it can be argued that ‘the accuracy of a student’s answers’ is a major ‘cause’ for identifying learning ability, while ‘exercise difficulty’ acts as a ‘confounding factor’ in the causal relationship between the two. We can derive the structural causal models shown in Figure 2.

In order to accurately identify students’ learning abilities, we use ’backdoor adjustment’ to conduct interventions. We defined the learning ability and exercise difficulty, assuming that each level has the same prior probability; that is,

P (P D = d) = \frac{1}{D}, P (L A = l) = \frac{1}{L}

(21)

where D and L are the number of divisions for learning ability and exercise difficulty levels, respectively. The post-intervention learning abilities is obtained as shown in Formula (22):

P (L A | d o (C S)) = \sum_{d = 1}^{D} P (L A | C S, P D = d) P (P D = d)

(22)

3.4.2. The Logistic Model

The logistic model is a type of model based on the logistic function. In the field of knowledge tracing, its basic concept is that the probability of correctly answering exercises can be represented by mathematical functions of students and KC parameters. Using student grades as the dependent variable, we can learn general parameters from historical data to model students and make predictions for answering questions. In the logistic model, students’ binary answers (correct/incorrect) follow a Bernoulli distribution. At the beginning of the 21st century, researchers proposed using logistic models to handle knowledge tracking tasks, with student grades as the dependent variable, and learning general parameters from historical data to model students for answer prediction.

We use the Sigmoid function to describe the probability of a student mastering a certain knowledge point at a certain moment. This is a nonlinear function, typically manifested as:

P (y = 1 | x) = \frac{1}{1 + \exp (- f_{θ} (x))}

(23)

We consider its dynamic changes over time and convert it into a differential equation. Assuming that students’ mastery of a certain skill changes over time, a dynamic model similar to “learning progress” can be used to represent the changes in students’ mastery of the skill:

\frac{d P (t)}{d t} = r P (t) (1 - P (t)) w h e r e P (t) = \frac{1}{1 + \exp (- f_{θ} (x (t)))}

(24)

where P(t) is the probability of a student mastering a certain skill at time t. R is the learning rate, indicating the speed at which students acquire skills.

f_{θ} (x (t))

is the output of the model, representing the student’s mastery of the knowledge points at time t. This differential equation is a standard logistic growth model that can describe the dynamic process of students’ knowledge acquisition.

3.4.3. Optimize the Loss Function

In order for the neural network to learn the rules that conform to the differential equations of the logistic model, we can impose constraints by introducing loss terms in the differential equations. In this case, the loss function not only considers the prediction results of the network, but also needs to consider whether the network output P(t) satisfies the differential equation of the logistic growth model.

We need to ensure that the derivative of the network output P(t) with respect to time t conforms to the Logistic equation. We calculate the derivative of the neural network output P(t) with respect to time t through automatic differentiation. Assuming P(t) is the probability of a student mastering a certain skill at that time output by the neural network, we construct the loss function as follows:

L_{p i n n} = \int_{t_{0}}^{T} {(\frac{d P (t)}{d t} - r P (t) (1 - P (t)))}^{2} d t

(25)

where

\frac{d P (t)}{d t}

is the derivative of the neural network output with respect to time t. This loss term ensures that the probability P(t) of the neural network output follows the dynamic changes in the logistic growth model.

According to the laws of reality, if the predicted student academic performance is high, the corresponding learning ability of the student should be relatively high, and if the predicted student academic performance is low, the corresponding learning ability of the student should be relatively low. Therefore, based on this phenomenon, we construct a loss term between academic performance and learning ability:

L_{ability} = \sum_{i} {({\overset{⌢}{y}}_{i} - f (C_{i}))}^{2}

(26)

where

f (C_{i})

is the student ability obtained through backdoor adjustment, and

{\hat{y}}_{i}

is the score that maps the predicted student academic performance to the interval (0, 1).

The final loss function is expressed as shown in Formula (26):

L = L_{c l s} + λ L_{p i n n} + L_{a b i l i t y}

(27)

where

λ

is a hyperparameter that balances physical constraints with traditional losses. By optimizing this comprehensive loss function, the neural network can not only make predictions based on students’ historical data but also follow the differential equations of the logistic model, thus better simulating the dynamic process of students’ knowledge mastery.

4. Experiment

4.1. Datasets

We use two diverse real-world datasets to evaluate the effectiveness of the model in different learning scenarios. Table 1 shows the statistical information of all datasets. We introduce and compare each datasets as follows:

ASSIST2012: This dataset was collected from the educational platform ASSISTMENTS, which provides high school math problems. This dataset contains data from the 2012–2013 academic year, and students need to do similar exercises to master these problem sets. We filtered out records without knowledge concepts and students who completed fewer than 20 questions.
ASSISTchall: This dataset was collected from ASSISTMENTS in 2017 and was used in a data mining competition. The data were gathered from a longitudinal study that tracked middle school students’ use of the ASSISTMENTS blended learning platform from 2004 to 2007. In this dataset, the learning sequence of students is much longer than ASSIST2012.

4.2. Results and Discussion

We compared TLPKT_PINN with several previous methods. For fair comparison, all these methods have been tuned to have the best performance.

BKT: Using the hidden Markov model and Bayesian inference methods, it is used to evaluate and predict the dynamic changes in students’ mastery of knowledge points during the learning process.
DKT: The model is based on a Recurrent Neural Network (RNN) and is used to dynamically track students’ mastery of knowledge points. It analyzes students’ interactive data, learns their knowledge status, and predicts their performance in future tasks.
DKVMN: This model defines a static matrix to store potential knowledge concepts and a dynamic matrix to update the corresponding knowledge states over time through read and write operations, using a memory network to obtain the interpretable student knowledge states.
AKT: This uses two self-attention encoders to learn context-aware representations of exercises and answers; the knowledge evolution model is called the knowledge retriever, which utilizes attention mechanisms to retrieve knowledge obtained in the past that is relevant to the current exercise.
LPKT: Modeling the learning gain during the learning process by capturing the difference between two consecutive learning units. The diversity of learning benefits is measured by students’ relevant knowledge status and interval time. The learning gate is used to distinguish students’ ability to absorb knowledge, and the forgetting gate is used to determine the decrease in students’ knowledge over time.

This article conducted experimental comparisons with these classic knowledge tracking models and their latest variants, using Area Under Curve (AUC) and Root Mean Square Error (RMSE) under the ROC curve to analyze the performance of the comparison models. For all datasets, we performed standard 5-fold cross validation on all models. Therefore, for each group, 80% of the students were divided into a training set and the remaining 20% were used as a testing set. We randomly initialized all parameters in a uniform distribution. We learned all hyperparameters on the training set and evaluated the test set using the model that performed best on the validation set. We added a dropout layer with a dropout rate of 0.2 to prevent overfitting. In our implementation, dk and de parameters are set to 128, and da is set to 50. The small positive value γ to enhance the Q-matrix was set to 0.03. The experimental results are shown in Table 2.

It can be seen that as an improvement to the LPKT model, the TLPKT-PINN method in this paper performs better in both datasets compared to deep learning methods such as DKT, DKVMN, and DIKT-CI, which showed an increase of at least 5% in AUC values across three datasets. In addition, as an improvement of the LPKT model, TLPKT-PINN has increased the AUC values by at least 2.6% in both datasets. This improvement is due to its integration of temporal causal relationships between knowledge points and consideration of student performance prediction.

4.3. Ablation Experiment

In order to investigate the impact of each module in the TLPKT_PINN model on the final performance prediction, we designed several ablation experiments to validate our model. The specific setup information is as follows:

LPKT-PINN: Remove the temporal causality module from the model;
TLPKT_PINN_ability: Remove the module that adjusts the learning ability through backdoor adjustment in the model, and remove the learning ability term from the loss function term;
TLPKT: Remove the physical loss module from the model.
TLPKT_1: Remove additional loss terms from the model.

With the results given in Table 3, it is clear that the temporal causality module, the physical loss module, and the physical loss module adjusted by learning ability have an effect on the experimental results. By comparing the performance results of LPKT-PINN, TLPKT_PINN_ability, TLPKT, and TLPKT_1, it can be seen that both the temporal causality module and the loss module have improved the model performance, indicating the effectiveness of each module in the TLPKT-PINN model. In addition, it can be seen from the results of LPKT-PINN that a better prediction performance can be achieved by considering the temporal causality between knowledge points. It is worth noting that in the causal module of knowledge points, the optimization effect of ASSIST2012 dataset is more significant, which may be due to the relatively small number of knowledge points in this dataset, and the causal network of learned knowledge points is not too complex. Finally, it was verified through TLPKT and TLPKT_1 that optimizing the loss term can improve the prediction performance.

4.4. Updating the Mastery Level of Students’ Knowledge Points

Figure 3 shows a student’s learning sequence. The student performed the following exercises in temporal order. The mastery level of students’ knowledge points was obtained through the combination of neural networks and temporal causality:

It is known that through the study of temporal causality, these three knowledge points have a causal relationship. We can see that when the degree of mastery of a knowledge point changes, the degree of mastery of its causal knowledge points also changes accordingly.

4.5. Performance Analysis of PINN Model

A key feature of the logistic model is its ability to simulate the “asymptotic” and saturation effects in the learning process. Students progress quickly in the early stages of learning, but as their mastery approaches saturation, their progress gradually slows down. As shown in Figure 4, during the learning process, students’ knowledge reserves in the early stage are not high, which basically conforms to the trend of the logistic model. However, the deviation of the results predicted by the neural network is slightly larger. But in the latter part, after training the neural network, the neural network prediction performance is better. Therefore, we decided to use the logistic function as a physical model for the first 15% step time, and controlling the learning progress output of the network through the logistic function to simulate the process of students from not understanding a certain knowledge point to fully mastering it, in order to impose constraints. As shown in Figure 5, after adding the PINN model, its mean square error is slightly smaller, indicating that adding the PINN model can improve the accuracy of predicting students’ grades.

5. Conclusions

In this paper, we propose a deep learning knowledge tracking model that integrates temporal causality and the Physics-Informed Neural Network (PINN) model. Since the probability of a student’s mastery of a knowledge point is affected by the mastery probability of its causal knowledge points, the model mines the causal relationship between knowledge points through the temporal causality model and combines the temporal causality matrix with the student’s learning, which can lead to the student’s final mastery of the knowledge points affected by the related knowledge points. We found through experiments that in the first half of students’ learning process, the logistic model is more in line with their actual answer situation, and after training, the neural network is more in line. So, we treated the logistic model as a physical model within the first 15% of the time step and constrained the prediction of the neural network, finding that it can improve the accuracy of student performance prediction. Adding another loss term would violate the principle that ’if a student’s performance is high, they have high learning ability; if their learning ability is low, they do not have good learning ability’, increase the punishment, and thus improve the ability to predict student performance. It can be concluded from the experiments that the TLPKT_PINN model has a better prediction effect in the dataset and shows the changes in the students’ learning process in a more rationalized way.

In future work, we can further explore the possibility of integrating the temporal causal matrix more rationally with the student learning process, thus reducing the time complexity, and in the knowledge tracking model, we can investigate how to automatically learn the specific weights in the q-matrix in order to more accurately represent the relationship between the exercises and the knowledge concepts.

Author Contributions

Methodology, F.L., Y.L. and Y.B.; Writing—original draft, Y.L.; Writing—review & editing, F.L. and Y.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Science and Technology Major Project (2022ZD0119501), the NSFC (52374221), Sci. & Tech. Development Fund of Shandong Province of China (ZR2022MF288, ZR2023MF097), the Taishan Scholar Program of Shandong Province (ts20190936), and the Science and Technology Program Special Project of Qingdao West Coast New Area (202209).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ASSISTment2012 and ASSISTchall datasets are from the ASSISTments platform.These datasets have been organized and can be found by following the following link: https://base.ustc.edu.cn/data/ASSISTment/ accessed on 20 November 2024.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI-EdTech	AI-powered educational technologies
AKT	Context-aware Attentive Knowledge Tracing
BKT	Bayesian Knowledge Tracing
DBKT	Dynamic Bayesian Knowledge Tracing
DBNs	Dynamic Bayesian Networks
DKT	Deep Knowledge Tracing
DKVMN	Dynamic Key–Value Memory Network
GKT	Graph-based Knowledge Tracing
HMM	Hidden Markov Model
LPKT	Learning Process-consistent Knowledge Tracing
LSTM	Long short-term memory
MDPI	Deep Knowledge Tracing Integrating Temporal Causal Inference and PINN
RNNs	Recurrent Neural Networks
TLS-BKT	Three-Learning-State BKT

References

Darvishi, A.; Khosravi, H.; Sadiq, S.; Gašević, D.; Siemens, G. Impact of AI assistance on student agency. Comput. Educ. 2024, 210, 104967. [Google Scholar] [CrossRef]
Chen, H.; Yin, C.; Li, R.; Rong, W.; Xiong, Z.; David, B. Enhanced learning resource recommendation based on online learning style model. Tsinghua Sci. Technol. 2019, 25, 348–356. [Google Scholar] [CrossRef]
Shen, S.; Liu, Q.; Huang, Z.; Zheng, Y.; Yin, M.; Wang, M.; Chen, E. A survey of knowledge tracing. arXiv 2021, arXiv:2105.15106. [Google Scholar]
Corbett, A.T.; Anderson, J.R. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Model. User-Adapt. Interact. 1995, 4, 253–278. [Google Scholar] [CrossRef]
Eddy, S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef]
Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep Knowledge Tracing. Adv. Neural Inform. Process. Syst. 2015, 28. [Google Scholar]
Jordan, M.I. Serial order: A parallel distributed processing approach. In Advances in Psychology; North-Holland: Amsterdam, The Netherlands, 1997; Volume 121, pp. 471–495. [Google Scholar]
Shen, S.; Liu, Q.; Chen, E.; Huang, Z.; Huang, W.; Yin, Y.; Su, Y.; Wang, S. Learning process-consistent knowledge tracing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 1452–1460. [Google Scholar]
Zanellati, A.; Di Mitri, D.; Gabbrielli, M.; Levrini, O. Hybrid models for knowledge tracing: A systematic literature review. IEEE Trans. Learn. Technol. 2024, 17, 1021–1036. [Google Scholar] [CrossRef]
Kaser, T.; Klingler, S.; Schwing, A.G.; Gross, M. Dynamic Bayesian networks for student modeling. IEEE Trans. Learn. Technol. 2017, 10, 450–462. [Google Scholar] [CrossRef]
De Baker, R.S.J.; Corbett, A.T.; Aleven, V. More accurate student modeling through contextual estimation of slip and guess probabilities in Bayesian knowledge tracing. In Proceedings of the 9th International Conference on Intelligent Tutoring Systems (LNCS 5091), Montreal, QC, Canada, 23–27 June 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 406–415. [Google Scholar]
Zeng, G.; Zhuang, J.; Huang, H.; Tian, M.; Gao, Y.; Liu, Y.; Yu, X. Use of Deep Learning for Continuous Prediction of Mortality for All Admissions in Intensive Care Units. Tsinghua Sci. Technol. 2023, 28, 639–648. [Google Scholar] [CrossRef]
Yang, X.; Esquivel, J.A. Time-aware LSTM neural networks for dynamic personalized recommendation on business intelligence. Tsinghua Sci. Technol. 2023, 29, 185–196. [Google Scholar] [CrossRef]
Zhang, J.; Shi, X.; King, I.; Yeung, D.Y. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 765–774. [Google Scholar]
Nakagawa, H.; Iwasawa, Y.; Matsuo, Y. Graph-based knowledge tracing: Modeling student proficiency using graph neural network. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Thessaloniki, Greece, 14–17 October 2019; pp. 156–163. [Google Scholar]
Ghosh, A.; Heffernan, N.; Lan, A.S. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD International Conference On Knowledge Discovery & DATA Mining, Virtual, 6–10 July 2020; pp. 2330–2339. [Google Scholar]
Eljialy, A.E.M.; Uddin, M.Y.; Ahmad, S. Novel framework for an intrusion detection system using multiple feature selection methods based on deep learning. Tsinghua Sci. Technol. 2024, 29, 948–958. [Google Scholar] [CrossRef]
Jiang, Z.; Ning, Z.; Miao, H.; Wang, L. STDNet: A Spatio-Temporal Decomposition Neural Network for Multivariate Time Series Forecasting. Tsinghua Sci. Technol. 2024, 29, 1232–1247. [Google Scholar] [CrossRef]
Daley, D.J.; Vere-Jones, D. An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods; Springer: New York, NY, USA, 2003. [Google Scholar]
Granger, C.W.J. Investigating causal relations by econometric models and cross-spectral methods. Econom. J. Econom. Soc. 1969, 37, 424–438. [Google Scholar] [CrossRef]
Didelez, V. Graphical models for marked point processes based on local independence. J. R. Stat. Soc. Ser. B Stat. Methodol. 2008, 70, 245–264. [Google Scholar] [CrossRef]
Eichler, M.; Dahlhaus, R.; Dueck, J. Graphical modeling for multivariate Hawkes processes with nonparametric link functions. J. Time Ser. Anal. 2017, 38, 225–242. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Inferring solutions of differential equations using noisy multi-fidelity data. J. Comput. Phys. 2017, 335, 736–746. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Machine learning of linear differential equations using Gaussian processes. J. Comput. Phys. 2017, 348, 683–693. [Google Scholar] [CrossRef]
Owhadi, H. Bayesian numerical homogenization. Multiscale Model. Simul. 2015, 13, 812–828. [Google Scholar] [CrossRef]
Williams, C.K.I.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Xiao, S.; Yan, J.; Farajtabar, M.; Song, L.; Yang, X.; Zha, H. Learning time series associated event sequences with recurrent point process networks. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3124–3136. [Google Scholar] [CrossRef] [PubMed]

Figure 1. TLPKT_PINN modeling framework.

Figure 2. Structural causal model with the confounder ‘difficulty of the exercise’.

Figure 3. Student answer sequence and the mastery level of the student’s knowledge points.Among them, e₁-e₉ are exercises, and the colors represent the knowledge points contained in the exercises. The specific knowledge points and corresponding colors are located in the upper right corner. The check or cross below the exercise represents whether the student did the problem correctly or incorrectly. For better identification, we will fill in different blue colors for students with mastery levels greater than or equal to 0.5, 0.3~0.5, and less than 0.3.

Figure 4. The time variation values of the three models (student’s actual answering situation, neural network prediction, and the logistic model).

Figure 5. The mean square error of the three models (neural network prediction, the logistic model, and the neural network incorporating the PINN model).

Table 1. Information on the dataset.

	ASSIST2012	ASSISTchall
Number of students	28,914	1600
Number of knowledge concepts	265	102
Number of problems	532,090	3142

Table 2. Experimental results.

	ASSIST2012		ASSISTChall
	AUC	RMSE	AUC	RMSE
BKT	0.622	0.511	0.638	0.513
DKT	0.701	0.432	0.721	0.447
DKVWM	0.685	0.437	0.710	0.450
AKT	0.769	0.414	0.766	0.431
LPKT	0.778	0.407	0.772	0.415
TLPKT_PINN	0.828	0.375	0.798	0.382

Table 3. Results of ablation experiments.

	ASSIST2012		ASSISTChall
	AUC	RMSE	AUC	RMSE
LPKT_PINN	0.786	0.410	0.788	0.405
TLPKT_PINN_ability	0.803	0.384	0.789	0.392
TLPKT	0.801	0.392	0.784	0.398
TLPKT_1	0.792	0.405	0.781	0.399

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, F.; Li, Y.; Bao, Y. Deep Knowledge Tracing Integrating Temporal Causal Inference and PINN. Appl. Sci. 2025, 15, 1504. https://doi.org/10.3390/app15031504

AMA Style

Lu F, Li Y, Bao Y. Deep Knowledge Tracing Integrating Temporal Causal Inference and PINN. Applied Sciences. 2025; 15(3):1504. https://doi.org/10.3390/app15031504

Chicago/Turabian Style

Lu, Faming, Yingran Li, and Yunxia Bao. 2025. "Deep Knowledge Tracing Integrating Temporal Causal Inference and PINN" Applied Sciences 15, no. 3: 1504. https://doi.org/10.3390/app15031504

APA Style

Lu, F., Li, Y., & Bao, Y. (2025). Deep Knowledge Tracing Integrating Temporal Causal Inference and PINN. Applied Sciences, 15(3), 1504. https://doi.org/10.3390/app15031504

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Knowledge Tracing Integrating Temporal Causal Inference and PINN

Abstract

1. Introduction

2. Related Work

2.1. Knowledge Tracing

2.2. Temporal Causal Inference

2.3. PINN Model

3. TLPKT_PINN Model

3.1. Overall Framework of the Model

3.2. Causal Relationship Mining of Knowledge Points

3.3. LPKT Model Integrating Temporal Causality

3.3.1. Learning Embedding and Knowledge Point Embedding

3.3.2. The Learning Module

3.3.3. Forgetting Module

3.3.4. Prediction Module and Objective Function

3.4. Loss Function Optimization

3.4.1. Student’s Ability and Exercise Difficulty Based on Backdoor Adjustment

3.4.2. The Logistic Model

3.4.3. Optimize the Loss Function

4. Experiment

4.1. Datasets

4.2. Results and Discussion

4.3. Ablation Experiment

4.4. Updating the Mastery Level of Students’ Knowledge Points

4.5. Performance Analysis of PINN Model

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI