BPSKT: Knowledge Tracing with Bidirectional Encoder Representation Model Pre-Training and Sparse Attention

Zhao, Weidong; Xu, Zhen; Qiu, Liqing

doi:10.3390/electronics14030458

Open AccessArticle

BPSKT: Knowledge Tracing with Bidirectional Encoder Representation Model Pre-Training and Sparse Attention

by

Weidong Zhao

,

Zhen Xu

and

Liqing Qiu

^*

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 458; https://doi.org/10.3390/electronics14030458

Submission received: 14 December 2024 / Revised: 21 January 2025 / Accepted: 21 January 2025 / Published: 23 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Knowledge tracing (KT) is a core task in an intelligent education system. It is designed to track the changes in students’ knowledge states during practice, and further predict the accuracy of their answers in the next round of exercises. Current knowledge tracing models generally focus on solving short-sequence problems, and still have significant limitations when handling long-sequence problems. Additionally, there is considerable room for improvement in the experimental performance of existing models on sparse datasets. This paper proposes a BPSKT knowledge tracing model that combines BERT pre-training and sparse attention. It extracts long-sequence node information features through two layers of GCN, and then fine-tunes the pre-trained data using sparse attention BERT to adapt it for downstream tasks. Finally, through a series of validation experiments, the logical consistency and structural effectiveness of BPSKT are gradually verified.

Keywords:

knowledge tracing; BERT; long sequences; graph convolutional networks; sparse attention

1. Introduction

In recent years, with the popularization of the Internet and smart phones, the concept of smart education has gradually become well known. With the continuous advancement of knowledge tracing [1], smart education has also developed rapidly. This is a technology that effectively tracks students’ learning progress through their online interaction with teaching materials, and scientifically and reasonably observes, represents, and quantifies students’ knowledge state [2], further predicting their future learning abilities. As a result, through the application of knowledge tracing technology, students can understand their own learning trajectories better, thereby further improving efficiency. Meanwhile, academics have paid close attention to the topic of how to employ this technology to anticipate potential learning patterns from students’ learning trajectories [3].

Although knowledge tracing technology continues to produce good results, it still has limitations. First of all, most of the knowledge tracing techniques that exist today focus more on short sequences than on long sequence problems. As shown in Figure 1, student S practice algebra-based computational problems as well as geometric proof problems separately online. e1, e2, and e3 represent inequalities, square roots, and algebraic solutions, where the knowledge concepts of e1 and e2 are included in e3 as part of algebraic problem-solving. These knowledge concepts are usually easier to master and rely little on prior information from e1. At this point, predicting the student’s performance on e3 based on their results on e1 and e2 can achieve good performance, relying only on the results of the previous two exercises, e1 and e2. e4 represents a plane geometry proof problem, which differs from computational problems in that such exercises generally require the student to complete a large number of related practice problems and a full answer process to master the content. When the student’s performance on e4 shows that only a fraction of the answers are correct, the prediction of their performance on the next exercise, e5, will be influenced by both the insufficient prior practice information and the current status of e4. Existing knowledge tracing models are not effective in predicting the student’s learning ability under these circumstances. In addition, current knowledge tracing models primarily target dense datasets and overlook sparse datasets. In some public knowledge tracing datasets, there may be a situation where there are many exercises but only a few students participate. This means that a large number of exercises are offered to a small number of students, which results in sparse data. On such datasets, existing models are prone to overfitting. Therefore, improving the model’s generalization ability on sparse datasets is of great significance.

To better address the issues of long sequences and sparse datasets, this paper proposes a knowledge tracing model based on BERT pre-training and sparse attention, named BPSKT. The model starts with a sequence of student–problem interactions and tests for changes in students’ prior outcomes and learning rates. Specifically, it first identifies and retains long sequences as the original data set embedding, and then extracts the knowledge state features of the sequence through a two-layer GCN structure. Next, the extracted features are input into the BERT pre-training model with sparse attention to generate the pre-training results. Finally, the pre-training results are decoded through a sequence decoder and passed to the prediction network to output the predicted data results. The main contributions made in this paper are summarized below:

(1): The paper proposes a new knowledge tracing model, BPSKT, which focuses on the characteristics of the response pairs of the long sequence problem and employs a suitable neural network for feature extraction, resulting in a better solution to the long sequence problem;
(2): Modifying self-attention to sparse attention in BERT, after fine-tuning and decoders’ output, the data can be extended to other downstream tasks, and good results can be obtained on other domain datasets;
(3): Extensive validation experiments were conducted on multiple knowledge tracing datasets, and various metrics were used to analyze the experimental data, thereby demonstrating the logical rationality and structural effectiveness of the model.

2. Related Work

In this section, the paper will focus on the development and evolution of KT techniques and attention mechanisms from the past and present, and the theoretical and modeling architectures of this thesis are built and implemented on top of these studies.

2.1. Development of Knowledge Tracing

The concept of knowledge tracing was first introduced by Anderson et al. [4] in a technical report in 1986 and published in the 1990s in the academic journal Artificial Intelligence. Early knowledge-tracing models, which followed a Bayesian theoretical approach, relied on probability and knowledge related to applied statistics to estimate student learning ability and performance by simplifying problem complexity. The BKT model, proposed by Corbett et al. [5], is implemented based on the above theory. The model uses Hidden Markov Models (HMMs) to treat the student’s knowledge state as a hidden variable, allowing the student’s knowledge state to be updated and posterior computation to be performed under the binary model. However, the limitations of the binary concept of non-zero is one that predisposes the model to be unable to deal with more complex knowledge networks, and in terms of modeling human memory, these models do not directly take into account the influence of human memory on the learning process.

With the development of machine learning and artificial intelligence, the field of knowledge tracing has likewise moved into a brand-new phase. Piech et al. [6] first introduced a deep learning framework into knowledge tracing to construct a DKT model, which utilizes a recurrent neural network (RNN) as a hidden unit to generate knowledge state vectors to represent the knowledge state of students, and obtain better results than BKT. However, since the DKT model represents students’ mastery of knowledge concepts (KCs) in terms of hidden states, it is not possible to output students’ mastery of each KC in detail [7]. Then, Zhang et al. [8] proposed the DKVMN model by combining the advantages of BKT and DKT, allowing multiple hidden vectors to be read and written separately, and utilizing dynamic matrices to assess students’ learning status. Nakagawa et al. [9] first applied graph neural networks (GNN) to knowledge tracing by proposing the GKT model, which constructs a graph with KC relationships and abstracts the original knowledge tracing task as a node problem in a GNN as a way to improve the accuracy of model predictions. However, GKT only uses a single KC as input, ignoring the issue of influence between multiple KCs, exercises [10]. Building upon this, Song et al. [11] proposed the Bi-Graph Contrastive Learning based Knowledge Tracing model (Bi-CLKT). This model constructs a dual-graph structure to simultaneously handle the relationships between knowledge concepts (KCs) and the interactions between student behaviors. Ultimately, the study demonstrated superior performance in experimental results when compared to models such as BKT, DKVMN, and SAKT. Su et al. [12] introduced the attention mechanism for the first time and proposed a practice-enhanced recurrent neural network (ERNNA) framework based on the attention mechanism. The models obtain the whole practice process of students by extracting the features of the practice text and tracing the student’s state vectors in combination with the hidden features. Therefore, the student will be affected by different practice steps to make the model have better interpretability. Nevertheless, this type of model has a flaw: the attention which is used in this model can only focus on short sequences [13,14]. In addition, Asselman et al. [15] summarized the limitations of existing extensions of Performance Factors Analysis (PFA) and proposed a scalable XGBoost model. This model, as a gradient boosting-based ensemble learning algorithm, improves the accuracy of student performance prediction by combining multiple weak learners.

2.2. Evolution of Attention Mechanisms

The attention mechanism is shaped by the centralized use of masks. Specifically, the mask identifies key features in the data through another new layer of weights, which are trained through learning to allow the deep neural network to learn the regions it needs to focus on. In 2014, Mnih et al. [16] used the attention mechanism for the first time on an RNN model for image classification. Since then, with its powerful performance, the attention mechanism has been widely used in various fields. Subsequently, Bahdanau et al. [17] used an attention-like mechanism to simultaneously translate and align on a machine translation task. And their work applied attention mechanisms to the NLP domain for the first time. Inspired by recent work in the fields of machine translation and target detection, Sammani et al. [18] introduced a soft-attention model. It can be automatically learned to describe the content of an image by using standard backpropagation techniques. These are trained in a deterministic manner and stochastically by maximizing a variational lower bound. Wojna et al. [19] then proposed another neural network model based on CNNs, RNNs, and strong attentional mechanisms. This method achieves extremely high accuracy on specific datasets, greatly outperforming the previous state-of-the-art, and is simpler and more versatile than previous approaches. However, for existing knowledge tracking models, neither the soft attention mechanism nor the strong attention mechanism alone is very effective.

In 2017, Vaswani et al. [20] proposed a powerful model called Transformer, which is a new, simple network architecture entirely based on the self-attention mechanism. It can completely eliminate recursion and convolution. Experiments with the model on two machine translation tasks show that the model has a qualitative advantage, as well as higher parallelizability and requires significantly less training time, demonstrating excellent performance. Subsequently, the Bidirectional Encoder Representation Model (BERT) [21], which is based on a network architecture consisting of multiple transformer encoders. This model introduces a Masked Language Model (MLM) module and a Next Sentence Prediction (NSP) module. Therefore it results in a pre-trained amount of deep bidirectional representations and achieving state-of-the-art performance on a large number of sentence-level and chunk-of-words-level tasks. And outperforms many systems with task-specific architectures. Nevertheless, this self-attention has the disadvantage that it only allows the model to focus on information that is very close to the current time. Therefore, it cannot be applied to the task of focusing on long-sequence problems.

The BPSKT model proposed in this paper is constructed on the basis of existing models, aiming at solving the problem of long sequences and sparse datasets that are difficult to deal with by the current models.

3. Method

In this section, a detailed description and explanation of the BPSKT architecture and workflow are provided. Section 3.1 presents the problem setting of knowledge tracing. Section 3.2 shows the overall framework and workflow of the BPSKT model. The remaining subsections will break down key modules in detail and explain their functions.

3.1. Knowledge Tracing Problem Set

Each student’s individual performance profile in online education is divided into two parts. One of them is an individual problem under each discrete time step. The second is the answer, or response pair, under these time steps that correspond to it. When a student undergoes training from the start until time step t, a tuple can be obtained by combining the exercises, knowledge concepts, and student responses during this period. i.e., (

q_{t}^{s}, c_{t}^{s}, i_{t}^{s}

), where

q_{t}^{s} \subseteq N^{*}

denotes the problem practiced by student at time step t;

c_{t}^{s} \subseteq N^{*}

denotes the concept of knowledge points contained;

i_{t}^{s} \subseteq [0, 1]

denotes the pair of student responses to the question, where 0 means that the question was answered incorrectly; 1 means that it was answered correctly; the rest of the results are partially correct. Due to the potential over-parameterization that can occur when training with long sequences of student behavior, where too many parameters are required to fit the data, we introduce the problem attributes

q_{t}^{s}

of the student’s exercises. This allows the model to more accurately capture the student’s responses to different types of questions, thus avoiding the overfitting of excessive details in student behavior and further preventing this issue. To better explain the long-series characterization problem, the corresponding responses of student that we retained from time step 1 to t are all long-series response pairs greater than 200, and on this basis, we omit the superscript s to discuss the future performance status of a single student. Therefore, there is the set of time steps from 1 to t − 1 as (

q_{1}, c_{1}, i_{1}), \dots, (q_{t - 1}, c_{t - 1}, i_{t - 1}

), and the next goal is to predict the student’s response it to question

q_{t}

on concept

c_{t}

at the current time step t.

Inspired by the work of Ghosh et al. [22], we used the real-valued vector

x_{t} \subseteq R^{d}

to denote the original embedding of each question student response pair. Here,

x_{t}

denotes the knowledge acquired by the student through answering the question, and for long sequences of questions with separate embeddings for correct (1), incorrect (0), and partially correct (default to 0.5) responses for each of the three, and d denotes the dimensional size of these embeddings. After that, the number of questions is denoted by n, there are then a total of 3n question-response embedding vectors. Since the concept index

c_{t}

was introduced earlier in order to prevent over-parameterization, the stipulation is added here that all problems involving the same concept are considered as one problem. In this case,

q_{t} = c_{t}

and n = c.

3.2. BPSKT Methodology

The overall framework of the BPSKT methodology is shown in Figure 2 and consists of four main components: A double-layer graph convolutional network, a BERT pretrained model that modifies self-attention to sparse attention, a sequence decoder, and a feed-forward response predictive network model.

One of the main innovations of the model is the modification of the attention mechanism in the BERT pre-trained model. It is found that sparse attention can cope better with long sequences as well as sparse data than Transfomer’s multi-head attention [23] and shows better results in such datasets. Afterwards, the pre-trained results are fine-tuned and then passed into the sequence decoder. The principle of this module is to use the attention mechanism and the forgetting feature curve to retrieve the response features of the sequence from the history up to the last time step t − 1; finally, the prediction model uses the retrieved response features to predict the learner’s response state to the current problem. The complete workflow of the BPSKT model is shown in Algorithm 1, where the key modules in the algorithm are described in detail in the following four subsections.

Algorithm 1 BPSKT Complete Algorithm

Input: LRS (Long response sequence)

Output: PSAV(Predicted student ability values)

1:: for S = 1, …, n do // the set S represents the collection of students
2:: if $S L > 200$ then // Check the sequence length is greater than 200
3:: $L R S \overset{l a b e l}{\leftarrow} S L$ // label sequence length ( $S L$ ) as long response sequence ( $L R S$ )
4:: $R E S \overset{r e t a i n}{\leftarrow} L R S$ // retain long response sequences ( $L R S$ ) as the raw embedding set ( $R E S$ )
5:: end if
6:: $T G C N \overset{i n p u t}{\leftarrow} R E S$ // input the raw embedding set ( $R E S$ ) into the two-layer graph convolutional network ( $T G C N$ )
7:: $E S \overset{g e n e r a t e}{\leftarrow} T G C N$ // generate eigenvalues set ( $E S$ ) from the two-layer graph convolutional network ( $T G C N$ )
8:: $B E R T \overset{i n p u t}{\leftarrow} E S$ // input eigenvalues set ( $E S$ ) into the bidirectional encoder representations from transformers ( $B E R T$ )
9:: $P R \overset{g e n e r a t e}{\leftarrow} B E R T$ // generate pre-training result ( $P R$ ) from the bidirectional encoder representations from transformers ( $B E R T$ )
10:: $S D \overset{i n p u t}{\leftarrow} P R$ // input pre-training result ( $P R$ ) into the sequence decoder ( $S D$ )
11:: $P N \overset{d e c o d e}{\leftarrow} S D$ // input the decoded data from the sequence decoder ( $S D$ ) into the prediction network ( $P N$ )
12:: for T = 1, …, $t - 1$ do // T represents the time step of the student’s practice
13:: $P S A V \overset{g e n e r a t e}{\leftarrow} P N . a d d (S D)$ // The data from the sequence decoder ( $S D$ ) is added in the prediction network ( $P N$ ), generating the predicted student ability values ( $P S A V$ ) for the time step
14:: end for
15:: end for
16:: return PSAV

3.3. Graph Convolutional Neural Networks(GCN)

Graph Neural Network (GNN) is a collective term for models of neural networks applied on graphs. Depending on the way of propagation, GNN can be categorized into structures such as Graph Convolutional Neural Networks (GCNs), Graph Attention Networks (GAT), and Graph LSTM. In GCN, the question features are treated as a complete network system, and each response sequence to a question is considered a node feature in the GCN subgraph. Nodes answering the same question are connected through a certain network structure, meaning they are linked by edge vectors that carry similarity information, with similarity quantitatively represented by numbers. The higher the similarity of two nodes, the higher the similarity of two responses to a question is proved. GCN extracts the node similarity features several times to obtain the closest answer to the original answer.

There are 2 main reasons for using GCN as a network model to extract features: (1) Compared with the traditional RNN model utilized on KT technology, GCN has better long-term dependence and is less prone to gradient vanishing and explosion problems [24]; (2) Compared with GNN, the use of convolutional layers not only allows for more accurate feature extraction, but also prevents the overfitting problem more effectively. According to the research for Yang et al. [25] the extraction effect is best when we set GCN to two layers.

For the graph

G = (V, E)

, V is the set of nodes and E is the set of edges, and for each node i, there is its feature

x_{t}

, which can be represented by the matrix

X_{N * D}

. Here, N denotes the number of nodes and D denotes the number of features per node, which can also be described as the dimension of the feature vector. Given a random unweighted undirected feature map as shown in the left panel of Figure 3, it is then possible to further obtain its degree matrix

M_{D}

, adjacency matrix

M_{A}

, and Laplace matrix

M_{L}

, respectively. It can be seen that the adjacency matrix

M_{A}

is 1 only between two nodes connected by an edge and 0 everywhere else. the degree matrix

M_{D}

has values only on the diagonal, for the degree of the corresponding node, and 0 for the rest. and the Laplacian matrix

M_{L}

is

M_{D} - M_{A}

.

Any graph convolutional layer can be written as a nonlinear function:

H^{l + 1} = f (H^{l}, M_{A})

(1)

where

H^{l}

and

H^{l + 1}

represent the inputs of layers l and

l + 1

, respectively.

H^{0} = X

is the input to the first layer,

X \subseteq R^{N * D}

. In this paper, the function f is implemented with the following formula:

H^{l + 1} = σ (M_{D}^{- \frac{1}{2}} \hat{M_{A}} M_{D}^{- \frac{1}{2}} H^{l} W^{l})

(2)

where

σ (\cdot)

represents the nonlinear activation function. In this paper, the Softmax activation function was chosen to enhance the interpretability of the model.

\hat{M_{A}}

represents for

M_{A} + M_{I}

, where

M_{I}

is the unit matrix, and here, the attention and extraction of features of the adjacency matrix to itself is realized.

W^{l}

then represents the weight of the input layerl. In this equation, the Laplace matrix

M_{L}

has the following variations:

M_{L}^{s y m} = M_{D}^{- \frac{1}{2}} \hat{M_{A}} M_{D}^{- \frac{1}{2}} = M_{D}^{- \frac{1}{2}} (M_{D} - M_{A}) M_{D}^{- \frac{1}{2}}

(3)

Based on the above formula, the following two issues can be addressed:

(1): Introducing the own-degree matrix to solve the self-passing problem;
(2): The normalization operation on the adjacency matrix is obtained by multiplying both sides of the adjacency matrix by the degree of the nodes in square and then taking the inverse.

For each node pair i, j the elements in the matrix are given by the following equation (for an undirected unweighted graph):

M_{L_{i, j}}^{s y m} \{\begin{matrix} 1, i f i = j a n d d e g (v_{i}) \neq 0 \\ - \frac{1}{\sqrt{d e g (v_{i}) d e g (v_{j})}}, i f i \neq j a n d v_{i} i s a d j a c e n t t o v_{j} \\ 0, o t h e r w i s e \end{matrix}

(4)

where

d e g (v_{i})

,

d e g (v_{j})

are the degrees of nodes i, j, respectively, that is, the values of the degree matrix at nodes i, j. Finally, we get the weights of each node on the feature graph, the higher the weight, the closer the concept of the problem is.

3.4. Sparse Attention Mechanism

The attention mechanism of the Query Key Value (QKV) model is one of the most important modules of the transformer. Given the packed matrix representations of queries

Q \subseteq R^{N \times D_{k}}

, keys

K \subseteq R^{M \times D_{k}}

, and values

V \subseteq R^{M \times D_{v}}

, its scaled dot product attention formula is as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{D_{K}}} V) = A V

(5)

where Q and K denote the lengths of queries and keys;

D_{K}

and

D_{V}

denotes the dimensions of keys and values; A =

s o f t m a x (\frac{Q K^{T}}{\sqrt{D_{K}}})

is often called attention matrix.

In general, in long sequence timing modeling problems, not only local timing information and hierarchical timing information but also burst timestamp information need to be considered [13,26,27]. Conventional self-attention mechanisms (e.g., Transformer’s multi-head attention mechanism) are difficult to adapt directly, and may bring about the problem of mis-matching of query and key between the encoder and decoder, which ultimately affects the prediction results. Therefore, this paper introduces the “sparse attention mechanism” to limit the number of query key pairs by combining with structural bias. And it can reduce the computational complexity and achieves better results under sparse data as well as in long sequence modeling. In this approach, there is the following formula:

{\hat{A}}_{i, j} = \{\begin{matrix} q_{i} k_{j}^{T}, i f t o k e n i a t t e n d s t o t o k e n j \\ - \infty, i f t o k e n i d o e s n o t a t t e n d t o t o k e n j \end{matrix}

(6)

where

\hat{A}

is un-normalized attention matrix. The internal structure of this BPSKT encoder is shown in Figure 4.

To construct sparse graphs efficiently, one approach can be recursive to the Maximum Inner Product Search (MIPS) problem. The advantage of this method is that it allows finding the key with the largest dot product through querying, without the need to compute all dot product terms. Inspired by the work of Roy et al. [28], it can use k-means clustering to cluster two queries

{\{q_{i}\}}_{i = 1}^{T}

and keys

{\{k_{i}\}}_{i = 1}^{T}

about the same set of center-of-mass vectors

{\{μ_{i}\}}_{i = 1}^{K}

. Each query focuses only on keys belonging to the the same cluster key. During training, the cluster center of mass vector is updated using the exponential moving average assigned to the cluster center of mass vector divided by the exponential moving average of the cluster counts:

\tilde{μ} \leftarrow λ \tilde{μ} + (1 - λ) (\sum_{i : μ (q_{i}) = μ} q_{i} + \sum_{j : μ (k_{j}) = μ} k_{j})

(7)

c_{μ} \leftarrow λ c_{μ} + (1 - λ) |μ|

(8)

μ \leftarrow \frac{\tilde{μ}}{c_{μ}}

(9)

where

|μ|

denotes the number of current vectors in the cluster

μ

and

λ \subseteq (0, 1)

is a hyperparameter. Let

P_{i}

denote the index set of key i-th query concern.

P_{i}

is defined as follows:

P_{i} = \{j : μ (q_{i}) = μ (k_{j})\}

(10)

This method of utilizing sparse attention in combination with GCN feature extraction and BERT pre-training allows for more efficient adaptation to various downstream tasks. By fine-tuning the pre-trained model, performance can be improved specifically for the task at hand.

3.5. BPSKT Pre-training and Fine-tuning

The BERT pre-training process was optimized to some extent in the work of Ye et al. [29]. The first step is called “masked LM”, where a certain number of input tokens are randomly masked in order to train the deep bi-directional representation, and the contents of the tokens are further predicted. In this case, the final hidden vector of the mask token is fed into the output Softmax function on the vocabulary. The second step is “Next Sentence Prediction” (NSP), where a binary Next Sentence Prediction task is pre-trained, which can be easily generated from any monolingual corpus.

The fine-tuning process utilizes transfer learning techniques [30]. Specifically, the encoder in BERT is used as an upper network layer in BPSKT to encode the response sequences, and then the context encoding, which is a deep representation of the sequences, is transmitted to a downstream sequence decoder to generate the knowledge states. In this case, each input sequence is not composed of two fragments, but a complete sequence from the dataset, contained between the [CLS] and [SEP] tags.

3.6. Response Prediction Network

The final component of the BPSKT method predicts the learner’s response to the current question. Figure 5 shows the internal structure of the decoder [31,32]. In a review of the existing literature, we found that, in general, the mastery of newly acquired knowledge is higher than that of knowledge learned a long time ago when students are practicing. So, it is important to consider the effects of forgetting features. Specifically, adding a multiplicative exponential decay term to the attention score; therefore, it is important to consider the effects of forgetting features, specifically by adding a multiplicative exponential decay term to the attention score:

A_{t, τ} = \frac{e x p (- θ \cdot d (t, τ) \cdot q_{t}^{T} k_{τ})}{\sqrt{D_{k}}}

(11)

where

θ > 0

is the learnable decay rate parameter and

(t, τ)

is the temporal distance measure between time steps t and

τ

.

Specifically, the inputs to the prediction model are the outputs of the sequence decoder’s indexing of knowledge in long sequences and a vector of current problem responses

x_{t}

. This input is passed through a fully connected network and then eventually through a nonlinear activation function to generate the predicted probability

\hat{r_{t}}

that a student will correctly answer the current question (including the forgetting feature). And by using the loss function to minimize the binary cross-entropy loss of all learner responses:

ς = \sum_{i} \sum_{t} - (r_{t}^{i} \log \hat{r_{t}^{i}} + (1 - r_{t}^{i}) \log (1 - \hat{r_{t}^{i}}))

(12)

The context encoding of the sequence decoder contains the potential feature information extracted from the original sequence. Therefore, it is able to track changes in the knowledge state more centrally, leading to more accurate predictions.

4. Experiment

The focus of this section is to obtain experimental data through a series of validation experiments to verify the rationale and effectiveness of BPSKT. In Section 4.1, five classic datasets in the field of knowledge tracing, six baseline models, and detailed experimental parameter settings are introduced. Section 4.2 then carries out comparative experiments between BPSKT and the six baseline models, ablation experiments to validate the structural effectiveness by removing key components, and visualization experiments that enhance the model’s interpretability.

4.1. Comparative Experiments on the Dataset

4.1.1. Datasets

In this paper, five typical datasets are used, each from data collected in the real world. The datasets are described below:

The dataset was collected from the ASSISTments education platform for the 2009–2010 school year, which, as a classic dataset for knowledge-tracing techniques, is now frequently used in the field of smart education, being used as a benchmark comparison of models. The dataset is a dense dataset with a higher average data density.

The ASSISTments2015 dataset was collected in 2015 and contains the largest number of students of any dataset used this time. This dataset is characterized by a lower average number of exercises per student and fewer average student records. This dataset is also the most used dataset in ASSISTmentsData by researchers after ASSISTments2009 with good modeling results.

The ASSISTments2017 dataset is publicly available for the Longitudinal Data Mining Competition and is also the most recent dataset of ASSISTments, containing more records of student interactions as well as sparse data.

The Statics2011 dataset is from OLI’s college-level Engineering State Statistics course, and the exercise labels are numerical knowledge components derived from textual descriptions, a dataset that simultaneously satisfies a small number of students and a large number of knowledge concepts.

EdNet is a hierarchical dataset with each subset containing a different type of student activity. The main difference is that some of the questions in this dataset are organized into bundles (a set of questions that must be completed completely). It also includes a range of features of learner interactions collected by the AI tutoring service, a dataset with the lowest sparsity of the five. Due to the large number of students in the data volume, only a portion was taken for the experimental study.

Inspired by the work of Ghosh et al. [22], this paper identifies the use of the ASSISTments2009, ASSISTments2015, ASSISTments2017, and Statics2011 datasets. The EdNet dataset was adopted by the work of Choi et al. [33]. The details of the five datasets are shown in Table 1. The necessity of sequence length was ensured for each particular exercise recorded in the five datasets.

Among the datasets used this time, each dataset covered as many validly recorded sequences as possible, and a large number of duplicate recorded sequences were removed to ensure that the experimental results were more convincing.

4.1.2. Comparison of Models

This paper compares BPSKT with six baseline KT methods, including DKT, DKVMN, CKT, Bi-CLKT, AKT, and the newly proposed Self-Attention KT (SAKT) method in recent years. SAKT uses attentional mechanisms that can be regarded as special cases of AKT without the context-aware representation of questions and answers as well as monotonic attentional mechanisms. The performance of all knowledge tracing methods was evaluated by predicting the students’ future responses to questions using the area under the curve (AUC) and accuracy (ACC) as evaluation metrics. Below is a brief description of these six KT methods:

DKT [6] is the earliest breakthrough and achievement in the fusion of knowledge tracing and deep learning, as well as the earliest KT method to utilize RNN models for deep learning. In most of the current classical KT methods, the DKT is usually chosen as a baseline to compare with other models.

DKVMN [8] belongs to the class of Dynamic Key-Value Memory Networks and is obtained by improving on the MANN model. It utilizes static and dynamic external matrices to read and write the students’ knowledge status, respectively, and grasp the students’ learning level.

CKT [34] belongs to the category of Convolutional Networks and uses 3D ConvNets techniques to reinforce short-term features of students’ recent learning states and exercises, often used in conjunction with feature extraction engineering.

Bi-CLKT [11] constructs a dual-graph structure based on Graph Neural Networks (GNN), which can simultaneously handle the relationships between knowledge concepts (KCs) and the interactions between student behaviors.

AKT [22] uses monotonic attentional modeling to establish a relationship between current practice and past interactions. And using Rasch models, feature forgetting curves such psychometric model components with interpretability, a series of learning states of students from past to present are obtained based on modeling exercises and primitive embedding of knowledge concepts.

SAKT [23] is based on a network category of attentional mechanisms that model students’ interaction histories using self-attentional mechanisms, thereby reducing the influence of irrelevant practice on target practice and making prediction results more convincing.

In the experiments, this paper re-implemented all of the above comparison methods so as to ensure consistency in data processing and network structure. In addition, for other factors, such as the embedding layer, the hidden layer of the attentional mechanism, and the FFN layer, the dimensionality levels remain consistent with the BPSKT model.

4.1.3. Parameter Setting

For evaluation and comparison purposes, in this research, we performed a standard five-fold cross-validation on all models and datasets [35]. For each fold, 20% of the learners are used as a test set, 20% as a validation set and 60% as a training set.

The experimental results in this study were implemented in PyTorch 2.1, with Python version 3.9. The processor used was an Intel i7 processor, and the system had 8 GB of RAM. The graphics card used in the experiments was an NVIDIA 3090Ti. After multiple adjustments to parameters and comparison of experimental results, it was ultimately found that the best results were achieved when the learning rate was set to 0.0001 and the dropout rate was set to 0.1, which also helped reduce the occurrence of overfitting.

For the optimizer, the Adam optimizer was used to train all models with a batch size of 24 to ensure that the entire batch could fit into the memory of the machine on which it was running (with an NVIDIA 3090Ti-12G GPU). For most datasets and most algorithms, the maximum number of training sessions was set to 300 due to the short duration of a single training session. In addition, in this experiment, based on the continuous adjustment of the dimension level, the best results were achieved when the encoder was set to N = 8, and d = 256 was used for the dimensions of the embedding and hiding layers, respectively.

4.2. Experimental Details and Results

4.2.1. Comparative Experiments

Table 2 lists the performance of all KT methods in predicting future learner responses across all datasets. For a fair comparison, the AUC and ACC values were averaged in this experiment using five-fold cross-validation as the final reference result. It was found that BPSKT achieved better results than the contrasting models on top of four of the five datasets. The average improvement over the second best modeling approach on the ASSISTments2009 and ASSISTments2017 datasets is 1.42% and 4.17%, respectively, which is at the same time the best result of the knowledge tracing model run in ASSISTments2017. In addition, BPSKT improved its AUC on the Statics2011 and EdNet datasets by 1.62% and 2.49%, respectively, compared to the second-best model. In terms of previous datasets, both ASSISTments2017 and EdNet datasets are data sparsity datasets, it is proved that BPSKT has better results on sparse datasets. By the same token, the experimental results on such dense datasets are slightly worse than those of the SAKT model due to the fact that ASSISTments2015 has both a large amount of data as well as a high number of records of student-problem interactions. Finally, to demonstrate the efficiency improvement brought by the sparse attention mechanism, this paper compares the BPSKT model with the other six models, all of which were trained for 100 epochs on the ASSISTments 2017 dataset, with other conditions held constant. During the training process, time efficiency and memory usage were recorded, and the final experimental results are shown in Table 3. As can be seen, under the same conditions, the second most efficient AKT model took 9 h and 27 min to train for 100 epochs, with a prediction time of 6 h and 12 min, and an average memory usage of 5.2/7.9 (66%); whereas the BPSKT model required 7 h and 52 min for training, 5 h and 44 min for prediction, with an average memory usage of 4.5/7.9 (57%). This clearly demonstrates that the sparse attention mechanism improves the model’s running efficiency.

In addition, the improvement in DKVMN over DKT is statistically insignificant based on comparisons between other methods, and it performs even worse than DKT on the two ASSISTments datasets. Overall, the focused depth methods tend to outperform RNN-based methods on most datasets. The AKT model that incorporates Rasch embedding and utilizes a monotonic attention mechanism shows significant improvement compared to SAKT.

4.2.2. Experimental Ablation Studies

To demonstrate the effectiveness of several key components of BPSKT, namely (1) long sequence effectiveness, (2) the effectiveness of GCN in extracting features, (3) the advantage of sparse attention, and (4) the role of sequence decoder with forgetting, this paper conducts ablation experiments to investigate the impact of these components on the overall performance of BPSKT.

In the first experiment, BPSKT was verified in mixed sequence length response pairs, confirming the performance advantage relative to its variant BPSKT-MS at mixed sequence lengths. BPSKT-MS was trained by adding a portion (about 10%) of the short sequence response pairs among the original long sequence response pairs, and the rest of the conditions were kept constant.

The second experiment compares the original model with the BPSKT-NG, where the alter removes the two-layer GCN network model for extracting dataset features and replaces it with a traditional RNN. The BPSKT-NG would denote the BPSKT method without GCN.

The third experiment was designed to explore the effect of the attention mechanism on the test dataset and hence, the performance differences. This experiment tested the BPSKT-NS, the BPSKT model without sparse attention, instead replacing it with the traditional self-attention of BERT. The performance of sparse attention is exemplified by comparing the two in terms of the datasets used at this time, especially those with sparse data.

And for the last experiment, the real role played by the sequence decoder in the decoding process is considered. Specifically, it is the role of the forgetting feature curve contained within, and this experiment compares the original model with its variant, BPSKT-NAC, a decoder that does not carry a forgetting curve, to confirm the importance of the forgetting feature in long sequence problems.

Table 4 shows the performance (AUC) results of the four ablation experiments of BPSKT under different variants. BPSKT outperforms all its variants, BPSKT-MS, BPSKT-NG, BPSKT-NS, and BPSKT-NAC, on all data sets. This shows that a series of modifications such as filtering the dataset, extracting the sequence features of the dataset using GCN, and replacing the attentional mechanism effectively improved the prediction performance of the model for the KT task.

Specifically, when using only long sequence response pairs as the content of the dataset, the BPSKT showed a significant improvement in the AUC of the five datasets compared to the BPSKT-MS, with increases of 1.74%, 1.89%, 1.73%, 1.07%, and 1.53%, respectively. The reason for this result may be due to the inability of the model to perform distributional identification efficiently when dealing with mixed long and short sequences, and the default uniform treatment of all sequences leads to poor results. It is also verified that the modular functionality of the model is more suitable for processing single long sequence data as compared to mixed sequences. In addition, BPSKT-NG outperforms BPSKT-NS on all datasets except the ASSISTments2015 dataset, but the results are all worse than BPSKT. This experimental result firstly proves the effectiveness of the two modules of the model, GCN and sparse attention mechanism. On the one hand, without adding GCN, it is not possible to extract sequence features for pre-training leading to degraded result. On the other hand, after removing or replacing sparse attention with other attention, the model is unable to perform cross-step feature extraction, generating additional noise as well as computational time cost, which is particularly effective on sparse datasets such as ASSISTments2017. It suggests that the modification of the sparse attention mechanism contributes more to improving the performance of the model than the extraction of features by the GCN network. And is also the single variant that most effect the experimental results of the original model. Secondly, the results of this experiment also demonstrate that modifying the sparse attention mechanism contributes more to improving the model performance than extracting features from the GCN network, as well as being the single variant that most affects the experimental results of the original model. The line graph in Figure 6 then highlights the effect of the attention module on the structure of the model. It can be seen that especially in the last three sparse data sets, the effect of sparse attention is more significant, directly affecting 2.82%, 2.86%, and 3.20%, respectively, and proving the effectiveness of the model.

Finally comparing BPSKT-NAC, again the effect is somewhat less than the BPSKT model, but to a flatter extent. This suggests that the inclusion of forgetting curves in the long series problem does improve the prediction accuracy of the data to some extent, but is not a central solution to the problem of long series prediction accuracy.

4.2.3. Visualization of Experimental Studies

To validate the effectiveness of BPSKT in tracking learners’ learning processes, the final part of this experiment presents a specific case study, predicting answer accuracy by recording the changes in a student’s knowledge state. The student experiment data was extracted from the ASSISTments 2017 dataset, and relevant topics containing three main knowledge points were selected. These knowledge points are: “13: Interpreting Linear Equations”, “81: Graph Interpretation”, and “99: Geometry”.

Figure 7 shows the process of visualizing the experiment. It can be seen that when the predicted value of p is less than or equal to 0.2, the students’ response pairs are all 0,indicating that this student did not answer any part of such questions correctly. It suggests that he did not acquire such knowledge at all. When p is greater than or equal to 0.8, all of the student’s response pairs are 1, indicating that this student answered the question completely correctly and the knowledge point was mastered completely; when p is greater than 0.2 and less than 0.8, the student’s response pairs are 0.5, indicating that the student’s answers to this type of question are partially correct and the knowledge point was mastered partially. Based on this law, it is possible to better predict the probability of accuracy of different students when dealing with different long sequences of problems.

5. Conclusions and Future Work

This paper proposes a sparse attention knowledge tracing model based on BERT pre-training to improve the prediction performance for long sequence KT tasks. The BPSKT architecture contains two important components, namely GCN feature extraction and sparse attention mechanism, which are used to tap into the ability to respond to pairs and features in a learner’s learning history as well as to track changes in their knowledge over time. First, the long sequence is retained as the original embedding and input into a two-layer GCN for feature extraction. The extracted feature information is then fed into the sparse attention BERT model for pre-training. Next, fine-tuning is performed to satisfy downstream prediction tasks, where the decoder is used to weight the encoder output (considering the matrix operation of forgetting) and obtain the knowledge state at each time step through transfer learning. Finally, this paper demonstrates the superiority of the BPSKT method in sparse datasets and long-sequence problems through comparison experiments; verifies the structural effectiveness of BPSKT through ablation experiments; and confirms the accuracy and interpretability of the BPSKT model in prediction through visualization experiments.

However, there are still some limitations to this work. When previously performing ablation experiments, we took one factor into account, the mixed sequence. Typically, when a student practices online, it is not common for him to train only long sequence problems, but usually both short and long sequence problems. Compared to training only long sequence problems, the BPSKT model performs worse when training on mixed sequences, indicating that “noise” reduces the model’s performance. Therefore, in future work, the study of the “balancing” problem is considered to make the model more suitable for mixed sequences. By combining with a module suited for handling short sequences and adding specific filtering conditions to train long and short sequences separately, the experimental performance on mixed sequence datasets can be improved, thereby further enhancing the interpretability and reliability of the modeling process.

Author Contributions

Conceptualization and methodology, W.Z.; investigation, data curation and writing—original draft, Z.X.; writing—review and editing, supervision and project administration, L.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shandong Province Key R&D Program (Soft Science Project) 2023RKY02009.

Institutional Review Board Statement

Since the study did not require ethical approval, claims are excluded.

Informed Consent Statement

This study did not involve humans, so we choose to exclude these statements here.

Data Availability Statement

The public dataset used in this article is sourced from https://sites.google.com/site/assistmentsdata/home, https://sites.google.com/view/assistmentsdatamining/data-mining-competition-2017, https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507, https://github.com/riiid/ednet, accessed on 20 September 2023. The experimental data of some classical model dataset experiments are derived from https://d.wanfangdata.com.cn/periodical/jsjkxyts202208005, accessed on 20 September 2023.

Acknowledgments

The authors thank Liqing Qiu and her team for providing the server enablement funding needed to conduct this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nwana, H.S. Intelligent tutoring systems: An overview. Artif. Intell. Rev. 1990, 4, 251–277. [Google Scholar] [CrossRef]
Abdelrahman, G.; Wang, Q.; Nunes, B. Knowledge tracing: A survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Song, X.; Li, J.; Cai, T.; Yang, S.; Yang, T.; Liu, C. A survey on deep learning based knowledge tracing. Knowl.-Based Syst. 2022, 258, 110036. [Google Scholar] [CrossRef]
Anderson, J.R. Cognitive Modelling and Intelligent Tutoring; Psychology Press: Hillsdale, NJ, USA, 1986. [Google Scholar]
Corbett, A.T.; Anderson, J.R. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Model. User-Adapt. Interact. 1994, 4, 253–278. [Google Scholar] [CrossRef]
Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep knowledge tracing. Adv. Neural Inf. Process. Syst. 2015, 28, 505–513. [Google Scholar]
Wang, L.; Sy, A.; Liu, L.; Piech, C. Deep knowledge tracing on programming exercises. In Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale, Cambridge, MA, USA, 20–21 April 2017; pp. 201–204. [Google Scholar]
Zhang, J.; Shi, X.; King, I.; Yeung, D.Y. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 May 2017; pp. 765–774. [Google Scholar]
Nakagawa, H.; Iwasawa, Y.; Matsuo, Y. Graph-based knowledge tracing: Modeling student proficiency using graph neural network. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Thessaloniki, Greece, 14–17 October 2019; pp. 156–163. [Google Scholar]
Tan, W.; Jin, Y.; Liu, M.; Zhang, H. Bidkt: Deep knowledge tracing with bert. In Proceedings of the International Conference on Ad Hoc Networks, Virtual, 6–7 December 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 260–278. [Google Scholar]
Song, X.; Li, J.; Lei, Q.; Zhao, W.; Chen, Y.; Mian, A. Bi-CLKT: Bi-graph contrastive learning based knowledge tracing. Knowl.-Based Syst. 2022, 241, 108274. [Google Scholar] [CrossRef]
Su, Y.; Liu, Q.; Liu, Q.; Huang, Z.; Yin, Y.; Chen, E.; Ding, C.; Wei, S.; Hu, G. Exercise-enhanced sequential modeling for student performance prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
Tong, H.; Wang, Z.; Zhou, Y.; Tong, S.; Han, W.; Liu, Q. Hgkt: Introducing hierarchical exercise graph for knowledge tracing. arXiv 2020, arXiv:2006.16915. [Google Scholar]
Asselman, A.; Khaldi, M.; Aammou, S. Enhancing the prediction of student performance based on the machine learning XGBoost algorithm. Interact. Learn. Environ. 2021, 31, 3360–3379. [Google Scholar] [CrossRef]
Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Bahdanau, D. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Sammani, F.; Melas-Kyriazi, L. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4808–4816. [Google Scholar]
Wojna, Z.; Gorban, A.N.; Lee, D.S.; Murphy, K.; Yu, Q.; Li, Y.; Ibarz, J. Attention-based extraction of structured information from street view imagery. In Proceedings of the IEEE 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 844–850. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Ghosh, A.; Heffernan, N.; Lan, A.S. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 2330–2339. [Google Scholar]
Pandey, S.; Karypis, G. A self-attentive model for knowledge tracing. arXiv 2019, arXiv:1907.06837. [Google Scholar]
Luo, Y.; Xiao, B.; Jiang, H.; Ma, J. Heterogeneous graph based knowledge tracing. In Proceedings of the IEEE 2022 11th International Conference on Educational and Information Technology (ICEIT), Chengdu, China, 6–8 January 2022; pp. 226–231. [Google Scholar]
Yang, Y.; Shen, J.; Qu, Y.; Liu, Y.; Wang, K.; Zhu, Y.; Zhang, W.; Yu, Y. GIKT: A graph-based interaction model for knowledge tracing. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, 14–18 September 2020; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2021; pp. 299–315. [Google Scholar]
Wei, L.; Li, B.; Li, Y.; Zhu, Y. Time interval aware self-attention approach for knowledge tracing. Comput. Electr. Eng. 2022, 102, 108179. [Google Scholar] [CrossRef]
Graves, A. Adaptive computation time for recurrent neural networks. arXiv 2016, arXiv:1603.08983. [Google Scholar]
Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 2021, 9, 53–68. [Google Scholar] [CrossRef]
Ye, Z.; Guo, Q.; Gan, Q.; Qiu, X.; Zhang, Z. Bp-transformer: Modelling long-range context via binary partitioning. arXiv 2019, arXiv:1911.04070. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
Lei Ba, J.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Xin, J.; Tang, R.; Lee, J.; Yu, Y.; Lin, J. DeeBERT: Dynamic early exiting for accelerating BERT inference. arXiv 2020, arXiv:2004.12993. [Google Scholar]
Choi, Y.; Lee, Y.; Shin, D.; Cho, J.; Park, S.; Lee, S.; Baek, J.; Bae, C.; Kim, B.; Heo, J. Ednet: A large-scale hierarchical dataset in education. In Proceedings of the Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, 6–10 July 2020; Proceedings, Part II 21. Springer: Berlin/Heidelberg, Germany, 2020; pp. 69–73. [Google Scholar]
Shen, S.; Liu, Q.; Chen, E.; Wu, H.; Huang, Z.; Zhao, W.; Su, Y.; Ma, H.; Wang, S. Convolutional knowledge tracing: Modeling individualization in student learning process. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 1857–1860. [Google Scholar]
Guo, X.; Huang, Z.; Gao, J.; Shang, M.; Shu, M.; Sun, J. Enhancing knowledge tracing via adversarial training. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–29 October 2021; pp. 367–375. [Google Scholar]

Figure 1. Results of student S practice for different sequence problems.

Figure 2. General architecture of BPSKT. The feature information in the original embeddings is extracted using a two-layer GCN network. The extracted features are then pre-trained through the BERT model with sparse attention. Finally, the pre-trained data is fine-tuned and decoded before being passed to the predictive model. This figure is simplified and some details of the decoder’s internal knots are omitted for simplicity and ease of understanding.

Figure 3. Unweighted feature map and its adjacency matrix, degree matrix and Laplace matrix representation.

Figure 4. Network structure of BERT, the inner part of the black box represents the internal composition of an encoder, where the attention to compute Q, K, V is replaced by a sparse attention mechanism from the previous multi-head attention mechanism.

Figure 5. Internal structure of a sequence decoder with the same sparse attention mechanism and an additional forgetting function to better track the interaction states of sparse datasets.

Figure 6. Comparison of fold plots of BPSKT-NS variants with BPSKT models.

Figure 7. Visualization of a case of change in a student’s knowledge, where records are extracted from the ASSISTments2017 dataset. Where the probability values in the squares indicate the predictions obtained through the different methods.

Table 1. Details of the Five Datasets Used in this Paper.

Dataset	Questions	Students	Concepts	Interactions	Public
ASSISTments2009	26,688	4217	123	325,600	Yes
ASSISTments2015	100	19,917	-	683,566	Yes
ASSISTments2017	1680	3155	411	870,866	Yes
Statics2011	80	335	1362	350,192	Yes
EdNet	11,658	10,000	290	687,265	Yes

Table 2. Results of the comparison experiments with the six knowledge-tracing models, where the highest AUC and ACC values for each item have been bolded.

Dataset	DKT	DKVMN	CKT	Bi-CLKT	AKT	SAKT	BPSKT
Dataset	AUC ACC	AUC ACC	AUC ACC	AUC ACC	AUC ACC	AUC ACC	AUC ACC
ASSISTments2009	0.8600 0.8385	0.8157 0.8003	0.8254 0.8477	0.8377 0.8022	0.8346 0.8379	0.8480 0.8157	0.8785 0.8576
ASSISTments2015	0.7365 0.7125	0.7268 0.7022	0.7291 0.7355	0.7652 0.7575	0.7828 0.7883	0.8540 0.8621	0.8288 0.8438
ASSISTments2017	0.7343 0.7055	0.6853 0.6695	0.7119 0.7345	0.7450 0.7642	0.7702 0.7725	0.7340 0.7369	0.7865 0.8038
Statics2011	0.8233 0.8038	0.8284 0.8089	0.8241 0.8229	0.8321 0.8472	0.8268 0.8191	0.8530 0.8324	0.8692 0.8522
EdNet	0.7638 0.7452	0.7663 0.7079	0.7327 0.7491	0.7756 0.7792	0.7686 0.7756	0.7513 0.7073	0.8005 0.8241

Table 3. The efficiency experiment results of the comparison models on the ASSISTments 2017 dataset, with the highest values in bold.

Efficiency	DKT	DKVMN	CKT	Bi-CLKT	AKT	SAKT	BPSKT
Training Time	13 h 32 min	12 h 57 min	13 h 3 min	10 h 15 min	9 h 27 min	9 h 55 min	7 h 52 min
Prediction Time	7 h 23 min	7 h 40 min	8 h 10 min	6 h 47 min	6 h 12 min	6 h 28 min	5 h 44 min
Memory Usage Ratio (GB)	6.2/7.9 (78%)	6.7/7.9 (85%)	6/7.9 (76%)	5.6/7.9 (71%)	5.2/7.9 (66%)	5.6/7.9 (71%)	4.5/7.9 (57%)

Table 4. Effect of ablation of different variants on model performance (AUC). Each is a BPSKT variant without a specific component. Bold is used to highlight the maximum value.

Dataset	BPSKT-MS	BPSKT-NG	BPSKT-NS	BPSKT-NAC	BPSKT
ASSISTments2009	0.8611	0.8689	0.8555	0.8726	0.8785
ASSISTments2015	0.8099	0.8117	0.8257	0.8232	0.8288
ASSISTments2017	0.7692	0.7621	0.7583	0.7752	0.7865
Statics2011	0.8585	0.8617	0.8406	0.8603	0.8692
EdNet	0.7852	0.7731	0.7685	0.7967	0.8005

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, W.; Xu, Z.; Qiu, L. BPSKT: Knowledge Tracing with Bidirectional Encoder Representation Model Pre-Training and Sparse Attention. Electronics 2025, 14, 458. https://doi.org/10.3390/electronics14030458

AMA Style

Zhao W, Xu Z, Qiu L. BPSKT: Knowledge Tracing with Bidirectional Encoder Representation Model Pre-Training and Sparse Attention. Electronics. 2025; 14(3):458. https://doi.org/10.3390/electronics14030458

Chicago/Turabian Style

Zhao, Weidong, Zhen Xu, and Liqing Qiu. 2025. "BPSKT: Knowledge Tracing with Bidirectional Encoder Representation Model Pre-Training and Sparse Attention" Electronics 14, no. 3: 458. https://doi.org/10.3390/electronics14030458

APA Style

Zhao, W., Xu, Z., & Qiu, L. (2025). BPSKT: Knowledge Tracing with Bidirectional Encoder Representation Model Pre-Training and Sparse Attention. Electronics, 14(3), 458. https://doi.org/10.3390/electronics14030458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BPSKT: Knowledge Tracing with Bidirectional Encoder Representation Model Pre-Training and Sparse Attention

Abstract

1. Introduction

2. Related Work

2.1. Development of Knowledge Tracing

2.2. Evolution of Attention Mechanisms

3. Method

3.1. Knowledge Tracing Problem Set

3.2. BPSKT Methodology

3.3. Graph Convolutional Neural Networks(GCN)

3.4. Sparse Attention Mechanism

3.5. BPSKT Pre-training and Fine-tuning

3.6. Response Prediction Network

4. Experiment

4.1. Comparative Experiments on the Dataset

4.1.1. Datasets

4.1.2. Comparison of Models

4.1.3. Parameter Setting

4.2. Experimental Details and Results

4.2.1. Comparative Experiments

4.2.2. Experimental Ablation Studies

4.2.3. Visualization of Experimental Studies

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI