Imitating Human Go Players via Vision Transformer

Hsieh, Yu-Heng; Kao, Chen-Chun; Yuan, Shyan-Ming

doi:10.3390/a18020061

Open AccessArticle

Imitating Human Go Players via Vision Transformer

by

Yu-Heng Hsieh

,

Chen-Chun Kao

and

Shyan-Ming Yuan

^*

Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(2), 61; https://doi.org/10.3390/a18020061

Submission received: 24 December 2024 / Revised: 16 January 2025 / Accepted: 22 January 2025 / Published: 24 January 2025

(This article belongs to the Special Issue Algorithms for Games AI)

Download

Browse Figures

Versions Notes

Abstract

:

Developing AI algorithms for the game of Go has long been a challenging task. While tools such as AlphaGo have revolutionized gameplay, their focus on maximizing win rates often leads to moves that are incomprehensible to human players, limiting their utility as training aids. This work introduces a novel approach to bridge this gap by leveraging a Vision Transformer (ViT) to develop an AI model that achieves professional-level play while mimicking human decision-making. Using a dataset from the KGS Go server, our ViT-based model achieves 51.49% accuracy in predicting expert moves with a simple feature set. Comparative analysis against CNN-based models highlights the ViT’s superior performance in capturing patterns and replicating expert strategies. These findings establish ViTs as promising tools for enhancing Go training by aligning AI strategies with human intuition.

Keywords:

Go; deep learning; Vision Transformer

1. Introduction

Developing algorithms for Go programs capable of human-level performance has long been considered an intractable challenge. This difficulty stems from the immense search space and the complex evaluation functions involved, where a single move can drastically influence the game’s outcome. Traditional hand-crafted evaluation functions and rules have consistently fallen short of achieving even amateur-level gameplay. Consequently, the predominant approach over the decades has been to leverage deep learning models trained on expert gameplay records. These models, typically trained under a supervised learning paradigm, treat board positions as inputs and use the corresponding moves made by human experts as labels.

The groundbreaking success of AlphaGo has brought convolutional neural network (CNN)-based methods [1] to the forefront of Go-related tasks. By employing CNNs, AlphaGo not only defeated professional Go players but also demonstrated the viability of deep learning in mastering this intricate game. The successes of AlphaGo [2] and AlphaGo Zero [3] have encouraged Go players to use AI-powered software as training tools.

For instance, Figure 1 illustrates a fifth-line shoulder hit played by AlphaGo against professional ninth-dan player Lee Sedol. This move was described by commentators as “an unthinkable move” and “a mistake”, as it defied conventional human Go intuition. Similarly, Figure 2 highlights another “mistake” by AlphaGo: selecting move A over move B, despite the latter being clearly superior from a human perspective, as it maximizes territory. This divergence arises because AlphaGo calculates win rates through a combination of neural networks and Monte Carlo Tree Search, which leads to a “thinking process” entirely different from human reasoning [4]. Other examples of non-human-like moves, such as slack moves played with no strategic intention other than prolonging the game, further emphasize the knowledge gap between humans and AI. Such moves contradict human intuition and experience, rendering them incomprehensible for players seeking to learn from the model.

The success of existing Go AI systems [2,3,5] lies in their use of self-play, where they play against themselves to maximize their win rate. This approach, however, leads to AI-generated moves that may not resemble human-like strategies, making them less suitable as a training tool. If players cannot understand the reasoning behind AI-generated moves, they are unlikely to replicate such strategies in human-versus-human competitions. Therefore, our study aims to develop a model capable of both formulating professional-level strategies and emulating human gameplay. This would allow Go players to engage with the AI as if they were playing against professional human opponents. Additionally, our model can serve as a post-game analysis tool to help players identify and correct mistakes made during the game.

While CNNs have delivered remarkable results in Go, they also exhibit notable limitations. Certain Go patterns, such as ladders and group life-and-death scenarios, involve long-range dependencies that require CNNs to be exceptionally deep. This increases network complexity and introduces challenges such as gradient vanishing or explosion.

Recent advances in Vision Transformers (ViTs) [6] suggest that Transformers can effectively handle computer vision tasks, often outperforming CNNs across multiple domains. The multi-head self-attention mechanism in ViTs is particularly adept at capturing long-range dependencies, making it well suited for addressing challenges in Go-related tasks.

In this study, we trained a ViT model on a dataset of game records from human experts. Our model achieved a top-one accuracy of 51.49% and a top-five accuracy of 83.12% using a simple feature set, surpassing a CNN-based model’s top-one accuracy of 50.45%. In addition to conventional top-k accuracy metrics, we adopted extended top-k accuracy to account for equivalent moves—logically identical moves with different coordinates—addressing a unique evaluation challenge in Go.

This study makes several key contributions to the field of AI-driven Go training and gameplay:

Application of Vision Transformers (ViTs) in Go: We introduce a ViT-based model that captures long-range dependencies in Go and predicts future moves, providing deeper insights into game progression and improving gameplay understanding.
Bridging AI and Human Intuition: Our model emulates human decision-making, providing a more intuitive training tool for Go players to engage with the AI as if playing against a professional.
Improved Top-one Accuracy Compared to Existing Studies: By utilizing the ViT model, we achieved higher top-one accuracy compared to previous studies, demonstrating the superior capability of ViT in Go move prediction.

The remainder of this article is structured as follows: Section 2 reviews related work on applying deep learning models to Go, with a focus on CNNs, and introduces the background of ViT. Section 3 describes the dataset and the features used as model inputs. Section 4 presents the experimental setup, comparing the performance of ViT with that of two CNN-based models from prior studies on imitating human moves. Section 5 concludes the study and outlines potential directions for future work.

2. Background and Related Work

2.1. Deep Learning and Go

Deep learning has been applied across various fields, including time-series data prediction [7], medicine [8,9], and games [2,10,11,12,13,14]. In the domain of board games, winning strategies typically involve two main steps: evaluating the position and identifying the best move. However, designing a hand-crafted evaluation function for complex games such as Go is particularly challenging. To address this, researchers have utilized convolutional neural networks (CNNs) to learn directly from expert gameplay, bypassing the need for manual evaluation function design.

In 2003, Van Der Werf et al. [10] used hand-crafted features and proposed feature extraction methods such as the Eigenspace Separation Transform, combined with dimensional reduction, to train a multi-layer perceptron (MLP). They achieved 25% accuracy on professional games. Subsequent studies saw advancements with the adoption of CNNs. In 2008, Sutskever and Nair [11] used CNNs to mimic Go experts, employing previous moves and the current position as input. They achieved 34% accuracy for a single CNN model and 39.9% for an ensemble of CNNs. In 2014, Clark and Storkey [12] employed a more sophisticated feature set containing edge encoding and Ko constraints to train an eight-layer CNN, achieving 41.1% and 44.4% accuracy on two different Go datasets. In 2015, Maddison et al. [13] used a large feature set incorporating 36 feature planes, including the rank of the player and ladder patterns. They also used a deeper 12-layer CNN and achieved 55% accuracy. The work of AlphaGo [2] included a supervised learning stage to train the policy network using Go expert data. Their model used a much larger dataset with 48 feature planes and a 12-layer CNN, reaching 55.4% accuracy on the test set. These results highlighted the feasibility of applying CNNs to imitate human expert moves.

However, in Section 1, we pointed out the limitations of CNNs and aimed to demonstrate that Vision Transformers (ViTs) could outperform CNNs, making them a better candidate for imitating Go experts. We compared the performance of ViTs and CNNs on a dataset collected from the KGS Go server [15]. The models used were based on the same architecture as the policy network from AlphaGo [2] and the 12-layer CNN from Clark and Storkey [12]. The results indicated that ViT could predict more expert moves correctly on the test set.

2.2. Vision Transformer (ViT)

In 2017, the Transformer architecture was introduced by Vaswani et al. [16] to address challenges in natural language processing. Building on the success of Transformers and the self-attention mechanism, Dosovitskiy et al. [6] presented the Vision Transformer (ViT) for computer vision (Figure 3). ViT processes input images by first splitting them into patches. These patches are flattened, linearly projected into embedding vectors, and combined with learnable positional embeddings to encode spatial information. An additional class embedding is included, and these embeddings serve as inputs for the Transformer blocks. The final output is passed to an MLP head, which produces the classification result. This figure is adapted from Figure 1 of [6].

The innovation of the ViT lies in its ability to effectively capture global context and long-range dependencies, outperforming CNNs that focus on local receptive fields. ViT treats image patches as tokens, akin to words in a sentence, enabling it to process the entire image simultaneously. After encoding spatial information through positional embeddings, ViT applies Transformer encoder blocks to perform self-attention operations on the patches. The MLP head at the last encoder block generates the final classification. A detailed explanation of the ViT architecture is provided in Section 3.

ViT has demonstrated state-of-the-art performance on multiple image classification benchmarks, often surpassing traditional CNNs. The success of ViT and its derivatives, such as the Swin Transformer [17], has inspired this study to explore the application of ViT in analyzing the game of Go.

The closest work to this study is that of Sagri et al. [18], in which EfficientFormer was used to play Go. However, their goal focused on maximizing the win rate rather than imitating human experts. Although their work applied Transformers to Go, the task itself differs from our approach. To the best of our knowledge, this is the first study to specifically focus on ViT in the context of Go with the goal of imitating human expert moves. We believe the results of this study can significantly contribute to the development of applications that mimic human Go experts, ultimately serving as valuable training tools.

3. Materials and Methods

3.1. Dataset

3.1.1. Dataset Formation

The dataset used in this study was collected from the KGS Go server, spanning from 2004 to 2019. To maintain a consistent level of gameplay quality, we excluded handicapped games and focused solely on matches involving players ranked between 4th dan and 9th dan. This decision was made to ensure that the model would learn from expert-level gameplay, as these ranks represent a high degree of skill and strategic depth, avoiding the potential noise or suboptimal moves that might arise from lower-ranked or handicapped games. The dataset provides a robust foundation for training a model aimed at imitating professional-level gameplay.

The dataset was divided into a training set and a holdout set, with the training set comprising 26,078,877 positions and the holdout set containing 1,376,321 positions. Each position captures a snapshot of the board at a specific point during a game, representing the arrangement of stones for both players. These positions serve as the input data for the model, enabling it to analyze and predict the next move based on the current board configuration.

One of the key properties of Go is its inherent symmetry. Due to the rules of the game, the information provided by each board position remains invariant under horizontal flips, vertical flips, and rotations. This characteristic, referred to as the symmetry property, allows significant data augmentation opportunities. In this study, we leveraged this property by applying all eight possible transformations to the board, including rotations of 90°, 180°, and 270°, as well as horizontal and vertical reflections, as shown in Figure 4. These transformations effectively expand the dataset while preserving the original strategic context of each position.

The labeling process for the dataset involved assigning a corresponding move to each board position. Specifically, the label for each position is the next move made by the human player in the game. This move is represented as a two-dimensional coordinate on the 19 × 19 Go board. To make the data compatible with the model, these coordinates are flattened into a one-hot vector of length 361, corresponding to all possible board positions (excluding the pass move). In this vector, the position where the move was played is marked with a value of 1, while all other positions are marked with 0. This one-hot encoding method provides a straightforward and efficient representation of the target variable, enabling the model to treat the task as a multi-class classification problem.

By focusing on expert-level gameplay and utilizing the symmetry property of Go, the dataset was designed to comprehensively reflect the strategic complexity and diversity of high-level matches. This approach ensures that the model is trained on data that closely mirror the patterns and decision-making processes of skilled human players.

3.1.2. Feature Preprocessing

The feature set used for our model is extracted from raw board positions, as detailed in Table 1. These features are encoded using binary planes, where each plane corresponds to a specific aspect of the board configuration (e.g., black stones, white stones, liberties, etc.). This encoding process provides a comprehensive representation of the board state, enabling the model to learn meaningful patterns and strategies from the raw positions.

During both the training and inference phases, the features are initially stacked to form a tensor of shape

R^{10 \times 19 \times 19}

, where 10 represents the number of feature planes and 19 × 19 corresponds to the dimensions of the Go board. To make the data compatible with the Vision Transformer (ViT), zero padding is then applied along the last two dimensions, resulting in an input tensor of shape

R^{10 \times 21 \times 21}

. This padding step is essential because ViT requires the input to be divided into smaller patches, and the original board size of 19 does not evenly divide by the patch size, making it less ideal for the transformation process.

In the training phase, data augmentation techniques such as random flipping and rotation of the board positions are applied to further enhance the model’s ability to generalize. By exploiting the inherent symmetry of Go positions, the model is exposed to a broader variety of board configurations, improving its robustness and helping it learn patterns that are invariant under these transformations. Additionally, during inference, the same symmetric property is leveraged to apply test-time augmentation, allowing the model to make more robust predictions by considering multiple transformed versions of the board.

This approach not only helps in increasing the diversity of the training data but also ensures that the model is trained to recognize Go patterns that are consistent regardless of board orientation, leading to a more effective and generalizable model.

3.2. Vision Transformer for Go

In this work, we aim to demonstrate the capability of the Vision Transformer (ViT) to imitate the moves of human experts in the game of Go. This problem is formulated as a classification task, where each possible move on the board is treated as a distinct class. Given the 19 × 19 grid of the Go board, this results in a total of 361 possible classes, corresponding to all board positions, excluding the pass move. Each class represents a unique location on the board where a stone can be placed, and the model’s goal is to predict the next move based on the current board configuration, mimicking the decisions made by expert players.

This section provides an overview of the key components involved in applying the Vision Transformer (ViT) to the task of imitating expert-level gameplay in Go. For a comprehensive theoretical background on the ViT, readers are encouraged to refer to [6,16]. Our methodology draws inspiration from [6], particularly for Equations (1) and (2), which outline the mathematical foundations of the ViT. These equations have been adapted to align with the specific requirements of our task, demonstrating how the ViT can be effectively applied to analyze board positions and predict the next move in Go.

An overview of our method is presented in Figure 5. In the ViT, the input features must be reshaped into a sequence of feature patches. The features from Table 1 have the shape

R^{10 \times 19 \times 19}

, which cannot be reshaped directly because 19 is a prime number. Therefore, the features are first padded with zeros to form

R^{10 \times 21 \times 21}

. We set the patch size to 7; accordingly, the features are broken down into 9 patches, each of shape 7 × 7. These patches are then flattened to form a sequence of 2D patches

p \in R^{9 \times (10 \cdot 7^{2})}

. We use a learnable linear projection to project each patch into

R^{768}

, resulting in projected patches

p_{p r o j} \in R^{9 \times 768}

. After prepending an additional class token

x_{c l a s s}

, positional embeddings are added to incorporate spatial information, which is crucial for identifying long-range patterns on the board. The resulting tensor is used as input for the Transformer encoder, denoted by

z_{0}

. Formally, we define:

With the patch size set to 7, the padded features are divided into 9 patches, each of size 7 × 7. These patches are then flattened into a sequence of 2D patches represented as

p \in R^{9 \times (10 \cdot 7^{2})}

. A learnable linear projection is applied to each patch, projecting the patches into a higher-dimensional space of

R^{9 \times 768}

. This step produces the projected patches

p_{p r o j} \in R^{9 \times 768}

, which are now suitable for processing by the ViT.

Before passing the patches into the Transformer encoder, an additional learnable class token

x_{c l a s s}

is prepended to the sequence. Positional embeddings are added to the sequence to encode spatial information, which is critical for recognizing long-range dependencies and patterns on the Go board. The resulting tensor, denoted as

z_{0}

, serves as the input to the Transformer encoder. Formally,

z_{0} = [x_{c l a s s}; p_{1} L; p_{2} L \dots; p_{1} 0 L] + E_{p o s}, L \in R^{(10 \cdot 7^{2}) \times 768}, E_{p o s} \in R^{(10 + 1) \times 768}

(1)

This formulation ensures that the input data are structured in a way that fully utilizes the ViT’s capability to model the complex spatial and strategic relationships inherent in Go gameplay.

The Transformer encoder blocks used in our study are illustrated in Figure 3. To evaluate the impact of model depth on performance, we experimented with three different configurations of Vision Transformer (ViT) by varying the number of encoder blocks: 4, 8, and 12 blocks. Each encoder block comprises two key components: a multi-head attention (MA) module and a multi-layer perceptron (MLP) module. These components work together to model both local and global dependencies within the input data, enabling the network to capture the complex patterns inherent in Go gameplay.

The operations within each encoder block are described mathematically as follows:

Multi-head Attention Module:

The output of the multi-head attention module is computed as follows:

z_{l}^{'} = M S A (L N (z_{l - 1})) + z_{l - 1}, l = 1 \dots L

(2)

Here,

z_{l - 1}

is the input to the l-th encoder block, LN denotes layer normalization, and MSA represents the multi-head self-attention mechanism. The residual connection (

+ z_{l - 1}

) ensures stability during training by mitigating issues such as vanishing gradients.

2.: Multi-layer Perceptron Module:

The output of the MLP module is calculated as follows:

z_{l} = M L P (L N (z_{l}^{'})) + z_{l}^{'}, l = 1 \dots L

(3)

where

z_{l}^{'}

is the intermediate representation from the MA module, and the residual connection (

+ z_{l}^{'}

) further enhances training stability.

In this study, L is set to 4, 8, and 12 to compare the performance of Vision Transformer (ViT) models with varying depths. By experimenting with these different configurations, we aim to evaluate the trade-offs between model complexity and predictive accuracy in the task of mimicking expert-level Go gameplay.

After processing the input through all L encoder blocks, the output from the first embedding (corresponding to the class token) of the last encoder block undergoes layer normalization. This normalized representation serves as the basis for generating the final score for each possible move on the Go board.

These scores are then converted into a probability distribution over all potential moves using the softmax function. The softmax operation ensures that the scores are transformed into a normalized probability distribution, where the probabilities of all possible moves sum to 1. Mathematically, the softmax function is defined as follows:

{s o f t m a x (z)}_{i} = \frac{e^{z_{j}}}{\sum_{j = 1}^{n} e^{z_{j}}}

(4)

where

z = [z_{1}, z_{2}, \dots, z_{n}]

represents the input logits, and

{s o f t m a x (z)}_{i}

is the probability assigned to the i-th class (or move). Here, n corresponds to the total number of classes, which, in the context of a 19 × 19 Go board, is 361 (excluding the pass move).

This final probability distribution indicates the model’s confidence for each possible move and is used to select the next move during inference, thereby enabling the model to imitate the decision-making process of expert Go players.

4. Experiments and Results

4.1. Experimental Settings

4.1.1. Training Settings for ViT

The model was trained on the training set for 80 epochs, and the results were evaluated on the holdout set after training. Each minibatch consisted of the trajectory of an entire game, where each data point represented a pair: a board position and the corresponding next move made by a human expert. Data points where the move was “pass” were discarded. Features extracted from the board position served as the input, while the move played was used as the ground truth label.

The Adam optimizer was employed for optimization, with an initial learning rate of 0.0001. Cross-entropy loss was used as the loss function. Different layers of the Vision Transformer (ViT) were trained, with each model requiring approximately 10 to 14 days to train on a single NVIDIA RTX 4070 GPU with 12 GB of memory.

4.1.2. Training Settings for CNNs

We also evaluated two additional models to compare the performance of the ViT with that of CNN-based architectures, using identical training and holdout sets. The comparison models included the policy network architecture from AlphaGo (Figure 6) and a 12-layer CNN adapted from [12] (Figure 7).

The policy network begins by padding input features with zeros to form a tensor of size

R^{10 \times 23 \times 23}

This tensor is processed through a sequence of convolutional layers, each followed by rectified linear unit (ReLU) activations. A final 1 × 1 convolutional layer, combined with flattening operations, converts the tensor into a vector of size

R^{361}

. A softmax operation then derives the probability distribution of potential moves.

The 12-layer CNN processes preprocessed input data structured as a tensor of size

R^{10 \times 19 \times 19}

. A sequence of convolutional layers with ReLU activations is applied, and zero padding ensures that the tensor size remains

R^{10 \times 19 \times 19}

. The final 1×1 convolutional layer outputs two

R^{19 \times 19}

tensors. Each tensor is flattened, and softmax operations are applied to produce the probability distributions for black and white moves, respectively.

In this study, these models are referred to as the “policy network” and the “12-layer CNN”, and their performance is compared to that of the ViT.

The two models share a similar CNN backbone structure, but their output layers are tailored to different objectives. The policy network employs a single output block that consolidates all information into a unified prediction. In contrast, the 12-layer CNN utilizes two separate output blocks, enabling it to handle game dynamics uniquely for each player. This distinction is significant, as it allows the 12-layer CNN to make more nuanced predictions by better capturing the complexities of player-specific strategies.

A significant architectural difference between the models lies in their input layers. The policy network uses the same comprehensive feature set as the ViT model, encom-passing various aspects of the game state, such as the current board configuration, previ-ous moves, and additional relevant features. In contrast, the 12-layer CNN excludes the “turn” feature from its input set, as detailed in Table 1 This exclusion required adjustments to the input layer to ensure the model could still effectively process the board state.

Both CNN-based models underwent the same rigorous training process. They were trained for 80 epochs, providing ample time for the models to learn from the training data. Stochastic Gradient Descent (SGD) was used as the optimization algorithm, with an initial learning rate of 0.003 that decayed by 50% every 16 epochs. Cross-entropy loss was employed as the loss function. Notably, the 12-layer CNN only updated parameters related to its output layer during training.

By comparing the performance of these two CNN-based models with the ViT approach, we aimed to gain deeper insights into how different deep learning architectures perform in the task of imitating human Go players. This comparative analysis is essential for identifying the model best suited for use as a training or analysis tool for Go players, ultimately contributing to their skill enhancement and understanding of the game.

4.1.3. Test-Time Augmentation

Test-time augmentation (TTA) is a technique employed to enhance the robustness and accuracy of model predictions during inference by generating multiple augmented samples from the test data. This approach mitigates the impact of individual transformations, providing a more reliable and consistent prediction. Go, with its symmetric board properties, is particularly well suited for TTA, as the board can be rotated and reflected in various ways without altering the fundamental nature of the game (Figure 8).

At inference time, each board position is augmented using its symmetric properties, generating eight symmetries through rotations and reflections. These symmetries are treated as a minibatch of inputs for the model. For each augmented position, the model computes a probability distribution by applying the softmax operation to its predictions. The resulting eight probability distributions are averaged to produce the final move probability distribution.

By leveraging TTA, the model’s predictions become more stable and reliable, as they incorporate the game board’s inherent symmetries. This process ensures that the model evaluates all symmetrical configurations, ultimately improving its predictive accuracy and robustness.

4.2. Performance Evaluation for Imitating Human Moves

4.2.1. Top-k Accuracy

To evaluate how well a model can imitate human players, we report both top-one and top-five accuracy. For a given board position, a prediction is considered correct for top-one accuracy if the model’s highest-ranked move matches the human expert’s move. For top-five accuracy, it is considered correct if the human expert’s move is among the model’s top five predicted moves. Formally, the top-k accuracy is defined as follows:

T o p - k A c c . = \frac{1}{N} \sum_{i = 1}^{N} I (y_{i} \in T o p - k ({\hat{x}}_{i}))

(5)

where the variables have the following meanings:

N is the total number of instances in the dataset;
$y_{i}$ is the true label (human expert’s move) for the i-th instance;
${\hat{x}}_{i}$ represents the predicted scores or probabilities for the i-th input;
$T o p - k ({\hat{x}}_{i})$ is the set of k labels with the highest predicted scores or probabilities for the i-th instance;
I(⋅) is the indicator function, returning 1 if the condition inside is true and 0 otherwise.

The results are summarized in Table 2, and test-time augmentation (TTA) leveraging the symmetric properties of Go was applied during evaluation. TTA is particularly effective for Go due to the game’s symmetry, allowing the model to evaluate augmented versions of each position without additional training. This method significantly improved performance with no extra computational cost during training.

Our evaluation shows that ViT models, even with eight encoder blocks, outperform both CNN-based models in top-one and top-five accuracy after incorporating TTA. The best-performing model, ViT/L-12, achieved a top-one accuracy of 51.49%.

However, due to the computational cost, we limited our testing of ViT models to a maximum of 12 layers. As shown in Table 2, the results suggest that increasing the number of encoder blocks could further improve accuracy, indicating potential for even better performance with deeper models.

We compare our results with existing Go move prediction studies [19,20]. In [19], three different models were used: the Incep–Attention model, the Up–Down model, and an ensemble of the two models. Their top-one accuracy scores were 0.4074, 0.4420, and 0.4686, respectively. In [20], a mobile network was tested for Go move prediction, achieving a top-one accuracy of 0.4431. Compared to these studies, our approach demonstrates superior performance in top-one accuracy. However, it is worth noting that ViT, while achieving higher accuracy, has a larger number of parameters, potentially leading to higher computational requirements and an increased system load.

4.2.2. Extended Top-k Accuracy

In most studies, such as [2,12], only top-one accuracy is reported as a measure of a model’s ability to mimic human experts. However, we argue that relying solely on top-k accuracy is insufficient for evaluating model performance, particularly in the context of Go.

In the opening stage of a Go game, there are often multiple logically equivalent moves according to the rules of the game. However, the ground truth typically contains only one of these moves. For instance, as illustrated in Figure 9, moves A and B are equivalent in terms of their impact on the game. If the ground truth labels move A as correct and the model predicts move B, it would be unjust to consider the prediction a failure, since both moves achieve the same strategic outcome.

An even more striking example is depicted in Figure 10, where eight moves are all logically equivalent. In this scenario, the model has only a one-eighth chance of being marked correct by top-one accuracy, despite all eight moves being equally valid. This limitation highlights the need for more nuanced evaluation metrics or alternative approaches that account for the equivalence of moves in certain game scenarios.

More formally, the extended top-k accuracy can be expressed by

E x t e n d e d T o p - k A c c . = \frac{1}{N} \sum_{i = 1}^{N} II (y_{i} \in ⋃_{y^{'} \in E q u i v (y_{i})} T o p - k ({\hat{y}}_{i}, y^{'}))

(6)

where the variables have the following meanings:

N is the total number of instances in the dataset;
$y_{i}$ is the true label for the i-th instance;
$y_{i}$ represents the predicted scores or probabilities for the i-th input;
$T o p - k ({\hat{y}}_{i}, y^{'})$ represents the set of k labels with the highest predicted scores or probabilities for the i-th instance, considering label $y^{'}$ ;
$E q u i v (y_{i})$ is the set of labels considered quivalent to the true label $y_{i}$ ;
$II (\cdot)$ is the indicator function, which returns 1 if the condition inside is true and 0 otherwise.

The results regarding extended top-k accuracy are listed in Table 3. Performance evaluations based on top-k accuracy and extended top-k accuracy both showed that the ViT with eight encoder layers was sufficient for the task.

4.2.3. Inference Time Analysis on Different Devices

In this section, we will discuss the model’s inference time during the inference phase.

When learning to play Go, users may not always have access to high-performance computers. Many players might rely on lightweight devices to predict the next move. To evaluate the model’s feasibility across different systems, we tested it on three devices: the GTX 1650, the RTX 3090, and the RTX 4090.

The results are illustrated in Figure 11, using a test dataset containing 1,376,321 positions. The total inference times were 280.23 s for the RTX 4090, 354.90 s for the RTX 3090, and 1523.62 s for the GTX 1650. These times correspond to an average inference time per position of 0.00021 s, 0.00026 s, and 0.0012 s, respectively.

These findings indicate that the model is computationally efficient and can be utilized effectively on lightweight devices equipped with a GPU.

5. Conclusions and Future Work

5.1. Conclusions

Building a model that achieves professional-level gameplay and also plays like humans is highly desirable in the Go community. In this study, we showcased the capability of the Vision Transformer (ViT) in this field. The ViT achieved 49.73% top-one accuracy without exploiting the symmetry property of the dataset from the KGS Go server and 51.49% top-one accuracy when exploiting the symmetry property using 12 encoder layers. These results demonstrate that the ViT can mimic expert moves better than CNN-based models.

We also proposed the extended top-k accuracy for evaluating model performance. To the best of our knowledge, all other research on imitating human Go players ignores the issue brought by the symmetric property and equivalent moves. The extended top-k accuracy was designed to address this problem, and we believe it can better evaluate the capability of models in imitating human experts. ViT achieved 50.13% extended top-one accuracy without test-time augmentation and 51.87% extended top-one accuracy with test-time augmentation using 12 encoder layers.

5.2. Future Work

During our experiments, as shown in Table 2 and Table 3, we observed that adding more layers to the model provided room for improvement. Beyond this, exploring advanced Transformer-based architectures offers promising opportunities to enhance Go AI. For instance, the Swin Transformer, which has outperformed the ViT in computer vision tasks, could serve as a robust candidate for imitating human Go players. Its ability to capture long-range dependencies in game patterns suggests that it could potentially replace the backbone of win-oriented models such as AlphaGo, improving their ability to replicate human expertise more effectively.

In addition to the Swin Transformer, other Transformer-based models, such as the Data-efficient Image Transformer (DeiT) and Pyramid Vision Transformer (PVT), could also be integrated into Go AI frameworks to address specific challenges. The DeiT, with its knowledge distillation capabilities, is well suited for enhancing data efficiency and addressing data leakage issues, making it an ideal tool for mimicking rare playing styles and diversifying gameplay. Meanwhile, PVT, with its multi-scale feature extraction and Spatial Reduction Attention (SRA), excels in dense prediction tasks, making it a strong candidate for downstream applications requiring high-resolution processing, such as detailed Go strategy analysis.

By leveraging the unique strengths of these Transformer architectures, future research could optimize Go AI models for both high performance and human-like gameplay. This approach has the potential to revolutionize Go AI, providing more advanced tools for teaching, analyzing, and diversifying Go strategies.

Furthermore, our objective is to imitate the movements of professional Go players, using historical game data from a Go server as our dataset. In future research, we plan to utilize our predictive models to compete against professional Go players, further validating their effectiveness and refining their capabilities.

Author Contributions

Conceptualization, Y.-H.H., C.-C.K. and S.-M.Y.; methodology, Y.-H.H., C.-C.K. and S.-M.Y.; software, Y.-H.H. and C.-C.K.; validation, Y.-H.H., C.-C.K. and S.-M.Y.; formal analysis, Y.-H.H., C.-C.K. and S.-M.Y.; investigation, Y.-H.H., C.-C.K. and S.-M.Y.; resources, Y.-H.H., C.-C.K. and S.-M.Y.; data curation, Y.-H.H., C.-C.K. and S.-M.Y.; writing—original draft preparation, C.-C.K.; writing—review and editing, Y.-H.H.; visualization, C.-C.K.; supervision, S.-M.Y.; project administration, S.-M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Data Availability Statement

The dataset used in this paper is sourced from the KGS Go server, covering the period from 2004 to 2019. https://u-go.net/gamerecords/ (accessed on 25 February 2024). The code for our model has been uploaded to GitHub. Please refer to the following link: https://github.com/nctu-dcs-lab/ViT-Go (accessed on 25 February 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
AlphaGo—The Movie|Full Award-Winning Documentary. Available online: https://www.youtube.com/watch?v=WXuK6gekU1Y (accessed on 1 July 2024).
Patankar, S.; Usakoyal, C.; Patil, P.; Raut, K. A Survey of Deep Reinforcement Learning in Game Playing. In Proceedings of the 2024 MIT Art, Design and Technology School of Computing International Conference (MITADTSoCiCon), Pune, India, 25–27 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Pellegrino, M.; Lombardo, G.; Adosoglou, G.; Cagnoni, S.; Pardalos, P.M.; Poggi, A. A Multi-Head LSTM Architecture for Bankruptcy Prediction with Time Series Accounting Data. Future Internet 2024, 16, 79. [Google Scholar] [CrossRef]
Saleh, S.N.; Elagamy, M.N.; Saleh, Y.N.; Osman, R.A. An Explainable Deep Learning-Enhanced IoMT Model for Effective Monitoring and Reduction of Maternal Mortality Risks. Future Internet 2024, 16, 411. [Google Scholar] [CrossRef]
Chen, S.-W.; Chen, J.-K.; Hsieh, Y.-H.; Chen, W.-H.; Liao, Y.-H.; Lin, Y.-C.; Chen, M.-C.; Tsai, C.-T.; Chai, J.-W.; Yuan, S.-M. Improving Patient Safety in the X-Ray Inspection Process with EfficientNet-Based Medical Assistance System. Healthcare 2023, 11, 2068. [Google Scholar] [CrossRef] [PubMed]
Van Der Werf, E.; Uiterwijk, J.W.; Postma, E.; Van Den Herik, J. Local move prediction in Go. In Proceedings of the Computers and Games: Third International Conference, CG 2002, Edmonton, AB, Canada, 25–27 July 2002; Revised Papers 3. Springer: Berlin, Germany, 2003; pp. 393–412. [Google Scholar]
Sutskever, I.; Nair, V. Mimicking go experts with convolutional neural networks. In Proceedings of the Artificial Neural Networks-ICANN 2008: 18th International Conference, Prague, Czech Republic, 3–6 September 2008; Proceedings, Part II 18. Springer: Berlin, Germany, 2008; pp. 101–110. [Google Scholar]
Clark, C.; Storkey, A. Teaching deep convolutional neural networks to play go. arXiv 2014, arXiv:1412.3409. [Google Scholar]
Maddison, C.J.; Huang, A.; Sutskever, I.; Silver, D. Move evaluation in Go using deep convolutional neural networks. arXiv 2014, arXiv:1412.6564. [Google Scholar]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
KGS. KGS GO Server. Available online: https://u-go.net/gamerecords/ (accessed on 25 February 2024).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Sagri, A.; Cazenave, T.; Arjonilla, J.; Saffidine, A. Vision Transformers for Computer Go. In Proceedings of the International Conference on the Applications of Evolutionary Computation (Part of EvoStar), Aberystwyth, UK, 3–5 March 2024; Springer: Berlin, Germany, 2024; pp. 376–388. [Google Scholar]
Lin, Y.-C.; Huang, Y.-C. Streamlined Deep Learning Models for Move Prediction in Go-Game. Electronics 2024, 13, 3093. [Google Scholar] [CrossRef]
Cazenave, T. Mobile networks for computer Go. IEEE Trans. Games 2020, 14, 76–84. [Google Scholar] [CrossRef]

Figure 1. Game 2 between AlphaGo and Lee Sedol. AlphaGo played move 37 (marked by the red square) and gained a significant advantage from this move.

Figure 2. A “mistake” made by AlphaGo. This is game 1 between AlphaGo and Lee Sedol. After Lee Sedol played move 161 (marked by the red square), AlphaGo answered with move A, which is a “clear” mistake from a human perspective. Move B is obviously a better move than A.

Figure 3. Model overview of the ViT.

Figure 4. The symmetry property of Go. By the rules of Go, the eight positions in this figure are considered equivalent.

Figure 5. Vision Transformer for Go. It should be noted that the images of Go are for illustrative purposes. The actual data are the features from Table 1. The illustration of the figure was inspired by [6].

Figure 6. The policy network’s architecture processes the raw board position through feature extraction, using the features specified in Table 1.

Figure 7. The 12-layer CNN for move prediction preprocesses the inputs by extracting the features listed in Table 1.

Figure 8. The process of test-time augmentation.

Figure 9. Black to play. Moves A and B are considered equivalent according to the rules.

Figure 10. Black to play. All positions marked with an X are considered equivalent according to the rules.

Figure 11. Time cost in different device.

Table 1. The features extracted from raw positions.

Feature	Shape	Description
Black	(1, 19, 19)	The positions of black stones are marked with 1. The rest are marked with 0.
White	(1, 19, 19)	The positions of white stones are marked with 1. The rest are marked with 0.
Invalid	(1, 19, 19)	The positions of invalid moves are marked with 1. The rest are marked with 0.
Turn	(1, 19, 19)	The entire plane is marked with 1 if it is black’s turn to play or 0 if it is white’s turn to play.
Ones	(1, 19, 19)	An entire plane is marked with 1.
Empty	(1, 19, 19)	The positions that are not occupied by a stone are marked with 1. The rest are marked with 0.
Recent moves	(4, 19, 19)	The features that store the most recent 4 moves. The ith plane of this feature stores the feature of the ith most recent move. The position of the move is marked with 1. The rest are marked with 0.

Table 2. The performance of ViT and CNNs is compared. We denote ViT with different layers as ViT/L = k, where k is the number of encoder blocks in the ViT.

Model	Symmetries	Top-1 Acc.	Top-5 Acc.
12-layer CNN	1	0.4911	0.8250
12-layer CNN	8	0.4921	0.8259
Policy network	1	0.4924	0.8134
Policy network	8	0.5045	0.8209
Vit/L = 4	1	0.4513	0.7674
Vit/L = 4	8	0.4726	0.7973
Vit/L = 8	1	0.4860	0.8031
Vit/L = 8	8	0.5050	0.8214
Vit/L = 12	1	0.4973	0.8154
Vit/L = 12	8	0.5149	0.8312

Table 3. The extended top-k accuracy of the ViT and CNNs is compared.

Model	Symmetries	Ext. Top-1 Acc.	Ext. Top-5 Acc.
12-layer CNN	1	0.4943	0.8274
12-layer CNN	8	0.4962	0.8272
Policy network	1	0.4930	0.8144
Policy network	8	0.5091	0.8237
Vit/L = 4	1	0.4533	0.7704
Vit/L = 4	8	0.4751	0.7993
Vit/L = 8	1	0.4892	0.8063
Vit/L = 8	8	0.5106	0.8278
Vit/L = 12	1	0.5013	0.8259
Vit/L = 12	8	0.5187	0.8349

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hsieh, Y.-H.; Kao, C.-C.; Yuan, S.-M. Imitating Human Go Players via Vision Transformer. Algorithms 2025, 18, 61. https://doi.org/10.3390/a18020061

AMA Style

Hsieh Y-H, Kao C-C, Yuan S-M. Imitating Human Go Players via Vision Transformer. Algorithms. 2025; 18(2):61. https://doi.org/10.3390/a18020061

Chicago/Turabian Style

Hsieh, Yu-Heng, Chen-Chun Kao, and Shyan-Ming Yuan. 2025. "Imitating Human Go Players via Vision Transformer" Algorithms 18, no. 2: 61. https://doi.org/10.3390/a18020061

APA Style

Hsieh, Y.-H., Kao, C.-C., & Yuan, S.-M. (2025). Imitating Human Go Players via Vision Transformer. Algorithms, 18(2), 61. https://doi.org/10.3390/a18020061

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Imitating Human Go Players via Vision Transformer

Abstract

1. Introduction

2. Background and Related Work

2.1. Deep Learning and Go

2.2. Vision Transformer (ViT)

3. Materials and Methods

3.1. Dataset

3.1.1. Dataset Formation

3.1.2. Feature Preprocessing

3.2. Vision Transformer for Go

4. Experiments and Results

4.1. Experimental Settings

4.1.1. Training Settings for ViT

4.1.2. Training Settings for CNNs

4.1.3. Test-Time Augmentation

4.2. Performance Evaluation for Imitating Human Moves

4.2.1. Top-k Accuracy

4.2.2. Extended Top-k Accuracy

4.2.3. Inference Time Analysis on Different Devices

5. Conclusions and Future Work

5.1. Conclusions

5.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI