1. Introduction
Deception has long been a focal point in computer science, representing a significant hallmark of intelligence [
1]. In fields like multi-agent systems and robotics, path planning serves as a foundational element for achieving collaborative task objectives. Deception becomes pivotal in enabling agents to gain strategic advantages in games and adversarial settings. In adversarial environments, deceptive planning empowers both human and AI agents to obscure their true intentions and manipulate the situational awareness of opponents. Beyond gaming contexts, deception is prevalent in diverse multi-agent applications, including negotiations [
2,
3] and fugitive pursuit scenarios [
4]. In adversarial contexts, deceptive planning allows intelligence operatives to obfuscate their true intentions and confound the awareness of adversaries, thereby finding practical use in domains such as network security [
5], robot soccer competitions [
6], privacy preservation [
7], and other real-world challenges. Deceptive path planning (DPP) has thus emerged as a pivotal task within this broader landscape.
Imagine a scenario in urban surveillance in which SWAT teams have located traces of terrorists on city streets. If SWAT teams reveal their intentions and proceed directly to the hideout of terrorists, they risk alarming them, potentially leading to escape. Therefore, the teams can strategize a decoy route, departing from a designated point and feigning a circuitous advance towards other locations to mislead the adversaries. This strategy aims to deceive terrorists into believing that SWAT teams’ objective is not their actual hiding place. Ultimately, the SWAT teams can reach and apprehend or eliminate terrorists. The purpose of this approach is to deceive the surveillance systems and observers of terrorists, thereby concealing the true objective of the SWAT teams to the fullest extent possible.
Consider a scenario in wildlife conservation in where researchers select suitable areas within a vast protected area to reintroduce endangered species. Poachers are monitoring the movement of these animals for their illegal hunting activities. Directly approaching the habitat of these animals can jeopardize their safety. Conversely, planning a circuitous route would deceive potential poachers, ensuring the safety of the animals while minimizing the risk of detection.
Previous research on DPP problems has raised several issues and opportunities for improvement. Firstly, while Masters et al. proposed three metrics (extent, magnitude, and destiny) for evaluating deceptive paths [
8], each has its limitations. Extent and destiny merely assess whether grid points along a path are deceptive or not, a binary evaluation without quantifying deception on a scale from 0 to 1. Although magnitude quantifies deception, it lacks normalization. DPP_Q [
9], incorporating the concept of “average deceptiveness”, considers paths under specific time constraints where all paths have equal costs, failing to address how paths with varying time costs should be compared. Therefore, unified metrics for comprehensively evaluating the performance of deceptive paths are lacking.
Secondly, DPP_Q tends to fall into local optima when the number of false goals increases. Our experiments indicate that agents trained using the DPP_Q method exhibit a tendency to initially move towards false targets from the starting point. While this approach may yield substantial rewards initially, the fixed time constraints of DPP_Q prevent agents from balancing deception in the subsequent planning phases.
Figure 1a illustrates this phenomenon using a heatmap based on a 49 × 49 grid map from the Moving-AI dataset, devoid of obstacles. The orange square (with a black border) represents the start point, the blue square (with a black border) is the real goal, and the teal squares (with black borders) are two false goals. The heatmap visualizes the posterior probability of the real goal at each grid point, ranging from 0 to 1, influencing the agent to bypass the lighter-colored points for greater deception (reward). When the agent begins exploring from the starting point, it tends to follow the direction indicated by the red arrow initially, which is detrimental to subsequent planning. Conversely, following the green arrow direction may not yield immediate rewards but offers greater potential rewards later. In
Figure 1b, starting planning from the real goal allows the agent to avoid biased initial exploration tendencies influenced by the heatmap.
Thirdly, the initial exploration tendency exhibited by the agent in DPP_Q resembles that of
[
8].
guides the agent towards false goals initially before proceeding to the real goal, contrasting with
methods lacking such tendencies. This distinction typically results in
achieving greater deceptive effects compared to
methods and explains why paths generated by
entail longer costs. Therefore, we propose that the
methods could be further optimized based on the tendency of the
method. One potential improvement involves adjusting the Last Deceptive Point (LDP), a concept to be detailed later.
This article has three main technical contributions. Firstly, we introduced the concept of Average Deception Degree (ADD), based on which we proposed the Average Deception Intensity (ADI) and analyzed its rationality, applying it to evaluate subsequent methods. Secondly, for methods that do not consider time constraints, we proposed , which can significantly enhance traditional methods. Thirdly, for DPP_Q which considers time constraints, we propose Endpoint DPP_Q and LDP DPP_Q, both resolving issues of suboptimal DPP problems under specific distributions of goals.
The structure of this paper is as follows: In
Section 2, we review the relevant literature on probabilistic goal recognition and DPP problems.
Section 3 provides foundational definitions of these concepts.
Section 4 defines ADD and ADI briefly, introduces
methods and subsequently proposes
. Subsequently, in
Section 4 we introduce DPP_Q, outlining the operational mechanisms of the new methods Endpoint DPP_Q and LDP DPP_Q. In
Section 5, based on the ADI, we validate the feasibility of our methods through experiments. Lastly, we conclude with a summary and outlook for this paper.
3. Preliminaries
Before explaining the specific details of
, Endpoint DPP_Q and LDP DPP_Q, it is necessary to explain its basic techniques. In this section, we first present the definition of path-planning domains, followed by the definition of path-planning problems [
8]. Next, we introduced the Precise Goal Recognition Model (PGRM) [
9], which serves as a goal recognition model for the observer, and finally, we defined DPP problems [
8].
Definition 1. A path planning domain is a triple :
N is a set of non-empty nodes (or locations);
is a set of edges between nodes;
returns the cost of traversing each edge.
A path in a path planning domain is a sequence of nodes such as , in which for each . The cost of is the cost of traversing all edges in , which is , from the start node to the goal node.
A path planning problem in a path planning domain is the problem of finding a path from the start node to the goal node. Based on Definition 1, we define the path planning problem as follows.
Definition 2. A path planning problem is a tuple :
The solution path for a path planning problem is a path
, in which
and
. An optimal path is a solution path with the lowest cost among all solution paths. The optimal cost for two nodes is the cost of an optimal path between them, which is denoted by
. The A* algorithm [
35], a well-known best-first search algorithm, is used by typical AI approaches to find the optimal path between two nodes.
When the number of targets exceeds one, an observer assigns a probability value to each target. Assuming that the path planning domains are discrete, fully observable, and deterministic, we introduce the PGRM as follows:
Definition 3. Precise Goal Recognition Model (PGRM) is a quadruple :
D is a path planning domain;
is a set of goals, consisting of the real goal gr and a set of false goals Gf;
is the observation sequence, representing the sequence of all grid points that the observed has passed through from the start node to the current node.
P is a conditional probability distribution across G, given the observation sequence O.
Based on Definition 3, the formula for calculating the cost-difference is:
where
represents the cost-difference for each
g in
G;
represents the optimal cost for the observed to reach
g from
s given the observation sequence
O;
represents the optimal cost for the observed to reach
g from
s without the need to satisfy the observed sequence
O.
From Equation (1), the cost-difference for all goals in the set
G can be computed. Subsequently, Equation (2) allows us to calculate their posterior probability distribution. The formula for computing
for each goal is as follows:
where
is a normalization factor,
is a positive constant, satisfying
.
is used to describe the degree to which the behavior of the observer tends toward rational or irrational, namely “soft rationality”.
indicates the sensitivity of the observer to whether the observed is rational. When the observed is fully rational, it will choose the least costly method (the optimal path) to reach its real goal. The larger
, the more the observer believes that the behavior of the observed is rational. When
,
, which means that the observed is considered completely irrational, so that the posterior probabilities of all goals are equal.
Based on Definition 3, we define the deceptive path planning problem.
Definition 4. A deceptive path planning (DPP) problem is a quintuple :
D is a discrete path planning domain;
is the start node;
is the real goal node;
is a set of possible goal nodes, in which gr is a single real goal and Gf is the set of false goals;
is the posterior probability distribution of G based on the observation sequence O. Its calculation is determined by the goal recognition model of the observer.
Here, PGRM will serve as the recognition model of the observer to calculate the . DPP presents a departure from conventional path planning. While typical path planning endeavors to find the most cost-effective route to a destination, c In this context, the objective extends beyond mere navigation to the goal; it includes minimizing the likelihood of an observer identifying the real goal among a set of possibilities.
4. Method
This section was divided into three parts. Firstly, we present the definitions of Average Deception Degree (ADD) and Average Deception Intensity (ADI). Secondly, we introduce
methods, and then proposed
. Finally, we introduced DPP_Q [
9] and propose Endpoint DPP_Q and LDP DPP_Q.
4.1. Average Deception Degree (ADD) and Average Deception Intensity (ADI)
Masters et al. stated that “The quality of the solution depends on the magnitude, density, and extent of the deception” [
8], yet these three metrics do not account for the cost of the path. With a sufficient time budget, we should be able to plan paths that achieve better deception effects. We exclude cases where the number of grid points covered by a path is less than or equal to 2, meaning the distance between the start point and the real goal must be greater than one grid point distance (either
or 1). Otherwise, DPP problems would be meaningless. To comprehensively evaluate a deceptive path, we introduce two new evaluation metrics: Average Deception Degree (ADD) and Average Deception Intensity (ADI). Here, we define ADI as follows:
Definition 5. The average deception degree of a path is , such that:
is a path (a solution to a DPP problem);
is the posterior probability of the real goal of ni calculated by PGRM, ;
, for each node ni the observed passed through, there exists the observation sequence .
Since the deception value is defined at each node, the quantifiable average deceptiveness of a path, referred to as ADD, is calculated as the sum of the deception values at each node divided by the total number of nodes (excluding the start and end nodes). When two paths consume the same amount of time, ADI serves as a direct measure to quantify the overall deceptive magnitude of a path. Based on Definition 5, we propose Definition 6:
Definition 6. The average deception intensity of a path is , such that:
is the average deception degree of the path;
c is .
represents the degree of importance placed on time. Typically, ranges from 0 to 1. indicates that the observed does not care about the time consumed by the path.
ADI measures the contribution of a path per unit of time to ADD. In the absence of considering path cost, a higher ADD for a path indicates better deceptive effectiveness. When considering path cost, the overall effectiveness of a path relates to the emphasis on the observed on time. ADI not only synthesizes the contribution of each node on a path to its ADD but also accounts for time constraints. It serves as a highly comprehensive indicator that reflects the overall deceptive effectiveness of a path. Overall, ADI is a reliable and direct standard for measuring the deceptive effectiveness of the two paths.
4.2. Deceptive Path Planning (DPP) via and
Firstly,
, as detailed by Masters et al. [
8], aim to maximize the Last Deceptive Point (LDP). The methods involve two main steps, as illustrated in the left half of
Figure 2. Step A identifies
gmin from the false goals, which is the most important false goal in the DPP problem, and Step B determines the LDP based on
gmin. The paths planned by
will be illustrated by the red arrows in the right half of
Figure 2. Based on these, we introduce the
as follows:
uses a “showing false” strategy where the observed navigates from the start point to
gmin, and then continues to the real goal
gr. The path generated by
is a combination of two parts:
. Similar to
,
focuses on identifying key node LDP on the optimal path from
gmin to
gr. Instead of navigating directly to
gmin, it first heads towards LDP, then to
gr. The path generated by
is a combination of two parts:
. Based on
,
also heads toward LDP first, but it introduces a modified heuristic approach biased toward the false goal
gmin to increase the deceptive potential of the path. The path generated by
is a combination of two parts:
.
utilizes a pre-computed probability “heatmap” to enhance path deception, significantly increasing the cost of
to improve its deception. The path generated by
is a combination of two parts:
. In
Figure 2,
shows a clear tendency to approach the false goal
gmin, indicating higher deception compared to
with the same cost consumption.
, although potentially less deceptive than
in terms of deception level, it significantly reduces path duration compared to
in the same DPP problem.
To further introduce
methods, as depicted in the right half of
Figure 2, the process of
involves several steps. First, following Step A, once
gmin is identified, Step C transforms the positions of “start” and “real goal”. Here, the original real goal becomes the new start point, and the original start point is designated as the new real goal. Simultaneously, all false goals except
gmin are removed. This sets up a new DPP problem where the original real goal is the starting point, the original start point is the new real goal, and
gmin remains as the sole false goal.
Building upon the new DPP problem, Step D identifies the new
gmin (which, due to the deletions of other false goals, corresponds to
gmin from the original path planning problem), followed by Step E to find the LDP for the new DPP problem. Subsequently, path planning proceeds based on
. This approach yields four deceptive paths from “(start)” to “(real goal)” in
Figure 2 based on this new path planning problem. These paths are then reversed to obtain four paths for the original deceptive path planning problem, as indicated by the green arrow on the right half of
Figure 2.
To facilitate further understanding of the differences between
and
, we also present the new DPP problem where additional false goals are not deleted in
Figure 3. In contrast to the original DPP problem, in
Figure 3, after Step A,
gmin is different from the original DPP problem. Consequently, the LDP point also changes: from (6,5) in
Figure 2 to (6,3) in
Figure 3 (where the horizontal axis represents the x-coordinate and the vertical axis represents the y-coordinate, starting from 0). The resulting four paths generated by the
method are depicted in the right half of
Figure 3 as indicated by red arrows.
4.3. Deceptive Path Planning (DPP) via DPP_Q, Endpoint DPP_Q and LDP DPP_Q
We will provide a unified introduction to DPP_Q, Endpoint DPP_Q, and LDP DPP_Q. The core of these three methods is based on reinforcement learning modeling using Count-Based Q-learning. Below, we first introduce the state space, action space, and reward function of the observed as follows:
4.3.1. The State Space
The state space of the observed includes its current coordinate
, as well as the number of straight and diagonal movements made by itself (denoted as
and
respectively). The state space
S is defined as a four-dimensional vector:
4.3.2. The Action Space
The observed can take actions of two types: straight movements (up, down, left, right) and diagonal movements (up–left, down–left, up–right, down–right). The time required for straight movements is 1, while that of diagonal movements takes
. Specifically, the action space is a matrix:
It means the observed has eight actions. Each of them was represented as a 4-dimensional vector corresponding to the state space S. The first two dimensions update the current coordinate of the observed (x, y), while the last two dimensions update the counts of straight and diagonal movements made by the observed. Given the state s
T, the next state s
T+1 can be obtained by simple vector addition:
i = 0, 1, 2, …, 7 represents the index of the action taken by the observed.
4.3.3. The Reward Function
Chen et al. [
9] proposed an Approximate Goal Recognition Model (AGRM) based on PGRM and conducted comparative experiments using two different reward mechanisms, demonstrating the reliability of AGRM in computing rewards for the observed. Different from PGRM matching full of the observation sequence
, the observer based on AGRM only matches the current node of the observed
. Specifically, the formula for calculating the cost-difference is:
The formula for computing
for each goal is as follows:
The observed is trained by AGRM. While training with PGRM is also feasible, AGRM allows for the advanced allocation of fuzzy rewards to each grid point, thereby circumventing the time-consuming reward computation during subsequent training in reinforcement learning. Specifically, assuming the current state s
T of the observed, the formula for its basic reward is as follows:
where
represents the current coordinate of the observed
ncurrent, that is,
.
We employed the count-based method to encourage the observed to explore unknown states. The count-based method can guide the observed to explore state-action pairs with higher uncertainty to confirm their high rewards. The uncertainty of the observed relative to its environment can be measured by
[
36], where
is a constant and
represents the number of times the state-action pair
has been visited. Specifically,
is set as an additional reward used to train the observed:
Intuitively, if the observed visited a state-action
pair less frequently (i.e.,
is smaller), the corresponding additional reward will be larger. Thus it should be more inclined to visit this state-action pair to confirm whether it has a high reward. The rules for the observed to receive rewards are as follows:
If the remaining time exceeds the shortest time the observed takes to reach the target, indicating it cannot arrive at the target within the time constraint. This scenario is denoted as Condition A, with a reward of −9 given.
The observed successfully reaches the target, which is denoted as Condition B, with a reward of +100 given.
Typically, the observed receives the deceptiveness of each grid point it traverses.
Specifically, the observed receives exploration rewards to encourage it to explore unknown state-action pairs. The reward function is shown as follows:
Below, we provide pseudocode for DPP_Q, Endpoint DPP_Q and LDP DPP_Q, using a 10 × 10 grid map as an example, accompanied by detailed explanations.
Algorithm 1 introduces the DPP problem and initializes it. It first sets up the learning rate, discount factor, and epsilon-greedy value, and initializes matrices
D and
Ds.
D is a 10 × 10 matrix storing the shortest path lengths (equivalent to optimal time cost) from each grid point to the real goal
gr.
Ds is a 10 × 10 matrix storing the optimal time cost from each grid point to the start point
s0. Furthermore, Algorithm 1 reads in various elements of the DPP problem: time constraints, obstacle coordinates, starting point coordinates, real goal coordinates, and coordinates of all fake goal points (line 1). Then, it initializes the Q-table and the counting matrix N-table, setting Q-values of illegal actions to −1000 (lines 2 to 3).
Algorithm 1 DPP via Q-learning (Initialization and Problem Setup) |
. 1: Initialize distance matrix D and Ds, time constraint TC, collection of obstacle Wall, start node s0 resulting in the agent out of the map. |
Algorithm 2 describes the training process of the observed. First, the observed checks the legality of eight actions (lines 1 to 5). Then, the observed selects an action according to
policy, and the state-action pairs are counted (lines 6 to 9). Subsequently, the observed receives a reward and updates the Q-table and the state (lines 10 to 13).
Algorithm 2 DPP via Q learning (Training Framework) |
1:
2: 3: 4:
: 7:
8:
9:
|
Algorithm 3 describes the DPP_Q method. It first initializes the problem according to Algorithm 1. Then, it sets the initial start point of the observed as
s0 and initializes its values for the next two dimensions to 0 (line 2). Each training round includes checking the termination condition: whether the remaining time exceeds the shortest observed time to reach the target (the real goal
gr), indicating that the observed cannot reach the real goal within the time constraint
, or the current coordinate of the observed coincide with the coordinate of the target (the real goal
gr)
scoord =
gr (line 5). Finally, the optimal path can be collected using the method of tracking the maximum value of the Q-table (line 10).
Algorithm 3 DPP_Q |
Algorithm 1
2:
3:
4:
Algorithm 2 5:
6:
7:
8:
|
Algorithm 4 describes the Endpoint DPP_Q method. It first initializes the problem according to Algorithm 1. Then, it sets the initial exploration start point of the observed as the real goal
gr, and initializes its values for the next two dimensions to 0 (line 2). Each training round includes checking the termination condition: whether the remaining time exceeds the shortest observed time to reach the start node
s0, indicating that the observed cannot reach the real goal within the time constraint
, or the current coordinate of the observed coincides with the coordinate of the target (the start node
s0)
scoord =
s0. After collecting the path (line 10), performing a reversal operation yields a solution for the DPP problem (line 11).
Algorithm 4 Endpoint DPP_Q |
Algorithm 1
2:
3: 4: Algorithm 2 5:
or scoord = s0:
6: break
7: end if
8: end while
9: end for
10: 11: ) |
Algorithm 5 describes the LDP DPP_Q method. It begins by initializing the problem according to Algorithm 1. Then, it computes the
gmin to determine the coordinates of LDP, and calculatess the shortest cost distance from the LDP point to the false goal
gmin:
(line 1). The initial start point of the observed is set as LDP, with its values initialized to 0 for the next two dimensions (line 2). Subsequently, the algorithm divides into two steps that can be executed in parallel or sequentially: one agent starts from LDP to explore paths that can reach the real goal
gr within the time constraint
, while another agent starts from LDP to explore paths that can reach the start node
s0 within the time constraint
. After collecting the paths, performing a reversal and concatenation operation yields a solution for the DPP problem starting from
s0 and ending at
gr (line 21).
Algorithm 5 LDP DPP_Q |
Algorithm 1
3: 4: while True:
5: Algorithm 2
7: break
15: Algorithm 2 or scoord = s0:
|
6. Discussion and Conclusions
In this paper, we introduce two metrics for measuring the solutions of DPP problems: ADD and ADI. Employing a reverse-thinking approach, we propose three new methods (, Endpoint DPP_Q and LDP DPP_Q) that demonstrate significant improvements compared to the original methods ( and DPP_Q).
Firstly, we define the concept of ADD. Based on ADD, we further develop the notion of ADI and analyze its validity, applying it to evaluate subsequent methods. Secondly, we enhance without time constraints by introducing , which shows substantial improvements over traditional methods. Experimental results indicate negligible improvement for compared with (−0.03%), a significant average improvement of 16.83% for compared with , a 10.47% average improvement for compared with , and a 5.03% average improvement for compared with . Overall, outperforms by 8.07%. Finally, addressing methods under time constraints like DPP_Q, we propose Endpoint DPP_Q and LDP DPP_Q. Both methods effectively address the issue of poor path deception in DPP_Q when the real and false goals have specific distributions. Moreover, Endpoint DPP_Q and LDP DPP_Q demonstrate even more significant advantages in overcoming local optima issues on large maps compared to DPP_Q, resulting in substantial improvements.
For future work, we recognize that the training times of methods such as DPP_Q, Endpoint DPP_Q, and LDP DPP_Q are still lengthy. Inspired by the principles of , we suggest segmenting paths by incorporating more navigation points that are similar to LDP. Segmenting agent training could accelerate learning while achieving comparable results. Additionally, modeling the goal recognition method of the observer remains challenging. PGRM serves as an assumption model that is potentially inaccurate or flawed. Future research should introduce adversarial elements and more complex opponent goal recognition models. For example, the observed may quickly perceive the identifications or actions of the observer and adjust their planning accordingly to enhance deception strategies. Furthermore, expanding DPP problems to higher dimensions, such as three-dimensional space for drone flight or dynamic environments, promises practical advancements. Incorporating additional elements like images or audio captured by identifiers rather than solely relying on coordinates could make the game more intricate. These directions hold promise for future research, enhancing practical applicability significantly.