1. Introduction
With the continuous development of electronic warfare, electronic jamming technology to protect the target assault brings severe challenges to the defender, resulting in a serious weakening of radar detection capabilities. Electronic jamming has the characteristics of dynamics, diversity, and complexity, which requires better and better anti-jamming capability of radars. Today, radars have many mature anti-jamming technologies, but commanders are more concerned about how to plan the best anti-jamming measures when they are jammed. The emergence of cognitive radars makes the choice of intelligent decision-making instead of relying on the subjective experience gradually become the mainstream trend [
1]. The core of radar intelligent anti-jamming technology is to automatically identify the jamming type and adaptively take effective anti-jamming measures to complete the confrontation [
2]. Cognitive radar actively perceives the characteristics of interfering electromagnetic signals in the environment, classifies and identifies the interfering signals, and then selects the corresponding optimal anti-jamming countermeasures. Cognitive radars have the ability to adaptively schedule anti-jamming measures [
3].
The electronic countermeasures of jammers and radars are similar to the game process of playing chess [
4]. According to the current situation of the chessboard, the two sides choose to place their pieces in ever-changing ways. There are many uncertain factors in the game between jammers and radars, mainly as follows: First, the reliability of radar identification. The jamming types may be incorrectly identified. Second, the mutual restraint relationship between radar anti-jamming measures and jamming types is not one-to-one mapping. Multiple anti-jamming actions can be used to deal with the same jamming pattern, and one anti-jamming measure can also weaken multiple jamming patterns. Therefore, in the complex electromagnetic environment, the choice of anti-jamming action has great uncertainty.
The decision-making principle of anti-jamming action is that the radar takes the best anti-jamming measures to weaken the jamming effect of a certain jamming pattern, so that the detection ability of the radar is improved. Since the jammer signal processor can capture the characteristics of the signal emitted by the radar and adjust the jamming pattern in time according to the radar anti-jamming response, the radar and the jammer are in a dynamic game process [
5]. We regard the choice of radar and jammer actions as an uncertain dynamic decision-making problem, and the framework of the decision-making model is stochastic dynamic programming [
6]. It has the Markov property, because the conditional probability distribution of its future state depends only on the current state. Markov decision is the optimal decision process of a stochastic dynamic system based on the Markov process theory [
7]. Through the study of state space, the set of actions and state transition probability, the future state, and change of the system can be predicted to some extent. In this paper, the goal is to build a sequential decision model in which problems can be modeled as Markov Decision Processes (MDPs) if the state is observable and only its transitions are random. If the state is not fully observable, the problem can be modeled as a partially observable Markov decision process (POMDP) [
8]. For our focus, the uncertainty in the choice of radar anti-jamming action can be considered incompletely observable. Therefore, POMDP is a suitable framework for our research content.
The rest of the paper is organised as follows: Related radar anti-jamming decision work is introduced in
Section 2. The anti-jamming system design and problem formulation are in
Section 3. The proposed POMDP model for intelligent decision-making for radar anti-jamming actions is described in
Section 4 and
Section 5. Simulation results and analysis are given in
Section 6. The conclusion is discussed in
Section 7.
2. Related Work
Jiang et al. [
3] considered the radar resource management problem in a cognitive radar for tracking multiple moving targets with jammers. This problem was modeled as a hybrid POMDP-based game model to exploit the statistical characteristics of the moving targets and the competition between the jammers and the radar. They proposed a low complexity gradient optimization algorithm to find the optimal anti-jamming policy of the radar. To separate the hidden real target and establish their accurate trajectories in the presence of deception jammers, Jiang et al. [
6] further promoted the existing non-anti-jamming tracking model to the anti-jamming POMDP-based game tracking model. The optimal anti-jamming resource management policy enables the limited resources to be effectively utilized, thus guaranteeing the anti-jamming performance. Similarly, both works also apply the POMDP approach to the radar anti-jamming problem; they studied the optimal policy for radar anti-jamming resource management from the level of signal processing with the goal of improving radar tracking performance in the presence of jammers. However, our work is to design a sequential decision model for radar anti-jamming. During the dynamic confrontation between radar and jammer, the radar is able to select the best anti-jamming measures for different jamming patterns to improve its anti-jamming performance. Joseph Mitola et al. proposed the cognitive radio theory in 1999, which pioneered the research on cognitive radar theory [
9]. Chen Wei et al. proposed a combat mode of intelligent networking and coordinated attack, using artificial intelligence methods to enhance the anti-jamming function of radar and seeker and improve the level of intelligence [
10]. Liu Meng et al. proposed an intelligent anti-jamming system based on a neural network algorithm, which mainly completed the waveform design from discontinuous orthogonal frequency division multiplexing, a transform domain, and spread spectrum [
11]. Wu Xianmeng et al. studied the jamming perception and performance evaluation of low-altitude radars, and gave a technical solution for intelligent anti-jamming [
12]. Wang Xin et al. proposed a radar anti-jamming system architecture based on jamming cognition. Through the reconnaissance and identification of jamming, game theory was applied to autonomously select anti-jamming measures [
13]. Dr. He Bin [
14] modeled and analyzed the game process of cognitive radar anti-jamming. Based on the principle of radar active jamming guidance, he studied the anti-jamming countermeasures of radar radio frequency shielding. Wang Hao [
15] and others from Hehai University used Q-Learning and Sarsa methods to enable the radar to update and optimize the anti-jamming policy autonomously. Yuan Quan [
16] from Harbin Institute of Technology used the deep Q-network (DQN) algorithm to realize the design of the intelligent radar anti-jamming network. By intelligently identifying the type of jamming, the DQN decision algorithm was applied to intelligently select the best anti-jamming measures. Serkan Ak and Stefan Bruggenwirth investigated the anti-jamming performance of a cognitive radar under a partially observable Markov decision process (POMDP) model. They illuminated the performance metric of probability of being jammed for the radar beyond a conventional signal-to-noise ratio (SNR). This performance metric is analyzed with deep Q-network (DQN) and long short-term memory (LSTM) networks under various uncertainty values [
17].
3. System Model and Problem Formulation
In this section, we present the system model of radar anti-jamming systems. Then, the sequential decision model of the anti-jamming policy based on POMDP is formulated.
3.1. System Structure Design
We designed a cognitive radar anti-jamming intelligent decision-making system. The POMDP model is used to model the random behavior of adaptively selecting anti-jamming actions when the radar is jammed to achieve intelligent confrontation. For uncertain factors, the POMDP model can be described by probability. The advantage of the model is that it can quantitatively describe the radar’s ability to identify the working state of the jammer, and then provide support for the radar’s autonomous anti-jamming action decision-making.
Figure 1 depicts the structure of the radar intelligent anti-jamming system.
(1) This system realizes the adaptive closed loop of transmitter→antenna transmission→space (channel)→antenna reception→receiver→anti-jamming decision→transmitter.
(2) This system has an environmental dynamic database that contains information about the environment and targets of interest, which come from external jamming sources, and the information in the database is constantly updated dynamically.
(3) The adaptive anti-jamming decision-making unit has the function of knowledge-assisted processing.
3.2. POMDP Preliminaries
The characteristics of Markov’s decision-making process are action space that can be used and the specific actions that will be taken only up to the current state of the system and have nothing to do with past history. The Markov decision process is a special kind of sequential decision problem, which is characterized by the set of actions that can be adopted. The acquired return and transfer probability only depend on the current state and selected actions, and have nothing to do with the past history. This property is called Markov. A POMDP is the generalization of a Markov decision process [
18]. It describes a decision maker’s interaction with a stochastic system of which the current state is not directly observable. The main part of POMDP is similar to the general sequence decision model; it has a decision cycle, state, action, observation, transition probability, and reward. The model is described by the following elements.
S, the set of system states.
A, the set of actions.
O, the set of observations.
The observation model denoted by , i.e., the probability that observation was made given that the state was and action was taken.
The underlying Markov Chain that models the transitions of the system’s state, denoted by , the probability that the system state transitions from the previous state to the next state after taking action .
The function defines a real-valued reward when the agent takes action in state .
As the system’s state is not observable exactly, the knowledge about the system is represented by the belief state
. This is a probability distribution over the state space based on the internal dynamics of the system. The system starts with an initial belief
, the actions taken and the observations made. By incorporating information from the action
taken and the observation
received, the updated belief
at time
is computed with a Bayesian update [
19].
The combination of an action and an observation induces an immediate reward, depending on the current state, and a future reward, depending on the next state. The value function describes the relation between the immediate reward, future reward, and the belief state. Due to the introduction of the belief state space, the POMDP problem can be transformed into a Markov decision process based on the belief state space. The optimal value function can be transformed into the maximum expected value of the discounted reward at the belief point. Equation (2) is the Bellman optimality equation of POMDP model in the belief state reformulation.
Here, is all possible observations that the agent can receive from the environment, and is the belief state of the next time step. , . indicates the probability that the agent takes an action and obtains an observation in the belief state . is the discount factor. In the decision-making process, the discount factor makes the defined optimization function converge to a finite value. Generally, the smaller the discount factor is, the greater the impact of short-term income is, which makes the agent pay more attention to short-term income when making decisions.
A POMDP policy prescribes the action at a belief. An optimal policy is one that maximizes the value function. This is described by the optimal value function which gives for each belief the maximum value. The optimal policy can be defined by Equation (3).
represents the belief state of the next time step. An optimal policy is computationally intractable because it is defined on a continuous belief space. Fortunately, in the finite-horizon case, it is proven that the optimal value function is piece-wise linear and convex (PWLC) [
20]. A possible way is to compute an approximation of the optimal value function over a subset of the belief space. The value function
is the combination of piecewise linear value functions in the belief space. These piecewise valued functions represent the segment or hyperplane with the largest value on the corresponding partition of the belief state space. The coefficients of these hyperplanes are called
-vectors. Different
-vectors correspond to different policies. The value function under any belief state can be obtained by substituting the belief state into the hyperplane equation. The optimal value function can be written as Equation (4).
It is noteworthy that Equation (4) is not valid generally for infinite-horizon POMDP, as the finiteness of the set of
-vectors does not generally extend from the finite-horizon case to the infinite-horizon case [
21]. In the infinite-horizon case, the approximate value of the optimal value function can be obtained using a finite set of vectors. For a certain finite set
of so-termed
-vectors, each
-vector has a corresponding action. The best action to be taken under belief state can be determined by an inner product operation. These POMDP can be solved in an approximation algorithm. An excellent example of the approximation algorithm is the Successive Approximations of the Reachable Space under Optimal Policies (SARSOP).
3.3. Point-Based Value Iterative Approximation on SARSOP Algorithm
Due to the limitation of computational complexity, solving the POMDP problem accurately has always been a difficult problem restricting the development and application of POMDP theory. Many scholars have successively developed many efficient POMDP approximate solving algorithms. H. Kurniawati et al. [
22] proposed the Successive Approximations of the Reachable Space under Optimal Policies (SARSOP) algorithm on the basis of previous research. The SARSOP algorithm utilizes the concept of optimally reachable belief space
to improve the computational efficiency of POMDP planning.
is defined as the set of belief states that can be reached from the initial belief
by taking an optimal policy. The belief space can be formed as a belief state tree
with the initial belief state
as the root node. Each node represents a reachable belief state, and all nodes in
form the complete belief space. The core of the SARSOP algorithm is the idea of limiting the boundary. Heuristic information and online learning techniques make as many sampled belief states as possible belong to the belief space
. The algorithm mainly includes three steps: Sampling, Backup, and Pruning. In order to reduce the amount of computation, the SARSOP algorithm only updates the value function on some belief states, so we need to first sample the belief states according to the belief tree
and the initial
-vector set. The purpose of performing the backup operation on the belief state
selected by
is to sort out the information of sub-nodes and feed it back to
. The backup operation is to input a specific belief point and the value function of step
, and output the optimal vector of point
in the value function of step
. For all
and
, the next belief state
can be obtained according to Equation (1), and its corresponding optimal
-vector can be obtained from Equation (5).
The value function of all
and
can be calculated by Equation (6).
The action
corresponding to the optimal vector
is determined by Equation (7).
Add the optimal vector
into the vector set
. The SARSOP algorithm adopts the
– neighborhood method [
20] to cut the branches and vectors corresponding to the suboptimal actions. Furthermore, the algorithm can further reduce the number of vectors in
by concentrating the sampled belief states nearby.
The introduction of the gap termination condition in SARSOP means that the expected gap is set at the upper limit and lower limit of the value function at the root of . By selecting the action and observation of the maximum upper limit on each node, a path that minimizes the root gap is obtained. When the root gap is less than or the time limit is reached, the sampling path is terminated. The pseudo code of SARSOP algorithm is as follows (Algorithm 1).
Algorithm 1 SARSOP algorithm |
1: | Initialize vector set , initialize the lower bound and upper bound of the value function. |
2: | Add initial belief point as the root node of the tree |
3: | Repeat |
4: | SAMPLE(,). |
5: | Select child nodes from , for each selected node , BACKUP(,,). |
6: | PRUNE(,). |
7: | Until the root gap is less than or the time limit is reached. |
8: | Return. |
4. Radar Anti-Jamming Action Decision Model
4.1. Principles of Radar–Environment Interaction
Anti-jamming action scheduling is based on identifying, classifying, and building a database, adaptively selecting anti-jamming actions according to the type of jamming, and optimizing anti-jamming actions according to the anti-jamming performance. The same jamming type will have multiple anti-jamming means, and the same jamming countermeasure can also fight against multiple types of jamming. In different jamming environments, different anti-jamming means will have different effects. In addition, the actual battlefield jamming to radar presents a complex and changeable situation, which makes the selection of jamming action extremely complicated.
The radar system contains a dynamic knowledge base for storing, scheduling, and updating various types of prior information. Through the continuous optimization of the dynamic knowledge base, we can find anti-jamming action sets that can deal with a variety of complex scenarios. The cognitive radar anti-jamming system mainly selects anti-jamming actions based on the idea of “game theory”. The radar system has complete judgment criteria as to whether the anti-jamming action that can be provided when the jamming is encountered meets the requirements. It is necessary to evaluate the anti-jamming effect, adopt actions with better effects, and the evaluation results provide support for the selection of anti-jamming actions, forming closed-loop feedback.
Knowledge bases and rules need to be preset to describe process tasks and task requirements for a certain period of time. The system provides a set of optimal policies calculated based on the action set and the state set. The radar chooses the best action based on the observations. The POMDP model of radar anti-jamming must ensure that the action of the radar at the next moment is only related to the current jamming signal type.
Figure 2 shows the model structure and flow of POMDP for radar anti-jamming.
4.2. Jamming Types and Belief States
Many scholars have summarized the benefits of different jamming policies to deal with radar anti-jamming through exploration and research. Reference [
23] lists 10 kinds of jamming measures and 9 kinds of radar anti-jamming measures according to the application of electronic countermeasures as in
Table 1. It is assumed that 10 jamming measures of the jammer constitute a state set, denoted as
= {
,…,
}, and 9 anti-jamming measures constitute an action set, denoted as
= {
,…,
}. The adversarial jamming effect are shown in
Table 2. The smaller the value, the better the anti-jamming effect, otherwise, the worse the anti-jamming effect.
4.3. Radar Anti-Jamming Action and Jamming State Transfer
Based on the radar countermeasure benefit matrix, the basic principle of state transition is as follows: under the premise of the current anti-jamming policy adopted by the radar, the score of the transferred jamming measure is not less than the score of the current jamming measure. According to the principle of maximum entropy, it is assumed that the state transition probability obeys a uniform distribution in the transferable direction. For example, after executing an action
, state
can transition to state
with a certain probability.
is a transferable direction under action
. For the action
, state
can only transition to state
if the jamming effect of the next state s is better than that of the current state s. Otherwise, the transition probability is 0, as shown in Equation (8).
is the score of current state
against anti-jamming action
.
is the score of next state
against anti-jamming action
.
4.4. Observations and Observed Probabilities
Jamming identification plays a very important role in the process of intelligent radar countermeasure. The POMDP approach has the ability of synchronous learning and confrontation by quickly identifying interference features and reasoning interference types. Suppose there are nine groups of time-frequency domain characteristic parameters {
} describing the interference type. We also assume that the observation probabilities are already well learned. According to the characteristic samples collected from 10 interference states, the observation probability database of characteristic parameters is constructed, as shown in
Table 3.
4.5. Reward Feedback
The purpose of radar anti-jamming is to weaken the jamming effect or force the jamming state to a less threatening state. To determine the reward function, we define the jamming threat level
. The efficiency of a radar anti-jamming system is evaluated by comparing the threat level before and after a state transition. In general, if the threat level
rises after a state transition, the reward increases. We adopt a subtle way of describing the effectiveness of any one anti-jamming measure in terms of a symbolic function
. If
, the qualitative evaluation of the effect of action
is negative. Conversely, if
, the qualitative evaluation of the action
is positive. The threat level
and
can be compared by Equation (9). Here, If
, the threat level ranking needs to be further determined by calculating the sum of the scores of state
and
against all actions
, as in Equation (10).
According to Equations (9) and (10), it can be obtained that the state set is arranged into in the ascending order of the degree of jamming threat.
The reward function obtained by the radar anti-jamming policy is defined as Equation (11).
Equation (11) shows that if the threat level of state is lower than the previous state , the feedback reward is +1, otherwise, the feedback reward is −1, and if the threat level of the two states is the same, the feedback reward is 0.
5. Radar Intelligent Anti-Jamming Planning Decision Making Process Based on POMDP
Considering that the game theory-based decision-making of jammers and radars has high real-time requirements, this paper adopts the POMDP offline planning method to select the optimal policy. First, a large amount of preprocessing time is invested in the search phase to generate the radar anti-jamming policy set on the entire belief state space. Then, the optimal policy is selected according to the anti-jamming gain in the policy execution stage. The algorithm flow and the pseudo code are described as follows (Algorithm 2).
Step 1: Input the parameters of the radar anti-jamming policy scheduling model into the SARSOP algorithm, set the expected gap or time limit as the algorithm termination condition, and obtain the -vector set.
Step 2: Initialize the jamming state and the initial belief state .
Step 3: Infer the jamming type according to the observed value, and update the belief state b according to Equation (1).
Step 4: Perform an inner product operation on each vector in and b, and select the optimal anti-jamming action from the policy set according to the -vector corresponding to the maximum value.
Step 5: Affected by the anti-jamming action, the jamming pattern is transferred according to the state transition probability, and new observations are generated.
Step 6: Determine whether the maximum time step is reached; otherwise, return to Step 3.
Algorithm 2 Offline planning of anti-jamming action based on POMDP |
Input: A POMDP <S,A,O,P,R>. Output: A near-optimal policy |
1 Let ← 0 for all . |
2 repeat |
3 for each . do |
4 for each . do |
5 |
6 end |
7 |
8 |
9 update |
10 end |
11 until |
12 return |
6. Simulations and Discussions
In this section, the performance of the radar anti-jamming policy based on POMDP is verified by simulation in MATLAB.
Parameter settings are as follows: discount factor , the initial belief state obeys a uniform distribution, and the SARSOP algorithm stops when the root gap reaches the expected gap ; the maximum time step .
The jamming state is dynamically shifted by the utility of radar anti-jamming actions, and observations are generated. The radar updates the belief state and infers the jamming type based on the observations and adaptively selects the best anti-jamming action. The jammer initialization state is broadband noise jamming pattern . In the simulation, three anti-jamming action policies are compared with the action policies solved by the POMDP model, including greedy policy, random policy, and fixed policy.
The greedy policy, also labeled as “myopic” policy, determines a state-specific action to the state that has the maximum likelihood of transmitting the current observation. The random policy maps the belief state to a random action, and each action has the same opportunity to be chosen. A fixed policy means that the radar always takes an identical anti-jamming action regardless of how the jamming state changes. The constant false alarm processing is used as a fixed anti-jamming measure for the radar.
Figure 3 and
Figure 4 shows the update of jamming states and observations under the four policies, and
Figure 5 shows an anti-jamming action taken by the radar at each time step. The horizontal axis is the iteration time step from 0 to 50 times and the vertical axis is the sequence number of the state, observations, and action, respectively. The results intuitively reflect the dynamic game confrontation process between the jammer and radar. In order to compare the pros and cons of the four policies, the average cumulative reward value under the four policies is recorded in
Figure 6, which means the cumulative reward before the current time step. The value of the time step was varied from 0 to 1000. It can be seen that the reward of the action policy planned by POMDP is significantly better than the greedy policy, random policy, and fixed policy, indicating that the POMDP policy can take more effective anti-jamming measures against the enemy’s jamming pattern.
The jamming patterns are divided into five types according to their attributes, namely main lobe jamming (
), side lobe jamming (
), noise jamming (
,
,
), impulse jamming (
,
), and deception jamming (
,
,
). We defined the jamming state occupancy and performed 1000 Monte Carlo experiments to count the average occupancy of different jamming types, as shown in
Figure 7. The vertical axis indicates the average occupancy of each jamming type. The average occupancy is defined as follows: the occurrences of a jamming type in all Monte Carlo trails.
It can be seen that with the transition of states, the average occupancy of the jamming states of the four policies tends to be stable. Except for the fixed policy, the anti-jamming action set generated by the POMDP policy has the highest average occupancy rate for deception jamming, and the lowest average occupancy rate for main lobe jamming and sidelobe jamming. The random policy has the highest average occupancy rate for main lobe and sidelobe jamming, and the lowest average occupancy rate for noise jamming and impulse jamming. Only noise jamming and impulse jamming will occur with a fixed policy.
The jamming states, observations and radar anti-jamming action scheduling are shown in
Figure 8,
Figure 9 and
Figure 10. We recorded the accumulation of rewards when executing the optimal policy, as shown in
Figure 11.
According to the principle of state transition, for the same anti-jamming action, the jamming score of the next state is not inferior to the current state. Anti-jamming effect after state transition will not be improved. After a multi-step transition of state, even the anti-jamming effect is negative. Thus, the long-term reward accumulation of fixed policy is the lowest. The reward of the random policy has strong randomness—each action has the same probability of being selected. The random policy has the lowest short-term reward, but the reward accumulation is higher than that of the fixed policy when the time step reaches about 550. The early-stage greedy policy rewards higher than the POMDP policy. The greedy policy only pays attention to the reward of the current time step, while the POMDP policy pays more attention to the long-term reward, so the former has a higher short-term reward, but at the end of the confrontation, the latter has the highest reward, as shown in
Figure 11. In general, the action plan of the POMDP policy has the best anti-jamming effect.
The main goal of the radar’s anti-jamming action is to force the jammer to transfer from a high-threat state to a low-threat one, so that the radar can more fully exert its combat power in a safe environment. In order to verify the efficiency of the cognitive radar anti-jamming model based on POMDP in this paper, we changed the initial jamming state settings and counted the average occupancy rate of the jamming state
when four policies were adopted, as shown in
Figure 12. The vertical axis represents the average occupancy of the lowest threat state
. The POMDP policy is most effective for forcing the jammer to transfer from a high-threat state to a low-threat one. Meanwhile, the proposed method is insensitive to the initial state. Changing the initial state hardly affects the reward outcome and the average occupancy of the jamming state
.
Table 4 counts the maximum reward and average reward after 1000 steps of anti-jamming action with four policies. Obviously, the reward of the POMDP policy is much higher than the other three policies, while the fixed policy returns the lowest reward and has the worst anti-jamming effect. The average occupancy of jamming state
with the lowest threat level is approximately 0.09, 0.08, 0.67, and 0 under the four policies, respectively.
7. Conclusions
In order to improve the adaptive anti-jamming capability of radars, a cognitive radar intelligent anti-jamming decision-making system model is designed. The decision-making of the jammer–radar confrontation is realized by applying the POMDP theory, which is different from greedy methods based on expert systems, artificial blind, or fixed selection methods. The system model simulates the Observation, Orientation, Decision, and Action (OODA) closed-loop loop of the anti-jamming system and obtains feedback through the reward function. Considering the uncertain factors of radar identification of jamming types, the jamming types are inferred according to observation so as to select the best action measures, which greatly improves the intelligence of the radar system’s anti-jamming decision-making. The average occupancy of the state of the lowest threat level for POMDP policy is the highest in all policies. Compared to the greedy policy and random policy, the occupancy of the lowest threat level state under the POMDP policy increased by approximately 12.5% and 34.3%, respectively. The reward of the POMDP policy is much higher than the other three policies. Therefore, the anti-jamming intelligent decision-making radar system can generate countermeasures adaptively based on the POMDP model, thereby significantly reducing jamming threat to radars and improving the survivability and combat capability of radars.