Task-Importance-Oriented Task Selection and Allocation Scheme for Mobile Crowdsensing

Chang, Sha; Wu, Yahui; Deng, Su; Ma, Wubin; Zhou, Haohao

doi:10.3390/math12162471

Open AccessArticle

Task-Importance-Oriented Task Selection and Allocation Scheme for Mobile Crowdsensing

by

Sha Chang

^*

,

Yahui Wu

^*,

Su Deng

,

Wubin Ma

and

Haohao Zhou

Science and Technology on Information Systems Engineering Laboratory National University of Defense Technology, Changsha 410073, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2471; https://doi.org/10.3390/math12162471

Submission received: 2 July 2024 / Revised: 7 August 2024 / Accepted: 8 August 2024 / Published: 10 August 2024

(This article belongs to the Topic AI and Data-Driven Advancements in Industry 4.0)

Download

Browse Figures

Versions Notes

Abstract

:

In Mobile Crowdsensing (MCS), sensing tasks have different impacts and contributions to the whole system or specific targets, so the importance of the tasks is different. Since resources for performing tasks are usually limited, prioritizing the allocation of resources to more important tasks can ensure that key data or information can be collected promptly and accurately, thus improving overall efficiency and performance. Therefore, it is very important to consider the importance of tasks in the task selection and allocation of MCS. In this paper, a task queue is established, the importance of tasks, the ability of participants to perform tasks, and the stability of the task queue are considered, and a novel task selection and allocation scheme (TSAS) in the MCS system is designed. This scheme introduces the Lyapunov optimization method, which can be used to dynamically keep the task queue stable, balance the execution ability of participants and the system load, and perform more important tasks in different system states, even when the participants are limited. In addition, the Double Deep Q-Network (DDQN) method is introduced to improve on the traditional solution of the Lyapunov optimization problem, so this scheme has a certain predictive ability and foresight on the impact of future system states. This paper also proposes action-masking and iterative training methods for the MCS system, which can accelerate the training process of the neural network in the DDQN and improve the training effect. Experiments show that the TSAS based on the Lyapunov optimization method and DDQN performs better than other algorithms, considering the long-term stability of the queue, the number and importance of tasks to be executed, and the congestion degree of tasks.

Keywords:

mobile crowdsensing; task importance; Lyapunov optimization; double deep Q-network; action mask

MSC:

60K25; 68M20; 90C15

1. Introduction

Mobile Crowdsensing (MCS) [1] introduces a novel sensing methodology by leveraging intelligent devices embedded with a variety of sensors to gather sensory information. In comparison to traditional static sensor networks, MCS eliminates the prerequisite of pre-deploying a substantial sensor infrastructure, thereby offering significant benefits, including reduced costs, scalability, real-time capabilities, and enhanced efficiency. Consequently, MCS has found widespread application in domains such as smart city management, environmental surveillance, and data acquisition [2,3,4,5].

Typically, an MCS system architecture, as illustrated in Figure 1, encompasses a central platform collaborating with a multitude of participant devices [6]. The process is initiated when external applications requiring data transmit sensing tasks to the platform. The platform assumes the pivotal role of task mediation, meticulously evaluating both the specific requirements of the tasks and the current state of the participants. By undertaking a comprehensive analysis, it formulates a strategic task allocation scheme to match tasks with the most suitable participants [7].

Participants undertake the assigned tasks and subsequently forward the accumulated data back to the platform. Here, the platform undertakes the tasks of securely storing and processing the data and subsequently transmitting them to the requesting external applications and generating revenue. Thus, MCS effectively integrates the sensing capacity of participants with the diverse demands of data-reliant applications [8,9].

In Mobile Crowdsensing (MCS), tasks vary in their impacts and contributions to the overall system or specific objectives, leading to a differentiation in their levels of importance. Take environmental monitoring [10] as an illustration: detecting hazardous gas leaks may hold greater importance than routine air quality monitoring, given its direct implications for public safety. By identifying and prioritizing tasks that pose higher risks or possess greater importance, potential hazards and losses can be mitigated while enhancing the system’s responsiveness and crisis management capacities.

Similarly, in military strategizing, the prioritization and importance of tasks exert a direct influence on higher-level decision-making processes. Tasks deemed highly important may demand data of superior quality or a higher completion rate. The gathering of vital data can arm decision-makers with timely and crucial information, thereby facilitating more efficacious decision-making [11,12].

Consequently, considering the importance of tasks during the task selection and allocation phases within MCS is of great significance. Given the usual constraints of finite resources, such as participant availability, energy, time, and financial budgets, prioritizing these resources for more crucial tasks ensures the prompt and precise collection of vital data. This strategic allocation enhances the overall effectiveness and performance of the MCS system.

In many studies concerning task allocation, the importance of the tasks is consistently recognized as an indispensable factor.

To guarantee the timely execution of urgent tasks, ref. [13] introduced a new multi-objective optimization problem based on the quality of service of urgent task-aware cloud manufacturing service composition. It gives a higher importance to urgent tasks in the production process. It proposes two service composition methods based on vertical collaboration and speed selection to speed up task completion. Ref. [14] proposed a radar task selection method based on task importance. The method first sorts tasks according to their dwell times and priorities and defines a set of reward values according to the initial ranking. Then, the priority of the tasks is changed iteratively according to the reward and punishment strategy. Finally, the optimal sequencing is determined to schedule tasks. This method significantly improves the performance of task scheduling and shortens the time for task selection and allocation. Ref. [15] ranked different nodes in wireless rechargeable sensor networks through the importance of data transmission so as to distinguish the priority of node charging. The importance of data transmission is mainly determined by the deadlines of the tasks and the penalty values of the tasks. In order to minimize data loss, charging tasks are divided into early tasks and delayed tasks. Nodes with high importance and short deadlines are included in early tasks. Ref. [16] proposed a scheduling scheme that can be directly applied to actual wind turbine maintenance tasks. The authors considered the type of maintenance tasks, the importance of the maintenance tasks, the remaining time, the task maintenance time, and the waiting time before maintenance; determined the priority of different maintenance tasks; and constructed a scheduling model based on task priority.

To address the problem of multiple permanent node failures in multicore real-time systems, ref. [17] proposed two protocols based on task importance: a recovery time distribution protocol and a graceful degradation protocol. These two protocols are based on the importance of the tasks. For single-node failure and multiple-node failures, the task is delayed or discarded after the importance is degraded. Based on a cloud computing platform, ref. [18] proposed a new deadline allocation strategy with level importance. The task level importance is obtained by calculating the proportion of a task level in all task levels. Based on the level importance, the authors proposed a heuristic task scheduling method. This method can minimize the execution cost by effectively scheduling tasks under deadline constraints. Ref. [19] studied the relationship between the processing time of task execution and the importance of the task in mobile edge computing. The authors distinguish the importance of the tasks according to the processing time of task execution. The corresponding task priority is given according to the average processing time of task execution. Then, the tasks with different priorities are weighted to allocate computing resources to ensure that the high-priority tasks can obtain sufficient computing resources. Ref. [20] summarized the factors related to the importance of tasks in industrial software into three aspects: intrinsic importance, the longest time that processing can be delayed, and whether it is on the critical path. The importance of tasks is negatively correlated with the longest processing time and positively correlated with the other two factors. The authors proposed an online task scheduling algorithm based on task importance. The algorithm sorts and schedules all tasks in descending order of importance. In addition, this paper also establishes resource reservation, preemptive scheduling, and online adaptive adjustment methods, which further improve the efficiency and adaptability of the algorithm.

Currently, research on the significance of tasks in task selection and allocation within MCS remains scarce.

Ref. [21] set a weight for each task according to the importance of the task. The higher the weight, the more important the corresponding task. In this paper, the revenue obtained by the data requester is defined as the total weighted quality of all sensing data. By designing a reasonable task allocation scheme, the revenue of the data requester is maximized under the constraint of a limited budget. It should be noted that the weight in this paper is a relative value, which refers to the importance of a task relative to other tasks when the set of all tasks is determined. However, in practical applications, the arrival of tasks is asynchronous and uninterrupted, and the system cannot predict all tasks in advance. Therefore, this method is not suitable for long-term and online dynamic task allocation in the MCS system.

In the participatory MCS system, the allocation and management of some tasks are based on auctions. In auctions, there will be untruthful bidding and malicious participants, which will lead to the redundant assignment of tasks; that is, a task is assigned to multiple participants. This will not only waste resources but also reduce the actual execution rate of tasks. In order to solve this problem, ref. [22] associated the importance of tasks with the number of bidders. Unpopular tasks, that is, tasks with fewer bidders, will be given higher importance and will be assigned and executed first, thus significantly improving the completion rate of tasks. But, the author did not distinguish the importance of the task itself.

The assumption underlying this prior research is premised on having adequate participants ready to undertake tasks. Nevertheless, in practical deployments, MCS systems diverge from other task allocation frameworks, such as those in edge computing systems or industrial manufacturing systems. Unlike these, the participants in MCS systems are autonomous, free to engage or exit from the system at their discretion. This autonomy may result in scenarios where certain areas or periods face a lack of participants, consequently leading to a backlog of pending tasks. In addition, the current research mainly determines task selection and allocation methods based on the current state, without considering the impact of future arrival tasks on current decisions. Hence, this paper focuses on devising a forward-looking and long-term task selection and allocation scheme based on the importance of tasks when the tasks and participants are unknown. The main contributions of this paper encompass the following:

A task queue is established, and the Lyapunov optimization method is introduced to propose the task selection and allocation scheme (TSAS) in the MCS system. When there are many tasks but not enough participants, participants mainly perform tasks of high importance and stabilize the task queue by controlling the number of tasks entering the queue; when the number of tasks is small but participants are sufficient, participants can perform the backlog tasks in the queue. So, tasks of lower importance can also be executed, which not only improves the completion of tasks but also improves the utilization of resources. Therefore, the TSAS is a long-term and online scheme that can dynamically balance the relationship between the execution ability of participants, the stability of the task queue, and the importance of tasks.
According to different system states, namely, the length of the task queue, the number of participants, and the number of tasks arriving at the platform, the TSAS is dynamically determined. This scheme has certain predictive power and foresight on the influence of future states.
The traditional solution of the Lyapunov optimization problem is improved, the DDQN method is introduced, and action mask, iterative training, and other methods are proposed to accelerate the training process and improve the training effect.

The rest of this paper is organized as follows. Section 2 introduces the system model and objective function. Section 3 develops the TSAS and solves the problem based on Lyapunov optimization and the DDQN. Section 4 presents a series of experiments to evaluate the performance of the TSAS and compares it with the Classical Lyapunov optimization algorithm, QPA algorithm, and offline task selection and allocation algorithm. Finally, Section 5 concludes this article.

2. System Model and Problem Formulation

2.1. System Model

In order to facilitate the management and scheduling of the system and ensure the synchronization and consistency of data collection, continuous time is discretized into discrete equal time slots

t

.

t \in {0, 1, 2, \dots}

. The number of tasks arriving at the platform in each time slot is different. As shown in Figure 2,

O (t)

represents the number of tasks arriving at the platform in time slot

t

.

O (t)

is i.i.d. over slots with

E \{O (t)\} = λ

,

0 \leq O (t) \leq O^{m a x}

.

O_{j} (t)

refers to the

j

th task arriving at the platform.

In order to establish a long-term task management method that can dynamically balance the number of participants and tasks, this paper designs a task queue to store the tasks to be performed.

Q (t)

indicates the length of the task queue in the platform in time slot

t

. Each task has its own importance value.

F_{s} (t)

refers to the importance of the

s

th task, which is determined by the data requester when generating the task. The larger the

F_{s} (t)

, the higher the importance of the task.

Task management includes task selection and task allocation. Task selection refers to the platform determining the tasks that can enter the task queue and be executed based on the state and number of participants in the previous time slot. Therefore, not all arriving tasks can enter the platform. When a task is dropped without entering the task queue, it indicates that the MCS system has a heavy load, a long task queue, and insufficient participants. The discarded tasks will be executed by other systems. This can ensure the stability of the task queue in the MCS system and avoid long waiting times for tasks.

o (t)

represents the number of tasks entering the sensing platform in time slot

t

, and

0 \leq o (t) \leq O (t)

. Because the importance of tasks is different, in order to make full use of the limited sensing resources, the platform should select tasks to enter the task queue according to the importance of the task.

Task allocation refers to the scheme in which a platform assigns tasks from a queue to participants who execute them. Task allocation is not based on the order in which tasks enter the task queue, but rather, tasks are selectively assigned based on their importance.

i

denotes different participants,

i \in \{1, \dots, N\}

.

x (t) = \{x_{s i} (t)| s \in \{1, 2, \dots, Q (t)\}, i \in \{1, 2, \dots, N\}\}

indicates the task allocation scheme, that is, whether task

s

in the queue is allocated to participant

i

in time slot

t

. If

x_{s i} (t) = 1

, participant

i

performs task

s

. If

x_{s i} (t) = 0

, participant

i

does not perform task

s

. In each time slot, the same participant can only perform a single task.

r (t)

is the number of tasks

s

that can be performed in time slot

t

:

r (t) = \sum_{s = 1}^{Q (t)} \sum_{i = 1}^{N} x_{s i} (t)

(1)

0 \leq \sum_{i = 1}^{N} x_{s i} (t) \leq 1

(2)

0 \leq \sum_{s = 1}^{Q (t)} x_{s i} (t) \leq 1

(3)

0 \leq r (t) \leq N

(4)

At the end of time slot

t

, the remaining unexecuted tasks in the task queue will wait to be executed in the future. New tasks will arrive at the platform in the next time slot, and the platform will make the task selection decision again to select the number of tasks entering the queue. The dynamic change in

Q (t)

follows the following formula:

Q (t + 1) = \max [Q (t) - r (t), 0] + o (t)

(5)

where

Q (t) \geq 0

.

Q (0) = 0

; that is, in the initial time slot, the length of the task queue is 0.

If

Q (t)

satisfies

\lim_{t \to + \infty} \frac{E \{Q (t)\}}{t} = 0

(6)

then

Q (t)

is stable; that is, the task queue length is bounded. Therefore, the MCS system is stable.

2.2. System Utility

If the MCS system is stable, tasks that enter the queue can be executed.

U (t)

represents the system utility in time slot

t

, which depends on the number of sensing tasks entering the task queue,

U (t) = β o (t)

(7)

where

β

represents the utility that the platform can obtain by performing unit tasks.

β

is a positive constant. The larger the

U (t)

, the more tasks that enter the task queue. A large

U (t)

may cause the queue to continuously grow and become unstable. If

U (t)

is too small, it may cause resource waste, meaning that some participants have no tasks to execute and remain idle. Therefore, the development of task selection plans needs to balance the stability of the task queue and the execution ability of participants.

2.3. Importance Level of the Tasks Being Executed

As mentioned earlier, when resources are limited, important tasks should be prioritized.

I (t)

represents the total importance of all performed tasks in time slot

t

:

I (t) = \sum_{s = 1}^{Q (t)} \sum_{i = 1}^{N} F_{s} (t) x_{s i} (t)

(8)

The larger the

I (t)

, the more meaningful the task selection and allocation scheme of the system. Therefore, the development of a task selection and allocation scheme needs to prioritize tasks with higher importance.

2.4. Problem Formulation

The problem to be solved in this paper is to maximize the overall system utility and the importance of the executed tasks through task selection and task allocation schemes while satisfying the stability of the task queue. Due to the long-term operation, dynamism, uncertainty, and nonlinearity of the MCS system studied in this paper, it is transformed into an optimization problem of system efficiency and task importance under time-averaged constraints. The objective function of this problem is as follows:

\min_{o (t)} \bar{U} = \min_{o (t)} \underset{t \to + \infty}{l i m} \frac{1}{t} \sum_{τ = 0}^{t - 1} E \{U (τ)\}

(9)

\max_{x_{si} (t)} \bar{I} = \max_{x_{si} (t)} \underset{t \to + \infty}{l i m} \frac{1}{t} \sum_{τ = 0}^{t - 1} E \{I (τ)\}

(10)

s.t. (1)(5)(6)(7)

3. Task Selection and Allocation Scheme

3.1. Lyapunov Optimization

In order to solve the optimization problem (9)–(10), this paper introduces queuing theory and Lyapunov optimization theory, which greatly reduces the difficulty of the problem. Lyapunov optimization theory [23] is about stochastic network optimization. The research of Professor Neely and his team on Lyapunov optimization is relatively mature and has been applied in many fields. Different from other static optimization algorithms, the algorithm based on Lyapunov optimization theory is an online, dynamic, and adaptive algorithm. When the system state changes, there is no need for manual adjustment. The algorithm has a certain self-learning ability and can maintain progressive optimality. In addition, the algorithm based on Lyapunov optimization theory is simpler to implement, requires less prior knowledge, and has a certain decoupling ability, which can greatly reduce the complexity of the algorithm.

First, the Lyapunov function is defined as follows:

L (t) ≜ \frac{{Q (t)}^{2}}{2}

(11)

L (t)

is the measure of task queue congestion.

L (t) \geq 0

. Generally, we cannot directly guarantee that the Lyapunov function is bounded, but we can design a method to keep the task queue backlog within a certain range so as to effectively control the congestion degree of the task queue and ensure the stability of the queue. Therefore, the Lyapunov drift function is defined as follows:

Δ (Q (t)) ≜ E \{L (t + 1) - L (t)| Q (t)\}

(12)

The Lyapunov drift function represents the change in the Lyapunov function with time. Because the system is dynamic and uncertain, the Lyapunov drift function is defined as a conditional expectation in this paper. The expectation depends on the task selection scheme of the MCS system and is related to the system state.

The utility function is added as a penalty to the above drift function to obtain the drift-plus-penalty function:

Δ_{v} (Q (t)) ≜ E \{L (t + 1) - L (t)| (Q (t))\} - V E \{U (t)| (Q (t))\}

(13)

V

is a non-negative constant. By adjusting the value of

V

, the proportional relationship between queue stability and system utility can be controlled.

In order to ensure the stability of the queue and maximize system utility and the total importance of performed tasks, task selection decision

o (t)

and task allocation decision

x (t)

should be designed to minimize the drift-plus-penalty function. The objective Functions (6) and (9) is converted into

\min Δ_{v} (Q (t)) = m i n E \{L (t + 1) - L (t)| (Q (t))\} - V E \{U (t)| (Q (t))\}

(14)

The classical solution of (14) is the mainstream and general solution for solving task selection and assignment problems in MCS at present. Because of the complexity of the Lyapunov optimization problem, it is difficult to obtain the minimum value of the drift-plus-penalty function directly. The classical solution of the Lyapunov optimization algorithm is to minimize the supremum of the drift-plus-penalty function so as to minimize the drift-plus-penalty function. However, in practical problems, the direct calculation of the supremum of the drift-plus-penalty function may be very complicated, sometimes even impractical. Generally, the accuracy and computational complexity will be balanced, and the problem will be solved by minimizing one of the upper bounds through appropriate scaling methods. This approximate solution to the drift-plus-penalty function can only obtain the approximate optimal solution, which is not the optimal task selection and allocation scheme and may even cause high task queue congestion. In addition, this method does not consider the impact of future states on the task selection scheme of the current time slot. Therefore, this paper introduces deep reinforcement learning to improve the defects in the classical solution of the Lyapunov optimization algorithm.

Reinforcement learning [24] is an important branch of machine learning. Reinforcement learning systems are usually composed of agents and environments. Agents can observe and interact with the environment, generate empirical data, and self-learn to maximize long-term returns so as to solve problems in different fields. In reinforcement learning, data are generated in the process of agent interactions with the environment, which is unmarked data. When an agent executes an action in a certain state of the environment, it receives a reward from the environment to judge whether the action is good or bad; the state of the environment changes accordingly. In the next interaction, the agent makes decisions and acts according to the new state of the environment. But, the goal of the reinforcement learning problem is long-term; that is, the goal is to maximize the sum of long-term rewards, so agents cannot simply obtain the optimal action through a single reward. In general, empirical data are generated iteratively during the training process of agents. Data generation and agent training alternate. Agents obtain optimal strategies in different states through long-term training.

The Q-learning algorithm is a common reinforcement learning algorithm, which is usually calculated in table form. In the Q-learning algorithm, first, a table needs to be established to store the Q values of each action in different states, then collect training data during the interaction between agents and the environment, and then update the Q values in the table with experience playback. Obviously, this method is suitable for the case of a small state space and action space. When the state or action space is large, the table is not convenient for recording the Q value of each state and action, so it is more reasonable to use the function fitting method to estimate the Q value. The Deep Q-Network (DQN) [25] is based on the idea of function fitting and uses the powerful expression ability of neural networks to fit and estimate the Q value. The Double Deep Q-Network (DDQN) [26,27] is improved on the basis of the DQN. By separating action selection and value evaluation, the DDQN reduces the risk of overestimation and improves learning stability and final performance. These improvements make the DDQN a powerful tool for dealing with high-dimensional state space and complex decision problems. Because the environment in the system model in this paper is unknown and unpredictable, and the state space is large, it is difficult to use the traditional table-based reinforcement learning method to solve the problem. Therefore, the DDQN is used in the task selection stage, and a task selection decision algorithm based on Lyapunov optimization and the DDQN is proposed. The Lyapunov optimization method based on deep reinforcement learning no longer scales a certain upper bound of the drift-plus-penalty function but directly optimizes the task selection decision scheme through a reward function.

The DDQN-based MCS system is composed of the agent and the environment. The agent refers to the MCS platform. The environment not only includes the real physical object, namely, participants, but also includes the working mechanism and operating rules of MCS. The interaction process between the agent and the environment refers to the interaction process between the platform and participants. When tasks arrive at the platform, the platform determines which tasks can enter the platform, that is, makes the task selection decision, so as to minimize the drift-plus-penalty function. Then, the task allocation scheme is determined according to

o (t)

,

Q (t)

, and

F_{s} (t)

to maximize the objective Function (10), which assigns the task to the participants. The participants complete tasks and upload data to the platform. At this time, an interaction between the agent and the environment is completed, and experience data are generated for the agent to learn and train.

Therefore, the objective Functions (9) and (10) is solved in two stages:

1.: In the task selection stage, the goal is mainly to solve the problem below:

\min_{o (t)} Δ_{v} (Q (t)) = \min_{o (t)} E \{L (t + 1) - L (t)| (Q (t))\} - V E \{U (t)| (Q (t))\}

s . t . Q (t + 1) = \max [Q (t) - r (t), 0] + o (t)

r (t) = \sum_{s = 1}^{Q (t)} \sum_{i = 1}^{N} x_{s i} (t)

U (t) = β o (t)

2.: In the task allocation stage, the goal is mainly to solve the problem below:

\max_{x_{si} (t)} I (t) = \max_{x_{si} (t)} \sum_{s = 1}^{Q (t)} \sum_{i = 1}^{N} F_{s} (t) x_{s i} (t)

The working mechanism of the task selection stage and allocation stage in the MCS system based on Lyapunov optimization and deep reinforcement learning is shown in Figure 3. The specific algorithm is introduced in the below sections.

3.2. Task Selection Stage

In the task selection stage, the DDQN and Lyapunov optimization are used to make the task selection decision. The action-value network calculates all the action values in different states so as to obtain the optimal actions, that is, to determine the number of tasks entering the task queue. The state, action, reward function of the algorithm, and parameter update mechanism of the action-value network are introduced in detail below.

3.2.1. State $S (t)$

The determination of state

S (t)

mainly considers the factors that affect the action value in the MCS system, including the number of tasks reaching the platform

O (t)

, the length of task queue

Q (t)

, and the number of participants

N (t)

.

S (t) = [O (t), Q (t), N (t)]

. There are different states in different time slots. The state space is denoted by

S ≜ \{S (t)\}

.

S (t)

is dynamic and uncertain. On the one hand, the action decided by the agent is uncertain. When the agent decides the action and acts on the environment, the state of the environment may also change. On the other hand, the next state is usually determined by the current state and action, but the next state cannot be completely predicted and determined. Obviously,

O (t)

,

Q (t)

, and

N (t)

are dynamic and uncertain. The state space

S

is very large. The traditional reinforcement learning method in the form of tables cannot solve this problem, so the DDQN is suitable for the platform to make decisions.

3.2.2. Action $a (t)$

Action

a (t)

refers to the number of tasks entering the task queue, which is the task selection decision made by the agent in time slot

t

.

a (t) = o (t)

.

0 \leq a (t) \leq O (t)

. The action space is denoted by

A ≜ \{a (t)\}

. Action values are different in different states. The optimal action is the action with the greatest value, that is, the optimal solution of the objective Function (14).

3.2.3. Reward $R e (t)$

As shown in Figure 4, after action

a (t)

is executed in time slot

t

, the environment returns a reward to the platform, which is determined by the reward function,

R e (t)

. The sum of current and future rewards is the return. The goal of the DDQN is to find a strategy to maximize the expectation of the return. Therefore, the reward function directly affects the results of reinforcement learning.

The objective Function (14) is to maximize the system utility and stabilize the task queue. Therefore, the reward function is defined as follows, namely, the weighted sum of the Lyapunov drift function and system benefits:

R e (t) = - (\frac{{Q (t + 1)}^{2}}{2} - \frac{{Q (t)}^{2}}{2} - V β o (t))

(15)

3.2.4. Action-Value Network Training and Parameter-Updating Mechanism

The action-value network is a neural network composed of an input layer, a hidden layer, and an output layer. The network parameter is

ω

. The input layer has three nodes, which input three variables of state

S (t)

, namely,

O (t)

,

Q (t)

, and

N (t)

. The network output is the values of all actions in state

S (t)

.

The red line in Figure 3 shows the training process of the neural networks. The task allocation algorithm based on the DDQN is different from the supervised machine learning process. The algorithm has no prior knowledge. In order to enable the neural network to produce reliable and accurate action values, the experience data generated by the interaction between the agent and the environment are stored for the training of the neural network. In Figure 3, state

S (t)

, action

a (t)

, reward

R e (t)

, and the next state

S (t + 1)

are the experience data generated in time slot

t

, which will be stored in the replay buffer. During training, experience replay is used to randomly sample training data tuples

(S (t), a (t), R e (t), S (t + 1))

in the buffer for the gradient descent training of neural networks. The experience replay not only can maintain the independence between training data but also makes it possible to use samples many times, improving the efficiency of experience data.

The network parameters of the DDQN are mainly updated through Temporal Difference learning (TD learning). In order to prevent the overestimation of the action value caused by bootstrapping using the same neural network, the DDQN sets up two networks to calculate action values, namely, the target net and evaluation net, as shown in Figure 5. They have the same network structure, but the network parameters are not identical.

q_{t a r g} (S (t), a (t))

represents the value of

a (t)

calculated by the target network in state

S (t)

.

q_{e v a l} (S (t), a (t))

represents the value of

a (t)

calculated by the evaluation network in state

s (t)

.

The TD target

\hat{q} (t)

is

\hat{q} (t) = R e (t) + γ \cdot {m a x}_{a^{'}} q_{t a r g} (S (t + 1), a')

(16)

The TD error

δ (t)

is

δ (t) = q_{e v a l} (S (t), a (t)) - \hat{q} (t)

(17)

The loss function is defined as

L o s s = \frac{1}{b a t c h} \sum_{t = 1}^{b a t c h} {δ (t)}^{2}

(18)

The gradient descent method is used to minimize the loss of the evaluation network, and its parameters are updated. The parameters of the target network will remain relatively fixed. After the evaluation network has been trained a certain number of times, the parameters of the target network will be directly replaced with the parameters of the evaluation network for updating.

3.2.5. Iterative Training

In order to increase the state experience with small

Q (t)

in the replay buffer, this paper designs iterative training for the infinite Markov decision problem.

In deep reinforcement learning, Markov decision problems are divided into finite periods and infinite periods, and their neural network training methods are different. When training the evaluation network of the Markov decision problem with a limited period, it is usually based on an episode for iterative training, so different states will be stored in the replay buffer and will appear repeatedly during training. But, the task selection and allocation problem in MCS is an infinite Markov decision process, and there is no termination state for one episode. Therefore, training is usually performed within an episode and stops after reaching the specified times. In the Lyapunov optimization problem, when the task queue reaches a stable state, it can no longer enter the unstable state of

Q (t)

. Since the experience data in the replay buffer are constantly updated during the training process, the amount of experience data after the queue is stable is far more than the amount of data before stabilization. Such a lack of experience in the replay buffer will lead to the unsatisfactory performance of the TSAS before the task queue is stabilized.

Therefore, this paper designs an iterative training method that conforms to the Lyapunov optimization problem. The specific method is as follows: The training time is divided into three stages. The first stage is from the beginning of training until the loss is small enough, when the task queue is stable. In the second stage, the system state is reset to the initial state, that is,

Q (0) = 0

. When the queue reaches the stable state, it will be reset to the initial state again, which will be repeated many times with iterative training so as to fully train the queue state before stabilization. After several instances of resetting the system state and iterative training, the third stage is started; that is, the stable state is trained again until the end of the training. The training results are shown in Section 4.1.

3.2.6. Priority Experience Replay and Importance Sampling Weight

In order to further improve the training efficiency, this paper also introduces priority experience replay and importance sampling weights. In DRL, the traditional experience replay method stores the experience generated by agent interaction with the environment in the replay buffer and then randomly samples these experiences during the training. With the training of the evaluation network,

Q (t)

gradually increases, resulting in an uneven distribution of states in the replay buffer. This random sampling may lead to some important experiences being ignored; for example, states with small

Q (t)

cannot be fully trained, but some less important experiences are frequently sampled, resulting in a poor training effect. Priority experience playback solves this problem by assigning different priorities to experiences. These priorities are calculated based on the TD error, which represents the gap between the current estimated action value and the target action value. Experiences with high TD errors mean that they differ greatly from the expected returns, so they are considered more important and should be sampled more for training. Therefore, in priority experience replay, experiences are not sampled with equal probability but sampled according to their priority. High-priority experiences will be selected more frequently, but if they are completely relied on, the learning results may be biased toward the local optimal solution.

The sampling bias needs to be corrected by importance sampling weights; that is, when calculating the gradient, an importance weight is used to weight the loss of each sample to correct the bias. If an experience should occur more frequently under the target strategy, it will be given more weight during training; otherwise, it will be given less weight. The importance sampling weight helps balance the learning of different types of experiences by adjusting the contribution of each sampling experience so as to ensure that the learning process pays attention to both important experiences and low-priority experiences that may contain key information.

3.2.7. Action Mask

Action masking [28] is an important training skill in DRL. It improves the learning efficiency and performance by limiting the action selection of the agent and also simplifies the action selection. The action space is huge in the task selection and allocation of MCS, but only some of the actions are feasible in the current state. Without an action mask, agents may choose invalid actions, which will not only waste resources but also lead to instability in the training process. By action masking, agents will focus on feasible actions rather than invalid actions so as to reduce the time spent exploring invalid actions, improve the training efficiency, and speed up the training process.

The action mask method is designed as follows:

m a s k (a (S (t))) = m a s k ({a r g m a x}_{a} (q (S (t), a))) = \{\begin{matrix} q (S (t), a), a i s v a l i d i n S (t) \\ 0, o t h e r w i s e \end{matrix}

(19)

a (t) = {a r g m a x}_{a} (m a s k (a (S (t))))

(20)

An action mask is usually used in policy gradient learning. It makes the gradient corresponding to the logarithm of invalid actions zero so as to prevent the network from selecting invalid actions. The DDQN belongs to value function gradient learning. In the above action mask method, the gradient of invalid actions will not change. However, the action mask method still works for the following reasons: in the DDQN, each time a tuple

(S (t), a (t), R e (t), S (t + 1))

is taken from the replay buffer, the parameters of the action-value network are updated in the following ways:

ω_{n e w} \leftarrow ω_{n o w} - α \cdot δ (t) \cdot \nabla_{ω} Q (S (t), a (t); ω_{n o w})

(21)

where

δ (t)

is obtained from Equation (17). It can be seen from the above formula that when the state is

S (t)

, only the value function of action

a (t)

selected by Formula (16) will be updated, and its action value will gradually increase. While the invalid action is shielded by the action mask method, its value function will not be updated, so the value of the invalid action will be significantly lower than other effective actions and will not be selected by the evaluation network.

3.3. Task Allocation Stage

In the objective Functions (9) and (10), the optimization goal of Formula (10) is to maximize the total importance of all tasks being performed. This is mainly realized through two aspects.

On the one hand, on the condition of maintaining the stability of the queue, we can maximize the number of tasks to be performed in the task queue and improve the utilization of idle participants. As the number of tasks being performed increases, the sum of task importance also increases. By selecting tasks with higher importance to enter the queue, the importance of performed tasks is further increased. This has been realized in Section 3.2.

On the other hand, we can maximize the importance by formulating a reasonable task allocation scheme

x_{s i} (t)

. This is achieved in the task allocation stage.

In the task allocation stage, the platform allocates tasks in the task queue to participants to perform. Due to the limited number of participants, when the number of tasks is large, some tasks in the task queue cannot be allocated and performed immediately. The order of tasks to be performed is mainly determined by the importance of the tasks. Tasks with high importance will be assigned to execute first, while tasks with low importance need to wait for execution. The sum of the importance of all the tasks performed in Formula (10) is converted into the sum of the importance of the tasks performed in each time slot as following, which greatly reduces the difficulty and complexity of the problem, namely,

\max_{x_{si} (t)} I (t) = \max_{x_{si} (t)} \sum_{s = 1}^{Q (t)} \sum_{i = 1}^{N} F_{s} (t) x_{s i} (t)

The above optimization problems can be solved by simple task importance ranking and selection methods.

It should be noted that, in fact, when training the action-value network of the DDQN, experience data need to be obtained after the above two stages. In state

S (t)

, the MCS system selects, assigns, and executes tasks and receives rewards

R e (t)

and a new task queue

Q (t + 1)

. When a new task

O (t + 1)

arrives in the next slot, the number of participants who can perform the task

N (t + 1)

is updated, and then the next state

S (t + 1)

is obtained. Therefore, the above two stages are a whole and are carried out simultaneously. Without the task allocation stage, the action-value network would not be able to obtain sufficient experience data and complete training.

Based on all of the above, the design of the TSAS for MCS is as follows (Algorithms 1 and 2).

Algorithm 1. TSAS based on Lyapunov optimization and DDQN
1	Initialize the evaluation network $Q_{ω}$ with random network parameters $ω$ .
2	Copy the same parameters $ω' \leftarrow ω$ to initialize the target network $Q_{ω'}$ .
3	Initialize $R e p l a y b u f f e r$ .
4	Input: environment state $S (t)$ : the number of tasks reaching the platform $O (t)$ , the length of the task queue $Q (t)$ , and the number of participants $N (t)$ .
5	Output: the number of tasks entering the task queue $a (t)$ and task allocation scheme $x (t)$ .
6	Set $Q (0) = 0$ .
7	for episode $e = 1 \to E$ do:
8	Obtain environment state $S (t)$ .
9	Input $s (t)$ into the evaluation network, output the action value $q_{e v a l} (S (t), a)$ , and select action $a (t)$ with the $ϵ$ -greedy policy.
10	Execute $a (t)$ in $S (t)$ , and determine the tasks to enter the task queue according to $F_{s} (t)$ .
11	Determine the task allocation and execution scheme according to $F_{s} (t)$ : $x (t) = \{x_{s i} (t)\}$ .
12	After the tasks are completed, obtain the next state $S (t + 1)$ and reward $R e (t)$ .
13	Store the experience data tuple $\{S (t), a (t), R e (t), S (t + 1)\}$ in the $R e p l a y b u f f e r$ .
14	If $l e a r n = T r u e$ :
15	Execute Algorithm 2: train the evaluation network, and update the parameters $ω$ .
16	If $r e p l a c e_t a r g e t = T r u e$ :
17	Update target network parameters $Q_{ω'}$ : $ω' \leftarrow ω$ .
18	end

Algorithm 2. Training and parameter update of action-value network
1	Sample $N$ groups of data from the replay buffer: ${\{(S (i), a (i), R e (i), S (i + 1))\}}_{i = 1, \dots, N}$
2	for $i = 1 \to N$ do:
3	Input $S (i)$ into the evaluation network to calculate the action value: $q_{e v a l} (S (i), a (i))$ ;
4	Input $S (i + 1)$ into the target network to calculate the action value: $\hat{q} (i) = R e (i) + γ \cdot {m a x}_{a^{'}} q_{t a r g} (S (i + 1), a')$ ;
5	Calculate the error: $δ (i) = q_{e v a l} (S (i), a (i)) - \hat{q} (i)$ .
6	end
7	Update the parameters $ω$ of the evaluation network $Q_{ω}$ to minimize the $L o s s$ : $L o s s = \frac{1}{N} \sum_{i = 1}^{N} {δ (i)}^{2}$ .
18	end

4. Experimental Validation and Performance Evaluation

Simulation experiments were conducted to evaluate the performance of the TSAS.

4.1. Training Performance

The MCS system was simulated, which consists of an MCS platform and participants. The number of participants

N (t)

in each time slot follows a Poisson distribution of

λ = 5

. The number of tasks arriving at the platform

O (t)

follows a Poisson distribution of

λ = 10

,

O (t) ϵ [0, 20]

. The importance of arriving tasks

F (t)

follows a Gaussian distribution with a mean of 20 and a variance of 10,

t a s k_s i g ϵ (0, 40)

. Set

V = 1000

,

β = 1

.

In the experiments, an evaluation network and a target network were set up. The two neural networks have the same structure and are composed of an input layer, two hidden layers, and an output layer, which contain 3, 32, 16, and 21 nodes, respectively. Algorithm 1 is run to obtain experience data, which are stored in the replay buffer. The storage capacity of the replay buffer is 20,000. To enrich experience data, the

ϵ

-greedy strategy is adopted to select actions.

ϵ = 0.1 + 0.0001 * T r a i n i n g t i m e s

, with a maximum of 0.9. The experiment runs 100,000 times. Each time the experiment runs, experience data are stored in the replay buffer. When the parameters of the evaluation network are updated 200 times, the target network updates its parameters once. Both the evaluation network and the target network use the action mask method to reduce the feasible range of actions. The action mask is set as follows:

m a s k (a (S (t))) = m a s k ({a r g m a x}_{a} (q (S (t), a))) = \{\begin{matrix} q (S (t), a), a \leq O (t) \\ 0, o t h e r w i s e \end{matrix}

(22)

a (t) = {a r g m a x}_{a} (m a s k (a (S (t))))

(23)

In order to observe the impact of iterative training methods, two training methods were adopted. Firstly, the network was trained 80,000 times using ordinary training methods without iterative training, and the training results are shown in Figure 6.

Then, the iterative training method was used to continue training the network 100,000 times. The specific steps are as follows: First, the evaluation network runs to generate 10,000 experiences and stores them in the replay buffer; second, the evaluation network starts to be trained. Every 1000 runs of the evaluation network, the system state is initialized, so the states before queue stabilization are iteratively trained; third, after the evaluation network runs 50,000 times, ordinary training is resumed, and the states after the queue is stable are trained again until the system runs a total of 100,000 times. The training results are shown in Figure 7.

In Figure 6, it can be seen that as the model continuously learns and optimizes, the task queue

Q (t)

eventually tends to stabilize, and the loss decreases. This proves that the TSAS is feasible and can keep the task queue stable.

In Figure 6a, when

t \in [0, 41,000)

,

Q (t)

fluctuates around smaller values; when

t \in [41, 000, 50,000]

,

Q (t)

significantly increases and tends to stabilize. Below is an analysis of the experimental results. When

t \in [0, 20000]

, the replay buffer is accumulating training data, and the evaluation network has not yet started training. The system adopts the

ϵ

-greedy strategy to select actions and randomly generates numbers in the interval

[0, 1)

. When the random number is less than

ϵ

, the evaluation network calculates the number of tasks that can enter the task queue

o (t)

. When the random number is greater than

ϵ

,

o (t)

is generated in the interval

[0, O (t)]

randomly. Therefore, within

t \in [0, 20,000]

,

Q (t)

is small, and

o (t)

fluctuates within

[0, O (t)]

. When

t \in [20,001, 41,000)

, the evaluation network starts training, and

ϵ

gradually increases from 0.1 to 0.9.

o (t)

is more calculated by the untrained evaluation network, so

o (t)

in Figure 6b decreases. Due to the storage of experience data generated by untrained networks in the replay buffer, the

L o s s

is large, and

Q (t)

is small. When

t \in [41,001, 100,000]

, as the evaluation network is trained and optimized,

o (t)

continues to increase, and

Q (t)

also increases and tends to stabilize, thereby maximizing the expected value of the reward.

In Figure 6c, it can be seen that the

L o s s

gradually decreases with training.

Q (t)

is relatively large and stable. In order to investigate whether states before queue stabilization, that is, states with a small value of

Q (t)

, have been trained, this paper continues to train the network using the iterative training method. The results are shown in the following figures.

In Figure 7a, it can be seen that the model trained 80,000 times can quickly make

Q (t)

reach stability. When

t \in [10, 000, 60,000]

, the state is initialized every 1000 times, so

Q (t)

is reset to 0, and then

Q (t)

quickly reaches stability again.

In Figure 7c, it can be seen that the

L o s s

increases sharply and then gradually decreases with the training of the model. This is because, in classic training, as shown in Figure 6, although the queue is stable and the loss gradually decreases and converges after training 80,000 times, the states before the queue is stable have not been fully trained. Therefore, when

Q (t)

is small, the loss is still very large. After iterative training, regardless of whether the task queue is stable, that is, regardless of the size of

Q (t)

, the evaluation network can solve the objective function.

4.2. Comparison Experiment of Different Algorithms

In order to verify the performance of the TSAS, this paper compares the TSAS that has been trained according to Section 4.1 with three algorithms: the Classical Lyapunov optimization algorithm, the QPA algorithm [29], and the offline task selection and allocation algorithm.

Classical Lyapunov optimization algorithm: As described in Section 3.1, this algorithm is the mainstream and general solution for solving task allocation problems in MCS at present. In the drift-plus-penalty functions, there are unknown variables, which cannot be predicted before executing the tasks. Therefore, the Classical Lyapunov optimization algorithm solves the problem by eliminating unknown variables and minimizing one of the upper bounds of the drift-plus-penalty function through appropriate scaling methods.

QPA algorithm: This algorithm maximizes the importance of the tasks performed by controlling

o (t)

. In the QPA algorithm, all tasks can enter the platform and tasks that have not been executed will be stored in the task queue, waiting to be executed later.

Offline task selection and allocation algorithm: In order to compare the difference between the online task allocation method TSAS and the offline task allocation method, this paper also presents comparative experiments on the offline task selection and allocation algorithm. This algorithm selects and allocates tasks according to the system status of the current time slot and does not set the task queue.

The experiment runs for a total of 1000 time slots.

t ϵ \{0, 1, \dots, 999\}

. In reality,

O (t)

varies at different time slots. Usually, there are more tasks that arrive during the day and fewer at night. Therefore, in order to simulate the changes in

O (t)

more realistically, this experiment sets

O (t)

in different time slots as follows: when

0 < t < 600

,

O (t)

follows a Poisson distribution of

λ = 10

; when

600 < t < 1000

,

O (t)

follows a Poisson distribution of

λ = 5

.

O (t) ϵ [0,20]

. The importance of arriving tasks

F (t)

follows a Gaussian distribution with an average of 20 and a variance of 10.

F (t) ϵ (0, 40)

.

Comparison experiments were conducted on different algorithms using the following evaluation metrics:

$Q (t)$ refers to the number of tasks waiting to be performed in the task queue.

$Q (t + 1) = \max [Q (t) - r (t), 0] + o (t)$

The size of $Q (t)$ depends on the number of $o (t)$ and $r (t)$ . Maintaining the stability of the task queue is an important goal of the online and long-term task allocation problem of the MCS system. By observing the change in $Q (t)$ , we can intuitively judge whether the task queue has reached a stable state. The result of the comparison experiment on $Q (t)$ is shown in Figure 8.
$o (t)$ refers to the number of tasks entering the queue. $o (t)$ is an important part of the task selection and allocation method, which directly determines the queue length and its stability, system efficiency, and the importance of the tasks performed. When there are few participants, the system should reduce $o (t)$ in order to maintain the stability of the queue. When there are many participants, the system should increase $o (t)$ while maintaining queue stability in order to maximize the system utility and the performed task importance. Therefore, by observing $o (t)$ , we can evaluate the advantages and disadvantages of different algorithms. The result of the comparison experiment on $o (t)$ is shown in Figure 9.
The $R e t u r n$ represents the total rewards:

$R e t u r n = \sum_{t = 0}^{1000} R e (t)$

The larger the $R e t u r n$ , the smaller the drift-plus-penalty function, and the closer to the optimization goal. The result of the comparison experiment on the $R e t u r n$ is shown in Figure 10a.
$U_s u m$ represents the total utility of the system. The more tasks that entered the queue, the greater the total utility.

$U_s u m = \sum_{t = 0}^{1000} β \cdot o (t)$

$U_s u m$ is not “the bigger, the better”. Due to the limited number of participants and ability to execute tasks, when there are too many tasks in the queue, although $U_s u m$ is large, the queue congestion is serious, resulting in a large delay before the task is executed. The result of the comparison experiment on $U_s u m$ is shown in Figure 10b.
$P_s u m$ represents the total number of tasks actually executed:

$P_s u m = \sum_{t = 0}^{1000} r (t)$

When the number of participants and their ability to perform tasks are limited, $P_s u m$ is largely constrained by their ability to perform tasks. The result of the comparison experiment on $P_s u m$ is shown in Figure 10c.
$I_s u m$ indicates the total importance of the tasks performed.

$I_s u m = \sum_{t = 0}^{1000} I (t) = \sum_{t = 0}^{1000} \sum_{s = 1}^{Q (t)} \sum_{i = 1}^{N} F_{s} (t) x_{s i} (t)$

The larger $I_s u m$ is, the more important the tasks performed are. The result of the comparison experiment on $I_s u m$ is shown in Figure 10d.

Figure 8 shows the variation in

Q (t)

under different algorithms. The following can be seen: (1) The

Q (t)

of the offline task selection and allocation algorithm is always 0. (2) When

0 < t < 600

, the queue of the Classical Lyapunov optimization algorithm and TSAS can gradually stabilize, while the queue of the QPA algorithm keeps growing and cannot reach a stable state. (3) When

600 < t < 1000

, the queue of the Classical Lyapunov optimization algorithm and TSAS gradually decreases, while the queue of the QPA algorithm no longer grows and appears to remain stable.

The reasons for the above result are as follows:

The offline task selection and allocation algorithm makes decisions based on $O (t)$ and $N (t)$ in order to maximize the number of tasks to be executed. There is no backlog of queued tasks, so the queue length is always 0. It is precisely for this reason that some participants will be idle when the number of tasks decreases, but the number of participants increases, so the number of tasks executed by the offline task selection and allocation algorithm is lower than that executed by other online task selection and allocation algorithms, as shown in Figure 10.
When $0 < t < 600$ , $O (t)$ follows a Poisson distribution of $λ = 10$ , meaning that $O (t)$ is greater than $N (t)$ , resulting in a backlog of queued tasks. The QPA algorithm seeks to maximize $U (t)$ greedily without limiting the queue length, so $Q (t)$ continues to grow and cannot stabilize. $Q (t)$ of the Classical Lyapunov optimization algorithm tends to stabilize at around 1000, while TSAS tends to stabilize at around 500. $Q (t)$ of the TSAS is significantly smaller than that of the Classical Lyapunov optimization algorithm. This is because the TSAS and the Classical Lyapunov optimization algorithm are both based on the Lyapunov optimization algorithm to maintain the stability of the task queue and pursue utility. The difference between the two algorithms lies in their solution. The Classical Lyapunov optimization algorithm approximates the unknown variable, minimizes an upper bound of the drift-plus-penalty function, and obtains an approximate optimal solution, thus causing high queue congestion. The TSAS can directly obtain the optimal solution of the drift-plus-penalty function, improving the defects of the Classical Lyapunov optimization algorithm. It can be seen that the TSAS can control the queue backlog and reduce the delay in task execution, which has important practical significance.
When $600 < t < 1000$ , $O (t)$ follows a Poisson distribution of $λ = 5$ . The QPA algorithm greedily pursues the maximization of $U (t)$ ; therefore, $Q (t)$ of QPA is equal to the average value of $N (t)$ . At this point, the queue tends to stabilize. In fact, once $O (t)$ increases, $Q (t)$ will continue to grow, so the QPA algorithm cannot maintain the stability of the task queue. Due to the decrease in $O (t)$ , the $o (t)$ of the Classical Lyapunov optimization algorithm and TSAS is reduced. The number of tasks executed is greater than the number of new tasks added to the queue, and the backlog of tasks in the queue is constantly being executed, so $Q (t)$ begins to decrease. Tasks that have not been executed are stored in the task queue instead of being directly discarded. Therefore, when participants have free time, the backlog of tasks in the queue will be executed. This increases the execution rate of tasks while also increasing the system’s utility and improving the utilization of idle participants.

Figure 9 shows the selection decision

o (t)

of the MCS system under different algorithms. (1) The

o (t)

of the offline task selection and allocation algorithm is always consistent with the number of participants. (2) The

o (t)

of the QPA algorithm is always consistent with

O (t)

. (3) When

0 < t < 600

, the TSAS and the Classical Lyapunov optimization algorithm both choose to enter the platform with some tasks instead of all tasks. However, the

o (t)

of the Classical Lyapunov optimization algorithm fluctuates greatly, while the

o (t)

of the TSAS has a small fluctuation interval. (4) When

600 < t < 1000

, the

o (t)

of the TSAS and the Classical Lyapunov optimization algorithm is consistent with

O (t)

.

The reasons for the above results are as follows:

(1): The offline task selection and allocation algorithm does not consider the long-term utility of the system. As the $O (t)$ set in the experiment is greater than or equivalent to the number of participants, in order to maximize the number of tasks performed, $o (t)$ is consistent with the number of participants subject to the task execution capability of the system.
(2): The QPA algorithm pursues the maximization of system benefits $U (t)$ greedily without considering the stability of task queues. All tasks arriving at the platform enter the task queue. Therefore, its $o (t)$ is consistent with $O (t)$ . When $0 < t < 600$ , $O (t)$ is large, and $o (t)$ is also large, which is greater than in other algorithms; when $600 < t < 1000$ , it is consistent with the TSAS and the Classical Lyapunov optimization algorithm. $O (t)$ decreases, and $o (t)$ also decreases.
(3): When $0 < t < 600$ , $O (t)$ is relatively large. In order to maintain the stability of the queue, the TSAS and the Classical Lyapunov optimization algorithm both choose to enter the platform with some tasks instead of all tasks. In addition, the $o (t)$ of the Classical Lyapunov optimization algorithm is obtained by solving the drift-plus-penalty function based on the classic solution, as mentioned before, which mainly depends on the current state $S (t)$ , so $o (t)$ fluctuates greatly; the TSAS is based on the DDQN to solve Formula (14), so it not only depends on the current state $S (t)$ but also takes into account the future states in advance, so the fluctuation range of $o (t)$ is relatively small. In Figure 10, it can be seen that the fluctuation of $o (t)$ causes a difference in the total importance of tasks being performed, as detailed in the following analysis.
(4): When $600 < t < 1000$ , $O (t)$ is equal to the number of participants. In order to maximize the system utility and the importance of performed tasks, the TSAS and the Classical Lyapunov optimization algorithm put all arriving tasks into the task queue, so their $o (t)$ is consistent with $O (t)$ .

Due to the complexity of the actual situation, it is impossible to judge the merits of the algorithm by a single indicator. In the experiment, it is necessary to comprehensively consider the following four indicators to evaluate different algorithms.

In Figure 10, it can be seen that (1)

U_s u m

,

P_s u m

, and

I_s u m

of the offline task selection and allocation algorithm are lower than those of the other three algorithms, and the

R e t u r n

is 0. (2)

U_s u m

and

I_s u m

of the QPA algorithm are the highest.

P_s u m

is equal to the values of the Classical Lyapunov optimization algorithm and TSAS. The

R e t u r n

is negative, far less than that of the Classical Lyapunov optimization algorithm and TSAS. (3) The

R e t u r n

and

I_s u m

of the TSAS are both higher than those of the Classical Lyapunov optimization algorithm,

P_s u m

is equal to that of the Classical Lyapunov optimization algorithm, and only

U_s u m

is slightly lower than that of the Classical Lyapunov optimization algorithm.

(1) Due to the offline task selection and allocation algorithm not considering long-term task allocation and having no backlog of queued tasks, when the number of tasks decreases and the number of participants increases, some participants will be idle. Therefore, its performance is lower than the other three long-term task allocation algorithms. The offline task selection and allocation algorithm makes decisions based on the current system state;

R e t u r n

has no meaning in this algorithm, so its

R e t u r n

was not calculated in this experiment.

(2) The QPA algorithm focuses on achieving optimal long-term utility, with all

O (t)

entering the task queue for execution. Therefore,

U_s u m

and

I_s u m

are the highest. Due to the limitation of

N (t)

,

P_s u m

is equal to the values of the Classical Lyapunov optimization algorithm and the TSAS. However, its

R e t u r n

is negative, far less than that of the Classical Lyapunov optimization algorithm and the TSAS, and the optimal solution cannot be obtained. This is because QPA cannot control the growth of the queue, and its task queue cannot reach stability. Therefore, it is not suitable for long-term task selection and allocation problems.

(3) Since

R e t u r n

is closely related to the objective Function (14), the larger the

R e t u r n

, the closer the solution is to the optimal solution. The TSAS can directly obtain the optimal solution of the objective function, so its

R e t u r n

is the largest. It can be seen in Figure 9 that, due to the fact that

o (t)

of the TSAS tends to be more stable than that of the Classical Lyapunov optimization algorithm, the TSAS can select more important tasks to enter the queue. For the Classic Lyapunov optimization algorithm, in order to maintain stability of the queue,

o (t)

is set to 0 in many time slots, so important tasks cannot be selected to enter the queue, resulting in

I_s u m

being lower than that of the TSAS. Due to the small number of participants,

P_s u m

is mainly restricted by the number of participants. In order to maximize the system utility, the two algorithms enable all participants to perform tasks, so their

P_s u m

values are the same. The

U_s u m

of the Classic Lyapunov optimization algorithm is bigger than that of the TSAS, which leads to a large queue backlog for the Classic Lyapunov optimization algorithm. Thus, the TSAS is better than the Classic Lyapunov algorithm.

In summary, the TSAS based on the DDQN is more suitable for solving the drift-plus-penalty function in Lyapunov optimization. It dynamically adjusts the system’s action to optimize performance as much as possible while maintaining system stability. Therefore, it can balance the system stability and utility in the drift-plus-penalty function.

4.3. Impact of $V$ on TSAS

In order to study the impact of

V

on the TSAS, we set

V = 100

,

V = 500

,

V = 1000

, and

V = 2000

. Experiments were conducted on the TSAS. The experiments were run for a total of 1000 time slots.

t ϵ \{0, 1, \dots, 999\}

.

Figure 11a shows the variation in

Q (t)

under different

V

. As

V

increases, the stable value of

Q (t)

also increases. This is determined by the objective function (14). It can be seen that when

O (t)

is determined, due to the fact that

o (t) \leq O (t)

, the larger the

V

, the longer the time required for

Q (t)

to reach stability, and the larger the stable value of

Q (t)

. Figure 11b shows the changes in

o (t)

with different

V

. It can be seen that the larger the

V

, the larger the

o (t)

, and its fluctuation range is also larger. The larger the

V

, the greater the weight of system utility. Therefore, It can be concluded that as

V

increases, the TSAS maximizes the

R e t u r n

by increasing

o (t)

, while the fluctuation range of

o (t)

is also increased.

Figure 12 shows the variation in

R e w a r d

,

U_s u m

, and

P_s u m

with different

V

. As

V

increases,

R e t u r n

and

U_s u m

gradually increase, but

P_s u m

is the same. This is determined by Formula (15). When

V

increases, it is obvious that the

R e t u r n

will also increase.

V

determines the size of the utility’s weight in the drift-plus-penalty function, so when

V

is large,

U_s u m

is high. Due to the fact that

O (t)

is larger than

N (t)

,

P_s u m

is limited to

N (t)

. When

V

increases,

P_s u m

cannot increase accordingly.

From above, it can be seen that, in practical applications,

V

should be determined based on

O (t)

and

N (t)

in order to balance the queue length and system utility. Although a large

V

can increase system utility, it can also lead to an excessive queue backlog and the delayed execution of some tasks. When

N (t)

is determined and

V

is high, that is, when the weight of system utility is high, the system utility is high. However, the stable value of

Q (t)

also increases, resulting in the delayed execution of some low-importance tasks. When

V

is small, the weight of system utility decreases, and

U (t)

decreases while the stable value of

Q (t)

decreases; thus, the delay in performing tasks is shortened. Therefore, comprehensive consideration should be given to the length of the queue backlog and system utility.

Further analysis reveals that there are two methods that can reduce the backlog of queued tasks. On the one hand, a backlog is caused by a large

V

and a large number of

o (t)

. Therefore, as mentioned above,

V

can be appropriately reduced to reduce the backlog. On the other hand, the ability of the system to perform tasks can be improved by recruiting more participants, thereby reducing the backlog.

5. Conclusions

This paper proposes the TSAS based on Lyapunov optimization and the DDQN, which divides the task assignment in the MCS system into a task selection stage and a task allocation stage. The TSAS is a long-term and online task allocation scheme that dynamically balances the execution ability of participants, the number of tasks, and the importance of the tasks, thus not only improving the completion of tasks but also improving the utilization of participants. The DDQN is used to improve on the traditional solution of the Lyapunov optimization problem so that the impact of future system states can be predicted. In addition, in order to accelerate the training process of the neural networks in the DDQN, this paper also proposes action-masking and iterative training methods, which are proved to be feasible.

In future research, there are still the following areas to be further explored. First, aiming at different application fields and different influencing factors, such as communication resources, participants’ energy, task timeliness, etc., targeted research can be conducted on task selection and allocation in MCS. Second, in terms of the solutions of task selection and allocation problems, most of the current methods focus on obtaining analytical solutions or using heuristic algorithms to obtain approximate optimal solutions. Machine learning, deep reinforcement learning, or a combination of the two methods can be used to replace the traditional methods to solve some task allocation problems in MCS. In addition, although the focus of this paper is on centralized task allocation problems, distributed task allocation problems, which are rarely studied, are an important area for future exploration.

Author Contributions

Conceptualization, S.C., Y.W., S.D., W.M. and H.Z.; Methodology, S.C., Y.W., S.D. and H.Z.; Software, S.C.; Validation, S.C.; Formal analysis, S.C. and Y.W.; Investigation, S.C.; Resources, S.C.; Data curation, S.C.; Writing—original draft, S.C.; Writing—review & editing, S.C. and Y.W.; Visualization, S.C.; Supervision, Y.W., S.D. and W.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nguyen, T.N.; Zeadally, S. Mobile crowd-sensing applications: Data redundancies, challenges, and solutions. ACM Trans. Internet Technol. 2021, 22, 1–15. [Google Scholar] [CrossRef]
Minkman, E.; van Overloop, P.J.; van der Sanden, M.C. Citizen science in water quality monitoring: Mobile crowd sensing for water management in the Netherlands. In Proceedings of the World Environmental and Water Resources Congress, Austin, TX, USA, 17–21 May 2015. [Google Scholar]
Foschini, L.; Martuscelli, G.; Montanari, R.; Solimando, M. Edge-enabled mobile crowdsensing to support effective rewarding for data collection in pandemic events. J. Grid Comput. 2021, 19, 28. [Google Scholar] [CrossRef] [PubMed]
Zappatore, M.; Longo, A.; Bochicchio, M.A. Using mobile crowd sensing for noise monitoring in Smart Cities. In Proceedings of the International Multidisciplinary Conference on Computer and Energy Science (SpliTech), Split, Croatia, 13–15 July 2016. [Google Scholar]
Habibzadeh, H.; Qin, Z.; Soyata, T.; Kantarci, B. Large-scale distributed dedicated- and non-dedicated Smart City Sensing Systems. IEEE Sens. J. 2017, 17, 7649–7658. [Google Scholar] [CrossRef]
Khan, F.; Ur Rehman, A.; Zheng, J.; Jan, M.A.; Alam, M. Mobile Crowdsensing: A survey on privacy-preservation, task management, assignment models, and Incentives Mechanisms. Future Gener. Comput. Syst. 2019, 100, 456–472. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, Y.; Wu, H.; Li, W. An ordered submodularity-based budget-feasible mechanism for opportunistic mobile crowdsensing task allocation and pricing. IEEE Trans. Mob. Comput. 2024, 23, 1278–1294. [Google Scholar] [CrossRef]
Kang, Y.; Liu, C.; Zhang, H.; Han, Z.; Osher, S.; Poor, H.V. Task selection and route planning for mobile crowd sensing using multi-population mean-field game. In Proceedings of the IEEE International Conference on Communications (ICC), Electr Network, Virtual, 14–23 June 2021. [Google Scholar]
Stoeckel, B.; Kloker, S.; Weinhardt, C.; Dann, D. Quantity over quality ?—A framework for combining mobile crowd sensing and high quality sensing. In Proceedings of the 16th International Conference on Business and Information Systems Engineering (WI), Univ Duisburg Essen, Duisburg, Germany, 9–11 March 2021. [Google Scholar]
Ullah, N.; Khan, F.; Khan, A.A.; Khan, S.; Tareen, A.W.; Saeed, M.; Khan, A. Optimal real-time static and Dynamic Air Quality Monitoring System. Indian J. Sci. Technol. 2020, 13, 1–12. [Google Scholar] [CrossRef]
Huang, Y.; Chen, H.; Ma, G.; Lin, K.; Ni, Z.; Yan, N.; Wang, Z. OPAT: Optimized allocation of time-dependent tasks for mobile crowdsensing. IEEE Trans. Ind. Inform. 2022, 18, 2476–2485. [Google Scholar] [CrossRef]
Mourtzis, D.; Angelopoulos, J.; Panopoulos, N. UAVs for industrial applications: Identifying challenges and opportunities from the implementation point of View. Procedia Manuf. 2021, 55, 183–190. [Google Scholar] [CrossRef]
Wang, Y.; Dai, Z.; Zhang, W.; Zhang, S.; Xu, Y.; Chen, Q. Urgent task-aware Cloud Manufacturing Service composition using two-stage biogeography-based optimization. Int. J. Comput. Integr. Manuf. 2018, 31, 1034–1047. [Google Scholar] [CrossRef]
Qu, Z.; Ding, Z.; Moo, P. A machine learning task selection method for radar resource management (poster). In Proceedings of the 22th International Conference on Information Fusion (FUSION), Ottawa, ON, Canada, 2–5 July 2019. [Google Scholar]
Ouyang, W.; Obaidat, M.S.; Liu, X.; Long, X.; Xu, W.; Liu, T. Importance-different charging scheduling based on matroid theory for wireless rechargeable sensor networks. IEEE Trans. Wirel. Commun. 2021, 20, 3284–3294. [Google Scholar] [CrossRef]
Wang, L. Maintenance task scheduling of wind turbines based on Task Priority. In Proceedings of the 2020 Asia-Pacific International Symposium on Advanced Reliability and Maintenance Modeling (APARM), Vancouver, BC, Canada, 20–23 August 2020. [Google Scholar]
Tchinda, Y.M.; Choquet-Geniet, A.; Largeteau-Skapin, G. Importance-Based Scheduling to Manage Multiple Core Defection in Real-Time Systems. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Yang, L.; Xia, Y.; Ye, L. Heuristic scheduling method with the importance of earlier tasks for deadline constrained workflows in clouds. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021. [Google Scholar]
Shu, Z.; Zhang, K. A mechanism for network resource allocation and task offloading in mobile edge computing and network engineering. Comput. Intell. 2024, 40, e12628. [Google Scholar] [CrossRef]
Chen, G.; Zhang, J.; Ning, M.; Cui, W.; Ma, M. Task scheduling in real-time industrial scenarios. Comput. Ind. Eng. 2023, 182, 109372. [Google Scholar] [CrossRef]
Xiao, M.; An, B.; Wang, J.; Gao, G.; Zhang, S.; Wu, J. CMAB-based reverse auction for unknown worker recruitment in Mobile Crowdsensing. IEEE Trans. Mob. Comput. 2022, 21, 3502–3518. [Google Scholar] [CrossRef]
Gendy, M.E.; Al-Kabbany, A.; Badran, E.F. Maximizing clearance rate by penalizing redundant task assignment in Mobile Crowdsensing Auctions. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC), Electr Network, Virtual, 25–28 May 2020. [Google Scholar]
Neely, J. Stochastic Network Optimization with Application to Communication and Queueing Systems; Morgan and Claypool Publishers: San Rafael, CA, USA, 2010. [Google Scholar]
Lyu, L.; Shen, Y.; Zhang, S. The advance of reinforcement learning and deep reinforcement learning. In Proceedings of the IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 25–27 February 2022. [Google Scholar]
Saito, N.; Oda, T.; Hirata, A.; Nagai, Y.; Hirota, M.; Katayama, K.; Barolli, L. A Tabu list strategy based DQN for AAV Mobility in indoor single-path environment: Implementation and performance evaluation. Internet Things 2021, 14, 100394. [Google Scholar] [CrossRef]
Zhang, X.; Shi, X.; Zhang, Z.; Wang, Z.; Zhang, L. A DDQN path planning algorithm based on experience classification and multi steps for Mobile Robots. Electronics 2022, 11, 2120. [Google Scholar] [CrossRef]
Chen, L.; Wang, Q.; Deng, C.; Xie, B.; Tuo, X.; Jiang, G. Improved double deep Q-network algorithm applied to multi-dimensional environment path planning of Hexapod Robots. Sensors 2024, 24, 2061. [Google Scholar] [CrossRef] [PubMed]
Huang, S.; Ontañón, S. A closer look at invalid action masking in policy gradient algorithms. In Proceedings of the International FLAIRS Conference Proceedings, Jensen Beach, FL, USA, 15–18 May 2022. [Google Scholar]
Gong, W.; Zhang, B.; Li, C. Location-based online task assignment and path planning for mobile crowdsensing. IEEE Trans. Veh. Technol. 2018, 68, 1772–1783. [Google Scholar] [CrossRef]

Figure 1. The composition of the MCS system.

Figure 2. Schematic diagram of task queue of MCS system.

Figure 3. The working mechanism of the task selection and allocation scheme in MCS.

Figure 4. Flow chart of state, action, and reward.

Figure 5. Training process of action-value network.

Figure 6. Training results of action-value network without iterative training: (a) task queue

Q (t)

, (b) the number of tasks that can enter the task queue

o (t)

, (c)

L o s s

, (d) reward

R e (t)

.

Figure 6. Training results of action-value network without iterative training: (a) task queue

Q (t)

, (b) the number of tasks that can enter the task queue

o (t)

, (c)

L o s s

, (d) reward

R e (t)

.

Figure 7. Training results of action-value network with iterative training: (a) task queue

Q (t)

, (b) the number of tasks that can enter the task queue

o (t)

, (c)

L o s s

, (d) reward

R e (t)

.

Figure 7. Training results of action-value network with iterative training: (a) task queue

Q (t)

, (b) the number of tasks that can enter the task queue

o (t)

, (c)

L o s s

, (d) reward

R e (t)

.

Figure 8. Variation in

Q (t)

under different algorithms.

Figure 8. Variation in

Q (t)

under different algorithms.

Figure 9. Variation in

o (t)

under different algorithms.

Figure 9. Variation in

o (t)

under different algorithms.

Figure 10. Comparison of different indicators under different algorithms: (a)

R e t u r n

comparison. (b)

U_s u m

comparison. (c)

P_s u m

comparison. (d)

I_s u m

comparison.

Figure 10. Comparison of different indicators under different algorithms: (a)

R e t u r n

comparison. (b)

U_s u m

comparison. (c)

P_s u m

comparison. (d)

I_s u m

comparison.

Figure 11. (a) Variation in

Q (t)

with different

V

. (b) Variation in

o (t)

with different

V

.

Figure 11. (a) Variation in

Q (t)

with different

V

. (b) Variation in

o (t)

with different

V

.

Figure 12. (a) Variation in

R e t u r n

with different

V

. (b) Variation in

P_s u m

with different

V

. (c) Variation in

U_s u m

with different

V

.

Figure 12. (a) Variation in

R e t u r n

with different

V

. (b) Variation in

P_s u m

with different

V

. (c) Variation in

U_s u m

with different

V

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, S.; Wu, Y.; Deng, S.; Ma, W.; Zhou, H. Task-Importance-Oriented Task Selection and Allocation Scheme for Mobile Crowdsensing. Mathematics 2024, 12, 2471. https://doi.org/10.3390/math12162471

AMA Style

Chang S, Wu Y, Deng S, Ma W, Zhou H. Task-Importance-Oriented Task Selection and Allocation Scheme for Mobile Crowdsensing. Mathematics. 2024; 12(16):2471. https://doi.org/10.3390/math12162471

Chicago/Turabian Style

Chang, Sha, Yahui Wu, Su Deng, Wubin Ma, and Haohao Zhou. 2024. "Task-Importance-Oriented Task Selection and Allocation Scheme for Mobile Crowdsensing" Mathematics 12, no. 16: 2471. https://doi.org/10.3390/math12162471

APA Style

Chang, S., Wu, Y., Deng, S., Ma, W., & Zhou, H. (2024). Task-Importance-Oriented Task Selection and Allocation Scheme for Mobile Crowdsensing. Mathematics, 12(16), 2471. https://doi.org/10.3390/math12162471

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Task-Importance-Oriented Task Selection and Allocation Scheme for Mobile Crowdsensing

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. System Model

2.2. System Utility

2.3. Importance Level of the Tasks Being Executed

2.4. Problem Formulation

3. Task Selection and Allocation Scheme

3.1. Lyapunov Optimization

3.2. Task Selection Stage

3.2.1. State $S (t)$

3.2.2. Action $a (t)$

3.2.3. Reward $R e (t)$

3.2.4. Action-Value Network Training and Parameter-Updating Mechanism

3.2.5. Iterative Training

3.2.6. Priority Experience Replay and Importance Sampling Weight

3.2.7. Action Mask

3.3. Task Allocation Stage

4. Experimental Validation and Performance Evaluation

4.1. Training Performance

4.2. Comparison Experiment of Different Algorithms

4.3. Impact of $V$ on TSAS

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Task-Importance-Oriented Task Selection and Allocation Scheme for Mobile Crowdsensing

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. System Model

2.2. System Utility

2.3. Importance Level of the Tasks Being Executed

2.4. Problem Formulation

3. Task Selection and Allocation Scheme

3.1. Lyapunov Optimization

3.2. Task Selection Stage

3.2.1. State S t

3.2.2. Action a ( t )

3.2.3. Reward R e ( t )

3.2.4. Action-Value Network Training and Parameter-Updating Mechanism

3.2.5. Iterative Training

3.2.6. Priority Experience Replay and Importance Sampling Weight

3.2.7. Action Mask

3.3. Task Allocation Stage

4. Experimental Validation and Performance Evaluation

4.1. Training Performance

4.2. Comparison Experiment of Different Algorithms

4.3. Impact of V on TSAS

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2.1. State $S (t)$

3.2.2. Action $a (t)$

3.2.3. Reward $R e (t)$

4.3. Impact of $V$ on TSAS