1. Introduction
The implementation of Industry 4.0 solutions, currently observed in various branches of the manufacturing industry, is expected to result in a transition from traditional production systems to smart, more flexible, autonomous, and reconfigurable cyber–physical systems. It is related to the transition from intelligent production based on knowledge to smart production based on data and knowledge, where the term “smart” refers to the creation, acquisition, and use of data, which use both advanced information and communication technologies, as well as advanced data analysis. As a result, future production systems and their management methods will be supported by real-time data transmission, exchange, and analysis technologies along with simulation and optimization based on digital models [
1,
2,
3]. Scientific and industrial research in this area is supported by developing key technologies related to the concept of Industry 4.0, such as the Internet of Things, big data, cloud computing, Embedded Systems, and artificial intelligence. The creation of a communication interface between the digital and physical world through the integration of computations, networks, and physical resources is called cyber–physical systems (CPSs), and, concerning production systems, Cyber–Physical Production Systems (CPPSs) [
2,
4,
5]. Apart from physical elements, a cyber–physical system is characterized by a cyber layer, the main element of which is a virtual component—a digital twin (DT) of the production system. DTs are taking a central position in new-generation intelligent production through integration with a CPPS, and their architecture should allow simultaneous access to production data and their processing [
5,
6]. A CPS, therefore, is a system that aggregates available information to predict conditions in production plants and demonstrates the advantages of additional optimization through data collection, analysis, and processing, and also helps to make decisions through control or prediction [
7,
8,
9].
In this context, the search for easily industrially applicable and effective methods for creating DTs is particularly topical. It is a challenge currently faced by researchers and engineers of production systems. As recent research shows, the creation of digital twins, which are the central element of a CPPS, can be successfully implemented using discrete-event computer simulation models [
10,
11,
12,
13,
14]. The very concept of a twin, in the sense of a prototype/copy of an existing object that mirrors actual operating conditions in order to simulate its behavior in real time, appeared during NASA’s Apollo program [
15]. Since then, computer simulation methods and tools have undergone a process of continuous development, passing through phases from individual applications of simulation models to supporting the solution of specific problems and areas through simulation-based system design support, today’s digital twin concepts, where simulation models constitute the center of the functionality of the CPS [
14,
15]. A simplified view of the system architecture is shown in
Figure 1.
The approaches to using simulation models as a DT proposed in the literature, including, in particular, the references indicated above, obviously differ in scope and functionality, both with respect to the model itself and the scope and direction of data exchange between data sources and the simulation model. Depending on the possibility of updating the state of the simulation model objects and the direction of data exchange in [
16], the authors divided digital mappings into three categories of digital representation of physical objects (a classification of digital twins):
Digital Model—which does not use any form of automatic data integration between the physical object and the digital object;
Digital Shadow—in which there is an automated one-way flow of data between the state of a physical and digital object;
Digital Twin—where data flows (in both directions) between the existing physical and digital objects are fully integrated.
Of course, in the applications of simulation models described above, since the DT is the central element of the CPPS, the DT should have the functionality of the third category. The possibility of bi-directional communication along with the mapping and simulation of the behavior, functioning, and performance of the physical counterpart are only necessary conditions that allow their use as the main element of the cyber layer. For digital twins to become useful in supporting decision-making in the areas of planning and control of physical system objects, it is necessary to enrich the cyber layer (or the digital twin itself) with analytical modules that allow for the determination of solutions that can be sent to the control system in real time. The challenges faced by such solutions correspond to the requirements of the third and fourth levels of development of production management and control strategies in the four-level development scale proposed by Zhuang, Liu and Xiong [
17] concerning manufacturing systems (
Figure 2). The first level of development is characterized by the so-called passive (reactive) management and control methods, where most of the data are collected and entered offline, and traditional relational databases can meet most data management requirements. At the next stage, data are collected in real time using RFID tags, sensors, and barcodes, and are transmitted over real-time networks, enabling faster data processing and decision-making. The third and fourth stages ensure the use of appropriately predictive and proactive management and control methods. They will require extensive machine learning applications, artificial intelligence methods, cloud computing, and other big data technologies.
The predictive strategy is to be based on combined big data and digital twin technologies. In the case of a proactive strategy, it may even be possible to control physical systems based on the self-adaptation and self-reconfiguration functions of the CPPS. It is an extension of the predictive strategy towards intelligent management and production control, thanks to the combination of artificial intelligence technologies and digital twins. Currently, many companies are still at the first or second stage of development [
17]. Similarly, the Digital Twin-based Manufacturing System (DTMS) model proposed in [
18] is a novel reference emphasizing fusion, interaction, and iteration. Regarding iteration, the system continuously analyzes the production process and continuously iteratively corrects the DT model to improve the overall DTMS. The authors in [
18] point out that the manufacturing industry is currently in an era of rapid and intelligent development thanks to the strengthening of information technologies. In this context, the research on DTMS has become a “hot topic” in both academic and industrial circles.
The need to respond to changing environmental factors and to autonomously reconfigure the production system or adapt to these changes indicated above in relation to the Industry 4.0 concept highlights the great potential of machine learning applications and the ability to achieve promising results in the field of production research. Moreover, it is pointed out that classical methods of allocating processes to resources, based on priority dispatching rules or evolutionary algorithms, are not adapted to the dynamic environment in modern production [
19,
20]. It becomes necessary to enrich CPPSs with solutions, e.g., implementing advanced artificial intelligence and machine learning methods. In such a combination, the DT reflecting the behavior of the production system can play the role of a virtual environment used in the training phase. It is especially important in relation to today’s requirements, where companies have to deal with mass customization, complex technologies, and shortening product life cycles. They must, therefore, be able to operate under highly dynamic market conditions and meet increasing sustainability standards. In recent years, there has been a significant increase in interest in the use of reinforcement learning algorithms, and also in the area of production systems [
19].
Regardless of the production planning or control problem considered, the architectures of RL agent interaction in the training process proposed in the literature indicate the need to develop an environment (appropriate for a given problem and system) in which the Deep RL agent can interact. Most non-simulation-based approaches to solving production planning problems use analytical models. These models often describe a problem through the objective function, as well as constraints to define the problem’s structure [
21]. Developing analytical models of real production planning, scheduling, and sequencing problems are complex tasks [
21]. Analytical models are now often the basis for modules that determine, for example, the production schedule and production indicators and, as a result, the value of the reward function in the communication process with the RL agent [
22,
23,
24,
25]. In many cases, the interaction environment is based on dedicated computational algorithms and mathematical models, where it is necessary to develop formulas that calculate subsequent states and rewards based on the state and action (see, e.g., [
22,
26,
27,
28,
29]).
However, compared to the DT based on discrete simulation models considered in this paper, modifying analytical models, especially related to adding a new type of constraint or combining models covering different subsystems or production planning areas, is usually also a complex and time-consuming task. The DT, thanks to the direct access to operational data and the use of communication techniques for data acquisition, allows a fast semi-automatic or automatic adaptation of the simulation model. In addition, 3D simulation models also take into account intralogistics subsystems and/or the availability of operators and their travel times, which are usually omitted in dedicated analytical models. They can be quickly adapted to changes resulting from emergencies that require route changes and the use of alternative resources, supply chain disruptions at various levels, and changes in demand. This also meets the requirements of the latest concepts of the so-called Symbiotic Simulation System [
30]. Analytical approaches, due to the need to update mathematical models and formulas (the agent’s interaction environment) in the event of changes (often even minor ones) in the real production system, may be too inflexible in practical industrial applications.
This gap motivated research in the area of using DTs of production systems based on modern computer discrete simulation systems as an environment for the interaction of reinforced machine learning agents, the results of which are presented in this paper. Additionally, many studies provide the parameters of the RL algorithms used (e.g., [
31]). Still, there is no analysis or discussion of the impact of their values on the results obtained or the speed of the training process. The results of such analyses may assist in their selection in future studies and show areas of possible shortening of responses by shortening the training time.
The highlights and contributions of this paper are as follows: (1) the proposed CPPS architecture for solving the DRL agent learning process, in which the agent communicates with the production system digital twin, so that the agent does not have to interact with the real workshop environment; (2) an example of the practical implementation of DRL algorithms in solving manufacturing process allocation problems along with a comparison of the effectiveness of selected algorithms, and results and conclusions from the analysis of the impact of selected DRL algorithms parameters’ values on the quality of the found policy models and training speed.
In the following section, the basics of RL, an overview of selected RL algorithms, and related work are presented. In
Section 3, the proposed CPPS architecture for solving the DRL agent training process is described. The comparison experiments are implemented in
Section 4.
Section 5 discusses the results. The last section contains short conclusions and areas of further research work.
2. Reinforcement Learning
Reinforcement learning (RL) is a subcategory of machine learning and differs from supervised and unsupervised learning in particular by its trial-and-error approach to learning in direct interaction with the environment [
19,
32]. Reinforcement learning was previously identified as one of the breakthrough technologies by MIT Technology Review [
33] as a technology with great potential in a wide range of applications [
19]. It is indicated that it may revolutionize the field of artificial intelligence and will play a key role in its future achievements. RL is also part of the ongoing trend in artificial intelligence and machine learning towards greater integration with statistics and optimization [
32].
Formally, the problem of reinforcement learning uses the ideas of dynamic systems theory. In particular, optimal control of the Markov decision-making process provides mathematical frameworks to model decision-making under uncertainty. The basic idea is to capture aspects of the real problem that the learning agent is supposed to solve by interacting with the environment and taking actions that affect the state of the environment in order to achieve the set goal [
32]. An autonomous agent controlled by an RL algorithm observes and receives a state
from the environment (at time step
t). Then, it interacts with the environment and executes an action
according to its policy
. The policy is also known as an agent strategy. After acting, the environment transits to a new state
based on the current state and action, and provides an immediate reward
to the agent [
34,
35] (
Figure 3). The reward can be positive or negative—positive is awarded in exchange for successful actions, and negative represents penalties.
RL is another machine learning paradigm alongside supervised and unsupervised learning. RL differs from supervised learning, in which learning occurs from an external training set of tagged examples for solving a given problem class. Its goal is, among other things, extrapolation by the system of its answers so that it works correctly in situations not present in the training set [
32]. For this reason, supervised learning is not practical in solving dynamic interactive problems for which we cannot even prepare a training set, representative for all situations we do not know. An example of such environments is production systems operating in changing environmental conditions, in which the agent must be able to learn from its own experience. RL also differs from the other current machine learning paradigm, unsupervised learning. Despite the apparent similarities between RL and unsupervised learning, which also does not rely on examples of correct behavior (training set), the main difference is that RL is looking for a way to maximize reward value rather than looking for a hidden structure. With regard to RL, unlike many other approaches that address subproblems without dealing with how they might fit into the bigger picture [
32], a key feature that makes RL suitable for solving production management problems is that RL takes into account the whole problem of a goal-oriented agent interacting with an unknown, changing environment.
The rapid development and a significant increase in interest in RL algorithms are also related to the development of deep learning in recent years, based on the function approximation and powerful representation capabilities of deep neural networks. The advent of deep learning has had a significant impact on many areas of machine learning, improving the state of the art. In particular, deep reinforcement learning (DRL) combines both a deep learning architecture with a reinforcement learning architecture to perform a wide range of complex decision-making tasks that were previously infeasible for a machine. DRL in machine learning practice allowed to solve previous problems related to the lack of scalability, limitations to fairly small-dimensional problems, or dimensionality problems (memory complexity, computational complexity, and sample complexity) [
35]. More information and explanations about DRL technology in machine learning can be found, i.e., in [
19,
32,
36]. RL algorithms based on these methods have become very popular in recent years [
32].
To update a learning agent’s way of behavior, the agent needs a learning algorithm. Given an optimal system, the agent would have a model of all transitions between state and action pairs (model-based algorithms). The advantage of model-based algorithms is that they can simulate transitions using a learned model, which is useful when each interaction with the environment is costly, but they also have serious drawbacks. Model learning introduces additional complexity and the likelihood of model errors, which, in turn affects the resulting policy in the training process [
35,
37]. Of course, because the number of actions and state changes is huge, this is very impractical. In contrast, model-free systems are built on a trial-and-error basis, which eliminates the requirement to store all combinations of states and actions [
26]. Therefore, the main disadvantage of early RL algorithms, which mainly focused on tabular and approximation-based algorithms, was their limited application in solving real-world problems [
38]. The advantage of model-free algorithms is that, instead of learning the dynamics of the environment or state transition functions, they learn directly from interactions with the environment [
19,
35]. This is currently a very popular group of algorithms that includes, as briefly described below, policy-based algorithms (e.g., PPO), value-based algorithms (e.g., DQN and DDQN), and hybrid algorithms (e.g., A3C and A2C) [
19,
35,
38]. Their advantages also include different state space and observation models. Moreover, the advantage of DRL algorithms is that deep neural networks are approximators of general functions, which can be used to approximate value functions and policies in complex tasks [
19,
38].
The basic reinforcement learning algorithm can operate with limited knowledge of the situation and limited feedback on decision quality [
39]. Research on the methods of reinforcement learning has led to the development of various groups of algorithms, which differ, for example, in the way policies are constructed. Popular RL algorithms are Q-learning, DQN (Deep Q-Network), (PPO) Proximal Policy Optimization, DDPG (Deep Deterministic Policy Gradients), (SAC) Soft Actor–Critic, SARSA (State–Action–Reward–State–Action), NAF (Normalized Advantage Functions), and A3C (Asynchronous Advantage Actor–Critic). A few selected RL algorithms are described below, based on their achievements and genesis of development. The first is the model-free Q-learning algorithm based on the Bellman equation and off-policy approach, proposed in [
40]. The learned action-value function Q approximates the optimal action-value function, so the next action is selected to maximize the Q value of the next state. The Q function gets its name from a value representing the “quality” of a selected action in a given state. Basically, the Q-learning algorithm can be deconstructed in a two-dimensional action–state pairs array that includes the probability of selecting the action on that state. When the action and observation are performed, the probabilities at a given action-state pair in the array are updated [
26]. In turn, in the Deep Q-Network (DQN) algorithm, which belongs to value-based methods and results from the development of deep learning, the state—action matrix was replaced with a class of artificial neural networks known as deep neural networks. By proposing a combination of deep neural networks and Q-learning, an excellent result has been shown from such a combination where an agent can successfully learn control policies directly from high-dimensional sensory input using DRL [
41]. More precisely, DQN uses a deep convolutional neural network to approximate the optimal action-value function Q, which is the maximum sum of rewards discounted at each time-step [
42]. The next algorithm, PPO, belongs to a group of policy-based methods that learn policy directly to maximize expected future rewards. PPO belongs to a family of policy gradient methods that alternate between sampling data through interaction with the environment and optimization as a “surrogate” objective function using stochastic gradient ascension. PPO uses multiple epochs of stochastic gradient ascent to perform each policy update [
43]. The cited studies indicate that these methods are characterized by stability, reliability, and high efficiency with simultaneous simplicity of implementation. Furthermore, PPO, unlike DQN, provides a continuous action space and directly assigns a state to action, creating a representation of the actual policy. In actor–critic methods, a critic is used to evaluate the policy function estimated by the actor, which is used to decide the best action for a specific state and tune the model parameters for the policy function. In the A3C algorithm, the critics learn the value function while multiple actors are trained in parallel. A3C has become one of the most popular algorithms in recent years. It combines advantage updates with the actor–critic formulation and relies on asynchronously updated policy and value function networks trained in parallel over several processing threads. Using multiple actors stabilizes improvements in the parameters and conveys an additional benefit in allowing for more exploration to occur [
35]. In turn, A2C is a synchronous, deterministic variant of A3C, which gives equal performance. Moreover, when using a single GPU or using only the CPU for larger policies, A2C implementation is more cost-effective than A3C [
44].
Deep Reinforcement Learning in Production Systems
Machine learning, or, in particular, DRL, is at the beginning of its path to becoming an effective and common method of resolving problems in modern, complex production systems. However, the first reviews of research clearly show that, following the already well-established modeling, simulation, virtualization, and big data techniques, it is machine learning and recent research into its application that indicate lightning potential in the wide application of RL from process control to maintenance [
19,
45]. In recent years, research has started to assess the possibilities and usefulness of DRL algorithms in applications to selected problems in the area of production and logistics, such as robotics, production scheduling and dispatching, process control, intralogistics, assembly planning, or energy management [
19,
26,
35,
41]. A review of recent publications on the implementation of DRL in production systems also indicates several potential areas of application in the engineering life cycle manufacturing stage [
19,
25].
Many studies have shown important advances in applications in areas related to the control of task executors in dynamic production conditions, such as robot control, in particular, mobile robots or AGVs [
38]. For example, ref. [
46] presents a case study on creating and training a digital twin of a robotic arm to train a DRL agent using PPO and SAC algorithms in a virtual space and applying simulation learning in a physical space. As a result of the training, the hyperparameters of the DRL algorithm were adjusted for a slow-pace and stable training, which allowed for the completion of all levels of the curriculum. Only the final training results are shown. The need to adapt hyperparameters to the needs of a given project and to provide adequate time for training were pointed out. However, it was not shown how the values of the adopted parameters influenced the agent’s training results. In [
47], the authors propose a visual path-following algorithm based on DRL—double DDQ (DDQN). The learning efficiency of DDQN was compared with different learning rates and with three sizes of the experience replay buffer. The potential of using the path-following strategy learned in the simulation environment in real-world applications was shown. Similar approaches to controlling AGVs and mobile robots can be found in [
48,
49], and a more comprehensive review in [
25].
Other popular areas of proposed DRL applications are order selection and scheduling, in particular, dynamic scheduling [
19,
25]. In [
31], a self-adaptive scheduling approach based on DDQN is proposed. To validate the effectiveness of the self-adaptive scheduling approach, the simulation model of a semiconductor production demonstration unit was used. The models obtained for DDQN were compared with DQN and it was found that the proposed approach promotes autonomous planning activities and effectively reduces manual supervision. The experiments were conducted for the input set of algorithm parameters, without discussing their impact on the training process. The summary indicates the need for future research to include digital twins or cyber–physical systems. In [
22], real-time scheduling method based on the DRL algorithm was developed to minimize the mean tardiness of the dynamic distributed job shop scheduling problem. Five DRL algorithms were compared using the scheduling environment and developed algorithms covering selected aspects of machines, tasks, and operations. Concerning the problem under consideration, PPO achieved the best result and got the highest win rate. Similar results were obtained when comparing DRL algorithms with classic dispatching rules, composite scheduling rules, and metaheuristics. Another example of research on the use of DRL in scheduling was using an RL approach to obtain efficient robot task sequences to minimize the makespan in robotic flow shop scheduling problems [
50]. In turn, in [
24], to tackle large-scale scheduling problems, an intelligent algorithm based on deep reinforcement learning (DDQN) was proposed. A reward function algorithm was proposed, which, based on the adopted mathematical model, determines task delays, returns the next state, and reward. The performance of DDQNB was found to be superior to the chosen heuristics and other DRL algorithms. In [
23], a dynamic permutation flow-shop scheduling problem was solved to minimize the total cost of lateness using DRL. The architecture of the problem-solving system was proposed and a mathematical model was established that minimizes the total cost of delays. The results show good performance of the A2C-based scheduling agent considering solution quality and CPU time. A more detailed review of applications in this area is provided in [
25]. Most research results indicate that DRL-based algorithms can train and adapt their behavior according to changes in the state of the production environment. They demonstrate better decision-making capabilities compared to classic dispatching rules and heuristics, adapting their strategies to the performance of agents in the environment [
51].
In addition, as shown in the review [
19], preliminary research has shown, in the vast majority of the described cases, the advantage of the applied DRL algorithms in this area. Deep RL outperformed 17 out of 19 conventional dispatching rules or heuristics and improved system efficiency. Although few of them have been tested in real workshop conditions, the results show very high potential for this type of method. In this area, the research mainly concerned the DQN, PPO, and A2C algorithms based on discrete observation and action spaces [
52].
Reviews of published research results in the areas of application of DRL methods in production also indicate difficulties in transferring the results to real-world scenarios [
19,
25]. Concerns are raised about the possibility of deterioration of the quality of solutions after the transfer, or the lack of direct application of the proposed solutions in real manufacturing environments. The challenges and prospects defined for future research concern the need to consider ways to quickly adapt to production requirements for various production equipment and input data. In this context, the proposed CPPS solutions that can be directly implemented in the operating environment of industrial demonstrators (constituting a kind of bridge between simulation and analytical models and industrial practice) seem to be valuable.
In our research, we chose PPO and A2C algorithms for experiments. Such a choice was dictated by the fact that these algorithms are well known for their simplification and, therefore, easy implementation [
43]. Ease of implementation and low cost of entry for future industrial applications were crucial for us because the DT used in the experiments was based on data obtained from industrial partners and modified for research purposes.
3. CPPS with DES-Based DT and DRL Agent Architecture
The central element of the proposed CPPS architecture is the DT based on discrete event 3D simulation models. To reflect the dynamics of the real system, the simulation model is supported by automatic generators of simulation models used in methods of semi-automatic creation and automatic updating of simulation models. Generators can be based on data mapping and transformation methods using a neutral data format and the XSL and XSLT languages [
53,
54]. Process data from sensors and programmable logic controllers (PLC) are transferred to the SCADA system and then to MES/ERP systems. From there, through data exchange interfaces and integration modules, the data required in the process of creating and updating the simulation model can be mapped and, depending on the needs, transformed into a neutral data model, and, then, using internal simulation software languages, transformed into code that creates a new or modifies an existing simulation model, as shown in
Figure 4. This solution allows for easy practical implementation of the method because it can be used with any management system and any computer simulation system (e.g., Enterprise Dynamics, Arena, FlexSim).
Of course, as mentioned in
Section 1, the transition from a digital model to a digital twin requires that the digital model has mechanisms for a bi-directional data flow between existing physical and virtual/digital objects. For this purpose, a model–control system architecture is proposed using the so-called emulation of industrial automation system components tools directly in the simulation software. It allows to create direct connections between digital, virtual objects of the simulation system and external physical sensors, PLC controllers, or communication clients/servers. The architecture of such a solution is shown in
Figure 5. The combination of solutions shown in
Figure 1 and
Figure 2 allows the model to be updated with data from production management systems as well as directly from sensors or PLC controllers (e.g., using OPC or Modbus protocols). This functionality is provided by, e.g., the emulation tool of the FlexSim simulation system [
55]. It also makes it possible to send responses back from the DT to the planning or control system.
The next stage includes supplementing the cyber layer with modules enabling the use of selected RL algorithms and enabling a two-way connection between the DRL agent and the simulation model. The simulation model serves as the environment for interaction, learning, and verification of the agent’s policy. Most of the algorithms mentioned in
Section 2 are publicly available in open-source libraries containing their implementations. These include popular Python libraries: OpenAI Baselines, Stable Baselines3, or keras-rl. Additionally, an API must be provided for communication between DRL agents and the environment. In the case discussed, the environment with which the agent communicates is a simulation model, so the simulation software will constitute a non-standard environment for which the design of appropriate communication and configuration modules will be required. Such modules should not only implement two-way communication, but also provide access to model configuration functions that allow, among other things, the definition of the observation space, the space of actions taken by the agent, and the reward function along with its standardization to the values required by individual algorithms. Appropriate tool packages that provide the functionality described above can be implemented directly in commercially available simulation programming using API mechanisms. The architecture of such a solution is shown in
Figure 6.
A practical example of the implementation of this concept is shown in the next section.
5. Discussion
Comparing the selected DRL algorithms, it can be concluded that PPO turned out to be the best for both the uniform and modified product version distributions. In the first case, it achieved a significant advantage over A2C, while, in the second case, both algorithms achieved similar, very good results of the trained model. More importantly, the results obtained in the training process were confirmed in the process of using the trained models in simulations conducted on the DT for many replications of the simulation model. As can be seen in
Figure 13 and
Figure 17, the obtained values of the reward function (correlated with the objective function for the problem under consideration) were at the same level as in the final phase of the training process. They also showed a significant advantage over the results for random control decisions and those based on typical dispatching rules.
The obtained results also indicate the need to perform a phase of tuning algorithm parameters for the current states of the production system mapped in the digital twin. This is most clearly visible in the results for PPO shown in
Figure 11. Obtaining the convergence of the graph for case (b) could indicate achieving the maximum result in the training process. Still, the results of subsequent experiments (c) and (d) showed the possibility of improving the training results by 50%. However, as the example of the results obtained for the set of training parameters of the A2C algorithm in the case shown in
Figure 16d shows, achieving convergence on the graph of the reward value versus the number of steps does not guarantee obtaining an effective trained model. In such a case, probably due to significant differences in the number of product versions arriving in the system, the too-small number of steps to run for each environment per update did not allow the agent’s policy to be adjusted to the circumstances. Moreover, general conclusions cannot be drawn here with regard to the minimum value of
n_steps because, in Experiment #1, an identical set of parameter values gave the best result for the A2C algorithm for the case analyzed there. However, it is noticeable that reference values that are too low will make the agent converge prematurely, thus failing to learn the optimal decision path. Similarly, the analysis of the results in terms of the influence of variability and the discount factor gamma, where lower values place more emphasis on immediate rewards, indicates the possibility of achieving faster convergence of the training process. It can be seen in
Figure 12b,c and
Figure 15b,c. In the analyzed cases, however, it did not always result in a significant reduction in the learning time or the maximum reward value obtained (
Figure 16b,c).
It shows that, due to the variable, dynamic nature of production systems and the wide variety of organizational and technological solutions used by manufacturers, it is recommended to always analyze the impact of the variability of parameter values on the quality of the training process. In general, there is a noticeable possibility of significantly shortening the training process through a preliminary analysis of the impact of parameters on the speed of the training process. In particular, it can be seen in the results obtained for A2C (
Figure 16b,c) and PPO (
Figure 11c,d), where, already, in the middle of the training process, the trend of the reward gradually converges to the highest value, indicating that the model training is completed.
Furthermore, RL agent training and verifying the learned model in the simulation-based DT and RL agent combination architecture proposed in this paper allows for a simple and fast reconfiguration of the environment with which the agent interacts. Compared to architectures based on non-simulation environments, a brief overview of which is presented in
Section 1, changing the components or performance indicators of the system on which the reward function is based, which, in the case of analytical models, would involve the need to construct new formulas or mathematical models, does not constitute any problem and it is quick and simple. This is due to the fact that, in the proposed approach, the values of the reward function can be directly linked to the statistics of the components of the simulation model and the modification of the reward function comes down to indicating them in the communication modules (see
Figure 9). In addition, using simple predefined conditional blocks, it is possible to create dynamic functions that activate various statistics depending on the adopted conditions. Similarly, changing the structure of the simulation model in response to changes in the real production system can be done automatically or semi-automatically by connecting the model to enterprise information systems or automation devices.
The limitations of the presented architecture are mainly related to the scope of the digital representation of real objects. Most modern advanced simulation software contains extensive libraries of predefined objects, covering manufacturing, assembly, and logistics subsystems. However, due to the wide variety of technologies and production organization in different industries, some model objects still have to be created from scratch. Similarly, current simulation systems have not transferred all the communication protocols used in industrial automation to the digital layer.
The key challenge remains to transfer the discussed solutions as quickly as possible into industrial practice. Partial challenges, including providing a framework for data transfer and real-time communication capabilities, and developing common data models and guidelines for implementation are related to it. The challenges in the closely related aspects of cybersecurity and explainable AI should also be mentioned. Furthermore, there is an important future need to ensure compliance of the designed industrial solutions with the legal requirements to which solutions used in industrial practice based on AI methods will be subject, especially in the context of the ongoing work on the AI Act [
58].