Our main motivation is to validate the possibilities of developing fast-learning and more robust RL agents using environment augmentation with GANs. While there has been work concerning improving the performance of RL agents and GANs, the feasibility of environment augmentation via GANs seems to be slightly unexplored. Vincent Huang et al. [
2] proposed the enhanced training of RL agents by using EGANs to initialize the agents in order to make them learn faster. The EGAN approach could be used to speed up the early training phases for real-time systems with sparse and slow data sampling. Consequently, agents trained with EGAN and GAN achieved faster increases in rewards, and both could provide more modalities in data space. We identified that suitability is only mentioned for wireless communication. Exploration is required for applying the technique to other domains. Further work is needed to verify and fine-tune the system to achieve an optimal performance. Thang Doan et al. [
3] proposed using the GAN environment for GAN-Q-learning to analyze its performance and provide a novel distributional RL method. This is an alternative to the traditional models. The end results state that Q-learning goes along with the benchmarked algorithms, such as GAN-Q-learning. A CNN was suggested in order to use rewards, and a high priority for the stability of future iterations should be carried out. Tim Salimans et al. [
4] provided many training techniques that generate realistic human images. The methods provided can train RL agents, which were previously untrainable. Using the methods proposed, we can use them for training RL agents, reducing the training time. Saldanha, J. et al. [
5] presented their effort to synthesize pulmonary sounds of various categories utilizing variations in variational autoencoders, such as multi-layer perceptron VAE (MLP-VAE), convolutional VAE (CVAE) and conditional VAE. They compared the impact of supplementing the unbalanced dataset on the performance of different lung sound classifiers. To identify and categorize dementia into different groups based on its prominence and severity in the available MRI images, Jain V. et al. [
6] suggested a unique DCGAN-based augmentation and classification (D-BAC) model technique. They suggested using a new GAN-augmented dataset to study the early detection of dementia, also known as mild cognitive impairment (MCI) and referred to as MCI. Machine learning, especially reinforcement learning, has recently found widespread use in wireless communications in the prediction of network traffic, identifying the quickest and shortest path while sending the data over the networks [
7,
8]. The model-free methodology known as Q-learning has been acknowledged as promising, mainly when used for complex decision-making processes. Excellent learning capabilities, successful outcomes and the ability to integrate with other models are just a few advantages of Q-learning [
9]. According to Saeed Kaviani et al., DeepCQ+ routing [
10], which innovatively integrates emerging multi-agent deep reinforcement learning (MADRL), achieves a consistently higher performance across various MANET architectures when training on only a small number of network parameters and conditions. Xinyu You et al. created a novel packet routing architecture based on multi-agent deep reinforcement learning in a completely decentralized setting. Each router in this system is equipped with a recurrent neural network (RNN) for training and decision making and an independent long short-term memory (LSTM) [
11]. The Q-values associated with each potential course of action for a given input state serve as the targets of an input state in deep Q-learning. Q-values are the current estimates of the total future rewards obtained for each activity. These Q-values are determined via the temporal difference formula [
12].
Guillermo Caminero et al. [
14] proposed that we can detect intrusion by using adversarial reinforcement learning. In addition, according to them, the training time for adversarial reinforcement learning is less compared to other algorithms. Hence, adversarial reinforcement learning is a relative fit for performing predictions online, and is mainly suitable for data sets with unbalanced data and in situations where a high prediction accuracy is required. Chelsea Finn et al. [
15] demonstrated equivalence between a maximum entropy inverse reinforcement learning (IRL) algorithm and a GAN model to evaluate the generator density. An equation was derived that used a form of discriminator that supports the values from the generator, leading to an unbiased estimation of the underlying energy equation. Justin Fu et al. [
16] proposed adversarial inverse reinforcement learning (AIRL), which is a pragmatic and scalable IRL calculation, in view of an adversarial reward learning formulation. They showed that AIRL can regain reward works that are resilient to variations in the environment when training. AIRL is a functional and adaptable IRL calculation that can learn released rewards and significantly improves upon both prior imitation learning and IRL calculations. Karol Hausman et al. [
17] proposed a multi-modal imitation learning framework that can segment and imitate skills from unlabelled and unstructured demonstrations. They came up with a method of imitation learning that learns a multimodal theoretical policy and successfully imitates automatically segmented tasks. The imitation learning method performs this with the help of unstructured and undefined demonstrations. Ali Taleb Zadeh Kasgari et al. [
18] proposed a GAN approach that helps in pretraining deep-RL with the use of synthetic and real data, thus developing an evolved deep-RL framework by exposing it to broad network conditions. The outcome was that, during extreme conditions, the proposed, experienced deep-RL agent can recover instantly, while a conventional deep-RL, under data rate restriction, considers epochs of high reliability and low latency. Michal Uricar et al. [
19] proposed GANs to be applied in autonomous driving and advanced data augmentation. In the paper, the generator was trained in such a way that, after some cycles of epochs, it started to identify which part of the image was soiled. They trained a generator model that is capable of “de-soiling” the image as well as introducing some soiling to the image. Some gaps in the study were that it was not a diverse data set and that over-fitting was observed.