Improved Q-Learning Method for Linear Discrete-Time Systems
Round 1
Reviewer 1 Report
This is a paper with strong theoretical development but lacking experimental assessment of the proposed Q-learning method for linearized discrete time system.
The state of the art should be carefully reviewed to better positioning the contribution in relation with Q-learning and other supervised and unsupervised learning techniques. Please check the following questions and the corresponding suggestions:
Can extended Fuzzy Kalman filter be explored to be combined with the proposed modification? See for example:
D. S. Pires and G. L. O. Serra, "Methodology for Evolving Fuzzy Kalman Filter Identification," International Journal of Control, Automation and Systems, Article vol. 17, no. 3, pp. 793-800, 2019.
Y. Ding, X. Xiao, X. Huang, and J. Sun, "System identification and a model-based control strategy of motor driven system with high order flexible manipulator," Industrial Robot, Article vol. 46, no. 5, pp. 672-681, 2019.
F. Matía et al. "The fuzzy Kalman filter: Improving its implementation by reformulating uncertainty representation," Fuzzy Sets and Systems, 2020 2019, https://doi.org/10.1016/j.fss.2019.10.015
How are set the parameters of the Q-learning method? Is there any automatic procedure? See for example:
M. B. Radac, R. E. Precup, and R. C. Roman, "Model-Free control performance improvement using virtual reference feedback tuning and reinforcement Q-learning," International Journal of Systems Science, Article vol. 48, no. 5, pp. 1071-1083, 2017.
I. la Fe et al., "Automatic Selection of Optimal Parameters Based on Simple Soft-Computing Methods: A Case Study of Micromilling Processes," IEEE Transactions on Industrial Informatics, Article vol. 15, no. 2, pp. 800-811, 2019, Art. no. 8325494.
Is possible to make a comparison with other techniques such as SVM and KNN? See for example:
K. Huang, S. Li, X. Kang, and L. Fang, "Spectral–Spatial Hyperspectral Image Classification Based on KNN," Sensing and Imaging, Article vol. 17, no. 1, pp. 1-13, 2016, Art. no. 1.
K. Li, X. Luo, and M. Jin, "Semi-supervised learning for SVM-KNN," Journal of Computers, Article vol. 5, no. 5, pp. 671-678, 2010.
F. Castaño et al., "Obstacle recognition based on machine learning for on-chip lidar sensors in a cyber-physical system," Sensors (Switzerland), Article vol. 17, no. 9, 2017, Art. no. 2109.
Can the proposed Q-learning method be combined with a gradient-free optimization method to improve the dynamic behavior? See for example:
X. Qi et al., "A new meta-heuristic butterfly-inspired algorithm," Journal of Computational Science, Article vol. 23, pp. 226-239, 2017
G. Beruvides et al. "Multi-objective optimization based on an improved cross-entropy method. A case study of a micro-scale manufacturing process," Information Sciences, Article vol. 334-335, pp. 161-173, 2016.
Author Response
Point 1:Can extended Fuzzy Kalman filter be explored to be combined with the proposed modification?
Response 1: The main goal of Kalman filter is to estimate the states of systems, or to identify parameters of presumptive models. So the function of Kalman filter is more like the observers in control theories. With the observers, the controllers should be designed according to the mathematical models of the controlled systems, and the models can be established by any methods. The extended fuzzy Kalman filter is an improved form of Kalman filter, and has the same function of traditional Kalman filter.
As for the improved Q-learning method, it is a kind of model-free control scheme, which do not rely on any models of controlled systems and directly calculates values of controllers according to data sampled from the controlled systems. So it is hard to say there is a possibility of the combination between extended fuzzy Kalman filter and the proposed algorithm. Maybe they could be the benchmark of the same controlled system for each other.
Point 2: How are set the parameters of the Q-learning method? Is there any automatic procedure?
Response 2: To design a Q-learning algorithm, the form of Q function should be decided and then the problem of parameters choosing could be discussed. Being used to solve the quadratic optimal problems, the form of Q function should coincide with the quadratic performance index function to design the approximate optimal controller. The main parameters in quadratic performance index function are matrices and . should be a symmetric positive definite matrix, and should satisfy the condition of . The values of the two matrices will decide the different weights between the quantity of states and control values. In another word, the designer should make the judgement, in the quantity of states and control values, which one will be more important in the quadratic performance index function. In the optimal control problem, how to choose the matrices and is relative to performance requirement, experience of designer and constraint conditions.
Point 3: Is possible to make a comparison with other techniques such as SVM and KNN?
Response 3: Both KNN and SVM are supervised algorithms, which means their train data should have corresponding target values. Even under some improved methods they have been evolved into semi-supervised algorithms, at the most the evolved algorithms can deal with situations such as insufficient or imbalance training data. But they still cannot obtain meaningful answers from data without target values. Reinforcement learning methods represented by Q-learning are unsupervised algorithms. They do not need the “answers” of the data, and construct the evaluation mechanism aiming to the current status. According to the results of the evaluation mechanism, the reinforcement learning methods will adjust their strategies. So between the KNN, SVM and Q-learning, there are fundamental differences.
Point 4: Can the proposed Q-learning method be combined with a gradient-free optimization method to improve the dynamic behavior?
Response 4: Gradient-free optimization method means that during the iterative optimization process, the direction of calculation is decided by indexes besides gradient descent. In this sense, the Q-learning method is a kind of gradient-free optimization method when it is used to solve optimization problem. And it is some researchers’ direction to find a combination between reinforcement methods and bio-inspired algorithms.
Author Response File: Author Response.docx
Reviewer 2 Report
The authors present a modified Q-learning algorithm for solving the quadratic optimal control problem of discrete-time linear systems. This modification consists in introducing ridge regression instead of least squares regression.
Overall, the article is interesting. It presents in a very formal way the proposed solution. Nevertheless, I have some remarks.
It is a detriment that the introduction does not explain the quadratic optimal control problem of discrete-time linear systems.
Many times in the introduction the same statements are repeated.
Where were the examples in section 3 taken from? It is worth giving a specific problem and solving it, not simply quoting numbers that mean almost nothing. You can always find some cases for which the proposed solution will be better.
The HJB equations are unexplained. It is not obvious to everyone.
Error in the name of section 2: impoved -> improved and some other language errors (I suggest to review the whole work again).
Author Response
Point 1: It is a detriment that the introduction does not explain the quadratic optimal control problem of discrete-time linear systems.
Response 1: Thank you for your reminding.The explaination of the quadratic optimal control problem of discrete-time linear systems can be checked in the additional word file, since there are some math expression which cannot be shown directly in the web page. We try to expound it concisely and clearly. And we will put it in our introduction part.
Point 2: Where were the examples in section 3 taken from? It is worth giving a specific problem and solving it, not simply quoting numbers that mean almost nothing. You can always find some cases for which the proposed solution will be better.
Response 2: The example 2 in section 3 is taken from a DC motor discrete-time model. For this kind of objects, their discrete-time models have similar structure and parameters. For example, in a book named Reinforcement Learning and Dynamic Programming Using Function Approximators(Lucian Busoniu, Robert Babuska, Bart De Schutter, et al. CRC Press, Inc. 2010), DC motor was modeled as a two-order discrete-time linear system in section 3.4.5, still the math expression can be checked in the additional word file. The model has the analogous structure and parameters with the example 2 in our manuscript.
As for the example 1, the main goal of choosing a three-order system is to illustrate that the improved Q-learning method is effective for high order systems, since in many optimal control papers the objects are one-order or two-order systems. In fact, the model can be replaced by any other stable linear discrete-time system, and the proposed method is still effective.
In our revised manuscript, we will make the proper explaination for our examples.
Point 3: The HJB equations are unexplained. It is not obvious to everyone.
Response 3: The whole name of HJB equations is Hamilton-Jacobi-Bellman equations.In optimal theory, the solution of HJB equations is the real valued function with minimum cost for specific dynamic system and corresponding cost function. So the solution of HJB equations is also the optimal controller of the controlled system. In the HJB equations, all of the information of the controlled system should be known, which means the precise mathematical model of the system should be established. To solve HJB equations, many methods have been researched, and the earlier mentioned Riccati equation is derived from HJB equations.
In our revised manuscript, we will make proper explaination for the HJB equations at where the terms is used.
Point 4: Error in the name of section 2: impoved -> improved and some other language errors (I suggest to review the whole work again).
Response 4: Thank you for your patient and time. We will try our best to eliminate the language errors in our manuscript.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
Comments are provided in reply to reviewers but these comments are not included in the resubmitted paper. Authors have not updated the review of the state of the art accordingly in order to position the claimed contributions of the paper in the state of the art.
Author Response
Thanks for considering the revised version of our manuscript entitled “Improved Q-learning Method for Linear Discrete-time Systems” (Manuscript ID: processes-725141). We believe that the comments have been highly constructive and very useful to restructure the paper.
We have thoughtfully taken into account reviewer’s comments and revised the paper carefully. The typos have been carefully checked and the correction has been made. The revised portions are highlighted in blue. The main corrections are as follows.
- To highlight the superiorityof Q-learning, authors have added some background on the difference between the Q-learning method and other common algorithms such as SVM, KNN, Kalman filter and its extended forms. Also, the feature of the Q-learning method as a kind of gradient-free optimization method has been added to the introduction part. For more details, please see the second paragraph of the section ”Q-learning method for model free control schemes” .
- The principles of setting parameters for quadratic optimal Q-learning method have been added in the manuscript.For more details, please see the paragraph before the equation (8) of the section 1: ”Design process of quadratic optimal controller by existing Q-learning method”.
- More papers related to the revised statement have been added to reference list.
Thanks for your comments and suggestions, again. We would like to express our great appreciation to the Editors and reviewers for comments on our paper. Those comments are all valuable and very helpful for revising and improving the quality of the paper, as well as the important guiding significance to our researchers. We hope that the revised paper could meet with the approval. We look forward to hearing from you regarding our submission. Any further questions and comments that you may have, please do not hesitate to contact us.