On Entropy Regularized Path Integral Control for Trajectory Optimization
Abstract
:1. Introduction
1.1. Stochastic Search Methods
1.2. Path Integral Control
1.3. Contributions
- We provide a comprehensive overview of existing Path Integral Control tailored to policy search. Here we point out a mutual underlying design principle which allows for a formal comparison and classification.
- We propose an original and intuitive argument for the introduction of entropy regularization terms in the context of optimization that is based on the principle of entropic inference.
- We untie the derivation of Path Integral Control methods from its historical roots in Linearly Solvable Optimal Control and illustrate that a similar set of algorithms can be derived directly from the more timely framework of Entropy Regularized Optimal Control. Therefore, we introduce the framework of Entropy Regularized Trajectory Optimization and derive the Entropy Regularized Path Integral Control (ERPIC) method. We consider this to be our primary contribution. Furthermore, this work elevates the structural similarity between Evolutionary Strategies, such as the CMA-ES [13] and NES [14], and PIC methods originally pointed out and exploited in [25], to a formal equivalence.
- We give a formal comparison of preceding PIC methods and ERPIC tailored to derivative-free trajectory optimization with control affine dynamics and locally linear Gaussian policies.
2. Preliminaries and Notation
2.1. General Notation
2.2. Dynamic System Models
2.3. Stochastic Optimal Control
2.4. Local Parametric Policies
3. Path Integral Control
3.1. Discrete Time Linearly Solvable Optimal Control
3.2. Path Integral Control (PIC) Methods
3.2.1. Exact Methods
Sample Efficient Path Integral Control Method
Path Integral Relative Entropy Policy Search
3.2.2. Gradient Ascent Methods
Path Integral Cross Entropy Method
Adaptive Smoothing Path Integral Control (ASPIC) Method
3.3. Other Noteworthy PIC Methods
3.4. Other Remarks
4. Entropy Regularized Path Integral Control
4.1. Entropy Regularized Optimization
4.1.1. Entropic Inference
4.1.2. Optimization as an Inference Problem
4.1.3. Theoretical Search Distribution Sequences
- the distribution sequence collapses in the limit on the Dirac delta distribution in the sense that
- the function of g is monotonically decreasing regardless of
- the function of g is monotonically decreasing if is chosen uniform on
4.2. Entropy Regularized Optimal Control
4.3. Entropy Regularized Trajectory Optimization
- As a result of the entropy regularization and the state trajectory lifted optimization space, it is now possible to obtain another formal yet explicit optimal state trajectory distribution. Note that this is not the same optimal distribution as we derived in the LSOC setting given that here the control is not penalized through a Kullback–Leibler divergence term but is penalized implicitly through the cost C. When we evaluate C and have access to the state-action trajectory , we can simply replace C by R. Here, we emphasize that still represents the state trajectory distribution which is now also a function of the actions.
- Second, the theorem implies that we can readily apply the PIC design strategy described in Section 3.2, substituting the optimal path distribution sequence (31) for , and, the parametrized path distribution induced by some stochastic parametric policy for , to derive a generalized class of PIC methods. We refer to this class as Entropy Regularized Path Integral Control or ERPIC.
5. Formal Comparison of Path Integral Control Methods
5.1. Control Affine Systems
- SEPIC (or PICE)The function P reads as the cost accumulated over the trajectory where the states and control efforts are penalized separately. The trailing term is included to compensate for the full noise penalization [41]. To penalize the states any nonlinear function can be used, the control is penalized using a quadratic penalty term which depends on the noise added to the system. A crucial limitations is clearly that the exploration noise and the control cost are therefore coupled.
- PIREPS (or ASPIC)It turns out that in this setting the weights are simply a smoothed version of those associated to SEPIC. For (i.e., strong regularization) the weights will all have approximately the same value and therefore . For (i.e., weak regularization), the method reduces to SEPIC.
- ERPICOne can easily verify that in the ERPIC setting, function P represents the cost accumulated over the trajectory where the states and control efforts are no longer penalized separately. Here, a discount terms is included that promotes uncertain trajectories making sure that the entropy of the search distribution does not evaporate eventually. Second, we wish to point out the obvious similarities with the stochastic search method in Appendix C.
5.2. Locally Linear Gaussian Policies
5.3. Discussion
5.3.1. Remarks Related to Stochastic Search Methods and Variance
5.3.2. Positioning of PIC Methods within the Field of RL
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Appendix A. Proof of Theorem 2
- Consider any and define , then there exists a set so that and a set so that . These definitions allow to derive an upper bound for the value of , specificallyNow as it follows that tends to 0 for . On the other hand, if we choose , one can easily verify that the denominator tends to 0 and thus tends to ∞ as . This limit behavior agrees with that of the Dirac delta and the statement follows.
- To proof that is a monotonically decreasing function of g, we simply have to verify whether the derivative to g is strictly negative. Therefore, let us first express the expectation explicitly, introducing the normalizerTaking the derivative to g yieldsAs the variance is a strictly positive operator, expect for , the statement follows.
- The entropy of the distribution is equal toWe will also need the logarithm of which is .Taking the derivative of to g yieldsIn case we choose uniform on the entropy decreases monotonically with g. Otherwise, the entropy might temporarily increase especially when and are far apart.Further note that the rate of convergence increases with g. □
Appendix B. Derivation of Derivatives in (22) and (23)
Appendix C. Entropy Regularized Evolutionary Strategy
Appendix D. Proof of Theorem 3
Appendix E. Derivation of Equation (37)
References
- Heess, N.; Dhruva, T.B.; Sriram, S.; Lemmon, J.; Merel, J.; Wayne, G.; Tassa, Y.; Erez, T.; Wang, Z.; Eslami, S.; et al. Emergence of locomotion behaviours in rich environments. arXiv 2017, arXiv:1707.02286. [Google Scholar]
- Todorov, E. Optimal control theory. In Bayesian Brain: Probabilistic Approaches to Neural Coding; MIT Press: Cambridge, UK, 2006; pp. 269–298. [Google Scholar]
- Mayne, D. A Second-order Gradient Method for Determining Optimal Trajectories of Non-linear Discrete-time Systems. Int. J. Control 1966, 3, 85–95. [Google Scholar] [CrossRef]
- Tassa, Y.; Mansard, N.; Todorov, E. Control-limited differential dynamic programming. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 1168–1175. [Google Scholar] [CrossRef] [Green Version]
- Erez, T.; Todorov, E. Trajectory optimization for domains with contacts using inverse dynamics. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012; pp. 4914–4919. [Google Scholar]
- Tassa, Y.; Erez, T.; Todorov, E. Synthesis and stabilization of complex behaviors through online trajectory optimization. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012; pp. 4906–4913. [Google Scholar]
- Todorov, E.; Li, W. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005 American Control Conference, Portland, OR, USA, 8–10 June 2005; Volume 1, pp. 300–306. [Google Scholar]
- Diehl, M.; Bock, H.; Diedam, H.; Wieber, P. Fast direct multiple shooting algorithms for optimal robot control. In Fast Motions in Biomechanics and Robotics; Springer: Berlin/Heidelberg, Germany, 2006; pp. 65–93. [Google Scholar]
- Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar] [CrossRef]
- Schwefel, H. Numerische Optimierung von Computer-Modellen Mittels der Evolutionsstrategie: Mit Einer Vergleichenden Einführung in die Hill-Climbing-und Zufallsstrategie; Springer: Berlin/Heidelberg, Germany, 1977; Volume 1. [Google Scholar]
- Abdolmaleki, A.; Price, B.; Lau, N.; Reis, L.; Neumann, G. Deriving and improving CMA-ES with information geometric trust regions. In Proceedings of the Genetic and Evolutionary Computation Conference, ACM, Berlin, Germany, 15–19 July 2017; pp. 657–664. [Google Scholar]
- Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. Information-geometric optimization algorithms: A unifying picture via invariance principles. J. Mach. Learn. Res. 2017, 18, 564–628. [Google Scholar]
- Hansen, N. The CMA evolution strategy: A tutorial. arXiv 2016, arXiv:1604.00772. [Google Scholar]
- Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; Peters, J.; Schmidhuber, J. Natural evolution strategies. J. Mach. Learn. Res. 2014, 15, 949–980. [Google Scholar]
- Winter, S.; Brendel, B.; Pechlivanis, I.; Schmieder, K.; Igel, C. Registration of CT and intraoperative 3-D ultrasound images of the spine using evolutionary and gradient-based methods. IEEE Trans. Evol. Comput. 2008, 12, 284–296. [Google Scholar] [CrossRef]
- Hansen, N.; Niederberger, A.; Guzzella, L.; Koumoutsakos, P. A method for handling uncertainty in evolutionary optimization with an application to feedback control of combustion. IEEE Trans. Evol. Comput. 2008, 13, 180–197. [Google Scholar] [CrossRef]
- Villasana, M.; Ochoa, G. Heuristic design of cancer chemotherapies. IEEE Trans. Evol. Comput. 2004, 8, 513–521. [Google Scholar] [CrossRef]
- Gholamipoor, M.; Ghadimi, P.; Alavidoost, M.; Feizi Chekab, M. Application of evolution strategy algorithm for optimization of a single-layer sound absorber. Cogent Eng. 2014, 1, 945820. [Google Scholar] [CrossRef]
- Hansen, N.; Kern, S. Evaluating the CMA evolution strategy on multimodal test functions. In Lecture Notes in Computer Science, Proceedings of the International Conference on Parallel Problem Solving from Nature, Birmingham, UK, 18–22 September 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 282–291. [Google Scholar]
- Gregory, G.; Bayraktar, Z.; Werner, D. Fast Optimization of Electromagnetic Design Problems Using the Covariance Matrix Adaptation Evolutionary Strategy. IEEE Trans. Antennas Propag. 2011, 59, 1275–1285. [Google Scholar] [CrossRef]
- Kothari, D. Power system optimization. In Proceedings of the 2012 2nd National Conference on Computational Intelligence and Signal Processing (CISP), Guwahati, India, 2–3 March 2012; pp. 18–21. [Google Scholar] [CrossRef]
- Todorov, E. Linearly-solvable Markov decision problems. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, UK, 2007; pp. 1369–1376. [Google Scholar]
- Yong, J.; Zhou, X. Stochastic Controls: Hamiltonian Systems and HJB Equations; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1999; Volume 43. [Google Scholar]
- Kappen, H. Linear theory for control of nonlinear stochastic systems. Phys. Rev. Lett. 2005, 95, 200201. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Stulp, F.; Sigaud, O. Path integral policy improvement with covariance matrix adaptation. arXiv 2012, arXiv:1206.4621. [Google Scholar]
- Stulp, F.; Sigaud, O. Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning. 2012. Available online: https://hal.archives-ouvertes.fr/hal-00738463 (accessed on 1 October 2020).
- Lefebvre, T.; Crevecoeur, G. Path Integral Policy Improvement with Differential Dynamic Programming. In Proceedings of the 2019 IEEE International Conference on Advanced Intelligent Mechatronics (AIM), Hong Kong, China, 8–12 July 2019. [Google Scholar]
- Rajamäki, J.; Naderi, K.; Kyrki, V.; Hämäläinen, P. Sampled differential dynamic programming. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 1402–1409. [Google Scholar]
- Chebotar, Y.; Kalakrishnan, M.; Yahya, A.; Li, A.; Schaal, S.; Levine, S. Path integral guided policy search. In Proceedings of the 2017 IEEE international conference on robotics and automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3381–3388. [Google Scholar]
- Summers, C.; Lowrey, K.; Rajeswaran, A.; Srinivasa, S.; Todorov, E. Lyceum: An efficient and scalable ecosystem for robot learning. arXiv 2020, arXiv:2001.07343. [Google Scholar]
- Williams, G.; Drews, P.; Goldfain, B.; Rehg, J.; Theodorou, E. Aggressive driving with model predictive path integral control. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 1433–1440. [Google Scholar] [CrossRef]
- Williams, G.; Drews, P.; Goldfain, B.; Rehg, J.; Theodorou, E. Information-Theoretic Model Predictive Control: Theory and Applications to Autonomous Driving. IEEE Trans. Robot. 2018, 34, 1603–1622. [Google Scholar] [CrossRef]
- Theodorou, E.; Krishnamurthy, D.; Todorov, E. From information theoretic dualities to path integral and kullback-leibler control: Continuous and discrete time formulations. In Proceedings of the Sixteenth Yale Workshop on Adaptive and Learning Systems, Yale, CT, USA, 5–7 June 2013. [Google Scholar]
- Theodorou, E. Nonlinear stochastic control and information theoretic dualities: Connections, interdependencies and thermodynamic interpretations. Entropy 2015, 17, 3352–3375. [Google Scholar] [CrossRef]
- Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Austria, 7–9 August 2017; pp. 1352–1361. [Google Scholar]
- Peters, J.; Mulling, K.; Altun, Y. Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010. [Google Scholar]
- Neu, G.; Jonsson, A.; Gómez, V. A unified view of entropy-regularized markov decision processes. arXiv 2017, arXiv:1705.07798. [Google Scholar]
- Akrour, R.; Abdolmaleki, A.; Abdulsamad, H.; Peters, J.; Neumann, G. Model-free trajectory-based policy optimization with monotonic improvement. J. Mach. Learn. Res. 2018, 19, 565–589. [Google Scholar]
- Caticha, A. Entropic inference: Some pitfalls and paradoxes we can avoid. In Proceedings of the MaxEnt 2012, The 32nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Garching, Germany, 15–20 July 2012; Volume 1553, pp. 200–211. [Google Scholar]
- Szepesvári, C. Reinforcement Learning Algorithms for MDPs; Wiley Encyclopedia of Operations Research and Management Science; John Wiley & Sons: Hoboken, NJ, USA, 2010. [Google Scholar]
- Kappen, H.; Ruiz, H.C. Adaptive importance sampling for control and inference. J. Stat. Phys. 2016, 162, 1244–1266. [Google Scholar] [CrossRef] [Green Version]
- Williams, G.; Aldrich, A.; Theodorou, E. Model predictive path integral control: From theory to parallel computation. J. Guid. Control. Dyn. 2017, 40, 344–357. [Google Scholar] [CrossRef]
- Kappen, H.; Wiegerinck, W.; van den Broek, B. A path integral approach to agent planning. In Autonomous Agents and Multi-Agent Systems; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
- Drews, P.; Williams, G.; Goldfain, B.; Theodorou, E.; Rehg, J. Vision-Based High-Speed Driving With a Deep Dynamic Observer. IEEE Robot. Autom. Lett. 2019, 4, 1564–1571. [Google Scholar] [CrossRef] [Green Version]
- Theodorou, E.; Buchli, J.; Schaal, S. A generalized path integral control approach to reinforcement learning. J. Mach. Learn. Res. 2010, 11, 3137–3181. [Google Scholar]
- Theodorou, E.; Buchli, J.; Schaal, S. Reinforcement learning of motor skills in high dimensions: A path integral approach. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 2397–2403. [Google Scholar]
- Gómez, V.; Kappen, H.; Peters, J.; Neumann, G. Policy search for path integral control. In Lecture Notes in Computer Science, Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France, 15–19 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 482–497. [Google Scholar]
- Pan, Y.; Theodorou, E.; Kontitsis, M. Sample efficient path integral control under uncertainty. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Candada, 7–12 December 2015; pp. 2314–2322. [Google Scholar]
- Thalmeier, D.; Kappen, H.; Totaro, S.; Gómez, V. Adaptive Smoothing Path Integral Control. arXiv 2020, arXiv:2005.06364. [Google Scholar]
- Yamamoto, K.; Ariizumi, R.; Hayakawa, T.; Matsuno, F. Path Integral Policy Improvement With Population Adaptation. IEEE Trans. Cybern. 2020. [Google Scholar] [CrossRef] [PubMed]
- Sun, Y.; Wierstra, D.; Schaul, T.; Schmidhuber, J. Efficient natural evolution strategies. In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, ACM, Montreal, QC, Canada, 8–12 July 2009; pp. 539–546. [Google Scholar]
- Pourchot, A.; Perrin, N.; Sigaud, O. Importance mixing: Improving sample reuse in evolutionary policy search methods. arXiv 2018, arXiv:1808.05832. [Google Scholar]
- Giffin, A.; Caticha, A. Updating probabilities with data and moments. In Proceedings of the 27th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Saratoga Springs, NY, USA, 8–13 July 2007; Volume 954, pp. 74–84. [Google Scholar]
- Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620. [Google Scholar] [CrossRef]
- Jaynes, E. Information theory and statistical mechanics. II. Phys. Rev. 1957, 108, 171. [Google Scholar] [CrossRef]
- Kullback, S. Information Theory and Statistics; John Wiley & Sons: Hoboken, NJ, USA, 1959. [Google Scholar]
- Johnson, R.; Shore, J. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inf. Theory 1980, 26, 26–37. [Google Scholar]
- Jaynes, E. On the rationale of maximum-entropy methods. Proc. IEEE 1982, 70, 939–952. [Google Scholar] [CrossRef]
- Tikochinsky, Y.; Tishby, N.; Levine, R. Consistent inference of probabilities for reproducible experiments. Phys. Rev. Lett. 1984, 52, 1357. [Google Scholar] [CrossRef]
- Caticha, A. Entropic inference. In Proceedings of the MaxEnt 2010, the 30th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Chamonix, France, 4–9 July 2010; Volume 1305, pp. 20–29. [Google Scholar]
- Abdolmaleki, A.; Lioutikov, R.; Peters, J.; Lau, N.; Reis, L.; Neumann, G. Model-based relative entropy stochastic search. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 7–12 December 2015; pp. 3537–3545. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
- Rubin, J.; Shamir, O.; Tishby, N. Trading value and information in MDPs. In Decision Making with Imperfect Decision Makers; Springer: Berlin/Heidelberg, Germany, 2012; pp. 57–74. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
- Rawlik, K.; Toussaint, M.; Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013; AAAI Press: Menlo Park, CA, USA, 2013. [Google Scholar]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lefebvre, T.; Crevecoeur, G. On Entropy Regularized Path Integral Control for Trajectory Optimization. Entropy 2020, 22, 1120. https://doi.org/10.3390/e22101120
Lefebvre T, Crevecoeur G. On Entropy Regularized Path Integral Control for Trajectory Optimization. Entropy. 2020; 22(10):1120. https://doi.org/10.3390/e22101120
Chicago/Turabian StyleLefebvre, Tom, and Guillaume Crevecoeur. 2020. "On Entropy Regularized Path Integral Control for Trajectory Optimization" Entropy 22, no. 10: 1120. https://doi.org/10.3390/e22101120
APA StyleLefebvre, T., & Crevecoeur, G. (2020). On Entropy Regularized Path Integral Control for Trajectory Optimization. Entropy, 22(10), 1120. https://doi.org/10.3390/e22101120