Uncertainty-Aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning

Sang, Tong; Tang, Hongyao; Hao, Jianye; Zheng, Yan; Meng, Zhaopeng

doi:10.1007/978-3-030-94662-3_2

Tong Sang¹²,
Hongyao Tang¹²,
Jianye Hao¹²,
Yan Zheng¹² &
…
Zhaopeng Meng¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13170))

Included in the following conference series:

International Conference on Distributed Artificial Intelligence

640 Accesses

Abstract

Value estimation is one key problem in Reinforcement Learning. Albeit many successes have been achieved by Deep Reinforcement Learning (DRL) in different fields, the underlying structure and learning dynamics of value function, especially with complex function approximation, are not fully understood. In this paper, we report that decreasing rank of Q-matrix widely exists during learning process across a series of continuous control tasks for different popular algorithms. We hypothesize that the low-rank phenomenon indicates the common learning dynamics of Q-matrix from stochastic high dimensional space to smooth low dimensional space. Moreover, we reveal a positive correlation between value matrix rank and value estimation uncertainty. Inspired by above evidence, we propose a novel Uncertainty-Aware Low-rank Q-matrix Estimation (UA-LQE) algorithm as a general framework to facilitate the learning of value function. Through quantifying the uncertainty of state-action value estimation, we selectively erase the entries of highly uncertain values in state-action value matrix and conduct low-rank matrix reconstruction for them to recover their values. Such a reconstruction exploits the underlying structure of value matrix to improve the value approximation, thus leading to a more efficient learning process of value function. In the experiments, we evaluate the efficacy of UA-LQE in several representative OpenAI MuJoCo continuous control tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 7435; Price includes VAT (Japan)

Softcover Book: JPY 9294; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Article 13 February 2018

Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning

Regularization in DQN for Parameter-Varying Control Learning Tasks

Notes

1.
For TD3 and SAC which adopt double critics $Q_{\theta _1},Q_{\theta _2}$, we calculate the approximate rank for $Q_{\theta _1}$ since the two are equivalent.

References

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: NeurIPS, pp. 1471–1479 (2016)
Google Scholar
Cai, J., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Article MathSciNet Google Scholar
Ciosek, K., Vuong, Q., Loftin, R., Hofmann, K.: Better exploration with optimistic actor critic. In: NeurIPS, pp. 1785–1796 (2019)
Google Scholar
Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: ICML (2018)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1856–1865 (2018)
Google Scholar
Hafner, D., Lillicrap, T.P., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: ICLR (2020)
Google Scholar
Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J.: Deep reinforcement learning and the deadly triad. CoRR abs/1812.02648 (2018)
Google Scholar
He, J., Zhou, D., Gu, Q.: Uniform-pac bounds for reinforcement learning with linear function approximation. CoRR abs/2106.11612 (2021)
Google Scholar
Keshavan, R.H., Oh, S., Montanari, A.: Matrix completion from a few entries. In: Proceedings of the IEEE International Symposium on Information Theory, ISIT 2009, June 28–July 3, 2009, Seoul, Korea, pp. 324–328. IEEE (2009)
Google Scholar
Kumar, A., Agarwal, R., Ghosh, D., Levine, S.: Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In: ICLR (2021)
Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: ICLR (2015)
Google Scholar
Luo, X., Meng, Q., He, D., Chen, W., Wang, Y.: I4R: promoting deep reinforcement learning by the indicator for expressive representations. In: IJCAI, pp. 2669–2675. ijcai.org (2020)
Google Scholar
Lyle, C., Rowland, M., Ostrovski, G., Dabney, W.: On the effect of auxiliary tasks on representation dynamics. In: AISTATS. vol. 130, pp. 1–9 (2021)
Google Scholar
Ma, S., Goldfarb, D., Chen, L.: Fixed point and Bregman iterative methods for matrix rank minimization. Math. Program. 128(1–2), 321–353 (2011)
Article MathSciNet Google Scholar
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
MathSciNet MATH Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Ong, H.: Value function approximation via low-rank models. CoRR abs/1509.00061 (2015)
Google Scholar
Osband, I., Blundell, C., Pritzel, A., Roy, B.V.: Deep exploration via bootstrapped DQN. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pp. 4026–4034 (2016)
Google Scholar
Pathak, D., Gandhi, D., Gupta, A.: Self-supervised exploration via disagreement. In: ICML, vol. 97, 5062–5071 (2019)
Google Scholar
Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., Geist, M.: Approximate modified policy iteration and its application to the game of Tetris. J. Mach. Learn. Res. 16, 1629–1676 (2015)
MathSciNet MATH Google Scholar
Schreck, J.S., Coley, C.W., Bishop, K.J.: Learning retrosynthetic planning through simulated experience. ACS Central Sci. 5(6), 970–981 (2019)
Article Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.A.: Deterministic policy gradient algorithms. In: ICML, pp. 387–395 (2014)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. IEEE Trans. Neural Netw. 16, 285–286 (1988)
MATH Google Scholar
Tang, H., et al.: #Exploration: a study of count-based exploration for deep reinforcement learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, pp. 2753–2762. Long Beach, CA, USA (2017)
Google Scholar
Vinyals, O., et al.: Grandmaster level in StarCraft ii using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)
Article Google Scholar
Yang, T., et al.: Exploration in deep reinforcement learning: a comprehensive survey. CoRR abs/2109.06668 (2021)
Google Scholar
Yang, Y., Zhang, G., Xu, Z., Katabi, D.: Harnessing structures for value-based planning and reinforcement learning. In: ICLR (2020)
Google Scholar

Download references

Acknowledgement

The work is supported by the National Natural Science Foundation of China (Grant No. 62106172)

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Tong Sang, Hongyao Tang, Jianye Hao, Yan Zheng & Zhaopeng Meng

Authors

Tong Sang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jianye Hao
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhaopeng Meng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jianye Hao or Yan Zheng .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
Jie Chen
Lamsade Bureau, Université Paris-Dauphine, Paris Cedex 16, France
Jérôme Lang
Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
Christopher Amato
ShanghaiTech University, Shanghai, China
Dengji Zhao

Appendices

Appendix

A More Background on Matrix Estimation

Matrix estimation is mainly divided into two types, matrix completion and matrix restoration. In this paper, we focus on the former. We use matrix estimation and matrix completion alternatively. Consider a low-rank matrix M, in which the values of some entries are missing or unknown. The goal of matrix completion problem is to recover the matrix by leveraging the low-rank structure. We use $\varOmega $ to denote the set of subscripts of known entries and the use $\hat{M}$ to denote the recovered (also called reconstructed in this paper) matrix. Formally, the matrix completion problem is formalized as follows:

$$\begin{aligned} \begin{aligned} \min \ \Vert \hat{M}\Vert _{*},&\\ \text {subject to} \ \hat{M}_{i,j}=M_{i,j}, \quad&\forall (i,j) \in \varOmega \end{aligned} \end{aligned}$$

(5)

Here $\Vert \cdot \Vert _{*}$ is nuclear norm. The purpose of the above optimization problem is to fill the original matrix as much as possible while keeping the rank of the filled matrix $\hat{M}$ low.

Some representative methods to solve this optimization problem are Fix Point Continuation (FPC) algorithm [14], Singular Value Thresholding (SVT) algorithm [2], OptSpace algorithm [9], Soft-Impute algorithm [15] and so on.

In this paper, we use Soft-Impute algorithm for matrix completion, i.e., Q-matrix reconstruction. A pseudocode for Soft-Impute is shown in Algorithm 2. Define function $P_{\varOmega }$ as below:

$$\begin{aligned} P_{\varOmega }(M) (i, j) = \left\{ \begin{array}{cc} &{} M_{ij}, \ if\ (i,j) \in \varOmega \\ &{} 0 , \ if \ (i,j) \not \in \varOmega \end{array} \right. \end{aligned}$$

(6)

Similar, we define function $P_{\bar{\varOmega }}$. Soft-Impute algorithm completes the missing entries in $\bar{\varOmega }$ in an iterative reconstruction fashion. During each iteration, the singular matrix S is obtained by SVD decomposition, and then a filtering is performed through subtracting the value $\lambda $ controlled by a filtering divider hyperparameter $\zeta $; next, the new matrix $M_{\text {new}}$ is reconstructed with the subtracted singular matrix $S_{\lambda }$. Afterwards, an iteration termination check is performed by comparing whether the difference between the matrices before and after the iteration exceeds the termination threshold $\epsilon $. When the termination threshold is met or the max iteration number is reached, the algorithm finally outputs the completed matrix $\hat{M}$.

B Additional Experimental Details

All regular hyperparametyers and implementation details are described in the main body of the paper.

For code-level details, our codes are implemented with Python 3.6.9 and Torch 1.3.1. All experiments were run on a single NVIDIA GeForce GTX 1660Ti GPU. Our Q-matrix reconstruction algorithm is implemented with reference to https://github.com/YyzHarry/SV-RL.

For matrix reconstruction algorithm, i.e., SoftImpute, we set the filtering divider $\zeta = 50$, termination threshold $\epsilon = 0.0001$ and Max Iteration Number to be 100.

C Complete Learning Curves of Table 2

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sang, T., Tang, H., Hao, J., Zheng, Y., Meng, Z. (2022). Uncertainty-Aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning. In: Chen, J., Lang, J., Amato, C., Zhao, D. (eds) Distributed Artificial Intelligence. DAI 2021. Lecture Notes in Computer Science(), vol 13170. Springer, Cham. https://doi.org/10.1007/978-3-030-94662-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-94662-3_2
Published: 11 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-94661-6
Online ISBN: 978-3-030-94662-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Uncertainty-Aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning