Abstract
Value estimation is one key problem in Reinforcement Learning. Albeit many successes have been achieved by Deep Reinforcement Learning (DRL) in different fields, the underlying structure and learning dynamics of value function, especially with complex function approximation, are not fully understood. In this paper, we report that decreasing rank of Q-matrix widely exists during learning process across a series of continuous control tasks for different popular algorithms. We hypothesize that the low-rank phenomenon indicates the common learning dynamics of Q-matrix from stochastic high dimensional space to smooth low dimensional space. Moreover, we reveal a positive correlation between value matrix rank and value estimation uncertainty. Inspired by above evidence, we propose a novel Uncertainty-Aware Low-rank Q-matrix Estimation (UA-LQE) algorithm as a general framework to facilitate the learning of value function. Through quantifying the uncertainty of state-action value estimation, we selectively erase the entries of highly uncertain values in state-action value matrix and conduct low-rank matrix reconstruction for them to recover their values. Such a reconstruction exploits the underlying structure of value matrix to improve the value approximation, thus leading to a more efficient learning process of value function. In the experiments, we evaluate the efficacy of UA-LQE in several representative OpenAI MuJoCo continuous control tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For TD3 and SAC which adopt double critics \(Q_{\theta _1},Q_{\theta _2}\), we calculate the approximate rank for \(Q_{\theta _1}\) since the two are equivalent.
References
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: NeurIPS, pp. 1471–1479 (2016)
Cai, J., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Ciosek, K., Vuong, Q., Loftin, R., Hofmann, K.: Better exploration with optimistic actor critic. In: NeurIPS, pp. 1785–1796 (2019)
Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: ICML (2018)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1856–1865 (2018)
Hafner, D., Lillicrap, T.P., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: ICLR (2020)
Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J.: Deep reinforcement learning and the deadly triad. CoRR abs/1812.02648 (2018)
He, J., Zhou, D., Gu, Q.: Uniform-pac bounds for reinforcement learning with linear function approximation. CoRR abs/2106.11612 (2021)
Keshavan, R.H., Oh, S., Montanari, A.: Matrix completion from a few entries. In: Proceedings of the IEEE International Symposium on Information Theory, ISIT 2009, June 28–July 3, 2009, Seoul, Korea, pp. 324–328. IEEE (2009)
Kumar, A., Agarwal, R., Ghosh, D., Levine, S.: Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In: ICLR (2021)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: ICLR (2015)
Luo, X., Meng, Q., He, D., Chen, W., Wang, Y.: I4R: promoting deep reinforcement learning by the indicator for expressive representations. In: IJCAI, pp. 2669–2675. ijcai.org (2020)
Lyle, C., Rowland, M., Ostrovski, G., Dabney, W.: On the effect of auxiliary tasks on representation dynamics. In: AISTATS. vol. 130, pp. 1–9 (2021)
Ma, S., Goldfarb, D., Chen, L.: Fixed point and Bregman iterative methods for matrix rank minimization. Math. Program. 128(1–2), 321–353 (2011)
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Ong, H.: Value function approximation via low-rank models. CoRR abs/1509.00061 (2015)
Osband, I., Blundell, C., Pritzel, A., Roy, B.V.: Deep exploration via bootstrapped DQN. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pp. 4026–4034 (2016)
Pathak, D., Gandhi, D., Gupta, A.: Self-supervised exploration via disagreement. In: ICML, vol. 97, 5062–5071 (2019)
Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., Geist, M.: Approximate modified policy iteration and its application to the game of Tetris. J. Mach. Learn. Res. 16, 1629–1676 (2015)
Schreck, J.S., Coley, C.W., Bishop, K.J.: Learning retrosynthetic planning through simulated experience. ACS Central Sci. 5(6), 970–981 (2019)
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.A.: Deterministic policy gradient algorithms. In: ICML, pp. 387–395 (2014)
Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. IEEE Trans. Neural Netw. 16, 285–286 (1988)
Tang, H., et al.: #Exploration: a study of count-based exploration for deep reinforcement learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, pp. 2753–2762. Long Beach, CA, USA (2017)
Vinyals, O., et al.: Grandmaster level in StarCraft ii using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)
Yang, T., et al.: Exploration in deep reinforcement learning: a comprehensive survey. CoRR abs/2109.06668 (2021)
Yang, Y., Zhang, G., Xu, Z., Katabi, D.: Harnessing structures for value-based planning and reinforcement learning. In: ICLR (2020)
Acknowledgement
The work is supported by the National Natural Science Foundation of China (Grant No. 62106172)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendices
Appendix
A More Background on Matrix Estimation
Matrix estimation is mainly divided into two types, matrix completion and matrix restoration. In this paper, we focus on the former. We use matrix estimation and matrix completion alternatively. Consider a low-rank matrix M, in which the values of some entries are missing or unknown. The goal of matrix completion problem is to recover the matrix by leveraging the low-rank structure. We use \(\varOmega \) to denote the set of subscripts of known entries and the use \(\hat{M}\) to denote the recovered (also called reconstructed in this paper) matrix. Formally, the matrix completion problem is formalized as follows:
Here \(\Vert \cdot \Vert _{*}\) is nuclear norm. The purpose of the above optimization problem is to fill the original matrix as much as possible while keeping the rank of the filled matrix \(\hat{M}\) low.
Some representative methods to solve this optimization problem are Fix Point Continuation (FPC) algorithm [14], Singular Value Thresholding (SVT) algorithm [2], OptSpace algorithm [9], Soft-Impute algorithm [15] and so on.

In this paper, we use Soft-Impute algorithm for matrix completion, i.e., Q-matrix reconstruction. A pseudocode for Soft-Impute is shown in Algorithm 2. Define function \(P_{\varOmega }\) as below:
Similar, we define function \(P_{\bar{\varOmega }}\). Soft-Impute algorithm completes the missing entries in \(\bar{\varOmega }\) in an iterative reconstruction fashion. During each iteration, the singular matrix S is obtained by SVD decomposition, and then a filtering is performed through subtracting the value \(\lambda \) controlled by a filtering divider hyperparameter \(\zeta \); next, the new matrix \(M_{\text {new}}\) is reconstructed with the subtracted singular matrix \(S_{\lambda }\). Afterwards, an iteration termination check is performed by comparing whether the difference between the matrices before and after the iteration exceeds the termination threshold \(\epsilon \). When the termination threshold is met or the max iteration number is reached, the algorithm finally outputs the completed matrix \(\hat{M}\).
B Additional Experimental Details
All regular hyperparametyers and implementation details are described in the main body of the paper.
For code-level details, our codes are implemented with Python 3.6.9 and Torch 1.3.1. All experiments were run on a single NVIDIA GeForce GTX 1660Ti GPU. Our Q-matrix reconstruction algorithm is implemented with reference to https://github.com/YyzHarry/SV-RL.
For matrix reconstruction algorithm, i.e., SoftImpute, we set the filtering divider \(\zeta = 50\), termination threshold \(\epsilon = 0.0001\) and Max Iteration Number to be 100.
C Complete Learning Curves of Table 2
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Sang, T., Tang, H., Hao, J., Zheng, Y., Meng, Z. (2022). Uncertainty-Aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning. In: Chen, J., Lang, J., Amato, C., Zhao, D. (eds) Distributed Artificial Intelligence. DAI 2021. Lecture Notes in Computer Science(), vol 13170. Springer, Cham. https://doi.org/10.1007/978-3-030-94662-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-94662-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-94661-6
Online ISBN: 978-3-030-94662-3
eBook Packages: Computer ScienceComputer Science (R0)