Uncertainty-Aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning | SpringerLink
Skip to main content

Uncertainty-Aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning

  • Conference paper
  • First Online:
Distributed Artificial Intelligence (DAI 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13170))

Included in the following conference series:

  • 640 Accesses

Abstract

Value estimation is one key problem in Reinforcement Learning. Albeit many successes have been achieved by Deep Reinforcement Learning (DRL) in different fields, the underlying structure and learning dynamics of value function, especially with complex function approximation, are not fully understood. In this paper, we report that decreasing rank of Q-matrix widely exists during learning process across a series of continuous control tasks for different popular algorithms. We hypothesize that the low-rank phenomenon indicates the common learning dynamics of Q-matrix from stochastic high dimensional space to smooth low dimensional space. Moreover, we reveal a positive correlation between value matrix rank and value estimation uncertainty. Inspired by above evidence, we propose a novel Uncertainty-Aware Low-rank Q-matrix Estimation (UA-LQE) algorithm as a general framework to facilitate the learning of value function. Through quantifying the uncertainty of state-action value estimation, we selectively erase the entries of highly uncertain values in state-action value matrix and conduct low-rank matrix reconstruction for them to recover their values. Such a reconstruction exploits the underlying structure of value matrix to improve the value approximation, thus leading to a more efficient learning process of value function. In the experiments, we evaluate the efficacy of UA-LQE in several representative OpenAI MuJoCo continuous control tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 7435
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 9294
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For TD3 and SAC which adopt double critics \(Q_{\theta _1},Q_{\theta _2}\), we calculate the approximate rank for \(Q_{\theta _1}\) since the two are equivalent.

References

  1. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: NeurIPS, pp. 1471–1479 (2016)

    Google Scholar 

  2. Cai, J., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)

    Article  MathSciNet  Google Scholar 

  3. Ciosek, K., Vuong, Q., Loftin, R., Hofmann, K.: Better exploration with optimistic actor critic. In: NeurIPS, pp. 1785–1796 (2019)

    Google Scholar 

  4. Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: ICML (2018)

    Google Scholar 

  5. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1856–1865 (2018)

    Google Scholar 

  6. Hafner, D., Lillicrap, T.P., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: ICLR (2020)

    Google Scholar 

  7. Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J.: Deep reinforcement learning and the deadly triad. CoRR abs/1812.02648 (2018)

    Google Scholar 

  8. He, J., Zhou, D., Gu, Q.: Uniform-pac bounds for reinforcement learning with linear function approximation. CoRR abs/2106.11612 (2021)

    Google Scholar 

  9. Keshavan, R.H., Oh, S., Montanari, A.: Matrix completion from a few entries. In: Proceedings of the IEEE International Symposium on Information Theory, ISIT 2009, June 28–July 3, 2009, Seoul, Korea, pp. 324–328. IEEE (2009)

    Google Scholar 

  10. Kumar, A., Agarwal, R., Ghosh, D., Levine, S.: Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In: ICLR (2021)

    Google Scholar 

  11. Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: ICLR (2015)

    Google Scholar 

  12. Luo, X., Meng, Q., He, D., Chen, W., Wang, Y.: I4R: promoting deep reinforcement learning by the indicator for expressive representations. In: IJCAI, pp. 2669–2675. ijcai.org (2020)

    Google Scholar 

  13. Lyle, C., Rowland, M., Ostrovski, G., Dabney, W.: On the effect of auxiliary tasks on representation dynamics. In: AISTATS. vol. 130, pp. 1–9 (2021)

    Google Scholar 

  14. Ma, S., Goldfarb, D., Chen, L.: Fixed point and Bregman iterative methods for matrix rank minimization. Math. Program. 128(1–2), 321–353 (2011)

    Article  MathSciNet  Google Scholar 

  15. Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)

    MathSciNet  MATH  Google Scholar 

  16. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  17. Ong, H.: Value function approximation via low-rank models. CoRR abs/1509.00061 (2015)

    Google Scholar 

  18. Osband, I., Blundell, C., Pritzel, A., Roy, B.V.: Deep exploration via bootstrapped DQN. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pp. 4026–4034 (2016)

    Google Scholar 

  19. Pathak, D., Gandhi, D., Gupta, A.: Self-supervised exploration via disagreement. In: ICML, vol. 97, 5062–5071 (2019)

    Google Scholar 

  20. Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., Geist, M.: Approximate modified policy iteration and its application to the game of Tetris. J. Mach. Learn. Res. 16, 1629–1676 (2015)

    MathSciNet  MATH  Google Scholar 

  21. Schreck, J.S., Coley, C.W., Bishop, K.J.: Learning retrosynthetic planning through simulated experience. ACS Central Sci. 5(6), 970–981 (2019)

    Article  Google Scholar 

  22. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Article  Google Scholar 

  23. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.A.: Deterministic policy gradient algorithms. In: ICML, pp. 387–395 (2014)

    Google Scholar 

  24. Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. IEEE Trans. Neural Netw. 16, 285–286 (1988)

    MATH  Google Scholar 

  25. Tang, H., et al.: #Exploration: a study of count-based exploration for deep reinforcement learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, pp. 2753–2762. Long Beach, CA, USA (2017)

    Google Scholar 

  26. Vinyals, O., et al.: Grandmaster level in StarCraft ii using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)

    Article  Google Scholar 

  27. Yang, T., et al.: Exploration in deep reinforcement learning: a comprehensive survey. CoRR abs/2109.06668 (2021)

    Google Scholar 

  28. Yang, Y., Zhang, G., Xu, Z., Katabi, D.: Harnessing structures for value-based planning and reinforcement learning. In: ICLR (2020)

    Google Scholar 

Download references

Acknowledgement

The work is supported by the National Natural Science Foundation of China (Grant No. 62106172)

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jianye Hao or Yan Zheng .

Editor information

Editors and Affiliations

Appendices

Appendix

A More Background on Matrix Estimation

Matrix estimation is mainly divided into two types, matrix completion and matrix restoration. In this paper, we focus on the former. We use matrix estimation and matrix completion alternatively. Consider a low-rank matrix M, in which the values of some entries are missing or unknown. The goal of matrix completion problem is to recover the matrix by leveraging the low-rank structure. We use \(\varOmega \) to denote the set of subscripts of known entries and the use \(\hat{M}\) to denote the recovered (also called reconstructed in this paper) matrix. Formally, the matrix completion problem is formalized as follows:

$$\begin{aligned} \begin{aligned} \min \ \Vert \hat{M}\Vert _{*},&\\ \text {subject to} \ \hat{M}_{i,j}=M_{i,j}, \quad&\forall (i,j) \in \varOmega \end{aligned} \end{aligned}$$
(5)

Here \(\Vert \cdot \Vert _{*}\) is nuclear norm. The purpose of the above optimization problem is to fill the original matrix as much as possible while keeping the rank of the filled matrix \(\hat{M}\) low.

Some representative methods to solve this optimization problem are Fix Point Continuation (FPC) algorithm [14], Singular Value Thresholding (SVT) algorithm [2], OptSpace algorithm [9], Soft-Impute algorithm [15] and so on.

figure i

In this paper, we use Soft-Impute algorithm for matrix completion, i.e., Q-matrix reconstruction. A pseudocode for Soft-Impute is shown in Algorithm 2. Define function \(P_{\varOmega }\) as below:

$$\begin{aligned} P_{\varOmega }(M) (i, j) = \left\{ \begin{array}{cc} &{} M_{ij}, \ if\ (i,j) \in \varOmega \\ &{} 0 , \ if \ (i,j) \not \in \varOmega \end{array} \right. \end{aligned}$$
(6)

Similar, we define function \(P_{\bar{\varOmega }}\). Soft-Impute algorithm completes the missing entries in \(\bar{\varOmega }\) in an iterative reconstruction fashion. During each iteration, the singular matrix S is obtained by SVD decomposition, and then a filtering is performed through subtracting the value \(\lambda \) controlled by a filtering divider hyperparameter \(\zeta \); next, the new matrix \(M_{\text {new}}\) is reconstructed with the subtracted singular matrix \(S_{\lambda }\). Afterwards, an iteration termination check is performed by comparing whether the difference between the matrices before and after the iteration exceeds the termination threshold \(\epsilon \). When the termination threshold is met or the max iteration number is reached, the algorithm finally outputs the completed matrix \(\hat{M}\).

B Additional Experimental Details

All regular hyperparametyers and implementation details are described in the main body of the paper.

For code-level details, our codes are implemented with Python 3.6.9 and Torch 1.3.1. All experiments were run on a single NVIDIA GeForce GTX 1660Ti GPU. Our Q-matrix reconstruction algorithm is implemented with reference to https://github.com/YyzHarry/SV-RL.

For matrix reconstruction algorithm, i.e., SoftImpute, we set the filtering divider \(\zeta = 50\), termination threshold \(\epsilon = 0.0001\) and Max Iteration Number to be 100.

C Complete Learning Curves of Table 2

Fig. 6.
figure 6

For the reward curve, the first four are the results of reconstruction on the target Q-matrix, and the last four are the results of reconstruction on the current Q-matrix. Results are the means and stds over 3 trials.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sang, T., Tang, H., Hao, J., Zheng, Y., Meng, Z. (2022). Uncertainty-Aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning. In: Chen, J., Lang, J., Amato, C., Zhao, D. (eds) Distributed Artificial Intelligence. DAI 2021. Lecture Notes in Computer Science(), vol 13170. Springer, Cham. https://doi.org/10.1007/978-3-030-94662-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-94662-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-94661-6

  • Online ISBN: 978-3-030-94662-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics