Abstract
Multi-agent reinforcement learning (MARL) is a prevalent learning paradigm for solving stochastic games. In most MARL studies, agents in a game are defined as teammates or enemies beforehand, and the relationships among the agents (i.e., their identities) remain fixed throughout the game. However, in real-world problems, the agent relationships are commonly unknown in advance or dynamically changing. Many multi-party interactions start off by asking: who is on my team? This question arises whether it is the first day at the stock exchange or the kindergarten. Therefore, training policies for such situations in the face of imperfect information and ambiguous identities is an important problem that needs to be addressed. In this work, we develop a novel identity detection reinforcement learning (IDRL) framework that allows an agent to dynamically infer the identities of nearby agents and select an appropriate policy to accomplish the task. In the IDRL framework, a relation network is constructed to deduce the identities of other agents by observing the behaviors of the agents. A danger network is optimized to estimate the risk of false-positive identifications. Beyond that, we propose an intrinsic reward that balances the need to maximize external rewards and accurate identification. After identifying the cooperation-competition pattern among the agents, IDRL applies one of the off-the-shelf MARL methods to learn the policy. To evaluate the proposed method, we conduct experiments on Red-10 card-shedding game, and the results show that IDRL achieves superior performance over other state-of-the-art MARL methods. Impressively, the relation network has the par performance to identify the identities of agents with top human players; the danger network reasonably avoids the risk of imperfect identification. The code to reproduce all the reported results is available online at https://github.com/MR-BENjie/IDRL.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11, 387–434.
Ismail, Z. H., Sariff, N., & Hurtado, E. (2018). A survey and analysis of cooperative multi-agent robot systems: Challenges and directions. Applications of Mobile Robots, 5, 8–14.
Dafoe, A., Bachrach, Y., Hadfield, G., Horvitz, E., Larson, K., & Graepel, T. (2021). Cooperative ai: Machines must learn to find common ground. Nature, 593(7857), 33–36.
Carta, S. (2022). Machine Learning and the City: Applications in Architecture and Urban Design, pp. 143–166.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. (2020). Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839), 604–609.
Ye, D., Liu, Z., Sun, M., Shi, B., Zhao, P., Wu, H., Yu, H., Yang, S., Wu, X., Guo, Q., & et al. (2020). Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (vol. 34, pp. 6672–6679).
Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., & Hesse, C., et al. (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
Brown, N., & Sandholm, T. (2018). Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374), 418–424.
Li, J., Koyamada, S., Ye, Q., Liu, G., Wang, C., Yang, R., Zhao, L., Qin, T., Liu, T.-Y., & Hon, H.-W. (2020). Suphx: Mastering mahjong with deep reinforcement learning. arXiv preprint arXiv:2003.13590.
Zha, D., Xie, J., Ma, W., Zhang, S., Lian, X., Hu, X., & Liu, J. (2021). Douzero: Mastering doudizhu with self-play deep reinforcement learning. In International conference on machine learning (pp. 12333–12344). PMLR.
Kurach, K., Raichuk, A., Stańczyk, P., Zajac, M., Bachem, O., Espeholt, L., Riquelme, C., Vincent, D., Michalski, M., Bousquet, O., et al. (2020). Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence (vol. 34, pp. 4501–4510).
Chenghao, L., Wang, T., Wu, C., Zhao, Q., Yang, J., & Zhang, C. (2021). Celebrating diversity in shared multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34, 3991–4002.
Buşoniu, L., Babuška, R., & Schutter, B. D. (2010). Multi-agent reinforcement learning: An overview. Innovations in Multi-Agent Systems and Applications, 1, 183–221.
Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., & Tuyls, K., et al. (2017). Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296.
Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., & Whiteson, S. (2018). Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International conference on machine learning (pp. 4295–4304). PMLR.
Du, Y., Han, L., Fang, M., Liu, J., Dai, T., & Tao, D. (2019). Liir: Learning individual intrinsic reward in multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 32, 558.
Xiao, B., Ramasubramanian, B., & Poovendran, R. (2022). Agent-temporal attention for reward redistribution in episodic multi-agent reinforcement learning. arXiv preprint arXiv:2201.04612.
Peng, B., Rashid, T., Schroeder de Witt, C., Kamienny, P.-A., Torr, P., Böhmer, W., & Whiteson, S. (2021). Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34, 12208–12221.
Foerster, J., Assael, I. A., De Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 29, 552.
Peng, Z., Zhang, L., & Luo, T. (2018). Learning to communicate via supervised attentional message processing. In Proceedings of the 31st international conference on computer animation and social agents (pp. 11–16).
Lin, T., Huh, M., Stauffer, C., Lim, S. N., & Isola, P. (2021). Learning to ground multi-agent communication with autoencoders. Advances in Neural Information Processing Systems, 19, 15230–15242.
Vanneste, S., Vanneste, A., Mets, K., Anwar, A., Mercelis, S., Latré, S., & Hellinckx, P (2020) .Learning to communicate using counterfactual reasoning. arXiv preprint arXiv:2006.07200.
Heinrich, J., & Silver, D. (2016). Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121.
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.
Foerster, J.N., Chen, R.Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., & Mordatch, I. (2017). Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326.
Anthony, T., Eccles, T., Tacchetti, A., Kramár, J., Gemp, I., Hudson, T., Porcel, N., Lanctot, M., Pérolat, J., Everett, R., et al. (2020). Learning to play no-press diplomacy with best response policy iteration. Advances in Neural Information Processing Systems, 33, 17987–18003.
Paquette, P., Lu, Y., Bocco, S. S., Smith, M., & O-G, S., Kummerfeld, J.K., Pineau, J., Singh, S., & Courville, A.C. (2019). No-press diplomacy: Modeling multi-agent gameplay. Advances in Neural Information Processing Systems, 32, 569.
(FAIR)†, M.F.A.R.D.T., Bakhtin, A., Brown, N., Dinan, E., Farina, G., Flaherty, C., Fried, D., Goff, A., Gray, J., & Hu, H., et al. (2022). Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624), 1067–1074.
Serrino, J., Kleiman-Weiner, M., Parkes, D. C., & Tenenbaum, J. (2019). Finding friend and foe in multi-agent games. Advances in Neural Information Processing Systems, 32, 669.
Wang, T., & Kaneko, T. (2018). Application of deep reinforcement learning in werewolf game agents. In 2018 conference on technologies and applications of artificial intelligence (TAAI) (pp. 28–33). IEEE.
Sutton, R. S., & Barto, A.G. (2018). Reinforcement Learning: An Introduction.
Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. In International conference on machine learning (pp. 5571–5580). PMLR.
Wang, B., Xie, J., & Atanasov, N. (2022). Darl1n: Distributed multi-agent reinforcement learning with one-hop neighbors. arXiv preprint arXiv:2202.09019.
Lowe, R., Wu, Y. I., Tamar, A., Harb, J., Pieter Abbeel, O., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems, 30, 5689.
Pérolat, J., Strub, F., Piot, B., & Pietquin, O. (2017). Learning nash equilibrium for general-sum markov games from batch data. In Artificial intelligence and statistics (pp. 232–241). PMLR.
uz Zaman, M.A., Zhang, K., Miehling, E., & Başar, T. (2020). Approximate equilibrium computation for discrete-time linear-quadratic mean-field games. In 2020 American control conference (ACC) (pp. 333–339). IEEE.
Fu, Z., Yang, Z., Chen, Y., & Wang, Z. (2019). Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M., Beattie, C., & Petersen, S. et al. (2015). Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296.
Wang, T., Wang, J., Wu, Y., & Zhang, C. (2019). Influence-based multi-agent exploration. arXiv preprint arXiv:1910.05512.
Liu, I.-J., Jain, U., Yeh, R.A., & Schwing, A. (2021). Cooperative exploration for multi-agent deep reinforcement learning. In International conference on machine learning (pp. 6826–6836). PMLR.
Viseras, A., Wiedemann, T., Manss, C., Magel, L., Mueller, J., Shutin, D., & Merino, L. (2016). Decentralized multi-agent exploration with online-learning of gaussian processes. In 2016 IEEE international conference on robotics and automation (ICRA) (pp. 4222–4229). IEEE.
Hadfield-Menell, D., Russell, S. J., Abbeel, P., & Dragan, A. (2016). Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 29, 556.
Wu, H., Sequeira, P., & Pynadath, D. V. (2023). Multiagent inverse reinforcement learning via theory of mind reasoning. arXiv preprint arXiv:2302.10238.
He, H., Boyd-Graber, J., Kwok, K., & Daumé III, H. (2016). Opponent modeling in deep reinforcement learning. In International conference on machine learning (pp. 1804–1813). PMLR.
Albrecht, S. V., & Stone, P. (2018). Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258, 66–95.
Perolat, J., De Vylder, B., Hennes, D., Tarassov, E., Strub, F., de Boer, V., Muller, P., Connor, J. T., Burch, N., Anthony, T., et al. (2022). Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623), 990–996.
Rabinowitz, N., Perbet, F., Song, F., Zhang, C., Eslami, S. A., & Botvinick, M. (2018). Machine theory of mind. In International conference on machine learning (pp. 4218–4227). PMLR.
Cuzzolin, F., Morelli, A., Cirstea, B., & Sahakian, B. J. (2020). Knowing me, knowing you: Theory of mind in ai. Psychological Medicine, 50(7), 1057–1061.
Stone, P., Kaminka, G.A., Kraus, S., & Rosenschein, J.S. (2010). Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Twenty-fourth AAAI conference on artificial intelligence.
Mirsky, R., Carlucho, I., Rahman, A., Fosong, E., Macke, W., Sridharan, M., Stone, P., & Albrecht, S. V. (2022). A survey of ad hoc teamwork research. In European conference on multi-agent systems (pp. 275–293). Springer.
Barrett, S., & Stone, P. (2015). Cooperating with unknown teammates in complex domains: A robot soccer case study of ad hoc teamwork. In Twenty-ninth AAAI conference on artificial intelligence.
Ravula, M., Alkoby, S., & Stone, P. (2019). Ad hoc teamwork with behavior switching agents. In Proceedings of the 28th international joint conference on artificial intelligence (pp. 550–556).
Chen, S., Andrejczuk, E., Cao, Z., & Zhang, J. (2020). Aateam: Achieving the ad hoc teamwork by employing the attention mechanism. In Proceedings of the AAAI conference on artificial intelligence (vol. 34, pp. 7095–7102).
Gu, P., Zhao, M., Hao, J., & An, B. (2021). Online ad hoc teamwork under partial observability. In International conference on learning representations.
Rahman, M.A., Hopner, N., Christianos, F., & Albrecht, S.V. (2021). Towards open ad hoc teamwork using graph-based policy learning. In International conference on machine learning (pp. 8776–8786). PMLR.
Zha, D., Lai, K.-H., Huang, S., Cao, Y., Reddy, K., Vargas, J., Nguyen, A., Wei, R., Guo, J., & Hu, X. (2021). Rlcard: a platform for reinforcement learning in card games. In Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence (pp. 5264–5266).
Jiang, Q., Li, K., Du, B., Chen, H., & Fang, H. (2019). Deltadou: Expert-level doudizhu ai through self-play. In IJCAI (pp. 1265–1271).
You, Y., Li, L., Guo, B., Wang, W., & Lu, C. (2019). Combinational q-learning for dou di zhu. arXiv preprint arXiv:1901.08925.
Arnob, S.Y. (2020). Off-policy adversarial inverse reinforcement learning. arXiv preprint arXiv:2005.01138.
Singh, S., Soni, V., & Wellman, M. (2004). Computing approximate bayes-nash equilibria in tree-games of incomplete information. In Proceedings of the 5th ACM conference on electronic commerce (pp. 81–90).
Acknowledgements
We acknowledge funding in support of this work from the Project supported by the Key Program of the National Natural Science Foundation of China (Grant No.51935005), Basic Research Project (Grant No.JCKY20200603C010), China Academy of Launch Vehicle Technology (CALT2022-18) and supported by Natural Science Foundation of Heilongjiang Province of China (Grant No.LH2021F023), as well as supported by Science and Technology Planning Project of Heilongjiang Province of China (Grant No.GA21C031).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Red-10 game rules
Deck Red-10 game is played with a standard 52-card deck comprising 13 ranks in each of the four suits: clubs, diamonds, hearts, and spades. Each suit series is ranked from top to bottom as 2,A,K,Q,J,10,9,8,7,6,5,4,3.
Cards combination categories Similar to the Doudizhu, there are rich card combination categories in Red-10 as follows.
-
Solo: Any individual card, ranked according to its face rank.
-
Pair: Any pair of identically ranked cards, ranked according to its face rank.
-
Trio: Any three identically ranked cards, ranked according to its face rank.
-
Trio with solo: Any three identically ranked cards with a solo, ranked according to the trio.
-
Trio with pair: Any three identically ranked cards with a pair, ranked according to the trio.
-
Solo chain: No fewer than five consecutive card ranks, ranked by the lowest rank in the chain.
-
Pairs chain: No fewer than three consecutive pairs, ranked by the lowest rank in the chain.
-
Airplane: No fewer than two consecutive trios, ranked by the lowest rank in the combination.
-
Airplane with small wings: No fewer than two consecutive trios, with additional cards having the same amount of trios, ranked by the lowest rank in the chain of trios.
-
Airplane with large wings: No fewer than two consecutive trios with additional pairs having the same amount of trios, ranked by the lowest rank in the chain of trios.
-
Four with two single cards: Four cards with equal rank with two individual cards, ranked according to the four cards.
-
Four with two pairs: four cards with equal rank, with two pairs, ranked by the four cards.
-
Bomb: Four cards of equal rank.
Red-10 includes two phases as follows.
-
1.
Dealing: A shuffled deck of 52 cards is randomly dealt to four players in turn, equally.
-
2.
Card-playing: For players play cards in turn; the first plays any category. The next player must play cards of the same category with a higher rank or bomb; otherwise, they can pass on their turn. If three consecutive agents pass, the fourth player can play any category. The game ends when any player runs out of cards.
Winner Players holding a red 10 card are on the “Landlord team,” and the others on the “Peasant.” The first team with a player who runs out of cards wins.
Appendix B: Detailed input data
In Red-10 game environment, the detailed input data of the Q action-value function, the relation network, and the danger network are listed as follows tables (Tables 8 and 9).
Appendix C: Experiments hyper-parameters
We list the hyper-parameters of IDRL in Red-10 experiments in Table 10, and the hyper-parameters of baseline algorithms in Table 11.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Han, S., Li, S., An, B. et al. Classifying ambiguous identities in hidden-role Stochastic games with multi-agent reinforcement learning. Auton Agent Multi-Agent Syst 37, 35 (2023). https://doi.org/10.1007/s10458-023-09620-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10458-023-09620-x