{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,4,4]],"date-time":"2025-04-04T09:27:34Z","timestamp":1743758854943,"version":"3.38.0"},"reference-count":70,"publisher":"SAGE Publications","issue":"5","license":[{"start":{"date-parts":[[2024,11,7]],"date-time":"2024-11-07T00:00:00Z","timestamp":1730937600000},"content-version":"vor","delay-in-days":366,"URL":"http:\/\/www.sagepub.com\/licence-information-for-chorus"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["#1941722","#2218760"],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of Robotics Research"],"published-print":{"date-parts":[[2024,4]]},"abstract":" Designing reward functions is a difficult task in AI and robotics. The complex task of directly specifying all the desirable behaviors a robot needs to optimize often proves challenging for humans. A popular solution is to learn reward functions using expert demonstrations. This approach, however, is fraught with many challenges. Some methods require heavily structured models, for example, reward functions that are linear in some predefined set of features, while others adopt less structured reward functions that may necessitate tremendous amounts of data. Moreover, it is difficult for humans to provide demonstrations on robots with high degrees of freedom, or even quantifying reward values for given trajectories. To address these challenges, we present a preference-based learning approach, where human feedback is in the form of comparisons between trajectories. We do not assume highly constrained structures on the reward function. Instead, we employ a Gaussian process to model the reward function and propose a mathematical formulation to actively fit the model using only human preferences. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework. We further analyze our algorithm in comparison to several baselines on reward optimization, where the goal is to find the optimal robot trajectory in a data-efficient way instead of learning the reward function for every possible trajectory. Our results in three different simulation experiments and a user study show our approach can efficiently learn expressive reward functions for robotic tasks, and outperform the baselines in both reward learning and reward optimization. <\/jats:p>","DOI":"10.1177\/02783649231208729","type":"journal-article","created":{"date-parts":[[2023,11,7]],"date-time":"2023-11-07T08:00:26Z","timestamp":1699344026000},"page":"665-684","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":8,"title":["Active preference-based Gaussian process regression for reward learning and optimization"],"prefix":"10.1177","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9516-3130","authenticated-orcid":false,"given":"Erdem","family":"B\u0131y\u0131k","sequence":"first","affiliation":[{"name":"Department of Electrical Engineering, Stanford University, Stanford, CA, USA"},{"name":"Center for Human-Compatible Artificial Intelligence, UC Berkeley, Berkeley, CA, USA"},{"name":"Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA, USA"}]},{"given":"Nicolas","family":"Huynh","sequence":"additional","affiliation":[{"name":"Department of Applied Mathematics, \u00c9cole Polytechnique, Palaiseau, France"},{"name":"Department of Computer Science and Technology, University of Cambridge, Cambridge, UK"}]},{"given":"Mykel J.","family":"Kochenderfer","sequence":"additional","affiliation":[{"name":"Department of Aeronautics and Astronautics, Stanford University, Stanford, CA, USA"}]},{"given":"Dorsa","family":"Sadigh","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Stanford University, Stanford, CA, USA"},{"name":"Department of Computer Science, Stanford University, Stanford, CA, USA"}]}],"member":"179","published-online":{"date-parts":[[2023,11,7]]},"reference":[{"key":"bibr1-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1145\/1015330.1015430"},{"key":"bibr2-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102352"},{"key":"bibr3-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1007\/s12369-012-0160-0"},{"key":"bibr4-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33486-3_8"},{"key":"bibr5-02783649231208729","first-page":"217","volume":"78","author":"Bajcsy A","year":"2017","journal-title":"Proceedings of Machine Learning Research"},{"key":"bibr6-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1145\/3171221.3171267"},{"key":"bibr7-02783649231208729","doi-asserted-by":"crossref","unstructured":"Basu C, Yang Q, Hungerman D, et al. (2017) Do you want your autonomous car to drive like you? In: ACM\/IEEE international conference on Human-Robot Interaction (HRI), Vienna, Austria, 06\u201309 March 2017, pp. 417\u2013425.","DOI":"10.1145\/2909824.3020250"},{"key":"bibr8-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/IROS40897.2019.8968522"},{"volume-title":"Learning preferences for interactive autonomy","year":"2022","author":"Biyik E","key":"bibr9-02783649231208729"},{"key":"bibr12-02783649231208729","unstructured":"Biyik E, Palan M, Landolfi NC, et al. (2019b) Asking easy questions: a user-friendly approach to active reward learning. In: Conference on Robot Learning (CoRL), Osaka, Japan, October 2019."},{"key":"bibr10-02783649231208729","unstructured":"Biyik E, Sadigh D (2018) Batch active preference-based learning of reward functions. In: Conference on Robot Learning (CoRL), Zurich, Switzerland, October 2018, pp. 120\u2013127."},{"key":"bibr11-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/CDC40024.2019.9030169"},{"key":"bibr13-02783649231208729","unstructured":"Biyik E, Wang K, Anari N, et al. (2019c) Batch active learning using determinantal point processes. ArXiv Preprint arXiv:1906.07975."},{"key":"bibr14-02783649231208729","doi-asserted-by":"publisher","DOI":"10.15607\/rss.2020.xvi.041"},{"key":"bibr15-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1177\/02783649211041652"},{"key":"bibr16-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/hri53351.2022.9889650"},{"key":"bibr17-02783649231208729","first-page":"1278780","volume":"10","author":"Bridson R","year":"2007","journal-title":"SIGGRAPH Sketches"},{"key":"bibr18-02783649231208729","unstructured":"Brockman G, Cheung V, Pettersson L, et al. (2016) Openai gym. ArXiv Preprint arXiv:1606.01540."},{"key":"bibr19-02783649231208729","unstructured":"Brown DS, Niekum S (2019) Deep Bayesian reward learning from preferences. In: NeurIPS workshop on safety and robustness in decision making, Vancouver, BC, December 2019, pp. n\/a."},{"key":"bibr20-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2011.6094735"},{"key":"bibr21-02783649231208729","unstructured":"Casper S, Davies X, Shi C, et al. (2023) Open problems and fundamental limitations of reinforcement learning from human feedback. ArXiv Preprint arXiv:2307.15217."},{"key":"bibr22-02783649231208729","unstructured":"Christiano PF, Leike J, Brown T, et al. (2017) Deep reinforcement learning from human preferences. In: Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, December 2017, pp. 4299\u20134307."},{"key":"bibr23-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102369"},{"key":"bibr24-02783649231208729","unstructured":"Clark J, Amodei D (2016) Faulty reward functions in the wild. https:\/\/openai.com\/blog\/faulty-reward-functions\/"},{"key":"bibr25-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1007\/s10514-015-9454-z"},{"volume-title":"Robotics: Science and Systems","year":"2012","author":"Dragan AD","key":"bibr26-02783649231208729"},{"key":"bibr27-02783649231208729","unstructured":"Gonz\u00e1lez J, Dai Z, Damianou A, et al. (2017) Preferential Bayesian optimization. In: International Conference on Machine Learning (ICML), Sydney, Australia, August 2017, pp. 1282\u20131291."},{"key":"bibr28-02783649231208729","first-page":"282","volume-title":"Uncertainty in Artificial Intelligence","author":"Hensman J","year":"2013"},{"key":"bibr29-02783649231208729","unstructured":"Ho J, Ermon S (2016) Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, December 2016, pp. 4565\u20134573."},{"key":"bibr30-02783649231208729","unstructured":"Houlsby N, Husz\u00e1r F, Ghahramani Z, et al. (2011) Bayesian active learning for classification and preference learning. In: NIPS workshop on Bayesian optimization, experimental design and bandits: theory and applications. (accessed December 2011)."},{"key":"bibr31-02783649231208729","unstructured":"Houlsby N, Huszar F, Ghahramani Z, et al. (2012) Collaborative Gaussian processes for preference learning. In: Advances in Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, December 2012, pp. 2096\u20132104."},{"key":"bibr32-02783649231208729","unstructured":"Ibarz B, Leike J, Pohlen T, et al. (2018) Reward learning from human preferences and demonstrations in atari. In: Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, December 2018, pp. 8011\u20138023."},{"key":"bibr33-02783649231208729","volume":"2015","author":"Javdani S","year":"2015","journal-title":"Robotics: Science and Systems"},{"key":"bibr34-02783649231208729","unstructured":"Jensen BS, Nielsen JB (2011) Pairwise judgements and absolute ratings with gaussian process priors. Technical University of Denmark (DTU), Department of Applied Mathematics and Computer Science, Technical Report."},{"key":"bibr35-02783649231208729","first-page":"4415","volume":"33","author":"Jeon HJ","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"bibr36-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2007.4408844"},{"key":"bibr37-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/DASC43569.2019.9081648"},{"key":"bibr38-02783649231208729","unstructured":"Katz SM, Maleki A, B\u0131y\u0131k E, et al. (2021) Preference-based learning of reward function features. ArXiv Preprint arXiv:2103.02727."},{"key":"bibr39-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1162\/PRES_a_00223"},{"key":"bibr40-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-51532-8_10"},{"key":"bibr41-02783649231208729","unstructured":"Lee K, Smith L, Abbeel P (2021) Pebble: feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In: International conference on machine learning, Virtual, July 2021."},{"key":"bibr42-02783649231208729","doi-asserted-by":"publisher","DOI":"10.2514\/1.I010363"},{"key":"bibr43-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2019.2957109"},{"key":"bibr44-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA40945.2020.9197197"},{"key":"bibr45-02783649231208729","unstructured":"Myers V, Biyik E, Anari N, et al. (2021) Learning multimodal rewards from rankings. In: Conference on robot learning, London, UK, November 2021, pp. 342\u2013352. PMLR."},{"key":"bibr46-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA48891.2023.10160439"},{"key":"bibr47-02783649231208729","unstructured":"Ng AY, Russell SJ (2000) Algorithms for inverse reinforcement learning. In: International Conference on Machine Learning (ICML), Stanford, CA, June 2000, 1, p. 2."},{"key":"bibr48-02783649231208729","first-page":"278","volume":"99","author":"Ng AY","year":"1999","journal-title":"International Conference on Machine Learning (ICML)"},{"key":"bibr49-02783649231208729","first-page":"2035","volume":"9","author":"Nickisch H","year":"2008","journal-title":"Journal of Machine Learning Research"},{"key":"bibr50-02783649231208729","first-page":"27730","volume":"35","author":"Ouyang L","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"bibr51-02783649231208729","doi-asserted-by":"crossref","unstructured":"Palan M, Landolfi NC, Shevchuk G, et al. (2019) Learning reward functions by integrating human demonstrations and preferences. In: Robotics: science and systems, Messe Freiburg, Germany, June 2019.","DOI":"10.15607\/RSS.2019.XV.023"},{"key":"bibr52-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/HRI.2019.8673300"},{"key":"bibr53-02783649231208729","unstructured":"Ramachandran D, Amir E (2007) Bayesian inverse reinforcement learning. In: International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007, 7, pp. 2586\u20132591."},{"key":"bibr54-02783649231208729","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/3206.001.0001"},{"key":"bibr55-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2013.6630809"},{"key":"bibr56-02783649231208729","unstructured":"Sadigh D, Sastry S, Seshia SA, et al. (2016) Planning for autonomous cars that leverage effects on human actions. In: Robotics: science and systems, Cambridge, MA, July 2016."},{"key":"bibr57-02783649231208729","doi-asserted-by":"crossref","unstructured":"Sadigh D, Dragan AD, Sastry SS, et al. (2017) Active preference-based learning of reward functions. In: Robotics: science and systems, Cambridge, MA, July 2017.","DOI":"10.15607\/RSS.2017.XIII.053"},{"key":"bibr58-02783649231208729","doi-asserted-by":"publisher","DOI":"10.15607\/RSS.2022.XVIII.065"},{"key":"bibr59-02783649231208729","unstructured":"Song J, Ren H, Sadigh D, et al. (2018) Multi-agent generative adversarial imitation learning. In: Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, December 2018, pp. 7461\u20137472."},{"key":"bibr60-02783649231208729","unstructured":"Sontakke SA, Arnold S, Zhang J, et al. (2023) Roboclip: one demonstration is enough to learn robot policies. In: Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, December 2023."},{"key":"bibr61-02783649231208729","unstructured":"Stadie BC, Abbeel P, Sutskever I (2017) Third-person imitation learning. In: International Conference on Learning Representations (ICLR), Toulon, France, April 2017."},{"key":"bibr62-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2012.6386109"},{"key":"bibr63-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/IROS45743.2020.9341416"},{"key":"bibr65-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2019.2897342"},{"key":"bibr66-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/IROS45743.2020.9341530"},{"key":"bibr64-02783649231208729","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA40945.2020.9196661"},{"key":"bibr67-02783649231208729","unstructured":"Wilde N, Biyik E, Sadigh D, et al. (2021) Learning reward functions from scale feedback. In: 5th annual conference on robot learning, London, UK, November 2021."},{"issue":"136","key":"bibr68-02783649231208729","first-page":"1","volume":"18","author":"Wirth C","year":"2017","journal-title":"Journal of Machine Learning Research"},{"key":"bibr69-02783649231208729","unstructured":"Wise M, Ferguson M, King D, et al. (2016) Fetch and freight: standard platforms for service robot applications. In: Workshop on autonomous mobile service robots, New York City, NY, July 2016."},{"key":"bibr70-02783649231208729","unstructured":"Ziebart BD, Maas AL, Bagnell JA, et al. (2008) Maximum entropy inverse reinforcement learning. In: AAAI conference on Artificial Intelligence (AAAI), Chicago, IL, July 2008, 8, pp. 1433\u20131438."}],"container-title":["The International Journal of Robotics Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/02783649231208729","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/02783649231208729","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/02783649231208729","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/02783649231208729","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,3]],"date-time":"2025-03-03T23:45:30Z","timestamp":1741045530000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/02783649231208729"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,7]]},"references-count":70,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,4]]}},"alternative-id":["10.1177\/02783649231208729"],"URL":"https:\/\/doi.org\/10.1177\/02783649231208729","relation":{},"ISSN":["0278-3649","1741-3176"],"issn-type":[{"type":"print","value":"0278-3649"},{"type":"electronic","value":"1741-3176"}],"subject":[],"published":{"date-parts":[[2023,11,7]]}}}