{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,10,18]],"date-time":"2024-10-18T04:28:02Z","timestamp":1729225682682,"version":"3.27.0"},"reference-count":0,"publisher":"IOS Press","isbn-type":[{"value":"9781643685489","type":"electronic"}],"license":[{"start":{"date-parts":[[2024,10,16]],"date-time":"2024-10-16T00:00:00Z","timestamp":1729036800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,10,16]]},"abstract":"Generative language models (LMs) offer numerous advantages but may produce inappropriate or harmful outputs due to the harmful knowledge acquired during pre-training. This knowledge often manifests as undesirable correspondences, such as \u201charmful prompts\u201d leading to \u201charmful outputs,\u201d which our research aims to mitigate through unlearning techniques.However, existing unlearning methods based on gradient ascent can significantly impair the performance of LMs. To address this issue, we propose a novel approach called Weighted Positional N-pair (WPN) Learning, which leverages position-weighted mean pooling within an n-pair contrastive learning framework. WPN is designed to modify the output distribution of LMs by eliminating specific harmful outputs (e.g., replacing toxic responses with neutral ones), thereby transforming the model\u2019s behavior from \u201charmful prompt-harmful output\u201d to \u201charmful prompt-harmless response\u201d. Experiments on OPT and GPT-NEO LMs show that WPN effectively reduces the proportion of harmful responses, achieving a harmless rate of up to 95.8% while maintaining stable performance on nine common benchmarks (with less than 2% degradation on average). Moreover, we provide empirical evidence to demonstrate WPN\u2019s ability to weaken the harmful correspondences in terms of generalizability and robustness, as evaluated on out-of-distribution test sets and under adversarial attacks.<\/jats:p>","DOI":"10.3233\/faia240662","type":"book-chapter","created":{"date-parts":[[2024,10,17]],"date-time":"2024-10-17T13:01:53Z","timestamp":1729170113000},"source":"Crossref","is-referenced-by-count":0,"title":["WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models"],"prefix":"10.3233","author":[{"given":"Guitao","family":"Chen","sequence":"first","affiliation":[{"name":"Beijing University of Posts and Telecommunications"}]},{"given":"Yunshen","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications"}]},{"given":"Hongye","family":"Sun","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications"}]},{"given":"Guang","family":"Chen","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications"}]}],"member":"7437","container-title":["Frontiers in Artificial Intelligence and Applications","ECAI 2024"],"original-title":[],"link":[{"URL":"https:\/\/ebooks.iospress.nl\/pdf\/doi\/10.3233\/FAIA240662","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,17]],"date-time":"2024-10-17T13:01:53Z","timestamp":1729170113000},"score":1,"resource":{"primary":{"URL":"https:\/\/ebooks.iospress.nl\/doi\/10.3233\/FAIA240662"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,16]]},"ISBN":["9781643685489"],"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/faia240662","relation":{},"ISSN":["0922-6389","1879-8314"],"issn-type":[{"value":"0922-6389","type":"print"},{"value":"1879-8314","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,16]]}}}