Abstract
In the real-world of Information Retrieval, the timely retrieval of the latest documents has gained significant attention in recent years. In this paper, we develop an effective retrieval method for search engines, i.e., inverse retrieval. We propose a two-stage contrastive strategy to train doc2query model, the component of inverse retrieval. We perform offline or nearline computations to generate queries and then build or update an index from the query to the tuple of document and score. We have implemented an offline and a nearline retrieval channel at Xiaohongshu. Both channels showed substantial improvement during A/B tests. To make our work reproducible, we release QD100K dataset with 111K documents and 23M query-doc pairs. Our experimental results on QK100K and MS MARCO show the effectiveness of our method. All our code and datasets are available at https://github.com/fytxlj/InverseRetrievalDataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Quality is predicted by a neural network upon the publication of each new document.
- 3.
- 4.
References
Burges, C.J.: From ranknet to lambdarank to lambdamart: an overview. Learning 11(23–581), 81 (2010)
Campos, D.F., et al.: MS MARCO: a human generated machine reading comprehension dataset. arXiv (2016)
Cheng, H.T., et al.: Wide & deep learning for recommender systems. In: 1st Workshop on Deep Learning for Recommender Systems (2016)
Crestani, F., Lalmas, M., Rijsbergen, C.J.V., Campbell, I.: “Is this document relevant?. . . probably”: a survey of probabilistic models in information retrieval. ACM Comput. Surv. (1998)
Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: SIGIR (2019)
Dong, Q., et al.: I3 retriever: incorporating implicit interaction in pre-trained language models for passage retrieval. In: CIKM (2023)
Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: SPLADE v2: sparse lexical and expansion model for information retrieval. arXiv (2021)
Guo, W., et al.: DeText: a deep text ranking framework with BERT. In: CIKM (2020)
Hambarde, K.A., Proença, H.: Information retrieval: recent advances and beyond. IEEE Access, 76581–76604 (2023)
Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep structured semantic models for web search using clickthrough data. In: CIKM (2013)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. TBD (2019)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Liu, Y., Liu, P., Radev, D., Neubig, G.: BRIO: bringing order to abstractive summarization. arXiv (2022)
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI (2018)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (2020)
Rajput, S., et al.: Recommender systems with generative retrieval. In: NeurIPS (2023)
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at trec-3. In: TREC (1994)
Shoef, M., Fogel, S., Cohen-Or, D.: Pointwise: an unsupervised point-wise feature learning network. arXiv (2019)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Vijayakumar, A.K., et al.: Diverse beam search for improved description of complex scenes. In: AAAI (2018)
Wang, J., et al.: Milvus: a purpose-built vector data management system. In: SIGMOD (2021)
Wang, Y., Ma, H., Wang, D.Z.: LIDER: an efficient high-dimensional learned index for large-scale dense passage retrieval. Proc. VLDB Endow. 16(2), 154–166 (2022)
Yao, T., et al.: Self-supervised learning for large-scale item recommendations. In: CIKM (2021)
Zhai, J., et al.: Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations. arXiv (2024)
Zhang, J., et al.: Build faster with less: a journey to accelerate sparse model building for semantic matching in product search. In: CIKM (2023)
Zou, L., et al.: Pre-trained language model based ranking in Baidu search. In: SIGKDD (2021)
Acknowledgement
The authors would like to thank Dr. Shusen Wang at Xiaohongshu. This work was mainly conducted when Yuantao Fan was an intern. In addition, the authors would like to thank the anonymous reviewers for their valuable comments on improving the final version of this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Fan, Y., Tu, X., Li, R. (2024). An Inverse Retrieval Method via Query Generation for Xiaohongshu’s Search Engine. In: Huang, DS., Zhang, X., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14879. Springer, Singapore. https://doi.org/10.1007/978-981-97-5675-9_31
Download citation
DOI: https://doi.org/10.1007/978-981-97-5675-9_31
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5674-2
Online ISBN: 978-981-97-5675-9
eBook Packages: Computer ScienceComputer Science (R0)