Reliable and Generalizable Neural Search Engine Architectures

Reliable and Generalizable
Neural Search Engine Architectures

Jamie Callan
Language Technologies Institute
School of Computer Science
Carnegie Mellon University

Project Overview

During the last five years, search engines based on neural network techniques have emerged as an alternative to traditional search engine architectures. These new neural ranking architectures use distributed text representations that enable reasoning about how well a query term such as airplane matches a document term such as jet, and more sophisticated methods of combining evidence. They may be more effective than simpler models, especially when a massive amount of training data is available.

This project develops new methods of training neural ranking architectures when a massive amount of training data is not available for the target application; integrates external knowledge resources to provide more information for making accurate ranking decisions; and applies the architecture to the task of retrieving tabular data from scientific documents. This collection of problems is chosen to increase the practicality of neural ranking architectures outside of high-traffic commercial search environments, and to investigate and exploit the strengths of neural ranking architectures at using attention mechanisms to manage evidence, soft-matching across different types of evidence, and learning sophisticated nonlinear decision models. This research furthers the development of neural ranking architectures that are generally applicable and more reliable than current systems due to their ability to integrate a broader range of evidence in a predictable manner.

Project Personnel

Jamie Callan,	Principal Investigator
Zhuyun Dai,	Research Assistant
Hafeezul Rahman,	Research Assistant
HongChien Yu,	Undergraduate & Graduate Research Assistant
Weihan Anita Li,	Undergraduate Research Assistant
Luyu Gao,	Graduate Research Assistant
Zhen Fan,	Graduate Research Assistant

Dissemination of Research Results

Research results are disseminated by research publications. Datasets and experimental results are disseminated through online virtual appendices to research publications. Open-source software is disseminated as part of the open-source Lemur Project.

Z. Dai and J. Callan. Context-aware term weighting for first-stage passage retrieval (short paper). In Proceedings of the 43nd International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM. 2020.
Z. Dai and J. Callan. Context-aware document term weighting for ad-hoc search. In Proceedings of The Web Conference 2020. 2020. [slides]
Z. Dai and J. Callan. An evaluation of weakly-supervised DeepCT in the TREC 2019 Deep Learning Track. In The Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019), National Institute of Standards and Technology, special publication. 2020.
Z. Dai and J. Callan. Deeper text understanding for IR with contextual neural language modeling (short paper). In Proceedings of 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval. 2019. [code] [data]
Z. Dai and J. Callan. Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv:1910.10687. 2019.
Z. Dai, Z. Fan, H. Rahman, and J. Callan. Local matching networks for engineering diagram search. In Proceedings the 2019 The Web Conference. 2019. [poster] [data]
Z. Fan, L. Gao, J. Callan. CSurF: Sparse lexical retrieval through contextualized surface forms. In Proceedings of the 2023 ACM SIGIR International Conference on the Theory of Information Retrieval, pp. 65-75. ACM. 2023.
Z. Fan, L. Gao, R. Jha, and J. Callan. COIL_CR: Efficient semantic matching in contextualized exact match retrieval. In Advances in Information Retrieval – 44th European Conference on IR Research, pp. 298–312. Springer Cham. 2023.

L. Gao, X. Ma, J. Lin and J. Callan. Tevatron: An efficient and flexible toolkit for neural retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3120–3124. ACM. 2023. A preprint appeared as arXiv:2203.05765.

L. Gao, X. Ma, J. Lin, and J. Callan. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 1762–1777. ACL. 2023. A preprint appeared as arXiv:2212.10496.

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL: Program-aided language models. In Proceedings of the 40th International Conference on Machine Learning. 2023. A preprint appeared as arXiv:2211.10435.

L. Gao and J. Callan. Long document re-ranking with modular re-ranker (short paper). In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 2022. A preprint appeared as arXiv:2205.04275.
L. Gao and J. Callan. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022. A preprint appeared as arXiv:2108.05540. 2021.
L. Gao and J. Callan. Condenser: a Pre-training Architecture for Dense Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. A preprint appeared as arXiv:2104.08253. 2021.
L. Gao, Z. Dai, and J. Callan. COIL: Revisit exact lexical match in information retrieval with contextualized inverted list. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3030-3042. Association for Computational Linguistics. 2021.
L. Gao, Z. Dai, and J. Callan. Rethink training of BERT rerankers in multi-stage retrieval pipeline (short paper). In Advances in Information Retrieval – 43rd European Conference on IR Research. 2021.
L. Gao, Z. Dai, T. Chen, Z. Fan, B. Van Durme and J. Callan. Complement lexical retrieval model with semantic residual embeddings. In Advances in Information Retrieval – 43rd European Conference on IR Research. 2021.
L. Gao, Y. Zhang, J. Han and J. Callan. Scaling deep contrastive learning batch size under memory limited setup. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP). ACL. 2021.
L. Gao, Z. Dai, and J. Callan. EARL: Speedup transformer-based rankers with pre-computed representation. arXiv:2004.13313. 2020.
L. Gao, Z. Dai, and J. Callan. Understanding BERT rankers under distillation. In Proceedings of the 2020 ACM SIGIR International Conference on the Theory of Information Retrieval. 2020.
L. Gao, Z. Dai, and J. Callan. Modularized transformer-based ranking framework. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020.
Z. Jiang, L. Gao, J. Araki, H. Ding, Z. Wang, J. Callan, and G. Neubig. Retrieval as attention: End-to-end learning of retrieval and reading within a single transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2336-2349. 2022.
J. Mackenzie, Z. Dai, L. Gallagher and J. Callan. Efficiency implications of term re-weighting for passage retrieval (short paper). In Proceedings of the 43nd International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM. 2020.
R. Padaki, Z. Dai and J. Callan. Rethinking query expansion for BERT reranking. In Advances in Information Retrieval - 42nd European Conference on IR Research. 2020.
H. Yu, Z, Dai, and J. Callan. PGT: Pseudo relevance feedback using a graph-based transformer (short paper). In Advances in Information Retrieval – 43rd European Conference on IR Research. 2021.
H. Yu, C. Xiong, and J. Callan. Improving query representations for dense retrieval with pseudo relevance feedback (short paper). In Proceedings of the 30th ACM International Conference on Information and Knowledge Management, pp. 3592-3596. ACM. 2021.
S. Zhang, Z. Dai, K. Balog and J. Callan. Summarizing and exploring tabular data in conversational search (short paper). In Proceedings of the 43nd International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM. 2020.

Software

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized design for easy research; a set of command line tools are also provided for fast development and testing. A set of easy-to-use interfaces to Huggingface's state-of-the-art pre-trained transformers ensures Tevatron's superior performance. arXiv, Github

This research is sponsored by National Science Foundation grant IIS-1815528. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.

Updated on Dec 12, 2023.

Jamie Callan