Jie Lu's Homepage - jielu@cs.cmu.edu

Publications

"Full-text federated search in peer-to-peer networks" by Jie Lu.
Ph.D. Dissertation, Language Technologies Institute, Carnegie Mellon University, 2007.

"Content-based peer-to-peer network overlay for full-text federated search" by Jie Lu and Jamie Callan.
8th RIAO Conference on Large-Scale Semantic Access to Content (RIAO '07), 2007.

"User modeling for full-text federated search in peer-to-peer networks" by Jie Lu and Jamie Callan.
29th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'06), 2006.

"Full-text federated search of text-based digital libraries in peer-to-peer networks" by Jie Lu and Jamie Callan.
Journal of Information Retrieval, Volumn 9, Number 4, 2006.

"Combining multiple resources, evidences and criteria for genomic information retrieval" by Luo Si, Jie Lu and Jamie Callan.
Text Retrieval Conference (TREC'06), 2006.

"Federated search of text-based digital libraries in hierarchical peer-to-peer networks" by Jie Lu and Jamie Callan.
27th European Conference on Information Retrieval Research (ECIR'05), 2005.

"Federated search of text-based digital libraries in hierarchical peer-to-peer networks" by Jie Lu and Jamie Callan.
Peer-to-Peer IR Workshop of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04), 2004.

"Merging retrieval results in hierarchical peer-to-peer networks" by Jie Lu and Jamie Callan.
27th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04), 2004.

"Content-based retrieval in hybrid peer-to-peer networks" by Jie Lu and Jamie Callan.
12th International Conference on Information and Knowledge Management (CIKM'03), 2003.

"Distributed information retrieval with skewed database size distributions" by Luo Si, Jie Lu and Jamie Callan.
National Conference on Digital Government Research (dg.o2003), 2003.

"Reducing storage costs for federated search of text databases" by Jie Lu and Jamie Callan.
National Conference on Digital Government Research (dg.o2003), 2003.

"Pruning long documents for distributed information retrieval" by Jie Lu and Jamie Callan.
11th International Conference on Information and Knowledge Management (CIKM'02), 2002.

Research

Federated search in distributed environments
My dissertation research develops an integrated framework of network overlay, network evolution, and search models for full-text ranked retrieval in P2P networks. Multiple directory services maintain full-text representations of resources located in their network neighborhoods, and provide local resource selection and result merging services. The network overlay model defines a network structure that extends previous peer functionalities and integrates search-enhancing properties of interest-based locality, content-based locality, and small-world to explicitly support full-text federated search. The network evolution model provides autonomous and adaptive topology evolution algorithms to construct a network structure with desired content distribution, navigability and load balancing without a centralized control or semantic annotations. The network search model addresses the problems of resource representation, resource selection, and result merging based on the unique characteristics of P2P networks, and balances between effectiveness and cost. The framework is a comprehensive and practical solution to full-text ranked retrieval in large-scale, distributed and dynamic environments with heterogeneous, open-domain contents.
The models developed as integrated parts of the framework for full-text federated search in P2P networks can also be used in other applications such as organizing the server farm in large-scale centralized search, managing online communities and social networks, and improving meta-search and personalized search.

Genomic information retrieval
My work in genomic information retrieval focuses on combining multiple resources, evidence, and criteria for query expansion and result ranking. Acronyms, aliases, and synonyms are extracted from external biomedical resources such as AcroMed, LocusLink and UMLS to create lexicons. Based on the associations among terms/phrases in these lexicons, several term-weighting schemes are designed to assign weights to expansion terms from different sources. For result ranking, different scoring criteria are used to evaluate evidence from document, passage, and term-matching granularities, which are further combined using a weighted linear combination to produce final ranking. Evaluation results show that the technique developed for query expansion based on external biomedical resources is effective, and result ranking by combining multiple scoring criteria and evidence consistently provides better performance compared with result ranking based on a single criterion.

Automatic duplication detection
The task of duplicate detection in large public comment datasets is to detect exact-duplicate and near-duplicate documents in comments made by the public about proposed federal regulations. Exact-duplicate and near-duplicate comments are typically created by copying and editing form letters provided by organized interest groups and lobbies. To utilize the domain knowledge about the creation process of duplicate documents, a new fuzzy match edit operation is introduced in my work to match sentences with minor word differences. The degree of fuzzy match between sentences is measured using traditional information retrieval techniques. A modified edit distance method is proposed to compare documents at the sentence granularity based on the edit operations of substitution, insertion, deletion, and fuzzy match. By combining the complementary strengths of a similarity-based approach commonly used in IR (flexibility and efficiency) and a string-based approach which measures the effort required to transform one document into another (accuracy), more effective and robust performance can be achieved for detecting near duplicates.

Here are some of the places I visited during my time at CMU. I was on "business" trips for about one third of the places (attending conferences, project meetings etc.) and just for fun for the rest. Although going to these places was fun, especially places outside U.S., the visas I needed sometimes in order to be able to go and return were no fun at all.

Amsterdam (2004, 2005) | Brussels (2004) | Paris (2002, 2004)
London (2004) | Toronto (2003, 2004) | Santiago de Compostela, Spain (2005)