I am a Ph.D. graduate from the Language Technologies Institute of
School of Computer Science at Carnegie Mellon University.
I was one of the members of the Distributed Information Retrieval Group led by my advisor Prof. Jamie Callan.
I graduated in May, 2007 and joined the Intelligent Information Interaction group at IBM T. J. Watson Research Center.
My work focuses on intelligent information retrieval and analytic technologies and systems.
I am a BIG fan of Pittsburgh Steelers and Pittsburgh Penguins. Although I do not live in Pittsburgh now, I am still following and supporting my favorite teams as much as I can. I love everything Black & Gold! Go Steelers! Go Pens!
"Full-text federated
search in peer-to-peer networks" by Jie Lu.
Ph.D. Dissertation, Language Technologies Institute, Carnegie
Mellon University, 2007.
"Content-based peer-to-peer
network overlay for full-text federated search" by Jie
Lu and Jamie Callan.
8th RIAO Conference on Large-Scale Semantic Access to Content (RIAO '07),
2007.
"User modeling
for full-text federated search in peer-to-peer networks" by Jie Lu
and Jamie Callan.
29th International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR'06), 2006.
"Full-text federated
search of text-based digital libraries in peer-to-peer networks"
by Jie Lu and Jamie Callan.
Journal of Information Retrieval, Volumn 9, Number 4, 2006.
"Combining multiple resources,
evidences and criteria for genomic information retrieval" by Luo
Si, Jie Lu and Jamie Callan.
Text Retrieval Conference (TREC'06), 2006.
"Federated search of text-based
digital libraries in hierarchical peer-to-peer networks" by Jie Lu
and Jamie Callan.
27th European Conference on Information Retrieval Research (ECIR'05),
2005.
"Federated search of text-based
digital libraries in hierarchical peer-to-peer networks" by Jie Lu
and Jamie Callan.
Peer-to-Peer IR Workshop of the 27th International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR'04),
2004.
"Merging retrieval
results in hierarchical peer-to-peer networks" by Jie Lu and Jamie
Callan.
27th International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR'04), 2004.
"Content-based retrieval in
hybrid peer-to-peer networks" by Jie Lu and Jamie Callan.
12th International Conference on Information and Knowledge Management
(CIKM'03), 2003.
"Distributed information retrieval
with skewed database size distributions" by Luo Si, Jie Lu and Jamie
Callan.
National Conference on Digital Government Research (dg.o2003),
2003.
"Reducing storage costs
for federated search of text databases" by Jie Lu and Jamie Callan.
National Conference on Digital Government Research (dg.o2003),
2003.
"Pruning long documents
for distributed information retrieval" by Jie Lu and Jamie Callan.
11th International Conference on Information and Knowledge Management
(CIKM'02), 2002.
Federated search in distributed environments
My dissertation research develops an integrated framework of network overlay,
network evolution, and search models for full-text ranked retrieval in P2P
networks. Multiple directory services maintain full-text representations of
resources located in their network neighborhoods, and provide local resource
selection and result merging services. The network overlay model defines
a network structure that extends previous peer functionalities and integrates
search-enhancing properties of interest-based locality, content-based locality,
and small-world to explicitly support full-text federated search. The network
evolution model provides autonomous and adaptive topology evolution algorithms
to construct a network structure with desired content distribution, navigability
and load balancing without a centralized control or semantic annotations.
The network search model addresses the problems of resource representation,
resource selection, and result merging based on the unique characteristics
of P2P networks, and balances between effectiveness and cost. The framework
is a comprehensive and practical solution to full-text ranked retrieval in
large-scale, distributed and dynamic environments with heterogeneous, open-domain
contents.
The models developed as integrated parts of the framework for full-text federated
search in P2P networks can also be used in other applications such as
organizing the server farm in large-scale centralized search, managing online
communities and social networks, and improving meta-search and personalized
search.
Genomic information retrieval
My work in genomic information retrieval focuses on combining multiple
resources, evidence, and criteria for query expansion and result ranking.
Acronyms, aliases, and synonyms are extracted from external biomedical resources
such as AcroMed, LocusLink and UMLS to create lexicons. Based on the associations
among terms/phrases in these lexicons, several term-weighting schemes are
designed to assign weights to expansion terms from different sources. For
result ranking, different scoring criteria are used to evaluate evidence from
document, passage, and term-matching granularities, which are further combined
using a weighted linear combination to produce final ranking. Evaluation results
show that the technique developed for query expansion based on external biomedical
resources is effective, and result ranking by combining multiple scoring criteria
and evidence consistently provides better performance compared with result
ranking based on a single criterion.
Automatic duplication detection
The task of duplicate detection in large public comment datasets is to
detect exact-duplicate and near-duplicate documents in comments made by the
public about proposed federal regulations. Exact-duplicate and near-duplicate
comments are typically created by copying and editing form letters provided
by organized interest groups and lobbies. To utilize the domain knowledge
about the creation process of duplicate documents, a new fuzzy match edit
operation is introduced in my work to match sentences with minor word differences.
The degree of fuzzy match between sentences is measured using traditional
information retrieval techniques. A modified edit distance method is proposed
to compare documents at the sentence granularity based on the edit operations
of substitution, insertion, deletion, and fuzzy match. By combining the complementary
strengths of a similarity-based approach commonly used in IR (flexibility
and efficiency) and a string-based approach which measures the effort required
to transform one document into another (accuracy), more effective and robust
performance can be achieved for detecting near duplicates.
Here are some of the places I visited during my time at CMU. I was on "business" trips for about one third of the places (attending conferences, project meetings etc.) and just for fun for the rest. Although going to these places was fun, especially places outside U.S., the visas I needed sometimes in order to be able to go and return were no fun at all.
Amsterdam (2004, 2005)
| Brussels (2004)
| Paris (2002, 2004)
London (2004) | Toronto
(2003, 2004) | Santiago
de Compostela, Spain (2005)
New Hampshire
| Massachusetts
| Connecticut |
Washington DC
| Maryland | Rhode
Island
New York | Pennsylvania
| Ohio | Tennessee
| Louisiana | Florida
| California | Colorado