Anthony Tomasic () is a Consultant for the Robotics Institute at Carnegie Mellon University. For 15 years he was a Senior Systems Scientist at CMU. He was co-Founder and Director of the Carnegie Mellon University Master of Computational Data Science degree program (CMU MCDS). Anthony also co-founded the Master of Science in Product Management. This degree program focuses on transitioning software engineers to product management ronles. Currently Anthony is CEO of Fort Alto Inc, an application manufacturing company. |
Projects |
|
Anthony's research career started with an undergraduate degree in Computer Science (with honors) from Indiana University, Bloomington. He then joined the European Computer-Industry Research Centre (ECRC) in Munich, Germany where he worked in part on the view update problem in database theory. He then attended graduate school at Princeton and performed his thesis research at Stanford University. His thesis invented novel methods for improving information retrieval search response time and throughput performance. Upon receiving his Ph.D., Anthony led a research team at the Institute National de Research in Informatique et Automatique (INRIA). His team created the federated database DISCO for data integration. DISCO was transferred to Kelkoo.com, a French internet comparison shopping site, which was subsequently purchased by Yahoo. In 1999, he participated in a team that was a winner in the French National New Venture competition. Anthony then spent three years with various internet.bomb start-ups in Silicon Valley. Eventually he moved back into research at Carnegie Mellon University where for the several years he led a team, as part of the RADAR project, that created intelligent assistants to the desktop. He has also contributed to research on extract-transform-load systems, detection of phishing messages, and scaling of database systems. In 2009, Anthony received an MBA from the Tepper School of Business at Carnegie Mellon University. In 2011, Anthony, in partnership with three other faculty, founded Tiramisu Transit, LLC. In 2020, he co-founded Fort Alto, Inc. an application manufacturing company.
Abstract
Many collections of scientific data in particular disciplines are available today on the World Wide Web. Most of these data sources are compliant with some standard for interoperable access. In addition, sources may support a common semantics, i.e., a shared meaning for the data types and their domains. However, sharing data among a global community of users is still difficult because of the following reasons: (i) data provides need a mechanism for describing and publishing available sources of data; (ii) data administrators need a mechanism for discovering the location of published sources and obtaining metadata from these sources; and (iii) users need a mechanism for browsing and selecting sources. This paper describes a system, WebSemantics, that accomplishes the above tasks. We describe an architecture for the publication and discovery of scientific data sources that is an extension of the World Wide Web architecture and protocols. We support catalogs containing metadata about data sources for some application domain. We define a language for discovering sources and querying their metadata. We then describe the WebSemantics prototype.
Abstract
Disco is a mediator system developed at INRIA for accessing heterogeneous data sources over the Internet. In Disco, mediators accept queries from users, process them with respect to wrappers, and return answers. Wrapper provide access to underlying sources. To efficiently process queries, the mediator performs cost-based query optimization. In a heterogeneous distributed database, cost-estimate based query optimization is difficult to achieve because the underlying data sources do not export cost information. Disco's approach relies on combining a generic cost model with specific cost information exported by wrappers. In this paper, we propose a validation of Disco's cost model based on experimentation with real Web data sources. This validation shows the efficiency of our generic cost model as well as the efficiency of more specialized cost functions.
Abstract
The dramatic growth of the Internet has created a new problem for users: the location of relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Then, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS -- Glossary of Servers Server --, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective at determining promising text sources for a given query.
Abstract
Accessing many data sources aggravates problems for users of heterogeneous distributed databases. Database administrators must deal with fragile mediators, that is, mediators with schemas and views that must be significantly changed to incorporate a new data source. When implementing translators of queries from mediators to data sources, database implementors must deal with data sources that do not support all the functionality required by mediators. Application programmers must deal with graceless failures for unavailable data sources. Queries simply return failure and no further information when data sources are unavailable for query processing. The Distributed Information Search COmponent (DISCO) addresses these problems. Data modeling techniques manage the connections to data sources, and sources can be added transparently to the users and applications. The interface between mediators and data sources flexibly handles different query languages and different data source functionality. Query rewriting and optimization techniques rewrite queries so they are efficiently evaluated by sources. Query processing and evaluation semantics are developed to process queries over unavailable data sources. In this article we describe (a) the distributed mediator architecture of DISCO; (b) the data model and its modeling of data source connections; (c) the interface to underlying data sources and the query rewriting process; and (d) query processing semantics. We describe several advantages of our system.
Abstract
With the profusion of text databases on the Internet, it is becoming increasingly hard to find the most useful databases for a given query. To attack this problem, several existing and proposed systems employ brokers to direct user queries, using a local database of summary information about the available databases. This summary information must effectively distinguish relevant databases, and must be compact while allowing efficient access. We offer evidence that one broker, GlOSS, can be effective at locating databases of interest even in a system of hundreds of databases, and examine the performance of accessing the GlOSS summaries for two promising storage methods: the grid file and partitioned hashing. We show that both methods can be tuned to provide good performance for a particular workload (within a broad range of workloads), and discuss the tradeoffs between the two data structures. As a side effect of our work, we show that grid files are more broadly applicable than previously thought; in particular, we show that by varying the policies used to construct the grid file we can provide good performance for a wide range of workloads even when storing highly skewed data.
Abstract
Many information retrieval systems provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPEC database of abstracts of the literature on physics, computer science, electrical engineering, etc. In this article this database is studied by using a trace-driven simulation. We focus on a physical index design which accommodates truncations, inverted index caching, and database scaling in a distributed shared-nothing system. All three issues are shown to have a strong effect on response time and throughput. Database scaling is explored in two ways. One way assumes an ``optimal'' configuration for a single host and then linearly scales the database by duplicating the host architecture as needed. The second way determines the optimal number of hosts given a fixed database size.
Abstract
The performance of distributed text document retrieval systems is strongly influenced by the organization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine those variables that most strongly influence response time and throughput. This leads to a set of design trade-offs over a wide range of hardware configurations and new parallel query processing strategies.
Abstract
This paper describes the e-XML component suite, a modular product for integrating heterogeneous data sources under an XML schema and querying in real-time the integrated information using XQuery, the emerging W3C standard for XML query. We describe the two main components of the suite, i.e., the repository for warehousing XML and the mediator for distributed query processing. We also discuss some typical applications.
Abstract
Many collections of scientific data in particular disciplines are available today around the world. Much of this data conforms to some agreed upon standard for data exchange, i.e., a standard schema and its semantics. However, sharing this data among a global community of users is still difficult because of a lack of standards for the following necessary functions: (i) data providers need a standard for describing or publishing available sources of data; (ii) data administrators need a standard for discovering the published data and (iii) users need a standard for accessing this discovered data. This paper describes a prototype implementation of a system, WebSemantics, that accomplishes the above tasks. We describe an architecture and protocols for the publication, discovery and access to scientific data. We define a language for discovering sources and querying the data in these sources, and we provide a formal semantics for this language.
Abstract
Accessing numerous widely-distributed data sources poses significant new challenges for query optimization and execution. Congestion or failure in the network introduce highly-variable response times for wide-area data access. This paper is an initial exploration of solutions to this variability. We investigate a class of dynamic, run-time query plan modification techniques that we call query plan scrambling. We present an algorithm which modifies execution plans on-the-fly in response to unexpected delays in data access. The algorithm both reschedules operators and introduces new operators into the plan. We present simulation results that show how our technique effectively hides delays in receiving the initial requested tuples from remote data sources.
Abstract
Access to large numbers of data sources introduces new problems for users of heterogeneous distributed databases. End users and application programmers must deal with unavailable data sources. Database administrators must deal with incorporating new sources into the model. Database implementors must deal with the translation of queries between query languages and schemas. The Distributed Information Search COmponent (Disco) addresses these problems. Query processing semantics are developed to process queries over data sources which do not return answers. Data modeling techniques manage connections to data sources. The component interface to data sources flexibly handles different query languages and translates queries. This paper describes (a) the distributed mediator architecture of Disco, (b) its query processing semantics, (c) the data model and its modeling of data source connections, and (d) the interface to underlying data sources.
Abstract
On-line information vendors offer access to multiple databases. In addition, the advent of a variety of INTERNET tools has provided easy, distributed access to many more databases. The result is thousands of text databases from which a user may choose for a given information need (a user query). This paper, an abridged version, presents a framework for (and analyzes a solution to) this problem, which we call the text-database discovery problem (see full version for a survey of related work). Our solution to the text-database discovery problem is to build a service that can suggest potentially good databases to search. A user's query will go through two steps: first, the query is presented to our server (dubbed GlOSS, for Glossary-Of-Servers Server) to select a set of promising databases to search. During the second step, the query is actually evaluated at the chosen databases. GlOSS gives a hint of what databases might be useful for the user's query, based on word-frequency information for each database. This information indicates, for each database and each keyword in the database vocabulary, how many documents at that database actually contain the keyword, for each field designator (Sections 2 and 3). For example, a Computer-Science library could report that ``Knuth'' (keyword) occurs as an author (field designator) in 180 documents, the keyword ``computer,'' in the title of 25,548 documents, and so on. This information is orders of magnitude smaller than a full index since for each keyword field-designation pair we only need to keep its frequency, not the identities of the documents that contain it. To evaluate the set of databases that GlOSS returns for a given query, Section 4 presents a framework based on the precision and recall metrics of information-retrieval theory. In that theory, for a given query q and a given set S of relevant documents for q, precision is the fraction of documents in the answer to q that are in S, and recall is the fraction of S in the answer to q. We borrow these notions to define metrics for the text-database discovery problem: for a given query q and a given set of ``relevant databases'' S, P is the fraction of databases in the answer to q that are in S, and R is the fraction of S in the answer to q. We further extend our framework by offering different definitions for a ``relevant database'' (Section 4). We have performed experiments using query traces from the FOLIO library information-retrieval system at Stanford University, and involving six databases available through FOLIO. As we will see, the results obtained for different variants of GlOSS are very promising (Section 5). Even though GlOSS keeps a small amount of information about the contents of the available databases, this information proved to be sufficient to produce very useful hints on where to search.
Abstract
Declining disk and CPU costs have kindled a renewed interest in efficient document indexing techniques. In this paper, the problem of incremental updates of inverted lists is addressed using a dual-structure index data structure that dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. The behavior of this index is studied with the use of a synthetically-generated document collection and a simulation model of the algorithm. The index structure is shown to support rapid insertion of documents, fast queries, and to scale well to large document collections and many disks.
Abstract
With the proliferation of the world's ``information highways'' a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index data structure. The index dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering trade-offs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. We then describe the best algorithm for a variety of criteria.
Abstract
The popularity of on-line document databases has led to a new problem: finding which text databases (out of many candidate choices) are the most relevant to a user. Identifying the relevant databases for a given query is the text database discovery problem. The first part of this paper presents a practical solution based on estimating the result size of a query and a database. The method is termed GlOSS--Glossary of Servers Server. The second part of this paper evaluates the effectiveness of GlOSS based on a trace of real user queries. In addition, we analyze the storage cost of our approach.
Abstract
A common class of existing information retrieval system provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPEC database of abstracts of the literature on physics, computer science, electrical engineering, etc. In this paper this database is studied by using a trace-driven simulation. We focus on physical index design, inverted index caching, and database scaling in a distributed shared-nothing system. All three issues are shown to have a strong effect on response time and throughput. Database scaling is explored in two ways. One way assumes an ``optimal'' configuration for a single host and then linearly scales the database by duplicating the host architecture as needed. The second way determines the optimal number of hosts given a fixed database size.
Abstract
The performance of distributed text document retrieval systems is strongly influenced by the organization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine which variables most strongly influence response time and throughput. This leads to a set of design trade-offs over a range of hardware configurations and new parallel query processing strategies.
Abstract
First steps are taken in examining the view update problem in deductive databases. The class of recursive definite deductive databases is examined. A view update is defined as a statement of factual logical consequence of the deductive database. A translation is a minimul update on the facts of a deductive database such that the view update holds. The number of translations for a view update is exponential in the size of the database. Algorithms for view updates are presented and proven correct. They are based on SLD-resolution and are independent of the computation rule. Finally, as an example of a method for reducing the number of possible translations of a view update, rule annotations are introduced. A small number of unique annotations (proportional to the size of the database) is shown to produce unique translations of view updates.
Abstract
We discuss the problem of unavailable data sources in the context of two mediator based applications. We discuss the limitations of existing system with respect to this problem and describe a novel evaluation model that overcomes these shortcomings.
Introduction
Mediator systems are being deployed in various environments to provide query access to heterogeneous data sources. When processing a query, the mediator may have difficulty accessing a data source (due to network or server problems). In such cases the mediator is faced with the problem of unavailable data sources. In this paper, we discuss the problem of unavailable data sources in mediator based applications. We first introduce two applications that we are currently developing. The first application concerns a hospital information system; a mediator accesses data sources located in the different services to provide doctors with information on patients. The second application concerns the access to documentary repositories within a network of public and private institutions; a mediator accesses the data sources located in each institution to answer queries asked through a World Wide Web application. We detail the characteristics of these applications in Section 2. We show that these applications are representative of large classes of applications. We then discuss, in Section 3, the impact of unavailable data sources on the design of both applications. We illustrate the limitations of classical mediator systems. We give in Section 4 an overview of a novel sequential model of interaction which fits the needs of both applications and overcomes some of the above mentioned shortcomings. We review related work in Section 5. We conclude and give directions for future work in Section 6.
Abstract The scientific community, public organizations and administrations have generated a large amount of data concerning the environment. There is a need to allow sharing and exchange of this type of information by various kinds of users including scientists, decision-makers and public authorities. Metadata arises as the solution to support these requirements. We present a formal framework for classification of metadata that will give a uniform definition of what metadata is, how it can be used and where it must be used. This framework also provides a procedure for classifying elements of existing metadata standards.
Abstract
Many heterogeneous database system products and prototypes exist today; they will soon be deployed in a wide variety of environments. Most existing systems suffer from an Achilles' heel: they ungracefully fail in presence of unavailable data sources. If some data sources are unavailable when accessed, these systems either silently ignore them or generate an error. This behavior is improper in environments where there is a non-negligible probability that data sources cannot be accessed (e.g., Internet). In case some data sources cannot be accessed when processing a query, the complete answer to this query cannot be computed; some work can however be done with the data sources that are available. In this paper, we propose a novel approach where, in presence of unavailable data sources, the answer to a query is a partial answer. A partial answer is a representation of the work that has been done in case the complete answer to a query cannot be computed, and of the work that remains to be done in order to obtain this complete answer. The use of a partial answer is twofold. First, it contains an incremental query that allows to obtain the complete answer without redoing the work that has already been done. Second, the application program can extract information from a partial answer through the use of a secondary query, which we call a parachute query. In this paper, we present a framework for partial answers and we propose three algorithms for the evaluation of queries in presence of unavailable sources, the construction of incremental queries and the evaluation of parachute queries.
Abstract
Mediator systems are used today in a wide variety of unreliable environments. When processing a query, a mediator may try to access a data source which is unavailable. In this situation, existing systems either silently ignore unavailable data sources or generate an error. This behavior is inefficient in environments with a non-negligible probability that a data source is unavailable (e.g., the Internet). In the case that some data sources are unavailable, the complete answer to a query cannot be obtained; however useful work can be done with the available data sources. In this paper, we describe a novel approach to mediator query processing where, in the presence of unavailable data sources, the answer to a query is computed incrementally. It is possible to access data obtained at intermediate steps of the computation. We define two new evaluation models and analytically model for these evaluation models the probability of obtaining the answer to a query in the presence of unavailable data sources. The analysis shows that complete answers are more likely in our two evaluation models than in a classical system. We measure the performance of our evaluation models via simulations and show that, in the case that all data sources are available, the performance penalty for our approach is negligible.
Abstract
Given an intensional database (IDB) and an extension database (EDB), the view update problem translates updates on the IDB into updates on the EDB. One approach to the view update problem uses a translation langauge to specify the meaning of a view update. In this paper we prove properties of a translation language. This approach to the view update problem studies the expressive power of the translation language and the computational cost of demonstrating properties of a translation. We use an active rule based database language for specifying translations of view updates. This paper uses the containment of one datalog program (or conjunctive query) by another to demonstrate that a translation is semantically correct. We show that the complexity of correctness is lower for insertion than deletion. Finally, we discuss extensions to the translation language.