Anthony Tomasic Biographical Information and Publications

Anthony Tomasic (tomasic@cs.cmu.edu) is a Consultant for the Robotics Institute at Carnegie Mellon University. For 15 years he was a Senior Systems Scientist at CMU. He was co-Founder and Director of the Carnegie Mellon University Master of Computational Data Science degree program (CMU MCDS). Anthony also co-founded the Master of Science in Product Management. This degree program focuses on transitioning software engineers to product management ronles. Currently Anthony is CEO of Fort Alto Inc, an application manufacturing company.

Projects

Application Manufacturing

Application manufacturing is a new approach to the software design, development, deployment and mantenance process. The approach defines an application through an abstract model, similar to the way data management was defined by the relational model beginning in the 1970s. Application manufacturing promises order of magnitude improvements in total cost of ownership for software applications. Application manufacturing white paper. Video of a prototype application manufacturing system. Application Manufacturing technical details. (long)

Zoom City

The Zoom City project investigates the impact of new interaction techniques on children's reading comprehension, vocabulary development and other learning activities.

Conversational Interaction

The DRRP on Robotics and Automation for Inclusive Transportation performs research and develop on seamless transportation assistance from cloud-based autonomy and shared robots located in and around transportation hubs. More information is available at DRRP Project Outline and at The Transportation, Bots, and Disability Lab. The TBD lab has implemented several demonstrations of experimental conversational interactions for navigation, discovery, and other user experiences. University environment Bus stop environment Airport and shopping environment

Tiramisu

The Rehabilitation Engineering Research Center on Accessible Public Transportation (RERC APT) is focused on research and development of methods to improve accessibility in transporation systems. Tiramisu was a crowdsourced transit information system produced by the RERC APT from 2008-2022. The project won the Allen Newell award for research excellence at CMU.

Biographical Information

Anthony's research career started with an undergraduate degree in Computer Science (with honors) from Indiana University, Bloomington. He then joined the European Computer-Industry Research Centre (ECRC) in Munich, Germany where he worked in part on the view update problem in database theory. He then attended graduate school at Princeton and performed his thesis research at Stanford University. His thesis invented novel methods for improving information retrieval search response time and throughput performance. Upon receiving his Ph.D., Anthony led a research team at the Institute National de Research in Informatique et Automatique (INRIA). His team created the federated database DISCO for data integration. DISCO was transferred to Kelkoo.com, a French internet comparison shopping site, which was subsequently purchased by Yahoo. In 1999, he participated in a team that was a winner in the French National New Venture competition. Anthony then spent three years with various internet.bomb start-ups in Silicon Valley. Eventually he moved back into research at Carnegie Mellon University where for the several years he led a team, as part of the RADAR project, that created intelligent assistants to the desktop. He has also contributed to research on extract-transform-load systems, detection of phishing messages, and scaling of database systems. In 2009, Anthony received an MBA from the Tepper School of Business at Carnegie Mellon University. In 2011, Anthony, in partnership with three other faculty, founded Tiramisu Transit, LLC. In 2020, he co-founded Fort Alto, Inc. an application manufacturing company.

Recent Publications

Oscar J. Romero, John Zimmerman, Aaron Steinfeld, Anthony Tomasic, Synergistic integration of large language models and cognitive architectures for robust AI: An exploratory analysis. in Proceedings of the AAAI Symposium Series, vol. 2, no. 1, pp. 396-405. 2023.
John Zimmerman, Aaron Steinfeld, Anthony Tomasic, Oscar J. Romero, Recentering Reframing as an RtD Contribution: The Case of Pivoting from Accessible Web Tables to a Conversational Internet, in CHI '22: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. April 2022, Article No.: 541, pp 1â€“14. https://doi.org/10.1145/3491102.3517789
Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C. Mowry, Matthew Perron, Ian Quah, Siddharth Santurkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, Tieying Zhang, Self-Driving Database Management Systems, in Conference on Innovative Data Systems Research (CIDR), 2017.
Planning Adaptive Mobile Experiences When Wireframing
Qian Yang, John Zimmerman, Aaron Steinfeld, Anthony Tomasic
DIS '16 Proceedings of the 2016 ACM Conference on Designing Interactive Systems,Â 2016

frames are not supported
EnTable: Rewriting Web Data Sets as Accessible Tables
Steven Gardiner, Anthony Tomasic, John Zimmerman
ASSETS '15 Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility,Â 2015

frames are not supported
Motivating contribution in a participatory sensing system via quid-pro-quo
Anthony Tomasic, John Zimmerman, Aaron Steinfeld, Yun Huang
CSCW '14 Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing,Â 2014

frames are not supported
Aaron Steinfeld, John Zimmerman, Anthony Tomasic, Daisy Yoo, Rafae Dar Aziz, "Mobile Transit Information from Universal Design and Crowdsourcing," Transportation Research Record: Journal of the Transportation Research Board, Volume 2217, Issue 1, 2011-12-01.
Steven Gardiner, Anthony Tomasic, John Zimmerman, Rafae Aziz, Kathryn Rivard, "Mixer: Mixed-Initiative Data Retrieval and Integration by Example," Proceedings of the 13th IFIP TC13 Conference on Human-Computer Interaction (INTERACT 2011), Lisbon, Portugal, 2011. (Winner of the IFIP Brian Shackel award award.)
Field trial of Tiramisu: crowd-sourcing bus arrival times to spur co-design
John Zimmerman, Anthony Tomasic, Charles Garrod, Daisy Yoo, Chaya Hiruncharoenvate, Rafae Aziz, Nikhil Ravi Thiruvengadam, Yun Huang, Aaron Steinfeld
CHI '11 Proceedings of the 2011 annual conference on Human factors in computing systems, 2011

frames are not supported
Word sense disambiguation via human computation
Nitin Seemakurty, Jonathan Chu, Luis von Ahn, Anthony Tomasic
HCOMP '10 Proceedings of the ACM SIGKDD Workshop on Human Computation, 2010

frames are not supported
Daisy Yoo, John Zimmerman, Aaron Steinfeld, Anthony Tomasic, "Understanding the Space for Co-design in Riders' Interactions with a Transit Service," Proceedings of the 28th international Conference on Human Factors in Computing Systems (CHI 2010), Atlanta, Georgia, 2010.
John Zimmerman, Kathryn Rivard, Ian Hargraves, Anthony Tomasic, Ken Mohnkern, "User-created Forms as an Effective Method of Human-agent Communication," Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 2009, (poster).
Amit Manjhi, Charles Garrod, Bruce Maggs, Todd Mowry, Anthony Tomasic, "Holistic Query Transformations for Dynamic Web Applications," Proceedings of the International Conference on Data Engineering (ICDE), 2009.
Charles Garrod, Amit Manjhi, Bruce Maggs, Todd Mowry, Anthony Tomasic, "Holistic Application Analysis for Update-Independence," Proceedings of the Second IEEE Workshop on Hot Topics in Web Systems and Technologies (HOTWEB), 2008.
Charles Garrod, Amit Manjhi, Anastasia Ailamaki, Bruce Maggs, Todd Mowry, Christopher Olston, Anthony Tomasic, "Scalable Query Result Caching for Web Applications," Proceedings of the 34th International Conference On Very Large Databases (VLDB), 2008.
Michael Freed, Jaime Carbonell, Geoff Gordon, Jordan Hayes, Brad Myers, Daniel Siewiorek, Stephen Smith, Aaron Steinfeld, Anthony Tomasic, "RADAR: A Personal Assistant that Learns to Reduce Email Overload," in Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI), 2008.
Ian Fette, Norman Sadeh, Anthony Tomasic, "Learning to Detect Phishing Emails," in Proceedings of the 16th International World Wide Web Conference (WWW), 2007.
Anthony Tomasic, Isaac Simmons, John Zimmerman, "Learning Information Intent via Observation," in Proceedings of the International World Wide Web Conference (WWW), 2007.
Anthony Tomasic, John Zimmerman, Ian Hargraves, Roderick McMullen, "User Constructed Data Integration via Mixed Initiative Design," in Proceedings of the American Association of Artificial Intelligence, Spring Symposium Series (AAAI SSS), 2007.
John Zimmerman, Anthony Tomasic, Isaac Simmons, Ian Hargraves, Ken Mohnkern, Jason Cornwell, Robert Martin McGuire, "VIO: a mixed-initiative approach to learning and automating procedural update tasks," in Proceedings of the Conference on Computer/Human Interaction (CHI), 2007.
Amit Manjhi, Phillip B. Gibbons, Anastassia Ailamaki, Charles Garrod, Bruce M. Maggs, Todd C. Mowry, Christopher Olston, Anthony Tomasic, Haifeng Yu, "Invalidation Clues for Database Scalability Services," in Proceedings of the International Conference on Data Engineering (ICDE), 2007.
Anthony Tomasic, Isaac Simmons, John Zimmerman, "Processing information intent via weak labeling," in Proceedings of the Conference on Integrated Knowledge Management (CIKM), 2006, pp 856-857. (poster)
Anthony Tomasic, Charles Garrod, Kris Popendorf, "Symmetric Publish / Subscribe via Constraint Publication," in Proceedings of the Workshop on Experimental Databases (EXPDB), 2006, pp 21-27.
Einat Minkov, Richard C. Wang, Anthony Tomasic, William W. Cohen, "NER Systems that Suit User's Preferences: Adjusting the Recall-Precision Trade-off for Entity Extraction," in Proceedings of the Human Language Technology conference - North American chapter of the Association for Computational Linguistics (HLT-NAACL), 2006.
Anthony Tomasic, John Zimmerman, Isaac Simmons, "Linking messages and form requests," Proceedings of the International Conference on Intelligent User Interfaces (IUI), 2006, pp 78-85.
Amit Manjhi, Anastassia Ailamaki, Bruce M. Maggs, Todd C. Mowry, Christopher Olston, Anthony Tomasic, "Simultaneous scalability and security for data-intensive web applications," Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD), 2006, pp 241-252.
William W. Cohen, Einat Minkov, Anthony Tomasic, "Learning to Understand Web Site Update Requests, Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2005, pp 1028-1033.

Articles

George Mihaila, Louiqa Raschid and Anthony Tomasic, "Locating and Accessing Data Repositories with WebSemantics," Technical Note, The VLDB Journal, 11(1), 2002.

Abstract

Many collections of scientific data in particular disciplines are available today on the World Wide Web. Most of these data sources are compliant with some standard for interoperable access. In addition, sources may support a common semantics, i.e., a shared meaning for the data types and their domains. However, sharing data among a global community of users is still difficult because of the following reasons: (i) data provides need a mechanism for describing and publishing available sources of data; (ii) data administrators need a mechanism for discovering the location of published sources and obtaining metadata from these sources; and (iii) users need a mechanism for browsing and selecting sources. This paper describes a system, WebSemantics, that accomplishes the above tasks. We describe an architecture for the publication and discovery of scientific data sources that is an extension of the World Wide Web architecture and protocols. We support catalogs containing metadata about data sources for some application domain. We define a language for discovering sources and querying their metadata. We then describe the WebSemantics prototype.

Hubert Naacke, Anthony Tomasic, and Patrick Valduriez, "Validating Mediator Cost Models with Disco," in Networking and Information Systems Journal, 2(5), 2000.
Abstract
Disco is a mediator system developed at INRIA for accessing heterogeneous data sources over the Internet. In Disco, mediators accept queries from users, process them with respect to wrappers, and return answers. Wrapper provide access to underlying sources. To efficiently process queries, the mediator performs cost-based query optimization. In a heterogeneous distributed database, cost-estimate based query optimization is difficult to achieve because the underlying data sources do not export cost information. Disco's approach relies on combining a generic cost model with specific cost information exported by wrappers. In this paper, we propose a validation of Disco's cost model based on experimentation with real Web data sources. This validation shows the efficiency of our generic cost model as well as the efficiency of more specialized cost functions.
Luis Gravano, Hector Garcia-Molina and Anthony Tomasic, "GlOSS: Text-Source Discovery over the Internet," in ACM Transactions on Database Systems, 24(2), 1999.
Abstract
The dramatic growth of the Internet has created a new problem for users: the location of relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Then, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS -- Glossary of Servers Server --, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective at determining promising text sources for a given query.
Anthony Tomasic, Louiqa Raschid and Patrick Valduriez, "Scaling Access to Heterogeneous Databases with DISCO," in IEEE Transactions on Knowledge and Data Engineering, 10(5), 1998.
Abstract
Accessing many data sources aggravates problems for users of heterogeneous distributed databases. Database administrators must deal with fragile mediators, that is, mediators with schemas and views that must be significantly changed to incorporate a new data source. When implementing translators of queries from mediators to data sources, database implementors must deal with data sources that do not support all the functionality required by mediators. Application programmers must deal with graceless failures for unavailable data sources. Queries simply return failure and no further information when data sources are unavailable for query processing. The Distributed Information Search COmponent (DISCO) addresses these problems. Data modeling techniques manage the connections to data sources, and sources can be added transparently to the users and applications. The interface between mediators and data sources flexibly handles different query languages and different data source functionality. Query rewriting and optimization techniques rewrite queries so they are efficiently evaluated by sources. Query processing and evaluation semantics are developed to process queries over unavailable data sources. In this article we describe (a) the distributed mediator architecture of DISCO; (b) the data model and its modeling of data source connections; (c) the interface to underlying data sources and the query rewriting process; and (d) query processing semantics. We describe several advantages of our system.
Anthony Tomasic, Luis Gravano, Calvin Lue, Peter Schwarz, and Laura Haas, "Data Structures for Efficient Broker Implementation," in ACM Transactions on Information Systems, 15(3), July 1997.
Abstract
With the profusion of text databases on the Internet, it is becoming increasingly hard to find the most useful databases for a given query. To attack this problem, several existing and proposed systems employ brokers to direct user queries, using a local database of summary information about the available databases. This summary information must effectively distinguish relevant databases, and must be compact while allowing efficient access. We offer evidence that one broker, GlOSS, can be effective at locating databases of interest even in a system of hundreds of databases, and examine the performance of accessing the GlOSS summaries for two promising storage methods: the grid file and partitioned hashing. We show that both methods can be tuned to provide good performance for a particular workload (within a broad range of workloads), and discuss the tradeoffs between the two data structures. As a side effect of our work, we show that grid files are more broadly applicable than previously thought; in particular, we show that by varying the policies used to construct the grid file we can provide good performance for a wide range of workloads even when storing highly skewed data.
Anthony Tomasic and Hector Garcia-Molina, "Performance Issues in Distributed Shared-Nothing Information Retrieval Systems" in Information Processing and Management, 32(6), pp. 647-665, 1996.
Abstract
Many information retrieval systems provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPEC database of abstracts of the literature on physics, computer science, electrical engineering, etc. In this article this database is studied by using a trace-driven simulation. We focus on a physical index design which accommodates truncations, inverted index caching, and database scaling in a distributed shared-nothing system. All three issues are shown to have a strong effect on response time and throughput. Database scaling is explored in two ways. One way assumes an ``optimal'' configuration for a single host and then linearly scales the database by duplicating the host architecture as needed. The second way determines the optimal number of hosts given a fixed database size.
Anthony Tomasic and Hector Garcia-Molina, "Query Processing and Inverted Indices in Distributed Text Document Retrieval Systems," in The VLDB Journal, 2(3), 1993.
Abstract
The performance of distributed text document retrieval systems is strongly influenced by the organization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine those variables that most strongly influence response time and throughput. This leads to a set of design trade-offs over a wide range of hardware configurations and new parallel query processing strategies.

Conference Papers

Georges Gardarin, Antoine Mensch and Anthony Tomasic, "An Introduction to the e-XML Data Integration Suite," in Proceedings of the 8th International Conference on Extending Database Technology (EDBT), 2002.
Abstract
This paper describes the e-XML component suite, a modular product for integrating heterogeneous data sources under an XML schema and querying in real-time the integrated information using XQuery, the emerging W3C standard for XML query. We describe the two main components of the suite, i.e., the repository for warehousing XML and the mediator for distributed query processing. We also discuss some typical applications.
George Mihaila, Louiqa Raschid and Anthony Tomasic, "Equal Time for Data on the Internet with WebSemantics," in Proceedings of the 6th International Conference on Extending Database Technology (EDBT), 1998.
Abstract
Many collections of scientific data in particular disciplines are available today around the world. Much of this data conforms to some agreed upon standard for data exchange, i.e., a standard schema and its semantics. However, sharing this data among a global community of users is still difficult because of a lack of standards for the following necessary functions: (i) data providers need a standard for describing or publishing available sources of data; (ii) data administrators need a standard for discovering the published data and (iii) users need a standard for accessing this discovered data. This paper describes a prototype implementation of a system, WebSemantics, that accomplishes the above tasks. We describe an architecture and protocols for the publication, discovery and access to scientific data. We define a language for discovering sources and querying the data in these sources, and we provide a formal semantics for this language.
Laurent Amsaleg, Michael Franklin, Anthony Tomasic and Tolga Urhan, "Scrambling Query Plans to Cope with Unexpected Delays," in Parallel and Distributed Information Systems, 1996.
Abstract
Accessing numerous widely-distributed data sources poses significant new challenges for query optimization and execution. Congestion or failure in the network introduce highly-variable response times for wide-area data access. This paper is an initial exploration of solutions to this variability. We investigate a class of dynamic, run-time query plan modification techniques that we call query plan scrambling. We present an algorithm which modifies execution plans on-the-fly in response to unexpected delays in data access. The algorithm both reschedules operators and introduces new operators into the plan. We present simulation results that show how our technique effectively hides delays in receiving the initial requested tuples from remote data sources.
Anthony Tomasic, Louiqa Raschid and Patrick Valduriez, "Scaling Heterogeneous Distributed Databases and the Design of DISCO," in Proceedings of the 16th International Conference on Distributed Computing Systems, Hong Kong, 1996.
Abstract
Access to large numbers of data sources introduces new problems for users of heterogeneous distributed databases. End users and application programmers must deal with unavailable data sources. Database administrators must deal with incorporating new sources into the model. Database implementors must deal with the translation of queries between query languages and schemas. The Distributed Information Search COmponent (Disco) addresses these problems. Query processing semantics are developed to process queries over data sources which do not return answers. Data modeling techniques manage connections to data sources. The component interface to data sources flexibly handles different query languages and translates queries. This paper describes (a) the distributed mediator architecture of Disco, (b) its query processing semantics, (c) the data model and its modeling of data source connections, and (d) the interface to underlying data sources.
Luis Gravano, Hector Garcia-Molina and Anthony Tomasic, "Precision and Recall of GlOSS Estimators for Database Discovery," in Proceedings of the Third International Conference on Parallel and Distributed Information Systems (PDIS), Austin, Texas, 1994.
Abstract
On-line information vendors offer access to multiple databases. In addition, the advent of a variety of INTERNET tools has provided easy, distributed access to many more databases. The result is thousands of text databases from which a user may choose for a given information need (a user query). This paper, an abridged version, presents a framework for (and analyzes a solution to) this problem, which we call the text-database discovery problem (see full version for a survey of related work). Our solution to the text-database discovery problem is to build a service that can suggest potentially good databases to search. A user's query will go through two steps: first, the query is presented to our server (dubbed GlOSS, for Glossary-Of-Servers Server) to select a set of promising databases to search. During the second step, the query is actually evaluated at the chosen databases. GlOSS gives a hint of what databases might be useful for the user's query, based on word-frequency information for each database. This information indicates, for each database and each keyword in the database vocabulary, how many documents at that database actually contain the keyword, for each field designator (Sections 2 and 3). For example, a Computer-Science library could report that ``Knuth'' (keyword) occurs as an author (field designator) in 180 documents, the keyword ``computer,'' in the title of 25,548 documents, and so on. This information is orders of magnitude smaller than a full index since for each keyword field-designation pair we only need to keep its frequency, not the identities of the documents that contain it. To evaluate the set of databases that GlOSS returns for a given query, Section 4 presents a framework based on the precision and recall metrics of information-retrieval theory. In that theory, for a given query q and a given set S of relevant documents for q, precision is the fraction of documents in the answer to q that are in S, and recall is the fraction of S in the answer to q. We borrow these notions to define metrics for the text-database discovery problem: for a given query q and a given set of ``relevant databases'' S, P is the fraction of databases in the answer to q that are in S, and R is the fraction of S in the answer to q. We further extend our framework by offering different definitions for a ``relevant database'' (Section 4). We have performed experiments using query traces from the FOLIO library information-retrieval system at Stanford University, and involving six databases available through FOLIO. As we will see, the results obtained for different variants of GlOSS are very promising (Section 5). Even though GlOSS keeps a small amount of information about the contents of the available databases, this information proved to be sufficient to produce very useful hints on where to search.
Kurt Shoens, Anthony Tomasic and Hector Garcia-Molina, "Synthetic Workload Performance Analysis of Incremental Updates," in Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR), Dublin, Ireland, 1994.
Abstract
Declining disk and CPU costs have kindled a renewed interest in efficient document indexing techniques. In this paper, the problem of incremental updates of inverted lists is addressed using a dual-structure index data structure that dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. The behavior of this index is studied with the use of a synthetically-generated document collection and a simulation model of the algorithm. The index structure is shown to support rapid insertion of documents, fast queries, and to scale well to large document collections and many disks.
Anthony Tomasic, Hector Garcia-Molina and Kurt Shoens, "Incremental Updates of Inverted Lists for Text Document Retrieval," in Proceedings of ACM Special Interest Group on Management of Data (SIGMOD), Minneapolis, MN, 1994.
Abstract
With the proliferation of the world's ``information highways'' a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index data structure. The index dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering trade-offs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. We then describe the best algorithm for a variety of criteria.
Luis Gravano, Hector Garcia-Molina and Anthony Tomasic, "The Efficacy of GlOSS for the Text Database Retrieval Problem," in Proceedings of ACM Special Interest Group on Management of Data (SIGMOD), Minneapolis, MN, 1994.
Abstract
The popularity of on-line document databases has led to a new problem: finding which text databases (out of many candidate choices) are the most relevant to a user. Identifying the relevant databases for a given query is the text database discovery problem. The first part of this paper presents a practical solution based on estimating the result size of a query and a database. The method is termed GlOSS--Glossary of Servers Server. The second part of this paper evaluates the effectiveness of GlOSS based on a trace of real user queries. In addition, we analyze the storage cost of our approach.
Anthony Tomasic and Hector Garcia-Molina, "Caching and Database Scaling in Distributed Shared-Nothing Information Retrieval Systems," in Proceedings of ACM Special Interest Group on Management of Data (SIGMOD), Washington, D.C., 1993.
Abstract
A common class of existing information retrieval system provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPEC database of abstracts of the literature on physics, computer science, electrical engineering, etc. In this paper this database is studied by using a trace-driven simulation. We focus on physical index design, inverted index caching, and database scaling in a distributed shared-nothing system. All three issues are shown to have a strong effect on response time and throughput. Database scaling is explored in two ways. One way assumes an ``optimal'' configuration for a single host and then linearly scales the database by duplicating the host architecture as needed. The second way determines the optimal number of hosts given a fixed database size.
Anthony Tomasic and Hector Garcia-Molina, "Performance of Inverted Indices in Shared-Nothing Distributed Text Document Information Retrieval Systems," in Proceedings of the Second International Conference on Parallel and Distributed Information Systems (PDIS), San Diego, 1993.
Abstract
The performance of distributed text document retrieval systems is strongly influenced by the organization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine which variables most strongly influence response time and throughput. This leads to a set of design trade-offs over a range of hardware configurations and new parallel query processing strategies.
Anthony Tomasic, "View Update Translation via Deduction and Annotation," in Proceedings of the Second International Conference on Database Theory (ICDT), Bruges, Belgium, also as Springer Verlag Lecture Notes in Computer Science 326, 1988.
Abstract
First steps are taken in examining the view update problem in deductive databases. The class of recursive definite deductive databases is examined. A view update is defined as a statement of factual logical consequence of the deductive database. A translation is a minimul update on the facts of a deductive database such that the view update holds. The number of translations for a view update is exponential in the size of the database. Algorithms for view updates are presented and proven correct. They are based on SLD-resolution and are independent of the computation rule. Finally, as an example of a method for reducing the number of possible translations of a view update, rule annotations are introduced. A small number of unique annotations (proportional to the size of the database) is shown to produce unique translations of view updates.

Workshop Papers

Anthony Tomasic, Charles Garrod, Kris Popendorf, "Symmetric Publish / Subscribe via Constraint Publication," in Proceedings of the Workshop on Experimental Databases (EXPDB), 2006, pp 21-27.
Anthony Tomasic, William Cohen, Susan Fussell, John Zimmerman, Marina Kobayashi, Einat Minkov, Nathan Halstead, Ravi Mosur, Jason Hum, "Learning to Navigate Web Forms," in Workshop on Information Integration on the Web (IIWEB 2004), Toronto, Canada, 2004.
Philippe Bonnet and Anthony Tomasic, "Unavailable Data Sources in Mediator Based Applications," in First International Workshop on Practical Information Mediation and Brokering, and the Commerce of Information on the Internet (I'MEDIAT'98), Tokyo, Japan, 1998.
Abstract
We discuss the problem of unavailable data sources in the context of two mediator based applications. We discuss the limitations of existing system with respect to this problem and describe a novel evaluation model that overcomes these shortcomings.
Introduction
Mediator systems are being deployed in various environments to provide query access to heterogeneous data sources. When processing a query, the mediator may have difficulty accessing a data source (due to network or server problems). In such cases the mediator is faced with the problem of unavailable data sources. In this paper, we discuss the problem of unavailable data sources in mediator based applications. We first introduce two applications that we are currently developing. The first application concerns a hospital information system; a mediator accesses data sources located in the different services to provide doctors with information on patients. The second application concerns the access to documentary repositories within a network of public and private institutions; a mediator accesses the data sources located in each institution to answer queries asked through a World Wide Web application. We detail the characteristics of these applications in Section 2. We show that these applications are representative of large classes of applications. We then discuss, in Section 3, the impact of unavailable data sources on the design of both applications. We illustrate the limitations of classical mediator systems. We give in Section 4 an overview of a novel sequential model of interaction which fits the needs of both applications and overcomes some of the above mentioned shortcomings. We review related work in Section 5. We conclude and give directions for future work in Section 6.
Helan Galhardas, Eric Simon and Anthony Tomasic, "A Framework for Classifying Scientific Metadata," Helena Galhardas, Eric Simon and Anthony Tomasic, in Proceedings of the AAAI Workshop on Information Integration, Fifteenth National Conference on Artificial Intelligence (AAAI), Madison, Wisconsin, 1998.
Abstract The scientific community, public organizations and administrations have generated a large amount of data concerning the environment. There is a need to allow sharing and exchange of this type of information by various kinds of users including scientists, decision-makers and public authorities. Metadata arises as the solution to support these requirements. We present a formal framework for classification of metadata that will give a uniform definition of what metadata is, how it can be used and where it must be used. This framework also provides a procedure for classifying elements of existing metadata standards.
Philippe Bonnet and Anthony Tomasic, "Partial Answers for Unavailable Data Sources," in Proceedings of the International Workshop on Flexible Query Answering Systems (FQAS), Roskilde University, Denmark, 1998.
Abstract
Many heterogeneous database system products and prototypes exist today; they will soon be deployed in a wide variety of environments. Most existing systems suffer from an Achilles' heel: they ungracefully fail in presence of unavailable data sources. If some data sources are unavailable when accessed, these systems either silently ignore them or generate an error. This behavior is improper in environments where there is a non-negligible probability that data sources cannot be accessed (e.g., Internet). In case some data sources cannot be accessed when processing a query, the complete answer to this query cannot be computed; some work can however be done with the data sources that are available. In this paper, we propose a novel approach where, in presence of unavailable data sources, the answer to a query is a partial answer. A partial answer is a representation of the work that has been done in case the complete answer to a query cannot be computed, and of the work that remains to be done in order to obtain this complete answer. The use of a partial answer is twofold. First, it contains an incremental query that allows to obtain the complete answer without redoing the work that has already been done. Second, the application program can extract information from a partial answer through the use of a secondary query, which we call a parachute query. In this paper, we present a framework for partial answers and we propose three algorithms for the evaluation of queries in presence of unavailable sources, the construction of incremental queries and the evaluation of parachute queries.
Philippe Bonnet and Anthony Tomasic, "Parachute Queries in the Presence of Unavailable Data Sources," in Proceedings of Bases de Donnees Advancees, BDA'98, Hammamet, Tunisia, October 1998.
Abstract
Mediator systems are used today in a wide variety of unreliable environments. When processing a query, a mediator may try to access a data source which is unavailable. In this situation, existing systems either silently ignore unavailable data sources or generate an error. This behavior is inefficient in environments with a non-negligible probability that a data source is unavailable (e.g., the Internet). In the case that some data sources are unavailable, the complete answer to a query cannot be obtained; however useful work can be done with the available data sources. In this paper, we describe a novel approach to mediator query processing where, in the presence of unavailable data sources, the answer to a query is computed incrementally. It is possible to access data obtained at intermediate steps of the computation. We define two new evaluation models and analytically model for these evaluation models the probability of obtaining the answer to a query in the presence of unavailable data sources. The analysis shows that complete answers are more likely in our two evaluation models than in a classical system. We measure the performance of our evaluation models via simulations and show that, in the case that all data sources are available, the performance penalty for our approach is negligible.
Anthony Tomasic, "Correct View Update Translations via Containment," in Proceedings of the Workshop on Deductive Database and Logic Programming, Second International Conference on Logic Programming (ICLP), Santa Margherita Ligure, Italy, also as Gesellschaft für Mathematik und Datenverarbeitung, GMD-Studien Nr. 231, 1994.
Abstract
Given an intensional database (IDB) and an extension database (EDB), the view update problem translates updates on the IDB into updates on the EDB. One approach to the view update problem uses a translation langauge to specify the meaning of a view update. In this paper we prove properties of a translation language. This approach to the view update problem studies the expressive power of the translation language and the computational cost of demonstrating properties of a translation. We use an active rule based database language for specifying translations of view updates. This paper uses the containment of one datalog program (or conjunctive query) by another to demonstrate that a translation is semantically correct. We show that the complexity of correctness is lower for insertion than deletion. Finally, we discuss extensions to the translation language.

Invited Publications & Other Work

Anthony Tomasic, "XFORMS in Practice," XML and Web Services Magazine, December/January, 2002.
Anthony Tomasic, "XML/DBC: A Standard API for Accesss to XML Repositories and Mediators," Invited Panel at 2nd Workshop on Data Integration over the Web (DIWEB'02), Toronto, Canada, 2002.
Peter Fankhauser, Georges Gardarin, Mauricio Lopez, Jose Munoz, and Anthony Tomasic, "Experiences in Federated Databases: From IRO-DB to MIRO-Web," in Industrial Track, Proceedings of the Twenty Fourth International Conference on Very Large Databases (VLDB), New York, NY, 1998.
Catherine Houstis, Christos Nikolaou, Manolis Marazakis, Nicholas Patrikalakis, Jakka Sairamesh, and Anthony Tomasic, "THETIS: Design of a Data Management and Data Visualization System for Coastal Zone Management of the Mediterranean Sea," in D-Lib Magazine, November, 1997.
Anthony Tomasic, Remy Amouroux, Philippe Bonnet, Olga Kapitskaia, Hubert Naacke and Louiqa Raschid, "The Distributed Information Search Component (Disco) and the World-Wide Web," Prototype Demonstration Description in Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, 1997.
Anthony Tomasic and Eric Simon, "Improving Access to Environmental Information using Context Information," in SIGMOD Record, March, 1997. Also appears in: 1st ERCIM Environmental Modelling Group Workshop on Air Pollution Modelling, April 7-8, 1997, GMD FIRST Berlin, Germany.
Laurent Amsaleg, Philippe Bonnet, Michael Franklin, Anthony Tomasic and Tolga Urhan, "Dynamic Query Execution Strategies for Coping with Delays in Wide-Area Remote Access," in IEEE Data Engineering Bulletin, 1997.
Anthony Tomasic and Hector Garcia-Molina, "Issues in Parallel Information Retrieval," in IEEE Data Engineering Bulletin, 1994.
Anthony Tomasic, Distributed Queries and Incremental Updates in Information Retrieval Systems, Ph.D. Thesis, Princeton University, 1994.

Technical Reports, Technical Notes

Hubert Naacke, Georges Gardarin and Anthony Tomasic, "Leveraging Mediator Cost Models with Heterogeneous Data Sources," INRIA Technical Report RR-3143, 1997.
Olga Kapitskaia, Anthony Tomasic and Patrick Valduriez, "Dealing with Discrepancies in Wrapper Functionality," INRIA Technical Report RR-3138, 1997.
George Mihaila, Louiqa Raschid and Anthony Tomasic, "Equal Time for Data on the Internet with WebSemantics," INRIA Technical Report RR-3136, 1997.
Philippe Bonnet and Anthony Tomasic, "Partial Answers for Unavailable Data Sources," INRIA Technical Report RR-3127, 1997.
Anthony Tomasic, Louiqa Raschid and Patrick Valduriez, "Scaling Heterogeneous Databases and the Design of DISCO," INRIA Technical Report RR-2704, 1995.
Anthony Tomasic, Luis Gravano, Calvin Lue, Peter Schwarz and Laura Haas, "Improving Broker Performance with Multidimensional Data Structures," IBM Technical Report Number RJ 9999, 1996.
Anthony Tomasic, Hector Garcia-Molina and Kurt Shoens, "Incremental Updates of Inverted Lists for Text Document Retrieval," Stanford University Department of Computer Science Technical Note Number STAN-CS-TN-93-1, 1993.
Anthony Tomasic and Hector Garcia-Molina, "Caching and Database Scaling in Distributed Shared-Nothing Information Retrieval Systems," Stanford University Department of Computer Science Technical Report Number STAN-CS-92-1456, 1992.
Anthony Tomasic and Hector Garcia-Molina, "Performance of Inverted Indices in Distributed Text Document Retrieval Systems," Stanford University Department of Computer Science Technical Report Number STAN-CS-92-1434, 1992.