James Cheney
University of Edinburgh

Data provenance as dependency analysis


* * * * * PLEASE NOTE:  NOT USUAL DAY OR CONFERENCE ROOM * * * * *

Abstract:

Scientists in a variety of disciplines are now using databases and other sophisticated systems in novel ways and placing new demands on them.  For example, biologists expect data to be accompanied by so-called provenance information explaining how the data got there, where it came from, and how it has been manipulated.  Currently, provenance is maintained through manual effort.  Doing so is expensive, tedious, and error-prone, which motivates investigating ways of automatically tracking and managing provenance in general-purpose systems such as databases.

This raises many system design and implementation issues.  But there are also foundational questions that ought to be addressed first, such as what makes a given candidate definition of provenance correct or suitable for a given purpose.  These questions have been largely ignored.  Instead, most work in this area is based on ad hoc definitions motivated by imprecise claims that the definition captures how parts of the input "influence", "contribute to" or "are relevant to" parts of the output.

In this talk I will present a new provenance-tracking technique that is equipped with a clear and (I argue) well-motivated correctness property, called dependency provenance.  For each part of the output of a query, we define the dependency provenance as the set of input locations on which the given output part depends, in a sense similar to that used in programming language dependency analyses.  It is also closely related to debugging techniques such as dynamic program slicing, adapted to databases.  Calculating exact dependency provenance turns out to be expensive (and undecidable in general) so we consider dynamic and static over-approximations.

Joint work with Amal Ahmed and Umut Acar

  
 
Host: Bob Harper
Appointments:  April Foster
<aprilf@cs.cmu.edu>


 * * * * * PLEASE NOTE:  NOT USUAL DAY OR CONFERENCE ROOM * * * * *

Thursday, April 24, 2008
3:30 - 5:00 p.m.
Wean Hall 7220

Principles of Programming Seminars