When we describe the structure of a relational database, we refer to the schema -- the set of tables, fields and the relationships between them that allow us to organize our data. The schema might take the form of pictures, descriptions, or code to generate the database skeleton. This document provides a little of each to help you design code, queries, and new datastructures around the DurandLab databases.
Although we've broken the structure into several schemas for many data sources, each can access the tables of the others and should be viewed as a single interconnected database.
Note: Since we're constantly trying to find ways to bring in new biological data, these schemas will be changing very often, and new ones will be added! The best way to explore the DBs once you have the general idea is to fire up a client like mysqlcc and explore.
CDART
CDART is a tool by NCBI extracting protein domain information about sequences contained in NCBI's RefSeq database, by detecting domain models (PSSMs) within protein sequences (PIGs). We use this data to generate the domain architecture information in DurandLab.
This table holds information about each of the position-specific scoring matrices used as models of domains in the detection process. PSSMs were imported (by NCBI) from several sources, including SMART, PFAM, and COG. We subsequently removed all COG data, as COG predictions don't match our protein model and were generating unnecessary ambiguity in domain architectures.
A unique pssmID is assigned to each pssm, along with an internal accession and shortname for descriptive purposes. family is currently unused. clusterID defines membership in a a cluster as described with the clusters table.
This table provides a way of creating directed graphs of pssms, where edges indicate some relationship between the pssms involved. relationID provides a category of relationship, which can be corresponded to a descriptive string on the pssm_relationship_type table. Currently, the only relationships recorded are those of frequent co-occurrence and overlap in proteins, as described by Lewis Geer's CDART paper.
Because CDART's pssm repository is redundant -- it contains multiple models from different sources for a single well-studied domain -- it's necessary for us to simplify our protein architectures by treating closely related pssms as equivalent. We formed these equivalence groups out (clusters)of the connected components in the relationship gap defined in pssm_relationships. Each group is given a unique clusterID. Additional fields are essentially statistics we used to calculate the usefulness of this group definition. k is the number of pssms in the cluster, cliqueness is a measure 0-1 of how well connected a cluster is (meaningful equivalence groups would presumably have a high cliqueness). When we calculate domain architectures, we do so in terms of clusters rather than individual pssms.
Clusters turned out to be tricky entities because poorly connected clusters can connect pssms with little in common. Biconnected Components are structures in a graph which are more rigorously connected than connected components. Any member must be removable without breaking the component. Since individual nodes can belong to multiple bccs, we instead label edges with a bccID (since they can belong to only one).
Entries in NCBI's protein sequence databases are referred to by gi, a unique identifier given to each submitted sequence. Since sequences are frequently redundant, gi2nr is used to associate each gi with a protein identity group (pig) of completely identical sequences. These sequences are then used for the sequence analysis found in hits.
Additional fields are descriptive of the gi. taxID corresponds to GenBank's taxonomic id tree, indicating the organism in which the sequence was found. defline is unused. symbol has been filled in when possible with symbols from nrmus_unmarked. Nucleotide accessions -- the NCBI accession of the nucleotide sequence which generated the protein -- have been filled in the same way.
This table provides an entry for each instance in which a pssm was matched with a certain cutoff score to any pig. Note that there are no unique identifiers! A pig may match multiple pssms, a pssm may hit multiple pigs, and it may hit the same pig many times!
This table contains most of the information associated with a SwissProt entry. Name and accession both form unique keys presently. Name is unique within a version of SwissProt, accessions are conserved between version so that even if entries are combined or split, they can be found in the accession_list of a future entry.
Examine the SwissProt manual for details on the meaning of each field. Notable the sequence itself has been replaced with a seqID field pointing to a generic sequence entry in the DurandLab.prot_seq table. Also, all database cross-references have been extracted to the dbXref table facilitate querying.
Each SwissProt entry originally contained multiple db_Xref tags, which were extracted to this table. The meaning of primaryID and secondaryID differ depending on the referred database (db_name). Note that there are no real unique keys -- there may be multiple entries sharing name, accession, and db_name.
Like the corresponding tables for pigs, these provide a way of remembering a set of SwissProt entries for batch processing. Sets are described in the entry_sets table with an (automatically generated) setID and a short name and long description for documentation. Actual membership in a set is provided by the entry_set_membership table, which enforces unique entries (an entry may be part of a set only once). A variety of tools in /afs/cs/user/bobsedge/bin/ convert these sets into sets in other dbs.
gi2accession is generated from NCBI's LiveLists flatfiles. Each entry relates a gi and version to a given accession. This is useful for doing database crossreferences and lookups from journals, and also in descerning the type of sequence -- protein or nucleotide.
This database is derived from NCBI's "names.txt" flatfile, which is part of their taxonomic tree structure. SwissProt and CDART both define species origin in terms of GB_taxIDs, which are looked up in this table. None of the hierarchy information is maintained -- you can't determine where in the taxa a taxID falls. But you can correspond it to a name, making it useful to find which organism a given sequence came from. name_txt is the most commonly used field, but name_class will inform you if it's a scientific name or some sort of container node. "LK3 Transgenic mice", for example, is a frequently used taxID, and is somewhat more specific than the "mus musculus" entry.
This table is used to store architecture information generated by CDART data. You can see links back to CDART fields CD_pig and CD_clusterID. The way architectures are generated changes frequently, but this table is intended to store the results and some descriptive analysis. number_regions is the number of serparate domain-coding regions found on the strand. ambiguous is meant to indicate whether any of those regions could not be clearly classified as corresponding to a single clusterID of related pssms. CD_clusterID_list shows all the different clusters (domains) that had hits on this protein sequence. repeats indicates the repetition of one or more of the clusters on the strand. suspected_repeats is an outdated measure for when a region matched by a cluster seemed to be longer than any of the component pssms. arch_txt contains a long description of the decoded regions. A new field will be added soon containing a simple string representing the architecture.
prot_seq is a generic method of storing a protein sequence, to simplify the creation of programs that do batch analysis on protein sequences. Each sequence is given an automatically generated seqID which becomes its unique identified. GB_taxID_list is used to indicate the organisms containing this sequence. source, sourceID1, sourceID2, and description are used both to elucidate the protein and to provide a FASTA header when necessary for the protein sequence.
source indicates the originating database of the sequence, sourceID1 and sourceID2 differ in meaning according to the source. You can look up the meaning of each source code and ID in the source_dbs lookup table.
Once again, these sets allow groups of protein sequences to be remembered for batch processing. There is significant existing code for this purpose -- check out ProteinSequenceSet.py . Sets can be created from various sources, blasted against a database, turned into a blastable database, blasted against themselves.
The BLAST library can automatically save the results of large BLAST runs to this directory. This makes it easy to query out for Nan's NC calculations and other purpose.
Every blast_hit is associated with a blast_run, partially because certain parameters are used for the entire run and adjust the scoring scheme. These parameters, and the database that was used for the run, are stored in entries in this table. In the future, this will have to encompass slightly different parameter schemes since we're exploring different ways to do BLAST statistics.
Pairs of genes, keyed by ORF location, presumed to be resulting from a mass duplication in yeast, ccording to Kellis.
euGenes is a collection of gene data from eukaryotic organisms maintained at Indiana U. We downloaded a table from them that indicates probable homologies in yeast, according to experiments they've done. Not necessarily trustworthy!
Saccharomyces Genome Database is one of the best genomic resources for yeast data. We use a local table primarily to aggregate identifiers: this puts symbols, loci, ORF names etc on one entry per gene where appropriate.
Saeed is a friend of Dr. Durand's and researcher at MIT, who has kindly loaned us some data for examining large scale duplication histories in yeast. The genes table lists all genes used in his motif experiments. The motifs table lists a series of upstream motifs he believes are associated with regulation. The upstreams table connects the two, listing for each gene, how many of the top X hits for a given motif lie in its upstream region.
The exp_loci contains loci used as the gene identifiers in his correlation data. The correlation data itself is in correlations, which is listing gene expression correlations with a series of cell states from a microarray experiment.
Lee's regulation data indicates which of 46 known transcription factors seemed to have an impact on which of 6700 yeast genes.
Wolfe's yeast genome experiments are our gold standard for finding retained duplicates from the genomic duplication in yeast. genes lists all examined yeast genes, and gene_pairs links entries which are suspected ohnologs. family_codes is a reference table for codes in a field of genes which puts genes into categories according to function.
ensembl contains genomic data organized according to genomic location for a variety of organisms. It's very useful for placing a protein on a chromosome and analyzing location-based phenomonenon! Currently, we can retrieve peptide sequences for ensembl genes, and form sets of retrieved sequences based on a location on a chromosome in an ensembl organism. Look at /afs/cs/user/bobsedge/lib/ensembl/ for details.
Ensembl maintains their own public MySQL server, so we don't keep a local copy. Check the ensembl library, which accesses it, for details.
MGI is a database of mouse genes, focused on genomic rather than proteomic data. We use it mostly to reference gene symbols and loci used in papers and references.
InterPro is a widely used repository of domain models. Our local copy is essentially just a lookup table from names to descriptions, so that we can understand some of the entries in SwissProt's dbXref table.
These DBs are for Nan and Narayanan to store ther own data. Do not disturb!
This table assists a cancer project Dr. Durand is working with Marc and others on. It's maintained by Marc.