Line Weavings for the Astral 40 Set --- Fall 2005


This link points to a tree of webpages, organized by the second and third characters of the protein PDB IDs, in a fairly standard way.

At the bottom of the hierarchy, each protein chain is represented by two files: a crossing file and a backbone overlay file.
For instance, chain A in protein 5rub associates to the crossing file 5rub_A.lad and the backbone overlay file 5rub_A_sse.pdb.


Please cite, and read for more details:
M. A. Erdmann. Protein Similarity from Knot Theory: Geometric Convolution and Line Weavings.
Journal of Computational Biology, Vol. 12, No. 6, 2005, 609-637.


See also Structural Comparisons for a basis set of proteins.

Crossing Files

A crossing file represents the line weaving associated with the secondary structure elements of a protein chain. For each line, the crossing file lists the residues involved, the centroid of the associated alpha carbons, the (oriented) tangent to the line, and the start and end points formed by projecting the secondary structure's first and last alpha carbons onto the line.

The crossing file also shows the crossing matrix associated with the lines, the matrix of interline angles, the matrix of centroid separations, and the matrix of closest approach between secondary structures as determined by their constituent alpha and sidechain carbons. The crossing matrix has entries of the form "+", "-", or ".", specifying the sign of the interline angles. The entry "." appears for angles that lie below some threshold (0.5 radians).


Backbone Overlay Files

A backbone overlay file is a .pdb file that contains the backbone atoms of the protein chain and some "fake" atoms to represent the line approximations.

Each secondary structure line appears in the file with an alpha carbon at its centroid, a beryllium atom at its start, and a sulphur atom at its end. A chain of hydrogen atoms connects these three points, giving the appearance of a line. The residue number associated with all these atoms is the index that appears for the secondary structure in the crossing file (in the Type column).

The protein chain appears with its original chain letter (or the letter A if the protein did not originally have a chain letter). The lines appear with chain letter Z (unless that conflicts with the protein chain letter, in which case the lines have chain letter A).

The overlay files should display fine in either RASMOL or Protein Explorer.
The following two RASMOL scripts will display the file using structural colors for the protein chain and green for the lines (assuming the lines have chain letter Z):
   Ribbon Script
   Backbone Script

Sources

There are 5271 protein chains represented by this tree of webpages. With minor modifications to account for missing structures, these are the encompassing chains of the genetic domains represented by Astral to have less than 40% sequence identity. We used the version (1.67) present at the Astral website on 19-July-2005.

For coordinates, we used the protein files appearing on the PDB DVD set entitled "Release #1, 2004 Edition", as received 19-July-2005. For protein chains with multiple models we simply used the first model present in the PDB file. Similarly, for atoms with alternate locations we used the first location.

The residue numbers appearing in the crossing and overlay files are not necessarily the original PDB residue numbers. Instead, we ran DSSP over the PDB files, and used the internal DSSP residue numbers, reshifted to start at 1 for each protein chain. Doing so avoided issues with negative residue numbers and insertion codes.

We also determined secondary structure using DSSP. We augmented the DSSP information slightly (automatically, not by hand), by adding some turns to helices and breaking some strands at non-bridge locations. We also broke secondary structures that bent severely, in order to reduce the likelihood of poor line approximations. We ignored strands consisting of a single residue and helices consisting of fewer than three residues. Some protein chains have no associated secondary structure and thus no associated line weaving.

THANKS: We are very grateful to all those individuals and institutions who have created the sources listed above, and thank them for making the sources publicly available.


A word of caution: the line approximations derived for secondary structure elements consisting of only a few residues can be overly sensitive to the coordinates of their constituent alpha carbons. For instance, adding or removing even a single residue in a short helix, say one with fewer than 7 residues, may potentially shift the orientation of the line dramatically. (This is not surprising since 4 residues are required to define a single "center" of a helical axis, but it is worth remembering.)


Acknowledgment

This research was supported in part by Carnegie Mellon University and the Pennsylvania Department of Health through the grant ``Integrated Protein Informatics for Cancer Research''. We are very grateful for this support.

Any opinions, findings, and conclusions or recommendations expressed in this research are those of the author(s) and do not necessarily reflect the views of Carnegie Mellon University, the Pennsylvania Department of Health, or any other private or governmental agency.


Disclaimer

The data contained in this set of webpages and the files to which they point are approximations, possibly faulty, and are provided "as is", without express or implied warranty of any kind.


Copyright (C) 2005 by Michael A. Erdmann.

Permission is granted to any individual or institution to use, copy, and/or distribute this material, provided that the complete contents of this webpage, including but not limited to the disclaimer, copyright, and permission notice, are maintained, intact, in all copies and supporting documentation.


Modified 19-September-2005 by
Michael Erdmann