MySQL Databases
What are they?
We bring data from all sorts of public sources into the lab and organize them in MySQL databases for easy programatic access. Rather than creating 50 tables in a single database, we've split them into multiple databases maintained on the same server. The distinction is artificial: one query can move between databases without trouble. Bringing new data into the lab, designing a schema, and putting it online is one of the major tasks for the principle research programmer.
Information about the MySQL server itself (as opposed to the data) can be found here.
Details about the existing databases and their schemas can be found here.
Where do they live?
Currently, the server and all data reside on fugu.
/net/fugu/usr0/db/[db_name]/
-- Flat files go here when first coming into the lab
/net/fugu/usr0/tools/pybin/db/[db_name/
-- Scripts that parse and upload the data to the db go here.
/net/fugu/usr0/db/mysql_data/[db_name]/
-- The server keeps tables in RAM, but periodically stores them here.
Standard Operating Procedures
-
Bringing In External Data
Obviously there are a ton of design decisions to be made where an external resource is being brought into the lab. Here's the general workplan I've followed.
- Download flatfiles for the resource to
/net/fugu/usr0/db/[db name]
. Figure out what's there and how it's organized.
- Figure out what subset of the data will actually be used by the lab. For a big cross-referencing resource like SwissProt, completion might be a good thing. For small datasets from other labs, there may be a ton of fields we don't really need.
- Design the schema using the same paradigms as used in the other databases. When possible, try to maintain the organization scheme under which you got the data as well.
- Use
mysqlcc
or similar to add your schema to the database. Don't forget to set permissions if you made a new database.
- Write scripts to parse out flat files and populate the database. Place them in
/net/fugu/usr0/tools/pybin/db/[db_name]
. If the data is in a standard format, it might be worth writing a library to parse that format and putting it in pylib
.
-
Managing List Fields
One of the biggest pains in MySQL is creating fields that track lists of items. You probably have your own methods about this, but here are two that have worked for me:
- If a reasonably small number of identifiers are going in the list, use a TEXT type field and a comma-delimmeted list. You can use MySQL's
FIND_IN_SET(member, set)
function to query out of these fairly quickly.
- If a large number of identifiers is going in the list, consider a seperate table that joins the primary key of your table with many secondary fields (like
DurandLab.prot_seq_set_membership
)
back home
bobsedge@andrew.cmu.edu