SQL Primer

Some Links:

Introduction

MySQL is an Open-Source relational database that has become very popular because it is fast, compatible, and well supported. It operates as a server (currently on Motif), listening to queries by client applications on the internet. Queries can be used to update information in the database or to retrieve it, even performing complex calculations and comparisons on the server. MySQL also implements a security policy giving different users permission to access and update records selectively.


Table Properties Display from mysqlcc client

The MySQL server can manage multiple, seperate databases, but in practice a user logs onto and accesses only a single one. Each database consists of one or more tables, which in turn contain records and fields.

MySQL automatically sorts the data entered into a table by indexes, combinations of fields that can be specified manually or automatically. This sort is invisible to the user, but allows specific subsets of the data to be called up rapidly at will.

To describe these subsets, MySQL implements Structured Query Language (SQL), a syntax for describing sets of data to retrieve or alter. SQL is the common language of most relational databases. SQL statements view, edit, create or delete records, or can even be used to modify the structure of the database or send special instructions to the server.

SELECT Queries

The most common query in SQL is the select query. The general form is: SELECT [fields] FROM [tables] WHERE [criteria]. (complete syntax). If no criteria is specified, the statement will operate on the whole table! This is important to remember later when altering data.

mysql> SELECT * FROM CD_pssm_relationship_type;
+------------+----------+
| relationID | type     |
+------------+----------+
|          1 | same     |
|          2 | superset |
|          3 | subset   |
|          4 | similar  |
|          5 | co-occur |
+------------+----------+
5 rows in set (0.01 sec)

mysql> SELECT TYPE FROM CD_pssm_relationship_type WHERE relationID < 4 AND relationID > 1;
+----------+
| type     |
+----------+
| superset |
| subset   |
+----------+
2 rows in set (0.00 sec)

The return order in the example above is a complete coincidence! By nature, SQL tables have no row order, not even the order you put entries in in. Instead, we can specify an order on our reutrn set:

mysql> SELECT * FROM CD_pssm_relationship_type ORDER BY type ASC;
+------------+----------+
| relationID | type     |
+------------+----------+
|          5 | co-occur |
|          1 | same     |
|          4 | similar  |
|          3 | subset   |
|          2 | superset |
+------------+----------+
5 rows in set (0.00 sec)

SQL also provides built-in functions that we can use for more complicated comparisons such as string searches or simple regular expressions:

mysql> SELECT * FROM CD_pssm WHERE INSTR(shortname, "tyr");
+--------+------------+--------------+--------+-----------+--------+
| pssmID | accession  | shortname    | family | clusterID | length |
+--------+------------+--------------+--------+-----------+--------+
|   4981 | pfam03222  | Trp_Tyr_perm |      0 |      1570 |    393 |
|   5392 | cd00192    | TyrKc        |      0 |        44 |    269 |
|   7538 | pfam00264  | tyrosinase   |      0 |      2252 |    325 |
|  14924 | smart00219 | TyrKc        |      0 |        44 |    255 |
+--------+------------+--------------+--------+-----------+--------+
4 rows in set (0.01 sec)

mysql> SELECT * FROM CD_pssm WHERE shortname LIKE "Tyr%";
+--------+------------+------------+--------+-----------+--------+
| pssmID | accession  | shortname  | family | clusterID | length |
+--------+------------+------------+--------+-----------+--------+
|   5392 | cd00192    | TyrKc      |      0 |        44 |    269 |
|   7538 | pfam00264  | tyrosinase |      0 |      2252 |    325 |
|  14924 | smart00219 | TyrKc      |      0 |        44 |    255 |
+--------+------------+------------+--------+-----------+--------+
3 rows in set (0.01 sec)

Tables can be used in isolation like this, but the point of a relational database is to draw on relationships between fields in different tables. Frequently, a field in one table will contain a number or name that corresponds to a specific row in another. We call this a foreign key. SQL can traverse these keys in a single statement, effectively merging 2 rows from different tables. In this example, we find all the sequence hits for the tyrosinase domain, without knowing it's pssmID. Note the use of aliases for the table names in order to unambiguously specify which field we'rue using:

mysql> SELECT h.*, p.shortname FROM CD_hits h, CD_pssm p WHERE p.pssmID = h.pssmID AND p.shortname = "tyrosinase";
+----------+-------+------+--------+-----------+----------+-------+----------+------------+
| pig      | start | stop | pssmID | pssmStart | pssmStop | eval  | bitscore | shortname  |
+----------+-------+------+--------+-----------+----------+-------+----------+------------+
| 11995023 |    79 |  399 |   7538 |         2 |      324 | -2236 |     3519 | tyrosinase |
| 11995028 |     2 |  271 |   7538 |        36 |      316 |  -559 |     1088 | tyrosinase |
| 12053765 |  1269 | 1537 |   7538 |        33 |      324 | -1181 |     2024 | tyrosinase |
| 12053765 |  2090 | 2381 |   7538 |        27 |      324 | -1114 |     1932 | tyrosinase |
| 12053765 |  1696 | 1957 |   7538 |        50 |      318 | -1039 |     1820 | tyrosinase |
| 12053765 |  2520 | 2788 |   7538 |        36 |      321 |  -986 |     1743 | tyrosinase |
| 12053765 |   847 | 1131 |   7538 |        31 |      318 |  -925 |     1658 | tyrosinase |
| 12053765 |   402 |  714 |   7538 |         4 |      323 |  -894 |     1612 | tyrosinase |
| 12053765 |  2928 | 3175 |   7538 |        36 |      307 |  -824 |     1512 | tyrosinase |
.....
+----------+-------+------+--------+-----------+----------+-------+----------+------------+
373 rows in set (8.30 sec)

Note that the example above took 8 seconds to complete, whereas previous examples were almost instantaneous! The reason is that behind the scenes, MySQL is performing a join operation: A new table is formed, consisting of each row of CD_hits merged with each row of CD_pssm. The where statement then filters out rows that don't match on pssmID. Sometimes this is ok, but sometimes we want more control of the join operation for speed or complex operations (like joining a table to itself!). Here is the above example with an explicit join:

mysql> SELECT h.*, p.shortname FROM CD_hits h JOIN CD_pssm p ON p.pssmID = h.pssmID WHERE p.shortname = "tyrosinase";
+----------+-------+------+--------+-----------+----------+-------+----------+------------+
| pig      | start | stop | pssmID | pssmStart | pssmStop | eval  | bitscore | shortname  |
+----------+-------+------+--------+-----------+----------+-------+----------+------------+
| 11995023 |    79 |  399 |   7538 |         2 |      324 | -2236 |     3519 | tyrosinase |
| 11995028 |     2 |  271 |   7538 |        36 |      316 |  -559 |     1088 | tyrosinase |
| 12053765 |  1269 | 1537 |   7538 |        33 |      324 | -1181 |     2024 | tyrosinase |
| 12053765 |  2090 | 2381 |   7538 |        27 |      324 | -1114 |     1932 | tyrosinase |
| 12053765 |  2928 | 3175 |   7538 |        36 |      307 |  -824 |     1512 | tyrosinase |
.....
+----------+-------+------+--------+-----------+----------+-------+----------+------------+
373 rows in set (8.12 sec)

In this case the speed gain is minimal because MySQL is already capable of optimizing simple "where" clauses into joins as above...more complex queries will benefit more. Note that you can join any number of tables using this syntax. We'll use this to find out what organisms tyrosinase is hitting on:

mysql> SELECT h.pig, p.shortname, t.name_txt FROM CD_hits h JOIN CD_pssm p ON p.pssmID = h.pssmID \
JOIN CD_gi2nr g ON h.pig = g.pig JOIN GB_names t ON g.taxID = t.taxID WHERE p.shortname = "tyrosinase";
+----------+------------+-----------------------------------------------------------------+
| pig      | shortname  | name_txt                                                        |
+----------+------------+-----------------------------------------------------------------+
|    37507 | tyrosinase | Homo sapiens                                                    |
|    37509 | tyrosinase | Homo sapiens                                                    |
|    37513 | tyrosinase | Homo sapiens                                                    |
|    55062 | tyrosinase | LK3 transgenic mice                                             |
|    66358 | tyrosinase | LK3 transgenic mice                                             |
|    66358 | tyrosinase | LK3 transgenic mice                                             |
|    80770 | tyrosinase | "Actinomyces glaucescens" Preobrazhenskaya in Gauze et al. 1957 |
.....
| 28864833 | tyrosinase | Sus scrofa                                                      |
+----------+------------+-----------------------------------------------------------------+
529 rows in set (6.46 sec)

The shortened time and number of rows is due to the fact that showing multiple hits per pig would be redundant, given our display set. Conveniently, we now see only one row per sequence. In many cases, we're less interested in the actual rows than we are in some attribute about them. Keywords like COUNT and DISTINCT can tell us features of the return set without returning the whole thing. In the examples below, we find out how many different sequences and species the domain occurs in:

mysql> SELECT DISTINCT t.name_txt FROM CD_hits h JOIN CD_pssm p ON p.pssmID = h.pssmID JOIN CD_gi2nr g ON h.pig = g.pig JOIN GB_names t ON g.taxID = t.taxID WHERE p.shortname = "tyrosinase";
+-----------------------------------------------------------------+
| name_txt                                                        |
+-----------------------------------------------------------------+
| Homo sapiens                                                    |
| LK3 transgenic mice                                             |
| "Actinomyces glaucescens" Preobrazhenskaya in Gauze et al. 1957 |
| Helix pomatia                                                   |
| North Pacific giant octopus                                     |
| Neurospora crassa                                               |
| "Actinomyces antibioticus" Waksman and Woodruff 1941            |
| Lycopersicon esculentum                                         |
....
+-----------------------------------------------------------------+
199 rows in set (0.02 sec)

mysql> SELECT COUNT(*) FROM CD_hits h JOIN CD_pssm p ON p.pssmID = h.pssmID JOIN CD_gi2nr g ON h.pig = g.pig JOIN GB_names t ON g.taxID = t.taxID WHERE p.shortname = "tyrosinase";
+----------+
| COUNT(*) |
+----------+
|      529 |
+----------+
1 row in set (0.05 sec)


back home
bobsedge@andrew.cmu.edu