Overview
Querying a short read database for a transcript of interest is a fundamental problem in biology. Yet such queries are computationally intensive and scale linearly with the size of the data being searched. This leads to a computational bottleneck in which large databases of sequencing reads are compiled but never investigated systematically. To address this problem, we developed the Sequence Bloom Tree (SBT) data structure to facilitate searching short-read expression experiments for transcripts of interest. Rather then naively explore every file in a database, the SBT prunes files which do not contain the query with high probability and thus scales linearly with the number of experiments containing the query rather then the total size of the experiment set.
The SBT is built upon the Jellyfish library bloom filter implementation and the default settings are designed as a reasonable compromise between speed, storage cost, and accuracy. In the paper, we demonstrate that the SBT can search multi-terabyte databases substantially faster than any existing tool with reasonable accuracy and negligable storage costs in both memory and RAM.
Downloads
Download SBT Source on Github
Download SBT Linux Binary [beta v0.3.5]
Download SBT User Manual [beta v0.3.5]
The pre-release version of our latest tool - the Split Sequence Bloom Tree - can be found below. Manuscript coming soon!
Download SSBT Source on Github
SBT Example Files
Download Example Compressed SBT Index. - All the necessary files to load and query a 2652 experiment compressed SBT. [176 GB Download, 200 GB unpacked].
Download Example SBT Leaves. - The compressed leaf bloom filters for 2652 SRR experiments. [50 GB Download, 63 GB unpacked].
Download Example SBT Uncompressed Leaves. - The uncompressed leaf bloom filters for 2652 SRR experiments. [68 GB Download, 618 GB unpacked].
Installation Instructions
To install using the binary:
Download the latest version of SBT
Decompress the tarball: tar xzf sbt-binary-0.3.5.tar.gz
Install gcc (Version 4.9.1 or later)
Install Jellyfish (Version 2.2.0 or later)
Install SDSL (SDSL-lite)
Download the latest version of SBT using Github
Compile:
cd bloomtree/src
make
Solomon, Brad and Carl Kingsford.
Fast search of thousands of short-read sequencing experiments. Nature biotechnology. 2016 doi: 10.1038/nbt.3442
A list of the experiments included in that paper is here: srr-list.txt