Sailfish Frequently Asked Questions
Q: I've read the kallisto paper, and it looks like kallisto is much more accurate (and faster) than the current Sailfish software; is this true?
A: Not necessarily. The benchmarks in the kallisto paper are made against an older version of Sailfish (from March of 2014). While this was the latest version of Sailfish at the time kallisto was originally submitted, it has now been superseded, and has been deprecated for some time. Based on the success we encountered by replacing alignment with mapping (instead of k-mer counting) in Salmon, we decided to update Sailfish to make use of this improved mapping information. As mentioned in the note above, this results in substantial improvements in both Sailfish's accuracy and its speed. For a comparison of quantification accuracy that considers reasonably recent versions of kallisto and Sailfish (among other tools), please refer to the RapMap pre-print (the most-recent pre-print is available here). You can also see a re-analysis of Sailfish's performance on some of the kallisto simulations (using a more recent version of Sailfish) in the QuantAnalysis repo.
Q: What k should I use?
A: Actually, it doesn't seem to matter that much, since Sailfish only uses k-mer to find initial exact matches (which are then extended using quasi-mapping). The current default is 31, and only odd k-mers of length less than 32 are currently supported.
Q: How does Sailfish deal with paired-end data?
A: Sailfish first tries to find concordant mappings of paired-end reads (i.e. mappings of the read pair to the same transcript). If no concordant mappings can be found, then orphaned mappings are considered.
Q: How can I read gzipped sequence files?
A: Reading from .gz files is possible in Sailfish. Rather than integrate the compressed reader into the Sailfish software itself, one can redirect the decompressed reads to a named-pipe and feed them directly into Sailfish. This allows you to directly quantify not just on reads that are gzipped, but compressed with any method that can be decompressed in a streaming manner. Here's an example of how you might use this functionality. Say your reads are the gzipped filereads.fq.gz. Then you can use a command like the following:
sailfish quant -i sfindex -l U -r <(gunzip -c reads.fq.gz) -o reads_quant
<()
syntax is part of the bash shell, not Sailfish. It executes the command between the parentheses, and creates a named pipe (aka a fifo) containing the output of this command --- in this case, the decompressed reads. See http://en.wikipedia.org/wiki/Process_substitution for more information. This stream of decompressed reads is then read as input by Sailfish as if it were a regular fastq formatted file. A few releases ago we added support to Sailfish to read from streaming sources for just this purpose, since many people keep sequencing reads around in compressed form.Q: What does "Percent reads mapped" mean?
A: The "mapping rate" is the percentage of all sequenced fragments that map either concordantly or as orphans (no distinction is made when computing the mapping rate). While some fragments may not map due to an excess of sequencing errors, or unannotated material in the sample, we ususally observe a mapping rate at least as high as traditional aligners. If you observe a low mapping rate, consider re-building the index with a smaller value of k (this is particularly true for short reads).
Q: Which reference transcriptome was used for the experiments in the paper?
A: The reference transcritome was derived from the RefSeq GTF file. The GTF file we used is here.
In order to fairly compare with Cufflinks, we extracted transcripts using the the genomic sequence and the RefSeq GTF file using the gffread tool, and then retained only the ’NM’ labeled transcripts. The main motivation in doing this is that Cufflinks, unlike the other tools we tested, accepts only a GTF and genomic sequence to perform quantification of annotated transcripts (i.e. you cannot provide it directly with a fasta file representing your transcripts). Further, Cufflinks always seemed to quantify a slightly smaller number of transcripts than what was present in our original transcriptome fast file. Thus, to put all of the methods on equal footing, we extracted the set of transcripts to quantify from the genome and GTF file directly using gffread, which shares code with Cufflinks and interprets the GTF in the same way. In this manner, we get a fasta file containing a set of transcripts that Cufflinks will quantify in full.
The GTF file with which we extracted the transcripts is necessarily older than the (frequently updated) annotation file currently on the UCSC FTP server. This results in fewer annotated genes, despite the fact that they are both ‘hg18’ annotations. We’ve been using the same GTF file since we started developing Sailfish, and this one, in particular, was provided to us by the RSEM authors in response to a request for data so that we could quantify the same set of genes used in the RSEM paper.