Hai-Son Le * Machine Learning Department * Carnegie Mellon University

Probabilistic error correction for RNA sequencing

Sequencing of RNAs (RNA-Seq) has revolutionized the field of transcriptomics, but the reads obtained often contain errors. Although the forthcoming generation of sequencing technologies will produce significantly longer reads improving data analysis quality, they are expected to have higher error rate. Read error correction can have a large impact on the ability to accurately assemble transcripts. This is especially true for de novo transcriptome analysis, where a reference genome is not available. Current read error correction methods, developed for DNA sequencing data, could not handle the overlapping effects of non-uniform abundance, polymorphisms and alternative splicing.

We develop SEECER, a hidden Markov Model based method that is the first to successfully address these problems. To handle the large number of reads and transcripts we developed efficient computational methods for fast learning of millions of HMMs. SEECER is able to reduce the amount of mismatch and indel sequencing errors and significantly increases the performance of downstream analyses with or without available reference sequence. Using human RNA-Seq data we show that SEECER greatly improves upon previous methods in terms of quality of read alignments to the genome and assembly reconstruction accuracy.

More information on SEECER

Inferring Interaction Networks: Applications to microRNA target prediction

Determining interactions between entities, the overall organization and clustering of nodes in networks is amajor challenge when analyzing biological and scocail network data. We develop a model to integrate noisy interaction scores with properties of individual entites for inferring interaction networks and clustering nodes within these networks. We focus on applications to study how microRNAs regulate mRNAs in cells.

• H.S. Le, Z. Bar-Joseph, Inferring Interaction Networks using the IBP applied to microRNA Target Prediction, Neural Information Processing Systems (NIPS), 2011, to appear.

Cross-Species Expression Analysis

Inferring Ortholog and clustering of genes

“Recent studies compare gene expression data across species to identify core and species specific genes in biological systems. To perform such comparisons researchers need to match genes across species. This is a challenging task since the correct matches (orthologs) are not known for most genes. … Here we develop a new method that can utilize soft matches (given as priors) to infer both, unique and similar expression patterns across species and a matching for the genes in both species. Our method uses a Dirichlet process mixture model which includes a latent data matching variable.”

• H.S. Le and Z. Bar-Joseph. Cross species expression analysis using a dirichlet process mixture model with latent matchings. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1270-1278. 2010.

Large gene expression databases query

“Expression databases, including the Gene Expression Omnibus and ArrayExpress, have experienced significant growth over the past decade and now hold hundreds of thousands of arrays from multiple species. … However, while several methods exist for finding co-expressed genes in the same species as a query gene, looking at co-expression of homologs or arbitrary genes in other species is challenging. Thus, to carry out cross-species analysis using these databases, we need methods that can match experiments in one species with experiments in another species.”

• H.S. Le, Z. Oltvai, and Z. Bar-Joseph. Cross-species queries of large gene expression databases. Bioinformatics, 26(19):2416, 2010.

Undergraduate work

Cache-Oblivious Dynamic Programming for Bioinformatics

We present efficient cache-oblivious algorithms for some well-studied string problems in bioinformatics including the longest common subsequence, global pairwise sequence alignment and three-way sequence alignment (or median), both with affine gap costs, and RNA secondary structure prediction with simple pseudoknots

• R. Chowdhury, H.S. Le, and V. Ramachandran. Cache-oblivious dynamic programming for bioinformatics. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 7(3):495-510, 2010.

Pattern Identification in Biogeography

Identifying common patterns among area cladograms that arise in historical biogeography is an important tool for biogeographical inference. We develop the first rigorous formalization of these pattern-identification problems. We develop metrics to compare area cladograms. We define the maximum agreement area cladogram (MAAC) and we develop efficient algorithms for finding the MAAC of two area cladograms, while showing that it is NP-hard to find the MAAC of several binary area cladograms.

• G. Ganapathy, B. Goodson, R. Jansen, H.S. Le, V. Ramachandran, and T. Warnow. Pattern identification in biogeography. IEEE/ACM Transactions on Computational Biology and Bioinformatics, pages 334-346, 2006.

Hai-Son Le

PhD Student

Probabilistic error correction for RNA sequencing

Inferring Interaction Networks: Applications to microRNA target prediction

Cross-Species Expression Analysis

Undergraduate work