Large-scale graphical structure learning and its applications on textual data and genome micro-array data
- This is my dissertation work under the supervision of my advisor Prof. Yiming Yang. Learning the structures of large graphical models has been an area of intense research with the emergence of high-throughput data in many fields, including molecular biology, social science, natural language processing, marketing data analysis, etc. The estimated structures have high values because of their semantic clarity and understandability by humans and the ease of integration with a variety of tools like decision support systems. However, the problem becomes intractable for traditional learning algorithms when the number of nodes in the graph is large. My research addresses several main research problems in this line of work. New algorithms have been proposed for transferring structure learning into solving a set of Lasso regressions, which are convex optimization problems and could be solved efficiently. The model has been used to learn genome regulatory networks and help text categorization with correlated topics.
Non-linear feature selection and sparse regression
- I explored the duality between SVM regression and Lasso regression and extended Lasso regression into non-linear versions, which are referred as Feature Vector Machine (FVM). It re-formulates the standard Lasso regression into a form isomorphic to SVM, and this form can be easily extended for feature selection with non-linear models by introducing kernels defined on feature vectors. FVM generates sparse solutions in the nonlinear feature space and it is much more tractable compared to traditional non-linear feature selection approaches such as feature scaling kernel machines. Based on FVM, The proposed structure-learning framework can also be generalized to capture non-linear correlations among nodes in the graph.
Large-scale text categorization
- Developed various large-scale text categorization systems and conducted experiments on several benchmark corpuses (including RCV1, OSHMED, Reuters 21578, etc). Investigated how to use the hierarchical structure of class labels to improve the categorization performance. Our experimental results on the RCV1 dataset have become a popular baseline in the field of automated text categorization. Here is a link to some of my experimental results on Reuters 21578 and RCV1 corpus. The vectors generated from Reuters 21578 corpus can be downloaded from this link.
Gene selection from genome micro-array data
- Investigated different types of gene selection algorithms and developed a novel approach that has been applied successfully to extract disease-related genes from genome micro-array data. This approach has also been applied to improve the text categorization performance by eliminating redundant or irrelevant text features.
Data mining from mass spectrums
- Collaborated with faculties in Biochemistry department at VanderbiltUniversity and developed novel protein identification and disease prediction systems based on mass spectrums.
|