In Submission, WWW '18
We approach the problem of Active Search at the granularity of entities present in the document corpus, as opposed to defining explicit similarity kernels to the documents. We show our approach is scalable, and allows flexibility in supervising the entities. We also propose heuristics as exploratory criteria for the search task.
Oral Presentation at DMBIH, IEEE ICDM '17
Joint Work with UPMC Dept. of Critical Care Medicine
We propose a Machine Learning Pipeline that explicitly exploits the inherent Hierarchy of ICD-9 Codes, resulting in an interpretable model that has better predictive performance for Rarer Medical Conditions like Venous Thrombo-embolism.
Oral Presentation at Bloomberg Data 4 Good Exchange (D4GX) '17
Paper in Workshop on User Generated Noisy Text, ACL EMNLP '17
We mined data from over 4M Publicly Available Escort advertisements, and constructed data driven pipeline to perform Entity Resolution and isolate solicitors. Then then utilized supervised machine learning to classify these resolved entities as being indicative of trafficking, with prior ground truth.
Entry for Citadel Citadel Datathon Finals '17
We performed Assoiciative Analysis and Casual Inference to answer the following questions.
1) What relationships can we identify between district-level demographic variables and educational achievement gaps?
2) Are district-level achievement gaps correlated with state-level crime and employment variables?
3) What relationships can we identify between district-level achievement gaps and college-level variables such as enrollment ratios between different ethnicities and debt taken on by students of different ethnicities?
First Place Cash Award of $20k at Citadel Datathon '17
We analyzed Gene Expressions in Multiple Tissues from Populations of Healthy and Cancerous individuals using the GTEx and TCGA datasets and exploited unsupervised techniques like stochastic neighborhood embeddings to featurize the gene expressions in a non parametric setting. We also proposed rank order based features, and finally built classification models, over the expression levels, to classify there source. Our team was the winning entry and received a Cash Prize of $20k
Deliverable for DARPA MEMEX and the FBI
Competitively Outperformed Teams from Leading Academic and Industrial Partners
For DARPAs MEMEX Quarterly Program Review, we presented personaLink A Python tool to extract features from, and train supervised classifiers for Pairwise classification of users across Internet Forums. personaLink was applied on Credit Card hacking forums with limited ground truth, and outperformed top MEMEX participants.
Working Paper for NAACL-HLT '18 [pdf]
CS-LTI-747 Final Project [pdf]
We investigate Deep Multi-modal Techniques to Model the Course Time Series of Electronic Health Records. Here are Modalities being ICD Codes in subsequent admissions, along with the Clinical Notes. We empirically evaluate the performance of a model trained jointly on the modalities, using the popular MIMIC-III Dataset
CS-MLD-701 Final Project
Footprints form a crucial piece of evidence in crime scenes, which forensic experts and law enforcement agencies depend on to build incriminating evidence against a suspect, especially in cases of recidivism. In this project we describe a system to perform
1) Footprint Detection, to detect if the print belongs to the right or left foot, and
2) Footprint Matching, that is given prior footprint data aims to match newer footprints. We further extend this work by performing some analysis and exploration, and pattern discover using some Machine Learning Techniques.
CS-LTI-711 Assignments & Projects
Implemented Classical Large Scale NLP systems including, Language Modeling, Parsing and Statistical Machine Translation as part of Intro-NLP Course
1) Experimental Exploration of Trigram Kneser-Ney Language Models [pdf]
2) Parsing with Unlexicalised Probabilistic Context Free Grammar [pdf]
3) Reranking Probabilistic Parses with Supervised Learning [pdf]
4) Comparative Study of Word Alignment Models for Machine Translation [pdf]
We investigate the use of secondary loss functions, to regularize model objective, while at the same time constraining the model architecture to learn a more interpretable embedding for the Training data. Empirical results with shallow networks on MNIST show that the model learns a saliency map like layer, enhancing performance as compared to standard models with similar number of parameters. The applicability to more challenging tasks with deeper architectures is currently being investigated.
An attempt to exploit social media using Natural Language Processing, to exhibit a users Cognitive State through a Physical Nnterface. In as much, Sentiband is a humble attempt to demonstrate how modern technology bridges mulitple pardigms of Computer Science, Human Computer Interaction and Cognitive Science.
Sentiband exploits Twitter activity to perform Sentiment Analysis, and Light corresponding to the perceived sentiment.
A Prototype motivated from the Internet of Things paradigm, SatLight is a handheld that connects to the Internet with an Ethernet Connection, using an open source Linux Board, and polls the NORAD Database to track the ISS and other satellites using there corresponding Two-line Element Set (TLE) values.
Featured on HACKADAY.com [link]
Featured on DangerousPrototypes.com [link]
Featured on The Official Texas Instruments Blog [link]
We exploit TI's Low Power MSP-430 Microcontroller along with the ZigBee transrecievers to create a Wireless Mesh based Environmental Sensing module.