CS Independent Study - Spring 2020

CS Independent Study - Spring 2020 (Pittsburgh)

Take a look at the CS independent study done by our undergraduates this semester. Click on the title link to view a pdf of the associated poster (if available). If you have questions about a student's work, feel free to email them using the indicated andrew ID (andrewID@andrew.cmu.edu).

Ishan Bhargava
ibhargav
Advisor(s): Michael Coblenz and Jonathan Aldrich
Title: Obsidian Language Server
Abstract: Traditionally, programmers had to write hundreds of lines of code, compile it, and then go back and fix the inevitable compilation errors. This hampers developer productivity, as the programmer may end up working on disjoint sections of the code before running the compiler. This project aims to improve the Obsidian developer experience through editor integration.

Anand Bollu
abollu
Advisor(s): Leila Wehbe
Title: Using Control Tasks to Study the Effectiveness of Linguistic and Cognitive Probing Models
Abstract: Probes are models that aim to reveal the extent to which a feature representation (like a sentence encoding) captures particular information of interest (like tense or parts-of-speech). They can have variable complexity, ranging from simple linear models to highly non-linear deep neural networks. Given a lossless encoding, a sufficiently expressive probe with enough training data would be able to achieve high accuracy on almost any task. This makes it difficult to be sure that the original feature representation actually contains our information of interest, as a complex enough probe might have been able to learn the probing task itself. Recently, the introduction of control tasks (tasks that associate every word with a randomly determined label) has provided a way for us to measure a probe’s selectivity, the difference between its probing task accuracy and control task accuracy. An ideal probe is highly selective - it should be able to perform well on probing tasks and not so well on control tasks. But the downside of existing control task measures is that there is no way to assess what a “good” selectivity threshold is. We propose a slightly different construction of control tasks motivated by permutation tests, whereby we hope to quantify selectivity by measuring the statistical significance of a given probe’s performance. We also study the selectivity of recently proposed probing tasks designed to predict key information shared between neural network representations and brain recordings of people reading text.

Mckenna Brown
mckennab
Advisor(s): Maxine Eskenazi
Title: DialPort NLU
Abstract: This project focuses on the creation of a general and extendable Natural Language Understanding (NLU) module, with a dialog pipeline architecture. This component takes a transcribed sentence and breaks it down into two sources of information: the general intent behind the sentence, and the extraction of key details for “slot filling” of associated variables. As an example, for the sentence “I would like to book a ticket to Pittsburgh,” the intent could be set to “book a ticket” while slot filling could set the variable “destination_city” to Pittsburgh. This poster outlines the steps taken in intent classification and slot filling within the NLU, as well as results against benchmark datasets such as ATIS and MultiWOZ. The NLU uses various methods including Support Vector Machines (SVMs) and Long Short Term Memory Networks (LSTMs), and will be made open source for use by the Dialog research community.

Anirban Chowdhury
achowdh1
Advisor(s): Reid Simmons and Stephanie Rosenthal
Title: Learning the Differences between Data Analysts
Abstract: This project focuses on modeling the problem solving strategies of professional and novice data scientists. Experimental subjects were presented with a data science/machine learning problem and we obtained information about what methods they employed in order to solve it by scraping the code they wrote. We first identified the actions they used (e.g. feature engineering, feature selection, model training/ evaluation) in order to model the process they used to tackle the problem. We used Markov Decision Processes to understand and gain insight into the data we developed and made strides in developing our own model with the goal of identifying actions that new data scientists would be likely to take given previous action sequences, with the hope of creating a recommender agent to aid data scientists when stuck on a problem.

Ashley Hong
ahong1
Advisor(s): Kristin Williams and Scott Hudson
Title: Designing a Framework for Tangible Interfaces in End-User Programming
Abstract: The body of research literature in End-User Programming has been growing over the past several decades. In our work, we explore and identify the critical dimensions of EUP research in order to highlight potential challenges and guide further endeavors in related fields. There are thousands of research artifacts related to End-User Programming, so we use unsupervised machine learning approaches like topic modeling to discover the abstract topics that best characterize this collection of documents. In this poster, we will outline this problem and present a framework of the design space.

Swarnim Kalbande
skalband
Advisor(s): Bhiksha Raj
Title: Integration of Complex Machine Learning Techniques into Real-World Systems
Abstract: As the title indicates, my goal was to understand the integration of these techniques using Agot.ai as the real-world system, which is a CMU startup working on cashierless checkout at fast-food restaurants like Sushi Fuku. During the first half of the semester, I explored methods of serving size detection on Sushi Fuku data using depth images and set up a framework to do so, although our data pipeline got stuck with 8-bit data for the next few months. The second half of the semester was intended to focus on aiding food item detection by developing/modifying a pose tracking model to track hands and use that to know which ingredients went into which bowls. Due to the transition to remote work, I will instead do exploratory research on the same topic and try to find ways to work with limited data (Sushi Fuku closed) or no data (trying to modify existing labeled coco datasets to predict well on our data).

Simran Kaur
skaur
Advisor(s): Zachary Lipton
Title: Investigating the Sketchy Effects of Adversarial Training
Abstract: Targeted adversarial attacks consist of iteratively updating an image by gradient ascent to increase the score of a chosen class. While adversarial attacks against vanilla CNNs produce noisy instances of the input image, attacks against adversarially trained neural networks resemble the target class. In an effort to further investigate how similar phenomena are realized differently between standard and adversarially trained models, we consider datasets where some feature(s) do not vary across images (e.g. black and white sketches, images where the right-half of each image is blank). We observe that while adversarial attacks against the vanilla CNN produce noisy images that fail to maintain the relevant dataset’s consistent feature(s), these attacks against the adversarially trained CNN generally maintain the feature(s). We hope to replicate these experiments on more datasets of this nature. Eventually, we hope to explain the link between standard vs adversarial training and differences in how certain phenomena are realized.

Sunjana Kulkarni
sunjanak
Advisor(s): Alan Black and Khyathi Chandu
Title: "My Way of Telling a Story": Persona based Grounded Story Generation
Abstract: This semester's work was a continuation of the pre-existing Persona based Grounded Story generation project, which involved the construction of text generation models that generate a "story" from a sequence of images. The long-term goal of this project involves using data from movies and television shows from 4 different genres -- fantasy, reality/sitcom, sci-fi, and comic book/superhero -- to construct a discriminator. This discriminator will then influence a text generator to generate a story from the sequence of images in the speaking manner/personality of a fictional character from one of these genres. For this semester, our goal was to compile and analyze this dataset made up of character dialogues from the scripts of movies and television shows from the 4 genres. This dataset contains between 10,000-40,000 dialogues for each genre, and we aim to track vocabulary variance for each genre and character, calculate percentages of interaction between characters in each genre, filter out stop-words, and identify and remove clusters from the data that represent characters discussing a shared topic so the discriminator is influenced solely by the characters' speaking mannerisms rather than the subject matter of these dialogues. This analysis will inform our choice of model for training the discriminator, which we will be working on in the following semester.

Austin Leung
austinle
Advisor(s): Amir Zadeh and LP Morency
Title: Concept Embeddings
Abstract: In daily life, humans understand concepts such as owls and hooting through various mediums. These include images, audio, motion, and words. This poster outlines efforts made to compile datasets of these mediums to eventually train a multimodal machine learning model.

Anita Li
weihanl1
Advisor(s): Andre Platzer and Brandon Bohrer
Title: Discrete Games on Graphs modeled in Game Logic
Abstract: Game logic is a formal logic for proving safety and liveness properties of first order games, which are a programming language with discrete computations and adversarial dynamics. This project focuses primarily on discrete games between a pursuiter and an evader which have real-life implications on modern robotics. Using Game Logic, this project strives to prove game winning-strategies of Cops-and-Robbers game for certain classes of graphs, such as cycles, paths and directed acyclic paths.

Michelle Ling
mling2
Advisor(s): Cori Faklaris
Title: Towards Creating a Social Authentication System
Abstract: Research has shown that it can be difficult for workgroups to manage their shared online accounts with traditional authentication methods like passwords, as there is often a toss-up between privacy and ease of use. The goal of this project is to ideate and develop an authentication system suitable for these types of shared accounts. Employing the unique and dynamic social interactions between members of a workgroup, this system will target more accessible shared knowledge factors that will allow for a more convenient and secure method of authentication. This work aims to lay the foundation for a system that authenticates workgroups at CMU.

Ryan Liu
ryanliu
Advisor(s): Nihar Shah
Title: Preventing Quid-Pro-Quo in Peer Review
Abstract: In the project, our task was to design, implement, and test suitable algorithms to find reviewer-paper assignments in the setting of large-scale conferences. There are many desirable/necessary qualities in the assignment that we strove to achieve and maintain, such as avoiding assignments where a pair of reviewers review each others' papers, and maximizing the similarity between papers and reviewers. We settled on a randomized algorithm that upper bounds the probability that a bad matching happens. After settling on a plausible algorithm, we have made significant steps to test the runtime and variations. We have also taken steps to show the correctness of the algorithm. Later on, we plan to publish a paper and offer the algorithm to a variety of conferences for use.

Shunzhe Ma
shunzhem
Advisor(s): Seth Goldstein
Title: ColabBook: An online collaborative notebook for remote research and learning
Abstract: These days, remote learning and collaborations are becoming more and more important. This project seeks to develop a modern version of the science notebook which will support modern input devices, handwritten notes, and computer-generated documents (e.g., pdfs, powerpoints, Jupiter notebooks, etc.) It will support versioning so that the notebook remains a valid science document. Furthermore, more collaborative tools are being integrated to make the notebook into a team-centered system that supports remote collaboration as well as remote teaching.

Xiaoya (Michelle) Ma
xiaoyam
Advisor(s): LP Morency
Title: Music Synthesis and Generation with Unsupervised Learning
Abstract: My work can be mainly split into two parts: dataset and model. For the dataset, I’m creating a music dataset that contains multiple instruments to better aid my research. The dataset contains drums, piano, violin and guitar, with 25 hours of audio for each of these instruments. For each of the dataset, I labeled the data with tempo, genre, and keys. These labels together help provide a better direction for the model during the generation step. We’ve also improved the model from last semester by using Graph Neural Network on top of Transformers to help the model find correlations between each pixel with any other pixels on the spectrograms. In this case the model still captures the distribution in both frequency and temporal domains.

Arvind Mahankali
amahanka
Advisor(s): David Woodruff
Title: Improved Algorithms and Hardness Results for ell-p Low Rank Approximation
Abstract: We study the problem of low-rank approximation with entrywise ell_p-norm error for 1 <= p < 2, which has applications in machine learning, computer vision and other domains. We obtain a bi-criteria guarantee for ell_p-column subset selection: given an (n x d) matrix A, our algorithms find a small subset U of the columns of A, which achieve a small approximation factor relative to the error achieved by the optimal column subset of size k. We also empirically study a greedy column subset selection heuristic, which performs well in practice. Furthermore, we obtain improved relative error guarantees and hardness results for ell_p low rank approximation via column subset selection. Finally, we bypass those hardness results, obtaining a smaller approximation error through an algorithm that does not rely on column subset selection.

Kayleigh Migdol
kmigdol
Advisor(s): Pravesh Kothari
Title: Sub-Gaussian Mean Estimation of Heavy Tailed Distributions
Abstract: In the case of subgaussian distributions, it is quite simple to accurately estimate the mean and to have realistic bounds. In the case of distributions with heavier tails, however, this is not the case. Still, there are methods in which the distribution of the mean estimation is subgaussian even if the actual distribution is not. Because of this, we are able to still have those same guarantees. Thus, we investigate said methods that allow us to do this. We look at specifically the sum of squares method, in which outliers are ignored. We also look at the "median" based method of mean estimation.

Nikitha Murikinati
rmurikin
Advisor(s): Graham Neubig
Title: Morphological Inflection in a Shared Space
Abstract: Cross-lingual transfer between typologically related languages has been proven very successful for the task of morphological inflection. However, if the languages do not share the same script, then the current methods yield more modest improvements. We explore the use of transliteration between related languages as a data preprocessing method in order to alleviate this issue. We experimented with 5 language pairs (Hindi/Bengali, Sanskrit/Bengali, Telugu/Kannada, Arabic/Maltese, Hebrew/Maltese), and in most cases transliterating the transfer language data into the target one, leads to accuracy improvements of up to 12 percentage points.

Brandon Pek
bpek
Advisor(s): Timothy Libert
Title: Backtracking Online Trackers: Third-Party Trackers in Health-Related Advertising
Abstract: We explore advertising and online tracking technologies in ethically controversial and personally-sensitive areas like healthcare, politics, pornography and conduct analysis on cookies over large datasets of such domains to uncover the large hidden players in these fields. We also analyze the policies that large corporations and members of the National Advertising Initiative adopt towards these fields, and find their violations to spread awareness and inspire policy-level enforcement changes.

Ishani Santurkar
ivs
Advisor(s): Jan Hoffman
Title: Sub-Synchronizing Shared Session Types in Nomos
Abstract: Nomos is a programming language for smart contracts. Nomos is based on session types, which capture the protocols of interactions in the type rather than the implementation code and guarantee protocol adherence statically rather than at runtime. To ensure that multiple users of a contract interact with it in mutual exclusion, session types in Nomos are shared, requiring that a user acquire a shared contract, interact with it linearly, and then release it back to its shared state. Key to the safety of this is the equi-synchronizing requirement, which requires that the contract be released at the same type that it was acquired at. However, it is desirable to allow the contract to be released at a subtype of the type at which it was acquired (sub-synchronizing). For example, subtyping can be used to distinguish different phases of an auction in its type, with the ended phase being a subtype of the running phase. In this project, I developed rules for the algorithmic subtyping of Nomos. I proved the soundness and completeness of these rules with respect to a formal definition of subtyping. I implemented the subtyping and sub-synchronizing algorithms and integrated them into the Nomos type system. Finally, I proved that the extended system is still type-safe. These additions increased the flexibility of the Nomos type system and expanded the range of programs that can be expressed.

Ruiqi Wang
ruiqiw1
Advisor(s): Haiyi Zhu
Title: Closing Gender Gap: analyzing SVM model outputs and building software that detects keywords
Abstract: Professor Haiyi Zhu is working on a project that uses SVM model to generate gender-indicative words from Wikipedia. In the first part of this semester, I built a chrome extension that detects and highlights keywords generated from that SVM model. Then, I did literature review on bias categorization and I am working to use word embedding to show that our keywords indeed indicate bias.

Steven Wu
stevenwu
Advisor(s): Red Whittaker
Title: Mitigating Risks for Autonomous Rover Navigation at the Lunar Pole
Abstract: In 2022, the MoonRanger team will launch a rover to the Moon to search for ice within craters at the lunar pole. During the journey to the Moon, the rover will face several risks that threaten its integrity prior to arrival, one of which is the damaging effects space radiation will have on the rover’s onboard electronics. On the lunar surface, the rover will navigate autonomously between waypoints, utilizing its own path planning capabilities to avoid obstacles and optimize time and efficiency as it accomplishes mission objectives. The goals of this research project are to identify several radiation-hardening options available for the Nvidia Jetson TX2i on the MoonRanger rover and select the most optimal choice in terms of compute power and information preservation. Additionally, the project explores a preliminary path planning implementation based on the dynamic window approach (DWA) algorithm included in the ROS navigation package. Our results indicate the onboard error correcting code (ECC) in the TX2i achieves the most efficient tradeoff between radiation protection and development cost. The cost map-based DWA algorithm also demonstrates an effective algorithmic structure for identifying cheap navigation paths. Both conclusions offer legitimate solutions to significant radiation and navigation risks the MoonRanger rover will face during the span of its mission.

Menghen (Sam) Yong
myong
Advisor(s): Bogdan Vasilescu
Title: Measuring and comparing programmer behavior in open source projects with varying degree of visibility
Abstract: From social interactions, we observe that people behave better in public places. We thus hypothesized that it is also the case with writing software – programmers behave more “correctly” in more highly visible settings. “Correct” behaviors include writing readable code and clear code comments, providing useful and accurate git commit messages, etc. “High visible” settings could be public repositories with a large number of Stars. We sampled (some number of) public repositories, with Python as the primary language, from (some number of) GitHub users. We analyzed a user’s git commits in the repositories they contribute to and generated several metrics for “correct” behaviors, including average comment density across a user’s git commits in each repository and a user’s average git commit message length in each repository. Then for each user we analyzed the correlation between these metrics and the number of stars in a repository (using some statistical methods) (and reached some conclusion).

Vicki Zeng
vzz
Advisor(s): Reid Simmons
Title: Personalization and Sentiment Analysis of a Social Scrabble-Playing Robot
Abstract: My research looks to increase engagement between the CMU community and Victor, the Scrabble-Playing Robot at Gates-Hillman Center. We believe that personalizing interactions between Victor and players would make interactions more natural, intimate and long-lasting. Building off work from last semester, Victor can now respond to a greater variety of player-specific playing patterns, such as a layer's personal score per turn. We have also been performing sentiment analysis on the chat conversations between Victor and players. The resulting emotional responses can be used to tune his sassy personality on a player-by-player basis, thus increasing his likability with each player.