SCS Honors Senior Thesis 2024

SCS Honors Senior Thesis 2024 (Pittsburgh)

Senior thesis presentations will take place on Wednesday, May 1, GHC 4401 (Rashid Auditorium).
All times are Eastern Daylight Time.
(Zoom link available via email invite.)

9:45AM	Sarah Fisher	WORK IN PROGRESS: Predictive modeling of cancer tumor evolution with graph-based spatial representations
9:50AM	Jessica Liu	WORK IN PROGRESS: Identifying the impact of Alzheimer’s disease-associated genetic variants on molecular pathways via genetic perturbation of selected enhancers in conditional specific immune cells
10:00AM	Daniel Ng	Memory Reuse in Linear Functional Computation
10:25AM	Shengchao Yang	Exceptions in a Message Passing Interpretation of Substructural Logic
10:50AM	Yueqi Song	What Is Missing in Multilingual Visual Reasoning and How to Fix It
11:15AM	Kaajal Gupta	Hermes: Evaluating Enso Schedulers for High-Performance Networking
11:40AM	Tony Yu	Rage Against the Context Switch
1:00PM	Raashi Mohan	Robust Disaster Damage Assessment: Leveraging Large Pretrained Models
1:25PM	Alan Luo	Optimized deconvolution of bulk and single-cell RNA-seq data
1:50PM	Ruby Redlich	RERconverge Expansion: Using Relative Evolutionary Rates to Study Complex Categorical Trait Evolution
2:15PM	Gavin Zhu	Cognitive Framework for Preference Adaptation in Human-AI Interaction
2:40PM	Bradley Teo	Improving ADT Representation in Virgil

ABSTRACTS (alphabetical order by presenter)

WORK IN PROGRESS: Predictive modeling of cancer tumor evolution with graph-based spatial representations Sarah Fisher
Research Advisor: Oana Carja

Cancer is a progressive disease characterized by fast-evolving tumors with many distinct cellular subpopulations and complex structures. With hints that this spatial cellular heterogeneity plays an essential role in cancer treatment outcomes, we can investigate further the evolutionary dynamics that indicate cancer progression. With unprecedented public access to extensive imaging datasets, we can leverage this under-utilized data to map spatial architectures of tumor cellular populations, rather than simply measuring counts of cells as is done in typical pathology techniques. This project leverages digital pathology, graph theory, and evolutionary modeling to understand cancer progression through the lens of its spatial architecture. By building a spatially informed representation of cancer tumors and employing mathematical models to predict their evolution, this project aims to explore cancer's long-term eco-evolutionary dynamics and response to treatment.

Hermes: Evaluating Enso Schedulers for High-Performance Networking
Kaajal Gupta
Advisors: Hugo Sadok, Justine Sherry

Kernel bypass has recently gained popularity as a way to achieve high- performance networking. However, it pays for its improvements in performance with CPU efficiency. For applications to respond quickly to packets and have high performance, they must be continually polling the NIC, never yielding to the kernel–leading to low CPU efficiency. Alternatively, CPU efficiency can be kept high by having the process scheduler interpose between the NIC and appli- cations, letting it deschedule idle applications—which leads to low performance due to data movement. In this project, a system of kernel bypass is introduced that reconciles performance and CPU efficiency. The key idea behind this sys- tem is to have the process scheduler interpose on the network control plane while letting the applications exchange data directly with the NIC (dataplane). This design is enabled by Ens ̄o, a recent proposal for a new NIC interface that detaches data from notifications of the data arriving. Having the scheduler handle notifications, but not the data, allows the scheduler to react quickly to changes in load without the overhead of data movement typically imposed by the kernel. Different designs incorporating Ens ̄o are evaluated on a variety of metrics to determine how to best leverage this data and control plane split.

WORK IN PROGRESS: Identifying the impact of Alzheimer’s disease-associated genetic variants om molecular pathways via genetic perturbation of selected enhancers in conditional specific immune cells
Jessica Liu
Research advisors: Andreas Pfenning and Ziheng Chen

Alzheimer’s disease (AD) is one of the most devastating neurodegenerative diseases contributing to 60-70% of dementia. However, difficulties in understanding its complex molecular mechanisms result in challenges in early diagnosis and treatments. While the cause of the more common late-onset Alzheimer’s disease (LOAD) has not yet been fully uncovered, previous genome-wide association studies (GWAS) on LOAD had identified thousands of genetic variants that are mainly single nucleotide polymorphisms (SNPs) associated with AD. In order to identify the causal variants, we performed linkage disequilibrium score regression (S-LDSC) and discovered that AD-associated common variants from GWAS are highly enriched in the open chromatin regions, consisting mainly of cis-regulatory elements in microglia and macrophages, while rare variants that are not enriched in immune system are hypothesize to correlate with neural development and functions of synapses. We hypothesize that these SNPs are likely disrupting cis-regulatory elements activities in the immune cell. Previously, we measured the impact of AD-associated SNPs on enhancer activities in conditional specific immune cells utilizing massively parallel reporter assay (MPRA). To further confirm AD SNPs mediate endogenous gene expression, we want to investigate more specific changes in molecular pathways caused by these AD-associated SNPs via single-cell CRISPR screening on conditional specific immune cell models. This study could potentially contribute to the understanding of the purpose and position of immune cells in AD pathogenesis and attempts to provide novel targets for AD treatment development.

Optimized deconvolution of bulk and single-cell RNA-seq data
Alan Luo
Advisor: Dr. Russell Schwartz

My goal was to improve RADs, an algorithm for integrating bulk and single-cell genomic data in cancer progression studies. RADs is an algorithm that performs a technique called semi- deconvolution which seeks to infer clonal frequency and gene expression evolution. The first efforts in this research area focused on specific use cases wherein the data that one has available is limited to bulk data profiling average genomic features of mixtures of many distinct cells. Single-cell data has revolutionized the field by allowing one to track genomic behavior of individual cells in a tumor but is not always technically feasible. There are situations when one has samples suitable for single-cell methods, such as some recent metastases, but also samples only suitable for bulk methods, such as archived primary tumor biopsy. RADs focuses on such scenarios but can have poor resolution for quantifying the different cell types in the bulk data. So far, I have explored a bulk dataset and a reference single-cell RNA-seq dataset. The bulk dataset is composed of three samples: two breast cancer bone metastases, and one associated breast primary tumor. I then ran RADs on these two datasets for multiple repetitions, using different penalty weights each time. The results were that the performance worsened as the penalty weight increased. In addition, the fractions observed across the penalty weights overall were accurate, as the fraction changes for the cell types each had viable biological explanations for them. Therefore, few changes were needed besides minor errors, such as invalid function calls. More results are being collected.

Robust Disaster Damage Assessment: Leveraging Large Pretrained Models
Raashi Mohan
Advisors: , Saurabh Garg, Amrith SetlurAditi Raghunathan

Robustness to distribution shifts is essential for utilizing machine learning models in real-world applications. Nevertheless, existing techniques that enhance the performance of machine learning in the presence of these shifts have primarily focused on shifts that are well-defined and uncomplicated — no method has proven successful in improving performance in the open-ended scenario considered in this study. Today, there is a growing interest in leveraging unlabeled data, especially in real-world applications where this data type is cheaper and easier to obtain. Unfortunately, the de-facto standard of simply using existing data from previous related events to create models tailored towards specific tasks often falls short in several real-world situations, as these types of strategies are designed for conventional machine learning benchmarks, where a wealth of labels is available, and distribution shifts are absent. In light of these challenges, this thesis aims to devise novel approaches for effectively performing disaster assessment after natural disasters, while leveraging OpenAI’s CLIP as a large pre-trained zero-shot backbone, with the goal of enhancing performance in the presence of distribution shifts.

Memory Reuse in Linear Functional Computation
Daniel Ng
Advisor: Prof. Frank Pfenning

The semi-axiomatic sequent calculus, or SAX, offers an alternative way to represent proofs in the sequent calculus. SAX also corresponds to a process calculus, where processes interact by writing and reading from memory cells. Improvements were then made to the memory layout of data structures to create the SNAX language, which acts similarly to SAX but uses fewer pointer dereferences. We have now added a system of memory reuse to the existing linear system provided by SNAX, further minimizing the time spent performing memory allocations. We have also updated both the typing rules and the dynamic rules for SNAX to accommodate reuse. Unlike the original SNAX, which is concurrent, SNAX with reuse runs under a sequential semantics. With memory reuse, progress and preservation still hold, meaning that well-typed programs in SNAX will never cause a memory error. Our changes allow SNAX to more closely match other functional languages that are capable of updating their single-threaded data structures in-place.

RERconverge Expansion: Using Relative Evolutionary Rates to Study Complex Categorical Trait Evolution
Ruby Redlich
Advisors: Dr. Andreas Pfenning and Dr. Amanda Kowalczyk

Comparative genomics approaches seek to associate molecular evolution with the evolution of phenotypes across a phylogeny. Many of these methods, including our evolutionary rates-based method, RERconverge, lack the ability to analyze non-ordinal, multicategorical traits. To address this limitation, we introduce an expansion to RERconverge that associates shifts in evolutionary rates with the convergent evolution of multi-categorical traits. The categorical RERconverge expansion includes methods for performing categorical ancestral state reconstruction, statistical tests for associating relative evolutionary rates with categorical variables, and a new method for performing phylogeny-aware permutations, “permulations”, on multi-categorical traits. In addition to demonstrating our new method on a three-category diet phenotype, we compare its performance to binary RERconverge analyses and two existing methods for comparative genomic analyses of categorical traits: phylogenetic simulations and a phylogenetic signal based method. Our results show that our new categorical method outperforms phylogenetic simulations at identifying genes and enriched pathways significantly associated with the diet phenotypes and that the categorical ancestral state reconstruction drives an improvement in our ability to capture diet-related enriched pathways compared to binary RERconverge when implemented without user input on phenotype evolution. Through investigation of the PIEZO1 gene, we also illustrate how diet-relevant genes detected by our method can possess convergent patterns of amino acid sequence change. An additional case study using the binary pair bonding phenotype illustrates how our categorical expansion can still be applied successfully to binary traits as indicated by our identification of relevant biological pathways related to male gametes, ovarian follicles, and behavioral response to drugs. The categorical expansion to RERconverge will provide a strong foundation for applying the comparative method to categorical traits on larger data sets with more species and more complex trait evolution than have previously been analyzed.

What Is Missing in Multilingual Visual Reasoning and How to Fix It
Yueqi Song
Research Advisor: Graham Neubig

NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar performance between English and other languages, indicating the potential for equitable system development across languages. Our analysis on model failures reveals three key aspects that make this task challenging: multilinguality, complex reasoning, and multimodality. To address these challenges, we propose three targeted interventions including a translate-test approach to tackle multilinguality, a visual programming approach to break down complex reasoning, and a novel method that leverages image captioning to address multimodality. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open model LLaVA-v1.5-13B by 14%, LLaVA-v1.5-34B by 20%, and Qwen-VL by 17%, while also minorly improving GPT-4V's performance.

Improving ADT Representation in Virgil
Bradley Teo
Advisor: Ben Titzer

Algebraic Data Types (ADTs) are an increasingly common feature in modern programming languages. We explore various optimizations to ADT representations in Virgil, a systems-level programming language that compiles to x86, x86-64, Wasm and the Java Virtual Machine. In Virgil, programmers can now annotate ADTs as unboxed to eliminate the overhead of heap allocation, and we have extended the language to enable programmer-expressed bit-layouts for varying levels of control on memory layout. More aggressive techniques for packing data, such as packed references, are used to further reduce memory usage and register pressure. The performance impact of these representation changes was evaluated on a variety of workloads in terms of execution time and memory usage.

Exceptions in a Message Passing Interpretation of Substructural Logic
Shengchao Yang
Advisor: Frank Pfenning

We study a message passing of interpretation of classical affine logic and propose a new type system to incorporate exception by introducing explicit channel cancellation and exception handling constructors. Our type system ensures program safety by enforcing session fidelity and deadlock freedom. To experiment, we implement an interpreter for our language and test on several examples to compare with the expected program behavior. Beyond that, we extend our language to have uncaught exception and non-exhaustive match features.

Rage Against the Context Switch
Tony Yu
Advisor: Professor Dimitrios Skarlatos

Modern computers must have the capability to handle hundreds of processes. A core component of this capability is the context switch, wherein the state of a process is saved so that execution can be paused and resumed at a later point, freeing the CPU for use by another process. Context switches allow processes to share CPUs and help computers to hide stalls from blocked processes. However, context switches are computationally intensive, and typically have a negative impact on system performance. Despite this, context switches are extremely prevalent in all computer systems, as they are necessary to handle I/O interrupts, system calls, VM-exits, core scheduling and preemption, and other multiprocessing operations. A key design paradigm is that core scheduling and preemption is a responsibility delegated to the operating system. Using user-space instrumentation, we quantify the consequent context switching and kernel thread preemption overheads and show that they are significant in common multiprocessing settings. We also present Nemo, a novel CPU simulator for the Structural Simulation Toolkit, and use it to demonstrate the same overheads in a simulated environment. We also propose means by which such overheads can be eliminated from modern computing systems.

Cognitive Framework for Preference Adaptation in Human-AI Interaction
Feiyu “Gavin” Zhu
Advisor: Reid Simmons

Previous work on preference learning focuses extensively on using rewards as proxies. Despite fitting into the reinforcement learning paradigm nicely, reward-based machine learning approaches face the difficulty of fully representing personal preferences as rewards and the challenge of updating the policy with few samples. In this study, we aim to take an alternative rule-centric approach - drawing inspiration from cognitive science and building a decision-making framework centered around production rules. The production rules in the cognitive framework are abstract, modular, interpretable, and composable - all important features in human-AI interactions. In this work we formally define a cognitive architecture, show how we can bootstrap its rules with minimal human input, collect a set of human preferences in the real world, and show how the architecture we proposed can adapt to those preferences in one-shot. We hope that this work will inspire more future work in rule- centric agent policies in the future.