Starting the late 18th century, linguists developed a method for reconstructing the phonological forms of words in unattested ancestor languages (proto-languages) of a group of daughter languages. Various attempts have been made to implement the comparative method using computation models. However, these efforts have typically not been both fully unsupervised (with no phylogenetic or gold-standard proto-form information).

We are modeling comparative reconstruction using neural models that overcome these limitations, inducing both proto-forms and a phylogenetic (family tree) given only cognate sets (sets of words derived from the same ancestor), which can be identified by other (computational) means.

We are working with datasets from Sinitic (Chinese), Polynesian, Romance, and Tankghulic.

This project is work in progress with Anna Cai, Leon Lu, Helen Jason Rauchwerk, Kexun Zhang, Naveen Suresh, and Aravind Mysore. Alumni include Young Min Kim, Chen Cui, Kalvin Chang, and Nate Robinson.