NeuLab Presentations at EMNLP 2020

NeuLab will be presenting at EMNLP 2020! Come visit our posters/presentations if you’re online for the conferences.

Conference Papers

A Bilingual Generative Transformer for Semantic Sentence Embedding

  • Authors: John Wieting, Graham Neubig, Taylor Berg-Kirkpatrick.
  • Time: Tuesday, Nov 17, 02:00 (UTC+0) Gather Session 1C: Linguistic Theories, Cognitive Modeling and Psycholinguistics; Semantics: Sentence-level Semantics, Textual Inference and Other areas

A Bilingual Generative Transformer for Semantic Sentence Embedding

Semantic sentence embedding models encode natural language sentences into vectors, such that closeness in embedding space indicates closeness in the semantics between the sentences. Bilingual data offers a useful signal for learning such embeddings: properties shared by both sentences in a translation pair are likely semantic, while divergent properties are likely stylistic or language-specific. We propose a deep latent variable model that attempts to perform source separation on parallel sentences, isolating what they have in common in a latent semantic vector, and explaining what is left over with language-specific latent vectors. Our proposed approach differs from past work on semantic sentence encoding in two ways. First, by using a variational probabilistic framework, we introduce priors that encourage source separation, and can use our model’s posterior to predict sentence embeddings for monolingual data at test time. Second, we use high-capacity transformers as both data generating distributions and inference networks – contrasting with most past work on sentence embeddings. In experiments, our approach substantially outperforms the state-of-the-art on a standard suite of unsupervised semantic similarity evaluations. Further, we demonstrate that our approach yields the largest gains on more difficult subsets of these evaluations where simple word overlap is not a good indicator of similarity.

Automatic Extraction of Rules Governing Morphological Agreement

  • Authors: Aditi Chaudhary, Antonios Anastasopoulos, Adithya Pratapa, David R. Mortensen, Zaid Sheikh, Yulia Tsvetkov, Graham Neubig.
  • Time: Tuesday, Nov 17, 18:00 (UTC+0) Gather Session 3E: Information Extraction; Phonology, Morphology and Word Segmentation

Automatic Extraction of Rules Governing Morphological Agreement

Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world’s languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results. Using cross-lingual transfer, even with no expert annotations in the language of interest, our framework extracts a grammatical specification which is nearly equivalent to those created with large amounts of gold-standard annotated data. We confirm this finding with human expert evaluations of the rules that our framework produces, which have an average accuracy of 78%. We release an interface demonstrating the extracted rules at https://neulab.github.io/lase/.

Dynamic Data Selection and Weighting for Iterative Back-Translation

  • Authors: Zi-Yi Dou, Antonios Anastasopoulos, Graham Neubig
  • Time: Tuesday, Nov 18, 02:00 (UTC+0) Gather Session 4A: Machine Translation and Multilinguality

Dynamic Data Selection and Weighting for Iterative Back-Translation

Back-translation has proven to be an effective method to utilize monolingual data in neural machine translation (NMT), and iteratively conducting back-translation can further improve the model performance. Selecting which monolingual data to back-translate is crucial, as we require that the resulting synthetic data are of high quality and reflect the target domain. To achieve these two goals, data selection and weighting strategies have been proposed, with a common practice being to select samples close to the target domain but also dissimilar to the average general-domain text. In this paper, we provide insights into this commonly used approach and generalize it to a dynamic curriculum learning strategy, which is applied to iterative back-translation models. In addition, we propose weighting strategies based on both the current quality of the sentence and its improvement over the previous iteration. We evaluate our models on domain adaptation, low-resource, and high-resource MT settings and on two language pairs. Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.

Is Chinese Word Segmentation a Solved Task? Rethinking Neural Chinese Word Segmentation

  • Authors: Jinlan Fu, Pengfei Liu, Qi Zhang, Xuanjing Huang
  • Time: Wednesday, Nov 18, 02:00 (UTC+0) Gather Session 4I: Phonology, Morphology and Word Segmentation

Is Chinese Word Segmentation a Solved Task? Rethinking Neural Chinese Word Segmentation

The performance of the Chinese Word Seg-mentation (CWS) systems has graduallyreached a plateau with the rapid developmentof deep neural networks, especially the successful use of large pre-trained models. In this paper, we take stock of what we have achieved and rethink what’s left in the CWS task. Methodologically, we propose a fine-grained evaluation for existing CWS systems, which not only allows us to diagnose the strengths and weaknesses of existing models (under the in-dataset setting), but enables us to quantify the discrepancy between different criterion and alleviate the negative transfer problem when doing multi-criteria learning. Strategically, despite not aiming to propose a novel model in this paper, our comprehensive experiments on eight models and seven datasets, as well as thorough analysis, could search for some promising direction for future research. We make all codes publicly available and release an interface that can quickly evaluate and diagnose user’s models: https://github.com/neulab/InterpretEval.

NeuSpell: A Neural Spelling Correction Toolkit

  • Authors: Sai Muralidhar Jayanthi, Danish Pruthi, Graham Neubig
  • Time: Wednesday, Nov 18, 02:00 (UTC+0) Gather Session 4K: Demos

NeuSpell: A Neural Spelling Correction Toolkit

We introduce NeuSpell, an open-source toolkit for spelling correction in English. Our toolkit comprises ten different models, and benchmarks them on naturally occurring misspellings from multiple sources. We find that many systems do not adequately leverage the context around the misspelt token. To remedy this, (i) we train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings; and (ii) use contextual representations. By training on our synthetic examples, correction rates improve by 9% (absolute) compared to the case when models are trained on randomly sampled character perturbations. Using richer contextual representations boosts the correction rate by another 3%. Our toolkit enables practitioners to use our proposed and existing spelling correction systems, both via a unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility of our spell-checkers in combating adversarial misspellings. The toolkit can be accessed at this http URL. Code and pretrained models are available at this http URL.

Interpretable Multi-dataset Evaluation for Named Entity Recognition

  • Authors: Jinlan Fu, Pengfei Liu, Graham Neubig.
  • Time: Wednesday, Nov 18, 02:00 (UTC+0) Gather Session 4I: Interpretability and Analysis of Models for NLP; Syntax: Tagging, Chunking, and Parsing

Interpretable Multi-dataset Evaluation for Named Entity Recognition

With the proliferation of models for natural language processing tasks, it is even harder to understand the differences between models and their relative merits. Simply looking at differences between holistic metrics such as accuracy, BLEU, or F1 does not tell us why or how particular methods perform differently and how diverse datasets influence the model design choices. In this paper, we present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them, identifying the strengths and weaknesses of current systems. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area: https://github.com/neulab/InterpretEval

X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models

  • Authors: Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki, Haibo Ding, Graham Neubig
  • Time: Wednesday, Nov 18, 02:00 (UTC+0) Gather Session 4A: Machine Translation and Multilinguality

X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models

Language models (LMs) have proven surprisingly successful at capturing factual knowledge by completing cloze-style fill-in-the-blank questions such as “Punta Cana is located in _.” However, while knowledge is both written and queried in many languages, studies on LMs’ factual representation ability have almost invariably been performed on English. To assess factual knowledge retrieval in LMs in different languages, we create a multilingual benchmark of cloze-style probes for 23 typologically diverse languages. To properly handle language variations, we expand probing methods from single- to multi-word entities, and develop several decoding algorithms to generate multi-token predictions. Extensive experimental results provide insights about how well (or poorly) current state-of-the-art LMs perform at this task in languages with more or fewer available resources. We further propose a code-switching-based method to improve the ability of multilingual LMs to access knowledge, and verify its effectiveness on several benchmark languages. Benchmark data and code have been released at https://x-factr.github.io/.

OCR Post Correction for Endangered Language Texts

  • Authors: Shruti Rijhwani, Antonios Anastasopoulos, Graham Neubig.
  • Time: Wednesday, Nov 18, 02:00 (UTC+0) Gather Session 4A: Machine Translation and Multilinguality

OCR Post Correction for Endangered Language Texts

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.

Re-evaluating Evaluation in Text Summarization

  • Authors: Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu, Graham Neubig.
  • Time: Thursday, Nov 19, 00:00 (UTC+0) Zoom Q&A Session 16C: Summarization

Re-evaluating Evaluation in Text Summarization

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not – for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

TACL Papers

The Return of Lexical Dependencies: Neural Lexicalized PCFGs

  • Authors: Hao Zhu, Yonatan Bisk, Graham Neubig
  • Time: Wednesday, Nov 18, 16:00 (UTC+0) Zoom Q&A Session 13D: Syntax: Tagging, Chunking, and Parsing

The Return of Lexical Dependencies: Neural Lexicalized PCFGs

In this paper we demonstrate that context free grammar (CFG) based methods for grammar induction benefit from modeling lexical dependencies. This contrasts to the most popular current methods for grammar induction, which focus on discovering either constituents or dependencies. Previous approaches to marry these two disparate syntactic formalisms (e.g. lexicalized PCFGs) have been plagued by sparsity, making them unsuitable for unsupervised grammar induction. However, in this work, we present novel neural models of lexicalized PCFGs which allow us to overcome sparsity problems and effectively induce both constituents and dependencies within a single model. Experiments demonstrate that this unified framework results in stronger results on both representations than achieved when modeling either formalism alone. Code is available at https://github.com/neulab/neural-lpcfg.

How Can We Know What Language Models Know?

  • Authors: Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig
  • Time: Wednesday, Nov 18, 23:00 (UTC+0) Zoom Q&A Session 15D: Language Generation

How Can We Know What Language Models Know?

Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as “Obama is a _ by profession”. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as “Obama worked as a _” may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github.com/jzbjyb/LPAQA.

Findings Papers

Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation

  • Authors: Luyu Gao, Xinyi Wang, Graham Neubig

Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation

To improve the performance of Neural Machine Translation~(NMT) for low-resource languages~(LRL), one effective strategy is to leverage parallel data from a related high-resource language~(HRL). However, multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs. In this paper, we aim to improve the effectiveness of multilingual transfer for NMT models that translate \emph{into} the LRL, by designing a better decoder word embedding. Extending upon a general-purpose multilingual encoding method Soft Decoupled Encoding~\citep{SDE}, we propose DecSDE, an efficient character n-gram based embedding specifically designed for the NMT decoder. Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.

Weakly- and Semi-supervised Evidence Extraction

  • Authors: Danish Pruthi, Bhuwan Dhingra, Graham Neubig, Zachary C. Lipton

Weakly- and Semi-supervised Evidence Extraction

For many prediction tasks, stakeholders desire not only predictions but also supporting evidence that a human can use to verify its correctness. However, in practice, additional annotations marking supporting evidence may only be available for a minority of training examples (if available at all). In this paper, we propose new methods to combine few evidence annotations (strong semi-supervision) with abundant document-level labels (weak supervision) for the task of evidence extraction. Evaluating on two classification tasks that feature evidence annotations, we find that our methods outperform baselines adapted from the interpretability literature to our task. Our approach yields substantial gains with as few as hundred evidence annotations. Code and datasets to reproduce our work are available at https://github.com/danishpruthi/evidence-extraction.

Why and when should you pool? Analyzing Pooling in Recurrent Architectures

  • Authors: Pratyush Maini, Keshav Kolluru, Danish Pruthi, Mausam

Why and when should you pool? Analyzing Pooling in Recurrent Architectures

Pooling-based recurrent neural architectures consistently outperform their counterparts without pooling. However, the reasons for their enhanced performance are largely unexamined. In this work, we examine three commonly used pooling techniques (mean-pooling, max-pooling, and attention), and propose max-attention, a novel variant that effectively captures interactions among predictive tokens in a sentence. We find that pooling-based architectures substantially differ from their non-pooling equivalents in their learning ability and positional biases–which elucidate their performance benefits. By analyzing the gradient propagation, we discover that pooling facilitates better gradient flow compared to BiLSTMs. Further, we expose how BiLSTMs are positionally biased towards tokens in the beginning and the end of a sequence. Pooling alleviates such biases. Consequently, we identify settings where pooling offers large benefits: (i) in low resource scenarios, and (ii) when important words lie towards the middle of the sentence. Among the pooling techniques studied, max-attention is the most effective, resulting in significant performance gains on several text classification tasks.

CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems

  • Authors: Yiran Chen, Pengfei Liu, Ming Zhong, Zi-Yi Dou, Danqing Wang, Xipeng Qiu, Xuanjing Huang

CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems

Neural network-based models augmented with unsupervised pre-trained knowledge have achieved impressive performance on text summarization. However, most existing evaluation methods are limited to an in-domain setting, where summarizers are trained and evaluated on the same dataset. We argue that this approach can narrow our understanding of the generalization ability for different summarization systems. In this paper, we perform an in-depth analysis of characteristics of different datasets and investigate the performance of different summarization models under a cross-dataset setting, in which a summarizer trained on one corpus will be evaluated on a range of out-of-domain corpora. A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways (i.e. abstractive and extractive) on model generalization ability. Further, experimental results shed light on the limitations of existing summarizers. Brief introduction and supplementary code can be found in this https URL.