A key challenge in natural language processing is defining the
computational representation of words. Data-driven distributional
approaches use corpora to induce vector-space representations for
words, based on the contexts they occur in. This project goes beyond
traditional approaches (e.g., latent semantic analysis; Deerwester et
al., 1990), which use words that tend to occur near a word in corpora
to define the context, by extending the types of contexts used in
constructing semantic vectors. First, this project incorporates
translation contexts, i.e., words readily available in multilingual
parallel corpora, alongside traditional monolingual corpora. This
allows evidence-sharing across languages, most importantly from
resource-rich languages with large corpora to more resource-poor
languages. Second, this project incorporates social context inferable
from social network platforms, captured through author, time,
geographic, and social connection metadata. Taken together, these
additional features give a broader definition of a word's context and
lead to a more unified approach to the distributional approach to
modeling human language, moving in the direction of a
language-independent semantics. The project focuses on ten
typologically diverse languages representing several major language
families (English, Arabic, Chinese, Spanish, Russian, German,
Portuguese, Swahili, Malagasy, and Farsi). A key emphasis is scaling
up algorithms for inferring distributional representations to
web-scale corpora and dealing with much larger contextual vectors
representing the expanded notion of context. The approach also
leverages noisy syntactic processing to enable syntactic information,
rather than just information about neighboring words, to be considered
when defining context.
In addition to improving the quality of the learned lexico-semantic
representations by including richer contextual information, this
project creates lexical semantic representations that link word types
across languages. These have direct use in text processing
applications such as text categorization, machine translation,
information extraction, and semantic analysis of text, and they will
enable the construction of robust lexical semantic resources in
lower-resource languages that benefit from the richness of resources
in languages they are paired with. The multilingual vector
representations produced will be released to the research community
and will be used in undergraduate class projects. The project
supports the education of two graduate students in a dynamic research
environment.