Non-Negative
Sparse
Embedding
|
This
is
the
webpage
for NNSE (Non-Negative Sparse Embedding) - a semantic
representation scheme that derives interpretable and cognitively
plausible word representations from massive web corpora. The model
matrices, and papers, can be downloaded below.
The main
features of this vector space model (aka word embedding, aka
distributional semantics model) are:
- state of the art performance on cognitive modelling tasks
- individual features that are humanly interpretable
- compact representations with domain specific features
- a mixture of topical and taxonomic semantics (from LSA/LDA-style document co-occurrences, and HAL-style local dependency co-occurences)
In brief, the model is derived in an unsupervised way from
~10 million documents and ~15 billion words of web text (from the Clueweb
collection). MALT
dependency co-occurrences (target word - dependency - head/dependent)
are collated
(applying a frequency cutoff), adjusted with positive pointwise mutual
information (PPMI) to normalise for word and feature frequencies, and
reduced in dimesionality with sparse
SVD methods. In parallel, document
co-occurrence
counts
(LSA/LDA
style) are similarly
collated, PPMI adjusted,
and sparse SVD reduced. The union of these inputs is
factorised again
using Non-Negative Sparse Embedding, a variation on Non-Negative
Sparse
Coding.
The result is one in which a relatively compact set of
features dimensions (typically in the 100's) can be used to describe
all the words
in a typical adult-scale vocabulary (here approximated with a list of
~35,000 words frequent words of American English). The representation
of a single word is sparse, and disjoint - e.g. a typical concrete noun
in the 300-dimension model might use only 30 of the features, and these
features would be mostly disjoint with other word types (e.g. abstract
nouns, verbs, function words). Within the space, words should
have both taxonomic
neighbours (e.g. judge is
near to referee) and topical
neighbours (e.g. judge is
near to prison).
The features can also be interpreted, and often encode
prominent aspects of meaning, such as taxonomic categories, topical
associations and word senses/usages. Here are a couple examples, giving
the most prominent semantic dimensions for a word, and characterising
each of those dimensions in turn by their most prominent word-members.
Representation
for apple
Weight |
Top Words (per weighted
dimension) |
0.40 |
raspberry, peach, pear,
mango, melon |
0.26 |
ripper, aac, converter,
vcd, rm |
0.14 |
cpu, intel, mips,
pentium, risc |
0.13 |
motorola, lg, samsung,
vodafone, alcatel |
0.11 |
peaches, apricots, pears, cherries, blueberries |
Representation
for motorbike
Weight |
Top Words (per weighted
dimension) |
0.69 |
bike, mtb, bikes, harley,
motorcycle |
0.35 |
canoe, raft, scooter, kayak, skateboard |
0.15 |
sedan, dealership, dealerships, dealer, convertible |
0.10 |
attorney, malpractice, lawyer, attorneys, lawyers |
0.08 |
earnhardt, speedway, irl, indy, racing |
Download
Several of the models used in the paper can be downloaded below
as zipped versions of a plain-text file. Each line is a tab-delimited
word entry - the first field is the word token, and following fields
are entries in a fixed number of semantic feature dimensions. Comment
lines starting with '#' can be ignored. Full document and dependency model, NNSE reduced [number of output dimensions: 50 | 300 | 1000 | 2500]
Dependency model
(taxonomic relatedness),
NNSE reduced [number of
output dimensions: 300]
Document model (topical relatedness), NNSE reduced [number of output dimensions: 300]
References
More details of the scheme are given in this paper:
Brian Murphy, Partha Talukdar and Tom
Mitchell, 2012: Learning
Effective
and
Interpretable
Semantic
Models
using
Non-Negative
Sparse
Embedding, International
Conference on Computational Linguistics (COLING 2012), Mumbai,
India. [Paper]
... and there is further background in:
Brian Murphy, Partha Talukdar and Tom Mitchell,
2012: Selecting Corpus-Semantic Models for Neurolinguistic Decoding.
Proceedings of the First Joint Conference on Lexical and
Computational Semantics (*SEM), Montreal, Pages 114-123. [Paper]
Feel free to e-mail
with
comments or questions:
brianmurphy@cmu.edu.