Breck et al, IJCAI 2007
From ScribbleWiki: Analysis of Social Media
Title
Identifying Expressions of Opinion in Context
Link
The paper can be found here.
Summary
This paper tries to solve the problem of identification of opinion expressions from text. They follow the categorization of two kinds of opinion expressions in text in previous work. Direct Subjective Expression (DSE) that are spans of text that explicitly express an attitude or opinion. In contrast, Expressive Subjective Elements (ESE) are spans of text that indicate a degree of subjectivity on the part of the speaker. The authors treat the task of identifying opinion expressions as a tagging task and they use CRFs to handle this problem.
There are several problems of identifying opinion expressions in text. The expression can often vary in length ranging from one to twenty words. They may be verb phrases, noun phrases or strings of words that do not correspond to one linguistic constituent. Sometimes, an expression may or may not be subjective depending on its context. They used IO encoding, i.e. for each token in the text, they determined whether the token is inside a subjective opinion expression or not. Consecutive 'I' tagged tokens would suggest a contiguous opinion expression.
The various features used in the model were:
- Lexical Features: A window of 4 words to the left of the concerned token and 4 to the right were selected. For each of the positions in the window, there was a feature. Thus about 18000 (vocabulary size) features per position were encoded.
- Syntactic Features: POS tag of the token was considered as a feature. Also the current, previous and the next syntactic constituent type given by a parser was considered as a feature.
- Dictionary Based Features: A Wordnet based feature was included that were synsets which are hypernyms of the current token. A total of 29,989 features many of whom may be one for a given token. A verb categorization feature was also included that derived the verb type from framenet. Strong or weak cues were also considered as a feature.
Like the penultimate paper whose summary I wrote, the MPQA corpus was used for experiments. It contains 535 newswire documents annotated with variety of annotations of interest. All the DSEs and ESEs in the document were manually annotated. 135 documents were used for development and 5 fold cross validation was used to evaluate the system. Precision, Recall and F-Measure were the evaluation measures. As a baseline, two dictionaries of subjectivity clues identified by previous work were used to match a set of tokens containing some subjective opinion expression.
From the results, it is clear that for both DSEs and ESEs, the system outperforms the baseline.
This paper is interesting in that it gets results that are within 5% of human inter-annotator agreement and shows direction towards question answering systems that are not factoid, and would answer questions like who expressed what kind of an opinion about a subject matter.