3. DETECTION
The CMU topic detection detection system for TDT-2 was largely based on
previous retrospective and online detection systems used in the TDT-1
pilot study [5,6], since the new definition of the detection task
encompasses the the earlier versions of the tasks. In particular we
used the vector space model (VSM) to represent documents as weighted
unigram models. We also used temporally-sensitive versions of our
incremental and hierarchical group-average (GAC) clustering algorithms
to detect new topics within the 3 deferral periods.
The operation of the system is fairly straightforward. First, the
inverse-document frequencies for all terms are initialized using a
training corpus. As new stories "arrive" in the deferral window, they
are used to update the IDF statistics. Then, the entire deferral
window is clustered using GAC with fractionation [7], an efficient
greedy agglomerative clustering algorithm. This method divides the
incoming data stream into sequential buckets, clustering by cosine
similarity within each bucket, and then merging clusters among buckets.
Additional shuffling and reclustering among stories in adjacent buckets
improves performance. A fortunate property of fractionation lies in
the way it achieves it's efficiency- by dividing the data, and
performing a partial clustering, and repeating; by using the time of
stories as a criteria for these divisions we can prefer associating
stories that occur nearby in time. Preferring temporally proximate
stories in this way provided a substantial performance increase in
earlier TDT work.
The TDT task requires the identification of new topics prior to the
expiration of the deferral window. In other words, decisions as to
whether individual stories belong to existing topics or create the
start of a new one must be made without waiting for new data past
the deferral window. These decisions are made by cluster-to-cluster
similarity comparisons between each cluster in the deferral window
and past topic clusters prior to the deferral period. Specifically:
sim(C1, C2) = cos(centroid(C1), centroid(C2))
NEW(C1) iff max[sim(C1, Ci)] < THRESHOLD
i
If the maximum similarity is above a threshold (optimized on the training
corpus), then the cluster is deemed to be an extension of the previous
topic. Otherwise, it is deemed to be the onset of a new topic. In either
case (the extended old topic or the beginning new topic) becomes part of the
past as the deferral window is moved forward in time.
As an improvement to the similarity function we implemented a form of
temporal decay for the cluster-to-cluster similarity function. At
first, we implemented the date of the earliest story as a timestamp,
but this led to a high miss rate (0.3526 story weighted, to be specific).
Changing the time stamp to the most recent story (to penalize less
topics that continue to be reported) significantly improved our results,
dropping our miss rate to 0.0765 while increasing false alarm by a
much smaller proportion (0.0013 vs 0.0004).
We found that the use of incrementally-updated IDF statistics allowed
for larger events to be followed through their evolution when later
reports contain evolving content; essentially allowing topic drift.
This becomes important in larger events, when knowledge about the
event, the terminology used to describe it, and related developments
can change substantially from initial reports as the event matures.
Our methods appear to degrade gracefully under imperfect transcription
and boundary conditions. Machine recognized speech tended to increase
the number of miss and false-alarm documents minimally, although
word-error rates (WER) approached 25%. Initial tests with
automatically determined story boundaries incurred more degradation,
but still less than anticipated. Further tests are required to
quantify these effects more reliably, but the results reported in
Tables 1 and 2 below provide good evidence of robust performance.
We expect that performance would be improved by treating incoming
stories differently based on source, as will be discussed in Section 6.
Also, editorial style articles, regardless of source, often might
diverge enough to be considered distinct from a group of
"just-the-facts" reporting on the same topic. Both of these
behaviors have been observed, and hopefully can be addressed in future
revisions.
Source Condition |
Segmented? |
Deferral Period |
Story Weighted Cdet |
Topic Weighted Cdet |
ASR | Yes | 10 files | 0.0075 | 0.0057 |
ASR | Yes | 1 file | 0.0092 | 0.0060 |
Transcript | Yes | 10 files | 0.0068 | 0.0049 |
Transcript | Yes | 1 file | 0.0089 | 0.0057 |
ASR | NO | 10 files | 0.0107 | 0.0076 |
Table 1: Detection Results.
Source Condition |
Segmented? |
Deferral Period |
Story Weighted Cdet |
Topic Weighted Cdet |
ASR | Yes | 10 files | 0.0028 | 0.0042 |
Table 2: Post-Submission Improved Detection Results.
4. TRACKING
We used two methods for topic tracking, k-nearest neighbors (kNN) and
decision-tree induction. kNN is an instance-based classification
method which has been previously applied (by Yang et al [8])
to text categorization, with generally good performance.
A requirement in official TDT evaluations is that each event be tracked
independently, without any knowledge about other events. According
to this constraint, we adapted our conventional M-ary classification
kNN developed for text categorization in general, and trained a
kNN classier per topic. Essentially, the system converts an input story
into a vector, compares it to the training stories, and selects the k
nearest neighbors based on the cosine similarity between the input story
and the training stories. The confidence score for a YES prediction
on the input story is computed by summing the similarity scores for the
positive and negative stories respectively in the $k$-neighborhood,
and checking the difference between the two sums is above a threshold
(typically 0, unless we want to introduce a bias towards avoiding
false negatives or avoiding false positives). The scores for all
test events are also recorded to computer Decision-Error Tradeoff
(DET) curves.
Since positive instances are sparse (for most topics) in the training
set, it is difficult to achieve greater recall (fewer misses) without
significantly sacrificing precision (more false alarms). One solution
to this problem is to discount the influence of negative examples by
sampling a small portion in the k-neighborhood, and ignoring the
remaining examples. This idea leads to a modified version of kNN; for
distinction, we refer to the original version as kNN-a, and the
modified version as kNN-b. In the modified version, we take k1 (less
than k) nearest positive examples (P(x,k1)) and k2 (less than k)
nearest negative examples (N(x,k2)) from the k-neighborhood, and
average the similarity scores of the two subsets respectively. The
confidence score for a YES prediction on the input story is computed
as an interpolated value between these two quantities. For the
purpose of the official evaluation, however, we did not use different
k1 and k2 parameters.
The Decision Tree (DTree) tracking method also uses binary
classifiers. DTree uses not only the words as features, but also a
number of meta-features such as presence of a word near the beginning
of a story, multiple occurrences of a word, highly-collocated bigrams,
and M-of-N thresholds. Further, the decision-tree induction can be
tuned through a variety of inductive-bias parameters such as the
maximum number of negative training instances, when to stop growing
the tree, and which meta-features to consider. A time-windowing
capability is available when performing the actual tracking using the
induced decision trees, but it was ineffective in this evaluation
because the data covered only two months of time -- in TDT-1 we found
that two months was an appropriate size for the temporal window.
To create an M-of-N threshold meta-feature, DTree simply selects the N
top-ranked features (including non-threshold meta-features) and then
splits the collection of stories for the decision node by whether or
not they contain at least M of those N features, rather than whether
or not they contain the single top-ranked feature.
Bigram meta-features are selected based on the mutual conditional
probability of the two words. From each possible pair of adjacent
words, select only those where the occurrence probability of each
word given the other is above some pre-selected threshold. This list
of candidate bigrams is then further limited to a specified number of
bigram features by selecting the most likely from among the
candidates.
Because of the small number of positive training instances, the
information gain which is the primary means of ordering features in
decision-tree induction had to be augmented with additional measures
that serve as tie-breakers. If two (meta-)features have the same
information gain, then
- bigrams will be given priority over single words,
- words occurring near the beginning of the story will be given priority
over words occurring anywhere,
- simple words are given priority over multiple-occurrence meta-features,
and
- the feature with higher TF*IDF score is given priority.
Should none of the above tie-breakers prefer one word over another,
the final resort is to prefer the longest words and then to simply
alphabetically sort words of equal length.
The DTree program has been optimized for speed when tracking multiple
events (especially when testing multiple values of Nt) on the same
story collection. The entire collection is read into memory and
indexed before tracking begins, trading off increased memory use
(depending on parameters, 3 to 4.5 times the size of the text without
SGML markup) against reduced run-time. The evaluation set was
processed at Nt=4 for all 21 evaluation events in just 4 minutes 31
seconds, using 140 megabytes of memory, and for Nt=1,2,3,4 in 8
minutes 25 seconds.
For the default evaluation, DTree used 300 negative training stories,
4-of-12 threshold features, the multiple-occurrence and near-beginning
(within 125 words) meta-features, and stopped growing the trees
whenever a node had at least 56% positive instances. The parameters
for the closed-captioning text and FDCH manual transcriptions were
similar, using 200 negative training stories in each case, and
stopping at 30% (32%) positive instances for a node. All of these
parameters were set via cross-validation on the development set.
We found that on the TDT-2 collection, the following were advantageous,
improving Ctrack for DTree:
- preferring words near the beginning of the story
- threshold meta-features (greatly improve robustness)
The following did not improve performance:
- adaptive time-windowing (insufficient temporal extent in the collection)
- bigram features (increased miss rate more than it reduced false alarms)
- sets of words common to the positive training stories (often increase
both miss and false alarm rates)
And the following varied, sometimes helping and sometimes hurting
performance:
- stemming the words of the story
- distinguishing between single and multiple occurrences of a word
Planned enhancements to DTree are 1) per-source decision-tree training
and induction parameters and 2) unsupervised updates triggered by a
second, high-precision decision tree. Currently, all parameters such
as M/N for threshold features and the cutoff for "near the beginning
of the story" are set globally, but DTree would benefit from having
these parameters set individually for each source. This will be
discussed in more detail in Section 6.
The tracking results for kNN and DTree were as shown in Table 3. These
results indicate that kNN is somewhat superior to DTrees for this
tracking task, and also that ASR with 25% word-error degrades
performance only moderately (about 10% as in detection). Moreover,
combining automated segmentation with tracking also degrades
performance gracefully, by about the same percentage as for detection.
Most of the increase in Ctrack is due to increased false alarms
(roughly double), many of which result when an incorrect segmentation
places key words into one of the stories adjacent to the actual
on-topic story, causing DTree to decide that the adjacent story is also
on-topic. Since kNN is less affected by individual words, it suffers
less degradation in the topic-weighted error metric from the
segmentation errors, as shown in Table 4 for a post-evaluation run.
After submitting our results for tracking with DTree, we discovered
that an error in the code used to find optimum parameter settings
resulted in suboptimal parameters. On correcting the error and
re-tuning for the training set, we achieved the improved results shown
in Table 4 using 225 negative stories, 4-of-12
threshold features, the near-beginning meta-feature, and stopping
expansion at 50% positive instances.
System |
Source Condition |
Boundaries Given? |
Story Weighted Ctrack |
Topic Weighted Ctrack |
kNN | ASR | Yes | 0.0076 | 0.0077 |
kNN | Cl.Caption | Yes | 0.0072 | 0.0072 |
DTree | ASR | Yes | 0.0096 | 0.0085 |
DTree | Cl.Caption | Yes | 0.0100 | 0.0090 |
DTree | ASR | NO | 0.0116 | 0.0109 |
Table 3: Official Tracking Results, Nt=4.
System |
Source Condition |
Boundaries Given? |
Story Weighted Ctrack |
Topic Weighted Ctrack |
DTree | ASR | Yes | 0.0085 | 0.0079 |
DTree | Cl.Caption | Yes | 0.0089 | 0.0080 |
kNN | ASR | NO | 0.0105 | 0.0092 |
DTree | ASR | NO | 0.0106 | 0.0105 |
Table 4: Post-Submission Tracking Results, Nt=4.