Text Segmentation in the Informedia Project

Faculty Mentor:

 

Alex Hauptmann  (alex@cs.cmu.edu)

Students:

 

Zhirong Wang      (Zhirong@cs)

Ningning Hu         (hnn@cs)

Jichuan Chang     (cjc@cs)

Abstract

In this paper, we report our experiences of building text segmenter for the Informedia project. Several methods are used in our project; their experimental results are compared and analyzed. Based on the application background of Informedia project, we choose to use relaxed error metrics to do performance evaluation.

1.               Introduction

Segmentation is an integral and critical process in the Informedia digital video library. The success of information retrieval in Informedia hinges on the critical assumption that we can segment the whole news broadcast into individual paragraphs or stories. The segmentation task can be conducted on different media (video, speech, text, etc.), and their result can be integrated to achieve better performance. This paper will report the experiences of building close-captioning text segmenter to aid the segmentation of audio or video data.

Text segmentation problem focus on how to identify the story boundary, where one region of text ends and another begins, within a document. This work was motivated by the observations that such a seemingly simple problem can actually prove quite difficult to automate [1], and that a tool for partitioning a stream of undifferentiated text into coherent regions would be of great benefit to a number of existing applications.

Consider the following scenario: a video-on-demand application can respond to a news event query, by providing the user with a stream of video containing related news clips.  This application may be able to accurately locate positions in its database which are highly relevant to the query, but unable to determine how much of the neighboring data should be provided to the user. Apparently, without an accurate segmentation tool, the user will be flooded with overly abundant or unrelated (commercials, for example) information!

Text segmenter also helps to detect subtopics in a long passage, allowing the reader to quickly jump to his/her most interested topics. Because segmentation information provides additional structural information about the document, such tools can also be used in information extraction and summarization tasks, to quickly build the outlines of key points in a long passage.

We treat the text segmentation problem as a task of automatically locating topic boundaries. It can be refined as a classification problem: given a block of continuous words (or sentences), a segmenter should tell us if there exists a boundary in this block by observing a set of labeled data. Different classification methods are used in our project to compare their performance, and find out a better method. These methods include Neural Network (BP network), Naive Bayes Classification, and Support Vector Machine.

Sentences within the correctly segmented text regions are (parts of) semantically coherent, belonging to the same topic. The story boundaries in news transcript are usually related to topic shift in between different part of the document. This observation suggests the method of text segmentation by detecting topic changes. According to this observation, we also use topic change detection to assist the segmenter. The underlying topics of news stories are identified using Expectation Maximization clustering method.

2.               Our Approach

We approach the text segmentation problem by several methods. As there are many tools in public domain which have already implemented most of the methods, our work mainly fall in the preparation of data, selecting features, parameter tuning, and result comparison and analysis.

2.1 Data Collection and Preparation

The close-captioning raw data come from Informedia project, including several thousands of transcripts under different classes: CNN World View, The World Today, Early Prime and Science & Technology. We choose to use the CNN World View transcripts from year 1999 to October 2000 (about 500 passages). The data come in proprietary format (see example below), with timing information besides (“>>>” indicates a boundary).

001630 CENTURY >>> WE PEOPLE TEND TO

001631 PUT THINGS LIKE THE PASSING OF A

001633 MILLENIUM IN SHARP FOCUS. WE

001633 CELEBRATE, CONTEMPLATE, EVEN

001635 WORRY A BIT, SOMETIMES WORRY A

001636 LOT. AFTER ALL, IT'S SOMETHING

001638 THAT HAPPENS ONLY ONCE EVERY ONE

001641 THOUSAND YEARS. A BIG DEAL?

001641 PERHAPS NOT TO ALL LIVING THINGS,

001642 AS CNN'S RICHARD BLYSTONE

001643 FOUND OUT WHEN HE CONSIDERED ONE

001654 VERY OLD TREE. >>> HO HUM.

001654 ANOTHER MILLENNIUM. THE GREAT YEW

The raw data are pre-processed in several logical steps (by one or more programs):

(1)   Capitalize all words; remove non-printable characters and timing information.

(2)   Remove stories being too short (less than 50 words) or too long (more than 1500 words), these two limits are based on the actual distribution of stories length.

Below is the pre-processed example in one sentence per line format:

WE PEOPLE TEND TO PUT THINGS LIKE THE PASSING (omitted for short).

WE CELEBRATE, CONTEMPLATE, EVEN WORRY A BIT, SOMETIMES WORRY A LOT.

AFTER ALL, IT'S SOMETHING THAT HAPPENS ONLY (omitted for short).

A BIG DEAL PERHAPS NOT TO ALL LIVING THINGS, (omitted for short).

>>>

HO HUM.

ANOTHER MILLENNIUM.

(3)   Stemming using a standard algorithm (Porter's stemming algorithm).

(4)   Merge numbers, titles, date and time, abbreviations into their class names. For example, “3:50 PM” and “10:00 AM” are both represented as __TIME__.

(5)   For different methods we used, the intermediate data are divided into fixed length (in words or sentences) text blocks.

(6)   Some tools we used will exclude the stop words (common words like “the” and “where” which rarely help the classification) from input data. So we will remove them only when needed.

(7)   For some tools (ANN and SVM), we need to transform the block of text into vectors. Given the selected feature space, a vector is actually composed of component values (are number of occurrence of distinct words).

2.2 Support Vector Machine

Support vector machines are based on the Structural Risk Minimization principle stemming from computational learning theory. The nice property of SVM is: it is classification performance is independent of the dimensionality of the feature space, which is particularly useful for our text segmentation problem (usually involves tens of thousands of features -- words). It is also reported that SVM has achieved substantial improvement over current text classification methods.

In our experiment, we divide the passage into many blocks with exactly 2 consecutive sentences. For each block, if there is a boundary between the sentences, it is labeled as “yes” and called “boundary block”; otherwise the block is labeled as “no” and the sentences are called “background block”.

We have tried two SVM classification tools:

·        Rainbow: Rainbow is a statistical text classification tool built on top of the Bow library. It first build an index file by counting word in training data, then a SVM classifier are trained and used to classify the testing data.

In our experiment, the performance is similar with Naive Bayes classification (although achieved in much longer time).

·        SVMlight is a SVM tool built for text classification. It accepts vector (of word counts) or sparse vector input. After counting the number of distinct word, we realized that even for SVM, there are too many features to train a classifier.

In order to reduce the dimension of feature space within several hundreds, we decide to choose only the words with the highest average mutual information. To simplify the computation, we actually chose words from the sentences sitting just before and after the boundary, which have the largest difference between their occurrence probability in boundary blocks and background blocks.

The result is disappointing mainly because SVMlight takes too much to learn. For a simple case with 500 training data (actually 3 passages), the training process finished in 15 minutes (out machine runs Linux on top of a Intel Pentium III box). The training time increases quickly with the training data size, when we change the size of training data into 1600 (about 8 passages) if failed to finish after 15 hours[1].

2.3 Neural Network

We use the stochastic gradient descent version of Back Propagation algorithm for feed-forward networks containing 2 layers of sigmoid units. The network structure is illustrated in Figure 1. Units in each layer are connected with all units from the proceeding layer. The output is a vector of 2 components, they correspond to the probabilities in predicting the input data is a boundary block or not. Below we will discuss other two network parameters:

Figure 1. Structure of 2-layer Back Propagation Network

·        Input Units and Hidden Units

The number and features of input units are determined by experiments. We choose to use n-vector (n=100 or 200) by counting the occurrence of top n words with the highest mutual information (just the same as used in SVMlight). It’s supposed that more input units can improve the performance. In our experiments, Neural Networks with 200 input units outperform those with 100 units by 5%, but the computation increase much quicker.

The number of hidden units is also determined by experiments and tradeoff between accuracy and computation cost. We finally choose to use 100 input units and 10 hidden units (see Figure 1).

·        Merging False Alarms

By observing the classification result of ANN, we noticed an interesting phenomenon: about 15% false alarms are “clustered” around some true boundary. For example, below are results of classification of a short passage. Boundary blocks are represented as 0’s and background blocks 1’s. The three consecutive 0s in classification line show one of such “false alarm cluster”. Because there are usually more than two sentences contributing to a story’s introduction and conclusion (sign-offs), features in such sentences (that suggesting the existence of a nearby boundary) are also learned by our Neural Network. Such features cause some confusion when or segmenter try to distinguish between a boundary block and a background block.

Reference Classification:

111111011111011110111111

Classification (Before Merging):

110111000101011110111111

Classification (After Merging):

110111011101011110111111

One method to reduce such confusion is to include temporal information, which may help the segmenter to distinguish the first boundary block in the introduction part and the following background blocks. But we chose to use a much simple, brute-fore method to solve this problem – merging such false alarms. Assume that the input data are arranged according to their sequence of occurrence, we simple transverse the false alarm cluster, select one block with the highest target value (the one our segmenter feels the most like to contain a boundary), and change other 0’s into 1’s.

This method is simple and effective, except it might also remove some true boundary but leave a false alarm out there, which will slightly reduce the recall value. We can’t recover such errors, but the relaxed error metrics will not count such errors.

·        Stop words

We also studied the impact of removing stop words in our experiments. Removing stop words is one of the common practices in text processing applications. Why bother to observe the effect of classification with stop words? Because our mutual information statistics result showed that stop words occupy more than 2/3 of the 50 words, different importance in the two groups. We did experiments using input data with and without removing stop words, trying to find out if the stop words really matter in our classification. The result shows that segmentation with stop words increases the recall value, but decrease the precision. This suggests that there probably exist some special pattern of stop words in boundary blocks, which helps to identify more true boundaries. But such pattern also occurs in background blocks and can introduce more false alarms.

2.4 Naive Bayes Classification

Naive Bayes classification is a powerful method widely used in text classification applications. Rainbow toolset is utilized in our experiments. One of the major problems in Naive Bayes classification (also in other methods) is the selection of training data. After cutting raw data into blocks (size = 2 sentences), there are only 7% boundary blocks. After finished several initial experiments, we realized that such low frequency of boundary blocks can’t be used to effectively train our classifier (it can only identify 10% true boundaries).

Actually, increasing the percentage of boundary blocks in training data can effectively improve the recall value, but also hurts the precision of segmentation. We did some experiments to choose the suitable percentage of boundary blocks in our training data. A good tradeoff between precision and recall relies on the application context. In our project, we assume that a lower precision will only provide the user with shorter news clips, which is better than flooding the user with unrelated information, which is a result of lower recall. Such tradeoff leads to a relative preference of recall value in our experiments.

Also, different classification methods prefer different percentage of boundary blocks in the training data. For Naive Bayes and Neural Network classification, we use 50% boundary blocks in the training data.

2.5 Topic Change Detection

Topic change detection was used in Dragon's approach of text segmentation, and proved to be quite effective (67% recall and 65% precision). Dragon uses multi-pass k-means algorithm to construct the clusters, while we choose to use Rainbow’s EM clustering to attach this problem. There are two important parameters to be determined in our method:

·        Number of topics

When clustering documents, one must provide the number of clusters in our data set. Dragon borrowed the trick from Speech Recognition field, using thresholding to limit the size of search space (number of clusters) and iteratively to merge topics and create new ones. In our project, due to the limitation of Rainbow tool, we have to choose the number by intuition and experiments.

·        Size of sliding window

We have tried different window size in the topic change detection method, and 8 sentences works the best. As the size of text window grows, more boundary blocks will be combined into one text window, thus decreases the number of identified boundary blocks. We stop at the point of 8 sentences in one window, because after that the portion of such error begins to be not negligible.

Table 1 shows the result of different segmenter built with different sliding window size and number of topics. According to this result, we can see that clustering into more topics can improve the overall classification accuracy. But limited by Rainbow’s processing ability, we only have the time to get such result.

 

Recall

Precision

Topics         Size

4

6

8

4

6

8

8

0.321839

0.256177

0.311724

0.196568

0.248424

0.365696

 

16

0.421456

0.360208

0.38069

0.198198

0.267633

0.353846

Table 1. Segmentation performance using topic change detection method

2.6 Fixed Length Text Segmentation

This part will be finished by Zhirong Wang.

3.               Experimental Result and Analysis

3.1 Error Metrics

After we got the experiment results, how can we evaluate the performance of different segmentation methods? Two useful indicators are precision and recall, the conventional information retrieval metrics. For our segmentation task,

Recall  =  # actual boundaries identified / # total boundaries     

Precision          =  # actual boundaries identified / # boundaries identified

Researchers have also proposed other novel measurement for text segmentation problem: For fixed length segment, [9] uses the fraction of overlap part between the segment and relevant story as the metrics of relevance; For text segmentation based on language model, [7] proposes a new error metric based on the possible distance (in number of words) between identified boundary and the neighboring actual boundaries. A similar but much simplified idea is used in our approach.

We use a sentence as the minimum unit of segmentation. In the simplest case: a boundary is correct if and only if it is a true boundary. But considering our application of interactive query, one segmentation method is almost satisfactory if it always comes close to the true boundary. The closeness can be defined in units of words or sentences. Here we would relax our correctness criteria to accept all boundaries that are one or two sentences off a true boundary. We call the distance between identified boundary and the closest true boundary DR (degree of relaxation). Figure 2 illustrate the relaxed failure model for our sentence-based segmentation methods.

Figure 2. Failure model of sentence-based text segmentation method (Adapted from [7])
(YY# means under the degree of relaxation #, identified boundary is OK.)

Below is our result with the relaxed error metric for ANN method (10 hidden units), relaxed error metrics helps to reduce the error introduced by false alarm merging.

 

Before merging

After merging

DR

Precision

Recall

Precision

Recall

 

0

0.241

0.554

0.263

0.516

 

1

0.290

0.666

0.331

0.648

 

2

0.336

0.772

0.383

0.749

 

Table 2. Performance of ANN segmentation

Performance Evaluation

(1) SVM: Rainbow and SVMlight

 

SVMligh

Rainbow

Recall

0.07

??

Precision

0.223

??

Table 3. Segmentation Result of Rainbow and SVMligh

(2) ANN

2.1. Impact of Training Data Distribution

According to the following data, we choose to use 50% boundary blocks in our training data. Because this distribution provides rather high recall value (71% after merging) and acceptable precision value (33% after merging). This actually means the average length of our segmentation is 5 sentences, corresponding to 30 seconds news broadcasting.

Figure 3. Impact of Training Data Distribution

2.2. Impact of stop word removal

Stop words removal helps to improve the recall value, but hurts the precision. Table 3 gives the result of ANN with 100 input units, using 50% boundary blocks in training data. The same trend can be observed using different training data.

  %Y = 50%

100 Input Units

No Stop Words
(No merge)

With Stop Words

(No merge)

No Stop Words
(Merged)

With Stop Words

(Merged)

Recall

0.597

0.721

0.705

0.846

Precision

0.201

0.115

0.327

0.208

Table 3. Impact of stop words removal

2.3.  Impact of number of features (number of input units)

  %Y = %50

 

100 Input Units

(No merge)

100 Input Units

(No merge)

200 Input Units
(Merged)

200 Input Units

(Merged)

Recall

0.597

0.628

0.705

0.739

Precision

0.201

0.137

0.327

0.219

Table 4. Impact of number of features

The effect of increasing input units is the same as stop words removal. The same trend can be observed using different training data. The second reason that we chose to use 100 input units is it greatly reduced the computation cost.

(3) Naive Bayes classification

           

%Y = 25%

%Y = 33%

%Y = 50%

Recall

0.589

0.777

0.888

Precision

0.122

0.100

0.009

Table 5. Impact of training data using Naive Bayes Segmenter

(4) Topic change detection method

  Recall

4

6

8

       Precision

4

6

8

# topics = 8

0.322

0.256

0.312

# topics = 8

0.197

0.248

0.366

# topics =16

0.421

0.360

0.381

# topics =16

0.198

0.268

0.354

Table 6. Impact of window size and topics number

The result of topic change detection method is very different from other methods, with a much lower recall but relatively high precision value. We can say that TCD method is a conservative segmenter, which will not be tempted to identify too many boundaries, because it uses global information only. Such global information can be combined with ANN method, which uses only information within 2 consecutive sentences. Because the window size in TCD method is different from ANN and Naive Bayes methods, we didn’t test the result of voting with different methods. But we believe that future work can be done in this direction to improve the segmentation accuracy by integrating their power.

(5) Fixed length segmentation

This part will be finished by Zhirong Wang.

(6) Performance of different segmentation methods

Figure 4. Segmentation Accuracy

The chart below shows the performance values of different methods we have tried. The best tradeoff point between precision and recall are select and compared. All of our methods suffer from rather low precision values, but higher recall values (although topic change detection also has relatively lower recall value). We currently chose to use the Fixed Length (FL) segmenter because it reaches the best tradeoff point among all these methods.

Our results are different from published segmentation result, partly due to the difference of data set. The close-captioned transcripts we uses are much noisy (with noisy words, omitted sentences and incorrect labels), which is proved to be more difficult to work with.

4.               Conclusion

Compared with published result of methods, we exploited some of the simple and traditional machine learning methods to the problem of text segmentation. We have achieved little higher recall but rather lower precision performance. This leaves a lot of space of improvement to our methods, for example, integrating time series analysis with our ANN classification, using more topics to cluster news stories, etc. We can also combine current method with more sophisticated methods (such as Dragon’s approach or Hearst Algorithm), or even segmentation information coming from other media (such as video segmentation and speech recognition).

Appendix A: Top 50 Features used in ANN experiments

We count the word appearing the first and last sentences of stories respectively, so the 50 words actually come from two groups: 25 words from the ending sentences and 25 from the beginning sentences.

Without Stop Word Removal

After Stop Word Removal

CNN

OF

CNN

WORLDVIEW

THE

A

REPORTS

__USA__

OF

I

CAPTIONING

UNITED

IT

THAT

__NUM__

CLINTON

TO

AND

CLOSED

THINK

REPORTS

IN

WORLDVIEW

PRESIDENT

AND

IT

ADDITION

STATES

THAT

WORLDVIEW

REPORTING

__NUM__

IS

THEY

BELL

CNN

THIS

WE

LONDON

SPACE

CAPTIONING

__USA__

PROVIDED

WORLD

ON

ON

ORDERED

IRAQ

A

YOU

THINK

PRINCESS

__NUM__

HE

HEART

DAY

THEY

BE

JERUSALEM

DIANA

WE

THERE

ATLANTIC

WELL

ARE

DO

WASHINGTON

ISRAEL

CLOSED

TO

WALTER

NEW

BY

HAVE

RODGERS

JUDY

THERE

FOR

PEOPLE

AHEAD

WORLDVIEW

UNITED

COMMUNICATION

GET

AT

THIS

WHITE

TODAY

I

WILL

THANK

FIRST

DO

CLINTON

CORRESPONDENT

MILITARY

ADDITION

FROM

MOSCOW

MONEY

 



[1] Actually I never finished the training process in 2 weeks and finally gave it up.