This project is trying to apply a high-accuracy classification method, k Nearest Neighbor (kNN) to a very large category space by using feature selection and hierarchical approaches. The feature selection approach is used to reduce to the number of features needed for the text categorization for each sub-domain. A sub-domain is a non-terminal node in the hierarchical category structure. The hierarchical approach will apply a classifier to each sub-domain.
At the beginning of the project, several tasks have been assigned to people who select to work on this project. The tasks include sampling training set from the whole corpus for each classifier, performing the feature selection, converting SGML formatted documents into SMART formatted documents, documenting the usage of kNN program, generating the hierarchical structure for the documents, and taking out "none" category documents from the corpus.
The sampling process have been established, and several parameter tunings will be observed and suggested in the final report.
Here is my proposal
Task | To be done by | Status |
Design: Obtainting the modules from others Designing the integrated system |
March 10 | done |
Implementation: Build up the wrapping modules Interface modules to control the flow of the system Training the system |
March 17 | done |
Evaluation: Comparing the results with other published system |
April 1 | done |
Documentation: Project Web site Preliminary Design Report Final Report |
March 3 March 17 April 22 |
done done done |
The idea is to use a divide-and-conquer strategy to make a large problem more tractable, and hopefully without significant loss and any loss of classification accuracy. The approach used in this project can be illustrated into several parts. The first part is the pre-process of the documents. The second part is deploying the classifier on different layers of the hierarchy. The pre-process of the documents is trying to do the feature selection and the sampling to reduce the computation complexity. The multiple classifiers are used in order to take the advantage of sampling.
The pre-process of the text classification can be illustrated in Figure 1. First, one has to convert the Reuters21578 corpus from SGML format into SMART format. Secondly, we have to take out the documents having no sufficient information for training and test; this will make the number of categories become 90. The documents are those no belonging to any categories. Thirdly, a hierarchical structure of the corpus will be generated. The sampling process and feature selection modules will use the hierarchical structure to obtain the sampled training set and features used for training classifiers.
The sampling task, which is assigned to me during this developing phase, is trying to implement the kNN strategy on SMART program. For each document in the purified training set (without "none" category), it will at least belong to one category. Thus, we can collect positive documents of a category. In order to train the classifier correctly, negative documents, which do not belong to the category, should also be sampled. Using positive documents as a query, we can use SMART to retrieve documents from the training set. The returned non-relevant documents are the negative documents, which are the nearest neighbors of the around the positive samples, if the positive documents have only one centroid.
The procedures of sampling documents for a classifier are described in Figure 2. Taking positive samples form the training set of a classifier, ( e.g. COMMODITY.set is the training set for COMMODITY category. ) we compose a huge query based on these samples. Run SMART by using this query and cate90.train as the corpus, we will get a relevant document list. We can subtract the documents from the relevant list that are not truly relevant to the huge query as the negative nearest neighbor documents.
Two important parameters should
be concerned for future sampling. One is the number of positive documents that should be used for training a classifier, since the purpose of sampling is trying to reduce the computation cost by using the whole corpus to train the classifiers. The second parameter would be the number of nearest neighbor documents that should be chosen as negative samples.The classifier for this project is kNN classifier. As discussed in the previous section, kNN method is more robust for large scaled corpus and has no assumption of term independence. In the Reuters collection, we have two three levels of categories. Thus, we can apply kNN classifiers on the root node and the intermediate level nodes. The classifier on the root node classifies the incoming documents into the next level categories. The classifiers on the intermediate level nodes classifies the documents into the leaf node categories. Each documents has scores for the intermediate categories and leaf node categories. The category of the document is assigned by the highest result that is the multiplication of the scores along the hierarchical path traversed from the root node to the leaf node.
The experiments of the project are trying to observe how sampling and feature selection can improve the efficiency and accuracy of the hierarchical categorization. We would like to see how sampling could effect the accuracy and efficiency of the classification. The feature selection is conducted after the training documents are sampled. Four experiments have been done for the comparison. They are flat classification, sampled classification, flat classification with filtering, and sampled classification with filtering.
The flat classification, treating all the categories as a flat structure, is the base line of the experiment. All the other results are compared with this one. The sampled classification is used to show how sampling approach will affect the classification. The sampling scheme was set to sample 50% of the documents from the original training set. We cannot sample the documents from the cate90.train file directly since the sampling scripts are designed for per category based. The cate90.train file contains all the training documents with different categories. The sampling procedure starts from the leaf node categories. For each category, 50% of the documents are sequentially extracted as the sampled documents. Then we append all the sampled documents in one file called samcate90.train. The samcate90.train file contains 4811 documents. We filtered both cate90.train and samcate90.train to take out the less frequently occurred words. The 50% filtering results are used for the filtering experiments. The 50% filtering results are the training files using features that are 50% least frequent occurring words.
The precision-recall plot of the test results is shown in Figure 3 and the 11-point average for each experiment is shown in Table 1.
Figure 3 P-R plot of the results
Experiments: | 11-point average: |
flat-no-filter | 0.8266 |
sample-no-filter | 0.8068 |
flat-filter | 0.7770 |
sample-filter | 0.7380 |
Table 1 11-points average of the results