Abstract:
Due to the large amount of public comments submitted to the regulatory agencies, there is an increasing need for automatic organization of the documents. Clustering public comments into a hierarchy would help the regulatory agencies to better analyze and browse the content of the public comments. However, most of the hierarchical clustering algorithms fail to provide good descriptors for the clusters they generate, thus reducing the usefulness of the hierarchy. This paper focuses on the task of automatically assigning descriptors to clusters in the hierarchy. We propose a simple algorithm that automatically assigns labels to hierarchical clusters. Our algorithm computes a descriptive score for each label candidate based on its statistical features, and the assign the label with the highest descriptive score as the cluster descriptor. We evaluate our model against the ground truth data taken from the category labels in the Open Directory Project (ODP). In the evaluation, we demonstrate that our algorithm outperform the previous method in assigning labels to hierarchical clusters. In the majority of the categories, our algorithm accurately selects similar cluster labels as the one chosen by human.