Abstract:
An administrative agency is required by law to evaluate the public's
comments to the proposed regulations in the rulemaking process. In
general, these public comments contain many exact duplicates and near
duplicates of form letters. We will focus on the automatic process of
near duplicate (form letter) detection in this domain. We propose the
simple and efficient way using similarity between language models of
documents together with fingerprinting to identify near duplicate
comments. This method incorporates word-order information with the
"bag-of-words" approach. Then we conduct an experiment showing that
this simple method could provide reasonable performance in detecting
near duplicates in public comment data.
|