Benjamin Han: Code

For Research

This page lists various software developed by me. Most of the code is either under GPL or LGPL. There is no warranty of any kind that the code would work as advertised (in fact it might even blow up your machine) - use at your own risk. However if you do find the code useful, or have any comments (bug reports, etc.), I'd love to hear from you.

UTools

UTools is a set of software modules written in C++ for performing sentence generation/analysis using a unification-based grammar formalism (example grammar). It is a modernized and extended version based on the original GenKit written by Tomita and Nyberg in 1988 (in LISP), and is a close relative to KANTOO (also written in C++) used in the KANT project in LTI, CMU, but the latter is closed source. UTools is an independent implementation and currently used in various projects in LTI such as Avenu.

To illustrate what it can do, here is how UTools can be used to generate a sentence. First you need to have a feature structure representing the meaning of the sentence you want to generate, like this ("A small dog all of a sudden bites the girl in white with the teeth."):

   ((SUBJ

                  (*OR*

                   ((PRED DOG)

                    (FIN -)

                    (PERS 3)

                    (NUM SG)

                    (BREED CHIHUAHUA))

                   ((PRED DOG)

                    (FIN -)

                    (PERS 3)

                    (NUM SG)

                    (BREED PITBULL))))

              (OBJ

                  ((PRED GIRL)

                    (FIN +)

                    (PERS 3)

                    (NUM SG)

                    (COAT_COLOR WHITE)))

              (INST

                  ((PRED TOOTH)

                    (FIN +)

                    (PERS 3)

                    (NUM PL)))

              (PRED BITE)

              (TENSE PRES))

The feature structure is basically a tree structure specifying what the subject is (can be a 3rd person singular chihuahua or pitbull), what the object is (the poor girl), what the intrument is (tooth in plural), and what the action is (bite!). You then need to provide a grammar and a lexicon for UTools to generate the target sentence. A trace of this process can be seen here. You can find out more information about this process by reading the manual of UKernel:

B. Han and A. Lavie. UKernel: A Unification Kernel. Technical Report CMU-LTI-03-177, Language Technologies Institute, Carnegie Mellon University, August 13, 2004.

Currently UTools has the following components (C++):

Generator (download): This is a simple recursive-descent sentence generator - given a feature structure as input, a natural language sentence is generated by executing grammar rules.
UKernel (download): This is the core engine of UTools - it provides the basic functionality for executing the unification rules in a grammar.
Toolbox (download): This is a library implementing basic data structures such as token strings/dictionaries, tree templates and context-free grammars with prefix searching, etc.

Notably missing are two modules: the first one is a front-end module which parses grammar files into the corresponding data structures to drive the system (otherwise there is no way that UTools can understand the grammar rules you developed in plain text!). The second missing module is a parser. With these missing pieces added UTools can be a complete suite to meet a broad range of needs in developing a natural language application.

Thus comes my call for help: my plan is to add a front-end module with bindings into popular scripting languages, e.g., Python, and to add an Earley parser for efficient parsing. These two components should be written in C/C++ as well. But with limited time on my hands, I would like to invite anyone who is interested in this to join me - we can even move this project to Sourceforge if necessary. But the project has to stay in open source (either GPL or LGPL). Please contact me (email at the bottom of this page) if you are interested.

Analyzer

Analyzer is a program for building a bilingual dictionary from a parallel corpus. Analyzer uses a steady-state Genetic Algorithm to find a better translation solution. It also uses part-of-speech information from the target language to 'optimize' the translation. You need to download Toolbox and ePost to build and run this code.

Paper: B. Han. Building a Bilingual Dictionary with Scarce Resources: A Genetic Algorithm Approach. In the Student Research Workshop, the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-2001), Pittsburgh, 2001.

You can also find the the slides in the Publications section.

v1_4-20010903: Now compiled under g++ 3.0 with -Wall -pedantic, along with some minor fixes.

ePost

ePost is an adapted version of Brill's part-of-speech tagger for English (this is actually pulled out from Analyzer). The differences from Brill's original code are

Lots of memory-related bugs are removed, so you can fire the tagger more than once during the run-time now.
A C++ class wrapper is defined so it's easier to use in C++.
Now runs under Windows/MSVC++ 6.0.

Also you might be interested in a cool Java/Perl tagger developed by Jimmy Lin at University of Maryland (based on the ePost) - you can download them there!

v1_0-20020722: fixed warnings/errors for gcc 3.1.

For Fun

These are non-research-related code (at least for now). Again, use at your own risk.

JunkMatcher

Icon courtesy of
Steve Caplin

JunkMatcher (on Sourceforge.net) is a versatile spam filter addon for Mac OS X. Although Apple's built-in Mail.app has a wonderful, statistically trained junk filter, spammers nowadays use various tricks to conceal the real things they want to say (use of graphics, encoded characters, etc.). If you're going nuts about your spam problem, I've written a cocktail-styled tool that makes use of various effective techniques such as Bayesian filtering, IP-based blocking and flexible regular expressions to identify those sneaky junk mails.

receivedDB

receivedDB is a set of Python scripts for parsing the Received: headers in emails. A database file (text) is kept separate from the code so new header patterns can be added for recognizing new kinds of headers. This is useful for tracking the origin of emails and detecting header forgery.


Research \| Fun For Research This page lists various software developed by me. Most of the code is either under GPL or LGPL. There is no warranty of any kind that the code would work as advertised (in fact it might even blow up your machine) - use at your own risk. However if you do find the code useful, or have any comments (bug reports, etc.), I'd love to hear from you. UTools UTools is a set of software modules written in C++ for performing sentence generation/analysis using a unification-based grammar formalism (example grammar). It is a modernized and extended version based on the original GenKit written by Tomita and Nyberg in 1988 (in LISP), and is a close relative to KANTOO (also written in C++) used in the KANT project in LTI, CMU, but the latter is closed source. UTools is an independent implementation and currently used in various projects in LTI such as Avenu. To illustrate what it can do, here is how UTools can be used to generate a sentence. First you need to have a feature structure representing the meaning of the sentence you want to generate, like this ("A small dog all of a sudden bites the girl in white with the teeth."): ((SUBJ (OR ((PRED DOG) (FIN -) (PERS 3) (NUM SG) (BREED CHIHUAHUA)) ((PRED DOG) (FIN -) (PERS 3) (NUM SG) (BREED PITBULL)))) (OBJ ((PRED GIRL) (FIN +) (PERS 3) (NUM SG) (COAT_COLOR WHITE))) (INST ((PRED TOOTH) (FIN +) (PERS 3) (NUM PL))) (PRED BITE) (TENSE PRES)) The feature structure is basically a tree structure specifying what the subject is (can be a 3rd person singular chihuahua or pitbull), what the object is (the poor girl), what the intrument is (tooth in plural), and what the action is (bite!). You then need to provide a grammar and a lexicon for UTools to generate the target sentence. A trace of this process can be seen here. You can find out more information about this process by reading the manual of UKernel: B. Han and A. Lavie. UKernel: A Unification Kernel. Technical Report CMU-LTI-03-177, Language Technologies Institute, Carnegie Mellon University, August 13, 2004. Currently UTools has the following components (C++): Generator (download): This is a simple recursive-descent sentence generator - given a feature structure as input, a natural language sentence is generated by executing grammar rules. UKernel (download): This is the core engine of UTools - it provides the basic functionality for executing the unification rules in a grammar. Toolbox (download): This is a library implementing basic data structures such as token strings/dictionaries, tree templates and context-free grammars with prefix searching, etc. Notably missing are two modules: the first one is a front-end module which parses grammar files into the corresponding data structures to drive the system (otherwise there is no way that UTools can understand the grammar rules you developed in plain text!). The second missing module is a parser. With these missing pieces added UTools can be a complete suite to meet a broad range of needs in developing a natural language application. Thus comes my call for help: my plan is to add a front-end module with bindings into popular scripting languages, e.g., Python, and to add an Earley parser for efficient parsing. These two components should be written in C/C++ as well. But with limited time on my hands, I would like to invite anyone who is interested in this to join me - we can even move this project to Sourceforge if necessary. But the project has to stay in open source (either GPL or LGPL). Please contact me (email at the bottom of this page) if you are interested. Analyzer Analyzer is a program for building a bilingual dictionary from a parallel corpus. Analyzer uses a steady-state Genetic Algorithm to find a better translation solution. It also uses part-of-speech information from the target language to 'optimize' the translation. You need to download Toolbox and ePost to build and run this code. Paper: B. Han. Building a Bilingual Dictionary with Scarce Resources: A Genetic Algorithm Approach. In the Student Research Workshop, the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-2001), Pittsburgh, 2001. You can also find the the slides in the Publications section. v1_4-20010903: Now compiled under g++ 3.0 with -Wall -pedantic, along with some minor fixes. ePost ePost is an adapted version of Brill's part-of-speech tagger for English (this is actually pulled out from Analyzer). The differences from Brill's original code are Lots of memory-related bugs are removed, so you can fire the tagger more than once during the run-time now. A C++ class wrapper is defined so it's easier to use in C++. Now runs under Windows/MSVC++ 6.0. Also you might be interested in a cool Java/Perl tagger developed by Jimmy Lin at University of Maryland (based on the ePost) - you can download them there! v1_0-20020722: fixed warnings/errors for gcc 3.1. For Fun These are non-research-related code (at least for now). Again, use at your own risk. JunkMatcher Icon courtesy of Steve Caplin JunkMatcher (on Sourceforge.net) is a versatile spam filter addon for Mac OS X. Although Apple's built-in Mail.app has a wonderful, statistically trained junk filter, spammers nowadays use various tricks to conceal the real things they want to say (use of graphics, encoded characters, etc.). If you're going nuts about your spam problem, I've written a cocktail-styled tool that makes use of various effective techniques such as Bayesian filtering, IP-based blocking and flexible regular expressions to identify those sneaky junk mails. receivedDB receivedDB is a set of Python scripts for parsing the `Received:` headers in emails. A database file (text) is kept separate from the code so new header patterns can be added for recognizing new kinds of headers. This is useful for tracking the origin of emails and detecting header forgery.
© Benjamin Han (email), LTI / SCS / CMU. Last updated on Thursday, April 27, 2006