For Research
This page lists various software developed by me. Most of the code
is either under GPL or LGPL.
There is no warranty of any kind that the code would work as advertised
(in fact it might even blow up your machine) - use at your own risk.
However if you do find the code useful, or have any comments (bug reports,
etc.), I'd love to hear from you.
UTools
UTools is a set of software modules written in C++ for performing
sentence generation/analysis using a unification-based grammar
formalism (example grammar). It
is a modernized and extended version based on the original GenKit written
by Tomita and Nyberg in 1988 (in LISP), and is a close relative to KANTOO (also
written in C++) used in the KANT project
in LTI, CMU, but the latter
is closed source. UTools is an independent implementation and currently
used in various projects in LTI such as Avenu.
To illustrate what it can do, here is how UTools can be used to
generate a sentence. First you need to have a feature structure representing
the meaning of the sentence you want to generate, like this ("A
small dog all of a sudden bites the girl in white with the teeth."):
((SUBJ
(*OR*
((PRED DOG)
(FIN -)
(PERS 3)
(NUM SG)
(BREED CHIHUAHUA))
((PRED DOG)
(FIN -)
(PERS 3)
(NUM SG)
(BREED PITBULL))))
(OBJ
((PRED GIRL)
(FIN +)
(PERS 3)
(NUM SG)
(COAT_COLOR WHITE)))
(INST
((PRED TOOTH)
(FIN +)
(PERS 3)
(NUM PL)))
(PRED BITE)
(TENSE PRES))
The feature structure is basically a tree structure specifying what
the subject is (can be a 3rd person singular chihuahua or pitbull),
what the object is (the poor girl), what the intrument is
(tooth in plural), and what the action is (bite!).
You then need to provide a grammar and
a lexicon for UTools to generate the
target sentence. A trace of this process can be seen here.
You can find out more information about this process by reading the
manual of UKernel:
B. Han and A. Lavie. UKernel:
A Unification Kernel. Technical Report CMU-LTI-03-177,
Language Technologies Institute, Carnegie Mellon University, August
13, 2004.
Currently UTools has the following components (C++):
- Generator (download):
This is a simple recursive-descent sentence generator - given a
feature structure as input, a natural language sentence is generated
by executing grammar rules.
- UKernel (download):
This is the core engine of UTools - it provides the basic functionality
for executing the unification rules in a grammar.
- Toolbox (download):
This is a library implementing basic data structures such as token
strings/dictionaries, tree templates and context-free grammars
with prefix searching, etc.
Notably missing are two modules: the first one is a front-end module
which parses grammar files into the corresponding data structures
to drive the system (otherwise there is no way that UTools can understand
the grammar rules you developed in plain text!). The second missing
module is a parser. With these missing pieces added UTools can be
a complete suite to meet a broad range of needs in developing a natural
language application.
Thus comes my call for help: my plan is to add a front-end
module with bindings into popular scripting languages, e.g., Python,
and to add an Earley
parser for efficient parsing. These two components should be
written in C/C++ as well. But with limited time on my hands, I would
like to invite anyone who is interested in this to join me - we can
even move this project to Sourceforge if
necessary. But the project has to stay in open source (either GPL or LGPL).
Please contact me (email at the bottom of this page) if you are interested.
Analyzer
Analyzer is a program for building a bilingual dictionary from a
parallel corpus. Analyzer uses a steady-state Genetic Algorithm to
find a better translation solution. It also uses part-of-speech information
from the target language to 'optimize' the translation. You need
to download Toolbox and ePost to
build and run this code.
Paper: B. Han. Building
a Bilingual Dictionary with Scarce Resources: A Genetic Algorithm
Approach. In the Student Research Workshop, the Second Meeting
of the North American Chapter of the Association for Computational
Linguistics (NAACL-2001), Pittsburgh, 2001.
You can also find the the slides in the Publications section.
v1_4-20010903: Now compiled
under g++ 3.0 with -Wall -pedantic, along with some minor fixes.
ePost
ePost is an adapted version of Brill's part-of-speech tagger for
English (this is actually pulled out from Analyzer). The differences
from Brill's original code are
- Lots of memory-related bugs are removed, so you can fire the
tagger more than once during the run-time now.
- A C++ class wrapper is defined so it's easier to use in C++.
- Now runs under Windows/MSVC++ 6.0.
Also you might be interested in a cool Java/Perl tagger developed
by Jimmy Lin at
University of Maryland (based on the ePost) - you can download them there!
v1_0-20020722: fixed warnings/errors
for gcc 3.1.
For Fun
These are non-research-related code (at least for now). Again, use at
your own risk.
JunkMatcher
Icon
courtesy of
Steve Caplin
JunkMatcher (on Sourceforge.net)
is a versatile spam filter addon for Mac OS X. Although Apple's built-in Mail.app has
a wonderful, statistically trained junk filter, spammers nowadays
use various
tricks to conceal the real things they want to say (use of graphics,
encoded characters, etc.). If you're going nuts about your spam problem,
I've written a cocktail-styled tool that makes use of various effective
techniques such as Bayesian filtering, IP-based blocking
and flexible regular
expressions to identify those sneaky junk mails.
receivedDB
receivedDB is a set
of Python scripts for parsing
the Received:
headers in emails. A database file (text)
is kept separate from the code so new header patterns can be added
for recognizing new kinds of headers. This is useful for tracking
the origin of emails and detecting header forgery.