Software and Datasets
Software: Jangada
Jangada is an API for signature block extraction and
reply-to extraction from email messages. The ideas follow the ideas of the
following paper (CEAS2004 - Learning to Extract Signature and Reply Lines from Email),, but performance
was slightly improved by using a new set of features not mentioned in the
original reference.
Some
Features: Extracts signature
blocks and reply lines in email messages with very good accuracy. Can be easily
integrated in other Java applications (For instance, the entire email message
as a String can be used as input). Can be easily integrated in other Minorthird
applications (using the TextLabels format, it accepts as input email messages
with other annotations - such as dates, personal names, speech acts, etc)
Licensing: University of Illinois/NCSA Open Source License
Documentation: Very poor. An initial javadocs page is here. There is some
documentation on how to use Jangada in the example files below.
Requires:
j2sdk1.4 or later. Uses
MinorThird.jar.
Recommended: When using email files as input, results will be
better if the messages are in mime (.eml) format.
Usage
example:
1. create a new directory (for instance, jangadaDir)
2. download jangada.jar,
minorThird.jar, the example files, and the email files to jangadaDir
3. Unzip (gunzip Demos.tar.gz) and Untar (tar –xvf
Demos.tar) the example files, as well as the email files.
4. add jangadaDir,
jangadaDir/minorThird.jar and
jangadaDir/jangada.jar to the
CLASSPATH
5.
6. For a quick demo,
7. compile the example files. For instance: “javac
Demo2.java” – (in case of errors, please check you CLASSPATH again)
8. run the examples on the email files directory: “java
Demo2 emails/*”
9. Check the documentation on the DemoX.java files and
try your own application.
Reminder
1: if you’d like to have access
to the source code, please send me an email.
Reminder
2: If you used this package,
please cite the following reference:
·
Learning
to Extract Signature and Reply Lines from Email, Vitor
R. Carvalho and William W. Cohen,
Software: Ciranda
A java application that predicts the Email-Acts (or
email speech-Acts) of email messages. The ideas follow the contents of the
following papers (emnlp04
and sigir05),
but performance was significantly improved by careful feature selection and
additional features.
Some
Features:
Predicts the following acts: Request, Commit,
Deliver, Propose, Meet, dData.
Provides the confidence in each prediction.
Easy way to use these acts as features in your
application.
Licensing: No guarantees are provided. Lots of bugs for sure. Use at your own risk!
Documentation: Very poor. An initial javadocs page is here. Please check Example.java
on how to use it.
Requires:
j2sdk1.4 or later. Uses
MinorThird.jar (see below)
Questions:
I’ll be happy to help,
especially if you tell me what a good Ciranda is :-)
Usage
example:
1. create a new directory called ciranda, and ciranda/lib
2. download ciranda.jar
and minorThird.jar to ciranda/lib
3. add ciranda/
and lib/ciranda.jar to the CLASSPATH
4. download the example file Example.java to ciranda/
5. compile it: “javac Example.java” – (in case of
errors, please check you CLASSPATH again)
6. run the example: “java Example”
7. or run the main application on a directory with
emails in text format (without headers)
8. create the test directory ciranda/testdir
9. add some emails in text format (such as msg1, msg2, msg3) to ciranda/testdir
10. run “java
–jar lib/ciranda.jar testdir”
11. or try your own application.
Reminder:
Send me an email if you'd like the source code. If you use this package, please
use the following reference:
·
Learning
to Classify Email into ”Speech Acts”,, William W. Cohen,
Vitor R. Carvalho and Tom M. Mitchell,
Dataset:
Signature
and Reply Dataset
These 617 email messages have signature lines and
reply-to lines annotations. The messages are a subset of the 20 Newsgroups
dataset (produced by Ken Lang at CMU in the mid-90's).