This is the documentation for the SYNERGY NER system for Arabic and
Swahili as of April 2010.
The SYNERGY system performs NER by translating Arabic or Swahili text
to English, running off-the-shelf state-of-the-art English NER systems,
and aligning the annotated English output back to the original English
or Swahili text. It achieves F1 scores of 0.788 for Arabic, and 0.815
for Swahili. It is discussed in detail in this paper.
The system is implemented in Perl. It requires the following Perl
modules to be installed:
List::Util qw(max)
List::Util qw(min)
Text::English
WebService::Google::Language
Encode
Encode::Buckwalter
Encode qw/encode decode is_utf8/
REST::Google::Translate
Algorithm::Diff qw(LCS LCS_length LCSidx diff sdiff compact_diff
traverse_sequences traverse_balanced )
String::LCSS_XS qw(lcss lcss_all)
To use the system, first include the following line in your code:
use NER;
Then instantiate an object using the following command:
$ner = NER->(%args);
The arguments hash can take the following optional parameters:
align - specifies which alignment to use from {"giza", "en", "src"}.
default "giza"
pp - post-processing (0 - off, 1 - on). default 1.
coref - whether to also perform CRR (1) or not (0). default 0.
translate - which translator to use: google ("g") or bing ("b").
default "g".
Thus a sample usage would look like this:
$ner = NER->new(align => "giza", pp => 1, coref => 0,
translate => "g");
Then, to use any function in this module, say X, call it as follows:
$ner->X(@args);
The function to call to perfrom NER is translate, and must be supplied
an array of file names. e.g.
$ner->translate("file1_arb.txt", "file2_arb.txt");
File names must end in either "_arb.txt" for Arabic, or "_swh.txt" for
Swahili.
Other useful functions include make_table, lbj and stanford. Sample
usage:
($stan_table, $lbj_table, $corr_table, $union_table) =
$ner->make_table($outfile, @infiles);
First argument must be the name of the outfile, subsequent arguments
must be raw input files. By raw input files, we mean English text files
without any labeling. They must have _raw at the end, i.e. be named in
the following format: xyz_raw.txt
Before using make_table function for some $infile, following commands
must be run:
$ner->lbj($infile);
$ner->stanford($infile);
This will produce the lbj and stanford tagged versions of the raw file.
You must have lbj and stanford systems installed, and the folders
LbjNerTagger1.11.release and stanford-ner-2009-01-16 must be in your
current folder. LBJ files produced will be of the form _lbj.txt and
Stanford files produced will be of the form _tempstan.txt and
_stanford.txt . Do not modify this.
So, a full use case would look like this:
for $file(@infiles)
{
$ner->lbj($file);
$ner->stanford($file);
}
($stan_table, $lbj_table, $corr_table, $union_table)
= $ner->make_table("table_all.txt", @infiles);
Now you can either use the tables returned in $stan_table, etc. or the
output file table_all.txt
If labeled version of the raw file exists, it must be of the form
_correct.txt . If so, then $corr_table will be non-empty, otherwise
empty string.