This is the documentation for the SYNERGY NER system for Arabic and Swahili as of April 2010.

The SYNERGY system performs NER by translating Arabic or Swahili text to English, running off-the-shelf state-of-the-art English NER systems, and aligning the annotated English output back to the original English or Swahili text. It achieves F1 scores of 0.788 for Arabic, and 0.815 for Swahili. It is discussed in detail in this paper.

The system is implemented in Perl. It requires the following Perl modules to be installed:

List::Util qw(max)
List::Util qw(min)
Text::English
WebService::Google::Language
Encode
Encode::Buckwalter
Encode qw/encode decode is_utf8/
REST::Google::Translate
Algorithm::Diff qw(LCS LCS_length LCSidx diff sdiff compact_diff traverse_sequences traverse_balanced )
String::LCSS_XS qw(lcss lcss_all)


To use the system, first include the following line in your code:
  use NER;

Then instantiate an object using the following command:

$ner = NER->(%args);

The arguments hash can take the following optional parameters:

align - specifies which alignment to use from {"giza", "en", "src"}. default "giza"
pp - post-processing (0 - off, 1 - on). default 1.
coref - whether to also perform CRR (1) or not  (0). default 0.
translate - which translator to use: google ("g") or bing ("b"). default "g".

Thus a sample usage would look like this:
$ner = NER->new(align => "giza", pp => 1, coref => 0, translate => "g");

Then, to use any function in this module, say X, call it as follows:
$ner->X(@args);

The function to call to perfrom NER is translate, and must be supplied an array of file names. e.g.
$ner->translate("file1_arb.txt", "file2_arb.txt");

File names must end in either "_arb.txt" for Arabic, or "_swh.txt" for Swahili.

Other useful functions include make_table, lbj and stanford. Sample usage:

($stan_table, $lbj_table, $corr_table, $union_table) = $ner->make_table($outfile, @infiles);

First argument must be the name of the outfile, subsequent arguments must be raw input files. By raw input files, we mean English text files without any labeling. They must have _raw at the end, i.e. be named in the following format: xyz_raw.txt

Before using make_table function for some $infile, following commands must be run:
  $ner->lbj($infile);
  $ner->stanford($infile);
 
This will produce the lbj and stanford tagged versions of the raw file. You must have lbj and stanford systems installed, and the folders LbjNerTagger1.11.release and stanford-ner-2009-01-16 must be in your current folder. LBJ files produced will be of the form _lbj.txt and Stanford files produced will be of the form _tempstan.txt and _stanford.txt . Do not modify this.

So, a full use case would look like this:

    for $file(@infiles)
    {
        $ner->lbj($file);
        $ner->stanford($file);
    }
    ($stan_table, $lbj_table, $corr_table, $union_table) = $ner->make_table("table_all.txt", @infiles);

Now you can either use the tables returned in $stan_table, etc. or the output file table_all.txt

If labeled version of the raw file exists, it must be of the form _correct.txt . If so, then $corr_table will be non-empty, otherwise empty string.