Data Set of Morphologically Analyzed Inuktitut Words

This is the page for the currently available set of morphologically analyzed Inuktitut words. The development of this corpus is described in (forthcoming publication). The analyzed words were extracted from the Nunavut Hansard corpus, version 3.0, and were produced using two different analyzers: 1) the original Uqailaut analzyer described here (files with the 'uqailaut' designation in their name), and 2) the analyzer described in Micher, 2019."Bootstrapping a Neural Morphological Generator from Morphological Analyzer Output for Inuktitut", Proceedings of the 3rd Workshop on Computational Methods for Endangered Languages, Vol. 2 Extended Abstracts, Honolulu, HI. pdf (files with 'neuralmorph' designation in their name). The '1' and '2' in the file names denote corpus versions: the former, version 1.1, and the latter, version 3.0 (excluding previously processed words from version 1.1). The file whose name contains 'deep' is a rerun of the correponding file, using the neural analyzer trainied on deep form morphemes, to be consistent with the neuralmorph.2 file. The 'uqailaut' files contain only the first returned analysis from this analyzer, which can produce ambiguous output.

Word counts for the files are the following:

The format of the files is as follows:

<word><tab>{morph1}{morph2}...{morphx}| (for uqalaut files)
<word><tab>{morph1}{morph2}...{morphx}* (for neuralmorph files)

Some words are missing an analysis, denoted by "MISSING" in the analysis slot.