Professional Documents
Culture Documents
Viktor Tr
on
IGK, Language Technology and Cognitive Systems.
Universities of Edinburgh &
Saarbr
ucken. MOKK Lab, Budapest Intitute of Technology. Budapest. v.tron@ed.ac.uk
This file documents the HunLex morphological resource specification framework and precompilation tool (HunLex). It corresponds to release 0.1 of the the Hunlex distribution.
More information about Hunlex can be found at the MOKK Lab homepage,
http://lab.mokk.bme.hu.
Table of Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1
1.2
1.3
License. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5
5
5
5
5
5
6
6
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1
4.2
4.3
4.4
4.5
4.6
Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Supported Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Uninstall and Reinstall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Installed Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
7
7
8
9
Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Toplevel Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.1
6.2
6.3
12
13
13
14
14
14
15
15
16
17
18
19
ii
Description Language . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.1
Morphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Morph Preamble and Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.2 Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
24
25
28
29
Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.1
Input Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Primary Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1.1 Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1.2 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.3 Morpheme Configuration File. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.4 Feature Configuration File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.5 Usage Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Output Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
30
30
31
31
31
32
33
33
Command-line Control . . . . . . . . . . . . . . . . . . . . . . . . 35
10
Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
11
Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
11.1
11.2
12
Merging Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Feature Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
12.1
12.2
12.3
12.4
13
36
36
37
37
37
37
38
38
42
42
43
44
Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
13.1
13.2
13.3
13.4
Installation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems running hunlex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Resource Compilation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Grammar Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
45
45
iii
14
46
46
46
46
46
46
46
46
46
46
Chapter 1: Introduction
1 Introduction
This document presents the HunLex morphological resource specification framework and
precompilation tool which is being developed as part of the Budapest Institute of Technology
Media Education and Research Centers HunTools Natural Language Processing Toolkit
http://lab.mokk.bme.hu
Chapter 1: Introduction
1.2 Motivation
The motivation behind HunLex came from two opposing types of requirements lexical resources are supposed to fulfill:
1. (i) scalability, maintainability, extensibility; and
2. (ii) optimized format for the application.
The constraints in (i) favour one central, redundancy-free, abstract, but transparent specification, while the ones in (ii) require possibly multiple application-specific, potentially redundant, optimized formats.
In order to reconcile these two opposing requirements, HunLex introduces an offline layer
into the word-analysis workflow, which mediates between two levels of resources:
1. a central database conforming to (i) (also primary resource, input resource),
2. various application-specific formats conforming to (ii) (also secondary or output resource)
The primary resources are supposed to reasonably designed to help human maintanance,
and the secondary ones are supposed to optimize very different things ranging from file size,
performance with the tool that uses it, coverage, robustness, verbosity, normative strictness
depending on who uses it for what purpose.
HunLex is used to compile the primary resources into a particular application-specific format see Section 8.2 [Output Resources], page 33. This resource compilation phase is an
offline process which is highly configurable so that users can fine-tune the output resources
according to their needs.
By introducing this layer of offline resource compilation, maintenance, extendability, portability of lexical resources is possible without compromising your performance on specific
word-analysis tasks.
Providing the environment for a sensible primary resource specification framework and
managing the offline precompilation process are the raison d^etre behind Hunlex.
Chapter 1: Introduction
6. selection of registers, degree of normativity, etc. based on usage qualifiers in the database
7. selection of output morphological annotation, configurable tags information
Chapter 2: License
2 License
Hunlex is free software.
It is licensed under LGPL, which roughly means the following.
There are no restrictions on downloading it other than your bandwidth and our slothful
ways of making things available.
There are no restrictions on use either other than its deficiencies, clumsy features and outragous bugs. However, this can be amended, because there are no restrictions on modifying
it either. See also Section 3.5 [Contribution], page 5.
Freedom of use implies that any resources that you created, compiled with the mediation
of Hunlex is yours and you hold the right to distribute it in any way. Consider telling us
about this great news, see Section 3.8 [Contact], page 6.
What is more, there are no restrictions on redistributing this software or any modified
version of it.
For some legalese telling you the same, read the License http://creativecommons.org/licenses/LGPL/2.1/
Todo: Shall we not include the License?
3.4 Praises
So you found hunlex cool and/or useful and would like the authors to hear about that. How
nice is that! See Section 3.8 [Contact], page 6.
3.5 Contribution
Hunlex is open source development, so developpers are welcome to contribute to make it
better in any imaginable way. Contact us (see Section 3.8 [Contact], page 6) to work out
the details of how and what you would want to contribute to Hunlex.
3.6 Reference
For the context of the whole huntools kit, use
@InProceedings{szoszablya_saltmil:04,
author =
{L\aszl\o N\emeth and Viktor Tr\on
and P\eter Hal\acsy and Andr\as Kornai
and Andr\as Rung and Istv\an Szakad\at},
title =
{Leveraging the open-source ispell codebase
for minority language analysis},
booktitle =
{Proceedings of SALTMIL 2004},
year =
2004,
organization = {European Language Resources Association},
url =
{http://lab.mokk.bme.hu/}
}
A very brief intro to hunlex with a one-page English resume.
@InProceedings{hunlex_mszny:04,
author =
{Tr\on, Viktor},
title =
{HunLex - a description framework and
resource compilation tool for morphological dictionaries},
booktitle =
{II. Magyar Sz\am\it\og\epes
Nyelv\eszeti Konferencia},
institution = {Szegedi Tudom\anyegyetem},
address =
{Szeged, Hungary}
year =
2004
}
These and other papers can be downloaded from the MOKK Lab publications page at
http://lab.mokk.bme.hu
3.7 Authors
The author of hunlex and this document is Viktor Tr
on. He can be mailed to on
v.tron@ed.ac.uk
Hopefully more can be found on MOKK Labs pages at http://lab.mokk.bme.hu.
3.8 Contact
We can get in contact if you
1. Mail to Viktor Tr
on on
v.tron@ed.ac.uk
2. Join the forums on http://lab.mokk.bme.hu
3. Submit a bug report (see Section 3.2 [Submitting a Bug Report], page 5) or feature
request (see Section 3.3 [Requesting a New Feature], page 5).
Chapter 4: Installation
4 Installation
So you want to install the hunlex toolkit (see Chapter 1 [Introduction], page 1) from the
hunlex source distribution. This document describes what and how you can install with
this distribution.
4.1 Download
The latest version of the hunlex source distribution is always available from the MOKK LAB
website at http://lab.mokk.bme.hu or, if all else fails, by mailing to me v.tron@ed.ac.uk.
4.3 Prerequisites
[Prerequisite]
Hunlex is written in the ocaml programming language http://www.ocaml.org/.
OCaml compilers are extremely easy to install and are available for various platforms and downloadable in various package formats for free from
http://caml.inria.fr/ocaml/distrib.html.
You will need ocaml version >=3.08 to compile hunlex.
ocaml
[Prerequisite]
ocaml-make
OCamlMakefile (i.e., ocaml-make) is needed for the installation of hunlex and is
available from Markus Mottls homepage at
http://www.ai.univie.ac.at/~markus/home/ocaml_sources.html#OCamlMakefile
ocaml-make
4.4 Install
Hunlex is installed in the good old way, i.e., by typing
$ make && sudo make install
in the toplevel directory of the unpacked distribution. Read no further if you know what I
am talking about or if you trust some God.
Chapter 4: Installation
The hunlex distribution is available in a source tarball called hunlex.tgz. First you have
to unpack it by typing
$ tar xzvf hunlex.tgz
Then, you enter the toplevel directory of the unpacked distribution with
$ cd hunlex
To compile it, simply type
$ make
in the toplevel directory of the distribution.
To install it (on what gets installed, see Section 4.6 [Installed Files], page 9), type
$ make install
Well, by default this would want to install things under /usr/local, so you have to have
admin permissions. If you are not root but you are in the sudoers file with the appropriate
rights, you type:
$ sudo make install
You can change the location of the installation by changing the install prefix path with
$ sudo make PREFIX=/my/favourite/path install
Changing the location of installation for individual install targets individually is not recommended but easy-peasy if you have a clue about make and Makefile-s. To do this
you have to change the relevant Makefile-s in the subdirectories of the distribution. See
Section 4.6 [Installed Files], page 9.
If it works, great! Go ahead to Chapter 5 [Bootstrapping], page 10.
If you have problems, doubleckeck that you have the prerequisites (see Section 4.3 [Prerequisites], page 7). If you think you followed the instructions but still have problems, submit
a bug report (see Section 3.2 [Submitting a Bug Report], page 5).
If you are upgrading an earlier version of hunlex, you may want to uninstall the earlier one
first (see Section 4.5 [Uninstall and Reinstall], page 8).
Chapter 4: Installation
Chapter 5: Bootstrapping
10
5 Bootstrapping
So you installed hunlex and its running smoothly.
This section leads you through the first steps and gives you hints on how you set out working
with hunlex.
Create your sandbox directory.
Change to it.
Create your own local Makefile. This will be your connection to the hunlex toplevel control.
For your Makefile to understand hunlex predefined toplevel targets (see Section 6.3 [Targets],
page 13), you have to include (not insert) the hunlex systemwide Makefile. So you create a
Makefile with the following content:
-include /path/to/HunlexMakefile
where /path/to/HunlexMakefile is the path to HunlexMakefile which is supposed to
be installed on your system (see Section 4.6 [Installed Files], page 9), by default under
/usr/local/lib/HunlexMakefile.
Now, you are ready to test things for yourself. In order to see if all is well, type
$ make
at your prompt in the same sandbox directory.
In fact, you will always type the make command to control hunlex. If you dont give
arguments to make, a so-called default action (target, see Section 6.3 [Targets], page 13) is
assumed. The default target is resources which creates the output resources according
to the default settings (see Section 6.4 [Options], page 15).
Toplevel control assumes by default that all its necessary resources are found in the current
directory (see Section 6.4.3 [Input File Options], page 17). If this is not the case, because
the files do not exist, the compulsory ones are created and the compilation runs creating
the output resources.
Surely, the missing files are created without contents and your output resources will be
empty as well. However, this vacuous run will test whether hunlex (and toplevel control) is
working properly.
Now if you list your directory, you should see:
$ ls
affix.aff
dictionary.dic
grammar
lexicon
Makefile
morph.conf
phono.conf
usage.conf
Chapter 5: Bootstrapping
11
Now.
If you want to develop (toy around with) your own data and create resources, the next
step is to fill in the input files. Read on to learn more about files (see Chapter 8 [Files],
page 30) and then about the hunlex morphological resource specification language (see
Chapter 7 [Description Language], page 24). Since you want to test your creation, you
ultimately have to learn about toplevel control (see Chapter 6 [Toplevel Control], page 12)
and gradually about the advanced issues in the chapters that follow these.
If you already have your hunlex-resources describing your favourite language ready and you
want to compile specific output resources from it with hunlex, you better read about toplevel
control with special attention to the options (see Chapter 6 [Toplevel Control], page 12). If
you want to fiddle around with more advanced optimization, such as levels and tags, you
may end up having to read everything, sorry.
12
6 Toplevel Control
You typically want to use hunlex through its toplevel control interface. Toplevel control
means that you invoke hunlex indirectly through a Makefile to compile your resources.
We envisage typical users of hunlex developing their lexical resources in an input directory
and occasionally dump output resources for their analyser into specific target directories
for various applications.
If you dont like Makefiles or your system does not have make (how did you compile hunlex,
then?), you will then invoke hunlex from a shell and use it via the command-line interface.
This is non-typical use and not recommended. The Command-line interface which is almost
equivalent in functionality to the Makefile interface is described only for completeness
and for people developing alternative wrappers (see Chapter 9 [Command-line Control],
page 35).
In fact, you dont actually need to know much about make and Makefile-s to use hunlex.
Just follow the steps described in Chapter 5 [Bootstrapping], page 10. We assume that you
have a project directory with a Makefile sitting in it in order to try out what is described
here.
This document is more like a reference manual that details what you can do with your
resources and how you can do it through the Makefile interface. What the resources are
and how you can develop your own is described in other chapters (see Chapter 8 [Files],
page 30 and see Chapter 7 [Description Language], page 24).
13
6.3 Targets
The functionality of hunlex is accessed through targets. Targets are arguments of the make
command which reads your local Makefile and ultimately consults the systemwise hunlex
toplevel Makefile called HunlexMakefile (see Section 4.6 [Installed Files], page 9).
Usually, you will control hunlex through make by typing:
make options target
14
where options is a sequence of variable assignments which set your options described below
(see Section 6.4 [Options], page 15) and where targets is a sequence of targets. For more
on variables and targets you may consult the manual of make.
The available toplevel targets are detailed below:
resources
generate
[Special Target]
pretends that the base resources are changed. You need this directive if you want
to recompile the resources althouth no primary resource has changed. This might
happen because you are using a different configuration option. (If the base resources
are unchanged, no compilation would take place, you have to force it with new, see
make).
make MIN_LEVEL=3 new resources
[Special Target]
removes all intermediate temporary files, so that only lexicon, grammar, and the
configuration files, and the output resources (affix and dictionary) remain.
Todo: This is not implemented yet.
clean
[Special Target]
removes all non-primary resources, so that only lexicon, grammar, and the configuration files remain.
distclean
test
[Test Target]
tests the resource by making hunmorph read the resources (dic and aff files) and
analyze the contents of the file that is value of TEST (see Section 6.4.3 [Input File
Options], page 17). TEST is by default set to the standard input, so after saying
15
$ make test
you have to type in words in the terminal window (exiting with C-d).
If you want to test by analyzing a file, you have to set the value of TEST.
$ make TEST=my/favourite/testfile test
Test outputs are to stdout, so just pipe it to a file
$ make TEST=my/favourite/testfile test > test.out 2> test.log
[Test Target]
will run hunmorph on the wordlist file (see Section 6.3.1 [Resource Compilation
Targets], page 14, generate) and outputs the result on the standard output (so you
may want to pipe the result to a file).
testwordlist
[Test Target]
puts hunlex and the analyzer to the test, by creating the resources according to the
settings of your makefile, and then run hunmorph on the generated whole wordlist.
Warning: Note that this target first generates all words and then creates
the resources again. Running this on huge databases is probably not a
good idea.
The way you want to test a bigger database instead is by creating a a set
of words that your ideal analyzer has to recognize or correctly analyze
and test on that (with test). Realtest is just a quick and dirty shorthand
for toy databases to check if everybody is with us.
realtest
6.4 Options
Options of the toplevel are in effect Makefile variables that can be set at the users will.
(All the command-line options of hunlex can be accessed through the toplevel options are
passed to hunlex to regulate the compilation process. The documentation of command line
options is found in Chapter 9 [Command-line Control], page 35, but only for the record. All
hunlex options are all capital letters (LEXICON) and all command line options begin with a
dash and are all small letters but otherwise they are the same (-lexicon)).
All options can be set or reset in your local Makefile (and remembered, see Section 6.2
[Storing your Settings], page 13). These will override the system default. Both the system
default and your local default can be overriden by direct command-line variable assignments
passed to make, such as the ones shown in this file:
$ make QUIET= DEBUG_LEVEL=3 OUTPUTDIR=/my/favourite/ouputdir
Listed and explained below are all the hunlex options (all public Makefile variables) that
the toplevel control provides for the user to manipulate.
When you see something like variable (value), it means the default value of the variable
variable is value.
HUNLEX (hunlex)
16
[Installation], page 7). If you want to use (i) an alternative version of hunlex that
is not the one found in the path, or (ii) an uninstalled version of hunlex, or (iii) an
installed version but the path to which you dont want to include in your path, then
you should set which hunlex to use with this variable.
HUNLEX=/my/favourite/version/of/hunlex
[Option]
You need the executable hunmorph from the Huntools package (see Section 14.1.1
[Huntools], page 46) only for testing, if you dont want to test with direct analysis
(just want to compile the resources), you dont need to bother.
When used, however, the hunmorph executable is assumed to be found in the path
with name hunmorph. If this is not the case, update your path or provide the path to
hunmorph with the line
HUNMORPH=/my/favourite/version/of/hunmorph
HUNMORPH (hunmorph)
QUIET (@ = quiet)
[Option]
sets the verbosity of hunlex itself. By default debug level is set to 0. Debug messages
are sensitive to the debug level in the range from 0 to 6-ish: the higher the number
the more verbose hunlex is about its doings.
0 is non-verbose mode, which means that it only displays (fatal) error messages. If
you set DEBUG LEVEL to say -1, even error messages will be suppressed (only an
uncaught exception will be reported in case of fatal errors).
It is typically a good idea to set DEBUG_LEVEL to 2 or 3 and request more if we really
want to see what is happening.
Caveat: In fact you wont understand the messages anyway, so the debug
blurps just give you an idea of the context where something went wrong
with your grammar/lexicon, etc.
Todo: This shouldnt be so and debug messages pertaining to grammar
development should be self-evident or well designed and documented.
Especially parsing errors and/or compile warnings about the grammar
and lexicon should be clear.
Usually you want to create a log by piping the debug output of make (standard error)
with your debug messages to a file. This can be done by, for instance by
$ make DEBUG_LEVEL=5 resources 2> log
DEBUG_LEVEL (0)
[Option]
By default with every run of hunlex it is measured how long it takes to compile the
resources (unix shells time command) and this information is displayed. Surely, this
TIME (time)
17
is only interesting with big lexicons. If you (i) dont have a time command, (ii) have
a different time command, (iii) dont want time measured and displayed, just reset
the TIME variable. The option can be unset by the line
TIME=
in your local Makefile.
[Option]
lexicon file
[Option]
grammar file
They can all be set to alternative paths individually. If they are in the same directory, the
directory path can also be set via the variable GRAMMARDIR:
[Option]
the directory for the hunlex primary input resource files, which is, by default, set to
inputdir, the value of the variable INPUTDIR, see below.
GRAMMARDIR (inputdir )
There are three further input resources which need to be present for a hunlex compilation.
These are the compilation configuration files.
[Option]
the usage configuration file (see Section 8.1.2 [Configuration Files], page 31)
[Option]
the morph(eme) configuration file (see Section 8.1.2 [Configuration Files], page 31)
[Option]
the configuration file (see Section 8.1.2 [Configuration Files], page 31) for
morphophonologic and morphoorthographic features
There are two optional configuration files, the signature and the flags file. By default, the
options correspoding to these files are set to the empty string, which tells hunlex not to use
a feature structures (see Section 11.2 [Feature Structures], page 41) or custom output flags
(see Chapter 12 [Flags], page 42).
[Option]
The location of the signature file used to process and validate features structures (see
Section 11.2 [Feature Structures], page 41, see Section 8.1.2 [Configuration Files],
page 31). If it is set to the empty string (the default), hunlex does not use feature
structures.
If you use this file, it makes sense to call it something like fs.conf or
signature.conf and store it in confdir with your other configuration files, so the
assignment
SIGNATURE ()
18
SIGNATURE=$(CONFDIR)/fs.conf
is an appropriate setting.
[Option]
The location of the custom output flags file (see Section 8.1.2 [Configuration Files],
page 31) used to decide which flags are used in the output resources (see Chapter 12
[Flags], page 42). If it is set to the empty string (default), hunlex will use a built-in
flagset to determine flaggable characters (see Chapter 12 [Flags], page 42).
If you use this file, it makes sense to call it something like flags.conf and store it
in confdir with your other configuration files, so the assignment
FLAGS=$(CONFDIR)/flags.conf
is an appropriate setting.
FLAGS ()
All configuration files can be set to alternative paths individually. If they are in the same
directory, the directory path can also be set via the variable CONFDIR:
[Option]
the directory for the hunlex compilation configuration files, which is, by default, set
to inputdir, the value of the variable INPUTDIR, see below.
CONFDIR (inputdir )
As explained all input files can be set to alternative paths individually or primary resources
together and configuration files together. If all input resources (primary and configuration)
are in the same directory, this directory path can also be set via the variable INPUTDIR:
[Option]
the directory for all hunlex input resource files, which is by default, set to the currect
directory.
INPUTDIR (.
= current directory )
TEST (/dev/stdin)
[Option]
affix file
[Option]
dictionary file
[Option]
The wordlist generated by the generate target (see Section 6.3.1 [Resource Compilation Targets], page 14).
19
where outputdir (the default directory of the files) is the value of the variable OUTPUTDIR:
[Option]
the directory for the hunlex output resource files, which is, by default, set to the
currect directory
OUTPUTDIR (.
= current directory )
As you can see, the default setting is that all input and output files are located in the
current directory under their recommended canonical names. Putting the output resources
in the same directory as the primary resources might not be a good idea if you want to
compile various types of output resources.
DOUBLE_FLAGS ()
The following two options regulate the level of morphemes. You find more details about
levels in a separate chapter (see Chapter 10 [Levels], page 36).
[Option]
Morphemes of level below MIN_LEVEL are treated as lexical, i.e., are precompiled with
the appropriate stems into the dictionary file. By default, only morphemes of level 0
or below are precompiled into the dictionary.
MIN_LEVEL (1)
[Option]
Morphemes with levels higher than the value of MAX_LEVEL are, on the other hand,
treated as being on the same (non-lexical) level. By default, only morphemes of level
above 10000 are treated as having the same level.
MAX_LEVEL (10000)
TAG_DELIM ()
[Option]
[Option]
sets the delimiter to put between the fields of the affix and dictionary files,
respectively. By default it is set to a single space for the affix file and set to <TAB>
in the dictionary.
NB: A tab might allow better postprocessing in the affix file and even
allow spaces in the tags which might be useful.
At the time of writing the huntools reader only allowed a TAB not a
space as delimiter in the dictionary file so change with caution.
OUT_DELIM ( )
OUT_DELIM_DIC (<TAB>)
20
[Option]
the major output mode regulates what information gets output in the affix and dictionary files and how affix entries are conflated.
Warning: This option is not effective at the moment due to the lack of a
clear functional specification and it is also unclear how this option should
interact with the option STEMINFO (below).
Todo: Clarify this. See warning.
The possible values at the moment are:
Spellchecker
Stemmer
Analyzer
NoMode
MODE (Analyzer)
STEMINFO (LemmaWithTag)
[Option]
FS_INFO ()
The affix file specifies a lot of variables to be read by the morphbase routines. Some of these
are metadata but some are crucial for suggestions and accent replacement for automatic
error correction, see below.
21
AFF_PREAMBLE ()
DIC_PREAMBLE ()
[Option]
[Option]
These are the character-conversion table and replacement table to be included into
morphbase resources if alternatives (e.g., for spellchecking) or robust error correction
is required (see Section 14.1.1 [Huntools], page 46). These features are documented in
the huntools documentation (hopefully, but certainly not here, see Section 8.2 [Output
Resources], page 33, see Section 14.1.1 [Huntools], page 46).
NB: This feature of including these extra files into the affix file is only
available through toplevel control and will never be integral part of the
hunlex executable.
CHAR_CONVERSION_TABLE ()
REPLACEMENT_TABLE ()
[Option]
Identifies the character-set for the analyzer reading the affix file. By default, this is
set to ISO8859-2, i.e., Eastern European. Maybe this is the hun in hunlex...
AFF_SET (ISO8859-2)
[Option]
is the file from which settings for some affix variables are read. If it doesnt exist, no
affix variables other than the ones directly managed are dumped into the affix file
AFF_FORBIDDENWORD (!)
AFF_ONLYROOT (~)
[Option]
If this flag is present, it indicates for the stemmer/analyzer that the stem string is
to be output or not as part of the annotation. For instance (if STEM GIVEN flag is
x), the following dic file
STEM_GIVEN ()
22
go/ [VERB]
went/x go[VERB]
will result in the following stemming:
> go
go[VERB]
> went
go[VERB]
This makes more compact dictionaries. What information one wants the stemmer
and analyzer to output can be configured through hunlex options (see below).
Todo: This flag is not implemented yet (since it is not implemented yet in
morphbase, either, but probably will never be implemented since treatment of special flags shouldnt be user customizable above the choice of
flaggable characters.
Warning: Make sure the flags given here are consistent with the double flags option and the custom flags file (see the FLAGS variable above, and see Chapter 12
[Flags], page 42).
These options are superfluous and should be automatically managed by hunlex
which would write them into the affix file. Very likely to be deprecated soon.
Todo: This needs to be implemented.
Warning: Additional settings that are to be included in the affix file and are
crucial part of the resources (partly should be set by hunlex itself) such as
compoundflags. I have no idea what to do with these at the moment. The ones
I know of are listed here just for the record.
Todo: This needs to be sorted out.
Some of these data are actually global and could even go to the settings preamble (AFF_
SETTINGS):
NAME
LANG
HOME
VERSION
These ones should be dynamic metadata
??
The ones below should clearly be controled and output by hunlex itself. (also ONLYROOT
and FORBIDDENWORD, but they are handled by the toplevel, at least).
Ones relating to compounding (compounding is handled very differently by myspell,
morphbase and jmorph):
COMPOUNDMIN
COMPOUNDFLAG
COMPOUNDWORD?
COMPOUNDFORBIDFLAG
COMPOUNDSYLLABLE
SYLLABLENUM
COMPOUNDFIRST
COMPOUNDLAST
Warning: Compounding is as yet unsupported by hunlex and should be worked
on with high priority.
I have really no idea about the following ones:
TRY
ACCENT
CHECKNUM
WORDCHARS
HU KOTOHANGZO
23
24
7 Description Language
This chapter is about the framework that allows you to describe the morphology and lexicon
of a language. Below we specify the syntax and semantics of this description language. The
files written in this language (the lexicon and grammar) are the primary resources of
hunlex (see Section 8.1 [Input Resources], page 30) and the basis for all compiled output
(how this works is described in another chapter, see Chapter 6 [Toplevel Control], page 12).
There are three kinds of statement in this language:
morph definition
macro definition
metadata definition
Only the grammar file can contain macro definitions (see Section 7.2 [Macros], page 28)
and metadata definitions (see Section 7.3 [Metadata], page 29) and both the lexicon and
the grammar file can contain morph definitions which describe morphological units (affix
morphemes, lexemes and their paradigms). In this respect, the syntax of the lexicon and
grammar files are identical and, therefore, it is discussed together (see Section 7.1 [Morphs],
page 24) are not described separately, although the usefulness (and sometimes even the
semantics) of certain expressions might be different in the lexicon and in the grammar.
7.1 Morphs
Morphs are the central entities in the description language. They stand for morphological
units of any size and abstractness including affix morphemes, lexemes, paradigms, etc. and
are not what linguists call morphs (i.e., a particular occurrence of one morpheme). Morphs
are meant to describe an affix morpheme or a lexeme, but in fact, it is up to you what
level of abstractness you find useful in your grammar, so you can have individual morphs
describing each allomorph of a morpheme or each stem variant of a lexeme. But the point
is that morphs support description of variants or allomorphs. Anyway, a morph is basically
a collection of rules, variants, etc. that somehow belong together. Ideally, a variant of an
affix morpheme is actually an affix allomorph, a concrete affixation rule, while a variant of
a lexeme is a stem variant or an exceptional form of the lexemes paradigm.
[statement]
morph-name block...
[preamble]
block...
[variant]
A morph statement is introduced by an optional MORPH: keyword. It is a good idea to drop
it and start the statement directly with the preamble (in fact, the name of the morph),
which is compulsory.
A morph description has a preamble, i.e., a header describing the global properties of the
morph, the properties which characterize all of its variants/allomorphs.
After the preamble, one finds the variants one after the other. The preamble and the
variants are delimited by a comma.
Finally, the morph definition like all other statements is closed by a semicolon.
25
The preamble starts with the name of the morph. The name of the morph can be any
arbitrary id, a mnemonic string that ideally uniquely identifies the morph. Referring to other
morphs is an important in describing how morphemes can be combined: in order for these
references to be reliable, the names in the grammar are supposed to be unique. This is not
important in the lexicon, where homophonous lemmas can have identical names (however,
this is not recommended, since, in such a case, for instance, morphological synthesis would
be unable to distinguish two senses especially if they are of the same morphosyntactic
category).
The rest of the preamble as well as each individual variant is composed of blocks. Blocks
are the ingredients of the description, they specify information such as conditions of rule
application, output of a rule, the tag associated with the rule, etc.
In sum, then, morphs have the following structure:
[statement]
7.1.2 Blocks
Blocks are the ingredients of the description, they specify information such as conditions of
rule application, output of a rule, the tag associated with the rule, etc.
Blocks all have a leading keyword followed by some expressions (arguments) and last till
the next keyword or the end of the variant:
KEYWORD argument...
[block]
Blocks can come in any order within a variant and can be repeated any number of times.
So writing
KEYWORD: argument0 argument1 argument2 ...
has the same effect as when it is written like
KEYWORD: argument0 KEYWORD: argument1 KEYWORD: argument2 ...
or even
KEYWORD: argument0 SOME-OTHER-BLOCKS KEYWORD: argument1 SOME-OTHER-BLOCKS KEYWORD: arg
or when it is included with a macro (see Section 7.2 [Macros], page 28).
Certain blocks specify information in a cumulative way, so every time they are specified the
information is added to the info specified so far. For instance an IF block is cumulative,
all the arguments of all the IF blocks of a variant cumulate to give the conditions of rule
application, i.e., the rule applies only if all conditions on features are satisfied by the input
(see IF block below).
However, other blocks do not specify information that can be interpreted cumulatively, so it
does not make sense to have more than one argument with them or specify them more than
once for a variant. (They, however, may still be specified in the preamble and overriden in
a variant, for instance).
In every case, out of contradictory information, the one given last has the last word
overriding previous ones.
So if you write
CLIP: 1 CLIP: 2
it is the same as
26
CLIP: 2
In what follows, blocks are listed and explained one by one.
[block]
default morphs are used to assign features to inputs unspecified for some features. A
morph with a default block just adds extra rules that leave alone inputs which are
specified for any of the features to be defaulted. The variants of a morph having a
default block in their preamble will assume that neither of the features to be defaulted
is present in the input.
So morph DEFAULT: feature0 feature1 , MATCH: x OUT: feature0 ;
is equivalent to
morph , IF: !feature0 feature1 OUT: feature1 , IF: feature0 !feature1 OUT: feature0
, IF: feature0 feature1 OUT: feature0 feature1 , IF: !feature0 !feature1 MATCH: x
OUT: feature0
Filters typically want to pass on their whole input by default.
VARIANT variant
[block]
this block defines the actual affix or lexis.
The exact shape of variant determines what type of affix, lexis the variant describes:
+aff describes a suffix when the rule applies aff is appended to the end of the
input (after possibly clipping some characters)
aff+ describes a prefix when the rule applies aff is appended to the beginning of
the input (after possibly clipping some characters)
pref+suff describes a circumfix
when the rule applies pref is appended to the beginning of the input (after
possibly clipping some characters) and suff is appended to the end of the input
(after possibly clipping some characters)
lexis defines a lexis. This is typically used in the lexicon and used as input to
the rules. If the VARIANT keyword is left out, it has to come as the first block
of the rule (after the comma closing the preamble or the preceding rule).
If a lexis is used in the grammar, it is meant to stand for a suppletive form.
Since it may well be a typo, a warning is given. We encourage the policy to put
suppletive paradigmatic exceptions as variants of the lexeme in the lexicon file.
Especially since matches are ineffective for lexis rules, therefore conditions on the
suppletion should be expressed with features which is much safer anyway.
All the lexis and affix strings can contain any character except whitespace, comma,
semcolon, colon?? exclamation mark slash tilde plus sign [^# \t \n ; , \r
! / + ~]
there should be a way to allow escapes.
Substitutions (which are special kind of rules) are specified by REPLACE/WITH
blocks.
CLIP integer
[block]
This block specifies the number of characters that needs to be clipped from one end
of the input.
27
It has no effect if the variant is a lexis or substitution. So you dont use this block in
the lexicon.
If no CLIP block is given, no characters are clipped (the integer defaults to zero).
REPLACE pattern
WITH template
[block]
[block]
MATCH pattern
[block]
specifies a match condition on rule application. The rule only applies if the input
matches pattern, which is a hunlex regular expression. So you dont use this in the
lexicon.
The matched expression defines a match at the edge of the word, the beginning for
prefixes and the end for suffixes. You may include special symbols like ^ and $, to
make this more explicit.
Match blocks are non-cumulative, but circumfixes allow two matches (one beginning
with a ^ and one ending in a $).
IF condition ...
[block]
If blocks specify the conditions of rule application. Conditions are either positive
conditions (feature name) or negative conditions (NOT feature-name).
The rule only applies if the input has the positive features specified in the IF blocks
and doesnt have the negative features specified in the IF block.
IF blocks are therefore cumulative and the conditions are understood conjunctively.
[block]
specify the output conditions of the variant (affix rule or lexis). An output can be a
feature or a morph
Features can be restricted to particular morphs.
TAG tag-string
[block]
[block]
[block]
tells that the morph in question is a filter which defines fallback rules for lexical
features.
This means that the variants are meant to apply only if the input has none of the
filtered features.
Has no effect within individual variants or in the lexicon. Only relevant in a morph
preamble in the grammar.
Cumulative (conjunctive on the rule conditions)
28
[block]
Defines inheritance of features: a feature mentioned in the KEEP block is an output
feature of the result of rule application if and only if the input has the feature. As
long as a particular variant applies to an input.
If output features and keep features overlap, output features are meant to override
inheritance.
Features which are restricted by the input condition (IF block) are inherited normally,
but since they are known, can also be mentioned in the OUT block for clarity.
NB: The thingies following KEEP in a KEEP block are features. They
can not be macro names. Dont trick yourself by abbreviating a sequence
phonofeatures with a macro and then refer to that in a keep block. Dont
forget that macros abbreviate (a series of) blocks, so clearly they cant
be nested within a KEEP block.
Cumulative
FREE bool
[block]
specifies if the rule application gives a full form. For bound stems or non-closing
affixes, it has to be set to false.
By default, variants in the lexicon are NOT-free variants in the grammar are free ????
!!!! is this ok?
FS feature-structure
specifies the feature structure graph to merged when the rule applies.
structure is a kr-style features structure description string.
[block]
feature-
FS feature-structure
[block]
PASS bool
[block]
7.2 Macros
DEFINE macro-name blocks
[expression]
defines a macro named macro-name. Later (any time after this definition), any time
macro-name is encountered it is understood as if it said blocks. blocks is a sequence
of any blocks including (other) macro-names. The macro-name appearing elsewhere
than its definition has to be already defined.
If a macro-name is a declared morph-name? If a macro-name is a declared feature?
[expression]
binds regexp-name to a hunlex regular expression, i.e., a regular expression that
can contain regular expression macro-names in angle-brackets. regexp-name can be
referenced within any regular expression later. An expression is resolved by replacing
the substring <regexp-name> with the resolved regexp.
This means, that you have to esacpe you <-s and >-s if they do not delimit regexp
names.
As said, you can define regexp-macros using other macros, only at the time of using a
regexp-name it has to be defined already (the definition should be earlier in the file),
so that it can be resolved at the time of reading the definition.
7.3 Metadata
29
Chapter 8: Files
30
8 Files
There are various files that hunlex processes. Input as well as Output files are described in
this chapter. The file names used in this section are just nicknames (which happen to be
the default filenames assumed) and can be changed at will with setting toplevel option (see
Section 6.4 [Options], page 15).
8.1.1.1 Lexicon
The lexicon file (the file name of which is lexicon by default, but can be set through
options, see Section 6.4.3 [Input File Options], page 17) is the repository of lexical entries,
containing information about:
lemmas
stem-allomorphs (variants) belonging to the lemmas paradigm
suppletive forms expressing some paradigmatic slot of the lemma
morphological output annotation (tag) of the lemma (and the variants)
sense indices (arbitrary tag to distinguish identical lemmas)
the morphosyntactic and morphophonological features which characterize variants (or
lemmas), and which determine its morphological combinations (i.e., which rules apply
to it and how).
usage qualifiers of the variants (or lemmas), such as register, usage domain, normative
status, formality, etc.
The syntax of the lexicon file is basically the same as that of the grammar, except that
it cannot contain macro definitions (see Section 7.2 [Macros], page 28). This syntax of
Chapter 8: Files
31
8.1.1.2 Grammar
The grammar file is the other primary resource and also absolutely necessary to describe
the morphology of your language. Its name is grammar by default but can be changed by
setting toplevel options (see Section 6.4.3 [Input File Options], page 17). The grammar file
specifies:
affix morphemes
affix-allomorphs (variants) belonging to the same morpheme
morphological output annotation (tag) of the affix morpheme (and its variants)
the morphosyntactic and morphophonological features which characterize variants (or
morphemes) and which determine its morphological combinations (which rules apply
to it and how).
usage features of the variants (or affixes), such as register, usage domain, normative
status, formality, etc.
possibly special pseudo affixes, so called filters which assign (default) features to variants based on their form (orthographic patterns) or other features.
The syntax of the grammar file is the same as the one used for the lexicon except that the
grammar file can contain macro definitions (see Section 7.2 [Macros], page 28). The syntax
and semantics of this description language is explained in detail in another chapter, see
Chapter 7 [Description Language], page 24.
For examples of grammar files, have a look at the zillion examples in the Examples directory
that comes with the distribution (see Section 4.6 [Installed Files], page 9).
Chapter 8: Files
32
It declares the affix morphemes and the filters that are to be used from among the ones
that are in the grammar.
Warning: the affix morphemes not listed (or commented out) in this file are
ineffective for the compilation (as if they were not in the grammar).
Each line in this file contains the affix morphemes name and optionally a second field,
which gives the level of the morpheme. If no level is given, the affix is assumed to be
of level maximum level (the value of the option MAX_LEVEL, see Section 6.4.5 [Resource
Compilation Options], page 19). Very briefly, levels regulate which affixes will be merged
with which other affixes to yield the affix clusters that are dumped as affix rules into the affix
file. The odds and ends of levels are described in detail in another chapter (see Chapter 10
[Levels], page 36).
For examples of the rather dull morph.conf files, browse the examples in the Examples
directory that comes with the distribution (see Section 4.6 [Installed Files], page 9).
If you have a grammar and you want to declare all the (undeclared) morphs defined in it
by including them in the morph.conf. All you do is type
make DEBUG_LEVEL=1 new resources 2>&1 | grep (morph skipped) | cut -d -f1 >> in/mo
in the directory where your local Makefile resides. This will append all the undeclared
morphs (one per line) to the morph.conf file. Note, the morphs so declared will be of level
maximum level (see above).
make DEBUG_LEVEL=1 new resources 2>&1 | grep (feature skipped) | cut -d -f1 | sort
Chapter 8: Files
33
The usage.conf file is is one of the compilation configuration files that determine how
hunlex compiles the output resources (aff and dic, see Section 8.2 [Output Resources],
page 33) from the primary resources (lexicon and grammar, see Section 8.1.1 [Primary
Resources], page 30).
usage.conf in particular determines which usage qualifiers are allowed for the input units
(lexical entries, affixes, filters and the variants thereof) that are included into the resource
to be compiled. Units having a usage qualifier that is not listed in this file are ignored for
the compilation (as if they were not there).
NB: Usage qualifiers are not first class features. They can not be negated or
used as conditions on rule application. They are simply used to categorize rules
(affixes and stems) in certain dimensions such as etymology, register, usage
domain, normative status, formality, etc.
In addition to declaring allowed usage qualifiers, this file has another function as well. Each
line containing the usage qualifier may contain a second field which is a tag associated with
that usage feature. If this field is missing, the name of the usage qualifier string is assumed
to be its tag. Usage qualifier tags can be output by the analyzer if they are compiled into
the resources by hunlex.
This can be configured with the output info option (see Section 6.4.5 [Resource Compilation
Options], page 19).
Warning: This option is not implemented yet.
Todo: This is not implemented yet. I dont even know if this is fine like this.
The problem is that they cannot really be just intermixed with the ordinary
morphological tags.
Various dimensions of usage information can be made effective by introducing expressions
with arbitrary leading keywords (see Chapter 7 [Description Language], page 24). Redefining each of the wanted usage dimensions in the parsing_common.ml file will result in
making any one or more of them effective as usage qualifiers. The point is that you can
keep a lot of information in the same lexical database. When the keywords it contains are
hunlex-ineffective, the expressions they lead are simply ignored.
Caveat: At the moment, for these alternatives, you have to recompile hunlex, with the new keyword associations, see Chapter 7 [Description Language],
page 24.
Todo: This could be done online but has very low priority.
For examples of usage.conf files, browse the examples, in the Examples directory that
comes with the distribution (see Section 4.6 [Installed Files], page 9).
Chapter 8: Files
34
Section 8.1.1 [Primary Resources], page 30). But the affix and dictionary files are resources
that are used by real-time word-analysis routines (such as morphbase, myspell or jmorph,
see Chapter 14 [Related Software and Resources], page 46). They share commonalities of
format with minor idiosyncrasies, some of which are still in the changing.
Hunlex reads a transparent human-maintainable non-redundant morphological grammar
description with the lexicon of a language and creates affix and dictionary files tailored to
your needs (see Chapter 1 [Introduction], page 1). The ultimate purpose of hunlex is that
these output resource files could at last be considered a binary-like secondary (automatically
compiled) format, not a primary (maintained) lexical resource.
Therefore the technical specification of these output formats should only concern you here
if you want to compile affix and dictionary files for your own (or modifief versions of our
own) word-analysis software which also reads the aff/dic files. In such a case, however, you
know that format better than I do. All I can say is that the parameters along which the
format can be manipulated is supposed to conform with the format of the software listed
in see Section 14.1 [Software that can use the output of Hunlex as input], page 46. If you
develop some such stuff as well and would like your format to be supported, take a deep
breath and consider requesting a feature from the authors see Section 3.3 [Requesting a
New Feature], page 5.
In sum, the format of these output resource files are not detailed. Anyway, they are (probably) well documented elsewhere (e.g., myspell manual page). See especially the documentation of huntools and the morphbase library (see Section 14.1.1 [Huntools], page 46).
35
9 Command-line Control
This chapter is a verbatim include of the hunlex manpage. Command-line control is not the
recommended interface to use hunlex, see toplevel control (see Chapter 6 [Toplevel Control],
page 12).
removed so make doc would run
36
10 Levels
Levels index morphemes and are assigned to morphemes in the morph.conf file (see
Section 8.1.3 [Morpheme Configuration File], page 31).
Levels govern which affixes will be merged together into complex affixes (or affix clusters)
and will constitute an affix rule (linguistically correctly, and affix-cluster rule) in the output
affix file (see Section 8.2 [Output Resources], page 33). Affix rules in the affix file will
be stripped from the analyzed words by the analysis routines in one step (i.e., by one
rule-application).
Levels, then, regulate the output resources of hunlex and have no role to play in how you
design your grammars. There are no levels in the hunlex grammar and lexicon, the files
which describe the morphology of the language (see Section 8.1.1 [Primary Resources],
page 30). Levels make sense only in relation to the compilation process.
This chapter describes why you would want levels, how you manipulate them and what
consequences it has on analysis.
37
For instance, taking the example of the previous section, if both the plural and the inessive
morpheme are below the minimal level (of on-line-ness), the whole morphologically complex
word dalokban will be included in the dictionary file. To learn why youwould want to do
such a thing see also Section 10.4 [Manipulating Levels with Options], page 37.
38
$ cp morph.conf morph.conf.orig
$ cut -d -f1 morph.conf.orig | nl -nln -s | sed s/\(.*\) \(.*\)$/\2 \1/g > morp
Todo: I should provide an option that does this.
With a routine that supports any number of affix stripping operations, such a resource
will allow correct analysis. But not with the ones that allow only a finite number of rule
applications.
Todo: write on recursion
Geeky note: If rule-application monotonically increases the size of the input,
potential recursion is never unbounded recursion since all analysis routines have
a fixed buffersize anyway. If not however, if empty strings or clippings make
rule application non-monotonic in size, potential recursion may cause actual
infinite loops in some uncautious implementations. Boundedness of recursion
due to buffersize restrictions is only one sense in which the full intended (implied) generative power of any arbitrary hunlex grammar is not reflected in the
analyzers actual analysis potential.
39
affix file will be small, but your analysis may not be optimal for runtime (if it generates the
correct analyses at all, see Section 10.4.3 [Levels and Steps of Affix Stripping], page 38).
If, on the other hand, you precompile all affixes into affix clusters, you might end up with
hundreds of megabytes of affix file which is gonna compromise your runtime analysis memory load (though maybe faster for the analysis algorithm, than recursive calls). This last
realization led the author of hunspell (see Section 14.1.1 [Huntools], page 46) to introduce
a second step of suffix stripping in the algorithm which was a legacy of the original myspell
code with its one level of affix stripping.
Finally, compiling everything in the lexicon is not a very good idea for complex morphologies
and big lexicons. Although it may be indespensible for testing on smaller fragments of
lexicons/grammars or for creating wordlists (see Section 6.3.3 [Test Targets], page 14).
Some special affix rules should always be precompiled into the dictionary and not output
as affix rules. These rules are the ones that cannot be interpreted as affix rules at all, for
instance, rules of substitution or suppletion. These rules are beyond the descriptive capacity
of affix files. Therefore all substitutions and suppletions are precompiled (merged with the
rules or stems they can be applied to) irrespective of their level. Find more about this.
40
11 Tags
We call the information that a morphological analyzer is expected to output for an analyzed
word a piece of morphological annotation. In more general terms, howver, when we talk
about any kind of word-analysis routine such as a spell-checker, stemmer, we call the output
information these routines associate with words tags. We want to emphasize here that this
piece of output information tags the whole that is analyzed. The tag is used to annotate
words in a corpus by decorating a raw text with useful extra information.
NB: Tagging as we use it in no way constitutes a constituent structure, segmentation, etc. of the input word form.
This document describes the ways in which you can associate tags with your morphemes
(or individual stem variants and affix rules). These tags should be thought to constitute
ingredients of and output tag that an analyzed word containing that morpheme would be.
Certainly, not all analysis software can or is supposed to output any useful information
about the morphological makeup of the word. For instance, a spell-checker is typically
required only to recognize whether a word is correct (usually in a strict normative sense),
but a morphological analyzer or a stemmer is supposed to output some information. Since
the huntools routines are able to perform full morphological analysis, not just recognition
(REFERENCE), adding morphological tags to your rules is worth your while. Nevertheless,
if you never ever want to be able to output any useful info (because you only care about
spellchecking), you dont really need to read on.
41
42
12 Flags
Flags are used in the output resources (see Section 8.2 [Output Resources], page 33) to
index affix rules. Each entry in the dictionary file has a set of flags indicating which affix
rules can be applied to it.
So, flags are given by hunlex and written in the affix and dictionary files. There is no
such thing as a flag in the hunlex input grammar or lexicon, the files which describe your
morphology.
You can specify some aspects of what flags hunlex will assign to affix classes and how. This
is what the present chapter is about.
43
The association of flags to affix classes takes flags from left to right. This means that if the
output requires 35 flags, the first 35 flaggable characters will be used. This is, however, all
that can be said : which actual flag comes to which affix class can not be further specified.
Warning: This last sentence is a warning in itself. For people who are used
to fiddling with affix files that were manually created (in fact almost all ispell
resources), it has to be stressed: hunlex-generated affix files are not to be read
by humans and should be considered binary. Associations of flags with particular affix rules/classes are not permanent across various configurations/resource
compilations. If you want to post-process affix files, never assume particular
flags are meaningful. This is rather obvious once you realize that the affix
rules/classes themselves are not consistent across different parametrizations,
either (see e.g., levels). This policy is called dynamic flagging.
An exception from under dynamic flagging might be special flags which are
fairly consistent since their expression can be customized to particular strings
(see Section 6.4.5 [Resource Compilation Options], page 19, Affix file variables).
But this feature will soon cease to exist, so just wipe your tears off your face,
be happy that you have a hunlex resource and forget your old flags.
44
thousand affix classes. If this is not enough for you (you get the exception
Not_enough_flags), you are likely to have a problem in your grammar (see
Chapter 13 [Troubleshooting], page 45). If you are sure it is not a grammar
problem, you better choose another language. At any rate, please notify us
(see Section 3.8 [Contact], page 6) about this extraordinary case and we
might even extend the support for flags even in morphbase on one of our free
afternoons (see Section 3.3 [Requesting a New Feature], page 5).
45
13 Troubleshooting
13.1 Installation Problems
If hunlex wouldnt install, check Section 4.3 [Prerequisites], page 7 carefully with special
attention to the versions.
There are some hints hidden among the lines of Section 4.4 [Install], page 7 which you may
have missed.
46
47
M
18
21
21
21
21
21
C
CHAR_CONVERSION_TABLE () . . . . . . . . . . . . . . . . . . . . 21
CONFDIR (inputdir ). . . . . . . . . . . . . . . . . . . . . . . . . . . 18
MAX_LEVEL (10000) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MIN_LEVEL (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MODE (Analyzer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MORPH (confdir /morph.conf) . . . . . . . . . . . . . . . . .
19
19
20
17
O
OUT_DELIM ( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
OUT_DELIM_DIC (<TAB>) . . . . . . . . . . . . . . . . . . . . . . . 19
OUTPUTDIR (. = current directory ) . . . . . . . . . . 19
P
D
DEBUG_LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DEBUG_LEVEL (0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DIC (outputdir /dictionary.dic) . . . . . . . . . . . . .
DIC_PREAMBLE () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DOUBLE_FLAGS () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Q
QUIET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
QUIET (@ = quiet) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
FLAGS () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
FS_INFO () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
REPLACEMENT_TABLE () . . . . . . . . . . . . . . . . . . . . . . . . 21
S
G
GRAMMAR (grammardir /grammar) . . . . . . . . . . . . . . . 17
GRAMMARDIR (inputdir ) . . . . . . . . . . . . . . . . . . . . . . . 17
H
HUNLEX (hunlex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
HUNMORPH (hunmorph) . . . . . . . . . . . . . . . . . . . . . . . . . . 16
I
INPUTDIR (. = current directory ) . . . . . . . . . . . . 18
SIGNATURE () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
STEM_GIVEN (). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
STEMINFO (LemmaWithTag) . . . . . . . . . . . . . . . . . . . . . 20
T
TAG_DELIM () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TEST (/dev/stdin) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TIME (time) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
18
13
16
U
USAGE (confdir /usage.conf) . . . . . . . . . . . . . . . . . 17
L
LEXICON (grammardir /lexicon) . . . . . . . . . . . . . . . 17
W
WORDLIST (outputdir /wordlist) . . . . . . . . . . . . . . 18
48
Files Index
49
Files Index
A
C
character conversion table . . . . . . . . . . . . . . . . . . . . .
character replacement table . . . . . . . . . . . . . . . . . . . .
configuration files . . . . . . . . . . . . . . . . . . . . . . . . . . 17,
configutation directory . . . . . . . . . . . . . . . . . . . . . . . . .
21
21
31
18
Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Makefile, hunlex toplevel . . . . . . . . . . . . . . . . . . . . . . . . 9
Makefile, local . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
morph.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 31, 37
O
output resources directory . . . . . . . . . . . . . . . . . . . . . 19
F
flags file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 42
flags.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 42
fs.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
G
grammar directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
grammar file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 31
P
phono.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 32
preambles for affix and dictionary files . . . . . . . . . 21
primary resource files . . . . . . . . . . . . . . . . . . . . . . . . . . 30
S
signature file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
T
H
test file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
hunlex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
hunlex.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
HunlexMakefile . . . . . . . . . . . . . . . . . . . . . . 9, 10, 12, 13
U
usage configuration file . . . . . . . . . . . . . . . . . . . . . . . . 33
usage.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 33
I
input resources directory. . . . . . . . . . . . . . . . . . . . . . . 18
install prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
W
wordlist file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Concept Index
50
Concept Index
A
affix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
affix clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
affix file, variables in . . . . . . . . . . . . . . . . . . . . . . . . . . .
affix rules (in the affix file) . . . . . . . . . . . . . . . . . . . . .
affix rules, conditioning the application of . . . . . .
affix rules, merging affix rules in the affix file . . .
allomorph, setting stem allomorph to output . . .
G
18
32
20
32
32
36
20
hunlex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
hunlex, invoking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
hunmorph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
huntools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 16, 21, 38
block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
block, the blocks used within morph statements
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
I
input resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
C
character set, setting character set in the affix file
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
command-line interface . . . . . . . . . . . . . . . . . . . . . . . . 12
compounding, affix file variables for compounding
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
configuration of resource compilation . . . . . . . . . . . 31
cumulative, blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
D
debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
debugging resource compilation . . . . . . . . . . . . . . . . 16
default settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
default target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 13
default values of options . . . . . . . . . . . . . . . . . . . . . . . 15
delimiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
delimiter, tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
description language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
double flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
F
feature structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
feature structures . . . . . . . . . . . . . . . . . . . . . . . . . . 20, 41
feature, declaring features for compilation . . . . . . 32
flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 42
flags, double . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
flags, limit on the number of . . . . . . . . . . . . . . . . . . . 43
flags, setting double flags . . . . . . . . . . . . . . . . . . . . . . 19
flags, setting special flags through toplevel options
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
flags, special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
J
jmorph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
L
lemma, setting output lemma . . . . . . . . . . . . . . . . . . 20
level, setting maximal level . . . . . . . . . . . . . . . . . . . . 19
level, setting minimal level . . . . . . . . . . . . . . . . . . . . . 19
levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19, 36
levels, settings the level of a morpheme . . . . . . . . 32
lexical resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
LGPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
M
macro definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
make, GNU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
make, hunlex installation . . . . . . . . . . . . . . . . . . . . . . . 7
make, toplevel control . . . . . . . . . . . . . . . . . . . . . . . . . 12
Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Makefile variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Makefile, hunlex installation . . . . . . . . . . . . . . . . . . . . 7
metadata definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
metadata, output metadata into affix file . . . . . . . 22
minimal level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
mode, setting mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
morph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
morph definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
morphemes, configuring . . . . . . . . . . . . . . . . . . . . . . . . 31
morphological analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1
morphological analyzer . . . . . . . . . . . . . . . . . . . . . . . . . 1
morphological annotation . . . . . . . . . . . . . . . . . . . . . . 40
myspell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Concept Index
O
ocaml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ocaml-make . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
OCamlMakefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
options, default value . . . . . . . . . . . . . . . . . . . . . . . . . . 15
options, executable path . . . . . . . . . . . . . . . . . . . . . . . 15
options, remembering . . . . . . . . . . . . . . . . . . . . . . . . . . 13
options, toplevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
output information, of a word-analysis routine . . 40
output options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
output resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 33
output, tags to output . . . . . . . . . . . . . . . . . . . . . . . . . 20
P
preamble, the header part of a morph statement
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
preambles, for affix and dictionary files . . . . . . . . . 21
primary resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
R
resource compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
resource compilation, configuration of . . . . . . . . . . 31
resource, primary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
resource-specification language . . . . . . . . . . . . . . . . . . 1
resources, input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 30
resources, lexical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
resources, output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 33
resources, primary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
resources, secondary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
S
settings, storing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
special flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
51
T
tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19,
tags, formatting output . . . . . . . . . . . . . . . . . . . . . . . .
tags, merging tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
tags, settings output tags . . . . . . . . . . . . . . . . . . . . . .
targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
targets, special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
targets, test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
test targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
test, a file to test your grammar on . . . . . . . . . . . .
test, testing resources . . . . . . . . . . . . . . . . . . . . . . . . . .
testing compiled resources . . . . . . . . . . . . . . . . . . . . .
testing tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
toplevel control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
toplevel options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
40
40
20
13
14
14
14
18
14
18
17
16
12
15
U
usage configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
usage qualifiers, declaring . . . . . . . . . . . . . . . . . . . . . . 33
V
variables, affix file . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
variables, in affix file . . . . . . . . . . . . . . . . . . . . . . . . . . .
variables, Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
variant, as part of a morph statement . . . . . . . . . .
verbosity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12,
22
20
15
24
16
52
I
A
Are there any Hunlex resources available for any
language? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Are there software with similar functionality? . . 46
C
Can
Can
Can
Can
H
How are tags calculated by the analyzer?. . . . . . . 40
How can I associate tags with usage qualifiers?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
How can I configure resource compilation? . . . . . 31
How can I configure usage qualifiers? . . . . . . . . . . . 33
How can I configure what the output tags of my
analyzer will be? . . . . . . . . . . . . . . . . . . . . . . . . . . 40
U
Under what copyright restrictions can I distribute
linguistic resources compiled or to be compiled
with hunlex? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Under what copyright restrictions can I
redistribute this software? . . . . . . . . . . . . . . . . . . 4
W
What are feature structures good for? . . . . . . . . . . 41
What are feature structures? . . . . . . . . . . . . . . . . . . . 41
What are levels? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
What are special flags? . . . . . . . . . . . . . . . . . . . . . . . . 44
What are the configuration files? . . . . . . . . . . . . . . . 31
What are the primary resources for hunlex? . . . . 30
What are usage qualifiers? . . . . . . . . . . . . . . . . . . . . . 33
What do I need if I want to install this package?
............................................ 7
What do I want to install? . . . . . . . . . . . . . . . . . . . . . . 7
What do levels mean for affixes and the affix file?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
What do levels mean for stems in the lexicon and
the dictionary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
What does a flag look like? . . . . . . . . . . . . . . . . . . . . 42
What does hunlex do with the tags? . . . . . . . . . . . 40
What files are relevant for hunlex? . . . . . . . . . . . . . 30
What files are the input for hunlex? . . . . . . . . . . . . 30
What gets installed when I install hunlex? . . . . . . 9
What is a block? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
What is a dictionary file? . . . . . . . . . . . . . . . . . . . . . . 33
What is a flag? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
What is a flaggable character? . . . . . . . . . . . . . . . . . 42
53
What is a variant? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
What is an affix file? . . . . . . . . . . . . . . . . . . . . . . . . . . 33
What is Hunlex? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 7
What is in the grammar file? . . . . . . . . . . . . . . . . . . 31
What is in the lexicon file? . . . . . . . . . . . . . . . . . . . . 30
What is morphological annotation? . . . . . . . . . . . . 40
What is the difference between features and usage
qualifiers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
What is the grammar file? . . . . . . . . . . . . . . . . . . . . . 31
What is the lexicon file? . . . . . . . . . . . . . . . . . . . . . . . 30
What is the License of this software? . . . . . . . . . . . . 4
What is the morph.conf file? . . . . . . . . . . . . . . . . . . . 31
What is the phono.conf file? . . . . . . . . . . . . . . . . . . . 32
What is the usage.conf file? . . . . . . . . . . . . . . . . . . . . 33
What output files does hunlex create? . . . . . . . . . . 33
What platforms are supported by Hunlex? . . . . . . 7
Where can I find the sources? . . . . . . . . . . . . . . . . . . . 7
Where do the tags come from? . . . . . . . . . . . . . . . . . 40
Which affix rules do I want to merge and which
words do I want to . . . . . . . . . . . . . . . . . . . . . . . . 38
Which characters can be flags? . . . . . . . . . . . . . . . . . 42
Which files actually describe the language? . . . . . 30
Which flags will be in the affix file and which flag
is a particular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Which softwares can use the output of Hunlex?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Who are you? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
With which options can I manipulate levels (and
thereby affix-merging)? . . . . . . . . . . . . . . . . . . . . 37