You are on page 1of 58

HunLex

Morphological resource title specification framework


and title precompilation tool
Reference Manual
Edition draft for release candidate for pre-beta version 0.1

Viktor Tr
on
IGK, Language Technology and Cognitive Systems.
Universities of Edinburgh &
Saarbr
ucken. MOKK Lab, Budapest Intitute of Technology. Budapest. v.tron@ed.ac.uk
This file documents the HunLex morphological resource specification framework and precompilation tool (HunLex). It corresponds to release 0.1 of the the Hunlex distribution.
More information about Hunlex can be found at the MOKK Lab homepage,
http://lab.mokk.bme.hu.

Table of Contents
1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1
1.2
1.3

Hunlex: A Short Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Configurable Compilations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

License. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Authors, Contact, Bugs . . . . . . . . . . . . . . . . . . . . . . . . 5


3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

License? What license? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Submitting a Bug Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Requesting a New Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Praises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5
5
5
5
5
5
6
6

Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1
4.2
4.3
4.4
4.5
4.6

Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Supported Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Uninstall and Reinstall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Installed Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7
7
7
7
8
9

Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Toplevel Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.1
6.2
6.3

Verbosity and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Storing your Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Resource Compilation Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Special Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.3 Test Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Executable Path Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Verbosity and Debug Options . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Input File Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.4 Output File Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.5 Resource Compilation Options. . . . . . . . . . . . . . . . . . . . . . . . . . . .

12
13
13
14
14
14
15
15
16
17
18
19

ii

Description Language . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.1

Morphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Morph Preamble and Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.2 Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24
24
25
28
29

Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.1

Input Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Primary Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1.1 Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1.2 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.3 Morpheme Configuration File. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.4 Feature Configuration File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.5 Usage Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Output Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30
30
30
31
31
31
32
33
33

Command-line Control . . . . . . . . . . . . . . . . . . . . . . . . 35

10

Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

10.1 Levels and Affix Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


10.2 Levels and Stems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Levels and Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 Manipulating Levels with Options . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.1 Levels and Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.2 Levels and No Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.3 Levels and Steps of Affix Stripping . . . . . . . . . . . . . . . . . . . . . .
10.5 Levels and Optimizing Performance . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

11.1
11.2

12

Merging Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Feature Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

12.1
12.2
12.3
12.4

13

36
36
37
37
37
37
38
38

Two Forms of Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Flaggable Characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Limit on the Number of Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Special Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42
42
43
44

Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

13.1
13.2
13.3
13.4

Installation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems running hunlex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Resource Compilation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Grammar Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45
45
45
45

iii

14

Related Software and Resources . . . . . . . . . . . . 46

14.1 Software that can use the output of Hunlex as input . . . . . . . . . .


14.1.1 Huntools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.1.2 Myspell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.1.3 Jmorph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.1.4 Ispell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2 Available resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2.1 The Hungarian Morphdb Project . . . . . . . . . . . . . . . . . . . . . . . .
14.2.2 The English Morphdb Project . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3 Hunlexs relatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3.1 XFST, TWOLC, LEXC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46
46
46
46
46
46
46
46
46
46

Variables and Options Index . . . . . . . . . . . . . . . . . . . . . 47


Description Language Index . . . . . . . . . . . . . . . . . . . . . . 48
Files Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Concept Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . 52

Chapter 1: Introduction

1 Introduction
This document presents the HunLex morphological resource specification framework and
precompilation tool which is being developed as part of the Budapest Institute of Technology
Media Education and Research Centers HunTools Natural Language Processing Toolkit
http://lab.mokk.bme.hu

1.1 Hunlex: A Short Description


HunLex offers a description language, ie., a formalism for specifying a base lexicon and
morphological rules which describe a languages morphology. This description which is
stored in textual format serves as your primary resources that represents your knowledge
about the morphology and lexicon of the language in question.
Now, providing a resource-specification language is rather useless in itself. Hunlex is able
to process these primary resources and create the type of resources that are used by some
real-time word-level analysis tools. If you create these from your primary resources you
might call them secondary resources. These provide the the language-specific knowledge to
a variety of word-level analysis tools.
At present, most importantly, Hunlex provides the language specific resources for the
HunTools word-level analysis toolkit see Section 14.1.1 [Huntools], page 46. This package
contains the MorphBase library of word-analysis routines such as spell-checker, stemmer,
morphological analyzer/generator and their standalone executable wrappers. Therefore,
your single Hunlex description of your favourite language will enable you to perform spellchecking, stemming, and morphological analysis for that language, which is more than useful.
In addition to the HunTools routines, other software which use ispell-type resources will be
able to use Hunlexs output. Among these are myspell, an open-source spell-checker (also
used in Open Office http://www.openoffice.org, see Section 14.1.2 [Myspell], page 46),
or jmorph, a superfast java morphological analyzer (see Section 14.1.3 [Jmorph], page 46).
This document describes how you can create your primary resources and what you can
(make Hunlex) do with them.
Note: This document is not intended to describe how to use any of these realtime tools, what they are good for. See the above links to learn more about
them.
In particular, this document provides you with:
1. The compulsory tedium about Chapter 2 [License], page 4, Section 3.7 [Authors],
page 6, Section 3.8 [Contact], page 6, Section 3.2 [Submitting a Bug Report], page 5,
etc. See Chapter 3 [About], page 5.
2. The indispensable but trivial Installation notes, see Chapter 4 [Installation], page 7.
3. A bit about Chapter 5 [Bootstrapping], page 10 your way as a Hunlex user.
4. The detailed exposition of the syntax and semantics of the resource specification language (see Chapter 7 [Description Language], page 24);
TODO: not yet
5. The description of the toplevel control of the hunlex resoure compiler (see Chapter 6
[Toplevel Control], page 12) detailing all the options and parameters. The direct command line interface is also descibed there.

Chapter 1: Introduction

6. Some hints on Chapter 13 [Troubleshooting], page 45.


7. Information about Chapter 14 [Related Software and Resources], page 46.
8. as well as a lot of advanced issues, like Chapter 12 [Flags], page 42, Chapter 10 [Levels],
page 36, Chapter 11 [Tags], page 40, the list and format of Chapter 8 [Files], page 30.

1.2 Motivation
The motivation behind HunLex came from two opposing types of requirements lexical resources are supposed to fulfill:
1. (i) scalability, maintainability, extensibility; and
2. (ii) optimized format for the application.
The constraints in (i) favour one central, redundancy-free, abstract, but transparent specification, while the ones in (ii) require possibly multiple application-specific, potentially redundant, optimized formats.
In order to reconcile these two opposing requirements, HunLex introduces an offline layer
into the word-analysis workflow, which mediates between two levels of resources:
1. a central database conforming to (i) (also primary resource, input resource),
2. various application-specific formats conforming to (ii) (also secondary or output resource)
The primary resources are supposed to reasonably designed to help human maintanance,
and the secondary ones are supposed to optimize very different things ranging from file size,
performance with the tool that uses it, coverage, robustness, verbosity, normative strictness
depending on who uses it for what purpose.
HunLex is used to compile the primary resources into a particular application-specific format see Section 8.2 [Output Resources], page 33. This resource compilation phase is an
offline process which is highly configurable so that users can fine-tune the output resources
according to their needs.
By introducing this layer of offline resource compilation, maintenance, extendability, portability of lexical resources is possible without compromising your performance on specific
word-analysis tasks.
Providing the environment for a sensible primary resource specification framework and
managing the offline precompilation process are the raison d^etre behind Hunlex.

1.3 Configurable Compilations


Configuration allows you to adjust the compilation of resources along various dimensions:
1. choice of output format that suits the algorithm (spell-checking, stemming, morphological analysis, generation, synthesis),
2. selection of morphemes to be included in the resource
3. grouping of morphemes to be stripped in one step as an affix cluster (with one rule
application)
4. selection of morphophonological features that are to be observed or ignored
5. depth of recursive rule application

Chapter 1: Introduction

6. selection of registers, degree of normativity, etc. based on usage qualifiers in the database
7. selection of output morphological annotation, configurable tags information

Chapter 2: License

2 License
Hunlex is free software.
It is licensed under LGPL, which roughly means the following.
There are no restrictions on downloading it other than your bandwidth and our slothful
ways of making things available.
There are no restrictions on use either other than its deficiencies, clumsy features and outragous bugs. However, this can be amended, because there are no restrictions on modifying
it either. See also Section 3.5 [Contribution], page 5.
Freedom of use implies that any resources that you created, compiled with the mediation
of Hunlex is yours and you hold the right to distribute it in any way. Consider telling us
about this great news, see Section 3.8 [Contact], page 6.
What is more, there are no restrictions on redistributing this software or any modified
version of it.
For some legalese telling you the same, read the License http://creativecommons.org/licenses/LGPL/2.1/
Todo: Shall we not include the License?

Chapter 3: Authors, Contact, Bugs

3 Authors, Contact, Bugs


3.1 License? What license?
See Chapter 2 [License], page 4.

3.2 Submitting a Bug Report


If you find a bug or an undesireable feature or anything that is worth a couple of lines
ranting at the authors, please go ahead and send a bugreport on the MOKK Lab bugzilla
page at http://lab.mokk.bme.hu or send a mail to me (see Section 3.8 [Contact], page 6).

3.3 Requesting a New Feature


So you are using hunlex and find yourself realizing that you would need a certain feature
desparately which happens not to be implemented. Go ahead and request it from the
authors (see Section 3.8 [Contact], page 6) or sit silently and hope!

3.4 Praises
So you found hunlex cool and/or useful and would like the authors to hear about that. How
nice is that! See Section 3.8 [Contact], page 6.

3.5 Contribution
Hunlex is open source development, so developpers are welcome to contribute to make it
better in any imaginable way. Contact us (see Section 3.8 [Contact], page 6) to work out
the details of how and what you would want to contribute to Hunlex.

3.6 Reference
For the context of the whole huntools kit, use
@InProceedings{szoszablya_saltmil:04,
author =
{L\aszl\o N\emeth and Viktor Tr\on
and P\eter Hal\acsy and Andr\as Kornai
and Andr\as Rung and Istv\an Szakad\at},
title =
{Leveraging the open-source ispell codebase
for minority language analysis},
booktitle =
{Proceedings of SALTMIL 2004},
year =
2004,
organization = {European Language Resources Association},
url =
{http://lab.mokk.bme.hu/}
}
A very brief intro to hunlex with a one-page English resume.
@InProceedings{hunlex_mszny:04,
author =
{Tr\on, Viktor},
title =
{HunLex - a description framework and
resource compilation tool for morphological dictionaries},

Chapter 3: Authors, Contact, Bugs

booktitle =
{II. Magyar Sz\am\it\og\epes
Nyelv\eszeti Konferencia},
institution = {Szegedi Tudom\anyegyetem},
address =
{Szeged, Hungary}
year =
2004
}
These and other papers can be downloaded from the MOKK Lab publications page at
http://lab.mokk.bme.hu

3.7 Authors
The author of hunlex and this document is Viktor Tr
on. He can be mailed to on
v.tron@ed.ac.uk
Hopefully more can be found on MOKK Labs pages at http://lab.mokk.bme.hu.

3.8 Contact
We can get in contact if you
1. Mail to Viktor Tr
on on
v.tron@ed.ac.uk
2. Join the forums on http://lab.mokk.bme.hu
3. Submit a bug report (see Section 3.2 [Submitting a Bug Report], page 5) or feature
request (see Section 3.3 [Requesting a New Feature], page 5).

Chapter 4: Installation

4 Installation
So you want to install the hunlex toolkit (see Chapter 1 [Introduction], page 1) from the
hunlex source distribution. This document describes what and how you can install with
this distribution.

4.1 Download
The latest version of the hunlex source distribution is always available from the MOKK LAB
website at http://lab.mokk.bme.hu or, if all else fails, by mailing to me v.tron@ed.ac.uk.

4.2 Supported Platforms


The hunlex executable in principle runs on any platform for which there is an ocaml compiler
(see Section 4.3 [Prerequisites], page 7). This includes all Linuxes, unices, MS Windows,
etc.
Warning: This package has not been tested on platforms other than linux.

4.3 Prerequisites
[Prerequisite]
Hunlex is written in the ocaml programming language http://www.ocaml.org/.
OCaml compilers are extremely easy to install and are available for various platforms and downloadable in various package formats for free from
http://caml.inria.fr/ocaml/distrib.html.
You will need ocaml version >=3.08 to compile hunlex.

ocaml

[Prerequisite]
ocaml-make
OCamlMakefile (i.e., ocaml-make) is needed for the installation of hunlex and is
available from Markus Mottls homepage at
http://www.ai.univie.ac.at/~markus/home/ocaml_sources.html#OCamlMakefile

ocaml-make

(I used version 6.19. writing on 8.1.2004).


For OCamlMakefile you will need ocaml and GNU make. (for ocaml-make version
6.19 you will need GNU make version >= 3.80)
NB: Most probably earlier versions of ocaml-make and GNU make should
also work but have not been tested yet.
You dont need anything else to use hunlex (but a little patience).

4.4 Install
Hunlex is installed in the good old way, i.e., by typing
$ make && sudo make install
in the toplevel directory of the unpacked distribution. Read no further if you know what I
am talking about or if you trust some God.

Chapter 4: Installation

The hunlex distribution is available in a source tarball called hunlex.tgz. First you have
to unpack it by typing
$ tar xzvf hunlex.tgz
Then, you enter the toplevel directory of the unpacked distribution with
$ cd hunlex
To compile it, simply type
$ make
in the toplevel directory of the distribution.
To install it (on what gets installed, see Section 4.6 [Installed Files], page 9), type
$ make install
Well, by default this would want to install things under /usr/local, so you have to have
admin permissions. If you are not root but you are in the sudoers file with the appropriate
rights, you type:
$ sudo make install
You can change the location of the installation by changing the install prefix path with
$ sudo make PREFIX=/my/favourite/path install
Changing the location of installation for individual install targets individually is not recommended but easy-peasy if you have a clue about make and Makefile-s. To do this
you have to change the relevant Makefile-s in the subdirectories of the distribution. See
Section 4.6 [Installed Files], page 9.
If it works, great! Go ahead to Chapter 5 [Bootstrapping], page 10.
If you have problems, doubleckeck that you have the prerequisites (see Section 4.3 [Prerequisites], page 7). If you think you followed the instructions but still have problems, submit
a bug report (see Section 3.2 [Submitting a Bug Report], page 5).
If you are upgrading an earlier version of hunlex, you may want to uninstall the earlier one
first (see Section 4.5 [Uninstall and Reinstall], page 8).

4.5 Uninstall and Reinstall


The install prefix is remembered in the source distribution in the file install_prefix.
So after you cd into the toplevel directory of the distribution, you can uninstall hunlex by
typing
$ make uninstall
You can reinstall it with
$ make reinstall
at any time if you make modifications to the code or compile options.
Warning: Note that if you fiddle with changing the location of individual install
targets, uninstall and resinstall will not work correctly.

Chapter 4: Installation

4.6 Installed Files


The following files and directories are installed, paths are relative to the install prefix (see
Section 4.4 [Install], page 7):
bin/hunlex
the executable which can be run on the command line (see Chapter 9 [Command-line
Control], page 35)
lib/HunlexMakefile
is the Makefile that defines the toplevel control of hunlex (see Chapter 6 [Toplevel
Control], page 12). This file is to be include-ed into your local Makefile to give you
a Makefile-style wrapper for calling hunlex (see Chapter 5 [Bootstrapping], page 10
and Chapter 6 [Toplevel Control], page 12).
Note that HunlexMakefile will assume that the hunlex executable is found in your
path. Make sure that install-prefix/bin is in the path (usually /usr/local/bin is
in the PATH.
share/doc/hunlex//
is a directory containing hunlex documentation. Various documents in various formats
are found under this directory including a replica of this document.
TODO: this is not yet the case
man/hunlex.1
is the hunlex man page describes the command-line use of hunlex (also see Chapter 9
[Command-line Control], page 35. Command-line use of hunlex is not the recommended
way of using it for the general user. Instead, use hunlex through the toplevel control
described in a chapter (see Chapter 6 [Toplevel Control], page 12).
Todo: there is no man page yet

Chapter 5: Bootstrapping

10

5 Bootstrapping
So you installed hunlex and its running smoothly.
This section leads you through the first steps and gives you hints on how you set out working
with hunlex.
Create your sandbox directory.
Change to it.
Create your own local Makefile. This will be your connection to the hunlex toplevel control.
For your Makefile to understand hunlex predefined toplevel targets (see Section 6.3 [Targets],
page 13), you have to include (not insert) the hunlex systemwide Makefile. So you create a
Makefile with the following content:
-include /path/to/HunlexMakefile
where /path/to/HunlexMakefile is the path to HunlexMakefile which is supposed to
be installed on your system (see Section 4.6 [Installed Files], page 9), by default under
/usr/local/lib/HunlexMakefile.
Now, you are ready to test things for yourself. In order to see if all is well, type
$ make
at your prompt in the same sandbox directory.
In fact, you will always type the make command to control hunlex. If you dont give
arguments to make, a so-called default action (target, see Section 6.3 [Targets], page 13) is
assumed. The default target is resources which creates the output resources according
to the default settings (see Section 6.4 [Options], page 15).
Toplevel control assumes by default that all its necessary resources are found in the current
directory (see Section 6.4.3 [Input File Options], page 17). If this is not the case, because
the files do not exist, the compulsory ones are created and the compilation runs creating
the output resources.
Surely, the missing files are created without contents and your output resources will be
empty as well. However, this vacuous run will test whether hunlex (and toplevel control) is
working properly.
Now if you list your directory, you should see:
$ ls
affix.aff
dictionary.dic

grammar
lexicon

Makefile
morph.conf

phono.conf
usage.conf

If this is not the case, go to see Chapter 13 [Troubleshooting], page 45.


The meaning of these files in your directory are explained in detail in another chapter (see
Chapter 8 [Files], page 30).
If you type make (or the equivalent make resources again, your resources will not be compiled again, since the input resources did not change. If you still want to compile your
resources again, you type
$ make new resources
which forces toplevel to recompile although no input files changed (see Section 6.3.2 [Special
Targets], page 14).

Chapter 5: Bootstrapping

11

Now.
If you want to develop (toy around with) your own data and create resources, the next
step is to fill in the input files. Read on to learn more about files (see Chapter 8 [Files],
page 30) and then about the hunlex morphological resource specification language (see
Chapter 7 [Description Language], page 24). Since you want to test your creation, you
ultimately have to learn about toplevel control (see Chapter 6 [Toplevel Control], page 12)
and gradually about the advanced issues in the chapters that follow these.
If you already have your hunlex-resources describing your favourite language ready and you
want to compile specific output resources from it with hunlex, you better read about toplevel
control with special attention to the options (see Chapter 6 [Toplevel Control], page 12). If
you want to fiddle around with more advanced optimization, such as levels and tags, you
may end up having to read everything, sorry.

Chapter 6: Toplevel Control

12

6 Toplevel Control
You typically want to use hunlex through its toplevel control interface. Toplevel control
means that you invoke hunlex indirectly through a Makefile to compile your resources.
We envisage typical users of hunlex developing their lexical resources in an input directory
and occasionally dump output resources for their analyser into specific target directories
for various applications.
If you dont like Makefiles or your system does not have make (how did you compile hunlex,
then?), you will then invoke hunlex from a shell and use it via the command-line interface.
This is non-typical use and not recommended. The Command-line interface which is almost
equivalent in functionality to the Makefile interface is described only for completeness
and for people developing alternative wrappers (see Chapter 9 [Command-line Control],
page 35).
In fact, you dont actually need to know much about make and Makefile-s to use hunlex.
Just follow the steps described in Chapter 5 [Bootstrapping], page 10. We assume that you
have a project directory with a Makefile sitting in it in order to try out what is described
here.
This document is more like a reference manual that details what you can do with your
resources and how you can do it through the Makefile interface. What the resources are
and how you can develop your own is described in other chapters (see Chapter 8 [Files],
page 30 and see Chapter 7 [Description Language], page 24).

6.1 Verbosity and Debugging


First of all, you need to know how to make your compilation process more verbose.
In order to see what the toplevel Makefile wrapper is doing you have to unset QUIET
option. For instance, typing
$ make QUIET= new resources
will tell you what the Makefile is doing, i.e., what programs it invokes, etc. Unless you
are debugging the toplevel control interface of hunlex, you dont want the toplevel to be
verbose about what it is doing. So just dont do this.
What you want instead is to make the resource compilation process more verbose, probably
because you want to debug your grammar or want hunlex to give you hints what went
wrong with your resource compilation.
Verbosity of the hunlex resource compilation can be set with the DEBUG_LEVEL option.
Typing
$ make DEBUG_LEVEL=1
in your sandbox (with empty primary resources) will give you something like this (see
Chapter 5 [Bootstrapping], page 10):
Reading
Reading
Reading
Parsing
Parsing

morpheme declarations and levels...0 morphemes declared.


phono features...0 phono features declared.
usage qualifiers...0 usage qualifiers declared.
the grammar...ok
the lexicon and performing closure on levels... 0 entries read.

Chapter 6: Toplevel Control

13

Dynamically allocating flags; dumping affix file...ok


Dumping precompiled stems to dictionary file...ok
0.00user 0.00system 0:00.02elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+329minor)pagefaults 0swaps
The first couple of lines give you information about the stages of compilation and are
described elsewhere.
The enigmatic last two lines give you information about the time it took hunlex to compile
your resources. If you are not interested in this information you can deset it using the TIME
option (see 4)
You can choose not to bother with this information and deset the TIME option. Typing,
say,
$ make TIME= new resources
will not measure and display the duration of compiling.

6.2 Storing your Settings


Your favourite settings can be remembered by adding them to your local Makefile in a
rather obvious way. Let us assume you want your DEBUG_LEVEL to be set 1 by default and
also that you couldnt care less about the time of compilation. In this case you want to
have the following in your Makefile:
DEBUG_LEVEL=1
TIME=
You can also define your default target (see Section 6.3 [Targets], page 13), i.e., the task
that make will carry out if you invoke it without an expicit target. For instance, if you
always want to recompile your resources each time you invoke make irrespective of whether
your primary resources and/or compile configurations changed, you can add the following
line at the top of the file:
default: new resources
Now, your Makefile looks something like this:
# comments are introduced by a #
# my favourite target
default: new resources
# my favourite settings
DEBUG_LEVEL=1
TIME=
-include /path/to/HunlexMakefile

6.3 Targets
The functionality of hunlex is accessed through targets. Targets are arguments of the make
command which reads your local Makefile and ultimately consults the systemwise hunlex
toplevel Makefile called HunlexMakefile (see Section 4.6 [Installed Files], page 9).
Usually, you will control hunlex through make by typing:
make options target

Chapter 6: Toplevel Control

14

where options is a sequence of variable assignments which set your options described below
(see Section 6.4 [Options], page 15) and where targets is a sequence of targets. For more
on variables and targets you may consult the manual of make.
The available toplevel targets are detailed below:

6.3.1 Resource Compilation Targets


[Resource Compilation Target]
compiles the output resources given the input resources and configuration files. The
necessary file locations and options are defined by the relevant variables described
below (see Section 6.4.3 [Input File Options], page 17). This file creates the dictionary
and the affix files (by default dictionary.dic and affix.aff, see Section 8.2
[Output Resources], page 33).

resources

[Resource Compilation Target]


by setting MIN LEVEL to a big number, this call generates resources that contain
all words of the language precompiled into the dictionary. And the stems of the
dictionary without their output annotation (see Annotation) are found in the file
*wordlist*.

generate

6.3.2 Special Targets


new

[Special Target]
pretends that the base resources are changed. You need this directive if you want
to recompile the resources althouth no primary resource has changed. This might
happen because you are using a different configuration option. (If the base resources
are unchanged, no compilation would take place, you have to force it with new, see
make).
make MIN_LEVEL=3 new resources
[Special Target]
removes all intermediate temporary files, so that only lexicon, grammar, and the
configuration files, and the output resources (affix and dictionary) remain.
Todo: This is not implemented yet.

clean

[Special Target]
removes all non-primary resources, so that only lexicon, grammar, and the configuration files remain.

distclean

6.3.3 Test Targets


Additional targets for testing are available, these all presuppose that the huntools (see
Section 14.1.1 [Huntools], page 46) is installed and that the executable hunmorph is found
in the path. An alternative hunmorph can be used by setting the HUNMORPH option, (see
Section 6.4.1 [Executable Path Options], page 15).

test

[Test Target]
tests the resource by making hunmorph read the resources (dic and aff files) and
analyze the contents of the file that is value of TEST (see Section 6.4.3 [Input File
Options], page 17). TEST is by default set to the standard input, so after saying

Chapter 6: Toplevel Control

15

$ make test
you have to type in words in the terminal window (exiting with C-d).
If you want to test by analyzing a file, you have to set the value of TEST.
$ make TEST=my/favourite/testfile test
Test outputs are to stdout, so just pipe it to a file
$ make TEST=my/favourite/testfile test > test.out 2> test.log
[Test Target]
will run hunmorph on the wordlist file (see Section 6.3.1 [Resource Compilation
Targets], page 14, generate) and outputs the result on the standard output (so you
may want to pipe the result to a file).

testwordlist

[Test Target]
puts hunlex and the analyzer to the test, by creating the resources according to the
settings of your makefile, and then run hunmorph on the generated whole wordlist.
Warning: Note that this target first generates all words and then creates
the resources again. Running this on huge databases is probably not a
good idea.
The way you want to test a bigger database instead is by creating a a set
of words that your ideal analyzer has to recognize or correctly analyze
and test on that (with test). Realtest is just a quick and dirty shorthand
for toy databases to check if everybody is with us.

realtest

6.4 Options
Options of the toplevel are in effect Makefile variables that can be set at the users will.
(All the command-line options of hunlex can be accessed through the toplevel options are
passed to hunlex to regulate the compilation process. The documentation of command line
options is found in Chapter 9 [Command-line Control], page 35, but only for the record. All
hunlex options are all capital letters (LEXICON) and all command line options begin with a
dash and are all small letters but otherwise they are the same (-lexicon)).
All options can be set or reset in your local Makefile (and remembered, see Section 6.2
[Storing your Settings], page 13). These will override the system default. Both the system
default and your local default can be overriden by direct command-line variable assignments
passed to make, such as the ones shown in this file:
$ make QUIET= DEBUG_LEVEL=3 OUTPUTDIR=/my/favourite/ouputdir
Listed and explained below are all the hunlex options (all public Makefile variables) that
the toplevel control provides for the user to manipulate.
When you see something like variable (value), it means the default value of the variable
variable is value.

6.4.1 Executable Path Options


[Option]
The hunlex executable is by default assumed to be found in the path with name
hunlex. By default, installation installs hunlex into /usr/local/bin (see Chapter 4

HUNLEX (hunlex)

Chapter 6: Toplevel Control

16

[Installation], page 7). If you want to use (i) an alternative version of hunlex that
is not the one found in the path, or (ii) an uninstalled version of hunlex, or (iii) an
installed version but the path to which you dont want to include in your path, then
you should set which hunlex to use with this variable.
HUNLEX=/my/favourite/version/of/hunlex
[Option]
You need the executable hunmorph from the Huntools package (see Section 14.1.1
[Huntools], page 46) only for testing, if you dont want to test with direct analysis
(just want to compile the resources), you dont need to bother.
When used, however, the hunmorph executable is assumed to be found in the path
with name hunmorph. If this is not the case, update your path or provide the path to
hunmorph with the line
HUNMORPH=/my/favourite/version/of/hunmorph

HUNMORPH (hunmorph)

6.4.2 Verbosity and Debug Options


[Option]
Quiet mode is set by default which means that the workings of the Makefile toplevel
wont bore you to death. The compilation debug messages that Hunlex blurps when
running can still be displayed independently (see the DEBUG_LEVEL option below). The
QUIET option only refers to what the toplevel wrapper invokes (this way of handling
Makefile verbosity is an idea nicked from OCamlMakefile by Markus Mottl).

QUIET (@ = quiet)

[Option]
sets the verbosity of hunlex itself. By default debug level is set to 0. Debug messages
are sensitive to the debug level in the range from 0 to 6-ish: the higher the number
the more verbose hunlex is about its doings.
0 is non-verbose mode, which means that it only displays (fatal) error messages. If
you set DEBUG LEVEL to say -1, even error messages will be suppressed (only an
uncaught exception will be reported in case of fatal errors).
It is typically a good idea to set DEBUG_LEVEL to 2 or 3 and request more if we really
want to see what is happening.
Caveat: In fact you wont understand the messages anyway, so the debug
blurps just give you an idea of the context where something went wrong
with your grammar/lexicon, etc.
Todo: This shouldnt be so and debug messages pertaining to grammar
development should be self-evident or well designed and documented.
Especially parsing errors and/or compile warnings about the grammar
and lexicon should be clear.
Usually you want to create a log by piping the debug output of make (standard error)
with your debug messages to a file. This can be done by, for instance by
$ make DEBUG_LEVEL=5 resources 2> log

DEBUG_LEVEL (0)

[Option]
By default with every run of hunlex it is measured how long it takes to compile the
resources (unix shells time command) and this information is displayed. Surely, this

TIME (time)

Chapter 6: Toplevel Control

17

is only interesting with big lexicons. If you (i) dont have a time command, (ii) have
a different time command, (iii) dont want time measured and displayed, just reset
the TIME variable. The option can be unset by the line
TIME=
in your local Makefile.

6.4.3 Input File Options


The type and use of hunlex input resource files are described in detail elsewhere (see
Section 8.1 [Input Resources], page 30). The options by which their locations can be (re)set
are listed below:

LEXICON (grammardir /lexicon)

[Option]

lexicon file

GRAMMAR (grammardir /grammar)

[Option]

grammar file
They can all be set to alternative paths individually. If they are in the same directory, the
directory path can also be set via the variable GRAMMARDIR:
[Option]
the directory for the hunlex primary input resource files, which is, by default, set to
inputdir, the value of the variable INPUTDIR, see below.

GRAMMARDIR (inputdir )

There are three further input resources which need to be present for a hunlex compilation.
These are the compilation configuration files.

USAGE (confdir /usage.conf)

[Option]

the usage configuration file (see Section 8.1.2 [Configuration Files], page 31)
[Option]
the morph(eme) configuration file (see Section 8.1.2 [Configuration Files], page 31)

MORPH (confdir /morph.conf)

[Option]
the configuration file (see Section 8.1.2 [Configuration Files], page 31) for
morphophonologic and morphoorthographic features

PHONO (confdir /phono.conf)

There are two optional configuration files, the signature and the flags file. By default, the
options correspoding to these files are set to the empty string, which tells hunlex not to use
a feature structures (see Section 11.2 [Feature Structures], page 41) or custom output flags
(see Chapter 12 [Flags], page 42).
[Option]
The location of the signature file used to process and validate features structures (see
Section 11.2 [Feature Structures], page 41, see Section 8.1.2 [Configuration Files],
page 31). If it is set to the empty string (the default), hunlex does not use feature
structures.
If you use this file, it makes sense to call it something like fs.conf or
signature.conf and store it in confdir with your other configuration files, so the
assignment

SIGNATURE ()

Chapter 6: Toplevel Control

18

SIGNATURE=$(CONFDIR)/fs.conf
is an appropriate setting.
[Option]
The location of the custom output flags file (see Section 8.1.2 [Configuration Files],
page 31) used to decide which flags are used in the output resources (see Chapter 12
[Flags], page 42). If it is set to the empty string (default), hunlex will use a built-in
flagset to determine flaggable characters (see Chapter 12 [Flags], page 42).
If you use this file, it makes sense to call it something like flags.conf and store it
in confdir with your other configuration files, so the assignment
FLAGS=$(CONFDIR)/flags.conf
is an appropriate setting.

FLAGS ()

All configuration files can be set to alternative paths individually. If they are in the same
directory, the directory path can also be set via the variable CONFDIR:
[Option]
the directory for the hunlex compilation configuration files, which is, by default, set
to inputdir, the value of the variable INPUTDIR, see below.

CONFDIR (inputdir )

As explained all input files can be set to alternative paths individually or primary resources
together and configuration files together. If all input resources (primary and configuration)
are in the same directory, this directory path can also be set via the variable INPUTDIR:
[Option]
the directory for all hunlex input resource files, which is by default, set to the currect
directory.

INPUTDIR (.

= current directory )

A special test file is only used with the Test targets:


[Option]
The value of TEST is a file (well, a file descriptor, to be precise), the contents of
which is tested whenever the toplevel test target is called (see Section 6.3.3 [Test
Targets], page 14). By default it is set to the standard input, so testing with test
will expect you to type in words in your terminal window.

TEST (/dev/stdin)

6.4.4 Output File Options


Hunlexs output resources are the affix and the dictionary files (see Section 8.2 [Output
Resources], page 33). The options by which their locations can be (re)set are listed below:

AFF (outputdir /affix.aff)

[Option]

affix file

DIC (outputdir /dictionary.dic)

[Option]

dictionary file
[Option]
The wordlist generated by the generate target (see Section 6.3.1 [Resource Compilation Targets], page 14).

WORDLIST (outputdir /wordlist)

Chapter 6: Toplevel Control

19

where outputdir (the default directory of the files) is the value of the variable OUTPUTDIR:
[Option]
the directory for the hunlex output resource files, which is, by default, set to the
currect directory

OUTPUTDIR (.

= current directory )

As you can see, the default setting is that all input and output files are located in the
current directory under their recommended canonical names. Putting the output resources
in the same directory as the primary resources might not be a good idea if you want to
compile various types of output resources.

6.4.5 Resource Compilation Options


[Option]
if set, hunlex uses double flags (two-character flags) in the output resources (see
Chapter 12 [Flags], page 42).

DOUBLE_FLAGS ()

The following two options regulate the level of morphemes. You find more details about
levels in a separate chapter (see Chapter 10 [Levels], page 36).
[Option]
Morphemes of level below MIN_LEVEL are treated as lexical, i.e., are precompiled with
the appropriate stems into the dictionary file. By default, only morphemes of level 0
or below are precompiled into the dictionary.

MIN_LEVEL (1)

[Option]
Morphemes with levels higher than the value of MAX_LEVEL are, on the other hand,
treated as being on the same (non-lexical) level. By default, only morphemes of level
above 10000 are treated as having the same level.

MAX_LEVEL (10000)

The options below regulate the format of output resources in detail:


[Option]
determines the delimiter hunlex puts between individual tags of affixes when tags are
merged.
This is interesting if you have a tagging scheme where a morpheme is tagged with a
label MORPH1, but in the output you want them clearly delimited, like:
wordtoanalyze
>lemma_MORPH1_MORPH2
The above is possible if you set
TAG_DELIM=_

TAG_DELIM ()

[Option]
[Option]
sets the delimiter to put between the fields of the affix and dictionary files,
respectively. By default it is set to a single space for the affix file and set to <TAB>
in the dictionary.
NB: A tab might allow better postprocessing in the affix file and even
allow spaces in the tags which might be useful.
At the time of writing the huntools reader only allowed a TAB not a
space as delimiter in the dictionary file so change with caution.

OUT_DELIM ( )
OUT_DELIM_DIC (<TAB>)

Chapter 6: Toplevel Control

20

[Option]
the major output mode regulates what information gets output in the affix and dictionary files and how affix entries are conflated.
Warning: This option is not effective at the moment due to the lack of a
clear functional specification and it is also unclear how this option should
interact with the option STEMINFO (below).
Todo: Clarify this. See warning.
The possible values at the moment are:
Spellchecker
Stemmer
Analyzer
NoMode

MODE (Analyzer)

all without effect (see warning).

STEMINFO (LemmaWithTag)

[Option]

regulates what info the analyzer should output about a word.


This option can take the following values:
Tag only output the tag of a lexical stem (to output the pos tag of the stem)
Lemma only output the lemma of a lexical stem (for stemmers doing lexical
indexing)
Stem output the stem allomorph of the stem (e.g., for counting stem variant
occurrences?)
LemmaWithTag output lemma with the tag (default, for morphological analysis)
StemWithTag output stem (allomorph) with the tag (?)
NoSteminfo no output for the dictionary (for spell-checker resources).
[Option]
regulates if feature structure annotations (see Section 11.2 [Feature Structures],
page 41) should be output along with the normal (string type) tags (see Chapter 11
[Tags], page 40). This is extremely useful for debugging purposes. If the manually
supplied tag chunks are supposed to yield well-formed features structures in the
output annotation of the analyzer, it is a good idea to check whether this is the
case. If this option is set to -fs_info (the corresponding command-line option),
the feature structures resulting from unification are output along with the tags in
the dictionary and the affix file. Typically, this option is used with the generate
target (see Section 6.3.1 [Resource Compilation Targets], page 14) and the second
and the third columns of the dictionary file are compared (they are supposed to be
identical).
Todo: This process should be added to the set of toplevel test targets.

FS_INFO ()

The affix file specifies a lot of variables to be read by the morphbase routines. Some of these
are metadata but some are crucial for suggestions and accent replacement for automatic
error correction, see below.

Chapter 6: Toplevel Control

21

Warning: This part is a disasterously underdevelopped part of hunlex and an


outragously ad-hoc part of morphbase as well.
Preambles can be generated to hunlex output files these are meant to be official comment
headers about copyright information, etc.
[Option]
[Option]
are files to be included as preambles in the affix and dictionary output resources,
respectively. By default, they are unset, ie., no preambles will be included into the
output resources.
NB: This feature is only available on toplevel control and will never be
integral part of the hunlex executable.

AFF_PREAMBLE ()
DIC_PREAMBLE ()

[Option]
[Option]
These are the character-conversion table and replacement table to be included into
morphbase resources if alternatives (e.g., for spellchecking) or robust error correction
is required (see Section 14.1.1 [Huntools], page 46). These features are documented in
the huntools documentation (hopefully, but certainly not here, see Section 8.2 [Output
Resources], page 33, see Section 14.1.1 [Huntools], page 46).
NB: This feature of including these extra files into the affix file is only
available through toplevel control and will never be integral part of the
hunlex executable.

CHAR_CONVERSION_TABLE ()
REPLACEMENT_TABLE ()

[Option]
Identifies the character-set for the analyzer reading the affix file. By default, this is
set to ISO8859-2, i.e., Eastern European. Maybe this is the hun in hunlex...

AFF_SET (ISO8859-2)

[Option]
is the file from which settings for some affix variables are read. If it doesnt exist, no
affix variables other than the ones directly managed are dumped into the affix file

AFF_SETTINGS (confdir /affix_vars.conf)

Todo: Need to sort these things out.


Some affix file variables are managed by hunlex internally but dumped to the affix file by
the toplevel routines.
Todo: This is done at the moment by the toplevel Makefile, but should be
integrated into the hunlex executable itself.
[Option]
[Option]
These two flags will be attached to (i) bound stems and (ii) affix entries which can
not be stripped first (i.e., suffixes which cannot end a word, see Chapter 12 [Flags],
page 42).

AFF_FORBIDDENWORD (!)
AFF_ONLYROOT (~)

[Option]
If this flag is present, it indicates for the stemmer/analyzer that the stem string is
to be output or not as part of the annotation. For instance (if STEM GIVEN flag is
x), the following dic file

STEM_GIVEN ()

Chapter 6: Toplevel Control

22

go/ [VERB]
went/x go[VERB]
will result in the following stemming:
> go
go[VERB]
> went
go[VERB]
This makes more compact dictionaries. What information one wants the stemmer
and analyzer to output can be configured through hunlex options (see below).
Todo: This flag is not implemented yet (since it is not implemented yet in
morphbase, either, but probably will never be implemented since treatment of special flags shouldnt be user customizable above the choice of
flaggable characters.
Warning: Make sure the flags given here are consistent with the double flags option and the custom flags file (see the FLAGS variable above, and see Chapter 12
[Flags], page 42).
These options are superfluous and should be automatically managed by hunlex
which would write them into the affix file. Very likely to be deprecated soon.
Todo: This needs to be implemented.
Warning: Additional settings that are to be included in the affix file and are
crucial part of the resources (partly should be set by hunlex itself) such as
compoundflags. I have no idea what to do with these at the moment. The ones
I know of are listed here just for the record.
Todo: This needs to be sorted out.
Some of these data are actually global and could even go to the settings preamble (AFF_
SETTINGS):
NAME
LANG
HOME
VERSION
These ones should be dynamic metadata
??
The ones below should clearly be controled and output by hunlex itself. (also ONLYROOT
and FORBIDDENWORD, but they are handled by the toplevel, at least).
Ones relating to compounding (compounding is handled very differently by myspell,
morphbase and jmorph):
COMPOUNDMIN
COMPOUNDFLAG
COMPOUNDWORD?
COMPOUNDFORBIDFLAG
COMPOUNDSYLLABLE

Chapter 6: Toplevel Control

SYLLABLENUM
COMPOUNDFIRST
COMPOUNDLAST
Warning: Compounding is as yet unsupported by hunlex and should be worked
on with high priority.
I have really no idea about the following ones:
TRY
ACCENT
CHECKNUM
WORDCHARS
HU KOTOHANGZO

23

Chapter 7: Description Language

24

7 Description Language
This chapter is about the framework that allows you to describe the morphology and lexicon
of a language. Below we specify the syntax and semantics of this description language. The
files written in this language (the lexicon and grammar) are the primary resources of
hunlex (see Section 8.1 [Input Resources], page 30) and the basis for all compiled output
(how this works is described in another chapter, see Chapter 6 [Toplevel Control], page 12).
There are three kinds of statement in this language:
morph definition
macro definition
metadata definition
Only the grammar file can contain macro definitions (see Section 7.2 [Macros], page 28)
and metadata definitions (see Section 7.3 [Metadata], page 29) and both the lexicon and
the grammar file can contain morph definitions which describe morphological units (affix
morphemes, lexemes and their paradigms). In this respect, the syntax of the lexicon and
grammar files are identical and, therefore, it is discussed together (see Section 7.1 [Morphs],
page 24) are not described separately, although the usefulness (and sometimes even the
semantics) of certain expressions might be different in the lexicon and in the grammar.

7.1 Morphs
Morphs are the central entities in the description language. They stand for morphological
units of any size and abstractness including affix morphemes, lexemes, paradigms, etc. and
are not what linguists call morphs (i.e., a particular occurrence of one morpheme). Morphs
are meant to describe an affix morpheme or a lexeme, but in fact, it is up to you what
level of abstractness you find useful in your grammar, so you can have individual morphs
describing each allomorph of a morpheme or each stem variant of a lexeme. But the point
is that morphs support description of variants or allomorphs. Anyway, a morph is basically
a collection of rules, variants, etc. that somehow belong together. Ideally, a variant of an
affix morpheme is actually an affix allomorph, a concrete affixation rule, while a variant of
a lexeme is a stem variant or an exceptional form of the lexemes paradigm.

7.1.1 Morph Preamble and Variants


(MORPH:) preamble, variant0, variant1, ... ;

[statement]
morph-name block...
[preamble]
block...
[variant]
A morph statement is introduced by an optional MORPH: keyword. It is a good idea to drop
it and start the statement directly with the preamble (in fact, the name of the morph),
which is compulsory.
A morph description has a preamble, i.e., a header describing the global properties of the
morph, the properties which characterize all of its variants/allomorphs.
After the preamble, one finds the variants one after the other. The preamble and the
variants are delimited by a comma.
Finally, the morph definition like all other statements is closed by a semicolon.

Chapter 7: Description Language

25

The preamble starts with the name of the morph. The name of the morph can be any
arbitrary id, a mnemonic string that ideally uniquely identifies the morph. Referring to other
morphs is an important in describing how morphemes can be combined: in order for these
references to be reliable, the names in the grammar are supposed to be unique. This is not
important in the lexicon, where homophonous lemmas can have identical names (however,
this is not recommended, since, in such a case, for instance, morphological synthesis would
be unable to distinguish two senses especially if they are of the same morphosyntactic
category).
The rest of the preamble as well as each individual variant is composed of blocks. Blocks
are the ingredients of the description, they specify information such as conditions of rule
application, output of a rule, the tag associated with the rule, etc.
In sum, then, morphs have the following structure:

(MORPH:) morph-name block ... (, block ... )* ;

[statement]

Blocks are explained in detail in the next subsection.

7.1.2 Blocks
Blocks are the ingredients of the description, they specify information such as conditions of
rule application, output of a rule, the tag associated with the rule, etc.
Blocks all have a leading keyword followed by some expressions (arguments) and last till
the next keyword or the end of the variant:

KEYWORD argument...

[block]
Blocks can come in any order within a variant and can be repeated any number of times.
So writing
KEYWORD: argument0 argument1 argument2 ...
has the same effect as when it is written like
KEYWORD: argument0 KEYWORD: argument1 KEYWORD: argument2 ...
or even
KEYWORD: argument0 SOME-OTHER-BLOCKS KEYWORD: argument1 SOME-OTHER-BLOCKS KEYWORD: arg
or when it is included with a macro (see Section 7.2 [Macros], page 28).
Certain blocks specify information in a cumulative way, so every time they are specified the
information is added to the info specified so far. For instance an IF block is cumulative,
all the arguments of all the IF blocks of a variant cumulate to give the conditions of rule
application, i.e., the rule applies only if all conditions on features are satisfied by the input
(see IF block below).
However, other blocks do not specify information that can be interpreted cumulatively, so it
does not make sense to have more than one argument with them or specify them more than
once for a variant. (They, however, may still be specified in the preamble and overriden in
a variant, for instance).
In every case, out of contradictory information, the one given last has the last word
overriding previous ones.
So if you write
CLIP: 1 CLIP: 2
it is the same as

Chapter 7: Description Language

26

CLIP: 2
In what follows, blocks are listed and explained one by one.

DEFAULT feature ...

[block]
default morphs are used to assign features to inputs unspecified for some features. A
morph with a default block just adds extra rules that leave alone inputs which are
specified for any of the features to be defaulted. The variants of a morph having a
default block in their preamble will assume that neither of the features to be defaulted
is present in the input.
So morph DEFAULT: feature0 feature1 , MATCH: x OUT: feature0 ;
is equivalent to
morph , IF: !feature0 feature1 OUT: feature1 , IF: feature0 !feature1 OUT: feature0
, IF: feature0 feature1 OUT: feature0 feature1 , IF: !feature0 !feature1 MATCH: x
OUT: feature0
Filters typically want to pass on their whole input by default.

VARIANT variant

[block]
this block defines the actual affix or lexis.
The exact shape of variant determines what type of affix, lexis the variant describes:
+aff describes a suffix when the rule applies aff is appended to the end of the
input (after possibly clipping some characters)
aff+ describes a prefix when the rule applies aff is appended to the beginning of
the input (after possibly clipping some characters)
pref+suff describes a circumfix
when the rule applies pref is appended to the beginning of the input (after
possibly clipping some characters) and suff is appended to the end of the input
(after possibly clipping some characters)
lexis defines a lexis. This is typically used in the lexicon and used as input to
the rules. If the VARIANT keyword is left out, it has to come as the first block
of the rule (after the comma closing the preamble or the preceding rule).
If a lexis is used in the grammar, it is meant to stand for a suppletive form.
Since it may well be a typo, a warning is given. We encourage the policy to put
suppletive paradigmatic exceptions as variants of the lexeme in the lexicon file.
Especially since matches are ineffective for lexis rules, therefore conditions on the
suppletion should be expressed with features which is much safer anyway.

All the lexis and affix strings can contain any character except whitespace, comma,
semcolon, colon?? exclamation mark slash tilde plus sign [^# \t \n ; , \r
! / + ~]
there should be a way to allow escapes.
Substitutions (which are special kind of rules) are specified by REPLACE/WITH
blocks.

CLIP integer

[block]
This block specifies the number of characters that needs to be clipped from one end
of the input.

Chapter 7: Description Language

27

It has no effect if the variant is a lexis or substitution. So you dont use this block in
the lexicon.
If no CLIP block is given, no characters are clipped (the integer defaults to zero).

REPLACE pattern
WITH template

[block]
[block]

These blocks specify a substitution.


pattern is a hunlex regular expression.
tamplate is a replacement string which can contain special symbols \1, \2, etc,
which reference the bracketed subpatterns in pattern.

MATCH pattern

[block]
specifies a match condition on rule application. The rule only applies if the input
matches pattern, which is a hunlex regular expression. So you dont use this in the
lexicon.
The matched expression defines a match at the edge of the word, the beginning for
prefixes and the end for suffixes. You may include special symbols like ^ and $, to
make this more explicit.
Match blocks are non-cumulative, but circumfixes allow two matches (one beginning
with a ^ and one ending in a $).

IF condition ...

[block]
If blocks specify the conditions of rule application. Conditions are either positive
conditions (feature name) or negative conditions (NOT feature-name).
The rule only applies if the input has the positive features specified in the IF blocks
and doesnt have the negative features specified in the IF block.
IF blocks are therefore cumulative and the conditions are understood conjunctively.

OUT output ...

[block]
specify the output conditions of the variant (affix rule or lexis). An output can be a
feature or a morph
Features can be restricted to particular morphs.

TAG tag-string

[block]

specifies the output tag chunk associated with the variant.

USAGE usage-qualifier ...

[block]

specifies usage qualifiers describing the variant.


Cumulative (conjunctive)

FILTER feature ...

[block]
tells that the morph in question is a filter which defines fallback rules for lexical
features.
This means that the variants are meant to apply only if the input has none of the
filtered features.
Has no effect within individual variants or in the lexicon. Only relevant in a morph
preamble in the grammar.
Cumulative (conjunctive on the rule conditions)

Chapter 7: Description Language

28

KEEP feature ...

[block]
Defines inheritance of features: a feature mentioned in the KEEP block is an output
feature of the result of rule application if and only if the input has the feature. As
long as a particular variant applies to an input.
If output features and keep features overlap, output features are meant to override
inheritance.
Features which are restricted by the input condition (IF block) are inherited normally,
but since they are known, can also be mentioned in the OUT block for clarity.
NB: The thingies following KEEP in a KEEP block are features. They
can not be macro names. Dont trick yourself by abbreviating a sequence
phonofeatures with a macro and then refer to that in a keep block. Dont
forget that macros abbreviate (a series of) blocks, so clearly they cant
be nested within a KEEP block.
Cumulative

FREE bool

[block]
specifies if the rule application gives a full form. For bound stems or non-closing
affixes, it has to be set to false.
By default, variants in the lexicon are NOT-free variants in the grammar are free ????
!!!! is this ok?

FS feature-structure
specifies the feature structure graph to merged when the rule applies.
structure is a kr-style features structure description string.

[block]
feature-

FS feature-structure

[block]

PASS bool

[block]

7.2 Macros
DEFINE macro-name blocks

[expression]
defines a macro named macro-name. Later (any time after this definition), any time
macro-name is encountered it is understood as if it said blocks. blocks is a sequence
of any blocks including (other) macro-names. The macro-name appearing elsewhere
than its definition has to be already defined.
If a macro-name is a declared morph-name? If a macro-name is a declared feature?

REGEXP regexp-name regexp

[expression]
binds regexp-name to a hunlex regular expression, i.e., a regular expression that
can contain regular expression macro-names in angle-brackets. regexp-name can be
referenced within any regular expression later. An expression is resolved by replacing
the substring <regexp-name> with the resolved regexp.
This means, that you have to esacpe you <-s and >-s if they do not delimit regexp
names.
As said, you can define regexp-macros using other macros, only at the time of using a
regexp-name it has to be defined already (the definition should be earlier in the file),
so that it can be resolved at the time of reading the definition.

Chapter 7: Description Language

7.3 Metadata

29

Chapter 8: Files

30

8 Files
There are various files that hunlex processes. Input as well as Output files are described in
this chapter. The file names used in this section are just nicknames (which happen to be
the default filenames assumed) and can be changed at will with setting toplevel option (see
Section 6.4 [Options], page 15).

8.1 Input Resources


There are several types of files hunlex considers and they will all be discussed in turn.
Lexicon and grammar are the two files which are considered the primary resources. These
files contain the description of the languages morphology with all the rules for affixation,
lexical entries, specifying morphological output annotation (tags), etc., see Section 8.1.1
[Primary Resources], page 30.
Secondly, there are configuration files, which declare the morphemes and features that
are considered active by hunlex for a particular compilation. By choosing and adjusting
parameters of these features, one can manipulate under- and over-generation of the analyzer
(see Section 6.4.5 [Resource Compilation Options], page 19) and, most importantly, regulate
which affixes are merged together to yield the affix-cluster rules dumped into the affix file.
The way affixes are merged is crucial for the efficiency of real-time analyzers (see Chapter 10
[Levels], page 36). These files are also described below (see Section 8.1.2 [Configuration
Files], page 31).

8.1.1 Primary Resources


Primary resources are the files that you are supposed to develop, maintain, extend and that
describe your morphology (see Section 1.2 [Motivation], page 2). There are two primary
resources: the grammar and the lexicon. These files are described below.

8.1.1.1 Lexicon
The lexicon file (the file name of which is lexicon by default, but can be set through
options, see Section 6.4.3 [Input File Options], page 17) is the repository of lexical entries,
containing information about:

lemmas
stem-allomorphs (variants) belonging to the lemmas paradigm
suppletive forms expressing some paradigmatic slot of the lemma
morphological output annotation (tag) of the lemma (and the variants)
sense indices (arbitrary tag to distinguish identical lemmas)
the morphosyntactic and morphophonological features which characterize variants (or
lemmas), and which determine its morphological combinations (i.e., which rules apply
to it and how).
usage qualifiers of the variants (or lemmas), such as register, usage domain, normative
status, formality, etc.
The syntax of the lexicon file is basically the same as that of the grammar, except that
it cannot contain macro definitions (see Section 7.2 [Macros], page 28). This syntax of

Chapter 8: Files

31

describing morphology is explained in detail in another chapter (see Chapter 7 [Description


Language], page 24).
For examples of lexicons, have a look at the zillion examples in the Examples directory that
comes with the distribution (see Section 4.6 [Installed Files], page 9).

8.1.1.2 Grammar
The grammar file is the other primary resource and also absolutely necessary to describe
the morphology of your language. Its name is grammar by default but can be changed by
setting toplevel options (see Section 6.4.3 [Input File Options], page 17). The grammar file
specifies:
affix morphemes
affix-allomorphs (variants) belonging to the same morpheme
morphological output annotation (tag) of the affix morpheme (and its variants)
the morphosyntactic and morphophonological features which characterize variants (or
morphemes) and which determine its morphological combinations (which rules apply
to it and how).
usage features of the variants (or affixes), such as register, usage domain, normative
status, formality, etc.
possibly special pseudo affixes, so called filters which assign (default) features to variants based on their form (orthographic patterns) or other features.
The syntax of the grammar file is the same as the one used for the lexicon except that the
grammar file can contain macro definitions (see Section 7.2 [Macros], page 28). The syntax
and semantics of this description language is explained in detail in another chapter, see
Chapter 7 [Description Language], page 24.
For examples of grammar files, have a look at the zillion examples in the Examples directory
that comes with the distribution (see Section 4.6 [Installed Files], page 9).

8.1.2 Configuration Files


Configuration files are the files which mediate between primary resources describing a language and a particular resource created for a particular method, routine, application.
There are three configuration files which tell hunlex which units and features should be
included into the output resource from among the ones mentioned in the primary resources.
The units (morphemes, features) not declared in these configuration files are considered
ineffective by hunlex while reading the primary resources.
The format of these three definition files are the same, each declaring a unit each line (with
some parameters) and accept comments starting with # lasting till the end of the line.
They are discussed in turn below.

8.1.3 Morpheme Configuration File


The morph.conf file is one of the compilation configuration files that determine how hunlex
compiles its output resources (aff and dic, see Section 8.2 [Output Resources], page 33)
from the primary resources (lexicon and grammar, see Section 8.1.1 [Primary Resources],
page 30).

Chapter 8: Files

32

It declares the affix morphemes and the filters that are to be used from among the ones
that are in the grammar.
Warning: the affix morphemes not listed (or commented out) in this file are
ineffective for the compilation (as if they were not in the grammar).
Each line in this file contains the affix morphemes name and optionally a second field,
which gives the level of the morpheme. If no level is given, the affix is assumed to be
of level maximum level (the value of the option MAX_LEVEL, see Section 6.4.5 [Resource
Compilation Options], page 19). Very briefly, levels regulate which affixes will be merged
with which other affixes to yield the affix clusters that are dumped as affix rules into the affix
file. The odds and ends of levels are described in detail in another chapter (see Chapter 10
[Levels], page 36).
For examples of the rather dull morph.conf files, browse the examples in the Examples
directory that comes with the distribution (see Section 4.6 [Installed Files], page 9).
If you have a grammar and you want to declare all the (undeclared) morphs defined in it
by including them in the morph.conf. All you do is type

make DEBUG_LEVEL=1 new resources 2>&1 | grep (morph skipped) | cut -d -f1 >> in/mo
in the directory where your local Makefile resides. This will append all the undeclared
morphs (one per line) to the morph.conf file. Note, the morphs so declared will be of level
maximum level (see above).

8.1.4 Feature Configuration File


The phono.conf file is one of the compilation configuration files that determine how hunlex
compiles the output resources (aff and dic, see Section 8.2 [Output Resources], page 33)
from the primary resources (lexicon and grammar, see Section 8.1.1 [Primary Resources],
page 30).
The phono.conf file is the file simply listing all the features that we want used from among
the ones used in the grammar and the lexicon. Very briefly, features are attributes of affixes
and lexical entries the presence or absence of which can be a condition on applying an affix
rule.
Warning: Features used in the grammar but not mentioned (or commented
out) in the phono.conf file will be ignored (as if they were never there) for
the present compilation by hunlex when reading the primary resources.
Warning: Features mentioned in phono.conf but never used in the grammar
or the lexicon are allowed and maybe should generate a warning, but they dont.
This may cause a lot of trouble.
So, phon.conf simply declares the features one on each line and allows the usual comments
(with a #).
For examples of phono.conf files, browse the examples in the Examples directory that
comes with the distribution (see Section 4.6 [Installed Files], page 9).
If you have a grammar and you want to declare all the (undeclared) features referred to in
the grammar in conditions by including them in the phono.conf. All you do is type

make DEBUG_LEVEL=1 new resources 2>&1 | grep (feature skipped) | cut -d -f1 | sort

Chapter 8: Files

33

8.1.5 Usage Configuration

The usage.conf file is is one of the compilation configuration files that determine how
hunlex compiles the output resources (aff and dic, see Section 8.2 [Output Resources],
page 33) from the primary resources (lexicon and grammar, see Section 8.1.1 [Primary
Resources], page 30).
usage.conf in particular determines which usage qualifiers are allowed for the input units
(lexical entries, affixes, filters and the variants thereof) that are included into the resource
to be compiled. Units having a usage qualifier that is not listed in this file are ignored for
the compilation (as if they were not there).
NB: Usage qualifiers are not first class features. They can not be negated or
used as conditions on rule application. They are simply used to categorize rules
(affixes and stems) in certain dimensions such as etymology, register, usage
domain, normative status, formality, etc.
In addition to declaring allowed usage qualifiers, this file has another function as well. Each
line containing the usage qualifier may contain a second field which is a tag associated with
that usage feature. If this field is missing, the name of the usage qualifier string is assumed
to be its tag. Usage qualifier tags can be output by the analyzer if they are compiled into
the resources by hunlex.
This can be configured with the output info option (see Section 6.4.5 [Resource Compilation
Options], page 19).
Warning: This option is not implemented yet.
Todo: This is not implemented yet. I dont even know if this is fine like this.
The problem is that they cannot really be just intermixed with the ordinary
morphological tags.
Various dimensions of usage information can be made effective by introducing expressions
with arbitrary leading keywords (see Chapter 7 [Description Language], page 24). Redefining each of the wanted usage dimensions in the parsing_common.ml file will result in
making any one or more of them effective as usage qualifiers. The point is that you can
keep a lot of information in the same lexical database. When the keywords it contains are
hunlex-ineffective, the expressions they lead are simply ignored.
Caveat: At the moment, for these alternatives, you have to recompile hunlex, with the new keyword associations, see Chapter 7 [Description Language],
page 24.
Todo: This could be done online but has very low priority.
For examples of usage.conf files, browse the examples, in the Examples directory that
comes with the distribution (see Section 4.6 [Installed Files], page 9).

8.2 Output Resources


The output of a hunlex resource compilation is an affix file and a dictionary file. In brief,
the affix file contains the description of the affix (cluster) rules of the language we analyze,
while the dictionary contains the stems the affix rules can apply to. They have more or
less the same role as the grammar and lexicon files, the primary resources of hunlex (see

Chapter 8: Files

34

Section 8.1.1 [Primary Resources], page 30). But the affix and dictionary files are resources
that are used by real-time word-analysis routines (such as morphbase, myspell or jmorph,
see Chapter 14 [Related Software and Resources], page 46). They share commonalities of
format with minor idiosyncrasies, some of which are still in the changing.
Hunlex reads a transparent human-maintainable non-redundant morphological grammar
description with the lexicon of a language and creates affix and dictionary files tailored to
your needs (see Chapter 1 [Introduction], page 1). The ultimate purpose of hunlex is that
these output resource files could at last be considered a binary-like secondary (automatically
compiled) format, not a primary (maintained) lexical resource.
Therefore the technical specification of these output formats should only concern you here
if you want to compile affix and dictionary files for your own (or modifief versions of our
own) word-analysis software which also reads the aff/dic files. In such a case, however, you
know that format better than I do. All I can say is that the parameters along which the
format can be manipulated is supposed to conform with the format of the software listed
in see Section 14.1 [Software that can use the output of Hunlex as input], page 46. If you
develop some such stuff as well and would like your format to be supported, take a deep
breath and consider requesting a feature from the authors see Section 3.3 [Requesting a
New Feature], page 5.
In sum, the format of these output resource files are not detailed. Anyway, they are (probably) well documented elsewhere (e.g., myspell manual page). See especially the documentation of huntools and the morphbase library (see Section 14.1.1 [Huntools], page 46).

Chapter 9: Command-line Control

35

9 Command-line Control
This chapter is a verbatim include of the hunlex manpage. Command-line control is not the
recommended interface to use hunlex, see toplevel control (see Chapter 6 [Toplevel Control],
page 12).
removed so make doc would run

Chapter 10: Levels

36

10 Levels
Levels index morphemes and are assigned to morphemes in the morph.conf file (see
Section 8.1.3 [Morpheme Configuration File], page 31).
Levels govern which affixes will be merged together into complex affixes (or affix clusters)
and will constitute an affix rule (linguistically correctly, and affix-cluster rule) in the output
affix file (see Section 8.2 [Output Resources], page 33). Affix rules in the affix file will
be stripped from the analyzed words by the analysis routines in one step (i.e., by one
rule-application).
Levels, then, regulate the output resources of hunlex and have no role to play in how you
design your grammars. There are no levels in the hunlex grammar and lexicon, the files
which describe the morphology of the language (see Section 8.1.1 [Primary Resources],
page 30). Levels make sense only in relation to the compilation process.
This chapter describes why you would want levels, how you manipulate them and what
consequences it has on analysis.

10.1 Levels and Affix Rules


Imagine a word has several affixes like dalokban (= dal song + ok plural + ban inessive).
Assume that your hunlex grammar correctly describes the plural and inessive morphemes
and their combination rules. If you assign these morphemes to different levels, the output
resource will contain affix rules expressing the morphemes separately. This means that
these affixes are not stripped in one go by the analysis routines using the affix file as their
resource.
Some affixes, however, may need to be stripped as a cluster in one go, because some analysis
algorithms do not allow any number of consecutive affix-strippings operations or because
stripping them in one go is just more optimal for your purposes (see Section 10.5 [Levels and
Optimizing Performance], page 38). Therefore the separate affix rules in the input grammar
should be merged when they are dumped by hunlex as rules into the affix file. Well, levels
regulate which morphemes should be merged with which other morphemes. (To be more
precise, they regulate which affix rules expressing which morphemes should be merged with
which other which other affix rules expressing which other morphemes.
Since merged affix rules are highly redundant and tedious to maintain, one of the main
purposes of hunlex is actually to allow for high flexibility in your choice of merging affixes
to create resources optimized for your needs, while at the same time also allow for transparent and non-redundant description for easy maintenance and scalability (see Chapter 1
[Introduction], page 1).

10.2 Levels and Stems


Levels do not only regulate which affixes are compiled into one affix cluster (an affix rule in
the output affix file, see Section 10.1 [Levels and Affix Rules], page 36). They also determine which stems are precompiled into the dictionary (see Section 8.2 [Output Resources],
page 33). In particular, all affixes below a so called minimal lexical level (see Section 10.3
[Levels and Ordering], page 37) are precompiled with the stems of the lexicon into the
output dictionary.

Chapter 10: Levels

37

For instance, taking the example of the previous section, if both the plural and the inessive
morpheme are below the minimal level (of on-line-ness), the whole morphologically complex
word dalokban will be included in the dictionary file. To learn why youwould want to do
such a thing see also Section 10.4 [Manipulating Levels with Options], page 37.

10.3 Levels and Ordering


The word level is actually rather misleading, since the notion of level we have here has
only a very restricted sense of ordering. There is no sense in which (rules expressing) a
morpheme of level i can not be applied after (rules expressing) another morpheme of level
j where i > j.
There is a sense in which levels do have ordering, however. There is always a minimal
level (that is the value of the MIN_LEVEL option, see Section 6.4.5 [Resource Compilation
Options], page 19) below which all morphemes are compiled into the dictionary (i.e., they
are merged with the absolute stems in the lexicon and dumped as stems into the dictionary).
The default lexical level is 1, meaning that (affix rules expressing) morphemes of level 0
or less are merged with the appropriate stems and the resulting (morphologically complex)
words will be entries in the dictionary file (see Section 10.2 [Levels and Stems], page 36.
Since the dictionary file entries are the practical stems of the analysis routines, configuring
the level of morphemes gives you the option to adjust the depth of stemming. For instance,
if you choose not to want your stemmer to analyze some derivational affix (which you
otherwise productively describe with a rule in the grammar), all you have to do is to assign
a lexical level to this morpheme in morph.include. Recompiling with this configuration will
result in resources with the precompiled entries in the dictionary file.
See also the MAX_LEVEL option, see Section 6.4.5 [Resource Compilation Options], page 19.

10.4 Manipulating Levels with Options


10.4.1 Levels and Generation
You dont always have to fiddle manually with assigning alternative levels to each morpheme.
For some of the special cases, hunlex provides an option. It is very common that you
want to generate all the words your grammar accepts. All you have to do is to set the
minimal level to a very large value that is higher than any of the levels you have assigned
to morphemes in the morph.conf file (see Section 8.1.3 [Morpheme Configuration File],
page 31). This is done with the MIN_LEVEL option (see Section 6.4.5 [Resource Compilation
Options], page 19). This means to hunlex that all the rules expressing all the morphemes
are to be compiled in the dictionary, which results in deriving all the words of the language.
This option is also provided as the generate toplevel target (see Section 6.3 [Targets],
page 13), in fact
make generate
is just a shorthand for
make MIN_LEVEL=100000 new resources

10.4.2 Levels and No Clusters


In order to create an output in which no two affix rules are merged, it is enough to assign
every morpheme to a different level for instance by using the following unix shell command:

Chapter 10: Levels

38

$ cp morph.conf morph.conf.orig
$ cut -d -f1 morph.conf.orig | nl -nln -s | sed s/\(.*\) \(.*\)$/\2 \1/g > morp
Todo: I should provide an option that does this.
With a routine that supports any number of affix stripping operations, such a resource
will allow correct analysis. But not with the ones that allow only a finite number of rule
applications.
Todo: write on recursion
Geeky note: If rule-application monotonically increases the size of the input,
potential recursion is never unbounded recursion since all analysis routines have
a fixed buffersize anyway. If not however, if empty strings or clippings make
rule application non-monotonic in size, potential recursion may cause actual
infinite loops in some uncautious implementations. Boundedness of recursion
due to buffersize restrictions is only one sense in which the full intended (implied) generative power of any arbitrary hunlex grammar is not reflected in the
analyzers actual analysis potential.

10.4.3 Levels and Steps of Affix Stripping


For myspell style resources where you want only one stage of affix stripping, you should use
one lexical and one non-lexical level. Without having to create your alternative morph.conf
file, this can easily be done with the combination of the MIN_LEVEL and the MAX_LEVEL
options (see Section 6.4 [Options], page 15).
You just set these two options to the same value l, and all morphemes with level equal
or smaller then l will be compiled into the dictionary, and all the other morphemes (i.e.,
affix morphemes with level greater than l ) will be merged into clusters (and these affixes
will be dumped to the affix file as rules). Implementations like myspell (see Section 14.1.2
[Myspell], page 46) can only run correctly with such resources given that they allow only
one step of suffix stripping.
Todo: This is slightly more compilcated because of prefixes and affixes stripped
separately. We should clarify this. And this whole myspell business is actually
not tested.
Myspell supports only one stage of affix stripping, the morphbase routines support two and
jmorph supports any number (truely recursive).
With an affix file where there are separate affix rules for these affixes, the analyzer would
have to do two suffix stripping operations to recognize the word dalokban. Therefore using
such a resource, myspell will not recognize this word at all. The morphbase routines will
be able to analyze it since they allow two stages of suffix stripping which is just enough and
jmorph as well since it allows any number of suffix stripping steps.
So, when you configure which affixes are merged, make sure you have considered the generative capacity of the target analysis routine (how many suffix strippings it can make).
precompile into the dictionary?

10.5 Levels and Optimizing Performance


Which affix rules you want to merge and precompile into clusters is entirely up to you and
usually a question of optimization. If you choose not to precompile anything, then your

Chapter 10: Levels

39

affix file will be small, but your analysis may not be optimal for runtime (if it generates the
correct analyses at all, see Section 10.4.3 [Levels and Steps of Affix Stripping], page 38).
If, on the other hand, you precompile all affixes into affix clusters, you might end up with
hundreds of megabytes of affix file which is gonna compromise your runtime analysis memory load (though maybe faster for the analysis algorithm, than recursive calls). This last
realization led the author of hunspell (see Section 14.1.1 [Huntools], page 46) to introduce
a second step of suffix stripping in the algorithm which was a legacy of the original myspell
code with its one level of affix stripping.
Finally, compiling everything in the lexicon is not a very good idea for complex morphologies
and big lexicons. Although it may be indespensible for testing on smaller fragments of
lexicons/grammars or for creating wordlists (see Section 6.3.3 [Test Targets], page 14).
Some special affix rules should always be precompiled into the dictionary and not output
as affix rules. These rules are the ones that cannot be interpreted as affix rules at all, for
instance, rules of substitution or suppletion. These rules are beyond the descriptive capacity
of affix files. Therefore all substitutions and suppletions are precompiled (merged with the
rules or stems they can be applied to) irrespective of their level. Find more about this.

Chapter 11: Tags

40

11 Tags
We call the information that a morphological analyzer is expected to output for an analyzed
word a piece of morphological annotation. In more general terms, howver, when we talk
about any kind of word-analysis routine such as a spell-checker, stemmer, we call the output
information these routines associate with words tags. We want to emphasize here that this
piece of output information tags the whole that is analyzed. The tag is used to annotate
words in a corpus by decorating a raw text with useful extra information.
NB: Tagging as we use it in no way constitutes a constituent structure, segmentation, etc. of the input word form.
This document describes the ways in which you can associate tags with your morphemes
(or individual stem variants and affix rules). These tags should be thought to constitute
ingredients of and output tag that an analyzed word containing that morpheme would be.
Certainly, not all analysis software can or is supposed to output any useful information
about the morphological makeup of the word. For instance, a spell-checker is typically
required only to recognize whether a word is correct (usually in a strict normative sense),
but a morphological analyzer or a stemmer is supposed to output some information. Since
the huntools routines are able to perform full morphological analysis, not just recognition
(REFERENCE), adding morphological tags to your rules is worth your while. Nevertheless,
if you never ever want to be able to output any useful info (because you only care about
spellchecking), you dont really need to read on.

11.1 Merging Tags


The output tag associated with a successful analysis of a word is extremely primitively
defined by the concatenation of the tags assigned to the rules and the stem which constituted
the parse of the word.
NB: It is not clear whether the order of prefixes stems and suffixes should matter
in some cases. In the usual case we assume that what the analyzer will do is
concatenation of tags in the order of affix-rule stripping.
Todo: What the analyzers do with the tags should be clarified. In fact, both
huntools and jmorph do something that is smarter for particular purposes but
not reasonably generalizable or even incorrect for the general case.
For this to work, you can assign tags (chunks of output annotation) to any affix variant
and stem variant in your grammar and lexicon. This is done with TAG expressions (see
Chapter 7 [Description Language], page 24, TAG keyword).
As hunlex merges affixes, it merges their tags accordingly as expected. There are a number
of formatting options with which you can influence the way you put together tags. One is
the TAG_DELIM option (see Section 6.4.5 [Resource Compilation Options], page 19), which
sets the delimiter between any two tags. If multiple tags are given by TAG expressions,
they are also concatenated with this delimiter in the order of their appearance within the
rule-block.
Depending on the main output mode of hunlex, various pieces of information can be chosen
to be considered as tags to output. This is important if you want to configure your resources
so that it will give you a stemmer or a tagger or an analyzer and various other options are
available, see Section 6.4.5 [Resource Compilation Options], page 19.

Chapter 11: Tags

41

11.2 Feature Structures


Hunlex also support feature structures as a kind of annotation scheme. This is extremely
useful to crossckeck the correctness of your tags. Tags can be quite messy and since they
are pieces of strings, they are difficult to check.
Feature structures are structured objects which are checked against a signature (given in
the signature file, which is the value of the option SIGNATURE, see Section 6.4.3 [Input
File Options], page 17) and are merged with graph-unification. As annotations to give
complex morphological information, they are more expressive and adequate than pieces of
tags that are concatenated. Also, feature structures, unlike just arbitrary strings in the tags,
are interpretable data structures which one can directly calculate with, say, in a syntactic
analyzer using the output of the morphological analyzer.
That said, it has to be added that the analyzers themselves do not support these feature
structures. This means that they still manipulate pieces of feature-structure descriptions
as strings and glue them together. If you use the feature structures within your hunlex description, however, you can be certain that, even if the analyzer just concatenates them, the
resulting analyses describe valid FS-s according to your signature (see also the SIGNATURE
option under Section 6.4.3 [Input File Options], page 17).
Todo: Include a proper description of the extended KR framework of FS-s.
Todo: No support for derivations is implemented yet (it is on the way).

Chapter 12: Flags

42

12 Flags
Flags are used in the output resources (see Section 8.2 [Output Resources], page 33) to
index affix rules. Each entry in the dictionary file has a set of flags indicating which affix
rules can be applied to it.
So, flags are given by hunlex and written in the affix and dictionary files. There is no
such thing as a flag in the hunlex input grammar or lexicon, the files which describe your
morphology.
You can specify some aspects of what flags hunlex will assign to affix classes and how. This
is what the present chapter is about.

12.1 Two Forms of Flags


Flags can be a one-character or two-character long.
Myspell (and legacy xspell implementations) can only handle single-character flags. For
the general case, this should be ok and is the default. If you are dealing with languages of
sensible complexity, this default is ok and you dont need to read this chapter any further.
Double flags are composed of a number as first character and a flaggable character (see
Section 12.2 [Flaggable Characters], page 42) as the second, such as 3f or 9t. In order
to use double flags, use the DOUBLE_FLAGS option (see Section 6.4.5 [Resource Compilation
Options], page 19). Read on to learn why you would use double flags (see Section 12.3
[Limit on the Number of Flags], page 43).

12.2 Flaggable Characters


flaggable characters are characters that hunlex can use as flags (in case of single-character
flags) or can be the second character of a double flag (see Section 12.1 [Two Forms of Flags],
page 42). All non-whitespace characters are in principle flaggable.
The actual choice of flaggable characters is by default the following 132 characters which
are hard-wired in hunlex. I am not in a position to list them here, because... (I bet my
hundred forints that some of the characters below are displayed completely differently for
you than for me, in any format any display, ranging from your terminal trough your browser
to acroread.) They are however, included, in the original texinfo version of this document
as a comment (see file doc/texinfo/flags.texinfo, or to be sure in the source code,
src/hunlex_wrapper.ml)
Flaggable characters, however, can be customized through hunlexs FLAGS options (see
Section 6.4.3 [Input File Options], page 17). This option takes a filename. The contents of
the file is the sequence of characters to be used for flags without any delimiters.
Warning: Make sure you do not include any whitespace in this file (other than
a trailing newline), or do not include any character twice. Since characters
are not checked for sanity, doing otherwise may result in ill-formed affix files
or conflated affix classes. If you use double-flags, do not include numbers as
flaggables.
Todo: Why dont we bloody check this? Checking of flaggable characters should
be amended in a future version.
rule indexed with?

Chapter 12: Flags

43

The association of flags to affix classes takes flags from left to right. This means that if the
output requires 35 flags, the first 35 flaggable characters will be used. This is, however, all
that can be said : which actual flag comes to which affix class can not be further specified.
Warning: This last sentence is a warning in itself. For people who are used
to fiddling with affix files that were manually created (in fact almost all ispell
resources), it has to be stressed: hunlex-generated affix files are not to be read
by humans and should be considered binary. Associations of flags with particular affix rules/classes are not permanent across various configurations/resource
compilations. If you want to post-process affix files, never assume particular
flags are meaningful. This is rather obvious once you realize that the affix
rules/classes themselves are not consistent across different parametrizations,
either (see e.g., levels). This policy is called dynamic flagging.
An exception from under dynamic flagging might be special flags which are
fairly consistent since their expression can be customized to particular strings
(see Section 6.4.5 [Resource Compilation Options], page 19, Affix file variables).
But this feature will soon cease to exist, so just wipe your tears off your face,
be happy that you have a hunlex resource and forget your old flags.

12.3 Limit on the Number of Flags


If you use single-character flags (see Section 12.1 [Two Forms of Flags], page 42), the number
of flags equals the number of flaggable characters, i.e., the length of the custom flag file (see
Section 6.4.5 [Resource Compilation Options], page 19, flags.conf) or 132, by default
(see Section 12.2 [Flaggable Characters], page 42). This is also the maximum number of
affix classes you can have in your output resources.
(After flaggable characters will be chosen to express the special flags (see Section 12.4
[Special Flags], page 44), the number of possible affix classes is the number of flaggables
minus the number of special flags needed.)
Sometimes this is not enough for languages with hugely complex and lexically idiosynchratic
morphology and one has to use double flags (see Section 12.1 [Two Forms of Flags], page 42).
You can tell whether you really need this if hunlex resource compilation stops complaining
that there are not enough flags (exception Not_enough_flags).
You tell hunlex to use double flags with the appropriately named DOUBLE_FLAGS option (see
Section 6.4.5 [Resource Compilation Options], page 19). With double flags you can have 10
times more affix classes than flaggable characters, i.e., 1320 (from 0a to 9z or whatever)
with the default flaggables (see Section 12.2 [Flaggable Characters], page 42).
Warning: Double flags are only understood by the morphbase implementations
(see Section 14.1.1 [Huntools], page 46) but not ispell, myspell and (yet)
jmorph.
This is a reason why one might use the huntools package. Please tell us uf this
is the sole reason you are using huntools.
Caveat: The use of double flags for the morphbase routines is a compile-time
option at the moment.
Caveat: When customizing your flaggables (see Section 12.2 [Flaggable
Characters], page 42), and using double flags, you can have up to about two

Chapter 12: Flags

44

thousand affix classes. If this is not enough for you (you get the exception
Not_enough_flags), you are likely to have a problem in your grammar (see
Chapter 13 [Troubleshooting], page 45). If you are sure it is not a grammar
problem, you better choose another language. At any rate, please notify us
(see Section 3.8 [Contact], page 6) about this extraordinary case and we
might even extend the support for flags even in morphbase on one of our free
afternoons (see Section 3.3 [Requesting a New Feature], page 5).

12.4 Special Flags


There are special flags in the affix file (the full documentation of special flags is (hopefully)
found in the morphbase and jmorph docs). Special flags are special because they do not
index affixes, but encode other sorts of information needed by analysis routines. There are
a number of flags one can configure through the options (see Section 6.4 [Options], page 15).
These are
ROOT_ONLY_FLAG
STEM_GIVEN
Todo: Include the ones below into the implementation and uncomment it from
the texinfo document
these are:
WARNING, CAVEAT: if you use two-character flags (-double flags option), you have to
make sure that special flags are also two-characters, otherwise it will lead to ill-formed affix
files. If you set flags through the toplevel Makefiles variables, make sure your flags are
quoted (otherwise Makefile will resolve the flag ~ to, say, /home/tron and you wont
understand what went wrong...)
TODO: Special flags should NOT be user-configurable at all. They should be assigned the
first possible flags.

Chapter 13: Troubleshooting

45

13 Troubleshooting
13.1 Installation Problems
If hunlex wouldnt install, check Section 4.3 [Prerequisites], page 7 carefully with special
attention to the versions.
There are some hints hidden among the lines of Section 4.4 [Install], page 7 which you may
have missed.

13.2 Problems running hunlex


If you upgraded from an earlier version, make sure you uninstall the earlier version first (see
Section 4.5 [Uninstall and Reinstall], page 8).
If you use hunlex trhough the toplevel control with Makefile (see Chapter 6 [Toplevel
Control], page 12), the hunlex executable is by default assumed to be found in the path
with name hunlex.
By default, installation installs hunlex into /usr/local/bin (see Section 4.6 [Installed
Files], page 9) unless you set another install prefix.
Find out whether the hunlex executable is found in the path by typing
$ which hunlex
If it is not found, check again where you installed it by looking into the file install_prefix
in the toplevel directory of your source distribution. If this file is not there, your installation
was not successful.
If you found out your install-prefix, see if install-prefix/bin/hunlex exists. If it does, you
can do the following things:
add install-prefix/bin/hunlex to your path by something like:
PATH=install-prefix /bin:${PATH}
or
tell the toplevel where to find your hunlex. This you can do by setting the HUNLEX
toplevel option (see Section 6.4 [Options], page 15).

13.3 Resource Compilation Problems


13.4 Grammar Problems
If your grammar seems to overgenerate, first thing is check if you declared the features that
you think your grammar is relying on in the phono.conf file.
You may have mispelled some phonofeature, this can be traced by peeping into the debug
messages. Ideally you do this by redirecting the output into a log file (with debug level set
sufficiently high) and search the file for the term skipped. This is the warning hunlex gives
you to let you know that an entity has been skipped.

Chapter 14: Related Software and Resources

46

14 Related Software and Resources


14.1 Software that can use the output of Hunlex as input
14.1.1 Huntools
14.1.2 Myspell
14.1.3 Jmorph
14.1.4 Ispell

14.2 Available resources


14.2.1 The Hungarian Morphdb Project
The HunLex framework is being used in the development of an open-source morphological
database (lexicon and grammar) for the Hungarian language in a collaboration between the
Hungarian Academy of Sciences, Research Institute for Linguistics and the Budapest Institute of Technology, Media Education and Research Center Natural Language Processing
Lab. This database aspires to be the most complete and accurate account of Hungarian
morphology published so far, and is the result of merging several well-respected electronic
resources http://lab.mokk.bme.hu.

14.2.2 The English Morphdb Project

14.3 Hunlexs relatives


14.3.1 XFST, TWOLC, LEXC
For xfst, twolc, lexc, see
http://www.xrce.xerox.com/competencies/content-analysis/fst/home.en.html
or
http://www.stanford.edu/~laurik/fsmbook/home.html

Variables and Options Index

47

Variables and Options Index


A
AFF (outputdir /affix.aff) . . . . . . . . . . . . . . . . . .
AFF_FORBIDDENWORD (!) . . . . . . . . . . . . . . . . . . . . . . .
AFF_ONLYROOT (~) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AFF_PREAMBLE () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AFF_SET (ISO8859-2) . . . . . . . . . . . . . . . . . . . . . . . . . .
AFF_SETTINGS (confdir /affix_vars.conf). . . .

M
18
21
21
21
21
21

C
CHAR_CONVERSION_TABLE () . . . . . . . . . . . . . . . . . . . . 21
CONFDIR (inputdir ). . . . . . . . . . . . . . . . . . . . . . . . . . . 18

MAX_LEVEL (10000) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MIN_LEVEL (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MODE (Analyzer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MORPH (confdir /morph.conf) . . . . . . . . . . . . . . . . .

19
19
20
17

O
OUT_DELIM ( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
OUT_DELIM_DIC (<TAB>) . . . . . . . . . . . . . . . . . . . . . . . 19
OUTPUTDIR (. = current directory ) . . . . . . . . . . 19

P
D
DEBUG_LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DEBUG_LEVEL (0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DIC (outputdir /dictionary.dic) . . . . . . . . . . . . .
DIC_PREAMBLE () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DOUBLE_FLAGS () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PHONO (confdir /phono.conf) . . . . . . . . . . . . . . . . . 17


12
16
18
21
19

Q
QUIET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
QUIET (@ = quiet) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

FLAGS () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
FS_INFO () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

REPLACEMENT_TABLE () . . . . . . . . . . . . . . . . . . . . . . . . 21

S
G
GRAMMAR (grammardir /grammar) . . . . . . . . . . . . . . . 17
GRAMMARDIR (inputdir ) . . . . . . . . . . . . . . . . . . . . . . . 17

H
HUNLEX (hunlex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
HUNMORPH (hunmorph) . . . . . . . . . . . . . . . . . . . . . . . . . . 16

I
INPUTDIR (. = current directory ) . . . . . . . . . . . . 18

SIGNATURE () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
STEM_GIVEN (). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
STEMINFO (LemmaWithTag) . . . . . . . . . . . . . . . . . . . . . 20

T
TAG_DELIM () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TEST (/dev/stdin) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TIME (time) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19
18
13
16

U
USAGE (confdir /usage.conf) . . . . . . . . . . . . . . . . . 17

L
LEXICON (grammardir /lexicon) . . . . . . . . . . . . . . . 17

W
WORDLIST (outputdir /wordlist) . . . . . . . . . . . . . . 18

Description Language Index

Description Language Index


macro definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

48

Files Index

49

Files Index
A

affix file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 33


affix settings file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

lexicon file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 30

C
character conversion table . . . . . . . . . . . . . . . . . . . . .
character replacement table . . . . . . . . . . . . . . . . . . . .
configuration files . . . . . . . . . . . . . . . . . . . . . . . . . . 17,
configutation directory . . . . . . . . . . . . . . . . . . . . . . . . .

21
21
31
18

Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Makefile, hunlex toplevel . . . . . . . . . . . . . . . . . . . . . . . . 9
Makefile, local . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
morph.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 31, 37

O
output resources directory . . . . . . . . . . . . . . . . . . . . . 19

dictionary file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 33

F
flags file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 42
flags.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 42
fs.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

G
grammar directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
grammar file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 31

P
phono.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 32
preambles for affix and dictionary files . . . . . . . . . 21
primary resource files . . . . . . . . . . . . . . . . . . . . . . . . . . 30

S
signature file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

T
H

test file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

hunlex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
hunlex.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
HunlexMakefile . . . . . . . . . . . . . . . . . . . . . . 9, 10, 12, 13

U
usage configuration file . . . . . . . . . . . . . . . . . . . . . . . . 33
usage.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17, 33

I
input resources directory. . . . . . . . . . . . . . . . . . . . . . . 18
install prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

W
wordlist file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Concept Index

50

Concept Index
A
affix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
affix clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
affix file, variables in . . . . . . . . . . . . . . . . . . . . . . . . . . .
affix rules (in the affix file) . . . . . . . . . . . . . . . . . . . . .
affix rules, conditioning the application of . . . . . .
affix rules, merging affix rules in the affix file . . .
allomorph, setting stem allomorph to output . . .

G
18
32
20
32
32
36
20

generating all the words of the language. . . . . . . . 37


generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

hunlex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
hunlex, invoking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
hunmorph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
huntools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 16, 21, 38

block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
block, the blocks used within morph statements
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

I
input resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

C
character set, setting character set in the affix file
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
command-line interface . . . . . . . . . . . . . . . . . . . . . . . . 12
compounding, affix file variables for compounding
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
configuration of resource compilation . . . . . . . . . . . 31
cumulative, blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D
debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
debugging resource compilation . . . . . . . . . . . . . . . . 16
default settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
default target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 13
default values of options . . . . . . . . . . . . . . . . . . . . . . . 15
delimiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
delimiter, tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
description language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
double flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

F
feature structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
feature structures . . . . . . . . . . . . . . . . . . . . . . . . . . 20, 41
feature, declaring features for compilation . . . . . . 32
flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 42
flags, double . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
flags, limit on the number of . . . . . . . . . . . . . . . . . . . 43
flags, setting double flags . . . . . . . . . . . . . . . . . . . . . . 19
flags, setting special flags through toplevel options
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
flags, special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

J
jmorph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

L
lemma, setting output lemma . . . . . . . . . . . . . . . . . . 20
level, setting maximal level . . . . . . . . . . . . . . . . . . . . 19
level, setting minimal level . . . . . . . . . . . . . . . . . . . . . 19
levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19, 36
levels, settings the level of a morpheme . . . . . . . . 32
lexical resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
LGPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

M
macro definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
make, GNU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
make, hunlex installation . . . . . . . . . . . . . . . . . . . . . . . 7
make, toplevel control . . . . . . . . . . . . . . . . . . . . . . . . . 12
Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Makefile variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Makefile, hunlex installation . . . . . . . . . . . . . . . . . . . . 7
metadata definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
metadata, output metadata into affix file . . . . . . . 22
minimal level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
mode, setting mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
morph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
morph definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
morphemes, configuring . . . . . . . . . . . . . . . . . . . . . . . . 31
morphological analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1
morphological analyzer . . . . . . . . . . . . . . . . . . . . . . . . . 1
morphological annotation . . . . . . . . . . . . . . . . . . . . . . 40
myspell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Concept Index

O
ocaml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ocaml-make . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
OCamlMakefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
options, default value . . . . . . . . . . . . . . . . . . . . . . . . . . 15
options, executable path . . . . . . . . . . . . . . . . . . . . . . . 15
options, remembering . . . . . . . . . . . . . . . . . . . . . . . . . . 13
options, toplevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
output information, of a word-analysis routine . . 40
output options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
output resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 33
output, tags to output . . . . . . . . . . . . . . . . . . . . . . . . . 20

P
preamble, the header part of a morph statement
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
preambles, for affix and dictionary files . . . . . . . . . 21
primary resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

R
resource compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
resource compilation, configuration of . . . . . . . . . . 31
resource, primary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
resource-specification language . . . . . . . . . . . . . . . . . . 1
resources, input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 30
resources, lexical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
resources, output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 33
resources, primary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
resources, secondary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

S
settings, storing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
special flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

51

spell-checking, settings for . . . . . . . . . . . . . . . . . . . . . 20


spellchecker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
spellchecking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
spellchecking, resources for myspell . . . . . . . . . . . . 38
statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
stemming stemmer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
stemming, settings for . . . . . . . . . . . . . . . . . . . . . . . . . 20

T
tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19,
tags, formatting output . . . . . . . . . . . . . . . . . . . . . . . .
tags, merging tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
tags, settings output tags . . . . . . . . . . . . . . . . . . . . . .
targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
targets, special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
targets, test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
test targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
test, a file to test your grammar on . . . . . . . . . . . .
test, testing resources . . . . . . . . . . . . . . . . . . . . . . . . . .
testing compiled resources . . . . . . . . . . . . . . . . . . . . .
testing tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
toplevel control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
toplevel options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40
40
40
20
13
14
14
14
18
14
18
17
16
12
15

U
usage configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
usage qualifiers, declaring . . . . . . . . . . . . . . . . . . . . . . 33

V
variables, affix file . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
variables, in affix file . . . . . . . . . . . . . . . . . . . . . . . . . . .
variables, Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
variant, as part of a morph statement . . . . . . . . . .
verbosity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12,

22
20
15
24
16

Frequently Asked Questions

52

Frequently Asked Questions


- How can I display/undisplay the time it takes
hunlex to compile the resources? . . . . . . . . . . 13
- How can I invoke an alternative version of hunlex
(rather then the systemwide one)? . . . . . . . . . 15
- How can I make compilation more verbose with
the toplevel control? . . . . . . . . . . . . . . . . . . . . . . 12
- How can I remember the options on the toplevel?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
- How can I set what tags are output by the
analyzer? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
- How can I test if my grammar compiles? . . . . . 14
- How can I test if my resources are working? . . 14
- How do I compile my resources with hunlex? . . 14
- Is there a way to dump all the words my
grammar generates? . . . . . . . . . . . . . . . . . . . . . . . 14
- What can I do through the toplevel control at
all? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
- What is a Makefile? . . . . . . . . . . . . . . . . . . . . . . . . . . 12
- What is a target? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
- What is toplevel control? . . . . . . . . . . . . . . . . . . . . . 12
- What options are available for the toplevel
control? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
- What targets are available through
HunlexMakefile? . . . . . . . . . . . . . . . . . . . . . . . . 13

How can I contact you? . . . . . . . . . . . . . . . . . . . . . . . . . 6


How can I declare usage qualifiers? . . . . . . . . . . . . . 33
How can I reset which usage qualifier dimension do
I want in the output? . . . . . . . . . . . . . . . . . . . . . 33
How can I select which morphemes are included in
the output? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
How do I compile resources for the myspell
spellchecker? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
How do I define a macro in the grammar? . . . . . . 28
How do I install the hunlex? . . . . . . . . . . . . . . . . . . . . 7
How do I reinstall hunlex? . . . . . . . . . . . . . . . . . . . . . . 8
How do I specify the morphology of the language?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
How do I start using hunlex? . . . . . . . . . . . . . . . . . . 10
How do I uninstall hunlex?. . . . . . . . . . . . . . . . . . . . . . 8
How do I upgrade hunlex? . . . . . . . . . . . . . . . . . . . . . . 8
How do you specify the words of the language?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
How many arguments a block can have? . . . . . . . 25
How many blocks can one have in a variant? . . . 25
How many flags (affix classes) can I have? . . . . . . 43
How many steps of affix stripping do I want to
have? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

I
A
Are there any Hunlex resources available for any
language? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Are there software with similar functionality? . . 46

C
Can
Can
Can
Can

I generate all the words of the language? . . 37


I use hunlex from a shell? . . . . . . . . . . . . . . . . . 35
I use hunlex on the command-line? . . . . . . . . 35
I use the macros I defined in the grammar in
the lexicon? Yes. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Can one iterate the same type of blocks? . . . . . . . 25
Can you recommend a morphological analyzer?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Can you recommend a spellchecker?. . . . . . . . . . . . 46
Can you recommend a stemmer? . . . . . . . . . . . . . . . 46

I am a keen hacker/stupid user with sleepless


nights/long afternoons at work to be
wasted/spent usefully. How can I contribute to
your work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I am selling this product to people for money. Is
this OK? OK with us. Ask your customers. . . 4
I am using hunlex for commercial purposes to
make money. Is this OK? OK with us. Ask
your neighbour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
I am using hunlex, but desparately lacking a
feature X. Can I request it? . . . . . . . . . . . . . . . . 5
I found a bug/strange feature/some stupid
mistake. What should I do? . . . . . . . . . . . . . . . . 5
I found hunlex cool/useful/useless. How can I let
the authors know? . . . . . . . . . . . . . . . . . . . . . . . . . 5
I used hunlex and report it in a paper. What
reference do I use? . . . . . . . . . . . . . . . . . . . . . . . . . 5
In what sense are levels ordered? . . . . . . . . . . . . . . . 37
Is Hunlex just a converter?. . . . . . . . . . . . . . . . . . . . . . 2

H
How are tags calculated by the analyzer?. . . . . . . 40
How can I associate tags with usage qualifiers?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
How can I configure resource compilation? . . . . . 31
How can I configure usage qualifiers? . . . . . . . . . . . 33
How can I configure what the output tags of my
analyzer will be? . . . . . . . . . . . . . . . . . . . . . . . . . . 40

U
Under what copyright restrictions can I distribute
linguistic resources compiled or to be compiled
with hunlex? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Under what copyright restrictions can I
redistribute this software? . . . . . . . . . . . . . . . . . . 4

Frequently Asked Questions

Under what restrictions can I modify this


software?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Under what restrictions can I use this software?
............................................ 4

W
What are feature structures good for? . . . . . . . . . . 41
What are feature structures? . . . . . . . . . . . . . . . . . . . 41
What are levels? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
What are special flags? . . . . . . . . . . . . . . . . . . . . . . . . 44
What are the configuration files? . . . . . . . . . . . . . . . 31
What are the primary resources for hunlex? . . . . 30
What are usage qualifiers? . . . . . . . . . . . . . . . . . . . . . 33
What do I need if I want to install this package?
............................................ 7
What do I want to install? . . . . . . . . . . . . . . . . . . . . . . 7
What do levels mean for affixes and the affix file?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
What do levels mean for stems in the lexicon and
the dictionary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
What does a flag look like? . . . . . . . . . . . . . . . . . . . . 42
What does hunlex do with the tags? . . . . . . . . . . . 40
What files are relevant for hunlex? . . . . . . . . . . . . . 30
What files are the input for hunlex? . . . . . . . . . . . . 30
What gets installed when I install hunlex? . . . . . . 9
What is a block? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
What is a dictionary file? . . . . . . . . . . . . . . . . . . . . . . 33
What is a flag? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
What is a flaggable character? . . . . . . . . . . . . . . . . . 42

53

What is a variant? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
What is an affix file? . . . . . . . . . . . . . . . . . . . . . . . . . . 33
What is Hunlex? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 7
What is in the grammar file? . . . . . . . . . . . . . . . . . . 31
What is in the lexicon file? . . . . . . . . . . . . . . . . . . . . 30
What is morphological annotation? . . . . . . . . . . . . 40
What is the difference between features and usage
qualifiers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
What is the grammar file? . . . . . . . . . . . . . . . . . . . . . 31
What is the lexicon file? . . . . . . . . . . . . . . . . . . . . . . . 30
What is the License of this software? . . . . . . . . . . . . 4
What is the morph.conf file? . . . . . . . . . . . . . . . . . . . 31
What is the phono.conf file? . . . . . . . . . . . . . . . . . . . 32
What is the usage.conf file? . . . . . . . . . . . . . . . . . . . . 33
What output files does hunlex create? . . . . . . . . . . 33
What platforms are supported by Hunlex? . . . . . . 7
Where can I find the sources? . . . . . . . . . . . . . . . . . . . 7
Where do the tags come from? . . . . . . . . . . . . . . . . . 40
Which affix rules do I want to merge and which
words do I want to . . . . . . . . . . . . . . . . . . . . . . . . 38
Which characters can be flags? . . . . . . . . . . . . . . . . . 42
Which files actually describe the language? . . . . . 30
Which flags will be in the affix file and which flag
is a particular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Which softwares can use the output of Hunlex?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Who are you? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
With which options can I manipulate levels (and
thereby affix-merging)? . . . . . . . . . . . . . . . . . . . . 37

You might also like