This directory contains our effort to proofread apertium-kaz’s lexicon against the 15-volume [Explanatory] Dictionary of Literary Kazakh (“Қазақ Әдеби Тілінің Сөздігі”), published by Kazakh Linguistics Institute in 2011 (EDOK2011), and the single-volume [Explanatory] Dictionary of Kazakh (“Қазақ Сөздігі”), published by the Linguistics Institute and Kazakh Language Commitee in 2013 (EDOK2013). The goal is to solve issue #11 of apertium-kaz, as well as to extend it with more stems, especially with common words (as opposed to proper nouns).
We plan to merge the results back to apertium-kaz once we’re done proof-reading it.
In the meantime, we provide our own version of apertium-kaz.kaz.lexc as a drop-in replacement for apertium-kaz’s file of the same name, as well a slightly modified Makefile.am for it. To use apertium-kaz with our modifications to it, you should:
- Install Apertium Core:
wget https://apertium.projectjj.com/apt/install-nightly.sh -O
- | sudo bash
sudo apt-get -f install apertium-all-devOR
wget https://apertium.projectjj.com/rpm/install-nightly.sh -O - |
sudo yum install apertium-all-develor similar, depending on what kind of GNU/Linux distibution you are using. See this article for more information. For Windows users, Apertium project provides a Virtualbox image with all necessary tools installed on it. If you’re using the Virtualbox image, you should simply continue with the next step.
- Dowload apetium-kaz:
git clone https://github.com/apertium/apertium-kaz.git
Replace apertium-kaz/apertium-kaz.kaz.lexc file with the file that we provide.
Replace apertium-kaz/Makefile.am file with the file that we provide.
- Compile apertium-kaz:
cd apertium-kaz; ./autogen.sh; make
Here’s a brief comparison of the two main characteristics of apertium-kaz’s state before and after our modifications to it:
Number of stems in .lexc before
Number of stems in .lexc after
Naive coverage before
Naive coverage after
Bible (New Testament)
91.69 91.45 91.22 90.23 => avrg. 91.1475
93.82 93.91 93.59 91.86 => avrg. 93.295
Using hfst transducers for measuring coverage is not optimal though, due to the old issue of HFST related to vowel harmony at word boundaries. ^жылдан бастап/*жылдан бастап$ e.g., appears among unrecognized words. That was the reason that in apertium-kaz we started printing hfst transtucers in ATT format, and recompiling them into lttoolbox transducers from that ATT representation.
Naive coverage is defined as the share of words in a running text, for which apertium-kaz returns at least one analysis. For calculating the coverage of Wikipedia, four non-consecutive chunks of approximately 3.3 million tokens were selected. The total size of Kazakh Wikipedia as of this writing is about 33 million tokens. Bible corpus is often used for calculating coverage or other benchmarks since it is a text available in the highest number of languages. The coverage was measured using the hfst-covtest script.
The rest of unanalysed words seem to be either misspelled words, proper nouns, abbreviations or bound affixes for some reason appearing detached in text. It is trivial to augment apertium-kaz.kaz.lexc with catch-all regular expressions like the following (one for each category like N1 — nouns, ADJ — adjectives, V-TV — transitive verbs, NP-TOP — toponyms etc):
<(а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л |
м | н | ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ |
ь | ы | ъ | э | ю | я)+> N1 ;
Besides, including such regexes seems to render apertium-kaz incompilable.
The above regex will allow analysing all forms of a noun even if it is not in apertium-kaz’s lexicon. Let’s take a non-word “баргылларда” as an example. Based on its ending, it looks like a noun “баргыл” in plural locative form. The above regex will analyse it as such, but it will also return a nominative reading (and several other). In short, for such catch-all regexes to be useful, a good (statistical) disambiguator is required. The development of a disambiguator is scheduled for 2020, so we decided not to add such regular expressions in apertium-kaz as of yet.
Here’s how apertiumpp-kaz is different from apertium-kaz.
The main list of stems, lexicon.rkt, is implemented in a full-fledged programming language (Racket), and not in the Xerox/Helsinki Finite State Toolkit’s lexc formalism.
- This list of stems is consumed by a scribble/text-based template, apertium-kaz.kaz.lexc.scrbl. A normal apertium-kaz.kaz.lexc file is generated when
racket apertium-kaz.kaz.lexc.scrblcommand is run.
As opposed to the 3-element data structure of a lexc (upper-side string, lower-side string, continuation lexicon), with other marks being comments formatted in a particular way by convention, the main datatype of lexicon.rkt is an Entry with 7 fields, representing the following information:
a gloss and various (restrictional) marks such as USE/MT, DIR/LR, DIR/RL etc
inflected forms of this word, which were unnecessarily lexicalised in the .lexc file we proofread (or were lexicalised by the authors of the print dictionaries, but we thought it wasn’t necessary to lexicalise them in the transducer)
stem from which this stem has been derived from in a semi-productive way, or a chain of such stems
normative spelling(s) of this word
In the source code, entries can be wrapped up with function calls, which modify entries in various ways (or not), depending on how the functions in question are defined, and, ultimately, what defaults a particular application of apertium-kaz calls for.
Below are four examples of entries of apertiumpp-kaz, named as E-1, E-2, E-3 and E-4 here (with the difference that in the actual lexicon.rkt we use a more concise notation, omitting duplications and some of the empty lists ’()).
(define E-1 (e "абдаста" "абдаста" 'N1 '() '() '() '("әбдесте"))) (define E-2 (e "абдикация" "абдикация" 'N1 '() '() '() '())) (define E-3 (e "абдырат" "абдырат" 'V-IV `("confuse,embarras" ,USE/MT) '("абдырату") '("абдыра") '())) (define E-4 (e "абыр-дабыр" "абыр-дабыр" '(IDEO N1) '() '() '() '()))
Commentaries on why and how these modifications were made follow.
We have chosen the 15-volume and single-volume explanatory dictionaries of Kazakh as a reference because they are:
developed by publicly-financed organisations, responsible for language policy in Kazakhstan, and not commercial companies.
Individual words (entry words, to be exact, which interest us in this project and which we have extracted from the dictionaries) are not copyrightable per se, but the later point is a further safeguard that we are not violating anyone’s rights.
First of all, we copied all common words from apertium-kaz.kaz.lexc. By common words we mean words which are not proper nouns. This includes open-class words (nouns, verbs, adjectives, adverbs), but also closed-class or functional words like pronouns, determiners, postpositions etc.
With few rare exceptions, entry words contained in the single-volume EDOK2013 are a superset of those contained in the 15-volume EDOK2011. The size of the latter is due example usages and more elaborate explanations.
Therefore, we proceeded as follows:
extracted text from EDOK2013’s pdf file
converted entries in it into (entry word, rest of the entry) pairs, separated by tabs
labeled the first N entry words with the right categories, using lexikograf.py
Lexikograf.py expects two command-line arguments: a dictionary in plain text format, and a number BATCH_SIZE. Lines in the dictionary of the following form:
label \tab entry word \tab rest of the entry
serve as training data.
Lines in the following form:
entry word \tab rest of the entry
are lines for which lexikograf.py will suggest a label, and the user is requested to either mark the suggested label as correct or, if it is not, to type in the correct label.
After having seen BATCH_SIZE new observations, lexikograf.py adds these new observations to the training data, and (re)trains a MaxEnt (aka multinomial logistic regression) classifier. At each step, the annotation process is backed up in ws.pickle file as a WorldState, which is compound data structure consisting of the <classifier, labeled entries, unlabeled entries>.
Apparently we labeled 754 entries in this way, after which the number errors lexikograf.py made seemed negligible, so that we made it label the rest of the entries and added the entry words to lexikon.rkt, if such (word, continuation lexicon pairs) were not present in it already.
As described in the previous section, lexicon.rkt is a union of entries from two sources:
common words of the original apertium-kaz.kaz.lexc file, and
entry words from EDOK2013 (first 757 of which were hand-labeled with correct continuation marks, the rest with labels lexikograf.py’s classifier has assigned to them)
The resulting lexikograf.rkt requires manual cheking because:
errors from the original apertium-kaz.kaz.lexc got carried over (see issue #11)
lexikograf.py might have labeled words from EDOK2013 inccorrectly (read: they have a wrong continuation lexicon in lexicon.rkt)
For mitigating both errors, we open up both lexicon.rkt and EDOK2011, and read both in parallel. We proof-read lexicon.rkt against EDOK2011, and not against EDOK2013, because the explanations of the former are more elaborate, and, more importantly, it includes example usage sentences for each entry word / sense.
For most of the words in lexicion.rkt, reading explanations or examples was not necessary, as it was apparent whether their continuation classes were correct or not, for some, reading example sentences was crucial. Notably, they were helpful for figuring out whether a verb was transitive or intransitive, or whether an adjective was A1 or A2. As a side note, we decided to restrict the possible continuation classes for adjectives to two (A1 and A2), thus eliminating A3 and A4 entirely. The only difference between an A1 adjective and A2 adjective is that the former is actually both an adjective and an adverb, and thus can modify both nouns and verbs, while the latter is not and is used solely as an attribute in a sentence.
We have said above that entries in lexicon.rkt can be wrapped with function calls. Here are some examples of that:
(-day '("абажадай" A1 () () ()))
(-li '("абажурлы" A2 () () ()))
(comp-verb '("абай бол" V-IV () () ()))
(refl '("абайлан" V-TV () () ("абайла")))
(caus '("абайлат" V-TV () ("абайлату") ("абайла")))
(-siz '("абайсыз" A1 () () ()))
(caus '("дағдыландыр" V-TV () ("дағдыландырыл") ("дағдылан")))
(sim '("даңғойсы" V-IV () () ()))
(multi '("даму% ақау" N1 () () ()))
This work is being funded by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan, contract# 346/018-2018/33-28, IRN AP05133700, principal investigator Zhenisbek Assylbekov.
Just like apertium-kaz, the contents of this repo are published under GPL version 3.