MAXIMILIANO DURAN
INALCO. Paris
AbstractFor many of NLP, Natural Language Processing applications in quechua it is an important prerequisite to have all kinds of dictionaries at hand.
This article presents the process of building of a dictionary of inflected Quechua verbal forms (IQVF), indispensable for any text analysis.
In the first time, to open the way, I recall the results and the methods that I used to obtain a complete dictionary of the inflected quechua nouns. Before, I had arrived to isolate 60 nominal suffixes. After a detailed study on the ways these suffixes behave grammatically, I show how I built the module capable of generating all the noun inflections.
Starting with a general Quechua-Spanish dictionary I had extracted the Quechua nouns, which I translated into French. This first corpus contained about 1500 nouns, to which I applied the set of paradigms programmed in NOOJ. We have obtained a dictionary of all the inflections of these nouns which totalizes 710 216 inflected nominal forms.
We have then extrapolated these methods for the verbs. First, we make a brief presentation of the corpus on which my study is based. It consists in: 1) a set of ten contemporary tales, of a known Peruvian author; 2) an ancient manuscript, the only one written by an Indian who had secretly learned how to write, at the end of the XVI century; 3) an ancient dictionary, the most complete of its time, written by a Spanish missionary at the beginning of the XVII century; and, finally 4) the same contemporary dictionary Quechua-Spanish used for the nouns. This corpus totalizes over 1400 pages, out of which, using the NooJ resources we have got a list of 2000 verbs. I have translated them into French. Next, I describe the interesting grammar details of the inflectional structure of quechua verbs and so verify the remarkable property: all of the quechua verbs are regular verbs. Then, I also highlight another outstanding characteristic of quechua morphology, that is to say how a conjugated form may, it itself at its turn, be inflected, in postposition, by a particular subset of suffixes. I have also arrived to isolate this subset. We have noted that these suffixes may combine among them in order to obtain a manifold of suffixes formed by two, three or more of them. Actually up to 10 of them. It is important to remark that the order of the agglutination of suffixes is not arbitrary. We studied the constraints that must be respected in order to a manifold be valid, that is to say, compatible with the structural grammar of the language.
All this work, allowed us to build a comprehensive catalog of paradigms. We obtained in the first place, all the conjugated forms, and then, applying the manifolds, the supplementary set of paradigms. And so we were able to program them in NooJ.
Thus, considering the catalog of 2000 verbs, we have obtained more than 1 million inflections, which we call Inflected Quechua Verbal Forms, the IQVF dictionary.
Evaluation of this IQVF dictionary on our corpus shows good matching results.
Finally, we present the perspectives of our work.
N.B. The rest of the article may be obtained wrinting the author to: duran_maximiliano@yahoo.fr