Conferences : World Congress of KURDISH STUDIES : Gerard GAUTIER

PRESENTATION	PROGRAMME	PRESSE Avestakurd.net Netkurd.com Nefel.com
World Congress of KURDISH STUDIES Irbil, 6-9 September 2006 Organized by the Kurdish Institute of Paris in partnership with Salahadin University (Irbil) and with the support of the Kurdistan Regional Government and of the French Ministry for Foreign Affairs

Computerised text corpus in Kurdish

Par Gérard GAUTIER (*)

Abstract:

This paper provides a proposal concerning the development of a computerised text corpus in Kurdish. After reviewing the recent development of computer corpus linguistics and lexicography and providing several examples as dictionary building, concordancing, and historical language studies, the author gives a short landscape of the evolution of the computer tools available to Kurdish language and then suggests the realisation of an experimental corpus of 100 000 words through a non-commercial consortium as a preliminary to the realisation of a more important corpus of one million words. Then the paper abstracts the benefits which would result for Kurdish language from entering this research field.

Keywords:

Computational linguistics, text corpus & corpora, Kurdish language.

The author:

Gérard Gautier, born in 1955. Originally a physicist, holds a Ph.D. in Anthropology, two MA, in Educational Science and Computer Science. He studied Kurdish in Paris, worked in multilingual computing and applied research in computational linguistics (generation of electronic dictionaries for vocal dictation for IBM). He taught in university and lived in Taiwan from 1989 to 1996 and in Kurdistan from 1999 to 2004, before coming back to France to work in the field of science non-formal education for youth and pedagogic consulting.

Computerised text corpus in Kurdish

A proposal for the Erbil Symposium, September 2006

Gérard GAUTIER

ggautierk@online.fr

The development of computer corpus lexicography

Computer can be used as a tool to build large sets of texts in a specific language, called “text corpus” (pl. “text corpora”)[1]. This practice has developped more and more during the last twenty years, and now the construction process and the choices at stake are quite well-known.

Why did this practice develop ? One of the answers is that the use of computers more and more permeates all aspects of life, and electronic texts become more and more available. Another reason is that researchers quickly understood that those electronic texts can be used in a very interesting way for revealing useful facts about the language, because it is much easier to search a text on a computer than on paper.

Building dictionaries

One good example is the construction of dictionaries.

Before the use of corpora, the dictionary author or “lexicographer” resorted to reading large amount of texts and introspection to find facts. The team who built the well-known Oxford English Dictionary (OED) in the 20s pushed this technique to the extreme by using hundreds of voluntary workers in the whole United Kingdom. They scanned publications to look for interesting ways of using words, then they sent the data on small slips of papers to a centralised office in Oxford, where it was put in order for inclusion in the dictionary. This process took years.

Today, the OED would probably be built using a text corpus. It is the way the Cobuild Dictionary was built in the 80s.

Twenty years ago, in the 80s, a set of texts counting one-million words, as the Brown University Corpus, was seen as a big corpus. In 1993 it was already considered small[2]. Now, it is considered very small, and ten-million words corpora are quite common. Indeed the Cobuild corpus now counts hundred of millions of words, 56 millions of which are currently available on the Internet through Collins WordbanksOnline service.

Concordancing

A now quite classical example of the use of text corpora is the concordancer. This small piece of software allows to find all the occurences of any word in a text. It produces outputs called KWIC, ie. “Key Word in Context”, very useful to look quickly at different usages of any word or expression. Below is a concordance of the word “newspaper” in the Collins WordbanksOnline English corpus, obtained directly online from the Collins Corpus Concordance Sampler website[3].

Iraqi withdrawal. An Iraqi government	newspaper	(al-Thawra) has said that a new United
mother was a journalist. After her	newspaper	office was ransacked she was jailed - for
for Aamulehti, Finland's second largest	newspaper	and the secretary of Finland's Electrician'
Hunter and Julian Amery. The Observer	newspaper	subsequently revealed Lauder-Frost's racist
top of the 10 note. Finally, a folded	newspaper	would be laid carefully over the cash and
Long may it continue. [p] In the world of	newspaper	publishing, there is one success story that
in a less hostile climate. The TODAY	newspaper	agrees, saying that if Iran's dead could

Concordancing tools are now available for Arabic, which could be used for Kurdish in Arabic letters as well. A good example is aConCorde, developped by Andrew Roberts at the School of Computing of University of Leeds (UK) [4] (see screenshot below).

A screenshot of aConCorde © from Andrew Roberts website

The figures near the words in the lists on the right (both Arabic and English) are the number of occurrences of each word. It shows that another possible use of text corpora and concordance is in teaching language. Specifically, teachers of a second language found quite useful to have students use concordancers in class to investigate facts – and know the words with highest frequency.

An example of historical study

Corpora can also be used for other studies, as ascertaining the evolution of the language. If texts are ordered by date, it becomes possible to see how the use of a given word evolves over time. For instance, the French National Centre for Research (CNRS) and University of Chicago teamed to give access to a large corpus of 150 million words of French texts through the ARTFL project (American and French Research on the Treasury of the French Language).

To quote from ARTFL website :

“The ARTFL database is one of the largest of its kind in the world. The number, variety and historical range of its texts allow researchers to go well beyond the usual narrow focus on single works or single authors. The database permits both the rapid exploration of single texts, and the inter-textual research of a kind virtually impossible without the aid of a computer.” [5]

In Standford University, Keith Baker was able to use ARTFL to study the evolution of the use of the word “revolution” across the centuries spanned by the texts. In the resulting publication where the author wrote about the usefulness of ARTFL corpus to his research :

Over the years, I have used ARTFL in a number of research projects on the history of French political culture. [...] I have found it extremely helpful. Generally speaking, I have searched the database for occurrences of terms relevant to particular political concepts. The searches [...] have demonstrated shifts in the frequency of the uses of important terms in the database over relatively long periods of time. [...]

Another project in which I had valuable recourse to ARTFL was a study of the idea of "revolution" in prerevolutionary France, first published in 1988 and also reprinted in Inventing the French Revolution. Searching the database for révolution produced an enormous amount of information. It revealed important occurrences in works I would not otherwise have investigated, as well as ensuring that I did not miss occurrences in works I already knew to be crucial [...] [6].

Given the high speed evolution Kurdish language is witnessing in Kurdistan Region of Iraq, it is a likely useful use for an eventual text corpus.

Kurdish particulars : a quick evolution of the available tools

Quite a decade ago, I worked on the technical difficulties awaiting the researcher trying to build a Kurdish language corpus [7]. At the time, tools as well as standardisation were lacking, but fortunately, now, if some technical choices indeed remain, the “technical landscape” has simplified a lot. First, operating systemes with an efficient Arabic Language Graphic User Interface (Arabic GUI), once only available on MacIntosh, have generalised inthe PC world. Second, the UNICODE multilingual encoding scheme[8], since introduced in the 90s, has gained wide acceptance, particularly thanks to the Internet (in Kurdish, this gave rise to the creation of numerous websites using UNICODE, the majority of which were not online even 5 years ago).

This evolution means that whatever representation we chose for Kurdish – Arabic or Roman alphabet – we now have the tools to work, as the majority of functions which are accessible to Arabic should work with Kurdish in Arabic letters. Besides, the prevalence of UNICODE means that we now have a natural standard to rely on to store any textual data in Kurdish. Some Kurdish groups abroad started to make specific tools available, particularly the KurdIT Group [9].

Starting the tests for a Kurdish Text Corpus

There are obviously several possible choices as to the building of a Kurdish language text corpus, but I would like to suggest one of them.

First a “small” experimental corpus of 100 000 words should be built to test the technical problems. This is only an introductory paper, so I will not enter into more technical considerations : there are still some remaining problems, as the existence of some faulty computer fonts which oblige the user to type the short vowel e as a h followed by a non-separating space... According to the software under which they were typed and the font used, the texts collected may have to be “normalised” (i.e. those sequences of h + space replaced by e) before being exported and re-coded into UNICODE. There are also some small problems concerning the encoding of the sequence [lam + alif] _لا and [lam + hawt + alif] _لاَ. A small set of texts would allow to think to the problems as they occur, and test-drive the solutions for a bigger project.

Then... since the beginning of the 90s, the Kurdistan Region of Iraq witnessed an important development in the field of publications. The language itself evolved quite a lot during the last 10-15 years. I think that a bigger corpus (but still small by todays standards) of one million words could gather papers taken from all the magazines published since this time.

I think it is magazines which capture the best the introduction of new vocabulary in the language. By this I do not mean at all that other types of texts should be excluded. I think that, to produce a balanced corpus, daily newspapers and books should indeed be included, with papers from different fields[10].

Another aspect, which should not be forgotten, is the legal one. It is very important that the authors and publishers of any text, non only give thir texts to the project, but also specify in written form the right of use of their texts for inclusion into the corpus. A non-commercial use agreement should be prepared, which will allow the distribution of the corpus by official research agencies in the world to all university departments interested, for a nominal fee. Missing such an agreement, the texts could not be distributed for research.

Benefits

They are numerous and I will only abstract them.

If the result of the work can be distributed through a linguistic research agency, for instance ELRA-ELDA[11] in Europe, the availability of Kurdish data will encourage researchers in whole Europe – including Kurds in diaspora – to work on the language. It will increase the global presence of Kurdish in the linguistic research field in the world, hence defending the language in general (and also defending it against scientifically unfounded allegations that there would not be anything as “Kurdish language”...). A known example of this effect is Korean. The availability of electronic resources in this language actually produced a sharp increase in the number of studies devoted to it.

There are new techniques to learn in the process, as similar experiences have been gone through by numerous researchers[12]. It will allow Kurdish researchers to link with a whole community abroad. Besides, this work will boost the reflexion on standardisation and will probably help making choices, also for future computer software development as fonts and keyboard drivers encouraging good typing practice by users. This might look like not very useful, but if we want one day a correcting dictionary for Kurdish, working under any word processor, it is necessary to first have good typing tools !

And indeed, the availability of a large quantity of electronic data in Kurdish means that it will become possible in time to generate correcting tools as are common in Arabic or English. And the existence of such tools in turn is important in the day-to-day defence of any language.

Then other applications will become possible, as a Wordnet in Kurdish, and more generally the generation of databases following XML standards, which will help a lot the publishing of – online, electronic or even paper – dictionaries, general or specialised. This is a necessary development to move forward in translation work, which is necessary for further academic research, but also business and economy in general.

Conclusion

The institutional support for such a project might be a consortium headed (for instance) by a computer department, including any interested institution and specifically publishers willing to donate their files under a non-commercial agreement. It would provide a framework for MA students willing to work on this project.

Corresponding members to help them could be situated in academic institutions abroad, to cooperate with the consortium and providing link with similar research in the world. Contacts to this effect could be made with laboratories and universities known to work in the field. Besides the most known labs working in “big” languages as English, numerous “small” language speakers, as Catalan, Britton, Gaelic, now work on such developments. Kurdish is not alone in this problematic, and as already mentionned, starting this type of work would provide a link with all those people sympathetic to Kurdish needs.

[1] - It is also possible to record language speakers in order to build speech corpora. The European Linguistic Resources Agency (ELRA) routinely distributes speech corpora as for instance telephone exchanges.

[2] - CHURCH K. W. & MERCER R. L., “Introduction to the special issue on Using Large Corpora”, Computational Linguistics, 19, 1, 1993, 1-24 ; quoted in Haber et al., Les Linguistiques de Corpus, Armand Colin, 1997.

[3] - <http://www.collins.co.uk/Corpus/CorpusSearch.aspx>.

[4] - <http://www.andy-roberts.net/software/aConCorde/>.

[5] - <http://humanities.uchicago.edu/orgs/ARTFL/artfl.flyer.html>

[6] - BAKER, Keith, “Public Opinions and Revolutionary Thoughts: Searching for Eighteenth-Century Culture,” The ARTFL Project Newsletter, vol. 8, no. 1 (Winter 1992-3), 2-3. Archived at :
<http://www.lib.uchicago.edu/efts/ARTFL/newhome/info/about.html>

[7] - GAUTIER, Gérard, “Building a Kurdish Language Corpus : an Overview of the Technical Problems”, Proceedings of the 6^th International Conference and Exhibition on Multilingual Computing (ICEMCO 98), Cambridge, UK, 9-11 April 1998, <http://ggautierk.free.fr/e/icem_98.htm>.
Also, with METHY Daniel, Preliminary reflexions for the constitution of a national corpus of Kurdish language
<www.cogsci.ed.ac.uk/~siamakr/Kurdish/Corpus/1-project.htm>, March 1998.

[8] - All information is freely available on the Unicode Consortium website, <http://www.unicode.org/>, which also contains a great number of useful tools to download, as keyboard drivers, characters searchers and editors.

[9] - <http://www.kurditgroup.org/>, a website to visit by all means !

[10]- A Google search on “corpus balancing” will retrieve several interesting papers on the question. As researchers in the field of linguistic computing are all addicted to the use of computers in general, the whole set of data concerning the field can probably be found on the Internet !

[11]- The European Linguistic Resource Agency <http://www.elra.info/> uses the Evaluations and Language resources Distribution Agency <http://www.elda.org/> as its operational body.

[12]- Most notably the Text Encoding Initiative (TEI), with its website to visit at <http://www.tei-c.org/>.

(*) ggautierk@online.fr, +33 6 73 26 11 13