Open access data and open educational resources help to save endangered languages


May 15, 2015

What role does open access data have to play in the conservation of endangered languages? Why make samples of rare speech freely accessible on-line, and what are the problems associated with making this kind of data open? – Read in a guest post by Katarzyna Klessa, adjunct at Adam Mickiewicz University in Poznań and editor of the phonetics and phonology section in Open Linguistics, and Tomasz Wicherkiewicz, lecturer at Adam Mickiewicz University in Poznań, Chair of Oriental Studies, Department for Language Policy and Minority Studies.

Linguistic diversity has unquestionably turned into a vital issue of global importance. Until the last decades of the 20th century, the multitude of languages and their varieties had been perceived as a local-dimensioned obstacle in a balanced socioeconomic development of nation-states. Thereafter, people discovered that the diversity of languages, dialects, vernaculars, patois, Mundarten, наречия or говоры, streektalen, idiomas or hablas they remember from their juvenile age started ceasing to exist. The processes have lost their local dimension and attracted an international attention, causing institutions, organisations and academia to alert the public at large: the world’s language diversity is as endangered as natural ecosystems. By loosing their languages, people and peoples are getting rid of their culture, their collective memory, their identity and their uniqueness. Language endangerment or even language death threaten the communities of speakers all over the world, as more and more of them are not trasmitted to new generations. Thus the numbers of 6,500 to 7,200 world’s languages quoted by e.g. Ethnologue.org and SIL International are dynamic, if not exceedingly unsteady.

Linguists are overtly expressing their concern – David Crystal prefaced his book on Language Death (2000: 18-19): “I consider it a plausible calculation that – at the rate things are going – the coming [21st] century will see either the death or the doom of 90% of mankind’s languages”[1]; while earlier Michael Krauss (1992: 7) stated: “That means only about 600 [languages] are ‘safe’. The majority of the world’s languages are vulnerable not just to decline but to extinction. Over half the world’s languages are moribund, i.e. not effectively passed on to the next generation. A middle position could assert 50% loss in the next 100 years”. [2] A couple of years later, the same scholar prognosed more drastically: “Out of approximately 5,000 to 6,000 still existing languages, over one tenth is in the last phase of their existence, and about 80 per cent of others are immediately threatened with extinction. It has been estimated that in 2050 only about 1,000 living languages will remain on earth” [3].

Language documentation

Next to revitalisation programmes and initiatives, one of essential responses to language endangerment is documentation of language varieties, i.e. collecting, digitalizing and archiving samples of written texts (if existent) and spoken utterances. The more representative the samples (various speakers, various genres, various domains and topics) the better. An ultimate goal of language documentation is a representative corpus for a language. Such corpora exist and grow as result of long-term, well-manned projects with multidisciplinary teams of researchers, secure financing and high-tech digital hard- and software base.

The ‘other’ languages: inne-jezyki.amu.edu.pl

Stable, representative resources are often unavailable for endangered languages and shrinking/vanishing communities of their speakers, especially if their language variety is/has not a recognised prestigous (literary) standard.

This is the case of most language varieties spoken on the territories of Poland: the country in its present (post-WWII) shape and ethno-linguistic makeup had long been considered one of the most homogenous and monolingual nation-states in Europe. Yet, the territory of Rzeczpospolita [4] had always been inhabited by communities speaking languages different than Polish. Hence, the meaning of the web-site name in Polish: inne-jezyki (‘other languages’). Of course, their array changed; the languages themselves changed, and so did the communities who spoke them and who rendered these languages as means of intra-group communication; and, finally, throughout history, the territory of Rzeczpospolita has been changing like of no other political body in Europe.

The project Poland’s Linguistic Heritage. Documentation Database for Endangered Languages (in Polish: Dziedzictwo językowe Rzeczypospolitej. Baza dokumentacji zagrożonych języków) was carried out in the years 2012-2014, and financed with a grant of the Polish National Programme for the Development of Humanities. The project team included specialists in endangered and/or minority languages and language documentation from Adam Mickiewicz University in Poznań, the University of Warsaw and the Catholic University of Lublin. The main goal of the project was to create a database for archiving annotated video- and/or audio-recorded samples of non-Polish language varieties, which constitute language heritage of Poland. The research team had effectively collected data on and records of those varieties, and published them as an open-access web-portal www.inne-jezyki.amu.edu.pl in Polish- and English-language versions.

As a following step, it is intended to extend the catalogued database with new widely accessible linguistic, ethnolinguistic and sociolinguistic data. In a further perspective, the collected corpora and materials can serve as linguistic sources in community-driven revitalization projects.

reklama kybynów

Data access

To make the archive as widely accessible as possible it was decided to design an on-line repository which can be now reached at: inne-jezyki.amu.edu.pl. Three levels of accessibility have been implemented for the database: (i) public, (ii) for educational purposes (participating teachers and their students), and (iii) confidential (only for project-team members). In the initial stage (language profiles + recordings of first four language variaties), however, we hardly encountered any obstacles in being given full authority to record our informants and publish practically all data without restriction (see more in the section: Collecting data).

Contents of the Poland’s Linguistic Heritage Database

The resources compiled under the Poland’s Linguistic Heritage project include:

1. extended fact sheets/profiles on history, location and range, endangerment, speakers’ community, standardization efforts, language contacts, subvarieties, state of research and literature in and on the following language varieties or dialect-clusters:
– Polish Yiddish (mainly in diaspora),
– Latgalian (in former Polish Livonia, now in Latgale),
– Wilamowicean (language exclave of Wilamowice in southern Poland),
– Hałcnowian (within the former German Bielitz-Bialaer Sprachinsel in southern Poland),
– Lithuanian,
– Belarusian,
– Ukrainian,
– Rusyn-Lemko,
– Podlachian and Polesian varieties,
– Old-Believers’ Russian,
– Spiš and Orava varieties of the Polish-Slovak borderland,
– Lachian varieties of the Polish-Moravian-Czech borderland,
– Czech dialects of the Kłodzko/Kladsko Valley and Zelów/Zelov enclave,
– Silesian dialects of German,
– German varieties of Central Poland, Galicia, Greater Poland and Prussia,
– Low German dialects (including the Mennonites’ Plautdietsch),
– Romani varieties,
– Karaim,
– language varieties spoken in the past by Polish Armenians and Tatars;

2. a database compiled during the field-work stage of the project including commented, metadated and annotated language samples/sources (spoken, narrated, sung, printed, manuscript and iconographic) for four of the above-listed language varieties: Polish Yiddish, Latgalian, Wilamowicean and Hałcnowian (in the latter case, the project team managed to find and register all living speakers of the lect, although several aged informants from Hałcnów and Wilamowice passed away during the project, making the archived materials by far unique);

3. digitalised written texts (with translations and transliterations), digital(ised) sound, audiovisual and multimodal recordings of utterances and performative speech acts (in large parts transliterated, transcribed, annotated and translated) in the four languages;

4. photodocumentation of ethno-linguistic artifacts in or related to the forementioned languages; preliminary linguistic analyses (phonetics, prosody, language contacts) of the archived materials.

Annotating & translating the original data

Stare Bielsko_stroje ludoweWe decided to provide at least a part of our recordings with various levels / types of annotation, e.g. orthographic (adopting a nomic spelling system or transliteration), phonetic / phonemic transcription, morphological glossing as well as ethno-/extralinguistic comments. This should make the archived materials comprehendible and universally functional for specialists, who do not know the relevant language/variety, but (intend to) use it for a range of purposes in diverse disciplines: linguistics, including socio- and ethnolinguistics, language policy, linguistic anthropology, ethnology and ethnography, including folklore studies, ethnomusycology, ethnotaxonomies, or inter- and transdisciplines, such as regional studies, minority studies, etc.

Collecting data for endangered languages

Fieldworking and archive querying

The materials for the Poland’s Linguistic Heritage database have been (and continue to be) gathered during field-work sessions, mostly at places of residence of recorded informants. Due to a transborder dimension and diachronic character of the project we had to travel not only to regions inhabited by lesser-used languages in the country, but also abroad (as e.g. to Latvian Latgalia/Latgale or to individual speakers of Polish Yiddish living in the Western European cities) – we have also queried larger language archives, such as Archiv für Gesprochenes Deutsch or Ústav pro jazyk český. Information concerning mixed or transitional varieties has also been consulted derived from, compared with and referred to Polish dialectal resources, e.g. Dialekty i gwary polskie.

Problematic issues

As observed also in previous research, for the most part, during research carried out among persons belonging to (ethnic/linguistic) minorities or institututions representing them, the researching documentalist(s) and informant(s) quickly start talking the same language. On the other hand, one has to take into account that in many regions, belonging to a minority, being it national, ethnic, confessional or linguistic, have constituted an – objective or subjective – obstacle in/for the majority society. This is the case of the generation remembering e.g. the communist era in Poland. Therefore, we might expect that some of (minority language) informants may prefer not to be fully and openly recognisable and their recordings should be made accessible for limited audience.

What is more, recording (and publishing) samples of previously disregarded, low-prestige or (considered) unacceptable language varieties, constitutes itself a method of upgrading the (self-)esteem of their speakers and reintegration of their speech communities. Most speakers of the documented microlects are very advanced in years (usually born before or during the WWII), and they were brought up in the belief that the only “correct” and full-fledged language to be used is standard Polish. Therefore, our projects have not only a documentary dimension, but also act as instruments of revival and resaturation of a local language repertoire and identity.

The practical impact of the project – some examples

An evident example of a practical influence and real-world outcomes of the Heritage project is the case Wymysiöeryś microethnolect of Wilamowice/Wymysoü in southern Poland[]. Documentation programmes, initiated by our project stimulated an array of revitalisation ventures, to mention e.g.: Endangered languages. Comprehensive models for research and revitalization or the academia-cum-community-run Dokumentacja językowego i kulturowego dziedzictwa Wilamowic and Stworzenie klastra turystycznego w gminie Wilamowice w oparciu o język wymysiöeryś i związaną z nim kulturę. The project’s objectives and results have several times been presented to and by the community of Wilamowice, including the International Mother Day Language, celebrated in Wilamowice for the first time in history in February 2015 (including a theatrical performance of Hobbit in Wilamowicean), International Conference on Endangered Languages in Wilamowice in June 2014, or the Conference on Endangered Languages in the Polish Parliament in November 2013.
The vicinal settlement of Hałcnów/Ałza/Alzen – constituting formerly a dialectal link with(in) the so-called Bielitz-Bialaer Sprachinsel and for a long time considered extinct, is modestly regaining its ancient native language variety called Päuersch. Thanks to our project all living speakers of Hałcnowian have been identified and the language is being described in an academic monograph by Marek Dolatowski from A. Mickiewicz University in Poznań. For the first time in the post-War history, Hałcnowian resounded in public during the 2015 International Mother Language Day in Wilamowice. The very last users speak up…

The same reaction have been noticed among the Russian Old-Believers in Masuria, who speak (a few elderly persons) a very interesting translanguage containing Russian, German and Polish strata. We expect similar effects among speakers of other German varieties in Poland, which were preliminarily described in the first stage (have a look here) and here and urgently await documentation. Preparatory arrangements are also being made to document the Czech dialect(s) spoken in the town of Zelów/Zelov and the Kłodzko/Kladsko Valley, Lithuanian dialects spoken in the northeasternmost part of Suwałki region/Suvalkija, or the Belarusian dialects of Podlachia/Podlasie/Падляшша.

Renessaince (Own work) GFDL_lubCCBY30via Wikimedia Commons

Feedback from speaker communities and researchers

We have received various feedback from the speakers of the described and documented language varieties as well as positive comments from persons who are aware of the languages (by grandparents’ generation), but never heard them.

Teachers and educationalists keep expressing their acknowledgmenets for the openly accessible and high-quality teaching materials on language history, ethno-linguistic
diversity, language contacts etc.

The scholarly profile of our project has been highly appreciated by such partners as the Central Archive for German Yiddish or the Ukrainian Academy of Sciences or the, who intend to share our resources or use the model of the entire project. In such a way, and hopefully in the future on a much wider scale, our documentation resources shall become scholarly verifiable, available and compatible with similar documentation centres for endangered languages in the world.

Experts of the National Programme for the Development of Humanities at the Polish Ministry of Science and Higher Education pronounced the results and outcome of the project as a benchmark for projects’ realization.

Languagesindanger.eu – open education

The (first stage of the) project was carried out paralelly to another one: INNET – Innovative Networking in Infrastructure for Endangered Languages, run by an international consortium: University of Cologne, Max Planck Institute in Nijmegen, Research Institute for Linguistics in Budapest and Adam Mickiewicz University in Poznań. The main task of the Polish team was preparing, publishing and dissemination of teaching materials concerning endangered languages and their documentation. All teaching materials and interactive self-study exercises are freely accessible on-line at: www.languagesindanger.eu.


[1] – Crystal, David 2000. Language Death. Cambridge University Press.
[2] – Krauss, Michael 1992. „The world’s languages in crisis”, Language 68: 4-10.
[3] – Krauss, Michael 1998. „The Scope of the Language Endangerment Crisis and Recent Response to It”, w: K. Matsumura (red.) Studies in Endangered Languages. Tokyo: Hituzi Syobo, 103-106.
[4] – Rzeczpospolita (qualque from Latin res publica) is the Polish endonymic term referring to the consecutive state(hood) forms: Commonwealth of Both Nations (I Rzeczpospolita), the interbellum independent Polish state (II Rzeczpospolita – 1918-1945), the People’s Republic of Poland (1945/1952-1989), and the III Rzeczpospolita (Since 1990).

1 – Old-Believers’ Russian book, photo by Tomasz Wicherkiewicz.
2 – Commercial advertisement in Karaim language in Troki (Trakai) village, photo by Tomasz Wicherkiewicz.
3 – Inhabitants of Stare Bielsko Village in 1935, photo from Wagner, Richard Ernst 1935. Der Beeler Psalter. Die Bielitz-Bialaer deutsche mundartliche Dichtung. Katowice: Kattowitzer Buchdruckerei u. Verlags – Sp. Akc.
4 – Singage in Polish and Podlachian variety, photo by Renessaince, licensed unser CC-BY-3.0.

