Low Resource Languages
Resources for conservation, development, and documentation of low resource (human) languages.
According to some estimates, half of the 7,000~ currently spoken languages are expected to become extinct this century. However, there is a lot of work by academics, independent scholars, organizations, communities, and individuals which goes towards stopping or slowing this trend. This list is intended to provide a list of open source code that would be useful for documenting, conserving, developing, preserving, or working with endangered languages.
Slack Group
We have a Slack group for live discussion. Join Us Here!
Publication
A white paper describing this repository was published at the LREC 2016 CCURL Workshop (Collaboration and Computing for Under-Resourced Languages). The paper is in this repository, in the papers
folder. Download the raw paper here: Open Source Code Serving Endangered Languages.
Contribute
To edit this list on GitHub, simply click here. If you would like to discuss anything at all related to this, please open an issue. If you know of any resource available that is not on this list, please add it, either using the link above or by submitting pull requests.
There are more details on contributing in the CONTRIBUTING guide.
If you're interested in discussing the list in some offline capacity, get in touch with @RichardLitt. I'd be more than happy to have a phone call or email exchange.
Table of Contents
Table of Contents generated with DocToc
- Definitions
- Generic Repositories
- Keyboard Layout Configuration Helpers
- Annotation
- Format Specifications
- i18n-related Repositories
- Audio automation
- Text-to-Speech (TTS)
- Automatic Speech Recognition (ASR)
- Text automation
- Experimentation
- Flashcards
- Natural language generation
- Computing systems
- Android Applications
- Chrome Extensions
- FieldDB
- Academic Research Paper-Specific Repositories
- Example Repositories
- Fonts
- Corpora
- Organizations
- Tutorials
- Language Specific Projects
- License
Definitions
Endangered languages are human languages that are in danger of extinction. This list also encompasses minority languages - languages which are spoken by a stable, but small, population (for example, Maltese or Hawai'ian); and low- or under-resourced languages, which may be spoken by a large population but are under-represented digitally (for instance, Quechua). These languages share certain characteristics in common; the most pertinent is sparse data and a lack of resources, ranging from spell-checkers to grammars to machine translation corpora. Other under-resourced languages that do not fall under this list include constructed languages (for instance, Klingon or Na'vi), computer languages (for instance, Javascript or Lua), and extinct languages that are so sparse as to be rendered computationally irrelevant for most purposes (for instance, Tocharian).
Open Source "promotes a universal access via a free license to a product's design or blueprint, and universal redistribution of that design or blueprint, including subsequent improvements to it by anyone." (Wiki). This is important because money and resources allocated towards a language or project that are not open source is spent at the expense of possible extensibility elsewhere.
This list used to be named endangered-languages
. It was renamed to reflect that endangerment is a loaded term that both may not reflect the views of language communities speaking minority languages. low-resource-languages
focuses this list on a lack of digital resources compared to other, high resourced languages.
Tools which are built for these languages are not included (unless relevant for dialects or variants): Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, Flemish, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Norwegian (Bokmål), Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian, Valencian, Vietnamese. This list comes from the list of most popular content languages for websites, on this Wikipedia page. Other metrics could be used - if you have another one, please suggest it!
This list is particularly good at one thing; showing the kinds of tools that exist in the field, generically. However, for in depth research into a specific language or tool suite, it does not perform exceptionally well. For instance, listing all of the Firefox language packs or Apertium language modules for each low resource language would be unhelpful, as would be including all of the tools available for Basque noted in the ACL Wiki, which would mainly mean cataloguing tools through the IXA group, some of which are open source, and some are not. Instead, view this list as a starting point for more research.
Looking for resources for code languages? Take a look at the awesome lists collection.
Generic Repositories
Single language lexicography projects and utilities
Utilities
- Project for Free Electronic Dictionaries Is a project for a java MIDlet for mobile phones - for indigenous language dictionaries.
- Webonary Site which hosts digital dictionaries for single languages.
- WeSay - Allows language communities to build their own dictionaries. https://software.sil.org/wesay/ (by the SIL International).
Software
- 4lang - Concept dictionary using Eilenberg machines.
- accentuate.us a.k.a. "charlifter". Statistical Unicodification of plain text for many languages
- alignment-with-openfst - This is an implementation of the CRF autoencoder framework for four tasks: bitext word alignment, part-of-speech tagging, code switching, dependency parsing.
- Apertium Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs.
- ark-tweet-nlp - CMU ARK Twitter Part-of-Speech Tagger (Fork).
- ArtOfReading - Index and processing scripts related to the Art Of Reading illustration collection.
- bayesline - A Multinomial Bayesian Classification for Language Identification.
- bible-corpus-tools - A collection of tools for reading/processing the multilingual Bible corpus.
- BloomDesktop - Bloom Desktop is a hybrid c#/javascript/html/css Windows application that dramatically "lowers the bar" for language communities who want books in their own languages. Bloom delivers a low-training, high-output system where mother tongue speakers and their advocates work together to foster both community authorship and access to external materia… https://bloomlibrary.org/.
- BloomLibrary - Bloom Library Single Page App, using AngularJS & Bootstrap, Parse.com backend. https://bloomlibrary.org/.
- brain - Neural networks in JavaScript.
- Bristol Uni MT Morphology tools - This repo is a mirror of scripts previously available on http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp. Included: Ukwabelana - An open-source morphological Zulu corpus and EMMA: A Novel Evaluation Metric for Morphological Analysis.
- brown-cluster - C++ implementation of the Brown word clustering algorithm.
- CasualCon CasualConc is a concordance program that runs natively on Mac OS X 10.5 Leopard or later. It was originally designed for casual use (preliminary analysis or non-research purposes), though [the maintainer] has been using it for his own research (and may others have). It can generate kwic concordance lines, word clusters, collocation analysis, and word count.
- cdec - Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms.
- charlint Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model.
- chorus - A version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed.
- clam - Computational Linguistics Application Mediator -- Quickly turn NLP applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
- CMU Sphinx CMUSphinx is a speaker-independent large vocabulary continuous speech recognizer released under BSD style license. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems.
- cnminlangwebcollect - Chinese minorities website languages detection and websites collection.
- Cog - Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties. http://sillsdev.github.io/cog/.
- convertextract - Convert Excel, Word and PowerPoint files with non-Unicode text (like text requiring SIL fonts) into Unicode, while preserving original file's formatting.
- CorpusTools - Phonological CorpusTools http://phonologicalcorpustools.github.io/CorpusTools/.
- CTK - Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible. (Original project is on SourceForge: http://champollion.sourceforge.net).
- DataTags - A system to assess the sensitivity and privacy risk of a dataset, and assign a tag to describe how the dataset must be transfered, stored and accessed. (Fork).
- dataverse - A data repository framework to share and publish research data.
- Dative - Dative: software for linguistic fieldwork http://www.dative.ca.
- dative - A single-page application that interacts with multiple linguistic fieldwork web service databases. Website.
- DeepLearnToolbox - Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started.
- Desmeme - Database and tools for exploring linguistic templates.
- dictdb - dictionary database for language translation.
- discoursegraphs - Python-based tool to convert and merge multilayer annotated linguistic data.
- divvun-gramcheck - This program does FST lookup on forms specified as Constraint Grammar format readings, and looks up error-tags in an XML file with human-readable messages. It is meant to be used as a late stage of a grammar checker pipeline.
- divvun-keyboard - keyboard apps for iOS and Android with keyboard layouts for indigenous and minority languages
- divvunspell -
hfst-ospell
(below) rewritten in Rust, for robust concurrency and memory management. Is in practical use about 10x faster thanhfst-ospell
. It uses the same zhfst files ashfst-ospell
, which are available for all languages in the GiellaLT GitHub org (see below).