1. What does this library do?
Its task is simple: It tells you which language some text is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.
2. Why does this library exist?
Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.
Python is widely used in natural language processing, so there are a couple of comprehensive open source libraries for this task, such as Google's CLD 2 and CLD 3, Langid, FastText, FastSpell, Simplemma and Langdetect. Unfortunately, most of them have two major drawbacks:
- Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, they do not provide adequate results.
- The more languages take part in the decision process, the less accurate are the detection results.
Lingua aims at eliminating these problems. She nearly does not need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. She draws on both rule-based and statistical methods but does not use any dictionaries of words. She does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.
3. A short history of this library
This library started as a pure Python implementation. Python's quick prototyping capabilities made an important contribution to its improvements. Unfortunately, there was always a tradeoff between performance and memory consumption. At first, Lingua's language models were stored in dictionaries during runtime. This led to quick performance at the cost of large memory consumption (more than 3 GB). Because of that, the language models were then stored in NumPy arrays instead of dictionaries. Memory consumption reduced to approximately 800 MB but CPU performance dropped significantly. Both approaches were not satisfying.
Starting from version 2.0.0, the pure Python implementation was replaced with compiled Python bindings to the native Rust implementation of Lingua. This decision has led to both quick performance and a small memory footprint of less than 1 GB. The pure Python implementation is still available in a separate branch in this repository and will be kept up-to-date in subsequent 1.* releases. Both 1.* and 2.* versions will remain available on the Python package index (PyPI).
4. Which languages are supported?
Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 75 languages are supported:
- A
- Afrikaans
- Albanian
- Arabic
- Armenian
- Azerbaijani
- B
- Basque
- Belarusian
- Bengali
- Norwegian Bokmal
- Bosnian
- Bulgarian
- C
- Catalan
- Chinese
- Croatian
- Czech
- D
- Danish
- Dutch
- E
- English
- Esperanto
- Estonian
- F
- Finnish
- French
- G
- Ganda
- Georgian
- German
- Greek
- Gujarati
- H
- Hebrew
- Hindi
- Hungarian
- I
- Icelandic
- Indonesian
- Irish
- Italian
- J
- Japanese
- K
- Kazakh
- Korean
- L
- Latin
- Latvian
- Lithuanian
- M
- Macedonian
- Malay
- Maori
- Marathi
- Mongolian
- N
- Norwegian Nynorsk
- P
- Persian
- Polish
- Portuguese
- Punjabi
- R
- Romanian
- Russian
- S
- Serbian
- Shona
- Slovak
- Slovene
- Somali
- Sotho
- Spanish
- Swahili
- Swedish
- T
- Tagalog
- Tamil
- Telugu
- Thai
- Tsonga
- Tswana
- Turkish
- U
- Ukrainian
- Urdu
- V
- Vietnamese
- W
- Welsh
- X
- Xhosa
- Y
- Yoruba
- Z
- Zulu
5. How accurate is it?
Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:
- a list of single words with a minimum length of 5 characters
- a list of word pairs with a minimum length of 10 characters
- a list of complete grammatical sentences of various lengths
Both the language models and the test data have been created from separate documents of the Wortschatz corpora offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.
Given the generated test data, I have compared the detection results of Lingua, FastText, FastSpell, Langdetect, Langid, Simplemma, CLD 2 and CLD 3 running over the data of Lingua's supported 75 languages. Languages that are not supported by the other detectors are simply ignored for them during the detection process.
Each of the following sections contains two plots. The bar plot shows the detailed accuracy results for each supported language. The box plot illustrates the distributions of the accuracy values for each classifier. The boxes themselves represent the areas which the middle 50 % of data lie within. Within the colored boxes, the horizontal lines mark the median of the distributions.
5.1 Single word detection
Bar plot
5.2 Word pair detection
Bar plot
5.3 Sentence detection
Bar plot
5.4 Average detection
Bar plot
5.5 Mean, median and standard deviation
The table below shows detailed statistics for each language and classifier including mean, median and standard deviation.
Open table
Language | Average | Single Words | Word Pairs | Sentences | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lingua (high accuracy mode) | Lingua (low accuracy mode) | Langdetect | FastText | FastSpell (conservative mode) | FastSpell (aggressive mode) | Langid | CLD3 | CLD2 | Simplemma | Lingua (high accuracy mode) | Lingua (low accuracy mode) | Langdetect | FastText | FastSpell (conservative mode) | FastSpell (aggressive mode) | Langid | CLD3 | CLD2 | Simplemma | Lingua (high accuracy mode) | Lingua (low accuracy mode) | Langdetect | FastText | FastSpell (conservative mode) | FastSpell (aggressive mode) | Langid | CLD3 | CLD2 | Simplemma | Lingua (high accuracy mode) | Lingua (low accuracy mode) | Langdetect | FastText | FastSpell (conservative mode) | FastSpell (aggressive mode) | Langid | CLD3 | CLD2 | Simplemma | |
Afrikaans | 79 | 64 | 67 | 36 | 70 | 73 | 30 | 55 | 55 | - | 58 | 38 | 37 | 11 | 49 | 50 | 1 | 22 | 13 | - | 81 | 62 | 66 | 23 | 67 | 74 | 10 | 46 | 56 | - | 97 | 93 | 98 | 74 |