lingua - 多语言文本检测库，低配置需求，高精度

[![ci build status][ci badge]][ci url] [![codecov][codecov badge]][codecov url] ![supported languages][supported languages badge] [![Kotlin platforms badge][Kotlin platforms badge]][Kotlin platforms url] [![license badge][license badge]][license url] [![javadoc][javadoc badge]][javadoc url] [![Maven Central][Maven Central badge]][Maven Central] [![Download][lingua version badge]][lingua download url]

1. What does this library do?

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

2. Why does this library exist?

Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.

So far, three other comprehensive open source libraries working on the JVM for this task are [Apache Tika], [Apache OpenNLP] and [Optimaize Language Detector]. Unfortunately, especially the latter has three major drawbacks:

Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it doesn't provide adequate results.
The more languages take part in the decision process, the less accurate are the detection results.
Configuration of the library is quite cumbersome and requires some knowledge about the statistical methods that are used internally.

Lingua aims at eliminating these problems. It nearly doesn't need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

3. Which languages are supported?

Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 75 languages are supported:

A
- Afrikaans
- Albanian
- Arabic
- Armenian
- Azerbaijani
B
- Basque
- Belarusian
- Bengali
- Norwegian Bokmal
- Bosnian
- Bulgarian
C
- Catalan
- Chinese
- Croatian
- Czech
D
- Danish
- Dutch
E
- English
- Esperanto
- Estonian
F
- Finnish
- French
G
- Ganda
- Georgian
- German
- Greek
- Gujarati
H
- Hebrew
- Hindi
- Hungarian
I
- Icelandic
- Indonesian
- Irish
- Italian
J
- Japanese
K
- Kazakh
- Korean
L
- Latin
- Latvian
- Lithuanian
M
- Macedonian
- Malay
- Maori
- Marathi
- Mongolian
N
- Norwegian Nynorsk
P
- Persian
- Polish
- Portuguese
- Punjabi
R
- Romanian
- Russian
S
- Serbian
- Shona
- Slovak
- Slovene
- Somali
- Sotho
- Spanish
- Swahili
- Swedish
T
- Tagalog
- Tamil
- Telugu
- Thai
- Tsonga
- Tswana
- Turkish
U
- Ukrainian
- Urdu
V
- Vietnamese
W
- Welsh
X
- Xhosa
Y
- Yoruba
Z
- Zulu

4. How good is it?

Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:

a list of single words with a minimum length of 5 characters
a list of word pairs with a minimum length of 10 characters
a list of complete grammatical sentences of various lengths

Both the language models and the test data have been created from separate documents of the [Wortschatz corpora] offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.

Given the generated test data, I have compared the detection results of Lingua, Apache Tika, Apache OpenNLP and Optimaize Language Detector using parameterized JUnit tests running over the data of Lingua's supported 75 languages. Languages that are not supported by the other libraries are simply ignored for those during the detection process.

Each of the following sections contains two plots. The bar plot shows the detailed accuracy results for each supported language. The box plot illustrates the distributions of the accuracy values for each classifier. The boxes themselves represent the areas which the middle 50 % of data lie within. Within the colored boxes, the horizontal lines mark the median of the distributions.

4.1 Single word detection

Bar plot

4.2 Word pair detection

Bar plot

4.3 Sentence detection

Bar plot

4.4 Average detection

Bar plot

4.5 Mean, median and standard deviation

The table below shows detailed statistics for each language and classifier including mean, median and standard deviation.

Open table

Language	Average					Single Words					Word Pairs					Sentences
	Lingua (high accuracy mode)	Lingua (low accuracy mode)	Tika	OpenNLP	Optimaize	Lingua (high accuracy mode)	Lingua (low accuracy mode)	Tika	OpenNLP	Optimaize	Lingua (high accuracy mode)	Lingua (low accuracy mode)	Tika	OpenNLP	Optimaize	Lingua (high accuracy mode)	Lingua (low accuracy mode)	Tika	OpenNLP	Optimaize
Afrikaans	79	64	71	72	39	58	38	44	41	3	81	62	70	75	22	97	93	98	99	93
Albanian	88	80	79	71	70	69	54	54	40	38	95	86	84	73	73	100	99	99	100	98
Arabic	98	94	97	84	89	96	88	94	65	72	99	96	99	88	94	100	99	100	99	100
Armenian	100	100	-	100	-	100	100	-	100	-	100	100	-	100	-	100	100	-	100	-
Azerbaijani	90	82	-	82	-	77	71	-	60	-	92	78	-	86	-	99	96	-	99	-
Basque	84	74	83	77	66	71