Link Grammar Parser
Version 5.12.5
The Link Grammar Parser exhibits the linguistic (natural language) structure of English, Thai, Russian, Arabic, Persian and limited subsets of a half-dozen other languages. This structure is a graph of typed links (edges) between the words in a sentence. One may obtain the more conventional HPSG (constituent) and dependency style parses from Link Grammar by applying a collection of rules to convert to these different formats. This is possible because Link Grammar goes a bit "deeper" into the "syntactico-semantic" structure of a sentence: it provides considerably more fine-grained and detailed information than what is commonly available in conventional parsers.
The theory of Link Grammar parsing was originally developed in 1991 by Davy Temperley, John Lafferty and Daniel Sleator, at the time professors of linguistics and computer science at the Carnegie Mellon University. The three initial publications on this theory provide the best introduction and overview; since then, there have been hundreds of publications further exploring, examining and extending the ideas.
Although based on the original Carnegie-Mellon code base, the current Link Grammar package has dramatically evolved and is profoundly different from earlier versions. There have been innumerable bug fixes; performance has improved by several orders of magnitude. The package is fully multi-threaded, fully UTF-8 enabled, and has been scrubbed for security, enabling cloud deployment. Parse coverage of English has been dramatically improved; other languages have been added (most notably, Thai and Russian). There is a raft of new features, including support for morphology, dialects, and a fine-grained weight (cost) system, allowing vector-embedding-like behaviour. There is a new, sophisticated tokenizer tailored for morphology: it can offer alternative splittings for morphologically ambiguous words. Dictionaries can be updated at run-time, enabling systems that perform continuous learning of grammar to also parse at the same time. That is, dictionary updates and parsing are mutually thread-safe. Classes of words can be recognized with regexes. Random planar graph parsing is fully supported; this allows uniform sampling of the space of planar graphs. A detailed report of what has changed can be found in the ChangeLog.
This code is released under the LGPL license, making it freely available for both private and commercial use, with few restrictions. The terms of the license are given in the LICENSE file included with this software.
Please see the main web page for more information. This version is a continuation of the original CMU parser.
New!
As of version 5.9.0, the system include an experimental system for
generating sentences. These are specified using a "fill in the blanks"
API, where words are substituted into wild-card locations, whenever the
result is a grammatically valid sentence. Additional details are in
the man page: man link-generator
(in the man
subdirectory).
This generator is used in the OpenCog Language Learning project, which aims to automatically learn Link Grammars from corpora, using brand-new and innovative information theoretic techniques, somewhat similar to those found in artificial neural nets (deep learning), but using explicitly symbolic representations.
Quick Overview
The parser includes API's in various different programming languages, as well as a handy command-line tool for playing with it. Here's some typical output:
linkparser> This is a test!
Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=6)
+-------------Xp------------+
+----->WV----->+---Ost--+ |
+---Wd---+-Ss*b+ +Ds**c+ |
| | | | | |
LEFT-WALL this.p is.v a test.n !
(S (NP this.p) (VP is.v (NP a test.n)) !)
LEFT-WALL 0.000 Wd+ hWV+ Xp+
this.p 0.000 Wd- Ss*b+
is.v 0.000 Ss- dWV- O*t+
a 0.000 Ds**c+
test.n 0.000 Ds**c- Os-
! 0.000 Xp- RW+
RIGHT-WALL 0.000 RW-
This rather busy display illustrates many interesting things. For
example, the Ss*b
link connects the verb and the subject, and
indicates that the subject is singular. Likewise, the Ost
link
connects the verb and the object, and also indicates that the object
is singular. The WV
(verb-wall) link points at the head-verb of
the sentence, while the Wd
link points at the head-noun. The Xp
link connects to the trailing punctuation. The Ds**c
link connects
the noun to the determiner: it again confirms that the noun is singular,
and also that the noun starts with a consonant. (The PH
link, not
required here, is used to force phonetic agreement, distinguishing
'a' from 'an'). These link types are documented in the
English Link Documentation.
The bottom of the display is a listing of the "disjuncts" used for
each word. The disjuncts are simply a list of the connectors that
were employed to form the links. They are particularly interesting
because they serve as an extremely fine-grained form of a "part of
speech". Thus, for example: the disjunct S- O+
indicates a
transitive verb: its a verb that takes both a subject and an object.
The additional markup above indicates that 'is' is not only being used
as a transitive verb, but it also indicates finer details: a transitive
verb that took a singular subject, and was used (is usable as) the head
verb of a sentence. The floating-point value is the "cost" of the
disjunct; it very roughly captures the idea of the log-probability
of this particular grammatical usage. Much as parts-of-speech correlate
with word-meanings, so also fine-grains parts-of-speech correlate with
much finer distinctions and gradations of meaning.
The link-grammar parser also supports morphological analysis. Here is an example in Russian:
linkparser> это теста
Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=4)
+-----MVAip-----+
+---Wd---+ +-LLCAG-+
| | | |
LEFT-WALL это.msi тест.= =а.ndnpi
The LL
link connects the stem 'тест' to the suffix 'а'. The MVA
link connects only to the suffix, because, in Russian, it is the
suffixes that carry all of the syntactic structure, and not the stems.
The Russian lexis is
documented here.
The Thai dictionary is now fully developed, effectively covering the entire language. An example in Thai:
linkparser> นายกรัฐมนตรี ขึ้น กล่าว สุนทรพจน์
Linkage 1, cost vector = (UNUSED=0 DIS= 2.00 LEN=2)
+---------LWs--------+
| +<---S<--+--VS-+-->O-->+
| | | | |
LEFT-WALL นายกรัฐมนตรี.n ขึ้น.v กล่าว.v สุนทรพจน์.n
The VS
link connects two verbs 'ขึ้น' and 'กล่าว' in a serial verb
construction. A summary of link types is
documented here. Full documentation of Thai Link
Grammar can be found here.
Thai Link Grammar also accepts POS-tagged and named-entity-tagged inputs. Each word can be annotated with the Link POS tag. For example:
linkparser> เมื่อวานนี้.n มี.ve คน.n มา.x ติดต่อ.v คุณ.pr ครับ.pt
Found 1 linkage (1 had no P.P. violations)
Unique linkage, cost vector = (UNUSED=0 DIS= 0.00 LEN=12)
+---------------------PT--------------------+
+---------LWs---------+---------->VE---------->+ |
| +<---S<---+-->O-->+ +<--AXw<-+--->O--->+ |
| | | | | | | |
LEFT-WALL เมื่อวานนี้.n[!] มี.ve[!] คน.n[!] มา.x[!] ติดต่อ.v[!] คุณ.pr[!] ครับ.pt[!]
Full documentation for the Thai dictionary can be found here.
The Thai dictionary accepts LST20 tagsets for POS and named entities, to bridge the gap between fundamental NLP tools and the Link Parser. For example:
linkparser> linkparser> วันที่_25_ธันวาคม@DTM ของ@PS ทุก@AJ ปี@NN เป็น@VV วัน@NN คริสต์มาส@NN
Found 348 linkages (348 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=10)
+--------------------------------LWs--------------------------------+
| +<------------------------S<------------------------+
| | +---------->PO--------->+ |
| +----->AJpr----->+ +<---AJj<--+ +---->O---->+------NZ-----+
| | | | | | | |
LEFT-WALL วันที่_25_ธันวาคม@DTM[!] ของ@PS[!].pnn ทุก@AJ[!].jl ปี@NN[!].n เป็น@VV[!].v วัน@NN[!].na คริสต์มาส@NN[!].n
Note that each word above is annotated with LST20 POS tags and NE tags. Full documentation for both the Link POS tags and the LST20 tagsets can be found here. More information about LST20, e.g. annotation guideline and data statistics, can be found here.
The any
language supports uniformly-sampled random planar graphs:
linkparser> asdf qwer tyuiop fghj bbb
Found 1162 linkages (1162 had no P.P. violations)
+-------ANY------+-------ANY------+
+---ANY--+--ANY--+ +---ANY--+--ANY--+
| | | | | |
LEFT-WALL asdf[!] qwer[!] tyuiop[!] fghj[!] bbb[!]
The ady
language does likewise, performing random morphological
splittings:
linkparser> asdf qwerty fghjbbb
Found 1512 linkages (1512 had no P.P. violations)
+------------------ANY-----------------+
+-----ANY----+-------ANY------+ +---------LL--------+
| | | | |
LEFT-WALL asdf[!ANY-WORD] qwerty[!ANY-WORD] fgh[!SIMPLE-STEM].= =jbbb[!SIMPLE-SUFF]
Theory and Documentation
An extended overview and summary can be found in the Link Grammar Wikipedia page, which touches on most of the import, primary aspects of the theory. However, it is no substitute for the original papers published on the topic:
- Daniel D. K. Sleator, Davy Temperley, "Parsing English with a Link Grammar" October 1991 CMU-CS-91-196.
- Daniel D. Sleator, Davy Temperley, "Parsing English with a Link Grammar", Third International Workshop on Parsing Technologies (1993).
- Dennis Grinberg, John Lafferty, Daniel Sleator, "A Robust Parsing Algorithm for Link Grammars", August 1995 CMU-CS-95-125.
- John Lafferty, Daniel Sleator, Davy Temperley, "Grammatical Trigrams: A Probabilistic Model of Link Grammar", 1992 AAAI Symposium on Probabilistic Approaches to Natural Language.
There are many more papers and references listed on the primary Link Grammar website
See also the C/C++ API documentation. Bindings for other programming languages, including python3, java and node.js, can be found in the bindings directory. (There are two sets of javascript bindings: one set for the library API, and another set for the command-line parser.)
Contents
Content | Description |
---|---|
LICENSE | The license describing terms of use |
ChangeLog | A compendium of recent changes. |
configure | The GNU configuration script |
autogen.sh | Developer's configure maintenance tool |
link-grammar/*.c | The program. (Written in ANSI-C) |
---- | ---- |
bindings/autoit/ | Optional AutoIt language bindings. |
bindings/java/ | Optional Java language bindings. |
bindings/js/ | Optional JavaScript language bindings. |
bindings/lisp/ | Optional Common Lisp language bindings. |
bindings/node.js/ | Optional node.js language bindings. |
bindings/ocaml/ | Optional OCaML language bindings. |
bindings/python/ | Optional Python3 language bindings. |
bindings/python-examples/ | Link-grammar test suite and Python language binding usage example. |
bindings/swig/ | SWIG interface file, for other FFI interfaces. |
bindings/vala/ | Optional Vala language bindings. |
---- | ---- |
data/en/ | English language dictionaries. |
data/en/4.0.dict | The file containing the dictionary definitions. |
data/en/4.0.knowledge | The post-processing knowledge file. |
data/en/4.0.constituents | The constituent knowledge file. |
data/en/4.0.affix | The affix (prefix/suffix) file. |
data/en/4.0.regex | Regular expression-based morphology guesser. |
data/en/tiny.dict | A small example dictionary. |
data/en/words/ | A directory full of word lists. |
data/en/corpus*.batch | Example corpora used for testing. |
---- | ---- |
data/ru/ | A full-fledged Russian dictionary |
data/th/ | A full-fledged Thai dictionary (100,000+ words) |
data/ar/ | A fairly complete Arabic dictionary |
data/fa/ | A Persian (Farsi) dictionary |
data/de/ | A small prototype German dictionary |
data/lt/ | A small prototype Lithuanian dictionary |
data/id/ | A small prototype Indonesian dictionary |
data/vn/ | A small prototype Vietnamese dictionary |
data/he/ | An experimental Hebrew dictionary |
data/kz/ | An experimental Kazakh dictionary |
data/tr/ | An experimental Turkish dictionary |
---- | ---- |
morphology/ar/ | An Arabic morphology analyzer |
morphology/fa/ | An Persian morphology analyzer |
---- | ---- |
debug/ | Information about debugging the library |
msvc/ | Microsoft Visual-C project files |
mingw/ | Information on using MinGW under MSYS or Cygwin |
UNPACKING and signature verification
The system is distributed using the conventional tar.gz
format;
it can be extracted using the tar -zxf link-grammar.tar.gz
command
at the command line.
A tarball of the latest version can be downloaded from:
https://www.gnucash.org/link-grammar/downloads/
The files have been digitally signed to make sure that there was no corruption of the dataset during download, and to help ensure that no malicious changes were made to the code internals by third parties. The signatures can be checked with the gpg command:
gpg --verify link-grammar-5.12.5.tar.gz.asc
which should generate output identical to (except for the