Link Grammar Parser
by Davy Temperley, John Lafferty and Daniel Sleator (this variant maintained by Dom Lachowicz - <domlachowicz@gmail.com> and Linas Vepstas - <linasvepstas@gmail.com> )News
July, 2008: link-grammar 4.3.6 released! This includes an important security fix; anyone using versions 4.2.4 or earlier are advised to upgrade.
What is the Link Grammar?
The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" (Penn tree-bank style phrase tree) representation of a sentence (showing noun phrases, verb phrases, etc.).
Did the AbiWord team write Link Grammar?
In large part, no. The project is the brainchild of Davy Temperley, John Lafferty and Daniel Sleator, all university professors. It is the product of a decade of academic research into grammar, and is founded on a theory backed by numerous publications. Its canonical homepage is hosted by Carnegie Mellon University.
So, then what is it doing @ AbiSource.com?
The AbiWord team had a concrete need - to integrate a grammar checking feature into AbiWord. The best choice, they felt, was to build upon Temperley et. al.'s successful Link Grammar project.
However, in order for the link-grammar project to be useful to them and to the greater Free Software world, the AbiWord community felt that a variety of changes to the project would be necessary. While they did have success (a few years ago) convincing the authors to release Link Grammar under a GPL-compatible license, there was no practical way to continue project development and maintenance at the CMU website. So the AbiWord community took it under its wing and has nurtured the project since.
Notable changes from the upstream Link Grammar package include:
- Actively maintained.
- Portability fixes to non-Linux platforms (i.e. Windows).
- Java bindings.
- Support for UTF8 Unicode, and languages other than English.
- A variety of other bug fixes to both the source code, and the dictionaries.
- A more standard, portable build system, making packagers' lives easier.
- Convenience features for integrators, such as a simplified API, pkg-config integration, dynamic/shared library support.
Downloading Link Grammar
The system can be downloaded either as a tarball, or via SVN. The current stable version is Link Grammar 4.3.6 (July, 2008). Older versions are available here.
Unstable, working versions are available through AbiWord's SVN repository. Anonymous read-only access is available by issuing the command:
svn co http://svn.abisource.com/link-grammar/trunk link-grammar
General instructions for AbiWord's anonymous SVN can be found here.
The Link Grammar source can be browsed online here. Link Grammar's public API can be found here.
Documentation
A mirror of the Link Grammar dictionary documentation is here. A mirror of the API documentation is here.
Mailing Lists
The current list for Link Grammar discussion is at the link-grammar google group.
Subscribe to link-grammar:
Bug Tracker
Bug reports, patches, RFEs, etc. are gladly welcomed.
- Bug reports should be filed at the Google code bug tracker.
- General issue discussion, requests for enhancement, and related matters should be discussed on the Link Grammar mailing list
Disclaimer
Link grammar is a natural language parser, not an artificial intelligence. This means that there are many sentences that it cannot parse correctly, and many others for which it generates multiple parses. There are also entire classes of speech that it cannot parse, such as Valley-girl speak. Link grammar does best on "newspaper English": medium-length sentences written with good grammar, proper punctuation, and proper capitalization. It don't do 733t speek, etc. In particular, it has problems with the following "registers" and types of writing:- Phrases (that are not a part of a complete sentence)
- Bulleted lists, such as this.
- Quotations within sentences (and parenthetical remarks) These can be handled by an appropriate front-end, that separates out the quotations from the rest of the text.
- Slang speech, words, like 733t warez d00dz, although it can certainly guess from context if the slang is sufficiently grammatical.
- Long run-on sentences. These can generate thousands of alternative parses in a combinatorial explosion.
- Certain "registers", such as newspaper headlines; for example, "Thieves rob bank."
Recent Changes
Version 4.3.6 includes the following changes:
- Fixes for Windows MS Visual-C builds.
- Fix parsing of "He walked the dog.", "He sailed the boat."
- Add support for right-apostrophe (’) which is a non-ASCII UTF8 char.
- Add support for other non-ASCII UTF8 punctuation.
- Fix crash on printing constituent tree of certain long sentences.
- Avoid recursive error reporting for UTF8 dictionary errors.
- Clarify error logging and error printing.
- Add java getVersion() to return link-grammar version string.
- Add more numbers to dict (e.g. twenty-seven, bazillion, half-dozen, etc.)
- Foodstuffs: bagels, lox, tacos, guacamole, roe, neufchatel, mayo, etc.
- Weights and measures: megabytes, °C, km² etc.
- Performance improvements in printing of link-tree.
- Convert assert into warning when no canonical linkages can be found.
- Convert assert into warning when constituent andlist overflows.
- Provide additional checks for constituent overflows.
- Convert most error printfs into a formal error reporting system.
- Remove all globals, library is now thread-safe.
- Fix crash when sentence has square bracket, and doing constituents.
Version 4.3.5 includes the following changes:
- Added ant build file to create the link-grammar jar file.
- Fix regression in command-line client of multiple-parse display.
- Use MB_LEN_MAX, not MB_CUR_MAX for UTF8 support.
- Fix a WIN32 compiler regression (no in-line support in Windows).
- Fix error in handling of UTF8 dictionaries.
- Fix strncat() misuse in error.c
- Fix capitalization errors in country names.
- Fix parsing of "he angled left, he dodged left, he turned left".
- Don't build the JNI library if Java isn't found. Fixes build on Windows.
- Fix install bug for NetBSD systems.
- Pre-detected entities cannot participate in G links.
- There is no UTF8 support in windows, so stub it out.
- Fix crash in constituent output, bug #22 in googlecode bugtacker.
- Some small steps taken to eventually make library thread-safe.
- There are three constituent string styles, enable all three.
- Make the command-line flag errors less cryptic.
- Add readline (BSD editline) support.
- Rename "grammar-parse" to the more logical "link-parser".
- Small man page updates.
- Export and cost, link cost via public API.
Version 4.3.4 includes the following changes:
- Fix regression of handling of capitalization at the start of sentences.
- Fix dictionary search path so that it respects command-line input.
- Fix rare but nasty crash when parsing long sentences in panic mode.
- Add a method to set the dictionary path.
- Fix all remaining compiler warnings.
- Make parser capable of handling UTF8 strings and dictionaries.
- Ongoing minor expansion of the Lithuanian (lt) dictionary.
Version 4.3.3 includes the following changes:
- Missing java is a warning, not an error.
- man page for grammar-parse.
- Removed cruft from the dictionary open routines.
- configure tries to guess some non-standard jni.h locations.
- Split up java library exports, should help cygwin builds.
- Fix java library pre-linking bug.
- Minor English dictionary additions.
- Prototype Lithuanian (lt) dictionary.
Version 4.3.2 includes the following changes:
- Fix dictionary errors involving given names; e.g. any sentence with the name "John" in it.
- Minor Windows build fixes.
Version 4.3.1 includes the following changes:
- Merger of extensive dictionary additions from Peter Szolovits. This adds 15K new words, bringing the dictionary to 70K words total.
Version 4.3.0 includes the following changes:
- New link types (Ct, Cta, Rn, Rw) for comparatives, so as to link relative clauses: "John is bigger than Dave is", "John wants more cookies than Dave wants". The Rw link is used to link question words to the relative clauses that follow them.
- Dictionary Fixes for "Espresso is a coffee drink", "Teach me fetch", "I am pooped" as synonym for "I am tired", "Mother likes her", "Mommy loves me" and related. Also, directives involving "go": "Go play ball", "Go take a walk", "You and Rover go play with the ball."
- Dictionary support for external entity markup. This includes the recognition of personID0..personID60, dateID0..dateID60, organizationID0..organizationID60 and locationID0..locationID60 as appropriate words.
- Fixes of numerous compile-time warnings.
- Simple Java (JNI) bindings.
Version 4.2.5 includes the following changes:
- Fix for a security problem, involving a buffer overflow: CVE-2007-5395.
Adjunct Projects
- RelEx Semantic Relation Extractor
- RelEx is an English-language semantic relationship extractor, built on the Carnegie-Mellon link parser. It can identify subject, object, indirect object and many other relationships between words in a sentence. It will also provide part-of-speech tagging, noun-number tagging, verb tense tagging, gender tagging, and so on. Relex includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm. Optionally, it can use GATE for entity detection.
- Perl bindings
- A perl module was written by Dan Brian. [Download] [Documentation (mirror)]. See also a tutorial. Note that the perl bindings were developed against an older version of the link parser.
- Ocaml bindings
- OCaml interface to Link Grammar
- Ruby bindings
- Ruby interface to Link Grammar
- Persian dictionaries
- Persian dictionaries, by Jon Dehdari. These require the Persian stemming engine, as significant morphology analysis needs to be performed to parse Persian.
- Arabic dictionaries
- Arabic dictionaries, by Jon Dehdari. [download] These require the Aramorph stemming package, which is included.
- Russian parser
- Located at http://slashzone.ru/parser/. By Sergey Protasov. Russian morpheme dictionaries can be had at http://aot.ru.
- English dictionary extensions
- LinkGrammar-WN is a lexicon expansion for the English language Link Grammar Parser. This project adds 14K new words to the dictionaries. The extended lexicon is provided under the GPL license, and thus cannot be merged back into the current project.
- Medical terms
- Extending the Link Grammar Parser's lexicon from UMLS' Specialist lexicon -- adds many medical terms. All but the six largest of these dictionaries have now been merged into version 4.3.1. The large dictionaries EXTRA.2, EXTRA.3, EXTRA.8, EXTRA.9, EXTRA.12, and EXTRA.17 have not been merged. These dictionaries contain 180K assorted medical, biological and biochemical terms and phrases.
Of related interest
- Genia tagger
- The Genia tagger is useful for named entity extraction.
Recent Applications and Publications
Some recent uses and applications of the Link Grammar Parser are shown below. There is also an older bibliography on the CMU website referencing several dozen papers pertaining to the Link Grammar Parser.
- Fabian M. Suchanek, Georgiana Ifrim, Gerhard Weikum, "Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents" (2006)
- P. Szolovits, "Adding a Medical Lexicon to an English Parser". Proc. AMIA 2003 Annual Symposium. Pages 639-643. 2003.
Some miscellaneous facts:
- Any categorical grammar can be easily converted to a link grammar; see section 6 of Daniel Sleator and Davy Temperley. 1993. "Parsing English with a Link Grammar." Third International Workshop on Parsing Technologies.
- Link grammars can be learned by performing a statistical analysis on a large corpus: see John Lafferty, Daniel Sleator, and Davy Temperley. 1992. "Grammatical Trigrams: A Probabilistic Model of Link Grammar." Proceedings of the AAAI Conference on Probabilistic Approaches to Natural Language, October, 1992. See also the P. Szolovits paper above.
License
The Link Grammar license is essentially the BSD license. A copy of this license can be found below, and at the original author's CMU site
Copyright (c) 2003-2004 Daniel Sleator, David Temperley, and John Lafferty. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- The names "Link Grammar" and "Link Parser" must not be used to endorse or promote products derived from this software without prior written permission. To obtain permission, contact sleator@cs.cmu.edu
THIS SOFTWARE IS PROVIDED BY DANIEL SLEATOR, DAVID TEMPERLEY, JOHN LAFFERTY AND OTHER CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

![[Logo]](/gfx/swish-a.jpg)