Event Extraction papers
This repository contains resources for Natural Language Processing (NLP) with a focus on the task of Event Extraction.
Table of Contents
Expand Table of Contents
Pattern matching
1993
1. Automatically Constructing a Dictionary for Information Extraction Tasks by Ellen Riloff
Knowledge-based natural language processing systems have achieved good success with certain tasks but they are often criticized because they depend on a domain-specific dictionary that requires a great deal of manual knowledge engineering. This knowledge engineering bottleneck makes knowledge-based NLP systems impractical for real-world applications because they cannot be easily scaled up orported to new domains. In response to this problem, we developed a system called AutoSlog that automatically builds a domain-specific dictionary of concepts for extracting information from text. Using AutoSlog. we constructed a dictionary for the domain of terrorist event descriptions in only 5 person-hours. We then compared the AutoSlog dictionary with a hand-crafted dictionary that was built by two highly skilled graduate students and required approximately 1500 person-hours of effort. We evaluated the two dictionaries using two blind test sets of 100 texts each. Overall, the AutoSlog dictionary achieved 98% of the performance of the hand-crafted dictionary. On the first test set, the Auto-Slog dictionary obtained 96.3% of the perfomlance of the hand-crafted dictionary. On the second test set, the overall scores were virtually indistinguishable with the AutoSlog dictionary achieving 99.7% of the performance of the handcrafted dictionary.
1995
1. Acquisition of linguistic patterns for knowledge-based information extraction by Jun-Tae Kim ; D.I. Moldovan
The paper presents an automatic acquisition of linguistic patterns that can be used for knowledge based information extraction from texts. In knowledge based information extraction, linguistic patterns play a central role in the recognition and classification of input texts. Although the knowledge based approach has been proved effective for information extraction on limited domains, there are difficulties in construction of a large number of domain specific linguistic patterns. Manual creation of patterns is time consuming and error prone, even for a small application domain. To solve the scalability and the portability problem, an automatic acquisition of patterns must be provided. We present the PALKA (Parallel Automatic Linguistic Knowledge Acquisition) system that acquires linguistic patterns from a set of domain specific training texts and their desired outputs. A specialized representation of patterns called FP structures has been defined. Patterns are constructed in the form of FP structures from training texts, and the acquired patterns are tuned further through the generalization of semantic constraints. Inductive learning mechanism is applied in the generalization step. The PALKA system has been used to generate patterns for our information extraction system developed for the fourth Message Understanding Conference (MUC-4).
2. Automatically Acquiring Conceptual Patterns without an Annotated Corpus by Ellen Riloff, Jay Shoen
Previous work on automated dictionary construction for information extraction has relied on annotated text corpora. However, annotating a corpus is time-consuming and difficult. We propose that conceptual patterns for information extraction can be acquired automatically using only a preclassified training corpus and no text annotations. We describe a system called AutoSlog-TS, which is a variation of our previous AutoSlog system, that runs exhaustively on an untagged text corpus. Text classification experiments in the MUC-4 terrorism domain show that the AutoSlog-TS dictionary performs comparably to a hand-crafted dictionary, and actually achieves higher precision on one test set. For text classification, AutoSlog-TS requires no manual effort beyond the preclassified training corpus. Additional experiments suggest how a dictionary produced by AutoSlog-TS can be filtered automatically for information extraction tasks. Some manual intervention is still required in this case, but AutoSlog-TS significantly reduces the amount of effort required to create an appropriate training corpus.
3. Learning information extraction patterns from examples by Scott B. Huffman
A growing population of users want to extract a growing variety of information from on-line texts. Unfortunately, current information extraction systems typically require experts to hand-build dictionaries of extraction patterns for each new type of information to be extracted. This paper presents a system that can learn dictionaries of extraction patterns directly from user-provided examples of texts and events to be extracted from them. The system, called LIEP, learns patterns that recognize relationships between key constituents based on local syntax. Sets of patterns learned by LIEP for a sample extraction task perform nearly at the level of a hand-built dictionary of patterns.
1998
1. Multistrategy Learning for Information Extraction by Dayne Freitag
Information extraction IE is the problem of lling out pre de ned structured sum maries from text documents We are in terested in performing IE in non traditional domains where much of the text is often ungrammatical such as electronic bulletin board posts and Web pages We suggest that the best approach is one that takes into ac count many di erent kinds of information and argue for the suitability of a multistrat egy approach We describe learners for IE drawn from three separate machine learning paradigms rote memorization term space text classi cation and relational rule induc tion By building regression models mapping from learner con dence to probability of cor rectness and combining probabilities appro priately it is possible to improve extraction accuracy over that achieved by any individ ual learner We describe three di erent mul tistrategy approaches Experiments on two IE domains a collection of electronic seminar announcements from a university computer science department and a set of newswire ar ticles describing corporate acquisitions from the Reuters collection demonstrate the effectiveness of all three approaches
1999
1. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping by Ellen Riloff, Rosie Jones
Information extraction systems usually require two dictionaries: a semantic lexicon and a dictionary of extraction patterns for the domain. We present a multilevel bootstrapping algorithm that generates both the semantic lexicon and extraction patterns simultaneously. As input, our technique requires only unannotated training texts and a handful of seed words for a category. We use a mutual bootstrapping technique to alternately select the best extraction pattern for the category and bootstrap its extractions into the semantic lexicon, which is the basis for selecting the next extraction pattern. To make this approach more robust, we add a second level of bootstrapping (metabootstrapping) that retains only the most reliable lexicon entries produced by mutual bootstrapping and then restarts the process. We evaluated this multilevel bootstrapping technique on a collection of corporate web pages and a corpus of terrorism news articles. The algorithm produced high-quality dictionaries for several semantic categories.
2000
1. REES: A Large-Scale Relation and Event Extraction System by Chinatsu Aone, Mila Ramos-Santacruz
This paper reports on a large-scale, end-to-end relation and event extraction system. At present, the system extracts a total of 100 types of relations and events, which represents a much wider coverage than is typical of extraction systems. The system consists of three specialized pattem-based tagging modules, a high-precision co-reference resolution module, and a configurable template generation module. We report quantitative evaluation results, analyze the results in detail, and discuss future directions.
2. Automatic Acquisition of Domain Knowledge for Information Extraction by Roman Yangarber, Ralph Grishman, Pasi Tapanainen, Silja Huttunen
In developing an Information Extraction (IE) system for a new class of events or relations, one of the major tasks is identifying the many ways in which these events or relations may be expressed in text. This has generally involved the manual analysis and, in some cases, the annotation of large quantities of text involving these events. This paper presents an alternative approach, based on an automatic discovery procedure, ExDisco, which identi es a set of relevant documents and a set of event patterns from un-annotated text, starting from a small set of seed patterns." We evaluate ExDisco by comparing the performance of discovered patterns against that of manually constructed systems on actual extraction tasks.
2001
1. Adaptive Information Extraction from Text by Rule Induction and Generalisation by Fabio Ciravegna
(LP)2 is a covering algorithm for adaptive Information Extraction from text (IE). It induces symbolic rules that insert SGML tags into texts by learning from examples found in a user-defined tagged corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in tagging. Induction is performed by bottom-up generalization of examples in the training corpus. Shallow knowledge about Natural Language Processing (NLP) is used in the generalization process. The algorithm has a considerable success story. From a scientific point of view, experiments report excellent results with respect to the current state of the art on two publicly available corpora. From an application point of view, a successful industrial IE tool has been based on (LP)2. Real world applications have been developed and licenses have been released to external companies for building other applications. This paper presents (LP)2, experimental results and applications, and discusses the role of shallow NLP in rule induction.
2002
1. Event Pattern Discovery from the Stock Market Bulletin by Fang Li, Huanye Sheng, Dongmo Zhang
Electronic information grows rapidly as the Internet is widely used in our daily life. In order to identify the exact information for the user query, information extraction is widely researched and investigated. The template, which pertains to events or situations, and contains slots that denote who did what to whom, when, and where, is predefined by a template builder. Therefore, fixed templates are the main obstacles for the information extraction system out of the laboratory. In this paper, a method to automatically discover the event pattern in Chinese from stock market bulletin is introduced. It is based on the tagged corpus and the domain model. The pattern discovery process is independent of the domain model by introducing a link table. The table is the connection between text surface structure and semantic deep structure represented by a domain model. The method can be easily adapted to other domains by changing the link table.
2003
1. A System for new event detection by Thorsten Brants, Francine Chen, Ayman Farahat
We present a new method and system for performing the New Event Detection task, i.e., in one or multiple streams of news stories, all stories on a previously unseen (new) event are marked. The method is based on an incremental TF-IDF model. Our extensions include: generation of source-specific models, similarity score normalization based on document-specific averages, similarity score normalization based on source-pair specific averages, term reweighting based on inverse event frequencies, and segmentation of the documents. We also report on extensions that did not improve results. The system performs very well on TDT3 and TDT4 test data and scored second in the TDT-2002 evaluation.
2. Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction. by Mary Elaine Califf, Raymond J. Mooney
Information extraction is a form of shallow text processing that locates a specified set of relevant items in a natural-language document. Systems for this task require significant domain-specific knowledge and are time-consuming and difficult to build by hand, making them a good application for machine learning. We present an algorithm, RAPIER, that uses pairs of sample documents and filled templates to induce pattern-match rules that directly extract fillers for the slots in the template. RAPIER is a bottom-up learning algorithm that incorporates techniques from several inductive logic programming systems. We have implemented the algorithm in a system that allows patterns to have constraints on the words, part-of-speech tags, and semantic classes present in the filler and the surrounding text. We present encouraging experimental results on two domains.
3. An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition by Kiyoshi Sudo, Satoshi Sekine, Ralph Grishman
Several approaches have been described for the automatic unsupervised acquisition of patterns for information extraction. Each approach is based on a particular model for the patterns to be acquired, such as a predicate-argument structure or a dependency chain. The effect of these alternative models has not been previously studied. In this paper, we compare the prior models and introduce a new model, the Subtree model, based on arbitrary subtrees of dependency trees. We describe a discovery procedure for this model and demonstrate experimentally an improvement in recall using Subtree patterns.
2005
1. Automatic event and relation detection with seeds of varying complexity by Feiyu Xu, Hans Uszkoreit and Hong Li
In this paper, we present an approach for automatically detecting events in natural language texts by learning patterns that signal the mentioning of such events. We construe the relevant event types as relations and start with aset of seeds consisting of representative event instances thath appen to be known and also to be mentioned frequently in easily available training data. Methods have been developed for the automatic identification of event extents andevent triggers. We have learned patterns for a particular domain, i.e., prize award events. Currently we are systematically investigating the criteria for selecting the most effective patterns for the detection of events in sentences and paragraphs. Although the systematic investigation is still under way, we can already report on first very promising results of the method for learning of patterns and for using these patterns in event detection.
2. A Semantic Approach to IE Pattern Induction by Mark Stevenson, Mark A. Greenwood
This paper presents a novel algorithm for the acquisition of Information Extraction patterns. The approach makes the assumption that useful patterns will have similar meanings to those already identified as relevant. Patterns are compared using a variation of the standard vector space model in which information from an ontology is used to capture semantic similarity. Evaluation shows this algorithm performs well when compared with a previously reported document-centric approach.