Papers: Automated

Automated Coding Papers

This collection of papers deals mainly with the validity and applications of event data to the study of international relations. While these papers are primarily products of Kansas Event Data Project directors and personnel, G. Dale Thomas's "Practical Guide" to events data is included, as are some other papers dealing with automated coding that have been produced by other research groups.

Click here to go to a detailed history of the Kansas Event Data Project

Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks

Halterman, Andrew, Philip A. Schrodt, Andreas Beger, Benjamin E. Bagozzi and Grace I. Scarborough.

Event data, or structured records of ``who did what to whom'' that are automatically extracted from text, is an important source of data for scholars of international politics. The high cost of developing new event datasets, especially using automated systems that rely on hand-built dictionaries, means that most researchers draw on large, pre-existing datasets such as ICEWS rather than developing tailor-made event datasets optimized for their specific research question. This paper describes a ``bag of tricks'' for efficient, custom event data production, drawing on recent advances in natural language processing (NLP) that allow researchers to rapidly produce customized event datasets. The paper introduces techniques for training an event category classifier with active learning, identifying actors and the recipients of actions in text using large language models and standard machine learning classifiers and pretrained ``question-answering'' models from NLP, and resolving mentions of actors to their Wikipedia article to categorize them. We describe how these techniques produced the new POLECAT global event dataset that is intended to replace ICEWS, along with examples of how scholars can quickly produce smaller, custom event datasets. We publish example code and models to implement our new techniques.

Paper presented at the International Studies Association, Montreal, March-2023.

Link to open archive copy of the paper

PLOVER and POLECAT: A New Political Event Ontology and Dataset

Halterman, Andrew, Benjamin E. Bagozzi, Andreas Beger, Philip A. Schrodt, and Grace I. Scarborough.

POLECAT is a new global political event dataset intended to serve as the successor to the dataset produced by the DARPA Integrated Conflict Early Warning System (ICEWS) project. POLECAT's event data are machine coded from millions of multi-language international news reports and will soon cover the period 2010-to-present. These data are generated using the Next Generation Event Coder (NGEC), a new automated coder that replaces the use of extensive (and difficult to update) dictionaries with a more flexible set of expert annotations of an event's characteristics. In contrast to existing automated event coders, it uses a combination of NLP tools, transformer-based neural networks, and actor information sourced from Wikipedia. POLECAT's event data are based on an event-mode-context ontology, the Political Language Ontology for Verifiable Event Records (PLOVER), that replaces the older CAMEO ontology used in past datasets such as ICEWS and Phoenix. These innovations offer substantial improvements in the scope and accuracy of political event data in terms of the what, how, why, where, and when of domestic and international interactions. After detailing PLOVER and POLECAT, we illustrate the innovations and improvements through a preliminary comparison to the existing-ICEWS event data system.

Paper presented at the International Studies Association, Montreal, March-2023.

Link to open archive copy of the paper

Taming the Firehose: Thematically summarizing very large news corpora using topic modeling

Philip A. Schrodt

This paper instantiates the new "application-focused" format for PolMeth XXXV. The application in question is producing thematic chronologies from very large corpora of news texts (both native English and machine-translated Arabic) using a combination of political event data coding—specifically, a successor to the event coder used in the ICEWS project—and latent Dirichlet allocation (LDA) topic modeling as implemented through the open source program \txt{gensim}. Because this is an applied project where the less-than-infinitely-patient end-users are looking for plausible and more or less distinct clusterings, rather than whatever dog's breakfast is produced by LDA, the two algorithmic challenges are reconciling the indeterminate outcomes from LDA (that is, due to numerical optimization over a high-dimensional surface characterized by many local optima, multiple runs produce different clusterings) and identifying similar clusters within a single runs. The system—which is entirely unsupervised after various pre-processing steps, including the use of an automated event coder similar to that used by the ICEWS project to restrict the corpus to sentences involving transactional events—is producing reasonably coherent (and consistent) results, with an interesting distinction between the results where the texts all involve a single country, where thematic clusters tend to align according to the foreign relations with other countries, and a clustering on a much larger corpus involving the entire Middle East, where the clusters are more behavioral. Issues remaining in the system are appropriate summarization—gensim's function for this seems less than completely reliable—and differences induced by the very different stylistic characteristics of native English and translated Arabic.

Paper presented at PolMeth XXXV: 2018 Conference of the Society for Political Methodology, Brigham Young University, 18-21 July 2018.

Link to Adobe .pdf file of the paper

Comparison Metrics for Large Scale Political Event Data Sets

Philip A. Schrodt

This paper addresses three general issues surrounding the use of political event data generated by fully automated methods in forecasting political conflict. I first look at the differences between the data generation process for machine and human coded data, where I believe the major difference in contemporary efforts is found not in the precision of the coding, but rather the effects of using multiple sources. While the use of multiple sources has virtually no downside in human coding, it has great potential to introduce noise in automated coding. I then propose a metric for comparing event data sources based on the correlations between weekly event counts in the CAMEO "pentaclasses" weighted by the frequency of dyadic events, and illustrate this with two examples:

A comparison of the new ICEWS public data set with an unpublished data set based only on the BBC Summary of World Broadcasts.
A comparison of the TABARI shallow parser and PETRARCH full parser for the 35-year KEDS Reuters and Agence France Presse Levant series.

In the case of the ICEWS/BBC comparison, the metric appears useful not only in showing the overall convergence—typical weighted correlations are in the range of 0.45, surprisingly high given the differences between the two data sets—and showing variations across time and regions. In the case of TABARI/KEDS, the metric shows high convergence for the series with a large number of reports, and also shows that the PETRARCH coding reduces the number of material conflict events—presumably mostly by eliminating false positives—by around a factor of 2 in most dyads. In both tests, the metric is good at identifying anomalous dyads, Asia in the case of ICEWS and Palestine in the case of the TABARI-coded Levant series.

Finally, the paper looks at the degree to which the complex coding of PETRARCH can be duplicated using much simpler "bag of words" methods, specifically a simple pattern-based method for identifying actors and support vector machines for identifying events. While, as expected, these methods do not fully reproduce the more complex codings, they perform far better than chance on doing the aggregated classifications typically found in research projects, and arguably could be used as a first approximation for new behaviors where automated coding dictionaries are not available. The pattern-based actor and SVM models also strongly suggests that there may be a very substantial number of sentences which are currently not coded which actually contain events.

Paper presented at the Text as Data meetings, New York University, 16-17 October 2015. Partial results were presented earlier at the European Political Science Association meetings, Vienna (Austria), 25 June 2015 and at the Conference on Forecasting and Early Warning of Conflict, Peace Research Institute, Oslo (Norway), 22-23 April 2015

Link to Adobe .pdf file of the paper

Three's a Charm?: Open Event Data Coding with EL:DIABLO, PETRARCH, and the Open Event Data Alliance

Philip A. Schrodt, John Beieler and Muhammed Idris

This paper is a brief review of three current efforts to provide an open and transparent path to the automated production of event data: :

EL:DIABLO: an open, user-friendly modular system for the acquisition and coding of web-based news sources which is intended to allow small research teams to generate customized event data sets with a minimum of effort
PETRARCH: a Python-based event data coder using fully-parsed Penn Treebank input
The Open Event Data Alliance, a new professional organization for the promotion and provision of fully transparent open event data

Presented at the International Studies Association, Toronto, March 2014.

Link to Adobe .pdf file of the paper

Automated Coding of Very Large Scale Political Event Data

Philip A. Schrodt and Jay Yonamine

This paper discusses the current state-of-the-art for generating high-volume, near-real-time event data using automated coding methods, based on recent efforts by the Penn State Event Data Project and its precursors. Political event data—the extraction of who did what to whom based from natural language news reports—is a core element of many models that forecast political stability, and is easily adapted to new problems, either in a fully-automated mode or as part of a machine-assisted coding system. We discuss five elements of a contemporary system: acquisition and filtering of source texts, the actor and event coding ontology, the development of actor dictionaries using named-entity recognition, the development of a coding engine, and integration with open-source natural language processing software. While we will be focusing on the characteristics of the open-source TABARI, most of these issues will be relevant to any system, and thus is intended as a "how-to" guide to the pragmatic challenges and solutions to various elements of the process of generating event data using automated techniques.

Presented at the New Directions in Text as Data workshop, Harvard, October 2012.

Link to Adobe .pdf file of the paper

Detecting Latent Topics in Political News Reports using Latent Dirichlet Allocation Models

Benjamin E. Bagozzi and Philip A. Schrodt

Latent Dirichlet Allocation (LDA) models are a machine learning method for finding sets of words that characterize latent dimensions in texts. In this paper, we apply LDA to a very large corpus of international newswire texts for 2000 to 2011 covering 61 countries in Europe and the Middle East to determine the extent to which these latent dimensions correspond to the categories found in existing event data ontologies such WEIS, CAMEO and IDEA. We analyze the texts after removing common stop words, proper nouns and meronyms based on the CountryCodes file developed by the CAMEO project. The LDA analysis produces very plausible latent topics -- which is not always true of LDA -- and most of these topics are general rather than country-specific, though as expected, some country-specific crisis and topic areas (e.g. specific international and ethnic conflicts) can be found as well. Clustering analysis supports a dominant cooperation-conflict dimension in most but not all of the cases, and shows very strongly that when compared across countries, the latent topics cluster substantively rather than by country, and those clusters contain representatives of the topics derived from the set of all of the stories. However, LDA also reveals a number of additional issues, particularly economic and institutional, that are generally not covered well in the existing event data coding schemes, in large part because of their traditional emphasis on violent conflict. This suggests that automated coding methods could be used to substantially expand the scope of event data, particularly in areas relevant to political economy and routine institutional behavior such as elections. It may be possible to use these methods to increase the effectiveness of report filters and to identify stories containing sentences that should produce events but are missed by automated coding programs due to incomplete dictionaries.

Present at the European Political Science Association meetings, Berlin, June 2012.

Link to Adobe .pdf file of the paper

Precedents, Progress and Prospects in Political Event Data

Philip A. Schrodt

The past decade has seen a renaissance in the development of political event data sets. This has been due to at least the three sets of factors. First, there have been technological changes that have reduced the cost of producing event data, including the availability of information on the Web, the development of specialized systems for automated coding, and the development of machine-assisted systems which reduce the cost of human coding. Second, event data have become much more elaborate than the original state-centric data sets such as WEIS and COPDAB, with a far greater emphasis on sub-state and non-state actors, and in some data sets, the incorporation of geospatial information. Finally, there have been major institutional investments, such as support for a number of Uppsala and PRIO data sets, the DARPA ICEWS Asian and global data sets, and various political violence data sets from the U.S. government. This paper will first review the major new contributions, with a focus on those represented in this special issue, discuss some of the open problems in the existing data and finally discuss prospects for future development, including the enhanced use of open-source natural language processing tools, standardizing the coding taxonomies, and prospects for near-real-time coding systems.

International Interactions 38,5 (December 2012)

Link to Adobe .pdf file of the paper

Analyzing International Event Data (2001/2012)

Philip A. Schrodt, Deborah J. Gerner

This is a book-length manuscript that covers event data in general and the KEDS project in particular

Update March 2012: We've been continuing to use the first three chapters for instructional purposes, so these have been slightly updated and reformatted to include the bibliographic references.

Preface and Table of Contents [.pdf] [Postscript]
Chapter One: International Event Data [.pdf 2012, includes Appendix] [.pdf 2001] [Postscript]
Chapter Two: Fundamentals of Machine Coding [.pdf 2012] [ .pdf 2001 ]
[ Postscript ]
Chapter Three: Statistical Characteristics of Event Data [.pdf 2012] [ .pdf 2001]
[ Postscript ]
Chapter Four: Clustering Methods [ .pdf ] [ Postscript ]
Chapter Five: Sequence Analysis Methods [ .pdf ] [ Postscript ]
Chapter Six: Hidden Markov Models [ .pdf ]
Chapter Seven: Conclusion [ .pdf ] [ Postscript ]
Bibliography [ .pdf ] [ Postscript ]
Appendix: Event Coding Systems [ .pdf ] [ Postscript ]

Building Datasets with TABARI Output: An Analysis of Varying Temporal and Typological Aggregations

Penn State ICEWS Working Group, June 2010

This document begins by explaining various aggregation options for each of the four variables found in standard event data -- date, actor, target and source -- while briefly addressing relevant data management concerns in STATA. The explanation begins with CAMEO and concludes with Date aggregations. In addition, it provides a case-study of the Israel-Palestine dyad to illustrate how different aggregation strategies generate varying empirical relationships and provide a brief discussion of relationship between variables in the coding system.

Link to Adobe .pdf file of the paper

Automated Production of High-Volume,Near-Real-Time Political Event Data

Philip A. Schrodt

This paper summarizes the current state-of-the-art for generating high-volume, near- real-time event data using automated coding methods, based on recent efforts from the DARPA Integrated Crisis Early Warning System (ICEWS) and NSF-funded research. The ICEWS work expanded by more than two orders of magnitude previous automated coding efforts, coding of about 26-million sentences generated from 8-million stories condensed from around 30 gigabytes of text. The actual coding took six minutes. The paper is largely a general "how-to" guide to the pragmatic challenges and solutions to various elements of the process of generating event data using automated techniques. It also discusses a number of ways that this could be augmented with existing open- source natural language processing software to generate a third-generation event data coding system.

Paper prepared for delivery at the Annual Meeting of the American Political Science Association, Washington, 2 - 5 September 2010.

Link to Adobe .pdf file of the paper

Inductive Event Data Scaling using Item Response Theory

Philip A. Schrodt

Political event data are frequently converted to an interval level measurement by assigning a numerical scaled value to each event. All of the existing scaling systems rely on non-replicable expert assessments to determine these numerical scores, which do not take into account the characteristics of the data that will be aggregated. This paper uses item response theory (IRT)—a technique originally developed for the scaling of test scores—to derive scales inductively, using event data on Israeli interactions with Lebanon and the Palestinians for 1991-2007. In the IRT model, the probability of an event being reported in an interval of time by a specific news source is modeled as a logistic function on a latent dimension determined from the data itself. Monthly scores on this latent trait are calculated using three IRT models: the single-parameter Rasch model, and two-parameter models that add discrimination and guessing parameters. The three formulations produce generally comparable scores (correlations around 0.90 or higher). The Rasch scales are less successful than the expert-derived Goldstein scale in reconciling the somewhat divergent sets of events derived from the Agence France Presse and Reuters news services. This is in all likelihood due largely to the low weighting given uses of force by the IRT models, because force events are common in these two data sets. A factor analysis of the event counts shows that a single cooperation-conflict dimension generally accounts for about two-thirds of the variance in these dyads, but a second case-specific dimension explains another 20%. Finally, moving averages of the scores generally correlate well with the Goldstein values, suggesting that IRT may provide a route towards deriving a purely inductive and hence replicable scale.

Presented at the Summer Meeting of the Society for Political Methodology, Pennsylvania State University, 18 - 20 July 2007. An earlier version was presented at the 2007 Annual Meeting of the International Studies Association, Chicago.

Link to Adobe .pdf file of the paper

The CAMEO (Conflict and Mediation Event Observations) Actor Coding Framework

Philip A. Schrodt, Ömür Yilmaz, Deborah J. Gerner, and Dennis Hermreck

The Conflict and Mediation Event Observations (CAMEO) framework is a relatively new event data-coding scheme optimized for the study of third party mediation in global disputes. In an earlier paper (Gerner et al. 2002) we discussed the development of the event-coding component of that framework; in this paper we discuss the actor-coding framework. Because almost all contemporary conflicts transcend the traditional focus on state actors, featuring instead significant involvement of both sub-state and non-state actors, the state-centered coding schemes used in older data sets such as WEIS and COPDAB have proven inadequate for coding current events. In their place, we have established a systematic method of hierarchically creating codes that allow for the identification of states, sub-state actors, ethnic groups, geographical regions, IGOs and NGOs. This system, while still under development, has proven sufficient to code a wide range of relevant actors involved in inter- and intra-state protracted conflicts in Africa, the Balkans, Central Asia and the Middle East.

Presented at the 2008 Annual Meeting of the International Studies Association, 26 - 29 March 2008. An earlier version was presented at the 2005 Annual Meeting of the American Political Science Association, 1 - 4 September 2005.

Link to Adobe .pdf file of the paper

Conflict and Mediation Event Observations (CAMEO): A New Event Data Framework for a Post Cold War World

Deborah J. Gerner, Philip A. Schrodt, Ömür Yilmaz, and Rajaa Abu-Jabr

The Conflict and Mediation Event Observations (CAMEO) framework is a new event data coding scheme optimized for the study of third party mediation in international disputes. The World Events Interaction Survey (WEIS) framework that the authors used in previous event data research has a number of shortcomings, including vagueness in and overlap of some categories, and a limited applicability to contemporary issues involving non-state actors. The authors have addressed these and other problems in constructing CAMEO and have produced far more complete documentation than is available for WEIS.

CAMEO has been developed and implemented using the TABARI automated coding program and has been used to generate data sets for the Balkans (1989-2002; N=71,081), Levant (1979-2002; N=139,376), and West Africa (1989-2002; N=18,519) from Reuters and Agence France Presse reports. This article reports statistical comparisons of CAMEO-coded and WEIS-coded data for these three geographical regions. CAMEO and WEIS show similar irregularities in the distribution of events by category. In addition, when the data are aggregated to a general behavioral level (that is, into verbal cooperation, material cooperation, verbal conflict and material conflict), most of the data sets show a high correlation (r > 0.90) in the number of WEIS and CAMEO events coded per month. Finally, there is a significant correlation (r > 0.57) between the count of CAMEO events specifically dealing with mediation and negotiation, and a pattern-based measure of mediation the authors developed earlier from WEIS data. CAMEO thus appears to maintain coverage of events typically coded by WEIS while adding enhanced precision and stronger coverage of additional activities such as mediation that are of increasing scholarly interest in the twenty-first century.

Presented at the 2002 Annual Meeting of the American Political Science Association, 29 August - 1 September 2002.
(Earlier version at the International Studies Association, New Orleans, 23-27 March 2002.)

Link to Adobe .pdf file of the paper

Fair & Balanced or Fit to Print? The Effects of Media Sources on Statistical Inferences

Andrew Reeves (William and Mary), Stephen Shellman (University of Georgia) and Brandon Stewart (William and Mary)

This paper examines the effects of source bias on statistical inferences drawn from event data analyses. Most event data projects use a single source to code events. For example most of the early Kansas Event Data System (KEDS) datasets code only Reuters news reports and code Agence France Presse (AFP) reports. One of the goals of Project Civil Strife (PCS)—a new domestic-based event data project—is to code event data from several news sources to garner the most extensive coverage of events and control for bias often found in a single source. Herein, we examine the effects that source bias has on the inferences we draw from statistical time-series models. In this study, we concentrate on Indonesia and Cambodia from 1980-2004 using automated content analyzed datasets collected from multiple sources (i.e. Associated Press, British Broadcasting Corporation, Japan Economic Newswire, United Press International, and Xinhua). The analyses show that we draw different inferences across sources, especially when we disaggregate domestic political groups. We then combine our sources together and eliminate duplicate events to create a multi-source dataset and compare the results to the single-source models. We conclude that there are important differences in the inferences drawn dependent upon source use. We conclude that researchers should (1) check their results across multiple sources and/or (2) analyze multi-source data to test their hypotheses.

Paper prepared for delivery at the Annual Meeting of the International Studies Association, San Diego, March 2006.

Link to Adobe .pdf file of the paper

Power Laws in Event Data

Philip A. Schrodt

This paper explores the possibility that the distribution of event data might follow a power-law distribution. A variety of event data sets, both manually-coded (WEIS and BCOW) and machine-coded (IDEA and various KEDS project data sets) are analyzed. The distribution of the frequency of categories does not follow a classic Zipf's law
   log(x) = a + b * log(rank)
particularly well, but there is remarkable consistency to the modified form
   log(x) = a + b * rank
A "Type IV" power law of the form
   log (1 - CDF(x)) = a + b log(x)
is found to fit an exponential transformation of Goldstein totals quite well (r² > 0.99) for KEDS data sets on the Levant and Balkans.

Paper prepared for Claudio Cioffi-Revilla (ed.). Power Laws in the Social Sciences: Discovering Complexity and Non-Equilibrium Dynamics in the Social Universe.

Link to Adobe .pdf file of the paper

Monitoring conflict using automated coding of newswire sources: a comparison of five geographical regions

Philip A. Schrodt, Erin M. Simpson and Deborah J. Gerner

This paper discusses the experience of the Kansas Event Data System (KEDS) in developing event data sets for monitoring conflict levels in five geographical areas: the Levant (Arab-Israeli conflict), Persian Gulf, former Yugoslavia, Central Asia (Afghanistan, Armenia-Azerbijan, former Soviet republics), and West Africa (Nigeria, Liberia, Sierra Leone). These data sets were coded from commercial news sources using the KEDS and TABARI automated coding systems. The paper discusses our experience in developing the dictionaries required for this coding, the problems with the coverage in the various areas, and provides a number of examples of the statistical summaries that can be produced with the event data. We also compare the coverage of Reuters and Agence France Presse news services for selected years in the Levant and former Yugoslavia.

Paper presented at the PRIO/Uppsala University/DECRG High-Level Scientific Conference on Identifying Wars: Systematic Conflict Research and Its Utility in Conflict Resolution and Prevention, Uppsala, Sweden 8-9 June 2001

Link to Adobe .pdf file of the paper

Automated Coding of International Event Data Using Sparse Parsing Techniques

Philip A. Schrodt

"Event data" record the interactions of political actors reported in sources such as newspapers and news services; this type of data is widely used in research in international relations. Over the past ten years, there has been a shift from coding event data by humans -- typically university students -- to using computerized coding. The automated methods are dramatically faster, enabling data sets to be coded in real time, and provide far greater transparency and consistency than human coding. This paper reviews the experience of the Kansas Event Data System (KEDS) project in developing automated coding using "sparse parsing" machine coding methods, discusses a number of design decisions that were made in creating the program, and assesses features that would improve the effectiveness of these programs.

Paper presented at the International Studies Association, Chicago, 21-24 February 2001

Link to Adobe .pdf file of the paper

Link to Adobe .pdf file of a shorter version of the paper presented at the Fifth International Conference on Social Science Methodology, Cologne, Germany, October 3 - 6, 2000.

The Machine-Assisted Creation of Historical Event Data Sets: A Practical Guide

G. Dale Thomas

As the title of this project suggests, the purpose of this paper is to present a practical guide to the machine assisted creation of historical event data sets. In an effort to make this useful to as large an audience as possible, I begin by introducing event data as commonly understood in international relations research and discuss why some scholars are arguing that this type of data is presently becoming more attractive. After a somewhat brief discussion of the foregoing, I turn to a series of practical steps specifically oriented toward the machine assisted creation of historical event data sets. Throughout this discussion, I will be relying on examples drawn from my experience in creating the Northern Ireland Systemic (NIS) data set which focuses on domestic political violence from 1968 to 1996. Finally, I present a brief description of work being carried out using the NIS data.

Paper presented at the International Studies Association meetings, Los Angeles, 14-18 March 2000

Link to Adobe .pdf version of this paper

Barbarians at the Gate: A Political Methodologist's Notes on using the National Center for Supercomputer Applications

Philip A. Schrodt
August 1999

This note is a synopsis of my recent experience using the facilities at the National Center for Supercomputer Applications (NCSA) in Urbana-Champaign. The quick summary: If you are already familiar with Unix systems, it is remarkably straightforward to run ANSI C code on these machines. The individual processors in the "supercomputer" are much slower than those in a contemporary personal computer, but one can gain substantial wall-clock speed advantages through parallelism. In most instances, parallel processing can be added to be program with very little additional effort. The computing time at NCSA is free and should be a consideration for anyone using computationally intensive methods that are parallel either in their inner-most loops (e.g. linear algebra routines) or at the outer-most loops (e.g. Monte Carlo or resampling).

[A version of this paper with active Web links can be found here]

An Event Data Set for the Arabian/Persian Gulf Region 1979-1997

Philip A. Schrodt and Deborah J. Gerner

This paper discusses a WEIS-coded event data set covering the Arabian/Persian Gulf region (Iran, Iraq, Kuwait, Oman, Saudi Arabia, Yemen, and the smaller Gulf states) for the period 15 April 1979 to 10 June 1997. The coded events cover international interactions among these states, as well as interactions with any other states or major international organizations. The data set is generated from Reuters news reports downloaded from the NEXIS data service and coded using the Kansas Event Data System (KEDS) machine-coding program.

The paper begins with a review of the process of generating a machine-coded data set, including a discussion of software we have developed to partially automate the development of dictionaries to code new geographical regions. The Gulf data are coded using a standard set of verb phrases (rather than phrases specifically adapted to the Gulf) and an actors dictionary that has been augmented only with the actors identified by a utility program that examines the source texts for actors not already found in the KEDS dictionary.

The Reuters reports generate 264,421 events when full stories are coded and 48,721 events when only lead sentences are coded. An examination of the time series that are generated when the events are aggregated by month using the Goldstein scale shows that they capture the major features of the behavior that we know to have occurred in the region. There is generally a high correlation (r > 0.75) between the series generated from lead-sentences and from full stories when the major actors of the region (Iran, Iraq, Saudi Arabia and USA) are studied. An exception to this pattern is found in interactions involving a relatively minor actor, the United Arab Emirates. Here the full-story coding provides far more events than the lead-sentence coding and shows greater variance even for interactions between major actors. We expect this will also be the case for other small Gulf states, suggesting that full-story coding may be necessary for a complete analysis of these actors.

This paper was presented at the annual meetings of the International Studies Association, Minneapolis, 18 - 22 March 1998

Link to Adobe .pdf version of this paper

The Effects of Media Coverage on Crisis Assessment and Early Warning in the Middle East

Deborah J. Gerner and Philip A. Schrodt
Schmeidl, Susanne and Howard Adelman, eds. 1998. Early Warning and Early Response.
New York: Columbia International Affairs Online, Columbia University Press

The international news media have a tremendous impact on the prediction and assessment of humanitarian crises. While government agencies, IGOs and NGOs have internal sources of information about areas experiencing stress, academic researchers and the general public -- whose interests often must be mobilized to support intervention -- are likely to depend primarily on electronic sources such as CNN, Reuters, Agence France Presse and elite newspapers such as The New York Times. It is well known that the coverage provided by these sources is uneven, particularly in marginal areas such as Africa and Central Asia, and that their attention-span is limited. In this chapter we examine some characteristics of media coverage of a well-covered region -- the Arab-Israeli conflict -- and assess systematically how this coverage might affect early warning and monitoring.

We first examine the issue of "media fatigue": how does the number and type of events reported in public sources change as a conflict evolves? Using the first three years of the Palestinian intifada as a case study, we compare the reports of uses of force in event data sets based on Reuters and on The New York Times with reports in an independently-collected data set from a Jerusalem-based human rights data center. As predicted by the media fatigue hypothesis, we find that the correlation between the three sources declines over time. The correlation between Reuters and The New York Times changes in a pattern that is similar to the correlation between these sources and the human rights data source, suggesting that the correlation between the two news sources could be used as an indicator of the bias in an event data set caused by media fatigue.

Second, we analyze with the effects of various levels of event data aggregation on a cluster-based early warning indicator for political change in the Levant. Our results show that very high levels of aggregation -- for example distinguishing only between conflict and cooperation, or simply counting the number of events -- provide almost as much predictive power as that provided by more detailed differentiation of events. At an extreme level of aggregation, we find that a data set that indicates only whether any events were reported for a dyad provides about 50% of the clustering of political activities that is provided by detailed coding.

These findings have two implications. First, the limitations of media fatigue in newswire and newspaper sources suggest that early warning models might benefit from greater attention to specialized reports such as those available through ReliefWeb; this is particularly important in long-term monitoring. Second, much of the variance provided by media reports such as Reuters and The New York Times is found in the existence of the reports -- whether a region is being covered at all -- rather than in the detailed content of those reports. Because of this limitation, complex coding schemes requiring expert interpretation of news reports are unlikely to provide significantly more information than simpler coding schemes such as those that can be implemented with all-machine coding.

Link to Adobe .pdf version of this paper

The Kansas Event Data System: A Beginner's Guide with an Application to the Study of Media Fatigue in the Palestinian Intifada

Deborah J. Gerner and Philip A. Schrodt

This paper provides a general introduction to using KEDS, the Kansas Event Data System. KEDS is a Macintosh-based machine coding system for generating event data using pattern recognition and simple linguistic parsing. The system codes from machine-readable text describing international events; the NEXIS data service, optical character recognition and CD-ROM can provide such texts. The paper is targeted at researchers who are considering using KEDS to generate event data: it describes the overall process of generating and analyzing event data, as well as providing a FAQ (Frequently Asked Questions) section and an annotated bibliography of sources of information on event data.

We illustrate the use of KEDS in political science research by examining the issue of "media fatigue": how does the number and type of events reported in public sources change as a conflict evolves? Using the first three years of the Palestinian intifada as a case study, we compare a machine-coded, Reuters-based event data set, a human-coded data set based on the New York Times and an independently-collected data set reporting levels of lethal violence during this period. As predicted by the media fatigue hypothesis, we find that the correlation between the three sources declines over time, and is generally proportional to the level of interest that the international media is showing to this region. The correlation between Reuters and the New York Times changes in a pattern that is similar to the correlation between these sources and the independent data source, suggesting that the correlation between the two news sources could be used as an indicator of the level of error in an event data set caused by media fatigue.

Poster session presentation at the American Political Science Association, San Francisco, 28 August - 1 September 1996

Political Science: KEDS, A Program for the Machine Coding of Event Data

Philip A. Schrodt, Shannon G. Davis and Judith L. Weddle
Social Science Computer Review 12,3: 561-588 (Fall,1994)

This paper describes in technical detail the Kansas Event Data System (KEDS) and summarizes our experience in coding Reuters data for the Middle East. The components of KEDS are first described; this discussion is intended to provide sufficient detail about the program that one could develop a more sophisticated machine-coding system based on our research. We then discuss a number of problems we have encountered in machine coding, focusing on the Reuters data source and the KEDS program itself. The paper concludes with a discussion of future approaches to machine coding in event data research and other potential applications of the technology.

Link to Adobe .pdf version of a revision of this paper

This paper appears in Social Science Computer Review 12,3: 561-588 (Fall,1994).

Validity Assessment of a Machine-Coded Event Data Set for the Middle East, 1982-1992

Philip A. Schrodt and Deborah J. Gerner
American Journal of Political Science 38:825-854. (1994)

This paper is a study of the validity of a machine-coded event data series for six Middle Eastern actors and the United States. The series is based on Reuters newswire story leads coded into the WEIS categories. The face validity of the data is assessed by examining the monthly net cooperation scores based on Goldstein's (1992) scale in comparison to narrative accounts of the interactions between the actors; the event data series clearly shows the major patterns of political interaction. The machine-coded data are also compared to a human-coded WEIS data set based on the The New York Times and Los Angeles Times. Almost all dyads show a statistically significant correlation between the number of events reported by the two series, as well as the number of cooperative events. About half of the dyads show significant correlation in net cooperation and the number of conflictual events; many of these differences appear to be due to the higher density of events in Reuters. Finally, the machine-coded and WEIS data sets are used in two statistical time series studies and are shown to produce generally comparable results.

Link to Adobe .pdf version of a revision of this paper

This paper appears in American Journal of Political Science 38:825-854. (1994)

Machine Coding of Event Data Using Regional and International Sources

Deborah J. Gerner, Philip A. Schrodt, Ronald A. Francisco, and Judith L. Weddle

This article discusses research on the machine coding of international event data from international and regional news sources using the Kansas Event Data System (KEDS). First, we suggest that the definition of an "event" should be modified so that events are explicitly and unambiguously defined in terms of natural language. Second, we discuss KEDS: a Macintosh-based machine coding system using pattern recognition and simple linguistic parsing to code events using the WEIS event categories. Third, we compare the Reuters international news service reports with those of two specialized regional sources: the foreign policy chronologies in the Journal of Palestine Studies and the German language biweekly publication Informationen.

We conclude by noting that machine coding, when combined with the numerous sources of machine readable text that have become available in the past decade, has the potential to provide a much richer source of event data on international political interactions than that currently available. The ease of machine coding should encourage the creation of event coding schemes developed to address specific theoretical concerns; the increased density of these new data sets may allow the study of problems that could not be analyzed before.

Link to Adobe .pdf version of a revision of this paper

This paper appears in International Studies Quarterly 38:91-119 (1994).

Event Data in Foreign Policy Analysis

Philip A. Schrodt

This article is a textbook-level discussion of the use of event data in political analysis. It discusses a variety of approaches to event data -- including CODPAB, CREON and BCOW -- and illustrates this with several examples of event data research. A modified version of this paper appears as an appendix to the KEDS manual

Link to Adobe .pdf version of a revision of this paper

This paper appears in Laura Neack, Jeanne A.K. Hey and Patrick J. Haney Foreign Policy Analysis: Continuity and Change. New York: Prentice-Hall, 1994: 145-166.

Machine Coding of Event Data

Philip A. Schrodt and Christopher Donald

This paper reports work on the development of a machine coding system for generating events data. The project is in three parts. First, the NEXIS database system is discussed as a source of events data. NEXIS provides a convenient source of essentially real-time events from a number of international sources. Programs are developed which reformat these data and assign dates, actor and target codes. Second, a system using a machine-learning and statistical techniques for natural language processing is described and tested. The system is automatically "trained" using examples of the natural language text which includes the codes, and then generates a statistical scheme for classifying unknown cases. Using the English language text from the FBIS Index, ICPSR WEIS and IPPRC WEIS sets, the system shows about 90% - 95% accuracy on its training set and around 40% - 60% accuracy on a validation set. Third, an alternative system using pattern-based coding is described and tested on NEXIS. In this system, a set of 500 rules is capable of coding NEXIS events with about 70% - 80% accuracy.

This paper was presented at the annual meetings of the International Studies Association, Washington, DC. April 1990.

Link to Adobe .pdf version of a revision of this paper

LAST UPDATED: 21 April 2013