Beginner's Guide

The Kansas Event Data System: A Beginner's Guide Illustrated with a Study of Media Fatigue in the Palestinian Intifada

Philip A. Schrodt and Deborah J. Gerner
Department of Political Science
University of Kansas

Poster Session presented at the
American Political Science Association meetings
San Francisco
August 1996

[Exhibit Directory | Time Series | Media Fatigue | Cluster Analysis ]

ABSTRACT

This exhibit provides a general introduction to research using KEDS, the Kansas Event Data System. KEDS is a Macintosh-based machine coding system for generating event data using pattern recognition and simple linguistic parsing. The system codes from machine-readable text describing international events; the NEXIS data service, optical character recognition and CD-ROM can provide such texts. The paper written for the session is targeted at researchers who are considering using KEDS to generate event data: it describes the overall process of generating and analyzing event data, and provides a FAQ (Frequently Asked Questions) section and an annotated bibliography of sources of information on event data.

The graphs in this exhibit show three applications of KEDS data. First, we show some time series graphs of aggregated events in some dyads in the Middle East, including a comparison of machine-coded and human-coded data. Second, we show some graphs from our current work that is using KEDS data to develop early warning indicators of political change. Third, we examine the issue of media 'fatigue'. Using the Palestinian intifada as a case study, we compare the machine-coded, Reuters-based event data set with independently-coded information on the levels of lethal violence during the first three years of the intifada. We find that the correlation between the three sources declines over time. The correlation between Reuters and the New York Times- changes in a pattern that suggests that the correlation between these two sources could be used as an indicator of the level of error in an event data set caused by media fatigue.

Development of KEDS was funded by the National Science Foundation through Grants SES89-10738, SBR-9410023 and SES90-25130 (Data Development in International Relations Project) and the University of Kansas General Research Fund Grant 3500-X0-0038.

Exhibit Directory:

Kansas Event Data System

Fully operational machine coding system for generating event data
Operates on unedited English text such as Reuters leads
Coding accuracy is comparable to the level achieved by human coders
Statistically validated against human coded data (American Journal of Political Science 38,3)
Uses sparse parsing of English sentences
More accurate than simple pattern recognition
Recognizes compound nouns and compound clauses
Assigns common nouns to political actors
System is general purpose: all information required to define an event coding framework is entered from text files

Accuracy and Validity

The accuracy of KEDS depends heavily on the source text, the event coding scheme and the type of event being coded. We have done a variety of different reliability checks in recent papers and articles (Gerner et al 1992, Schrodt and Gerner 1994); other tests of the KEDS system are found in Huxtable and Pevehouse (1996) and Bond et al (1996).

In the data set we are developing for the Middle East-Reuters lead sentences and the WEIS coding scheme-KEDS assigns the same code as a single human coder in about 75% to 85% of the cases. Approximately 10% of the Reuters leads have a syntactic structure that is too complicated or too idiosyncratic for KEDS to handle properly, although some of the residual coding disagreement comes from ambiguities in the WEIS coding categories themselves. In an experiment where dictionaries were optimized for the coding of a single day of Reuters leads, the PANDA project-using a coding scheme substantially more detailed than WEIS-achieved a 91.7% machine coding accuracy; this probably represents the upper limit of accuracy for Reuters leads and a program using KEDS's sparse parsing approach (Bond, Bennett & Vogele 1994:9). This level of coding accuracy is comparable to that achieved in event data projects using human coders: Burgess and Lawton (1972:58) report a mean intercoder reliability of 82% for eight projects where that statistic is known.

Schrodt and Gerner (1993, 1994) assess the face validity of KEDS-generated data for the Middle East, 1982-1993; the time series produced by the program correspond closely to the patterns expected from narrative accounts of the interactions between the actors. In these papers, the KEDS data were also compared to the human-coded WEIS data set for 1982-1991. For almost all dyads, there was a statistically significant correlation between the number of events reported by the two series, as well as the number of cooperative events. On the net cooperation values aggregated using the Goldstein (1992) scale and number of conflictual events there was a statistically significant correlation in about half of the dyads. Many of the differences between the two series appear to be due to the higher density of events in KEDS compared to the New York Times-based WEIS: the Reuters series contained, on average, three times as many events as WEIS. The KEDS and WEIS data sets were also used in two statistical studies-one involving cross-correlation and the other spectral analysis-produced generally comparable results, although some idiosyncratic differences are found in specific dyads.

Standard KEDS parsing:

Identify source/subject, verb phrase and target/object
Identify compound actors
Reduces titles to a single actor reference
Identify compound clauses within a sentence
Locates the references of pronouns
Eliminate comma-delimited subordinate clauses

Additional parsing features that can be activated by using commands in the .Options file:

Associate agents with actors
Evaluate prepositional phrases to determine location of event
Apply a complexity filtering to bypass sentences that are unlikely to code correctly
Apply grammatical rules to transform the sentence structure

Advantages of Machine Coding

Fast and inexpensive; data sets can be maintained in real time
Transparent: coding rules are explicit in the dictionaries
Reproducible: a coding system can be consistently maintained over a period of time without the "coding drift" caused by changing teams of coders
Coding dictionaries can be shared between institutions
Unaffected by the biases of individual coders.

Disadvantages

Machine coding makes errors on complex sentences
Requires a properly formatted, machine-readable source of text
Development of new coding dictionaries is time-consuming

Machine Coding using KEDS

STEP 1: Locate and reformat a set of machine-readable texts

The very first step in doing research with KEDS is finding a source of machine-readable text. This will usually come from an on-line data service such as Nexis or CD-ROM. In all likelihood, the original text will not be in the input format used by KEDS. We've developed a number of reformatting programs that remove all of the irrelevant information found in the Nexis download and reformat the text; some of these programs and their Pascal and C source code can be found at the KEDS web site.

However, unless your original text is in exactly the same format as we found with Nexis, you will need to write your own filter or modify one of ours. Because machine-readable data are usually consistently formatted, this is usually not very difficult provided you know (or know someone who knows) a programming language such as BASIC, Pascal or C; the macro language in Microsoft's Word program provides another possibility for reformatting.

STEP 2: Develop the initial coding dictionaries

KEDS uses large dictionaries of proper nouns and verb phrases to code the actors and events it finds in the source text. If you intend to code political events, you would probably find it easier to modify the dictionaries that have been developed by other projects than starting with a new dictionary. The advantage of this approach is that the existing dictionaries have already identified most of the English vocabulary used in Reuters, so even if you expect to substantially change the coding scheme, you will know what types of phrases to expect. The KEDS (WEIS, Middle East) and Pevehouse (BCOW, Middle East) dictionaries are available from our project and are archived at the ICPSR; the PANDA dictionaries are available from the PANDA project (contact: dbond@cfia.harvard.edu). All three of these dictionaries were based on coding Reuters lead sentences.

STEP 3: Fine-tune the dictionaries

With the initial dictionaries incorporated into your system, the next step is fine-tuning-"tweaking"-the phrases to work correctly with your data and coding scheme. This is done by going through a large number of texts and modifying the vocabulary as needed; this process will also give you an indication of the accuracy of the system. Most vocabulary modifications involve the addition of specific individual actors (e.g. political leaders; geographical place names) and the addition of verb phrases describing behaviors specific to the problem you are considering.

While you are fine-tuning the dictionaries you might also look at some of the advanced features of KEDS, such as the use of substitution rules, word classes, the complexity filter, and additional coding features such as issues and content analysis counts. The grammatical transformation rules may enable you to develop general solutions to problems that would otherwise require a large number of specific phrases, and the additional coding features allow information to be extracted from a sentence beyond the basic source-event-target structure of event data.

STEP 4: Autocode the entire data set

Unless you intend to use KEDS for machine assisted-coding of your entire dataset, the data should be autocoded once the accuracy of the dictionaries has reached a level you are comfortable with. Autocoding will ensure that the coding rules have been consistently applied across the entire data set, rather than having the part of the data that was used to develop the dictionaries coded by hand, and the remainder machine coded. Autocoding also insures that your coding can be replicated by later researchers, as well as by yourself at a later date.

STEP 5: Aggregate the data for statistical analysis

Because event data are an irregular, nominal-measure (categorical) time series they must be aggregated before they can be used by standard statistical programs such as SPSS and SAS or graphical displays such as spreadsheets; these all expect a regular, interval-measure (numerical) time series. This transformation is usually done by mapping each event code to an interval-level scale (for example, Goldstein 1993), and then aggregating the data by actor-pair and week, month or year using averages or totals.

It is possible to do this aggregation by scripting the data transformation facilities of a statistical program. However, this process tends to be very slow and awkward, particularly when dealing with a large number of actor pairs. As an alternative, we have developed an aggregation program, KEDS_Count, to automate this process; this program and its documentation are on the KEDS disks. In contrast to the text reformatting programs, which need to be customized, KEDS_Count should handle most situations requiring aggregation of event data into a time-series.

For Further Information:

KEDS

Gerner, Deborah J., Philip A. Schrodt, Ronald A. Francisco, and Judith L. Weddle. 1994. "The Machine Coding of Events from Regional and International Sources." International Studies Quarterly 38:91-119.

Description of the DDIR-sponsored KEDS research; includes tests on German-language sources and a foreign affairs chronology.

Schrodt, Philip A. and Deborah J. Gerner. 1994. "Validity Assessment of a Machine-Coded Event Data Set for the Middle East, 1982-1992." American Journal of Political Science 38:825-854.

Statistically compares KEDS data to a human-coded data set covering the same time period and actors.

Schrodt, Philip A. , Shannon G. Davis and Judith L. Weddle. 1994. "Political Science: KEDS-A Program for the Machine Coding of Event Data." Social Science Computer Review 12,3: 561-588.

General description of KEDS and an extended discussion of some of the problems encountered coding Reuters.

Event Data

Schrodt, Philip A. 1994. "Event Data in Foreign Policy Analysis" in Laura Neack, Jeanne A.K. Hey, and Patrick J. Haney. Foreign Policy Analysis: Continuity and Change. New York: Prentice-Hall, pp. 145-166.

Textbook-level introduction to the general topic of event data analysis.

Duffy, Gavin, ed. 1994. International Interactions 20,1-2

Special double-issue on event data analysis.

Merritt, Richard L., Robert G. Muncaster, and Dina A. Zinnes, eds. 1994. Management of International Events: DDIR Phase II. Ann Arbor: University of Michigan Press.

Reports from the DDIR projects.

Computational Methods for Interpreting Text

Advanced Research Projects Agency (ARPA). 1993. Proceedings of the Fifth Message Understanding Conference (MUC-5). Los Altos,CA: Morgan Kaufmann.

Reports from a large-scale ARPA project on developing computer programs to interpret news reports on terrorism in Latin America.

Merritt, Richard L., Robert G. Muncaster, and Dina A. Zinnes, eds. 1994. Management of International Events: DDIR Phase II. Ann Arbor: University of Michigan Press.

Reports from the DDIR projects.

Pinker, Steven. 1994. The Language Instinct. New York: W. Morrow and Co.

Excellent non-technical introduction to contemporary linguistics; extensive discussion of the problems of parsing English

Salton, Gerald. 1989. Automatic Text Processing. Reading, Mass: Addison-Wesley.

General introduction to the use of computers to process text; covers a wide variety of methods.