2003 Political Methodology Conference Outline

Analyzing Text using Statistical Methods

2003 Society for Political Methodology Summer Conference Information Session

Friday, 18 July 2003, 8:00 - 9:00 a.m.

Link to HTML version of slides presented at the workshop

Questions? Contact Philip Schrodt at schrodt.parusanalytics.com, phone 814-865-8978.

Session Description

Over the past ten years, the statistical analysis of textual materials related to political behavior has been revolutionized by two developments. First, the World Wide Web has made vast quantities of politically-relevant text available at nearly zero cost. Second, most research projects have switched from using teams of human coders to using fully-automated coding done with desktop computers. A number of studies have shown that the validity of automated coding is comparable to that of human coding, while its reliability is significantly greater, particularly for data sets that need to be maintained over a long period of time. These two developments allow large customized data sets to be produced by a single scholar, for example a graduate student doing dissertation research.

This session will [very] briefly introduce this methodology. Topics include a general introduction to natural language processing, the key grammatical issues involved in automated coding, automating the downloading and reformatting of texts using the perl computer programming language, and analyzing documents as statistical objects. Given the limitations of time, none of these topics will be discussed in depth, but the session should give researchers contemplating a research project using text a good idea of what can and cannot be done using current technology.

Tentative Outline of Session Topics

Introduction to automated natural language processing

Review of automated content analysis
Keyword-based approaches
Syntactic approaches
Language problems: disambiguation and lemmaization

Automated downloading from the Web: "if you can see it, you've got it"

Why you really want to learn the perl programming language
Downloading using perl
Downloading using a scripted browser
Reformatting

Text as a statistical object

Utility and frequency: Zipf's Law meets information theory
The text always bats last: why inductive methods are important
Textual feature vectors and reduction of dimensionality
Classification and clustering

Further software resources

LAST UPDATED: 1 JUNE 2003