Analyzing Text using Statistical Methods

2003 Society for Political Methodology Summer Conference Information Session

Friday, 18 July 2003, 8:00 - 9:00 a.m.

Link to HTML version of slides presented at the workshop

Questions? Contact Philip Schrodt at, phone 814-865-8978.

Session Description

Over the past ten years, the statistical analysis of textual materials related to political behavior has been revolutionized by two developments. First, the World Wide Web has made vast quantities of politically-relevant text available at nearly zero cost. Second, most research projects have switched from using teams of human coders to using fully-automated coding done with desktop computers. A number of studies have shown that the validity of automated coding is comparable to that of human coding, while its reliability is significantly greater, particularly for data sets that need to be maintained over a long period of time. These two developments allow large customized data sets to be produced by a single scholar, for example a graduate student doing dissertation research.

This session will [very] briefly introduce this methodology. Topics include a general introduction to natural language processing, the key grammatical issues involved in automated coding, automating the downloading and reformatting of texts using the perl computer programming language, and analyzing documents as statistical objects. Given the limitations of time, none of these topics will be discussed in depth, but the session should give researchers contemplating a research project using text a good idea of what can and cannot be done using current technology.

Tentative Outline of Session Topics

Introduction to automated natural language processing
  • Review of automated content analysis
  • Keyword-based approaches
  • Syntactic approaches
  • Language problems: disambiguation and lemmaization
Automated downloading from the Web: "if you can see it, you've got it"
  • Why you really want to learn the perl programming language
  • Downloading using perl
  • Downloading using a scripted browser
  • Reformatting
Text as a statistical object
  • Utility and frequency: Zipf's Law meets information theory
  • The text always bats last: why inductive methods are important
  • Textual feature vectors and reduction of dimensionality
  • Classification and clustering
Further software resources