Slides from Workshop on Automated Content Analysis

Philip A. Schrodt
Department of Political Science
Penn State University
schrodt.parusanalytics.com

Presented at the 20th Political Methodology Summer Conference
University of Minnesota
18 July 2003

Outline of workshop

  • Overview of content analysis
  • Human vs. automated coding
  • Accessing material on the web
  • Text as a statistical object
  • Do’s and Don’t’s in contemporary content analysis
  • Further information

Key points to be made:

  • Contemporary content analysis is very different from methods used in the 1960s
  • Automated coding is superior to human coding in large projects; it is a well-developed technology
  • The Web has made a tremendous amount of data available in machine readable form, at your desktop, for free
  • Learn and use the Perl language for text processing
  • Text has regular statistical characteristics but should be treated inductively

Contemporary Content Analysis

Levels of content analysis
Analytical
Term
Linguistic
Term
Methodology
Thematic Lexical Analysis of words and phrases
Syntactic Syntactic Use of grammatical rules to determine role of words

Network Semantic Use relationships between words to disambiguate meanings
Research in other fields
Library science Automated indexing
Computational Linguistics Automated translation and natural language processing generally
Psychology Personality tests
Communications Studies Content of popular culture: books, movie and television scripts
Education Automated grading
Business Automated evaluation of resumés, aptitude tests
Resources: Books
  • Alexa, Melina and Cornelia Zuell. 1999. A Review of Software for Text Analysis. Mannheim: Zentrum für Umfragen, Methoden und Analysen.
  • Neuendorf, Kimberly A. . 2002. The Content Analysis Guidebook. Thousand Oaks CA: Sage
  • Popping, Roel. 2000. Computer-Assisted Text Analysis. Thousand Oaks CA: Sage
  • Roberts, Carl. 1997. Text Analysis for the Social Sciences. Mahwah NJ: Lawrence Earlbum Associates
  • Salton, G. 1989. Automatic Text Processing. Reading, Mass: Addison-Wesley.
  • Weber, Robert Philip. 1990. Basic Content Analysis, 2d ed. Newbury Park, CA: Sage Publications.
Potential text sources relevant to political behavior
  • News reports
  • Legislation
  • Campaign platforms
  • Editorials
  • Open ended survey questions
Advantages of text as a data source
  • Text is one of the primary methods of communicating political information
  • Text is unaffected by the act of measurement
  • The source material is intentional: it was created for some political purpose
  • Web-based text can be collected in near-real-time at very little cost
  • Using automated acquisition and coding methods, a single individual can create an original, customized data set with little or no funding

Human versus Automated Coding

Reliability in content analysis
  • stability–the ability of a coder to consistently assign the same code to a given text;
  • reproducibility–intercoder reliability;
  • accuracy–the ability of a group of coders to conform to a standard.

Source: Weber (1990:17)

Advantages of automated coding
  • Fast and inexpensive
  • Transparent: coding rules are explicit in the dictionaries
  • Reproducible: a coding system can be consistently maintained over a period of time without the "coding drift" caused by changing teams of coders.
  • Coding dictionaries are also be shared between institutions
  • Unaffected by the biases of individual coders.
Disadvantages of automated coding
  • Automated thematic coding has problems with disambiguation; automated syntactic coding makes errors on complex sentences.
  • Requires a properly formatted, machine-readable source of text, therefore older paper and microfilm sources are difficult to code.
  • Development of new coding dictionaries is time-consuming–KEDS/PANDA initial dictionary development required 2-labor-years. (Modification of existing dictionaries, however, requires far less effort)
Tradeoffs between human and machine coding
  • Machine coding uses only information that is explicit in the text; human coders are likely to use implicit knowledge of the situation.
  • Machine coding is not affected by boredom and fatigue
  • Human coders can more effectively interpret idiomatic and metaphorical text
  • Human coders can more effectively deal with complex subordinate phrases

Summary: Comparative advantages of human versus machine coding

Advantage to human coding
  • Small data sets
  • Data coded only one time at a single site
  • Existing dictionaries cannot be modified
  • Complex sentence structure
  • Metaphorical, idiomatic, or time- dependent text
  • Money available to fund coders and supervisors
Advantage to machine coding
  • Large data sets
  • Data coded over a period of time or across projects
  • Existing dictionaries can be modified
  • Simple sentence structures
  • Literal, present-tense text
  • Money is limited
Caution:

Don’t commit yourself to human coding until you have first spent several hours–not just a few minutes–doing the coding. It is a tedious, mind-numbing task.

"Doing content analysis by hand will reduce even the most fanatical post-modernist to pleading for a computer."
Philip Stone (author of General Inquirer)

Do you have the funds to hire group of reliable, enthusiastic, and committed graduate students or undergraduate honors students with excellent substantive knowledge who will code accurately and consistently for months or years at a time?

[No, you don’t...]

Suggestions:

Design your coding protocol with automated coding in mind. Coding categories that cannot be easily differentiated by automated methods usually cannot be easily differentiated by human coders either.

Do not mix data from manual and automated coding! Optimize your coding dictionaries first, then use automated coding for the entire data set.

Disambiguation and Lemmmaization

Disambiguation refers to the problem of dealing with homonyms–words that sound (and are written) the same but have different meanings. These are very common in English

Lemmaization refers to the problem of associating various forms of a word with the same root. This can usually be done with simple stemming in English; it is more complicated in most other languages

Disambiguation: "Bat"

Noun

  • wooden (or aluminum) cylinder used in the game of baseball
  • small flying mammal

Verb

  • act of batting ("at bat")
  • blinking ("bat an eye")

Idiomatic uses

  • "go to bat for": defending or interceding;
  • "right off the bat": immediately;
  • "bats in the belfry": commentary on an individual’s cognitive ability

Foreign phrases

  • "bat mitzvah": a girl’s coming-of-age ceremony (Hebrew).
Disambiguation, cont.

Any of these uses might be encountered in an English-language text, and multiple uses might be found in a single sentence

"The umpire didn’t bat an eye as Sarah lowered her bat to watch the bat flying around the pitcher."

Disambiguation–3

Words can also change from verbs to nouns without modification: Consider

  • I plan to drive to the store, then wash the car
  • When John returned from the car wash, he parked his car in the drive.

In summary: "Verbing weirds language."

Bill Watterson, Calvin and Hobbes

Lemmaization

Nouns: "Syria"

  • Possessive: "Syria’s"
  • Adjectival: "Syrian"
  • Plural: Syrians

Verbs: "discuss"

  • 3rd person singular: "discusses"
  • Past tense: "discussed"

In general, English language word forms are exceptionally simple: it has only two noun cases (singular and plural), only two regular verb endings (-s/es and -ed), and does not change nouns to indicate whether the noun is a subject or object (case). Most other languages are more complex, but that complexity also carries additional information

Text Processing using Perl

Why should a political methodologist learn programming?
  • It is at the guts of all of the programs you will be using anyway, so it helps you figure them out.
  • It gives you vastly more flexibility than you would otherwise have, particularly dealing with text. Things can be done very easily with a program that are difficult with a search-and-replace or statistical transformations
  • 10-year-olds program; and 16-year-olds can cover the basics in about 10 weeks (albeit in BASIC or Pascal)

    20-year-old hackers in developing countries can write and deploy viruses for the Windows OS that cause billions of dollars of damage across the planet in a few hours!

  • It is easy to learn, though to get it down well, you need to practice, practice, practice.
Why learn programming?, continued
  • Moore’s Law–computer capacity doubles every 18 months. You don’t want to use this??

    Economist’s law–every discussion of computing must start by mentioning Moore’s Law

  • Otherwise you are at the mercy of computer programmers

    See also: plumbing, automobile repair, landscaping, remodeling

The wrong reasons to learn programming (despite what you have heard)

Instant access to fantastic jobs earning zillion-dollar salaries

  • See Micro-smurfs

  • See NASDAQ technology index, 1998-present

  • If you don’t enjoy it, you don’t want to do it for a living
  • Academic salaries are quite competitive

Only opportunity to meet, and possibly mate with, other individuals with severe personality disorders and zero social skills

Advantages of Perl

Note: these advantages assume one already knows C/C++ or Java...

  • Most of the control structures and syntax of Perl are the same as in C++ and Java.
  • Perl does not require any of the headers and variable declarations used in C and Java.
  • Perl contains a large number of additional string-oriented functions and data structures not available in C.
  • The pattern matching and substitution options are incredibly rich.
  • Perl transparently interfaces with the operating system – in other words, a Perl program can easily move, delete or rename files, fetch web pages, and the like.
Advantages of Perl, continued
    • Perl is open-source and freely available for Unix, Linux, Windows, and Macintosh. It runs as part of the operating system on many Unix machines, in Linux, and in the Macintosh OS X operating system.
    • There is extensive documentation and source code available on the Web.
    • "Perl is the glue that holds the web together"–much of what you download from the web will have been generated from Perl and is therefore easily processed with Perl
Caveat:

Perl comes out of the Unix community and a lot of the most powerful features of the language are based on Unix models, which will seem obscure until you become familiar with them. But once you've learned the "regular expression" syntax for Perl, you can also use it in Unix.

Disadvantages of Perl
  • Perl is an interpreted language, rather than a compiled language, so it is probably too slow for writing large programs. The speed seems fine on both Unix and the Mac, however–a simple program for count event types in a WEIS file runs through a 30,000 line data file in less than a second on a Mac G3.
  • This is a text-processing language, not a general purpose language.
For further information on Perl
  • Larry Wall, Tom Christiansen, and Jan Orwant. 2000. Programming Perl. (3rd edition) Cambridge: O'Reilly Associates.
    (this is known as the "camel book" and is the definitive guide to Perl. 1067 pages. Possibly more than you want to know.)
  • Randal Schwartz and Tom Christiansen. 1997. Learning Perl. (2rd edition) Cambridge: O'Reilly Associates.
    (covers the 30% of the language that is used most of the time)
  • http://www.perl.com (home page for the Perl enterprise)
  • http://www.perldoc.com (this links into full Perl documentation, complete with a search facility)
  • "An Instantaneous Introduction to Perl"
    [by Michael Grobe, University of Kansas]
A Perl program for downloading a known set of URLS

open(FIN,"my.file.of.URLs");
open(FOUT,">my.file.of.HTML.txt");
while ($theURL = <FIN>) {
   chomp($theURL);
   $theHTML = get($theURL );
   print FOUT "\n\n$theHTML";
}
close(FIN);
close(FOUT);

Alternative: script a browser

This is likely to be the easier method is the site requires authentication or other security measures

  • Step 1: Log into the site manually and manually navigate to a point where you can access the material you want
  • Step 2: Run a separate script (for example, AppleScript on the Macintosh) that drives the browser to do the downloads.
Caution:

Don’t assume that you will be able to download from a site: it may use internal scripts or other methods that get in the way. Experiment first.

However, most sites can be downloaded. In particular, any site that can be indexed by Google can be downloaded using automated methods (since that is how Google works). This provides an incentive for sites that want traffic to be Perl-friendly

Text Filtering

This is an essential step in any original automated analysis. The text that you download will not be in a format that you can immediately analyze!

Filters are used to regularize the text for later processing. Perl is ideal for this task.

What a Text Filter Needs to Do
  • Remove the HTML tags and other web-specific coding
  • Locate the beginning and end of the document text
  • Segment article into sentences
  • Problems: Periods in abbreviations
    Abbreviations at the end of sentence
  • Identify quotations for separate treatment:
  • Problems: Short quoted phrases in mid-sentence
    Bill "Mad Dog" Jones
    Use of double-apostrophes rather than quotation marks
  • Eliminate duplicate stories–comparison of character counts seems to work for this
  • Ignore everything in the file not required for the above tasks
Text File Formats
  • ASCII ("text")–this is what you want.
  • MS-Word (or other word processing)–nearly impossible to process; convert to "text"
  • HTML–downloaded from the web; this is ACSII plus tags
  • RTF–"rich text format"; also ASCII with tags
  • PDF–portable document format (Adobe); see "MS-Word"
  • JPEG and other graphics formats: These are scanned images of the document and cannot be coded directly OCR might work on some of these, but it is tedious
Operating System Differences

How is a line ended?

  • Macintosh–ASCII 10 (\n)
  • Unix–ASCII 13 (\r)
  • Windows ASCII 10 + ASCII 13

Special characters (e.g. diacriticals å, ü)–there are a wide variety of "standards";

"Unicode"–successor to ASCII; incorporates character sets of all widely-used languages (e.g. Russian, Arabic, Hebrew, Hindu, Chinese, Korean, Japanese)

Treating Text as a Statistical Object

Statistical Characteristics of Text

Zipf’s Law (a.k.a. rank-size law)

"The frequency of the occurrence of a word in a natural language is inversely proportional to its rank in frequency"

In mathematics: fi µ 1/ri

In English: A small number of words account for most of word usage

Word frequency in English
% of usage # of words
40% 50
60% 2,300
85% 8,000
99% 16,000

Total words in American English: about 600,000

Total words in technical English (all fields): about 3-million

Functional Words

Very short words such as

Articles a an the
Interogatives who what when where why how
Prepositions to from at in above below
Auxillary verbs have has was were been
Markers by in at to
(French de, German du, Arabic fi)
Pronouns I you he she him her his hers

In English, the specificity of a word is generally proportional to its length

Marker words have multiple uses: Random House College Dictionary lists 29 meanings for "by," 31 for "in," 25 for "to," and 15 for "for."

Zipf’s Law collides with statistical analysis

Information theory:
the information contained in an item of data is proportional to log(fi)

Statistical Efficiency:
the standard error of a parameter estimate is inversely proportional to the square root of the sample size

The upshot: Any content analysis must balance the high level of information contained in low-frequency words with the requirements of getting a sample of those words sufficiently large for reasonable parameter estimation

What does a document look like as a statistical object?

Mathematically: it is a high-dimensional, sparse feature vector where the elements of the vector are the frequencies of specific words and phrases in the document

Geometrically: it is a point in a high-dimensional space.

The upshot: Anything you can do with points, you can do with documents

Do's and Don't's in Contemporary Content Analysis

Content Analysis "Best Practices"

Wherein we introduce two individuals who we will follow through the the hazardous process of implementing a content analysis project in 2003

Draco

who does everything wrong...

Hermione

who does everything right...

Click here for further technical applications of this material

Coding Methods–1

Draco

Relies on manual coding because human coders can apply contextual interpretations to the text

Hermione

Uses automated coding because it is fast, transparent, reliable, and stable.

Coding Methods–2

Draco

Establishes a coding protocol that requires extensive training and supervision of coders

Hermione

Establishes a coding protocol that takes into consideration the limitations and likely errors of automated text processing

Coding Methods–3

Draco

Hires a team of coders, trains them to 85% inter-coder reliability levels, then plays Quidditch while they complete the coding

Hermione

Avoids the use of multiple coders whenever possible, and does continual cross-checks if they are used. Coders only work on dictionaries, never with the final data

Coding Methods–4

Draco

Uses data produced by a combination of automated methods and manual correction

Hermione

Uses data produced by fully-automated methods to insure transparency and reproducibility

Choice of Medium

Draco

Codes from paper

Hermione

Codes from sources on the web or CD-ROM

Obtaining information from the web

Draco

Downloads using cut-and-paste from web pages

Hermione

Downloads HTML source using a spider or script, and post-processes the information using a Perl program

Formatting the Input Data – 1

Draco

Assumes that data will be in a form that can be processed immediately by the coding program

Hermione

Assumes that the data will require extensive reformatting before it can be processed by the coding program

Formatting the Input Data – 2

Draco

Reformats the text using MS-Word, SAS macros, or Visual Basic, or hires a computer science graduate student to write a reformatting program in LISP

Hermione

Reformats the text herself using Perl

Review of state-of-the-art methods

Draco

Reads U.S. political science studies from the 1960s

Hermione

Studies current research in sociology, psychology and communications studies, with a focus on European research

Choosing an automated coding program

Draco

Uses the program his graduate advisor used in 1978 or obsessively checks out every program referenced on Bill Evans’s web site.

Hermione

First determines the requirements of the project, then checks several reviews of available software, and then chooses between two or three of the most promising programs.

Deciding what to code from the text

Draco

Codes as much information as possible from the text given the limitations of the automated coding program.

Hermione

Codes only the information required for the project, and focuses on maximizing the validity of that coding. Incrementally adds complexity to the coding scheme as required.

Intellectual property issues

Draco

Openly flaunts copyrights because hey, we’re the Napster generation and information wants to be free; shares copyrighted primary source material

Hermione

Quietly asserts right to download copyrighted material for research purposes under the legal doctrine of fair use; shares only secondary data

Determining Coding Categories

Draco

Establishes a small number of coding categories based on deductive understanding of the knowledge domain and examination of a few source texts

Hermione

Determines coding categories by automated coding of all of the data and then applies a statistical method for reduction of dimensionality or clustering

Assumptions about the accuracy of the coding

Draco

Assumes coded data contain very little error

Hermione

Assumes data–whether coded manually or by automated methods–will contain at least 15% erroneously coded records, and probably 25% to 35%. Some of this error will be systematic.

Additional Resources

Additional Resources: Books
  • Alexa, Melina and Cornelia Zuell. 1999. A Review of Software for Text Analysis. Mannheim: Zentrum für Umfragen, Methoden und Analysen.
  • Neuendorf, Kimberly A. . 2002. The Content Analysis Guidebook. Thousand Oaks CA: Sage
  • Popping, Roel. 2000. Computer-Assisted Text Analysis. Thousand Oaks CA: Sage
  • Roberts, Carl. 1997. Text Analysis for the Social Sciences. Mahwah NJ: Lawrence Earlbum Associates
  • Salton, G. 1989. Automatic Text Processing. Reading, Mass: Addison-Wesley.
  • Weber, Robert Philip. 1990. Basic Content Analysis, 2d ed. Newbury Park, CA: Sage Publications.
Additional Resources: Web Sites