Content Analysis Slides

Slides from Workshop on Automated Content Analysis

Philip A. Schrodt
Department of Political Science
Penn State University
schrodt.parusanalytics.com

Presented at the 20th Political Methodology Summer Conference
University of Minnesota
18 July 2003

Outline of workshop

Overview of content analysis
Human vs. automated coding
Accessing material on the web
Text as a statistical object
Do’s and Don’t’s in contemporary content analysis
Further information

Key points to be made:

Contemporary content analysis is very different from methods used in the 1960s
Automated coding is superior to human coding in large projects; it is a well-developed technology
The Web has made a tremendous amount of data available in machine readable form, at your desktop, for free
Learn and use the Perl language for text processing
Text has regular statistical characteristics but should be treated inductively

Contemporary Content Analysis

Levels of content analysis

Analytical Term	Linguistic Term	Methodology
Thematic	Lexical	Analysis of words and phrases
Syntactic	Syntactic	Use of grammatical rules to determine role of words
Network	Semantic	Use relationships between words to disambiguate meanings

Research in other fields

Library science	Automated indexing
Computational Linguistics	Automated translation and natural language processing generally
Psychology	Personality tests
Communications Studies	Content of popular culture: books, movie and television scripts
Education	Automated grading
Business	Automated evaluation of resumés, aptitude tests

Resources: Books

Alexa, Melina and Cornelia Zuell. 1999. A Review of Software for Text Analysis. Mannheim: Zentrum für Umfragen, Methoden und Analysen.
Neuendorf, Kimberly A. . 2002. The Content Analysis Guidebook. Thousand Oaks CA: Sage
Popping, Roel. 2000. Computer-Assisted Text Analysis. Thousand Oaks CA: Sage
Roberts, Carl. 1997. Text Analysis for the Social Sciences. Mahwah NJ: Lawrence Earlbum Associates
Salton, G. 1989. Automatic Text Processing. Reading, Mass: Addison-Wesley.
Weber, Robert Philip. 1990. Basic Content Analysis, 2d ed. Newbury Park, CA: Sage Publications.

Potential text sources relevant to political behavior

News reports
Legislation
Campaign platforms
Editorials
Open ended survey questions

Advantages of text as a data source

Text is one of the primary methods of communicating political information
Text is unaffected by the act of measurement
The source material is intentional: it was created for some political purpose
Web-based text can be collected in near-real-time at very little cost
Using automated acquisition and coding methods, a single individual can create an original, customized data set with little or no funding

Human versus Automated Coding

Reliability in content analysis

stability–the ability of a coder to consistently assign the same code to a given text;
reproducibility–intercoder reliability;
accuracy–the ability of a group of coders to conform to a standard.

Source: Weber (1990:17)

Advantages of automated coding

Fast and inexpensive
Transparent: coding rules are explicit in the dictionaries
Reproducible: a coding system can be consistently maintained over a period of time without the "coding drift" caused by changing teams of coders.
Coding dictionaries are also be shared between institutions
Unaffected by the biases of individual coders.

Disadvantages of automated coding

Automated thematic coding has problems with disambiguation; automated syntactic coding makes errors on complex sentences.
Requires a properly formatted, machine-readable source of text, therefore older paper and microfilm sources are difficult to code.
Development of new coding dictionaries is time-consuming–KEDS/PANDA initial dictionary development required 2-labor-years. (Modification of existing dictionaries, however, requires far less effort)

Tradeoffs between human and machine coding

Machine coding uses only information that is explicit in the text; human coders are likely to use implicit knowledge of the situation.
Machine coding is not affected by boredom and fatigue
Human coders can more effectively interpret idiomatic and metaphorical text
Human coders can more effectively deal with complex subordinate phrases

Summary: Comparative advantages of human versus machine coding

Advantage to human coding

Small data sets
Data coded only one time at a single site
Existing dictionaries cannot be modified
Complex sentence structure
Metaphorical, idiomatic, or time- dependent text
Money available to fund coders and supervisors

Advantage to machine coding

Large data sets
Data coded over a period of time or across projects
Existing dictionaries can be modified
Simple sentence structures
Literal, present-tense text
Money is limited

Caution:

Don’t commit yourself to human coding until you have first spent several hours–not just a few minutes–doing the coding. It is a tedious, mind-numbing task.

"Doing content analysis by hand will reduce even the most fanatical post-modernist to pleading for a computer."
Philip Stone (author of General Inquirer)

Do you have the funds to hire group of reliable, enthusiastic, and committed graduate students or undergraduate honors students with excellent substantive knowledge who will code accurately and consistently for months or years at a time?

[No, you don’t...]

Suggestions:

Design your coding protocol with automated coding in mind. Coding categories that cannot be easily differentiated by automated methods usually cannot be easily differentiated by human coders either.

Do not mix data from manual and automated coding! Optimize your coding dictionaries first, then use automated coding for the entire data set.

Disambiguation and Lemmmaization

Disambiguation refers to the problem of dealing with homonyms–words that sound (and are written) the same but have different meanings. These are very common in English

Lemmaization refers to the problem of associating various forms of a word with the same root. This can usually be done with simple stemming in English; it is more complicated in most other languages

Disambiguation: "Bat"

Noun

wooden (or aluminum) cylinder used in the game of baseball
small flying mammal

Verb

act of batting ("at bat")
blinking ("bat an eye")

Idiomatic uses

"go to bat for": defending or interceding;
"right off the bat": immediately;
"bats in the belfry": commentary on an individual’s cognitive ability

Foreign phrases

"bat mitzvah": a girl’s coming-of-age ceremony (Hebrew).

Disambiguation, cont.

Any of these uses might be encountered in an English-language text, and multiple uses might be found in a single sentence

"The umpire didn’t bat an eye as Sarah lowered her bat to watch the bat flying around the pitcher."

Disambiguation–3

Words can also change from verbs to nouns without modification: Consider

I plan to drive to the store, then wash the car
When John returned from the car wash, he parked his car in the drive.

In summary: "Verbing weirds language."

Bill Watterson, Calvin and Hobbes

Lemmaization

Nouns: "Syria"

Possessive: "Syria’s"
Adjectival: "Syrian"
Plural: Syrians

Verbs: "discuss"

3rd person singular: "discusses"
Past tense: "discussed"

In general, English language word forms are exceptionally simple: it has only two noun cases (singular and plural), only two regular verb endings (-s/es and -ed), and does not change nouns to indicate whether the noun is a subject or object (case). Most other languages are more complex, but that complexity also carries additional information

Text Processing using Perl

Why should a political methodologist learn programming?

It is at the guts of all of the programs you will be using anyway, so it helps you figure them out.
It gives you vastly more flexibility than you would otherwise have, particularly dealing with text. Things can be done very easily with a program that are difficult with a search-and-replace or statistical transformations
10-year-olds program; and 16-year-olds can cover the basics in about 10 weeks (albeit in BASIC or Pascal)
20-year-old hackers in developing countries can write and deploy viruses for the Windows OS that cause billions of dollars of damage across the planet in a few hours!
It is easy to learn, though to get it down well, you need to practice, practice, practice.

Why learn programming?, continued

Moore’s Law–computer capacity doubles every 18 months. You don’t want to use this??
Economist’s law–every discussion of computing must start by mentioning Moore’s Law
Otherwise you are at the mercy of computer programmers
See also: plumbing, automobile repair, landscaping, remodeling

The wrong reasons to learn programming (despite what you have heard)

Instant access to fantastic jobs earning zillion-dollar salaries

See Micro-smurfs
See NASDAQ technology index, 1998-present
If you don’t enjoy it, you don’t want to do it for a living
Academic salaries are quite competitive

Only opportunity to meet, and possibly mate with, other individuals with severe personality disorders and zero social skills

Advantages of Perl

Note: these advantages assume one already knows C/C++ or Java...

Most of the control structures and syntax of Perl are the same as in C++ and Java.
Perl does not require any of the headers and variable declarations used in C and Java.
Perl contains a large number of additional string-oriented functions and data structures not available in C.
The pattern matching and substitution options are incredibly rich.
Perl transparently interfaces with the operating system – in other words, a Perl program can easily move, delete or rename files, fetch web pages, and the like.

Advantages of Perl, continued

Perl is open-source and freely available for Unix, Linux, Windows, and Macintosh. It runs as part of the operating system on many Unix machines, in Linux, and in the Macintosh OS X operating system.
There is extensive documentation and source code available on the Web.
"Perl is the glue that holds the web together"–much of what you download from the web will have been generated from Perl and is therefore easily processed with Perl

Caveat:

Perl comes out of the Unix community and a lot of the most powerful features of the language are based on Unix models, which will seem obscure until you become familiar with them. But once you've learned the "regular expression" syntax for Perl, you can also use it in Unix.

Disadvantages of Perl

Perl is an interpreted language, rather than a compiled language, so it is probably too slow for writing large programs. The speed seems fine on both Unix and the Mac, however–a simple program for count event types in a WEIS file runs through a 30,000 line data file in less than a second on a Mac G3.
This is a text-processing language, not a general purpose language.

For further information on Perl

Larry Wall, Tom Christiansen, and Jan Orwant. 2000. Programming Perl. (3rd edition) Cambridge: O'Reilly Associates.
(this is known as the "camel book" and is the definitive guide to Perl. 1067 pages. Possibly more than you want to know.)
Randal Schwartz and Tom Christiansen. 1997. Learning Perl. (2rd edition) Cambridge: O'Reilly Associates.
(covers the 30% of the language that is used most of the time)
http://www.perl.com (home page for the Perl enterprise)
http://www.perldoc.com (this links into full Perl documentation, complete with a search facility)
"An Instantaneous Introduction to Perl"
[by Michael Grobe, University of Kansas]

A Perl program for downloading a known set of URLS

open(FIN,"my.file.of.URLs"); open(FOUT,">my.file.of.HTML.txt"); while ($theURL = <FIN>) { chomp($theURL); $theHTML = get($theURL ); print FOUT "\n\n$theHTML"; } close(FIN); close(FOUT);

Alternative: script a browser

This is likely to be the easier method is the site requires authentication or other security measures

Step 1: Log into the site manually and manually navigate to a point where you can access the material you want
Step 2: Run a separate script (for example, AppleScript on the Macintosh) that drives the browser to do the downloads.

Caution:

Don’t assume that you will be able to download from a site: it may use internal scripts or other methods that get in the way. Experiment first.

However, most sites can be downloaded. In particular, any site that can be indexed by Google can be downloaded using automated methods (since that is how Google works). This provides an incentive for sites that want traffic to be Perl-friendly

Text Filtering

This is an essential step in any original automated analysis. The text that you download will not be in a format that you can immediately analyze!

Filters are used to regularize the text for later processing. Perl is ideal for this task.

What a Text Filter Needs to Do

Remove the HTML tags and other web-specific coding
Locate the beginning and end of the document text
Segment article into sentences
Problems: Periods in abbreviations
Abbreviations at the end of sentence
Identify quotations for separate treatment:
Problems: Short quoted phrases in mid-sentence
…Bill "Mad Dog" Jones…
Use of double-apostrophes rather than quotation marks
Eliminate duplicate stories–comparison of character counts seems to work for this
Ignore everything in the file not required for the above tasks

Text File Formats

ASCII ("text")–this is what you want.
MS-Word (or other word processing)–nearly impossible to process; convert to "text"
HTML–downloaded from the web; this is ACSII plus tags
RTF–"rich text format"; also ASCII with tags
PDF–portable document format (Adobe); see "MS-Word"
JPEG and other graphics formats: These are scanned images of the document and cannot be coded directly OCR might work on some of these, but it is tedious

Operating System Differences

How is a line ended?

Macintosh–ASCII 10 (\n)
Unix–ASCII 13 (\r)
Windows ASCII 10 + ASCII 13

Special characters (e.g. diacriticals å, ü)–there are a wide variety of "standards";

"Unicode"–successor to ASCII; incorporates character sets of all widely-used languages (e.g. Russian, Arabic, Hebrew, Hindu, Chinese, Korean, Japanese)

Treating Text as a Statistical Object

Statistical Characteristics of Text

Zipf’s Law (a.k.a. rank-size law)

"The frequency of the occurrence of a word in a natural language is inversely proportional to its rank in frequency"

In mathematics: f_i µ 1/r_i

In English: A small number of words account for most of word usage

Word frequency in English

% of usage	# of words
40%	50
60%	2,300
85%	8,000
99%	16,000

Total words in American English: about 600,000

Total words in technical English (all fields): about 3-million

Functional Words

Very short words such as

Articles	a an the
Interogatives	who what when where why how
Prepositions	to from at in above below
Auxillary verbs	have has was were been
Markers	by in at to (French de, German du, Arabic fi)
Pronouns	I you he she him her his hers

In English, the specificity of a word is generally proportional to its length

Marker words have multiple uses: Random House College Dictionary lists 29 meanings for "by," 31 for "in," 25 for "to," and 15 for "for."

Zipf’s Law collides with statistical analysis

Information theory:
the information contained in an item of data is proportional to log(f_i)

Statistical Efficiency:
the standard error of a parameter estimate is inversely proportional to the square root of the sample size

The upshot: Any content analysis must balance the high level of information contained in low-frequency words with the requirements of getting a sample of those words sufficiently large for reasonable parameter estimation

What does a document look like as a statistical object?

Mathematically: it is a high-dimensional, sparse feature vector where the elements of the vector are the frequencies of specific words and phrases in the document

Geometrically: it is a point in a high-dimensional space.

The upshot: Anything you can do with points, you can do with documents

Do's and Don't's in Contemporary Content Analysis

Content Analysis "Best Practices"

Wherein we introduce two individuals who we will follow through the the hazardous process of implementing a content analysis project in 2003

Draco

who does everything wrong...

Hermione

Hermione

Assumes data–whether coded manually or by automated methods–will contain at least 15% erroneously coded records, and probably 25% to 35%. Some of this error will be systematic.

Additional Resources

Additional Resources: Books

Alexa, Melina and Cornelia Zuell. 1999. A Review of Software for Text Analysis. Mannheim: Zentrum für Umfragen, Methoden und Analysen.
Neuendorf, Kimberly A. . 2002. The Content Analysis Guidebook. Thousand Oaks CA: Sage
Popping, Roel. 2000. Computer-Assisted Text Analysis. Thousand Oaks CA: Sage
Roberts, Carl. 1997. Text Analysis for the Social Sciences. Mahwah NJ: Lawrence Earlbum Associates
Salton, G. 1989. Automatic Text Processing. Reading, Mass: Addison-Wesley.
Weber, Robert Philip. 1990. Basic Content Analysis, 2d ed. Newbury Park, CA: Sage Publications.

Additional Resources: Web Sites

William Evans' content analysis web page
Harald Klein's text analysis software page
Kansas Event Data System site (automated event data analysis)