Kansas Event Data System FAQs: Frequently Asked Questions

Status of this FAQ

This FAQ was assembled around 2002 based on a very large number of emails we had answered dealing with various parts of the project, both for the KEDS and TABARI programs, and a lot of questions about downloading and filtering texts. While a few of the queries concerning TABARI may still be relevant (though even here, you are far better off looking at the manual) this is largely of historical interest. Though, as you can see, we were answering a lot of emailed questions.

Where can I get the program?

The most up-to-date version of the program -- as well as the manual, utility programs, various coding dictionaries, data sets and so forth -- is available from this web site.

What are the system requirements?

We have used KEDS in a variety of Macintosh configurations, including an SE, SE/30, original II, II with a DayStar Turbo 30 accelerator, IIsi, IIsi with a DayStar Turbo 40 accelerator, Quadra 900, LC, Powerbook 160, Powerbook 520c, PowerPC Macintosh 7100, iMac, iBook, G3 and G4; we've also used it under Systems 6.0.5, 6.0.7, 7.1, 7.5, 8.0, 8.6 and 9.0. The suggested application memory size is set at 2048K so KEDS should run on Macs with 4Mb or more of memory under System 6, or 8Mb under System 7.

How fast is the coding?

Roughly 15 events per second on a PowerPC 7100/80 or on a Mac IIsi running with a DayStar 68040 50Mhz accelerator. Coding our 80,000 event 1979-1996 Middle East data set using a 4000-phrase .Verbs dictionary and 700-phrase .Actors dictionary requires about two hours. TABARI is about 70-times faster and will code that data set in about 7 seconds on a 1.6Ghz G5.

Will the program run on Windows95?

The Penn State Event Data Project is part of the valiant but desperate rear-guard effort to prevent William Gates III from controlling the entire known universe. Providing the Penn State Event Data Project exclusively on the Macintosh platform is our small contribution to that cause.

Click here for a more extended discussion of this issue.

If you are really concerned about getting the program onto Windows, we can make the source code available on the condition that the resulting compiled program is posted on this web site. Contact schrodt.parusanalytics.com if you are interested in this possibility.

A more parsimonious solution for any serious project would be to purchase a used Macintosh -- we recently got a Mac IIsi, complete with color monitor, for $350. Download the public domain BBEdit Lite program as your text editor, and you're in business (all but the oldest Macs are able to read Windows disks containing any source texts). (A good source of information for the Macintosh novice/Windows refugee is Paul Hensel's Macintosh page).

KEDS will run on all known Macintoshes that have at least 2 Mb of memory. Call your local used computer store: on average Macintoshes have a muchlonger useable life than Wintel systems, and tend to be resold rather than ending up in landfills.

Update, Spring 2005: With the introduction of the $500 Mac Mini, this is still good advice: the Macintosh (or Linux/Unix -- same code) versions of our software are almost always more recent than the Windows versions, so if you intend to do any serious research using TABARI, it makes sense to work in that environment rather than Windows. The $500 cost of a Mac Mini (or $600 if you get 512 Mb of memory) is trivial compared to the cost of the time invested in a major project.

How about UNIX systems?

The old answers, circa 1999

UNIX is okay; we just can't afford a workstation. At one point there was a way to get around this: The Apple "Macintosh Application Environment" (MAE) version 3.0 allows Macintosh applications to be run in the X-Window environment on Sun's Solaris 2.4 or later and on Hewlett-Packard's HP-UX 9.0 and HP-UX-10.10 or later, and a 30-day demo version was available for free.

Unfortunately, Apple discontinued selling this system in May, 1998. Copies might still be floating around, or you might be able to locate a copy of the software somewhere. But buying an inexpensive used Mac is probably easier.

But help is on the way -- the TABARI system, which is compatible with KEDS dictionaries, uses Linux as its reference platform, and is open- source C++ as well. TABARI well eventually replace KEDS as our primary system.

The situation, circa 2002

TABARI runs just fine on Linux, and apparently at least one shop has converted it to Solaris Unix as well. At the moment the posted Macintosh version is a version or two ahead of the Linux version, but it's all ANSI C++ so if you want the current version in Linux, conversion should be easy. Once the feature set settles down, we'll post a tested Linux version of the code as well.

Also note that the new Apple OS-X is Unix, albeit BSD rather than Linux. Unix with a very fancy user interface.

The situation, circa spring 2005

Finally got around to doing a conversion to Macintosh OS-X. We've also switched from MetroWerks to Apple's XCode/gcc environment for development, which will mean a single code base will support both Linux and the Macintosh. Speed on the new hardware is pretty decent: 10,600 events per second on a 1.6 GHz G5 and 8,600 events/sec on a Mac Mini.

There are currently two choices for OS-X (both posted versions are compiled, though current source code is available on request)

  1. TABARI.0.4.9B2: The Linux version of TABARI.0.4 now runs nicely in OS-X. The interface is a bit ugly but it works.

  2. TABARI.0.5.0B1: This is the initial version of a cleaner interface using the "ncurses" Unix library -- it operates much like the older OS-9 version. It has also implemented the much-requested color-coded parse display.

Both of these -- as well as subsequent OS-X versions -- run in the OS-X Terminal application: see the TABARI page for further detail.

But despite your general loathing of Microsoft, TABARI actually does run on Windows, right?

Right. TABARI is an open-source program and Dale Thomas, who used KEDS to generate data for his dissertation, converted TABARI to run on Windows. That's what open source is all about. We anticipate adding a graphical-user-interface to TABARI in the near future, and if someone would like to convert that to Windows as well, this would be very nice. Please volunteer.

Don't you have anything nice to say about Microsoft??

Well, yes:

  1. I actually think MS-Excel is a good program. Can't say the same about MS-Word, though like almost everyone else I use it.
  2. Steve Maguire's Writing Solid Code : Microsoft's Techniques for Developing Bug-Free C Programs is an excellent guide to programming.
  3. At least Gates is putting his accumulated billions to good use -- the Bill and Melinda Gates Foundation focus on Third World disease is admirable (I'm serious...) and beats heck out of the expensive toys that many information-age zillionaires are focusing on.

Okay, okay, enough on Microsoft. So, how long does it take to develop dictionaries?

This depends entirely on the extent to which your event coding scheme differs from that used in an existing dictionaries. We estimate that around two person-years (4000 hours) went into the development of our Middle East dictionaries, and a comparable level of effort has gone into the PANDA dictionaries. However, this involved a lot of dead-ends and the effort was integrated into debugging the program itself (and in the case of PANDA, refining the coding scheme). Translating the WEIS codes into the BCOW system, in contrast, took one person less than a month. The existing dictionaries probably contain most of the English-language vocabulary relevant to coding political events, but usually one needs to work with actual news reports to determine the best association of phrases and codes.

Other work in automated text processing of reports of political events (ARPA 1993; Linert & Sondheim 1991) indicates that dictionaries on the order of about 5,000 phrases are necessary for relatively complete discrimination of various political events described by news media sources in the English language. The KEDS and PANDA dictionaries are somewhat smaller than this -- about 4000 phrases -- and a dictionary focusing on only a small subset of behavior might be substantially smaller.

How are you getting Reuters?

The old answers, circa 1999

From Reuters: we are currently using Reuters Business Briefing service, where stories can be downloaded for about $0.04 a story. Reuters charges about $40 per connect-hour for their service, with a minimum subscription of 120 hours, and we find we can consistently download about 1,000 full stories per hour. At the present time, Reuters has no equivalent to the NEXIS "HLEAD" format, but downloading full stories is workable. Still cheaper than survey research.

LEXIS-NEXIS, the source we used until about a year ago, has changed their academic service, pricing and software and is no longer practical for large downloads (at least at Kansas -- your contract conditions and software may vary). Reuters and NEXIS parted company on 10 June 1997 and NEXIS no longer has current news from Reuters, though they actually have better archival files than the Business Briefing provides. [But see the update below!]

The current situation, circa 2001

We aren't, which is why the data sets have not been updated for a while. Reuters Business Briefing has apparently moved entirely to a web-based format, but we have yet to experiment with whether this is practical for downloading.

Academic researchers might want to consider the Agence France Presse (AFP) reports that are available through the LEXIS-NEXIS "Academic Universe" service available at many university libraries. AFP is the second-largest news service (Reuters is the largest) and the archives on NEXIS go back to 1980. We've done some initial comparisons between the two sources and it appears that the relative levels of coverage differ by region -- AFP actually has more stories than Reuters for the Middle East, but fewer on the Balkans. The general pattern of events in the two sources is quite similar, however.

Reuters archives, on the other hand, are no longer available on NEXIS.

The current situation, circa 2002

We have resumed near-real-time coding and intend to post updated data sets every 3 months or so. The new data is from AFP (via NEXIS) rather than Reuters. We have been attempting quite unsuccessfully to get Reuters from some other source (Factiva, Dow Jones, someone) but can't even get them to return our phone calls. The Factiva service has meanwhile announced (on their web site) that they will stop carrying Reuters as of June 2003, so it looks like we would need to shift to AFP anyway.

We are continuing to do comparisons between AFP and Reuters for the 8-year period where we have data from both sources, and it presently appears as though there may be some substantial differences between them. While we are presenting these as a single time series, any statistical results that show a break between pre- and post-1999 events should be assumed to be an artifact of the data unless shown otherwise. Details on the data sources used to generate data for various time periods can be found in the Read.Me files that accompany the data sets.

The current situation, circa 2005

Reuters can now be downloaded via email from Factiva, and we've got a filter for this. The Factiva search limit is 100 stories (versus 1,000 for NEXIS) so one has to be patient, but this is doable for individual countries. Reuters also appears to be ever-so-gradually posting stories further back in time: one can now get these back to 1987 or so. However, because downloading from NEXIS is so dramatically faster, we're mostly using AFP now.

Can KEDS code full stories in addition to lead sentences?

In the case of Reuters, yes. Schrodt and Gerner (1998) discusses the differences between lead and full-story coding for a data set on the Gulf, and Phillip Huxtable (see Schrodt, Huxtable & Gerner 1996) has been using full-story coding to generate a data set for West Africa. KEDS can pass a pronoun reference across sentences, and in Reuters stories a pronoun at the beginning of a sentence usually refers to the first actor in the previous sentence. Full stories have much more redundant and irrelevant information than do lead sentences, so it is necessary to use different pre-processing and post-processing filters for full-story coding than are used in lead sentences.

Based on our validity tests, lead sentences alone provided sufficient coverage in the Middle East, an area closely monitored by the international media. As noted earlier, our data set based on Reuters leads provided almost three times the density of events found in human coding of full stories from the New York Times, in part because Reuters tends to break up any major story on the Middle East into several small stories. Huxtable found the opposite pattern to hold in West Africa -- Reuters tends to combine multiple, virtually unrelated events into a single story -- so full-story coding provided a considerable increase in density.

I used the .actors file in the demonstration suite on my data and I'm getting almost no valid events. Am I doing something wrong?

It depends on what world you are coding. In order to avoid copyright problems, the text in the demonstration file contains hypothetical events with actors and place-names from Middle Earth (as in J.R.R. Tolkein's Lord of the Rings). As a general rule, these are not relevant to contemporary political activity. Instead, you should use one of dictionaries that are saved with the data sets we have been posted elsewhere on this site.

That's a joke, right? Who would ever use a dictionary that has "Mordor" and "Barad-Dur" in the actors list?

We get an email from someone having this problem about once every six months. Some people don't read the friendly manual. We're shocked, shocked...

Is KEDS useful for projects other than coding event data?

Maybe, and maybe not. KEDS is optimized for event data coding, but has a number of additional string-recognition features that can be applied to content analysis projects. However, several important features found in most general purpose content analysis packages -- notably Boolean searches and proximity-based searches -- are not implemented in KEDS. Unless your coding scheme relies heavily on the subject-verb-object structure of a sentence or needs to handle compound nouns, you are probably better off with a content analysis program.

http://www.car.ua.edu has pointers to a large set of software packages for content analysis. For a review of content analysis programs, see William Evans (1996) "Computer-Supported Content Analysis: Trends, Tools and Techniques." Social Science Computer Review 14,3:269-279.

Will KEDS work for languages other than English?

In principle, yes, and in limited practice, yes. An early version of the program was used to code German-language reports quite successfully, but the specialized sections of the parser that handled German were not maintained in later versions. Depending on the language, it may be possible to use KEDS' "rules" facility to transform the structure of a sentence to something resembling English, but we have not tried this. The PANDA project is currently working on a machine-coding system that works in Spanish.

Is the program in the public domain?

Under the Bayh-Dole Act that governs technology developed with National Science Foundation funding, the KEDS program is the intellectual property of the University of Kansas. You may use and make copies of the program for educational, government and non-profit use without charge; the program can be posted to bulletin boards and included in software collections provided that a copy of the manual is included. If you wish to license the program or its source code for commercial applications, please contact the University of Kansas Technology Transfer Office (Phone: (785) 864-7871 Fax: (785) 864-5049 ).

TABARI, in contrast, is just garden-variety open-source, specifically under the GNU General Public License.

Are you planning to sell a commercial version?

No. The commercial event coding system that we recommend is Virtual Research Associates -- a spin-off from the PANDA project -- who have developed a commercial event coding system and several information management and data visualization programs for the Windows operating system. VRA. The system is currently being used by UNICEF and several government agencies for monitoring political and economic activity.

Because you are undoubtedly wondering...

The term KEDS is an acronym for "Kansas Event Data System." The software is in no way connected to -- nor could it possibly be confused with -- a trademarked brand of footwear with a similar name.


Emailed Questions and Answers

  • Acquiring Machine-Codeable Text (Questions 1-20)
  • Setting Up And Coding In TABARI/KEDS (Questions 21-50)
  • Analyzing TABARI/KEDS Data Sets (Questions 51-75)
  • Programming (Questions 76-83)
  • Miscellaneous (Questions 84-101)

  • Acquiring Machine-Codeable Text

    1. Question: I don't know how I can save the html files as Unix files because when I open the html file and hit "Save As" under "File" there isn't an option for file format. Do you know how I can do that? In addition, when I double-click the html files on a Mac, they open with some browser program. Can they also be opened with BBEdit?

    Answer: If you are using BBEdit, there is a little box in the file window -- I think it is the third from the left -- that allows the file type to be designated. They are still html files; it is just that the line endings have changed. And yes -- an html file is just another form of a text file, so any text processor will be able to open it.

    2. Question: The lead articles don't seem to have a recognizable extension of any kind (e.g. they aren't text files it appears). I can't open the files in BBEdit or any related text editor, and when I try to run the files in TABARI, it crashes. What am I doing wrong here?
    USA, June 2001

    Answer: They should be text files, and BBEdit should be able to open them (unless it is hitting some sort of limit -- are you using BBEdit Lite or the full thing?). I just checked a few of the files from here (using Fetch) and they look fine.

    3. Question: I find it quite difficult to determine which text filter and text format to use for inserting the texts into the TABARI program. Could you provide a step by step example and/or a manual of the program for Windows users?

    Answer: The format of the text is quite simple -- the first line has the date in YYMMDD format (plus optional identification information), then this is followed by the text itself, formatted to 80-character lines, followed by a blank line. For example:

    030501 AFPN-0001-01 Japanese Foreign Minister Yoriko Kawaguchi on Thursday welcomed the newly released "road map" to reviving the Middle East peace process.

    030501 AFPN-0002-01 The first ever Palestinian prime minister, Mahmoud Abbas, and the United States administration must determine the success of the international "road map" aimed at reviving the Middle East peace process, Britain's press said Thursday.

    As far as the filter is concerned, it depends entirely on what electronic source is being used: unfortunately, the filters have to be customized for each individual source of data, though the examples that we have on the web site might provide a place to start. Some people have been able to do the filtering using the "macros" in a general-purpose word processor such as Microsoft Word, but more commonly the filters are written by computer programmers. The filters that we have on the web site are written in standard computer programming languages (Pascal, C, and Perl) and would work equally well on Windows or any operating system.

    4. Question: Where can I get a hold of AFP reports (lead sentences) already converted by Perl to a TABARI format? Are these available on the web somewhere? I just need 35 to 50 of them (if possible).
    USA, August 2002

    Answer: Do you have NEXIS "Academic Universe" available? -- if so, you can just use nexispider.pl (Perl Program) to get a bunch on your own. You can also use the TABARI.Demo.text file to get a feel of what you need to obtain.

    5. Question: I realize that the program you wrote is to be used with Agence France Presse. I have been downloading stories from the "Non-US News" and "US News" sources. This may be why I am having a problem with the filter I was sent. Do you have any suggestions about how I can alter the program you wrote, or the filter I have been using for these sources? Someone suggested that I "paste" a phrase to the beginning of each story.
    USA, December 2000

    Answer: As far as I can tell, the format you have is very similar to what we've been working with. You can eliminate the filtering for AFP by simply putting a # in front of the line: if ($body !~ m/Agence/) { print "Skip non-AFP story\n\n"; next;}

    There are some other AFP-specific filters in there, but these are also pretty obvious. The other difference is that the filter is designed to work with an HTML file, rather than a text file (and actually, the file you sent me was RTF, which is yet another set of formatting commands). If you want to use the nexispider.pl program, the easiest way to do this is downloading the files in HTML rather than text -- both Netscape Navigator and MS-Explorer have options for doing this. That, however, supposes that you want to look at specific stories (that is, are you going through and selecting individual stories before saving them? Or do you just want to format everything that a NEXIS search produces? This can be done either way.

    6. Question: I have noticed the Nexispider sometimes misses the end of a sentence and then puts multiple sentences together. I adopted a very poor solution of cutting off entries at five lines, but I wanted to ask you if you had this trouble.
    Florida, USA, January 2003

    Answer: Yes, it is definitely a bug. AFP formerly (and conveniently) just had each sentence separated by a blank line, and Nexispider used that to painlessly separate sentences. But they have these multiple-sentence paragraphs, like Reuters. I'm going to try to find a work-around; meanwhile in our system (with the complexity filter on) those stories are almost always skipped because they contain too many verbs. Though that isn't necessarily a good thing either.

    7. Question: Where can I get a filter that will translate the Reuters headlines into TABARI/KEDS compatible format? I have searched the web site but cannot find it. Can you help?
    United Kingdom, November 2002

    Answer: The reformatting programs that we used to go from Reuters to TABARI/KEDS are about half-way down the web page at http://web.ku.edu/keds/software.html. There are about a half-dozen of them in various languages -- Pascal, C, perl. Unfortunately, it is quite possible that none of them will work for your problem without modification -- the data suppliers (Reuters, NEXIS,etc) keep changing their formats and every time the format changes, the program needs to be changed at least a little. But at least you can get some ideas of how the programs work by looking at the code.

    8. Question: I am considering using the filter program Factiva.Reutlead.filter.pl and I am running into a problem with the spaces. If I process only one html file downloaded from Factiva (either on a server or on a Mac itself), there will be extra spaces between leads in the leads file. TABARI could not show the last few words of each lead because of this problem. But after the spaces are deleted to only one space, TABARI could read all words of the lead. Do you know how I can fix this problem?

    Answer: The problem is due to incompatible file types. The Windows, Unix and Mac operating systems use different methods for indicating a line ending. I'm guessing that your input file was produced in Windows, so the Mac isn't reading it correctly -- it needs to be saved as either a Unix or Mac file. I'm pretty sure that the free "TextWrangler" program from http://www.barebones.com/index.shtml does the file conversions easily. If not, the commercial equivalent, BBEdit, definitely does this, and I suspect there are some shareware solutions. With the introduction of Mac OS-X, which is fundamentally a Unix system, the situation has gotten a little more complicated, since Apple is gradually phasing out the Macintosh file format. The current version of TABARI -- version 0.5 -- needs Unix files, but some of our earlier software worked with Mac files. Try converting to Unix first -- that should do it. If it doesn't, try Mac.

    9. Question: I am having a problem with the length of some of the paragraphs that are being downloaded. Any suggestions?
    Wisconsin, USA, March 2001

    Answer: What I'm guessing is happening is that some of sources you are using have more than one sentence per block of text (Agence France Presse, which was what I originally developed that filter for, only had one sentence per block). KEDS or TABARI will actually work with multiple sentences -- people have produced useful data that way -- but it's not designed for that purpose, and it would be better to get this reduced to individual sentences. You need to figure out what parts of the download you want to keep -- ideally this can be done with a combination of the "Copyright" and "SECTION:" fields. (the "Copyright" will give the news source; some of the sources do not carry the "SECTION:" field and you will probably just want to deal with the ones that do have this.) Once you've got that, it is a fairly simple matter eliminate all of the useless stories -- this involves a minor modification of the earlier Perl program. Now, at that point, the question becomes what the remaining stories look like. TABARI will handle up to 2048 characters in the source text -- which is about 10-times the length of the typical news wire sentence. The quick-and-dirty way of handling multiple sentences in a text block (that is, something delimited by the paragraph delimiters in HTML) is just feed it the entire block -- it will code the first verb it encounters, plus deal with any other verbs found in the context of compound sentences, and that is usually what you want to get out of the text block anyway. The alternative is to try to actually identify the ends of sentences, but this turns out to be fairly complicated because of the presence of abbreviations (though we've got some programs that do this pretty well). I am guessing that at the moment, TABARI is trying to read the entire story, which is giving it problems with memory. But again, the first step is figuring out what you want to keep out of that download.

    10. Question: I am gathering information out of newspaper articles and some of the articles exceed the number of allowed characters. How do I resolve this issue without loosing event data?

    Answer: The problem may be the line length rather than the story length. The example you give is 266 characters in length, which is close to the 255 characters that is the line length limitation; the source text limit is 2048 characters, or around 25 of the typical 80-character lines.

    11. Question: When I run GEDS_format.pl, I get one file (GEDS.output) with reformatted text. But, the program broke up and assigned each sentence of each article a separate ID. So one article going in might end up several "articles"--each one sentence long--coming out. Is that what you meant the program GEDS_format.pl to do?
    USA, May 2001

    Answer: This is correct. If you look at the ID, you will see that a new middle number is assigned to each story, and then the final number refers to the sentence number within the story (assuming you are doing full stories; otherwise the final number is always -01). KEDS (and maybe TABARI -- if it doesn't now, it will...) uses this sentence sequencing information to forward pronoun references across consecutive sentences.

    12. Question: When I run the GEDS_format.pl program, it asks me for the input file, so I type "GEDS.sample.list". It then asks me for the output file, so I type "GEDS.output". It then proceeds to the command, "Reading file list from GEDS.sample.list... Writing reformatted records to GEDS.output". Then I get the error: "Can't open summary file GEDS.sample.list.summary; error; Invalid argument at line 68, line 2". Is the problem that the file name is too long or are there too many periods?
    USA, April 2001

    Answer: Try the following: change the line $summary_name = "$file_list\.$summary_suffix"; to $summary_name = "test.out". This will give a nice simple file name that shouldn't cause any problems.

    13. Question: I downloaded Perl to my office computer. I put the GEDS_Format.pl file in the bin folder and got it to run. I entered the file name of my sample text and "sislin" for the name of my output file. Perl said it was doing something, created a "sislin" file, but the file turned out to be empty. I tried to run the TextSampleArticles file that I sent you, but that wouldn't generate a reformatted file either. In regards to a GEDS.sample.list file, I don't seem to have one. (All I have is the GEDS_format.pl and the GEDS sample.reformat file.) Can I make a GEDS.sample.list file? Do I save the file in ASCII? Is there anything else in the file, or just the name of each file of articles I want to reformat?
    USA, April 2001

    Answer: Are you sure that $file_list contains the name of a file that contains the names of the files you want to convert, rather than the file name itself? -- that would also explain the problem. The way the program is currently set up, there should be a file named "GEDS.sample.list". This file contains the names of the files that you want to process, with one file name per line (the current version just contains a single name). Alternatively, if you change the name of the file you want to process to Text.Sample.Articles.txt, the program should work. But the easier way to do things is to edit GEDS.sample.list. This should solve the problem. If you don't have such a file, just make one with that name, save it in ASCII ("text"), and all it needs to contain is the name of the file (or if you want to run multiple files at once, one name per line).

    14. Question: I think I successfully ran the GEDS_format.pl program. However, the files are so large (e.g. 100 megs) that I cannot run them in TABARI yet (insufficient hard drive space) and I am not sure how to open the output file without corrupting it. Notepad just won't do it. FYI, it only took a few minutes to reformat 68,000 articles. I suppose now, I need to purchase a better computer to run Linux, or maybe a better hard drive for the computer I've got.
    USA, May 2001

    Answer: You might try opening the file in one of the Linux text editors (there are two or three in the RedHat distribution) -- they probably won't have the constraints of NotePad.

    15. Question: I traced a problem I was having with the "Nexisreverse" program you wrote. Upon doing this, I received the following message:
    Running program "Nexisreverse"
    Output file: reverse.output
    File list: beles
    Reading 981221 NEXI-0001-01
    Can't open input file 981221 NEXI-0001-01;
    error No such file or directory at nexisr~1.txt line 106, chunk 1
    Do you think this may be due to filter issues?
    USA, April 2001

    Answer: I'm wondering if there is some problem with the line-feeds (Windows, Unix and the Mac use three different ways of doing this; sometimes the software translates automatically between them, and sometimes it doesn't). However, it is possible that filtering could be a problem. Is nexispider.pl actually running? (Also, what is it running on? -- Windows or Unix?). If nexispider.pl is outputting files in a Windows environment and then the Mac actor filter is trying to read them... that could be the problem. There is a shareware program for the Mac called "BBEdit Lite" that will take care of this easily (that is, it translates linefeed formats) if this turns out to be the problem. However, it seems that the program reads the input as a file containing a list of file names to be processed. For example, I just ran it on a list that looked like:
    AFPWA.000101-000331
    AFPWA.000401-000630
    AFPWA.000601-000930
    AFPWA.001001-001231
    AFPWA.010101-010330
    This was a file called "dir.2000" and the $filelist statement was $file_list = "dir.2000". Based on the output you've got, it was trying to read the text directly, and therefore it interpreted the first line as a file name.

    16. Question: I wrote my own filter using Perl, since I'm interested not just in Reuters or Agence France Presse but in as many news wires as I can get. The program can strip the text of all the html "noise" and save the article as a clean text file, but I wonder if this is not part of the problem or if it could become a problem later even if I succeed in implementing the ISSUES feature. Is there something I'm not doing right at the moment that the filters you provide do, which in turn is making TABARI crash?

    Answer: That seems unlikely, unless what you've got is a line-end problem -- i.e. your files have either Unix or Windows line endings rather than Macintosh line endings -- this is probably the single most common problem people run into with KEDS and filters. If your Perl program was running on something other than a Mac, you might check that the conversion was done correctly. But that would mess up KEDS generally, not just the ISSUES. I forget exactly what the maximum line length is in KEDS -- I think it is 120 chars -- but as long as you keep it around 80 or so you shouldn't have problems. Otherwise as long as your records follow the
    (date)
    (text line 1)
    (text line n)
    (blank)
    format you should be okay; the input routine is fairly robust.

    17. Question: I wanted to make sure I was clear on the process for using the Nexispider program. I will download the stories and save them as html files (as I have done with the RTF files). Next I open Perl (running the program you wrote with appropriate AFP lines skipped) and point it to the appropriate file name for the stories I have downloaded. This will produce output that will be read by TABARI. Is this correct?
    USA, December 2000

    Answer: The way the program is currently set up, you give it information on the URL of the first story in a sequence, and then it does the downloading and formating automatically. But it would be very easy (a couple of lines) to modify the program so that it worked from a file rather than a URL. Given that you can download multiple stories in a single file, that would probably be easiest provided you can get them in html. If this is not an option, it might be possible to convert the program to work in rtf.

    18. Question: We are just trying to figure out the logistics of actually running the KEDS program on an electronic database of stories. The problem seems to be that most of these services allow a limited number of stories to be accessed at one time. The librarian here was even saying that I would probably have to download all of the stories over which I wanted to run the KEDS program, which for this project would just be huge (since it goes so far back in time). Can you advise me on the best way to handle the search logically?
    USA, November 1999

    Answer: As far as downloading goes, one is pretty much stuck with the limits given by the data provider, so the trick is (1) make your search statement as specific as possible, to minimize the amount of text downloaded that is uncodeable and (2) download in whatever chunks you have to download in. This takes a fair amount of clock time, but very little human time, since once you've got the download going, it runs without interuption. Once you've got the raw text, you'll need to reformat it so that it works with KEDS. It is a bit frustrating to get the filter configured -- though a computer science student should find it easy -- but once it is running, the filtering process is very fast.

    19. Question: I wanted to ask you if you could describe to me what Reuters text should look like before and if possible after it goes through your filter program. I think with this information, I could figure out how to change the Access data into appropriate text format to be filtered.
    USA, December 2000

    Answer:
    Date ID
    Lead Sentence
    If you want to do pronoun forwarding across sentences, there is some additional sequencing information that can go in the header, but it is optional. Keep in mind that TABARI always works with sentences, not headlines.

    20. Question: I have been working on some dictionaries (actors, verbs, locations for Cambodia) and now I am beginning to test them. However, I am having trouble getting AFP press reports from Nexis using the nexispider script (nexispider_pl.pl). Apparently the articles are accessible through HTTPS protocol only and the script does not work for HTTPS. Could you suggest a work-around or modifications to the script that would enable it to fetch the articles from a HTTPS site?
    Florida, USA, January 2003

    Answer: Near the end of the Nexispider.pl program, there is a line reading $nexturl ='http://web.lexis-nexis.com' .substr($doc,$start+6,$end-$start-6). Change the "http" to "https" and see if that works. As best I can tell, Perl handles https the same way it handles http.





    Setting Up And Coding In TABARI/KEDS

    21. Question: If my actor/verb/project files were created in Notepad on a PC, do you know if I will be able to access them and use them in KEDS on a Mac?
    USA, February 2004

    Answer: Yes, with a bit of automatic translation. Window and the Mac use different codes to indicate end-of-line (and Linux uses yet another code), so you need to translate between the formats. There are a bunch of programs out there that do this -- we use the text editor BBEdit on the Mac and it will do this just by selecting a menu option. Some Mac programs also do it automatically -- MS-Word, for example. If you can find someone who uses a Mac regularly, they can probably show you how to do it -- it is very easy. The rest of the information other than the line endings goes between the systems without translation.

    22. Question: I have successfully filtered my data and I am now working on modifying the actor dictionary to reflect the actors in my cases. I am modifying the actor dictionary by including a numeric code for the countries and actors (the numeric codes reflect those I have used for my other data set). However, when I use the modified dictionary with TABARI none of the actors are coded. After completing its cycle, TABARI returns a file with no actors or events coded, although I know that these actors exist in the file. Do you have any idea why these actors are ignored? Is it because I am using numeric codes? I altered the demo dictionary by changing the actor codes to numeric codes with the same result (i.e. no actors were found). Do you have any ideas?
    USA, June 2001

    Answer: My first guess would be to check to be sure that the file types are compatible -- that is, did you edit the actors file on a different operating system than you are running TABARI on?

    23. Question: I'd like to format my data to look something like the Panda data files. Panda has the following fields:
    1. Source Actor
    2. Source Agent
    3. Target Actor
    4. Target Agent
    5. Event
    6. Place
    7. Issue
    8. Domain, Mechanism, Sanction, Context.
    Of these, the first five TABARI seems to handle quite well. One question I have is whether TABARI can simultaneously process the first four fields. My feeling looking around the code was that this was not fully implemented -- that is, it would either implement 1&3, or 2&4. One workaround I found was to run TABARI twice, once using the actors file and once using the agents file. Is there an easier method?
    New Jersey, USA, November 2002

    Answer: Yes, TABARI only sort of handles "agents" -- in the sense that one can simply put in the appropriate patterns (e.g. "Israeli police"), whereas KEDS had a series of rules (well, three or four rules) that would also pick up alternative constructions such as "police in Israel" automatically. While we never really used agents, I gather that the facility works pretty work, and it is on the list to be implemented, though not any time soon. Your idea of running the data through twice, once with actors and once with agents, might well work -- I hadn't thought about it before but from a linguistic perspective, it seems fairly plausible (that is, nouns and adjectives are usually adjacent to each other, and for that matter all the KEDS system did was look for nearby words). The only place where it will break down is if there is an agent in the subject but not the object (or vice versa), e.g. "Israeli police accused Palestinians of..." The frequency of such constructions is an empirical question -- one way you could get an estimate is by looking at the original PANDA data (is this still available?) and seeing how frequently there is a actor-agent in the source and only an actor in the target (or vice versa).

    24. Question: I am preparing an actor list and I wanted to know if I should only keep actors as follows: names of countries, capitals, and key persons?
    USA, October 2000

    Answer: Assuming you want to get "traditional" state-centric event data, that would be correct. Obviously you could also code minority groups, in which case you would need their names as well. But if you are just focusing on nation-states, Reuters almost always uses either the state name, capital or a major actor (usually the head of state, head of government, or foreign minister) as the reference.

    25. Question: I am running Java (recently downloaded) on my PC. I have tested the Java program by writing my own short "hello world" type of program, so I know that it functions properly. I am running everything from my hard drive (the floppy issue refers to the Mac version of the Actor_Filter program I used in the Mac lab at school). I have all the necessary files to run Actor_Filter in java, which it recognizes and runs properly. When prompted to give an input file name (the document I want filtered) I point it to a "txt" file located in the same directory as the Java.exe and Actor_Filter program. After locating this file the program stalls. Basically, after I see "Temporary file: actors.scr" nothing else happens.
    USA, May 2001

    Answer: My guess is that the command file has some problem because of the lines "Unknown Option." Try these few things:

    1)Try to rename the afilter.options to anything else (afilter.options.backup) and run the program (If it does not find an options file it will run with defaults) if it runs fine then there is a problem with the options file and we can work that way.
    2) In the afilter.options file you have OUTPUT FILE: [input file].kwic; this is actually trying to open a file named "[input file].kwic" you should either leave this line out (and accept the default) or replace [input file] with a name since it might be trying to create a file name with the "<" character in it (Option 1 above should test this).
    3) If these don't work, send me a copy of your afilter.options file so I can try that out.

    26. Question: I am stumped as to why TABARI is having fits over material following a ";" in the verb list. I opened the standard verb file in a word processor and all that I see is a hard return at the end of each line, but no hidden codes that I can find in any line. This makes me think that the TABARI program is looking for something it should not be (the comments), but I don't know why. Note that this also happened with the standard actor file, but there were only 2 comments, and I took both of them out. I am also a bit stumped by the purpose of the options file, which lists some codes and actors, but not all of them. Is this important?
    USA, April 2001

    Answer: Does it run the "demo" program correctly? That will show you how the program behaves when it is running. To run the demo:
    1. respond with "C" to the first prompt,
    2. hit "return" at the second prompt,
    3. enter anything ("XXX" will work) in response to the request for a coder ID,
    4. enter N if it asks if you want to skip records (it may not, depending on the status of the file).
    Then you can go through the demo records (they all involve states from Lord of the Rings) by hitting "N" or "return". If we get this far, then we know the program is running. Which installation of Linux are you running? With regard codes in the .options file: these are all optional -- they are just used to make the output easier to read, and the program will code without them. It picks up the codes from the dictionaries. Either way, try the demo first.

    27. Question: I might need to run TABARI with a more specific dictionary to get exactly what I need. Do you have the dictionary for the Levant data?

    Answer: They are with the data -- that is, when you download the data sets, those are in a folder that also contains the dictionaries.

    28. Question: I am considering using the TABARI tool (on Windows NT) that you have developed. I got a feel for the system through the Demo project that is provided with the TABARI_NT zip file and now want to start looking at setting up my own project with tailored dictionaries. Unfortunately I am struggling to get started as I cannot work out how to set up a new project. Is there a set of instructions that you could direct me to that explains how to get started with a new project? In particular I also want to use a CD-ROM and a web site as the data sources, and I cannot work out how to do this either -- is there a set of instructions that explain how this is done?
    United Kingdom, September 2002

    Answer: The "project file" itself is easy -- it is just a list of the files (dictionaries, source texts, etc). Have you looked at the TABARI.0.3.changes.doc document? -- that describes how to set up a project file (which is just an ASCII "text" document). Now, the hard part, unfortunately, is getting your source text into the correct input format. That format is simple -- just a date in YYMMDD format, followed by the lines of the text to be coded, followed by a blank. The problem arises with the fact that the original text you probably want to code -- whether on a CD, the Web, a database, whatever -- is almost certainly in a more complicated format. To convert from that, you need to either do a lot of reformatting manually (one can do a lot of this with search-and-replace or macros in a word processor), or write a specialized program (we call them "filters") to do this automatically. I'm guessing that this is where you are at the moment. Since the formatting of the source texts vary dramatically (between sources; usually it is reasonably consistent within a single source, e.g. Lexis-Nexis), this conversion process has to be customized. I think that there are some general tools available for this sort of thing, but since we just write little programs to do it, I'm not familiar with them. The reformatters convert your original text into the simple form used by TABARI, and from that point things are fairly straightforward.

    29. Question: I am beginning to get familiar with TABARI. However, I have not been able to get a sample file to run. I was using the sample file on the Software page under KEDS - perhaps this file is only compatible with KEDS.

    Answer: Correct -- the "project" and "options" files in the two programs are very different ("actors" and "verbs" are the same, or at least compatible). Get the TABARI file but -- in contrast to someone I was corresponding with last week -- DON'T USE THE SAMPLE FILE TO CODE REAL DATA! Unless you are coding Middle Earth (as in Lord of the Rings), since that is where the place names are from.

    30. Question: I have a student who produced a draft six months ago that used TABARI and reported odd results. She has told me that more than 80% of the TABARI events were duplicates and of the remaining cases more than 60% were not coded correctly by TABARI. Consequently, she threw out more than 5,400 of some 5,700 events. Can you provide any insight as to how we can resolve this dilemma?
    Ohio, USA, October 2002

    Answer: The big problem is the use of multiple news wire sources. We don't do that -- we just use one (originally Reuters, now AFP), and even then we filter for duplicate stories (effectively in Reuters, less so in AFP; we've had to use a new technique to get around that). So depending on how many sources were being used, the duplicate count could be credible (e.g. if she was downloading anything on Nexis, which has a couple thousand sources). We've got a filter to handle this problem. Another possible issue might involve the actor dictionaries (that might explain the accuracy problem; in fact she might have needed to do some work with the verb dictionaries as well).

    31. Question: I am using TABARI and several files are coming back as "event not found". When I hand code these files, however, I often find that there are in fact events but that the wording seems a little too complex for the program to pick out. Therefore, I would like to include a Complex statement in my options file for TABARI to create a complex file which includes these files where no events are found. I'm running into problems at this stage because the complex file is not being created after TABARI runs through my input files. Ideally, I like to mark as complex the files that have actors yet are coded as "no event". I have a complex statement in my options file which is simply:
    COMPLEX: NOEVENT EXPLAIN
    Do you happen to know what I am doing incorrectly to make the program not generate the complex file? I am autocoding using the "full" option. Does this make a difference?
    USA, February 2004

    Answer: It turns out that, in fact, there isn't a way of generating a separate file of "complex" events in TABARI, though there was in KEDS. No technical reason for this; it just hasn't been implemented. I'll try to do this in the future, but it won't be soon. One possibility (assuming you have access to a Mac) is that you might just try running KEDS -- for the most part the files are compatable between the two -- that is, any work you do on KEDS could later be moved to TABARI once that new feature is built in.

    32. Question: I am trying to use TABARI (tabari-0.4.04.windowsbeta.exe) to generate data for my dissertation research, and I am running into some problems that I was hoping you could help me with (or point me in the direction of applicable literature for my problem). I am receiving an error message when I attempt to run the program:
    Enter project file name, or <Return> for the demonstration file: c:/tabari/dissproject.project
    Last coding session: ; Coder
    Coding session 1
    Enter coder ID: lmp
    Initializing from options.options
    Fatal error encountered in program
    Problem: Unable to open the input file TABARI.diss.actors
    Press <Return> to end processing.
    I have absolutely no idea why the file cannot be opened. Previously, I was receiving a message that the file was not a Windows formatted file. I changed the association from Word to Notepad, and now I get the message above. When I try to run the validation files or the demo, I get an error message that the TABARI.Demo.project is not a "Windows formatted" file and it asks if I want to convert it. No matter if I hit Y or N, the program shuts down. I was wondering if perhaps it is my operating system? I am running Windows XP. I have a different problem when I try to use the TABARI version .2 compiled by Dale Thomas. It appears as if the program runs through all my input files, but it ends in just a few seconds, and there is nothing in the output file. I have contacted Dr. Thomas as well to see if he knows why this (or the problem I outlined above) is happening.
    Indiana, USA, December 2003

    Answer: This definitely sounds like the problem involves some incompatabilities between the file formats (rather than anything more serious), and just guessing, I'm wondering if TABARI is reading one of your files (say the project file), seeing a particular file format (Windows, Mac, or Unix), and then crashing when it finds an incompatibility in another file. We just added that capacity to detect file types recently (it is the single most common problem people seem to have with the program, probably since we are supporting multiple platforms) and probably haven't dealt with all of the possible contingencies. Is there somewhere (or some place) at Indiana where you could find someone who could easily check those file formats and make sure they are all Windows-formatted (that is what you want if you are using the Windows version). If one has the correct software, this only takes a couple of seconds (if you don't, you have to find the correct software, which can take hours...) -- is there a general consulting facility that might be able to provide that? If not, you could send the files here and we could check and see what the problem is -- my concern about that is that sometimes systems change the file formats in transit (my email goes through a Windows system that forwards to a Unix system and then I read the mail on a Mac... you see the problem...)

    33. Question: I am in Europe and I've tried to download the TABARI software from the web site but unfortunately I don't have within my reach a Mac OS 6.0 PC. Although I downloaded the TABARI software I find it difficult to understand how we input the data into the system and then how to evaluate the output.

    Answer: In terms of the program, there is a Windows (and also a Linux) version of TABARI that works almost the same as the Macintosh version, so you could just use that. However, the more likely problem is going to be finding texts to code. In North America, we usually can use the NEXIS data base to access international news reports from Agence France Presse and other agencies, or more recently the Factiva data base to access Reuters. When I've talked with colleagues in Europe, it seems that these two services are not as easily available there, but you could check with your library to find out. Another alternative would be to use an existing data set -- e.g. the World Events Interaction Survey (WEIS) data that covers 1966 to 1993. This was coded by a number of different projects over the years; it only includes the coded events, not the text.

    34. Question: I have been able to download TABARI, but I have not been able to find its source code. I want to change the output format for my project. Where can I find it?
    Missouri, USA, September 2000

    Answer: Look at the "TABARI.zip" file on the CD-ROM -- the source code is in there. This doesn't have the Windows interface (I think the zipped file contains the Linux version; the Mac version is in a .sit file in the Mac directory on the disk).

    35. Question: I was doing some initial coding with TABARI and wanted to ask a question. I notice that when I go to modify the actor list, it displays so many actors that the ones I need go off the screen. I've tried re-setting my window size, but that doesn't seem to do the trick. It could be that unless the font size is quite small, Windows shows fewer lines than the Mac(?).
    USA, April 2000

    Answer: If you go to the TABARI.h file and "uncomment" the line // const int WINDOW_HEIGHT = 24; and then comment-out the next line (which sets the variable to 48), I think that will take care of the problem, since the modify routines use WINDOW_HEIGHT to determine how many lines to display. As best I can tell, the "DOS console" output window is limited to the classic 24H x 80W size of a dumb terminal -- if you increase the size, it changes the font, but doesn't give additional lines.

    36. Question: Whenever I substitute a demo file with any STANDARD file, TABARI will initialize the file and then hang. Somehow TABARI is not recognizing them properly. Any ideas as to what may be causing this?

    Answer: Try adding the following line to the end of the STADNARD...ACTORS file:
    ~~FINISH
    The actors and verbs files are supposed to have that as the final line, but I just noticed that the standard file doesn't. That would mess up the detection of the end of file, and Red Hat might handle that differently than Debian. A ~~FINISH line is also supposed to be at the end of the .verbs file. I didn't find anything weird in the .project file -- the editor doesn't seem to be the problem.

    37. Question: I think adding ~~FINISH helped, however, I received another message; although it initialized the STANDARD actors file okay, it then said:
    error in the input line
    non-governmental organization ING
    problem: no code ("[...]") has been specified
    This leads me to believe that I need to make changes to the codes in the demo options file(?).
    USA, April 2001

    Answer: This is in all likelihood occurring in the line INTERNATIONAL_COMMISSION_OF_JURISTS [ING] ; international non-governmental organization ING. This seems to be a single line (and the phrase occurs as a comment, so it should be ignored). However, based on the error message, your editor might have put a line-feed in after "international" (I'm guessing this is the reason since that is where a line-feed would go if it was word-wrapping to 72 characters). Check whether that was the case. Otherwise, just eliminate the comment -- it may be overflowing the line length (actually, let me know if this was the problem). If you are just hitting C, the program will continue anyway. The CODES commands in the options file affect how the codes are displayed, but in fact the system will take any code it finds -- in other words, it doesn't cross-check these against the options file.

    38. Question: I need to be able to do two things: add new actors/verbs to the dictionaries and get TABARI to use these dictionaries instead of the demo ones. How do I tell the program to do this?
    USA, April 2001

    Answer: The file you need to change to use alternative dictionaries (rather than the demo dictionaries) is called TABARI.Demo.project. This is what tells TABARI what files to use. You can name the new version whatever you would like -- the program will prompt for the name after you enter "C" at the first prompt.

    39. Question: I am assuming TABARI prompts you each time it finds an event. How do I make TABARI run through the entire file and give me an output file, what is the output file called, and how are the data stored in it?
    USA, May 2001

    Answer: There is an "auto-code" function -- it is one of the menu options -- that will code an entire file automatically. It should only take a few seconds (if you turn off all of the display options; otherwise it takes a little longer).

    40. Question: I would like to code not just Reuters, but as many sources as I can access through Lexis-Nexis. Unfortunately, the program didn't work when I tried to implement the ISSUES syntax in the Options file as described on p. 117 of the KEDS manual. Is that feature fully implemented in TABARI? If so, how does it work?
    USA, October 2002

    Answer: Yes, it is now implemented, though only in the latest Macintosh version of TABARI (vers 0.4). It works pretty much the same as in KEDS (which should, in fact, be working -- the PANDA project was using ISSUES for years and I thought it was fully debugged); the documentation is in the "TABARI.0.4.changes.doc" file.

    41. Question: I'm confused about whether TABARI/KEDS will actually read the entire text or just the lead sentences. You mention the WEIS coding for West Africa, using full-story coding, but as I understand it, TABARI and KEDS are designed to look only at lead sentences. However, I just tried an article that the program coded correctly in my belief, something it couldn't have done by looking just at the lead sentence. I'm interested particularly in the "ISSUES" facility and how the program goes about identifying it. My hunch is that coding only the lead sentences might suffice to identity source/target and so on, but it might bias the identification of the issue under contention.
    USA, November 2002

    Answer: TABARI is *expecting* to see only a single sentence. However, if you give it multiple sentences in a single record, and it doesn't find anything in the first sentence, it will keep going through the text treating it as though it were a single sentence, and often as not it does a pretty good job pulling events out anyway (this is the advantage of sparse parsing -- it works even when a strictly grammatical parser would be choking). This is not how I recommend doing things, but I've had a couple other people do this and get results they were happy with. The cleaner way to do this is to set up your text filter so that it outputs all of the sentences in the story as separate records -- the NEXISFilter programs (in their various C, Pascal and Perl incarnations) can all do this; they assign separate serial numbers to the story and the sentence within the story. In the coding process, sentences that don't contain events are just skipped, and all of the sentences that do contain events are coded. This is more reliable for coding, though it does generate [many] duplicate events (though any news wire coding will do this). In short, the distinction between the lead-sentence and full-story coding comes in the filter, not in TABARI or KEDS. Also since the "ISSUES" facility just does string matching, it doesn't care about grammar at all, so giving the system a big chunk of text would probably give credible results.

    42. Question: I'm having some problems importing the PANDA dictionary. The problem, it seems, is that TABARI is not parsing 'compound' dictionary fields correctly. For example, [055!] works fine, but [055!/2] will cause a seg fault. The problem is somewhere in the 'storeEventString' function (codes.cp), but before I start debugging I was wondering if you can verify if this is a problem in the Macintosh version.
    New Jersey, USA, October 2002

    Answer: Actually, it's not a problem, it's an unimplemented feature. What 0.4 will handle is string-based issues -- that is, it will do a simple string-based content analysis of the text and output codes when it finds the strings. I haven't implemented the "embedded codes" (or whatever they were called in KEDS for issues), and I gather PANDA used those a lot. There is a simple work-around for this, since you could simply simulate it by creating a unique code for the various combinations (i.e. something like "055_2!" instead of KEDS's "055!/2") and then parse out those issue-based subfields in the resulting data. A bit of a pain but it will give you all of the same functionality, and TABARI has plenty of room for unique codes.

    43. Question: What do you do if a single sentence contains multiple events? For example consider the following sentence: President Bush said today that he was very pleased with today's meeting with President Chirac in which the two countries agreed to increase aid to Afghanistan. To me, this one sentence might contain three WEIS events: a positive comment by Bush, a meeting of the US and France, and a promise of material aid. But how would you handle it?
    Texas, USA, June 2004

    Answer: We "subordinate" comments -- in other words, we code something as a comment only if we can't code anything else from it. I'm not sure we'd get this one coded correctly, but... hmmm... ideally we would get an agreement with USA and France, and then two "promises" for aid. I doubt that the sparse parse would be that accurate, however. The meeting itself would be no problem -- it would be picked up a zillion times in easily codeable form from other stories. One of the useful things is that even when one gets a story like this which is difficult to code, *usually* all of the relevant events will be picked up elsewhere -- these news feeds (unlike the older paper-based sources) are highly redundant.

    44. Question: I'm trying to use the ISSUE function in KEDS. I think I've got it figured out, but I just want to make sure. Once I create an ISSUE file, if KEDS finds the words in the ISSUE file, it will assign each lead in that category as designated in the issue file, right? I'm trying to keep political and economic-related events separate. I'm creating an issue file full of "economic" terms to keep track of this.
    USA, June 1999

    Answer: Yes. Except for the "PLACE" issue, which does a little bit of parsing, the remaining issues are just simple pattern-matching (though note that you can also assign priorities to different patterns, so you can handle economic and political issues in the same file). I've never worked directly with the ISSUE facility, but PANDA did a lot with it, so it is probably debugged fairly well.

    45. Question: I have no idea how to change settings in Linux, but I'm hoping to port over a different actor list and a different verb list and some text files and see what happens. Does this sound right?
    USA, February 2001

    Answer: In order to change the actors, verbs, etc files, all you need to do is edit the input files. There are Linux text editing utilities that come with the installation -- in other words, they are already on your machine -- so you've got everything you need to work with.

    46. Question: My intention is to use your KEDS program, but I thought I would first check with you to see what options might be available. Should I just download what you have posted on your web site, or are there other means to purchase the required software and manuals?
    Washington, D.C., USA, January 2003

    Answer: No, everything we've got on the web site is our current best effort (well, almost everything -- sometimes stuff doesn't get posted quickly, but that is due to disorganization, not policy). You'll probably want to use TABARI rather than KEDS -- it is a more recent program and is more accurate than KEDS (also runs on Linux and Windows as well as the Mac, though the Mac version is the most current).

    47. Question: I noticed on the web page that the dictionaries for the coding of the Middle East are available. I hope to use KEDS to obtain data on intranational political violence (primarily in Latin America and Africa) using the Nexis service. I am still at the early stages of this process. In the manual you suggest using an existing dictionary, and then tweaking it if necessary. Do you know of a dictionary that someone has created specifically for intranational political violence?
    USA, September 1999

    Answer: Yes -- we've got a dictionary that we developed specifically for Colombia, and another for Nigeria. They have a lot in common with the international dictionaries, but have some additional vocabulary, particularly on drugs (come to think of it, we've got a Mexico dictionary as well).

    48. Question: One of my students was working with the latest version of the WEIS dictionaries and found that they crashed both KEDS and TABARI 0.2. I noticed that some of the dictionary entries have { } and <> embedded in them. Are these supposed to be supported by KEDS and TABARI 0.2? I deleted the forty or so lines containing these and afterward both of the programs could work with the verb dictionary.
    Florida, USA, February 2003

    Answer: Your supposition is correct: those things only work with TABARI 0.4 -- I should probably document that somewhere in the dictionaries. The one other possible problem you might run into is that some of the TABARI dictionaries (at least the ones we are using here) were running into the limits of memory for phrases. Fix is trivial -- just increase the size of the arrays -- and there is a warning that pops up a lot before you hit the limit.

    49. Question: Is there a way to allow adjectives to be coded by perhaps extending the verb dictionary?
    United Kingdom, July 2002

    Answer: We do a lot of this in the verbs dictionary when we add direct objects to the root verb. The content analytical "issues" would be another way of doing it. But if most of the material you are interested in is distant from the verb of the sentence (i.e. buried in subordinant phrases or conditionals), you might be better off with a more general content analysis package.

    50. Question: I looked at all the various parts of TABARI that I can see on Linux and I am still not sure I understand how to change the actor and verb files. Are there some instructions for this? Basically, I want to use the verb file you sent and I am working on adding stuff to the actor file, but I must just be missing how I tell TABARI what files to use. I would appreciate any advice you can offer.
    USA, March 2001

    Answer: Look in the "project" file (it is probably called "TABARI.Demo.project") -- that is where the information is set. Specifically:
    <actorsfile> TABARI.Demo.actors
    <verbsfile> TABARI.Demo.verbs
    <optionsfile> TABARI.Demo.options





    Analyzing TABARI/KEDS Data Sets

    51. Question: I am having problems opening the data set. Could you tell me under which program I can read the data set if I am working with a PC, and not a Macintosh?
    Russia, May 1999

    Answer: If you click on the link "Download Balkans data (.zip)", that should initiate an FTP file transfer of a file that is in the "Zip-It" file compression form; you should be able to open this on most machines running Windows. The files themselves are just simple ASCII text; any word processor or spreadsheet should be able to read it.

    52. Question: I need to retrieve raw KEDS data for U.S.-Soviet interactions through the end of the Cold War, rather than just using the aggregated Goldstein measure (in the Levant-WEIS set). I am attempting to get count data on the number of agreements between the two states. I have downloaded older KEDS archive data but they will not open in Excel, and therefore I am unable to look at them to see what data they contain. Therefore, I am requesting your guidance in what data set I should use to get the raw KEDS data and what format the older archived KEDS data sets are in.
    Indiana, USA, April 2004

    Answer: The raw KEDS data are simply a "flat ASCII" ("text") file -- the reason that Excel probably won't read them is that the file is very large (about 190,000 lines). All of the raw data are in this format -- the only possible differences you might encounter is that some of the older sets were saved only in Macintosh format, whereas now the .sit files are Mac and the .zip are DOS (the only difference is in the line endings, and many text processors translate these automatically anyway). One caution: while there are USA/USR records in that data set, they only record events where the USA and USSR were discussing the Middle East (or where the Middle East was somehow mentioned in the story). If they were talking about, say, the Intermediate Nuclear Forces agreement, it won't show up.

    53. Question: I have downloaded the KEDS Count tool and I have been trying to run it against the West Africa data set that is on the KEDS web site. I am running the software on an iBook using Mac OS 10 (but it appears to open the Count tool in Mac OS 9.2). Once I have selected the appropriate file from the file selection dialog box that appears when you run KEDS_Count, everything appears to run normally -- however I end up with an error, simply stating "The application KEDS_Count 1.0b8 has unexpectedly quit". Do you have any idea what this could be? Do you know if this is a common problem?
    United Kingdom, November 2002

    Answer: That program is one of our less robust efforts, and actually if you've got access to a Windows machine and don't mind doing a bit of cross-platform work, Dale Thomas's "Aggregator" program is a lot better. But in terms of KEDS_Count, it crashes (I forget why...) when the input file is too long -- if I remember correctly, a file of data longer than a year (or maybe it is two years) will crash it. So if you've got a longer file, you might try breaking it up. It will only work under OS 9.2; it's an old Pascal program.

    54. Question: I want to know if the KEDS data will be available for years before 1990. If it is not available now, when will it be?
    USA, January 2004

    Answer: You probably used the PANDA data, which was coded using the KEDS program but by a different project. I thought they had coded from about 1985 forward (it wouldn't go back much before 1985) and I'm not sure why they haven't posted it -- the data definitely existed at some point. If there is not a link to the PANDA project on Gary's site, go to our site and get to the "Other Links" page, and we've got a link.

    55. Question: We'd like to use the KEDS data for the Levant from 1979-1998 to update a data set. However, I'm having trouble getting the data set to open. It's too large for Excel. Any suggestions? Also, if it's feasible, we'd like to do quarterly measures of our variables rather than annual, but we may be restricted to annual depending upon economic data available to us for Israel. What are our options in this regard?
    Texas, USA, February 2001

    Answer: You might try opening the data in Word or in a text editor -- it is definitely all there. You might also take a look at our "KEDS_Count" program, which is designed to do aggregations, though I don't think this has a quarterly option. Getting a hold of Stata do-files are the best way to handle aggregations. There were/are a couple of folks at Indiana who spliced COPDAB and WEIS -- see Reuveny, Rafael, and Heejoon Kang. 1996b. "International Conflict and Cooperation: Splicing the COPDAB and WEIS Series." International Studies Quarterly 40,2:281-305. -- which would give you the first part. I'm not sure which version of WEIS they used -- the Tomlinson version goes to 1992 or thereabouts. However, the KEDS-coded data is a lot more detailed than WEIS (there is no overlap with COPDAB) and you'd get some discontinuities there.

    56. Question: What do you think about converting the categorical codes into a continuous scale that is meaningful to genocide data? In particular, I was wondering how a Goldstein's scale-type could be applicable to genocide modeling.
    USA, October 2002

    Answer: If I remember the genocide coding correctly, there was a fairly clear order to the categories, and consequently a scale similar to Goldstein's should work. Most work with event data in political science has used scaled data rather than categorical data. In the work that we've done, the precise values of the scale don't matter very much, as long as they roughly approximate a cooperation-to-conflict dimension. Typically about 50% of the variance in the models is explained just by the presence or absence of events, and scaling (or categories) just picks up a remaining 20% to 30%.

    57. Question: I was toying around with the Levant data, converting scores to the Goldstein scale, and I noticed that the WEIS scores "142" "199" and "215" did not have a Goldstein conversion listed on your web site. Do you have an updated Goldstein conversion table so that I can remain uniform with the other scores and avoid creating my own score conversions here?

    Answer: 142 is actually a WEIS code -- deny policy -- and is scored -1.1. Either dropping the events or converting them to their cue categories (e.g. treat 199 as 190, Goldstein -4, and 215 as 210, Goldstein -5.0) would probably work.

    58. Question: I am trying to obtain the total material conflict between ISR, PAL, EGY, SYR, and LEB. How do I go about doing this?

    Answer: The easiest measure would be to use the Goldstein scaled totals that are in the Excel file for the Levant data (there are references to the original Goldstein article in most of our papers -- it is just a conflict-cooperation measure). Almost all of the conflict in the region involves two dyads -- Israel-Palestine and Israel-Lebanon -- and if you just need a quick conflict measure, the total of those numbers should work fairly well. The data are aggregated by month, 1979 to 2003.

    59. Question: What is the connection between the WEIS coding and the Levant data set? Incidentally, we did not see any triple digits codes (WEIS coding) in the event coding files, although it says that the files are WEIS coded. Also, where can we download a simple description of the files?

    Answer: You should definitely be seeing three-digit codes in there -- in fact all of the original codes are three-digit. We frequently use 2-digit codes in our analysis but that is derived from three-digit codes. In addition, all the current Levant data sets should include a simple description of the file.

    60. Question: Many of your events are given general WEIS category codes 020 (comment), 030 (consult), etc., which Goldstein did not employ when devising his scale. What is the convention (if any) in weighting these events? I've been weighting them with the average for events of that type but thought I'd consult with you before my analysis proceeded too far. Also, how are pessimistic comment (022) and optimistic comment (024) weighted? While I understand that the comment weights are near zero, neutral comment has a (small) negative weight so presumably a pessimistic comment would have a somewhat larger negative weight.
    USA, June 1999

    Answer: This is the same thing that we do. We use the same weights that Goldstein uses. There are a few codes in that data set that are not in the original WEIS. We've not been really systematic about these, but there also aren't a lot of them.

    61. Question: I want to take the MID data, identify the periods of time a pair of states are engaged in a dispute, then go to KEDS, and pull out all the KEDS events during that time period between the pair of states. We've started to call these KMIDS, by the way. Suppose I pull out the Iran-Iraq events (i.e. use the MID data to identify the periods of time Iran & Iraq are engaged in MIDs, then get all the KEDS events in the Levant data set during those times). If I use the Levant data set, is it the case that I will miss Iran-Iraq events during a dispute because of the way in which the Levant data were coded? Is there a problem if I try to extract KEDS events from a KEDS data set from pairs of countries that were not the focus of the data set -- i.e. if I use the Levant data set as the source, is there a problem for states other than Egypt, Israel, Jordan, Lebanon, Palestinians, Syria, USA, and USSR/Russia?
    USA, June 1999

    Answer: Apparently it is possible to extract quite a bit of information on conflicts in a region, even if the countries aren't the focus. This is because a story is selected if one of the named countries appears *anywhere* in the story, so if one has a lightening-rod such as Israel involved, you pick up a lot of stories where Israel is mentioned only peripherally. As I might have mentioned in the past, I know of one study (of the "I could tell you but then I'd have to kill you" genre) where the Levant data set was used as a surrogate for a global data set (in fact I think it was compared against either PANDA or GEDS, I forget which) and did very well in predicting conflicts (accuracy in excess of 70%). But that study used a bunch of diffuse statistical techniques, whereas I believe that MID studies tend to be more direct. My guess is that as long as you are dealing with conflicts that are seen as interlinked, (e.g. the Arab-Israeli and Iran-Iraq conflicts), then you'll pick up almost all of the major events, but not the minor ones. If you get conflicts that are pretty much independent (e.g. Korea), there won't be much, though with full-story coding you still might pick up a few things.

    62. Question: Do you have any of the Bosnia data in raw form? That is, as simple verbal chronologies before being coded according to the WEIS format? I would like to code some of it (1998-99) using the BCOW coding scheme.
    USA, June 1999

    Answer: Unfortunately, we're using Reuters Business Briefing reports that are covered by a strict copyright and contractual agreement that prevents us from transferring the raw text outside of this institution. Do you have access to NEXIS? -- they don't have Reuters but they've got almost everything else, and you could probably assemble a chronology from there fairly easily.

    63. Question: The event codes in the Levant data set contain some codes that are not listed in Goldstein's article (I would like to weight the events). Do you weight events? If so, what do you do about the events that are not listed in Goldstein, 1992?
    USA, June 1999

    Answer: We added an assortment of codes along the way -- frankly we haven't been all that systematic about this since all of our analytical work has either used aggregated scores or else two-digit scores, but the codes are there. Almost 50% of the variance is explained by the presence/absence of events (and most of the rest is explained by gross distinctions between cooperative and conflictual events).

    64. Question: As I recall, some effort has gone into turning event-data scales into cardinal rather than ordinal measures so that they can be aggregated for larger time periods. Do you know whether anyone has thought about mapping event data onto MID data so that the same sort of aggregation could occur?
    Massachusetts, USA, October 2003

    Answer: Assuming that one accepts the legitimacy of Goldstein scaling and can get around the source-comparability problems (as you know, two big assumptions) this would actually be very straightforward:
    1. Locate as many MIDs as possible that are covered somewhere in the existing universe of event data sets (expanded WEIS, maybe COPDAB, maybe GEDS, KEDS, PANDA, IDEA, Goldstein-Pevehouse).
    2. Figure out the begin-end dates for the MID and the dyads involved
    3. Run Goldstein totals (or some equivalent for COPDAB and GEDS) for the dyad-period. As an alternative to Goldstein scores, one could also do a vector of event-type counts, using either the 4-category verbal/material cooperation/conflict scheme we use, or the 3-category scheme IDEA uses, or, for that matter, WEIS cue categories. That would make for a neat little exercise in cluster analysis, and then one could see whether it is possible to project that space onto MID.

    65. Question: I am beginning to do some analysis on the data I have collected/coded with TABARI. I have found that there are several duplicates and some instances of mis-coding, so I am going through to check each event that TABARI coded to make sure it is legitimate. A very long process, as I am sure you can imagine. I was wondering if there is a way to display the "record", and/or perhaps even the text of the coded events in output. Is this possible? I played with the options file, but was unsuccessful.
    USA, June 2002

    Answer: At the moment, there isn't a way to add the text to the output record, but you can output the identification information (that is, any identification information following the data) from the record -- see the "OUTPUT:" command in TABARI.changes.doc. That would at least allow you to identify which texts were identifying the events. This works in version 0.4; I don't think it was in 0.2.

    66. Question: When I try to open the Levant WEIS text file that I downloaded, the aggregator program gives me the following error message: "run time error '62'. input past end of file". How should I proceed considering I am using Windows 2000?

    Answer: In all likelihood it is a file-format incompatibility -- Aggregator is expecting a DOS/Windows format and you've probably got the file in either Mac or Unix format. There are a bunch of little utility programs that can correct this -- probably the quickest thing would be to call your computer center and see what they recommend: once you've got a program, it will just take a couple seconds.

    67. Question: I'm trying do some work with the Levant data. What program can I use to open it? It's too big for Excel, Stata won't do it, nor will Access.

    Answer: You just need a regular text editor that works with flat ASCII files. Here we use BBEdit here on the Mac, but presumably there is a Windows equivalent and you may use Microsoft Word.

    68. Question: In the Levant data set, the web page indicates that it is a "folder containing events (N=92,687) and a tab-delimited text file containing monthly total scores (Goldstein scale) for dyadic interactions within the following set of countries: Egypt, Israel, Jordan, Lebanon, Palestinians, Syria, USA, and USSR/Russia. Coverage is April 1979 to December 1998." When I unzip the events file, it appears to be daily event scores from 79-04-15 to 98-12-31, and includes more countries than those given above. I'm happy to have more data than the web site indicates, but can you tell which countries are included completely in it?
    Texas, USA, June 1999

    Answer: The daily event data include all interactions that involve the stated actors and *any* other actor in the system (as source or target). You'll also find a large assortment of interactions from actors outside that set: these occur, for example, when the original sentence being coded dealt with the UK and Italy discussing an intervention in Lebanon. Does that make sense? In other words, the criteria for inclusion was whether a story was retrieved in the original search used in NEXIS, not on the resulting coded data. The tabulated file of Goldstein-scaled scores, in contrast, just has the interactions within that set of eight actors (e.g. 56 dyads).

    69. Question: I have WEIS data from 1966 to 1992. It appears that the data covers all countries. Has this data been updated to 2003? If so, is it possible to get an updated file?
    Texas, USA, November 2003

    Answer: No -- 1992 was the last year that WEIS was updated globally (by Rodney Tomlinson at the Naval Academy). Beyond that the updates have just been for specific regions.

    70. Question: I am interested in the original WEIS aggregated quarterly. I am also interested in the data you have for the Middle East.
    USA, September 1999

    Answer: There are 40,000 or so directed dyads in WEIS. Do you want all of them? Are you interested in scaled values (e.g. Goldstein scale) or event counts? What level of detail on the coding (WEIS 2-digit or 3-digit)? This can be downloaded from the KEDS web page; this includes an Excel spreadsheet with monthly aggregations (Goldstein scaled).

    71. Question: I am working on a project utilizing the Gulf Data set. Does the Gulf Data set consist of only actors or targets who are countries of the Gulf region (i.e. only actors are targets pertaining to Saudi Arabia, Iran, Iraq, Kuwait, Oman, etc.. per the nexis search listed on the Keds Data web page), or are all other countries included? I assumed that the first was true, but then I noticed that there were actor/target combinations in the set that, given a search for only gulf state actors or targets, should not be included, such as USA/Russia, France/Russia, United Nations/United Kingdom, Greece/Vatican, etc. I am trying to put together a data set for six actors (Great Britain, Canada, Italy, Iran, China, and Saudi Arabia) using Lexis for five specific time periods between 1982-1998, and any shortcuts in our data collection efforts would be extremely beneficial.
    Texas, USA, January 2001

    Answer: Our data probably won't help you much except for Iran and Saudi -- the process will generate events on some of the actors, but it will be a very incomplete set. Probably your best bet would be to start over, since otherwise you'll probably get duplicate records. We've got a program that automates the downloading of NEXIS records from the NEXIS "Academic Universe" site -- that might save you some time. There is also a new reformatting program written in Perl; both are on the web site.

    72. Question: I was wondering if you had the WEIS data updated through the present? I have up to 1992, but would like the newer stuff as well. If this is possible, that would be great. If not, if you wouldn't mind letting me know the criteria for including certain countries or not others in your data set, that would be nice, so I can update the data set.
    Texas, USA, October 2002

    Answer: It sounds like you've got the most up-to-date version of WEIS, which is the Tomlinson set. The closest thing one could get to an update is the new PANDA data set at http://gking.harvard.edu/data.shtml. This has global coverage to 2001 or thereabouts, and you can translate the PANDA codes directly into WEIS codes. There will be some discontinuities where you splice the data (it goes back to 1990 or so, so you will have some years of overlap) because the original WEIS uses the New York Times whereas PANDA uses Reuters, which has a lot more events, but if you can figure out how to deal with that problem it would give you an extension.

    73. Question: For my purposes I would like to use monthly event counts based on event categories relating to cooperation and conflict. As for the measure, I'm not sure whether an event count or a scaled total would be more useful. If you already have the aggregated data in a separate data set, that would make it a lot easier for me. Otherwise, I'll play around with the aggregator program you suggested below and see if I can get the information I need.
    Pennsylvania, USA, February, 2004

    Answer: There is a tab-delimited file in the download that includes Goldstein-scaled totals (using Joshua Goldstein's scale for WEIS) by dyad-month. If you are okay with scaled totals for a measure of cooperation and conflict, you could just use that. I've also got a program that does counts for verbal cooperation, material cooperation, verbal conflict and material conflict (by dyad) -- each of those categories corresponding to a set of WEIS 2-digit categories. I've either got that available somewhere or could run it easily enough.

    74. Question: I had just completed a finite state Markov chain analysis of the Israeli-Palestinian conflict with your 1979-1992 data set when I stumbled upon the 1979-2003 data on your web site. The two data sets do not appear to share the same events over the time period they have in common. Are the data sets from different sources? Or is there some other explanation?
    Washington, D.C., USA, October 2003

    Answer: The sources are probably the same (though we may have done some additional filtering), but the coding programs were radically different: the 1979-1992 was generated with one of the earliest working versions of KEDS, whereas the current set is generated with TABARI, which should be a lot more accurate. We also update that every three months.

    75. Question: I'll be using KEDS/TABARI data in my research, but was unable to find any dyads for the conflict that seems most suited to my project (the Civil War in Yemen, 1961-1972). I'm looking at other, more recent conflicts, but my advisor suggested that I make sure there is no way to access data from Yemen, and assured me that if there is any data available for this conflict, you would know where to access it. Are there any resources you would suggest?
    USA, October 2003

    Answer: Probably your only option would be the COPDAB data set, which should be available through the ICPSR (COPDAB = Conflict and Peace Data Bank; PI is Edward Azar). COPDAB gets a little squirrelly in places, but the original focus of the database was the Middle East (Azar was Lebanese originally) and it might be pretty good -- it is certainly worth looking at.





    Programming

    76. Question: I have a couple of inquires about the NEXIS filter for DOS/Windows. The zip file has only the source code -- I don't have a "C" compiler here -- is it possible to compile this into an *.exe file to be run directly from Windows or DOS?
    USA, October 2000

    Answer: With a couple of caveats, the answer is "yes", though one will definitely need a C compiler. The one change that needs to be made is the function:
    int GetTEXTFile (char filename[])
    which calls a Macintosh file-selection dialog (in other words, it brings up a nice point-and-click box on the Mac). The easiest way around this is just to use the standard C "fopen" function and type in the file name. Alternatively, I assume there is some Windows equivalent to the Mac functions. Everything else in the program is ANSI C (that is, the standardized version of C) and will compile under virtually any system. If you don't have access to a C compiler, here are some suggestions:
    -- Check out the computer labs being used by the introductory programming classes; they will almost certainly be using C or C++.
    -- The CodeWarrior "Discover Programming for Windows" series contains Windows compilers for C, C++ and Java, and about half a dozen programming books on a CD-ROM. It should cost about $60. There is also a Microsoft equivalent; I'm not sure about the price on that.
    -- This won't solve the Windows problem directly, but if you are using either a Unix or Linux system, just use the "gcc" compiler that is built into the system. The filtering can be done there, then the filtered data downloaded to Windows. The Nexis_Filter program is not particularly complicated (or even necessarily very well done), and anyone who has complete a basic programming class in C should be able to make these modifications.

    77. Question: I'm a little confused: I thought you were using the same code base for all platforms. Does the Mac code differ slightly? I'm running the Linux version of TABARI and would like to use the ISSUES function. Is there any chance ISSUES will be implemented in the Linux version soon? Otherwise I suppose my only options are to run the Mac version or run KEDS on a Mac.
    New Jersey, USA, October 2002

    Answer: You are correct -- the code base is identical for the Mac and Linux versions and the "Mac" code *should* run in Linux. But everything is ANSI C++ and as far as I know it should work on that system -- just change
    #define MAC FALSE // see interface.cp for details
    #define LINUX TRUE
    #define WINDOWS FALSE
    in the TABARI.h file.

    78. Question: I have been looking at the TABARI code from version 2.02 and have a question. You seem to have implemented the three different coding options of ALL, SENTENCE, and CLAUSE, but how should these be entered into the options file for the program to recognize them? I always end up with the default of coding by CLAUSE.
    Florida, USA, April 2003

    Answer: That's what you get for reading the source code. As it happens, checkVerbs() was one of the few procedures that I imported from KEDS and just translated into C. However, it looks like I imported the comments but not all of the code -- in fact the only thing that is implemented is CODE BY CLAUSE.

    79. Question: When I convert MS Access Reuters articles into Text files, I get left with Date (tab) Time (tab) ID (tab) body. Originally, I planned to go from Access to Excel to Text, which deleted the time (I assume this, because I never knew it was there) and changed the date. However, Excel could not handle the file size. Now, I am going right from Access to Text and the time appears in each article. The time always appears as: 0:00:00. I was able to open the file in Word, find and replace the 0:00:00 with (nothing) then save as text. Unfortunately, Word now crashes too. You had suggested having the Perl program GEDS_format.pl take out the time. I do not know how to do this and was wondering if you could offer any suggestions as to adding a line to delete this and where that line should go.
    USA, June 2001

    Answer: A line of the form $line =~ s/0:00:00//; should do this -- that will replace the string 0:00:00 with a null string. I think if you put it after the line $curdate = $&; # set $curdate to $MATCH; that should work.

    80. Question: I am experiencing problems with running the program which I have downloaded from the KEDS website (HMM source code). Each time I try to compile it, I receive several error messages. I tried to do it in the Borland C++ complier and in the Visual C++ complier as well. (I work on a PC - MS Windows, and I do not have access to a Macintosh). What should I do Professor, after having downloaded the HMM source code, to be able to run it? How should I prepare the data (the Balkans-set from KEDS) to make it possible for the HMM-program to work on it?
    Poland, April 2000

    Answer: I have not tried running the code in either Borland C++ or Visual C++, but I have run it with very minor modifications on the Unix g++ compiler (and I originally modified it from code written for a Solaris compiler), so it is fairly close to being standard. Also it is only C rather than C++ (it uses C++ comments), which might simplify things. Very little in the program is specific to the Macintosh -- in fact lately I've been running it almost exclusively in Unix on a supercomputer, and I could send you that code if it would be helpful. If you can find someone who knows C and is familiar with the Borland compiler, they could probably interpret the error messages and make the corrections fairly easily. As you have determined, you will need to transform the basic event data in order to convert it to a sequence that can be modeled using the HMM.

    81. Question: I have been working on some dictionaries (actors, verbs, locations for Cambodia) and now I am beginning to test them. However, I am having trouble getting AFP press reports off of Nexis using the nexispider script (nexispider_pl.pl). Apparently the articles are accessible through https protocol only and the script does not work for https. Could you suggest a work-around or modifications to the script that would enable it to fetch the articles from an https site?
    Florida, USA, January 2003

    Answer: Near the end of the Nexispider.pl program, there is a line reading $nexturl ='http://web.lexis-nexis.com' .substr($doc,$start+6,$end-$start-6). Change the "http" to "https" and see if that works. As best I can tell, Perl handles https the same way it handles http.

    82. Question: I have started to figure out which stories I can skip in my coding and I am beginning to modify the code itself. Would the following lines work to skip stories in which the section=arts or leisure for all sources?
    if ($doc =~ m/SECTION:<\/strong> Arts/) { print "Skip arts story\n"; next;}
    if ($doc =~ m/SECTION:<\/strong> Leisure/) { print "Skip leisure story\n"; next;}
    USA, March 2001

    Answer: Yes, this is correct. You might find it helpful to find some sort of introduction to Perl (there are a lot of them now, at every level -- check the computer section of your local bookstore): Perl's notation for patterns is a bit abstract, but once you get the hang of it, it is fairly easy to use.

    83. Question: I am unsure how to modify the program for copyright related issues (for sources without "SECTION:"). Would something like this work?
    $bodyidx = index($doc,">Copyright");
    $langidx = index($doc,"<br>",$bodyidx);
    $body = substr($doc,$bodyidx,$langidx-$bodyidx);
    if ($body =~ m/1998 Miller Freeman PLC/) { print "Skip Miller Freeman source\n"; next;}
    USA, March 2001
    Answer: Yes, this works great.





    Miscellaneous

    84. Question: I am trying to generate a data set on trade disputes and wonder if KEDS can do that. Ideally I'd like to be able to generate a set of bilateral cases from the 1950s, global in distribution. Do you think KEDS could do something like that? I am planning to use these as null cases to be compared with the subset of cases that are brought to the WTO.
    USA, September 1999

    Answer: It depends entirely on whether you've got an appropriate set of machine-readable texts of news reports. NEXIS will generally take you back to about 1980, plus or minus a couple of years, but no further. I think some of the major newspapers and magazines (NYT? Time?) have CD-ROM versions that go back much further, though I don't know whether they have text, or just images of the pages. In addition, they will not be very comprehensive. Optical scanning of text is an option, and works a lot better than it did a few years ago, but most people still find it more trouble than it is worth unless you've got a concentrated set of texts (e.g. a chronology, and such a thing might be available for trade issues). If you do have a set of machine-readable news reports, I would imagine that KEDS could code them pretty easily. You would need to modify the dictionaries a bit, and you would probably want to use the "ISSUES" facility (which does relatively simple content analysis, just matching patterns of words) to figure out what the dispute was about (e.g. tariffs, health and safety, dumping, whatever), but it should work. KEDS is optimized for coding sentences that contain transitive verbs (i.e. some actor doing something to another actor) and most trade disputes would fall into that category.

    85. Question: I am trying to code transcripts. What programs do you suggest using? Is human coding preferable?

    Answer: You definitely don't want to use a syntactical coder like KEDS or TABARI, since it requires grammatically correct English. Transcripts (or any conversational, or quasi-conversational -- e.g. instant messaging -- text) usually aren't grammatical. You may be able to use a thematic coder such as TextPack: that is, you can extract the information you need using just the occurrence of specific words and phrases. Under the "Links" page on our web site, there are links to sites by Harald Klein and by Bill Evans that list a large number of these packages. In addition, under some circumstances, human coding is preferable. In particular, if you are dealing with a relatively small number of cases and if you are dealing with coding rules that involve comparing information that occurs in a variety of places in the document (but the documents are too short for statistical indexing to work), human coding may be best.

    86. Question: Have you ever heard of Global Event Data System (GEDS)? Do you know where I can find it (I can't find it in ICPSR)?

    Answer: This was a project at the University of Maryland that collected a number of data sets (mostly fairly short -- just a couple of years) during the 1990s. Their data used to be available on the web from the Center for International Development and Conflict Management. The CIDCM link is http://www.cidcm.umd.edu/.

    87. Question: I was wondering: would you happen to have, or would you know of someone who might have, the old COPDAB data set in a format that might be easier to use than Microsoft Access? I've tried endlessly and thoroughly unsuccessfully to convert it into SPSS, but I can't get it to work.

    Answer: Try the "Stat/Transfer" program -- people around here swear by it and it apparently handles Access. Details are at: http://stata.com/products/transfer.html.

    88. Question: I am a grad student working on analyzing Saudi Arabian foreign policy and I am wondering what event data are available on Saudi Arabia. Any suggestions?
    New York, USA, March 2003

    Answer: Look at Gary King's web site at Harvard -- he has recently posted a 3.5-million event set covering the entire world; again, the temporal coverage is primarily the 1990's but I think he has material through mid-2002 -- it isn't in chronological order but Dale Thomas wrote an extraction program and you could check. If you need really recent material (e.g. within 24 hours, it wouldn't be all that difficult to just generate you own using our automated coding software, assuming you've got access to Lexis-Nexis (see the software.html page at our web site). We've got an older (late 1990's) coding dictionary for Saudi that you could use as a starting point, and then you would just need to update it.

    89. Question: I am inquiring about any assistance you or your department are open to providing researchers who are in need of your data-collection system. My dissertation requires me to assemble an event data set that "scores" the overall level of conflict-cooperation in the Japan-China dyad from 1970-current (or as current as I can get).
    Washington, D.C., USA, January 2003

    Answer: Current isn't the problem, since if you use a source available on NEXIS you can update as recently as about 24 to 48 hours. The older stuff is more problematic -- the source we use is Agence France Presse and it supposedly goes to 1991 but gets real spotty before about 1993. NEXIS has a new file called "Information Bank Abstracts" that goes back to 1969 -- we've never worked with it but for a well-monitored dyad such as Japan-China it might be just the thing. You might want to look at it and see whether it would give you the information you are looking for.

    90. Question: I am currently working on the validation of a model that employs event data to anticipate regional instability. One of the regions we would like to model is India/Pakistan. Unfortunately, we haven't been able to find any data sets that focus on this region and the funding for this project does not allow us to build a new data set. Are you aware of any existing data for this area? We are most interested in the time frame from 1990 to 2002, but other times may be useful.
    USA, October 2002

    Answer: Yes -- if you go to http://gking.harvard.edu/data.shtml you can download the IDEA data set, which covers 1991-2002 (or something close to that), and there is also a nice little extraction program that can pull out the India-Pakistan dyad. That's your best bet. There is also an India set for 1987-1997 on our site as well.

    91. Question: I am in the process of gleaning event data from about 10,000 pages of Russian language text, and I am enthusiastically looking for an electronic way to do it. I was wondering if you might be able to point me in the direction of a program capable of working in Russian, or which might be relatively easily adapted to do so (the text which I have is rather formulaic in nature).
    New York, USA, March 2001

    Answer: The best place to check for this is William Evan's list of content analysis software (http://www.gsu.edu/~wwwcom/contentcsoftware software_menu.html). I attended several presentations on content analysis software at the recent sociological methodology meetings in Cologne, and I'm almost positive that some of those systems could handle Russian. It is certainly the case that there are a number of systems designed for languages other than English.

    92. Question: I would like to know if the Levant data is the most appropriate to study terrorism for the CAMEO project. Additionally, are you aware of any other terrorism data available?
    USA, October 2003

    Answer: It sort of depends on how specific you need the data to be. In terms of tracking the incidents of uses of violence over time (and generally assigning blame, albeit not always to specific groups), the CAMEO set is quite good. In terms of details on particular incidents (and also distinguishing acts of terrorism from, say, clashes between two armed groups), it isn't so good. As for other data sets, depends on whether you are interested in specific acts of "terrorism", or acts of violence in the context of other political and criminal behavior. The primary terrorism data set that most people use is one by the RAND Corp, though there are probably some others as well. The US Department of State also has "Patterns of Global Terrorism" reports that would provide a quick source of data for the 1990s (they are all on the web; just use Google with the title).

    93. Question: I have an undergraduate student who is looking at two different kinds of conceptions of development and she would like to test this hypothesis on more cases (Belize, Ecuador, Haiti, Romania, Sweden -- a random sample) over a several year time frame. I thought that KEDS might be a good source for finding fine-grained conflict data, but it seems that she would have to develop her own dictionary for each country and download the Reuter's news. Any ideas?
    USA, February 2000

    Answer: Three ideas come to mind:
    1. There is the PANDA data set, which is already coded and can be downloaded, and covers 1982-1995 or something like that (source is Reuters). I've heard mixed things about the quality of the data, however -- it is apparently better for some areas than others. But it covers all countries. There is a link from our web page.
    2. There is Rod Tomlinson's extension of the WEIS data set, which goes to about 1992. Again, all countries, and the coverage is fairly consistent, but it is based on the New York Times. I've got a copy and can send it.
    3. To the extent that COPDAB has been continued, it is through the GEDS project. They have mostly been doing data development for the State Failures Project, and they might have some of those countries (though not, I suspect, Sweden). I'm not sure what the availability of their data is, but again, we've got a link to it. John Sislin (formerly worked for Fred Pearson) just took over as director of the project.

    94. Question: I am interested in using an automated content analysis program to analyze the US administration's discourse. In particular, I want to look at how the administration subsumed the Iraqi issue under a loss frame (in accordance to Kahneman and Tversky's experiments on framing in their development of prospect theory), and what effects this might have had on proclivity to accept the military invasion of Iraq by the US public. Any advice concerning which content analysis software would be most helpful.
    USA, March 2004

    Answer: What you probably need is a general-purpose "thematic" content analysis system. If you go to http://web.ku.edu/keds/other.html and go down to the "Other Links" section, the first two links (William Evans and Harald Klein) will take you to web sites that have links to a variety of different software packages (both academic and commercial). The other thing that you will need to check into is the source of the text, which needs to be in some type of machine-readable form. Different programs use different formats; if you go to http://web.ku.edu/keds/filters.html we've got some automated downloading programs that work with NEXIS and a few other sources; these may not be exactly what you need but they might give you an idea of what is involved.

    95. Question: I'm using probit/logit models and following Ward's work on reciprocity to determine whether a number of key conflictive actors (the US, Russia, Egypt, and Israel) are actually more accommodative than is usually presumed. Among other things, I'm estimating the threshold values beyond which these states cease accommodating. The preliminary evidence suggests that the US in particular but also Russia and Egypt actually absorb considerable dyadic conflict behavior. It would be nice to have more global coverage for the post-1978 period but using KEDS to cover key dyads (US to Russia, Egypt, Syria, etc.) has been a big help. The people at Maryland have offered to send me the updated COPDAB data but (as you know) the new coverage is sketchy and I'm reluctant (perhaps unjustifiably) to work with WEIS.
    USA, June 1999

    Answer: Have you looked at the PANDA data set? I haven't used it, and I've heard mixed evaluations of it (I get the impression it is better on some regions than others), but it is global (albeit I think it just does 1981 to 1995 or some such). As for WEIS vs COPDAB, it seems increasingly clear to me (based on my own studies and those of some other folks) that subtle differences in event coding make almost no difference in most analyses -- the noise and censoring (usually for journalistic, rather than political, reasons) in the news reports completely swamps any subtle signals one is trying to get. I did one systematic study that showed that the simple existence of events (irrespective of content) explained almost 50% of the explanation I was getting.

    96. Question: I am working with a colleague on an NSF proposal to create a small arms database. It would be in the form of event data, tracking the flow and use of small arms. We would want to use TABARI. We would need to add a new dictionary of small arms terms. We would need to modify TABARI to search for these terms in the full text. I would like TABARI to pull out the sentences which have a small arms term in it. The time period is (at this point) 1970-1999. Can this be done? What resources (e.g. programmers) would I need to do it? How long would it take to do this?
    USA, July 2000

    Answer: Probably. An arms transfer is an event described by a transitive verb, so TABARI (or KEDS) should be able to code it. A lot depends on exactly what your source text looks like. But I would guess that most of the vocabulary is very specific so it should be relatively easy to code. As for the dates, this will be a problem unless you've got access to a text source that goes back that far. Reuters will go back to 1979, but it has once again disappeared from NEXIS, so you would have to contract with them directly. Agence France Presse is available on NEXIS, but only goes back to 1991. Well, if you've got a good programmer, they could probably make all of the changes you need in less than a month. What you really need is dictionary development more than programming. Once you've got the source text downloaded and reformatted, I can't imagine it would take more than a semester. Downloading and reformatting are often time-consuming (but painless) once you get it working correctly, but getting to the point where it does work correctly is easy for some projects, insurmountably difficult for others.

    97. Question: Do you know if there is anywhere I can get a copy of the WEIS coder's manual? We are working through Reuters stories, developing a scheme to parse them that we hope can translate into our program. Although we are using WEIS, I'd be glad to switch to another scheme. But right now I can't find a scheme with all of the following: (a) the coder's manual, (b) event codes, and (c) an associated cooperation/conflict scale. Am I missing something?
    Texas, USA, July 2004

    Answer: The only official WEIS manual that I know of is the ICPSR codebook (which is limited). However, Rod Tomlinson had a much more extensive version -- I've got a copy and could send you a photocopy of that, but I'm not sure what the status of it is (i.e. I don't know if Tomlinson considers it open for quotation). I've also got another unofficial manual from a private-sector project in the mid-1980s that was done by one of McClelland's students. The situation is just as bad on COPDAB, though the extensions that John Davies made for GEDS I think would be much more complete -- I think he has a fairly extensive manual, and used something similar to the Azar-Sloan scale. IDEA and CAMEO would have all of the components you are looking for, I think (if not now, then shortly -- there is work on an IDEA scale, and I think they have a complete manual on the web). We've got an extensive CAMEO manual and Dale Thomas's scale.

    98. Question: I'm the CEO of an advertising agency that does work for many major companies like Hallmark Cards, Pizza Hut, Sprint, GMAC and other Fortune 500 companies. I have a theory that trends in consumer values are often revealed in the news media. But it's been difficult to find a methodology to apply to this theory. Do you think KEDS might be adaptable to such a use?
    USA, July 1999

    Answer: I've been surprised that no one is doing this already. I see bits and pieces of reports of projects around the country and I know there are a few such projects -- though mostly in the financial sector -- but I haven't seen anything in advertising. As you undoubtedly are aware, the newspapers for virtually all major urban areas are available on Nexis, and actually most of them have web-based versions as well. I don't know whether these electronic versions carry "soft" news -- helpful homemaking hints and the like -- but if they do, you should be able to find a lot of useful information on consumer trends there. KEDS is not ideal for the task, but it would probably work reasonably well. KEDS is optimized for coding "who did what to whom" information, which occurs in subject verb-object form, whereas I would guess that you would want either straightforward content analysis -- simply counting phrases, e.g. how often "cilantro" shows up in Peoria versus Phoenix -- or else likes and dislikes. The first is very easy to code; the second is more difficult. KEDS has a content-analysis module in addition to the event coding.

    99. Question: I have thus far found that LEXIS-NEXIS searches of the NYT show increasingly frequent first paragraph references to "foreign" OR "international" OR "war" AND "public opinion" by decade (333 in the '70's; 813 in the '80's and 1513 in the '90's). Yet as I'm sure you know, the problems with such searches are many (e.g., LEXIS-NEXUS uses abstracts prior to 1980, and nothing at all prior to that). So would this be a project with which KEDS (or a similar data set) might be helpful?
    Ohio, USA, November 2000

    Answer: Unfortunately, KEDS won't be able to help you here unless you've got an independent source of news stories. In other words, our system is also dependent on NEXIS indexing -- we download material from NEXIS, then code it. If you want to explore the stories that NEXIS has found, I'm guessing that a general-purpose content analysis program might serve your purposes better than KEDS. KEDS is optimized for subject-verb-object (i.e. who did what to whom) coding, whereas I'm guessing that you may be more interested in the issue being discussed (which tends to occur after the object in a news report). There are links to a number of content analysis sites at Bill Evans' content analysis web site: http://www.gsu.edu/~wwwcom/content/csoftware/software_menu.html.

    100. Question: I am working at a company and the goals for the "Intelligence and Threat Analysis" function of the company are not ambitious -- we want to provide our personnel and managers early warning of potential risk that is based on a coherent, rational, and systemic process of information analysis. Perhaps simply identifying indicators that have been historically important in driving political and economic events would suffice. I would appreciate any of your thoughts on how I may direct my research, or ideas on finding work in the field that may save time or avoid developing a new process. My goal is to integrate an academic process into a practical model that may have applicability in real word situations.
    USA, September 2000

    Answer: We've got a new, and much faster version of the KEDS program ("TABARI") that is open-source code and runs on the Linux and Windows platforms as well as the Macintosh. Information about this is on the web site. Two other sources you might want to check out: First, see if you can get a copy of the book Preventive Measures: Building Risk Assessment and Crisis Early Warning Models (edited by John Davies and Ted Gurr; publisher is Rowman and Littlefield; ISBN # 0-8476-8874-7) -- this is a very good summary of almost all of the government and academic early warning projects that I know of in North American, Western Europe and the UN. Second, take a look at the web site of the "Forum for Early Warning and Early Response" (FEWER -- http://www.fewer.org/) -- this has links to quite a few other projects, mostly with an NGO orientation. Finally, there is actually a commercial venture called VRA that is using techniques similar to ours (it spun off of a Harvard-based project that collaborated for a number of years with KEDS). The contact for this would be Doug Bond; his email is vra@mediaone.net. I'm not clear how far along they are in developing early warning models, but they've got an excellent system for generating data.

    101. Question: I have decided to take the plunge and buy a used Mac to run KEDS on. Do you have any advice as to what models would be good? I went to the surplus store on campus and they have Mac II and Centris models, and they say they sometimes get PowerPCs or Powermacs (I can't remember which one). What should I get? I want to make sure that if I go to the data lab and find a Mac, I will be able to download all the required software (KEDS program, filters, etc.) onto floppy disks so that I could then upload them into my Mac (which is probably too old to connect to the internet). I just want to make sure before I buy a Mac, that I will be able to get the software on it.
    USA, April 2000

    Answer: It will run on either model, but the Centris is more recent. The Mac II model was introduced in 1987, if I remember correctly. The Centris is early 1990s. But KEDS will run on either -- in fact it was primarily developed on a Mac II. The program and the dictionaries fit comfortably on a floppy disk (at least when they are compressed -- the compressed program is only about 150K). The text records might be more problematic -- it is *possible* to do this on floppy disks but it is a bit of a pain. Prior to getting things networked, I usually moved files by just shifting an external hard disk between various machines. You might want to experiment a bit first with machines in a lab to make sure that you can get everything working (particularly the text filtering, which is the step that delays most people) before investing in a Mac. (On filtering: another approach that apparently works is MS-Word macros and/or Visual Basic. I think that this is the approach the PANDA project used.)