Dictionaries

CountryInfo

CountryInfo.txt is a general purpose file intended to facilitate natural language processing of news reports and political texts. It was originally developed to identify states for the text filtering system used in the development of MID4, then extended to incorporate CIA World Factbook and WordNet information for the development of TABARI dictionaries.

File contains about 32,000 lines, covering about 240 countries and administrative units (e.g. American Samoa, Christmas Island, Hong Kong, Greenland). It is internally documented and almost but not quite XML: The major fields are delimited with tags of the form <tag>...</tag> but elements inside are delimited with line feeds. Converting this to strict XML would be a relatively simple programming exercise for anyone who should be working with the file in the first place. File is UTF-8 with Unix line feeds and will need to be converted if used in a Windows system.

Fields include

  • Country name in English
  • Adjectival forms and synonyms of the country name, including some non-English versions of the name
  • ISO-3166 numeric, alpha2 and alpha3 codes, FIPS-10 code, IMF code, COW alpha and numeric codes
  • Capital city
  • Cities with populations over 1-million
  • Regions and geographical features (WordNet meronyms)
  • Leaders, 1960-2008 (rulers.org)
  • Members of government, 2003-2010 (CIA World Leaders)

NOTE: I'm gradually transitioning to using GitHub as a primary repository, so you should check https://github.com/philip-schrodt/CountryInfo-1 for possible more recent versions.

CountryInfo.140728.txt has been archived at dataverse.harvard.edu with the persistent URL http://dx.doi.org/10.7910/DVN/NBPRDW

Download CountryInfo.120116.txt (.zip) [Updated 16-Jan-2012]

Download CountryInfo.140728.txt (.zip) [Updated 28-Jul-2014]
Revision which has TABARI-style date restrictions on countries which became independent in the post-1989 period (mostly former Soviet Union and former Yugoslavia) plus some additional small code corrections.

Download CountryInfo.120116.actors (.zip), a TABARI-formatted .actors file extracted from CountryInfo.120116.txt [Updated 6-Jan-2012]

Download CountryInfo perl utilities (.zip).
translate.countryinfo.pl is the perl program used to extract the .actors file; this would also be a good starting point for doing additional work with the CountryInfo.txt format, though it does not accommodate the date-restricted country codes. format.rulers.org.pl attempts to convert rulers.org entries into CountryInfo format; takes input from either a file or cut-paste from the keyboard and works with most standard entries. [Updated 6-Jan-2012]

Links to additional resources on names

The EU Joint Research Centre (those great folks who give us European Media Monitor) maintains a very large and multi-lingual list of political names at

http://langtech.jrc.ec.europa.eu/JRC-Names.html.

This web site also links to an excellent paper on the technical challenges involved with name detection and resolution.

From Vincent Arel-Bundock:

The countrycode package for R includes a set of regular expressions which can be used to match country names in character strings to country codes.

http://cran.r-project.org/web/packages/countrycode/index.html

Also see the kountry Stata module by Rafal Raciborski:

http://ideas.repec.org/c/boc/bocode/s453301.html