GDELT’s Unit of Analysis Problem 1

Arnon Kohavi February 12, 2019 0 Comments
GDELT's Unit of Analysis Problem 1

GDELT’s Unit of Analysis Problem 1

In the course of the Event Data community’s reaction to Mona Chalabi’s use of GDELT to analyze events in Nigeria, some of us realized that a fundamental problem with GDELT is that its unit of analysis is unclear. This is a huge no-no.

The Unit of Analysis in a well-formed dataset is represented by a single row of data.  Some combinations of the columns of that row will generate a predictable unique identifier.  For example, International Relations data tends to represent countries across years.

 Hence we say that the Unit of analysis is the Country-year.  The should only be one row in such a dataset for Afghanistan in 1990, so we might create a unique identifier called AFG1990, and rest easily in the confidence that it will be unique.

Each row in GDELT represents one discrete event in theory.  GDELT takes many “events” from many articles, aggregates them into a groupings it believes represent the same event, and builds the data based on this grouping.  The problem is, it does this badly.  

GDELT’s Unit of Analysis Problem 1

We simply cannot trust the mysterious overlord program that builds and updates the data to actually code an event correctly and then figure out it reported by dozen other “events” it has already coded and cluster them together.  One of Chalabi’s errors was to trust this claim of GDELT’s at face value.

The way that we mitigate this problem is by saying that GDELT measures reporting coverage of events.  This is true in a sense–reporting is one of the many filters through which the real, true information must be distilled before it comes to us in raw-data form.  However, this line of thinking seduces us into believing that each GDELT row constitutes a single story about an event.  

This is wrong for two reasons: Firstly, GDELT doesn’t translate one article into one event.  It translates every sentence in an article into an event. Secondly, it collapses many events that it codes as the same (i.e. same location, same time, same actors, same action) into a single “event.”

But if we cannot trust these datapoints to represent real, meaningful, discrete events, we can at least use GDELT to measure the reporting of events.  How then can we recover this mysterious “reporting” variable which seems to be the only marginally trustworthy product of GDELT?  Well, GDELT provides 58 fields for each row, two of which are “NumMentions”, “NumSources”, and “NumArticles”.  

NumMentions – This is the total number of mentions of this event across all source documents. Multiple references to an event within a single document also contribute to this count. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents and the density of events within them vary over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest.

NumSources – This is the total number of information sources containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of sources varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest.

NumArticles – This is the total number of source documents containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest.

f you desire to employ GDELT to analyze reporting covering of something (for example, Kidnappings in Nigeria between 1-1-1979 and Now), take these steps:

  1. Query the data you want from GDELT.
  2. For your temporal unit of interest (days at minimum, probably years at maximum), sum any of the three variables listed above.
  3. Take a logarithm (any logarithm–it doesn’t really matter that much) of your counts.  This will smooth out the natural exponential growth in reporting over the time period GDELT records.  This isn’t the most nuanced handling, but it’s sufficiently effective for a “big picture” analysis.

That will give you a pretty good measure of reporting coverage of your topic of interest.

Leave a Reply

Your email address will not be published. Required fields are marked *