Event Data in Popular Media
Event Data in Popular Media
Mona Chalabi of 538’s Datalab took a stab at reporting on the mass kidnappings executed by Boko Haram in Nigeria. While this is a very popular story at the moment, Datalab, 538, and the larger collection of trendy new news entities run by geeks offer a data-driven approach to news and analysis. Now kidnappings, like most things in social science, are exceedingly difficult to find or collect data for.
The modern, popular approach to dealing with things which are hard to measure is to use event data. The idea behind event data is to parse the contents of large quantities of news reports and use machine learning to turn that article’s text into a bunch of raw data.
The most available source of event data at the moment is called the Global Database of Events, Locations, and Tone, or GDELT. So it’s only natural that Chalabi should refer to GDELT to address this topic.
Anyone who’s actively worked with GDELT at any reasonable length (or read its foundational ISA Paper) will recognize the exponential nature of the growth in GDELT entries. This occurs because GDELT grows exponentially over time. Jay Ulfelder observed this error on Twitter:
Wow, initial version of this 538 piece made huge rookie mistake in use of GDELT data http://t.co/6ICxq6bQO9
Luckily, someone (probably Kalev Leetaru himself) reached out to Chalabi to try to correct the course of the evidence.
You rarely want to plot raw numbers of events, since global news volume available in digital form has increased exponentially over the past 30 years (it increases by a fraction of a percent each day), meaning the total universe of events recorded in the news and available for processing has increased exponentially over that time. Instead, you want to divide the total number of what you are after (kidnappings in Nigeria) by all GDELT events for a given time period to get a normalized intensity measure.
Update not much better. Still trying to infer trend in phenomenon from super-noisy trend in reporting on it.
GDELT is a notoriously noisy source of data. While I agree (with Ulfelder) that it’s a pretty tall claim to make that kidnappings are actually on the uptick, I’d like to emphatically add that it isn’t necessarily incorrect. GDELT is a powerful tool, and it may lead to some breakthroughs in the social sciences (its legal and academic challenges notwithstanding).
However, it’s critically important to remember that GDELT is powerful for its ability to show trends in broad strokes (months, not days) over long periods (decades, not years). Diluting the counts over time in this way diminishes the effects a variety of errors, including the one Ulfelder highlighter. I hope Chalabi can take the lesson of this forward for future reporting using GDELT.