How to build a Geospatial Event Database
How to build a Geospatial Event Database
It occurred to me recently that it would be useful and interesting to be able to perform arbitrary geospatial querying on event data. The typical use case for querying event data geospatially is to filter on the Country (which has been a field in every event dataset with which I’ve worked), but this isn’t really “geospatial”. A true geospatial querying system enables you to specify arbitrary geographic constraints.
So, for example, let’s say I’m working with Phoenix and I’m interested in events that occured within 50 miles of Atlanta, GA. The easy way to do this is to filter events on location name.
#library("devtools") #install_github("ahalterman/phoxy") #In case you don't have phoxy yet library("dplyr") library("phoxy") download_phoenix("Phoenix") events <- ingest_phoenix("Phoenix") events %>% filter(LocationName=="Atlanta")
(Adapting the code for ICEWS uses a similar, albeit more memory-intensive process. That is left as an exercise for the reader.)
The problem with this approach is that it only gives you events that the geocoder identified as being in Atlanta. Luckily we have Latitudes and Longitudes, which offer a bit more precision. Latitudes and Longitudes can be mapped to approximate distances directly by hand (although this approach gets increasingly inaccurate for Longitudes as you move away from the equator).
To convert between a degree and miles, we can do a little dimensional analysis. There are 1.15 miles in a nautical mile, and 60 minutes in an equatorial degree.
Atlanta’s 50-mile bounding box
The problem is that the bounding box contains more space than you’re actually interested in. “Within 50 miles” implies a circle with its center at the center of Atlanta and a radius of 50 miles. If you actually queried this, you’d get results from Gainesville, GA (Northeast corner of the box), even though Gainesville is actually further than 50 miles from Atlanta.
At this point, we could write a SQL function to calculate the Euclidean distance between an arbitrary Lat, Lon and Atlanta. But we’re not going to do that, because A) a pretty good solution won’t assume the Earth is a flat, two dimensional plane, or even a sphere, but an ellipsoid. B) There are a ton of mistakes we could make on the way to getting a pretty good solution, and C) this is a solved problem.
In particular, the PostGIS extension to PostgreSQL represents the best-in-class collection of algorithms for solving geospatial querying problems of this variety. So let’s set one up! We could do it locally, but what’s the fun in that? Plus, if you want to build an app to show off to the world, you really don’t want to host it off your laptop. So let’s do it from Amazon’s Relational Database Service.
The first step is to set up your Amazon AWS account. Next, create a new database using Amazon RDS. Select “PostgreSQL.” At Step 2, if you want to do this with your AWS Free Tier account (which I enthusiastically recommend), select “No”. Step 3 is a little more complicated, but should be manageable for the adventurous.
Set the DB Engine version to the latest available (presently 9.4.1), DB Instance Class to “db.t2.micro” (necessary for Free tier), Multi AZ deployment to “No” (again, necessary for Free Tier), and then enter an identifier, username, and password.
You’re going to need to remember your credentials. Be sure to copy those down somewhere. The defaults for Step 4 are all fine, but you’ll need to give your database a name. Copy that down too.
At this point, you can click the big blue button at the bottom of the screen and watch the computer spin its loading spinner for a few minutes. When it stops, your database will be provisioned and launched. Just copy down your endpoint where you wrote down all your other credentials, and you’ll be set to go.
Now that we have launched our database, we need to enable the PostGIS extensions. We can actually accomplish this inside of R. To start, let’s create a file I’m calling “creds.R“. The reason I propose you save this file (as opposed to merely running it) is that we’ll use it in several places. Infosec Wisdom here:
The less often you type in your security credentials while you’re developing, the fewer files you’ll end up having to keep secret when you share your code or deploy your app. This R script will only do two things: First, load up the RPostgreSQL package, and second, create a connection to your database in an object creatively named “con”. Copy and modify as appropriate:
library("RPostgreSQL") con <- dbConnect(dbDriver("PostgreSQL"), host="your database's host", port="5432", user="your database's username", password="your database's password", dbname="your database's name")
Got that modified? Great. Save it to your R working directory. Now we can connect to the database with R. From there, we can submit the (astonishingly few) queries needed to get PostGIS running:
source("creds.R") dbGetQuery(con,"CREATE EXTENSION postgis;") dbGetQuery(con,"CREATE EXTENSION postgis_topology;") dbGetQuery(con,"CREATE EXTENSION fuzzystrmatch;") dbGetQuery(con,"CREATE EXTENSION postgis_tiger_geocoder;")
The database is now setup and configured! Now, to write the data into your database…
Notice how we wrap events in a data.frame. This is a necessity from a slight eccentricity of the RPostgreSQL package. Anyway, that will run for a little while, but once it returns, you’ll have all of your data in the database. To see some of it,
dbGetQuery(con, “SELECT * FROM phoenix LIMIT 20;”)
To actually use this data, we’re going to need to learn a little bit of PostGIS. A full PostGIS tutorial is well outside of my intent, but the internet is full of helpful documentation and interesting how-tos. To measure if a thing is within a given distance of another thing, there’s a PostGIS function called “ST_DWithin“.
I say “thing” because there are generally four types of things in Geospatial Data: points (generally useful), linestrings (good for roads), polygons (good for geopolitical areas), and multi* (good for anything that’s fragmented, like describing the borders of the US since it has non-contiguous land areas). Events are simple: every event is a point. So, we’ll also need to use a little PostGIS function to make a Point Object in the database. They’ve helpfully called it “ST_MakePoint.” Here’s how we piece it all together:
dbGetQuery(con, 'SELECT * FROM phoenix WHERE ST_DWithin( ST_MakePoint(33.755, -84.39), ST_MakePoint("Lat", "Lon"), 0.725);')
And there you have it! Phoenix presently contains approximately 400 events that happened within 50 miles of Atlanta. If you’ve never worked with Geospatial Querying before, you probably aren’t sold yet. After all, it was sort of a long way around for this payoff. This type of treatment for Political Event Data isn’t unprecedented.
In particular, Hammond and Weidmann put GDELT into a PostGIS database in order to tease out a Capitol-bias in the geocoder. (They didn’t control for centroid bias, but that’s about two papers worth of criticism and methodology). If you want to really feel the power, I’m going to need a little time to dream an interface to demonstrate it. In the meanwhile, take a look at the PostGIS documentation and let me know if you have any interesting applications in mind.