What’s the deal with Kansas?

Arnon Kohavi December 5, 2019 0 Comments
Dylan Matthews over at Vox recently issued a slight correction to an article of his about Pornhub’s look at the politics of its consumers (SFW, believe it or not).

Pornhub’s analysis suggested that Porn viewership in Kansas dominated that of all other states at 194 pageviews per capita.  This was a conspicuous outlier, as it bucked the trend that Democratic-leaning states tend to watch more porn than Republican-leaning states.

 John Bieler bumped a similar issue with protests.  Either there’s a field in the middle of Nowhere, Kansas where people go to protest and watch lots of porn, or there’s a bug in the analysis.

When things like this happen in social science, it’s usually because of an error. Such things are always worth investigating.  If it is an error, then you can fix it and strengthen your analysis.  If it’s not an error, they you might have discovered something truly interesting.

In this case, it was certainly an error.  Which brings us back to Matthews at Vox and the correction. Pornhub analyzed IP-based geolocations.  In other words, it recorded the IP address of everyone who visited, and then ran the IP address through a geocoder.  Want to know where you are?

IP-based geolocation is a type of coarse geolocation–it’s not super-accurate.  Your phone can’t use it to figure out street directions, which is why you have a GPS chip.  GPS is called fine geolocation.  There are a bunch of IP addresses that are assigned to the United States.  

Some of those have no regional mapping, meaning that anyone using them could be anywhere in the lower 48, so far as the server is concerned.  The server just records the IP and moves on.  Then, when pornhub wants to run its analysis, it passes that list of IPs to the geocoder.  When the geocoder gets one of these IPs that isn’t anywhere more specific than “United States”, it will return the centroid of the country.  Where do you think the centroid of the US is?


Now, attributing all this wierd stuff to Kansas isn’t the Geocoder’s fault.  Geocoding a centroid in the absence of more information is a good strategy!

Imagine you have to predict a person’s height, but you know nothing about that person except which country they’re from.  Let’s say we know they’re from the United States.  Well, what’s the average height of a person in the United States?

 Wikipedia breaks it down by gender, which makes sense: most adult men are a little taller than most adult women.  But we don’t know the person’s gender.  We could take the median of the average height, or we could build a population-weighted mean.  But all of these solutions boil down to the same idea: take an average.

The centroid is the average of all coordinates in a country’s physical boundaries.  So it’s arguably the best guess we can provide about an event’s location in the absence of more information.  In a way, all  geocoded addresses are centroids: it takes the most specific geographic identifier it has (Country, State, County, City) and returns the centroid of that geographic region.

If you want to discuss a map with geocoded data on it, you better be certain you know about that!  Otherwise, you’ll end up correcting your articles after publication.  If you’re working with international geocoded data, Frank at Gothos has made your job easy by puting together a handy dataset with estimates for the centroids of every country.  Go forth, and be wary of claims that Kansas is special!

