Access Event Data in Google BigQuery From R

Arnon Kohavi January 5, 2018 0 Comments
Access Event Data in Google BigQuery From R

Access Event Data in Google BigQuery From R

I spent a good portion of this past spring working with GDELT. Unfortunately, a perfect storm of legal troubles and methodological criticisms this spring turned GDELT from the rising star of Computational Social Science into the ugly behemoth we’d all prefer not to discuss. I stopped updating my copy of GDELT and refocused my efforts to other projects.

So when Google Announced that GDELT would be hosted for free on BigQuery, I was only mildly interested (read: didn’t care). GDELT exhibits some ideal characteristics for a demonstration of high-speed data processing: It’s small enough that you could play with it on a well-equipped consumer computer, but large enough (currently ~100GB) that doing so is a terrible idea.

Some of the queries I worked with, for example, would occupy hours of computing time on a fairly powerful machine. When I tried the some of the same queries on the BigQuery Console, it took ~10 sec, an improvement of three orders of magnitude.

But then I happened to be looking over the dplyr documentation and noticed that it had BigQuery integration, and I became slightly more interested. dplyr wraps the API from BigRQuery (one of Hadley‘s other projects). The effect of this is that we can integrate complex operations exceedingly large data sources into R workflows with little overhead.

Getting BigRQuery set up isn’t the most straight-forward project, but it isn’t too hard either. From the BigRQuery README:

If you just want to play around with the bigquery API, it’s easiest to start with the Google’s free sample data. To do that, you’ll also need to create your own project for billing purposes. If you’re just playing around, it’s unlikely that you’ll go over the 10,000 request/day free limit, but google still needs a project that it can bill (you don’t even need to provide a credit card).

To create a project:

  • Open
  • Click “Create Project” at the top
  • Select a name and project ID, and click “Create”
  • Turn on the BigQuery API by clicking “APIs & Auth” on the left, scrolling down to “BigQuery API”, and clicking the button at the right from “OFF” to “ON”.
  • Click on “Overview” at the left
  • Either the Project ID or Project Number at the top can be used to identify your project withbigrquery.

From there, you can access GDELT through the BigQuery Browser Console. Theoretically, you should be able to access it through the API at this point, but I couldn’t get that working. Instead, I just made a copy of the events table and imported it into my own project.  To do that:

  • Click on the dropdown arrow next to your project’s name in the left pane.  Select “Create a New Dataset”
  • For Simplicity’s sake, just call it “GDELT”
  • Under “gdelt-bq:full” in the left pane, click the dropdown arrow next to “events” and click on “Copy table”
  • Copy the table to your project, to the GDELT dataset, and (again, for simplicity’s sake) call it “events”
  • You can access the data in R and play with it in dplyr.  

Warning: This could cost you some money! Though the limit on free requests is 10,000/day, the limit on free data processed is 1TB/month.  Since GDELT is 100GB, that means you can run approximately 10 free queries on the complete GDELT database monthly.  Use them well!  (If you need more, 5TB is only $5 USD).Congratulations! In just a few minutes, you have gained access to hyper-fast querying on a hundred-gigabyte dataset right within R.

While it doesn’t look like GDELT has much of a future in the academic community, perhaps this model (being hosted on Google BigQuery) is a valuable way for future massive, machine-coded datasets to be made accessible to the masses (I’m looking at you, Phoenix!).

After all, any tool that reduces computation time by 3 orders of magnitude and fits seamlessly into the R environment we all already use is too valuable a prospect to ignore.

Leave a Reply

Your email address will not be published. Required fields are marked *