Rewired State Project: StateAware
I went to the Rewired State event yesterday to contribute to contribute code based on hacking government data sources and APIs. My project was called StateAware, and it aimed to collect, combine and enrich data through APIs and screen scraping.
I wanted StateAware to achieve two things:
- The government won’t make sensible APIs, so let’s do it for them
- Make people more aware of locally relevant government data
The project I worked hard on the project at the event, but it’s not even at the proof of concept stage yet. It was too ambitious for a one–day event.
Architecture
- StateAware is written in Rails
- It collects values from various APIs based on user input and serializes the data in a model called DataPoint
- If data matching DataPoints have been found recently, it doesn’t refetch it from a particular API (effectively caching it so APIs/sites don’t get hammered)
- DataPoints are grouped by DataGroups, APIs and Scrapers. DataGroups are a generic group that might appear in a user interface as categories or tags
Data collection
StateAware collects data through API and Scraper classes. The APIs are currently subclasses of an ActiveRecord model called Api. Scrapers are also models, but use a scraper DSL that should make it easier for people to contribute scrapers.
The Scraper DSL was just a quick one I knocked up for the prototype, but isn’t friendly enough yet.
Enriching data
I installed GeoKit with the aim of automatically geocoding data. I also wanted to collect Twitter search info about a particular DataPoint for its location. Another interesting search would be news items (perhaps from Google News or the BBC).
The first API I supported was TheyWorkForYou.com. I thought it would be interesting to see news links based on MPs and the things they’ve talked about.
I also tried to support UK flood warnings so I could have Twitter searches for people in those areas talking about the floods, but I got stuck trying to scrape their site.
Future expansion
- I didn’t do any work towards enriching data, but this could easily be added
- I designed an iPhone map that would show local data (that’s why a lot of the code refers to postcode searches currently)
- Each API and Scraper should define the datatype inputs it takes for relevant searching
- APIs and Scraper stubs need to including licensing details
- I started adding controllers that speak JSON and XML. The ultimate goal was trusted remote clients that can contribute data — this would help if a particular site blocks the IP of the scrapers
Download
Here’s what I had after 6 hours of hacking in The Guardian’s offices: