Prying Data Open

In the aftermath of Hurricane Irene, I was trying to get information from my local electric cooperative about outages. There were many (including my neighborhood) and I wanted to see the scale of the problem. It turns out, they have a page with a map that shows current outages by zip code.

It’s pretty old-school as far as web maps go but it gets the job done. Their day job is making electricity, not web maps, so I won’t critique it too much. One thing I did notice is that the map seems to be dynamically generated (as do the tables on the page) from some inaccessible data source. I search and tried to find some kind of feed, to no avail.

The data on this page is ideal for an RSS feed which could be consumed by any of the local news portals, online mapping sites, and other outlets that may be used by the public. Yet, there is no feed. Here is an example of useful information locked away behind an uninformed design decision. The organization has already made the decision to publish this information so using RSS or social media would not expose anything more than what is already being released.

It makes me wonder about the scale of this problem. How much more information is being produced in relatively inaccessible forms by otherwise well-intentioned organizations? In this case, the information is being produced as an HTML page, so we can always scrape and republish the information, which is exactly what I did. The resulting feed can be found here:

http://demo.zekiah.com/smecofeed/smeco_outage.xml

The feed is simple: the ZIP code is in the item title and the number households affected is in the item description (by itself with no other decoration). Since ZIP codes are fairly standard, it makes it easy to consume the feed and do other things with it, such as map it on GeoCommons. This map may seem redundant but now the data can be layered with other data sets such as shelter locations, ice distribution centers and the like, making it more useful.

To produce this feed I used Python. Dave Bouwman pointed me to Beautiful Soup and I also made use of the ScrapeNFeed library (which makes use of PyRSS2Gen). I have it set up on a cron job to update every two hours and dump a new XML file. I decided this was preferable to doing a direct link back to the page because I’m unsure how robust their server is. I am posting my code below in the event that someone else needs to do this. This type of approach is very fragile. You’ll see from the code that it’s very dependent upon the structure of the source HTML. So, if the page structure changes, the feed will break. This is obviously not ideal so it’s best to view it as a band-aid.

I suspect that there’s a lot of this kind of thing going on. Where you find it, it’s best to engage with the organization to help make it better and that’s my next step here. There’s been a lot of talk about open data in our industry for a while, along with a lot of activity. Situations like this make me realize the scale of the work yet to be done. It will take a lot of effort to open up data all the way down the line and, perhaps, even more effort to help organizations understand why it is beneficial to do so in the first place. But it’s work that needs to be done.

As promised, here’s the Python code should anyone find it useful:

[sourcecode language=”python”]
from BeautifulSoup import BeautifulSoup
import re
import urllib2
from PyRSS2Gen import RSSItem, Guid
import ScrapeNFeed

class SmecoFeed(ScrapeNFeed.ScrapedFeed):

def HTML2RSS(self, headers, body):
soup = BeautifulSoup(body)
table = soup.findAll(‘table’)[3]
rows = table.findAll(‘tr’)
items = []
for index in range(len(rows) – 1):
row = rows[index]
cols = row.findAll(‘td’)
if len(cols) > 0:
zip = cols[0].string
zip = zip.replace(‘ ’, ”)
tot = cols[1].string
tot = tot.replace(‘ ’, ”)
#This link is not real. It will simply take you to the homepage.
lnk = ‘http://www.smeco.coop#’ + zip
items.append(RSSItem(title=zip,description=tot,link=lnk))
#print zip
#print tot
#cols = row.findAll(‘td’)
#for col in cols:
# print col.string
self.addRSSItems(items)

SmecoFeed.load("Current SMECO outages (as scraped by Zekiah Technologies)",
‘http://outage.smeco.coop’,
"Current SMECO power outages by ZIP code",
‘smeco_outage.xml’,
‘smeco_outage.pickle’,
managingEditor=’bill@zekiah.com (Bill Dollins)’)
[/sourcecode]

Share this:

Published by Bill Dollins