In the aftermath of Hurricane Irene, I was trying to get information from my local electric cooperative about outages. There were many (including my neighborhood) and I wanted to see the scale of the problem. It turns out, they have a page with a map that shows current outages by zip code.
It’s pretty old-school as far as web maps go but it gets the job done. Their day job is making electricity, not web maps, so I won’t critique it too much. One thing I did notice is that the map seems to be dynamically generated (as do the tables on the page) from some inaccessible data source. I search and tried to find some kind of feed, to no avail.
The data on this page is ideal for an RSS feed which could be consumed by any of the local news portals, online mapping sites, and other outlets that may be used by the public. Yet, there is no feed. Here is an example of useful information locked away behind an uninformed design decision. The organization has already made the decision to publish this information so using RSS or social media would not expose anything more than what is already being released.
It makes me wonder about the scale of this problem. How much more information is being produced in relatively inaccessible forms by otherwise well-intentioned organizations? In this case, the information is being produced as an HTML page, so we can always scrape and republish the information, which is exactly what I did. The resulting feed can be found here:
http://demo.zekiah.com/smecofeed/smeco_outage.xml
The feed is simple: the ZIP code is in the item title and the number households affected is in the item description (by itself with no other decoration). Since ZIP codes are fairly standard, it makes it easy to consume the feed and do other things with it, such as map it on GeoCommons. This map may seem redundant but now the data can be layered with other data sets such as shelter locations, ice distribution centers and the like, making it more useful.
To produce this feed I used Python. Dave Bouwman pointed me to Beautiful Soup and I also made use of the ScrapeNFeed library (which makes use of PyRSS2Gen). I have it set up on a cron job to update every two hours and dump a new XML file. I decided this was preferable to doing a direct link back to the page because I’m unsure how robust their server is. I am posting my code below in the event that someone else needs to do this. This type of approach is very fragile. You’ll see from the code that it’s very dependent upon the structure of the source HTML. So, if the page structure changes, the feed will break. This is obviously not ideal so it’s best to view it as a band-aid.
I suspect that there’s a lot of this kind of thing going on. Where you find it, it’s best to engage with the organization to help make it better and that’s my next step here. There’s been a lot of talk about open data in our industry for a while, along with a lot of activity. Situations like this make me realize the scale of the work yet to be done. It will take a lot of effort to open up data all the way down the line and, perhaps, even more effort to help organizations understand why it is beneficial to do so in the first place. But it’s work that needs to be done.
As promised, here’s the Python code should anyone find it useful:
[sourcecode language=”python”]
from BeautifulSoup import BeautifulSoup
import re
import urllib2
from PyRSS2Gen import RSSItem, Guid
import ScrapeNFeed
class SmecoFeed(ScrapeNFeed.ScrapedFeed):
def HTML2RSS(self, headers, body):
soup = BeautifulSoup(body)
table = soup.findAll(‘table’)[3]
rows = table.findAll(‘tr’)
items = []
for index in range(len(rows) – 1):
row = rows[index]
cols = row.findAll(‘td’)
if len(cols) > 0:
zip = cols[0].string
zip = zip.replace(‘ ’, ”)
tot = cols[1].string
tot = tot.replace(‘ ’, ”)
#This link is not real. It will simply take you to the homepage.
lnk = ‘http://www.smeco.coop#’ + zip
items.append(RSSItem(title=zip,description=tot,link=lnk))
#print zip
#print tot
#cols = row.findAll(‘td’)
#for col in cols:
# print col.string
self.addRSSItems(items)
SmecoFeed.load("Current SMECO outages (as scraped by Zekiah Technologies)",
‘http://outage.smeco.coop’,
"Current SMECO power outages by ZIP code",
‘smeco_outage.xml’,
‘smeco_outage.pickle’,
managingEditor=’bill@zekiah.com (Bill Dollins)’)
[/sourcecode]
Nice work, Bill. Beautiful Soup really is one of the killer Python packages; it almost makes scraping fun.
Thank you. Beautiful Soup was a lifesaver; I’m glad Dave pointed me to it. My feet are getting a little steadier with Python the more I make myself use it.
arguably that is the lowest level op opendata openness…. question: is it just government that is pushing open data or are companies (firms) also pushing in that direction?
Cheers,
Jw.
Agreed. This approach just barely opens the data more. HTML is obviously open but RSS is more consumable for other uses.
I don’t know if I’d characterize the drive for open data to be a “push” as much as a “pull” right now. I think there’s increasing demand from various entities that want to consume data for other applications. Government entities, mostly Federal, are responding to varying degrees.
There also seems to an uneven understanding of what open data means. Many organizations view nice Flex maps on the web as “open” even though they provide no real access to data. It’s a big problem that’s getting better slowly and unevenly.
Wow. That page is BEYOND old school. I checked the source, and the numbers of customers affected is hardcoded into the html. It says the page is updated every 15 minutes (I waited and checked, and this appears to be the case), so I can only assume that the html is being rewritten (probably manually) every quarter hour. It’s almost beautiful – like watching the scoreboard on the Green Monster.
It’s hard to imagine someone hand-jamming that every 15 minutes. If you go to the main site for the coop and hover over the “outages” link, it points to a PHP page so I think it is actually dynamically generated. At least I hope so…
But, like I said, their day job is to make electricity so I’ll go easy on them.