This is the version of volunteer posted July 31, 2008.

Tasks for Volunteers

We've gotten a number of people offering to volunteer and asking if there's something they can work on. I've set up this wiki page to keep track of projects you can volunteer on and who's claimed them at the moment.

If none of these appeal to you, you can:

  1. email Aaron (me@aaronsw.com) and tell him what sort of thing you'd be interested in and he can try to think of something more appropriate

  2. join our volunteers mailing list and wait for us to email with a request

If you are interested, take a look at our coding standards and add your name next to the task you want to work on.


FEC electronic filings

being looked at by simonbc

The FEC provides data in two ways: FTP dumps of fixed-width text files for everyone and CSV records of campaign filings made electronically. We handle the first, but a lot of good data is in the second and we need a parser for it:

http://www.fec.gov/finance/disclosure/efile_search.shtml
http://www.fec.gov/elecfil/eFilingFormats.zip [ZIP]

historical voting data

Robert Vanderbei has collected county-level historical voting data here:

http://watchdog.net/data/crawl/manual/rvdb/

It'd be great to import this so people can see how their counties have gone.

also:

lobbying

being looked at by Drew and Jeremy Dunck

Lobbyists -- love them or hate them, they're obviously an important part of our government. Which is why we want watchdog to track them. The Senate provides an XML database of lobbyist information, OpenSecrets provides advice on making sense of it, as usual we want a Python parser that reads the files and emits dictionaries of the important information.

on the issues

The website On The Issues collects a vast number of positions and quotes from various political figures. Unfortunately, it doesn't seem to be stored in a database but just as plain HTML pages, so it's kind of hard to parse. Still, it's a wealth of useful information -- it would be great if someone could work on a parser for it.

SEC

being looked at by AaronSw

The SEC keeps track of all publicly-traded companies and their major executives, owners, and investors. Unfortunately it's all in XML wrapped in SGML and not very easy for people to get at. We want to parse the key data out and load it up so that people can explore corporate structures better. The database is called EDGAR and you can get it over FTP. Some C# code parsing it has been developed by GovTrack and the output is on archive.org. As usual, we want Python dictionaries with the key data so we can import them into a SQL DB.

trademarks

The US Government provides documentation of their trademark data XML format and some sample data on the USPTO website. Unfortunately, it's massively complicated. Your task, should you choose to accept it, is to figure out how to make sense out of all this and write a Python script that goes thru the XML files and returns dictionaries containing all of the important information. Remember, we'll want to integrate this with the SEC database above.

extract more almanac data

Modify the code in almanac.py (see http://watchdog.net/code/?p=dev.git;a=tree;f=import/parse) to extract more data, like presidential election voting history.

census

being looked at by Trevor

The US Census has an enormous amount of information about each Congressional district. Unfortunately, it's all in a very confusing format. If you can decode their complicated CSV files and output standard Python dictionaries full of interesting facts about each Congressional district, we'd be much obliged.

Here's some Perl code from GovTrack that parses some of it.

Ideas:

SF3-DP2: % never married, % divorced, % >= some college, % >= college degree, % professional degree, % foreign born, % speak foreign language, % veterans

Unknown: armed forces personnel, crimes, inmates, voting age population, voting age population by race, registered voters, political party identification, mortality rates

public schools

being looked at by zack

The National Center on Education Statistics (NCES) has a wide variety of information on the country's public schools. In particular, their Common Core of Data will let you calculate things like dollars per pupil, student-to-teacher ratio, dropout and graduation rates, etc.. It'd be great to get this data into the database so we can see how different districts stack up.

Meanwhile SchoolDataDirect.org also has a lot of data [click-thru license] "including student performance data, No Child Left Behind data, school environment data, financial data, community demographic data and analytical ratios". It might be easier to just get data from here, especially the NCLB and other student performance data that NCES doesn't seem to publish.

environmental toxins

Scorecard aggregates a whole series of data about toxins in your neighborhood, but their interface is terrible. It'd be great to get this data and integrate it into watchdog so you could, say, see where your district was on the pollution scale.

I'm no expert on this (perhaps we should call someone at Environmental Defense), but it looks like the NEI database is the place to start.

Databases, Access DB parser.

mortality data

The National Center for Health Statistics seems to have a lot of data. In particular, they have a database of every death record -- it'd be great to have the top causes of death for each district.

ICPSR

ICPSR has data sets about congresspeople and conressional districts with some interesting kinds of data:

http://www.icpsr.umich.edu/cocoon/ICPSR/STUDY/00011.xml
http://www.icpsr.umich.edu/cocoon/ICPSR/STUDY/03371.xml
http://www.icpsr.umich.edu/cocoon/ICPSR/STUDY/00013.xml

textual analysis

We've got speech data for every rep and while I'm sure some people might want to read thru those speeches, a lot of people would rather let a computer do that for them. If we could extract some fun things from the text, like Amazon does from book scans, that would be great. I'm thinking of:

  • favorite/least favorite phrases: the top 5 phrases the rep uses more than the average rep
  • most/least unique phrases: phrases used by a rep with the fewest uses by other people
  • defining words: phrases that best pick out a rep in some kind of statistical analysis

SurveyUSA

SurveyUSA.com publishes approval rating numbers for every Senator. It'd be nice to have a crawler and a parser that extracts these.

Here's some Perl code from GovTrack that does this.

Jonathan Holst: I have begun looking into this. I am not sure, though, what numbers I should look at. The best I have been able to come up with is http://www.surveyusa.com/50StateTracking.html, with some sort of re.match('Senate Approval'). Can anyone confirm, or am I completely off track?

AaronSw: Nope, you've got it.

maps

being looked at by Groby

Play around with Mapnik (http://mapnik.org/) or PostGIS to see if you can get it generating overlay maps for congressional districts. You can get the boundaries from the census at http://www.census.gov/geo/www/cob/cd110.html

The maps code we're currently using calls out to this Perl code on someone else's server (http://razor.occams.info/code/repo/?/viz/wms).

I was thinking it'd be nice to have it in Python on our server so that we could do more with it than just show one district at a time.

A second step would be loading in actual geographic data so we could replace Google Maps entirely -- one problem with Google Maps is that there's no way to generate static images, so pages with a lot of district maps take a very long time to load.

load PVS data

being worked on by Pradeep Gowda

We've crawled a large amount of data from Project Vote Smart and stored it in JSON dumps. We also have Project Vote Smart IDs for most politicians thanks to GovTrack. Now all we need is a simple script that uses that information to line up and import the data from Project Vote Smart.


Archive: volunteer/completed