Coding Standards

Parsing

Most of our data sets come in antiquated, baroque, overdesigned formats. When parsing it, you have two tasks: first, to fully understand the format and make sure you're processing it accurately, and second, to pick out the actually interesting data from the vast morass that's made available. All this can be done without any programming -- it just requires looking at the documentation and the files themselves (although programming often helps decode the files).

But once you've figured things out, you'll want to follow our coding standards to hand off the results.

Python

We strongly, strongly prefer that code be written in Python, like the majority of our stuff. And if you're processing XML, we prefer that you use the built-in pulldom library. (But we're more flexible about that.)

Your code should emit standard Python dictionaries (or web.storage objects) and should be structured so that there's a function parse_whatever that returns a generator over all the various dictionaries (i.e. it calls yield with each dictionary).

Other

If for some reason you can't use Python, please use something that will run on all standard UNIXes. Instead of yielding dictionaries, have your code emit JSON objects. (Our preference is for JSON objects with sorted keys and an indent of two.) Output the list of JSON objects by using netstrings. (def netstring(x): return str(len(x)) + ':' + x + ',') So an example output would be:

18:{
  "monkeys": 5
},17:{
  "donuts": 2
},

Both

Either way, there are some basic principles to follow:

Emit one dictionary per thing. Let's say you're parsing a table of crime statistics by county. You'll want to emit one dictionary for each county, with keys for its name, state and any other identifiers, and all of the interesting crime data. So maybe:

{"county_name": "Lake County", "state": "Illinois", "homicides": 25}
{"county_name": "Broward County", "state": "Florida", "homicides": 350}

Use identifiers for keys:

Don't do: {"First name": "Robert Blankenfield"}
Instead do: {"first_name": "Robert Blankefield"}

Nest dictionaries as needed:

Don't do: {"author": [2008, 7]}
Instead do: {"author": {"birth_year": 2008, "books": 7}}

Preserve all IDs. Make sure not to lose any identifiers; we'll need those to integrate with the rest of the database later.

Use standard formats. For dates, please use W3CDTF (YYYY-MM-DD).

Example. For an example of some code that follows all the coding standards, look at import/parse/govtrack.py.

If you have any more questions, feel free to ask on the mailing list.

changed March 2, 2012 delete history edit