Thoughts on data. (in progress)

1. Data summarizes and categorizes selected information on officeholders and candidates for the Senate, House, and executive office. This information includes (example):

Legislative Votes

  • The politican's vote and the bill's content is summarized in the title.
  • A paragraph summary of the bill follows, sometimes with pro and con statements.
  • Cited with a link to the bill on THOMAS, a link to a vote tally on (that in turn links to the government tally), and the date.

Legislation Sponsorship

  • The legislation is summarized in the title.
  • A short summary of the bill follows.
  • Cited with a link to the bill on THOMAS.

Interest Group Ratings

  • Rating, group, and issue are stated in the title.
  • The rating and group are explained in a short paragraph.
  • Cited with a link to an page listing all ratings by the group in that year, and date.

Direct Quotations

  • A sentence fragment is pulled from the quotation as the title.
  • At least one paragraph from the speech or interview follows.
  • Cited with the publication, page (if applicable), and date.

2. Format

All content is invalid HTML. Its tag structure is regular across the site, so that e.g. a table of John McCain's positions on Abortion will have the same (invalid) tag structure as a table of Evan Bayh's positions on Foreign Policy.

3. Content Organization

Each politician has one front page listing the same 24 issue categories and the titles of all categorized positions. Each front page links to 24 issue pages listing the full positions as described above (title, summary, and citation).

Front pages for all politicians are located in:

  • (Senators, Governors, and non-officeholders),
  •[state-abbreviation]/ (Reps from large states),
  • (Reps from small states)

Issue pages are stored in:

  •[issue-category]/ (Senators),
  •[state-abbreviation] (Reps from large states)
  • (Reps from small states)
  •[year]/ (for non-officeholder candidates for office in that year)

In addition, direct quotations from interviews, debates, speeches and questionnaires are listed (separately and redundantly) on These are not indexed.

Note that the subdomain is a trick; no matter which subdomain you use, you reach the same content.

4. Citations

Links to THOMAS and primary publications should be collected wherever available, with both link-title and link-url fields. sometimes cites information from a second page that summarizes the data, links to the primary source, and addes other content. In this case, the link to the second page should be crawled for the primary data link, then discarded. This way our data refers only to primary sources and remove the middleman.

When a citation does not contain a link, the entire text of the citation should be scraped as the reference.

The date of each citation should be scraped as a separate field.


Votes on bills are cited in the format: "Reference: bill name; Bill THOMAS-link; vote number ontheissues-link on date"

Sponsorship of bills is cited in the format: "Source: source-title ontheissues-link on date

changed March 2, 2012 delete history edit