Loading ...
Sorry, an error occurred while loading the content.

Initial source code release

Expand Messages
  • Josh Tauberer
    Group, I ve publicly posted some of the screen-scraping Perl scripts GovTrack uses to collect its data. I ll be adding my scripts to the Subversion repository
    Message 1 of 8 , Jul 3, 2007
    • 0 Attachment
      Group,

      I've publicly posted some of the screen-scraping Perl scripts GovTrack
      uses to collect its data. I'll be adding my scripts to the Subversion
      repository below over time with the purpose of making it possible for
      others to maintain, extend, and improve them.

      The scripts are not released under an open source license for a variety
      of reasons, among which is the fact that open source licenses don't
      enforce the type of open information model I'd want the *fruits* of the
      scripts to be covered under. Since wide distribution of the scripts
      isn't necessary to benefit from them, an open source license would be
      teethless.

      The repository is browsable here:
      http://razor.occams.info/code/repo/?/govtrack/gather/us

      I don't know quite what it would take to make any of the scripts
      actually work for you, though. Certainly you'll need a 'data' directory
      in the right place, and possibly MySQL tables in the right place, Perl
      modules, etc. If you try to get them working and can't, or make
      progress, post any notes here, or better yet post patches to make set-up
      easier. Also, the code is heavily *un*commented, and well... that's just
      the way it will be.

      All patches are welcome.

      --
      - Josh Tauberer

      http://razor.occams.info

      "Yields falsehood when preceded by its quotation! Yields
      falsehood when preceded by its quotation!" Achilles to
      Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)
    • damianmont
      ... Thank you Josh as always. Any way you could just document what each file does? Here s list: database.people.sql database.tables2.sql database.tables.sql
      Message 2 of 8 , Jul 5, 2007
      • 0 Attachment
        --- In govtrack@yahoogroups.com, Josh Tauberer <tauberer@...> wrote:
        > The repository is browsable here:
        > http://razor.occams.info/code/repo/?/govtrack/gather/us

        Thank you Josh as always.

        Any way you could just document what each file does?

        Here's list:
        database.people.sql
        database.tables2.sql
        database.tables.sql
        db.pl
        general.pl
        indexing.pl
        parse_record.pl
        parse_rollcall.pl
        parse_status.pl
        personaldb.pl
        sql.pl
        util.pl

        (p.s. EXCELLENT article on XML.com... That's how I found your site)

        Also could you maybe send us more or less the pages you scrape?
        I'll probably get them from reading the script, but I'm a php guy, not
        perl...but a programmer is a programmer right? I should be able to
        figure it out.
      • Joshua Tauberer
        ... Sure. ... MySQL table schema and data for people table of all people that have served in Congress (name, birthday, etc.), and people_roles table for
        Message 3 of 8 , Jul 5, 2007
        • 0 Attachment
          --- In govtrack@yahoogroups.com, "damianmont" <photoca@...> wrote:
          > Any way you could just document what each file does?
          >
          > Here's list:

          Sure.

          > database.people.sql
          MySQL table schema and data for "people" table of all people that have
          served in Congress (name, birthday, etc.), and "people_roles" table
          for every role in Congress each person has served (role type
          (senator/representative), start/end date, party, etc.).

          > database.tables2.sql
          > database.tables.sql
          MySQL table schemas for other tables that are filled in by the
          scripts, mostly for indexing bills. The people tables are the only
          ones I edit by hand and are not automatically generated from some
          other source.

          You'll need to pipe these to mysql, or otherwise load them, for any of
          the scripts to work. (The indexing tables aren't strictly necessary if
          you disable the indexing routines one way or another, but the people
          tables are pretty critical for all of the parsing scripts.)

          > db.pl
          Utility script for opening the MySQL database.

          > general.pl
          Really old utility functions that I don't really use.

          > indexing.pl
          Subroutines that update the indexing MySQL tables based on the
          contents of a bill or vote file.

          > parse_record.pl
          Parses the Congressional Record from THOMAS.

          > parse_rollcall.pl
          Parses roll call pages from the House and Senate websites.

          > parse_status.pl
          Parses bill status pages from THOMAS.

          > personaldb.pl
          Converts a name of a representative into an ID. Considers a date, role
          type, and state/district info to disambiguate names when it's ambiguous.

          > sql.pl
          Utility functions for dealing with MySQL (preparing SQL statements
          programmatically).

          > util.pl
          A ton of utility functions used throughout.

          > (p.s. EXCELLENT article on XML.com... That's how I found your site)

          Thanks!

          > Also could you maybe send us more or less the pages you scrape?
          > I'll probably get them from reading the script, but I'm a php guy, not
          > perl...but a programmer is a programmer right? I should be able to
          > figure it out.

          That's a long list. Maybe another time!

          - Josh
        • tay199
          YEAH! Thanks Josh. I ll keep the group posted on any patches or things of value we find and can contribute. Taylor
          Message 4 of 8 , Jul 6, 2007
          • 0 Attachment
            YEAH! Thanks Josh. I'll keep the group posted on any patches or things
            of value we find and can contribute.

            Taylor
          • Kevin Henry
            Great, thanks Josh... I didn t do much scripting (proper) for http://www.whereabill.org/, but since Josh has started the ball rolling, I ll add in the one
            Message 5 of 8 , Jul 8, 2007
            • 0 Attachment
              Great, thanks Josh...

              I didn't do much scripting (proper) for http://www.whereabill.org/,
              but since Josh has started the ball rolling, I'll add in the one
              script I did write: an XSLT file that takes the XML version of Josh's
              people database
              (http://www.govtrack.us/data/us/110/repstats/people.xml) and extracts
              the people/roles relevant to a single specified Congressional session.

              The problem (for my purposes, which include sending this information
              to the browser client on each request) is that the people.xml file is
              large (6.3MB), and includes lots of dead people. :) So I use this
              script to get only the information for a particular Congress.

              Kevin


              bioseparate.xml:

              <?xml version="1.0" encoding="UTF-8"?>
              <xsl:stylesheet version='1.0'
              xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
              xmlns:exsl="http://exslt.org/common">
              <xsl:output method="xml" version="1.0" encoding="ISO-8859-1"
              indent="yes"/>

              <xsl:param name="congress">110</xsl:param>

              <xsl:template match="people">
              <xsl:copy>
              <xsl:variable name="striprole">
              <xsl:for-each select="person">
              <xsl:copy>
              <xsl:copy-of select="@*"/>
              <xsl:copy-of select="role[(((number($congress)*2 + 1787) >=
              number(substring-before(@startdate,'-'))) and ((number($congress)*2 +
              1787) <= number(substring-before(@enddate,'-')))) or
              (((number($congress)*2 + 1788) >=
              number(substring-before(@startdate,'-'))) and ((number($congress)*2 +
              1788) <= number(substring-before(@enddate,'-'))))]"/>
              </xsl:copy>
              </xsl:for-each>
              </xsl:variable>
              <xsl:for-each select="exsl:node-set($striprole)/person[role]">
              <xsl:copy-of select="."/>
              </xsl:for-each>
              </xsl:copy>
              </xsl:template>

              </xsl:stylesheet>
            • damianmont
              Kevin, Love that www.WhereaBill.org site. You use the information from josh s xml files? Love the mashup, very well done.
              Message 6 of 8 , Jul 10, 2007
              • 0 Attachment
                Kevin,

                Love that www.WhereaBill.org site.
                You use the information from josh's xml files?

                Love the mashup, very well done.

                --- In govtrack@yahoogroups.com, "Kevin Henry" <k@...> wrote:
                >
                > Great, thanks Josh...
                >
                > I didn't do much scripting (proper) for http://www.whereabill.org ,
                > but since Josh has started the ball rolling, I'll add in the one
                > script I did write: an XSLT file that takes the XML version of Josh's
                > people database
                > (http://www.govtrack.us/data/us/110/repstats/people.xml) and extracts
                > the people/roles relevant to a single specified Congressional session.
              • Peggy Garvin
                I d like to know, too. I am writing a brief article (right now, due tomorrow) about some of the new legislative info projects and want to mention whereabill as
                Message 7 of 8 , Jul 10, 2007
                • 0 Attachment
                  I'd like to know, too. I am writing a brief article (right now, due tomorrow) about some of the new legislative info projects and want to mention whereabill as well as a sample of sites that have used Govtrack's file.

                  Thanks,
                  Peggy Garvin
                  peggy -at- garvinconsulting.com


                  damianmont <photoca@...> wrote:
                  Kevin,

                  Love that www.WhereaBill. org site.
                  You use the information from josh's xml files?

                  Love the mashup, very well done.

                  --- In govtrack@yahoogroup s.com, "Kevin Henry" <k@...> wrote:
                  >
                  > Great, thanks Josh...
                  >
                  > I didn't do much scripting (proper) for http://www.whereabi ll.org ,
                  > but since Josh has started the ball rolling, I'll add in the one
                  > script I did write: an XSLT file that takes the XML version of Josh's
                  > people database
                  > (http://www.govtrack .us/data/ us/110/repstats/ people.xml) and extracts
                  > the people/roles relevant to a single specified Congressional session.


                • Kevin Henry
                  Peggy and Damian, Thanks, glad you like the site. I m getting all the data from govtrack. Specifically, I m using the following files: - the bill status data
                  Message 8 of 8 , Jul 11, 2007
                  • 0 Attachment
                    Peggy and Damian,

                    Thanks, glad you like the site.

                    I'm getting all the data from govtrack. Specifically, I'm using the
                    following files:

                    - the bill status data (www.govtrack.us/data/us/*/bills/*.xml)
                    - the roll vote data (/data/us/*/rolls/*.xml)
                    - the people database (/data/us/110/repstats/people.xml)
                    - the popularity listing (/data/us/bills.technorati.xml)
                    - the search service (/congress/billsearch_api.xpd)

                    I keep a copy of all the files on my server, and do a daily rsync (as
                    Josh describes here: http://www.govtrack.us/source.xpd) to stay current.

                    Basically, when the server gets a request for a certain bill, it
                    retrieves the status data and goes through the action items, parsing
                    them into the steps that will be represented in the "driving
                    directions". It also retrieves any relevant roll vote data, and then
                    sends that information (along with the summary, the titles, the list
                    of sponsors, and the biographical data for that session of Congress)
                    back to the client, which renders everything (with the help of the
                    Google Maps API).

                    So it's really a (sort-of) UI sitting on top of govtrack's (sort-of) API.

                    Let me know if you need any more information...


                    Regards,
                    Kevin
                    http://www.whereabill.org/
                  Your message has been successfully submitted and would be delivered to recipients shortly.