Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] Two Project Ideas [bill versioning]

Expand Messages
  • Joshua Tauberer / GovTrack.us
    ... Ahha, I think that could be useful. Thanks for the pointer. (It s actually been integrated in the poppler-utils RPM for Fedora Core 6, if that s useful
    Message 1 of 21 , Nov 9, 2006
    • 0 Attachment
      yahoogroups-backupemail@... wrote:
      > On Wed, 8 Nov 2006, Joshua Tauberer / GovTrack.us wrote:
      >> The first step is to convert it to text -- you can see the text versions
      >> (from "pdftotext -layout -nopgbrk") that GovTrack makes at the same
      >> addresses, just replace .pdf with .txt. Without "-layout" you get a
      >> differently formatted text version that could be more useful for this.
      >
      > there's a fork of pdftotext (also free) which has very
      > useful -html and -xml output flags which might be a
      > better place to start from if you don't have tools already.
      >
      > http://pdftohtml.sourceforge.net/ <http://pdftohtml.sourceforge.net/>

      Ahha, I think that could be useful. Thanks for the pointer. (It's
      actually been integrated in the poppler-utils RPM for Fedora Core 6, if
      that's useful for anyone.)

      For reference, the two PDFs in HTML with pdftohtml are:

      http://www.govtrack.us/hc95rds.html
      http://www.govtrack.us/hc95enr.html

      It's not getting the alignment of lines quite right, splitting up things
      on the same line, but that might not impact the task anyway since
      different line breaks between versions has to be ignored anyway.

      --
      - Joshua Tauberer

      http://razor.occams.info

      "Strike up the klezmer and start acting like a man. You're
      about to have a truth-mitzvah." -- The Colbert Report
    • Scott Burns
      Instead of trying to convert PDFs and remove formatting you can get basic HTML versions of these bills from Thomas. This bill, for example, can be found here:
      Message 2 of 21 , Nov 9, 2006
      • 0 Attachment
        Instead of trying to convert PDFs and remove formatting you can get
        basic HTML versions of these bills from Thomas. This bill, for
        example, can be found here:

        http://thomas.loc.gov/cgi-bin/query/z?c109:H.CON.RES.95:

        From that page select the link to "Text of Legislation". You'll
        then be presented with a list of the different versions from
        different stages of the process. Pick the one you want there by
        selecting the link and, then, on the next page select "Printer
        Friendly Display". You'll then get a basic HTML display that, while
        somewhat ugly (see the source) should be parse-able into a DOM and
        then compared node-for-node to another version to find diffs.

        I haven't played around with the queries there enough to figure out
        if there's a reliable URL to get directly to the text display of the
        version you want, though getting to the bill summary page is easy.
        It shouldn't be hard to script a bot to do the navigation.

        HTH ...s.

        On Nov 8, 2006, at 7:11 PM, Joshua Tauberer / GovTrack.us wrote:

        > Scott Willeke wrote:
        > > Regarding Project 1:
        > > Can you recommend an example or two with PDFs that could be
        > compared.
        >
        > Hi, Scott. Take these two:
        >
        > http://www.govtrack.us/data/us/bills.text/109/hc/hc95rds.pdf
        > http://www.govtrack.us/data/us/bills.text/109/hc/hc95enr.pdf
        >
        > The first is the resolution after it was passed by the House (at the
        > time it was received in the Senate). The second is the final form of
        > the bill (as it was "enrolled") after the Senate passed it as well.
        > From a cursory look it seems the Senate tacked on some stuff at the
        > end,
        > between the versions.
        >
        > > How do you envision the difference in the output being shown?
        >
        > A list of changes, highlighting, whatever -- as long as it can filter
        > out a whole variety of unimportant formatting changes, like line
        > numbering, section numbering, line wrapping, etc.
        >
        > The first step is to convert it to text -- you can see the text
        > versions
        > (from "pdftotext -layout -nopgbrk") that GovTrack makes at the same
        > addresses, just replace .pdf with .txt. Without "-layout" you get a
        > differently formatted text version that could be more useful for this.
        >
        > Then strip out the formatting things that I mentioned above.
        > (Obviously
        > not trivial for line wrapping, for instance.)
        >
        > Then run a diff, but one would have to figure out how to format the
        > output of the diff so it looks like a bill again. (I have some
        > thoughts
        > on that, for future reference, but I won't get into it now.)
        >
        > > Alternatively, a completely new PDF document with highlighted
        > areas or
        > > annotations could be shown (e.g. something like MS Word's diff
        > > annotations).
        >
        > It's easier to view and navigate in HTML, so I don't think that's as
        > important, but interesting.
        >
        > Difficult, but not impossible. I hope you give it a shot.
        >
        > -
        > - Joshua Tauberer
        >
        > http://razor.occams.info
        >
        > "Strike up the klezmer and start acting like a man. You're
        > about to have a truth-mitzvah." -- The Colbert Report
        >
        >

        --
        Scott Burns, Staff Technologist <sburns@...>
        Public Knowledge <http://www.publicknowledge.org>

        -- Fortifying and Defending a Vibrant Information Commons
      • Joshua Tauberer / GovTrack.us
        ... Right, I forgot that Thomas s HTML versions are pretty good. ... Not as far as I know also. In that case, the task may be a lot easier. Convert the HTML
        Message 3 of 21 , Nov 9, 2006
        • 0 Attachment
          Scott Burns wrote:
          > Instead of trying to convert PDFs and remove formatting you can get
          > basic HTML versions of these bills from Thomas. This bill, for
          > example, can be found here:
          >
          > http://thomas.loc.gov/cgi-bin/query/z?c109:H.CON.RES.95:
          > <http://thomas.loc.gov/cgi-bin/query/z?c109:H.CON.RES.95:>

          Right, I forgot that Thomas's HTML versions are pretty good.

          > I haven't played around with the queries there enough to figure out
          > if there's a reliable URL to get directly to the text display of the
          > version you want

          Not as far as I know also.

          In that case, the task may be a lot easier. Convert the HTML into XML,
          and then run a difference with an XML differencing tool, such as xmldiff
          (a Python script, very slow when I tried it just now, but seems to
          actually be useful for this project and can read the HTML directly) or
          XyDiff:

          http://gemo.futurs.inria.fr/software/XyDiff/cdrom/www/xydiff/index-eng.htm

          Which might do the same thing faster and better, but I haven't tried.
          It's in C++ and needs to be compiled.

          --
          - Joshua Tauberer

          http://razor.occams.info

          "Strike up the klezmer and start acting like a man. You're
          about to have a truth-mitzvah." -- The Colbert Report
        • Andrew Badr
          http://federallink.org/ There are several aspects of the site that aren t ready for public consumption, like the long lists of data and overall ugliness, but I
          Message 4 of 21 , Mar 26, 2007
          • 0 Attachment
            http://federallink.org/

            There are several aspects of the site that aren't ready for public consumption, like the long lists of data and overall ugliness, but I want to get some feedback, starting with you fine folks on the govtrack mailing list.

            Beyond feedback, I'm looking for permanent help with coding, design, or establishing relationships with advocacy groups.

            -Andrew

            On 11/6/06, Andrew Badr <andrewbadr.etc@...> wrote:
            I'm glad to hear that people are interested. It's a friend and I working on the project. We can't devote all our time to it, but we expect to launch some time in February, and something that could be called a demo should be ready much sooner.

            Andrew


            On 11/6/06, Joshua Tauberer / GovTrack.us < tauberer@...> wrote:

            Andrew Badr wrote:
            > I'm already working Project 2 :)

            When can we all expect to see a demo? :)

            But, seriously, that's great. Keep us all posted.

            --
            - Joshua Tauberer

            http://razor.occams.info

            "Strike up the klezmer and start acting like a man. You're
            about to have a truth-mitzvah." -- The Colbert Report

            Andrew Badr wrote:
            >
            >
            > I'm already working Project 2 :)
            >
            > On 11/4/06, *Joshua Tauberer / GovTrack.us* <tauberer@...
            > <mailto:tauberer@...>> wrote:
            >
            > In case anyone on the list is bored and wants to work on a project that
            > would be really useful, I want to extract two ideas out of the current
            > read-the-bill thread. (And since GovTrack has a moderate surplus at the
            > moment, I could potentially fund one.)
            >
            > Project 1 - Version Tracking Bills
            >
            > Given two PDF versions of a bill (such as the bill as it was introduced
            > and then as it was after being reported by a committee, or in the case
            > in the other thread, as it was after being passed by the Senate and then
            > again after it was following the conference committee), what are the
            > additions, removals, and changes that were made?
            >
            > The idea is to have the effect of combining the Linux tools pdftotext
            > and diff, but better. Or, to tweak that process so that the output is
            > actually useful for a regular citizen.
            >
            > Project 2 - Collecting Advocacy Positions
            >
            > I want to display on GovTrack the positions of advocacy
            > groups/individuals on particular bills. What I need is a way for
            > independent organizations/individuals to enter their positions on
            > bills/amendments/votes (support/oppose/ambivalent + comment), or to
            > import their positions from e.g. blog entries, so that they end up in a
            > common data format to be displayed on GovTrack (and any other site that
            > wants to display it). This would entail creating a small website.
            >
            > --
            > - Joshua Tauberer
            >
            > http://razor.occams.info <http://razor.occams.info>
            >
            > "Strike up the klezmer and start acting like a man. You're
            > about to have a truth-mitzvah." -- The Colbert Report
            >
            >
            >



          • Josh Tauberer
            ... Hey, Andrew. (Btw, apparently we know someone in common. Small world...) The site looks great. I look forward to being able to link from GovTrack to
            Message 5 of 21 , Mar 27, 2007
            • 0 Attachment
              Andrew Badr wrote:
              > http://federallink.org/ <http://federallink.org/>
              >
              > There are several aspects of the site that aren't ready for public
              > consumption, like the long lists of data and overall ugliness, but I
              > want to get some feedback, starting with you fine folks on the govtrack
              > mailing list.
              >
              > Beyond feedback, I'm looking for permanent help with coding, design, or
              > establishing relationships with advocacy groups.

              Hey, Andrew.

              (Btw, apparently we know someone in common. Small world...)

              The site looks great. I look forward to being able to link from GovTrack
              to FederalLink (and hopefully to include some stats from your site on
              GovTrack).

              When you correlate the patterns of two advocacy groups, how do you
              select which groups to show (given the one the user is looking at)?

              I'd love to see some graphical representations of the data (the same way
              I made my political spectrum, for instance).

              --
              - Josh Tauberer

              http://razor.occams.info

              "Yields falsehood when preceded by its quotation! Yields
              falsehood when preceded by its quotation!" Achilles to
              Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)
            • Nancy Berry
              I went in and registered...easy, and then went into bills...great work! Nancy Berry ... From: Josh Tauberer To: govtrack@yahoogroups.com
              Message 6 of 21 , Mar 27, 2007
              • 0 Attachment
                I went in and registered...easy, and then went into bills...great work!
                 
                Nancy Berry


                ----- Original Message ----
                From: Josh Tauberer <tauberer@...>
                To: govtrack@yahoogroups.com
                Sent: Tuesday, March 27, 2007 5:52:17 PM
                Subject: Re: [govtrack] Two Project Ideas

                Andrew Badr wrote:

                > http://federallink. org/ <http://federallink. org/>
                >
                > There are several aspects of the site that aren't ready for public
                > consumption, like the long lists of data and overall ugliness, but I
                > want to get some feedback, starting with you fine folks on the govtrack
                > mailing list.
                >
                > Beyond feedback, I'm looking for permanent help with coding, design, or
                > establishing relationships with advocacy groups.

                Hey, Andrew.

                (Btw, apparently we know someone in common. Small world...)

                The site looks great. I look forward to being able to link from GovTrack
                to FederalLink (and hopefully to include some stats from your site on
                GovTrack).

                When you correlate the patterns of two advocacy groups, how do you
                select which groups to show (given the one the user is looking at)?

                I'd love to see some graphical representations of the data (the same way
                I made my political spectrum, for instance).

                --
                - Josh Tauberer

                http://razor. occams.info

                "Yields falsehood when preceded by its quotation! Yields
                falsehood when preceded by its quotation!" Achilles to
                Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)


              Your message has been successfully submitted and would be delivered to recipients shortly.