Loading ...
Sorry, an error occurred while loading the content.
 

Re: [govtrack] Two Project Ideas

Expand Messages
  • Scott Willeke
    Regarding Project 1: Can you recommend an example or two with PDFs that could be compared. How do you envision the difference in the output being shown? Maybe
    Message 1 of 21 , Nov 7, 2006
      Regarding Project 1:
      Can you recommend an example or two with PDFs that could be compared.
      How do you envision the difference in the output being shown? Maybe it
      could create a report with details about each addition and removal.
      Alternatively, a completely new PDF document with highlighted areas or
      annotations could be shown (e.g. something like MS Word's diff
      annotations). I'm not sure if I'll have the time to take on the project
      but with a little more information I can give it a try. In any case I
      think the information will be useful to anyone considering the project
      and I think such a tool will be invaluable.

      Joshua Tauberer / GovTrack.us wrote:
      > In case anyone on the list is bored and wants to work on a project that
      > would be really useful, I want to extract two ideas out of the current
      > read-the-bill thread. (And since GovTrack has a moderate surplus at the
      > moment, I could potentially fund one.)
      >
      > Project 1 - Version Tracking Bills
      >
      > Given two PDF versions of a bill (such as the bill as it was introduced
      > and then as it was after being reported by a committee, or in the case
      > in the other thread, as it was after being passed by the Senate and then
      > again after it was following the conference committee), what are the
      > additions, removals, and changes that were made?
      >
      > The idea is to have the effect of combining the Linux tools pdftotext
      > and diff, but better. Or, to tweak that process so that the output is
      > actually useful for a regular citizen.
      >
      > Project 2 - Collecting Advocacy Positions
      >
      > I want to display on GovTrack the positions of advocacy
      > groups/individuals on particular bills. What I need is a way for
      > independent organizations/individuals to enter their positions on
      > bills/amendments/votes (support/oppose/ambivalent + comment), or to
      > import their positions from e.g. blog entries, so that they end up in a
      > common data format to be displayed on GovTrack (and any other site that
      > wants to display it). This would entail creating a small website.
      >
      >
    • Steve Andersen
      It s a project we did for customers, and one of their requirements was for it to be an invite-only system. The work these folks do around bills and fighting
      Message 2 of 21 , Nov 8, 2006
        It's a project we did for customers, and one of their requirements was for it to be an invite-only system. The work these folks do around bills and fighting for their positions is very adversarial, and the tools they use are kept pretty close to the vest.

        The import is done from ftp://landru.leg.state.or.us/pub

        Steve

        -----Original Message-----
        From: Joe Germuska [mailto:Joe@...]
        Sent: Tuesday, November 07, 2006 8:18 AM
        To: Steve Andersen; govtrack@yahoogroups.com
        Subject: RE: [govtrack] Two Project Ideas

        At 7:53 AM -0800 11/7/06, Steve Andersen wrote:
        >Unfortunately, I can't give you a URL as it's a password protected app.

        Why is it password protected? Is it not possible to have something like that public without it being "poisoned"? (I ask honestly.) Or was it more a "we're just getting started, so let's keep it in the family."

        >We built our project on Plone, the open source content management
        >system. In Oregon, the bill status is provided in machine readable form
        >every night, so we don't have to do any pdf trickery.

        Good for Oregon! Where are these documents served? Do other states do this?

        Joe

        --
        Joe Germuska
        Joe@... * http://blog.germuska.com

        "The truth is that we learned from João forever to be out of tune."
        -- Caetano Veloso
      • Joshua Tauberer / GovTrack.us
        ... Hi, Scott. Take these two: http://www.govtrack.us/data/us/bills.text/109/hc/hc95rds.pdf http://www.govtrack.us/data/us/bills.text/109/hc/hc95enr.pdf The
        Message 3 of 21 , Nov 8, 2006
          Scott Willeke wrote:
          > Regarding Project 1:
          > Can you recommend an example or two with PDFs that could be compared.

          Hi, Scott. Take these two:

          http://www.govtrack.us/data/us/bills.text/109/hc/hc95rds.pdf
          http://www.govtrack.us/data/us/bills.text/109/hc/hc95enr.pdf

          The first is the resolution after it was passed by the House (at the
          time it was received in the Senate). The second is the final form of
          the bill (as it was "enrolled") after the Senate passed it as well.
          From a cursory look it seems the Senate tacked on some stuff at the end,
          between the versions.

          > How do you envision the difference in the output being shown?

          A list of changes, highlighting, whatever -- as long as it can filter
          out a whole variety of unimportant formatting changes, like line
          numbering, section numbering, line wrapping, etc.

          The first step is to convert it to text -- you can see the text versions
          (from "pdftotext -layout -nopgbrk") that GovTrack makes at the same
          addresses, just replace .pdf with .txt. Without "-layout" you get a
          differently formatted text version that could be more useful for this.

          Then strip out the formatting things that I mentioned above. (Obviously
          not trivial for line wrapping, for instance.)

          Then run a diff, but one would have to figure out how to format the
          output of the diff so it looks like a bill again. (I have some thoughts
          on that, for future reference, but I won't get into it now.)

          > Alternatively, a completely new PDF document with highlighted areas or
          > annotations could be shown (e.g. something like MS Word's diff
          > annotations).

          It's easier to view and navigate in HTML, so I don't think that's as
          important, but interesting.

          Difficult, but not impossible. I hope you give it a shot.

          -
          - Joshua Tauberer

          http://razor.occams.info

          "Strike up the klezmer and start acting like a man. You're
          about to have a truth-mitzvah." -- The Colbert Report
        • yahoogroups-backupemail@msmith.net
          ... there s a fork of pdftotext (also free) which has very useful -html and -xml output flags which might be a better place to start from if you don t have
          Message 4 of 21 , Nov 8, 2006
            On Wed, 8 Nov 2006, Joshua Tauberer / GovTrack.us wrote:
            > The first step is to convert it to text -- you can see the text versions
            > (from "pdftotext -layout -nopgbrk") that GovTrack makes at the same
            > addresses, just replace .pdf with .txt. Without "-layout" you get a
            > differently formatted text version that could be more useful for this.

            there's a fork of pdftotext (also free) which has very
            useful -html and -xml output flags which might be a
            better place to start from if you don't have tools already.

            http://pdftohtml.sourceforge.net/


            Sam
            www.disruptiveproactivity.com

            --
            May you always be as vivid as your hallucinations.
          • Joshua Tauberer / GovTrack.us
            ... Ahha, I think that could be useful. Thanks for the pointer. (It s actually been integrated in the poppler-utils RPM for Fedora Core 6, if that s useful
            Message 5 of 21 , Nov 9, 2006
              yahoogroups-backupemail@... wrote:
              > On Wed, 8 Nov 2006, Joshua Tauberer / GovTrack.us wrote:
              >> The first step is to convert it to text -- you can see the text versions
              >> (from "pdftotext -layout -nopgbrk") that GovTrack makes at the same
              >> addresses, just replace .pdf with .txt. Without "-layout" you get a
              >> differently formatted text version that could be more useful for this.
              >
              > there's a fork of pdftotext (also free) which has very
              > useful -html and -xml output flags which might be a
              > better place to start from if you don't have tools already.
              >
              > http://pdftohtml.sourceforge.net/ <http://pdftohtml.sourceforge.net/>

              Ahha, I think that could be useful. Thanks for the pointer. (It's
              actually been integrated in the poppler-utils RPM for Fedora Core 6, if
              that's useful for anyone.)

              For reference, the two PDFs in HTML with pdftohtml are:

              http://www.govtrack.us/hc95rds.html
              http://www.govtrack.us/hc95enr.html

              It's not getting the alignment of lines quite right, splitting up things
              on the same line, but that might not impact the task anyway since
              different line breaks between versions has to be ignored anyway.

              --
              - Joshua Tauberer

              http://razor.occams.info

              "Strike up the klezmer and start acting like a man. You're
              about to have a truth-mitzvah." -- The Colbert Report
            • Scott Burns
              Instead of trying to convert PDFs and remove formatting you can get basic HTML versions of these bills from Thomas. This bill, for example, can be found here:
              Message 6 of 21 , Nov 9, 2006
                Instead of trying to convert PDFs and remove formatting you can get
                basic HTML versions of these bills from Thomas. This bill, for
                example, can be found here:

                http://thomas.loc.gov/cgi-bin/query/z?c109:H.CON.RES.95:

                From that page select the link to "Text of Legislation". You'll
                then be presented with a list of the different versions from
                different stages of the process. Pick the one you want there by
                selecting the link and, then, on the next page select "Printer
                Friendly Display". You'll then get a basic HTML display that, while
                somewhat ugly (see the source) should be parse-able into a DOM and
                then compared node-for-node to another version to find diffs.

                I haven't played around with the queries there enough to figure out
                if there's a reliable URL to get directly to the text display of the
                version you want, though getting to the bill summary page is easy.
                It shouldn't be hard to script a bot to do the navigation.

                HTH ...s.

                On Nov 8, 2006, at 7:11 PM, Joshua Tauberer / GovTrack.us wrote:

                > Scott Willeke wrote:
                > > Regarding Project 1:
                > > Can you recommend an example or two with PDFs that could be
                > compared.
                >
                > Hi, Scott. Take these two:
                >
                > http://www.govtrack.us/data/us/bills.text/109/hc/hc95rds.pdf
                > http://www.govtrack.us/data/us/bills.text/109/hc/hc95enr.pdf
                >
                > The first is the resolution after it was passed by the House (at the
                > time it was received in the Senate). The second is the final form of
                > the bill (as it was "enrolled") after the Senate passed it as well.
                > From a cursory look it seems the Senate tacked on some stuff at the
                > end,
                > between the versions.
                >
                > > How do you envision the difference in the output being shown?
                >
                > A list of changes, highlighting, whatever -- as long as it can filter
                > out a whole variety of unimportant formatting changes, like line
                > numbering, section numbering, line wrapping, etc.
                >
                > The first step is to convert it to text -- you can see the text
                > versions
                > (from "pdftotext -layout -nopgbrk") that GovTrack makes at the same
                > addresses, just replace .pdf with .txt. Without "-layout" you get a
                > differently formatted text version that could be more useful for this.
                >
                > Then strip out the formatting things that I mentioned above.
                > (Obviously
                > not trivial for line wrapping, for instance.)
                >
                > Then run a diff, but one would have to figure out how to format the
                > output of the diff so it looks like a bill again. (I have some
                > thoughts
                > on that, for future reference, but I won't get into it now.)
                >
                > > Alternatively, a completely new PDF document with highlighted
                > areas or
                > > annotations could be shown (e.g. something like MS Word's diff
                > > annotations).
                >
                > It's easier to view and navigate in HTML, so I don't think that's as
                > important, but interesting.
                >
                > Difficult, but not impossible. I hope you give it a shot.
                >
                > -
                > - Joshua Tauberer
                >
                > http://razor.occams.info
                >
                > "Strike up the klezmer and start acting like a man. You're
                > about to have a truth-mitzvah." -- The Colbert Report
                >
                >

                --
                Scott Burns, Staff Technologist <sburns@...>
                Public Knowledge <http://www.publicknowledge.org>

                -- Fortifying and Defending a Vibrant Information Commons
              • Joshua Tauberer / GovTrack.us
                ... Right, I forgot that Thomas s HTML versions are pretty good. ... Not as far as I know also. In that case, the task may be a lot easier. Convert the HTML
                Message 7 of 21 , Nov 9, 2006
                  Scott Burns wrote:
                  > Instead of trying to convert PDFs and remove formatting you can get
                  > basic HTML versions of these bills from Thomas. This bill, for
                  > example, can be found here:
                  >
                  > http://thomas.loc.gov/cgi-bin/query/z?c109:H.CON.RES.95:
                  > <http://thomas.loc.gov/cgi-bin/query/z?c109:H.CON.RES.95:>

                  Right, I forgot that Thomas's HTML versions are pretty good.

                  > I haven't played around with the queries there enough to figure out
                  > if there's a reliable URL to get directly to the text display of the
                  > version you want

                  Not as far as I know also.

                  In that case, the task may be a lot easier. Convert the HTML into XML,
                  and then run a difference with an XML differencing tool, such as xmldiff
                  (a Python script, very slow when I tried it just now, but seems to
                  actually be useful for this project and can read the HTML directly) or
                  XyDiff:

                  http://gemo.futurs.inria.fr/software/XyDiff/cdrom/www/xydiff/index-eng.htm

                  Which might do the same thing faster and better, but I haven't tried.
                  It's in C++ and needs to be compiled.

                  --
                  - Joshua Tauberer

                  http://razor.occams.info

                  "Strike up the klezmer and start acting like a man. You're
                  about to have a truth-mitzvah." -- The Colbert Report
                • Andrew Badr
                  http://federallink.org/ There are several aspects of the site that aren t ready for public consumption, like the long lists of data and overall ugliness, but I
                  Message 8 of 21 , Mar 26, 2007
                    http://federallink.org/

                    There are several aspects of the site that aren't ready for public consumption, like the long lists of data and overall ugliness, but I want to get some feedback, starting with you fine folks on the govtrack mailing list.

                    Beyond feedback, I'm looking for permanent help with coding, design, or establishing relationships with advocacy groups.

                    -Andrew

                    On 11/6/06, Andrew Badr <andrewbadr.etc@...> wrote:
                    I'm glad to hear that people are interested. It's a friend and I working on the project. We can't devote all our time to it, but we expect to launch some time in February, and something that could be called a demo should be ready much sooner.

                    Andrew


                    On 11/6/06, Joshua Tauberer / GovTrack.us < tauberer@...> wrote:

                    Andrew Badr wrote:
                    > I'm already working Project 2 :)

                    When can we all expect to see a demo? :)

                    But, seriously, that's great. Keep us all posted.

                    --
                    - Joshua Tauberer

                    http://razor.occams.info

                    "Strike up the klezmer and start acting like a man. You're
                    about to have a truth-mitzvah." -- The Colbert Report

                    Andrew Badr wrote:
                    >
                    >
                    > I'm already working Project 2 :)
                    >
                    > On 11/4/06, *Joshua Tauberer / GovTrack.us* <tauberer@...
                    > <mailto:tauberer@...>> wrote:
                    >
                    > In case anyone on the list is bored and wants to work on a project that
                    > would be really useful, I want to extract two ideas out of the current
                    > read-the-bill thread. (And since GovTrack has a moderate surplus at the
                    > moment, I could potentially fund one.)
                    >
                    > Project 1 - Version Tracking Bills
                    >
                    > Given two PDF versions of a bill (such as the bill as it was introduced
                    > and then as it was after being reported by a committee, or in the case
                    > in the other thread, as it was after being passed by the Senate and then
                    > again after it was following the conference committee), what are the
                    > additions, removals, and changes that were made?
                    >
                    > The idea is to have the effect of combining the Linux tools pdftotext
                    > and diff, but better. Or, to tweak that process so that the output is
                    > actually useful for a regular citizen.
                    >
                    > Project 2 - Collecting Advocacy Positions
                    >
                    > I want to display on GovTrack the positions of advocacy
                    > groups/individuals on particular bills. What I need is a way for
                    > independent organizations/individuals to enter their positions on
                    > bills/amendments/votes (support/oppose/ambivalent + comment), or to
                    > import their positions from e.g. blog entries, so that they end up in a
                    > common data format to be displayed on GovTrack (and any other site that
                    > wants to display it). This would entail creating a small website.
                    >
                    > --
                    > - Joshua Tauberer
                    >
                    > http://razor.occams.info <http://razor.occams.info>
                    >
                    > "Strike up the klezmer and start acting like a man. You're
                    > about to have a truth-mitzvah." -- The Colbert Report
                    >
                    >
                    >



                  • Josh Tauberer
                    ... Hey, Andrew. (Btw, apparently we know someone in common. Small world...) The site looks great. I look forward to being able to link from GovTrack to
                    Message 9 of 21 , Mar 27, 2007
                      Andrew Badr wrote:
                      > http://federallink.org/ <http://federallink.org/>
                      >
                      > There are several aspects of the site that aren't ready for public
                      > consumption, like the long lists of data and overall ugliness, but I
                      > want to get some feedback, starting with you fine folks on the govtrack
                      > mailing list.
                      >
                      > Beyond feedback, I'm looking for permanent help with coding, design, or
                      > establishing relationships with advocacy groups.

                      Hey, Andrew.

                      (Btw, apparently we know someone in common. Small world...)

                      The site looks great. I look forward to being able to link from GovTrack
                      to FederalLink (and hopefully to include some stats from your site on
                      GovTrack).

                      When you correlate the patterns of two advocacy groups, how do you
                      select which groups to show (given the one the user is looking at)?

                      I'd love to see some graphical representations of the data (the same way
                      I made my political spectrum, for instance).

                      --
                      - Josh Tauberer

                      http://razor.occams.info

                      "Yields falsehood when preceded by its quotation! Yields
                      falsehood when preceded by its quotation!" Achilles to
                      Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)
                    • Nancy Berry
                      I went in and registered...easy, and then went into bills...great work! Nancy Berry ... From: Josh Tauberer To: govtrack@yahoogroups.com
                      Message 10 of 21 , Mar 27, 2007
                        I went in and registered...easy, and then went into bills...great work!
                         
                        Nancy Berry


                        ----- Original Message ----
                        From: Josh Tauberer <tauberer@...>
                        To: govtrack@yahoogroups.com
                        Sent: Tuesday, March 27, 2007 5:52:17 PM
                        Subject: Re: [govtrack] Two Project Ideas

                        Andrew Badr wrote:

                        > http://federallink. org/ <http://federallink. org/>
                        >
                        > There are several aspects of the site that aren't ready for public
                        > consumption, like the long lists of data and overall ugliness, but I
                        > want to get some feedback, starting with you fine folks on the govtrack
                        > mailing list.
                        >
                        > Beyond feedback, I'm looking for permanent help with coding, design, or
                        > establishing relationships with advocacy groups.

                        Hey, Andrew.

                        (Btw, apparently we know someone in common. Small world...)

                        The site looks great. I look forward to being able to link from GovTrack
                        to FederalLink (and hopefully to include some stats from your site on
                        GovTrack).

                        When you correlate the patterns of two advocacy groups, how do you
                        select which groups to show (given the one the user is looking at)?

                        I'd love to see some graphical representations of the data (the same way
                        I made my political spectrum, for instance).

                        --
                        - Josh Tauberer

                        http://razor. occams.info

                        "Yields falsehood when preceded by its quotation! Yields
                        falsehood when preceded by its quotation!" Achilles to
                        Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)


                      Your message has been successfully submitted and would be delivered to recipients shortly.
                      »
                      «