Loading ...
Sorry, an error occurred while loading the content.

Re: Bill text problems?

Expand Messages
  • David
    Hi Josh, Shots fired, golly. I don t believe your note is either fair or accurate. Just a placeholder for the forum to say I ll email you individually to have
    Message 1 of 9 , Nov 29, 2011
    • 0 Attachment
      Hi Josh,

      Shots fired, golly. I don't believe your note is either fair or accurate.

      Just a placeholder for the forum to say I'll email you individually to have a phone conversation about your concerns. Let's chat voice later today.

      Sincerely,
      -David

      http://www.participatorypolitics.org


      --- In govtrack@yahoogroups.com, Josh Tauberer <tauberer@...> wrote:
      >
      > I've been meaning to write about this.
      >
      > About two weeks ago GPO stopped updating GPO Access, which was their
      > system for publishing documents since the mid 90s. New bills and other
      > documents are only being published in FDSys now, and GovTrack isn't
      > pulling from FDSys because FDSys didn't exist when I wrote the bill text
      > scraper.
      >
      > Since I've been focused on POPVOX lately, I haven't had a chance to
      > build a new scraper for GovTrack, although in anticipation of this I've
      > been working on reimplementing much of the same functionality on POPVOX.
      > I'm not sure what if any of that code will be open, though we have an
      > experimental API for it now.
      >
      > It would be helpful to know who else, if anyone, is using bill text so I
      > can plan the future of GovTrack's bill text accordingly.
      >
      > But I will say that folks free riding on my data and using it to compete
      > with my business (i.e. POPVOX) get no sympathy from me.
      >
      > - Josh Tauberer
      > - GovTrack.us / POPVOX.com
      >
      > http://razor.occams.info | www.govtrack.us | www.popvox.com
      >
      > On 11/29/2011 02:12 AM, jlundigard wrote:
      > > Hey all,
      > >
      > > We've noticed the we stopped receiving bill text from govtrack. It seems to have stopped around this bill:
      > >
      > > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
      > >
      > > That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
      > >
      > > Perhaps a scraper is down?
      > >
      > > Thanks,
      > > Andy
      > > OpenCongress.org
      > >
      > >
      > >
      > > ------------------------------------
      > >
      > > Yahoo! Groups Links
      > >
      > >
      > >
      >
    • Eric Mill
      I make use of the bill text that GovTrack provides in Sunlight s data services (our Real Time Congress API) and in the apps that depend on it (including our
      Message 2 of 9 , Nov 29, 2011
      • 0 Attachment
        I make use of the bill text that GovTrack provides in Sunlight's data services (our Real Time Congress API) and in the apps that depend on it (including our Congress app). We load it into ElasticSearch (recommended, btw) and we power our search and highlighting with it. I'm imminently about to document this full text search capability and offer it to the public.

        Unlike bill metadata, where you've done God's work and scrapes THOMAS all day every day, the bill text is a less vital service, since you just repackage what GPO offers and provide it via rsync. This is an incredibly useful way to provide it though! I'd like it to stick around.

        I'm not sure it's possible to "free ride" on free, CC0-licensed, repackaged versions of public domain government data. If you feel like people have been insufficiently thankful for your work or haven't given enough attribution, that is a more valid and specific conversation to have than accusing folks who are asking about the status of your public data on your public mailing list of competing with your business.

        -- Eric

        On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...> wrote:
        I've been meaning to write about this.

        About two weeks ago GPO stopped updating GPO Access, which was their
        system for publishing documents since the mid 90s. New bills and other
        documents are only being published in FDSys now, and GovTrack isn't
        pulling from FDSys because FDSys didn't exist when I wrote the bill text
        scraper.

        Since I've been focused on POPVOX lately, I haven't had a chance to
        build a new scraper for GovTrack, although in anticipation of this I've
        been working on reimplementing much of the same functionality on POPVOX.
        I'm not sure what if any of that code will be open, though we have an
        experimental API for it now.

        It would be helpful to know who else, if anyone, is using bill text so I
        can plan the future of GovTrack's bill text accordingly.

        But I will say that folks free riding on my data and using it to compete
        with my business (i.e. POPVOX) get no sympathy from me.

        - Josh Tauberer
        - GovTrack.us / POPVOX.com

        http://razor.occams.info | www.govtrack.us | www.popvox.com

        On 11/29/2011 02:12 AM, jlundigard wrote:
        > Hey all,
        >
        > We've noticed the we stopped receiving bill text from govtrack.  It seems to have stopped around this bill:
        >
        > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
        >
        > That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
        >
        > Perhaps a scraper is down?
        >
        > Thanks,
        > Andy
        > OpenCongress.org
        >
        >
        >
        > ------------------------------------
        >
        > Yahoo! Groups Links
        >
        >
        >


        ------------------------------------

        Yahoo! Groups Links

        <*> To visit your group on the web, go to:
           http://groups.yahoo.com/group/govtrack/

        <*> Your email settings:
           Individual Email | Traditional

        <*> To change settings online go to:
           http://groups.yahoo.com/group/govtrack/join
           (Yahoo! ID required)

        <*> To change settings via email:
           govtrack-digest@yahoogroups.com
           govtrack-fullfeatured@yahoogroups.com

        <*> To unsubscribe from this group, send an email to:
           govtrack-unsubscribe@yahoogroups.com

        <*> Your use of Yahoo! Groups is subject to:
           http://docs.yahoo.com/info/terms/




        --

      • Josh Tauberer
        ... Exactly. That s why I m not particularly concerned about dropping this since it doesn t do much to begin with and after 5+ years of running the bill text
        Message 3 of 9 , Nov 29, 2011
        • 0 Attachment
          > the bill text is a less vital service, since you just
          > repackage what GPO offers

          Exactly. That's why I'm not particularly concerned about dropping this
          since it doesn't do much to begin with and after 5+ years of running the
          bill text scraper it's past time to rethink what's useful. (Btw, it does
          also scrape the HTML bill text on THOMAS, which is slightly less
          trivial, but still pretty trivial.)

          Do you use the PDFs or HTML (or .txt?)?

          (Clearly when I said "free loading" I was not referring to what I agree
          is a simple repackaging of PDFs.)

          - Josh Tauberer
          - GovTrack.us / POPVOX.com

          http://razor.occams.info | www.govtrack.us | www.popvox.com

          On 11/29/2011 09:30 AM, Eric Mill wrote:
          >
          >
          > I make use of the bill text that GovTrack provides in Sunlight's data
          > services (our Real Time Congress API) and in the apps that depend on it
          > (including our Congress app). We load it into ElasticSearch
          > (recommended, btw) and we power our search and highlighting with it. I'm
          > imminently about to document this full text search capability and offer
          > it to the public.
          >
          > Unlike bill metadata, where you've done God's work and scrapes THOMAS
          > all day every day, the bill text is a less vital service, since you just
          > repackage what GPO offers and provide it via rsync. This is an
          > incredibly useful way to provide it though! I'd like it to stick around.
          >
          > I'm not sure it's possible to "free ride" on free, CC0-licensed,
          > repackaged versions of public domain government data. If you feel like
          > people have been insufficiently thankful for your work or haven't given
          > enough attribution, that is a more valid and specific conversation to
          > have than accusing folks who are asking about the status of your public
          > data on your public mailing list of competing with your business.
          >
          > -- Eric
          >
          > On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...
          > <mailto:tauberer@...>> wrote:
          >
          > I've been meaning to write about this.
          >
          > About two weeks ago GPO stopped updating GPO Access, which was their
          > system for publishing documents since the mid 90s. New bills and other
          > documents are only being published in FDSys now, and GovTrack isn't
          > pulling from FDSys because FDSys didn't exist when I wrote the bill text
          > scraper.
          >
          > Since I've been focused on POPVOX lately, I haven't had a chance to
          > build a new scraper for GovTrack, although in anticipation of this I've
          > been working on reimplementing much of the same functionality on POPVOX.
          > I'm not sure what if any of that code will be open, though we have an
          > experimental API for it now.
          >
          > It would be helpful to know who else, if anyone, is using bill text so I
          > can plan the future of GovTrack's bill text accordingly.
          >
          > But I will say that folks free riding on my data and using it to compete
          > with my business (i.e. POPVOX) get no sympathy from me.
          >
          > - Josh Tauberer
          > - GovTrack.us / POPVOX.com
          >
          > http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
          > | www.popvox.com <http://www.popvox.com>
          >
          > On 11/29/2011 02:12 AM, jlundigard wrote:
          > > Hey all,
          > >
          > > We've noticed the we stopped receiving bill text from govtrack.
          > It seems to have stopped around this bill:
          > >
          > > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
          > >
          > > That bill and more recently introduced ones don't have any bill
          > text even though the text exists on the CPO website.
          > >
          > > Perhaps a scraper is down?
          > >
          > > Thanks,
          > > Andy
          > > OpenCongress.org
          > >
          > >
          > >
          > > ------------------------------------
          > >
          > > Yahoo! Groups Links
          > >
          > >
          > >
          >
          >
          > ------------------------------------
          >
          > Yahoo! Groups Links
          >
          >
          > <mailto:govtrack-fullfeatured@yahoogroups.com>
          >
          >
          >
          >
          >
          > --
          > Developer | sunlightfoundation.com <http://sunlightfoundation.com>
          >
          >
          >
          >
        • Josh Tauberer
          ... All right. I take back the free loading bit. But not the rest. - Josh Tauberer - GovTrack.us / POPVOX.com http://razor.occams.info | www.govtrack.us |
          Message 4 of 9 , Nov 29, 2011
          • 0 Attachment
            On 11/29/2011 09:02 AM, David wrote:
            > Shots fired, golly.

            All right. I take back the "free loading" bit. But not the rest.


            - Josh Tauberer
            - GovTrack.us / POPVOX.com

            http://razor.occams.info | www.govtrack.us | www.popvox.com

            On 11/29/2011 09:02 AM, David wrote:
            >
            > Hi Josh,
            >
            > Shots fired, golly. I don't believe your note is either fair or accurate.
            >
            > Just a placeholder for the forum to say I'll email you individually to have a phone conversation about your concerns. Let's chat voice later today.
            >
            > Sincerely,
            > -David
            >
            > http://www.participatorypolitics.org
            >
            >
            > --- In govtrack@yahoogroups.com, Josh Tauberer<tauberer@...> wrote:
            >>
            >> I've been meaning to write about this.
            >>
            >> About two weeks ago GPO stopped updating GPO Access, which was their
            >> system for publishing documents since the mid 90s. New bills and other
            >> documents are only being published in FDSys now, and GovTrack isn't
            >> pulling from FDSys because FDSys didn't exist when I wrote the bill text
            >> scraper.
            >>
            >> Since I've been focused on POPVOX lately, I haven't had a chance to
            >> build a new scraper for GovTrack, although in anticipation of this I've
            >> been working on reimplementing much of the same functionality on POPVOX.
            >> I'm not sure what if any of that code will be open, though we have an
            >> experimental API for it now.
            >>
            >> It would be helpful to know who else, if anyone, is using bill text so I
            >> can plan the future of GovTrack's bill text accordingly.
            >>
            >> But I will say that folks free riding on my data and using it to compete
            >> with my business (i.e. POPVOX) get no sympathy from me.
            >>
            >> - Josh Tauberer
            >> - GovTrack.us / POPVOX.com
            >>
            >> http://razor.occams.info | www.govtrack.us | www.popvox.com
            >>
            >> On 11/29/2011 02:12 AM, jlundigard wrote:
            >>> Hey all,
            >>>
            >>> We've noticed the we stopped receiving bill text from govtrack. It seems to have stopped around this bill:
            >>>
            >>> http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
            >>>
            >>> That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
            >>>
            >>> Perhaps a scraper is down?
            >>>
            >>> Thanks,
            >>> Andy
            >>> OpenCongress.org
            >>>
            >>>
            >>>
            >>> ------------------------------------
            >>>
            >>> Yahoo! Groups Links
            >>>
            >>>
            >>>
            >>
            >
            >
            >
            >
            > ------------------------------------
            >
            > Yahoo! Groups Links
            >
            >
            >
          • Eric Mill
            I use a combination of three files for each bill. Primarily, the .txt, for the text. I m only storing the text en masse for full text search, not storing the
            Message 5 of 9 , Nov 29, 2011
            • 0 Attachment
              I use a combination of three files for each bill. Primarily, the .txt, for the text. I'm only storing the text en masse for full text search, not storing the semantic hierarchy of the bill. Secondarily, I use the MODS XML metadata to get what date the bill version was issued on, a pretty critical piece of data. However, sometimes the MODS file doesn't exist, and I use the .xml (HTML) version of the bill as a backup source for the issued date -- which, now that I look at the code, makes use of the Dublin Core metadata that you add on top of the original bill data. I don't make use of the PDF.

              My code that does all this is here, btw:

              I understand that this is less vital, but I mean it when I say the rsync is incredibly useful -- so much so that if you left it offline, what I'd probably do is set up a separate dedicated GPO bulk data mirroring service for at least bill text, that supported rsync, and use that internally. That's a lot of work, though! If you're continuing to use the GPO's bill text files in your own work on POPVOX, you'd do the community a service by continuing to make that work available.

              -- Eric

              On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...> wrote:
              the bill text is a less vital service, since you just
              repackage what GPO offers

              Exactly. That's why I'm not particularly concerned about dropping this since it doesn't do much to begin with and after 5+ years of running the bill text scraper it's past time to rethink what's useful. (Btw, it does also scrape the HTML bill text on THOMAS, which is slightly less trivial, but still pretty trivial.)

              Do you use the PDFs or HTML (or .txt?)?

              (Clearly when I said "free loading" I was not referring to what I agree is a simple repackaging of PDFs.)


              - Josh Tauberer
              - GovTrack.us / POPVOX.com

              http://razor.occams.info | www.govtrack.us | www.popvox.com

              On 11/29/2011 09:30 AM, Eric Mill wrote:


              I make use of the bill text that GovTrack provides in Sunlight's data
              services (our Real Time Congress API) and in the apps that depend on it
              (including our Congress app). We load it into ElasticSearch
              (recommended, btw) and we power our search and highlighting with it. I'm
              imminently about to document this full text search capability and offer
              it to the public.

              Unlike bill metadata, where you've done God's work and scrapes THOMAS
              all day every day, the bill text is a less vital service, since you just
              repackage what GPO offers and provide it via rsync. This is an
              incredibly useful way to provide it though! I'd like it to stick around.

              I'm not sure it's possible to "free ride" on free, CC0-licensed,
              repackaged versions of public domain government data. If you feel like
              people have been insufficiently thankful for your work or haven't given
              enough attribution, that is a more valid and specific conversation to
              have than accusing folks who are asking about the status of your public
              data on your public mailing list of competing with your business.

              -- Eric

              On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...
              <mailto:tauberer@...>> wrote:

                 I've been meaning to write about this.

                 About two weeks ago GPO stopped updating GPO Access, which was their
                 system for publishing documents since the mid 90s. New bills and other
                 documents are only being published in FDSys now, and GovTrack isn't
                 pulling from FDSys because FDSys didn't exist when I wrote the bill text
                 scraper.

                 Since I've been focused on POPVOX lately, I haven't had a chance to
                 build a new scraper for GovTrack, although in anticipation of this I've
                 been working on reimplementing much of the same functionality on POPVOX.
                 I'm not sure what if any of that code will be open, though we have an
                 experimental API for it now.

                 It would be helpful to know who else, if anyone, is using bill text so I
                 can plan the future of GovTrack's bill text accordingly.

                 But I will say that folks free riding on my data and using it to compete
                 with my business (i.e. POPVOX) get no sympathy from me.

                 - Josh Tauberer
                 - GovTrack.us / POPVOX.com

                 http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
                 | www.popvox.com <http://www.popvox.com>


                 On 11/29/2011 02:12 AM, jlundigard wrote:
                  > Hey all,
                  >
                  > We've noticed the we stopped receiving bill text from govtrack.
                   It seems to have stopped around this bill:
                  >
                  > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
                  >
                  > That bill and more recently introduced ones don't have any bill
                 text even though the text exists on the CPO website.
                  >
                  > Perhaps a scraper is down?
                  >
                  > Thanks,
                  > Andy
                  > OpenCongress.org
                  >
                  >
                  >
                  > ------------------------------------
                  >
                  > Yahoo! Groups Links
                  >
                  >
                  >


                 ------------------------------------

                 Yahoo! Groups Links

                 <*> To visit your group on the web, go to:
                 http://groups.yahoo.com/group/govtrack/

                 <*> Your email settings:
                     Individual Email | Traditional

                 <*> To change settings online go to:
                 http://groups.yahoo.com/group/govtrack/join
                     (Yahoo! ID required)

                 <*> To change settings via email:
                 govtrack-digest@yahoogroups.com <mailto:govtrack-digest@yahoogroups.com>
                 govtrack-fullfeatured@yahoogroups.com
                 <mailto:govtrack-fullfeatured@yahoogroups.com>


                 <*> To unsubscribe from this group, send an email to:
                 govtrack-unsubscribe@yahoogroups.com
                 <mailto:govtrack-unsubscribe@yahoogroups.com>


                 <*> Your use of Yahoo! Groups is subject to:
                 http://docs.yahoo.com/info/terms/




              --
              Developer | sunlightfoundation.com <http://sunlightfoundation.com>







              --

            • Josh Tauberer
              Hi, everyone. Bill text is updating now. Thanks to whoever here forwarded the problem on to GPO --- I got an email from someone at GPO who pointed me to their
              Message 6 of 9 , Dec 10, 2011
              • 0 Attachment
                Hi, everyone.

                Bill text is updating now.

                Thanks to whoever here forwarded the problem on to GPO --- I got an
                email from someone at GPO who pointed me to their sitemap files, e.g.:
                http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml
                (warning: BIG file). I'm checking on how bills are split by year (by
                publication date?), but this seems to be the most helpful way to find
                them all.

                Btw, Eric- For indexing bill text, it might be better to use the
                original text files from GPO. The .txt files on GovTrack are generated
                using pdftotext and have line numbers, whereas the GPO original .txt
                files do not (I imagine they are generated from the XML or GPO locator
                codes files directly).

                I don't use my own .txt files except to display historical bill text,
                and unless there's an objection I could replace the pdftotext-generated
                files with the GPO original .txt files.

                Any objections from anyone?

                - Josh Tauberer
                - GovTrack.us / POPVOX.com

                http://razor.occams.info | www.govtrack.us | www.popvox.com

                On 11/29/2011 10:25 AM, Eric Mill wrote:
                >
                >
                > I use a combination of three files for each bill. Primarily, the .txt,
                > for the text. I'm only storing the text en masse for full text search,
                > not storing the semantic hierarchy of the bill. Secondarily, I use the
                > MODS XML metadata to get what date the bill version was issued on, a
                > pretty critical piece of data. However, sometimes the MODS file doesn't
                > exist, and I use the .xml (HTML) version of the bill as a backup source
                > for the issued date -- which, now that I look at the code, makes use of
                > the Dublin Core metadata that you add on top of the original bill data.
                > I don't make use of the PDF.
                >
                > My code that does all this is here, btw:
                > https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
                >
                > I understand that this is less vital, but I mean it when I say the rsync
                > is incredibly useful -- so much so that if you left it offline, what I'd
                > probably do is set up a separate dedicated GPO bulk data mirroring
                > service for at least bill text, that supported rsync, and use that
                > internally. That's a lot of work, though! If you're continuing to use
                > the GPO's bill text files in your own work on POPVOX, you'd do the
                > community a service by continuing to make that work available.
                >
                > -- Eric
                >
                > On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
                > <mailto:tauberer@...>> wrote:
                >
                > the bill text is a less vital service, since you just
                > repackage what GPO offers
                >
                >
                > Exactly. That's why I'm not particularly concerned about dropping
                > this since it doesn't do much to begin with and after 5+ years of
                > running the bill text scraper it's past time to rethink what's
                > useful. (Btw, it does also scrape the HTML bill text on THOMAS,
                > which is slightly less trivial, but still pretty trivial.)
                >
                > Do you use the PDFs or HTML (or .txt?)?
                >
                > (Clearly when I said "free loading" I was not referring to what I
                > agree is a simple repackaging of PDFs.)
                >
                >
                > - Josh Tauberer
                > - GovTrack.us / POPVOX.com
                >
                > http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
                > | www.popvox.com <http://www.popvox.com>
                >
                > On 11/29/2011 09:30 AM, Eric Mill wrote:
                >
                >
                >
                > I make use of the bill text that GovTrack provides in Sunlight's
                > data
                > services (our Real Time Congress API) and in the apps that
                > depend on it
                > (including our Congress app). We load it into ElasticSearch
                > (recommended, btw) and we power our search and highlighting with
                > it. I'm
                > imminently about to document this full text search capability
                > and offer
                > it to the public.
                >
                > Unlike bill metadata, where you've done God's work and scrapes
                > THOMAS
                > all day every day, the bill text is a less vital service, since
                > you just
                > repackage what GPO offers and provide it via rsync. This is an
                > incredibly useful way to provide it though! I'd like it to stick
                > around.
                >
                > I'm not sure it's possible to "free ride" on free, CC0-licensed,
                > repackaged versions of public domain government data. If you
                > feel like
                > people have been insufficiently thankful for your work or
                > haven't given
                > enough attribution, that is a more valid and specific
                > conversation to
                > have than accusing folks who are asking about the status of your
                > public
                > data on your public mailing list of competing with your business.
                >
                > -- Eric
                >
                > On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
                > <tauberer@... <mailto:tauberer@...>
                > <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
                >
                > I've been meaning to write about this.
                >
                > About two weeks ago GPO stopped updating GPO Access, which
                > was their
                > system for publishing documents since the mid 90s. New bills
                > and other
                > documents are only being published in FDSys now, and
                > GovTrack isn't
                > pulling from FDSys because FDSys didn't exist when I wrote
                > the bill text
                > scraper.
                >
                > Since I've been focused on POPVOX lately, I haven't had a
                > chance to
                > build a new scraper for GovTrack, although in anticipation
                > of this I've
                > been working on reimplementing much of the same
                > functionality on POPVOX.
                > I'm not sure what if any of that code will be open, though
                > we have an
                > experimental API for it now.
                >
                > It would be helpful to know who else, if anyone, is using
                > bill text so I
                > can plan the future of GovTrack's bill text accordingly.
                >
                > But I will say that folks free riding on my data and using
                > it to compete
                > with my business (i.e. POPVOX) get no sympathy from me.
                >
                > - Josh Tauberer
                > - GovTrack.us / POPVOX.com
                >
                > http://razor.occams.info | www.govtrack.us
                > <http://www.govtrack.us> <http://www.govtrack.us>
                > | www.popvox.com <http://www.popvox.com> <http://www.popvox.com>
                >
                >
                > On 11/29/2011 02:12 AM, jlundigard wrote:
                > > Hey all,
                > >
                > > We've noticed the we stopped receiving bill text from govtrack.
                > It seems to have stopped around this bill:
                > >
                > > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
                > <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
                > >
                > > That bill and more recently introduced ones don't have any bill
                > text even though the text exists on the CPO website.
                > >
                > > Perhaps a scraper is down?
                > >
                > > Thanks,
                > > Andy
                > > OpenCongress.org
                > >
                > >
                > >
                > > ------------------------------__------
                > >
                > > Yahoo! Groups Links
                > >
                > >
                > >
                >
                >
                > ------------------------------__------
                >
                > Yahoo! Groups Links
                >
                >
                > (Yahoo! ID required)
                >
                > <mailto:govtrack-digest@__yahoogroups.com
                > <mailto:govtrack-digest@yahoogroups.com>>
                > govtrack-fullfeatured@__yahoogroups.com
                > <mailto:govtrack-fullfeatured@yahoogroups.com>
                > <mailto:govtrack-fullfeatured@__yahoogroups.com
                > <mailto:govtrack-fullfeatured@yahoogroups.com>>
                >
                >
                > <mailto:govtrack-unsubscribe@__yahoogroups.com
                > <mailto:govtrack-unsubscribe@yahoogroups.com>>
                >
                >
                >
                >
                >
                >
                > --
                > Developer | sunlightfoundation.com
                > <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
                >
                >
                >
                >
                >
                >
                >
                > --
                > Developer | sunlightfoundation.com <http://sunlightfoundation.com>
                >
                >
                >
                >
              • Eric Mill
                I ve been looking exactly for sitemap files like that! Would you mind sharing how we can find the different sitemaps? For example, I guessed at the URL for the
                Message 7 of 9 , Dec 10, 2011
                • 0 Attachment
                  I've been looking exactly for sitemap files like that! Would you mind
                  sharing how we can find the different sitemaps?

                  For example, I guessed at the URL for the one for public and private laws:
                  http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_PLAW_sitemap.xml

                  But that file is very small and doesn't list what you would need to
                  effectively spider the PLAW collection without scraping their HTML.

                  As for text of bills -- I actually came to that realization yesterday
                  myself, that the GPO .txt files were probably better. I definitely
                  would not mind you switching over to them - I can adjust my regular
                  expressions (just for sanitization, not extracting data) accordingly.

                  -- Eric

                  On Sat, Dec 10, 2011 at 12:51 PM, Josh Tauberer <tauberer@...> wrote:
                  > Hi, everyone.
                  >
                  > Bill text is updating now.
                  >
                  > Thanks to whoever here forwarded the problem on to GPO --- I got an email
                  > from someone at GPO who pointed me to their sitemap files, e.g.:
                  > http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml (warning:
                  > BIG file). I'm checking on how bills are split by year (by publication
                  > date?), but this seems to be the most helpful way to find them all.
                  >
                  > Btw, Eric- For indexing bill text, it might be better to use the original
                  > text files from GPO. The .txt files on GovTrack are generated using
                  > pdftotext and have line numbers, whereas the GPO original .txt files do not
                  > (I imagine they are generated from the XML or GPO locator codes files
                  > directly).
                  >
                  > I don't use my own .txt files except to display historical bill text, and
                  > unless there's an objection I could replace the pdftotext-generated files
                  > with the GPO original .txt files.
                  >
                  > Any objections from anyone?
                  >
                  >
                  > - Josh Tauberer
                  > - GovTrack.us / POPVOX.com
                  >
                  > http://razor.occams.info | www.govtrack.us | www.popvox.com
                  >
                  > On 11/29/2011 10:25 AM, Eric Mill wrote:
                  >>
                  >>
                  >>
                  >> I use a combination of three files for each bill. Primarily, the .txt,
                  >> for the text. I'm only storing the text en masse for full text search,
                  >> not storing the semantic hierarchy of the bill. Secondarily, I use the
                  >> MODS XML metadata to get what date the bill version was issued on, a
                  >> pretty critical piece of data. However, sometimes the MODS file doesn't
                  >> exist, and I use the .xml (HTML) version of the bill as a backup source
                  >> for the issued date -- which, now that I look at the code, makes use of
                  >> the Dublin Core metadata that you add on top of the original bill data.
                  >> I don't make use of the PDF.
                  >>
                  >> My code that does all this is here, btw:
                  >>
                  >> https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
                  >>
                  >> I understand that this is less vital, but I mean it when I say the rsync
                  >> is incredibly useful -- so much so that if you left it offline, what I'd
                  >> probably do is set up a separate dedicated GPO bulk data mirroring
                  >> service for at least bill text, that supported rsync, and use that
                  >> internally. That's a lot of work, though! If you're continuing to use
                  >> the GPO's bill text files in your own work on POPVOX, you'd do the
                  >> community a service by continuing to make that work available.
                  >>
                  >> -- Eric
                  >>
                  >> On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
                  >> <mailto:tauberer@...>> wrote:
                  >>
                  >>        the bill text is a less vital service, since you just
                  >>        repackage what GPO offers
                  >>
                  >>
                  >>    Exactly. That's why I'm not particularly concerned about dropping
                  >>    this since it doesn't do much to begin with and after 5+ years of
                  >>    running the bill text scraper it's past time to rethink what's
                  >>    useful. (Btw, it does also scrape the HTML bill text on THOMAS,
                  >>    which is slightly less trivial, but still pretty trivial.)
                  >>
                  >>    Do you use the PDFs or HTML (or .txt?)?
                  >>
                  >>    (Clearly when I said "free loading" I was not referring to what I
                  >>    agree is a simple repackaging of PDFs.)
                  >>
                  >>
                  >>    - Josh Tauberer
                  >>    - GovTrack.us / POPVOX.com
                  >>
                  >>    http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
                  >>    | www.popvox.com <http://www.popvox.com>
                  >>
                  >>    On 11/29/2011 09:30 AM, Eric Mill wrote:
                  >>
                  >>
                  >>
                  >>        I make use of the bill text that GovTrack provides in Sunlight's
                  >>        data
                  >>        services (our Real Time Congress API) and in the apps that
                  >>        depend on it
                  >>        (including our Congress app). We load it into ElasticSearch
                  >>        (recommended, btw) and we power our search and highlighting with
                  >>        it. I'm
                  >>        imminently about to document this full text search capability
                  >>        and offer
                  >>        it to the public.
                  >>
                  >>        Unlike bill metadata, where you've done God's work and scrapes
                  >>        THOMAS
                  >>        all day every day, the bill text is a less vital service, since
                  >>        you just
                  >>        repackage what GPO offers and provide it via rsync. This is an
                  >>        incredibly useful way to provide it though! I'd like it to stick
                  >>        around.
                  >>
                  >>        I'm not sure it's possible to "free ride" on free, CC0-licensed,
                  >>        repackaged versions of public domain government data. If you
                  >>        feel like
                  >>        people have been insufficiently thankful for your work or
                  >>        haven't given
                  >>        enough attribution, that is a more valid and specific
                  >>        conversation to
                  >>        have than accusing folks who are asking about the status of your
                  >>        public
                  >>        data on your public mailing list of competing with your business.
                  >>
                  >>        -- Eric
                  >>
                  >>        On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
                  >>        <tauberer@... <mailto:tauberer@...>
                  >>        <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
                  >>
                  >>            I've been meaning to write about this.
                  >>
                  >>            About two weeks ago GPO stopped updating GPO Access, which
                  >>        was their
                  >>            system for publishing documents since the mid 90s. New bills
                  >>        and other
                  >>            documents are only being published in FDSys now, and
                  >>        GovTrack isn't
                  >>            pulling from FDSys because FDSys didn't exist when I wrote
                  >>        the bill text
                  >>            scraper.
                  >>
                  >>            Since I've been focused on POPVOX lately, I haven't had a
                  >>        chance to
                  >>            build a new scraper for GovTrack, although in anticipation
                  >>        of this I've
                  >>            been working on reimplementing much of the same
                  >>        functionality on POPVOX.
                  >>            I'm not sure what if any of that code will be open, though
                  >>        we have an
                  >>            experimental API for it now.
                  >>
                  >>            It would be helpful to know who else, if anyone, is using
                  >>        bill text so I
                  >>            can plan the future of GovTrack's bill text accordingly.
                  >>
                  >>            But I will say that folks free riding on my data and using
                  >>        it to compete
                  >>            with my business (i.e. POPVOX) get no sympathy from me.
                  >>
                  >>            - Josh Tauberer
                  >>            - GovTrack.us / POPVOX.com
                  >>
                  >>        http://razor.occams.info | www.govtrack.us
                  >>        <http://www.govtrack.us> <http://www.govtrack.us>
                  >>            | www.popvox.com <http://www.popvox.com>
                  >> <http://www.popvox.com>
                  >>
                  >>
                  >>
                  >>            On 11/29/2011 02:12 AM, jlundigard wrote:
                  >>         > Hey all,
                  >>         >
                  >>         > We've noticed the we stopped receiving bill text from govtrack.
                  >>              It seems to have stopped around this bill:
                  >>         >
                  >>         > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
                  >>
                  >>        <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
                  >>         >
                  >>         > That bill and more recently introduced ones don't have any bill
                  >>            text even though the text exists on the CPO website.
                  >>         >
                  >>         > Perhaps a scraper is down?
                  >>         >
                  >>         > Thanks,
                  >>         > Andy
                  >>         > OpenCongress.org
                  >>         >
                  >>         >
                  >>         >
                  >>         > ------------------------------__------
                  >>         >
                  >>         > Yahoo! Groups Links
                  >>         >
                  >>         >
                  >>         >
                  >>
                  >>
                  >>            ------------------------------__------
                  >>
                  >>
                  >>            Yahoo! Groups Links
                  >>
                  >>        <http://groups.yahoo.com/group/govtrack/>
                  >>
                  >>        <http://groups.yahoo.com/group/govtrack/join>
                  >>                (Yahoo! ID required)
                  >>
                  >>        <mailto:govtrack-digest@__yahoogroups.com
                  >>        <mailto:govtrack-digest@yahoogroups.com>>
                  >>        govtrack-fullfeatured@__yahoogroups.com
                  >>        <mailto:govtrack-fullfeatured@yahoogroups.com>
                  >>        <mailto:govtrack-fullfeatured@__yahoogroups.com
                  >>
                  >>        <mailto:govtrack-fullfeatured@yahoogroups.com>>
                  >>
                  >>
                  >>        <mailto:govtrack-unsubscribe@__yahoogroups.com
                  >>
                  >>        <mailto:govtrack-unsubscribe@yahoogroups.com>>
                  >>
                  >>
                  >>        <http://docs.yahoo.com/info/terms/>
                  >>
                  >>
                  >>
                  >>
                  >>        --
                  >>        Developer | sunlightfoundation.com
                  >>        <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
                  >>
                  >>
                  >>
                  >>
                  >>
                  >>
                  >>
                  >>
                  >> --
                  >> Developer | sunlightfoundation.com <http://sunlightfoundation.com>
                  >>
                  >>
                  >>
                  >>



                  --
                  Developer | sunlightfoundation.com
                Your message has been successfully submitted and would be delivered to recipients shortly.