Loading ...
Sorry, an error occurred while loading the content.
 

Bill text problems?

Expand Messages
  • jlundigard
    Hey all, We ve noticed the we stopped receiving bill text from govtrack. It seems to have stopped around this bill:
    Message 1 of 9 , Nov 28, 2011
      Hey all,

      We've noticed the we stopped receiving bill text from govtrack. It seems to have stopped around this bill:

      http://www.govtrack.us/congress/bill.xpd?bill=s112-1788

      That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.

      Perhaps a scraper is down?

      Thanks,
      Andy
      OpenCongress.org
    • Josh Tauberer
      I ve been meaning to write about this. About two weeks ago GPO stopped updating GPO Access, which was their system for publishing documents since the mid 90s.
      Message 2 of 9 , Nov 29, 2011
        I've been meaning to write about this.

        About two weeks ago GPO stopped updating GPO Access, which was their
        system for publishing documents since the mid 90s. New bills and other
        documents are only being published in FDSys now, and GovTrack isn't
        pulling from FDSys because FDSys didn't exist when I wrote the bill text
        scraper.

        Since I've been focused on POPVOX lately, I haven't had a chance to
        build a new scraper for GovTrack, although in anticipation of this I've
        been working on reimplementing much of the same functionality on POPVOX.
        I'm not sure what if any of that code will be open, though we have an
        experimental API for it now.

        It would be helpful to know who else, if anyone, is using bill text so I
        can plan the future of GovTrack's bill text accordingly.

        But I will say that folks free riding on my data and using it to compete
        with my business (i.e. POPVOX) get no sympathy from me.

        - Josh Tauberer
        - GovTrack.us / POPVOX.com

        http://razor.occams.info | www.govtrack.us | www.popvox.com

        On 11/29/2011 02:12 AM, jlundigard wrote:
        > Hey all,
        >
        > We've noticed the we stopped receiving bill text from govtrack. It seems to have stopped around this bill:
        >
        > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
        >
        > That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
        >
        > Perhaps a scraper is down?
        >
        > Thanks,
        > Andy
        > OpenCongress.org
        >
        >
        >
        > ------------------------------------
        >
        > Yahoo! Groups Links
        >
        >
        >
      • David
        Hi Josh, Shots fired, golly. I don t believe your note is either fair or accurate. Just a placeholder for the forum to say I ll email you individually to have
        Message 3 of 9 , Nov 29, 2011
          Hi Josh,

          Shots fired, golly. I don't believe your note is either fair or accurate.

          Just a placeholder for the forum to say I'll email you individually to have a phone conversation about your concerns. Let's chat voice later today.

          Sincerely,
          -David

          http://www.participatorypolitics.org


          --- In govtrack@yahoogroups.com, Josh Tauberer <tauberer@...> wrote:
          >
          > I've been meaning to write about this.
          >
          > About two weeks ago GPO stopped updating GPO Access, which was their
          > system for publishing documents since the mid 90s. New bills and other
          > documents are only being published in FDSys now, and GovTrack isn't
          > pulling from FDSys because FDSys didn't exist when I wrote the bill text
          > scraper.
          >
          > Since I've been focused on POPVOX lately, I haven't had a chance to
          > build a new scraper for GovTrack, although in anticipation of this I've
          > been working on reimplementing much of the same functionality on POPVOX.
          > I'm not sure what if any of that code will be open, though we have an
          > experimental API for it now.
          >
          > It would be helpful to know who else, if anyone, is using bill text so I
          > can plan the future of GovTrack's bill text accordingly.
          >
          > But I will say that folks free riding on my data and using it to compete
          > with my business (i.e. POPVOX) get no sympathy from me.
          >
          > - Josh Tauberer
          > - GovTrack.us / POPVOX.com
          >
          > http://razor.occams.info | www.govtrack.us | www.popvox.com
          >
          > On 11/29/2011 02:12 AM, jlundigard wrote:
          > > Hey all,
          > >
          > > We've noticed the we stopped receiving bill text from govtrack. It seems to have stopped around this bill:
          > >
          > > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
          > >
          > > That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
          > >
          > > Perhaps a scraper is down?
          > >
          > > Thanks,
          > > Andy
          > > OpenCongress.org
          > >
          > >
          > >
          > > ------------------------------------
          > >
          > > Yahoo! Groups Links
          > >
          > >
          > >
          >
        • Eric Mill
          I make use of the bill text that GovTrack provides in Sunlight s data services (our Real Time Congress API) and in the apps that depend on it (including our
          Message 4 of 9 , Nov 29, 2011
            I make use of the bill text that GovTrack provides in Sunlight's data services (our Real Time Congress API) and in the apps that depend on it (including our Congress app). We load it into ElasticSearch (recommended, btw) and we power our search and highlighting with it. I'm imminently about to document this full text search capability and offer it to the public.

            Unlike bill metadata, where you've done God's work and scrapes THOMAS all day every day, the bill text is a less vital service, since you just repackage what GPO offers and provide it via rsync. This is an incredibly useful way to provide it though! I'd like it to stick around.

            I'm not sure it's possible to "free ride" on free, CC0-licensed, repackaged versions of public domain government data. If you feel like people have been insufficiently thankful for your work or haven't given enough attribution, that is a more valid and specific conversation to have than accusing folks who are asking about the status of your public data on your public mailing list of competing with your business.

            -- Eric

            On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...> wrote:
            I've been meaning to write about this.

            About two weeks ago GPO stopped updating GPO Access, which was their
            system for publishing documents since the mid 90s. New bills and other
            documents are only being published in FDSys now, and GovTrack isn't
            pulling from FDSys because FDSys didn't exist when I wrote the bill text
            scraper.

            Since I've been focused on POPVOX lately, I haven't had a chance to
            build a new scraper for GovTrack, although in anticipation of this I've
            been working on reimplementing much of the same functionality on POPVOX.
            I'm not sure what if any of that code will be open, though we have an
            experimental API for it now.

            It would be helpful to know who else, if anyone, is using bill text so I
            can plan the future of GovTrack's bill text accordingly.

            But I will say that folks free riding on my data and using it to compete
            with my business (i.e. POPVOX) get no sympathy from me.

            - Josh Tauberer
            - GovTrack.us / POPVOX.com

            http://razor.occams.info | www.govtrack.us | www.popvox.com

            On 11/29/2011 02:12 AM, jlundigard wrote:
            > Hey all,
            >
            > We've noticed the we stopped receiving bill text from govtrack.  It seems to have stopped around this bill:
            >
            > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
            >
            > That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
            >
            > Perhaps a scraper is down?
            >
            > Thanks,
            > Andy
            > OpenCongress.org
            >
            >
            >
            > ------------------------------------
            >
            > Yahoo! Groups Links
            >
            >
            >


            ------------------------------------

            Yahoo! Groups Links

            <*> To visit your group on the web, go to:
               http://groups.yahoo.com/group/govtrack/

            <*> Your email settings:
               Individual Email | Traditional

            <*> To change settings online go to:
               http://groups.yahoo.com/group/govtrack/join
               (Yahoo! ID required)

            <*> To change settings via email:
               govtrack-digest@yahoogroups.com
               govtrack-fullfeatured@yahoogroups.com

            <*> To unsubscribe from this group, send an email to:
               govtrack-unsubscribe@yahoogroups.com

            <*> Your use of Yahoo! Groups is subject to:
               http://docs.yahoo.com/info/terms/




            --

          • Josh Tauberer
            ... Exactly. That s why I m not particularly concerned about dropping this since it doesn t do much to begin with and after 5+ years of running the bill text
            Message 5 of 9 , Nov 29, 2011
              > the bill text is a less vital service, since you just
              > repackage what GPO offers

              Exactly. That's why I'm not particularly concerned about dropping this
              since it doesn't do much to begin with and after 5+ years of running the
              bill text scraper it's past time to rethink what's useful. (Btw, it does
              also scrape the HTML bill text on THOMAS, which is slightly less
              trivial, but still pretty trivial.)

              Do you use the PDFs or HTML (or .txt?)?

              (Clearly when I said "free loading" I was not referring to what I agree
              is a simple repackaging of PDFs.)

              - Josh Tauberer
              - GovTrack.us / POPVOX.com

              http://razor.occams.info | www.govtrack.us | www.popvox.com

              On 11/29/2011 09:30 AM, Eric Mill wrote:
              >
              >
              > I make use of the bill text that GovTrack provides in Sunlight's data
              > services (our Real Time Congress API) and in the apps that depend on it
              > (including our Congress app). We load it into ElasticSearch
              > (recommended, btw) and we power our search and highlighting with it. I'm
              > imminently about to document this full text search capability and offer
              > it to the public.
              >
              > Unlike bill metadata, where you've done God's work and scrapes THOMAS
              > all day every day, the bill text is a less vital service, since you just
              > repackage what GPO offers and provide it via rsync. This is an
              > incredibly useful way to provide it though! I'd like it to stick around.
              >
              > I'm not sure it's possible to "free ride" on free, CC0-licensed,
              > repackaged versions of public domain government data. If you feel like
              > people have been insufficiently thankful for your work or haven't given
              > enough attribution, that is a more valid and specific conversation to
              > have than accusing folks who are asking about the status of your public
              > data on your public mailing list of competing with your business.
              >
              > -- Eric
              >
              > On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...
              > <mailto:tauberer@...>> wrote:
              >
              > I've been meaning to write about this.
              >
              > About two weeks ago GPO stopped updating GPO Access, which was their
              > system for publishing documents since the mid 90s. New bills and other
              > documents are only being published in FDSys now, and GovTrack isn't
              > pulling from FDSys because FDSys didn't exist when I wrote the bill text
              > scraper.
              >
              > Since I've been focused on POPVOX lately, I haven't had a chance to
              > build a new scraper for GovTrack, although in anticipation of this I've
              > been working on reimplementing much of the same functionality on POPVOX.
              > I'm not sure what if any of that code will be open, though we have an
              > experimental API for it now.
              >
              > It would be helpful to know who else, if anyone, is using bill text so I
              > can plan the future of GovTrack's bill text accordingly.
              >
              > But I will say that folks free riding on my data and using it to compete
              > with my business (i.e. POPVOX) get no sympathy from me.
              >
              > - Josh Tauberer
              > - GovTrack.us / POPVOX.com
              >
              > http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
              > | www.popvox.com <http://www.popvox.com>
              >
              > On 11/29/2011 02:12 AM, jlundigard wrote:
              > > Hey all,
              > >
              > > We've noticed the we stopped receiving bill text from govtrack.
              > It seems to have stopped around this bill:
              > >
              > > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
              > >
              > > That bill and more recently introduced ones don't have any bill
              > text even though the text exists on the CPO website.
              > >
              > > Perhaps a scraper is down?
              > >
              > > Thanks,
              > > Andy
              > > OpenCongress.org
              > >
              > >
              > >
              > > ------------------------------------
              > >
              > > Yahoo! Groups Links
              > >
              > >
              > >
              >
              >
              > ------------------------------------
              >
              > Yahoo! Groups Links
              >
              >
              > <mailto:govtrack-fullfeatured@yahoogroups.com>
              >
              >
              >
              >
              >
              > --
              > Developer | sunlightfoundation.com <http://sunlightfoundation.com>
              >
              >
              >
              >
            • Josh Tauberer
              ... All right. I take back the free loading bit. But not the rest. - Josh Tauberer - GovTrack.us / POPVOX.com http://razor.occams.info | www.govtrack.us |
              Message 6 of 9 , Nov 29, 2011
                On 11/29/2011 09:02 AM, David wrote:
                > Shots fired, golly.

                All right. I take back the "free loading" bit. But not the rest.


                - Josh Tauberer
                - GovTrack.us / POPVOX.com

                http://razor.occams.info | www.govtrack.us | www.popvox.com

                On 11/29/2011 09:02 AM, David wrote:
                >
                > Hi Josh,
                >
                > Shots fired, golly. I don't believe your note is either fair or accurate.
                >
                > Just a placeholder for the forum to say I'll email you individually to have a phone conversation about your concerns. Let's chat voice later today.
                >
                > Sincerely,
                > -David
                >
                > http://www.participatorypolitics.org
                >
                >
                > --- In govtrack@yahoogroups.com, Josh Tauberer<tauberer@...> wrote:
                >>
                >> I've been meaning to write about this.
                >>
                >> About two weeks ago GPO stopped updating GPO Access, which was their
                >> system for publishing documents since the mid 90s. New bills and other
                >> documents are only being published in FDSys now, and GovTrack isn't
                >> pulling from FDSys because FDSys didn't exist when I wrote the bill text
                >> scraper.
                >>
                >> Since I've been focused on POPVOX lately, I haven't had a chance to
                >> build a new scraper for GovTrack, although in anticipation of this I've
                >> been working on reimplementing much of the same functionality on POPVOX.
                >> I'm not sure what if any of that code will be open, though we have an
                >> experimental API for it now.
                >>
                >> It would be helpful to know who else, if anyone, is using bill text so I
                >> can plan the future of GovTrack's bill text accordingly.
                >>
                >> But I will say that folks free riding on my data and using it to compete
                >> with my business (i.e. POPVOX) get no sympathy from me.
                >>
                >> - Josh Tauberer
                >> - GovTrack.us / POPVOX.com
                >>
                >> http://razor.occams.info | www.govtrack.us | www.popvox.com
                >>
                >> On 11/29/2011 02:12 AM, jlundigard wrote:
                >>> Hey all,
                >>>
                >>> We've noticed the we stopped receiving bill text from govtrack. It seems to have stopped around this bill:
                >>>
                >>> http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
                >>>
                >>> That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
                >>>
                >>> Perhaps a scraper is down?
                >>>
                >>> Thanks,
                >>> Andy
                >>> OpenCongress.org
                >>>
                >>>
                >>>
                >>> ------------------------------------
                >>>
                >>> Yahoo! Groups Links
                >>>
                >>>
                >>>
                >>
                >
                >
                >
                >
                > ------------------------------------
                >
                > Yahoo! Groups Links
                >
                >
                >
              • Eric Mill
                I use a combination of three files for each bill. Primarily, the .txt, for the text. I m only storing the text en masse for full text search, not storing the
                Message 7 of 9 , Nov 29, 2011
                  I use a combination of three files for each bill. Primarily, the .txt, for the text. I'm only storing the text en masse for full text search, not storing the semantic hierarchy of the bill. Secondarily, I use the MODS XML metadata to get what date the bill version was issued on, a pretty critical piece of data. However, sometimes the MODS file doesn't exist, and I use the .xml (HTML) version of the bill as a backup source for the issued date -- which, now that I look at the code, makes use of the Dublin Core metadata that you add on top of the original bill data. I don't make use of the PDF.

                  My code that does all this is here, btw:

                  I understand that this is less vital, but I mean it when I say the rsync is incredibly useful -- so much so that if you left it offline, what I'd probably do is set up a separate dedicated GPO bulk data mirroring service for at least bill text, that supported rsync, and use that internally. That's a lot of work, though! If you're continuing to use the GPO's bill text files in your own work on POPVOX, you'd do the community a service by continuing to make that work available.

                  -- Eric

                  On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...> wrote:
                  the bill text is a less vital service, since you just
                  repackage what GPO offers

                  Exactly. That's why I'm not particularly concerned about dropping this since it doesn't do much to begin with and after 5+ years of running the bill text scraper it's past time to rethink what's useful. (Btw, it does also scrape the HTML bill text on THOMAS, which is slightly less trivial, but still pretty trivial.)

                  Do you use the PDFs or HTML (or .txt?)?

                  (Clearly when I said "free loading" I was not referring to what I agree is a simple repackaging of PDFs.)


                  - Josh Tauberer
                  - GovTrack.us / POPVOX.com

                  http://razor.occams.info | www.govtrack.us | www.popvox.com

                  On 11/29/2011 09:30 AM, Eric Mill wrote:


                  I make use of the bill text that GovTrack provides in Sunlight's data
                  services (our Real Time Congress API) and in the apps that depend on it
                  (including our Congress app). We load it into ElasticSearch
                  (recommended, btw) and we power our search and highlighting with it. I'm
                  imminently about to document this full text search capability and offer
                  it to the public.

                  Unlike bill metadata, where you've done God's work and scrapes THOMAS
                  all day every day, the bill text is a less vital service, since you just
                  repackage what GPO offers and provide it via rsync. This is an
                  incredibly useful way to provide it though! I'd like it to stick around.

                  I'm not sure it's possible to "free ride" on free, CC0-licensed,
                  repackaged versions of public domain government data. If you feel like
                  people have been insufficiently thankful for your work or haven't given
                  enough attribution, that is a more valid and specific conversation to
                  have than accusing folks who are asking about the status of your public
                  data on your public mailing list of competing with your business.

                  -- Eric

                  On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...
                  <mailto:tauberer@...>> wrote:

                     I've been meaning to write about this.

                     About two weeks ago GPO stopped updating GPO Access, which was their
                     system for publishing documents since the mid 90s. New bills and other
                     documents are only being published in FDSys now, and GovTrack isn't
                     pulling from FDSys because FDSys didn't exist when I wrote the bill text
                     scraper.

                     Since I've been focused on POPVOX lately, I haven't had a chance to
                     build a new scraper for GovTrack, although in anticipation of this I've
                     been working on reimplementing much of the same functionality on POPVOX.
                     I'm not sure what if any of that code will be open, though we have an
                     experimental API for it now.

                     It would be helpful to know who else, if anyone, is using bill text so I
                     can plan the future of GovTrack's bill text accordingly.

                     But I will say that folks free riding on my data and using it to compete
                     with my business (i.e. POPVOX) get no sympathy from me.

                     - Josh Tauberer
                     - GovTrack.us / POPVOX.com

                     http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
                     | www.popvox.com <http://www.popvox.com>


                     On 11/29/2011 02:12 AM, jlundigard wrote:
                      > Hey all,
                      >
                      > We've noticed the we stopped receiving bill text from govtrack.
                       It seems to have stopped around this bill:
                      >
                      > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
                      >
                      > That bill and more recently introduced ones don't have any bill
                     text even though the text exists on the CPO website.
                      >
                      > Perhaps a scraper is down?
                      >
                      > Thanks,
                      > Andy
                      > OpenCongress.org
                      >
                      >
                      >
                      > ------------------------------------
                      >
                      > Yahoo! Groups Links
                      >
                      >
                      >


                     ------------------------------------

                     Yahoo! Groups Links

                     <*> To visit your group on the web, go to:
                     http://groups.yahoo.com/group/govtrack/

                     <*> Your email settings:
                         Individual Email | Traditional

                     <*> To change settings online go to:
                     http://groups.yahoo.com/group/govtrack/join
                         (Yahoo! ID required)

                     <*> To change settings via email:
                     govtrack-digest@yahoogroups.com <mailto:govtrack-digest@yahoogroups.com>
                     govtrack-fullfeatured@yahoogroups.com
                     <mailto:govtrack-fullfeatured@yahoogroups.com>


                     <*> To unsubscribe from this group, send an email to:
                     govtrack-unsubscribe@yahoogroups.com
                     <mailto:govtrack-unsubscribe@yahoogroups.com>


                     <*> Your use of Yahoo! Groups is subject to:
                     http://docs.yahoo.com/info/terms/




                  --
                  Developer | sunlightfoundation.com <http://sunlightfoundation.com>







                  --

                • Josh Tauberer
                  Hi, everyone. Bill text is updating now. Thanks to whoever here forwarded the problem on to GPO --- I got an email from someone at GPO who pointed me to their
                  Message 8 of 9 , Dec 10, 2011
                    Hi, everyone.

                    Bill text is updating now.

                    Thanks to whoever here forwarded the problem on to GPO --- I got an
                    email from someone at GPO who pointed me to their sitemap files, e.g.:
                    http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml
                    (warning: BIG file). I'm checking on how bills are split by year (by
                    publication date?), but this seems to be the most helpful way to find
                    them all.

                    Btw, Eric- For indexing bill text, it might be better to use the
                    original text files from GPO. The .txt files on GovTrack are generated
                    using pdftotext and have line numbers, whereas the GPO original .txt
                    files do not (I imagine they are generated from the XML or GPO locator
                    codes files directly).

                    I don't use my own .txt files except to display historical bill text,
                    and unless there's an objection I could replace the pdftotext-generated
                    files with the GPO original .txt files.

                    Any objections from anyone?

                    - Josh Tauberer
                    - GovTrack.us / POPVOX.com

                    http://razor.occams.info | www.govtrack.us | www.popvox.com

                    On 11/29/2011 10:25 AM, Eric Mill wrote:
                    >
                    >
                    > I use a combination of three files for each bill. Primarily, the .txt,
                    > for the text. I'm only storing the text en masse for full text search,
                    > not storing the semantic hierarchy of the bill. Secondarily, I use the
                    > MODS XML metadata to get what date the bill version was issued on, a
                    > pretty critical piece of data. However, sometimes the MODS file doesn't
                    > exist, and I use the .xml (HTML) version of the bill as a backup source
                    > for the issued date -- which, now that I look at the code, makes use of
                    > the Dublin Core metadata that you add on top of the original bill data.
                    > I don't make use of the PDF.
                    >
                    > My code that does all this is here, btw:
                    > https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
                    >
                    > I understand that this is less vital, but I mean it when I say the rsync
                    > is incredibly useful -- so much so that if you left it offline, what I'd
                    > probably do is set up a separate dedicated GPO bulk data mirroring
                    > service for at least bill text, that supported rsync, and use that
                    > internally. That's a lot of work, though! If you're continuing to use
                    > the GPO's bill text files in your own work on POPVOX, you'd do the
                    > community a service by continuing to make that work available.
                    >
                    > -- Eric
                    >
                    > On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
                    > <mailto:tauberer@...>> wrote:
                    >
                    > the bill text is a less vital service, since you just
                    > repackage what GPO offers
                    >
                    >
                    > Exactly. That's why I'm not particularly concerned about dropping
                    > this since it doesn't do much to begin with and after 5+ years of
                    > running the bill text scraper it's past time to rethink what's
                    > useful. (Btw, it does also scrape the HTML bill text on THOMAS,
                    > which is slightly less trivial, but still pretty trivial.)
                    >
                    > Do you use the PDFs or HTML (or .txt?)?
                    >
                    > (Clearly when I said "free loading" I was not referring to what I
                    > agree is a simple repackaging of PDFs.)
                    >
                    >
                    > - Josh Tauberer
                    > - GovTrack.us / POPVOX.com
                    >
                    > http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
                    > | www.popvox.com <http://www.popvox.com>
                    >
                    > On 11/29/2011 09:30 AM, Eric Mill wrote:
                    >
                    >
                    >
                    > I make use of the bill text that GovTrack provides in Sunlight's
                    > data
                    > services (our Real Time Congress API) and in the apps that
                    > depend on it
                    > (including our Congress app). We load it into ElasticSearch
                    > (recommended, btw) and we power our search and highlighting with
                    > it. I'm
                    > imminently about to document this full text search capability
                    > and offer
                    > it to the public.
                    >
                    > Unlike bill metadata, where you've done God's work and scrapes
                    > THOMAS
                    > all day every day, the bill text is a less vital service, since
                    > you just
                    > repackage what GPO offers and provide it via rsync. This is an
                    > incredibly useful way to provide it though! I'd like it to stick
                    > around.
                    >
                    > I'm not sure it's possible to "free ride" on free, CC0-licensed,
                    > repackaged versions of public domain government data. If you
                    > feel like
                    > people have been insufficiently thankful for your work or
                    > haven't given
                    > enough attribution, that is a more valid and specific
                    > conversation to
                    > have than accusing folks who are asking about the status of your
                    > public
                    > data on your public mailing list of competing with your business.
                    >
                    > -- Eric
                    >
                    > On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
                    > <tauberer@... <mailto:tauberer@...>
                    > <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
                    >
                    > I've been meaning to write about this.
                    >
                    > About two weeks ago GPO stopped updating GPO Access, which
                    > was their
                    > system for publishing documents since the mid 90s. New bills
                    > and other
                    > documents are only being published in FDSys now, and
                    > GovTrack isn't
                    > pulling from FDSys because FDSys didn't exist when I wrote
                    > the bill text
                    > scraper.
                    >
                    > Since I've been focused on POPVOX lately, I haven't had a
                    > chance to
                    > build a new scraper for GovTrack, although in anticipation
                    > of this I've
                    > been working on reimplementing much of the same
                    > functionality on POPVOX.
                    > I'm not sure what if any of that code will be open, though
                    > we have an
                    > experimental API for it now.
                    >
                    > It would be helpful to know who else, if anyone, is using
                    > bill text so I
                    > can plan the future of GovTrack's bill text accordingly.
                    >
                    > But I will say that folks free riding on my data and using
                    > it to compete
                    > with my business (i.e. POPVOX) get no sympathy from me.
                    >
                    > - Josh Tauberer
                    > - GovTrack.us / POPVOX.com
                    >
                    > http://razor.occams.info | www.govtrack.us
                    > <http://www.govtrack.us> <http://www.govtrack.us>
                    > | www.popvox.com <http://www.popvox.com> <http://www.popvox.com>
                    >
                    >
                    > On 11/29/2011 02:12 AM, jlundigard wrote:
                    > > Hey all,
                    > >
                    > > We've noticed the we stopped receiving bill text from govtrack.
                    > It seems to have stopped around this bill:
                    > >
                    > > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
                    > <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
                    > >
                    > > That bill and more recently introduced ones don't have any bill
                    > text even though the text exists on the CPO website.
                    > >
                    > > Perhaps a scraper is down?
                    > >
                    > > Thanks,
                    > > Andy
                    > > OpenCongress.org
                    > >
                    > >
                    > >
                    > > ------------------------------__------
                    > >
                    > > Yahoo! Groups Links
                    > >
                    > >
                    > >
                    >
                    >
                    > ------------------------------__------
                    >
                    > Yahoo! Groups Links
                    >
                    >
                    > (Yahoo! ID required)
                    >
                    > <mailto:govtrack-digest@__yahoogroups.com
                    > <mailto:govtrack-digest@yahoogroups.com>>
                    > govtrack-fullfeatured@__yahoogroups.com
                    > <mailto:govtrack-fullfeatured@yahoogroups.com>
                    > <mailto:govtrack-fullfeatured@__yahoogroups.com
                    > <mailto:govtrack-fullfeatured@yahoogroups.com>>
                    >
                    >
                    > <mailto:govtrack-unsubscribe@__yahoogroups.com
                    > <mailto:govtrack-unsubscribe@yahoogroups.com>>
                    >
                    >
                    >
                    >
                    >
                    >
                    > --
                    > Developer | sunlightfoundation.com
                    > <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
                    >
                    >
                    >
                    >
                    >
                    >
                    >
                    > --
                    > Developer | sunlightfoundation.com <http://sunlightfoundation.com>
                    >
                    >
                    >
                    >
                  • Eric Mill
                    I ve been looking exactly for sitemap files like that! Would you mind sharing how we can find the different sitemaps? For example, I guessed at the URL for the
                    Message 9 of 9 , Dec 10, 2011
                      I've been looking exactly for sitemap files like that! Would you mind
                      sharing how we can find the different sitemaps?

                      For example, I guessed at the URL for the one for public and private laws:
                      http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_PLAW_sitemap.xml

                      But that file is very small and doesn't list what you would need to
                      effectively spider the PLAW collection without scraping their HTML.

                      As for text of bills -- I actually came to that realization yesterday
                      myself, that the GPO .txt files were probably better. I definitely
                      would not mind you switching over to them - I can adjust my regular
                      expressions (just for sanitization, not extracting data) accordingly.

                      -- Eric

                      On Sat, Dec 10, 2011 at 12:51 PM, Josh Tauberer <tauberer@...> wrote:
                      > Hi, everyone.
                      >
                      > Bill text is updating now.
                      >
                      > Thanks to whoever here forwarded the problem on to GPO --- I got an email
                      > from someone at GPO who pointed me to their sitemap files, e.g.:
                      > http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml (warning:
                      > BIG file). I'm checking on how bills are split by year (by publication
                      > date?), but this seems to be the most helpful way to find them all.
                      >
                      > Btw, Eric- For indexing bill text, it might be better to use the original
                      > text files from GPO. The .txt files on GovTrack are generated using
                      > pdftotext and have line numbers, whereas the GPO original .txt files do not
                      > (I imagine they are generated from the XML or GPO locator codes files
                      > directly).
                      >
                      > I don't use my own .txt files except to display historical bill text, and
                      > unless there's an objection I could replace the pdftotext-generated files
                      > with the GPO original .txt files.
                      >
                      > Any objections from anyone?
                      >
                      >
                      > - Josh Tauberer
                      > - GovTrack.us / POPVOX.com
                      >
                      > http://razor.occams.info | www.govtrack.us | www.popvox.com
                      >
                      > On 11/29/2011 10:25 AM, Eric Mill wrote:
                      >>
                      >>
                      >>
                      >> I use a combination of three files for each bill. Primarily, the .txt,
                      >> for the text. I'm only storing the text en masse for full text search,
                      >> not storing the semantic hierarchy of the bill. Secondarily, I use the
                      >> MODS XML metadata to get what date the bill version was issued on, a
                      >> pretty critical piece of data. However, sometimes the MODS file doesn't
                      >> exist, and I use the .xml (HTML) version of the bill as a backup source
                      >> for the issued date -- which, now that I look at the code, makes use of
                      >> the Dublin Core metadata that you add on top of the original bill data.
                      >> I don't make use of the PDF.
                      >>
                      >> My code that does all this is here, btw:
                      >>
                      >> https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
                      >>
                      >> I understand that this is less vital, but I mean it when I say the rsync
                      >> is incredibly useful -- so much so that if you left it offline, what I'd
                      >> probably do is set up a separate dedicated GPO bulk data mirroring
                      >> service for at least bill text, that supported rsync, and use that
                      >> internally. That's a lot of work, though! If you're continuing to use
                      >> the GPO's bill text files in your own work on POPVOX, you'd do the
                      >> community a service by continuing to make that work available.
                      >>
                      >> -- Eric
                      >>
                      >> On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
                      >> <mailto:tauberer@...>> wrote:
                      >>
                      >>        the bill text is a less vital service, since you just
                      >>        repackage what GPO offers
                      >>
                      >>
                      >>    Exactly. That's why I'm not particularly concerned about dropping
                      >>    this since it doesn't do much to begin with and after 5+ years of
                      >>    running the bill text scraper it's past time to rethink what's
                      >>    useful. (Btw, it does also scrape the HTML bill text on THOMAS,
                      >>    which is slightly less trivial, but still pretty trivial.)
                      >>
                      >>    Do you use the PDFs or HTML (or .txt?)?
                      >>
                      >>    (Clearly when I said "free loading" I was not referring to what I
                      >>    agree is a simple repackaging of PDFs.)
                      >>
                      >>
                      >>    - Josh Tauberer
                      >>    - GovTrack.us / POPVOX.com
                      >>
                      >>    http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
                      >>    | www.popvox.com <http://www.popvox.com>
                      >>
                      >>    On 11/29/2011 09:30 AM, Eric Mill wrote:
                      >>
                      >>
                      >>
                      >>        I make use of the bill text that GovTrack provides in Sunlight's
                      >>        data
                      >>        services (our Real Time Congress API) and in the apps that
                      >>        depend on it
                      >>        (including our Congress app). We load it into ElasticSearch
                      >>        (recommended, btw) and we power our search and highlighting with
                      >>        it. I'm
                      >>        imminently about to document this full text search capability
                      >>        and offer
                      >>        it to the public.
                      >>
                      >>        Unlike bill metadata, where you've done God's work and scrapes
                      >>        THOMAS
                      >>        all day every day, the bill text is a less vital service, since
                      >>        you just
                      >>        repackage what GPO offers and provide it via rsync. This is an
                      >>        incredibly useful way to provide it though! I'd like it to stick
                      >>        around.
                      >>
                      >>        I'm not sure it's possible to "free ride" on free, CC0-licensed,
                      >>        repackaged versions of public domain government data. If you
                      >>        feel like
                      >>        people have been insufficiently thankful for your work or
                      >>        haven't given
                      >>        enough attribution, that is a more valid and specific
                      >>        conversation to
                      >>        have than accusing folks who are asking about the status of your
                      >>        public
                      >>        data on your public mailing list of competing with your business.
                      >>
                      >>        -- Eric
                      >>
                      >>        On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
                      >>        <tauberer@... <mailto:tauberer@...>
                      >>        <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
                      >>
                      >>            I've been meaning to write about this.
                      >>
                      >>            About two weeks ago GPO stopped updating GPO Access, which
                      >>        was their
                      >>            system for publishing documents since the mid 90s. New bills
                      >>        and other
                      >>            documents are only being published in FDSys now, and
                      >>        GovTrack isn't
                      >>            pulling from FDSys because FDSys didn't exist when I wrote
                      >>        the bill text
                      >>            scraper.
                      >>
                      >>            Since I've been focused on POPVOX lately, I haven't had a
                      >>        chance to
                      >>            build a new scraper for GovTrack, although in anticipation
                      >>        of this I've
                      >>            been working on reimplementing much of the same
                      >>        functionality on POPVOX.
                      >>            I'm not sure what if any of that code will be open, though
                      >>        we have an
                      >>            experimental API for it now.
                      >>
                      >>            It would be helpful to know who else, if anyone, is using
                      >>        bill text so I
                      >>            can plan the future of GovTrack's bill text accordingly.
                      >>
                      >>            But I will say that folks free riding on my data and using
                      >>        it to compete
                      >>            with my business (i.e. POPVOX) get no sympathy from me.
                      >>
                      >>            - Josh Tauberer
                      >>            - GovTrack.us / POPVOX.com
                      >>
                      >>        http://razor.occams.info | www.govtrack.us
                      >>        <http://www.govtrack.us> <http://www.govtrack.us>
                      >>            | www.popvox.com <http://www.popvox.com>
                      >> <http://www.popvox.com>
                      >>
                      >>
                      >>
                      >>            On 11/29/2011 02:12 AM, jlundigard wrote:
                      >>         > Hey all,
                      >>         >
                      >>         > We've noticed the we stopped receiving bill text from govtrack.
                      >>              It seems to have stopped around this bill:
                      >>         >
                      >>         > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
                      >>
                      >>        <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
                      >>         >
                      >>         > That bill and more recently introduced ones don't have any bill
                      >>            text even though the text exists on the CPO website.
                      >>         >
                      >>         > Perhaps a scraper is down?
                      >>         >
                      >>         > Thanks,
                      >>         > Andy
                      >>         > OpenCongress.org
                      >>         >
                      >>         >
                      >>         >
                      >>         > ------------------------------__------
                      >>         >
                      >>         > Yahoo! Groups Links
                      >>         >
                      >>         >
                      >>         >
                      >>
                      >>
                      >>            ------------------------------__------
                      >>
                      >>
                      >>            Yahoo! Groups Links
                      >>
                      >>        <http://groups.yahoo.com/group/govtrack/>
                      >>
                      >>        <http://groups.yahoo.com/group/govtrack/join>
                      >>                (Yahoo! ID required)
                      >>
                      >>        <mailto:govtrack-digest@__yahoogroups.com
                      >>        <mailto:govtrack-digest@yahoogroups.com>>
                      >>        govtrack-fullfeatured@__yahoogroups.com
                      >>        <mailto:govtrack-fullfeatured@yahoogroups.com>
                      >>        <mailto:govtrack-fullfeatured@__yahoogroups.com
                      >>
                      >>        <mailto:govtrack-fullfeatured@yahoogroups.com>>
                      >>
                      >>
                      >>        <mailto:govtrack-unsubscribe@__yahoogroups.com
                      >>
                      >>        <mailto:govtrack-unsubscribe@yahoogroups.com>>
                      >>
                      >>
                      >>        <http://docs.yahoo.com/info/terms/>
                      >>
                      >>
                      >>
                      >>
                      >>        --
                      >>        Developer | sunlightfoundation.com
                      >>        <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
                      >>
                      >>
                      >>
                      >>
                      >>
                      >>
                      >>
                      >>
                      >> --
                      >> Developer | sunlightfoundation.com <http://sunlightfoundation.com>
                      >>
                      >>
                      >>
                      >>



                      --
                      Developer | sunlightfoundation.com
                    Your message has been successfully submitted and would be delivered to recipients shortly.