Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] Bill text problems?

Expand Messages
  • Eric Mill
    I make use of the bill text that GovTrack provides in Sunlight s data services (our Real Time Congress API) and in the apps that depend on it (including our
    Message 1 of 9 , Nov 29, 2011
    • 0 Attachment
      I make use of the bill text that GovTrack provides in Sunlight's data services (our Real Time Congress API) and in the apps that depend on it (including our Congress app). We load it into ElasticSearch (recommended, btw) and we power our search and highlighting with it. I'm imminently about to document this full text search capability and offer it to the public.

      Unlike bill metadata, where you've done God's work and scrapes THOMAS all day every day, the bill text is a less vital service, since you just repackage what GPO offers and provide it via rsync. This is an incredibly useful way to provide it though! I'd like it to stick around.

      I'm not sure it's possible to "free ride" on free, CC0-licensed, repackaged versions of public domain government data. If you feel like people have been insufficiently thankful for your work or haven't given enough attribution, that is a more valid and specific conversation to have than accusing folks who are asking about the status of your public data on your public mailing list of competing with your business.

      -- Eric

      On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...> wrote:
      I've been meaning to write about this.

      About two weeks ago GPO stopped updating GPO Access, which was their
      system for publishing documents since the mid 90s. New bills and other
      documents are only being published in FDSys now, and GovTrack isn't
      pulling from FDSys because FDSys didn't exist when I wrote the bill text
      scraper.

      Since I've been focused on POPVOX lately, I haven't had a chance to
      build a new scraper for GovTrack, although in anticipation of this I've
      been working on reimplementing much of the same functionality on POPVOX.
      I'm not sure what if any of that code will be open, though we have an
      experimental API for it now.

      It would be helpful to know who else, if anyone, is using bill text so I
      can plan the future of GovTrack's bill text accordingly.

      But I will say that folks free riding on my data and using it to compete
      with my business (i.e. POPVOX) get no sympathy from me.

      - Josh Tauberer
      - GovTrack.us / POPVOX.com

      http://razor.occams.info | www.govtrack.us | www.popvox.com

      On 11/29/2011 02:12 AM, jlundigard wrote:
      > Hey all,
      >
      > We've noticed the we stopped receiving bill text from govtrack.  It seems to have stopped around this bill:
      >
      > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
      >
      > That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
      >
      > Perhaps a scraper is down?
      >
      > Thanks,
      > Andy
      > OpenCongress.org
      >
      >
      >
      > ------------------------------------
      >
      > Yahoo! Groups Links
      >
      >
      >


      ------------------------------------

      Yahoo! Groups Links

      <*> To visit your group on the web, go to:
         http://groups.yahoo.com/group/govtrack/

      <*> Your email settings:
         Individual Email | Traditional

      <*> To change settings online go to:
         http://groups.yahoo.com/group/govtrack/join
         (Yahoo! ID required)

      <*> To change settings via email:
         govtrack-digest@yahoogroups.com
         govtrack-fullfeatured@yahoogroups.com

      <*> To unsubscribe from this group, send an email to:
         govtrack-unsubscribe@yahoogroups.com

      <*> Your use of Yahoo! Groups is subject to:
         http://docs.yahoo.com/info/terms/




      --

    • Josh Tauberer
      ... Exactly. That s why I m not particularly concerned about dropping this since it doesn t do much to begin with and after 5+ years of running the bill text
      Message 2 of 9 , Nov 29, 2011
      • 0 Attachment
        > the bill text is a less vital service, since you just
        > repackage what GPO offers

        Exactly. That's why I'm not particularly concerned about dropping this
        since it doesn't do much to begin with and after 5+ years of running the
        bill text scraper it's past time to rethink what's useful. (Btw, it does
        also scrape the HTML bill text on THOMAS, which is slightly less
        trivial, but still pretty trivial.)

        Do you use the PDFs or HTML (or .txt?)?

        (Clearly when I said "free loading" I was not referring to what I agree
        is a simple repackaging of PDFs.)

        - Josh Tauberer
        - GovTrack.us / POPVOX.com

        http://razor.occams.info | www.govtrack.us | www.popvox.com

        On 11/29/2011 09:30 AM, Eric Mill wrote:
        >
        >
        > I make use of the bill text that GovTrack provides in Sunlight's data
        > services (our Real Time Congress API) and in the apps that depend on it
        > (including our Congress app). We load it into ElasticSearch
        > (recommended, btw) and we power our search and highlighting with it. I'm
        > imminently about to document this full text search capability and offer
        > it to the public.
        >
        > Unlike bill metadata, where you've done God's work and scrapes THOMAS
        > all day every day, the bill text is a less vital service, since you just
        > repackage what GPO offers and provide it via rsync. This is an
        > incredibly useful way to provide it though! I'd like it to stick around.
        >
        > I'm not sure it's possible to "free ride" on free, CC0-licensed,
        > repackaged versions of public domain government data. If you feel like
        > people have been insufficiently thankful for your work or haven't given
        > enough attribution, that is a more valid and specific conversation to
        > have than accusing folks who are asking about the status of your public
        > data on your public mailing list of competing with your business.
        >
        > -- Eric
        >
        > On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...
        > <mailto:tauberer@...>> wrote:
        >
        > I've been meaning to write about this.
        >
        > About two weeks ago GPO stopped updating GPO Access, which was their
        > system for publishing documents since the mid 90s. New bills and other
        > documents are only being published in FDSys now, and GovTrack isn't
        > pulling from FDSys because FDSys didn't exist when I wrote the bill text
        > scraper.
        >
        > Since I've been focused on POPVOX lately, I haven't had a chance to
        > build a new scraper for GovTrack, although in anticipation of this I've
        > been working on reimplementing much of the same functionality on POPVOX.
        > I'm not sure what if any of that code will be open, though we have an
        > experimental API for it now.
        >
        > It would be helpful to know who else, if anyone, is using bill text so I
        > can plan the future of GovTrack's bill text accordingly.
        >
        > But I will say that folks free riding on my data and using it to compete
        > with my business (i.e. POPVOX) get no sympathy from me.
        >
        > - Josh Tauberer
        > - GovTrack.us / POPVOX.com
        >
        > http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
        > | www.popvox.com <http://www.popvox.com>
        >
        > On 11/29/2011 02:12 AM, jlundigard wrote:
        > > Hey all,
        > >
        > > We've noticed the we stopped receiving bill text from govtrack.
        > It seems to have stopped around this bill:
        > >
        > > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
        > >
        > > That bill and more recently introduced ones don't have any bill
        > text even though the text exists on the CPO website.
        > >
        > > Perhaps a scraper is down?
        > >
        > > Thanks,
        > > Andy
        > > OpenCongress.org
        > >
        > >
        > >
        > > ------------------------------------
        > >
        > > Yahoo! Groups Links
        > >
        > >
        > >
        >
        >
        > ------------------------------------
        >
        > Yahoo! Groups Links
        >
        >
        > <mailto:govtrack-fullfeatured@yahoogroups.com>
        >
        >
        >
        >
        >
        > --
        > Developer | sunlightfoundation.com <http://sunlightfoundation.com>
        >
        >
        >
        >
      • Josh Tauberer
        ... All right. I take back the free loading bit. But not the rest. - Josh Tauberer - GovTrack.us / POPVOX.com http://razor.occams.info | www.govtrack.us |
        Message 3 of 9 , Nov 29, 2011
        • 0 Attachment
          On 11/29/2011 09:02 AM, David wrote:
          > Shots fired, golly.

          All right. I take back the "free loading" bit. But not the rest.


          - Josh Tauberer
          - GovTrack.us / POPVOX.com

          http://razor.occams.info | www.govtrack.us | www.popvox.com

          On 11/29/2011 09:02 AM, David wrote:
          >
          > Hi Josh,
          >
          > Shots fired, golly. I don't believe your note is either fair or accurate.
          >
          > Just a placeholder for the forum to say I'll email you individually to have a phone conversation about your concerns. Let's chat voice later today.
          >
          > Sincerely,
          > -David
          >
          > http://www.participatorypolitics.org
          >
          >
          > --- In govtrack@yahoogroups.com, Josh Tauberer<tauberer@...> wrote:
          >>
          >> I've been meaning to write about this.
          >>
          >> About two weeks ago GPO stopped updating GPO Access, which was their
          >> system for publishing documents since the mid 90s. New bills and other
          >> documents are only being published in FDSys now, and GovTrack isn't
          >> pulling from FDSys because FDSys didn't exist when I wrote the bill text
          >> scraper.
          >>
          >> Since I've been focused on POPVOX lately, I haven't had a chance to
          >> build a new scraper for GovTrack, although in anticipation of this I've
          >> been working on reimplementing much of the same functionality on POPVOX.
          >> I'm not sure what if any of that code will be open, though we have an
          >> experimental API for it now.
          >>
          >> It would be helpful to know who else, if anyone, is using bill text so I
          >> can plan the future of GovTrack's bill text accordingly.
          >>
          >> But I will say that folks free riding on my data and using it to compete
          >> with my business (i.e. POPVOX) get no sympathy from me.
          >>
          >> - Josh Tauberer
          >> - GovTrack.us / POPVOX.com
          >>
          >> http://razor.occams.info | www.govtrack.us | www.popvox.com
          >>
          >> On 11/29/2011 02:12 AM, jlundigard wrote:
          >>> Hey all,
          >>>
          >>> We've noticed the we stopped receiving bill text from govtrack. It seems to have stopped around this bill:
          >>>
          >>> http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
          >>>
          >>> That bill and more recently introduced ones don't have any bill text even though the text exists on the CPO website.
          >>>
          >>> Perhaps a scraper is down?
          >>>
          >>> Thanks,
          >>> Andy
          >>> OpenCongress.org
          >>>
          >>>
          >>>
          >>> ------------------------------------
          >>>
          >>> Yahoo! Groups Links
          >>>
          >>>
          >>>
          >>
          >
          >
          >
          >
          > ------------------------------------
          >
          > Yahoo! Groups Links
          >
          >
          >
        • Eric Mill
          I use a combination of three files for each bill. Primarily, the .txt, for the text. I m only storing the text en masse for full text search, not storing the
          Message 4 of 9 , Nov 29, 2011
          • 0 Attachment
            I use a combination of three files for each bill. Primarily, the .txt, for the text. I'm only storing the text en masse for full text search, not storing the semantic hierarchy of the bill. Secondarily, I use the MODS XML metadata to get what date the bill version was issued on, a pretty critical piece of data. However, sometimes the MODS file doesn't exist, and I use the .xml (HTML) version of the bill as a backup source for the issued date -- which, now that I look at the code, makes use of the Dublin Core metadata that you add on top of the original bill data. I don't make use of the PDF.

            My code that does all this is here, btw:

            I understand that this is less vital, but I mean it when I say the rsync is incredibly useful -- so much so that if you left it offline, what I'd probably do is set up a separate dedicated GPO bulk data mirroring service for at least bill text, that supported rsync, and use that internally. That's a lot of work, though! If you're continuing to use the GPO's bill text files in your own work on POPVOX, you'd do the community a service by continuing to make that work available.

            -- Eric

            On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...> wrote:
            the bill text is a less vital service, since you just
            repackage what GPO offers

            Exactly. That's why I'm not particularly concerned about dropping this since it doesn't do much to begin with and after 5+ years of running the bill text scraper it's past time to rethink what's useful. (Btw, it does also scrape the HTML bill text on THOMAS, which is slightly less trivial, but still pretty trivial.)

            Do you use the PDFs or HTML (or .txt?)?

            (Clearly when I said "free loading" I was not referring to what I agree is a simple repackaging of PDFs.)


            - Josh Tauberer
            - GovTrack.us / POPVOX.com

            http://razor.occams.info | www.govtrack.us | www.popvox.com

            On 11/29/2011 09:30 AM, Eric Mill wrote:


            I make use of the bill text that GovTrack provides in Sunlight's data
            services (our Real Time Congress API) and in the apps that depend on it
            (including our Congress app). We load it into ElasticSearch
            (recommended, btw) and we power our search and highlighting with it. I'm
            imminently about to document this full text search capability and offer
            it to the public.

            Unlike bill metadata, where you've done God's work and scrapes THOMAS
            all day every day, the bill text is a less vital service, since you just
            repackage what GPO offers and provide it via rsync. This is an
            incredibly useful way to provide it though! I'd like it to stick around.

            I'm not sure it's possible to "free ride" on free, CC0-licensed,
            repackaged versions of public domain government data. If you feel like
            people have been insufficiently thankful for your work or haven't given
            enough attribution, that is a more valid and specific conversation to
            have than accusing folks who are asking about the status of your public
            data on your public mailing list of competing with your business.

            -- Eric

            On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...
            <mailto:tauberer@...>> wrote:

               I've been meaning to write about this.

               About two weeks ago GPO stopped updating GPO Access, which was their
               system for publishing documents since the mid 90s. New bills and other
               documents are only being published in FDSys now, and GovTrack isn't
               pulling from FDSys because FDSys didn't exist when I wrote the bill text
               scraper.

               Since I've been focused on POPVOX lately, I haven't had a chance to
               build a new scraper for GovTrack, although in anticipation of this I've
               been working on reimplementing much of the same functionality on POPVOX.
               I'm not sure what if any of that code will be open, though we have an
               experimental API for it now.

               It would be helpful to know who else, if anyone, is using bill text so I
               can plan the future of GovTrack's bill text accordingly.

               But I will say that folks free riding on my data and using it to compete
               with my business (i.e. POPVOX) get no sympathy from me.

               - Josh Tauberer
               - GovTrack.us / POPVOX.com

               http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
               | www.popvox.com <http://www.popvox.com>


               On 11/29/2011 02:12 AM, jlundigard wrote:
                > Hey all,
                >
                > We've noticed the we stopped receiving bill text from govtrack.
                 It seems to have stopped around this bill:
                >
                > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
                >
                > That bill and more recently introduced ones don't have any bill
               text even though the text exists on the CPO website.
                >
                > Perhaps a scraper is down?
                >
                > Thanks,
                > Andy
                > OpenCongress.org
                >
                >
                >
                > ------------------------------------
                >
                > Yahoo! Groups Links
                >
                >
                >


               ------------------------------------

               Yahoo! Groups Links

               <*> To visit your group on the web, go to:
               http://groups.yahoo.com/group/govtrack/

               <*> Your email settings:
                   Individual Email | Traditional

               <*> To change settings online go to:
               http://groups.yahoo.com/group/govtrack/join
                   (Yahoo! ID required)

               <*> To change settings via email:
               govtrack-digest@yahoogroups.com <mailto:govtrack-digest@yahoogroups.com>
               govtrack-fullfeatured@yahoogroups.com
               <mailto:govtrack-fullfeatured@yahoogroups.com>


               <*> To unsubscribe from this group, send an email to:
               govtrack-unsubscribe@yahoogroups.com
               <mailto:govtrack-unsubscribe@yahoogroups.com>


               <*> Your use of Yahoo! Groups is subject to:
               http://docs.yahoo.com/info/terms/




            --
            Developer | sunlightfoundation.com <http://sunlightfoundation.com>







            --

          • Josh Tauberer
            Hi, everyone. Bill text is updating now. Thanks to whoever here forwarded the problem on to GPO --- I got an email from someone at GPO who pointed me to their
            Message 5 of 9 , Dec 10, 2011
            • 0 Attachment
              Hi, everyone.

              Bill text is updating now.

              Thanks to whoever here forwarded the problem on to GPO --- I got an
              email from someone at GPO who pointed me to their sitemap files, e.g.:
              http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml
              (warning: BIG file). I'm checking on how bills are split by year (by
              publication date?), but this seems to be the most helpful way to find
              them all.

              Btw, Eric- For indexing bill text, it might be better to use the
              original text files from GPO. The .txt files on GovTrack are generated
              using pdftotext and have line numbers, whereas the GPO original .txt
              files do not (I imagine they are generated from the XML or GPO locator
              codes files directly).

              I don't use my own .txt files except to display historical bill text,
              and unless there's an objection I could replace the pdftotext-generated
              files with the GPO original .txt files.

              Any objections from anyone?

              - Josh Tauberer
              - GovTrack.us / POPVOX.com

              http://razor.occams.info | www.govtrack.us | www.popvox.com

              On 11/29/2011 10:25 AM, Eric Mill wrote:
              >
              >
              > I use a combination of three files for each bill. Primarily, the .txt,
              > for the text. I'm only storing the text en masse for full text search,
              > not storing the semantic hierarchy of the bill. Secondarily, I use the
              > MODS XML metadata to get what date the bill version was issued on, a
              > pretty critical piece of data. However, sometimes the MODS file doesn't
              > exist, and I use the .xml (HTML) version of the bill as a backup source
              > for the issued date -- which, now that I look at the code, makes use of
              > the Dublin Core metadata that you add on top of the original bill data.
              > I don't make use of the PDF.
              >
              > My code that does all this is here, btw:
              > https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
              >
              > I understand that this is less vital, but I mean it when I say the rsync
              > is incredibly useful -- so much so that if you left it offline, what I'd
              > probably do is set up a separate dedicated GPO bulk data mirroring
              > service for at least bill text, that supported rsync, and use that
              > internally. That's a lot of work, though! If you're continuing to use
              > the GPO's bill text files in your own work on POPVOX, you'd do the
              > community a service by continuing to make that work available.
              >
              > -- Eric
              >
              > On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
              > <mailto:tauberer@...>> wrote:
              >
              > the bill text is a less vital service, since you just
              > repackage what GPO offers
              >
              >
              > Exactly. That's why I'm not particularly concerned about dropping
              > this since it doesn't do much to begin with and after 5+ years of
              > running the bill text scraper it's past time to rethink what's
              > useful. (Btw, it does also scrape the HTML bill text on THOMAS,
              > which is slightly less trivial, but still pretty trivial.)
              >
              > Do you use the PDFs or HTML (or .txt?)?
              >
              > (Clearly when I said "free loading" I was not referring to what I
              > agree is a simple repackaging of PDFs.)
              >
              >
              > - Josh Tauberer
              > - GovTrack.us / POPVOX.com
              >
              > http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
              > | www.popvox.com <http://www.popvox.com>
              >
              > On 11/29/2011 09:30 AM, Eric Mill wrote:
              >
              >
              >
              > I make use of the bill text that GovTrack provides in Sunlight's
              > data
              > services (our Real Time Congress API) and in the apps that
              > depend on it
              > (including our Congress app). We load it into ElasticSearch
              > (recommended, btw) and we power our search and highlighting with
              > it. I'm
              > imminently about to document this full text search capability
              > and offer
              > it to the public.
              >
              > Unlike bill metadata, where you've done God's work and scrapes
              > THOMAS
              > all day every day, the bill text is a less vital service, since
              > you just
              > repackage what GPO offers and provide it via rsync. This is an
              > incredibly useful way to provide it though! I'd like it to stick
              > around.
              >
              > I'm not sure it's possible to "free ride" on free, CC0-licensed,
              > repackaged versions of public domain government data. If you
              > feel like
              > people have been insufficiently thankful for your work or
              > haven't given
              > enough attribution, that is a more valid and specific
              > conversation to
              > have than accusing folks who are asking about the status of your
              > public
              > data on your public mailing list of competing with your business.
              >
              > -- Eric
              >
              > On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
              > <tauberer@... <mailto:tauberer@...>
              > <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
              >
              > I've been meaning to write about this.
              >
              > About two weeks ago GPO stopped updating GPO Access, which
              > was their
              > system for publishing documents since the mid 90s. New bills
              > and other
              > documents are only being published in FDSys now, and
              > GovTrack isn't
              > pulling from FDSys because FDSys didn't exist when I wrote
              > the bill text
              > scraper.
              >
              > Since I've been focused on POPVOX lately, I haven't had a
              > chance to
              > build a new scraper for GovTrack, although in anticipation
              > of this I've
              > been working on reimplementing much of the same
              > functionality on POPVOX.
              > I'm not sure what if any of that code will be open, though
              > we have an
              > experimental API for it now.
              >
              > It would be helpful to know who else, if anyone, is using
              > bill text so I
              > can plan the future of GovTrack's bill text accordingly.
              >
              > But I will say that folks free riding on my data and using
              > it to compete
              > with my business (i.e. POPVOX) get no sympathy from me.
              >
              > - Josh Tauberer
              > - GovTrack.us / POPVOX.com
              >
              > http://razor.occams.info | www.govtrack.us
              > <http://www.govtrack.us> <http://www.govtrack.us>
              > | www.popvox.com <http://www.popvox.com> <http://www.popvox.com>
              >
              >
              > On 11/29/2011 02:12 AM, jlundigard wrote:
              > > Hey all,
              > >
              > > We've noticed the we stopped receiving bill text from govtrack.
              > It seems to have stopped around this bill:
              > >
              > > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
              > <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
              > >
              > > That bill and more recently introduced ones don't have any bill
              > text even though the text exists on the CPO website.
              > >
              > > Perhaps a scraper is down?
              > >
              > > Thanks,
              > > Andy
              > > OpenCongress.org
              > >
              > >
              > >
              > > ------------------------------__------
              > >
              > > Yahoo! Groups Links
              > >
              > >
              > >
              >
              >
              > ------------------------------__------
              >
              > Yahoo! Groups Links
              >
              >
              > (Yahoo! ID required)
              >
              > <mailto:govtrack-digest@__yahoogroups.com
              > <mailto:govtrack-digest@yahoogroups.com>>
              > govtrack-fullfeatured@__yahoogroups.com
              > <mailto:govtrack-fullfeatured@yahoogroups.com>
              > <mailto:govtrack-fullfeatured@__yahoogroups.com
              > <mailto:govtrack-fullfeatured@yahoogroups.com>>
              >
              >
              > <mailto:govtrack-unsubscribe@__yahoogroups.com
              > <mailto:govtrack-unsubscribe@yahoogroups.com>>
              >
              >
              >
              >
              >
              >
              > --
              > Developer | sunlightfoundation.com
              > <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
              >
              >
              >
              >
              >
              >
              >
              > --
              > Developer | sunlightfoundation.com <http://sunlightfoundation.com>
              >
              >
              >
              >
            • Eric Mill
              I ve been looking exactly for sitemap files like that! Would you mind sharing how we can find the different sitemaps? For example, I guessed at the URL for the
              Message 6 of 9 , Dec 10, 2011
              • 0 Attachment
                I've been looking exactly for sitemap files like that! Would you mind
                sharing how we can find the different sitemaps?

                For example, I guessed at the URL for the one for public and private laws:
                http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_PLAW_sitemap.xml

                But that file is very small and doesn't list what you would need to
                effectively spider the PLAW collection without scraping their HTML.

                As for text of bills -- I actually came to that realization yesterday
                myself, that the GPO .txt files were probably better. I definitely
                would not mind you switching over to them - I can adjust my regular
                expressions (just for sanitization, not extracting data) accordingly.

                -- Eric

                On Sat, Dec 10, 2011 at 12:51 PM, Josh Tauberer <tauberer@...> wrote:
                > Hi, everyone.
                >
                > Bill text is updating now.
                >
                > Thanks to whoever here forwarded the problem on to GPO --- I got an email
                > from someone at GPO who pointed me to their sitemap files, e.g.:
                > http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml (warning:
                > BIG file). I'm checking on how bills are split by year (by publication
                > date?), but this seems to be the most helpful way to find them all.
                >
                > Btw, Eric- For indexing bill text, it might be better to use the original
                > text files from GPO. The .txt files on GovTrack are generated using
                > pdftotext and have line numbers, whereas the GPO original .txt files do not
                > (I imagine they are generated from the XML or GPO locator codes files
                > directly).
                >
                > I don't use my own .txt files except to display historical bill text, and
                > unless there's an objection I could replace the pdftotext-generated files
                > with the GPO original .txt files.
                >
                > Any objections from anyone?
                >
                >
                > - Josh Tauberer
                > - GovTrack.us / POPVOX.com
                >
                > http://razor.occams.info | www.govtrack.us | www.popvox.com
                >
                > On 11/29/2011 10:25 AM, Eric Mill wrote:
                >>
                >>
                >>
                >> I use a combination of three files for each bill. Primarily, the .txt,
                >> for the text. I'm only storing the text en masse for full text search,
                >> not storing the semantic hierarchy of the bill. Secondarily, I use the
                >> MODS XML metadata to get what date the bill version was issued on, a
                >> pretty critical piece of data. However, sometimes the MODS file doesn't
                >> exist, and I use the .xml (HTML) version of the bill as a backup source
                >> for the issued date -- which, now that I look at the code, makes use of
                >> the Dublin Core metadata that you add on top of the original bill data.
                >> I don't make use of the PDF.
                >>
                >> My code that does all this is here, btw:
                >>
                >> https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
                >>
                >> I understand that this is less vital, but I mean it when I say the rsync
                >> is incredibly useful -- so much so that if you left it offline, what I'd
                >> probably do is set up a separate dedicated GPO bulk data mirroring
                >> service for at least bill text, that supported rsync, and use that
                >> internally. That's a lot of work, though! If you're continuing to use
                >> the GPO's bill text files in your own work on POPVOX, you'd do the
                >> community a service by continuing to make that work available.
                >>
                >> -- Eric
                >>
                >> On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
                >> <mailto:tauberer@...>> wrote:
                >>
                >>        the bill text is a less vital service, since you just
                >>        repackage what GPO offers
                >>
                >>
                >>    Exactly. That's why I'm not particularly concerned about dropping
                >>    this since it doesn't do much to begin with and after 5+ years of
                >>    running the bill text scraper it's past time to rethink what's
                >>    useful. (Btw, it does also scrape the HTML bill text on THOMAS,
                >>    which is slightly less trivial, but still pretty trivial.)
                >>
                >>    Do you use the PDFs or HTML (or .txt?)?
                >>
                >>    (Clearly when I said "free loading" I was not referring to what I
                >>    agree is a simple repackaging of PDFs.)
                >>
                >>
                >>    - Josh Tauberer
                >>    - GovTrack.us / POPVOX.com
                >>
                >>    http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
                >>    | www.popvox.com <http://www.popvox.com>
                >>
                >>    On 11/29/2011 09:30 AM, Eric Mill wrote:
                >>
                >>
                >>
                >>        I make use of the bill text that GovTrack provides in Sunlight's
                >>        data
                >>        services (our Real Time Congress API) and in the apps that
                >>        depend on it
                >>        (including our Congress app). We load it into ElasticSearch
                >>        (recommended, btw) and we power our search and highlighting with
                >>        it. I'm
                >>        imminently about to document this full text search capability
                >>        and offer
                >>        it to the public.
                >>
                >>        Unlike bill metadata, where you've done God's work and scrapes
                >>        THOMAS
                >>        all day every day, the bill text is a less vital service, since
                >>        you just
                >>        repackage what GPO offers and provide it via rsync. This is an
                >>        incredibly useful way to provide it though! I'd like it to stick
                >>        around.
                >>
                >>        I'm not sure it's possible to "free ride" on free, CC0-licensed,
                >>        repackaged versions of public domain government data. If you
                >>        feel like
                >>        people have been insufficiently thankful for your work or
                >>        haven't given
                >>        enough attribution, that is a more valid and specific
                >>        conversation to
                >>        have than accusing folks who are asking about the status of your
                >>        public
                >>        data on your public mailing list of competing with your business.
                >>
                >>        -- Eric
                >>
                >>        On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
                >>        <tauberer@... <mailto:tauberer@...>
                >>        <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
                >>
                >>            I've been meaning to write about this.
                >>
                >>            About two weeks ago GPO stopped updating GPO Access, which
                >>        was their
                >>            system for publishing documents since the mid 90s. New bills
                >>        and other
                >>            documents are only being published in FDSys now, and
                >>        GovTrack isn't
                >>            pulling from FDSys because FDSys didn't exist when I wrote
                >>        the bill text
                >>            scraper.
                >>
                >>            Since I've been focused on POPVOX lately, I haven't had a
                >>        chance to
                >>            build a new scraper for GovTrack, although in anticipation
                >>        of this I've
                >>            been working on reimplementing much of the same
                >>        functionality on POPVOX.
                >>            I'm not sure what if any of that code will be open, though
                >>        we have an
                >>            experimental API for it now.
                >>
                >>            It would be helpful to know who else, if anyone, is using
                >>        bill text so I
                >>            can plan the future of GovTrack's bill text accordingly.
                >>
                >>            But I will say that folks free riding on my data and using
                >>        it to compete
                >>            with my business (i.e. POPVOX) get no sympathy from me.
                >>
                >>            - Josh Tauberer
                >>            - GovTrack.us / POPVOX.com
                >>
                >>        http://razor.occams.info | www.govtrack.us
                >>        <http://www.govtrack.us> <http://www.govtrack.us>
                >>            | www.popvox.com <http://www.popvox.com>
                >> <http://www.popvox.com>
                >>
                >>
                >>
                >>            On 11/29/2011 02:12 AM, jlundigard wrote:
                >>         > Hey all,
                >>         >
                >>         > We've noticed the we stopped receiving bill text from govtrack.
                >>              It seems to have stopped around this bill:
                >>         >
                >>         > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
                >>
                >>        <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
                >>         >
                >>         > That bill and more recently introduced ones don't have any bill
                >>            text even though the text exists on the CPO website.
                >>         >
                >>         > Perhaps a scraper is down?
                >>         >
                >>         > Thanks,
                >>         > Andy
                >>         > OpenCongress.org
                >>         >
                >>         >
                >>         >
                >>         > ------------------------------__------
                >>         >
                >>         > Yahoo! Groups Links
                >>         >
                >>         >
                >>         >
                >>
                >>
                >>            ------------------------------__------
                >>
                >>
                >>            Yahoo! Groups Links
                >>
                >>        <http://groups.yahoo.com/group/govtrack/>
                >>
                >>        <http://groups.yahoo.com/group/govtrack/join>
                >>                (Yahoo! ID required)
                >>
                >>        <mailto:govtrack-digest@__yahoogroups.com
                >>        <mailto:govtrack-digest@yahoogroups.com>>
                >>        govtrack-fullfeatured@__yahoogroups.com
                >>        <mailto:govtrack-fullfeatured@yahoogroups.com>
                >>        <mailto:govtrack-fullfeatured@__yahoogroups.com
                >>
                >>        <mailto:govtrack-fullfeatured@yahoogroups.com>>
                >>
                >>
                >>        <mailto:govtrack-unsubscribe@__yahoogroups.com
                >>
                >>        <mailto:govtrack-unsubscribe@yahoogroups.com>>
                >>
                >>
                >>        <http://docs.yahoo.com/info/terms/>
                >>
                >>
                >>
                >>
                >>        --
                >>        Developer | sunlightfoundation.com
                >>        <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
                >>
                >>
                >>
                >>
                >>
                >>
                >>
                >>
                >> --
                >> Developer | sunlightfoundation.com <http://sunlightfoundation.com>
                >>
                >>
                >>
                >>



                --
                Developer | sunlightfoundation.com
              Your message has been successfully submitted and would be delivered to recipients shortly.