Loading ...
Sorry, an error occurred while loading the content.

Getting bills in HTML from Thomas

Expand Messages
  • Harvey Frey
    Has anyone tried downloading bills from Thomas in HTML format, or converting their Text downloads to HTML? The guy at the Thomas help desk at first didn t
    Message 1 of 14 , May 24, 2010
    • 0 Attachment
          Has anyone tried downloading bills from Thomas in HTML format, or converting their Text downloads to HTML?

          The guy at the Thomas help desk at first didn't understand what I wanted, and then said that no one had ever asked for that before.


          As you know, when you download the Contents page of a bill, the hyperlinks to the actual sections point to your own local folder instead of to the Thomas page where they exist.

          If you add a base statement to point them back to the Thomas site, the links disappear in a few minutes since the search expires.

          If you download all their referenced pages, you can't use their links to convert them to local hyperlinks, since they address them through a cgi program instead of through static name anchors.

          Your can download a bill as plain text rather than HTML, but manually adding all the name anchors and hyperlinks would be a massive job.

          (I did it for the Patriot Act, and it wasn't fun, especially since it looked like they intentionally obfuscated the references, sometimes using Public Law References, sometimes USC, and sometimes common names, so it was a detective job to find the sections they were amending.)

          Has anyone tried this, say with perl?

          I'm specifically interested in HR 3590, the recent Health Reform Bill.
      Some bills are posted as a massive XML file, but this one isn't. (If it were, I suppose you could use their id/id-ref pairs to construct href/name pairs, and then clean out the rest of the XML cruft.)

      Harvey

      =============================
      Harvey S. Frey MD PhD Esq.
      hsfrey@...  www.harp.org
      -----------------------------
      "Withdrawing in disgust is not the same thing as apathy."
      - Brian Eno

      =============================
    • Josh Tauberer
      That s what GovTrack does to get bill text. http://www.govtrack.us/data/us/bills.text/111/h/h3590.html I clean the HTML and make sure it s well-formed XML
      Message 2 of 14 , May 24, 2010
      • 0 Attachment
        That's what GovTrack does to get bill text.
        http://www.govtrack.us/data/us/bills.text/111/h/h3590.html

        I clean the HTML and make sure it's well-formed XML before putting it there.

        - Josh Tauberer
        - CivicImpulse / GovTrack.us

        http://razor.occams.info | www.govtrack.us | civicimpulse.com

        "Members of both sides are reminded not to use guests of the
        House as props."

        On 05/24/2010 06:59 PM, Harvey Frey wrote:
        >
        >
        > Has anyone tried downloading bills from Thomas in HTML format, or
        > converting their Text downloads to HTML?
        >
        > The guy at the Thomas help desk at first didn't understand what I
        > wanted, and then said that no one had ever asked for that before.
        >
        > As you know, when you download the Contents page of a bill, the
        > hyperlinks to the actual sections point to your own local folder instead
        > of to the Thomas page where they exist.
        >
        > If you add a base statement to point them back to the Thomas site, the
        > links disappear in a few minutes since the search expires.
        >
        > If you download all their referenced pages, you can't use their links to
        > convert them to local hyperlinks, since they address them through a cgi
        > program instead of through static name anchors.
        >
        > Your can download a bill as plain text rather than HTML, but manually
        > adding all the name anchors and hyperlinks would be a massive job.
        >
        > (I did it for the Patriot Act, and it wasn't fun, especially since it
        > looked like they intentionally obfuscated the references, sometimes
        > using Public Law References, sometimes USC, and sometimes common names,
        > so it was a detective job to find the sections they were amending.)
        >
        > Has anyone tried this, say with perl?
        >
        > I'm specifically interested in HR 3590, the recent Health Reform Bill.
        > Some bills are posted as a massive XML file, but this one isn't. (If it
        > were, I suppose you could use their id/id-ref pairs to construct
        > href/name pairs, and then clean out the rest of the XML cruft.)
        >
        > Harvey
        >
        > =============================
        > Harvey S. Frey MD PhD Esq.
        > hsfrey@... www.harp.org
        > -----------------------------
        > "Withdrawing in disgust is not the same thing as apathy."
        > - Brian Eno
        > =============================
        >
        >
        >
      • Neil M.
        I ve got an old python script that takes the US Code text files (from http://uscode.house.gov), scans for references and outputs mediawiki formatted pages and
        Message 3 of 14 , May 24, 2010
        • 0 Attachment
          I've got an old python script that takes the US Code text files (from
          http://uscode.house.gov), scans for references and outputs mediawiki
          formatted pages and wiki links. It shouldn't be too difficult to modify
          it to parse some other text/HTML file and output HTML instead. What is
          it you want to link to exactly? The aforementioned US Code website?
          Thomas? Both?

          Neil

          On 5/24/2010 3:59 PM, Harvey Frey wrote:
          >
          >
          > Has anyone tried downloading bills from Thomas in HTML format, or
          > converting their Text downloads to HTML?
          >
          > The guy at the Thomas help desk at first didn't understand what I
          > wanted, and then said that no one had ever asked for that before.
          >
          > As you know, when you download the Contents page of a bill, the
          > hyperlinks to the actual sections point to your own local folder instead
          > of to the Thomas page where they exist.
          >
          > If you add a base statement to point them back to the Thomas site,
          > the links disappear in a few minutes since the search expires.
          >
          > If you download all their referenced pages, you can't use their
          > links to convert them to local hyperlinks, since they address them
          > through a cgi program instead of through static name anchors.
          >
          > Your can download a bill as plain text rather than HTML, but
          > manually adding all the name anchors and hyperlinks would be a massive job.
          >
          > (I did it for the Patriot Act, and it wasn't fun, especially since
          > it looked like they intentionally obfuscated the references, sometimes
          > using Public Law References, sometimes USC, and sometimes common names,
          > so it was a detective job to find the sections they were amending.)
          >
          > Has anyone tried this, say with perl?
          >
          > I'm specifically interested in HR 3590, the recent Health Reform
          > Bill. Some bills are posted as a massive XML file, but this one isn't.
          > (If it were, I suppose you could use their id/id-ref pairs to construct
          > href/name pairs, and then clean out the rest of the XML cruft.)
          >
          > Harvey
          >
          > =============================
          > Harvey S. Frey MD PhD Esq.
          > hsfrey@... www.harp.org
          > -----------------------------
          > "Withdrawing in disgust is not the same thing as apathy."
          > - Brian Eno
          > =============================
          >
          >
        • Harvey Frey
          Josh: What I m trying to do is clean it up so the section hierarchy is displayed, and add hyperlinks to the text from the TOC. This is an excerpt of the TOC.
          Message 4 of 14 , May 24, 2010
          • 0 Attachment
            Josh:

                What I'm trying to do is clean it up so the section hierarchy is displayed, and add hyperlinks to the text from the TOC. This is an excerpt of the TOC. What doesn't show here is that the lines are hyperlinks, and I've used a different background color for the sections of text added to other acts. (The back-ticks are not very salient.)

                It's not that I'm a neat freak - I just get easily confused when the margins are all raggedy and the fonts don't go with the hierarchy.

            Harvey

            SECTION 1. SHORT TITLE; TABLE OF CONTENTS.

              (a) Short Title- This Act may be cited as the `Patient Protection and Affordable Care Act'.
              (b) Table of Contents- The table of contents of this Act is as follows:
                Sec. 1. Short title; table of contents.

            TITLE I--QUALITY, AFFORDABLE HEALTH CARE FOR ALL AMERICANS



            Josh Tauberer wrote:
             

            That's what GovTrack does to get bill text.
            http://www.govtrack.us/data/us/bills.text/111/h/h3590.html

            I clean the HTML and make sure it's well-formed XML before putting it there.

            - Josh Tauberer
            - CivicImpulse / GovTrack.us

            http://razor.occams.info | www.govtrack.us | civicimpulse.com

            "Members of both sides are reminded not to use guests of the
            House as props."

            On 05/24/2010 06:59 PM, Harvey Frey wrote:
            >
            >
            > Has anyone tried downloading bills from Thomas in HTML format, or
            > converting their Text downloads to HTML?
            >
            > The guy at the Thomas help desk at first didn't understand what I
            > wanted, and then said that no one had ever asked for that before.
            >
            > As you know, when you download the Contents page of a bill, the
            > hyperlinks to the actual sections point to your own local folder instead
            > of to the Thomas page where they exist.
            >
            > If you add a base statement to point them back to the Thomas site, the
            > links disappear in a few minutes since the search expires.
            >
            > If you download all their referenced pages, you can't use their links to
            > convert them to local hyperlinks, since they address them through a cgi
            > program instead of through static name anchors.
            >
            > Your can download a bill as plain text rather than HTML, but manually
            > adding all the name anchors and hyperlinks would be a massive job.
            >
            > (I did it for the Patriot Act, and it wasn't fun, especially since it
            > looked like they intentionally obfuscated the references, sometimes
            > using Public Law References, sometimes USC, and sometimes common names,
            > so it was a detective job to find the sections they were amending.)
            >
            > Has anyone tried this, say with perl?
            >
            > I'm specifically interested in HR 3590, the recent Health Reform Bill.
            > Some bills are posted as a massive XML file, but this one isn't. (If it
            > were, I suppose you could use their id/id-ref pairs to construct
            > href/name pairs, and then clean out the rest of the XML cruft.)
            >
            > Harvey
            >
            > =============================
            > Harvey S. Frey MD PhD Esq.
            > hsfrey@... www.harp.org
            > -----------------------------
            > "Withdrawing in disgust is not the same thing as apathy."
            > - Brian Eno
            > =============================
            >
            >
            >

          • Harvey Frey
            Hi Neil: Thanks for the response! I simply want to use the contents section of a bill to hyperlink to the actual text paragraphs within the same bill, to
            Message 5 of 14 , May 25, 2010
            • 0 Attachment
              Hi Neil:

                  Thanks for the response!

                  I simply want to use the 'contents' section of a bill to hyperlink to the actual text paragraphs within the same bill, to make the bills easier to navigate and comprehend.

                  If I save the contents page from Thomas, the links point to a cgi script and disappear within a few minutes. AFAIK, Thomas uses no permanent links - everything runs through their cgi script.

                  If I download the entire bill from Thomas, it contains no links at all, so I need to be able to find section headings and put name anchors there, and put corresponding href anchors in the correct TOC line, but not do it for incidental references which are not headings.

                  So it's not a problem of finding stereotypical text references to USC sections and constructing HTML from them. That would be a pretty straightforward RegExp problem.

              Harvey
              ================================================

              Neil M. wrote:
              I've got an old python script that takes the US Code text files (from
              http://uscode.house.gov), scans for references and outputs mediawiki
              formatted pages and wiki links.  It shouldn't be too difficult to modify
              it to parse some other text/HTML file and output HTML instead.  What is
              it you want to link to exactly?  The aforementioned US Code website?
              Thomas?  Both?
              
              Neil
              
              On 5/24/2010 3:59 PM, Harvey Frey wrote:
                
               
              
                  Has anyone tried downloading bills from Thomas in HTML format, or
              converting their Text downloads to HTML?
              
                  The guy at the Thomas help desk at first didn't understand what I
              wanted, and then said that no one had ever asked for that before.
              
                  As you know, when you download the Contents page of a bill, the
              hyperlinks to the actual sections point to your own local folder instead
              of to the Thomas page where they exist.
              
                  If you add a base statement to point them back to the Thomas site,
              the links disappear in a few minutes since the search expires.
              
                  If you download all their referenced pages, you can't use their
              links to convert them to local hyperlinks, since they address them
              through a cgi program instead of through static name anchors.
              
                  Your can download a bill as plain text rather than HTML, but
              manually adding all the name anchors and hyperlinks would be a massive job.
              
                  (I did it for the Patriot Act, and it wasn't fun, especially since
              it looked like they intentionally obfuscated the references, sometimes
              using Public Law References, sometimes USC, and sometimes common names,
              so it was a detective job to find the sections they were amending.)
              
                  Has anyone tried this, say with perl?
              
                  I'm specifically interested in HR 3590, the recent Health Reform
              Bill. Some bills are posted as a massive XML file, but this one isn't.
              (If it were, I suppose you could use their id/id-ref pairs to construct
              href/name pairs, and then clean out the rest of the XML cruft.)
              
              Harvey
              
              =============================
              Harvey S. Frey MD PhD Esq.
              hsfrey@...  www.harp.org
              -----------------------------
              "Withdrawing in disgust is not the same thing as apathy."
              - Brian Eno
              =============================
              
              
                  
              
              ------------------------------------
              
              Yahoo! Groups Links
              
              <*> To visit your group on the web, go to:
                  http://groups.yahoo.com/group/govtrack/
              
              <*> Your email settings:
                  Individual Email | Traditional
              
              <*> To change settings online go to:
                  http://groups.yahoo.com/group/govtrack/join
                  (Yahoo! ID required)
              
              <*> To change settings via email:
                  govtrack-digest@yahoogroups.com 
                  govtrack-fullfeatured@yahoogroups.com
              
              <*> To unsubscribe from this group, send an email to:
                  govtrack-unsubscribe@yahoogroups.com
              
              <*> Your use of Yahoo! Groups is subject to:
                  http://docs.yahoo.com/info/terms/
              
              
                
            • Neil M.
              I had some free time, something like this? http://www.nabber.org/media/HR3590.html Neil
              Message 6 of 14 , May 25, 2010
              • 0 Attachment
                I had some free time, something like this?

                http://www.nabber.org/media/HR3590.html

                Neil

                On 5/25/2010 7:23 PM, Harvey Frey wrote:
                >
                >
                > Hi Neil:
                >
                > Thanks for the response!
                >
                > I simply want to use the 'contents' section of a bill to hyperlink
                > to the actual text paragraphs within the same bill, to make the bills
                > easier to navigate and comprehend.
                >
                > If I save the contents page from Thomas, the links point to a cgi
                > script and disappear within a few minutes. AFAIK, Thomas uses no
                > permanent links - everything runs through their cgi script.
                >
                > If I download the entire bill from Thomas, it contains no links at
                > all, so I need to be able to find section headings and put name anchors
                > there, and put corresponding href anchors in the correct TOC line, but
                > not do it for incidental references which are not headings.
                >
                > So it's not a problem of finding stereotypical text references to
                > USC sections and constructing HTML from them. That would be a pretty
                > straightforward RegExp problem.
                >
                > Harvey
                > ================================================
                >
                > Neil M. wrote:
                >
                >> I've got an old python script that takes the US Code text files (from
                >> http://uscode.house.gov), scans for references and outputs mediawiki
                >> formatted pages and wiki links. It shouldn't be too difficult to modify
                >> it to parse some other text/HTML file and output HTML instead. What is
                >> it you want to link to exactly? The aforementioned US Code website?
                >> Thomas? Both?
                >>
                >> Neil
                >>
                >> On 5/24/2010 3:59 PM, Harvey Frey wrote:
                >>
                >>>
                >>>
                >>> Has anyone tried downloading bills from Thomas in HTML format, or
                >>> converting their Text downloads to HTML?
                >>>
                >>> The guy at the Thomas help desk at first didn't understand what I
                >>> wanted, and then said that no one had ever asked for that before.
                >>>
                >>> As you know, when you download the Contents page of a bill, the
                >>> hyperlinks to the actual sections point to your own local folder instead
                >>> of to the Thomas page where they exist.
                >>>
                >>> If you add a base statement to point them back to the Thomas site,
                >>> the links disappear in a few minutes since the search expires.
                >>>
                >>> If you download all their referenced pages, you can't use their
                >>> links to convert them to local hyperlinks, since they address them
                >>> through a cgi program instead of through static name anchors.
                >>>
                >>> Your can download a bill as plain text rather than HTML, but
                >>> manually adding all the name anchors and hyperlinks would be a massive job.
                >>>
                >>> (I did it for the Patriot Act, and it wasn't fun, especially since
                >>> it looked like they intentionally obfuscated the references, sometimes
                >>> using Public Law References, sometimes USC, and sometimes common names,
                >>> so it was a detective job to find the sections they were amending.)
                >>>
                >>> Has anyone tried this, say with perl?
                >>>
                >>> I'm specifically interested in HR 3590, the recent Health Reform
                >>> Bill. Some bills are posted as a massive XML file, but this one isn't.
                >>> (If it were, I suppose you could use their id/id-ref pairs to construct
                >>> href/name pairs, and then clean out the rest of the XML cruft.)
                >>>
                >>> Harvey
                >>>
                >>> =============================
                >>> Harvey S. Frey MD PhD Esq.
                >>> hsfrey@... www.harp.org
                >>> -----------------------------
                >>> "Withdrawing in disgust is not the same thing as apathy."
                >>> - Brian Eno
                >>> =============================
                >>>
                >>>
                >>>
                >>
                >>
                >> ------------------------------------
                >>
                >> Yahoo! Groups Links
                >>
                >>
                >>
                >>
                >>
                >
              • Harvey Frey
                Neil: Precisely! :-D Thank You! Did you write a script to do that? Manually it would surely have taken more than a little free time !! Harvey
                Message 7 of 14 , May 26, 2010
                • 0 Attachment
                  Neil:

                      Precisely! :-D
                      Thank You!

                      Did you write a script to do that?
                      Manually it would surely have taken more than a little "free time" !!

                  Harvey

                  Neil M. wrote:
                  I had some free time, something like this?
                  
                  http://www.nabber.org/media/HR3590.html
                  
                  Neil
                  
                  On 5/25/2010 7:23 PM, Harvey Frey wrote:
                    
                   
                  
                  Hi Neil:
                  
                      Thanks for the response!
                  
                      I simply want to use the 'contents' section of a bill to hyperlink
                  to the actual text paragraphs within the same bill, to make the bills
                  easier to navigate and comprehend.
                  
                      If I save the contents page from Thomas, the links point to a cgi
                  script and disappear within a few minutes. AFAIK, Thomas uses no
                  permanent links - everything runs through their cgi script.
                  
                      If I download the entire bill from Thomas, it contains no links at
                  all, so I need to be able to find section headings and put name anchors
                  there, and put corresponding href anchors in the correct TOC line, but
                  not do it for incidental references which are not headings.
                  
                      So it's not a problem of finding stereotypical text references to
                  USC sections and constructing HTML from them. That would be a pretty
                  straightforward RegExp problem.
                  
                  Harvey
                  ================================================
                  
                  Neil M. wrote:
                  
                      
                  I've got an old python script that takes the US Code text files (from
                  http://uscode.house.gov), scans for references and outputs mediawiki
                  formatted pages and wiki links.  It shouldn't be too difficult to modify
                  it to parse some other text/HTML file and output HTML instead.  What is
                  it you want to link to exactly?  The aforementioned US Code website?
                  Thomas?  Both?
                  
                  Neil
                  
                  On 5/24/2010 3:59 PM, Harvey Frey wrote:
                    
                        
                   
                  
                      Has anyone tried downloading bills from Thomas in HTML format, or
                  converting their Text downloads to HTML?
                  
                      The guy at the Thomas help desk at first didn't understand what I
                  wanted, and then said that no one had ever asked for that before.
                  
                      As you know, when you download the Contents page of a bill, the
                  hyperlinks to the actual sections point to your own local folder instead
                  of to the Thomas page where they exist.
                  
                      If you add a base statement to point them back to the Thomas site,
                  the links disappear in a few minutes since the search expires.
                  
                      If you download all their referenced pages, you can't use their
                  links to convert them to local hyperlinks, since they address them
                  through a cgi program instead of through static name anchors.
                  
                      Your can download a bill as plain text rather than HTML, but
                  manually adding all the name anchors and hyperlinks would be a massive job.
                  
                      (I did it for the Patriot Act, and it wasn't fun, especially since
                  it looked like they intentionally obfuscated the references, sometimes
                  using Public Law References, sometimes USC, and sometimes common names,
                  so it was a detective job to find the sections they were amending.)
                  
                      Has anyone tried this, say with perl?
                  
                      I'm specifically interested in HR 3590, the recent Health Reform
                  Bill. Some bills are posted as a massive XML file, but this one isn't.
                  (If it were, I suppose you could use their id/id-ref pairs to construct
                  href/name pairs, and then clean out the rest of the XML cruft.)
                  
                  Harvey
                  
                  =============================
                  Harvey S. Frey MD PhD Esq.
                  hsfrey@...  www.harp.org
                  -----------------------------
                  "Withdrawing in disgust is not the same thing as apathy."
                  - Brian Eno
                  =============================
                  
                  
                      
                          
                  ------------------------------------
                  
                  Yahoo! Groups Links
                  
                  
                  
                  
                    
                        
                  
                  ------------------------------------
                  
                  Yahoo! Groups Links
                  
                  <*> To visit your group on the web, go to:
                      http://groups.yahoo.com/group/govtrack/
                  
                  <*> Your email settings:
                      Individual Email | Traditional
                  
                  <*> To change settings online go to:
                      http://groups.yahoo.com/group/govtrack/join
                      (Yahoo! ID required)
                  
                  <*> To change settings via email:
                      govtrack-digest@yahoogroups.com 
                      govtrack-fullfeatured@yahoogroups.com
                  
                  <*> To unsubscribe from this group, send an email to:
                      govtrack-unsubscribe@yahoogroups.com
                  
                  <*> Your use of Yahoo! Groups is subject to:
                      http://docs.yahoo.com/info/terms/
                  
                  
                    
                • Neil M.
                  Yes I wrote a quick Python script if anyone wants it just let me know. Neil
                  Message 8 of 14 , May 26, 2010
                  • 0 Attachment
                    Yes I wrote a quick Python script if anyone wants it just let me know.

                    Neil

                    On 5/26/2010 11:05 AM, Harvey Frey wrote:
                    >
                    >
                    > Neil:
                    >
                    > Precisely! :-D
                    > Thank You!
                    >
                    > Did you write a script to do that?
                    > Manually it would surely have taken more than a little "free time" !!
                    >
                    > Harvey
                    >
                    > Neil M. wrote:
                    >
                    >> I had some free time, something like this?
                    >>
                    >> http://www.nabber.org/media/HR3590.html
                    >>
                    >> Neil
                    >>
                    >> On 5/25/2010 7:23 PM, Harvey Frey wrote:
                    >>
                    >>>
                    >>>
                    >>> Hi Neil:
                    >>>
                    >>> Thanks for the response!
                    >>>
                    >>> I simply want to use the 'contents' section of a bill to hyperlink
                    >>> to the actual text paragraphs within the same bill, to make the bills
                    >>> easier to navigate and comprehend.
                    >>>
                    >>> If I save the contents page from Thomas, the links point to a cgi
                    >>> script and disappear within a few minutes. AFAIK, Thomas uses no
                    >>> permanent links - everything runs through their cgi script.
                    >>>
                    >>> If I download the entire bill from Thomas, it contains no links at
                    >>> all, so I need to be able to find section headings and put name anchors
                    >>> there, and put corresponding href anchors in the correct TOC line, but
                    >>> not do it for incidental references which are not headings.
                    >>>
                    >>> So it's not a problem of finding stereotypical text references to
                    >>> USC sections and constructing HTML from them. That would be a pretty
                    >>> straightforward RegExp problem.
                    >>>
                    >>> Harvey
                    >>> ================================================
                    >>>
                    >>> Neil M. wrote:
                    >>>
                    >>>
                    >>>> I've got an old python script that takes the US Code text files (from
                    >>>> http://uscode.house.gov), scans for references and outputs mediawiki
                    >>>> formatted pages and wiki links. It shouldn't be too difficult to modify
                    >>>> it to parse some other text/HTML file and output HTML instead. What is
                    >>>> it you want to link to exactly? The aforementioned US Code website?
                    >>>> Thomas? Both?
                    >>>>
                    >>>> Neil
                    >>>>
                    >>>> On 5/24/2010 3:59 PM, Harvey Frey wrote:
                    >>>>
                    >>>>
                    >>>>>
                    >>>>>
                    >>>>> Has anyone tried downloading bills from Thomas in HTML format, or
                    >>>>> converting their Text downloads to HTML?
                    >>>>>
                    >>>>> The guy at the Thomas help desk at first didn't understand what I
                    >>>>> wanted, and then said that no one had ever asked for that before.
                    >>>>>
                    >>>>> As you know, when you download the Contents page of a bill, the
                    >>>>> hyperlinks to the actual sections point to your own local folder instead
                    >>>>> of to the Thomas page where they exist.
                    >>>>>
                    >>>>> If you add a base statement to point them back to the Thomas site,
                    >>>>> the links disappear in a few minutes since the search expires.
                    >>>>>
                    >>>>> If you download all their referenced pages, you can't use their
                    >>>>> links to convert them to local hyperlinks, since they address them
                    >>>>> through a cgi program instead of through static name anchors.
                    >>>>>
                    >>>>> Your can download a bill as plain text rather than HTML, but
                    >>>>> manually adding all the name anchors and hyperlinks would be a massive job.
                    >>>>>
                    >>>>> (I did it for the Patriot Act, and it wasn't fun, especially since
                    >>>>> it looked like they intentionally obfuscated the references, sometimes
                    >>>>> using Public Law References, sometimes USC, and sometimes common names,
                    >>>>> so it was a detective job to find the sections they were amending.)
                    >>>>>
                    >>>>> Has anyone tried this, say with perl?
                    >>>>>
                    >>>>> I'm specifically interested in HR 3590, the recent Health Reform
                    >>>>> Bill. Some bills are posted as a massive XML file, but this one isn't.
                    >>>>> (If it were, I suppose you could use their id/id-ref pairs to construct
                    >>>>> href/name pairs, and then clean out the rest of the XML cruft.)
                    >>>>>
                    >>>>> Harvey
                    >>>>>
                    >>>>> =============================
                    >>>>> Harvey S. Frey MD PhD Esq.
                    >>>>> hsfrey@... www.harp.org
                    >>>>> -----------------------------
                    >>>>> "Withdrawing in disgust is not the same thing as apathy."
                    >>>>> - Brian Eno
                    >>>>> =============================
                    >>>>>
                    >>>>>
                    >>>>>
                    >>>>>
                    >>>> ------------------------------------
                    >>>>
                    >>>> Yahoo! Groups Links
                    >>>>
                    >>>>
                    >>>>
                    >>>>
                    >>>>
                    >>>>
                    >>
                    >>
                    >> ------------------------------------
                    >>
                    >> Yahoo! Groups Links
                    >>
                    >>
                    >>
                    >>
                    >>
                    >
                  • Harvey Frey
                    ... Yes, please! It does need a little manual post-editing, since the section numbers in amended texts can be (and are) duplicates. Harvey
                    Message 9 of 14 , May 26, 2010
                    • 0 Attachment
                      Neil:

                      >
                      if anyone wants it just let me know<

                           Yes, please!

                          It does need a little manual post-editing, since the section numbers in amended texts can be (and are) duplicates.

                      Harvey

                      Neil M. wrote:
                      Yes I wrote a quick Python script if anyone wants it just let me know.
                      
                      Neil
                      
                      On 5/26/2010 11:05 AM, Harvey Frey wrote:
                        
                       
                      
                      Neil:
                      
                          Precisely! :-D
                          Thank You!
                      
                          Did you write a script to do that?
                          Manually it would surely have taken more than a little "free time" !!
                      
                      Harvey
                      
                      Neil M. wrote:
                      
                          
                      I had some free time, something like this?
                      
                      http://www.nabber.org/media/HR3590.html
                      
                      Neil
                      
                      On 5/25/2010 7:23 PM, Harvey Frey wrote:
                        
                            
                       
                      
                      Hi Neil:
                      
                          Thanks for the response!
                      
                          I simply want to use the 'contents' section of a bill to hyperlink
                      to the actual text paragraphs within the same bill, to make the bills
                      easier to navigate and comprehend.
                      
                          If I save the contents page from Thomas, the links point to a cgi
                      script and disappear within a few minutes. AFAIK, Thomas uses no
                      permanent links - everything runs through their cgi script.
                      
                          If I download the entire bill from Thomas, it contains no links at
                      all, so I need to be able to find section headings and put name anchors
                      there, and put corresponding href anchors in the correct TOC line, but
                      not do it for incidental references which are not headings.
                      
                          So it's not a problem of finding stereotypical text references to
                      USC sections and constructing HTML from them. That would be a pretty
                      straightforward RegExp problem.
                      
                      Harvey
                      ================================================
                      
                      Neil M. wrote:
                      
                          
                              
                      I've got an old python script that takes the US Code text files (from
                      http://uscode.house.gov), scans for references and outputs mediawiki
                      formatted pages and wiki links.  It shouldn't be too difficult to modify
                      it to parse some other text/HTML file and output HTML instead.  What is
                      it you want to link to exactly?  The aforementioned US Code website?
                      Thomas?  Both?
                      
                      Neil
                      
                      On 5/24/2010 3:59 PM, Harvey Frey wrote:
                        
                            
                                
                       
                      
                          Has anyone tried downloading bills from Thomas in HTML format, or
                      converting their Text downloads to HTML?
                      
                          The guy at the Thomas help desk at first didn't understand what I
                      wanted, and then said that no one had ever asked for that before.
                      
                          As you know, when you download the Contents page of a bill, the
                      hyperlinks to the actual sections point to your own local folder instead
                      of to the Thomas page where they exist.
                      
                          If you add a base statement to point them back to the Thomas site,
                      the links disappear in a few minutes since the search expires.
                      
                          If you download all their referenced pages, you can't use their
                      links to convert them to local hyperlinks, since they address them
                      through a cgi program instead of through static name anchors.
                      
                          Your can download a bill as plain text rather than HTML, but
                      manually adding all the name anchors and hyperlinks would be a massive job.
                      
                          (I did it for the Patriot Act, and it wasn't fun, especially since
                      it looked like they intentionally obfuscated the references, sometimes
                      using Public Law References, sometimes USC, and sometimes common names,
                      so it was a detective job to find the sections they were amending.)
                      
                          Has anyone tried this, say with perl?
                      
                          I'm specifically interested in HR 3590, the recent Health Reform
                      Bill. Some bills are posted as a massive XML file, but this one isn't.
                      (If it were, I suppose you could use their id/id-ref pairs to construct
                      href/name pairs, and then clean out the rest of the XML cruft.)
                      
                      Harvey
                      
                      =============================
                      Harvey S. Frey MD PhD Esq.
                      hsfrey@...  www.harp.org
                      -----------------------------
                      "Withdrawing in disgust is not the same thing as apathy."
                      - Brian Eno
                      =============================
                      
                      
                          
                              
                                  
                      ------------------------------------
                      
                      Yahoo! Groups Links
                      
                      
                      
                      
                        
                            
                                
                      ------------------------------------
                      
                      Yahoo! Groups Links
                      
                      
                      
                      
                        
                            
                      
                      ------------------------------------
                      
                      Yahoo! Groups Links
                      
                      <*> To visit your group on the web, go to:
                          http://groups.yahoo.com/group/govtrack/
                      
                      <*> Your email settings:
                          Individual Email | Traditional
                      
                      <*> To change settings online go to:
                          http://groups.yahoo.com/group/govtrack/join
                          (Yahoo! ID required)
                      
                      <*> To change settings via email:
                          govtrack-digest@yahoogroups.com 
                          govtrack-fullfeatured@yahoogroups.com
                      
                      <*> To unsubscribe from this group, send an email to:
                          govtrack-unsubscribe@yahoogroups.com
                      
                      <*> Your use of Yahoo! Groups is subject to:
                          http://docs.yahoo.com/info/terms/
                      
                      
                        
                    Your message has been successfully submitted and would be delivered to recipients shortly.