Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] Getting bills in HTML from Thomas

Expand Messages
  • Neil M.
    I had some free time, something like this? http://www.nabber.org/media/HR3590.html Neil
    Message 1 of 14 , May 25, 2010
    • 0 Attachment
      I had some free time, something like this?

      http://www.nabber.org/media/HR3590.html

      Neil

      On 5/25/2010 7:23 PM, Harvey Frey wrote:
      >
      >
      > Hi Neil:
      >
      > Thanks for the response!
      >
      > I simply want to use the 'contents' section of a bill to hyperlink
      > to the actual text paragraphs within the same bill, to make the bills
      > easier to navigate and comprehend.
      >
      > If I save the contents page from Thomas, the links point to a cgi
      > script and disappear within a few minutes. AFAIK, Thomas uses no
      > permanent links - everything runs through their cgi script.
      >
      > If I download the entire bill from Thomas, it contains no links at
      > all, so I need to be able to find section headings and put name anchors
      > there, and put corresponding href anchors in the correct TOC line, but
      > not do it for incidental references which are not headings.
      >
      > So it's not a problem of finding stereotypical text references to
      > USC sections and constructing HTML from them. That would be a pretty
      > straightforward RegExp problem.
      >
      > Harvey
      > ================================================
      >
      > Neil M. wrote:
      >
      >> I've got an old python script that takes the US Code text files (from
      >> http://uscode.house.gov), scans for references and outputs mediawiki
      >> formatted pages and wiki links. It shouldn't be too difficult to modify
      >> it to parse some other text/HTML file and output HTML instead. What is
      >> it you want to link to exactly? The aforementioned US Code website?
      >> Thomas? Both?
      >>
      >> Neil
      >>
      >> On 5/24/2010 3:59 PM, Harvey Frey wrote:
      >>
      >>>
      >>>
      >>> Has anyone tried downloading bills from Thomas in HTML format, or
      >>> converting their Text downloads to HTML?
      >>>
      >>> The guy at the Thomas help desk at first didn't understand what I
      >>> wanted, and then said that no one had ever asked for that before.
      >>>
      >>> As you know, when you download the Contents page of a bill, the
      >>> hyperlinks to the actual sections point to your own local folder instead
      >>> of to the Thomas page where they exist.
      >>>
      >>> If you add a base statement to point them back to the Thomas site,
      >>> the links disappear in a few minutes since the search expires.
      >>>
      >>> If you download all their referenced pages, you can't use their
      >>> links to convert them to local hyperlinks, since they address them
      >>> through a cgi program instead of through static name anchors.
      >>>
      >>> Your can download a bill as plain text rather than HTML, but
      >>> manually adding all the name anchors and hyperlinks would be a massive job.
      >>>
      >>> (I did it for the Patriot Act, and it wasn't fun, especially since
      >>> it looked like they intentionally obfuscated the references, sometimes
      >>> using Public Law References, sometimes USC, and sometimes common names,
      >>> so it was a detective job to find the sections they were amending.)
      >>>
      >>> Has anyone tried this, say with perl?
      >>>
      >>> I'm specifically interested in HR 3590, the recent Health Reform
      >>> Bill. Some bills are posted as a massive XML file, but this one isn't.
      >>> (If it were, I suppose you could use their id/id-ref pairs to construct
      >>> href/name pairs, and then clean out the rest of the XML cruft.)
      >>>
      >>> Harvey
      >>>
      >>> =============================
      >>> Harvey S. Frey MD PhD Esq.
      >>> hsfrey@... www.harp.org
      >>> -----------------------------
      >>> "Withdrawing in disgust is not the same thing as apathy."
      >>> - Brian Eno
      >>> =============================
      >>>
      >>>
      >>>
      >>
      >>
      >> ------------------------------------
      >>
      >> Yahoo! Groups Links
      >>
      >>
      >>
      >>
      >>
      >
    • Harvey Frey
      Neil: Precisely! :-D Thank You! Did you write a script to do that? Manually it would surely have taken more than a little free time !! Harvey
      Message 2 of 14 , May 26, 2010
      • 0 Attachment
        Neil:

            Precisely! :-D
            Thank You!

            Did you write a script to do that?
            Manually it would surely have taken more than a little "free time" !!

        Harvey

        Neil M. wrote:
        I had some free time, something like this?
        
        http://www.nabber.org/media/HR3590.html
        
        Neil
        
        On 5/25/2010 7:23 PM, Harvey Frey wrote:
          
         
        
        Hi Neil:
        
            Thanks for the response!
        
            I simply want to use the 'contents' section of a bill to hyperlink
        to the actual text paragraphs within the same bill, to make the bills
        easier to navigate and comprehend.
        
            If I save the contents page from Thomas, the links point to a cgi
        script and disappear within a few minutes. AFAIK, Thomas uses no
        permanent links - everything runs through their cgi script.
        
            If I download the entire bill from Thomas, it contains no links at
        all, so I need to be able to find section headings and put name anchors
        there, and put corresponding href anchors in the correct TOC line, but
        not do it for incidental references which are not headings.
        
            So it's not a problem of finding stereotypical text references to
        USC sections and constructing HTML from them. That would be a pretty
        straightforward RegExp problem.
        
        Harvey
        ================================================
        
        Neil M. wrote:
        
            
        I've got an old python script that takes the US Code text files (from
        http://uscode.house.gov), scans for references and outputs mediawiki
        formatted pages and wiki links.  It shouldn't be too difficult to modify
        it to parse some other text/HTML file and output HTML instead.  What is
        it you want to link to exactly?  The aforementioned US Code website?
        Thomas?  Both?
        
        Neil
        
        On 5/24/2010 3:59 PM, Harvey Frey wrote:
          
              
         
        
            Has anyone tried downloading bills from Thomas in HTML format, or
        converting their Text downloads to HTML?
        
            The guy at the Thomas help desk at first didn't understand what I
        wanted, and then said that no one had ever asked for that before.
        
            As you know, when you download the Contents page of a bill, the
        hyperlinks to the actual sections point to your own local folder instead
        of to the Thomas page where they exist.
        
            If you add a base statement to point them back to the Thomas site,
        the links disappear in a few minutes since the search expires.
        
            If you download all their referenced pages, you can't use their
        links to convert them to local hyperlinks, since they address them
        through a cgi program instead of through static name anchors.
        
            Your can download a bill as plain text rather than HTML, but
        manually adding all the name anchors and hyperlinks would be a massive job.
        
            (I did it for the Patriot Act, and it wasn't fun, especially since
        it looked like they intentionally obfuscated the references, sometimes
        using Public Law References, sometimes USC, and sometimes common names,
        so it was a detective job to find the sections they were amending.)
        
            Has anyone tried this, say with perl?
        
            I'm specifically interested in HR 3590, the recent Health Reform
        Bill. Some bills are posted as a massive XML file, but this one isn't.
        (If it were, I suppose you could use their id/id-ref pairs to construct
        href/name pairs, and then clean out the rest of the XML cruft.)
        
        Harvey
        
        =============================
        Harvey S. Frey MD PhD Esq.
        hsfrey@...  www.harp.org
        -----------------------------
        "Withdrawing in disgust is not the same thing as apathy."
        - Brian Eno
        =============================
        
        
            
                
        ------------------------------------
        
        Yahoo! Groups Links
        
        
        
        
          
              
        
        ------------------------------------
        
        Yahoo! Groups Links
        
        <*> To visit your group on the web, go to:
            http://groups.yahoo.com/group/govtrack/
        
        <*> Your email settings:
            Individual Email | Traditional
        
        <*> To change settings online go to:
            http://groups.yahoo.com/group/govtrack/join
            (Yahoo! ID required)
        
        <*> To change settings via email:
            govtrack-digest@yahoogroups.com 
            govtrack-fullfeatured@yahoogroups.com
        
        <*> To unsubscribe from this group, send an email to:
            govtrack-unsubscribe@yahoogroups.com
        
        <*> Your use of Yahoo! Groups is subject to:
            http://docs.yahoo.com/info/terms/
        
        
          
      • Neil M.
        Yes I wrote a quick Python script if anyone wants it just let me know. Neil
        Message 3 of 14 , May 26, 2010
        • 0 Attachment
          Yes I wrote a quick Python script if anyone wants it just let me know.

          Neil

          On 5/26/2010 11:05 AM, Harvey Frey wrote:
          >
          >
          > Neil:
          >
          > Precisely! :-D
          > Thank You!
          >
          > Did you write a script to do that?
          > Manually it would surely have taken more than a little "free time" !!
          >
          > Harvey
          >
          > Neil M. wrote:
          >
          >> I had some free time, something like this?
          >>
          >> http://www.nabber.org/media/HR3590.html
          >>
          >> Neil
          >>
          >> On 5/25/2010 7:23 PM, Harvey Frey wrote:
          >>
          >>>
          >>>
          >>> Hi Neil:
          >>>
          >>> Thanks for the response!
          >>>
          >>> I simply want to use the 'contents' section of a bill to hyperlink
          >>> to the actual text paragraphs within the same bill, to make the bills
          >>> easier to navigate and comprehend.
          >>>
          >>> If I save the contents page from Thomas, the links point to a cgi
          >>> script and disappear within a few minutes. AFAIK, Thomas uses no
          >>> permanent links - everything runs through their cgi script.
          >>>
          >>> If I download the entire bill from Thomas, it contains no links at
          >>> all, so I need to be able to find section headings and put name anchors
          >>> there, and put corresponding href anchors in the correct TOC line, but
          >>> not do it for incidental references which are not headings.
          >>>
          >>> So it's not a problem of finding stereotypical text references to
          >>> USC sections and constructing HTML from them. That would be a pretty
          >>> straightforward RegExp problem.
          >>>
          >>> Harvey
          >>> ================================================
          >>>
          >>> Neil M. wrote:
          >>>
          >>>
          >>>> I've got an old python script that takes the US Code text files (from
          >>>> http://uscode.house.gov), scans for references and outputs mediawiki
          >>>> formatted pages and wiki links. It shouldn't be too difficult to modify
          >>>> it to parse some other text/HTML file and output HTML instead. What is
          >>>> it you want to link to exactly? The aforementioned US Code website?
          >>>> Thomas? Both?
          >>>>
          >>>> Neil
          >>>>
          >>>> On 5/24/2010 3:59 PM, Harvey Frey wrote:
          >>>>
          >>>>
          >>>>>
          >>>>>
          >>>>> Has anyone tried downloading bills from Thomas in HTML format, or
          >>>>> converting their Text downloads to HTML?
          >>>>>
          >>>>> The guy at the Thomas help desk at first didn't understand what I
          >>>>> wanted, and then said that no one had ever asked for that before.
          >>>>>
          >>>>> As you know, when you download the Contents page of a bill, the
          >>>>> hyperlinks to the actual sections point to your own local folder instead
          >>>>> of to the Thomas page where they exist.
          >>>>>
          >>>>> If you add a base statement to point them back to the Thomas site,
          >>>>> the links disappear in a few minutes since the search expires.
          >>>>>
          >>>>> If you download all their referenced pages, you can't use their
          >>>>> links to convert them to local hyperlinks, since they address them
          >>>>> through a cgi program instead of through static name anchors.
          >>>>>
          >>>>> Your can download a bill as plain text rather than HTML, but
          >>>>> manually adding all the name anchors and hyperlinks would be a massive job.
          >>>>>
          >>>>> (I did it for the Patriot Act, and it wasn't fun, especially since
          >>>>> it looked like they intentionally obfuscated the references, sometimes
          >>>>> using Public Law References, sometimes USC, and sometimes common names,
          >>>>> so it was a detective job to find the sections they were amending.)
          >>>>>
          >>>>> Has anyone tried this, say with perl?
          >>>>>
          >>>>> I'm specifically interested in HR 3590, the recent Health Reform
          >>>>> Bill. Some bills are posted as a massive XML file, but this one isn't.
          >>>>> (If it were, I suppose you could use their id/id-ref pairs to construct
          >>>>> href/name pairs, and then clean out the rest of the XML cruft.)
          >>>>>
          >>>>> Harvey
          >>>>>
          >>>>> =============================
          >>>>> Harvey S. Frey MD PhD Esq.
          >>>>> hsfrey@... www.harp.org
          >>>>> -----------------------------
          >>>>> "Withdrawing in disgust is not the same thing as apathy."
          >>>>> - Brian Eno
          >>>>> =============================
          >>>>>
          >>>>>
          >>>>>
          >>>>>
          >>>> ------------------------------------
          >>>>
          >>>> Yahoo! Groups Links
          >>>>
          >>>>
          >>>>
          >>>>
          >>>>
          >>>>
          >>
          >>
          >> ------------------------------------
          >>
          >> Yahoo! Groups Links
          >>
          >>
          >>
          >>
          >>
          >
        • Harvey Frey
          ... Yes, please! It does need a little manual post-editing, since the section numbers in amended texts can be (and are) duplicates. Harvey
          Message 4 of 14 , May 26, 2010
          • 0 Attachment
            Neil:

            >
            if anyone wants it just let me know<

                 Yes, please!

                It does need a little manual post-editing, since the section numbers in amended texts can be (and are) duplicates.

            Harvey

            Neil M. wrote:
            Yes I wrote a quick Python script if anyone wants it just let me know.
            
            Neil
            
            On 5/26/2010 11:05 AM, Harvey Frey wrote:
              
             
            
            Neil:
            
                Precisely! :-D
                Thank You!
            
                Did you write a script to do that?
                Manually it would surely have taken more than a little "free time" !!
            
            Harvey
            
            Neil M. wrote:
            
                
            I had some free time, something like this?
            
            http://www.nabber.org/media/HR3590.html
            
            Neil
            
            On 5/25/2010 7:23 PM, Harvey Frey wrote:
              
                  
             
            
            Hi Neil:
            
                Thanks for the response!
            
                I simply want to use the 'contents' section of a bill to hyperlink
            to the actual text paragraphs within the same bill, to make the bills
            easier to navigate and comprehend.
            
                If I save the contents page from Thomas, the links point to a cgi
            script and disappear within a few minutes. AFAIK, Thomas uses no
            permanent links - everything runs through their cgi script.
            
                If I download the entire bill from Thomas, it contains no links at
            all, so I need to be able to find section headings and put name anchors
            there, and put corresponding href anchors in the correct TOC line, but
            not do it for incidental references which are not headings.
            
                So it's not a problem of finding stereotypical text references to
            USC sections and constructing HTML from them. That would be a pretty
            straightforward RegExp problem.
            
            Harvey
            ================================================
            
            Neil M. wrote:
            
                
                    
            I've got an old python script that takes the US Code text files (from
            http://uscode.house.gov), scans for references and outputs mediawiki
            formatted pages and wiki links.  It shouldn't be too difficult to modify
            it to parse some other text/HTML file and output HTML instead.  What is
            it you want to link to exactly?  The aforementioned US Code website?
            Thomas?  Both?
            
            Neil
            
            On 5/24/2010 3:59 PM, Harvey Frey wrote:
              
                  
                      
             
            
                Has anyone tried downloading bills from Thomas in HTML format, or
            converting their Text downloads to HTML?
            
                The guy at the Thomas help desk at first didn't understand what I
            wanted, and then said that no one had ever asked for that before.
            
                As you know, when you download the Contents page of a bill, the
            hyperlinks to the actual sections point to your own local folder instead
            of to the Thomas page where they exist.
            
                If you add a base statement to point them back to the Thomas site,
            the links disappear in a few minutes since the search expires.
            
                If you download all their referenced pages, you can't use their
            links to convert them to local hyperlinks, since they address them
            through a cgi program instead of through static name anchors.
            
                Your can download a bill as plain text rather than HTML, but
            manually adding all the name anchors and hyperlinks would be a massive job.
            
                (I did it for the Patriot Act, and it wasn't fun, especially since
            it looked like they intentionally obfuscated the references, sometimes
            using Public Law References, sometimes USC, and sometimes common names,
            so it was a detective job to find the sections they were amending.)
            
                Has anyone tried this, say with perl?
            
                I'm specifically interested in HR 3590, the recent Health Reform
            Bill. Some bills are posted as a massive XML file, but this one isn't.
            (If it were, I suppose you could use their id/id-ref pairs to construct
            href/name pairs, and then clean out the rest of the XML cruft.)
            
            Harvey
            
            =============================
            Harvey S. Frey MD PhD Esq.
            hsfrey@...  www.harp.org
            -----------------------------
            "Withdrawing in disgust is not the same thing as apathy."
            - Brian Eno
            =============================
            
            
                
                    
                        
            ------------------------------------
            
            Yahoo! Groups Links
            
            
            
            
              
                  
                      
            ------------------------------------
            
            Yahoo! Groups Links
            
            
            
            
              
                  
            
            ------------------------------------
            
            Yahoo! Groups Links
            
            <*> To visit your group on the web, go to:
                http://groups.yahoo.com/group/govtrack/
            
            <*> Your email settings:
                Individual Email | Traditional
            
            <*> To change settings online go to:
                http://groups.yahoo.com/group/govtrack/join
                (Yahoo! ID required)
            
            <*> To change settings via email:
                govtrack-digest@yahoogroups.com 
                govtrack-fullfeatured@yahoogroups.com
            
            <*> To unsubscribe from this group, send an email to:
                govtrack-unsubscribe@yahoogroups.com
            
            <*> Your use of Yahoo! Groups is subject to:
                http://docs.yahoo.com/info/terms/
            
            
              
          Your message has been successfully submitted and would be delivered to recipients shortly.