Loading ...
Sorry, an error occurred while loading the content.
 

Re: [govtrack] scraping javascript sites, colorado example

Expand Messages
  • Bill Farrell
    Yes, that kind of runaround is ALL too common. The more I scrape, the more I find that public information isn t. You may have read my diatribes on FCC et
    Message 1 of 5 , Mar 3, 2005
      Yes, that kind of runaround is ALL too common. The more I scrape, the more I find that "public information" isn't. You may have read my diatribes on FCC et al. :-)

      Would it be an idea to use the --save-cookies {file} option to capture the cookies as they're tossed? If you can sleuth-out which (or which set) would be the magic, it's possible to toss the cookies back by using the --load-cookies {file} .

      Best!
      Bill

      ----- Original Message -----
      From: Neal McBurnett <neal@...>
      To: govtrack@yahoogroups.com
      Sent: Thu, 3 Mar 2005 20:31:26 +0000
      Subject: [govtrack] scraping javascript sites, colorado example


      >
    • Neal McBurnett
      Here are some spidering/scraping resources I ve stumbled upon via google. Feedback on any of them would be welcomed: Python web-client programming
      Message 2 of 5 , Mar 3, 2005
        Here are some spidering/scraping resources I've stumbled upon via
        google. Feedback on any of them would be welcomed:

        Python web-client programming
        http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

        HTMLForms

        dealing with javascript:
        Java's httpunit from Jython, since it knows some JavaScript
        Mozilla automation & XPCOM / PyXPCOM, Konqueror & DCOP / KParts /
        PyKDE

        ssl: Mozilla plugin: livehttpheaders.
        Use lynx -trace, and filter out the junk with a script.

        http://linux.duke.edu/projects/urlgrabber/

        mozilla plugin can display HTML form information and HTML table
        structure:
        http://chrispederick.myacen.com/work/firebird/webdeveloper/

        HTML Screen Scraping: A How-To Document
        http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
        python, urllib HTMLParser sgrep quixote,

        python: http://www.crummy.com/software/BeautifulSoup/
        http://www.pycs.net/users/0000316/

        http://www.oreilly.com/catalog/spiderhks/toc.html
        Very Perl-oriented.
        LWP::Simple
        HTML::TreeBuilder
        WWW::Mechanize
        Template::Extract
        WWW::Yahoo::Groups

        XPath


        And I did manage to get some colorado legislation pdfs to load
        directly, like

        http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/AB832E748317A60987256F3A006A1217/$FILE/002_01.pdf

        so they aren't necessarily as difficult as I thought. But other
        similar URLs don't work, so I'm still puzzled.
        I used ethereal/FollowTcpStream to see which URLs my browser was
        actually retrieving.

        -Neal

        > One topic I haven't seen much discussion of here is examples or
        > techniques for scraping documents.
        >
        > E.g. I'm interested.in Colorado legislation. The site
        > (http://www.leg.state.co.us/) is pretty unfriendly from an automated
        > scraping standpoint, in my experience. The legislation is in pdf,
        > which is a pain, but pdftotext seems to produce moderately scrapable
        > text.
        >
        > Model software (in python?) for cleaning up the output of pdftotext
        > would be welcomed.
        >
        >
        > But finding the pdfs is tricky. E.g. I start at the list of bills,
        > which is designed to use form submission to select successive sets of
        > 50 bills:
        >
        > http://www.leg.state.co.us/Clics2005a/csl.nsf/BillFoldersSenate?openFrameset
        >
        > Having found the right bill, say I want to look at the original
        > version of the bill (with legislative summary). First we need to go
        > to the "All versions" link. A typical URL, for SB05-079, is
        >
        > http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont2/51B91106B515902487256F5E0078CF5A?Open
        >
        > So they seem to intentionally introduce a hash to obfuscate the
        > URL.
        >
        > Going to that URL gets us to the "Introduced Bill" name: 079_01.pdf
        > with an invisible (white) "n" appended to it, visible only when I
        > select the area for cut-and-paste. Who knows why.
        >
        > Clicking on that goes to a different place than advertised by "Copy
        > Link Location" in firefox, via a javascript _doClick() function:
        >
        > _doClick('87256EE50072C919.8975551e51fa01d087256dd30080e1d5/$Body/0.390C'...)
        >
        > Ending up (at least for Firefox) at another useless intermediate web
        > page, which has more javascript that automatically downloads the pdf
        > (though I'm not sure why, and might be confused by the frame structure).
        >
        > It also has a link to the "Bill" as http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont3/51B91106B515902487256F5E0078CF5A?open&file=079_01.pdf
        > but again that is misleading.
        >
        > When the pdf is downloading, the Firefox dialog box which asks if I
        > want to browse it or download it contains this url:
        >
        > http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B515902487256F5E0078CF5A/$FILE/
        >
        > But if I try to load that via wget I get "400 Bad Request"
        > so they may be playing around with cookies or other magic.
        >
        > The Firefox "Page Info" Links section points to this for the "Current
        > PDF Document:
        > http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B515902487256F5E0078CF5A?OpenDocument
        >
        > which again just produces the html page.
        >
        > Can someone find a URL that just loads the pdf?
        > Is this sort of run-around common? Any good insights or tools to deal with
        > it? Anyone want to scrape, for a start, all the legislative summaries
        > for all the bills as introduced :-) ?
        >
        > Thanks,
        >
        > Neal McBurnett http://bcn.boulder.co.us/~neal/
        > Signed and/or sealed mail encouraged. GPG/PGP Keyid: 2C9EBA60
      • Joshua Tauberer / GovTrack
        Before I reply to Neal-- I ve added a section on RDF schemas/ontologies to the article I wrote: http://www.govtrack.us/articles/20050302rdf.xpd Neal, I m gonna
        Message 3 of 5 , Mar 5, 2005
          Before I reply to Neal-- I've added a section on RDF schemas/ontologies to the article I wrote:
          http://www.govtrack.us/articles/20050302rdf.xpd

          Neal,

          I'm gonna try to step through the process of getting the status of
          legislation out of the Colorado site, based on the links you gave. Here
          goes.

          First, load the page:
          http://www.leg.state.co.us/Clics2005a/csl.nsf/(bf-3)?OpenView&Count=50000

          I've changed the Count parameter from how the website has it in the
          framed version.

          To extract bill history, the only things that are relevant are the
          History links, which conveniently are in the form:

          <A HREF="[[URL]]" target="Bottom2" target="Bottom2">History</A>

          You can pick those out those URLs using regular expressions, or even
          some simple string manipulation functions. For each of those links,
          load up the URL.

          Absurdly enough, this page is a frameset. So you'll need to do the same
          trick of looking for the right URL to load. In this case, it's the URL
          in the line that matches:

          <frame src="[[URL]]" name="File"

          Although they have it as a relative URI there, so you'll need to tack on
          http://www.leg.state.co.us to the beginning.

          Finally I'm at the page with the bill history... Now you've gotta just
          extract the bill number and each action. What I sometimes do is strip
          the HTML out of the page, and then do pattern matching. So, doing that,
          to get to the bill number, you find the line that starts with
          "Summarized History for Bill Number " and take whatever follows it on
          the line. To get the actions, just pick out any lines that match the
          pattern:
          DD/DD/DD Whatever...
          Which is easy with regular expressions, but, again, also possible by
          just testing whether there are digits and slashes in the right indexes
          of the string.

          Hope some of this is useful. If you get stuck somewhere, post more. :)

          --
          - Joshua Tauberer

          http://taubz.for.net

          ** Nothing Unreal Exists **
        • John Labovitz
          ... There are starting to be some good Ruby libraries for screen-scraping, too. There s a simple but good version of WWW::Mechanize (you can find it via the
          Message 4 of 5 , Mar 5, 2005
            On Mar 3, 2005, at 10:09 PM, Neal McBurnett wrote:

            > Here are some spidering/scraping resources I've stumbled upon via
            > google.

            There are starting to be some good Ruby libraries for screen-scraping,
            too.

            There's a simple but good version of WWW::Mechanize (you can find it
            via the 'gems' Ruby library if you have that installed). And REXML is
            a fantastic XML parsing library, with XPath built in so you don't have
            to do so much procedure stuff as you do with some of the Perl modules.

            This won't help much with the Javascript mess, though. (And yes, I've
            found similar awful cruft in dealing with scraping financial services
            sites. I think it must be output of some middle-ware app that folks
            use to make web sites. I had to deal with one recently that had *no*
            way of navigating via regular HTML; only Javascript links! Truly
            annoying.)

            --
            John Labovitz
            Macintosh support, research, and software development
            John Labovitz Consulting, LLC
            johnl@... | +1 503.949.3492 |
            www.johnlabovitz.com/consulting
          Your message has been successfully submitted and would be delivered to recipients shortly.