Loading ...
Sorry, an error occurred while loading the content.

scraping javascript sites, colorado example

Expand Messages
  • Neal McBurnett
    One topic I haven t seen much discussion of here is examples or techniques for scraping documents. E.g. I m interested.in Colorado legislation. The site
    Message 1 of 5 , Mar 3, 2005
    • 0 Attachment
      One topic I haven't seen much discussion of here is examples or
      techniques for scraping documents.

      E.g. I'm interested.in Colorado legislation. The site
      (http://www.leg.state.co.us/) is pretty unfriendly from an automated
      scraping standpoint, in my experience. The legislation is in pdf,
      which is a pain, but pdftotext seems to produce moderately scrapable
      text.

      Model software (in python?) for cleaning up the output of pdftotext
      would be welcomed.


      But finding the pdfs is tricky. E.g. I start at the list of bills,
      which is designed to use form submission to select successive sets of
      50 bills:

      http://www.leg.state.co.us/Clics2005a/csl.nsf/BillFoldersSenate?openFrameset

      Having found the right bill, say I want to look at the original
      version of the bill (with legislative summary). First we need to go
      to the "All versions" link. A typical URL, for SB05-079, is

      http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont2/51B91106B515902487256F5E0078CF5A?Open

      So they seem to intentionally introduce a hash to obfuscate the
      URL.

      Going to that URL gets us to the "Introduced Bill" name: 079_01.pdf
      with an invisible (white) "n" appended to it, visible only when I
      select the area for cut-and-paste. Who knows why.

      Clicking on that goes to a different place than advertised by "Copy
      Link Location" in firefox, via a javascript _doClick() function:

      _doClick('87256EE50072C919.8975551e51fa01d087256dd30080e1d5/$Body/0.390C'...)

      Ending up (at least for Firefox) at another useless intermediate web
      page, which has more javascript that automatically downloads the pdf
      (though I'm not sure why, and might be confused by the frame structure).

      It also has a link to the "Bill" as http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont3/51B91106B515902487256F5E0078CF5A?open&file=079_01.pdf
      but again that is misleading.

      When the pdf is downloading, the Firefox dialog box which asks if I
      want to browse it or download it contains this url:

      http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B515902487256F5E0078CF5A/$FILE/

      But if I try to load that via wget I get "400 Bad Request"
      so they may be playing around with cookies or other magic.

      The Firefox "Page Info" Links section points to this for the "Current
      PDF Document:
      http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B515902487256F5E0078CF5A?OpenDocument

      which again just produces the html page.

      Can someone find a URL that just loads the pdf?
      Is this sort of run-around common? Any good insights or tools to deal with
      it? Anyone want to scrape, for a start, all the legislative summaries
      for all the bills as introduced :-) ?

      Thanks,

      Neal McBurnett http://bcn.boulder.co.us/~neal/
      Signed and/or sealed mail encouraged. GPG/PGP Keyid: 2C9EBA60
    • Bill Farrell
      Yes, that kind of runaround is ALL too common. The more I scrape, the more I find that public information isn t. You may have read my diatribes on FCC et
      Message 2 of 5 , Mar 3, 2005
      • 0 Attachment
        Yes, that kind of runaround is ALL too common. The more I scrape, the more I find that "public information" isn't. You may have read my diatribes on FCC et al. :-)

        Would it be an idea to use the --save-cookies {file} option to capture the cookies as they're tossed? If you can sleuth-out which (or which set) would be the magic, it's possible to toss the cookies back by using the --load-cookies {file} .

        Best!
        Bill

        ----- Original Message -----
        From: Neal McBurnett <neal@...>
        To: govtrack@yahoogroups.com
        Sent: Thu, 3 Mar 2005 20:31:26 +0000
        Subject: [govtrack] scraping javascript sites, colorado example


        >
      • Neal McBurnett
        Here are some spidering/scraping resources I ve stumbled upon via google. Feedback on any of them would be welcomed: Python web-client programming
        Message 3 of 5 , Mar 3, 2005
        • 0 Attachment
          Here are some spidering/scraping resources I've stumbled upon via
          google. Feedback on any of them would be welcomed:

          Python web-client programming
          http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

          HTMLForms

          dealing with javascript:
          Java's httpunit from Jython, since it knows some JavaScript
          Mozilla automation & XPCOM / PyXPCOM, Konqueror & DCOP / KParts /
          PyKDE

          ssl: Mozilla plugin: livehttpheaders.
          Use lynx -trace, and filter out the junk with a script.

          http://linux.duke.edu/projects/urlgrabber/

          mozilla plugin can display HTML form information and HTML table
          structure:
          http://chrispederick.myacen.com/work/firebird/webdeveloper/

          HTML Screen Scraping: A How-To Document
          http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
          python, urllib HTMLParser sgrep quixote,

          python: http://www.crummy.com/software/BeautifulSoup/
          http://www.pycs.net/users/0000316/

          http://www.oreilly.com/catalog/spiderhks/toc.html
          Very Perl-oriented.
          LWP::Simple
          HTML::TreeBuilder
          WWW::Mechanize
          Template::Extract
          WWW::Yahoo::Groups

          XPath


          And I did manage to get some colorado legislation pdfs to load
          directly, like

          http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/AB832E748317A60987256F3A006A1217/$FILE/002_01.pdf

          so they aren't necessarily as difficult as I thought. But other
          similar URLs don't work, so I'm still puzzled.
          I used ethereal/FollowTcpStream to see which URLs my browser was
          actually retrieving.

          -Neal

          > One topic I haven't seen much discussion of here is examples or
          > techniques for scraping documents.
          >
          > E.g. I'm interested.in Colorado legislation. The site
          > (http://www.leg.state.co.us/) is pretty unfriendly from an automated
          > scraping standpoint, in my experience. The legislation is in pdf,
          > which is a pain, but pdftotext seems to produce moderately scrapable
          > text.
          >
          > Model software (in python?) for cleaning up the output of pdftotext
          > would be welcomed.
          >
          >
          > But finding the pdfs is tricky. E.g. I start at the list of bills,
          > which is designed to use form submission to select successive sets of
          > 50 bills:
          >
          > http://www.leg.state.co.us/Clics2005a/csl.nsf/BillFoldersSenate?openFrameset
          >
          > Having found the right bill, say I want to look at the original
          > version of the bill (with legislative summary). First we need to go
          > to the "All versions" link. A typical URL, for SB05-079, is
          >
          > http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont2/51B91106B515902487256F5E0078CF5A?Open
          >
          > So they seem to intentionally introduce a hash to obfuscate the
          > URL.
          >
          > Going to that URL gets us to the "Introduced Bill" name: 079_01.pdf
          > with an invisible (white) "n" appended to it, visible only when I
          > select the area for cut-and-paste. Who knows why.
          >
          > Clicking on that goes to a different place than advertised by "Copy
          > Link Location" in firefox, via a javascript _doClick() function:
          >
          > _doClick('87256EE50072C919.8975551e51fa01d087256dd30080e1d5/$Body/0.390C'...)
          >
          > Ending up (at least for Firefox) at another useless intermediate web
          > page, which has more javascript that automatically downloads the pdf
          > (though I'm not sure why, and might be confused by the frame structure).
          >
          > It also has a link to the "Bill" as http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont3/51B91106B515902487256F5E0078CF5A?open&file=079_01.pdf
          > but again that is misleading.
          >
          > When the pdf is downloading, the Firefox dialog box which asks if I
          > want to browse it or download it contains this url:
          >
          > http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B515902487256F5E0078CF5A/$FILE/
          >
          > But if I try to load that via wget I get "400 Bad Request"
          > so they may be playing around with cookies or other magic.
          >
          > The Firefox "Page Info" Links section points to this for the "Current
          > PDF Document:
          > http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B515902487256F5E0078CF5A?OpenDocument
          >
          > which again just produces the html page.
          >
          > Can someone find a URL that just loads the pdf?
          > Is this sort of run-around common? Any good insights or tools to deal with
          > it? Anyone want to scrape, for a start, all the legislative summaries
          > for all the bills as introduced :-) ?
          >
          > Thanks,
          >
          > Neal McBurnett http://bcn.boulder.co.us/~neal/
          > Signed and/or sealed mail encouraged. GPG/PGP Keyid: 2C9EBA60
        • Joshua Tauberer / GovTrack
          Before I reply to Neal-- I ve added a section on RDF schemas/ontologies to the article I wrote: http://www.govtrack.us/articles/20050302rdf.xpd Neal, I m gonna
          Message 4 of 5 , Mar 5, 2005
          • 0 Attachment
            Before I reply to Neal-- I've added a section on RDF schemas/ontologies to the article I wrote:
            http://www.govtrack.us/articles/20050302rdf.xpd

            Neal,

            I'm gonna try to step through the process of getting the status of
            legislation out of the Colorado site, based on the links you gave. Here
            goes.

            First, load the page:
            http://www.leg.state.co.us/Clics2005a/csl.nsf/(bf-3)?OpenView&Count=50000

            I've changed the Count parameter from how the website has it in the
            framed version.

            To extract bill history, the only things that are relevant are the
            History links, which conveniently are in the form:

            <A HREF="[[URL]]" target="Bottom2" target="Bottom2">History</A>

            You can pick those out those URLs using regular expressions, or even
            some simple string manipulation functions. For each of those links,
            load up the URL.

            Absurdly enough, this page is a frameset. So you'll need to do the same
            trick of looking for the right URL to load. In this case, it's the URL
            in the line that matches:

            <frame src="[[URL]]" name="File"

            Although they have it as a relative URI there, so you'll need to tack on
            http://www.leg.state.co.us to the beginning.

            Finally I'm at the page with the bill history... Now you've gotta just
            extract the bill number and each action. What I sometimes do is strip
            the HTML out of the page, and then do pattern matching. So, doing that,
            to get to the bill number, you find the line that starts with
            "Summarized History for Bill Number " and take whatever follows it on
            the line. To get the actions, just pick out any lines that match the
            pattern:
            DD/DD/DD Whatever...
            Which is easy with regular expressions, but, again, also possible by
            just testing whether there are digits and slashes in the right indexes
            of the string.

            Hope some of this is useful. If you get stuck somewhere, post more. :)

            --
            - Joshua Tauberer

            http://taubz.for.net

            ** Nothing Unreal Exists **
          • John Labovitz
            ... There are starting to be some good Ruby libraries for screen-scraping, too. There s a simple but good version of WWW::Mechanize (you can find it via the
            Message 5 of 5 , Mar 5, 2005
            • 0 Attachment
              On Mar 3, 2005, at 10:09 PM, Neal McBurnett wrote:

              > Here are some spidering/scraping resources I've stumbled upon via
              > google.

              There are starting to be some good Ruby libraries for screen-scraping,
              too.

              There's a simple but good version of WWW::Mechanize (you can find it
              via the 'gems' Ruby library if you have that installed). And REXML is
              a fantastic XML parsing library, with XPath built in so you don't have
              to do so much procedure stuff as you do with some of the Perl modules.

              This won't help much with the Javascript mess, though. (And yes, I've
              found similar awful cruft in dealing with scraping financial services
              sites. I think it must be output of some middle-ware app that folks
              use to make web sites. I had to deal with one recently that had *no*
              way of navigating via regular HTML; only Javascript links! Truly
              annoying.)

              --
              John Labovitz
              Macintosh support, research, and software development
              John Labovitz Consulting, LLC
              johnl@... | +1 503.949.3492 |
              www.johnlabovitz.com/consulting
            Your message has been successfully submitted and would be delivered to recipients shortly.