Loading ...
Sorry, an error occurred while loading the content.

259Re: [govtrack] Two Project Ideas [bill versioning]

Expand Messages
  • Joshua Tauberer / GovTrack.us
    Nov 9 4:04 AM
    • 0 Attachment
      yahoogroups-backupemail@... wrote:
      > On Wed, 8 Nov 2006, Joshua Tauberer / GovTrack.us wrote:
      >> The first step is to convert it to text -- you can see the text versions
      >> (from "pdftotext -layout -nopgbrk") that GovTrack makes at the same
      >> addresses, just replace .pdf with .txt. Without "-layout" you get a
      >> differently formatted text version that could be more useful for this.
      > there's a fork of pdftotext (also free) which has very
      > useful -html and -xml output flags which might be a
      > better place to start from if you don't have tools already.
      > http://pdftohtml.sourceforge.net/ <http://pdftohtml.sourceforge.net/>

      Ahha, I think that could be useful. Thanks for the pointer. (It's
      actually been integrated in the poppler-utils RPM for Fedora Core 6, if
      that's useful for anyone.)

      For reference, the two PDFs in HTML with pdftohtml are:


      It's not getting the alignment of lines quite right, splitting up things
      on the same line, but that might not impact the task anyway since
      different line breaks between versions has to be ignored anyway.

      - Joshua Tauberer


      "Strike up the klezmer and start acting like a man. You're
      about to have a truth-mitzvah." -- The Colbert Report
    • Show all 21 messages in this topic